TW202122993A

TW202122993A - Memory-based processors

Info

Publication number: TW202122993A
Application number: TW109127495A
Authority: TW
Inventors: 埃利亞德希勒爾; 埃拉德希提; 雪妮布羅多; 大衛夏彌爾; 蓋兒戴楊
Original assignee: 埃利亞德希勒爾; 埃拉德希提; 雪妮布羅多; 大衛夏彌爾; 蓋兒戴楊
Priority date: 2019-08-13
Filing date: 2020-08-13
Publication date: 2021-06-16
Also published as: KR20220078566A; EP4010808A4; WO2021028723A2; CN114586019A; WO2021028723A3; EP4010808A2

Abstract

In some embodiments, an integrated circuit may include a substrate and a memory array disposed on the substrate, where the memory array includes a plurality of discrete memory banks. The integrated circuit may also include a processing array disposed on the substrate, where the processing array includes a plurality of processor subunits, each one of the plurality of processor subunits being associated with one or more discrete memory banks among the plurality of discrete memory banks. The integrated circuit may also include a controller configured to implement at least one security measure with respect to an operation of the integrated circuit and take one or more remedial actions if the at least one security measure is triggered.

Description

Memory processor

優先權priority

本申請案主張以下各者之優先權：2019年8月13日申請之美國臨時申請案第62/886,328號；2019年9月29日申請之美國臨時申請案第62/907,659號；2020年2月7日申請之美國臨時申請案第62/971,912號；及2020年2月28日申請之美國臨時申請案第62/983,174號。前述申請案以全文引用之方式併入本文中。 This application claims the priority of the following: U.S. Provisional Application No. 62/886,328 filed on August 13, 2019; U.S. Provisional Application No. 62/907,659 filed on September 29, 2019; February 2020 U.S. Provisional Application No. 62/971,912 filed on July 7; and U.S. Provisional Application No. 62/983,174 filed on February 28, 2020. The aforementioned application is incorporated herein by reference in its entirety.

本發明大體上係關於用於便利記憶體密集型操作之設備。特定而言，本發明係關於包括耦接至專用記憶體組之處理元件的硬體晶片。本發明亦係關於用於改善記憶體晶片之功率效率及速度的設備。特定而言，本發明係關於用於在記憶體晶片上實施部分再新或甚至無再新之系統及方法。本發明亦係關於大小可選擇之記憶體晶片及記憶體晶片上之雙埠能力。 The present invention generally relates to devices for facilitating memory-intensive operations. In particular, the present invention relates to a hardware chip including processing elements coupled to a dedicated memory bank. The present invention also relates to equipment for improving the power efficiency and speed of memory chips. In particular, the present invention relates to a system and method for implementing partial renewal or even no renewal on a memory chip. The present invention also relates to memory chips with selectable sizes and dual port capabilities on the memory chips.

隨著處理器速度及記憶體大小兩者均繼續增加，對有效處理速度之顯著限制係馮諾依曼(von Neumann)瓶頸。馮諾依曼瓶頸係由習知電腦架構所導致之輸送量限制造成。特定而言，相較於由處理器進行之實際運算，自記憶體至處理器之資料傳送常常遇到瓶頸。因此，用以對記憶體進行讀取及寫入之時脈循環的數目隨著記憶體密集型處理程序而顯著增加。此等時脈循環導致較低的有效處理速度，此係因為對記憶體進行讀取及寫入會消耗時脈循環，該等時脈循環無法用於對資料執行操作。此外，處理器之運算頻寬通常大於處理器用以存取記憶體之匯流排的頻寬。 As both processor speed and memory size continue to increase, the significant limitation on effective processing speed is the von Neumann bottleneck. The Von Neumann bottleneck is caused by the throughput limitation caused by the conventional computer architecture. In particular, compared to the actual operation performed by the processor, the data transfer from the memory to the processor often encounters a bottleneck. Therefore, the number of clock cycles used to read and write the memory increases significantly with memory-intensive processing procedures. These clock cycles result in a lower effective processing speed. This is because reading and writing to the memory consumes clock cycles. Isochronous cycles cannot be used to perform operations on data. In addition, the computing bandwidth of the processor is generally greater than the bandwidth of the bus used by the processor to access the memory.

此等瓶頸對於以下各者特別明顯：記憶體密集型處理程序，諸如神經網路及其他機器學習演算法；資料庫建構、索引搜尋及查詢；以及包括比資料處理操作多的讀取及寫入操作之其他任務。 These bottlenecks are particularly obvious for the following: memory-intensive processing procedures, such as neural networks and other machine learning algorithms; database construction, index search and query; and include more reads and writes than data processing operations Other tasks of operation.

另外，可用數位資料之量及粒度的迅速增長已為開發機器學習演算法創造了機會且已賦能新技術。然而，此亦為資料庫及並列運算領域帶來棘手之挑戰。舉例而言，社交媒體及物聯網(IoT)之出現以創記錄的速率產生數位資料。此新資料可用以產生用於多種用途之演算法，範圍為新廣告技術至工業處理程序之更精確控制方法。然而，新資料難以儲存、處理、分析及處置。 In addition, the rapid growth in the amount and granularity of available digital data has created opportunities for the development of machine learning algorithms and has empowered new technologies. However, this also brings thorny challenges to the field of database and parallel computing. For example, the emergence of social media and the Internet of Things (IoT) generates digital data at record rates. This new data can be used to generate algorithms for a variety of purposes, ranging from new advertising techniques to more precise control methods for industrial processing procedures. However, new data is difficult to store, process, analyze, and dispose of.

新資料資源可為巨大的，有時為大約千萬億至澤塔(zetta)位元組。此外，此等資料資源之增長速率可能超過資料處理能力。因此，資料科學家已致力於並列資料處理技術，以應對此等挑戰。為了提高計算能力且處置大量資料，科學家已嘗試建立能夠進行並列密集型運算之系統及方法。但此等現有系統及方法跟不上資料處理要求，此常常係因為所使用之技術受該等技術對用於資料管理、分隔資料整合及分段資料分析之額外資源的需求限制。 New data resources can be huge, sometimes on the order of petabytes to zetta bytes. In addition, the growth rate of these data resources may exceed the data processing capacity. Therefore, data scientists have committed to parallel data processing techniques to meet these challenges. In order to improve computing power and handle large amounts of data, scientists have tried to establish systems and methods that can perform parallel-intensive operations. However, these existing systems and methods cannot keep up with the data processing requirements. This is often because the technologies used are limited by the requirements of these technologies for additional resources for data management, separate data integration, and segmented data analysis.

為了便利對大資料集之操縱，工程師及科學家現設法改善用以分析資料之硬體。舉例而言，新的半導體處理器或晶片(諸如，本文中所描述之彼等半導體處理器或晶片)可藉由在以更適合記憶體操作而非算術運算之技術製造的單個基板中併入記憶體及處理功能而特定地針對資料密集型任務進行設計。利用特定地針對資料密集型任務而設計之積體電路，有可能滿足新的資料處理要求。然而，應對大資料集之資料處理的此新方法需要解決晶片設計及製造中之新問題。舉例而言，若針對資料密集型任務而設計之新晶片係藉由用於通用晶片之製造技術及架構製造，則該等新晶片將具有較差效能及/或不可接受之良率。此外，若該等新晶片經設計以利用當前資料處置方法進行操作，則該等新晶片將具有較差效能，此係因為當前方法可限制晶片處置並列操作的能力。 In order to facilitate the manipulation of large data sets, engineers and scientists are now trying to improve the hardware used to analyze data. For example, new semiconductor processors or chips (such as those described herein) can be incorporated in a single substrate manufactured with a technology more suitable for memory operations rather than arithmetic operations The memory and processing functions are specifically designed for data-intensive tasks. It is possible to meet new data processing requirements by using integrated circuits specifically designed for data-intensive tasks. However, this new approach to data processing for large data sets needs to solve new problems in chip design and manufacturing. For example, if new chips designed for data-intensive tasks are manufactured using manufacturing technology and architecture for general-purpose chips, these new chips will have poor performance and/or unacceptable The yield rate. In addition, if the new chips are designed to operate using the current data processing method, the new chips will have poor performance because the current method can limit the ability of the chip to handle parallel operations.

本發明描述用於減輕或克服上文所闡述之問題中之一或多者以及先前技術中之其他問題的解決方案。 The present invention describes solutions for alleviating or overcoming one or more of the problems set forth above and other problems in the prior art.

在一些實施例中，一種積體電路可包括一基板及安置於該基板上之一記憶體陣列，其中該記憶體陣列包括複數個離散記憶體組。該積體電路亦可包括安置於該基板上之一處理陣列，其中該處理陣列包括複數個處理器子單元，該等複數個處理器子單元中之每一者與該等複數個離散記憶體組當中之一或多個離散記憶體組相關聯。該積體電路亦可包括一控制器，該控制器經組態以相對於該積體電路之一操作實施至少一個安全措施且在該至少一個安全措施被觸發之情況下採取一或多個補救動作。 In some embodiments, an integrated circuit may include a substrate and a memory array disposed on the substrate, wherein the memory array includes a plurality of discrete memory groups. The integrated circuit may also include a processing array disposed on the substrate, wherein the processing array includes a plurality of processor subunits, each of the plurality of processor subunits and the plurality of discrete memories One or more discrete memory groups in the group are associated. The integrated circuit may also include a controller configured to implement at least one safety measure relative to an operation of the integrated circuit and take one or more remedies if the at least one safety measure is triggered action.

所揭示實施例亦可包括一種保護積體電路以防篡改之方法，其中該方法包括使用與積體電路相關聯之控制器實施相對於積體電路之操作的至少一個安全措施及在至少一個安全措施被觸發之情況下採取一或多個補救動作，且其中該積體電路包括：基板；記憶體陣列，其安置於基板上，該記憶體陣列包括複數個離散記憶體組；及處理陣列，其安置於基板上，該處理陣列包括複數個處理器子單元，該等複數個處理器子單元中之每一者與該等複數個離散記憶體組當中之一或多個離散記憶體組相關聯。 The disclosed embodiments may also include a method for protecting an integrated circuit from tampering, wherein the method includes using a controller associated with the integrated circuit to implement at least one security measure relative to the operation of the integrated circuit and at least one security measure One or more remedial actions are taken when the measure is triggered, and the integrated circuit includes: a substrate; a memory array arranged on the substrate, the memory array including a plurality of discrete memory groups; and a processing array, It is arranged on a substrate, the processing array includes a plurality of processor sub-units, each of the plurality of processor sub-units is related to one or more of the plurality of discrete memory groups United.

所揭示實施例可包括一種積體電路，其包含：基板；記憶體陣列，其安置於基板上，該記憶體陣列包括複數個離散記憶體組；處理陣列，其安置於基板上，該處理陣列包括複數個處理器子單元，該等複數個處理器子單元中之每一者與該等複數個離散記憶體組當中之一或多個離散記憶體組相關聯；及控制器，其經組態以：實施相對於積體電路之操作的至少一個安全措施；其中至少一個安全措施包括在至少兩個不同記憶體部分中複製程式碼。 The disclosed embodiment may include an integrated circuit including: a substrate; a memory array arranged on the substrate, the memory array including a plurality of discrete memory groups; a processing array arranged on the substrate, the processing array Includes a plurality of processor sub-units, each of the plurality of processor sub-units is associated with one or more of the plurality of discrete memory groups; and The controller is configured to: implement at least one security measure relative to the operation of the integrated circuit; wherein the at least one security measure includes copying code in at least two different memory portions.

在一些實施例中，提供一種分散式處理器記憶體晶片，其包含：基板；記憶體陣列，其安置於基板上；處理陣列，其安置於基板上；第一通信埠；及第二通信埠。該記憶體陣列可包括複數個離散記憶體組。該處理陣列可包括複數個處理器子單元，該等複數個處理器子單元中之每一者與複數個離散記憶體組當中之一或多個離散記憶體組相關聯。該第一通信埠可經組態以在該分散式處理器記憶體晶片與除另一分散式處理器記憶體晶片以外之外部實體之間建立通信連接。該第二通信埠可經組態以在該分散式處理器記憶體晶片與第一額外分散式處理器記憶體晶片之間建立通信連接。 In some embodiments, a distributed processor memory chip is provided, which includes: a substrate; a memory array arranged on the substrate; a processing array arranged on the substrate; a first communication port; and a second communication port . The memory array may include a plurality of discrete memory groups. The processing array may include a plurality of processor subunits, each of the plurality of processor subunits being associated with one or more of the plurality of discrete memory groups. The first communication port can be configured to establish a communication connection between the distributed processor memory chip and an external entity other than another distributed processor memory chip. The second communication port can be configured to establish a communication connection between the distributed processor memory chip and the first additional distributed processor memory chip.

在一些實施例中，一種在第一分散式處理器記憶體晶片與第二分散式處理器記憶體晶片之間傳送資料的方法可包括：使用與第一分散式處理器記憶體晶片及第二分散式處理器記憶體晶片中之至少一者相關聯的控制器判定安置於第一分散式處理器記憶體晶片上之複數個處理器子單元當中的第一處理器子單元是否已準備好將資料傳送至包括於第二分散式處理器記憶體晶片中之第二處理器子單元；及在判定第一處理器子單元已準備好將資料傳送至第二處理器子單元之後，使用由控制器控制之時脈賦能信號以起始資料自第一處理器子單元至第二處理器子單元之傳送。 In some embodiments, a method for transferring data between a first distributed processor memory chip and a second distributed processor memory chip may include: using a first distributed processor memory chip and a second distributed processor memory chip. The controller associated with at least one of the distributed processor memory chips determines whether the first processor subunit among the plurality of processor subunits placed on the first distributed processor memory chip is ready to The data is transferred to the second processor sub-unit included in the second distributed processor memory chip; and after determining that the first processor sub-unit is ready to transfer the data to the second processor sub-unit, use the control The clock enabling signal controlled by the device starts the transmission of data from the first processor sub-unit to the second processor sub-unit.

在一些實施例中，一種記憶體單元可包括：記憶體陣列，其包括複數個記憶體組；至少一個控制器，其經組態以控制相對於複數個記憶體組之讀取操作的至少一個態樣；至少一個零值偵測邏輯單元，其經組態以偵測儲存於複數個記憶體組之特定位址中的多位元零值；且其中該至少一個控制器及該至少一個零值偵測邏輯單元經組態以回應於由該至少一個零值偵測邏輯進行之零值偵測而將零值指示符傳回至記憶體單元外部之一或多個電路。 In some embodiments, a memory unit may include: a memory array including a plurality of memory banks; at least one controller configured to control at least one of the read operations relative to the plurality of memory banks Aspect; at least one zero value detection logic unit, which is configured to detect multi-bit zero values stored in a specific address of a plurality of memory groups; and wherein the at least one controller and the at least one zero The value detection logic unit is configured to return the zero value indicator to one or more circuits outside the memory unit in response to the zero value detection performed by the at least one zero value detection logic.

一些實施例可包括一種用於偵測複數個離散記憶體組之特定位址中之零值的方法，其包含：自記憶體單元外部之電路接收讀取儲存於複數個離散記憶體組之位址中之資料的請求；回應於所接收請求而藉由控制器啟動零值偵測邏輯單元以偵測所接收位址中之零值；及回應於由該零值偵測邏輯單元進行之零值偵測而藉由該控制器將零值指示符傳輸至電路。 Some embodiments may include a method for detecting a zero value in a specific address of a plurality of discrete memory groups, which includes: receiving and reading bits stored in the plurality of discrete memory groups from a circuit outside the memory unit Request for data in the address; in response to the received request, the controller activates the zero value detection logic unit to detect the zero value in the received address; and responds to the zero value performed by the zero value detection logic unit The value is detected and the zero value indicator is transmitted to the circuit by the controller.

一些實施例可包括一種非暫時性電腦可讀媒體，其儲存可由記憶體單元之控制器執行以使記憶體單元偵測複數個離散記憶體組之特定位址中之零值的指令集，該方法包含：自記憶體單元外部之電路接收讀取儲存於複數個離散記憶體組之位址中之資料的請求；回應於所接收請求而藉由控制器啟動零值偵測邏輯單元以偵測所接收位址中之零值；及回應於由該零值偵測邏輯單元進行之零值偵測而藉由該控制器將零值指示符傳輸至電路。 Some embodiments may include a non-transitory computer-readable medium that stores a set of instructions that can be executed by a controller of a memory unit to enable the memory unit to detect a zero value in a specific address of a plurality of discrete memory groups, the The method includes: receiving a request to read data stored in the addresses of a plurality of discrete memory groups from a circuit outside the memory unit; in response to the received request, the controller activates a zero-value detection logic unit to detect Receiving the zero value in the address; and in response to the zero value detection performed by the zero value detection logic unit, the controller transmits the zero value indicator to the circuit.

在一些實施例中，一種記憶體單元可包括：一或多個記憶體組；組控制器；及位址產生器；其中位址產生器經組態以將相關聯記憶體組中待存取之當前列中的當前位址提供至組控制器，判定相關聯記憶體組中待存取之下一列的預測位址，且在相對於與當前位址相關聯之當前列的讀取操作完成之前將預測位址提供至組控制器。 In some embodiments, a memory unit may include: one or more memory banks; a bank controller; and an address generator; wherein the address generator is configured to assign the associated memory bank to be accessed The current address in the current row is provided to the group controller to determine the predicted address of the next row to be accessed in the associated memory group, and the read operation relative to the current row associated with the current address is completed Previously, the predicted address was provided to the group controller.

在一些實施例中，一種記憶體單元可包括：一或多個記憶體組，其中一或多個記憶體組中之每一者包括複數個列；第一列控制器，其經組態以控制複數個列之第一子集；第二列控制器，其經組態以控制複數個列之第二子集；單個資料輸入端，其用以接收待儲存於複數個列中之資料；及單個資料輸出端，其用以提供自複數個列擷取之資料。 In some embodiments, a memory unit may include: one or more memory groups, wherein each of the one or more memory groups includes a plurality of rows; the first row of controllers is configured to Control the first subset of the plurality of rows; the second row controller, which is configured to control the second subset of the plurality of rows; a single data input terminal, which is used to receive the data to be stored in the plurality of rows; And a single data output terminal, which is used to provide data retrieved from multiple rows.

在一些實施例中，一種分散式處理器記憶體晶片可包括：基板；記憶體陣列，其安置於基板上，該記憶體陣列包括複數個離散記憶體組；處理陣列，其安置於基板上，該處理陣列包括複數個處理器子單元，該等處理器子單元中之每一者與該等複數個離散記憶體組中之對應的專用記憶體組相關聯；第一複數個匯流排，其各將複數個處理器子單元中之一者連接至其對應的專用記憶體組；及第二複數個匯流排，其各將複數個處理器子單元中之一者連接至複數個處理器子單元中之另一者。記憶體組中之至少一者可包括安置於基板上之至少一個DRAM記憶體墊。處理器單元中之至少一者可包括與至少一個記憶體墊相關聯之一或多個邏輯組件。至少一個記憶體墊及一或多個邏輯組件可經組態以充當用於複數個處理子單元中之一或多者的快取記憶體。 In some embodiments, a distributed processor memory chip may include: a substrate; a memory array arranged on the substrate, the memory array including a plurality of discrete memory groups; a processing array, arranged on the substrate, The processing array includes a plurality of processor subunits, and the processor subunits Each of the units is associated with a corresponding dedicated memory group in the plurality of discrete memory groups; the first plurality of buses each connects one of the plurality of processor subunits to its corresponding And a second plurality of buses, each of which connects one of the plurality of processor subunits to the other of the plurality of processor subunits. At least one of the memory banks may include at least one DRAM memory pad disposed on the substrate. At least one of the processor units may include one or more logic components associated with at least one memory pad. At least one memory pad and one or more logic components can be configured to act as cache memory for one or more of the plurality of processing subunits.

在一些實施例中，一種執行分散式處理器記憶體晶片中之至少一個指令的方法可包括：自分散式處理器記憶體晶片之記憶體陣列擷取一或多個資料值；將一或多個資料值儲存於形成於分散式處理器記憶體晶片之記憶體墊中的暫存器中；及根據由處理器元件執行之至少一個指令存取儲存於暫存器中之一或多個資料值；其中該記憶體陣列包括安置於基板上之複數個離散記憶體組；其中該處理器元件為包括於安置在基板上之處理陣列中的複數個處理器子單元當中之處理器子單元，其中處理器子單元中之每一者與複數個離散記憶體組中之對應的專用記憶體組相關聯；且其中該暫存器由安置於基板上之記憶體墊提供。 In some embodiments, a method for executing at least one instruction in a distributed processor memory chip may include: retrieving one or more data values from a memory array of the distributed processor memory chip; A data value is stored in a register formed in the memory pad of the memory chip of a distributed processor; and one or more data stored in the register is accessed according to at least one instruction executed by the processor element Value; wherein the memory array includes a plurality of discrete memory groups arranged on a substrate; wherein the processor element is a processor subunit included in a plurality of processor subunits in the processing array arranged on the substrate, Each of the processor sub-units is associated with a corresponding dedicated memory group in a plurality of discrete memory groups; and the register is provided by a memory pad arranged on the substrate.

一些實施例可包括一種裝置，其包含：基板；處理單元，其安置於基板上；及記憶體單元，其安置於基板上，其中該記憶體單元經組態以儲存待由處理單元存取之資料，且其中該處理單元包含經組態以充當用於處理單元之快取記憶體的記憶體墊。 Some embodiments may include a device that includes: a substrate; a processing unit disposed on the substrate; and a memory unit disposed on the substrate, wherein the memory unit is configured to store data to be accessed by the processing unit Data, and where the processing unit includes a memory pad configured to act as a cache for the processing unit.

預期處理系統處理以極高速率處理增加的資訊量。舉例而言，預期第五代(5G)行動網際網路接收大量資訊串流且以增加之速率處理此等資訊串流。 The processing system is expected to process the increased amount of information at an extremely high rate. For example, the fifth generation (5G) mobile Internet is expected to receive a large amount of information streams and process these information streams at an increased rate.

該處理系統可包括一或多個緩衝器及一處理器。由處理器應用之處理操作可能具有某一潛時且此可能需要大量緩衝器。大量緩衝器可為代價高的及/或耗面積的。 The processing system may include one or more buffers and a processor. Applied by the processor The processing operation may have a certain latency and this may require a lot of buffers. A large number of buffers can be costly and/or area consuming.

將大量資訊自緩衝器傳送至處理器可能需要緩衝器與處理器之間的高頻寬連接器及/或高頻寬匯流排，此亦可增加處理系統之成本及面積。 Sending a large amount of information from the buffer to the processor may require a high-bandwidth connector and/or a high-bandwidth bus between the buffer and the processor, which can also increase the cost and area of the processing system.

愈來愈需要提供高效處理系統。 There is an increasing need to provide efficient processing systems.

該處理系統可包括一或多個緩衝器及處理器。由處理器應用之處理操作可能具有某一潛時且此可能需要大量緩衝器。大量緩衝器可為代價高的及/或耗面積的。 The processing system may include one or more buffers and processors. The processing operations applied by the processor may have a certain latency and this may require a large amount of buffers. A large number of buffers can be costly and/or area consuming.

一種分解式伺服器包括多個子系統，而每一子系統具有獨特作用。舉例而言，一種分解式伺服器可包括一或多個交換子系統、一或多個運算子系統及一或多個儲存子系統。 A decomposed server includes multiple subsystems, and each subsystem has a unique role. For example, a decomposed server may include one or more switching subsystems, one or more computing subsystems, and one or more storage subsystems.

一或多個運算子系統及一或多個儲存子系統經由一或多個交換子系統彼此耦接。 One or more computing subsystems and one or more storage subsystems are coupled to each other through one or more switching subsystems.

運算子系統可包括多個運算單元。 The arithmetic subsystem may include multiple arithmetic units.

交換子系統可包括多個交換單元。 The switching subsystem may include multiple switching units.

儲存子系統可包括多個儲存單元。 The storage subsystem may include multiple storage units.

此分解式伺服器之瓶頸在於在子系統之間傳送資訊所需的頻寬。 The bottleneck of this decomposed server is the bandwidth required to transmit information between subsystems.

當執行需要在不同運算子系統之所有(或至少大部分)運算單元 (諸如，圖形處理單元)之間共用資訊單元的分散式計算時，尤其為如此。 When the execution requires all (or at least most) computing units in different computing subsystems This is especially true when sharing information units among distributed calculations (such as graphics processing units).

假定存在參與共用之N個運算單元，N為極大整數(例如，至少1024)，且N個運算單元中之每一者必須將資訊單元發送至所有其他運算單元(及自所有其他運算單元接收資訊單元)。在此等假定下，需要執行資訊單元之大約N×N個傳送處理程序。大量傳送處理程序係耗時且耗能量的，且將顯著地限制分解式伺服器之輸送量。 Suppose there are N arithmetic units participating in the sharing, N is a very large integer (for example, at least 1024), and each of the N arithmetic units must send the information unit to all other arithmetic units (and receive information from all other arithmetic units) unit). Under these assumptions, approximately N×N transmission processing procedures of the information unit need to be executed. Mass transfer processing procedures are time-consuming and energy-consuming, and will significantly limit the throughput of the disaggregated server.

愈來愈需要提供高效分解式伺服器及執行分散式處理之高效方式。 There is an increasing need to provide efficient decomposed servers and efficient ways to perform distributed processing.

資料庫包括許多條目，該等條目包括多個欄位。資料庫處理通常包括執行一或多個查詢，該一或多個查詢包括一或多個篩選參數(例如，識別一或多個相關欄位及一或多個相關欄位值)且亦包括一或多個操作參數，該一或多個操作參數可判定待執行之操作的類型、待在應用操作時使用之變數或常數，及其類似者。 The database includes many entries, and these entries include multiple fields. Database processing usually includes executing one or more queries that include one or more filter parameters (for example, identifying one or more related fields and one or more related field values) and also includes a Or multiple operating parameters, the one or more operating parameters can determine the type of operation to be performed, the variable or constant to be used in the application operation, and the like.

舉例而言，資料庫查詢可請求對資料庫之所有記錄執行統計操作(操作參數)，其中某一欄位具有預定義範圍內之值(篩選參數)。又對於另一實例，資料庫查詢可請求刪除具有小於臨限值(篩選參數)之某一欄位的(操作參數)記錄。 For example, a database query can request a statistical operation (operation parameter) to be performed on all records in the database, and a certain field has a value within a predefined range (filter parameter). For another example, the database query may request deletion of (operation parameter) records with a certain field less than the threshold value (screening parameter).

大型資料庫通常儲存於儲存裝置中。為了對查詢作出回應，將資料庫發送至記憶體單元，通常為一個資料庫區段接著另一資料庫區段。 Large databases are usually stored in storage devices. In order to respond to queries, the database is sent to the memory unit, usually one database section followed by another database section.

將資料庫區段之條目自記憶體單元發送至不屬於與記憶體單元相同之積體電路的處理器。該等條目接著由處理器處理。 The entries of the database section are sent from the memory unit to the processor that does not belong to the same integrated circuit as the memory unit. These entries are then processed by the processor.

對於儲存於記憶體單元中之資料庫的每一資料庫區段，處理包括以下步驟：(i)選擇資料庫區段之記錄；(ii)將記錄自記憶體單元發送至處理器；(iii)藉由處理器篩選記錄以判定記錄是否相關；及(iv)對相關記錄執行一或多個額外操作(求和、應用任何其他數學運算及/或統計操作)。 For each database section of the database stored in the memory unit, the processing includes the following steps: (i) select the record of the database section; (ii) send the record from the memory unit to the processor; (iii) ) Filter records through the processor to determine whether the records are relevant; and (iv) execute on related records One or more additional operations (summing, applying any other mathematical operations and/or statistical operations).

篩選處理程序在所有記錄被發送至處理器且處理器判定哪些記錄相關之後結束。 The screening process ends after all records are sent to the processor and the processor determines which records are relevant.

在資料庫區段之相關條目不儲存於處理器中之狀況下，則需要在篩選階段之後將此等相關記錄發送至處理器以供進一步處理(應用在處理之後的操作)。 Under the condition that the relevant entries of the database section are not stored in the processor, these relevant records need to be sent to the processor for further processing after the screening phase (application of operations after processing).

當多個處理操作在單個篩選之後時，則可將每一操作之結果發送至記憶體單元且接著再次發送至處理器。 When multiple processing operations follow a single screening, the result of each operation can be sent to the memory unit and then sent to the processor again.

此處理程序為耗頻寬且耗時的。 This processing procedure is bandwidth-consuming and time-consuming.

愈來愈需要提供執行資料庫處理之高效方式。 There is an increasing need to provide efficient ways to perform database processing.

字嵌入為自然語言處理(NLP)中之語言模型化及特徵學習技術之集合的統稱，其中將來自詞彙表之字或片語映射至元素之向量。在概念上，其涉及自每字具有許多維度之空間至具有低得多之維度的連續向量空間的數學嵌入(www.wikipedia.org)。 Word embedding is a collective term for a collection of language modeling and feature learning techniques in natural language processing (NLP), in which words or phrases from a vocabulary are mapped to a vector of elements. Conceptually, it involves mathematical embedding (www.wikipedia.org) from a space with many dimensions per word to a continuous vector space with much lower dimensions.

產生此映射之方法包括神經網路、字同現矩陣之降維、機率模型、可解釋知識庫方法及依據字出現之上下文的顯式表示。 The methods for generating this mapping include neural networks, dimensionality reduction of word co-occurrence matrices, probability models, interpretable knowledge base methods, and explicit representations based on the context in which words appear.

字及片語嵌入在用作基礎輸入表示時已展示為提高諸如語法剖析及情感分析之NLP任務的效能。 Word and phrase embeddings have been shown to improve the effectiveness of NLP tasks such as grammar analysis and sentiment analysis when used as basic input representations.

語句可分段成字或片語，且每一區段可由向量表示。語句可由矩陣表示，該矩陣包括表示語句之字或片語的所有向量。 Sentences can be segmented into words or phrases, and each segment can be represented by a vector. Sentences can be represented by a matrix, which includes all vectors representing words or phrases of the sentence.

將字映射至向量之詞彙表可儲存於記憶體單元(諸如，動態隨機存取記憶體(DRAM))中，該記憶體單元可使用字或片語(或表示字之索引)進行存取。 The vocabulary that maps words to vectors can be stored in a memory unit (such as dynamic random access memory (DRAM)), which can be accessed using words or phrases (or indexes representing words).

該等存取可為隨機存取，此減少DRAM之輸送量。此外，該等存取可使DRAM飽和，尤其在將大量存取饋入至DRAM時。 These accesses can be random access, which reduces the throughput of DRAM. In addition, these Access can saturate the DRAM, especially when a large amount of access is fed into the DRAM.

特定而言，包括於語句中之字通常相當隨機。甚至在使用DRAM叢發時，存取儲存映射之DRAM記憶體亦將通常導致隨機存取之較低效能，此係因為通常在叢發期間，DRAM記憶體組條目(在同時被存取之不同記憶體組的多個條目當中)之一小部分中的僅一者將儲存與某一語句相關之條目。 In particular, the words included in the sentence are usually quite random. Even when using DRAM bursts, accessing the memory mapped DRAM memory will usually result in lower performance of random access. This is because usually during bursts, DRAM memory bank entries (different when accessed at the same time) Only one of a small part of the multiple entries in the memory group will store entries related to a certain sentence.

因此，DRAM記憶體之輸送量低且為非連續的。 Therefore, the throughput of DRAM memory is low and non-continuous.

在主機電腦之控制下自DRAM記憶體擷取語句之每一字或片語，該主機電腦在DRAM記憶體之積體電路外部且必須基於對字之位置的瞭解來控制表示每一字或區段之每一向量的每次擷取，此為耗時且耗資源的任務。 Retrieve every word or phrase of a sentence from the DRAM memory under the control of the host computer. The host computer is outside the integrated circuit of the DRAM memory and must control the representation of each word or area based on the knowledge of the position of the word Each acquisition of each vector of a segment is a time-consuming and resource-intensive task.

預期資料中心及其他電腦化系統以極高速率處理及交換增加量之資訊。 Data centers and other computerized systems are expected to process and exchange increased amounts of information at extremely high rates.

增加量之資料的交換可為資料中心及其他電腦化系統之瓶頸，且可使此類資料中心及其他電腦化系統僅利用其能力之一部分。 The exchange of increased amounts of data can be a bottleneck for data centers and other computerized systems, and can allow such data centers and other computerized systems to use only part of their capabilities.

圖96A說明先前技術資料庫12010及先前技術伺服器主機板12011之實例。資料庫可包括多個伺服器，每一伺服器包括多個伺服器主機板(亦表示為「CPU+記憶體+網路」)。每一伺服器主機板12011包括CPU 12012(諸如但不限於因特爾之XEON)，該CPU接收訊務，連接至記憶體單元12013(表示為RAM)及多個資料庫加速器(DB加速器)12014。 FIG. 96A illustrates an example of the prior art database 12010 and the prior art server motherboard 12011. The database may include multiple servers, and each server includes multiple server motherboards (also denoted as "CPU+Memory+Network"). Each server motherboard 12011 includes a CPU 12012 (such as but not limited to Intel’s XEON), which receives traffic and is connected to a memory unit 12013 (represented as RAM) and multiple database accelerators (DB accelerators) 12014 .

DB加速器為可選的，且DB加速操作可由CPU 12012執行。 The DB accelerator is optional, and the DB acceleration operation can be executed by the CPU 12012.

所有訊務流經CPU，且CPU可經由具有相對有限頻寬之鏈路(諸如，PCIe)耦接至DB加速器。 All traffic flows through the CPU, and the CPU can be coupled to the DB accelerator via a link with a relatively limited bandwidth (such as PCIe).

大量資源專用於在多個伺服器主機板之間投送資訊單元。 A large amount of resources are dedicated to the delivery of information units among multiple server motherboards.

愈來愈需要提供高效資料中心及其他電腦化系統。 There is an increasing need to provide efficient data centers and other computerized systems.

諸如神經網路之人工智慧(AI)應用的大小顯著增加。為了應對神經網路之增加的大小，各作為AI加速伺服器(包括伺服器主機板)之多個伺服器用以執行神經網路處理任務，諸如但不限於訓練。包括配置於不同機架中之多個AI加速伺服器的系統之實例展示於圖1中。 The size of artificial intelligence (AI) applications such as neural networks has increased significantly. In response to The increased size of the neural network each serves as multiple servers of the AI acceleration server (including the server motherboard) for performing neural network processing tasks, such as but not limited to training. An example of a system including multiple AI acceleration servers arranged in different racks is shown in FIG. 1.

在典型的訓練工作階段中，同時處理大量影像以提供大量值，諸如損失。大量值在不同AI加速伺服器之間輸送且導致例外量的訊務。舉例而言，可跨越位於不同AI加速伺服器中之多個GPU運算一些神經網路層，且可能需要消耗頻寬之網路上聚集。 In a typical training session, a large number of images are processed at the same time to provide a large number of values, such as loss. A large number of values are transmitted between different AI acceleration servers and result in an exceptional amount of traffic. For example, some neural network layers can be calculated across multiple GPUs located in different AI acceleration servers, and may need to be aggregated on a network that consumes bandwidth.

例外量之訊務的傳送需要超高頻寬，其可能不可行或可能不具成本效益。 The transmission of exceptional traffic requires ultra-high bandwidth, which may not be feasible or cost-effective.

圖97A說明包括子系統之系統12050，每一子系統包括：交換器12051，其用於連接具有伺服器主機板12055之AI加速伺服器12052，該伺服器主機板包括RAM記憶體(RAM 12056)、中央處理單元(CPU)12054、網路介面卡(NIC)12053，而CPU 12054連接(經由PCIe匯流排)至多個AI加速器12057(諸如，圖形處理單元、AI晶片(AI ASIC)、FPGA及其類似者)。NIC藉由網路(使用例如乙太網路、UDP鏈路及其類似者)耦接至彼此(例如，藉由一或多個交換器)，且此等NIC可能夠輸送系統所需之超高頻寬。 Figure 97A illustrates a system 12050 including subsystems. Each subsystem includes: a switch 12051 for connecting to an AI acceleration server 12052 with a server motherboard 12055, which includes RAM memory (RAM 12056) , Central Processing Unit (CPU) 12054, Network Interface Card (NIC) 12053, and CPU 12054 is connected (via PCIe bus) to multiple AI accelerators 12057 (such as graphics processing unit, AI chip (AI ASIC), FPGA and its Similar). NICs are coupled to each other (for example, by one or more switches) through a network (using, for example, Ethernet, UDP links, and the like), and these NICs may be able to transport more than required by the system. High bandwidth.

愈來愈需要提供高效AI運算系統。 There is an increasing need to provide efficient AI computing systems.

符合其他所揭示實施例，非暫時性電腦可讀儲存媒體可儲存程式指令，該等程式指令由至少一個處理裝置執行且執行本文中所描述之方法中的任一者。 In accordance with other disclosed embodiments, the non-transitory computer-readable storage medium can store program instructions that are executed by at least one processing device and perform any of the methods described herein.

前文之一般描述及下文之詳細描述僅為例示性及解釋性的，且不受申請專利範圍限制。 The general description above and the detailed description below are only illustrative and explanatory, and are not limited by the scope of the patent application.

38:線 38: line

100:CPU 100: CPU

110:處理單元 110: processing unit

120a:處理器子單元 120a: processor subunit

120b:處理器子單元 120b: processor subunit

130:快取記憶體 130: Cache memory

140a:共用記憶體 140a: shared memory

140b:共用記憶體 140b: shared memory

200:GPU 200: GPU

210:處理單元 210: Processing Unit

220a:處理器子單元 220a: processor subunit

220b:處理器子單元 220b: processor subunit

220c:處理器子單元 220c: processor subunit

220d:處理器子單元 220d: processor subunit

220e:處理器子單元 220e: processor subunit

220f:處理器子單元 220f: processor subunit

220g:處理器子單元 220g: processor subunit

220h:處理器子單元 220h: processor subunit

220i:處理器子單元 220i: processor subunit

220j:處理器子單元 220j: processor subunit

220k:處理器子單元 220k: processor subunit

220l:處理器子單元 220l: processor subunit

220m:處理器子單元 220m: processor subunit

220n:處理器子單元 220n: processor subunit

220o:處理器子單元 220o: processor subunit

220p:處理器子單元 220p: processor subunit

230a:快取記憶體 230a: Cache memory

230b:快取記憶體 230b: Cache memory

230c:快取記憶體 230c: Cache memory

230d:快取記憶體 230d: Cache memory

250a:共用記憶體 250a: shared memory

250b:共用記憶體 250b: shared memory

250c:共用記憶體 250c: shared memory

250d:共用記憶體 250d: shared memory

300:硬體晶片 300: hardware chip

300':硬體晶片 300': hardware chip

310a:處理群組 310a: Processing group

310b:處理群組 310b: Processing group

310c:處理群組 310c: Processing group

310d:處理群組 310d: Processing group

320a:邏輯及控制子單元 320a: logic and control subunit

320b:邏輯及控制子單元 320b: logic and control subunit

320c:邏輯及控制子單元 320c: logic and control subunit

320d:邏輯及控制子單元 320d: logic and control subunit

320e:邏輯及控制子單元 320e: logic and control subunit

320f:邏輯及控制子單元 320f: logic and control subunit

320g:邏輯及控制子單元 320g: logic and control subunit

320h:邏輯及控制子單元 320h: logic and control subunit

330a:專用記憶體例項 330a: Dedicated memory instance

330b:專用記憶體例項 330b: Dedicated memory example

330c:專用記憶體例項 330c: Dedicated memory instance

330d:專用記憶體例項 330d: Dedicated memory instance

330e:專用記憶體例項 330e: Dedicated memory instance

330f:專用記憶體例項 330f: Dedicated memory example

330g:專用記憶體例項 330g: Dedicated memory example

330h:專用記憶體例項 330h: Dedicated memory instance

340a:控制件 340a: control

340b:控制件 340b: control

340c:控制件 340c: control

340d:控制件 340d: control part

350:主機 350: host

350a:處理器子單元 350a: processor subunit

350b:處理器子單元 350b: processor subunit

350c:處理器子單元 350c: processor subunit

350d:處理器子單元 350d: processor subunit

360a:匯流排 360a: bus

360b:匯流排 360b: bus

360c:匯流排 360c: bus

360d:匯流排 360d: bus

360e:匯流排 360e: bus

360f:匯流排 360f: bus

400:處理程序 400: Processing program

410:處理群組 410: Processing Group

420:專用記憶體例項 420: Dedicated memory instance

430:處理器子單元 430: processor subunit

440:處理元件 440: processing element

450:位址產生器 450: address generator

460:控制件 460: control

500:用於執行專門命令之例示性處理程序 500: Illustrative processing program for executing special commands

510:處理群組 510: Processing Group

520:專用記憶體例項 520: Dedicated memory instance

530:處理元件 530: processing components

600:處理群組 600: Processing group

610:處理器子單元 610: processor subunit

620:記憶體/記憶體元件 620: Memory/Memory Components

630:匯流排 630: Bus

640:處理元件 640: processing element

650:加速器/MUX 650: accelerator/MUX

660:輸入多工器(MUX)/DEMUX 660: Input multiplexer (MUX)/DEMUX

670:輸出DEMUX 670: output DEMUX

710:基板 710: Substrate

720a:記憶體組 720a: memory bank

720b:記憶體組 720b: Memory bank

720c:記憶體組 720c: memory bank

720d:記憶體組 720d: memory bank

720e:記憶體組 720e: memory bank

720f:記憶體組 720f: memory bank

720g:記憶體組 720g: memory bank

720h:記憶體組 720h: memory bank

730a:處理器子單元 730a: processor subunit

730b:處理器子單元 730b: processor subunit

730c:處理器子單元 730c: processor subunit

730d:處理器子單元 730d: processor subunit

730e:處理器子單元 730e: processor subunit

730f:處理器子單元 730f: processor subunit

730g:處理器子單元 730g: processor subunit

730h:處理器子單元 730h: processor subunit

740a:匯流排 740a: bus

740b:匯流排 740b: bus

740c:匯流排 740c: bus

740d:匯流排 740d: bus

740e:匯流排 740e: bus

740f:匯流排 740f: bus

740g:匯流排 740g: busbar

740h:匯流排 740h: bus

750a:匯流排 750a: busbar

750b:匯流排 750b: bus

750c:匯流排 750c: bus

750d:匯流排 750d: bus

750e:匯流排 750e: bus

750f:匯流排 750f: bus

750g:匯流排 750g: busbar

750h:匯流排 750h: busbar

750i:匯流排 750i: bus

750j:匯流排 750j: bus

760:架構 760: Architecture

770a:記憶體晶片 770a: memory chip

770b:記憶體晶片 770b: Memory chip

770c:記憶體晶片 770c: memory chip

770d:記憶體晶片 770d: memory chip

800:用於編譯一系列指令之方法 800: Method for compiling a series of instructions

810:步驟 810: step

820:步驟 820: step

830:步驟 830: step

840:步驟 840: step

850:步驟 850: step

900:組 900: Group

910:列解碼器 910: column decoder

920:全域感測放大器 920: Global Sensing Amplifier

930-1:墊 930-1: Pad

930-2:墊 930-2: Pad

940-1:墊 940-1: Pad

940-2:墊 940-2: Pad

950:字線 950: word line

960:位元線 960: bit line

1000:墊 1000: pad

1010-1:區域放大器 1010-1: Regional amplifier

1010-2:區域放大器 1010-2: Area amplifier

1010-x:區域放大器 1010-x: Area amplifier

1020-1:字線驅動器 1020-1: word line driver

1020-2:字線驅動器 1020-2: Word line driver

1020-x:字線驅動器 1020-x: word line driver

1030-1:胞元 1030-1: Cell

1030-2:胞元 1030-2: Cell

1030-3:胞元 1030-3: Cell

1040:字線 1040: word line

1050:位元線 1050: bit line

1100:組/子組 1100: group/subgroup

1110:組列解碼器 1110: group column decoder

1120:組行解碼器 1120: Group row decoder

1121a:位元線 1121a: bit line

1121b:位元線 1121b: bit line

1130a:子組控制器 1130a: Subgroup Controller

1130b:子組控制器 1130b: Subgroup controller

1130c:子組控制器 1130c: Subgroup Controller

1131a:匯流排 1131a: bus

1131b:匯流排 1131b: bus

1131c:匯流排 1131c: bus

1140a:解算器 1140a: solver

1140b:解算器 1140b: solver

1140c:解算器 1140c: solver

1150a:邏輯 1150a: logic

1150b:邏輯 1150b: logic

1150c:邏輯 1150c: logic

1160a:解碼器 1160a: decoder

1160b:解碼器 1160b: decoder

1160c:解碼器 1160c: decoder

1160d:解碼器 1160d: decoder

1160e:解碼器 1160e: decoder

1160f:解碼器 1160f: decoder

1160g:解碼器 1160g: decoder

1160h:解碼器 1160h: decoder

1160i:解碼器 1160i: decoder

1170a:子組 1170a: subgroup

1170b:子組 1170b: subgroup

1170c:子組 1170c: subgroup

1180a:解碼器 1180a: decoder

1180b:解碼器 1180b: decoder

1180c:解碼器 1180c: decoder

1181a:位元線 1181a: bit line

1181b:位元線 1181b: bit line

1190a-1:墊 1190a-1: pad

1190a-2:墊 1190a-2: pad

1190a-x:墊 1190a-x: pad

1190b-1:墊 1190b-1: Pad

1190b-2:墊 1190b-2: pad

1190b-x:墊 1190b-x: pad

1190c-1:墊 1190c-1: pad

1190c-2:墊 1190c-2: pad

1190c-x:墊 1190c-x: pad

1200:記憶體子組 1200: memory subgroup

1210:記憶體控制器 1210: Memory Controller

1220a:熔斷器及比較器 1220a: fuse and comparator

1220b:熔斷器及比較器 1220b: fuse and comparator

1230a:列解碼器 1230a: column decoder

1230b:列解碼器 1230b: column decoder

1240a:墊 1240a: pad

1240b:墊 1240b: pad

1250a:行解碼器 1250a: Line decoder

1250b:行解碼器 1250b: Line decoder

1251:位元線 1251: bit line

1253:位元線 1253: bit line

1260a-1:胞元 1260a-1: cell

1260a-2:胞元 1260a-2: cell

1260a-x:胞元 1260a-x: cell

1260b-1:胞元 1260b-1: cell

1260b-2:胞元 1260b-2: cell

1260b-x:胞元 1260b-x: cell

1300:記憶體晶片 1300: memory chip

1301:基板 1301: substrate

1302:位址管理器 1302: Address Manager

1304:記憶體陣列 1304: memory array

1304(a,a):記憶體組/記憶體區塊/記憶體例項 1304(a,a): memory group/memory block/memory instance

1304(z,z):記憶體組/記憶體例項 1304(z,z): memory group/memory instance

1306:記憶體邏輯 1306: memory logic

1308:商業邏輯 1308: business logic

1310:冗餘商業邏輯/冗餘邏輯區塊/冗餘商業區塊 1310: Redundant business logic/redundant logic block/redundant business block

1312:不啟動開關 1312: Do not start the switch

1314:啟動開關 1314: Start switch

1400:冗餘邏輯區塊集合 1400: Redundant logical block collection

1402:位址匯流排 1402: address bus

1404:資料匯流排 1404: data bus

1500:邏輯區塊 1500: logical block

1504:取得電路 1504: get circuit

1506:解碼器 1506: decoder

1508:暫存器 1508: register

1510:運算單元 1510: arithmetic unit

1512:複製運算單元 1512: Copy arithmetic unit

1514:開關電路 1514: switch circuit

1516:開關電路 1516: switch circuit

1518:寫回電路 1518: write back circuit

1602:邏輯區塊 1602: logical block

1602(a):邏輯區塊 1602(a): logical block

1602(b):邏輯區塊 1602(b): logical block

1602(c):邏輯區塊 1602(c): logical block

1604:熔斷識別件 1604: Fuse identification

1604(a):邏輯區塊 1604(a): logical block

1604(b):邏輯區塊 1604(b): logical block

1604(c):邏輯區塊 1604(c): logical block

1614:位址匯流排 1614: address bus

1616:命令線 1616: command line

1618:資料線 1618: data line

1702:單元 1702: unit

1702(a):單元 1702(a): unit

1702(b):發生故障的單元 1702(b): The failed unit

1702(c):單元 1702(c): unit

1712:單元 1712: unit

1712(a):單元 1712(a): unit

1712(b):單元 1712(b): unit

1712(c):單元 1712(c): unit

1722:開關電路 1722: switch circuit

1722(a):開關電路 1722(a): Switching circuit

1722(b):開關電路 1722(b): Switching circuit

1722(c):開關電路 1722(c): Switch circuit

1728:開關電路 1728: switch circuit

1728(a):開關電路 1728(a): Switch circuit

1728(c):開關電路 1728(c): Switch circuit

1730:樣本電路 1730: sample circuit

1730(a):樣本電路 1730(a): Sample circuit

1730(b):樣本電路 1730(b): Sample circuit

1730(c):樣本電路 1730(c): Sample circuit

1804:I/O區塊 1804: I/O block

1806:單元 1806: unit

1808:開關箱 1808: switch box

1810:連接箱 1810: connection box

1902(a):單元 1902(a): unit

1902(b):單元 1902(b): unit

1902(c):單元 1902(c): unit

1902(d):單元 1902(d): unit

1902(e):單元 1902(e): unit

1902(f):單元 1902(f): unit

1904(a):組態開關 1904(a): configuration switch

1904(b):組態開關 1904(b): configuration switch

1904(c):組態開關 1904(c): configuration switch

1904(d):組態開關 1904(d): configuration switch

1904(e):組態開關 1904(e): configuration switch

1904(f):組態開關 1904(f): configuration switch

1904(g):組態開關 1904(g): configuration switch

1904(h):組態開關 1904(h): configuration switch

2000:冗餘區塊賦能處理程序 2000: Redundant block enablement process

2002:步驟 2002: steps

2004:步驟 2004: steps

2006:步驟 2006: steps

2008:步驟 2008: steps

2010:步驟 2010: steps

2100:位址指派處理程序 2100: Address assignment process

2102:步驟 2102: step

2104:步驟 2104: step

2106:步驟 2106: step

2108:步驟 2108: step

2110:步驟 2110: steps

2112:步驟 2112: steps

2114:步驟 2114: step

2116:步驟 2116: step

2118:步驟 2118: step

2120:步驟 2120: step

2122:步驟 2122: step

2200:處理裝置 2200: processing device

2202:第一記憶體區塊 2202: The first memory block

2204:第二記憶體區塊 2204: second memory block

2210:記憶體控制器 2210: Memory Controller

2212:組態管理器 2212: Configuration Manager

2214:邏輯區塊/單元 2214: logical block/unit

2216:加速器/單元 2216: accelerator/unit

2216(a):加速器 2216(a): accelerator

2216(n):加速器 2216(n): accelerator

2218:線 2218: line

2220:線 2220: line

2230:主機 2230: host

2300:處理裝置 2300: processing device

2302:MAC單元 2302: MAC unit

2304:組態管理器 2304: Configuration Manager

2306:記憶體控制器 2306: Memory Controller

2308(a):記憶體區塊 2308(a): memory block

2308(b):記憶體區塊 2308(b): memory block

2308(c):記憶體區塊 2308(c): memory block

2308(d):記憶體區塊 2308(d): memory block

2500:記憶體組態處理程序 2500: Memory configuration processing program

2502:步驟 2502: step

2504:步驟 2504: step

2506:步驟 2506: step

2508:步驟 2508: step

2510:步驟 2510: steps

2512:步驟 2512: step

2514:步驟 2514: step

2600:記憶體讀取處理程序 2600: Memory read process

2602:步驟 2602: step

2604:步驟 2604: step

2606:步驟 2606: step

2608:步驟 2608: step

2614:步驟 2614: step

2616:步驟 2616: step

2618:步驟 2618: step

2620:步驟 2620: steps

2700:執行處理程序 2700: Execute the handler

2702:步驟 2702: step

2704:步驟 2704: step

2706:步驟 2706: step

2708:步驟 2708: step

2710:步驟 2710: Step

2712:步驟 2712: step

2714:步驟 2714: step

2176:步驟 2176: step

2718:步驟 2718: step

2720:步驟 2720: step

2800:記憶體晶片 2800: memory chip

2801a:記憶體組 2801a: memory bank

2803:再新控制器 2803: New controller

2805:控制器 2805: Controller

2900:實例再新控制器 2900: Instance and new controller

2900':實例再新控制器 2900': New controller for instance

2901:計時器 2901: Timer

2903:列計數器 2903: column counter

2905:有效位元 2905: effective bit

2907:加法器 2907: adder

2909:資料儲存器 2909: Data Storage

2911:再新閘 2911: new gate

3000:用於記憶體晶片中之部分再新的處理程序 3000: Used for partial renewal processing procedures in the memory chip

3010:步驟 3010: steps

3020:步驟 3020: steps

3030:步驟 3030: steps

3100:用於判定記憶體晶片之再新的處理程序 3100: Process used to determine the renewal of memory chips

3110:步驟 3110: step

3120:步驟 3120: step

3130:步驟 3130: step

3140:步驟 3140: step

3200:用於判定記憶體晶片之再新的處理程序 3200: Process used to determine the renewal of memory chips

3210:步驟 3210: steps

3220:步驟 3220: steps

3230:步驟 3230: steps

3240:步驟 3240: step

3250:步驟 3250: steps

3300:實例再新控制器 3300: Instance and new controller

3301:計時器 3301: timer

3303:列計數器 3303: column counter

3305:加法器 3305: adder

3307:資料儲存器 3307: Data Storage

3400:用於判定記憶體晶片之再新的處理程序 3400: Process used to determine the renewal of memory chips

3410:步驟 3410: steps

3420:步驟 3420: step

3430:步驟 3430: step

3440:步驟 3440: step

3501:晶圓 3501: Wafer

3503:晶粒 3503: Die

3504:第二區/記憶體晶片/群組 3504: second zone/memory chip/group

3505:區/記憶體晶片 3505: zone/memory chip

3506:記憶體晶片 3506: memory chip

3506A:記憶體晶粒/記憶體晶片 3506A: Memory Die/Memory Chip

3506B:記憶體晶粒/記憶體晶片 3506B: memory die/memory chip

3506C:記憶體晶片 3506C: Memory chip

3506D:記憶體晶片 3506D: Memory chip

3507:基板 3507: substrate

3511A:記憶體組 3511A: Memory Bank

3511B:記憶體組 3511B: Memory Bank

3511C:記憶體組 3511C: Memory Bank

3511D:記憶體組 3511D: Memory Bank

3511E:記憶體組 3511E: memory bank

3511F:記憶體組 3511F: Memory Bank

3511G:記憶體組 3511G: memory bank

3511H:記憶體組 3511H: memory bank

3512:匯流排或連接件 3512: busbar or connector

3515A:處理器子單元 3515A: processor subunit

3515B:處理器子單元 3515B: processor subunit

3515C:處理器子單元 3515C: processor subunit

3515D:處理器子單元 3515D: processor subunit

3515E:處理器子單元 3515E: processor subunit

3515F:處理器子單元 3515F: processor subunit

3515G:處理器子單元 3515G: processor subunit

3515H:處理器子單元/連接件 3515H: Processor subunit/connector

3515I:處理器子單元 3515I: processor subunit

3515J:處理器子單元 3515J: processor subunit

3515K:處理器子單元 3515K: processor subunit

3515L:處理器子單元 3515L: processor subunit

3515M:處理器子單元 3515M: processor subunit

3515N:處理器子單元 3515N: processor subunit

3515O:處理器子單元 3515O: processor subunit

3515P:處理器子單元 3515P: processor subunit

3516A:連接件 3516A: Connector

3516B:連接件 3516B: Connector

3516C:連接件 3516C: Connector

3516D:連接件 3516D: Connector

3517:區/記憶體晶片 3517: zone/memory chip

3521:輸入輸出(IO)控制器/IO控制模組 3521: Input and output (IO) controller/IO control module

3521A:IO控制器 3521A: IO controller

3521B:IO控制器 3521B: IO controller

3521:IO控制器/IO控制模組 3521: IO controller/IO control module

3522:IO控制器/IO控制模組 3522: IO controller/IO control module

3523:IO控制模組 3523: IO control module

3524:IO控制器 3524: IO controller

3530:輸入輸出匯流排/匯流排線 3530: input and output bus/bus line

3530A:輸入輸出匯流排 3530A: Input and output bus

3530B:輸入輸出匯流排 3530B: input and output bus

3531A:分支 3531A: branch

3531B:分支 3531B: branch

3532:記憶體晶片 3532: memory chip

3533:線 3533: line

3540:晶粒 3540: Die

3542A:輸入/輸出控制器 3542A: Input/Output Controller

3542B:輸入/輸出控制器 3542B: Input/Output Controller

3546:晶粒 3546: Die

3554:熔斷器/熔斷器元件 3554: Fuse/fuse element

3554A:熔斷器 3554A: Fuse

3554B:熔斷器 3554B: Fuse

3555:熔斷器/熔斷器元件 3555: Fuse/fuse element

3556:熔斷器/熔斷器元件 3556: Fuse/fuse element

3557:熔斷器/熔斷器元件 3557: fuse/fuse element

3601:域 3601: domain

3602:域 3602: domain

3603:域 3603: domain

3611:匯流排線 3611: bus line

3612:匯流排線 3612: bus line

3613:匯流排線 3613: bus line

3711:膠合邏輯/邏輯電路/膠合邏輯單元 3711: glue logic / logic circuit / glue logic unit

3713:群組 3713: Group

3715:群組 3715: Group

3801:水平切割 3801: Horizontal cutting

3802:線/切割 3802: wire/cut

3803:豎直切割 3803: Vertical cutting

3804:線/切割 3804: wire/cut

3806:線/切割 3806: wire/cut

3811A:區 3811A: District

3811B:區 3811B: District

3811C:區 3811C: District

3820:匯流排線 3820: bus line

3822:區 3822: District

3824:匯流排線 3824: bus line

3901:記憶體單元之群組 3901: Group of memory units

3905:連接件 3905: connector

4000:自晶粒群組建置記憶體晶片之實例處理程序 4000: An example process for building a memory chip from a die group

4011:步驟 4011: step

4015:步驟 4015: steps

4017:步驟 4017: step

4100:用於製造含有多個晶粒之記憶體晶片的實例處理程序 4100: An example process for manufacturing a memory chip containing multiple dies

4101:處理程序 4101: handler

4102:處理程序 4102: handler

4111:步驟 4111: Step

4113:步驟 4113: Step

4115:步驟 4115: step

4117:步驟 4117: Step

4119:步驟 4119: step

4131:步驟 4131: step

4133:步驟 4133: Step

4140:步驟 4140: Step

4200:實例電路系統 4200: Example circuit system

4201:記憶體陣列 4201: Memory Array

4203:列解碼器 4203: column decoder

4205a:行多工器(「mux」) 4205a: Row multiplexer (``mux'')

4205b:行多工器(「mux」) 4205b: Row multiplexer (``mux'')

4300:實例電路系統 4300: Example circuit system

4301:記憶體陣列 4301: Memory Array

4303:列解碼器 4303: column decoder

4305:行多工器 4305: Row Multiplexer

4400:實例電路系統 4400: Example circuit system

4401:記憶體陣列 4401: memory array

4403:列解碼器 4403: column decoder

4405:行解碼器(或多工器) 4405: Row decoder (or multiplexer)

4500:方法 4500: method

4600:實例電路系統 4600: Example circuit system

4601a:列解碼器 4601a: column decoder

4601b:列解碼器 4601b: column decoder

4603a:行多工器 4603a: Row multiplexer

4603b:行多工器 4603b: Row multiplexer

4607:列控制件 4607: Column Control

4609a:記憶體墊 4609a: Memory pad

4609b:記憶體墊 4609b: Memory pad

4611a:字線 4611a: word line

4611b:字線 4611b: word line

4613a:開關元件 4613a: switching element

4613b:開關元件 4613b: switching element

4615a:位元線 4615a: bit line

4615b:位元線 4615b: bit line

4700:用於在單埠記憶體陣列或墊上提供雙埠存取之處理程序 4700: Process used to provide dual-port access on a single-port memory array or pad

4710:步驟 4710: step

4720:步驟 4720: step

4730:步驟 4730: step

4740:步驟 4740: step

4750:用於在單埠記憶體陣列或墊上提供雙埠存取的處理程序 4750: A process used to provide dual-port access on a single-port memory array or pad

4760:步驟 4760: Step

4770:步驟 4770: step

4780:步驟 4780: step

4790:步驟 4790: step

4800:實例電路系統 4800: Example circuit system

4801a:列解碼器 4801a: column decoder

4801b:列解碼器 4801b: column decoder

4803a:行多工器 4803a: Row multiplexer

4803b:行多工器 4803b: Row multiplexer

4900:記憶體墊 4900: memory pad

5000:實例積體電路 5000: Example integrated circuit

5001:記憶體胞元 5001: memory cell

5008:記憶體胞元 5008: memory cell

5011:記憶體讀取路徑 5011: Memory read path

5018:記憶體讀取路徑 5018: Memory read path

5020:輸出埠 5020: output port

5021:位元 5021: bit

5028:位元 5028: bit

5030:縮減單元 5030: Reduced unit

5040:讀取電路系統 5040: Reading circuit system

5050:記憶體胞元之陣列 5050: Array of memory cells

5100:記憶體組 5100: memory bank

5101:記憶體組 5101: Memory Bank

5102:記憶體單元 5102: memory unit

5111:陣列 5111: Array

5112:列解碼器 5112: column decoder

5113:行多工器 5113: Row Multiplexer

5114:主I/O匯流排 5114: main I/O bus

5115:輸出匯流排 5115: output bus

5116:記憶體內處理(PIM)邏輯 5116: In-Memory Processing (PIM) logic

5117:匯流排 5117: bus

5118:PIM位址匯流排/位址行匯流排 5118: PIM address bus/address line bus

5119:匯流排 5119: bus

5130:實例方法 5130: instance method

5132:步驟 5132: step

5134:步驟 5134: step

5140:記憶體晶片 5140: Memory chip

5140(1):記憶體組之部分 5140(1): Part of the memory bank

5140(2):記憶體組之部分 5140(2): Part of the memory bank

5140(3):記憶體組之部分 5140(3): Part of the memory bank

5140(4):記憶體組之部分 5140(4): part of the memory bank

5140(5):記憶體組之部分 5140(5): Part of the memory bank

5140(6):記憶體組之部分 5140(6): Part of the memory bank

5141:記憶體墊及相關聯邏輯 5141: Memory pad and associated logic

5142:記憶體墊及相關聯邏輯 5142: Memory pad and associated logic

5143:記憶體墊及相關聯邏輯 5143: Memory pad and associated logic

5144:記憶體墊及相關聯邏輯 5144: Memory pad and associated logic

5145:記憶體墊及相關聯邏輯 5145: Memory pad and associated logic

5146:記憶體墊及相關聯邏輯 5146: Memory pad and associated logic

5147:匯流排 5147: bus

5150(10):記憶體墊 5150(10): Memory pad

5150(2):記憶體墊 5150(2): Memory pad

5150(3):記憶體墊 5150(3): memory pad

5150(4):記憶體墊 5150(4): memory pad

5150(5):記憶體墊 5150(5): Memory pad

5150(6):記憶體墊 5150(6): Memory pad

5151(1):記憶體墊 5151(1): memory pad

5151(2):記憶體墊 5151(2): memory pad

5151(3):記憶體墊 5151(3): memory pad

5151(4):記憶體墊 5151(4): memory pad

5151(5):記憶體墊 5151(5): Memory pad

5151(6):記憶體墊 5151(6): Memory pad

5152(1):記憶體墊/全域字線 5152(1): Memory Pad/Global Word Line

5152(2):記憶體墊/全域字線 5152(2): Memory Pad/Global Word Line

5152(3):記憶體墊/全域字線 5152(3): Memory Pad/Global Word Line

5152(4):記憶體墊/全域字線 5152(4): Memory Pad/Global Word Line

5152(5):記憶體墊/全域字線 5152(5): Memory Pad/Global Word Line

5152(6):記憶體墊/全域字線 5152(6): Memory Pad/Global Word Line

5152(8):全域字線 5152(8): Global word line

5153(1):延遲或隔離電路 5153(1): Delay or isolation circuit

5153(3):延遲或隔離電路 5153(3): Delay or isolation circuit

5154(1):延遲或隔離電路 5154(1): Delay or isolation circuit

5154(3):延遲或隔離電路 5154(3): Delay or isolation circuit

5155(1):正反器 5155(1): Flip-flop

5155(3):正反器 5155(3): Flip-flop

5156(1):正反器 5156(1): Flip-flop

5156(3):正反器 5156(3): Flip-flop

5157(1):開關 5157(1): switch

5157(3):開關 5157(3): switch

5157(8):開關 5157(8): switch

5158(1):開關 5158(1): switch

5158(3):開關 5158(3): switch

5158(8):開關 5158(8): switch

5159(1):反相器閘或緩衝器 5159(1): inverter gate or buffer

5159(3):反相器閘或緩衝器 5159(3): inverter gate or buffer

5159'(1):反相器閘或緩衝器 5159'(1): inverter gate or buffer

5159'(3):反相器閘或緩衝器 5159'(3): inverter gate or buffer

5160(1):列控制單元 5160(1): column control unit

5160(2):單元 5160(2): unit

5160(3):單元 5160(3): unit

5170(1):列部分賦能信號 5170(1): column part enable signal

5170(2):列部分賦能信號 5170(2): Column part enable signal

5180:全域字線 5180: Global word line

5190:用於操作記憶體單元之方法 5190: Method for operating memory unit

5192:步驟 5192: Step

5194:步驟 5194: step

5200:測試器 5200: Tester

5201:開關 5201: switch

5202:區段/完整晶圓 5202: Segment/complete wafer

5210:晶片(或晶片之晶圓)/積體電路/記憶體 5210: chip (or wafer of chip)/integrated circuit/memory

5211:晶片介面 5211: chip interface

5212:記憶體組 5212: Memory Bank

5213:匯流排 5213: bus

5214:I/O控制器 5214: I/O Controller

5215:邏輯單元/邏輯 5215: Logic Unit/Logic

5216:熔斷器介面 5216: Fuse interface

5217:匯流排 5217: bus

5218:測試單元(TU) 5218: Test Unit (TU)

5219:測試圖案產生器 5219: Test Pattern Generator

5221:寫入測試序列/第一步驟 5221: Write test sequence/first step

5222:讀回測試結果 5222: Read back test results

5223:寫入預期結果序列/第二步驟 5223: Write expected result sequence / second step

5224:讀取故障位址以修復/第三步驟 5224: Read the fault address to repair/the third step

5225:程式化熔斷器/第四步驟 5225: Programmable Fuse/Fourth Step

5231:測試結果 5231: test result

5232:測試程式碼 5232: test code

5300:用於測試記憶體組之方法 5300: Method for testing memory bank

5302:步驟 5302: Step

5310:步驟 5310: step

5320:步驟 5320: steps

5350:用於測試積體電路之記憶體組的方法 5350: Method for testing the memory bank of an integrated circuit

5352:步驟 5352: step

5355:步驟 5355: step

5358:步驟 5358: step

7000:用於分散式處理之方法 7000: Method for decentralized processing

7001:用於分散式處理之方法 7001: Method for decentralized processing

7010:步驟 7010: steps

7020:步驟 7020: steps

7030:步驟 7030: steps

7040:步驟 7040: steps

7050:步驟 7050: steps

7011:記憶體/處理單元 7011: memory/processing unit

7012:記憶體/處理單元 7012: memory/processing unit

7013:記憶體/處理單元 7013: memory/processing unit

7014:積體電路 7014: Integrated Circuit

7015:積體電路 7015: Integrated Circuit

7101:分解式系統 7101: Decomposition System

7102:分解式系統 7102: Decomposition System

7103:分解式系統 7103: Decomposition System

7104:分解式系統 7104: Decomposition System

7110:處理/記憶體子系統 7110: Processing/Memory Subsystem

7120:運算子系統 7120: computing subsystem

7120(1):運算單元PU(1) 7120(1): arithmetic unit PU(1)

7120(n):運算單元PU(n) 7120(n): arithmetic unit PU(n)

7120(N):運算單元PU(N) 7120(N): arithmetic unit PU(N)

7121(1):部分模型更新 7121(1): Partial model update

7121(n):部分模型更新 7121(n): Partial model update

7121(N):部分模型更新 7121(N): Partial model update

7122:經更新模型 7122: updated model

7130:儲存子系統 7130: Storage subsystem

7140:交換子系統 7140: Exchange subsystem

7150:加速器子系統 7150: accelerator subsystem

7200:積體電路/積體晶片 7200: Integrated Circuit/Integrated Chip

7210:記憶體陣列 7210: memory array

7210_1:離散記憶體組 7210_1: Discrete memory bank

7210_2:離散記憶體組 7210_2: Discrete memory bank

7210_j:專用記憶體組 7210_j: dedicated memory bank

7210_J1:離散記憶體組 7210_J1: Discrete memory bank

7210_Jn:離散記憶體組 7210_Jn: Discrete memory bank

7220:處理器子單元 7220: processor subunit

7220_1:處理器子單元 7220_1: processor subunit

7220_2:處理器子單元 7220_2: processor subunit

7220_k:處理器子單元 7220_k: processor subunit

7220_k+1:處理器子單元 7220_k+1: processor subunit

7220_K:處理器子單元 7220_K: processor subunit

7230:通信埠 7230: Communication port

7240:控制器 7240: Controller

7241:網路攻擊偵測器 7241: Cyber Attack Detector

7242:回應模組 7242: Response module

7243:存取控制規則 7243: Access Control Rules

7244:程式/模型操作圖案 7244: program/model operation pattern

7245:篡改偵測器 7245: tamper detector

7246:回應模組 7246: Response module

7247:曲線 7247: Curve

7250:匯流排 7250: bus

7260:第一匯流排 7260: the first bus

7261:第二匯流排 7261: second bus

7270:主機電腦 7270: host computer

7280:主機記憶體 7280: host memory

7281:可改變資料 7281: data can be changed

7282:不可改變資料 7282: Unchangeable data

7283:命令 7283: command

7450:方法 7450: method

7452:步驟 7452: step

7454:步驟 7454: step

7500:第一分散式處理器記憶體晶片 7500: The first distributed processor memory chip

7500':第二分散式處理器記憶體晶片 7500': Second distributed processor memory chip

7500":第三分散式處理器記憶體晶片 7500": The third distributed processor memory chip

7500''':分散式處理器記憶體晶片 7500''': Distributed processor memory chip

7500A:分散式處理器記憶體晶片 7500A: Distributed processor memory chip

7500B:分散式處理器記憶體晶片 7500B: Distributed processor memory chip

7500C:分散式處理器記憶體晶片 7500C: Distributed processor memory chip

7500D:分散式處理器記憶體晶片 7500D: Distributed processor memory chip

7500E:分散式處理器記憶體晶片 7500E: Distributed processor memory chip

7500F:分散式處理器記憶體晶片 7500F: Distributed processor memory chip

7500G:分散式處理器記憶體晶片 7500G: Distributed processor memory chip

7500H:分散式處理器記憶體晶片 7500H: Distributed processor memory chip

7500I:分散式處理器記憶體晶片 7500I: Distributed processor memory chip

7500A':分散式處理器記憶體晶片 7500A': Distributed processor memory chip

7500B':分散式處理器記憶體晶片 7500B': Distributed processor memory chip

7500C':分散式處理器記憶體晶片 7500C': Distributed processor memory chip

7510:記憶體陣列 7510: memory array

7510_1:專用記憶體組 7510_1: dedicated memory bank

7510_2:專用記憶體組 7510_2: dedicated memory bank

7510_3:專用記憶體組 7510_3: dedicated memory bank

7510_4:專用記憶體組 7510_4: dedicated memory bank

7510_5:專用記憶體組 7510_5: dedicated memory bank

7510_6:專用記憶體組 7510_6: dedicated memory bank

7510':記憶體陣列 7510': memory array

7510":記憶體陣列 7510": Memory array

7520:處理陣列 7520: Processing array

7520_1:分散式處理器子單元 7520_1: Distributed processor subunit

7520_2:分散式處理器子單元 7520_2: Distributed processor subunit

7520_3:分散式處理器子單元 7520_3: Distributed processor subunit

7520_4:分散式處理器子單元 7520_4: Distributed processor subunit

7520_5:分散式處理器子單元 7520_5: Distributed processor subunit

7520_6:分散式處理器子單元 7520_6: Distributed processor subunit

7520_K:分散式處理器子單元 7520_K: Distributed processor subunit

7520':處理陣列 7520': Processing array

7520":處理陣列 7520": Processing array

7530:第一通信埠 7530: the first communication port

7530':通信埠 7530': Communication port

7530":通信埠 7530": Communication port

7531:第二通信埠 7531: second communication port

7531':通信埠 7531': Communication port

7531":通信埠 7531": Communication port

7532:第三通信埠 7532: third communication port

7532':通信埠 7532': Communication port

7532":通信埠 7532": Communication port

7533:匯流排 7533: bus

7533':匯流排 7533': busbar

7534:匯流排 7534: bus

7534':匯流排 7534': busbar

7535:匯流排 7535: bus

7540:控制器 7540: Controller

7540':控制器 7540': Controller

7540":控制器 7540": Controller

7547:控制器及介面模組 7547: Controller and interface module

7548_1:通信介面 7548_1: Communication interface

7548_N:通信介面 7548_N: Communication interface

7570:主機通信埠 7570: Host communication port

7570':埠 7570': Port

7572:晶片埠/通信埠 7572: chip port/communication port

7580:第一匯流排 7580: The first bus

7580':第二匯流排 7580': second bus

7600:分散式處理器記憶體晶片 7600: Distributed processor memory chip

S7710:步驟 S7710: steps

S7720:步驟 S7720: steps

S7730:步驟 S7730: steps

7800:用於偵測儲存於複數個記憶體組之一或多個特定位址中之零值的系統/記憶體單元 7800: System/memory unit used to detect zero values stored in one or more specific addresses in a plurality of memory groups

7810:記憶體晶片 7810: memory chip

7811A:記憶體組 7811A: Memory Bank

7811B:記憶體組 7811B: Memory Bank

7812:IO匯流排 7812: IO bus

7820:主機 7820: host

7830:零值偵測邏輯單元 7830: Zero detection logic unit

7830A:零值指示符線 7830A: Zero indicator line

7830B:零值指示符線 7830B: Zero indicator line

7831:內部零值指示符線 7831: Internal zero indicator line

7840:匯流排 7840: bus

7840A:匯流排 7840A: Bus

7840B:匯流排 7840B: bus

7841:匯流排 7841: bus

7841A:匯流排 7841A: Bus

7911:記憶體組 7911: Memory Bank

7912A:記憶體墊 7912A: Memory pad

7912B:記憶體墊 7912B: memory pad

7913A:記憶體墊控制器 7913A: Memory Pad Controller

7913B:記憶體墊控制器 7913B: Memory Pad Controller

7914A:零值偵測邏輯單元 7914A: Zero detection logic unit

7914B:零值偵測邏輯單元 7914B: Zero detection logic unit

7915A:區域感測放大器 7915A: Area sensing amplifier

7915B:區域感測放大器 7915B: Area sensing amplifier

7916:全域感測放大器 7916: Global Sensing Amplifier

7931A:零值指示符線 7931A: Zero indicator line

7931B:零值指示符線 7931B: Zero indicator line

8000:偵測儲存於複數個記憶體組之特定位址中之零值的例示性方法 8000: An exemplary method for detecting zero values stored in specific addresses of multiple memory banks

8010:步驟 8010: steps

8020:步驟 8020: steps

8030:步驟 8030: steps

8100:系統/記憶體單元 8100: system/memory unit

8180:記憶體組 8180: memory bank

8180A:記憶體組 8180A: Memory bank

8180B:記憶體組 8180B: memory bank

8181:記憶體子組 8181: memory subgroup

8183A:第一子組列控制器 8183A: The first sub-group controller

8183B:第二子組列控制器 8183B: The second sub-group controller

8191:組控制器 8191: Group Controller

8192:當前及預測位址產生器 8192: Current and predicted address generator

8192A:計數器 8192A: Counter

8192B:當前位址產生器 8192B: current address generator

8192C:預測位址產生器 8192C: Predictive address generator

8193:快取記憶體 8193: Cache memory

8280:雙重控制記憶體組 8280: Dual control memory bank

8290:資料輸入(DIN) 8290: Data input (DIN)

8291:組控制器/列位址(ROW) 8291: Group Controller/Column Address (ROW)

8292:行位址(COLUMN) 8292: Row Address (COLUMN)

8293:第一命令輸入(COMMAND_1) 8293: The first command input (COMMAND_1)

8294:第二命令輸入(COMMAND_2) 8294: Second command input (COMMAND_2)

8295:資料輸出(Dout) 8295: Data output (Dout)

8400:傳統電腦架構 8400: traditional computer architecture

8402:CPU 8402: CPU

8406:外部記憶體 8406: External memory

8500a:分散式處理器記憶體晶片 8500a: Distributed processor memory chip

8500b:分散式處理器記憶體晶片 8500b: Distributed processor memory chip

8500c:裝置 8500c: device

8502:基板 8502: substrate

8504:暫存器檔案 8504: Register file

8510a:處理群組 8510a: Processing group

8510b:處理群組 8510b: Processing group

8510c:處理群組 8510c: Processing group

8520:記憶體陣列 8520: memory array

8520a:專用記憶體組 8520a: dedicated memory bank

8520b:專用記憶體組 8520b: dedicated memory bank

8520c:專用記憶體組 8520c: dedicated memory bank

8522a:記憶體墊 8522a: Memory pad

8522b:記憶體墊 8522b: Memory pad

8522c:記憶體墊 8522c: memory pad

8524a:記憶體墊 8524a: Memory pad

8524b:記憶體墊 8524b: memory pad

8524c:記憶體墊 8524c: memory pad

8526a:記憶體墊 8526a: Memory pad

8526b:記憶體墊 8526b: Memory pad

8526c:記憶體墊 8526c: memory pad

8530:處理陣列 8530: Processing array

8530a:處理器子單元/加速器 8530a: processor subunit/accelerator

8530b:處理器子單元 8530b: processor subunit

8530c:處理器子單元 8530c: processor subunit

8532a:記憶體墊/暫存器檔案 8532a: Memory pad/register file

8532b:記憶體墊 8532b: memory pad

8532c:記憶體墊 8532c: memory pad

8534a:邏輯組件 8534a: logic component

8534b:邏輯組件 8534b: logic component

8534c:邏輯組件 8534c: logical component

8540a:匯流排 8540a: bus

8540b:匯流排 8540b: bus

8540c:匯流排 8540c: bus

8550a:匯流排 8550a: bus

8550b:匯流排 8550b: bus

8560:基板 8560: substrate

8570:第一記憶體組 8570: First memory group

8572:第二記憶體組 8572: second memory bank

8580:處理單元 8580: processing unit

8582:暫存器檔案 8582: Register file

8584:處理器 8584: processor

8600:流程圖 8600: Flow chart

8602:步驟 8602: step

8604:步驟 8604: step

8606:步驟 8606: step

9010:記憶體/處理單元 9010: memory/processing unit

9011:記憶體/處理單元 9011: memory/processing unit

9012:記憶體/處理單元 9012: memory/processing unit

9013:記憶體/處理單元 9013: memory/processing unit

9014:記憶體/處理單元 9014: memory/processing unit

9015:記憶體/處理單元 9015: memory/processing unit

9018:主機 9018: host

9019:記憶體/處理單元 9019: memory/processing unit

9020:控制器/邏輯 9020: Controller/Logic

9021:內部匯流排 9021: internal bus

9022:匯流排 9022: Bus

9030:邏輯 9030: logic

9033:緩衝器 9033: Buffer

9039:狀態線 9039: Status line

9040:記憶體組 9040: memory bank

9050:向量處理器 9050: vector processor

9070:詞彙表 9070: Glossary

9071:擷取金鑰 9071: Retrieve key

9072:字/片語 9072: words/phrases

9073:向量 9073: Vector

9100:資料庫查詢 9100: database query

9101:篩選操作之最終結果 9101: The final result of the screening operation

9102:部分回應 9102: partial response

9103:完整回應 9103: complete response

9210:儲存裝置 9210: storage device

9211:介面 9211: Interface

9220:記憶體及篩選系統 9220: Memory and filtering system

9220(k):資料庫區段 9220(k): database section

9222:記憶體單元條目 9222: Memory cell entry

9224:篩選單元 9224: Screening unit

9224':相關性旗標 9224': relevance flag

9225:處理單元 9225: Processing Unit

9227:記憶體/處理單元 9227: memory/processing unit

9228:記憶體/處理系統 9228: Memory/Processing System

9229:仲裁器/記憶體及處理系統 9229: Arbiter/Memory and Processing System

9240:CPU 9240: CPU

9300:用於資料庫分析加速之方法 9300: Method for accelerating database analysis

9301:用於資料庫分析加速之方法 9301: Method for accelerating database analysis

9302:用於資料庫分析加速之方法 9302: Method for accelerating database analysis

9303:用於資料庫分析加速之方法 9303: Method for accelerating database analysis

9304:資料庫分析加速之方法 9304: Methods of accelerating database analysis

9305:資料庫分析加速之方法 9305: Methods of accelerating database analysis

9310:步驟 9310: steps

9314:步驟 9314: step

9315:步驟 9315: steps

9320:步驟 9320: steps

9324:步驟 9324: step

9325:步驟 9325: Step

9330:步驟 9330: steps

9331:步驟 9331: step

9332:步驟 9332: step

9333:步驟 9333: Step

9334:步驟 9334: step

9335:步驟 9335: Step

9340:步驟 9340: steps

9341:步驟 9341: Step

9342:步驟 9342: steps

9344:步驟 9344: step

9351:步驟 9351: step

9352:步驟 9352: step

9390:步驟 9390: steps

9391:步驟 9391: step

9400:用於嵌入之方法 9400: Method for embedding

9401:用於嵌入之方法 9401: Method for embedding

9402:用於嵌入之方法 9402: Method for embedding

9410:步驟 9410: steps

9420:步驟 9420: steps

9430:步驟 9430: steps

9431:步驟 9431: step

9440:步驟 9440: Step

9442:步驟 9442: step

10800:用於至少一個資訊串流之分散式處理的方法 10800: Method for distributed processing of at least one information stream

10810:步驟 10810: step

10820:步驟 10820: step

10830:步驟 10830: step

10840:步驟 10840: step

10850:步驟 10850: step

10900:系統 10900: System

10901:系統 10901: System

10902:系統 10902: system

10903:系統 10903: System

10908:DMA控制器 10908: DMA controller

10909:預處理器 10909: preprocessor

10910:記憶體/處理單元 10910: memory/processing unit

10910(1):記憶體/處理單元 10910(1): memory/processing unit

10910(N):記憶體/處理單元 10910(N): Memory/processing unit

10911(1,1):處理資源 10911(1,1): Processing resources

10911(1,2):處理資源 10911(1,2): Processing resources

10911(1,K):處理資源 10911(1,K): Processing resources

10912(1,1):記憶體資源 10912(1,1): Memory resource

10912(1,2):記憶體資源 10912(1,2): Memory resources

10912(1,J-1):記憶體資源 10912(1,J-1): Memory resource

10912(1,J):記憶體資源 10912(1,J): Memory resource

10915:鏈路 10915: link

10920:處理器 10920: processor

10931:鏈路 10931: link

10931(1):鏈路 10931(1): link

10931(N):鏈路 10931(N): link

10932:鏈路 10932: link

10932(1):鏈路 10932(1): link

10932(N):鏈路 10932(N): link

10933:鏈路 10933: link

11011:混合積體電路 11011: Hybrid integrated circuit

11011':混合積體電路 11011': Hybrid integrated circuit

11012:混合積體電路 11012: Hybrid integrated circuit

11012':導體 11012': Conductor

11013:混合積體電路 11013: Hybrid integrated circuit

11013':混合積體電路 11013': Hybrid integrated circuit

11014:混合積體電路/匯流排 11014: Hybrid integrated circuit/bus

11014':混合積體電路 11014': Hybrid integrated circuit

11015:混合積體電路/匯流排 11015: Hybrid integrated circuit/bus

11015':混合積體電路 11015': Hybrid integrated circuit

11016:混合積體電路/匯流排 11016: Hybrid integrated circuit/bus

11016':混合積體電路 11016': Hybrid integrated circuit

11017:封裝基板 11017: Package substrate

11017':混合積體電路 11017': Hybrid integrated circuit

11018:中介層 11018: Intermediary layer

11018':混合積體電路 11018': Hybrid integrated circuit

11019:基礎晶粒 11019: basic grain

11019':記憶體處理單元/混合積體電路 11019': Memory processing unit/hybrid integrated circuit

11020:微凸塊 11020: Micro bump

11021:DRAM晶圓/DRAM晶粒 11021: DRAM wafer/DRAM die

11021':記憶體/處理單元 11021': memory/processing unit

11022:第二記憶體控制器 11022: Second memory controller

11022':導體 11022': Conductor

11023:WOW中間層 11023: WOW middle layer

11030:HBM DRAM堆疊/晶圓 11030: HBM DRAM stack/wafer

11031:第一記憶體控制器 11031: The first memory controller

11032:HDM DRAM記憶體晶片/HDM DRAM晶粒/第二記憶體控制器 11032: HDM DRAM memory chip/HDM DRAM die/second memory controller

11039:TSV 11039: TSV

11040:晶圓/HBM記憶體晶片堆疊 11040: Wafer/HBM memory chip stacking

11051:處理器 11051: processor

11052:L2快取記憶體 11052: L2 cache

11053:記憶體單元 11053: memory unit

11061:WOW接合部 11061: WOW joint

11062:第二晶片 11062: second chip

11100:用於記憶體密集型處理之方法 11100: Method for memory-intensive processing

11110:步驟 11110: steps

11120:步驟 11120: step

11130:步驟 11130: steps

11140:步驟 11140: steps

11150:電腦系統 11150: computer system

11200:用於資料庫加速之方法 11200: Method for database acceleration

11210:步驟 11210: steps

11220:步驟 11220: steps

11230:步驟 11230: steps

11240:步驟 11240: steps

11250:步驟 11250: steps

11260:步驟 11260: steps

11270:步驟 11270: steps

11271:步驟 11271: steps

11272:步驟 11272: steps

11300:用於操作資料庫加速積體電路之群組的方法 11300: A method for operating the database to accelerate the group of integrated circuits

11310:步驟 11310: steps

11311:步驟 11311: Step

11312:步驟 11312: steps

11314:步驟 11314: steps

11316:步驟 11316: steps

11320:步驟 11320: steps

11350:用於資料庫加速之方法 11350: Method for database acceleration

11352:步驟 11352: step

11354:步驟 11354: step

11355:步驟 11355: steps

11356:步驟 11356: step

11358:步驟 11358: step

11359:步驟 11359: step

11510:運算系統 11510: computing system

11511:管理器 11511: Manager

11512:運算節點 11512: computing node

11513:管理單元 11513: Management Unit

11520:用於資料庫加速之裝置 11520: Device used for database acceleration

11530:資料庫加速積體電路 11530: Database accelerated integrated circuit

11531:網路通信介面/單元 11531: network communication interface/unit

11531(1):網路通信介面之第一埠/乙太網路埠 11531(1): The first port of the network communication interface/Ethernet port

11531(2):網路通信介面之一或多個第二埠 11531(2): One or more second ports of network communication interface

11531(4):乙太網路埠 11531(4): Ethernet port

11531(5):串列擴展埠 11531(5): Serial expansion port

11531(9):PCIe埠 11531(9): PCIe port

11532:第一處理單元 11532: the first processing unit

11533:記憶體控制器 11533: Memory Controller

11534:大輸送量介面 11534: Large throughput interface

11535:資料庫加速單元 11535: Database acceleration unit

11536:互連件 11536: Interconnect

11537:密碼編譯引擎 11537: Cryptographic engine

11538:二階靜態隨機存取記憶體(L2 SRAM) 11538: Second-level static random access memory (L2 SRAM)

11540:SATA控制器 11540: SATA controller

11545:RDMA單元 11545: RDMA unit

11546:遠端RAM 11546: remote RAM

11547:乙太網路記憶體DIMM/資料庫加速子單元 11547: Ethernet memory DIMM/database accelerator subunit

11548:三階L3記憶體 11548: Tier 3 L3 memory

11549:DMA引擎 11549: DMA engine

11550:記憶體資源 11550: Memory resources

11551:記憶體處理積體電路 11551: Memory processing integrated circuit

11560:儲存系統 11560: storage system

11561:本端儲存單元 11561: Local storage unit

11563:非揮發性記憶體(NVM) 11563: Non-volatile memory (NVM)

11571:快取記憶體 11571: Cache memory

11572:獨立資料庫處理單元 11572: Independent database processing unit

11573:資料庫處理子單元 11573: database processing subunit

11574:DB加速器之可重組態陣列 11574: Reconfigurable array of DB accelerator

11575:共用記憶體單元 11575: Shared memory unit

11576:可組態鏈路或互連件 11576: Configurable link or interconnect

11580:刀鋒/群組 11580: Blade/Group

11590:交換器 11590: switch

11601:PCIe交換器 11601: PCIe switch

11611:交換系統 11611: exchange system

11612:儲存系統 11612: storage system

11613:運算系統 11613: computing system

11615:用於資料庫加速之一或多個裝置 11615: One or more devices used for database acceleration

11621:系統 11621: system

11622:系統 11622: system

11700:用於資料庫加速之方法 11700: Method for database acceleration

11710:步驟 11710: steps

11720:步驟 11720: steps

11730:步驟 11730: steps

11740:步驟 11740: steps

11750:步驟 11750: steps

11760:步驟 11760: steps

12010:資料庫 12010: database

12011:伺服器主機板 12011: Server motherboard

12012:CPU 12012: CPU

12013:記憶體單元 12013: memory unit

12014:資料庫加速器 12014: Database Accelerator

12020:資料庫 12020: database

12021:管理單元 12021: Management Unit

12022:DB加速器板 12022: DB accelerator board

12024:處理器 12024: processor

12026:記憶體/處理單元 12026: memory/processing unit

12031:RDMA引擎 12031: RDMA engine

12033:DDR控制器 12033: DDR controller

12034:DB查詢資料庫引擎 12034: DB query database engine

12040:混合系統 12040: Hybrid system

12042:處理器 12042: processor

12043:記憶體/處理單元(MPU) 12043: Memory/Processing Unit (MPU)

12044:緊密微控制器 12044: compact microcontroller

12049:快取記憶體 12049: Cache memory

12050:系統 12050: System

12051:交換器 12051: Switch

12052:AI加速伺服器 12052: AI acceleration server

12053:網路介面卡(NIC) 12053: Network Interface Card (NIC)

12054:中央處理單元(CPU) 12054: Central Processing Unit (CPU)

12055:伺服器主機板 12055: Server motherboard

12056:RAM 12056: RAM

12057:AI加速器 12057: AI accelerator

12060:系統 12060: System

12061:交換器 12061: Switch

12063:AI處理及網路連接單元 12063: AI processing and network connection unit

12064:伺服器主機板 12064: Server motherboard

A:矩陣 A: Matrix

A1:資料/分片 A1: Data/Shard

A15:資料/分片 A15: Data/Shard

B:矩陣 B: Matrix

B1:資料/分片 B1: data/shard

B2:資料/分片 B2: data/sharding

B3:資料/分片 B3: data/sharding

B4:資料/分片 B4: data/sharding

B5:資料/分片 B5: data/sharding

B6:資料/分片 B6: data/shard

B7:資料/分片 B7: data/shard

B8:資料/分片 B8: data/shard

B9:資料/分片 B9: data/sharding

B10:資料/分片 B10: data/shard

B11:資料/分片 B11: data/sharding

B12:資料/分片 B12: data/shard

B13:資料/分片 B13: data/shard

B14:資料/分片 B14: data/shard

B15:資料/分片 B15: data/shard

RD:列解碼器 RD: column decoder

COL:行解碼器 COL: Row decoder

併入於本發明中且構成本發明之一部分的隨附圖式說明各種所揭示實施例。在圖式中： The accompanying drawings incorporated in and forming part of the present invention illustrate various disclosed embodiments. In the schema:

圖1為中央處理單元(CPU)之圖解表示。 Figure 1 is a diagrammatic representation of a central processing unit (CPU).

圖2為圖形處理單元(GPU)之圖解表示。 Figure 2 is a diagrammatic representation of a graphics processing unit (GPU).

圖3A為符合所揭示實施例之例示性硬體晶片之實施例的圖解表示。 FIG. 3A is a diagrammatic representation of an embodiment of an exemplary hardware chip in accordance with the disclosed embodiment.

圖3B為符合所揭示實施例之例示性硬體晶片之另一實施例的圖解表示。 FIG. 3B is a diagrammatic representation of another embodiment of an exemplary hardware chip in accordance with the disclosed embodiment.

圖4為符合所揭示實施例之由例示性硬體晶片執行之一般命令的圖解表示。 Figure 4 is a diagrammatic representation of general commands executed by an exemplary hardware chip consistent with the disclosed embodiments.

圖5為符合所揭示實施例之由例示性硬體晶片執行之專門命令的圖解表示。 Figure 5 is a diagrammatic representation of a special command executed by an exemplary hardware chip consistent with the disclosed embodiment.

圖6為符合所揭示實施例之供用於例示性硬體晶片中之處理群組的圖解表示。 Figure 6 is a diagrammatic representation of processing groups for use in an exemplary hardware chip consistent with the disclosed embodiments.

圖7A為符合所揭示實施例之處理群組之矩形陣列的圖解表示。 Figure 7A is a diagrammatic representation of a rectangular array of processing groups consistent with the disclosed embodiment.

圖7B為符合所揭示實施例之處理群組之橢圓形陣列的圖解表示。 Figure 7B is a diagrammatic representation of an elliptical array of processing groups consistent with the disclosed embodiment.

圖7C為符合所揭示實施例之硬體晶片之一陣列的圖解表示。 Figure 7C is a diagrammatic representation of an array of hardware chips in accordance with the disclosed embodiments.

圖7D為符合所揭示實施例之硬體晶片之另一陣列的圖解表示。 Figure 7D is a diagrammatic representation of another array of hardware chips in accordance with the disclosed embodiments.

圖8為描繪符合所揭示實施例之用於編譯一系列指令以供在例示性硬體晶片上執行的例示性方法之流程圖。 FIG. 8 is a flowchart depicting an exemplary method for compiling a series of instructions for execution on an exemplary hardware chip consistent with the disclosed embodiments.

圖9為記憶體組之圖解表示。 Figure 9 is a diagrammatic representation of the memory bank.

圖10為記憶體組之圖解表示。 Figure 10 is a diagrammatic representation of the memory bank.

圖11為符合所揭示實施例之具有子組控制件的例示性記憶體組之一實施例的圖解表示。 FIG. 11 is a diagrammatic representation of an embodiment of an exemplary memory bank with sub-group controls in accordance with the disclosed embodiments.

圖12為符合所揭示實施例之具有子組控制件的例示性記憶體組之另一實施例的圖解表示。 FIG. 12 is a diagrammatic representation of another embodiment of an exemplary memory bank with sub-group controls in accordance with the disclosed embodiments.

圖13為符合所揭示實施例之例示性記憶體晶片的方塊圖。 FIG. 13 is a block diagram of an exemplary memory chip in accordance with the disclosed embodiment.

圖14為符合所揭示實施例之例示性冗餘邏輯區塊集合的方塊圖。 FIG. 14 is a block diagram of an exemplary set of redundant logic blocks in accordance with the disclosed embodiment.

圖15為符合所揭示實施例之例示性邏輯區塊的方塊圖。 FIG. 15 is a block diagram of an exemplary logic block in accordance with the disclosed embodiment.

圖16為符合所揭示實施例之與匯流排連接之例示性邏輯區塊的方塊圖。 FIG. 16 is a block diagram of an exemplary logic block connected to the bus in accordance with the disclosed embodiment.

圖17為符合所揭示實施例之串聯連接之例示性邏輯區塊的方塊圖。 FIG. 17 is a block diagram of an exemplary logic block connected in series in accordance with the disclosed embodiment.

圖18為符合所揭示實施例之成二維陣列連接之例示性邏輯區塊的方塊圖。 FIG. 18 is a block diagram of an exemplary logic block connected in a two-dimensional array in accordance with the disclosed embodiment.

圖19為符合所揭示實施例之成複雜連接之例示性邏輯區塊的方塊圖。 FIG. 19 is a block diagram of an exemplary logic block forming a complex connection in accordance with the disclosed embodiment.

圖20為說明符合所揭示實施例之冗餘區塊賦能處理程序的例示性流程圖。 FIG. 20 is an exemplary flowchart illustrating a redundant block enabling processing procedure in accordance with the disclosed embodiment.

圖21為說明符合所揭示實施例之位址指派處理程序的例示性流程圖。 FIG. 21 is an exemplary flowchart illustrating an address assignment processing procedure in accordance with the disclosed embodiment.

圖22提供符合所揭示實施例之例示性處理裝置的方塊圖。 Figure 22 provides a block diagram of an exemplary processing device in accordance with the disclosed embodiments.

圖23為符合所揭示實施例之例示性處理裝置的方塊圖。 Figure 23 is a block diagram of an exemplary processing device in accordance with the disclosed embodiments.

圖24包括符合所揭示實施例之例示性記憶體組態圖。 Figure 24 includes an exemplary memory configuration diagram consistent with the disclosed embodiments.

圖25為說明符合所揭示實施例之記憶體組態處理程序的例示性流程圖。 FIG. 25 is an exemplary flowchart illustrating a memory configuration processing procedure in accordance with the disclosed embodiment.

圖26為說明符合所揭示實施例之記憶體讀取處理程序的例示性流程圖。 FIG. 26 is an exemplary flowchart illustrating a memory read processing procedure in accordance with the disclosed embodiment.

圖27為說明符合所揭示實施例之處理程序執行的例示性流程圖。 FIG. 27 is an exemplary flowchart illustrating the execution of a processing program in accordance with the disclosed embodiment.

圖28展示符合本發明之具有再新控制器的實例記憶體晶片。 Figure 28 shows an example memory chip with a renewed controller in accordance with the present invention.

圖29A展示符合本發明之一實例再新控制器。 Figure 29A shows a renewed controller in accordance with an example of the present invention.

圖29B展示符合本發明之另一實例再新控制器。 Figure 29B shows another example of a renewed controller in accordance with the present invention.

圖30為符合本發明之由再新控制器執行之處理程序的實例流程圖。 Fig. 30 is an example flow chart of the processing procedure executed by the renewed controller according to the present invention.

圖31為符合本發明之由編譯器實施之處理程序的一實例流程圖。 FIG. 31 is a flowchart of an example of a processing program implemented by a compiler according to the present invention.

圖32為符合本發明之由編譯器實施之處理程序的另一實例流程圖。 FIG. 32 is a flowchart of another example of a processing program implemented by a compiler according to the present invention.

圖33展示符合本發明之根據所儲存圖案組態的實例再新控制器。 Figure 33 shows an example of renewing the controller according to the stored pattern configuration according to the present invention.

圖34為符合本發明之由再新控制器內之軟體實施的處理程序的實例流程圖。 FIG. 34 is an example flow chart of the processing procedure implemented by the software in the renewed controller according to the present invention.

圖35A展示符合本發明之包括晶粒的實例晶圓。 Figure 35A shows an example wafer including dies in accordance with the present invention.

圖35B展示符合本發明之連接至輸入/輸出匯流排的實例記憶體晶片。 Figure 35B shows an example memory chip connected to the input/output bus in accordance with the present invention.

圖35C展示符合本發明之包括成列配置且連接至輸入輸出匯流排之記憶體晶片的實例晶圓。 Figure 35C shows an example wafer including memory chips arranged in rows and connected to the input and output bus in accordance with the present invention.

圖35D展示符合本發明之形成群組且連接至輸入輸出匯流排的兩個記憶體晶片。 FIG. 35D shows two memory chips that are grouped and connected to the input and output bus in accordance with the present invention.

圖35E展示符合本發明之實例晶圓，其包括以六邊形晶格置放且連接至輸入輸出匯流排之晶粒。 Figure 35E shows an example wafer in accordance with the present invention, which includes dies placed in a hexagonal lattice and connected to an input and output bus.

圖36A至圖36D展示符合本發明之連接至輸入/輸出匯流排之記憶體晶片的各種可能組態。 36A to 36D show various possible configurations of the memory chip connected to the input/output bus in accordance with the present invention.

圖37展示符合本發明之共用膠合邏輯(glue logic)之晶粒的實例分組。 Figure 37 shows the implementation of the die of the shared glue logic according to the present invention Example grouping.

圖38A至圖38B展示符合本發明之穿過晶圓的實例切割。 Figures 38A to 38B show example cutting through the wafer in accordance with the present invention.

圖38C展示符合本發明之晶圓上之晶粒的實例配置及輸入輸出匯流排之配置。 FIG. 38C shows an example configuration of the die on a wafer and the configuration of the input/output bus in accordance with the present invention.

圖39展示符合本發明之具有互連處理器子單元之晶圓上的實例記憶體晶片。 Figure 39 shows an example memory chip on a wafer with interconnected processor subunits in accordance with the present invention.

圖40為符合本發明之自晶圓佈置記憶體晶片之群組的處理程序的一實例流程圖。 FIG. 40 is an example flow chart of the processing procedure for arranging groups of memory chips from the wafer in accordance with the present invention.

圖41A為符合本發明之自晶圓佈置記憶體晶片之群組的處理程序的另一實例流程圖。 FIG. 41A is a flowchart of another example of a processing procedure for arranging a group of memory chips from a wafer according to the present invention.

圖41B至圖41C為符合本發明之判定用於自晶圓切割記憶體晶片之一或多個群組的切割圖案之處理程序的實例流程圖。 41B to 41C are flow charts of an example of a processing procedure for determining a cutting pattern for cutting one or more groups of memory chips from a wafer according to the present invention.

圖42展示符合本發明的提供沿著行之雙埠存取的記憶體晶片內之電路系統的實例。 FIG. 42 shows an example of a circuit system in a memory chip that provides dual port access along a row in accordance with the present invention.

圖43展示符合本發明的提供沿著列之雙埠存取的記憶體晶片內之電路系統的實例。 FIG. 43 shows an example of a circuit system in a memory chip that provides dual port access along a row in accordance with the present invention.

圖44展示符合本發明的提供沿著列及行兩者之雙埠存取的記憶體晶片內之電路系統的實例。 FIG. 44 shows an example of a circuit system in a memory chip that provides dual-port access along both rows and rows according to the present invention.

圖45A展示使用複製記憶體陣列或墊的雙讀取。 Figure 45A shows double read using a replicated memory array or pad.

圖45B展示使用複製記憶體陣列或墊的雙寫入。 Figure 45B shows double writing using a replicated memory array or pad.

圖46展示符合本發明之具有用於沿著列之雙埠存取的開關元件之記憶體晶片內之電路系統的實例。 FIG. 46 shows an example of a circuit system in a memory chip having switching elements for dual-port access along a row in accordance with the present invention.

圖47A為符合本發明之用於在單埠記憶體陣列或墊上提供雙埠存取之一處理程序的實例流程圖。 FIG. 47A is an example flowchart of a process for providing dual-port access on a single-port memory array or pad according to the present invention.

圖47B為符合本發明之用於在單埠記憶體陣列或墊上提供雙埠存取之另一處理程序的實例流程圖。 FIG. 47B is an example flowchart of another processing procedure for providing dual-port access on a single-port memory array or pad according to the present invention.

圖48展示符合本發明之提供沿著列及行兩者之雙埠存取的記憶體晶片記憶體晶片內之電路系統的另一實例。 FIG. 48 shows another example of a circuit system in a memory chip that provides dual-port access along both rows and rows in accordance with the present invention.

圖49展示符合本發明之用於記憶體墊內的雙埠存取之開關元件的實例。 FIG. 49 shows an example of a switch element for dual-port access in a memory pad according to the present invention.

圖50說明符合本發明之具有經組態以存取部分字之縮減單元的實例積體電路。 Figure 50 illustrates an example integrated circuit with reduced cells configured to access partial words in accordance with the present invention.

圖51說明用於使用如關於圖50所描述之縮減單元的記憶體組。 FIG. 51 illustrates a memory bank for using the reduction unit as described in relation to FIG. 50.

圖52說明符合本發明之使用整合至PIM邏輯中之縮減單元的記憶體組。 FIG. 52 illustrates a memory bank that uses a reduced unit integrated into the PIM logic in accordance with the present invention.

圖53說明符合本發明之使用PIM邏輯以啟動用於存取部分字之開關的記憶體組。 FIG. 53 illustrates the use of PIM logic to activate the memory bank used to access the partial word switch in accordance with the present invention.

圖54A說明符合本發明之具有用於不啟動以存取部分字之分段行多工器的記憶體組。 FIG. 54A illustrates a memory bank with a segmented row multiplexer for not starting to access part of words in accordance with the present invention.

圖54B為符合本發明之用於記憶體中的部分字存取之處理程序的實例流程圖。 FIG. 54B is an example flowchart of a processing procedure for partial word access in memory according to the present invention.

圖55說明包括多個記憶體墊的現有記憶體晶片。 FIG. 55 illustrates a conventional memory chip including multiple memory pads.

圖56說明符合本發明之具有用於在線斷開期間減少功率消耗之啟動電路的一實例記憶體晶片。 Figure 56 illustrates an example memory chip with a startup circuit for reducing power consumption during line disconnection in accordance with the present invention.

圖57說明符合本發明之具有用於在線斷開期間減少功率消耗之啟動電路的另一實例記憶體晶片。 FIG. 57 illustrates another example memory chip with a startup circuit for reducing power consumption during line disconnection in accordance with the present invention.

圖58說明符合本發明之具有用於在線斷開期間減少功率消耗之啟動電路的又一實例記憶體晶片。 FIG. 58 illustrates yet another example memory chip with a startup circuit for reducing power consumption during line disconnection in accordance with the present invention.

圖59說明符合本發明之具有用於在線斷開期間減少功率消耗之啟動電路的額外實例記憶體晶片。 FIG. 59 illustrates an additional example memory chip in accordance with the present invention with a startup circuit for reducing power consumption during online disconnection.

圖60說明符合本發明之具有用於在線斷開期間減少功率消耗之全域字線及區域字線的一實例記憶體晶片。 FIG. 60 illustrates an example memory chip with global word lines and regional word lines for reducing power consumption during line disconnection in accordance with the present invention.

圖61說明符合本發明之具有用於在線斷開期間減少功率消耗之全域字線及區域字線的另一實例記憶體晶片。 Fig. 61 illustrates another example memory chip with global word lines and regional word lines for reducing power consumption during line disconnection in accordance with the present invention.

圖62為符合本發明之用於依序斷開記憶體中的線之處理程序的實例流程圖。 FIG. 62 is an example flow chart of the processing procedure for sequentially disconnecting the lines in the memory according to the present invention.

圖63說明用於記憶體晶片之一現有測試器。 Figure 63 illustrates an existing tester used for memory chips.

圖64說明用於記憶體晶片之另一現有測試器。 Figure 64 illustrates another conventional tester used for memory chips.

圖65說明符合本發明之使用與記憶體在同一基板上的邏輯單元測試記憶體晶片的一實例。 FIG. 65 illustrates an example of testing a memory chip using logic cells on the same substrate as the memory according to the present invention.

圖66說明符合本發明之使用與記憶體在同一基板上的邏輯單元測試記憶體晶片的另一實例。 FIG. 66 illustrates another example of testing a memory chip using logic cells on the same substrate as the memory according to the present invention.

圖67說明符合本發明之使用與記憶體在同一基板上的邏輯單元測試記憶體晶片的又一實例。 FIG. 67 illustrates another example of testing a memory chip using logic cells on the same substrate as the memory according to the present invention.

圖68說明符合本發明之使用與記憶體在同一基板上的邏輯單元測試記憶體晶片的額外實例。 FIG. 68 illustrates an additional example of testing a memory chip using logic cells on the same substrate as the memory in accordance with the present invention.

圖69說明符合本發明之使用與記憶體在同一基板上的邏輯單元測試記憶體晶片的另一實例。 FIG. 69 illustrates another example of testing a memory chip using logic cells on the same substrate as the memory according to the present invention.

圖70為符合本發明之用於測試記憶體晶片之一處理程序的實例流程圖。 FIG. 70 is an example flow chart of a processing procedure for testing memory chips in accordance with the present invention.

圖71為符合本發明之用於測試記憶體晶片之另一處理程序的實例流程圖。 FIG. 71 is an example flowchart of another processing procedure for testing memory chips in accordance with the present invention.

圖72A為符合本發明之實施例的包括記憶體陣列及處理陣列之積體電路的圖解表示。 FIG. 72A is a diagrammatic representation of an integrated circuit including a memory array and a processing array in accordance with an embodiment of the present invention.

圖72B為符合本發明之實施例的積體電路內部之記憶體區的圖解表示。 FIG. 72B is a diagrammatic representation of a memory area inside an integrated circuit according to an embodiment of the present invention.

圖73A為符合本發明之實施例的具有控制器之實例組態的積體電路之圖解表示。 Figure 73A is a diagrammatic representation of an integrated circuit with an example configuration of a controller in accordance with an embodiment of the present invention.

圖73B為符合本發明之實施例的用於同時執行複製模型的組態之圖解表示。 Figure 73B is a diagrammatic representation of a configuration for simultaneous execution of a copy model in accordance with an embodiment of the present invention.

圖74A為符合本發明之實施例的具有控制器之另一實例組態的積體電路之圖解表示。 Figure 74A is a diagrammatic representation of an integrated circuit with another example configuration of a controller in accordance with an embodiment of the present invention.

圖74B為根據例示性所揭示實施例之保護積體電路的方法之流程圖表示。 FIG. 74B is a flowchart representation of a method of protecting an integrated circuit according to an exemplary disclosed embodiment.

圖74C為根據例示性所揭示實施例之位於晶片內之各個點處的偵測元件之圖解表示。 FIG. 74C is a diagrammatic representation of detecting elements located at various points within the chip according to an illustratively disclosed embodiment.

圖75A為符合本發明之實施例的包括複數個分散式處理器記憶體晶片之可擴展處理器記憶體系統的圖解表示。 FIG. 75A is a diagrammatic representation of an expandable processor memory system including a plurality of distributed processor memory chips in accordance with an embodiment of the present invention.

圖75B為符合本發明之實施例的包括複數個分散式處理器記憶體晶片之可擴展處理器記憶體系統的圖解表示。 FIG. 75B is a diagrammatic representation of an expandable processor memory system including a plurality of distributed processor memory chips in accordance with an embodiment of the present invention.

圖75C為符合本發明之實施例的包括複數個分散式處理器記憶體晶片之可擴展處理器記憶體系統的圖解表示。 FIG. 75C is a diagrammatic representation of an expandable processor memory system including a plurality of distributed processor memory chips in accordance with an embodiment of the present invention.

圖75D為符合本發明之實施例的雙埠分散式處理器記憶體晶片之圖解表示。 FIG. 75D is a diagrammatic representation of a dual-port distributed processor memory chip in accordance with an embodiment of the present invention.

圖75E為符合本發明之實施例的實例時序圖。 FIG. 75E is an example timing diagram according to an embodiment of the present invention.

圖76為符合本發明之實施例的具有整合式控制器及介面模組且構成可擴展處理器記憶體系統之處理器記憶體晶片的圖解表示。 FIG. 76 is an integrated controller and interface module according to an embodiment of the present invention and A diagrammatic representation of the processor memory chips that make up an expandable processor memory system.

圖77為符合本發明之實施例的用於在圖75A中所展示之可擴展處理器記憶體系統中的處理器記憶體晶片之間傳送資料的流程圖。 FIG. 77 is a flowchart for transferring data between processor memory chips in the expandable processor memory system shown in FIG. 75A according to an embodiment of the present invention.

圖78A說明符合本發明之實施例的用於在晶片層級偵測儲存於實施於記憶體晶片中之複數個記憶體組的一或多個特定位址中之零值的系統。 FIG. 78A illustrates a system for detecting zero values stored in one or more specific addresses of a plurality of memory banks implemented in a memory chip at the chip level in accordance with an embodiment of the present invention.

圖78B說明符合本發明之實施例的用於在記憶體組層級偵測儲存於複數個記憶體組之特定位址中的一或多者中之零值的記憶體晶片。 FIG. 78B illustrates a memory chip for detecting zero values stored in one or more of the specific addresses of a plurality of memory groups at the memory group level in accordance with an embodiment of the present invention.

圖79說明符合本發明之實施例的用於在記憶體墊層級偵測儲存於複數個記憶體墊之特定位址中的一或多者中之零值的記憶體組。 FIG. 79 illustrates a memory set for detecting zero values stored in one or more of the specific addresses of a plurality of memory pads at the memory pad level in accordance with an embodiment of the present invention.

圖80為說明符合本發明之實施例的偵測複數個離散記憶體組之特定位址中之零值的例示性方法之流程圖。 FIG. 80 is a flowchart illustrating an exemplary method for detecting zero values in specific addresses of a plurality of discrete memory groups according to an embodiment of the present invention.

圖81A說明符合本發明之實施例的用於基於下一列預測啟動與記憶體組相關聯之下一列的系統。 FIG. 81A illustrates a system for activating the next row associated with a memory bank based on the prediction of the next row in accordance with an embodiment of the present invention.

圖81B說明符合本發明之實施例的圖81A之系統的另一實施例。 FIG. 81B illustrates another embodiment of the system of FIG. 81A in accordance with an embodiment of the present invention.

圖81C說明符合本發明之實施例的每一記憶體子組之第一及第二子組列控制器。 FIG. 81C illustrates the first and second sub-group row controllers of each memory sub-group in accordance with an embodiment of the present invention.

圖81D說明符合本發明之實施例的下一列預測之實施例。 Figure 81D illustrates an embodiment of the next column prediction in accordance with an embodiment of the present invention.

圖81E說明符合本發明之實施例的記憶體組之實施例。 FIG. 81E illustrates an embodiment of a memory bank in accordance with an embodiment of the present invention.

圖81F說明符合本發明之實施例的記憶體組之另一實施例。 FIG. 81F illustrates another embodiment of the memory bank in accordance with the embodiment of the present invention.

圖82說明符合本發明之實施例的用於減少記憶體列啟動懲罰之雙重控制記憶體組。 FIG. 82 illustrates a dual-control memory bank for reducing memory row activation penalty in accordance with an embodiment of the present invention.

圖83A說明存取及啟動記憶體組之列的第一實例。 Figure 83A illustrates the first example of accessing and activating a row of memory banks.

圖83B說明存取及啟動記憶體組之列的第二實例。 FIG. 83B illustrates a second example of accessing and activating a row of memory banks.

圖83C說明存取及啟動記憶體組之列的第三實例。 FIG. 83C illustrates a third example of accessing and activating a row of memory banks.

圖84提供習知CPU/暫存器檔案及外部記憶體架構之圖解表示。 Figure 84 provides a graphical representation of the conventional CPU/register file and external memory architecture.

圖85A說明符合一個實施例之具有充當暫存器檔案之記憶體墊的例示性分散式處理器記憶體晶片。 Figure 85A illustrates an exemplary distributed processor memory chip with memory pads acting as register files in accordance with one embodiment.

圖85B說明符合另一實施例之具有經組態以充當暫存器檔案之記憶體墊的例示性分散式處理器記憶體晶片。 FIG. 85B illustrates an exemplary distributed processor memory chip with a memory pad configured to act as a register file in accordance with another embodiment.

圖85C說明符合另一實施例之具有充當暫存器檔案之記憶體墊的例示性裝置。 FIG. 85C illustrates an exemplary device with a memory pad acting as a register file in accordance with another embodiment.

圖86提供表示符合所揭示實施例之用於在分散式處理器記憶體晶片中執行至少一個指令的例示性方法之流程圖。 FIG. 86 provides a flowchart representing an exemplary method for executing at least one instruction in a distributed processor memory chip consistent with the disclosed embodiments.

圖87A包括分解式伺服器之實例； Figure 87A includes an example of an exploded server;

圖87B為分散式處理之實例； Figure 87B is an example of distributed processing;

圖87C為記憶體/處理單元之實例； Figure 87C is an example of a memory/processing unit;

圖87D為記憶體/處理單元之實例； Figure 87D is an example of a memory/processing unit;

圖87E為記憶體/處理單元之實例； Figure 87E is an example of a memory/processing unit;

圖87F為包括記憶體/處理單元及一或多個通信模組之積體電路的實例； FIG. 87F is an example of an integrated circuit including a memory/processing unit and one or more communication modules;

圖87G為包括記憶體/處理單元及一或多個通信模組之積體電路的實例； FIG. 87G is an example of an integrated circuit including a memory/processing unit and one or more communication modules;

圖87H為方法之實例； Figure 87H is an example of the method;

圖87I為方法之實例； Figure 87I is an example of the method;

圖88A為方法之實例； Figure 88A is an example of the method;

圖88B為方法之實例； Figure 88B is an example of the method;

圖88C為方法之實例； Figure 88C is an example of the method;

圖89A為記憶體/處理單元及詞彙表之實例； Figure 89A is an example of memory/processing unit and vocabulary;

圖89B為記憶體/處理單元之實例； Figure 89B is an example of a memory/processing unit;

圖89C為記憶體/處理單元之實例； Figure 89C is an example of a memory/processing unit;

圖89D為記憶體/處理單元之實例； Figure 89D is an example of a memory/processing unit;

圖89E為記憶體/處理單元之實例； Figure 89E is an example of a memory/processing unit;

圖89F為記憶體/處理單元之實例； Figure 89F is an example of a memory/processing unit;

圖89G為記憶體/處理單元之實例； Figure 89G is an example of a memory/processing unit;

圖89H為記憶體/處理單元之實例； Figure 89H is an example of a memory/processing unit;

圖90A為系統之實例； Figure 90A is an example of the system;

圖90B為系統之實例； Figure 90B is an example of the system;

圖90C為系統之實例； Figure 90C is an example of the system;

圖90D為系統之實例； Figure 90D is an example of the system;

圖90E為系統之實例； Figure 90E is an example of the system;

圖90F為方法之實例； Figure 90F is an example of the method;

圖91A為記憶體及篩選系統、儲存裝置以及CPU之實例； Figure 91A is an example of a memory and screening system, storage device, and CPU;

圖91B為記憶體及處理系統、儲存裝置以及CPU之實例； Figure 91B is an example of memory and processing system, storage device and CPU;

圖92A為記憶體及處理系統、儲存裝置以及CPU之實例； Figure 92A is an example of a memory and processing system, storage device and CPU;

圖92B為記憶體/處理單元之實例； Figure 92B is an example of a memory/processing unit;

圖92C為記憶體及篩選系統、儲存裝置以及CPU之實例； Figure 92C is an example of a memory and screening system, storage device and CPU;

圖92D為記憶體及處理系統、儲存裝置以及CPU之實例； Figure 92D is an example of memory and processing system, storage device and CPU;

圖92E為記憶體及處理系統、儲存裝置以及CPU之實例； Figure 92E is an example of memory and processing system, storage device and CPU;

圖92F為方法之實例； Figure 92F is an example of the method;

圖92G為方法之實例； Figure 92G is an example of the method;

圖92H為方法之實例； Figure 92H is an example of the method;

圖92I為方法之實例； Figure 92I is an example of the method;

圖92J為方法之實例； Figure 92J is an example of the method;

圖92K為方法之實例； Figure 92K is an example of the method;

圖93A為混合積體電路之實例的橫截面圖； Fig. 93A is a cross-sectional view of an example of a hybrid integrated circuit;

圖93B為混合積體電路之實例的橫截面圖； Figure 93B is a cross-sectional view of an example of a hybrid integrated circuit;

圖93C為混合積體電路之實例的橫截面圖； Figure 93C is a cross-sectional view of an example of a hybrid integrated circuit;

圖93D為混合積體電路之實例的橫截面圖； Figure 93D is a cross-sectional view of an example of a hybrid integrated circuit;

圖93E為混合積體電路之實例的俯視圖； Fig. 93E is a top view of an example of a hybrid integrated circuit;

圖93F為混合積體電路之實例的俯視圖； FIG. 93F is a top view of an example of a hybrid integrated circuit;

圖93G為混合積體電路之實例的俯視圖； Figure 93G is a top view of an example of a hybrid integrated circuit;

圖93H為混合積體電路之實例的橫截面圖； Figure 93H is a cross-sectional view of an example of a hybrid integrated circuit;

圖93I為混合積體電路之實例的橫截面圖； Figure 93I is a cross-sectional view of an example of a hybrid integrated circuit;

圖93J為方法之實例； Figure 93J is an example of the method;

圖94A為儲存系統、一或多個裝置及運算系統之實例； Fig. 94A is an example of a storage system, one or more devices, and a computing system;

圖94B為儲存系統、一或多個裝置及運算系統之實例； FIG. 94B is an example of a storage system, one or more devices, and a computing system;

圖94C為一或多個裝置及運算系統之實例； FIG. 94C is an example of one or more devices and computing systems;

圖94D為一或多個裝置及運算系統之實例； Fig. 94D is an example of one or more devices and computing systems;

圖94E為資料庫加速積體電路之實例； Figure 94E is an example of a database accelerated integrated circuit;

圖94F為資料庫加速積體電路之實例； Figure 94F is an example of a database accelerated integrated circuit;

圖94G為資料庫加速積體電路之實例； Figure 94G is an example of a database accelerated integrated circuit;

圖94H為資料庫加速單元之實例； Figure 94H is an example of a database acceleration unit;

圖94I為刀鋒以及資料庫加速積體電路之群組的實例； Fig. 94I is an example of the group of blade and database accelerated integrated circuit;

圖94J為資料庫加速積體電路之群組的實例； Fig. 94J is an example of a group of accelerated integrated circuits in the database;

圖94K為資料庫加速積體電路之群組的實例； Fig. 94K is an example of the group of accelerating integrated circuits in the database;

圖94L為資料庫加速積體電路之群組的實例； FIG. 94L is an example of the group of accelerated integrated circuits in the database;

圖94M為資料庫加速積體電路之群組的實例； FIG. 94M is an example of a group of accelerated integrated circuits in the database;

圖94N為系統之實例； Figure 94N is an example of the system;

圖94O為系統之實例； Figure 94O is an example of the system;

圖94P為方法之實例； Figure 94P is an example of the method;

圖95A為方法之實例； Figure 95A is an example of the method;

圖95B為方法之實例； Figure 95B is an example of the method;

圖95C為方法之實例； Figure 95C is an example of the method;

圖96A為先前技術系統之實例； Figure 96A is an example of a prior art system;

圖96B為系統之實例； Figure 96B is an example of the system;

圖96C為資料庫加速器板之實例； Figure 96C is an example of a database accelerator board;

圖96D為系統之一部分的實例； Figure 96D is an example of a part of the system;

圖97A為先前技術系統之實例； Figure 97A is an example of a prior art system;

圖97B為系統之實例；及 Figure 97B is an example of the system; and

圖97C為AI網路介面卡之實例。 Figure 97C shows an example of an AI network interface card.

以下詳細描述參考隨附圖式。在任何方便之處，在圖式及以下描述中使用相同參考編號來指相同或類似部分。雖然本文中描述了若干說明性實施例，但修改、調適及其他實施方案為可能的。舉例而言，可對圖式中所說明之組件進行替代、添加或修改，且可藉由替代、重排序、移除步驟或添加步驟至所揭示方法來修改本文中所描述之說明性方法。因此，以下詳細描述不限於所揭示實施例及實例。實情為，適當範圍由隨附申請專利範圍界定。 The following detailed description refers to the accompanying drawings. Wherever convenient, the same reference numbers are used in the drawings and the following description to refer to the same or similar parts. Although several illustrative examples are described herein, modifications, adaptations, and other implementations are possible. For example, the components illustrated in the drawings can be replaced, added, or modified, and the illustrative methods described herein can be modified by replacing, reordering, removing steps, or adding steps to the disclosed methods. Therefore, the following detailed description is not limited to the disclosed embodiments and examples. The fact is that the appropriate scope is defined by the scope of the attached patent application.

處理器架構 Processor architecture

如貫穿本發明所使用，術語「硬體晶片」係指半導體晶圓(諸如，矽或其類似者)，其上形成有一或多個電路元件(諸如，電晶體、電容器、電阻器及/或其類似者)。該等電路元件可形成處理元件或記憶體元件。「處理元件」係指一起執行至少一個邏輯功能(諸如，算術函數、邏輯閘、其他布林運算或其類似者)之一或多個電路元件。處理元件可為通用處理元件(諸如，可組態的複數個電晶體)或專用處理元件(諸如，經設計以執行特定邏輯功能之一特定邏輯閘或複數個電路元件)。「記憶體元件」係指可用以儲存資料之一或多個電路元件。「記憶體元件」亦可被稱作「記憶體胞元」。記憶體元件可為動態(使得需要電再新以維持資料儲存)、靜態(使得資料在失去電力之後持續存在至少一段時間)或非揮發性之記憶體。 As used throughout the present invention, the term "hardware chip" refers to a semiconductor wafer (such as, Silicon or the like) on which one or more circuit elements (such as transistors, capacitors, resistors, and/or the like) are formed. These circuit elements can form processing elements or memory elements. "Processing element" refers to one or more circuit elements that perform at least one logic function (such as arithmetic functions, logic gates, other Boolean operations, or the like) together. The processing element may be a general-purpose processing element (such as a configurable plurality of transistors) or a dedicated processing element (such as a specific logic gate or a plurality of circuit elements designed to perform a specific logic function). "Memory component" refers to one or more circuit components that can be used to store data. "Memory device" can also be referred to as "memory cell". The memory device can be dynamic (so that it needs electricity to be renewed to maintain data storage), static (so that the data persists for at least a period of time after losing power) or non-volatile memory.

處理元件可接合以形成處理器子單元。「處理器子單元」因此可包含可執行至少一個任務或指令(例如，屬於處理器指令集)的處理元件之最小分組。舉例而言，子單元可包含經組態以一起執行指令之一或多個通用處理元件、與經組態成以互補方式執行指令之一或多個專用處理元件配對的一或多個通用處理元件，或其類似者。該等處理器子單元可成陣列配置於基板(例如，晶圓)上。儘管「陣列」可包含矩形形狀，但陣列中之子單元的任何配置可形成於基板上。 The processing elements can be joined to form a processor sub-unit. The "processor subunit" may therefore include the smallest grouping of processing elements that can perform at least one task or instruction (for example, belonging to the processor instruction set). For example, a subunit may include one or more general-purpose processing elements configured to execute instructions together, one or more general-purpose processing elements paired with one or more dedicated processing elements configured to execute instructions in a complementary manner Element, or the like. The processor sub-units can be arranged in an array on a substrate (for example, a wafer). Although the "array" may include a rectangular shape, any configuration of the sub-units in the array may be formed on the substrate.

記憶體元件可接合以形成記憶體組。舉例而言，記憶體組可包含沿著至少一條導線(或其他導電連接件)鏈接之記憶體元件的一或多排。此外，記憶體元件可在另一方向上沿著至少一條添加導線鏈接。舉例而言，記憶體元件可沿著字線及位元線配置，如下文所解釋。儘管記憶體組可包含多個排，但組中之元件的任何配置可用以在基板上形成組。此外，一或多個組可電接合至至少一個記憶體控制器以形成記憶體陣列。儘管記憶體陣列可包含組之矩形配置，但陣列中之組的任何配置可形成於基板上。 The memory components can be joined to form a memory bank. For example, the memory bank may include one or more rows of memory elements linked along at least one wire (or other conductive connection). In addition, the memory device can be linked along at least one additional wire in another direction. For example, memory devices can be arranged along word lines and bit lines, as explained below. Although the memory bank can include multiple rows, any configuration of the elements in the group can be used to form the group on the substrate. In addition, one or more groups can be electrically coupled to at least one memory controller to form a memory array. Although the memory array can include rectangular configurations of groups, any configuration of the groups in the array can be formed on the substrate.

如貫穿本發明進一步所使用，「匯流排」係指基板之元件之間的任何通信連接件。舉例而言，導線或線(形成電連接件)、光纖(形成光學連接件)或進行組件之間的通信之任何其他連接件可被稱作「匯流排」。 As used further throughout the present invention, "bus bar" refers to the interconnection between the components of the substrate Any communication connection. For example, wires or wires (to form electrical connections), optical fibers (to form optical connections), or any other connections for communication between components may be referred to as "bus bars."

習知處理器使通用邏輯電路與共用記憶體配對。共用記憶體可儲存供邏輯電路執行之指令集以及用於指令集之執行且由指令集之執行產生的資料兩者。如下文所描述，一些習知處理器使用快取系統來減少執行自共用記憶體取得時的延遲；然而，習知快取系統保持共用。習知處理器包括中央處理單元(CPU)、圖形處理單元(GPU)、各種特殊應用積體電路(ASIC)或其類似者。圖1展示CPU之實例，且圖2展示GPU之實例。 Conventional processors pair general-purpose logic circuits with shared memory. The shared memory can store both the instruction set for the execution of the logic circuit and the data used for the execution of the instruction set and generated by the execution of the instruction set. As described below, some conventional processors use a cache system to reduce the delay in executing fetches from shared memory; however, the conventional cache system remains shared. The conventional processor includes a central processing unit (CPU), a graphics processing unit (GPU), various application-specific integrated circuits (ASIC), or the like. Figure 1 shows an example of a CPU, and Figure 2 shows an example of a GPU.

如圖1中所展示，CPU 100可包含處理單元110，該處理單元包括一或多個處理器子單元，諸如處理器子單元120a及處理器子單元120b。儘管圖1中未描繪，但每一處理器子單元可包含複數個處理元件。此外，處理單元110可包括一或多個層級之晶載快取記憶體。此類快取記憶體元件通常與處理單元110形成於同一半導體晶粒上，而非經由形成於基板中之一或多個匯流排連接至處理器子單元120a及120b，該基板含有處理器子單元120a及120b以及快取記憶體元件。對於習知處理器中之一階(L1)及二階(L2)快取記憶體，直接在同一晶粒上而非經由匯流排連接之配置為常用的。替代地，在早期處理器中，使用子單元與L2快取記憶體之間的背側匯流排而在處理器子單元間共用L2快取記憶體。背側匯流排通常大於下文所描述之前側匯流排。因此，因為快取記憶體待供晶粒上之所有處理器子單元共用，所以快取記憶體130可與處理器子單元120a及120b在同一晶粒上形成或經由一或多個背側匯流排以通信方式耦接至處理器子單元120a及120b。在不具有匯流排(例如，快取記憶體直接形成於晶粒上)之實施例以及使用背側匯流排之實施例兩者中，快取記憶體在CPU之處理器子單元之間共用。 As shown in FIG. 1, the CPU 100 may include a processing unit 110 that includes one or more processor sub-units, such as a processor sub-unit 120a and a processor sub-unit 120b. Although not depicted in FIG. 1, each processor sub-unit may include a plurality of processing elements. In addition, the processing unit 110 may include one or more levels of on-chip cache memory. Such cache memory devices are usually formed on the same semiconductor die as the processing unit 110, instead of being connected to the processor sub-units 120a and 120b via one or more bus bars formed in the substrate, which contains the processor sub-units. Units 120a and 120b and cache memory components. For the first-level (L1) and second-level (L2) caches of conventional processors, the configuration that is directly on the same die rather than connected via a bus is commonly used. Alternatively, in early processors, the backside bus between the sub-units and the L2 cache was used to share the L2 cache between the processor sub-units. The backside busbar is generally larger than the frontside busbar described below. Therefore, because the cache memory is to be shared by all the processor sub-units on the die, the cache memory 130 and the processor sub-units 120a and 120b can be formed on the same die or via one or more backside busses. The bank is communicatively coupled to the processor sub-units 120a and 120b. In both the embodiment without the bus (for example, the cache memory is directly formed on the die) and the embodiment using the backside bus, the cache memory is shared between the processor subunits of the CPU.

此外，處理單元110與共用記憶體140a及記憶體140b通信。舉例而言，記憶體140a及140b可表示共用動態隨機存取記憶體(DRAM)之記憶體組。儘管描繪為具有兩個組，但大部分習知記憶體晶片包括介於八個與十六個之間的記憶體組。因此，處理器子單元120a及120b可使用共用記憶體140a及140b儲存資料，該資料接著由處理器子單元120a及120b進行操作。然而，此配置導致在處理單元110之時脈速度超過匯流排之資料傳送速度時，記憶體140a及140b與處理單元110之間的匯流排成為瓶頸。對於習知處理器，通常係如此情況，從而導致低於基於時脈速率及電晶體數目之規定處理速度的有效處理速度。 In addition, the processing unit 110 communicates with the shared memory 140a and the memory 140b. Lift For example, the memory 140a and 140b may represent a memory bank sharing a dynamic random access memory (DRAM). Although depicted as having two groups, most conventional memory chips include between eight and sixteen memory groups. Therefore, the processor sub-units 120a and 120b can use the shared memory 140a and 140b to store data, which is then operated by the processor sub-units 120a and 120b. However, this configuration causes the bus between the memory 140a and 140b and the processing unit 110 to become a bottleneck when the clock speed of the processing unit 110 exceeds the data transmission speed of the bus. For conventional processors, this is usually the case, resulting in an effective processing speed lower than the specified processing speed based on the clock rate and the number of transistors.

如圖2中所展示，GPU中亦存在類似缺陷。GPU 200可包含處理單元210，該處理單元包括一或多個處理器子單元(例如，子單元220a、220b、220c、220d、220e、220f、220g、220h、220i、220j、220k、220l、220m、220n、220o及220p)。此外，處理單元210可包括一或多個層級之晶載快取記憶體及/或暫存器檔案。此類快取記憶體元件通常與處理單元210形成於同一半導體晶粒上。實際上，在圖2之實例中，快取記憶體210與處理單元210形成於同一晶粒上且在所有處理器子單元間共用，而快取記憶體230a、230b、230c及230d分別形成於處理器子單元之子集上且專用於該等處理器子單元。 As shown in Figure 2, similar defects also exist in GPUs. The GPU 200 may include a processing unit 210 that includes one or more processor sub-units (e.g., sub-units 220a, 220b, 220c, 220d, 220e, 220f, 220g, 220h, 220i, 220j, 220k, 220l, 220m , 220n, 220o and 220p). In addition, the processing unit 210 may include one or more levels of on-chip cache and/or register files. Such cache memory devices are usually formed on the same semiconductor die as the processing unit 210. In fact, in the example of FIG. 2, the cache memory 210 and the processing unit 210 are formed on the same die and shared among all the processor sub-units, and the cache memories 230a, 230b, 230c, and 230d are formed on the same die. A subset of the processor sub-units and dedicated to the processor sub-units.

此外，處理單元210與共用記憶體250a、250b、250c及250d通信。舉例而言，記憶體250a、250b、250c及250d可表示共用DRAM之記憶體組。因此，處理單元210之處理器子單元可使用共用記憶體250a、250b、250c及250d儲存資料，該資料接著由該等處理器子單元進行操作。然而，此配置導致記憶體250a、250b、250c及250d與處理單元210之間的匯流排成為瓶頸，其類似於上文關於CPU所描述之瓶頸。 In addition, the processing unit 210 communicates with shared memories 250a, 250b, 250c, and 250d. For example, the memories 250a, 250b, 250c, and 250d may represent memory banks sharing DRAM. Therefore, the processor subunits of the processing unit 210 can use the shared memory 250a, 250b, 250c, and 250d to store data, which is then operated by the processor subunits. However, this configuration causes the bus between the memory 250a, 250b, 250c, and 250d and the processing unit 210 to become a bottleneck, which is similar to the bottleneck described above with respect to the CPU.

所揭示硬體晶片之綜述 Overview of disclosed hardware chips

圖3A為描繪例示性硬體晶片300之實施例的圖解表示。硬體晶片300可包含經設計以緩解上文關於CPU、GPU及其他習知處理器所描述之瓶頸的分散式處理器。分散式處理器可包括在空間上分佈於單個基板上之複數個處理器子單元。此外，如上文所解釋，在本發明之分散式處理器中，對應記憶體組亦在空間上分佈於基板上。在一些實施例中，分散式處理器可與指令集相關聯，且分散式處理器之處理器子單元中的每一者可負責執行包括於該指令集中之一或多個任務。 FIG. 3A is a diagrammatic representation depicting an embodiment of an exemplary hardware chip 300. FIG. Hard crystal The slice 300 may include a distributed processor designed to alleviate the bottlenecks described above with respect to CPUs, GPUs, and other conventional processors. The distributed processor may include a plurality of processor subunits spatially distributed on a single substrate. In addition, as explained above, in the distributed processor of the present invention, the corresponding memory groups are also spatially distributed on the substrate. In some embodiments, a distributed processor may be associated with an instruction set, and each of the processor subunits of the distributed processor may be responsible for performing one or more tasks included in the instruction set.

如圖3A中所描繪，硬體晶片300可包含複數個處理器子單元，例如邏輯及控制子單元320a、320b、320c、320d、320e、320f、320g及320h。如圖3A中進一步所描繪，每一處理器子單元可具有專用記憶體例項。舉例而言，邏輯及控制子單元320a可操作地連接至專用記憶體例項330a，邏輯及控制子單元320b可操作地連接至專用記憶體例項330b，邏輯及控制子單元320c可操作地連接至專用記憶體例項330c，邏輯及控制子單元320d可操作地連接至專用記憶體例項330d，邏輯及控制子單元320e可操作地連接至專用記憶體例項330e，邏輯及控制子單元320f可操作地連接至專用記憶體例項330f，邏輯及控制子單元320g可操作地連接至專用記憶體例項330g，且邏輯及控制子單元320h可操作地連接至專用記憶體例項330h。 As depicted in FIG. 3A, the hardware chip 300 may include a plurality of processor sub-units, such as logic and control sub-units 320a, 320b, 320c, 320d, 320e, 320f, 320g, and 320h. As further depicted in Figure 3A, each processor sub-unit may have a dedicated memory instance. For example, the logic and control sub-unit 320a is operatively connected to the dedicated memory instance 330a, the logic and control sub-unit 320b is operatively connected to the dedicated memory instance 330b, and the logic and control sub-unit 320c is operatively connected to the dedicated memory instance 330b. The memory instance 330c, the logic and control subunit 320d are operatively connected to the dedicated memory instance 330d, the logic and control subunit 320e is operatively connected to the dedicated memory instance 330e, and the logic and control subunit 320f is operatively connected to For the dedicated memory instance 330f, the logic and control subunit 320g is operatively connected to the dedicated memory instance 330g, and the logic and control subunit 320h is operatively connected to the dedicated memory instance 330h.

儘管圖3A將每一記憶體例項描繪為單個記憶體組，但硬體晶片300可包括兩個或多於兩個記憶體組作為用於硬體晶片300上之處理器子單元的專用記憶體例項。此外，儘管圖3A將每一處理器子單元描繪為包含邏輯組件及用於專用記憶體組之控制件兩者，但硬體晶片300可使用用於記憶體組之控制件，該等控制件至少部分地與該等邏輯組件分開。此外，如圖3A中所描繪，可將兩個或多於兩個處理器子單元及其對應記憶體組分組成例如處理群組310a、310b、310c及310d。「處理群組」可表示上面形成有硬體晶片300之基板上的空間區別。因此，處理群組可包括用於群組中之記憶體組的其他控制件，例如控制件340a、340b、340c及340d。另外或替代地，「處理群組」可表示用於編譯程式碼以供在硬體晶片300上執行之目的之邏輯分組。因此，用於硬體晶片300之編譯器(下文進一步描述)可在硬體晶片300上之處理群組之間劃分整個指令集。 Although FIG. 3A depicts each memory instance as a single memory group, the hardware chip 300 may include two or more memory groups as dedicated memory instances for the processor subunits on the hardware chip 300. item. In addition, although FIG. 3A depicts each processor subunit as including both logic components and controls for a dedicated memory bank, the hardware chip 300 can use controls for the memory bank. These controls At least partly separate from these logical components. In addition, as depicted in FIG. 3A, two or more processor sub-units and their corresponding memory components may be grouped into processing groups 310a, 310b, 310c, and 310d, for example. The "processing group" may indicate the spatial difference on the substrate on which the hard chip 300 is formed. Therefore, the processing group can include other controls for the memory group in the group, such as Control elements 340a, 340b, 340c, and 340d. Additionally or alternatively, the "processing group" may refer to a logical grouping for the purpose of compiling program code for execution on the hardware chip 300. Therefore, the compiler for the hardware chip 300 (described further below) can divide the entire instruction set among the processing groups on the hardware chip 300.

此外，主機350可將指令、資料及其他輸入提供至硬體晶片300且自該硬體晶片讀取輸出。因此，指令集可全部在單個晶粒上，例如在代管硬體晶片300之晶粒上執行。實際上，晶粒外之僅有通信可包括指令至硬體晶片300之載入、發送至硬體晶片300之任何輸入及自硬體晶片300讀取之任何輸出。因此，所有計算及記憶體操作可在晶粒上(在硬體晶片300上)執行，此係因為硬體晶片300之處理器子單元與硬體晶片300之專用記憶體組通信。 In addition, the host 350 can provide commands, data, and other inputs to the hardware chip 300 and read and output from the hardware chip. Therefore, the instruction set can all be executed on a single die, for example, on the die of the surrogate hardware chip 300. In fact, the only communication outside the die can include loading commands to the hardware chip 300, any input sent to the hardware chip 300, and any output read from the hardware chip 300. Therefore, all calculations and memory operations can be performed on the die (on the hardware chip 300) because the processor subunit of the hardware chip 300 communicates with the dedicated memory bank of the hardware chip 300.

圖3B為描繪另一例示性硬體晶片300'之實施例的圖解表示。儘管描繪為硬體晶片300之替代，但圖3B中所描繪之架構可至少部分地與圖3A中所描繪之架構組合。 FIG. 3B is a diagrammatic representation depicting an embodiment of another exemplary hardware chip 300'. Although depicted as an alternative to the hardware chip 300, the architecture depicted in FIG. 3B can be at least partially combined with the architecture depicted in FIG. 3A.

如圖3B中所描繪，硬體晶片300'可包含複數個處理器子單元，例如處理器子單元350a、350b、350c及350d。如圖3B中進一步所描繪，每一處理器子單元可具有複數個專用記憶體例項。舉例而言，處理器子單元350a可操作地連接至專用記憶體例項330a及330b，處理器子單元350b可操作地連接至專用記憶體例項330c及330d，處理器子單元350c可操作地連接至專用記憶體例項330e及330f，且處理器子單元350d可操作地連接至專用記憶體例項330g及330h。此外，如圖3B中所描繪，可將處理器子單元及其對應記憶體組分組成例如處理群組310a、310b、310c及310d。如上文所解釋，「處理群組」可表示上面形成有硬體晶片300'之基板上的空間區別及/或用於編譯程式碼以供在硬體晶片300'上執行之目的之邏輯分組。 As depicted in FIG. 3B, the hardware chip 300' may include a plurality of processor sub-units, such as processor sub-units 350a, 350b, 350c, and 350d. As further depicted in FIG. 3B, each processor sub-unit may have a plurality of dedicated memory instances. For example, the processor sub-unit 350a is operatively connected to the dedicated memory instances 330a and 330b, the processor sub-unit 350b is operatively connected to the dedicated memory instances 330c and 330d, and the processor sub-unit 350c is operatively connected to The dedicated memory instances 330e and 330f, and the processor subunit 350d is operatively connected to the dedicated memory instances 330g and 330h. In addition, as depicted in FIG. 3B, the processor sub-units and their corresponding memory components may be grouped into processing groups 310a, 310b, 310c, and 310d, for example. As explained above, the "processing group" may refer to the spatial distinction on the substrate on which the hardware chip 300' is formed and/or the logical grouping for the purpose of compiling program codes for execution on the hardware chip 300'.

如圖3B中進一步所描繪，處理器子單元可經由匯流排彼此通信。舉例而言，如圖3B所展示，處理器子單元350a可經由匯流排360a與處理器子單元350b通信，經由匯流排360c與處理器子單元350c通信，且經由匯流排360f與處理器子單元350d通信。類似地，處理器子單元350b可經由匯流排360a(如上文所描述)與處理器子單元350a通信，經由匯流排360e與處理器子單元350c通信，且經由匯流排360d與處理器子單元350d通信。此外，處理器子單元350c可經由匯流排360c(如上文所描述)與處理器子單元350a通信，經由匯流排360e(如上文所描述)與處理器子單元350b通信，且經由匯流排360b與處理器子單元350d通信。相應地，處理器子單元350d可經由匯流排360f(如上文所描述)與處理器子單元350a通信，經由匯流排360d(如上文所描述)與處理器子單元350b通信，且經由匯流排360b(如上文所描述)與處理器子單元350c通信。一般熟習此項技術者將理解，可使用比圖3B中所描繪之匯流排少的匯流排。舉例而言，可消除匯流排360e，使得處理器子單元350b與350c之間的通信經由處理器子單元350a及/或350d傳遞。類似地，可消除匯流排360f，使得處理器子單元350a與處理器子單元350d之間的通信經由處理器子單元350b或350c傳遞。 As further depicted in Figure 3B, the processor sub-units can communicate with each other via a bus. For example, as shown in FIG. 3B, the processor subunit 350a may communicate with the processor subunit 350b via the bus 360a, communicate with the processor subunit 350c via the bus 360c, and communicate with the processor subunit 350c via the bus 360f. 350d communication. Similarly, the processor sub-unit 350b can communicate with the processor sub-unit 350a via the bus 360a (as described above), with the processor sub-unit 350c via the bus 360e, and with the processor sub-unit 350d via the bus 360d Communication. In addition, the processor sub-unit 350c can communicate with the processor sub-unit 350a via the bus 360c (as described above), communicate with the processor sub-unit 350b via the bus 360e (as described above), and communicate with the processor sub-unit 350b via the bus 360b. The processor subunit 350d communicates. Accordingly, the processor sub-unit 350d can communicate with the processor sub-unit 350a via the bus 360f (as described above), communicate with the processor sub-unit 350b via the bus 360d (as described above), and via the bus 360b (As described above) communicate with the processor sub-unit 350c. Those skilled in the art will understand that fewer buses than those depicted in FIG. 3B can be used. For example, the bus 360e can be eliminated, so that the communication between the processor subunits 350b and 350c is passed through the processor subunits 350a and/or 350d. Similarly, the bus 360f can be eliminated so that the communication between the processor subunit 350a and the processor subunit 350d is passed through the processor subunit 350b or 350c.

此外，一般熟習此項技術者將理解，可使用除圖3A及圖3B中所描繪之架構以外的架構。舉例而言，各具有單個處理器子單元及記憶體例項之處理群組的陣列可配置於基板上。處理器子單元可另外或替代地形成用於對應的專用記憶體組之控制器的部分、用於對應的專用記憶體之記憶體墊之控制器的部分，或其類似者。 In addition, those who are generally familiar with the technology will understand that architectures other than those depicted in FIGS. 3A and 3B can be used. For example, arrays of processing groups each having a single processor subunit and memory instance can be arranged on the substrate. The processor subunit may additionally or alternatively form part of the controller for the corresponding dedicated memory bank, part of the controller for the memory pad of the corresponding dedicated memory, or the like.

鑒於上文所描述之架構，相較於傳統架構，硬體晶片300及300'可顯著提高記憶體密集型任務之效率。舉例而言，資料庫操作及人工智慧演算法(諸如，神經網路)為記憶體密集型任務之實例，對於記憶體密集型任務，傳統架構在效率上低於硬體晶片300及300'。因此，硬體晶片300及300'可被稱作資料庫加速器處理器及/或人工智慧加速器處理器。 In view of the architecture described above, compared to the traditional architecture, the hardware chips 300 and 300' can significantly improve the efficiency of memory-intensive tasks. For example, database operations and artificial intelligence algorithms (such as neural networks) are examples of memory-intensive tasks. For memory-intensive tasks, the traditional architecture is less efficient than the hardware chips 300 and 300'. Therefore, the hardware chips 300 and 300' can be referred to as database accelerator processors and/or artificial intelligence accelerator processors.

組態所揭示硬體晶片 Hardware chip revealed by configuration

上文所描述之硬體晶片架構可經組態以用於程式碼執行。舉例而言，每一處理器子單元可與硬體晶片中之其他處理器子單元隔開而個別地執行程式碼(定義指令集)。因此，替代依賴於作業系統來管理多執行緒處理或使用多任務處理(其為同時而非並列的)，本發明之硬體晶片可允許處理器子單元完全並列地操作。 The hardware chip architecture described above can be configured for code execution. For example, each processor sub-unit can be separated from other processor sub-units in the hardware chip and execute the program code (defining instruction set) individually. Therefore, instead of relying on the operating system to manage multi-threaded processing or using multi-tasking (which is simultaneous rather than parallel), the hardware chip of the present invention can allow the processor sub-units to operate completely in parallel.

除上文所描述之完全並列實施方案以外，指派給每一處理器子單元之指令中的至少一些可重疊。舉例而言，分散式處理器上之複數個處理器子單元可執行重疊指令作為例如作業系統或其他管理軟體之實施方案，同時執行非重疊指令以便在作業系統或其他管理軟體之上下文內執行並列任務。 Except for the fully parallel implementation described above, at least some of the instructions assigned to each processor subunit may overlap. For example, a plurality of processor subunits on a distributed processor can execute overlapping instructions as an implementation solution for operating systems or other management software, and execute non-overlapping instructions at the same time to execute parallel operations within the context of the operating system or other management software. task.

圖4描繪藉由處理群組410進行之用於執行一般命令的例示性處理程序400。舉例而言，處理群組410可包含本發明之硬體晶片(例如，硬體晶片300、硬體晶片300'或其類似者)的一部分。 FIG. 4 depicts an exemplary processing procedure 400 performed by the processing group 410 for executing general commands. For example, the processing group 410 may include a part of the hard chip (for example, the hard chip 300, the hard chip 300', or the like) of the present invention.

如圖4中所描繪，可將命令發送至與專用記憶體例項420配對之處理器子單元430。外部主機(例如，主機350)可將命令發送至處理群組410以供執行。替代地，主機350可能已發送包括該命令之指令集以用於儲存於記憶體例項420中，使得處理器子單元430可自記憶體例項420擷取命令且執行所擷取命令。因此，該命令可由處理元件440執行，該處理元件為可組態以執行所接收命令之一般處理元件。此外，處理群組410可包括用於記憶體例項420之控制件460。如圖4中所描繪，控制件460可執行處理元件440在執行所接收命令時所需的對記憶體例項420之任何讀取及/或寫入。在執行命令之後，處理群組410可將命令之結果輸出至例如外部主機或輸出至同一硬體晶片上之不同處理群組。 As depicted in FIG. 4, commands may be sent to the processor sub-unit 430 paired with the dedicated memory instance 420. The external host (for example, the host 350) may send the command to the processing group 410 for execution. Alternatively, the host 350 may have sent an instruction set including the command for storage in the memory instance 420, so that the processor sub-unit 430 can retrieve the command from the memory instance 420 and execute the retrieved command. Therefore, the command can be executed by the processing element 440, which is a general processing element that can be configured to execute the received command. In addition, the processing group 410 may include a control 460 for the memory instance 420. As depicted in FIG. 4, the control component 460 can perform any reading and/or writing of the memory instance 420 required by the processing element 440 when executing the received command. After executing the command, the processing group 410 can output the result of the command to, for example, an external host or to different processing groups on the same hardware chip.

在一些實施例中，如圖4中所描繪，處理器子單元430可進一步包括位址產生器450。「位址產生器」可包含複數個處理元件，該等複數個處理元件經組態以判定用於執行讀取及寫入之一或多個記憶體組中的位址，且亦可對位於所判定位址處之資料執行操作(例如，加法、減法、乘法或其類似者)。舉例而言，位址產生器450可判定用於對記憶體進行之任何讀取或寫入的位址。在一個實例中，位址產生器450可藉由在不再需要讀取值時用基於命令所判定之新值覆寫讀取值來提高效率。另外或替代地，位址產生器450可選擇可用位址以用於儲存來自命令執行之結果。此可允許為後一時脈循環排程結果讀出，此對於外部主機較為便利。在另一實例中，位址產生器450可在諸如向量或矩陣乘法累加(multiply-accumulate)計算之多循環計算期間判定讀取及寫入的位址。因此，位址產生器450可維持或計算用於讀取資料及寫入多循環計算之中間結果的記憶體位址，使得處理器子單元430可繼續處理而不必儲存此等記憶體位址。 In some embodiments, as depicted in FIG. 4, the processor sub-unit 430 may further Includes address generator 450. The "address generator" can include a plurality of processing elements, which are configured to determine the addresses used to perform reading and writing in one or more memory banks, and can also be located in The data at the determined address performs an operation (for example, addition, subtraction, multiplication, or the like). For example, the address generator 450 can determine the address used for any read or write to the memory. In one example, the address generator 450 can improve efficiency by overwriting the read value with a new value determined based on the command when the read value is no longer needed. Additionally or alternatively, the address generator 450 can select available addresses for storing results from command execution. This allows the result of scheduling for the next clock cycle to be read out, which is more convenient for external hosts. In another example, the address generator 450 may determine the read and write addresses during multi-cycle calculations such as vector or matrix multiply-accumulate calculations. Therefore, the address generator 450 can maintain or calculate the memory addresses used to read data and write the intermediate results of the multi-cycle calculation, so that the processor subunit 430 can continue processing without storing these memory addresses.

圖5描繪藉由處理群組510進行之用於執行專門命令的例示性處理程序500。舉例而言，處理群組510可包含本發明之硬體晶片(例如，硬體晶片300、硬體晶片300'或其類似者)的一部分。 FIG. 5 depicts an exemplary processing procedure 500 performed by the processing group 510 for executing special commands. For example, the processing group 510 may include a part of the hard chip (for example, the hard chip 300, the hard chip 300', or the like) of the present invention.

如圖5中所描繪，專門命令(例如，乘法累加命令)可發送至與專用記憶體例項520配對之處理元件530。外部主機(例如，主機350)可將命令發送至處理元件530以供執行。因此，該命令可由處理元件530在來自主機之給定信號下執行，該處理元件為可組態以執行特定命令(包括所接收命令)的專門處理元件。替代地，處理元件530可自記憶體例項520擷取命令以供執行。因此，在圖5之實例中，處理元件530為乘法累加(MAC)電路，該電路經組態以執行自外部主機接收或自記憶體例項520擷取之MAC命令。在執行命令之後，處理群組410可將命令之結果輸出至例如外部主機或輸出至同一硬體晶片上之不同處理群組。儘管關於單個命令及單個結果來描繪，但可接收或擷取及執行複數個命令，且複數個結果可在輸出之前在處理群組510上組合。 As depicted in FIG. 5, special commands (eg, multiply and accumulate commands) can be sent to the processing element 530 paired with the special memory instance 520. An external host (for example, the host 350) may send the command to the processing element 530 for execution. Therefore, the command can be executed by the processing element 530 under a given signal from the host. The processing element is a special processing element that can be configured to execute a specific command (including the received command). Alternatively, the processing component 530 may retrieve the command from the memory instance 520 for execution. Therefore, in the example of FIG. 5, the processing element 530 is a multiply-accumulate (MAC) circuit configured to execute MAC commands received from an external host or retrieved from the memory instance 520. After executing the command, the processing group 410 can output the result of the command to, for example, an external host or to different processing groups on the same hardware chip. Although it describes a single command and a single result, it can be received or retrieved and Multiple commands are executed, and multiple results can be combined on the processing group 510 before output.

儘管在圖5中描繪為MAC電路，但額外或替代的專門電路可包括於處理群組510中。舉例而言，可實施MAX讀取命令(其傳回向量之最大值)、MAX0讀取命令(亦被稱作整流器之常用功能，其傳回整個向量，而且傳回為0之最大值)，或其類似者。 Although depicted as a MAC circuit in FIG. 5, additional or alternative specialized circuits may be included in the processing group 510. For example, you can implement the MAX read command (which returns the maximum value of the vector), the MAX0 read command (also known as a common function of the rectifier, which returns the entire vector and returns the maximum value of 0), Or the like.

儘管分開地描繪，但可組合圖4之一般處理群組410與圖5之專門處理群組510。舉例而言，一般處理器子單元可耦接至一或多個專門處理器子單元以形成處理器子單元。因此，一般處理器子單元可用於不可由一或多個專門處理器子單元執行的所有指令。 Although depicted separately, the general processing group 410 of FIG. 4 and the specialized processing group 510 of FIG. 5 can be combined. For example, a general processor sub-unit may be coupled to one or more specialized processor sub-units to form a processor sub-unit. Therefore, a general processor subunit can be used for all instructions that cannot be executed by one or more specialized processor subunits.

一般熟習此項技術者將理解，可藉由專門邏輯電路來處置神經網路實施方案及其他記憶密集型任務。舉例而言，資料庫查詢、封包檢測、字串比較及其他功能在由本文中所描述之硬體晶片執行的情況下可提高效率。 Those who are familiar with this technology will understand that special logic circuits can be used to handle neural network implementations and other memory-intensive tasks. For example, database query, packet inspection, string comparison, and other functions can improve efficiency when executed by the hardware chip described in this article.

用於分散式處理之基於記憶體之架構 Memory-based architecture for distributed processing

在符合本發明之硬體晶片上，專用匯流排可在該晶片上之處理器子單元之間及/或在該等處理器子單元與其對應的專用記憶體組之間傳送資料。使用專用匯流排可降低仲裁成本，此係因為競爭請求係不可能的或容易使用軟體而非使用硬體來避免。 On the hardware chip according to the present invention, the dedicated bus can transfer data between the processor sub-units on the chip and/or between the processor sub-units and their corresponding dedicated memory banks. Using a dedicated bus can reduce the cost of arbitration, because competing requests are impossible or easy to avoid using software rather than hardware.

圖6示意性地描繪處理群組600之圖解表示。處理群組600可供用於硬體晶片(例如，硬體晶片300、硬體晶片300'或其類似者)中。處理器子單元610可經由匯流排630連接至記憶體620。記憶體620可包含隨機可存取記憶體(RAM)元件，其儲存供處理器子單元610執行之資料及程式碼。在一些實施例中，記憶體620可為N路記憶體(其中N為等於或大於1之數字，其暗示交錯式記憶體620中之區段的數目)。因為處理器子單元610經由匯流排630耦接至專用於處理器子單元610之記憶體620，所以N可保持相對較小而不損害執行效能。此表示對習知多路暫存器檔案或快取記憶體之改善，其中較低N通常導致較低執行效能，且較高N通常導致大的面積及功率損失。 Figure 6 schematically depicts a diagrammatic representation of a processing group 600. The processing group 600 can be used in a hard chip (for example, a hard chip 300, a hard chip 300', or the like). The processor sub-unit 610 can be connected to the memory 620 via the bus 630. The memory 620 may include a random access memory (RAM) device, which stores data and program codes for the processor sub-unit 610 to execute. In some embodiments, the memory 620 may be an N-way memory (where N is a number equal to or greater than 1, which implies the number of sectors in the interleaved memory 620). Because the processor sub-unit 610 is coupled to the memory 620 dedicated to the processor sub-unit 610 via the bus 630, N can be kept relatively small without damage Execution efficiency. This represents an improvement over the conventional multi-channel register file or cache memory, where a lower N usually leads to lower execution performance, and a higher N usually leads to a large area and power loss.

可根據例如一或多個任務中所涉及之資料的大小而調整記憶體620之大小、通路之數目及匯流排630之寬度以滿足使用處理群組600之系統之任務及應用程式實施方案的要求。記憶體元件620可包含此項技術中已知的一或多個類型之記憶體，例如揮發性記憶體(諸如，RAM、DRAM、SRAM、相變RAM(PRAM)、磁阻式RAM(MRAM)、電阻式RAM(ReRAM)或其類似者)或非揮發性記憶體(諸如，快閃記憶體或ROM)。根據一些實施例，記憶體元件620之一部分可包含第一記憶體類型，而另一部分可包含另一記憶體類型。舉例而言，記憶體元件620之程式碼區可包含ROM元件，而記憶體元件620之資料區可包含DRAM元件。此分割之另一實例為將神經網路之權重儲存於快閃記憶體中，而將用於計算之資料儲存於DRAM中。 The size of the memory 620, the number of channels, and the width of the bus 630 can be adjusted according to, for example, the size of the data involved in one or more tasks to meet the requirements of the task and application implementation of the system using the processing group 600 . The memory element 620 may include one or more types of memory known in the art, such as volatile memory (such as RAM, DRAM, SRAM, phase change RAM (PRAM), magnetoresistive RAM (MRAM)) , Resistive RAM (ReRAM) or the like) or non-volatile memory (such as flash memory or ROM). According to some embodiments, a part of the memory element 620 may include a first memory type, and another part may include another memory type. For example, the program code area of the memory device 620 may include a ROM device, and the data area of the memory device 620 may include a DRAM device. Another example of this division is to store the weights of the neural network in flash memory and store the data used for calculation in DRAM.

處理器子單元610包含處理元件640，該處理元件可包含處理器。該處理器可為管線式或非管線式的，可為定製精簡指令集運算(RISC)元件或實施於此項技術中已知之任何商業積體電路(IC)(諸如，ARM、ARC、RISC-V等)上的其他處理方案，如一般熟習此項技術者所瞭解。處理元件640可包含控制器，該控制器在一些實施例中包括算術邏輯單元(ALU)或其他控制器。 The processor sub-unit 610 includes a processing element 640, which may include a processor. The processor can be pipelined or non-pipelined, can be a customized reduced instruction set computing (RISC) component or implemented in any commercial integrated circuit (IC) known in the art (such as ARM, ARC, RISC) -V, etc.) on other treatment schemes, as generally understood by those who are familiar with this technology. The processing element 640 may include a controller, which in some embodiments includes an arithmetic logic unit (ALU) or other controller.

根據一些實施例，執行所接收或所儲存之程式碼的處理元件640可包含一般處理元件，且因此為靈活的並能夠執行廣泛多種處理操作。當比較在特定操作之執行期間所消耗的功率時，非專用電路系統通常比特定操作專用電路系統消耗更多功率。因此，當執行特定的複雜算術計算時，處理元件640可比專用硬體消耗更多功率且執行效率更低。因此，根據一些實施例，處理元件640之控制器可經設計以執行特定操作(例如，加法或「移動」操作)。 According to some embodiments, the processing element 640 that executes the received or stored code may include general processing elements, and is therefore flexible and capable of performing a wide variety of processing operations. When comparing the power consumed during the execution of a specific operation, the non-dedicated circuit system generally consumes more power than the dedicated circuit system for the specific operation. Therefore, when performing specific complex arithmetic calculations, the processing element 640 can consume more power and perform less efficiently than dedicated hardware. Therefore, according to some embodiments, the controller of the processing element 640 may be designed to perform specific operations (for example, addition or "move" operations).

在一個實例中，特定操作可由一或多個加速器650執行。每一加速器可為專用的且經程式化以執行特定計算(諸如，乘法、浮點向量運算或其類似者)。藉由使用加速器，每個處理器子單元之每次計算所消耗的平均功率可降低，且計算輸送量因此增加。可根據系統經設計以實施之應用程式(例如，執行神經網路、執行資料庫查詢或其類似者)而選擇加速器650。加速器650可由處理元件640組態且可與處理元件協同操作以用於降低功率消耗且加速計算及運算。加速器可另外或替代地用以在記憶體與諸如智慧型直接記憶體存取(DMA)周邊裝置之處理群組600的MUX/DEMUX/輸入/輸出埠(例如，MUX 650及DEMUX 660)之間傳送資料。 In one example, certain operations may be performed by one or more accelerators 650. Every plus The speed controller may be dedicated and programmed to perform specific calculations (such as multiplication, floating-point vector operations, or the like). By using the accelerator, the average power consumed by each calculation of each processor subunit can be reduced, and the calculation throughput will increase accordingly. The accelerator 650 may be selected according to the application that the system is designed to implement (for example, executing a neural network, executing a database query, or the like). The accelerator 650 can be configured by the processing element 640 and can cooperate with the processing element for reducing power consumption and accelerating calculations and operations. The accelerator may additionally or alternatively be used between the memory and the MUX/DEMUX/input/output ports of the processing group 600 (for example, MUX 650 and DEMUX 660) of the processing group 600 such as smart direct memory access (DMA) peripherals Send data.

加速器650可經組態以執行多種功能。舉例而言，一個加速器可經組態以執行通常用於神經網路中之16位元浮點計算或8位元整數計算。加速器功能之另一實例為通常用於神經網路之訓練階段期間的32位元浮點計算。加速器功能之又一實例為查詢處理，諸如用於資料庫中之查詢處理。在一些實施例中，加速器650可包含用以執行此等功能之專門處理元件及/或可根據儲存於記憶體元件620上之組態資料進行組態使得其可加以修改。 The accelerator 650 can be configured to perform a variety of functions. For example, an accelerator can be configured to perform 16-bit floating point calculations or 8-bit integer calculations commonly used in neural networks. Another example of the accelerator function is 32-bit floating point calculations commonly used during the training phase of neural networks. Another example of the accelerator function is query processing, such as query processing in a database. In some embodiments, the accelerator 650 may include specialized processing elements for performing these functions and/or may be configured according to configuration data stored on the memory element 620 so that it can be modified.

加速器650可另外或替代地實施記憶體移動之可組態的指令碼處理清單以對資料至/自記憶體620或至/自其他加速器及/或輸入/輸出端的移動進行計時。因此，如下文進一步所解釋，使用處理群組600之硬體晶片內部的所有資料移動可使用軟體同步而非硬體同步。舉例而言，一個處理群組(例如，群組600)中之加速器可每第十循環將資料自其輸入端傳送至其加速器，且接著在下一循環輸出資料，藉此使資訊自處理群組之記憶體流送至另一記憶體。 The accelerator 650 may additionally or alternatively implement a configurable command code processing list of memory movement to time the movement of data to/from the memory 620 or to/from other accelerators and/or input/output terminals. Therefore, as explained further below, all data movement within the hardware chip using the processing group 600 can use software synchronization instead of hardware synchronization. For example, an accelerator in a processing group (for example, group 600) can send data from its input to its accelerator every tenth cycle, and then output the data in the next cycle, thereby making the information self-processing group The memory is streamed to another memory.

如圖6中進一步所描繪，在一些實施例中，處理群組600可進一步包含連接至其輸入埠之至少一個輸入多工器(MUX)660及連接至其輸出埠之至少一個輸出DEMUX670。此等MUX/DEMUX可由來自處理元件640及/或來自加速器650中之一者的控制信號(未圖示)控制，該等控制信號係根據正由處理元件640進行之當前指令及/或由加速器650中之加速器執行的操作而判定。在一些情境中，可能需要處理群組600(根據來自其程式碼記憶體之預定義指令)將資料自其輸入埠傳送至其輸出埠。因此，除DEMUX/MUX中之每一者連接至處理元件640及加速器650以外，輸入MUX(例如，MUX 660)中之一或多者亦可經由一或多個匯流排直接連接至輸出DEMUX(例如，DEMUX 670)。 As further depicted in FIG. 6, in some embodiments, the processing group 600 may further include at least one input multiplexer (MUX) 660 connected to its input port and at least one output DEMUX 670 connected to its output port. These MUX/DEMUX can be controlled by control signals (not shown) from one of the processing element 640 and/or from the accelerator 650, and these control signals are based on The current instruction being performed by the processing element 640 and/or the operation performed by the accelerator in the accelerator 650 is determined. In some scenarios, it may be necessary to process group 600 (according to a predefined command from its code memory) to send data from its input port to its output port. Therefore, in addition to each of DEMUX/MUX connected to the processing element 640 and the accelerator 650, one or more of the input MUX (e.g., MUX 660) can also be directly connected to the output DEMUX ( For example, DEMUX 670).

圖6之處理群組600可排成陣列以形成分散式處理器，例如，如圖7A中所描繪。處理群組可安置於基板710上以形成陣列。在一些實施例中，基板710可包含諸如矽之半導體基板。另外或替代地，基板710可包含電路板，諸如可撓性電路板。 The processing group 600 of FIG. 6 may be arranged in an array to form a distributed processor, for example, as depicted in FIG. 7A. The processing group may be disposed on the substrate 710 to form an array. In some embodiments, the substrate 710 may include a semiconductor substrate such as silicon. Additionally or alternatively, the substrate 710 may include a circuit board, such as a flexible circuit board.

如圖7A中所描繪，基板710可包括安置於其上之複數個處理群組，諸如處理群組600。因此，基板710包括記憶體陣列，該記憶體陣列包括複數個組，諸如組720a、720b、720c、720d、720e、720f、720g及720h。此外，基板710包括處理陣列，該處理陣列可包括複數個處理器子單元，諸如子單元730a、730b、730c、730d、730e、730f、730g及730h。 As depicted in FIG. 7A, the substrate 710 may include a plurality of processing groups, such as the processing group 600, disposed thereon. Therefore, the substrate 710 includes a memory array including a plurality of groups, such as groups 720a, 720b, 720c, 720d, 720e, 720f, 720g, and 720h. In addition, the substrate 710 includes a processing array, which may include a plurality of processor subunits, such as subunits 730a, 730b, 730c, 730d, 730e, 730f, 730g, and 730h.

此外，如上文所解釋，每一處理群組可包括處理器子單元及專用於該處理器子單元之一或多個對應的記憶體組。因此，如圖7A中所描繪，每一子單元與一對應的專用記憶體組相關聯，例如：處理器子單元730a與記憶體組720a相關聯，處理器子單元730b與記憶體組720b相關聯，處理器子單元730c與記憶體組720c相關聯，處理器子單元730d與記憶體組720d相關聯，處理器子單元730e與記憶體組720e相關聯，處理器子單元730f與記憶體組720f相關聯，處理器子單元730g與記憶體組720g相關聯，處理器子單元730h與記憶體組720h相關聯。 In addition, as explained above, each processing group may include a processor sub-unit and one or more corresponding memory groups dedicated to the processor sub-unit. Therefore, as depicted in FIG. 7A, each subunit is associated with a corresponding dedicated memory group. For example, the processor subunit 730a is associated with the memory group 720a, and the processor subunit 730b is associated with the memory group 720b. The processor subunit 730c is associated with the memory group 720c, the processor subunit 730d is associated with the memory group 720d, the processor subunit 730e is associated with the memory group 720e, and the processor subunit 730f is associated with the memory group 720e. 720f is associated with, the processor subunit 730g is associated with the memory group 720g, and the processor subunit 730h is associated with the memory group 720h.

為了允許每一處理器子單元與其對應的專用記憶體組通信，基板 710可包括將處理器子單元中之一者連接至其對應的專用記憶體組的第一複數個匯流排。因此，匯流排740a將處理器子單元730a連接至記憶體組720a，匯流排740b將處理器子單元730b連接至記憶體組720b，匯流排740c將處理器子單元730c連接至記憶體組720c，匯流排740d將處理器子單元730d連接至記憶體組720d，匯流排740e將處理器子單元730e連接至記憶體組720e，匯流排740f將處理器子單元730f連接至記憶體組720f，匯流排740g將處理器子單元730g連接至記憶體組720g，且匯流排740h將處理器子單元730h連接至記憶體組720h。此外，為了允許每一處理器子單元與其他處理器子單元通信，基板710可包括將處理器子單元中之一者連接至處理器子單元中之另一者的第二複數個匯流排。在圖7A之實例中，匯流排750a將處理器子單元730a連接至處理器子單元750e，匯流排750b將處理器子單元730a連接至處理器子單元750b，匯流排750c將處理器子單元730b連接至處理器子單元750f，匯流排750d將處理器子單元730b連接至處理器子單元750c，匯流排750e將處理器子單元730c連接至處理器子單元750g，匯流排750f將處理器子單元730c連接至處理器子單元750d，匯流排750g將處理器子單元730d連接至處理器子單元750h，匯流排750h將處理器子單元730h連接至處理器子單元750g，匯流排750i將處理器子單元730g連接至處理器子單元750g，且匯流排750j將處理器子單元730f連接至處理器子單元750e。 In order to allow each processor sub-unit to communicate with its corresponding dedicated memory bank, the board 710 may include a first plurality of buses connecting one of the processor subunits to its corresponding dedicated memory bank. Therefore, the bus 740a connects the processor sub-unit 730a to the memory group 720a, the bus 740b connects the processor sub-unit 730b to the memory group 720b, and the bus 740c connects the processor sub-unit 730c to the memory group 720c, The bus 740d connects the processor sub-unit 730d to the memory group 720d, the bus 740e connects the processor sub-unit 730e to the memory group 720e, and the bus 740f connects the processor sub-unit 730f to the memory group 720f. 740g connects the processor sub-unit 730g to the memory group 720g, and the bus 740h connects the processor sub-unit 730h to the memory group 720h. In addition, in order to allow each processor sub-unit to communicate with other processor sub-units, the substrate 710 may include a second plurality of bus bars connecting one of the processor sub-units to the other of the processor sub-units. In the example of FIG. 7A, the bus 750a connects the processor sub-unit 730a to the processor sub-unit 750e, the bus 750b connects the processor sub-unit 730a to the processor sub-unit 750b, and the bus 750c connects the processor sub-unit 730b. Connected to the processor subunit 750f, the bus 750d connects the processor subunit 730b to the processor subunit 750c, the bus 750e connects the processor subunit 730c to the processor subunit 750g, and the bus 750f connects the processor subunit 730c is connected to the processor sub-unit 750d, the bus 750g connects the processor sub-unit 730d to the processor sub-unit 750h, the bus 750h connects the processor sub-unit 730h to the processor sub-unit 750g, and the bus 750i connects the processor sub-unit. The unit 730g is connected to the processor sub-unit 750g, and the bus 750j connects the processor sub-unit 730f to the processor sub-unit 750e.

因此，在圖7A中所展示之實例配置中，複數個邏輯處理器子單元配置成至少一列及至少一行。第二複數個匯流排將每一處理器子單元連接至同一列中之至少一個鄰近處理器子單元且連接至同一行中之至少一個鄰近處理器子單元。圖7A可被稱作「部分塊連接」。 Therefore, in the example configuration shown in FIG. 7A, a plurality of logical processor subunits are configured in at least one column and at least one row. The second plurality of bus bars connect each processor subunit to at least one adjacent processor subunit in the same column and to at least one adjacent processor subunit in the same row. Figure 7A can be referred to as "partial block connection".

圖7A中所展示之配置可經修改以形成「完全塊連接」。完全塊連接包括連接對角線處理器子單元之額外匯流排。舉例而言，第二複數個匯流排可包括處理器子單元730a與處理器子單元730f之間、處理器子單元730b與處理器子單元730e之間、處理器子單元730b與處理器子單元730g之間、處理器子單元730c與處理器子單元730f之間、處理器子單元730c與處理器子單元730h之間以及處理器子單元730d與處理器子單元730g之間的額外匯流排。 The configuration shown in Figure 7A can be modified to form a "complete block connection." Full block connections include additional bus bars connecting diagonal processor sub-units. For example, the second plurality of confluences The row may include between the processor sub-unit 730a and the processor sub-unit 730f, between the processor sub-unit 730b and the processor sub-unit 730e, between the processor sub-unit 730b and the processor sub-unit 730g, and the processor sub-unit 730c. Additional buses between the processor sub-unit 730f, between the processor sub-unit 730c and the processor sub-unit 730h, and between the processor sub-unit 730d and the processor sub-unit 730g.

完全塊連接可用於卷積計算，在卷積計算中，使用儲存於附近處理器子單元中之資料及結果。舉例而言，在卷積影像處理期間，每一處理器子單元可接收影像之塊(諸如，像素或像素群組)。為了詳算卷積結果，每一處理器子單元可自所有八個鄰近處理器子單元獲取資料，該等鄰近處理器子單元中之每一者已接收對應塊。在部分塊連接中，來自對角線鄰近處理器子單元之資料可經由連接至該處理器子單元之其他鄰近處理器子單元傳遞。因此，晶片上之分散式處理器可為人工智慧加速器處理器。 Complete block connection can be used for convolution calculations. In convolution calculations, data and results stored in nearby processor subunits are used. For example, during convolutional image processing, each processor sub-unit may receive a block of the image (such as pixels or pixel groups). In order to calculate the convolution result in detail, each processor subunit can obtain data from all eight neighboring processor subunits, each of which has received the corresponding block. In partial block connections, data from diagonally adjacent processor subunits can be transferred via other adjacent processor subunits connected to the processor subunit. Therefore, the distributed processor on the chip can be an artificial intelligence accelerator processor.

在卷積計算之特定實例中，可跨越複數個處理器子單元來劃分N×M影像。每一處理器子單元可在其對應塊上執行與A×B濾波器的卷積。為了對塊之間的邊界上的一或多個像素執行濾波，每一處理器子單元可能需要來自相鄰處理器子單元之資料，該等相鄰處理器子單元具有包括同一邊界上之像素的塊。因此，針對每一處理器子單元產生之程式碼組態該子單元以計算卷積，且每當需要來自鄰近子單元之資料時便自第二複數個匯流排取得。將資料輸出至第二複數個匯流排之對應命令被提供至該子單元以確保所需資料傳送之適當時序。 In the specific example of convolution calculation, the N×M image can be divided across a plurality of processor sub-units. Each processor sub-unit can perform convolution with the A×B filter on its corresponding block. In order to perform filtering on one or more pixels on the boundary between blocks, each processor sub-unit may require data from adjacent processor sub-units that have pixels on the same boundary Block. Therefore, the code generated by each processor sub-unit is configured to calculate the convolution, and data from the adjacent sub-unit is obtained from the second plurality of buses. The corresponding commands for outputting data to the second plurality of buses are provided to the sub-unit to ensure proper timing of the required data transmission.

圖7A之部分塊連接可修改為N部分塊連接。在此修改中，第二複數個匯流排可進一步將每一處理器子單元連接至在圖7A之匯流排運行所沿的四個方向(亦即，上、下、左及右)上處於該處理器子單元之臨限距離內(例如，處於n個處理器子單元內)的處理器子單元。可對完全塊連接進行類似修改(以產生N完全塊連接)，使得第二複數個匯流排進一步將每一處理器子單元連接至在除兩個對角線方向以外的圖7A之匯流排運行所沿的四個方向上處於該處理器子單元之臨限距離內(例如，處於n個處理器子單元內)的處理器子單元。 The partial block connection in Figure 7A can be modified to N partial block connections. In this modification, the second plurality of buses can further connect each processor subunit to the four directions along which the bus in FIG. 7A runs (ie, up, down, left, and right). The processor subunits within the threshold distance of the processor subunits (for example, within n processor subunits). Similar modifications can be made to the complete block connection (to produce N complete block connections), so that the second plurality of buses further connects each processor sub-unit The element is connected to the four directions along which the bus of FIG. 7A runs except for the two diagonal directions, which are within the threshold distance of the processor subunit (for example, in n processor subunits). Processor subunit.

其他配置為可能的。舉例而言，在圖7B中所展示之配置中，匯流排750a將處理器子單元730a連接至處理器子單元730d，匯流排750b將處理器子單元730a連接至處理器子單元730b，匯流排750c將處理器子單元730b連接至處理器子單元730c，且匯流排750d將處理器子單元730c連接至處理器子單元730d。因此，在圖7B中所展示之實例配置中，複數個處理器子單元配置成星形圖案。第二複數個匯流排將每一處理器子單元連接至星形圖案內之至少一個鄰近處理器子單元。 Other configurations are possible. For example, in the configuration shown in FIG. 7B, bus 750a connects processor sub-unit 730a to processor sub-unit 730d, bus 750b connects processor sub-unit 730a to processor sub-unit 730b, and bus 750c connects the processor sub-unit 730b to the processor sub-unit 730c, and the bus 750d connects the processor sub-unit 730c to the processor sub-unit 730d. Therefore, in the example configuration shown in FIG. 7B, a plurality of processor subunits are configured in a star pattern. The second plurality of bus bars connect each processor subunit to at least one adjacent processor subunit in the star pattern.

其他配置(未圖示)為可能的。舉例而言，可使用相鄰者連接配置，使得複數個處理器子單元配置成一或多排(例如，類似於圖7A中所描繪之情況)。在相鄰者連接配置中，第二複數個匯流排將每一處理器子單元連接至同一排中之左方處理器子單元、同一排中之右方處理器子單元、同一排中之左方處理器子單元及右方處理器子單元兩者等。 Other configurations (not shown) are possible. For example, a neighbor connection configuration may be used such that a plurality of processor subunits are arranged in one or more rows (e.g., similar to the situation depicted in FIG. 7A). In the adjacent connection configuration, the second plurality of bus bars connect each processor subunit to the left processor subunit in the same row, the right processor subunit in the same row, and the left processor subunit in the same row. Both the square processor subunit and the right processor subunit, etc.

在另一實例中，可使用N線性連接配置。在N線性連接配置中，第二複數個匯流排將每一處理器子單元連接至處於該處理器子單元之臨限距離內(例如，處於n個處理器子單元內)的處理器子單元。N線性連接配置可與線形陣列(上文所描述)、矩形陣列(圖7A中所描繪)、橢圓形陣列(圖7B中所描繪)或任何其他幾何形狀陣列一起使用。 In another example, an N linear connection configuration can be used. In the N linear connection configuration, the second plurality of buses connect each processor subunit to the processor subunit within the threshold distance of the processor subunit (for example, within n processor subunits) . The N linear connection configuration can be used with linear arrays (described above), rectangular arrays (depicted in FIG. 7A), elliptical arrays (depicted in FIG. 7B), or any other geometric arrays.

在又一實例中，可使用N對數連接配置。在N對數連接配置中，第二複數個匯流排將每一處理器子單元連接至處於該處理器子單元之二的冪之臨限距離內(例如，處於2ⁿ個處理器子單元內)的處理器子單元。N對數連接配置可與線形陣列(上文所描述)、矩形陣列(圖7A中所描繪)、橢圓形陣列 (圖7B中所描繪)或任何其他幾何形狀陣列一起使用。 In yet another example, an N logarithmic connection configuration may be used. In the N logarithmic connection configuration, the second plurality of buses connect each processor subunit to within a threshold distance of the power of two of the processor subunit (for example, within 2 ⁿ processor subunits) The processor subunit. The N-log connected configuration can be used with linear arrays (described above), rectangular arrays (depicted in FIG. 7A), elliptical arrays (depicted in FIG. 7B), or any other geometric arrays.

可組合上文所描述之連接方案中之任一者以供用於同一硬體晶片中。舉例而言，可在一個區中使用完全塊連接，而在另一區中使用部分塊連接。在另一實例中，可在一個區中使用N線性連接配置，而在另一區中使用N完全塊連接。 Any of the connection schemes described above can be combined for use in the same hardware chip. For example, a complete block connection can be used in one zone, and a partial block connection can be used in another zone. In another example, an N linear connection configuration can be used in one zone, while an N complete block connection can be used in another zone.

替代記憶體晶片之處理器子單元之間的專用匯流排或除該等專用匯流排以外，亦可使用一或多個共用匯流排以互連分散式處理器之所有處理器子單元(或處理器子單元之子集)。仍可藉由使用由處理器子單元執行之程式碼對共用匯流排上之資料傳送進行計時來避免共用匯流排上之衝突，如下文進一步所解釋。除共用匯流排以外或替代共用匯流排，亦可使用可組態匯流排以動態地連接處理器子單元以形成連接至分開匯流排之處理器單元之群組。舉例而言，可組態匯流排可包括電晶體或可由處理器子單元控制以將資料傳送導引至選定處理器子單元的其他機構。 Instead of dedicated buses between processor subunits of memory chips, or in addition to these dedicated buses, one or more common buses can also be used to interconnect all processor subunits (or processing A subset of the device subunit). It is still possible to avoid conflicts on the shared bus by using the code executed by the processor subunit to time the data transmission on the shared bus, as explained further below. In addition to or instead of the shared bus, a configurable bus can also be used to dynamically connect processor subunits to form a group of processor units connected to separate buses. For example, the configurable bus may include transistors or other mechanisms that can be controlled by the processor sub-unit to direct data transmission to the selected processor sub-unit.

在圖7A及圖7B兩者中，處理陣列之複數個處理器子單元在空間上分佈於記憶體陣列之複數個離散記憶體組當中。在其他替代實施例(未圖示)中，複數個處理器子單元可叢集於基板之一或多個區中，且複數個記憶體組可叢集於基板之一或多個其他區中。在一些實施例中，可使用空間分佈與叢集之組合(未圖示)。舉例而言，基板之一個區可包括處理器子單元之叢集，基板之另一區可包括記憶體組之叢集，且基板之又一區可包括分佈於記憶體組當中之處理陣列。 In both FIGS. 7A and 7B, the plurality of processor subunits of the processing array are spatially distributed among the plurality of discrete memory groups of the memory array. In other alternative embodiments (not shown), a plurality of processor subunits may be clustered in one or more regions of the substrate, and a plurality of memory groups may be clustered in one or more other regions of the substrate. In some embodiments, a combination of spatial distribution and clustering (not shown) may be used. For example, one area of the substrate may include a cluster of processor subunits, another area of the substrate may include a cluster of memory banks, and another area of the substrate may include a processing array distributed among the memory banks.

一般熟習此項技術者將認識到，在基板上將處理器群組600排成陣列並非排他性實施例。舉例而言，每一處理器子單元可與至少兩個專用記憶體組相關聯。因此，可替代處理群組600或與該處理群組組合地使用圖3B之處理群組310a、310b、310c及310d，以形成處理陣列及記憶體陣列。可使用包括例如三個、四個或多於四個專用記憶體組之其他處理群組(未圖示)。 Those skilled in the art will recognize that arranging the processor groups 600 in an array on a substrate is not an exclusive embodiment. For example, each processor sub-unit can be associated with at least two dedicated memory banks. Therefore, the processing group 310a, 310b, 310c, and 310d of FIG. 3B can be used instead of or in combination with the processing group 600 to form a processing array and a memory array. Available include For example, other processing groups (not shown) of three, four, or more than four dedicated memory groups.

複數個處理器子單元中之每一者可經組態以相對於包括於複數個處理器子單元中之其他處理器子單元獨立地執行與特定應用程式相關聯之軟體程式碼。舉例而言，如下文所解釋，指令之複數個子系列可分組為機器碼且被提供至每一處理器子單元以供執行。 Each of the plurality of processor subunits can be configured to independently execute software code associated with a specific application program relative to other processor subunits included in the plurality of processor subunits. For example, as explained below, multiple sub-series of instructions can be grouped into machine code and provided to each processor sub-unit for execution.

在一些實施例中，每一專用記憶體組包含至少一個動態隨機存取記憶體(DRAM)。替代地，記憶體組可包含諸如靜態隨機存取記憶體(SRAM)、DRAM、快閃記憶體或其類似者之記憶體類型的混合。 In some embodiments, each dedicated memory bank includes at least one dynamic random access memory (DRAM). Alternatively, the memory bank may include a mixture of memory types such as static random access memory (SRAM), DRAM, flash memory, or the like.

在習知處理器中，處理器子單元之間的資料共用通常藉由共用記憶體來執行。共用記憶體通常需要大部分晶片面積及/或執行由額外硬體(諸如，仲裁器)管理之匯流排。如上文所描述，該匯流排造成瓶頸。此外，可在晶片外部之共用記憶體通常包括快取一致性機制及更複雜的快取記憶體(例如，L1快取記憶體、L2快取記憶體及共用DRAM)，以便將準確且最新的資料提供至處理器子單元。如下文進一步所解釋，圖7A及圖7B中所描繪之專用匯流排允許無硬體管理(諸如，仲裁器)之硬體晶片。此外，使用如圖7A及圖7B中所描繪之專用記憶體允許消除複雜的快取層及一致性機制。 In conventional processors, data sharing between processor subunits is usually performed by sharing memory. Shared memory usually requires most of the chip area and/or implementation of a bus managed by additional hardware (such as an arbiter). As described above, this bus bar creates a bottleneck. In addition, the shared memory that can be external to the chip usually includes a cache coherency mechanism and more complex cache memory (for example, L1 cache memory, L2 cache memory, and shared DRAM), so that the accurate and up-to-date The data is provided to the processor sub-unit. As explained further below, the dedicated bus depicted in FIGS. 7A and 7B allows for hardware chips without hardware management (such as an arbiter). In addition, the use of dedicated memory as depicted in Figures 7A and 7B allows the elimination of complex cache layers and coherency mechanisms.

實情為，為了允許每一處理器子單元存取由其他處理器子單元計算及/或儲存於專用於其他處理器子單元之記憶體組中的資料，提供匯流排，該等匯流排之時序係使用由每一處理器子單元個別地執行之程式碼動態地執行。此情形允許消除如習知地所使用的大部分(若非全部)匯流排管理硬體。此外，此等匯流排上之直接傳送替換複雜的快取機制，以減少在記憶體讀取及寫入期間的潛時。 In fact, in order to allow each processor sub-unit to access data calculated by other processor sub-units and/or stored in a memory bank dedicated to other processor sub-units, buses are provided, and the timing of these buses It is dynamically executed using code executed individually by each processor subunit. This situation allows the elimination of most (if not all) bus management hardware as conventionally used. In addition, direct transmission on these buses replaces complex caching mechanisms to reduce latency during memory reading and writing.

基於記憶體之處理陣列 Memory-based processing array

如圖7A及圖7B中所描繪，本發明之記憶體晶片可獨立地操作。替代地，本發明之記憶體晶片可與諸如記憶體裝置(例如，一或多個DRAM組)、系統單晶片、場可程式化閘陣列(FPGA)或其他處理及/或記憶體晶片的一或多個額外積體電路可操作地連接。在此等實施例中，由該架構執行之一系列指令中的任務可在記憶體晶片之處理器子單元與額外積體電路之任何處理器子單元之間進行劃分(例如，藉由編譯器，如下文所描述)。舉例而言，其他積體電路可包含將指令及/或資料輸入至記憶體晶片且自其接收輸出之主機(例如，圖3A之主機350)。 As depicted in FIG. 7A and FIG. 7B, the memory chip of the present invention can be operated independently. Alternatively, the memory chip of the present invention can be combined with one such as a memory device (for example, one or more DRAM banks), a system-on-a-chip, a field programmable gate array (FPGA), or other processing and/or memory chips. Or multiple additional integrated circuits are operatively connected. In these embodiments, the tasks in a series of instructions executed by the architecture can be divided between the processor subunits of the memory chip and any processor subunits of additional integrated circuits (for example, by the compiler , As described below). For example, other integrated circuits may include a host (for example, the host 350 of FIG. 3A) that inputs commands and/or data to the memory chip and receives output therefrom.

為了將本發明之記憶體晶片與一或多個額外積體電路互連，記憶體晶片可包括記憶體介面，諸如遵從聯合電子裝置工程委員會(Joint Electron Device Engineering Council；JEDEC)標準或其變體中之任一者的記憶體介面。一或多個額外積體電路接著可連接至該記憶體介面。因此，若該一或多個額外積體電路連接至本發明之複數個記憶體晶片，則資料可經由該一或多個額外積體電路在記憶體晶片之間共用。另外或替代地，該一或多個額外積體電路可包括用以連接至本發明之記憶體晶片上之匯流排的匯流排，使得該一或多個額外積體電路可與本發明之記憶體晶片協同執行程式碼。在此等實施例中，該一或多個額外積體電路進一步輔助分散式處理，即使該等額外積體電路可能與本發明之記憶體晶片在不同基板上亦如此。 In order to interconnect the memory chip of the present invention with one or more additional integrated circuits, the memory chip may include a memory interface, such as compliant with the Joint Electron Device Engineering Council (JEDEC) standard or its variants The memory interface of any one of them. One or more additional integrated circuits can then be connected to the memory interface. Therefore, if the one or more additional integrated circuits are connected to the plurality of memory chips of the present invention, data can be shared among the memory chips through the one or more additional integrated circuits. Additionally or alternatively, the one or more additional integrated circuits may include a bus for connecting to the bus on the memory chip of the present invention, so that the one or more additional integrated circuits can be connected to the memory of the present invention. The integrated chip executes the code together. In these embodiments, the one or more additional integrated circuits further assist in distributed processing, even though the additional integrated circuits may be on different substrates from the memory chip of the present invention.

此外，本發明之記憶體晶片可排成陣列以便形成分散式處理器之陣列。舉例而言，一或多個匯流排可將記憶體晶片770a連接至額外記憶體晶片770b，如圖7C中所描繪。在圖7C之實例中，記憶體晶片770a包括處理器子單元與專用於每一處理器子單元之一或多個對應的記憶體組，例如：處理器子單元730a與記憶體組720a相關聯，處理器子單元730b與記憶體組720b相關聯，處理器子單元730e與記憶體組720c相關聯，且處理器子單元730f與記憶體組720d相關聯。匯流排將每一處理器子單元連接至其對應的記憶體組。因此，匯流排740a將處理器子單元730a連接至記憶體組720a，匯流排740b將處理器子單元730b連接至記憶體組720b，匯流排740c將處理器子單元730e連接至記憶體組720c，且匯流排740d將處理器子單元730f連接至記憶體組720d。此外，匯流排750a將處理器子單元730a連接至處理器子單元750e，匯流排750b將處理器子單元730a連接至處理器子單元750b，匯流排750c將處理器子單元730b連接至處理器子單元750f，且匯流排750d將處理器子單元730e連接至處理器子單元750f。舉例而言，如上文所描述，可使用記憶體晶片770a之其他配置。 In addition, the memory chips of the present invention can be arranged in an array to form an array of distributed processors. For example, one or more bus bars can connect the memory chip 770a to the additional memory chip 770b, as depicted in FIG. 7C. In the example of FIG. 7C, the memory chip 770a includes a processor subunit and one or more corresponding memory groups dedicated to each processor subunit. For example, the processor subunit 730a is associated with the memory group 720a. , The processor subunit 730b is associated with the memory group 720b, the processor subunit 730e is associated with the memory group 720c, and the processor subunit 730f is associated with the memory group 720d. The bus bar connects each processor sub-unit to its corresponding memory bank. Therefore, the sink The bus 740a connects the processor sub-unit 730a to the memory group 720a, the bus 740b connects the processor sub-unit 730b to the memory group 720b, and the bus 740c connects the processor sub-unit 730e to the memory group 720c, and the bus Row 740d connects the processor subunit 730f to the memory bank 720d. In addition, the bus 750a connects the processor sub-unit 730a to the processor sub-unit 750e, the bus 750b connects the processor sub-unit 730a to the processor sub-unit 750b, and the bus 750c connects the processor sub-unit 730b to the processor sub-unit. Unit 750f, and the bus 750d connects the processor sub-unit 730e to the processor sub-unit 750f. For example, as described above, other configurations of the memory chip 770a may be used.

類似地，記憶體晶片770b包括處理器子單元與專用於每一處理器子單元之一或多個對應的記憶體組，例如：處理器子單元730c與記憶體組720e相關聯，處理器子單元730d與記憶體組720f相關聯，處理器子單元730g與記憶體組720g相關聯，且處理器子單元730h與記憶體組720h相關聯。匯流排將每一處理器子單元連接至其對應的記憶體組。因此，匯流排740e將處理器子單元730c連接至記憶體組720e，匯流排740f將處理器子單元730d連接至記憶體組720f，匯流排740g將處理器子單元730g連接至記憶體組720g，且匯流排740h將處理器子單元730h連接至記憶體組720h。此外，匯流排750g將處理器子單元730c連接至處理器子單元750g，匯流排750h將處理器子單元730d連接至處理器子單元750h，匯流排750i將處理器子單元730c連接至處理器子單元750d，且匯流排750j將處理器子單元730g連接至處理器子單元750h。舉例而言，如上文所描述，可使用記憶體晶片770b之其他配置。 Similarly, the memory chip 770b includes a processor subunit and one or more corresponding memory groups dedicated to each processor subunit. For example, the processor subunit 730c is associated with the memory group 720e, and the processor subunit 730c is associated with the memory group 720e. The unit 730d is associated with the memory group 720f, the processor subunit 730g is associated with the memory group 720g, and the processor subunit 730h is associated with the memory group 720h. The bus bar connects each processor sub-unit to its corresponding memory bank. Therefore, the bus 740e connects the processor sub-unit 730c to the memory group 720e, the bus 740f connects the processor sub-unit 730d to the memory group 720f, and the bus 740g connects the processor sub-unit 730g to the memory group 720g, And the bus 740h connects the processor sub-unit 730h to the memory bank 720h. In addition, the bus 750g connects the processor sub-unit 730c to the processor sub-unit 750g, the bus 750h connects the processor sub-unit 730d to the processor sub-unit 750h, and the bus 750i connects the processor sub-unit 730c to the processor sub-unit. Unit 750d, and bus 750j connects processor sub-unit 730g to processor sub-unit 750h. For example, as described above, other configurations of the memory chip 770b may be used.

記憶體晶片770a及770b之處理器子單元可使用一或多個匯流排來連接。因此，在圖7C之實例中，匯流排750e可將記憶體晶片770a之處理器子單元730b與記憶體晶片770b之處理器子單元730c連接，且匯流排750f可將記憶體晶片770a之處理器子單元730f與記憶體770b之處理器子單元730c連接。舉例而言，匯流排750e可充當至記憶體晶片770b之輸入匯流排(且因此充當記憶體晶片770a之輸出匯流排)，而匯流排750f可充當至記憶體晶片770a之輸入匯流排(且因此充當記憶體晶片770b之輸出匯流排)，或反之亦然。替代地，匯流排750e及750f均可充當記憶體晶片770a與770b之間的雙向匯流排。 The processor sub-units of the memory chips 770a and 770b can be connected using one or more buses. Therefore, in the example of FIG. 7C, the bus 750e can connect the processor sub-unit 730b of the memory chip 770a with the processor sub-unit 730c of the memory chip 770b, and the bus 750f can connect the processor of the memory chip 770a The sub-unit 730f is connected to the processor sub-unit 730c of the memory 770b. For example, the bus 750e can serve as the input bus to the memory chip 770b (and therefore charge When the output bus of the memory chip 770a), the bus 750f can serve as the input bus to the memory chip 770a (and thus the output bus of the memory chip 770b), or vice versa. Alternatively, the bus bars 750e and 750f can serve as bidirectional bus bars between the memory chips 770a and 770b.

匯流排750e及750f可包括直接導線或可在高速連接上交錯，以便減少用於記憶體晶片770a與積體電路770b之間的晶片間介面的接腳。此外，用於記憶體晶片本身中的上文所描述之連接配置中之任一者可用以將記憶體晶片連接至一或多個額外積體電路。舉例而言，記憶體晶片770a及770b可使用完全塊或部分塊連接而非如圖7C所展示僅使用兩個匯流排來連接。 The bus bars 750e and 750f may include direct wires or may be interleaved on high-speed connections to reduce the number of pins used for the inter-chip interface between the memory chip 770a and the integrated circuit 770b. In addition, any of the above-described connection configurations used in the memory chip itself can be used to connect the memory chip to one or more additional integrated circuits. For example, the memory chips 770a and 770b may be connected using complete blocks or partial blocks instead of using only two bus bars as shown in FIG. 7C.

因此，儘管使用匯流排750e及750f來描繪，但架構760可包括更少匯流排或額外匯流排。舉例而言，可在處理器子單元730b與730c之間或處理器子單元730f與730c之間使用單個匯流排。替代地，可例如在處理器子單元730b與730d之間、處理器子單元730f與730d之間或其類似者之間使用額外匯流排。 Therefore, although the bus bars 750e and 750f are used for illustration, the architecture 760 may include fewer buses or additional buses. For example, a single bus can be used between the processor sub-units 730b and 730c or between the processor sub-units 730f and 730c. Alternatively, additional busbars may be used, for example, between the processor sub-units 730b and 730d, between the processor sub-units 730f and 730d, or the like.

此外，儘管描繪為使用單個記憶體晶片及額外積體電路，但複數個記憶體晶片可使用匯流排來連接，如上文所解釋。舉例而言，如圖7C之實例中所描繪，記憶體晶片770a、770b、770c及770d連接成陣列。類似於上文所描述之記憶體晶片，每一記憶體晶片包括處理器子單元及專用記憶體組。因此，此處不重複對此等組件之描述。 In addition, although it is depicted as using a single memory chip and additional integrated circuits, a plurality of memory chips can be connected using a bus, as explained above. For example, as depicted in the example of FIG. 7C, memory chips 770a, 770b, 770c, and 770d are connected in an array. Similar to the memory chips described above, each memory chip includes a processor sub-unit and a dedicated memory bank. Therefore, the description of these components is not repeated here.

在圖7C之實例中，記憶體晶片770a、770b、770c及770d連接成迴路。因此，匯流排750a連接記憶體晶片770a與770d，匯流排750c連接記憶體晶片770a與770b，匯流排750e連接記憶體晶片770b與770c，且匯流排750g連接記憶體晶片770c與770d。儘管記憶體晶片770a、770b、770c及770d可利用完全塊連接、部分塊連接或其他連接配置來連接，但圖7C之實例允許記憶體晶片770a、770b、770c及770d之間的更少接腳連接。 In the example of FIG. 7C, the memory chips 770a, 770b, 770c, and 770d are connected in a loop. Therefore, the bus 750a is connected to the memory chips 770a and 770d, the bus 750c is connected to the memory chips 770a and 770b, the bus 750e is connected to the memory chips 770b and 770c, and the bus 750g is connected to the memory chips 770c and 770d. Although the memory chips 770a, 770b, 770c, and 770d can be connected using full block connections, partial block connections, or other connection configurations, the example of FIG. 7C allows fewer pins between the memory chips 770a, 770b, 770c, and 770d connection.

相對較大的記憶體 Relatively large memory

本發明之實施例可使用大小與習知處理器之共用記憶體相比相對較大的專用記憶體。使用專用記憶體而非共用記憶體允許繼續獲得效率增益而不會隨著記憶體增加而逐漸減少。此允許諸如神經網路處理及資料庫查詢之記憶體密集型任務比在習知處理器中更高效地執行，在習知處理器中，共用記憶體增加之效率增益由於馮諾伊曼瓶頸而逐漸減少。 The embodiments of the present invention can use a dedicated memory that is relatively larger in size than the shared memory of a conventional processor. Using dedicated memory instead of shared memory allows continued efficiency gains without gradual decrease as memory increases. This allows memory-intensive tasks such as neural network processing and database query to be performed more efficiently than in conventional processors. In conventional processors, the increased efficiency gain of shared memory is due to the von Neumann bottleneck. gradually decreases.

舉例而言，在本發明之分散式處理器中，安置於分散式處理器之基板上的記憶體陣列可包括複數個離散記憶體組。離散記憶體組中之每一者可具有大於一百萬位元組之容量；以及安置於該基板上之處理陣列，該處理陣列包括複數個處理器子單元。如上文所解釋，該等處理器子單元中之每一者可與該等複數個離散記憶體組中之對應的專用記憶體組相關聯。在一些實施例中，該等複數個處理器子單元可在空間上分佈於記憶體陣列內之複數個離散記憶體組當中。藉由將至少一百萬位元組之專用記憶體而非幾百萬位元組之共用快取記憶體用於大型CPU或GPU，本發明之分散式處理器獲得在習知系統中由於CPU及GPU中之馮諾依曼瓶頸而不可能達成的效率。 For example, in the distributed processor of the present invention, the memory array disposed on the substrate of the distributed processor may include a plurality of discrete memory groups. Each of the discrete memory groups may have a capacity greater than one million bytes; and a processing array disposed on the substrate, the processing array including a plurality of processor subunits. As explained above, each of the processor subunits can be associated with a corresponding dedicated memory group among the plurality of discrete memory groups. In some embodiments, the plurality of processor subunits may be spatially distributed among a plurality of discrete memory groups in the memory array. By using at least one million bytes of dedicated memory instead of a few million bytes of shared cache memory for large CPUs or GPUs, the distributed processor of the present invention can be used in conventional systems due to the CPU And the efficiency that the von Neumann bottleneck in the GPU is impossible to achieve.

不同記憶體可用作專用記憶體。舉例而言，每一專用記憶體組可包含至少一個DRAM組。替代地，每一專用記憶體組可包含至少一個靜態隨機存取記憶體組。在其他實施例中，不同類型之記憶體可在單個硬體晶片上組合。 Different memories can be used as dedicated memories. For example, each dedicated memory bank may include at least one DRAM bank. Alternatively, each dedicated memory bank may include at least one static random access memory bank. In other embodiments, different types of memory can be combined on a single hardware chip.

如上文所解釋，每一專用記憶體可為至少一百萬位元組。因此，每一專用記憶體組之大小可相同，或該等複數個記憶體組中之至少兩個記憶體組可具有不同大小。 As explained above, each dedicated memory can be at least one million bytes. Therefore, the size of each dedicated memory group may be the same, or at least two of the plurality of memory groups may have different sizes.

此外，如上文所描述，該分散式處理器可包括：第一複數個匯流排，其各將該等複數個處理器子單元中之一者連接至對應的專用記憶體組；及第二複數個匯流排，其各將該等複數個處理器子單元中之一者連接至該等複數個處理器子單元中之另一者。 In addition, as described above, the distributed processor may include: a first plurality of bus bars, each of which connects one of the plurality of processor subunits to a corresponding dedicated memory group; and a second plurality of buses Bus bars, each of which connects one of the plurality of processor subunits to the plurality of The other of a processor subunit.

使用軟體之同步 Synchronization using software

如上文所解釋，本發明之硬體晶片可使用軟體而非硬體來管理資料傳送。特定而言，因為匯流排上之傳送、對記憶體進行之讀取及寫入以及處理器子單元之計算的時序係藉由處理器子單元所執行的指令之子系列設定，所以本發明之硬體晶片可執行程式碼以防止匯流排上之衝突。因此，本發明之硬體晶片可避免習知地用以管理資料傳送之硬體機構(諸如，晶片內之網路控制器、處理器子單元之間的封包剖析器及封包傳送器、匯流排仲裁器、用以避免仲裁的複數個匯流排，或其類似者)。 As explained above, the hardware chip of the present invention can use software instead of hardware to manage data transmission. In particular, because the timing of the transmission on the bus, the reading and writing of the memory, and the calculation of the processor sub-unit is set by the sub-series of instructions executed by the processor sub-unit, the hardware of the present invention is The body chip can execute code to prevent conflicts on the bus. Therefore, the hardware chip of the present invention can avoid the conventional hardware mechanism used to manage data transmission (such as the network controller in the chip, the packet parser and the packet transmitter between the processor sub-units, and the bus Arbiter, multiple buses used to avoid arbitration, or the like).

若本發明之硬體晶片習知地傳送資料，則利用匯流排連接N個處理器子單元將需要由仲裁器控制的匯流排仲裁或寬MUX。實情為，如上文所描述，本發明之實施例可在處理器子單元之間使用僅為導線、光學纜線或其類似者之匯流排，其中該等處理器子單元個別地執行程式碼以避免匯流排上之衝突。因此，本發明之實施例可節省基板上之空間以及材料成本及效率損失(例如，由於仲裁導致之功率及時間消耗)。相較於使用先進先出(FIFO)控制器及/或信箱之其他架構，效率及空間增益甚至更大。 If the hardware chip of the present invention conventionally transmits data, using a bus to connect N processor sub-units will require bus arbitration or wide MUX controlled by the arbiter. In fact, as described above, the embodiments of the present invention can use only wires, optical cables, or the like between the processor sub-units of the bus, wherein the processor sub-units individually execute code to Avoid conflicts on the bus. Therefore, the embodiments of the present invention can save space on the substrate and material cost and efficiency loss (for example, power and time consumption due to arbitration). Compared with other architectures that use a first-in-first-out (FIFO) controller and/or mailbox, the efficiency and space gains are even greater.

此外，如上文所解釋，除一或多個處理元件以外，每一處理器子單元亦可包括一或多個加速器。在一些實施例中，加速器可自匯流排而非自處理元件進行讀取及寫入。在此等實施例中，可藉由允許加速器在處理元件執行一或多個計算之同一循環期間傳輸資料來獲得額外效率。然而，此等實施例需要用於加速器之額外材料。舉例而言，可能需要額外電晶體以用於製造加速器。 In addition, as explained above, in addition to one or more processing elements, each processor sub-unit may also include one or more accelerators. In some embodiments, the accelerator can read and write from the bus instead of the processing element. In these embodiments, additional efficiency can be obtained by allowing the accelerator to transmit data during the same cycle in which the processing element performs one or more calculations. However, these embodiments require additional materials for the accelerator. For example, additional transistors may be required for manufacturing accelerators.

程式碼亦可考量處理器子單元(例如，包括形成處理器子單元之部分的處理元件及/或加速器)之內部行為，包括時序及潛時。舉例而言，編譯器(如下文所描述)可在產生控制資料傳送之指令子系列時執行考量時序及潛時的預處理。 The program code may also consider the internal behavior of the processor sub-units (for example, including processing elements and/or accelerators that form part of the processor sub-units), including timing and latency. For example, compile The processor (as described below) can perform preprocessing that considers timing and latency when generating a sub-series of commands that control data transmission.

在一個實例中，複數個處理器子單元可經指派計算神經網路層之任務，該神經網路層含有全部連接至較大複數個神經元之前一層的複數個神經元。假設前一層之資料均勻地散佈在複數個處理器子單元之間，執行該計算的一種方式可為組態每一處理器子單元，以依次將前一層之資料傳輸至主匯流排，且接著每一處理器子單元將此資料乘以子單元實施之對應神經元的權重。因為每一處理器子單元計算多於一個神經元，所以每一處理器子單元將數次傳輸前一層之資料，該次數等於神經元之數目。因此，每一處理器子單元之程式碼與用於其他處理器子單元之程式碼不相同，此係因為該等子單元將在不同時間進行傳輸。 In one example, a plurality of processor sub-units may be assigned the task of computing a neural network layer that contains a plurality of neurons that are all connected to a layer before a larger plurality of neurons. Assuming that the data of the previous layer is evenly distributed among a plurality of processor subunits, one way to perform this calculation can be to configure each processor subunit to transmit the data of the previous layer to the main bus in turn, and then Each processor subunit multiplies this data by the weight of the corresponding neuron implemented by the subunit. Because each processor subunit calculates more than one neuron, each processor subunit will transmit the data of the previous layer several times, which is equal to the number of neurons. Therefore, the code of each processor sub-unit is different from the code used in other processor sub-units because the sub-units will be transmitted at different times.

在一些實施例中，分散式處理器可包含：基板(例如，諸如矽之半導體基板及/或諸如可撓性電路板之電路板)；安置於該基板上之記憶體陣列，該記憶體陣列包括複數個離散記憶體組；及安置於該基板上之處理陣列，該處理陣列包括複數個處理器子單元，如描繪於例如圖7A及圖7B中。如上文所解釋，該等處理器子單元中之每一者可與該等複數個離散記憶體組中之對應的專用記憶體組相關聯。此外，如描繪於例如圖7A及圖7B中，分散式處理器可進一步包含複數個匯流排，該等複數個匯流排中之每一者將該等複數個處理器子單元中之一者連接至該等複數個處理器子單元中之至少另一者。 In some embodiments, the distributed processor may include: a substrate (for example, a semiconductor substrate such as silicon and/or a circuit board such as a flexible circuit board); a memory array disposed on the substrate, the memory array It includes a plurality of discrete memory groups; and a processing array disposed on the substrate. The processing array includes a plurality of processor subunits, as depicted in, for example, FIG. 7A and FIG. 7B. As explained above, each of the processor subunits can be associated with a corresponding dedicated memory group among the plurality of discrete memory groups. In addition, as depicted in, for example, FIGS. 7A and 7B, the distributed processor may further include a plurality of bus bars, each of which is connected to one of the plurality of processor subunits To at least another of the plurality of processor subunits.

如上文所解釋，該等複數個匯流排可用軟體來控制。因此，該等複數個匯流排可能不含時序硬體邏輯組件，使得在處理器子單元之間及跨越該等複數個匯流排中之對應者的資料傳送不受時序硬體邏輯組件控制。在一個實例中，該等複數個匯流排可能不含匯流排仲裁器，使得在處理器子單元之間及跨越該等複數個匯流排中之對應者的資料傳送不受匯流排仲裁器控制。 As explained above, these multiple buses can be controlled by software. Therefore, the plurality of buses may not contain sequential hardware logic components, so that the data transmission between the processor subunits and across the corresponding ones of the plurality of buses is not controlled by the sequential hardware logic components. In one example, the plurality of buses may not contain a bus arbiter, so that data transmission between processor subunits and across corresponding ones of the plurality of buses is not controlled by the bus arbiter.

在一些實施例中，如描繪於例如圖7A及圖7B中，分散式處理器可進一步包含第二複數個匯流排，該等第二複數個匯流排將複數個處理器子單元中之一者連接至對應的專用記憶體組。類似於上文所描述之複數個匯流排，第二複數個匯流排可能不含時序硬體邏輯組件，使得處理器子單元與對應的專用記憶體組之間的資料傳送不受時序硬體邏輯組件控制。在一個實例中，第二複數個匯流排可能不含匯流排仲裁器，使得處理器子單元與對應的專用記憶體組之間的資料傳送不受匯流排仲裁器控制。 In some embodiments, as depicted in, for example, FIG. 7A and FIG. 7B, the distributed processor may further include a second plurality of buses, and the second plurality of buses may be one of the plurality of processor subunits. Connect to the corresponding dedicated memory bank. Similar to the plurality of buses described above, the second plurality of buses may not contain sequential hardware logic components, so that the data transfer between the processor subunit and the corresponding dedicated memory bank is not affected by the sequential hardware logic. Component control. In one example, the second plurality of buses may not contain a bus arbiter, so that the data transmission between the processor subunit and the corresponding dedicated memory bank is not controlled by the bus arbiter.

如本文中所使用，片語「不含」未必暗示諸如時序硬體邏輯組件(例如，匯流排仲裁器、仲裁樹、FIFO控制器、信箱或其類似者)的組件絕對不存在。此等組件仍可包括於描述為「不含」彼等組件之硬體晶片中。實情為，片語「不含」係指硬體晶片之功能；亦即，「不含」時序硬體邏輯組件之硬體晶片控制其資料傳送之時序而不使用包括於其中的時序硬體邏輯組件(若存在)。舉例而言，硬體晶片執行包括指令之子系列的程式碼，該等指令控制硬體晶片之處理器子單元之間的資料傳送，即使該硬體晶片包括時序硬體邏輯組件作為防範由於所執行程式碼中之錯誤的衝突之輔助預防措施亦如此。 As used herein, the phrase "does not contain" does not necessarily imply that components such as sequential hardware logic components (for example, bus arbiter, arbitration tree, FIFO controller, mailbox, or the like) absolutely do not exist. These components can still be included in the hardware chip described as "excluding" their components. In fact, the phrase "not included" refers to the function of the hardware chip; that is, the hardware chip that "does not contain" the sequential hardware logic component controls the timing of its data transmission without using the sequential hardware logic included in it The component (if present). For example, a hardware chip executes program codes that include a sub-series of instructions that control the data transfer between the processor subunits of the hardware chip, even if the hardware chip includes sequential hardware logic components as a protection against the execution The same is true for auxiliary preventive measures for conflicts in code errors.

如上文所解釋，複數個匯流排可包含介於複數個處理器子單元中之對應者之間的導線或光纖中之至少一者。因此，在一個實例中，不含時序硬體邏輯組件之分散式處理器可僅包括導線或光纖，而無匯流排仲裁器、仲裁樹、FIFO控制器、信箱或其類似者。 As explained above, the plurality of bus bars may include at least one of wires or optical fibers between corresponding ones of the plurality of processor subunits. Therefore, in one example, a distributed processor without sequential hardware logic components may only include wires or fibers, and no bus arbiter, arbitration tree, FIFO controller, mailbox, or the like.

在一些實施例中，複數個處理器子單元經組態以根據由該等複數個處理器子單元執行之程式碼跨越複數個匯流排中之至少一者傳送資料。因此，如下文所解釋，編譯器可組織指令之子系列，每一子系列包含由單個處理器子單元執行之程式碼。該等子系列指令可發指令處理器子單元何時將資料傳送至匯流排中之一者上及何時自匯流排擷取資料。當該等子系列跨越分散式處理器協同執行時，處理器子單元之間的傳送之時序可藉由包括於該等子系列中的用以傳送及擷取之指令來控制。因此，程式碼規定跨越複數個匯流排中之至少一者的資料傳送之時序。編譯器可產生待由單個處理器子單元執行之程式碼。另外，編譯器可產生待由處理器子單元之群組執行的程式碼。在一些狀況下，編譯器可將所有處理器子單元一起視為如同該等處理器子單元為一個超處理器(例如，分散式處理器)，且編譯器可產生用於由彼定義的超處理器/分散式處理器執行的程式碼。 In some embodiments, the plurality of processor subunits are configured to transmit data across at least one of the plurality of buses according to the code executed by the plurality of processor subunits. Therefore, as explained below, the compiler can organize sub-series of instructions, each of which contains code that is executed by a single processor subunit. These sub-series commands can instruct the processor sub-unit when to send data to one of the buses and when to retrieve data from the bus. When this sub-series crosses the decentralized place When the processors are executed in cooperation, the timing of the transmission between the processor sub-units can be controlled by the instructions for transmission and retrieval included in the sub-series. Therefore, the code specifies the timing of data transmission across at least one of the plurality of buses. The compiler can generate program code to be executed by a single processor subunit. In addition, the compiler can generate code to be executed by the group of processor subunits. In some cases, the compiler can treat all the processor subunits together as if the processor subunits are a super processor (for example, a distributed processor), and the compiler can generate the super processor for the super processor defined by it. The code executed by the processor/distributed processor.

如上文所解釋且如圖7A及圖7B中所描繪，複數個處理器子單元可在空間上分佈於記憶體陣列內之複數個離散記憶體組當中。替代地，複數個處理器子單元可叢集於基板之一或多個區中，且複數個記憶體組可叢集於基板之一或多個其他區中。在一些實施例中，可使用空間分佈與叢集之組合，如上文所解釋。 As explained above and as depicted in FIGS. 7A and 7B, a plurality of processor subunits may be spatially distributed among a plurality of discrete memory groups in the memory array. Alternatively, a plurality of processor subunits may be clustered in one or more regions of the substrate, and a plurality of memory groups may be clustered in one or more other regions of the substrate. In some embodiments, a combination of spatial distribution and clustering may be used, as explained above.

在一些實施例中，分散式處理器可包含基板(例如，包括矽之半導體基板及/或諸如可撓性電路板之電路板)，該基板具有安置於其上之記憶體陣列，該記憶體陣列包括複數個離散記憶體組。處理陣列亦可安置於基板上，該處理陣列包括複數個處理器子單元，如描繪於例如圖7A及圖7B中。如上文所解釋，該等處理器子單元中之每一者可與該等複數個離散記憶體組中之對應的專用記憶體組相關聯。此外，如描繪於例如圖7A及圖7B中，該分散式處理器可進一步包含複數個匯流排，該等複數個匯流排中之每一者將該等複數個處理器子單元中之一者連接至該等複數個離散記憶體組中之對應的專用記憶體組。 In some embodiments, the distributed processor may include a substrate (for example, a semiconductor substrate including silicon and/or a circuit board such as a flexible circuit board), the substrate having a memory array disposed thereon, and the memory The array includes a plurality of discrete memory groups. The processing array can also be disposed on the substrate, and the processing array includes a plurality of processor subunits, as depicted in, for example, FIG. 7A and FIG. 7B. As explained above, each of the processor subunits can be associated with a corresponding dedicated memory group among the plurality of discrete memory groups. In addition, as depicted in, for example, FIGS. 7A and 7B, the distributed processor may further include a plurality of bus bars, each of the plurality of bus bars is one of the plurality of processor subunits Connected to the corresponding dedicated memory group among the plurality of discrete memory groups.

如上文所解釋，該等複數個匯流排可用軟體來控制。因此，複數個匯流排可能不含時序硬體邏輯組件，使得處理器子單元與複數個離散記憶體組中之對應的專用離散記憶體組之間及跨越複數個匯流排中之對應者的資料傳送不受時序硬體邏輯組件控制。在一個實例中，該等複數個匯流排可能不含匯流排仲裁器，使得在處理器子單元之間及跨越該等複數個匯流排中之對應者的資料傳送不受匯流排仲裁器控制。 As explained above, these multiple buses can be controlled by software. Therefore, a plurality of buses may not contain sequential hardware logic components, so that data between the processor subunit and the corresponding dedicated discrete memory group in the plurality of discrete memory groups and across the corresponding ones in the plurality of buses pass Sending is not controlled by sequential hardware logic components. In one example, the plurality of buses may not contain a bus arbiter, so that data transmission between processor subunits and across corresponding ones of the plurality of buses is not controlled by the bus arbiter.

在一些實施例中，如描繪於例如圖7A及圖7B中，分散式處理器可進一步包含第二複數個匯流排，該等第二複數個匯流排將該等複數個處理器子單元中之一者連接至該等複數個處理器子單元中之至少另一者。類似於上文所描述之複數個匯流排，第二複數個匯流排可能不含時序硬體邏輯組件，使得處理器子單元與對應的專用記憶體組之間的資料傳送不受時序硬體邏輯組件控制。在一個實例中，第二複數個匯流排可能不含匯流排仲裁器，使得處理器子單元與對應的專用記憶體組之間的資料傳送不受匯流排仲裁器控制。 In some embodiments, as depicted in, for example, FIG. 7A and FIG. 7B, the distributed processor may further include a second plurality of buses, and the second plurality of buses may be one of the plurality of processor subunits. One is connected to at least another of the plurality of processor subunits. Similar to the plurality of buses described above, the second plurality of buses may not contain sequential hardware logic components, so that the data transfer between the processor subunit and the corresponding dedicated memory bank is not affected by the sequential hardware logic. Component control. In one example, the second plurality of buses may not contain a bus arbiter, so that the data transmission between the processor subunit and the corresponding dedicated memory bank is not controlled by the bus arbiter.

在一些實施例中，分散式處理器可使用軟體時序組件與硬體時序組件之組合。舉例而言，分散式處理器可包含基板(例如，包括矽之半導體基板及/或諸如可撓性電路板之電路板)，該基板具有安置於其上之記憶體陣列，該記憶體陣列包括複數個離散記憶體組。處理陣列亦可安置於基板上，該處理陣列包括複數個處理器子單元，如描繪於例如圖7A及圖7B中。如上文所解釋，該等處理器子單元中之每一者可與該等複數個離散記憶體組中之對應的專用記憶體組相關聯。此外，如描繪於例如圖7A及圖7B中，分散式處理器可進一步包含複數個匯流排，該等複數個匯流排中之每一者將該等複數個處理器子單元中之一者連接至該等複數個處理器子單元中之至少另一者。此外，如上文所解釋，該等複數個處理器子單元可經組態以執行軟體，該軟體控制跨越該等複數個匯流排之資料傳送的時序，以避免與該等複數個匯流排中之至少一者上的資料傳送衝突。在此實例中，軟體可控制資料傳送之時序，但傳送本身可至少部分地由一或多個硬體組件控制。 In some embodiments, distributed processors can use a combination of software timing components and hardware timing components. For example, a distributed processor may include a substrate (for example, a semiconductor substrate including silicon and/or a circuit board such as a flexible circuit board) having a memory array disposed thereon, the memory array including Multiple discrete memory groups. The processing array can also be disposed on the substrate, and the processing array includes a plurality of processor subunits, as depicted in, for example, FIG. 7A and FIG. 7B. As explained above, each of the processor subunits can be associated with a corresponding dedicated memory group among the plurality of discrete memory groups. In addition, as depicted in, for example, FIGS. 7A and 7B, the distributed processor may further include a plurality of bus bars, each of which is connected to one of the plurality of processor subunits To at least another of the plurality of processor subunits. In addition, as explained above, the plurality of processor sub-units can be configured to execute software that controls the timing of data transmission across the plurality of buses to avoid interfering with the plurality of buses. Data transfer conflicts on at least one of them. In this example, the software can control the timing of data transmission, but the transmission itself can be controlled at least in part by one or more hardware components.

程式碼之劃分 Code division

如上文所解釋，本發明之硬體晶片可跨越包括於形成硬體晶片之基板上的處理器子單元並列地執行程式碼。另外，本發明之硬體晶片可執行多任務處理。舉例而言，本發明之硬體晶片可執行區域多任務處理，其中硬體晶片之處理器子單元的一個群組執行一個任務(例如，音訊處理)，而硬體晶片之處理器子單元的另一群組執行另一任務(例如，影像處理)。在另一實例中，本發明之硬體晶片可執行時序多任務處理，其中硬體晶片之一或多個處理器子單元在第一時間段期間執行一個任務且在第二時間段期間執行另一任務。亦可使用區域多任務處理與時序多任務處理之組合，使得可在第一時間段期間將一個任務指派給第一群組處理器子單元，而可在第一時間段期間將另一任務指派給第二群組處理器子單元，此後，可在第二時間段期間將第三任務指派給包括於第一群組及第二群組中之處理器子單元。 As explained above, the hardware chip of the present invention can run program codes in parallel across the processor subunits included on the substrate forming the hardware chip. In addition, the hardware chip of the present invention can perform multi-task processing. For example, the hardware chip of the present invention can perform area multitasking processing, where a group of processor subunits of the hardware chip perform one task (for example, audio processing), and the processor subunits of the hardware chip Another group performs another task (for example, image processing). In another example, the hardware chip of the present invention can perform sequential multitasking, wherein one or more processor subunits of the hardware chip perform one task during a first time period and perform another during a second time period. One task. A combination of regional multitasking and sequential multitasking can also be used, so that one task can be assigned to the first group processor subunit during the first time period, and another task can be assigned during the first time period To the second group of processor sub-units, thereafter, a third task can be assigned to the processor sub-units included in the first group and the second group during the second time period.

為了組織機器碼以供在本發明之記憶體晶片上執行，機器碼可在記憶體晶片之處理器子單元之間進行劃分。舉例而言，記憶體晶片上之處理器可包含基板及安置於該基板上之複數個處理器子單元。該記憶體晶片可進一步包含安置於該基板上之對應的複數個記憶體組，該等複數個處理器子單元中之每一者連接至不被該等複數個處理器子單元中之任何其他處理器子單元共用的至少一個專用記憶體組。該記憶體晶片上之每一處理器子單元可經組態以獨立於其他處理器子單元而執行一系列指令。每一系列指令可藉由以下操作執行：根據定義該系列指令之程式碼而組態處理器子單元之一或多個一般處理元件及/或根據在定義該系列指令之該程式碼中所提供的序列而啟動處理器子單元之一或多個特殊處理元件(例如，一或多個加速器)。 In order to organize the machine code for execution on the memory chip of the present invention, the machine code can be divided among the processor subunits of the memory chip. For example, the processor on the memory chip may include a substrate and a plurality of processor sub-units disposed on the substrate. The memory chip may further include a plurality of corresponding memory groups disposed on the substrate, each of the plurality of processor sub-units is connected to any other of the plurality of processor sub-units At least one dedicated memory bank shared by the processor sub-units. Each processor subunit on the memory chip can be configured to execute a series of instructions independently of other processor subunits. Each series of instructions can be executed by the following operations: configure one or more general processing elements of the processor subunit according to the code defining the series of instructions and/or according to the code provided in the code defining the series of instructions The sequence of the processor subunit activates one or more special processing elements (for example, one or more accelerators).

因此，每一系列指令可定義待由單個處理器子單元執行之一系列任務。單個任務可包含在由處理器子單元中之一或多個處理元件之架構定義的指令集內之指令。舉例而言，該處理器子單元可包括特定暫存器，且單個任務可將資料推送至暫存器上，自暫存器取得資料，對暫存器內之資料執行算術函數，對暫存器內之資料執行邏輯運算，或其類似者。此外，處理器子單元可針對任何數自個運算元來組態，諸如0運算元處理器子單元(亦被稱作「堆疊機」)、1運算元處理器子單元(亦被稱作累加機)、2運算元處理器子單元(諸如，RISC)、3運算元處理器子單元(諸如，複雜指令集電腦(CISC))或其類似者。在另一實例中，處理器子單元可包括一或多個加速器，且單個任務可啟動一加速器以執行特定功能，諸如MAC功能、MAX功能、MAX-0功能或其類似者。 Therefore, each series of instructions can define a series of tasks to be executed by a single processor subunit. A single task may include instructions in an instruction set defined by the architecture of one or more processing elements in a processor subunit. For example, the processor sub-unit may include a specific register, and a single task It can push data to the register, obtain data from the register, perform arithmetic functions on the data in the register, perform logical operations on the data in the register, or the like. In addition, the processor subunit can be configured for any number of operands, such as 0 operand processor subunit (also known as "stacker"), 1 operand processor subunit (also known as accumulator Computer), 2-operand processor subunit (such as RISC), 3-operand processor subunit (such as complex instruction set computer (CISC)), or the like. In another example, the processor subunit may include one or more accelerators, and a single task may activate an accelerator to perform a specific function, such as a MAC function, a MAX function, a MAX-0 function, or the like.

該系列指令可進一步包括用於對記憶體晶片之專用記憶體組進行讀取及寫入的任務。舉例而言，一任務可包括將一段資料寫入至專用於執行該任務之處理器子單元的記憶體組、自專用於執行該任務之處理器子單元的記憶體組讀取一段資料，或其類似者。在一些實施例中，讀取及寫入可由處理器子單元與記憶體組之控制器協同執行。舉例而言，處理器子單元可藉由將控制信號發送至控制器以執行讀取或寫入來執行讀取或寫入任務。在一些實施例中，該控制信號可包括用於讀取及寫入之特定位址。替代地，處理器子單元可聽從記憶體控制器以選擇可用於讀取及寫入之位址。 The series of commands may further include tasks for reading and writing the dedicated memory bank of the memory chip. For example, a task may include writing a piece of data to the memory bank of the processor subunit dedicated to performing the task, reading a piece of data from the memory bank of the processor subunit dedicated to performing the task, or Its similar. In some embodiments, reading and writing can be performed by the processor sub-unit and the controller of the memory bank in cooperation. For example, the processor sub-unit can perform read or write tasks by sending control signals to the controller to perform read or write. In some embodiments, the control signal may include specific addresses for reading and writing. Alternatively, the processor sub-unit can listen to the memory controller to select an address that can be used for reading and writing.

另外或替代地，讀取及寫入可由一或多個加速器與記憶體組之控制器協同執行。舉例而言，該等加速器可產生用於記憶體控制器之控制信號，此類似於處理器子單元如何產生控制信號，如上文所描述。 Additionally or alternatively, reading and writing can be performed by one or more accelerators in cooperation with the controller of the memory bank. For example, the accelerators can generate control signals for the memory controller, which is similar to how the processor sub-units generate control signals, as described above.

在上文所描述之實施例中之任一者中，位址產生器亦可用以導引對記憶體組之特定位址的讀取及寫入。舉例而言，該位址產生器可包含經組態以產生用於讀取及寫入之記憶體位址的處理元件。該位址產生器可經組態以產生位址以便提高效率，例如藉由將稍後計算之結果寫入至與先前計算之不再需要之結果相同的位址。因此，位址產生器可回應於來自處理器子單元(例如，來自包括於其中之處理元件或來自其中之一或多個加速器)之命令抑或與處理器子單元協同產生用於記憶體控制器之控制信號。另外或替代地，位址產生器可基於一些組態或暫存器產生位址，例如產生巢套迴圈結構，從而以某一圖案在記憶體中之某些位址上進行反覆。 In any of the above-described embodiments, the address generator can also be used to guide the reading and writing of specific addresses of the memory bank. For example, the address generator may include processing elements configured to generate memory addresses for reading and writing. The address generator can be configured to generate addresses in order to improve efficiency, for example, by writing the result of a later calculation to the same address as the result of the previous calculation that is no longer needed. Therefore, the address generator can respond to sub-units from the processor (e.g., Commands from the processing elements included therein or from one or more accelerators) or from the processor sub-units to generate control signals for the memory controller. Additionally or alternatively, the address generator can generate addresses based on some configurations or registers, such as generating a nested loop structure, so as to repeat on certain addresses in the memory in a certain pattern.

在一些實施例中，每一系列指令可包含定義對應的一系列任務之機器碼的集合。因此，上文所描述之該系列任務可囊封於包含該系列指令之機器碼內。在一些實施例中，如下文關於圖8所解釋，該系列任務可由編譯器定義，該編譯器經組態以將較高階系列之任務作為複數個系列之任務分佈於複數個邏輯電路當中。舉例而言，編譯器可基於較高階系列之任務產生複數個系列之任務，使得協同執行對應的每一系列任務之處理器子單元執行與由較高階系列之任務所概述之功能相同的功能。 In some embodiments, each series of instructions may include a set of machine codes that define a corresponding series of tasks. Therefore, the series of tasks described above can be encapsulated in the machine code containing the series of instructions. In some embodiments, as explained below with respect to FIG. 8, the series of tasks can be defined by a compiler that is configured to distribute higher-order series of tasks as a plurality of series of tasks among a plurality of logic circuits. For example, the compiler can generate multiple series of tasks based on the higher-order series of tasks, so that the processor subunits that perform each series of tasks cooperatively perform the same functions as those outlined by the higher-order series of tasks.

如下文進一步所解釋，較高階系列之任務可包含用人類可讀程式設計語言編寫之指令集。對應地，每一處理器子單元之該系列任務可包含較低階系列之任務，該等任務中之每一者包含用機器碼編寫之指令集。 As explained further below, the higher-level series of tasks can include a set of instructions written in a human-readable programming language. Correspondingly, the series of tasks of each processor subunit may include a lower-level series of tasks, and each of these tasks includes an instruction set written in machine code.

如上文關於圖7A及圖7B所解釋，記憶體晶片可進一步包含複數個匯流排，每一匯流排將複數個處理器子單元中之一者連接至複數個處理器子單元中之至少另一者。此外，如上文所解釋，複數個匯流排上之資料傳送可使用軟體來控制。因此，跨越複數個匯流排中之至少一者的資料傳送可藉由包括於連接至複數個匯流排中之至少一者之處理器子單元中的該系列指令預定義。因此，包括於該系列指令中之任務中之一者可包括將資料輸出至匯流排中之一者或自匯流排中之一者取得資料。此等任務可由處理器子單元之一處理元件或由包括於處理器子單元中之一或多個加速器執行。在後一實施例中，處理器子單元可執行計算或在同一循環中將控制信號發送至對應記憶體組，在該循環期間，加速器自匯流排中之一者取得資料或將資料置放於匯流排中之一者上。 As explained above with respect to FIGS. 7A and 7B, the memory chip may further include a plurality of bus bars, and each bus bar connects one of the plurality of processor sub-units to at least another of the plurality of processor sub-units. By. In addition, as explained above, data transmission on multiple buses can be controlled by software. Therefore, data transmission across at least one of the plurality of buses can be predefined by the series of instructions included in the processor subunit connected to at least one of the plurality of buses. Therefore, one of the tasks included in the series of commands may include outputting data to one of the buses or obtaining data from one of the buses. These tasks can be performed by one of the processing elements of the processor subunit or by one or more accelerators included in the processor subunit. In the latter embodiment, the processor subunit can perform calculations or send control signals to the corresponding memory bank in the same cycle. During the cycle, the accelerator obtains data from one of the buses or places the data in On one of the bus bars.

在一個實例中，包括於連接至複數個匯流排中之至少一者之處理器子單元中的該系列指令可包括發送任務，該發送任務包含針對連接至複數個匯流排中之至少一者之處理器子單元的用以將資料寫入至複數個匯流排中之至少一者的命令。另外或替代地，包括於連接至複數個匯流排中之至少一者之處理器子單元中的該系列指令可包括接收任務，該接收任務包含針對連接至複數個匯流排中之至少一者之處理器子單元的用以自複數個匯流排中之至少一者讀取資料的命令。 In one example, the series of instructions included in the processor subunit connected to at least one of the plurality of buses may include a sending task, the sending task including instructions for connecting to at least one of the plurality of buses A command of the processor subunit to write data to at least one of the plurality of buses. Additionally or alternatively, the series of instructions included in the processor subunit connected to at least one of the plurality of buses may include a receiving task, the receiving task including instructions for connecting to at least one of the plurality of buses A command of the processor sub-unit to read data from at least one of the plurality of buses.

除將程式碼分佈在處理器子單元當中以外或替代將程式碼分佈在處理器子單元當中，可在記憶體晶片之記憶體組之間劃分資料。舉例而言，如上文所解釋，記憶體晶片上之分散式處理器可包含安置於記憶體晶片上之複數個處理器子單元及安置於記憶體晶片上之複數個記憶體組。該等複數個記憶體組中之每一者可經組態以儲存獨立於儲存於該等複數個記憶體組之其他者中之資料的資料，且該等複數個處理器子單元中之一者可連接至該等複數個記憶體組當中之至少一個專用記憶體組。舉例而言，每一處理器子單元可存取專用於該處理器子單元之一或多個對應記憶體組的一或多個記憶體控制器，且其他處理器子單元不可存取此等對應的一或多個記憶體控制器。因此，儲存於每一記憶體組中之資料對於專用處理器子單元可為唯一的。此外，儲存於每一記憶體組中之資料可獨立於儲存於其他記憶體組中之記憶體，此係因為無記憶體控制器可在記憶體組之間共用。 In addition to distributing the code in the processor sub-units or instead of distributing the code in the processor sub-units, data can be divided between the memory groups of the memory chip. For example, as explained above, a distributed processor on a memory chip may include a plurality of processor subunits arranged on the memory chip and a plurality of memory groups arranged on the memory chip. Each of the plurality of memory groups can be configured to store data independent of the data stored in the other of the plurality of memory groups, and one of the plurality of processor subunits It can be connected to at least one dedicated memory group among the plurality of memory groups. For example, each processor subunit can access one or more memory controllers dedicated to one or more corresponding memory banks of the processor subunit, and other processor subunits cannot access these Corresponding one or more memory controllers. Therefore, the data stored in each memory bank can be unique to the dedicated processor subunit. In addition, the data stored in each memory bank can be independent of the memories stored in other memory banks, because the memoryless controller can be shared between memory banks.

在一些實施例中，如下文關於圖8所描述，儲存於複數個記憶體組中之每一者中的資料可由編譯器定義，該編譯器經組態以將資料分佈於該等複數個記憶體組當中。此外，該編譯器可經組態以使用分佈於對應處理器子單元當中之複數個較低階任務將定義於較高階系列之任務中的資料分佈於複數個記憶體組當中。 In some embodiments, as described below with respect to FIG. 8, the data stored in each of the plurality of memory groups can be defined by a compiler that is configured to distribute the data among the plurality of memories In the body group. In addition, the compiler can be configured to use a plurality of lower-level tasks distributed in the corresponding processor subunits to distribute data defined in a higher-level series of tasks among a plurality of memory banks.

如上文關於圖7A及圖7B所解釋，記憶體晶片可進一步包含複數個匯流排，每一匯流排將複數個處理器子單元中之一者連接至複數個記憶體組當中之一或多個對應的專用記憶體組。此外，如上文所解釋，複數個匯流排上之資料傳送可使用軟體來控制。因此，跨越複數個匯流排中之特定匯流排的資料傳送可由連接至該等複數個匯流排中之該特定匯流排的對應處理器子單元來控制。因此，包括於該系列指令中之任務中之一者可包括將資料輸出至匯流排中之一者或自匯流排中之一者取得資料。如上文所解釋，此等任務可由(i)處理器子單元之處理元件或(ii)包括於處理器子單元中之一或多個加速器執行。在後一實施例中，處理器子單元可執行計算或在同一循環中使用將該處理器子單元連接至其他處理器子單元之匯流排，在該循環期間，加速器自連接至一或多個對應的專用記憶體組的匯流排中之一者取得資料或將資料置放於該等匯流排中之一者上。 As explained above with respect to FIGS. 7A and 7B, the memory chip may further include a plurality of bus bars, and each bus bar connects one of the plurality of processor subunits to one or more of the plurality of memory banks Corresponding dedicated memory bank. In addition, as explained above, data transmission on multiple buses can be controlled by software. Therefore, data transmission across a specific bus of the plurality of buses can be controlled by the corresponding processor subunit connected to the specific bus of the plurality of buses. Therefore, one of the tasks included in the series of commands may include outputting data to one of the buses or obtaining data from one of the buses. As explained above, these tasks can be performed by (i) the processing elements of the processor sub-unit or (ii) one or more accelerators included in the processor sub-unit. In the latter embodiment, the processor subunit can perform calculations or use a bus that connects the processor subunit to other processor subunits in the same cycle. During the cycle, the accelerator is self-connected to one or more One of the buses of the corresponding dedicated memory group obtains the data or places the data on one of the buses.

因此，在一個實例中，包括於連接至複數個匯流排中之至少一者之處理器子單元中的該系列指令可包括發送任務。該發送任務可包含針對連接至複數個匯流排中之至少一者之處理器子單元的用以將資料寫入至複數個匯流排中之至少一者以供儲存於一或多個對應的專用記憶體組中的命令。另外或替代地，包括於連接至複數個匯流排中之至少一者之處理器子單元中的該系列指令可包括接收任務。該接收任務可包含針對連接至複數個匯流排中之至少一者之處理器子單元的用以自複數個匯流排中之至少一者讀取資料以供儲存於一或多個對應的專用記憶體組中的命令。因此，此等實施例中之發送任務及接收任務可包含控制信號，該等控制信號沿著複數個匯流排中之至少一者發送至一或多個對應的專用記憶體組之一或多個記憶體控制器。此外，發送任務及接收任務可與計算或由處理子單元之另一部分(例如，由處理子單元之一或多個不同加速器)執行的其他任務同時由處理子單元之一個部分(例如，由處理子單元之一或多個加速器)執行。此同時執行之實例可包括MAC中繼命令，其中協同執行接收、相乘及發送。 Therefore, in one example, the series of instructions included in the processor subunit connected to at least one of the plurality of buses may include sending tasks. The sending task may include a processor subunit connected to at least one of the plurality of buses for writing data to at least one of the plurality of buses for storage in one or more corresponding dedicated Commands in the memory group. Additionally or alternatively, the series of instructions included in the processor subunit connected to at least one of the plurality of buses may include receiving tasks. The receiving task may include a processor subunit connected to at least one of the plurality of buses for reading data from at least one of the plurality of buses for storage in one or more corresponding dedicated memories Commands in the body group. Therefore, the sending task and the receiving task in these embodiments may include control signals, and the control signals are sent to one or the other along at least one of a plurality of buses. One of multiple corresponding dedicated memory groups or multiple memory controllers. In addition, the sending task and the receiving task can be performed by one part of the processing subunit (for example, by the processing subunit) simultaneously with the calculation or other tasks performed by another part of the processing subunit (for example, by one or more different accelerators of the processing subunit). One or more accelerators of the sub-units) execute. Examples of this simultaneous execution may include MAC relay commands, where receiving, multiplying, and sending are performed in concert.

除將資料分佈於記憶體組當中以外，亦可跨越不同記憶體組複製資料之特定部分。舉例而言，如上文所解釋，記憶體晶片上之分散式處理器可包含安置於記憶體晶片上之複數個處理器子單元及安置於記憶體晶片上之複數個記憶體組。該等複數個處理器子單元中之每一者可連接至該等複數個記憶體組當中之至少一個專用記憶體組，且該等複數個記憶體組中之每一記憶體組可經組態以儲存獨立於儲存於該等複數個記憶體組之其他者中之資料的資料。此外，儲存於複數個記憶體組當中之一個特定記憶體組中之資料中的至少一些可包含儲存於複數個記憶體組中之至少另一記憶體組中的資料之複製者。舉例而言，該系列指令中所使用之數字、字串或其他類型之資料可儲存於專用於不同處理器子單元之複數個記憶體組中，而非自一個記憶體組傳送至記憶體晶片中之其他處理器子單元。 In addition to distributing data among memory groups, specific parts of data can also be copied across different memory groups. For example, as explained above, a distributed processor on a memory chip may include a plurality of processor subunits arranged on the memory chip and a plurality of memory groups arranged on the memory chip. Each of the plurality of processor subunits can be connected to at least one dedicated memory group among the plurality of memory groups, and each of the plurality of memory groups can be grouped State to store data independent of the data stored in the others of the plurality of memory groups. In addition, at least some of the data stored in one specific memory group among the plurality of memory groups may include a copy of the data stored in at least another memory group among the plurality of memory groups. For example, the numbers, strings, or other types of data used in this series of commands can be stored in a plurality of memory banks dedicated to different processor subunits instead of being transferred from one memory bank to the memory chip The other processor subunits in it.

在一個實例中，並列字串匹配可使用上文所描述之資料複製。舉例而言，可將複數個字串與同一字串進行比較。習知處理器可依序將複數個字串中之每一字串與同一字串進行比較。在本發明之硬體晶片上，可跨越記憶體組複製同一字串，使得處理器子單元可並列地將複數個字串中之分開字串與所複製字串進行比較。 In one example, parallel string matching can be copied using the data described above. For example, a plurality of strings can be compared with the same string. The conventional processor can sequentially compare each of the plural character strings with the same character string. On the hardware chip of the present invention, the same character string can be copied across the memory bank, so that the processor sub-unit can compare the divided character string with the copied character string in parallel.

在一些實施例中，如下文關於圖8所描述，跨越複數個記憶體組當中之一個特定記憶體組及複數個記憶體組中之至少另一記憶體組而複製的至少一些資料由編譯器定義，該編譯器經組態以跨越記憶體組而複製資料。此外，該編譯器可經組態以使用分佈於對應處理器子單元當中之複數個較低階任務來複製至少一些資料。 In some embodiments, as described below with respect to FIG. 8, at least some of the data copied across one specific memory group among the plurality of memory groups and at least another memory group among the plurality of memory groups is performed by the compiler By definition, the compiler is configured to copy data across memory banks. In addition, The compiler can be configured to use a plurality of lower-level tasks distributed among the corresponding processor subunits to copy at least some data.

資料之複製可適用於跨越不同計算而重複使用資料之相同部分的特定任務。藉由複製資料之此等部分，不同計算可分佈於記憶體晶片之處理器子單元當中以用於並列執行，而每一處理器子單元可將資料之該等部分儲存於專用記憶體組中且自專用記憶體組存取所儲存部分(而非跨越連接處理器子單元之匯流排推送及取得資料之該等部分)。在一個實例中，跨越複數個記憶體組當中之一個特定記憶體組及複數個記憶體組中之至少另一記憶體組而複製的至少一些資料可包含神經網路之權重。在此實例中，該神經網路中之每一節點可由複數個處理器子單元當中之至少一個處理器子單元定義。舉例而言，每一節點可包含由定義該節點之至少一個處理器子單元執行的機器碼。在此實例中，權重之複製可允許每一處理器子單元執行機器碼以至少部分地實現對應節點，同時僅存取一或多個專用記憶體組(而非與其他處理器子單元執行資料傳送)。因為對專用記憶體組進行之讀取及寫入的時序獨立於其他處理器子單元，而處理器子單元之間的資料傳送之時序需要時序同步(例如，使用軟體，如上文所解釋)，所以複製記憶體以避免處理器子單元之間的資料傳送可進一步提高總體執行之效率。 The duplication of data can be applied to specific tasks where the same part of the data is reused across different calculations. By copying these parts of the data, different calculations can be distributed among the processor subunits of the memory chip for parallel execution, and each processor subunit can store these parts of the data in a dedicated memory bank And access the stored part from the dedicated memory bank (instead of pushing and obtaining the parts of the data across the bus connected to the processor subunit). In one example, at least some of the data copied across a specific memory group among the plurality of memory groups and at least another memory group among the plurality of memory groups may include neural network weights. In this example, each node in the neural network can be defined by at least one processor subunit among a plurality of processor subunits. For example, each node may include machine code executed by at least one processor subunit that defines the node. In this example, the duplication of weights allows each processor subunit to execute machine code to at least partially realize the corresponding node, while only accessing one or more dedicated memory banks (rather than executing data with other processor subunits). Send). Because the timing of reading and writing to the dedicated memory bank is independent of other processor sub-units, and the timing of data transfer between processor sub-units requires timing synchronization (for example, using software, as explained above), Therefore, duplicating the memory to avoid data transfer between processor sub-units can further improve the overall execution efficiency.

如上文關於圖7A及圖7B所解釋，記憶體晶片可進一步包含複數個匯流排，每一匯流排將複數個處理器子單元中之一者連接至複數個記憶體組當中之一或多個對應的專用記憶體組。此外，如上文所解釋，複數個匯流排上之資料傳送可使用軟體來控制。因此，跨越複數個匯流排中之特定匯流排的資料傳送可由連接至該等複數個匯流排中之該特定匯流排的對應處理器子單元來控制。因此，包括於該系列指令中之任務中之一者可包括將資料輸出至匯流排中之一者或自匯流排中之一者取得資料。如上文所解釋，此等任務可由(i)處理器子單元之處理元件或(ii)包括於處理器子單元中之一或多個加速器執行。如上文進一步所解釋，此等任務可包括包含控制信號之發送任務及/或接收任務，該等控制信號沿著複數個匯流排中之至少一者發送至一或多個對應的專用記憶體組之一或多個記憶體控制器。 As explained above with respect to FIGS. 7A and 7B, the memory chip may further include a plurality of bus bars, and each bus bar connects one of the plurality of processor subunits to one or more of the plurality of memory banks Corresponding dedicated memory bank. In addition, as explained above, data transmission on multiple buses can be controlled by software. Therefore, data transmission across a specific bus of the plurality of buses can be controlled by the corresponding processor subunit connected to the specific bus of the plurality of buses. Therefore, one of the tasks included in the series of commands may include outputting data to one of the buses or obtaining data from one of the buses. As explained above, these tasks can be represented by (i) The processing element of the processor sub-unit or (ii) is executed by one or more accelerators included in the processor sub-unit. As explained further above, these tasks may include sending tasks and/or receiving tasks including control signals, which are sent to one or more corresponding dedicated memory banks along at least one of a plurality of buses One or more memory controllers.

圖8描繪用於編譯一系列指令以供在本發明之例示性記憶體晶片(例如，如圖7A及圖7B中所描繪)上執行之方法800的流程圖。方法800可藉由任何習知處理器(無論係通用抑或專用的)實施。 FIG. 8 depicts a flowchart of a method 800 for compiling a series of instructions for execution on an exemplary memory chip of the present invention (eg, as depicted in FIGS. 7A and 7B). The method 800 can be implemented by any conventional processor (whether general purpose or special purpose).

方法800可作為形成編譯器之電腦程式之一部分執行。如本文中所使用，「編譯器」係指將較高階語言(例如，程序性語言，諸如C、FORTRAN、BASIC或其類似者；物件導向式語言，諸如Java、C++、Pascal、Python或其類似者；等等)轉換成較低階語言(例如，組合程式碼、目標程式碼、機器碼或其類似者)的任何電腦程式。編譯器可允許人類以人類可讀語言來程式設計一系列指令，接著將該人類可讀語言轉換成機器可執行語言。 The method 800 can be executed as part of a computer program forming a compiler. As used herein, "compiler" refers to a higher-level language (for example, a procedural language such as C, FORTRAN, BASIC or the like; an object-oriented language such as Java, C++, Pascal, Python or the like者; etc.) Any computer program that is converted into a lower-level language (for example, assembly code, object code, machine code, or the like). The compiler may allow humans to program a series of instructions in a human-readable language, and then convert the human-readable language into a machine executable language.

在步驟810處，處理器可將與該系列指令相關聯之任務指派給處理器子單元中之不同處理器子單元。舉例而言，該系列指令可分成子群組，該等子群組待跨越處理器子單元而並列地執行。在一個實例中，可將神經網路分成其節點，且可將一或多個節點指派給分開的處理器子單元。在此實例中，每一子群組可包含跨越不同層連接的複數個節點。因此，處理器子單元可實施：來自神經網路之第一層的節點；來自第二層之節點，該第二層連接至來自藉由同一處理器子單元實施之第一層的節點；及其類似者。藉由基於節點之連接來指派節點，可減少處理器子單元之間的資料傳送，此可導致效率提高，如上文所解釋。 At step 810, the processor may assign tasks associated with the series of instructions to different processor subunits in the processor subunits. For example, the series of instructions can be divided into subgroups, and the subgroups are to be executed in parallel across processor subunits. In one example, the neural network can be divided into its nodes, and one or more nodes can be assigned to separate processor subunits. In this example, each sub-group may include a plurality of nodes connected across different layers. Therefore, the processor subunit can implement: nodes from the first layer of the neural network; nodes from the second layer connected to nodes from the first layer implemented by the same processor subunit; and Its similar. By assigning nodes based on node connections, data transfer between processor subunits can be reduced, which can lead to increased efficiency, as explained above.

如上文圖7A及圖7B中所描繪而解釋，處理器子單元可在空間上分佈於安置於記憶體晶片上之複數個記憶體組當中。因此，任務之指派可至少部分地為空間劃分以及邏輯劃分。 As explained above as depicted in FIG. 7A and FIG. 7B, the processor sub-units may be spatially distributed among a plurality of memory banks arranged on a memory chip. Therefore, the assignment of tasks can be at least Part of it is spatial division and logical division.

在步驟820處，處理器可產生用以在記憶體晶片之多對處理器子單元之間傳送資料的任務，每一對處理器子單元由匯流排連接。舉例而言，如上文所解釋，該等資料傳送可使用軟體來控制。因此，處理器子單元可經組態以在同步時間將資料推送於匯流排上及取得匯流排上之資料。所產生之任務可因此包括用於執行資料之此同步推送及取得的任務。 At step 820, the processor may generate a task for transferring data between multiple pairs of processor sub-units of the memory chip, each pair of processor sub-units connected by a bus. For example, as explained above, such data transmission can be controlled by software. Therefore, the processor subunit can be configured to push data on the bus and obtain data on the bus at the synchronization time. The generated tasks may therefore include tasks for performing this synchronous push and acquisition of data.

如上文所解釋，步驟820可包括預處理以考量處理器子單元之內部行為，包括時序及潛時。舉例而言，處理器可使用處理器子單元之已知時間及潛時(例如，將資料推送至匯流排的時間、自匯流排取得資料的時間、計算與推送或取得之間的潛時，或其類似者)以確保所產生之任務同步。因此，包含由一或多個處理器子單元進行之至少一次推送及由一或多個處理器子單元進行之至少一次取得的資料傳送可同時發生，而不會由於處理器子單元之間的時序差、處理器子單元之潛時或其類似者而引起延遲。 As explained above, step 820 may include preprocessing to consider the internal behavior of the processor subunits, including timing and latency. For example, the processor can use the known time and latency of the processor subunit (for example, the time to push data to the bus, the time to obtain data from the bus, the latency between calculation and push or acquisition, Or the like) to ensure that the generated tasks are synchronized. Therefore, the data transfer including at least one push by one or more processor sub-units and at least one acquisition by one or more processor sub-units can occur simultaneously, without being caused by the inter-processor sub-units. Delays are caused by timing differences, latency of processor sub-units, or the like.

在步驟830處，處理器可將所指派及產生之任務分組成子系列指令之複數個群組。舉例而言，該等子系列指令可各包含供單個處理器子單元執行的一系列任務。因此，子系列指令之複數個群組中之每一者可對應於複數個處理器子單元中之一不同處理器子單元。因此，步驟810、820及830可導致將該系列指令分成子系列指令之複數個群組。如上文所解釋，步驟820可確保不同群組之間的任何資料傳送同步。 At step 830, the processor may group the assigned and generated tasks into a plurality of groups of sub-series of instructions. For example, the sub-series of instructions may each include a series of tasks for a single processor sub-unit to perform. Therefore, each of the plurality of groups of the sub-series of instructions may correspond to a different processor sub-unit of the plurality of processor sub-units. Therefore, steps 810, 820, and 830 may result in dividing the series of commands into a plurality of groups of sub-series commands. As explained above, step 820 can ensure synchronization of any data transmission between different groups.

在步驟840處，處理器可產生對應於子系列指令之複數個群組中之每一者的機器碼。舉例而言，可將表示子系列指令之較高階程式碼轉換成可由對應處理器子單元執行的較低階程式碼，諸如機器碼。 At step 840, the processor may generate machine code corresponding to each of the plurality of groups of the sub-series of instructions. For example, a higher-level program code representing a sub-series of instructions can be converted into a lower-level program code that can be executed by the corresponding processor subunit, such as machine code.

在步驟850處，處理器可根據劃分將對應於子系列指令之複數個群組中之每一者的所產生機器碼指派給複數個處理器子單元中之對應處理器子單元。舉例而言，處理器可用對應處理器子單元之識別符來標記每一子系列指令。因此，當將子系列指令上傳至記憶體晶片以供執行(例如，由圖3A之主機350)時，每一子系列可組態一正確的處理器子單元。 At step 850, the processor may assign the generated machine code corresponding to each of the plurality of groups of the sub-series instructions to the corresponding processor sub-units of the plurality of processor sub-units according to the division. unit. For example, the processor can mark each sub-series of instructions with an identifier corresponding to the processor sub-unit. Therefore, when the sub-series commands are uploaded to the memory chip for execution (for example, by the host 350 of FIG. 3A), each sub-series can be configured with a correct processor sub-unit.

在一些實施例中，將與該系列指令相關聯之任務指派給處理器子單元中之不同處理器子單元可至少部分地取決於記憶體晶片上之處理器子單元中之兩者或多於兩者之間的空間接近性。舉例而言，如上文所解釋，可藉由減少處理器子單元之間的資料傳送之數目來提高效率。因此，處理器可最少化跨越處理器子單元中之多於兩者而移動資料的資料傳送。因此，處理器可結合一或多個最佳化演算法(諸如，貪婪演算法)使用記憶體晶片之已知佈局，以便將子系列指派給處理器子單元，其指派方式最大化(至少區域地)鄰近傳送且最少化(至少區域地)至非相鄰處理器子單元之傳送。 In some embodiments, the assignment of tasks associated with the series of instructions to different processor subunits in the processor subunit may depend at least in part on two or more of the processor subunits on the memory chip. The spatial proximity between the two. For example, as explained above, the efficiency can be improved by reducing the number of data transfers between processor sub-units. Therefore, the processor can minimize data transfers that move data across more than two of the processor subunits. Therefore, the processor can use the known layout of the memory chip in combination with one or more optimization algorithms (such as the greedy algorithm) in order to assign the sub-series to the processor sub-units in a way that maximizes (at least the area Ground) Proximity transfers and minimize (at least regionally) transfers to non-adjacent processor subunits.

方法800可包括針對本發明之記憶體晶片的進一步最佳化。舉例而言，處理器可基於劃分將與該系列指令相關聯之資料分組且根據該分組將資料指派給記憶體組。因此，該等記憶體組可保存用於指派給每一記憶體組所專用於的每一處理器子單元之子系列指令的資料。 The method 800 may include further optimization of the memory chip of the present invention. For example, the processor may group the data associated with the series of instructions based on the division and assign the data to the memory group according to the grouping. Therefore, the memory groups can store data for the sub-series of instructions assigned to each processor subunit to which each memory group is dedicated.

在一些實施例中，將資料分組可包括判定資料之至少一部分以在記憶體組中之兩者或多於兩者中複製。舉例而言，如上文所解釋，可跨越多於一個子系列指令而使用一些資料。此資料可跨越專用於經指派不同子系列指令之複數個處理器子單元的記憶體組而複製。此最佳化可進一步減少跨越處理器子單元之資料傳送。 In some embodiments, grouping data may include determining at least a portion of the data to be copied in two or more of the memory groups. For example, as explained above, some data can be used across more than one sub-series of commands. This data can be copied across memory banks dedicated to a plurality of processor subunits assigned different sub-series commands. This optimization can further reduce data transfer across processor subunits.

可將方法800之輸出輸入至本發明之記憶體晶片以供執行。舉例而言，記憶體晶片可包含複數個處理器子單元及對應的複數個記憶體組，每一處理器子單元連接至專用於該處理器子單元之至少一個記憶體組，且該記憶體晶片之該等處理器子單元可經組態以執行由方法800產生之機器碼。如上文關於圖3A所解釋，主機350可將由方法800產生之機器碼輸入至處理器子單元以供執行。 The output of the method 800 can be input to the memory chip of the present invention for execution. For example, the memory chip may include a plurality of processor sub-units and a corresponding plurality of memory groups, each of the processor sub-units is connected to at least one memory group dedicated to the processor sub-unit, and the memory The processor subunits of the chip can be configured to execute the machine code generated by the method 800. As above As explained in FIG. 3A, the host 350 can input the machine code generated by the method 800 to the processor sub-unit for execution.

子組及子控制器 Subgroup and Subcontroller

在習知記憶體組中，控制器設置在組層級處。每一組包括複數個墊，該等複數個墊通常以矩形方式配置，但可按任何幾何形狀配置。每一墊包括複數個記憶體胞元，該等複數個記憶體胞元亦通常以矩形方式配置，但可按任何幾何形狀配置。每一胞元可儲存單個資料位元(例如，取決於該胞元保持在高電壓抑或低電壓下)。 In the conventional memory group, the controller is set at the group level. Each group includes a plurality of pads, and the plurality of pads are usually arranged in a rectangular manner, but may be arranged in any geometric shape. Each pad includes a plurality of memory cells, and the plurality of memory cells are usually arranged in a rectangular manner, but they can be arranged in any geometric shape. Each cell can store a single data bit (for example, depending on whether the cell is maintained at a high voltage or a low voltage).

此習知架構之實例描繪於圖9及圖10中。如圖9中所展示，在組層級處，複數個墊(例如，墊930-1、930-2、940-1及940-2)可形成組900。在習知矩形組織中，可跨越全域字線(例如，字線950)及全域位元線(例如，位元線960)而控制組900。因此，列解碼器910可基於傳入控制信號(例如，對自位址讀取之請求、對寫入至位址之請求或其類似者)選擇正確字線，且全域感測放大器920(及/或全域行解碼器，圖9中未展示)可基於控制信號選擇正確位元線。放大器920亦可在讀取操作期間放大來自選定組之任何電壓位準。儘管描繪為將列解碼器用於初始選擇且沿著行執行放大，但組可另外或替代地將行解碼器用於初始選擇且沿著列執行放大。 Examples of this conventional architecture are depicted in FIG. 9 and FIG. 10. As shown in FIG. 9, at the group level, a plurality of pads (eg, pads 930-1, 930-2, 940-1, and 940-2) may form a group 900. In the conventional rectangular organization, the group 900 can be controlled across a global word line (for example, word line 950) and a global bit line (for example, bit line 960). Therefore, the column decoder 910 can select the correct word line based on the incoming control signal (for example, a request to read from an address, a request to write to an address, or the like), and the global sense amplifier 920 (and /Or the global row decoder (not shown in FIG. 9) can select the correct bit line based on the control signal. The amplifier 920 can also amplify any voltage level from the selected group during the read operation. Although depicted as using column decoders for initial selection and performing amplification along the rows, the group may additionally or alternatively use row decoders for initial selection and performing amplification along the columns.

圖10描繪墊1000之實例。舉例而言，墊1000可形成諸如圖9之組900的記憶體組之一部分。如圖10中所描繪，複數個胞元(例如，胞元1030-1、1030-2及1030-3)可形成墊1000。每一胞元可包含電容器、電晶體或儲存至少一個資料位元之其他電路系統。舉例而言，胞元可包含電容器或可包含正反器，該電容器經充電以表示「1」且放電以表示「0」，該正反器具有表示「1」之第一狀態及表示「0」之第二狀態。習知墊可包含例如512個位元×512個位元。在墊1000形成MRAM、ReRAM或其類似者之一部分的實施例中，胞元可包含電晶體、電阻器、電容器或用於隔離儲存至少一個資料位元之材料之離子或一部分的其他機構。舉例而言，胞元可包含具有表示「1」之第一狀態及表示「0」之第二狀態的電解質離子、硫族化物玻璃之一部分，或其類似者。 Figure 10 depicts an example of a pad 1000. For example, the pad 1000 may form part of a memory group such as the group 900 of FIG. 9. As depicted in FIG. 10, a plurality of cells (for example, cells 1030-1, 1030-2, and 1030-3) may form a pad 1000. Each cell may include a capacitor, a transistor, or other circuit system that stores at least one data bit. For example, the cell may include a capacitor or a flip-flop that is charged to indicate "1" and discharged to indicate "0", and the flip-flop has a first state that indicates "1" and a flip-flop that indicates "0". "The second state. The conventional pad may include, for example, 512 bits×512 bits. In embodiments where the pad 1000 forms part of MRAM, ReRAM, or the like, the cell The element may include a transistor, resistor, capacitor, or other mechanism for isolating the ion or part of the material storing at least one data bit. For example, the cell may include an electrolyte ion, a part of chalcogenide glass, or the like having a first state representing "1" and a second state representing "0".

如圖10中進一步所描繪，在習知矩形組織中，可跨越區域字線(例如，字線1040)及區域位元線(例如，位元線1050)而控制墊1000。因此，字線驅動器(例如，字線驅動器1020-1、1020-2、……、1020-x)可基於來自與記憶體組(墊1000形成該記憶體組之一部分)相關聯之控制器的控制信號(例如，對自位址讀取之請求、對寫入至位址之請求、再新信號)而控制選定字線以執行讀取、寫入或再新。此外，區域感測放大器(例如，區域放大器1010-1、1010-2、……、1010-x)及/或區域行解碼器(圖10中未展示)可控制選定位元線以執行讀取、寫入或再新。該等區域感測放大器亦可在讀取操作期間放大來自選定胞元之任何電壓位準。儘管描繪為將字線驅動器用於初始選擇且沿著行執行放大，但墊可替代地將位元線驅動器用於初始選擇且沿著列執行放大。 As further depicted in FIG. 10, in the conventional rectangular organization, the pad 1000 can be controlled across a regional word line (e.g., word line 1040) and a regional bit line (e.g., bit line 1050). Therefore, word line drivers (e.g., word line drivers 1020-1, 1020-2, ..., 1020-x) can be based on the data from the controller associated with the memory group (pad 1000 forms part of the memory group) The control signal (for example, a request for reading from the address, a request for writing to an address, and a renew signal) controls the selected word line to perform reading, writing, or renewing. In addition, the regional sensing amplifier (for example, the regional amplifiers 1010-1, 1010-2, ..., 1010-x) and/or the regional row decoder (not shown in FIG. 10) can control the selected positioning element lines to perform reading , Write or renew. The area sense amplifiers can also amplify any voltage level from the selected cell during the read operation. Although depicted as using a word line driver for the initial selection and performing magnification along the rows, the pad may instead use a bit line driver for the initial selection and performing magnification along the columns.

如上文所解釋，複製大量墊以形成記憶體組。記憶體組可分組以形成記憶體晶片。舉例而言，記憶體晶片可包含八個至三十二個記憶體組。因此，使處理器子單元與習知記憶體晶片上之記憶體組配對可產生僅八個至三十二個處理器子單元。因此，本發明之實施例可包括具有額外子組階層之記憶體晶片。本發明之此等記憶體晶片可接著包括處理器子單元與用作與處理器子單元配對之專用記憶體組的記憶體子組，以形成大量子處理器，此可接著達成記憶體內運算之較高並列性及效能。 As explained above, a large number of pads are copied to form a memory bank. Memory groups can be grouped to form memory chips. For example, the memory chip may include eight to thirty-two memory banks. Therefore, pairing the processor sub-units with the memory bank on the conventional memory chip can produce only eight to thirty-two processor sub-units. Therefore, embodiments of the present invention may include memory chips with additional sub-group levels. The memory chips of the present invention can then include a processor subunit and a memory subgroup used as a dedicated memory group paired with the processor subunit to form a large number of sub-processors, which can then achieve in-memory operations High parallelism and efficiency.

在本發明之一些實施例中，組900之全域列解碼器及全域感測放大器可用子組控制器來替換。因此，記憶體組之控制器可將控制信號導引至適當的子組控制器，而非將控制信號發送至記憶體組之全域列解碼器及全域感測放大器。導引可動態地加以控制或可為硬連線的(例如，經由一或多個邏輯閘)。在一些實施例中，熔斷器可用以提示每一子組或墊之控制器阻斷控制信號抑或將控制信號傳遞至適當的子組或墊。在此等實施例中，可因此使用熔斷器來不啟動故障子組。 In some embodiments of the present invention, the global column decoders and global sense amplifiers of the group 900 can be replaced by a sub-group controller. Therefore, the controller of the memory bank can direct the control signal to the appropriate sub-group controller instead of sending the control signal to the global column decoder and the global sense amplifier of the memory bank. Steering can be dynamically controlled or can be hardwired (e.g., via one or more logic gates). In some embodiments, the fuse can be used to prompt the controller of each subgroup or pad to block the control signal or to transfer the control signal to the appropriate subgroup or pad. In such embodiments, a fuse can therefore be used to not activate the faulty subgroup.

在此等實施例之一個實例中，記憶體晶片可包括複數個記憶體組，每一記憶體組具有組控制器及複數個記憶體子組，每一記憶體子組具有子組列解碼器及子組行解碼器以允許對該記憶體子組上之位置進行讀取及寫入。每一子組可包含複數個記憶體墊，每一記憶體墊具有複數個記憶體胞元且可在內部具有區域列解碼器、行解碼器及/或區域感測放大器。該等子組列解碼器及該等子組行解碼器可處理來自用於子組記憶體上之記憶體內運算的組控制器或子組處理器子單元之讀取及寫入請求，如下文所描述。另外，每一記憶體子組可進一步具有控制器，該控制器經組態以判定處理來自組控制器之讀取請求及寫入請求及/或將讀取請求及寫入請求轉送至下一層級(例如，墊上之列解碼器及行解碼器的下一層級)，抑或阻斷該等請求，例如以允許內部處理元件或處理器子單元存取記憶體。在一些實施例中，該組控制器可與系統時脈同步。然而，該等子組控制器可能不與系統時脈同步。 In one example of these embodiments, the memory chip may include a plurality of memory banks, each memory bank has a bank controller and a plurality of memory sub banks, and each memory bank has a sub bank decoder And the sub-group row decoder to allow reading and writing of the location on the memory sub-group. Each sub-group may include a plurality of memory pads, and each memory pad has a plurality of memory cells and may have a regional column decoder, a row decoder and/or a regional sense amplifier inside. The sub-group row decoders and the sub-group row decoders can handle read and write requests from the group controller or the sub-group processor subunits used for in-memory operations on the sub-group memory, as follows Described. In addition, each memory sub-group may further have a controller that is configured to determine whether to process the read request and write request from the group controller and/or forward the read request and write request to the next Level (for example, the next level of column decoders and row decoders on the pad), or block these requests, for example, to allow internal processing elements or processor subunits to access memory. In some embodiments, the set of controllers can be synchronized with the system clock. However, these subgroup controllers may not be synchronized with the system clock.

如上文所解釋，子組之使用可允許在記憶體晶片中包括比在處理器子單元與習知晶片之記憶體組配對之情況下更大數目個處理器子單元。因此，每一子組可進一步具有使用子組作為專用記憶體之處理器子單元。如上文所解釋，該處理器子單元可包含RISC、CISC或其他通用處理子單元及/或可包含一或多個加速器。另外，該處理器子單元可包括位址產生器，如上文所解釋。在上文所描述之實施例中之任一者中，每一處理器子單元可經組態以使用專用於該處理器子單元之子組的列解碼器及行解碼器來存取該子組，而不使用組控制器。與子組相關聯之處理器子單元亦可處置記憶體墊(包括下文所描述之解碼器及記憶體冗餘機構)及/或判定是否轉送且因此處置來自上部層級(例如，組層級或記憶體層級)之讀取或寫入請求。 As explained above, the use of sub-groups may allow a larger number of processor sub-units to be included in the memory chip than if the processor sub-units are paired with the memory group of the conventional chip. Therefore, each sub-group may further have a processor sub-unit that uses the sub-group as a dedicated memory. As explained above, the processor sub-unit may include RISC, CISC, or other general-purpose processing sub-units and/or may include one or more accelerators. In addition, the processor subunit may include an address generator, as explained above. In any of the embodiments described above, each processor sub-unit can be configured to use column decoders and row decoders dedicated to the sub-group of the processor sub-unit to access the sub-group , Instead of using the group controller. The processor subunits associated with the subgroups can also handle memory pads (including the decoders and memory redundancy mechanisms described below) and/or determine whether to forward and therefore handle them from the upper level (e.g., Group level or memory level) read or write request.

在一些實施例中，子組控制器可進一步包括儲存子組之狀態的暫存器。因此，在該暫存器提示該子組處於使用中時，若該子組控制器接收到來自記憶體控制器之控制信號，則該子組控制器可傳回錯誤。在每一子組進一步包括處理器子單元之實施例中，若該子組中之該處理器子單元正存取與來自記憶體控制器之外部請求衝突的記憶體，則該暫存器可提示錯誤。 In some embodiments, the sub-group controller may further include a register for storing the state of the sub-group. Therefore, when the register indicates that the subgroup is in use, if the subgroup controller receives a control signal from the memory controller, the subgroup controller can return an error. In an embodiment where each sub-group further includes a processor sub-unit, if the processor sub-unit in the sub-group is accessing a memory that conflicts with an external request from the memory controller, the register may Prompt an error.

圖11展示使用子組控制器之記憶體組之另一實施例的實例。在圖11之實例中，組1100具有列解碼器1110、行解碼器1120，及具有子組控制器(例如，控制器1130a、1130b及1130c)之複數個記憶體子組(例如，子組1170a、1170b及1170c)。該等子組控制器可包括位址解算器(例如，解算器1140a、1140b及1140c)，該等位址解算器可判定是否將請求傳遞至由子組控制器控制之一或多個子組。 Figure 11 shows an example of another embodiment of a memory bank using a sub-bank controller. In the example of FIG. 11, the group 1100 has a column decoder 1110, a row decoder 1120, and a plurality of memory subgroups (e.g., subgroup 1170a) having subgroup controllers (e.g., controllers 1130a, 1130b, and 1130c). , 1170b and 1170c). The sub-group controllers may include address solvers (for example, solvers 1140a, 1140b, and 1140c), which can determine whether to pass the request to one or more sub-groups controlled by the sub-group controller. group.

該等子組控制器可進一步包括一或多個邏輯電路(例如，邏輯1150a、1150b及1150c)。舉例而言，包含一或多個處理元件之邏輯電路可允許執行諸如再新子組中之胞元、清除子組中之胞元或其類似者的一或多個操作而無需來自組1100外部之處理請求。替代地，邏輯電路可包含處理器子單元，如上文所解釋，使得處理器子單元具有由子組控制器控制之任何子組作為對應的專用記憶體。在圖11之實例中，邏輯1150a可具有子組1170a作為對應的專用記憶體，邏輯1150b可具有子組1170b作為對應的專用記憶體，且邏輯1150c可具有子組1170c作為對應的專用記憶體。在上文所描述之實施例中之任一者中，邏輯電路可具有至子組之匯流排，例如匯流排1131a、1131b或1131c。如圖11中進一步所描繪，該等子組控制器可各包括複數個解碼器，諸如子組列解碼器及子組行解碼器，以允許處理元件或處理器子單元或發佈命令之較高階記憶體控制器對記憶體子組上之位址進行讀取及寫入。舉例而言，子組控制器1130a 包括解碼器1160a、1160b及1160c，子組控制器1130b包括解碼器1160d、1160e及1160f，且子組控制器1130c包括解碼器1160g、1160h及1160i。基於來自組列解碼器1110之請求，子組控制器可使用包括於子組控制器中之解碼器來選擇字線。所描述系統可允許子組之處理元件或處理器子單元存取記憶體而不會中斷其他組及甚至其他子組，藉此允許每一子組處理器子單元與其他子組處理器子單元並列地執行記憶體運算。 The sub-group controllers may further include one or more logic circuits (for example, logic 1150a, 1150b, and 1150c). For example, a logic circuit that includes one or more processing elements may allow one or more operations such as renewing cells in a subgroup, clearing cells in a subgroup, or the like, without having to come from outside the group 1100 The processing request. Alternatively, the logic circuit may include a processor sub-unit, as explained above, so that the processor sub-unit has any sub-group controlled by the sub-group controller as a corresponding dedicated memory. In the example of FIG. 11, the logic 1150a may have the subgroup 1170a as the corresponding dedicated memory, the logic 1150b may have the subgroup 1170b as the corresponding dedicated memory, and the logic 1150c may have the subgroup 1170c as the corresponding dedicated memory. In any of the embodiments described above, the logic circuit may have buses to sub-groups, such as bus bars 1131a, 1131b, or 1131c. As further depicted in FIG. 11, the sub-group controllers may each include a plurality of decoders, such as sub-group column decoders and sub-group row decoders, to allow higher order processing elements or processor sub-units or issuing commands The memory controller reads and writes the address on the memory subgroup. For example, the sub-group controller 1130a It includes decoders 1160a, 1160b, and 1160c, the sub-group controller 1130b includes decoders 1160d, 1160e, and 1160f, and the sub-group controller 1130c includes decoders 1160g, 1160h, and 1160i. Based on the request from the group column decoder 1110, the sub-group controller can use the decoder included in the sub-group controller to select the word line. The described system allows the processing elements or processor subunits of a subgroup to access memory without interrupting other groups and even other subgroups, thereby allowing each subgroup of processor subunits and other subgroups of processor subunits Perform memory operations in parallel.

此外，每一子組可包含複數個記憶體墊，每一記憶體墊具有複數個記憶體胞元。舉例而言，子組1170a包括墊1190a-1、1190a-2、……、1190a-x；子組1170b包括墊1190b-1、1190b-2、……、1190b-x；且子組1170c包括墊1190c-1、1190c-2、……、1190c-3。如圖11中進一步所描繪，每一子組可包括至少一個解碼器。舉例而言，子組1170a包括解碼器1180a，子組1170b包括解碼器1180b，且子組1170c包括解碼器1180c。因此，組行解碼器1120可基於外部請求選擇全域位元線(例如，位元線1121a或1121b)，而由組列解碼器1110選擇之子組可使用其行解碼器以基於來自子組所專用於的邏輯電路之區域請求而選擇區域位元線(例如，位元線1181a或1181b)。因此，每一處理器子單元可經組態以使用子組之列解碼器及行解碼器來存取專用於該處理器子單元之子組而無需使用組列解碼器及組行解碼器。因此，每一處理器子單元可存取對應子組而不會中斷其他子組。此外，當對子組之請求在處理器子單元外部時，子組解碼器可向組解碼器反映所存取資料。替代地，在每一子組僅具有一列記憶體墊之實施例中，區域位元線可為墊之位元線而非子組之位元線。 In addition, each sub-group may include a plurality of memory pads, and each memory pad has a plurality of memory cells. For example, subgroup 1170a includes pads 1190a-1, 1190a-2,..., 1190a-x; subgroup 1170b includes pads 1190b-1, 1190b-2,..., 1190b-x; and subgroup 1170c includes pads 1190c-1, 1190c-2,..., 1190c-3. As further depicted in Figure 11, each sub-group may include at least one decoder. For example, subgroup 1170a includes decoder 1180a, subgroup 1170b includes decoder 1180b, and subgroup 1170c includes decoder 1180c. Therefore, the group row decoder 1120 can select a global bit line (for example, bit line 1121a or 1121b) based on an external request, and the sub-group selected by the group column decoder 1110 can use its row decoder based on the dedicated data from the sub-group According to the area request of the logic circuit, the area bit line (for example, bit line 1181a or 1181b) is selected. Therefore, each processor sub-unit can be configured to use the column decoder and row decoder of the sub-group to access the sub-group dedicated to the processor sub-unit without using the column decoder and the row decoder. Therefore, each processor sub-unit can access the corresponding sub-group without interrupting other sub-groups. In addition, when the request for the sub-group is outside the processor sub-unit, the sub-group decoder can reflect the accessed data to the group decoder. Alternatively, in an embodiment where each sub-group has only one row of memory pads, the regional bit lines may be the bit lines of the pads instead of the bit lines of the sub-groups.

可使用以下各者之組合：使用子組列解碼器及子組行解碼器之實施例；及圖11中所描繪之實施例。舉例而言，可消除組列解碼器，但保留組行解碼器且使用區域位元線。 A combination of the following can be used: an embodiment using sub-group column decoders and sub-group row decoders; and the embodiment depicted in FIG. 11. For example, the group column decoder can be eliminated, but the group row decoder is reserved and regional bit lines are used.

圖12展示具有複數個墊之記憶體子組1200之實施例的實例。舉例而言，子組1200可表示圖11之子組1100的一部分或可表示記憶體組之替代實施方案。在圖12之實例中，子組1200包括複數個墊(例如，墊1240a及1240b)。此外，每一墊可包括複數個胞元。舉例而言，墊1240a包括胞元1260a-1、1260a-2、……、1260a-x，且墊1240b包括胞元1260b-1、1260b-2、……、1260b-x。 Figure 12 shows an example of an embodiment of a memory subgroup 1200 with a plurality of pads. Lift For example, the sub-group 1200 may represent a part of the sub-group 1100 in FIG. 11 or may represent an alternative implementation of the memory group. In the example of FIG. 12, the subgroup 1200 includes a plurality of pads (for example, pads 1240a and 1240b). In addition, each pad may include a plurality of cells. For example, the pad 1240a includes cells 1260a-1, 1260a-2, ..., 1260a-x, and the pad 1240b includes cells 1260b-1, 1260b-2, ..., 1260b-x.

每一墊可經指派將指派給墊之記憶體胞元的位址之範圍。此等位址可在生產時組態，使得墊可到處移動且使得故障墊可被不啟動且保持未使用(例如，使用一或多個熔斷器，如下文進一步所解釋)。 Each pad can be assigned a range of addresses to be assigned to the memory cell of the pad. These addresses can be configured at production time so that the pad can be moved around and so that a faulty pad can be deactivated and left unused (for example, using one or more fuses, as explained further below).

子組1200自記憶體控制器1210接收讀取及寫入請求。儘管圖12中未描繪，但來自記憶體控制器1210之請求可經由子組1200之控制器來篩選且導引至子組1200之適當墊以進行位址解算。替代地，來自記憶體控制器1210之請求的位址之至少一部分(例如，較高位元)可傳輸至子組1200之所有墊(例如，墊1240a及1240b)，使得僅在墊之經指派位址範圍包括命令中所指定之位址的情況下，每一墊方可處理完整位址及與該位址相關聯之請求。類似於上文所描述之子組導引，墊判定可動態地加以控制或可為硬連線的。在一些實施例中，熔斷器可用以判定每一墊之位址範圍，從而亦允許藉由指派不合法位址範圍來使故障墊去能。墊可另外或替代地藉由其他常用方法或熔斷器之連接來去能。 The sub-group 1200 receives read and write requests from the memory controller 1210. Although not depicted in FIG. 12, the request from the memory controller 1210 can be filtered by the controller of the sub-group 1200 and directed to the appropriate pad of the sub-group 1200 for address resolution. Alternatively, at least a portion (for example, higher bits) of the requested address from the memory controller 1210 can be transmitted to all the pads of the subgroup 1200 (for example, the pads 1240a and 1240b), so that only the assigned bits of the pad When the address range includes the address specified in the command, each party can process the complete address and the request associated with that address. Similar to the subgroup guidance described above, the mat determination can be controlled dynamically or can be hardwired. In some embodiments, the fuse can be used to determine the address range of each pad, thereby also allowing the faulty pad to be disabled by assigning an illegal address range. The pad can be de-energized by other common methods or fuse connection in addition or alternatively.

在上文所描述之實施例中之任一者中，子組之每一墊可包括用於選擇墊中之字線的列解碼器(例如，列解碼器1230a或1230b)。在一些實施例中，每一墊可進一步包括熔斷器及比較器(例如，1220a及1220b)。如上文所描述，比較器可允許每一墊判定是否處理傳入請求，且熔斷器可允許每一墊在發生故障之情況下不啟動。替代地，可使用組及/或子組之列解碼器而非每一墊中之列解碼器。 In any of the embodiments described above, each pad of the sub-group may include a column decoder (e.g., column decoder 1230a or 1230b) for selecting the word line in the pad. In some embodiments, each pad may further include a fuse and a comparator (e.g., 1220a and 1220b). As described above, the comparator may allow each pad to determine whether to process an incoming request, and the fuse may allow each pad not to activate in the event of a failure. Alternatively, group and/or subgroup column decoders can be used instead of column decoders in each pad.

此外，在上文所描述之實施例中之任一者中，包括於適當墊中之行解碼器(例如，行解碼器1250a或1250b)可選擇區域位元線(例如，位元線1251或1253)。區域位元線可連接至記憶體組之全域位元線。在子組具有其自身的區域位元線之實施例中，胞元之區域位元線可進一步連接至子組之區域位元線。因此，可經由胞元之行解碼器(及/或感測放大器)、接著經由子組之行解碼器(及/或感測放大器)(在包括子組行解碼器及/或感測放大器之實施例中)且接著經由組之行解碼器(及/或感測放大器)來讀取選定胞元中之資料。 Furthermore, in any of the above-described embodiments, the ones included in the appropriate pad The row decoder (e.g., row decoder 1250a or 1250b) can select a regional bit line (e.g., bit line 1251 or 1253). The local bit line can be connected to the global bit line of the memory bank. In the embodiment where the sub-group has its own regional bit line, the regional bit line of the cell can be further connected to the regional bit line of the sub-group. Therefore, the row decoders (and/or sense amplifiers) of the cells can be passed, and then the row decoders (and/or sense amplifiers) of the sub-groups (in the case of sub-groups of row decoders and/or sense amplifiers). In the embodiment) and then read the data in the selected cell through the row decoder (and/or sense amplifier) of the group.

墊1200可經複製及排成陣列以形成記憶體組(或記憶體子組)。舉例而言，本發明之記憶體晶片可包含複數個記憶體組，每一記憶體組具有複數個記憶體子組，且每一記憶體子組具有用於處理對記憶體子組上之位置的讀取及寫入之子組控制器。此外，每一記憶體子組可包含複數個記憶體墊，每一記憶體墊具有複數個記憶體胞元且具有一墊列解碼器及一墊行解碼器(例如，如圖12中所描繪)。該等墊列解碼器及該等墊行解碼器可處理來自子組控制器之讀取及寫入請求。舉例而言，該等墊解碼器可接收所有請求且基於每一墊之已知位址範圍判定(例如，使用比較器)是否處理請求，或該等墊解碼器可基於子組(或組)控制器對墊的選擇而僅接收在已知位址範圍內之請求。 The pads 1200 can be copied and arranged in an array to form a memory group (or memory sub-group). For example, the memory chip of the present invention may include a plurality of memory groups, each memory group has a plurality of memory subgroups, and each memory subgroup has a position for processing a pair of memory subgroups The read and write sub-group controller. In addition, each memory sub-group may include a plurality of memory pads, each memory pad has a plurality of memory cells and has a row decoder and a row decoder (for example, as depicted in FIG. 12 ). The padding decoders and the padding decoders can handle read and write requests from the sub-group controller. For example, the mat decoders can receive all requests and determine whether to process the request based on the known address range of each mat (for example, using a comparator), or the mat decoders can be based on subgroups (or groups) The controller selects the pad and only receives requests within the known address range.

控制器資料傳送 Controller data transfer

除使用處理子單元來共用資料以外，本發明之記憶體晶片中之任一者亦可使用記憶體控制器(或子組控制器或墊控制器)來共用資料。舉例而言，本發明之記憶體晶片可包含：複數個記憶體組(例如，SRAM組、DRAM組或其類似者)，每一記憶體組具有一組控制器、一列解碼器及一行解碼器，以允許對該記憶體組上之位置進行讀取及寫入；以及複數個匯流排，其將複數個組控制器中之每一控制器連接至複數個組控制器中之至少另一控制器。該等複數個匯流排可類似於如上文所描述之連接處理子單元的匯流排，但該等複數個匯流排直接地而非經由處理子單元來連接該等組控制器。此外，儘管描述為連接組控制器，但匯流排可另外或替代地連接子組控制器及/或墊控制器。 In addition to using the processing subunit to share data, any one of the memory chips of the present invention can also use a memory controller (or sub-group controller or pad controller) to share data. For example, the memory chip of the present invention may include: a plurality of memory banks (for example, SRAM bank, DRAM bank or the like), each memory bank has a set of controllers, a row of decoders, and a row of decoders , To allow reading and writing to the location on the memory bank; and a plurality of buses, which connect each controller of the plurality of group controllers to at least another control of the plurality of group controllers Device. The plurality of buses may be similar to the buses connecting the processing sub-units as described above, but the plurality of buses are directly connected to the set of controllers instead of via the processing sub-units. In addition, despite the description as The group controller is connected, but the bus bar can additionally or alternatively connect the sub-group controller and/or the pad controller.

在一些實施例中，可存取該等複數個匯流排而不會中斷連接至一或多個處理器子單元之記憶體組之主匯流排上的資料傳送。因此，記憶體組(或子組)可在與將資料傳輸至不同記憶體組(或子組)或自不同記憶體組(或子組)傳輸資料相同的時脈循環中將資料傳輸至對應處理器子單元或自對應處理器子單元傳輸資料。在每一控制器連接至複數個其他控制器之實施例中，該等控制器可為可組態的以用於選擇其他控制器中之另一者用於發送或接收資料。在一些實施例中，每一控制器可連接至至少一個相鄰控制器(例如，多對空間鄰近控制器可彼此連接)。 In some embodiments, the plurality of buses can be accessed without interrupting the data transmission on the main bus of the memory bank connected to one or more processor subunits. Therefore, the memory group (or sub-group) can transmit data to the corresponding clock cycle in the same clock cycle as when transferring data to a different memory group (or sub-group) or from a different memory group (or sub-group). The processor subunit or the corresponding processor subunit transmits data. In embodiments where each controller is connected to a plurality of other controllers, the controllers may be configurable for selecting another of the other controllers for sending or receiving data. In some embodiments, each controller may be connected to at least one adjacent controller (for example, multiple pairs of spatially adjacent controllers may be connected to each other).

記憶體電路中之冗餘邏輯 Redundant logic in memory circuit

本發明大體上係有關於具有用於晶片上資料處理之主要邏輯部分的記憶體晶片。該記憶體晶片可包括冗餘邏輯部分，該等冗餘邏輯部分可替換有缺陷的主要邏輯部分以提高晶片之製造良率。因此，該晶片可包括晶載組件，該等晶載組件允許基於對該等邏輯部分之個別測試來組態記憶體晶片中之邏輯區塊。該晶片之此特徵可提高良率，此係因為具有專用於邏輯部分之較大面積的記憶體晶片更容易發生製造故障。舉例而言，具有大冗餘邏輯部分之DRAM記憶體晶片可容易發生製造問題，此降低良率。然而，實施冗餘邏輯部分可導致提高良率及可靠性，此係因為該實施使DRAM記憶體晶片之製造商或使用者能夠在維持高並列性的同時接通或斷開全部邏輯部分。應注意，在此處及貫穿本發明，可識別某些記憶體類型(諸如，DRAM)之實例，以便利解釋所揭示實施例。然而，應理解，在此等情況下，所識別之記憶體類型並不意欲為限制性的。確切而言，諸如DRAM、快閃記憶體、SRAM、ReRAM、PRAM、MRAM、ROM或任何其他記憶體之記憶體類型可與所揭示實施例一起使用，即使在本發明之某一章節中特定地識別較少實例亦如此。 The present invention generally relates to a memory chip having a main logic portion for data processing on the chip. The memory chip may include redundant logic parts, and the redundant logic parts may replace defective main logic parts to improve the manufacturing yield of the chip. Therefore, the chip may include on-chip components that allow the logic blocks in the memory chip to be configured based on individual testing of the logic parts. This feature of the chip can improve the yield, because the memory chip with a larger area dedicated to the logic part is more prone to manufacturing failures. For example, a DRAM memory chip with a large redundant logic part may be prone to manufacturing problems, which reduces the yield rate. However, the implementation of redundant logic parts can lead to improved yield and reliability, because this implementation enables manufacturers or users of DRAM memory chips to switch on or off all logic parts while maintaining high parallelism. It should be noted that here and throughout the present invention, examples of certain memory types (such as DRAM) may be identified to facilitate the explanation of the disclosed embodiments. However, it should be understood that under these circumstances, the identified memory type is not intended to be limiting. To be precise, memory types such as DRAM, flash memory, SRAM, ReRAM, PRAM, MRAM, ROM, or any other memory can be used with the disclosed embodiments, even if it is specified in a certain section of the present invention. The same is true for fewer instances.

圖13為符合所揭示實施例之例示性記憶體晶片1300的方塊圖。記憶體晶片1300可實施為DRAM記憶體晶片。記憶體晶片1300亦可實施為任何類型之揮發性或非揮發性記憶體，諸如快閃記憶體、SRAM、ReRAM、PRAM及/或MRAM等。記憶體晶片1300可包括基板1301，該基板中安置有位址管理器1302、包括複數個記憶體組1304(a,a)至1304(z,z)的記憶體陣列1304、記憶體邏輯1306、商業邏輯1308及冗餘商業邏輯1310。記憶體邏輯1306及商業邏輯1308可構成主要邏輯區塊，而冗餘商業邏輯1310可構成冗餘區塊。此外，記憶體晶片1300可包括組態開關，該等組態開關可包括不啟動開關1312及啟動開關1314。不啟動開關1312及啟動開關1314亦可安置於基板1301中。在本申請案中，記憶體邏輯1306、商業邏輯1308及冗餘商業邏輯1310亦可統稱為「邏輯區塊」。 FIG. 13 is a block diagram of an exemplary memory chip 1300 in accordance with the disclosed embodiments. The memory chip 1300 may be implemented as a DRAM memory chip. The memory chip 1300 can also be implemented as any type of volatile or non-volatile memory, such as flash memory, SRAM, ReRAM, PRAM, and/or MRAM. The memory chip 1300 may include a substrate 1301 in which an address manager 1302 is disposed, a memory array 1304 including a plurality of memory groups 1304 (a, a) to 1304 (z, z), a memory logic 1306, Business logic 1308 and redundant business logic 1310. The memory logic 1306 and the business logic 1308 may constitute a main logic block, and the redundant business logic 1310 may constitute a redundant block. In addition, the memory chip 1300 may include configuration switches, and the configuration switches may include a deactivation switch 1312 and an activation switch 1314. The inactivation switch 1312 and the activation switch 1314 can also be arranged in the substrate 1301. In this application, the memory logic 1306, business logic 1308, and redundant business logic 1310 can also be collectively referred to as "logical blocks".

位址管理器1302可包括列及行解碼器或其他類型之記憶體輔助裝置。替代地或另外，位址管理器1302可包括微控制器或處理單元。 The address manager 1302 may include column and row decoders or other types of memory auxiliary devices. Alternatively or in addition, the address manager 1302 may include a microcontroller or processing unit.

在一些實施例中，如圖13中所展示，記憶體晶片1300可包括單個記憶體陣列1304，該記憶體陣列可將複數個記憶體區塊以二維陣列配置於基板1301上。然而，在其他實施例中，記憶體晶片1300可包括多個記憶體陣列1304，且記憶體陣列1304中之每一者可按不同組態配置記憶體區塊。舉例而言，記憶體陣列中之至少一者中的記憶體區塊(亦被稱為記憶體組)可按徑向分佈配置以便利位址管理器1302或記憶體邏輯1306至記憶體區塊之間的路由。 In some embodiments, as shown in FIG. 13, the memory chip 1300 may include a single memory array 1304, and the memory array may arrange a plurality of memory blocks in a two-dimensional array on the substrate 1301. However, in other embodiments, the memory chip 1300 may include a plurality of memory arrays 1304, and each of the memory arrays 1304 may configure memory blocks in different configurations. For example, the memory blocks (also referred to as memory banks) in at least one of the memory arrays can be arranged in a radial distribution to facilitate the address manager 1302 or the memory logic 1306 to the memory blocks Route between.

商業邏輯1308可用以進行與用以管理記憶體本身之邏輯無關的應用程式之記憶體內運算。舉例而言，商業邏輯1308可實施與AI相關之功能，諸如用作啟動功能之浮點、整數或MAC運算。此外，商業邏輯1308可實施資料庫相關功能，如最小值、最大值、排序、計數以及其他。記憶體邏輯1306可執行與記憶體管理相關之任務，包括(但不限於)讀取、寫入及再新操作。因此，可在組層級、墊層級或墊群組層級中之一或多者中添加商業邏輯。商業邏輯1308可具有一或多個位址輸出及一或多個資料輸入/輸出。舉例而言，商業邏輯1308可藉由至位址管理器1302之列\行線來定址。然而，在某些實施例中，邏輯區塊可另外或替代地經由資料輸入\輸出來定址。 The business logic 1308 can be used to perform in-memory operations for applications that have nothing to do with the logic used to manage the memory itself. For example, the business logic 1308 can implement AI-related functions, such as floating point, integer, or MAC operations that are used as a startup function. In addition, the business logic 1308 can implement database-related functions, such as minimum, maximum, sorting, counting, and others. The memory logic 1306 can perform tasks related to memory management, including (but not limited to) read, write, and renew operations. because Therefore, business logic can be added to one or more of the group level, the mat level, or the mat group level. The business logic 1308 may have one or more address outputs and one or more data inputs/outputs. For example, the business logic 1308 can be addressed by column\row lines to the address manager 1302. However, in some embodiments, the logic block can be addressed via data input/output in addition or alternatively.

冗餘商業邏輯1310可為商業邏輯1308之再製品。此外，冗餘商業邏輯1310可連接至不啟動開關1312及/或啟動開關1314，其可包括小的熔斷器\反熔斷器，且用於使例項中之一者(例如，預設連接之例項)邏輯去能或賦能且對其他邏輯區塊中之一者(例如，預設斷開之例項)賦能。在一些實施例中，如關於圖15進一步所描述，區塊之冗餘在諸如商業邏輯1308之邏輯區塊內可為區域的。 The redundant business logic 1310 may be a reproduction of the business logic 1308. In addition, the redundant business logic 1310 can be connected to the non-start switch 1312 and/or the start switch 1314, which can include a small fuse\anti-fuse, and is used to enable one of the examples (for example, the default connection Instance) logic disable or enable and enable one of the other logic blocks (for example, default disconnected instance). In some embodiments, as further described with respect to FIG. 15, the redundancy of the block may be regional within a logical block such as business logic 1308.

在一些實施例中，記憶體晶片1300中之邏輯區塊可藉由專用匯流排連接至記憶體陣列1304之子集。舉例而言，記憶體邏輯1306、商業邏輯1308及冗餘商業邏輯1310之集合可連接至記憶體陣列1304中之第一列記憶體區塊(亦即，記憶體區塊1304(a,a)至1304(a,z))。專用匯流排可允許相關聯邏輯區塊快速地存取記憶體區塊之資料，而不要求經由例如位址管理器1302開放通信線。 In some embodiments, the logic blocks in the memory chip 1300 can be connected to a subset of the memory array 1304 through dedicated buses. For example, the set of memory logic 1306, business logic 1308, and redundant business logic 1310 can be connected to the first row of memory blocks in the memory array 1304 (ie, memory blocks 1304(a,a) To 1304(a,z)). The dedicated bus allows the associated logical block to quickly access the data of the memory block without requiring the communication line to be opened via, for example, the address manager 1302.

複數個主要邏輯區塊中之每一者可連接至複數個記憶體組1304中之至少一者。又，諸如冗餘商業區塊1310之冗餘區塊可連接至記憶體例項1304(a,a)至1304(z,z)中之至少一者。冗餘區塊可再製複數個主要邏輯區塊中之至少一者，諸如記憶體邏輯1306或商業邏輯1308。不啟動開關1312可連接至該等複數個主要邏輯區塊中之至少一者，且啟動開關1314可連接至該等複數個冗餘區塊中之至少一者。 Each of the plurality of main logic blocks can be connected to at least one of the plurality of memory groups 1304. Also, a redundant block such as a redundant commercial block 1310 can be connected to at least one of the memory instances 1304(a, a) to 1304(z, z). The redundant block can reproduce at least one of a plurality of main logic blocks, such as memory logic 1306 or business logic 1308. The inactivation switch 1312 can be connected to at least one of the plurality of main logic blocks, and the activation switch 1314 can be connected to at least one of the plurality of redundant blocks.

在此等實施例中，在偵測到與複數個主要邏輯區塊中之一者(記憶體邏輯1306及/或商業邏輯1308)相關聯之故障後，不啟動開關1312可經組態以使複數個主要邏輯區塊中之該者去能。同時，啟動開關1314可經組態以對複數個冗餘區塊中的再製複數個主要邏輯區塊中之一者賦能的冗餘區塊，諸如冗餘邏輯區塊1310。 In these embodiments, after detecting a fault associated with one of a plurality of main logic blocks (memory logic 1306 and/or business logic 1308), the switch 1312 can be set without activation State to disable that one of the plurality of main logic blocks. At the same time, the activation switch 1314 can be configured to reproduce one of the plurality of redundant blocks to enable a redundant block, such as the redundant logic block 1310, to reproduce one of the plurality of main logic blocks.

此外，可統稱為「組態開關」之啟動開關1314及不啟動開關1312可包括用以組態開關之狀態的外部輸入。舉例而言，啟動開關1314可經組態以使得外部輸入中之啟動信號產生閉合開關條件，而不啟動開關1312可經組態以使得外部輸入中之不啟動信號產生斷開開關條件。在一些實施例中，1300中之所有組態開關可預設為經不啟動，且在測試提示相關聯邏輯區塊起作用且信號施加於外部輸入中之後變得被啟動或賦能。替代地，在一些狀況下，1300中之所有組態開關可預設為經賦能，且可在測試提示相關聯邏輯區塊不起作用且不啟動信號施加於外部輸入中之後被不啟動或去能。 In addition, the activation switch 1314 and the deactivation switch 1312, which can be collectively referred to as "configuration switches", can include external inputs for configuring the state of the switches. For example, the activation switch 1314 can be configured so that the activation signal in the external input generates a closed switching condition, while the deactivation switch 1312 can be configured so that the non-activation signal in the external input generates an open switching condition. In some embodiments, all configuration switches in 1300 can be preset to be disabled, and become activated or enabled after the test prompts that the associated logic block is active and the signal is applied to the external input. Alternatively, in some cases, all configuration switches in 1300 can be preset to be enabled, and can be deactivated or deactivated after the test prompts that the associated logic block is inoperative and the deactivation signal is applied to the external input. To be able to.

無關於最初對組態開關賦能抑或使其去能，在偵測到與相關聯邏輯區塊相關聯之故障後，組態開關可使相關聯邏輯區塊去能。在最初賦能組態開關之狀況下，組態開關之狀態可改變為去能，以便使相關聯邏輯區塊去能。在最初使組態開關去能之狀況下，組態開關之狀態可保持在其去能狀態中，以便使相關聯邏輯區塊去能。舉例而言，可操作性測試之結果可提示，某一邏輯區塊為非操作的或該邏輯區塊不能在某些規格內操作。在此等狀況下，可藉由不對邏輯區塊之對應組態開關賦能來使邏輯區塊去能。 Regardless of whether the configuration switch is initially enabled or disabled, after detecting a fault associated with the associated logic block, the configuration switch can disable the associated logic block. Under the condition that the configuration switch is initially enabled, the state of the configuration switch can be changed to disable in order to disable the associated logic block. Under the condition of initially disabling the configuration switch, the state of the configuration switch can be maintained in its disabling state so as to disable the associated logic block. For example, the result of the operability test may indicate that a certain logic block is non-operational or that the logic block cannot be operated within certain specifications. Under these conditions, the logic block can be disabled by not enabling the corresponding configuration switch of the logic block.

在一些實施例中，組態開關可連接至兩個或多於兩個邏輯區塊，且可經組態以在不同邏輯區塊之間進行選擇。舉例而言，組態開關可連接至商業邏輯1308及冗餘邏輯區塊1310兩者。組態開關可對冗餘邏輯區塊1310賦能，同時使商業邏輯1308去能。 In some embodiments, the configuration switch can be connected to two or more logic blocks, and can be configured to select between different logic blocks. For example, the configuration switch can be connected to both the business logic 1308 and the redundant logic block 1310. The configuration switch can enable the redundant logic block 1310 and disable the commercial logic 1308 at the same time.

替代地或另外，複數個主要邏輯區塊中之至少一者(記憶體邏輯1306及/或商業邏輯1308)可藉由第一專用連接件連接至複數個記憶體組或記憶體例項1304之子集。接著，複數個冗餘區塊中的再製複數個主要邏輯區塊中之至少一者的至少一個冗餘區塊(諸如，冗餘商業邏輯1310)可藉由第二專用連接件連接至相同複數個記憶體組或例項1304之子集。 Alternatively or in addition, at least one of the plurality of main logic blocks (memory logic 1306 and/or business logic 1308) can be connected to a plurality of memory banks or memories through a first dedicated connection Subset of system item 1304. Then, at least one redundant block (such as redundant business logic 1310) that reproduces at least one of the plurality of main logic blocks in the plurality of redundant blocks can be connected to the same plurality by a second dedicated connector A subset of memory groups or instances 1304.

此外，記憶體邏輯1306可具有不同於商業邏輯1308之功能及能力。舉例而言，雖然記憶體邏輯1306可經設計以使得能夠進行記憶體組1304中之讀取及寫入操作，但商業邏輯1308可經設計以執行記憶體內運算。因此，若商業邏輯1308包括第一商業邏輯區塊且商業邏輯1308包括第二商業邏輯區塊(如冗餘商業邏輯1310)，則有可能將有缺陷的商業邏輯1308斷開且重新連接冗餘商業邏輯1310而不會失去任何能力。 In addition, the memory logic 1306 may have different functions and capabilities than the business logic 1308. For example, although the memory logic 1306 can be designed to enable read and write operations in the memory bank 1304, the business logic 1308 can be designed to perform in-memory operations. Therefore, if the business logic 1308 includes the first business logic block and the business logic 1308 includes the second business logic block (such as the redundant business logic 1310), it is possible to disconnect the defective business logic 1308 and reconnect the redundant Business logic 1310 without losing any ability.

在一些實施例中，組態開關(包括不啟動開關1312及啟動開關1314)可用熔斷器、反熔斷器或可程式化裝置(包括一次性可程式化裝置)或其他形式之非揮發性記憶體來實施。 In some embodiments, the configuration switch (including the non-activation switch 1312 and the activation switch 1314) can be a fuse, an anti-fuse or a programmable device (including a one-time programmable device) or other forms of non-volatile memory To implement.

圖14為符合所揭示實施例之例示性冗餘邏輯區塊集合1400的方塊圖。在一些實施例中，冗餘邏輯區塊集合1400可安置於基板1301中。冗餘邏輯區塊集合1400可包括分別連接至開關1312及1314之商業邏輯1308及冗餘商業邏輯1310中之至少一者。此外，商業邏輯1308及冗餘商業邏輯1310可連接至位址匯流排1402及資料匯流排1404。 FIG. 14 is a block diagram of an exemplary redundant logical block set 1400 in accordance with the disclosed embodiment. In some embodiments, the redundant logic block set 1400 may be disposed in the substrate 1301. The set of redundant logic blocks 1400 may include at least one of a business logic 1308 and a redundant business logic 1310 connected to the switches 1312 and 1314, respectively. In addition, business logic 1308 and redundant business logic 1310 can be connected to address bus 1402 and data bus 1404.

在一些實施例中，如圖14中所展示，開關1312及1314可將邏輯區塊連接至時脈節點。以此方式，組態開關可使邏輯區塊與時脈信號連結或脫離，以有效地啟動或不啟動邏輯區塊。然而，在其他實施例中，開關1312及1314可將邏輯區塊連接至其他節點以用於啟動或不啟動。舉例而言，組態開關可將邏輯區塊連接至電壓供應節點(例如，VCC)或連接至接地節點(例如，GND)或時脈信號。以此方式，邏輯區塊可由組態開關賦能或去能，此係因為該等組態開關可產生開路或截斷邏輯區塊供電。 In some embodiments, as shown in FIG. 14, switches 1312 and 1314 can connect logic blocks to clock nodes. In this way, the configuration switch can connect or disconnect the logic block from the clock signal to effectively activate or deactivate the logic block. However, in other embodiments, the switches 1312 and 1314 can connect the logic block to other nodes for activation or deactivation. For example, the configuration switch can connect the logic block to a voltage supply node (for example, VCC) or to a ground node (for example, GND) or a clock signal. In this way, the logic block can be enabled or disabled by the configuration switches, because the configuration switches can generate an open circuit or cut off the power supply of the logic block.

在一些實施例中，如圖14中所展示，位址匯流排1402及資料匯流排1404可在邏輯區塊之相對側，該等邏輯區塊並聯地連接至該等匯流排中之每一者。以此方式，可藉由邏輯區塊集合1400便利不同晶載組件之路由。 In some embodiments, as shown in FIG. 14, the address bus 1402 and the data bus 1404 may be on opposite sides of the logical blocks, which are connected in parallel to each of the buses . In this way, the logical block assembly 1400 can facilitate the routing of different on-chip components.

在一些實施例中，複數個不啟動開關1312中之每一者將複數個主要邏輯區塊中之至少一者與時脈節點耦接，且複數個啟動開關1314中之每一者可將複數個冗餘區塊中之至少一者與時脈節點耦接，以允許連接\斷開時脈以作為簡單的啟動\不啟動機制。 In some embodiments, each of the plurality of inactivation switches 1312 couples at least one of the plurality of main logic blocks to the clock node, and each of the plurality of activation switches 1314 can connect the plurality of At least one of the redundant blocks is coupled to the clock node to allow connection/disconnection of the clock as a simple activation/non-activation mechanism.

冗餘邏輯區塊集合1400之冗餘商業邏輯1310允許設計者基於面積及路由而選擇值得複製之區塊。舉例而言，晶片設計者可選擇較大區塊進行複製，此係因為較大區塊可更容易出錯。因此，晶片設計者可決定複製大的邏輯區塊。另一方面，設計者可偏好複製較小邏輯區塊，此係因為較小邏輯區塊容易複製而無顯著的空間損失。此外，使用圖14中之組態，設計者可取決於每個區域之錯誤的統計而容易地選擇複製邏輯區塊。 The redundant business logic 1310 of the redundant logic block set 1400 allows the designer to select blocks that are worth duplicating based on area and routing. For example, chip designers can choose larger blocks for copying, because larger blocks can be more prone to errors. Therefore, the chip designer can decide to replicate large logic blocks. On the other hand, designers may prefer to copy smaller logic blocks, because smaller logic blocks are easy to copy without significant space loss. In addition, using the configuration in Figure 14, the designer can easily choose to copy logic blocks depending on the statistics of errors in each area.

圖15為符合所揭示實施例之例示性邏輯區塊1500的方塊圖。該邏輯區塊可為商業邏輯1308及/或冗餘商業邏輯1310。然而，在其他實施例中，例示性邏輯區塊可描述記憶體邏輯1306，或記憶體晶片1300之其他組件。 FIG. 15 is a block diagram of an exemplary logic block 1500 in accordance with the disclosed embodiment. The logic block can be business logic 1308 and/or redundant business logic 1310. However, in other embodiments, the exemplary logic block may describe the memory logic 1306, or other components of the memory chip 1300.

邏輯區塊1500呈現在小型處理器管線內使用邏輯冗餘之又一實施例。邏輯區塊1500可包括暫存器1508、提取電路1504、解碼器1506及寫回電路1518。此外，邏輯區塊1500可包括運算單元1510及複製運算單元1512。然而，在其他實施例中，邏輯區塊1500可包括其他單元，該等其他單元不包含控制器管線，但包括包含所需商業邏輯之分散的處理元件。 The logic block 1500 presents another embodiment of using logic redundancy in a small processor pipeline. The logic block 1500 may include a register 1508, an extraction circuit 1504, a decoder 1506, and a write-back circuit 1518. In addition, the logic block 1500 may include an operation unit 1510 and a copy operation unit 1512. However, in other embodiments, the logic block 1500 may include other units that do not include the controller pipeline, but include distributed processing elements that include the required business logic.

運算單元1510及複製運算單元1512可包括能夠執行數位計算之數位電路。舉例而言，運算單元1510及複製運算單元1512可包括算術邏輯單元(ALU)以對二進位數執行算術及逐位元運算。替代地，運算單元1510及複製運算單元1512可包括對浮點數進行運算之浮點單元(FPU)。此外，在一些實施例中，運算單元1510及複製運算單元1512可實施資料庫相關功能，如最小值、最大值、計數及比較運算以及其他。 The arithmetic unit 1510 and the copy arithmetic unit 1512 may include digital circuits capable of performing digital calculations. For example, the operation unit 1510 and the copy operation unit 1512 may include an arithmetic logic unit (ALU) to perform arithmetic and bitwise operations on binary numbers. Alternatively, the arithmetic unit 1510 and copy The arithmetic unit 1512 may include a floating-point unit (FPU) for performing arithmetic on floating-point numbers. In addition, in some embodiments, the arithmetic unit 1510 and the copy arithmetic unit 1512 can implement database-related functions, such as minimum, maximum, counting and comparison operations, and others.

在一些實施例中，如圖15中所展示，運算單元1510及複製運算單元1512可連接至開關電路1514及1516。當經啟動時，該等開關電路可對該等運算單元賦能或使其去能。 In some embodiments, as shown in FIG. 15, the operation unit 1510 and the copy operation unit 1512 may be connected to the switch circuits 1514 and 1516. When activated, the switching circuits can enable or disable the arithmetic units.

在邏輯區塊1500中，複製運算單元1512可再製計算單元1510。此外，在一些實施例中，暫存器1508、提取電路1504、解碼器1506及寫回電路1518(統稱為區域邏輯單元)之大小可小於運算單元1510。因為較大元件更容易在製造期間出現問題，所以設計者可決定再製較大單元(諸如，運算單元1510)而非較小單元(諸如，區域邏輯單元)。然而，取決於歷史良率及錯誤率，除複製大單元(或整個區塊)以外或替代複製大單元(或整個區塊)，設計者亦可選擇複製區域邏輯單元。舉例而言，運算單元1510可比暫存器1508、提取電路1504、解碼器1506及寫回電路1518大，且因此更容易出錯。設計者可選擇複製運算單元1510而非邏輯區塊1500中之其他元件或整個區塊。 In the logic block 1500, the copy operation unit 1512 can reproduce the calculation unit 1510. In addition, in some embodiments, the size of the register 1508, the extraction circuit 1504, the decoder 1506, and the write-back circuit 1518 (collectively referred to as a regional logic unit) may be smaller than the arithmetic unit 1510. Because larger components are more prone to problems during manufacturing, the designer may decide to reproduce larger units (such as arithmetic unit 1510) rather than smaller units (such as regional logic units). However, depending on the historical yield and error rate, in addition to or instead of copying the large unit (or the entire block), the designer can also choose to copy the area logic unit. For example, the arithmetic unit 1510 may be larger than the register 1508, the extraction circuit 1504, the decoder 1506, and the write-back circuit 1518, and is therefore more prone to errors. The designer can choose to copy the arithmetic unit 1510 instead of other elements in the logic block 1500 or the entire block.

邏輯區塊1500可包括複數個區域組態開關，該等複數個區域組態開關中之每一者連接至運算單元1510或複製運算單元1512中之至少一者中的至少一者。當在運算單元1510中偵測到故障時，區域組態開關可經組態以使運算單元1510去能且對複製運算單元1512賦能。 The logic block 1500 may include a plurality of area configuration switches, and each of the plurality of area configuration switches is connected to at least one of the operation unit 1510 or the copy operation unit 1512. When a fault is detected in the arithmetic unit 1510, the area configuration switch can be configured to disable the arithmetic unit 1510 and enable the copy arithmetic unit 1512.

圖16展示符合所揭示實施例之與匯流排連接之例示性邏輯區塊的方塊圖。在一些實施例中，邏輯區塊1602(其可表示記憶體邏輯1306、商業邏輯1308或冗餘商業邏輯1310)可彼此獨立，可經由匯流排連接，且可藉由特定地定址該等邏輯區塊而在外部啟動。舉例而言，記憶體晶片1300可包括許多邏輯區塊，每一邏輯區塊具有ID碼。然而，在其他實施例中，邏輯區塊1602 可表示包含記憶體邏輯1306、商業邏輯1308或冗餘商業邏輯1310中之複數個一或多者的較大單元。 FIG. 16 shows a block diagram of an exemplary logic block connected to the bus in accordance with the disclosed embodiment. In some embodiments, the logical blocks 1602 (which can represent memory logic 1306, business logic 1308, or redundant business logic 1310) can be independent of each other, can be connected via a bus, and can be specifically addressed to these logical areas Block and start externally. For example, the memory chip 1300 may include many logic blocks, and each logic block has an ID code. However, in other embodiments, the logic block 1602 It may represent a larger unit including one or more of the memory logic 1306, the business logic 1308, or the redundant business logic 1310.

在一些實施例中，邏輯區塊1602中之每一者可能與其他邏輯區塊1602為冗餘的。所有區塊可作為主要或冗餘區塊來操作之此完全冗餘性可改善製造良率，此係因為設計者可斷開故障單元，同時維持整個晶片之功能性。舉例而言，設計者可能夠使容易出錯但維持類似運算能力之邏輯區域去能，此係因為所有複製區塊可連接至相同的位址匯流排及資料匯流排。舉例而言，邏輯區塊1602之初始數目可大於目標能力。因而，使一些邏輯區塊1602去能將不會影響目標能力。 In some embodiments, each of the logical blocks 1602 may be redundant with other logical blocks 1602. This complete redundancy in which all blocks can be operated as main or redundant blocks can improve manufacturing yields because the designer can disconnect failed units while maintaining the functionality of the entire chip. For example, a designer may be able to disable logic areas that are prone to errors but maintain similar computing power, because all replicated blocks can be connected to the same address bus and data bus. For example, the initial number of logical blocks 1602 can be greater than the target capacity. Therefore, disabling some logic blocks 1602 will not affect the target capability.

連接至邏輯區塊之匯流排可包括位址匯流排1614、命令線1616及資料線1618。如圖16中所展示，邏輯區塊中之每一者可獨立於匯流排中之每一線而連接。然而，在某些實施例中，邏輯區塊1602可按階層式結構連接以便利路由。舉例而言，匯流排中之每一線可連接至將該線路由至不同邏輯區塊1602之多工器。 The bus connected to the logic block may include an address bus 1614, a command line 1616, and a data line 1618. As shown in Figure 16, each of the logical blocks can be connected independently of each line in the bus. However, in some embodiments, the logic blocks 1602 may be connected in a hierarchical structure to facilitate routing. For example, each line in the bus can be connected to a multiplexer that routes the line to a different logic block 1602.

在一些實施例中，為了在不知曉內部晶片結構(其可能由於賦能及去能單元而改變)之情況下允許外部存取，邏輯區塊中之每一者可包括熔斷ID，諸如熔斷識別件1604。熔斷識別件1604可包括判定ID之開關(如熔斷器)的陣列，且可連接至管理電路。舉例而言，熔斷識別件1604可連接至位址管理器1302。替代地，熔斷識別件1604可連接至較高記憶體位址單元。在此等實施例中，熔斷識別件1604可為可組態的以用於特定位址。舉例而言，熔斷識別件1604可包括可程式化的非揮發性裝置，該裝置基於自管理電路接收到之指令而判定最終ID。 In some embodiments, in order to allow external access without knowing the internal chip structure (which may be changed due to enabling and disabling units), each of the logical blocks may include a fuse ID, such as fuse identification Piece 1604. The fuse identification element 1604 may include an array of switches (such as fuses) for determining the ID, and may be connected to the management circuit. For example, the fuse identifier 1604 can be connected to the address manager 1302. Alternatively, the fuse identifier 1604 may be connected to a higher memory address unit. In these embodiments, the fuse identifier 1604 may be configurable for specific addresses. For example, the fuse identification element 1604 may include a programmable non-volatile device that determines the final ID based on the command received from the management circuit.

記憶體晶片上之分散式處理器可設計成具有圖16中所描繪之組態。在晶片喚醒時或在工廠測試時執行為BIST之測試程序可將運行ID碼指派給通過測試協定的複數個主要邏輯區塊(記憶體邏輯1306及商業邏輯1308)中之區塊。測試程序亦可將不合法ID碼指派給未通過測試協定的複數個主要邏輯區塊中之區塊。測試程序亦可將運行ID碼指派給通過測試協定的複數個冗餘區塊中之區塊(冗餘邏輯區塊1310)。因為冗餘區塊替換未通過的主要邏輯區塊，所以經指派運行ID碼的複數個冗餘區塊中之區塊可等於或大於經指派不合法ID碼的複數個主要邏輯區塊中之區塊，藉此使區塊去能。此外，複數個主要邏輯區塊中之每一者及複數個冗餘區塊中之每一者可包括至少一個熔斷識別件1604。又，如圖16中所展示，連接邏輯區塊1602之匯流排可包括命令線、資料線及位址線。 The distributed processor on the memory chip can be designed to have the configuration depicted in FIG. 16. When the chip wakes up or when the factory test is executed as a BIST test program, the running ID code can be assigned Give the blocks in a plurality of main logic blocks (memory logic 1306 and business logic 1308) that have passed the test protocol. The test program can also assign illegal ID codes to blocks in a plurality of main logic blocks that fail the test protocol. The test program can also assign the running ID code to a block (redundant logic block 1310) among a plurality of redundant blocks that pass the test protocol. Because the redundant block replaces the failed main logical block, the block of the redundant blocks assigned to the running ID code can be equal to or greater than that of the main logical blocks assigned to the illegal ID code. Block, thereby disabling the block. In addition, each of the plurality of main logical blocks and each of the plurality of redundant blocks may include at least one fuse identifier 1604. Also, as shown in FIG. 16, the bus connected to the logic block 1602 may include command lines, data lines, and address lines.

然而，在其他實施例中，連接至匯流排之所有邏輯區塊1602將開始被去能且不具有ID碼。逐個地測試，每一良好邏輯區塊將得到運行ID碼，且不工作之彼等邏輯區塊將保留不合法ID，此將使此等區塊去能。以此方式，冗餘邏輯區塊可藉由替換在測試處理程序期間已知有缺陷的區塊來改善製造良率。 However, in other embodiments, all logic blocks 1602 connected to the bus will start to be disabled and will not have ID codes. Test one by one, each good logic block will get a running ID code, and those logic blocks that are not working will retain illegal IDs, which will disable these blocks. In this way, redundant logic blocks can improve manufacturing yield by replacing blocks that are known to be defective during the test process.

位址匯流排1614可將管理電路耦接至複數個記憶體組中之每一者、複數個主要邏輯區塊中之每一者及複數個冗餘區塊中之每一者。此等連接允許管理電路在偵測到與主要邏輯區塊(諸如，商業邏輯1308)相關聯之故障後將無效位址指派給複數個主要邏輯區塊中之一者且將有效位址指派給複數個冗餘區塊中之一者。 The address bus 1614 can couple the management circuit to each of the plurality of memory groups, each of the plurality of main logic blocks, and each of the plurality of redundant blocks. These connections allow the management circuit to assign an invalid address to one of the plurality of main logic blocks and assign a valid address to one of the plurality of main logic blocks after detecting a failure associated with the main logic block (such as business logic 1308) One of a plurality of redundant blocks.

舉例而言，如圖16A中所展示，不合法ID經組態至所有邏輯區塊1602(a)至1602(c)(例如，位址0xFFF)。在測試之後，邏輯區塊1602(a)及1602(c)經驗證為起作用，而邏輯區塊1602(b)不起作用。在圖16A中，無陰影邏輯區塊可表示成功地通過功能性測試之邏輯區塊，而陰影邏輯區塊可表示未通過功能性測試之邏輯區塊。因而，測試程序針對起作用的邏輯區塊將不合法ID 改變為合法ID，而為不作用之邏輯區塊保留不合法ID。作為實例，在圖16A中，邏輯區塊1602(a)及1602(c)之位址自0xFFF分別改變為0x001及0x002。相比之下，邏輯區塊1602(b)之位址仍為不合法位址0xFFF。在一些實施例中，ID藉由程式化對應熔斷識別件1604來改變。 For example, as shown in FIG. 16A, illegal IDs are configured to all logical blocks 1602(a) to 1602(c) (for example, address 0xFFF). After the test, the logic blocks 1602(a) and 1602(c) are verified to be functional, while the logic block 1602(b) is not functional. In FIG. 16A, the unshaded logic blocks may indicate the logic blocks that successfully passed the functional test, and the shaded logic blocks may indicate the logic blocks that failed the functional test. Therefore, the test program will not have a valid ID for the functional logic block Change to a legal ID, and reserve an illegal ID for the inoperative logical block. As an example, in FIG. 16A, the addresses of logical blocks 1602(a) and 1602(c) are changed from 0xFFF to 0x001 and 0x002, respectively. In contrast, the address of the logical block 1602(b) is still the illegal address 0xFFF. In some embodiments, the ID is changed by programming the corresponding fuse identifier 1604.

來自邏輯區塊1602之測試的不同結果可產生不同組態。舉例而言，如圖16B中所展示，位址管理器1302最初可將不合法ID指派給所有邏輯區塊1602(亦即，0xFFF)。然而，測試結果可提示兩個邏輯區塊1602(a)及1602(b)起作用。在此等狀況下，對邏輯區塊1602(c)之測試可能並非必要的，此係因為記憶體晶片1300可能僅需要兩個邏輯區塊。因此，為了最少化測試資源，可僅根據1300之產品定義所需的起作用邏輯區塊之最小數目來測試邏輯區塊，以使其他邏輯區塊未受測試。圖16B亦展示表示通過功能性測試之受測試邏輯區塊的無陰影邏輯區塊及表示未測試邏輯區塊之陰影邏輯區塊。 Different results from the testing of the logic block 1602 can produce different configurations. For example, as shown in FIG. 16B, the address manager 1302 may initially assign illegal IDs to all logical blocks 1602 (ie, 0xFFF). However, the test result can suggest that the two logic blocks 1602(a) and 1602(b) are working. Under these conditions, testing of the logic block 1602(c) may not be necessary, because the memory chip 1300 may only require two logic blocks. Therefore, in order to minimize test resources, logic blocks can be tested only according to the minimum number of active logic blocks required by the product definition of 1300, so that other logic blocks are not tested. FIG. 16B also shows the unshaded logic blocks representing the tested logic blocks that passed the functional test and the shadow logic blocks representing the untested logic blocks.

在此等實施例中，在起動時執行BIST之生產測試器(外部或內部的，自動或人工的)或控制器可針對起作用的受測試邏輯區塊將不合法ID改變為運行ID，而為未測試邏輯區塊保留不合法ID。作為實例，在圖16B中，邏輯區塊1602(a)及1602(b)之位址自0xFFF分別改變為0x001及0x002。相比之下，未測試邏輯區塊1602(c)之位址仍為不合法位址0xFFF。 In these embodiments, the production tester (external or internal, automatic or manual) or controller that executes BIST at startup can change the illegal ID to the running ID for the functioning logic block under test, and Reserve illegal IDs for untested logical blocks. As an example, in FIG. 16B, the addresses of logical blocks 1602(a) and 1602(b) are changed from 0xFFF to 0x001 and 0x002, respectively. In contrast, the address of the untested logical block 1602(c) is still the illegal address 0xFFF.

圖17為符合所揭示實施例的串聯連接之例示性單元1702及1712的方塊圖。圖17可表示整個系統或晶片。替代地，圖17可表示含有其他起作用區塊之晶片中的區塊。 Figure 17 is a block diagram of exemplary units 1702 and 1712 connected in series in accordance with the disclosed embodiments. Figure 17 may represent the entire system or wafer. Alternatively, FIG. 17 may represent a block in a chip containing other functional blocks.

單元1702及1712可表示包括諸如記憶體邏輯1306及/或商業邏輯1308的複數個邏輯區塊之完整單元。在此等實施例中，單元1702及1712亦可包括執行操作所需之元件，諸如位址管理器1302。然而，在其他實施例中，單元1702及1712可表示諸如商業邏輯1308或冗餘商業邏輯1310之邏輯單元。 Units 1702 and 1712 may represent complete units including a plurality of logic blocks such as memory logic 1306 and/or business logic 1308. In these embodiments, the units 1702 and 1712 may also include components required to perform operations, such as the address manager 1302. However, in other embodiments, the units 1702 and 1712 may represent logical units such as business logic 1308 or redundant business logic 1310.

圖17呈現單元1702及1712可能需要在其本身之間通信的實施例。在此類狀況下，單元1702及1712可串聯連接。然而，非工作單元可破壞邏輯區塊之間的連續性。因此，當單元由於缺陷而需要被去能時，單元之間的連接可包括旁路選項。該旁路選項亦可為旁路單元本身之部分。 Figure 17 presents an embodiment in which units 1702 and 1712 may need to communicate between themselves. Under such conditions, the units 1702 and 1712 can be connected in series. However, non-working units can break the continuity between logical blocks. Therefore, when a cell needs to be disabled due to a defect, the connection between the cells may include a bypass option. The bypass option can also be part of the bypass unit itself.

在圖17中，單元可串聯連接(例如，1702(a)至1702(c))，且未通過的單元(例如，1702(b))可在其有缺陷時被繞過。該等單元可進一步與開關電路並聯地連接。舉例而言，在一些實施例中，單元1702及1712可與開關電路1722及1728連接，如圖17中所描繪。在圖17中所描繪之實例中，單元1702(b)有缺陷。舉例而言，單元1702(b)未通過電路功能性測試。因此，可使用例如啟動開關1314(圖17中未展示)來使單元1702(b)去能，及/或可啟動開關電路1722(b)以繞過單元1702(b)且維持邏輯區塊之間的連接性。 In FIG. 17, cells can be connected in series (e.g., 1702(a) to 1702(c)), and a failed cell (e.g., 1702(b)) can be bypassed when it is defective. These units can be further connected in parallel with the switching circuit. For example, in some embodiments, the cells 1702 and 1712 may be connected with the switch circuits 1722 and 1728, as depicted in FIG. 17. In the example depicted in Figure 17, cell 1702(b) is defective. For example, unit 1702(b) failed the circuit functionality test. Therefore, for example, the activation switch 1314 (not shown in FIG. 17) can be used to disable the unit 1702(b), and/or the switch circuit 1722(b) can be activated to bypass the unit 1702(b) and maintain the logic block. Connectivity between.

因此，當複數個主要單元串聯連接時，該等複數個單元中之每一者可與一並聯開關並聯地連接。在偵測到與複數個單元中之一者相關聯的故障後，可啟動連接至該等複數個單元中之該者的並聯開關以連接該等複數個單元中之兩者。 Therefore, when a plurality of main units are connected in series, each of the plurality of units can be connected in parallel with a parallel switch. After detecting a fault associated with one of the plurality of units, a parallel switch connected to that one of the plurality of units can be activated to connect two of the plurality of units.

在其他實施例中，如圖17中所展示，開關電路1728可包括將致使一或多個循環延遲以維持單元之不同排之間的同步的一或更多個取樣點。當使一單元去能時，鄰近邏輯區塊之間的連接之短路可能會產生與其他計算之同步誤差。舉例而言，若任務需要來自A排及B排兩者之資料，且A及B中之每一者由獨立之一系列單元承載，則使單元去能將導致將需要進一步資料管理之排之間的去同步。為了防止去同步，樣本電路1730可模擬由經去能單元1712(b)引起的延遲。然而，在一些實施例中，並聯開關可包括反熔斷器而非取樣電路1730。 In other embodiments, as shown in FIG. 17, the switch circuit 1728 may include one or more sampling points that will cause one or more cycles to be delayed to maintain synchronization between different rows of cells. When a unit is disabled, a short circuit between adjacent logic blocks may cause synchronization errors with other calculations. For example, if the task requires data from both row A and row B, and each of A and B is carried by a series of independent units, disabling the unit will result in a row that will require further data management Desynchronization between. To prevent desynchronization, the sample circuit 1730 can simulate the delay caused by the de-energizing unit 1712(b). However, in some embodiments, the parallel switch may include an anti-fuse instead of the sampling circuit 1730.

圖18為符合所揭示實施例之成二維陣列連接之例示性單元的方塊圖。圖18可表示整個系統或晶片。替代地，圖18可表示含有其他起作用區塊之晶片中的區塊。 18 is a method of exemplary units connected in a two-dimensional array in accordance with the disclosed embodiment Block diagram. Figure 18 may represent the entire system or wafer. Alternatively, FIG. 18 may represent a block in a chip containing other functional blocks.

單元1806可表示包括諸如記憶體邏輯1306及/或商業邏輯1308之複數個邏輯區塊的自主單元。然而，在其他實施例中，單元1806可表示諸如商業邏輯1308之邏輯單元。在方便時，圖18之論述可參考圖13(例如，記憶體晶片1300)中所識別且上文所論述之元件。 The unit 1806 may represent an autonomous unit including a plurality of logic blocks such as memory logic 1306 and/or business logic 1308. However, in other embodiments, the unit 1806 may represent a logical unit such as business logic 1308. When convenient, the discussion of FIG. 18 may refer to the elements identified in FIG. 13 (eg, memory chip 1300) and discussed above.

如圖18中所展示，單元可配置成二維陣列，其中單元1806(其可包括或表示記憶體邏輯1306、商業邏輯1308或冗餘商業邏輯1310中之一或多者)經由開關箱1808及連接箱1810互連。此外，為了控制二維陣列之組態，二維陣列可在二維陣列之周邊中包括I/O區塊1804。 As shown in FIG. 18, the units may be configured in a two-dimensional array, where the unit 1806 (which may include or represent one or more of the memory logic 1306, the business logic 1308, or the redundant business logic 1310) passes through the switch box 1808 and The connection boxes 1810 are interconnected. In addition, in order to control the configuration of the two-dimensional array, the two-dimensional array may include an I/O block 1804 in the periphery of the two-dimensional array.

連接箱1810可為可程式化且可重組態之裝置，該裝置可對自I/O區塊1804輸入之信號作出回應。舉例而言，連接箱可包括來自單元1806之複數個輸入接腳且亦可連接至開關箱1808。替代地，連接箱1810可包括將可程式化邏輯胞元之接腳與路由軌線連接的開關之群組，而開關箱1808可包括連接不同軌線的開關之群組。 The connection box 1810 can be a programmable and reconfigurable device that can respond to signals input from the I/O block 1804. For example, the connection box can include a plurality of input pins from the unit 1806 and can also be connected to the switch box 1808. Alternatively, the connection box 1810 may include a group of switches that connect the pins of the programmable logic cell with routing trajectories, and the switch box 1808 may include a group of switches that connect different trajectories.

在某些實施例中，連接箱1810及開關箱1808可藉由諸如開關1312及1314之組態開關實施。在此等實施例中，連接箱1810及開關箱1808可由生產測試器或在晶片起動時所執行之BIST來組態。 In some embodiments, the connection box 1810 and the switch box 1808 can be implemented by configuration switches such as switches 1312 and 1314. In these embodiments, the connection box 1810 and the switch box 1808 can be configured by the production tester or BIST executed when the wafer is started.

在一些實施例中，連接箱1810及開關箱1808可在測試單元1806之電路功能性之後進行組態。在此等實施例中，I/O區塊1804可用以將測試信號發送至單元1806。取決於測試結果，I/O區塊1804可發送程式化信號，該等程式化信號以使未通過測試協定之單元1806去能且對通過測試協定之單元1806賦能的方式來組態連接箱1810及開關箱1808。 In some embodiments, the connection box 1810 and the switch box 1808 can be configured after testing the circuit functionality of the unit 1806. In these embodiments, the I/O block 1804 can be used to send test signals to the unit 1806. Depending on the test result, the I/O block 1804 can send programmed signals to configure the connection box by disabling the units 1806 that have not passed the test protocol and enabling the units 1806 that have passed the test protocol. 1810 and switch box 1808.

在此等實施例中，複數個主要邏輯區塊及複數個冗餘區塊可成二維柵格安置於基板上。因此，複數個主要單元1806中之每一者及複數個冗餘區塊中之每一者(諸如，冗餘商業邏輯1310)可用開關箱1808互連，且輸入區塊可安置於二維柵格之每一排及每一行之周邊中。 In these embodiments, a plurality of main logic blocks and a plurality of redundant blocks may be two The dimension grid is arranged on the substrate. Therefore, each of the plurality of main units 1806 and each of the plurality of redundant blocks (such as the redundant business logic 1310) can be interconnected by the switch box 1808, and the input blocks can be placed on the two-dimensional grid. In each row of the grid and the periphery of each row.

圖19為符合所揭示實施例之成複雜連接之例示性單元的方塊圖。圖19可表示整個系統。替代地，圖19可表示含有其他起作用區塊之晶片中的區塊。 FIG. 19 is a block diagram of an exemplary unit for complex connection in accordance with the disclosed embodiment. Figure 19 can represent the entire system. Alternatively, FIG. 19 may represent a block in a chip containing other functional blocks.

圖19之複雜連接包括單元1902(a)至1902(f)及組態開關1904(a)至1904(f)。單元1902可表示包括諸如記憶體邏輯1306及/或商業邏輯1308之複數個邏輯區塊的自主單元。然而，在其他實施例中，單元1902可表示諸如記憶體邏輯1306、商業邏輯1308或冗餘商業邏輯1310之邏輯單元。組態開關1904可包括不啟動開關1312及啟動開關1314中之任一者。 The complex connection of Figure 19 includes units 1902(a) to 1902(f) and configuration switches 1904(a) to 1904(f). The unit 1902 may represent an autonomous unit including a plurality of logic blocks such as memory logic 1306 and/or business logic 1308. However, in other embodiments, the unit 1902 may represent a logic unit such as memory logic 1306, business logic 1308, or redundant business logic 1310. The configuration switch 1904 may include any one of a non-activation switch 1312 and an activation switch 1314.

如圖19中所展示，該複雜連接可包括兩個平面中之單元1902。舉例而言，複雜連接可包括在z軸上分開的兩個獨立基板。替代地或另外，單元1902可配置於基板之兩個表面中。舉例而言，出於減小記憶體晶片1300之面積之目的，基板1301可配置於兩個重疊表面中且與三維配置之組態開關1904連接。組態開關可包括不啟動開關1312及/或啟動開關1314。 As shown in Figure 19, the complex connection may include cells 1902 in two planes. For example, a complex connection may include two independent substrates separated on the z-axis. Alternatively or in addition, the unit 1902 may be arranged in both surfaces of the substrate. For example, for the purpose of reducing the area of the memory chip 1300, the substrate 1301 may be disposed in two overlapping surfaces and connected to the configuration switch 1904 in a three-dimensional configuration. The configuration switch may include an inactivation switch 1312 and/or an activation switch 1314.

基板之第一平面可包括「主」單元1902。此等區塊可預設為經賦能。在此等實施例中，第二平面可包括「冗餘」單元1902。此等單元可預設為經去能。 The first plane of the substrate may include the "main" unit 1902. These blocks can be preset to be enabled. In these embodiments, the second plane may include "redundant" cells 1902. These units can be preset to be disabled.

在一些實施例中，組態開關1904可包括反熔斷器。因此，在測試單元1902之後，區塊可藉由將某些反熔斷器切換至「始終接通」及使選定單元1902去能來連接於起作用單元之塊中，即使該等單元在不同平面中亦如此。在圖19中所呈現之實例中，「主」單元中之一者(單元1902(e))不工作。圖19可將不起作用區塊或未測試區塊表示為陰影區塊，而受測試或起作用區塊可為無陰影的。因此，組態開關1904經組態以使得不同平面中之邏輯區塊中之一者(例如，單元1902(f))變為在作用中。以此方式，即使主邏輯區塊中之一者有缺陷，記憶體晶片仍藉由替換備用邏輯單元而工作。 In some embodiments, the configuration switch 1904 may include an anti-fuse. Therefore, after testing the unit 1902, the block can be connected to the block of active units by switching some anti-fuse to "always on" and disabling the selected unit 1902, even if the units are on different planes. The same is true in China. In the example presented in Figure 19, one of the "main" units (unit 1902(e)) does not work. Figure 19 shows the inactive or untested blocks as shaded blocks, and the tested or active blocks can be It is unshaded. Therefore, the configuration switch 1904 is configured so that one of the logical blocks in a different plane (e.g., cell 1902(f)) becomes active. In this way, even if one of the main logic blocks is defective, the memory chip still works by replacing the spare logic unit.

圖19另外展示不測試第二平面中之單元1902中之一者(亦即，1902(c))或對其賦能，此係因為主邏輯區塊起作用。舉例而言，在圖19中，兩個主單元1902(a)及1902(d)通過功能性測試。因此，單元1902(c)未被測試或賦能。因此，圖19展示特定地選擇取決於測試結果而變為在作用中之邏輯區塊的能力。 FIG. 19 additionally shows that one of the cells 1902 in the second plane (ie, 1902(c)) is not tested or enabled because the main logic block functions. For example, in Figure 19, two main units 1902(a) and 1902(d) pass the functional test. Therefore, unit 1902(c) has not been tested or enabled. Therefore, FIG. 19 shows the ability to specifically select the logic block that becomes active depending on the test result.

在一些實施例中，如圖19中所展示，並非第一平面中之所有單元1902均可具有對應的備用或冗餘區塊。然而，在其他實施例中，所有單元可彼此冗餘以實現完全冗餘，其中所有單元均為主要或冗餘的。此外，雖然一些實施方案可遵循圖19中所描繪之星形網路拓樸，但其他實施方案可使用並聯連接、串聯連接及/或將不同元件與組態開關並聯地或串聯地耦接。 In some embodiments, as shown in FIG. 19, not all cells 1902 in the first plane may have corresponding spare or redundant blocks. However, in other embodiments, all units may be redundant with each other to achieve complete redundancy, where all units are primary or redundant. In addition, although some implementations may follow the star network topology depicted in FIG. 19, other implementations may use parallel connections, series connections, and/or coupling different components in parallel or in series with the configuration switch.

圖20為說明符合所揭示實施例之冗餘區塊賦能處理程序2000的例示性流程圖。可針對記憶體晶片1300且特別地針對DRAM記憶體晶片實施賦能處理程序2000。在一些實施例中，處理程序2000可包括以下步驟：測試記憶體晶片之基板上的複數個邏輯區塊中之每一者的至少一個電路功能性；基於測試結果識別複數個主要邏輯區塊中之故障邏輯區塊；測試記憶體晶片之基板上的至少一個冗餘或額外邏輯區塊的至少一個電路功能性；藉由將外部信號施加至不啟動開關來使至少一個故障邏輯區塊去能；及藉由將該外部信號施加至啟動開關來對該至少一個冗餘區塊賦能，該啟動開關與該至少一個冗餘區塊連接且安置於該記憶體晶片之該基板上。以下圖20之描述進一步詳述處理程序2000之每一步驟。 FIG. 20 is an exemplary flowchart illustrating a redundant block enabling processing program 2000 in accordance with the disclosed embodiment. The enabling processing program 2000 can be implemented for the memory chip 1300 and particularly for the DRAM memory chip. In some embodiments, the processing program 2000 may include the following steps: testing at least one circuit functionality of each of the plurality of logic blocks on the substrate of the memory chip; identifying the plurality of main logic blocks based on the test results The faulty logic block; test at least one circuit functionality of at least one redundant or additional logic block on the substrate of the memory chip; at least one faulty logic block is disabled by applying an external signal to the inactivation switch And by applying the external signal to the activation switch to energize the at least one redundant block, the activation switch is connected to the at least one redundant block and arranged on the substrate of the memory chip. The following description of FIG. 20 further details each step of the processing program 2000.

處理程序2000可包括測試諸如商業區塊1308之複數個邏輯區塊 (步驟2002)以及複數個冗餘區塊(例如，冗餘商業區塊1310)。測試可在封裝之前使用例如用於晶圓上測試之探測站進行。然而，步驟2000亦可在封裝之後執行。 The processing program 2000 may include testing a plurality of logic blocks such as business block 1308 (Step 2002) and a plurality of redundant blocks (for example, redundant commercial block 1310). Testing can be performed before packaging using, for example, a probing station for on-wafer testing. However, step 2000 can also be performed after packaging.

步驟2002中之測試可包括將測試信號之有限序列施加至記憶體晶片1300中之每個邏輯區塊或記憶體晶片1300中之邏輯區塊的子集。該等測試信號可包括請求預期得到0或1之運算。在其他實施例中，測試信號可請求讀取記憶體組中之特定位址或寫入特定記憶體組中。 The test in step 2002 may include applying a finite sequence of test signals to each logic block in the memory chip 1300 or a subset of the logic blocks in the memory chip 1300. These test signals may include operations that request zero or one to be expected. In other embodiments, the test signal may request to read a specific address in the memory bank or write to a specific memory bank.

可在步驟2002中實施測試技術以測試邏輯區塊在反覆處理程序下之回應。舉例而言，該測試可涉及藉由傳輸將資料寫入記憶體組中之指令及接著驗證寫入資料之完整性來測試邏輯區塊。在一些實施例中，該測試可包括利用反轉資料重複演算法。 The test technique can be implemented in step 2002 to test the response of the logic block under the iterative process. For example, the test may involve testing the logical block by transmitting a command to write data into the memory bank and then verifying the integrity of the written data. In some embodiments, the test may include repeating the algorithm using reversal data.

在替代實施例中，步驟2002之測試可包括運行邏輯區塊之模型以基於測試指令集產生目標記憶體影像。接著，可對記憶體晶片中之邏輯區塊執行同一指令序列，且可記錄結果。模擬之殘餘記憶體影像亦可與自測試獲得之影像進行比較，且任何失配可標示為故障。 In an alternative embodiment, the test of step 2002 may include running a model of a logic block to generate a target memory image based on the test instruction set. Then, the same command sequence can be executed on the logic block in the memory chip, and the result can be recorded. The simulated residual memory image can also be compared with the image obtained from the self-test, and any mismatch can be marked as a fault.

替代地，在步驟2002中，該測試可包括陰影模型化，在陰影模型化中會產生診斷，但未必預測結果。實情為，使用陰影模型化之測試可對記憶體晶片及模擬兩者並列地運行。舉例而言，當記憶體晶片中之邏輯區塊完成指令或任務時，模擬可經發信以執行同一指令。一旦記憶體晶片中之邏輯區塊完成該等指令，便可將兩個模型之架構狀態進行比較。若存在失配，則標示故障。 Alternatively, in step 2002, the test may include shadow modeling, in which a diagnosis may be generated, but the result may not be predicted. The reality is that tests using shadow modeling can run in parallel on both the memory chip and the simulation. For example, when a logic block in a memory chip completes a command or task, the simulation can be sent to execute the same command. Once the logic blocks in the memory chip have completed these instructions, the architectural states of the two models can be compared. If there is a mismatch, a fault is indicated.

在一些實施例中，可在步驟2002中測試所有邏輯區塊(包括例如記憶體邏輯1306、商業邏輯1308或冗餘商業邏輯1310中之每一者)。然而，在其他實施例中，可在不同測試回合中僅測試邏輯區塊之子集。舉例而言，在第一測試回合中，可僅測試記憶體邏輯1306及相關聯區塊。在第二回合中，可僅測試商業邏輯1308及相關聯區塊。在第三回合中，取決於前兩個回合之結果，可測試與冗餘商業邏輯1310相關聯之邏輯區塊。 In some embodiments, all logic blocks (including, for example, each of memory logic 1306, business logic 1308, or redundant business logic 1310) may be tested in step 2002. However, in other embodiments, only a subset of the logical blocks may be tested in different test rounds. For example, in In the first test round, only the memory logic 1306 and associated blocks can be tested. In the second round, only the business logic 1308 and associated blocks can be tested. In the third round, depending on the results of the first two rounds, the logic block associated with the redundant business logic 1310 can be tested.

處理程序2000可繼續至步驟2004。在步驟2004中，可識別故障邏輯區塊，且亦可識別故障冗餘區塊。舉例而言，未通過步驟2002之測試的邏輯區塊可在步驟2004中識別為故障區塊。然而，在其他實施例中，最初僅可識別某些故障邏輯區塊。舉例而言，在一些實施例中，僅可識別與商業邏輯1308相關聯之邏輯區塊，且僅在需要故障冗餘區塊以替代故障邏輯區塊的情況下識別故障冗餘區塊。此外，識別故障區塊可包括在記憶體組或非揮發性記憶體上寫入經識別故障區塊之識別資訊。 The processing procedure 2000 may continue to step 2004. In step 2004, the failed logical block can be identified, and the failed redundant block can also be identified. For example, a logic block that fails the test of step 2002 can be identified as a faulty block in step 2004. However, in other embodiments, only certain faulty logic blocks can be identified initially. For example, in some embodiments, only logical blocks associated with the business logic 1308 can be identified, and a failed redundant block can only be identified if a failed redundant block is needed to replace the failed logical block. In addition, identifying the faulty block may include writing identification information of the identified faulty block on the memory bank or non-volatile memory.

在步驟2006中，可使故障邏輯區塊去能。舉例而言，使用組態電路，可藉由將故障邏輯區塊與時脈、接地及/或電源節點斷開來使故障邏輯區塊去能。替代地，可藉由以避開邏輯區塊之配置組態連接箱來使故障邏輯區塊去能。又，在其他實施例中，可藉由自位址管理器1302接收不合法位址來使故障邏輯區塊去能。 In step 2006, the faulty logic block can be disabled. For example, using a configuration circuit, the faulty logic block can be disabled by disconnecting the faulty logic block from the clock, ground, and/or power node. Alternatively, the faulty logic block can be disabled by configuring the connection box to avoid the configuration of the logic block. Moreover, in other embodiments, the faulty logic block can be disabled by receiving an illegal address from the address manager 1302.

在步驟2008中，可識別複製故障邏輯區塊之冗餘區塊。即使一些邏輯區塊已發生故障，為了支援記憶體晶片的相同能力，在步驟2008中，可識別可用且可複製故障邏輯區塊之冗餘區塊。舉例而言，若執行向量之乘法的邏輯區塊經判定為發生故障，則在步驟2008中，位址管理器1302或晶載控制器可識別亦執行向量之乘法的可用冗餘邏輯區塊。 In step 2008, the redundant block that replicates the failed logical block can be identified. Even if some logical blocks have failed, in order to support the same capability of the memory chip, in step 2008, redundant blocks that are available and can replicate the failed logical block can be identified. For example, if the logic block that performs vector multiplication is determined to be faulty, in step 2008, the address manager 1302 or the on-chip controller can identify the available redundant logic block that also performs vector multiplication.

在步驟2010中，可對在步驟2008中所識別之冗餘區塊賦能。與步驟2006之去能操作相比，在步驟2010中，可藉由將經識別冗餘區塊連接至時脈、接地及/或電源節點來對該等經識別冗餘區塊賦能。替代地，可藉由以連接經識別冗餘區塊之配置組態連接箱來對經識別冗餘區塊賦能。又，在其他實施例中，可藉由在測試程序執行時間接收運行位址來對經識別冗餘區塊賦能。 In step 2010, the redundant blocks identified in step 2008 can be energized. Compared with the disabling operation of step 2006, in step 2010, the identified redundant blocks can be energized by connecting the identified redundant blocks to the clock, ground, and/or power nodes. Alternatively, the identified redundant block can be energized by configuring the connection box with a configuration connecting the identified redundant block. Also, in other implementations In an example, the identified redundant block can be enabled by receiving the operating address during the execution time of the test program.

圖21為說明符合所揭示實施例之位址指派處理程序2100的例示性流程圖。可針對記憶體晶片1300且特別地針對DRAM記憶體晶片實施位址指派處理程序2100。如關於圖16所描述，在一些實施例中，記憶體晶片1300中之邏輯區塊可連接至資料匯流排且具有位址識別項。處理程序2100描述位址指派方法，該位址指派方法使故障邏輯區塊去能且對通過測試之邏輯區塊賦能。處理程序2100中所描述之步驟將描述為由生產測試器或在晶片起動時所執行之BIST執行；然而，記憶體晶片1300之其他組件及/或外部裝置亦可執行處理程序2100之一或多個步驟。 FIG. 21 is an exemplary flowchart illustrating an address assignment processing program 2100 in accordance with the disclosed embodiment. The address assignment process 2100 can be implemented for the memory chip 1300 and particularly for the DRAM memory chip. As described with respect to FIG. 16, in some embodiments, the logic blocks in the memory chip 1300 can be connected to the data bus and have address identification items. The processing program 2100 describes an address assignment method that disables the failed logic block and enables the logic block that passes the test. The steps described in the processing procedure 2100 will be described as being executed by the production tester or the BIST executed when the chip is started; however, other components of the memory chip 1300 and/or external devices may also execute one or more of the processing procedures 2100 Steps.

在步驟2102中，測試器可藉由在晶片層級將不合法識別項指派給每一邏輯區塊來使所有邏輯區塊及冗餘區塊去能。 In step 2102, the tester can disable all logical blocks and redundant blocks by assigning illegal identification items to each logical block at the chip level.

在步驟2104中，測試器可執行邏輯區塊之測試協定。舉例而言，測試器可針對記憶體晶片1300中之邏輯區塊中之一或多者運行步驟2002中所描述的測試方法。 In step 2104, the tester can execute the test protocol of the logic block. For example, the tester can run the test method described in step 2002 for one or more of the logic blocks in the memory chip 1300.

在步驟2106中，取決於步驟2104中之測試之結果，測試器可判定邏輯區塊是否有缺陷。若邏輯區塊無缺陷(步驟2106：否)，則位址管理器可在步驟2108中將運行ID指派給受測試邏輯區塊。若邏輯區塊有缺陷(步驟2106：是)，則位址管理器1302可在步驟2110中為有缺陷邏輯區塊保留不合法ID。 In step 2106, depending on the result of the test in step 2104, the tester can determine whether the logic block is defective. If the logical block is not defective (step 2106: No), the address manager can assign the running ID to the tested logical block in step 2108. If the logical block is defective (step 2106: Yes), the address manager 1302 can reserve an illegal ID for the defective logical block in step 2110.

在步驟2112中，位址管理器1302可選擇再製有缺陷邏輯區塊之冗餘邏輯區塊。在一些實施例中，再製有缺陷邏輯區塊之冗餘邏輯區塊可具有與有缺陷邏輯區塊相同的組件及連接。然而，在其他實施例中，冗餘邏輯區塊可具有不同於有缺陷邏輯區塊的組件及/或連接，但能夠執行等效操作。舉例而言，若有缺陷邏輯區塊經設計以執行向量之乘法，則選定冗餘邏輯區塊將能夠執行向量之乘法，即使選定冗餘邏輯區塊不具有與有缺陷單元相同的架構亦如此。 In step 2112, the address manager 1302 can choose to reproduce the redundant logic block of the defective logic block. In some embodiments, the redundant logic block that reproduces the defective logic block may have the same components and connections as the defective logic block. However, in other embodiments, the redundant logic block may have different components and/or connections than the defective logic block, but can perform equivalent operations. For example, if the defective logic block is designed to perform vector multiplication, the selected redundant logic block will be able to Perform vector multiplication even if the selected redundant logic block does not have the same architecture as the defective cell.

在步驟2114中，位址管理器1302可測試冗餘區塊。舉例而言，測試器可將步驟2104中應用之測試技術應用於經識別冗餘區塊。 In step 2114, the address manager 1302 may test the redundant block. For example, the tester can apply the testing technique applied in step 2104 to the identified redundant blocks.

在步驟2116中，基於步驟2114中之測試之結果，測試器可判定冗餘區塊是否有缺陷。在步驟2118中，若冗餘區塊無缺陷(步驟2116：否)，則測試器可將運行ID指派給經識別冗餘區塊。在一些實施例中，處理程序2100可在步驟2118之後返回至步驟2104，以產生測試記憶體晶片中之所有邏輯區塊的反覆迴圈。 In step 2116, based on the result of the test in step 2114, the tester can determine whether the redundant block is defective. In step 2118, if the redundant block is not defective (step 2116: No), the tester may assign a running ID to the identified redundant block. In some embodiments, the processing procedure 2100 may return to step 2104 after step 2118 to generate an iterative loop for testing all logic blocks in the memory chip.

若測試器判定冗餘區塊有缺陷(步驟2116：是)，則在步驟2120中，測試器可判定額外冗餘區塊是否可用。舉例而言，測試器可向記憶體組查詢關於可用冗餘邏輯區塊之資訊。若冗餘邏輯區塊可用(步驟2120：是)，則測試器可返回至步驟2112且識別再製有缺陷邏輯區塊之新的冗餘邏輯區塊。若冗餘邏輯區塊不可用(步驟2120：否)，則在步驟2122中，測試器可產生錯誤信號。該錯誤信號可包括有缺陷邏輯區塊及有缺陷冗餘區塊之資訊。 If the tester determines that the redundant block is defective (step 2116: Yes), then in step 2120, the tester can determine whether the additional redundant block is available. For example, the tester can query the memory bank for information about available redundant logic blocks. If the redundant logic block is available (step 2120: Yes), the tester can return to step 2112 and identify a new redundant logic block that reproduces the defective logic block. If the redundant logic block is not available (step 2120: No), then in step 2122, the tester can generate an error signal. The error signal may include information on defective logical blocks and defective redundant blocks.

耦接之記憶體組 Coupled memory bank

本發明所揭示之實施例亦包括分散式高效能處理器。該處理器可包括介接記憶體組及處理單元之記憶體控制器。該處理器可為可組態的以加快將資料遞送至處理單元以用於計算。舉例而言，若處理單元需要兩個資料例項以執行任務，則記憶體控制器可經組態以使得通信線獨立地提供對來自兩個資料例項之資訊的存取。所揭示之記憶體架構試圖最小化與複雜快取記憶體及複雜暫存器檔案方案相關聯之硬體要求。通常，處理器晶片包括允許核心直接與暫存器一起工作的快取記憶體階層。然而，快取記憶體操作需要相當大的晶粒面積且消耗額外功率。所揭示之記憶體架構藉由在記憶體中添加邏輯組件來避免使用快取記憶體階層。 The disclosed embodiments of the present invention also include distributed high-performance processors. The processor may include a memory controller that interfaces with the memory bank and the processing unit. The processor can be configurable to expedite the delivery of data to the processing unit for calculation. For example, if the processing unit requires two data instances to perform tasks, the memory controller can be configured so that the communication line independently provides access to information from the two data instances. The disclosed memory architecture attempts to minimize the hardware requirements associated with complex cache and complex register file solutions. Generally, the processor chip includes a cache hierarchy that allows the core to work directly with the register. However, the cache operation requires a considerable die area and consumes additional power. The disclosed memory architecture avoids the problem by adding logical components to the memory Avoid using the cache hierarchy.

所揭示架構亦實現資料在記憶體組中之策略性(或甚至最佳化)置放。即使記憶體組具有單個埠及高潛時，所揭示之記憶體架構亦可藉由將資料策略性地定位於記憶體組之不同區塊中來實現高效能且避免記憶體存取瓶頸。以將資料之連續串流提供至處理單元為目標，編譯最佳化步驟可針對特定或一般任務判定資料應如何儲存於記憶體組中。接著，介接處理單元及記憶體組之記憶體控制器可經組態以在特定處理單元需要資料以執行操作時向該等特定處理單元授權存取。 The disclosed architecture also realizes the strategic (or even optimized) placement of data in the memory bank. Even if the memory bank has a single port and high potential, the disclosed memory architecture can also achieve high performance and avoid memory access bottlenecks by strategically positioning data in different blocks of the memory bank. With the goal of providing a continuous stream of data to the processing unit, the compilation optimization step can determine how the data should be stored in the memory bank for a specific or general task. Then, the memory controller that interfaces the processing unit and the memory bank can be configured to authorize access to specific processing units when they need data to perform operations.

記憶體晶片之組態可由處理單元(例如，組態管理者)或外部介面執行。該組態亦可由編譯器或其他SW工具寫入。此外，記憶體控制器之組態可基於記憶體組中之可用埠及記憶體組中之資料的組織。因此，所揭示架構可向處理單元提供來自不同記憶體區塊之恆定資料流或同時資訊。以此方式，記憶體內之運算任務可藉由避免潛時瓶頸或快取記憶體要求來快速地處理。 The configuration of the memory chip can be performed by a processing unit (for example, a configuration manager) or an external interface. The configuration can also be written by the compiler or other SW tools. In addition, the configuration of the memory controller can be based on the available ports in the memory bank and the organization of the data in the memory bank. Therefore, the disclosed architecture can provide a constant data stream or simultaneous information from different memory blocks to the processing unit. In this way, computing tasks in the memory can be quickly processed by avoiding latency bottlenecks or cache memory requirements.

此外，儲存於記憶體晶片中之資料可基於編譯最佳化步驟進行配置。編譯可允許建置處理常式，其中處理器將任務高效地指派給處理單元而無記憶體潛時相關聯之延遲。該編譯可由編譯器執行且被傳輸至連接至基板中之外部介面之主機。通常，某些存取圖案的高潛時及/或低的埠數目將導致需要資料之處理單元的資料瓶頸。然而，所揭示編譯可按使得處理單元能夠甚至在不利記憶體類型之情況下仍連續地接收資料的方式將資料定位於記憶體組中。 In addition, the data stored in the memory chip can be configured based on the compilation optimization step. Compilation may allow the creation of processing routines in which the processor efficiently assigns tasks to processing units without the delays associated with memory latency. The compilation can be executed by the compiler and transmitted to the host connected to the external interface in the substrate. Generally, the high latency and/or low port number of certain access patterns will cause data bottlenecks in the processing units that require data. However, the disclosed compilation can locate the data in the memory bank in a way that enables the processing unit to continuously receive the data even in the case of unfavorable memory types.

此外，在一些實施例中，組態管理器可基於任務所需之運算向所需處理單元發信。晶片中之不同處理單元或邏輯區塊可具有用於不同任務之專門硬體或架構。因此，取決於將執行之任務，可選擇處理單元或處理單元群組來執行任務。基板上之記憶體控制器可為可組態的以根據處理子單元之選擇來投送資料或授權存取，以改善資料傳送速率。舉例而言，基於編譯最佳化及記憶體架構，當需要處理單元以執行任務時，可授權該等處理單元對記憶體組之存取。 In addition, in some embodiments, the configuration manager can send a message to the required processing unit based on the calculation required by the task. Different processing units or logic blocks in a chip can have specialized hardware or architectures for different tasks. Therefore, depending on the task to be performed, a processing unit or a group of processing units can be selected to perform the task. The memory controller on the substrate can be configurable to send data or authorize access according to the selection of processing sub-units to improve the data transfer rate. For example, based on compilation optimization and memory Memory architecture, when processing units are needed to perform tasks, the processing units can be authorized to access the memory bank.

此外，晶片架構可包括晶載組件，該等晶載組件藉由減少存取記憶體組中之資料所需的時間來便利資料之傳送。因此，本發明描述用於能夠使用簡單的記憶體例項執行特定或一般任務的高效能處理器的晶片架構連同編譯最佳化步驟。記憶體例項可具有高的隨機存取潛時及/或低的埠數目，諸如DRAM裝置或面向其他記憶體之技術中所使用之彼等記憶體例項，但所揭示架構可藉由實現自記憶體組至處理單元之連續(或幾乎連續)資料流來克服此等缺點。 In addition, the chip architecture may include on-chip components that facilitate data transmission by reducing the time required to access data in the memory bank. Therefore, the present invention describes a chip architecture for a high-performance processor capable of performing specific or general tasks using simple memory instances along with compilation optimization steps. Memory instances can have high random access latency and/or low port numbers, such as those used in DRAM devices or other memory-oriented technologies, but the disclosed architecture can be realized by self-memory The continuous (or almost continuous) data flow from the body group to the processing unit overcomes these shortcomings.

在本申請案中，同時通信可指時脈循環內之通信。替代地，同時通信可指在預定時間量內發送資訊。舉例而言，同時通信可指在幾奈秒內之通信。 In this application, simultaneous communication may refer to communication within a clock cycle. Alternatively, simultaneous communication may refer to sending information within a predetermined amount of time. For example, simultaneous communication may refer to communication within a few nanoseconds.

圖22提供符合所揭示實施例之例示性處理裝置的方塊圖。圖22A展示處理裝置2200之第一實施例，其中記憶體控制器2210使用多工器連接第一記憶體區塊2202及第二記憶體區塊2204。記憶體控制器2210亦可連接至少一組態管理器2212、一邏輯區塊2214及多個加速器2216(a)至2216(n)。圖22B展示處理裝置2200之第二實施例，其中記憶體控制器2210使用匯流排連接記憶體區塊2202及2204，該匯流排連接記憶體控制器2210與至少一組態管理器2212、一邏輯區塊2214及多個加速器2216(a)至2216(n)。此外，主機2230可在處理裝置2200外部且經由例如外部介面連接至處理裝置。 Figure 22 provides a block diagram of an exemplary processing device in accordance with the disclosed embodiments. FIG. 22A shows the first embodiment of the processing device 2200, in which the memory controller 2210 uses a multiplexer to connect the first memory block 2202 and the second memory block 2204. The memory controller 2210 can also be connected to at least one configuration manager 2212, a logic block 2214, and multiple accelerators 2216(a) to 2216(n). 22B shows a second embodiment of the processing device 2200, in which the memory controller 2210 uses a bus to connect the memory blocks 2202 and 2204, the bus connects the memory controller 2210 and at least one configuration manager 2212, a logic Block 2214 and multiple accelerators 2216(a) to 2216(n). In addition, the host 2230 may be external to the processing device 2200 and connected to the processing device via, for example, an external interface.

記憶體區塊2202及2204可包括DRAM墊或墊群組、DRAM組、MRAM\PRAM\RERA1M\SRAM單元、快閃記憶體墊或其他記憶體技術。記憶體區塊2202及2204可替代地包括非揮發性記憶體、快閃記憶體裝置、電阻式隨機存取記憶體(ReRAM)裝置或磁阻式隨機存取記憶體(MRAM)裝置。 The memory blocks 2202 and 2204 may include DRAM pads or pad groups, DRAM banks, MRAM\PRAM\RERA1M\SRAM cells, flash memory pads, or other memory technologies. The memory blocks 2202 and 2204 may alternatively include non-volatile memory, flash memory devices, resistive random access memory (ReRAM) devices, or magnetoresistive random access memory (MRAM) devices.

記憶體區塊2202及2204可另外包括複數個記憶體胞元，該等複數個記憶體胞元在複數條字線(未圖示)與複數條位元線(未圖示)之間按列及行配置。每一列記憶體胞元之閘極可連接至複數條字線中之各別者。每一行記憶體胞元可連接至複數條位元線中之各別者。 The memory blocks 2202 and 2204 may additionally include a plurality of memory cells, which Several memory cells are arranged in columns and rows between a plurality of word lines (not shown) and a plurality of bit lines (not shown). The gate of each row of memory cells can be connected to each of the plurality of word lines. Each row of memory cells can be connected to each of the plurality of bit lines.

在其他實施例中，記憶體區域(包括記憶體區塊2202及2204)由簡單的記憶體例項建置。在本申請案中，術語「記憶體例項」可與術語「記憶體區塊」互換地使用。記憶體例項(或區塊)可具有不良特性。舉例而言，記憶體可為僅單埠記憶體且可具有高隨機存取潛時。替代地或另外，記憶體在行及排改變期間可能無法存取且面臨與例如電容充電及/或電路系統設置相關之資料存取問題。然而，藉由允許記憶體例項與處理單元之間的專用連接及以考量區塊之特性的某一方式來配置資料，圖22中所呈現之架構仍便利記憶體裝置中之並列處理。 In other embodiments, the memory area (including memory blocks 2202 and 2204) is constructed by simple memory instances. In this application, the term "memory instance" can be used interchangeably with the term "memory block". Memory instances (or blocks) may have undesirable characteristics. For example, the memory may be a single-port memory only and may have a high random access latency. Alternatively or in addition, the memory may be inaccessible during row and row changes and face data access problems related to, for example, capacitor charging and/or circuit system settings. However, by allowing dedicated connections between memory instances and processing units and configuring data in a way that takes into account the characteristics of blocks, the architecture presented in FIG. 22 still facilitates parallel processing in memory devices.

在一些裝置架構中，記憶體例項可包括若干埠，以便利並列操作。然而，在此等實施例中，當資料基於晶片架構來編譯及組織時，晶片仍可達成改善效能。舉例而言，編譯器可藉由提供指令及組織資料置放來改善記憶體區域中之存取的效率，因此即使使用單埠記憶體，仍能夠容易存取記憶體區域。 In some device architectures, the memory instance can include several ports to facilitate parallel operation. However, in these embodiments, when the data is compiled and organized based on the chip architecture, the chip can still achieve improved performance. For example, the compiler can improve the efficiency of access in the memory area by providing instructions and organizing data placement. Therefore, even if a single-port memory is used, the memory area can still be easily accessed.

此外，記憶體區塊2202及2204可為單個晶片中之多種類型的記憶體。舉例而言，記憶體區塊2202及2204可為eFlash及eDRAM。又，記憶體區塊可包括具有ROM例項之DRAM。 In addition, the memory blocks 2202 and 2204 can be multiple types of memory in a single chip. For example, the memory blocks 2202 and 2204 can be eFlash and eDRAM. Also, the memory block may include DRAM with ROM instances.

記憶體控制器2210可包括用以處置記憶體存取及將結果傳回至模組之其餘部分的邏輯電路。舉例而言，記憶體控制器2210可包括位址管理器及諸如多工器之選擇裝置，以在記憶體區塊與處理單元之間投送資料或授權對記憶體區塊之存取。替代地，記憶體控制器2210可包括用以驅動DDR SDRAM之雙資料速率(DDR)記憶體控制器，其中資料係在系統之記憶體時脈的上升緣及下降緣傳送。 The memory controller 2210 may include logic circuits for handling memory access and returning the results to the rest of the module. For example, the memory controller 2210 may include an address manager and a selection device such as a multiplexer to send data between the memory block and the processing unit or authorize access to the memory block. Alternatively, the memory controller 2210 may include a double data rate (DDR) memory controller for driving DDR SDRAM, where the data is on the rise of the system's memory clock Edge and falling edge transmission.

此外，記憶體控制器2210可構成雙通道記憶體控制器。雙通道記憶體之併入可便利記憶體控制器2210對並列存取線之控制。該等並列存取線可經組態以具有相同長度，以在結合使用多條線時便利資料同步。替代地或另外，該等並列存取線可允許存取記憶體組之多個記憶體埠。 In addition, the memory controller 2210 can constitute a dual-channel memory controller. The incorporation of dual-channel memory can facilitate the memory controller 2210 to control the parallel access lines. The parallel access lines can be configured to have the same length to facilitate data synchronization when multiple lines are used in combination. Alternatively or in addition, the parallel access lines may allow access to multiple memory ports of the memory bank.

在一些實施例中，處理裝置2200可包括可連接至處理單元之一或多個多工器。該等處理單元可包括可直接連接至多工器之組態管理器2212、邏輯區塊2214及加速器2216。又，記憶體控制器2210可包括自複數個記憶體組或區塊2202及2204之至少一個資料輸入端，及連接至複數個處理單元中之每一者的至少一個資料輸出端。藉由此組態，記憶體控制器2210可經由兩個資料輸入端同時自記憶體組或記憶體區塊2202及2204接收資料，且經由兩個資料輸出端同時將經由接收之資料傳輸至至少一個選定處理單元。然而，在一些實施例中，至少一個資料輸入端及至少一個資料輸出端可實施於單個埠中，以僅允許讀取或寫入操作。在此等實施例中，單個埠可實施為包括資料線、位址線及命令線之資料匯流排。 In some embodiments, the processing device 2200 may include one or more multiplexers connectable to the processing unit. The processing units may include a configuration manager 2212, a logic block 2214, and an accelerator 2216 that can be directly connected to the multiplexer. In addition, the memory controller 2210 may include at least one data input terminal from a plurality of memory groups or blocks 2202 and 2204, and at least one data output terminal connected to each of the plurality of processing units. With this configuration, the memory controller 2210 can simultaneously receive data from the memory bank or memory blocks 2202 and 2204 via two data input ports, and simultaneously transmit the received data to at least the A selected processing unit. However, in some embodiments, at least one data input terminal and at least one data output terminal can be implemented in a single port to allow only read or write operations. In these embodiments, a single port can be implemented as a data bus including data lines, address lines, and command lines.

記憶體控制器2210可連接至複數個記憶體區塊2202及2204中之每一者，且亦可經由例如選擇開關連接至處理單元。基板上之處理單元(包括組態管理器2212、邏輯區塊2214及加速器2216)亦可獨立地連接至記憶體控制器2210。在一些實施例中，組態管理器2212可接收待執行之任務的提示，且作為回應，根據儲存於記憶體中或自外部供應之組態而組態記憶體控制器2210、加速器2216及/或邏輯區塊2214。替代地，記憶體控制器2210可由外部介面組態。該任務可能需要可用以自複數個處理單元選擇至少一個選定處理單元之至少一次運算。替代地或另外，該選擇可至少部分地基於選定處理單元執行至少一次運算之能力。作為回應，記憶體控制器2210可授權對記憶體組之存取，或使用專用匯流排及/或以管線式記憶體存取在至少一個選定處理單元與至少兩個記憶體組之間投送資料。 The memory controller 2210 can be connected to each of a plurality of memory blocks 2202 and 2204, and can also be connected to the processing unit via, for example, a selection switch. The processing units on the substrate (including the configuration manager 2212, the logic block 2214, and the accelerator 2216) can also be independently connected to the memory controller 2210. In some embodiments, the configuration manager 2212 may receive prompts for tasks to be executed, and in response, configure the memory controller 2210, accelerator 2216, and/or the configuration stored in the memory or externally supplied. Or logical block 2214. Alternatively, the memory controller 2210 can be configured by an external interface. This task may require at least one operation that can be used to select at least one selected processing unit from a plurality of processing units. Alternatively or in addition, the selection may be based at least in part on the ability of the selected processing unit to perform at least one operation. In response, the memory controller 2210 can authorize the storage of the memory bank Access, or use a dedicated bus and/or pipeline memory access to send data between at least one selected processing unit and at least two memory banks.

在一些實施例中，至少兩個記憶體區塊中之第一記憶體區塊2202可配置於複數個處理單元之第一側；且至少兩個記憶體組中之第二記憶體組2204可配置於該等複數個處理單元之與該第一側相對的第二側。另外，用以執行任務之選定處理單元(例如，加速器2216(n))可經組態以在至第一記憶體組或第一記憶體區塊2202之通信線開放的時脈循環期間存取第二記憶體組2204。替代地，該選定處理單元可經組態以在通信線開放至第一記憶體區塊2202的時脈循環期間將資料傳送至第二記憶體區塊2204。 In some embodiments, the first memory block 2202 of the at least two memory blocks can be arranged on the first side of the plurality of processing units; and the second memory group 2204 of the at least two memory blocks can be It is arranged on the second side of the plurality of processing units opposite to the first side. In addition, the selected processing unit (for example, accelerator 2216(n)) used to perform the task can be configured to be accessed during the clock cycle when the communication line to the first memory bank or the first memory block 2202 is open The second memory group 2204. Alternatively, the selected processing unit may be configured to transmit data to the second memory block 2204 during the clock cycle when the communication line is opened to the first memory block 2202.

在一些實施例中，記憶體控制器2210可實施為獨立元件，如圖22中所展示。然而，在其他實施例中，記憶體控制器2210可嵌入於記憶體區域中或可沿著加速器2216(a)至2216(n)安置。 In some embodiments, the memory controller 2210 can be implemented as an independent component, as shown in FIG. 22. However, in other embodiments, the memory controller 2210 may be embedded in the memory area or may be placed along the accelerators 2216(a) to 2216(n).

處理裝置2200中之處理區域可包括組態管理器2212、邏輯區塊2214及加速器2216(a)至2216(n)。加速器2216可包括具有預定義功能之多個處理電路且可由特定應用程式定義。舉例而言，加速器可為處置模組之間的記憶體移動之向量乘法累加(MAC)單元或直接記憶體存取(DMA)單元。加速器2216亦可能夠計算其自身位址且向記憶體控制器2210請求資料或將資料寫入至記憶體控制器。舉例而言，組態管理器2212可向加速器2216中之至少一者發信該加速器可存取記憶體組。接著，加速器2216可組態記憶體控制器2210以投送資料或向加速器本身授權存取。此外，加速器2216可包括至少一個算術邏輯單元、至少一個向量處置邏輯單元、至少一個字串比較邏輯單元、至少一個暫存器及至少一個直接記憶體存取件。 The processing area in the processing device 2200 may include a configuration manager 2212, a logic block 2214, and accelerators 2216(a) to 2216(n). The accelerator 2216 may include multiple processing circuits with predefined functions and may be defined by a specific application. For example, the accelerator may be a vector multiplication accumulation (MAC) unit or a direct memory access (DMA) unit for memory movement between processing modules. The accelerator 2216 may also be able to calculate its own address and request data from the memory controller 2210 or write data to the memory controller. For example, the configuration manager 2212 may send a message to at least one of the accelerators 2216 that the accelerator has access to the memory bank. Then, the accelerator 2216 can configure the memory controller 2210 to deliver data or authorize access to the accelerator itself. In addition, the accelerator 2216 may include at least one arithmetic logic unit, at least one vector processing logic unit, at least one string comparison logic unit, at least one register, and at least one direct memory access device.

組態管理器2212可包括用以組態加速器2216及發指令任務之執行的數位處理電路。舉例而言，組態管理器2212可連接至記憶體控制器2210 以及複數個加速器2216中之每一者。組態管理器2212可具有其自身的專用記憶體以保存加速器2216之組態。組態管理器2212可使用記憶體組以經由記憶體控制器2210提取命令及組態。替代地，組態管理器2212可經由外部介面來程式化。在某些實施例中，組態管理器2212可用具有自身的快取記憶體階層之晶載精簡指令集電腦(RISC)或晶載複雜CPU來實施。在一些實施例中，亦可省略組態管理器2212，且加速器可經由外部介面來組態。 The configuration manager 2212 may include a digital processing circuit for configuring the accelerator 2216 and executing execution of command tasks. For example, the configuration manager 2212 can be connected to the memory controller 2210 And each of the plurality of accelerators 2216. The configuration manager 2212 may have its own dedicated memory to save the configuration of the accelerator 2216. The configuration manager 2212 can use the memory bank to retrieve commands and configurations via the memory controller 2210. Alternatively, the configuration manager 2212 can be programmed via an external interface. In some embodiments, the configuration manager 2212 can be implemented by a reduced instruction set computer on chip (RISC) or a complex CPU on chip with its own cache memory level. In some embodiments, the configuration manager 2212 can also be omitted, and the accelerator can be configured via an external interface.

處理裝置2200亦可包括外部介面(未圖示)。該外部介面允許自上部層級(此記憶體組控制器，其自外部主機2230或晶載主處理器接收命令)對記憶體進行存取，或自外部主機2230或晶載主處理器對記憶體進行存取。該外部介面可藉由經由記憶體控制器2210將組態或程式碼寫入至記憶體以供稍後由組態管理器2212或單元2214及2216本身使用來允許程式化組態管理器2212及加速器2216。然而，該外部介面亦可直接程式化處理單元而不經由記憶體控制器2210進行路由。在組態管理器2212為微控制器之狀況下，組態管理器2212可允許經由外部介面將程式碼自主記憶體載入至控制器區域記憶體。記憶體控制器2210可經組態以回應於自外部介面接收到請求而中斷任務。 The processing device 2200 may also include an external interface (not shown). The external interface allows access to the memory from the upper level (the memory bank controller, which receives commands from the external host 2230 or the on-chip host processor), or from the external host 2230 or on-chip host processor to the memory To access. This external interface allows the programmable configuration manager 2212 and the program code to be written to the memory via the memory controller 2210 for later use by the configuration manager 2212 or the units 2214 and 2216 themselves. Accelerator 2216. However, the external interface can also directly program the processing unit without routing through the memory controller 2210. In the case that the configuration manager 2212 is a microcontroller, the configuration manager 2212 can allow the autonomous memory of the program code to be loaded into the controller area memory via an external interface. The memory controller 2210 can be configured to interrupt tasks in response to requests received from the external interface.

該外部介面可包括與邏輯電路相關聯之多個連接器，該等連接器提供至處理裝置上之多種元件的無膠合介面。該外部介面可包括：用於資料讀取之資料I/O輸入端及用於資料寫入之輸出端；外部位址輸出端；外部CE0晶片選擇接腳；低有效晶片選擇器；位元組賦能接腳；用於記憶體循環之等待狀態的接腳；寫入賦能接腳；輸出賦能有效接腳；及讀取寫入賦能接腳。因此，該外部介面具有所需輸入端及輸出端以控制處理程序且自處理裝置獲得資訊。舉例而言，該外部介面可符合JEDEC DDR標準。替代地或另外，外部介面可符合其他標準，諸如SPI\OSPI或UART。 The external interface may include a plurality of connectors associated with the logic circuit, and the connectors provide a glueless interface to various components on the processing device. The external interface may include: data I/O input terminal for data reading and output terminal for data writing; external address output terminal; external CE0 chip selection pin; low effective chip selector; byte Enable pin; pin used for waiting state of memory cycle; write enable pin; output enable valid pin; and read write enable pin. Therefore, the external interface has required input terminals and output terminals to control the processing procedure and obtain information from the processing device. For example, the external interface can comply with the JEDEC DDR standard. Alternatively or in addition, the external interface may comply with other standards, such as SPI\OSPI or UART.

在一些實施例中，該外部介面可安置於晶片基板上且可連接外部主機2230。外部主機可經由外部介面存取記憶體區塊2202及2204、記憶體控制器2210以及處理單元。替代地或另外，外部主機2230可對記憶體進行讀取及寫入，或可經由讀取及寫入命令向組態管理器2212發信以執行操作，諸如開始處理程序及/或停止處理程序。此外，外部主機2230可直接組態加速器2216。在一些實施例中，外部主機2230能夠直接對記憶體區塊2202及2204執行讀取/寫入操作。 In some embodiments, the external interface can be disposed on the chip substrate and can be connected to external The host 2230. The external host can access the memory blocks 2202 and 2204, the memory controller 2210, and the processing unit through the external interface. Alternatively or in addition, the external host 2230 can read and write to the memory, or can send a message to the configuration manager 2212 via read and write commands to perform operations, such as starting a processing program and/or stopping a processing program . In addition, the external host 2230 can directly configure the accelerator 2216. In some embodiments, the external host 2230 can directly perform read/write operations on the memory blocks 2202 and 2204.

在一些實施例中，組態管理器2212及加速器2216可經組態以取決於目標任務而使用直接匯流排來連接裝置區域與記憶體區域。舉例而言，當加速器之該子集能夠執行任務執行所需之運算時，加速器2216之子集可與記憶體例項2204連接。藉由進行此分開，有可能確保專用加速器獲得記憶體區塊2202及2204所需之頻寬(BW)。此外，具有專用匯流排之此組態可允許將大記憶體分裂成較小例項或區塊，此係因為將記憶體例項連接至記憶體控制器2210允許甚至在具有高列潛時之情況下亦可快速存取不同記憶體中的資料。為達成連接之並列化，記憶體控制器2210可用資料匯流排、位址匯流排及/或控制匯流排連接至記憶體例項中之每二者。 In some embodiments, the configuration manager 2212 and the accelerator 2216 can be configured to use a direct bus to connect the device area and the memory area depending on the target task. For example, when the subset of accelerators can perform operations required for task execution, the subset of accelerators 2216 can be connected to the memory instance 2204. By performing this separation, it is possible to ensure that the dedicated accelerator obtains the bandwidth (BW) required by the memory blocks 2202 and 2204. In addition, this configuration with a dedicated bus allows the large memory to be split into smaller instances or blocks. This is because connecting the memory instances to the memory controller 2210 allows even in situations with high rank latency You can also quickly access the data in different memories. In order to achieve parallelization of connections, the memory controller 2210 can be connected to each of the two memory instances with a data bus, an address bus, and/or a control bus.

記憶體控制器2210之上述包括可消除對處理裝置中之快取記憶體階層或複雜暫存器檔案的要求。儘管可添加快取記憶體階層以得到添加的能力，但處理裝置處理裝置2200中之架構可允許設計者基於處理操作而添加足夠記憶體區塊或例項，且相應地管理該等例項而無需快取記憶體階層。舉例而言，處理裝置處理裝置2200中之架構可藉由實施管線式記憶體存取來消除對快取記憶體階層的要求。在管線式記憶體存取中，處理單元可在某些資料線可開放(或啟動)而其他資料線接收或傳輸資料的每個循環中接收持續資料流。由於線改變，使用獨立通信線之持續資料流可實現改善之執行速度及最少潛時。 The aforementioned inclusions of the memory controller 2210 can eliminate the requirement for the cache memory hierarchy or complex register files in the processing device. Although the cache hierarchy can be added to obtain the added capabilities, the architecture in the processing device 2200 allows the designer to add enough memory blocks or instances based on processing operations, and manage these instances accordingly. No need to cache the memory hierarchy. For example, the architecture in the processing device 2200 can eliminate the requirement for the cache memory hierarchy by implementing pipelined memory access. In pipeline memory access, the processing unit can receive a continuous data stream in each cycle in which certain data lines can be opened (or activated) and other data lines receive or transmit data. Due to line changes, continuous data flow using independent communication lines can achieve improved execution speed and minimal latency.

此外，圖22中之所揭示架構實現管線式記憶體存取，有可能將資料組織在少量記憶體區塊中且節省由線切換造成之功率損失。舉例而言，在一些實施例中，編譯器可向主機2230傳達資料在記憶體組中之組織或用以將資料組織在記憶體組中之方法，以便利在給定任務期間存取資料。接著，組態管理器2212可定義哪些記憶體組且在一些狀況下，記憶體組之哪些埠可由加速器存取。記憶體組中之資料的位置與資料存取方法之間的此同步藉由以最少潛時將資料饋入至加速器來改善運算任務。舉例而言，在組態管理器2212包括RISC\CPU的實施例中，該方法可用離線軟體(SW)來實施，且接著組態管理器2212可經程式化以執行該方法。該方法可用可由RISC/CPU電腦執行之任何語言來開發且可在任何平台上執行。該方法之輸入可包括記憶體控制器後方之記憶體的組態以及資料本身，連同記憶體存取之圖案。此外，該方法可用特定於實施例之語言或機器語言來實施，且亦可僅為以二進位或文字表示的一系列組態值。 In addition, the architecture disclosed in FIG. 22 implements pipelined memory access, which may change Data is organized in a small number of memory blocks and the power loss caused by line switching is saved. For example, in some embodiments, the compiler can communicate the organization of data in a memory group or a method for organizing data in a memory group to the host 2230 to facilitate access to the data during a given task. Then, the configuration manager 2212 can define which memory banks and in some cases, which ports of the memory banks can be accessed by the accelerator. This synchronization between the location of the data in the memory bank and the data access method improves the computational task by feeding the data to the accelerator with the least latency. For example, in an embodiment where the configuration manager 2212 includes RISC\CPU, the method can be implemented with offline software (SW), and then the configuration manager 2212 can be programmed to execute the method. The method can be developed in any language that can be executed by a RISC/CPU computer and can be executed on any platform. The input of this method can include the configuration of the memory behind the memory controller and the data itself, as well as the pattern of memory access. In addition, the method can be implemented in an embodiment-specific language or machine language, and can also be only a series of configuration values expressed in binary or text.

如上文所論述，在一些實施例中，編譯器可將指令提供至主機2230以用於在準備管線式記憶體存取時將資料組織在記憶體區塊2202及2204中。該管線式記憶體存取通常可包括以下步驟：接收複數個記憶體組或記憶體區塊2202及2204之複數個位址；使用獨立資料線根據所接收位址存取該等複數個記憶體組；經由第一通信線將來自第一位址之資料供應至複數個處理單元中之至少一者且開放至第二位址之第二通信線，該第一位址在該等複數個記憶體組中之第一記憶體組中，該第二位址在該等複數個記憶體組中之第二記憶體組2204中；及在第二時脈循環內，經由該第二通信線將來自該第二位址之資料供應至該等複數個處理單元中之該至少一者且開放至第一線中之第一記憶體組中之第三位址的第三通信線。在一些實施例中，該管線式記憶體存取可在兩個記憶體區塊連接至單個埠的情況下執行。在此等實施例中，記憶體控制器2210可將兩個記憶體區塊隱藏在單個埠後方，但利用管線式記憶體存取方法將資料傳輸至處理單元。 As discussed above, in some embodiments, the compiler may provide instructions to the host 2230 for organizing data in memory blocks 2202 and 2204 when preparing for pipelined memory access. The pipelined memory access usually includes the following steps: receiving a plurality of addresses of a plurality of memory banks or memory blocks 2202 and 2204; using independent data lines to access the plurality of memories according to the received addresses Group; the data from the first address is supplied to at least one of the plurality of processing units via the first communication line and the second communication line is opened to the second address, the first address is in the plurality of memories In the first memory group in the body group, the second address is in the second memory group 2204 of the plurality of memory groups; and in the second clock cycle, the second communication line The data from the second address is supplied to the at least one of the plurality of processing units and is open to the third communication line of the third address in the first memory bank in the first line. In some embodiments, the pipelined memory access can be performed with two memory blocks connected to a single port. In these embodiments, the memory controller 2210 can hide two memory blocks behind a single port, but uses a pipelined memory access method to transfer data. Input to the processing unit.

在一些實施例中，編譯器可在主機2230上運行，之後執行任務。在此等實施例中，編譯器可能夠基於記憶體裝置之架構而判定資料流之組態，此係因為該組態將為編譯器已知的。 In some embodiments, the compiler can run on the host 2230 and then perform tasks. In these embodiments, the compiler may be able to determine the configuration of the data stream based on the architecture of the memory device, because the configuration will be known to the compiler.

在其他實施例中，若記憶體區塊2204及2202之組態在離線時間係未知的，則管線式方法可在主機2230上運行，該主機可在開始計算之前將資料配置在記憶體區塊中。舉例而言，主機2230可將資料直接寫入記憶體區塊2204及2202中。在此等實施例中，諸如組態管理器2212及記憶體控制器2210之處理單元在運行時間之前可能不會具有關於所需硬體的資訊。接著，可能有必要延遲對加速器2216之選擇，直至任務開始運行。在此等情形中，處理單元或記憶體控制器2210可隨機地選擇加速器2216且產生測試資料存取圖案，該存取圖案可在執行任務時加以修改。 In other embodiments, if the configuration of the memory blocks 2204 and 2202 is unknown at offline time, the pipeline method can be run on the host 2230, which can allocate the data in the memory block before starting the calculation in. For example, the host 2230 can write data directly into the memory blocks 2204 and 2202. In these embodiments, processing units such as the configuration manager 2212 and the memory controller 2210 may not have information about the required hardware before runtime. Then, it may be necessary to delay the selection of accelerator 2216 until the task starts to run. In these situations, the processing unit or memory controller 2210 can randomly select the accelerator 2216 and generate a test data access pattern, which can be modified when the task is performed.

然而，當任務預先已知時，編譯器可將資料及指令組織在記憶體組中以供主機2230提供至諸如組態管理器2212之處理單元，以設定最少化存取潛時之信號連接。舉例而言，在一些狀況下，加速器2216可能同時需要n個字。然而，每一記憶體例項支援每次僅擷取m個字，其中「m」及「n」為整數且m<n。因此，編譯器可跨越不同記憶體例項或區塊置放所需資料，以便利資料存取。又，為了避免排錯漏潛時，在處理裝置2200包括多個記憶體記憶體的情況下，主機可在不同記憶體例項之不同排中分裂資料。資料之劃分可允許存取下一例項中之下一排資料，同時仍使用來自當前例項之資料。 However, when the task is known in advance, the compiler can organize the data and instructions in a memory group for the host 2230 to provide to the processing unit such as the configuration manager 2212 to set the signal connection to minimize the access latency. For example, in some situations, the accelerator 2216 may require n words at the same time. However, each memory instance supports to retrieve only m words at a time, where "m" and "n" are integers and m<n. Therefore, the compiler can place required data across different memory instances or blocks to facilitate data access. In addition, in order to avoid errors during troubleshooting, when the processing device 2200 includes multiple memory memories, the host can split data in different rows of different memory instances. The division of data allows access to the next row of data in the next example, while still using the data from the current example.

舉例而言，加速器2216(a)可經組態以將兩個向量相乘。向量中之每一者可儲存於諸如記憶體區塊2202及2204之獨立記憶體區塊中，且每一向量可包括多個字。因此，為了完成需要加速器2216(a)進行乘法之任務，可能有必要存取兩個記憶體區塊且擷取多個字。然而，在一些實施例中，記憶體區塊僅允許每個時脈循環存取一個字。舉例而言，記憶體區塊可具有單個埠。在此等狀況下，為了在操作期間加快資料傳輸，編譯器可將構成向量之字組織在不同記憶體區塊中，以允許對字之並列及/或同時讀取。在此等情形中，編譯器可將字儲存於具有專用線之記憶體區塊中。舉例而言，若每一向量包括兩個字且記憶體控制器能夠直接存取四個記憶體區塊，則編譯器可將資料配置於四個記憶體區塊中，每一記憶體區塊傳輸一字且加快資料遞送。此外，在實施例中，當記憶體控制器2210可具有至每一記憶體區塊之多於單個連接時，編譯器可發指令給組態管理器2212(或其他處理單元)以存取埠特定埠。以此方式，處理裝置2200可執行管線式記憶體存取，以藉由同時在一些線中載入字及在其他線中傳輸資料來將資料連續地提供至處理單元。因此，此管線式記憶體存取避免可避免潛時問題。 For example, accelerator 2216(a) can be configured to multiply two vectors. Each of the vectors can be stored in separate memory blocks such as memory blocks 2202 and 2204, and each vector can include multiple words. Therefore, in order to complete the task that requires the accelerator 2216(a) to perform multiplication, it may be necessary to access two memory blocks and retrieve multiple words. However, in some embodiments, the memory block is only Allows access to one word per clock cycle. For example, the memory block may have a single port. Under these conditions, in order to speed up data transmission during operation, the compiler can organize the vector zigzags in different memory blocks to allow parallel and/or simultaneous reading of the characters. In these situations, the compiler can store the words in a memory block with dedicated lines. For example, if each vector includes two words and the memory controller can directly access four memory blocks, the compiler can arrange the data in four memory blocks, each memory block Transmit one word and speed up data delivery. In addition, in an embodiment, when the memory controller 2210 can have more than a single connection to each memory block, the compiler can issue instructions to the configuration manager 2212 (or other processing unit) to access the port Specific port. In this way, the processing device 2200 can perform pipelined memory access to continuously provide data to the processing unit by simultaneously loading words in some lines and transmitting data in other lines. Therefore, this pipelined memory access avoidance can avoid latency problems.

圖23為符合所揭示實施例之例示性處理裝置2300的方塊圖。該方塊圖展示簡化之處理裝置2300，其顯示呈MAC單元2302形式之單個加速器、組態管理器2304(等效或類似於組態管理器2212)、記憶體控制器2306(等效或類似於記憶體控制器2210)及複數個記憶體區塊2308(a)至2308(d)。 FIG. 23 is a block diagram of an exemplary processing device 2300 in accordance with the disclosed embodiments. The block diagram shows a simplified processing device 2300, which shows a single accelerator in the form of a MAC unit 2302, a configuration manager 2304 (equivalent or similar to the configuration manager 2212), and a memory controller 2306 (equivalent or similar to The memory controller 2210) and a plurality of memory blocks 2308(a) to 2308(d).

在一些實施例中，MAC單元2302可為用於處理特定任務之特定加速器。作為實例，處理裝置2300可以2D卷積為任務。接著，組態管理器2304可向具有適當硬體之加速器發信以執行與任務相關聯之計算。舉例而言，MAC單元2302可具有四個內部遞增計數器(用以管理卷積計算所需之四個迴圈的邏輯加法器及暫存器)及一乘法累加單元。組態管理器2304可向MAC單元2302發信以處理傳入資料且執行任務。組態管理器2304可將提示傳輸至MAC單元2302以執行任務。在此等情形中，MAC單元2302可在所計算位址上進行反覆，將數字相乘，且將其累加至內部暫存器。 In some embodiments, the MAC unit 2302 may be a specific accelerator for processing specific tasks. As an example, the processing device 2300 may 2D convolution as a task. Then, the configuration manager 2304 can send a message to an accelerator with appropriate hardware to perform calculations associated with the task. For example, the MAC unit 2302 may have four internal increment counters (logical adders and registers used to manage the four loops required for the convolution calculation) and a multiplication and accumulation unit. The configuration manager 2304 can send a message to the MAC unit 2302 to process the incoming data and perform tasks. The configuration manager 2304 may transmit the prompt to the MAC unit 2302 to perform the task. In these situations, the MAC unit 2302 can iterate on the calculated address, multiply the numbers, and add them to the internal register.

在一些實施例中，組態管理器2304可組態加速器，而記憶體控制器2306授權使用專用匯流排存取區塊2308及MAC單元2302。然而，在其他實施例中，記憶體控制器2306可基於自組態管理器2304或外部介面接收到之指令而直接組態加速器。替代地或另外，組態管理器2304可預先載入幾個組態且允許加速器反覆地在具有不同大小之不同位址上運行。在此等實施例中，組態管理器2304可包括快取記憶體，該快取記憶體儲存命令，之後該命令被傳輸至諸如加速器2216之複數個處理單元中的至少一者。然而，在其他實施例中，組態管理器2304可能不包括快取記憶體。 In some embodiments, the configuration manager 2304 can configure the accelerator, and the memory control The controller 2306 authorizes the use of the dedicated bus access block 2308 and the MAC unit 2302. However, in other embodiments, the memory controller 2306 can directly configure the accelerator based on instructions received from the configuration manager 2304 or an external interface. Alternatively or in addition, the configuration manager 2304 may pre-load several configurations and allow the accelerator to repeatedly run on different addresses with different sizes. In these embodiments, the configuration manager 2304 may include a cache memory that stores commands, which are then transmitted to at least one of a plurality of processing units such as the accelerator 2216. However, in other embodiments, the configuration manager 2304 may not include cache memory.

在一些實施例中，組態管理器2304或記憶體控制器2306可接收為了任務需要存取之位址。組態管理器2304或記憶體控制器2306可檢查暫存器以判定位址是否已經在至記憶體區塊2308中之一者的載入線中。若在載入線中，則記憶體控制器2306可自記憶體區塊2308讀取字且將該字傳遞至MAC單元2302。若位址不在載入線中，則組態管理器2304可請求記憶體控制器2306可載入該線且向MAC單元2302發信以延遲，直至擷取該位址。 In some embodiments, the configuration manager 2304 or the memory controller 2306 may receive the address that needs to be accessed for the task. The configuration manager 2304 or the memory controller 2306 can check the register to determine whether the address is already in the load line to one of the memory blocks 2308. If it is in the load line, the memory controller 2306 can read the word from the memory block 2308 and transfer the word to the MAC unit 2302. If the address is not in the load line, the configuration manager 2304 can request the memory controller 2306 to load the line and send a message to the MAC unit 2302 to delay until the address is retrieved.

在一些實施例中，如圖23中所展示，記憶體控制器2306可包括形成兩個獨立位址之兩個輸入。但若應同時存取多於兩個位址，且此等位址在單個記憶體區塊中(例如，位址僅在記憶體區塊2308(a)中)，則記憶體控制器2306或組態管理器2304可能會引發例外狀況。替代地，當兩個位址僅可經由單條線來存取時，組態管理器2304可傳回無效資料信號。在其他實施例中，單元可延遲處理程序執行，直至有可能擷取所有需要的資料。此可降低總體效能。然而，編譯器可能夠找到將防止延遲之組態及資料置放。 In some embodiments, as shown in FIG. 23, the memory controller 2306 may include two inputs forming two independent addresses. However, if more than two addresses should be accessed at the same time, and these addresses are in a single memory block (for example, the addresses are only in memory block 2308(a)), then the memory controller 2306 or Configuration manager 2304 may cause exceptions. Alternatively, when the two addresses can only be accessed via a single line, the configuration manager 2304 may return an invalid data signal. In other embodiments, the unit can delay the execution of the processing procedure until it is possible to retrieve all the required data. This can reduce overall performance. However, the compiler may be able to find the configuration and data placement that will prevent the delay.

在一些實施例中，編譯器可產生用於處理裝置2300之組態或指令集，該組態或指令集可組態組態管理器2304及記憶體控制器2306以及加速器2302以處置需要存取單個記憶體區塊之多個位址但該記憶體區塊具有一個埠的情形。舉例而言，編譯器可重新配置記憶體區塊2308中之資料，使得處理單元可存取記憶體區塊2308中之多個排。 In some embodiments, the compiler can generate a configuration or instruction set for the processing device 2300. The configuration or instruction set can configure the configuration manager 2304 and the memory controller 2306 and the accelerator 2302 to handle the need to access Multiple addresses of a single memory block but the memory block has one port. For example, the compiler can reconfigure the data in the memory block 2308 so that the processing unit Multiple rows in the memory block 2308 can be accessed.

此外，記憶體控制器2306亦可在同一時間同時對多於一個輸入進行工作。舉例而言，記憶體控制器2306可允許經由一個埠存取記憶體區塊2308中之一者及在於另一輸入端中接收對不同記憶體區塊之請求時供應資料。因此，此操作可導致以例示性2D卷積為任務之加速器2216自相關記憶體區塊之專用通信線接收資料。 In addition, the memory controller 2306 can also work on more than one input at the same time. For example, the memory controller 2306 may allow access to one of the memory blocks 2308 via one port and supply data when a request for a different memory block is received in the other input. Therefore, this operation can cause the accelerator 2216, which takes the exemplary 2D convolution as the task, to receive data from the dedicated communication line of the relevant memory block.

另外或替代地，記憶體控制器2306或邏輯區塊可保持針對每個記憶體區塊2308之再新計數器且處置所有排之再新。具有此計數器允許記憶體控制器2306插入裝置之停滯存取時間之間的再新循環中。 Additionally or alternatively, the memory controller 2306 or logic block may maintain a refresh counter for each memory block 2308 and handle the refresh of all rows. Having this counter allows the memory controller 2306 to be inserted into the recycle between the stalled access times of the device.

此外，記憶體控制器2306可為可組態的以執行管線式記憶體存取，以接收位址且開放記憶體區塊中之線，之後供應資料。該管線式記憶體存取可在不中斷或不延遲時脈循環之情況下將資料提供至處理單元。舉例而言，雖然記憶體控制器2306或邏輯區塊中之一者在圖23中利用右方線存取資料，但記憶體控制器或邏輯區塊可正在左方線中傳輸資料。將關於圖26更詳細地解釋此等方法。 In addition, the memory controller 2306 can be configurable to perform pipelined memory access to receive addresses and open lines in the memory block, and then supply data. The pipelined memory access can provide data to the processing unit without interrupting or delaying the clock cycle. For example, although one of the memory controller 2306 or the logic block uses the right line to access data in FIG. 23, the memory controller or logic block may be transmitting data in the left line. These methods will be explained in more detail with respect to FIG. 26.

回應於所需資料，處理裝置2300可使用多工器及/或其他開關裝置來選擇服務哪些裝置以執行給定任務。舉例而言，組態管理器2304可組態多工器，使得至少兩個資料線到達MAC單元2302。以此方式，需要來自多個位址之資料的任務(諸如，2D卷積)可較快地執行，此係因為在卷積期間需要乘法之向量或字可在單個時脈中同時到達處理單元。此資料傳送方法可允許諸如加速器2216之處理單元快速地輸出結果。 In response to the required data, the processing device 2300 can use a multiplexer and/or other switching devices to select which devices to serve to perform a given task. For example, the configuration manager 2304 can configure the multiplexer so that at least two data lines reach the MAC unit 2302. In this way, tasks that require data from multiple addresses (such as 2D convolution) can be performed faster, because vectors or words that need to be multiplied during convolution can reach the processing unit at the same time in a single clock . This data transmission method allows processing units such as accelerator 2216 to output results quickly.

在一些實施例中，組態管理器2304可為可組態的以基於任務之優先權執行處理程序。舉例而言，組態管理器2304可經組態以使運行中處理程序無任何中斷地完成。在彼狀況下，組態管理器2304可將任務之指令或組態提供至加速器2216，使該等加速器不中斷地運行，且僅在任務完成時切換多工器。然而，在其他實施例中，組態管理器2304可在其接收到優先任務(諸如，來自外部介面之請求)時中斷任務且重新組態資料投送。然而，在記憶體區塊2308足夠之情況下，記憶體控制器2306可為可組態的以利用專用線將資料投送至處理單元或向處理單元授權存取，該等專用線在任務完成之前不必改變。此外，在一些實施例中，所有裝置可藉由匯流排連接至組態管理器2304之實體，且裝置可管理裝置本身與匯流排之間的存取(例如，使用與多工器相同之邏輯)。因此，記憶體控制器2306可直接連接至數個記憶體例項或記憶體區塊。 In some embodiments, the configuration manager 2304 may be configurable to execute processing procedures based on task priority. For example, the configuration manager 2304 can be configured so that the running process can be completed without any interruption. In that situation, the configuration manager 2304 can provide task instructions or configuration It is provided to the accelerator 2216, so that the accelerators can run without interruption, and the multiplexer is switched only when the task is completed. However, in other embodiments, the configuration manager 2304 can interrupt the task and reconfigure the data delivery when it receives a priority task (such as a request from an external interface). However, if the memory block 2308 is sufficient, the memory controller 2306 can be configurable to use dedicated lines to send data to the processing unit or to authorize access to the processing unit. These dedicated lines are used when the task is completed. No need to change before. In addition, in some embodiments, all devices can be connected to the entity of the configuration manager 2304 through the bus, and the device can manage the access between the device itself and the bus (for example, using the same logic as the multiplexer ). Therefore, the memory controller 2306 can be directly connected to several memory instances or memory blocks.

替代地，記憶體控制器2306可直接連接至記憶體子例項。在一些實施例中，每一記憶體例項或區塊可由子例項建置(例如，DRAM可由配置於多個子區塊中的具有獨立資料線之墊建置)。另外，例項可包括DRAM墊、DRAM、組、快閃記憶體墊或SRAM墊或任何其他類型的記憶體中之至少一者。接著，記憶體控制器2306可包括專用線以直接定址子例項，從而最少化管線式記憶體存取期間之潛時。 Alternatively, the memory controller 2306 can be directly connected to the memory sub-instance. In some embodiments, each memory instance or block can be built by sub-instances (for example, DRAM can be built by pads with independent data lines arranged in multiple sub-blocks). In addition, the instance may include at least one of a DRAM pad, a DRAM, a bank, a flash memory pad, or an SRAM pad, or any other type of memory. Next, the memory controller 2306 may include dedicated lines to directly address the sub-instances, thereby minimizing the latency during pipelined memory access.

在一些實施例中，記憶體控制器2306亦可保持特定記憶體例項所需之邏輯(諸如，列\行解碼器、再新邏輯等)，且記憶體區塊2308可處置其自身的邏輯。因此，記憶體區塊2308可獲得位址且產生用於傳回\寫入資料之命令。 In some embodiments, the memory controller 2306 can also maintain logic (such as column\row decoder, renew logic, etc.) required by a specific memory instance, and the memory block 2308 can handle its own logic. Therefore, the memory block 2308 can obtain an address and generate a command to return\write data.

圖24描繪符合所揭示實施例之例示性記憶體組態圖。在一些實施例中，產生用於處理裝置2200之程式碼或組態的編譯器可執行用以藉由將資料預先配置在每一區塊中來組態自記憶體區塊2202及2204之載入的方法。舉例而言，編譯器可預先配置資料，使得任務所需之每一字與一排記憶體例項或記憶體區塊相關。但對於需要比處理裝置2200中可用之一個記憶體區塊多的記憶體區塊之任務，編譯器可實施使資料適合每一記憶體區塊之多於一個記憶體位置的方法。編譯器亦可依序儲存資料且評估每一記憶體區塊之潛時以避免排錯漏潛時。在一些實施例中，主機可為處理單元之部分，諸如組態管理器2212，但在其他實施例中，編譯器主機可經由外部介面連接至處理裝置2200。在此等實施例中，主機可運行編譯功能，諸如針對編譯器所描述之編譯功能。 FIG. 24 depicts an exemplary memory configuration diagram consistent with the disclosed embodiment. In some embodiments, a compiler that generates code or configuration for the processing device 2200 can be executed to configure the load from the memory blocks 2202 and 2204 by pre-arranging data in each block. Method of entry. For example, the compiler can pre-configure data so that each word required by the task is related to a row of memory instances or memory blocks. But for tasks that require more memory blocks than one memory block available in the processing device 2200, the compiler can implement more than one memory location to fit the data in each memory block The method of setting. The compiler can also store data in sequence and evaluate the latency of each memory block to avoid troubleshooting and missing latency. In some embodiments, the host may be part of the processing unit, such as the configuration manager 2212, but in other embodiments, the compiler host may be connected to the processing device 2200 via an external interface. In these embodiments, the host may run a compilation function, such as the compilation function described for the compiler.

在一些實施例中，組態管理器2212可為CPU或微控制器(uC)。在此等實施例中，組態管理器2212可能必須存取記憶體以提取置放於記憶體中之命令或指令。特定編譯器可產生程式碼且將該程式碼置放於記憶體中，方式為允許在同一記憶體排中及跨越數個記憶體組儲存連續命令，從而允許亦對所提取命令進行管線式記憶體存取。在此等實施例中，組態管理器2212及記憶體控制器2210可能夠藉由便利管線式記憶體存取來避免線性執行中之列潛時。 In some embodiments, the configuration manager 2212 may be a CPU or a microcontroller (uC). In these embodiments, the configuration manager 2212 may have to access the memory to retrieve the commands or instructions placed in the memory. A specific compiler can generate code and place the code in memory in a way that allows continuous commands to be stored in the same memory bank and across several memory banks, thereby allowing pipelined memory of the extracted commands Body access. In these embodiments, the configuration manager 2212 and the memory controller 2210 may be able to avoid column latency in linear execution by facilitating pipelined memory access.

程式之線性執行之先前狀況描述供編譯器辨識及置放指令以允許管線式記憶體執行之方法。然而，其他軟體結構可能更複雜且將需要編譯器辨識其他軟體結構且相應地採取動作。舉例而言，在任務需要迴圈及分支之狀況下，編譯器可將所有迴圈程式碼置放於單條線內，使得單條線可在不具有線開放潛時之情況下進行迴圈。接著，記憶體控制器2210可能不需要在執行期間改變線。 The previous state of the linear execution of the program describes the method for the compiler to recognize and place instructions to allow pipelined memory execution. However, other software structures may be more complex and will require the compiler to recognize other software structures and take actions accordingly. For example, in a situation where the task requires loops and branches, the compiler can place all loop code in a single line, so that a single line can loop without the line open latent time. Then, the memory controller 2210 may not need to change lines during execution.

在一些實施例中，組態管理器2212可包括內部快取記憶體或小記憶體。內部快取記憶體可儲存由組態管理器2212執行以處置分支及迴圈的命令舉例而言，內部快取記憶體中之命令可包括用以組態用於存取記憶體區塊之加速器的指令。 In some embodiments, the configuration manager 2212 may include internal cache memory or small memory. The internal cache memory can store commands executed by the configuration manager 2212 to handle branches and loops. For example, the commands in the internal cache memory can include an accelerator used to configure access to memory blocks. Instructions.

圖25為說明符合所揭示實施例之可能記憶體組態處理程序2500的例示性流程圖。在便於描述記憶體組態處理程序2500之情況下，可參考圖22中所描繪及上文所描述的元件之識別符。在一些實施例中，處理程序2500可由編譯器執行，該編譯器將指令提供至經由外部介面連接之主機。在其他實施例中，處理程序2500可由處理裝置2200之組件(諸如，組態管理器2212)執行。 FIG. 25 is an exemplary flowchart illustrating a possible memory configuration processing procedure 2500 in accordance with the disclosed embodiment. For the convenience of describing the memory configuration processing program 2500, reference may be made to the identifiers of the components depicted in FIG. 22 and described above. In some embodiments, the processing program 2500 may be executed by a compiler, which provides instructions to a host connected via an external interface. In other embodiments Here, the processing program 2500 can be executed by a component of the processing device 2200 (such as the configuration manager 2212).

一般而言，處理程序2500可包括：判定執行任務同時所需的字之數目；判定可同時自複數個記憶體組中之每一者存取的字之數目；及當同時所需的字之數目大於可同時存取的字之數目時，在多個記憶體組之間劃分同時所需的字之數目。此外，劃分同時所需的字之數目可包括執行字之循環組織及依序地每個記憶體組指派一個字。 Generally speaking, the processing procedure 2500 may include: determining the number of characters required to perform the task at the same time; determining the number of characters that can be accessed from each of the plurality of memory groups at the same time; and determining the number of characters required at the same time When the number is greater than the number of characters that can be accessed at the same time, the number of characters required at the same time is divided among multiple memory groups. In addition, dividing the number of words required at the same time can include performing a cyclic organization of words and assigning a word to each memory group sequentially.

更具體而言，處理程序2500可以步驟2502開始，在該步驟中，編譯器可接收任務規格。該規格包括所需運算及/或優先權等級。 More specifically, the processing program 2500 may start in step 2502, in which the compiler may receive task specifications. The specifications include required calculations and/or priority levels.

在步驟2504中，編譯器可識別可執行任務之加速器或加速器群組。替代地，編譯器可產生指令，因此處理單元(諸如，組態管理器2212)可識別加速器以執行任務。舉例而言，使用所需運算組態管理器2212可識別加速器2216之群組中的可處理任務之加速器。 In step 2504, the compiler may identify the accelerator or accelerator group that can perform the task. Alternatively, the compiler can generate instructions, so the processing unit (such as the configuration manager 2212) can identify the accelerator to perform the task. For example, using the required computing configuration manager 2212 can identify the accelerators in the group of accelerators 2216 that can handle tasks.

在步驟2506中，編譯器可判定需要同時存取以執行任務的字之數目。舉例而言，兩個向量之乘法需要存取至少兩個向量，且編譯器因此可判定必須同時存取向量字以執行運算。 In step 2506, the compiler can determine the number of words that need to be accessed simultaneously to perform the task. For example, the multiplication of two vectors requires access to at least two vectors, and the compiler can therefore determine that the vector words must be accessed at the same time to perform operations.

在步驟2508中，編譯器可判定執行任務必需的循環之數目。舉例而言，若任務需要對四個副乘積之卷積運算，則編譯器可判定至少4個循環將為執行任務所必需的。 In step 2508, the compiler can determine the number of loops necessary to execute the task. For example, if the task requires a convolution operation on four sub-products, the compiler can determine that at least 4 cycles will be necessary to execute the task.

在步驟2510中，編譯器可將需要同時存取之字置放於不同記憶體組中。以彼方式，記憶體控制器2210可經組態以開放至不同記憶體例項之線且在時脈循環內存取所需記憶體區塊，而不需要任何快取記憶體資料。 In step 2510, the compiler can place the words that need to be accessed simultaneously in different memory banks. In that way, the memory controller 2210 can be configured to open to the lines of different memory instances and access the required memory blocks within the clock cycle without any need for cache memory data.

在步驟2512中，編譯器將依序存取的字置放於相同記憶體組中。舉例而言，在需要操作之四個循環的狀況下，編譯器可產生指令以在依序循環中將所需字寫入單個記憶體區塊中，以避免在執行期間在不同記憶體區塊之間改變線。 In step 2512, the compiler places the sequentially accessed words in the same memory bank. For example, in the case of four cycles of operation, the compiler can generate instructions to write the required words into a single memory block in a sequential cycle to avoid different memory blocks during execution. between Change the line.

在步驟2514中，編譯器產生用於程式化諸如組態管理器2212之處理單元的指令。該等指令可指定操作開關裝置(諸如，多工器)或組態資料匯流排之條件。藉由此等指令，組態管理器2212可根據任務組態記憶體控制器2210以使用專用通信線將資料自記憶體區塊投送至處理單元或授權對該等記憶體區塊之存取。 In step 2514, the compiler generates instructions for programming processing units such as configuration manager 2212. These instructions can specify conditions for operating a switching device (such as a multiplexer) or configuring a data bus. With these commands, the configuration manager 2212 can configure the memory controller 2210 according to the task to use dedicated communication lines to send data from the memory block to the processing unit or authorize access to these memory blocks .

圖26為說明符合所揭示實施例之記憶體讀取處理程序2600的例示性流程圖。在便於描述記憶體讀取處理程序2600之情況下，可參考圖22中所描繪及上文所描述的元件之識別符。在一些實施例中，如下文所描述，處理程序2600可由記憶體控制器2210實施。然而，在其他實施例中，處理程序2600可由處理裝置2200中之其他元件(諸如，組態管理器2212)實施。 FIG. 26 is an exemplary flowchart illustrating a memory read processing program 2600 in accordance with the disclosed embodiment. For the convenience of describing the memory read processing program 2600, reference may be made to the identifiers of the components depicted in FIG. 22 and described above. In some embodiments, the processing procedure 2600 may be implemented by the memory controller 2210 as described below. However, in other embodiments, the processing program 2600 may be implemented by other elements in the processing device 2200 (such as the configuration manager 2212).

在步驟2602中，記憶體控制器2210、組態管理器2212或其他處理單元可接收來自記憶體組之投送資料或授權對記憶體組之存取的提示。請求可指定位址及記憶體區塊。 In step 2602, the memory controller 2210, the configuration manager 2212, or other processing unit may receive the posted data from the memory bank or a prompt to authorize access to the memory bank. The request can specify the address and memory block.

在一些實施例中，該請求可經由在線2218中指定讀取命令及在線2220中指定位址的資料匯流排接收。在其他實施例中，該請求可經由連接至記憶體控制器2210之解多工器接收。 In some embodiments, the request may be received via a data bus that specifies a read command in line 2218 and a specified address in line 2220. In other embodiments, the request may be received via a demultiplexer connected to the memory controller 2210.

在步驟2604中，組態管理器2212、主機或其他處理單元可查詢內部暫存器。該內部暫存器可包括關於至記憶體組之開放線、開放位址、開放記憶體區塊及/或即將進行的任務的資訊。基於內部暫存器中之資訊，可判定是否存在至記憶體組之開放線及/或記憶體區塊是否在步驟2602中接收到請求。替代地或另外，記憶體控制器2210可直接查詢該內部暫存器。 In step 2604, the configuration manager 2212, the host or other processing unit may query the internal register. The internal register may include information about open lines, open addresses, open memory blocks, and/or upcoming tasks to the memory bank. Based on the information in the internal register, it can be determined whether there is an open line to the memory bank and/or whether the memory block has received a request in step 2602. Alternatively or in addition, the memory controller 2210 may directly query the internal register.

若該內部暫存器提示記憶體組未載入開放線中(步驟2606：否)，則處理程序2600可繼續至步驟2616，且可將線載入至與所接收位址相關聯之記憶體組。此外，記憶體控制器2210或諸如組態管理器2212之處理單元可在步驟2616中將延遲發信至請求來自記憶體位址之資訊的元件。舉例而言，若加速器2216正請求位於已被佔用之記憶體區塊中的記憶體資訊，則在步驟2618中，記憶體控制器2210可將延遲信號發送至加速器。在步驟2620中，組態管理器2212或記憶體控制器2210可更新內部暫存器以提示已開放至新記憶體組或新記憶體區塊之線。 If the internal register prompts that the memory bank is not loaded into the open line (step 2606: No), the processing procedure 2600 can continue to step 2616, and the line can be loaded to the memory associated with the received address Memory body group. In addition, the memory controller 2210 or a processing unit such as the configuration manager 2212 can send a delay in step 2616 to the device requesting information from the memory address. For example, if the accelerator 2216 is requesting memory information in an occupied memory block, in step 2618, the memory controller 2210 may send a delay signal to the accelerator. In step 2620, the configuration manager 2212 or the memory controller 2210 may update the internal register to indicate that the line has been opened to a new memory bank or a new memory block.

若該內部暫存器提示記憶體組載入開放線中(步驟2606：是)，則處理程序2600可繼續至步驟2608。在步驟2608中，可判定載入有記憶體組之線是否正用於不同位址。若該線正用於不同位址(步驟2608：是)，則此將提示單個區塊中存在兩個例項，且因此，不能同時存取該兩個例項。因此，可在步驟2616中將錯誤或免除信號發送至請求來自記憶體位址之資訊的元件。但若該線並未正用於不同位址(步驟2608：否)，則可開放針對該位址之線且自目標記憶體組擷取資料，且繼續至步驟2614以將資料傳輸至請求來自記憶體位址之資訊的元件。 If the internal register prompts that the memory bank is loaded into the open line (step 2606: Yes), the processing procedure 2600 can continue to step 2608. In step 2608, it can be determined whether the line loaded with the memory bank is being used at a different address. If the line is being used at a different address (step 2608: Yes), this will indicate that there are two instances in a single block, and therefore, the two instances cannot be accessed at the same time. Therefore, an error or exemption signal can be sent in step 2616 to the device requesting information from the memory address. But if the line is not being used at a different address (step 2608: No), you can open the line for that address and retrieve data from the target memory bank, and continue to step 2614 to transfer the data to the request from The component of the information of the memory address.

利用處理程序2600，處理裝置2200能夠建立處理單元與含有執行任務所需之資訊的記憶體區塊或記憶體例項之間的直接連接。資料之此組織將使得能夠自不同記憶體例項中之經組織向量讀取資訊，以及允許在裝置請求複數個此等位址時同時自不同記憶體區塊擷取資訊。 Using the processing program 2600, the processing device 2200 can establish a direct connection between the processing unit and the memory block or memory instance containing the information required to perform the task. This organization of the data will enable information to be read from organized vectors in different memory instances, and to simultaneously retrieve information from different memory blocks when the device requests multiple such addresses.

圖27為說明符合所揭示實施例之執行處理程序2700的例示性流程圖。在便於描述執行處理程序2700之情況下，可參考圖22中所描繪及上文所描述的元件之識別符。 FIG. 27 is an exemplary flowchart illustrating the execution processing program 2700 in accordance with the disclosed embodiment. To facilitate the description of the execution processing program 2700, reference may be made to the identifiers of the elements depicted in FIG. 22 and described above.

在步驟2702中，編譯器或諸如組態管理器2212之區域單元可接收需要執行之任務的提示。該任務可包括單個運算(例如，乘法)或更複雜運算(例如，矩陣之間的卷積)。該任務亦可提示所需運算。 In step 2702, the compiler or a regional unit such as the configuration manager 2212 may receive prompts for tasks that need to be performed. This task may include a single operation (e.g., multiplication) or more complex operation (e.g., convolution between matrices). The task can also prompt the required calculation.

在步驟2704中，編譯器或組態管理器2212可判定執行任務同時所需的字之數目。舉例而言，組態編譯器可判定同時需要兩個字來執行向量之間的乘法。在另一實例(2D卷積任務)中，組態管理器2212可判定矩陣之間的卷積需要「n」乘「m」個字，其中「n」及「m」為矩陣維度。此外，在步驟2704中，組態管理器2212亦可判定執行任務必需的循環之數目。 In step 2704, the compiler or configuration manager 2212 can determine the number of words required to perform the task at the same time. For example, the configuration compiler can determine that two words are needed to perform multiplication between vectors. In another example (2D convolution task), the configuration manager 2212 may determine that the convolution between matrices requires "n" by "m" words, where "n" and "m" are matrix dimensions. In addition, in step 2704, the configuration manager 2212 can also determine the number of cycles necessary to execute the task.

在步驟2706中，取決於步驟2704中之判定，編譯器可將需要同時存取之字寫入安置於基板上之複數個記憶體組中。舉例而言，當可自複數個記憶體組中之一者同時存取的字之數目的數目小於同時所需的字之數目時，編譯器可將資料組織在多個記憶體組中以便利在時脈內存取不同的所需字。此外，當組態管理器2212或編譯器判定執行任務必需的循環之數目時，編譯器可在依序循環中將所需的字寫入複數個記憶體組中之單個記憶體組中，以防止記憶體組之間的線之切換。 In step 2706, depending on the determination in step 2704, the compiler can write the words that require simultaneous access to a plurality of memory banks arranged on the substrate. For example, when the number of words that can be accessed simultaneously from one of a plurality of memory groups is less than the number of words required at the same time, the compiler can organize the data in multiple memory groups for convenience Access different required words in the clock. In addition, when the configuration manager 2212 or the compiler determines the number of cycles necessary to execute the task, the compiler can write the required words into a single memory group of the plural memory groups in a sequential cycle to Prevent line switching between memory banks.

在步驟2708中，記憶體控制器2210可經組態以使用第一記憶體線自複數個記憶體組或區塊中之第一記憶體組讀取至少一個第一字或授權對該至少一個第一字的存取。 In step 2708, the memory controller 2210 may be configured to use the first memory line to read at least one first word from the first memory group in the plurality of memory groups or blocks or to authorize the at least one Access to the first word.

在步驟2170中，處理單元(例如，加速器2216中之一者)可使用至少一個第一字來處理任務。 In step 2170, the processing unit (for example, one of the accelerator 2216) may use at least one first word to process the task.

在步驟2712中，記憶體控制器2210可經組態以開放第二記憶體組中之第二記憶體線。舉例而言，基於任務且使用管線式記憶體存取方法，記憶體控制器2210可經組態以開放在步驟2706中寫入有任務所需之資訊的第二記憶體區塊中之第二記憶體線。在一些實施例中，該第二記憶體線可在步驟2170中之任務將要完成時開放。舉例而言，若任務需要100個時脈，則第二記憶體線可在第90個時脈中開放。 In step 2712, the memory controller 2210 may be configured to open the second memory line in the second memory group. For example, based on the task and using the pipelined memory access method, the memory controller 2210 can be configured to open the second memory block in the second memory block in which the information required by the task is written in step 2706 Memory cable. In some embodiments, the second memory line can be opened when the task in step 2170 is about to be completed. For example, if the task requires 100 clocks, the second memory line can be opened at the 90th clock.

在一些實施例中，步驟2708至2712可在一個線存取循環內執行。 In some embodiments, steps 2708 to 2712 can be performed within one line access cycle.

在步驟2714中，記憶體控制器2210可經組態以授權使用在步驟2710中開放之第二記憶體線存取來自第二記憶體組之至少一個第二字的資料。 In step 2714, the memory controller 2210 can be configured to authorize the use of the second memory line opened in step 2710 to access the data of at least one second word from the second memory group.

在步驟2176中，處理單元(例如，加速器2216中之一者)可使用至少第二字來處理任務。 In step 2176, the processing unit (e.g., one of the accelerator 2216) may use at least the second word to process the task.

在步驟2718中，記憶體控制器2210可經組態以開放第一記憶體組中之第二記憶體線。舉例而言，基於任務且使用管線式記憶體存取方法，記憶體控制器2210可經組態以開放至第一記憶體區塊之第二記憶體線。在一些實施例中，至第一區塊之第二記憶體線可在步驟2176中之任務將要完成時開放。 In step 2718, the memory controller 2210 may be configured to open the second memory line in the first memory bank. For example, based on tasks and using a pipelined memory access method, the memory controller 2210 can be configured to open to the second memory line of the first memory block. In some embodiments, the second memory line to the first block can be opened when the task in step 2176 is about to be completed.

在一些實施例中，步驟2714至2718可在一個線存取循環內執行。 In some embodiments, steps 2714 to 2718 can be performed within one line access cycle.

在步驟2720中，記憶體控制器2210可使用第一組中之第二記憶體線或第三組中之第一線及在不同記憶體組中繼續而自複數個記憶體組或區塊中之第一記憶體組讀取至少一個第三字或授權對該至少一個第三字的存取。 In step 2720, the memory controller 2210 can use the second memory line in the first group or the first line in the third group and continue in different memory groups from a plurality of memory groups or blocks The first memory group reads at least one third word or authorizes access to the at least one third word.

部分再新 Partially renewed

諸如動態隨機存取記憶體(DRAM)晶片之一些記憶體晶片使用再新以避免所儲存資料(例如，使用電容)由於電容器或晶片之其他電組件中之電壓衰減而丟失。舉例而言，在DRAM中，必須時常再新每一胞元(基於特定處理程序及設計)以恢復電容器中之電荷，使得資料不會丟失或損壞。隨著DRAM晶片之記憶體容量增加，再新記憶體所需之時間量變得顯著。在正再新記憶體之某一的時間段期間，不能存取含有正再新之該線的組。此可導致效能降低。另外，與再新處理程序相關聯之功率亦可為顯著的。先前已努力嘗試減小執行再新之速率以減少與再新記憶體相關聯之不利影響，但大部分此等努力集中於DRAM之實體層。 Some memory chips, such as dynamic random access memory (DRAM) chips, are reused to avoid loss of stored data (for example, using capacitors) due to voltage attenuation in capacitors or other electrical components of the chip. For example, in DRAM, each cell must be renewed from time to time (based on specific processing procedures and design) to restore the charge in the capacitor so that data will not be lost or damaged. As the memory capacity of DRAM chips increases, the amount of time required to regenerate the memory becomes significant. During a certain period of time when the memory is being renewed, the group containing the line being renewed cannot be accessed. This can result in reduced performance. In addition, the power associated with the renewal process can also be significant. Previous efforts have been made to reduce the rate of performing refresh to reduce the adverse effects associated with refreshing memory, but most of these efforts have focused on the physical layer of DRAM.

再新類似於讀取及寫回記憶體之一列。使用此原理且集中於存取記憶體之圖案，本發明之實施例包括軟體及硬體技術以及對記憶體晶片之修改，以使用較少功率用於再新且減少再新記憶體期間之時間量。舉例而言，作為綜述，一些實施例可使用硬體及/或軟體以追蹤線存取時序且在再新循環內跳過最近存取列(例如，基於時序臨限值)。在另一實例中，一些實施例可依賴於由記憶體晶片之再新控制器執行的軟體來指派讀取及寫入，使得對記憶體之存取為非隨機的。因此，軟體可更精確地控制再新以避免浪費再新循環及/或線。此等技術可單獨使用或與編碼用於再新控制器之命令及用於處理器之機器碼的編譯器組合使用，使得對記憶體之存取同樣為非隨機的。使用下文詳細描述之此等技術及組態之任何組合，所揭示實施例可藉由減少再新記憶體單元期間之時間量來降低記憶體再新功率要求及/或提高系統效能。 Renewing is similar to reading and writing back a row of memory. Using this principle and focusing on accessing memory patterns, the embodiments of the present invention include software and hardware technology and repairs to memory chips. Changed to use less power for renewing and reducing the amount of time during renewing memory. For example, as an overview, some embodiments may use hardware and/or software to track the line access timing and skip the most recently accessed row in the new cycle (e.g., based on timing thresholds). In another example, some embodiments may rely on software executed by the renewed controller of the memory chip to assign reads and writes so that access to the memory is non-random. Therefore, the software can more accurately control the renewal to avoid wasting the renewal cycle and/or line. These techniques can be used alone or in combination with a compiler that encodes the commands for the renewed controller and the machine code for the processor, so that the access to the memory is also non-random. Using any combination of these techniques and configurations described in detail below, the disclosed embodiments can reduce memory regeneration power requirements and/or improve system performance by reducing the amount of time during memory cell regeneration.

圖28描繪符合本發明之具有再新控制器2803的實例記憶體晶片2800。舉例而言，記憶體晶片2800可包括基板上之複數個記憶體組(例如，記憶體組2801a及其類似者)。在圖28之實例中，基板包括四個記憶體組，其各具有四條線。線可指記憶體晶片2800之一或多個記憶體組或記憶體晶片2800內之記憶體胞元之任何其他集合(諸如，記憶體組之一部分或沿著記憶體組之一整列或記憶體組之群組)內的字線。 Figure 28 depicts an example memory chip 2800 with a refresh controller 2803 in accordance with the present invention. For example, the memory chip 2800 may include a plurality of memory groups (for example, the memory group 2801a and the like) on the substrate. In the example of FIG. 28, the substrate includes four memory banks, each of which has four lines. A line can refer to one or more of the memory chips 2800 or any other collection of memory cells within the memory chip 2800 (such as a part of a memory group or an entire row along one of the memory groups or a memory The word line in the group of the group).

在其他實施例中，基板可包括任何數目個記憶體組，且每一記憶體組可包括任何數目條線。一些記憶體組可包括相同數目條線(如圖28中所展示)，而其他記憶體組可包括不同數目條線。如圖28中進一步描繪，記憶體晶片2800可包括控制器2805，該控制器用以接收至記憶體晶片2800之輸入且自記憶體晶片2800傳輸輸出(例如，如上文在「碼之劃分」中所描述)。 In other embodiments, the substrate may include any number of memory banks, and each memory bank may include any number of lines. Some memory groups may include the same number of lines (as shown in FIG. 28), while other memory groups may include a different number of lines. As further depicted in FIG. 28, the memory chip 2800 may include a controller 2805 for receiving input to the memory chip 2800 and transmitting output from the memory chip 2800 (for example, as described above in "Code Division" description).

在一些實施例中，複數個記憶體組可包含動態隨機存取記憶體(DRAM)。然而，複數個記憶體組可包含儲存需要週期性再新之資料的任何揮發性記憶體。 In some embodiments, the plurality of memory banks may include dynamic random access memory (DRAM). However, the plurality of memory banks can include any volatile memory that stores data that needs to be refreshed periodically.

如下文將更詳細地論述，本發明所揭示之實施例可使用計數器或電阻器-電容器電路以對再新循環進行計時。舉例而言，計數器或計時器可用以對自最後完整再新循環之時間進行計數，且接著當計數器達到其目標值時，可使用另一計數器在所有列上進行反覆。本發明之實施例可另外追蹤對記憶體晶片2800之區段的存取且減小所需的再新功率。舉例而言，儘管未在圖28中描繪，但記憶體晶片2800可進一步包括資料儲存器，該資料儲存器經組態以儲存提示對複數個記憶體組之一或多個區段之存取操作的存取資訊。舉例而言，該一或多個區段可包含記憶體晶片2800內之記憶體胞元的排、行或任何其他分組之任何部分。在一個特定實例中，該一或多個區段可包括複數個記憶體組內之至少一列記憶體結構。再新控制器2803可經組態以至少部分地基於所儲存的存取資訊而執行該一或多個區段之再新操作。 As will be discussed in more detail below, the disclosed embodiments of the present invention can use counters or A resistor-capacitor circuit is used to time the renewal cycle. For example, a counter or timer can be used to count the time since the last complete recycle, and then when the counter reaches its target value, another counter can be used to repeat on all columns. The embodiment of the present invention can additionally track the access to the section of the memory chip 2800 and reduce the renewal power required. For example, although not depicted in FIG. 28, the memory chip 2800 may further include a data memory configured to store prompts for access to one or more sections of a plurality of memory banks Access information for the operation. For example, the one or more segments may include any part of the row, row, or any other grouping of memory cells in the memory chip 2800. In a specific example, the one or more segments may include at least one row of memory structures in a plurality of memory groups. The refresh controller 2803 may be configured to perform the refresh operation of the one or more sections based at least in part on the stored access information.

舉例而言，資料儲存器可包含與記憶體晶片2800之區段(例如，記憶體晶片2800內之記憶體胞元的排、行或任何其他分組)相關聯的一或多個暫存器、靜態隨機存取記憶體(SRAM)胞元，或其類似者。另外，資料儲存器可經組態以儲存提示相關聯之區段是否在一或多個先前循環中經存取的位元。「位元」可包含儲存至少一個位元之任何資料結構，諸如暫存器、SRAM胞元、非揮發性記憶體或其類似者。此外，位元可藉由將資料結構之對應開關(或開關元件，諸如電晶體)設定為接通(其可等效於「1」或「真」)來設定。另外或替代地，位元可藉由修改資料結構內之任何其他性質(諸如，對快閃記憶體之浮動閘極充電，修改SRAM中之一或多個正反器的狀態，或其類似者)以便將「1」寫入至該資料結構(或提示位元之設定的任何其他值)來設定。若位元被判定為作為記憶體控制器之再新操作的部分而經設定，則再新控制器2803可跳過相關聯區段之再新循環且清空與彼部分相關聯之暫存器。 For example, the data storage may include one or more registers associated with a section of the memory chip 2800 (for example, the row, row, or any other grouping of memory cells in the memory chip 2800), Static random access memory (SRAM) cells, or the like. In addition, the data storage can be configured to store bits that indicate whether the associated segment has been accessed in one or more previous cycles. "Bit" can include any data structure that stores at least one bit, such as a register, SRAM cell, non-volatile memory, or the like. In addition, the bit can be set by setting the corresponding switch (or switching element, such as a transistor) of the data structure to ON (which can be equivalent to "1" or "true"). Additionally or alternatively, bits can be modified by modifying any other properties within the data structure (such as charging the floating gate of flash memory, modifying the state of one or more flip-flops in SRAM, or the like ) In order to write "1" into the data structure (or any other value of the prompt bit setting) for setting. If the bit is determined to be set as part of the renew operation of the memory controller, the renew controller 2803 may skip the renewal cycle of the associated section and clear the register associated with that part.

在另一實例中，資料儲存器可包含與記憶體晶片2800之區段(例如，記憶體晶片2800內之記憶體胞元的排、行或任何其他分組)相關聯的一或多個非揮發性記憶體(例如，快閃記憶體或其類似者)。非揮發性記憶體可經組態以儲存提示相關聯之區段是否在一或多個先前循環中經存取的位元。 In another example, the data storage may include one or more sections associated with the memory chip 2800 (for example, the row, row, or any other grouping of memory cells within the memory chip 2800). Multiple non-volatile memories (for example, flash memory or the like). The non-volatile memory can be configured to store bits that indicate whether the associated segment has been accessed in one or more previous cycles.

一些實施例可另外或替代地在每一列或列群組(或記憶體晶片2800之其他區段)上添加時戳暫存器，該時戳暫存器保存當前再新循環內存取線的最後時刻。此意謂在每一列存取之情況下，再新控制器可更新列時戳暫存器。因此，當下一次再新發生時(例如，在再新循環結束時)，再新控制器可比較所儲存時戳，且若相關聯區段先前在某一時間段內(例如，在如應用於所儲存時戳之某一臨限值內)經存取，則再新控制器可跳至下一區段。此避免系統在最近已存取之區段上消耗再新功率。此外，再新控制器可繼續追蹤存取以確保在下一循環存取或再新每一區段。 Some embodiments may additionally or alternatively add a time stamp register to each row or group of rows (or other sections of the memory chip 2800), and the time stamp register saves the current recycle access line last moment. This means that in the case of each row access, the new controller can update the row time stamp register. Therefore, when the next renewal occurs (for example, at the end of the renewed cycle), the renewed controller can compare the stored timestamps, and if the associated section was previously within a certain period of time (for example, in the case of After accessing the stored time stamp within a certain threshold, the new controller can jump to the next section. This prevents the system from consuming new power on the recently accessed section. In addition, the new controller can continue to track the access to ensure that each section is accessed in the next cycle or renewed.

因此，在又一實例中，資料儲存器可包含與記憶體晶片2800之區段(例如，記憶體晶片2800內之記憶體胞元的排、行或任何其他分組)相關聯之一或多個暫存器或非揮發性記憶體。該等暫存器或非揮發性記憶體可經組態以儲存時戳或提示相關聯區段之最近存取的其他資訊，而非使用位元來提示是否已存取相關聯區段。在此實例中，再新控制器2803可基於儲存於相關聯暫存器或記憶體中之時戳與當前時間(例如，來自計時器，如下文在圖29A及圖29B中所解釋)之間的時間量是否超過預定臨限值(例如，8ms、16ms、32ms、64ms或其類似者)來判定是否再新或存取相關聯區段。 Therefore, in yet another example, the data storage may include one or more sections associated with the memory chip 2800 (for example, the row, row, or any other grouping of memory cells within the memory chip 2800). Register or non-volatile memory. These registers or non-volatile memory can be configured to store time stamps or other information that prompts the associated section of the most recently accessed, instead of using bits to prompt whether the associated section has been accessed. In this example, the refresh controller 2803 may be based on the time stamp stored in the associated register or memory and the current time (for example, from a timer, as explained below in FIGS. 29A and 29B) Whether the amount of time exceeds a predetermined threshold (for example, 8ms, 16ms, 32ms, 64ms, or the like) to determine whether to renew or access the associated section.

因此，預定臨限值可包含確保相關聯區段在每個再新循環內被再新(若並非存取)至少一次之再新循環的時間量。替代地，預定臨限值可包含短於再新循環所需之時間量的時間量(例如，以確保任何所需再新或存取信號可在再新循環完成之前到達相關聯區段)。舉例而言，預定時間可包含用於具有8ms再新時段之記憶體晶片的7ms，使得若區段在7ms內尚未被存取，則再新控制器將發送在8ms再新時段結束時到達該區段之再新或存取信號。在一些實施例中，預定臨限值可取決於相關聯區段之大小。舉例而言，對於記憶體晶片2800之較小區段，預定臨限值可較小。 Therefore, the predetermined threshold may include the amount of time for the renewal cycle to ensure that the associated section is renewed (if not accessed) at least once in each renewal cycle. Alternatively, the predetermined threshold value may include an amount of time that is shorter than the amount of time required for the recycle (e.g., to ensure that any required recycle or access signals can reach the associated section before the recycle is completed). For example, the predetermined time may include 7ms for a memory chip with a renew period of 8ms, so that if the segment has not been accessed within 7ms, the renew controller will send a message that arrives at the end of the 8ms renew period. Renew or access signal of the section. In some In an embodiment, the predetermined threshold value may depend on the size of the associated section. For example, for a smaller section of the memory chip 2800, the predetermined threshold value may be smaller.

儘管上文關於記憶體晶片進行了描述，但本發明之再新控制器亦可用於分散式處理器架構中，如在上文之章節中及貫穿本發明所描述的彼等架構。此類架構之一個實例描繪於圖7A中。在此等實施例中，與記憶體晶片2800相同之基板可包括安置於其上之複數個處理群組，例如，如圖7A中所描繪。如上文關於圖3A所解釋，「處理群組」可指基板上之兩個或多於兩個處理器子單元及其對應記憶體組。該群組可表示基板上之空間分佈及/或用於編譯程式碼以供在記憶體晶片2800上執行之目的之邏輯分組。因此，該基板可包括記憶體陣列，該記憶體陣列包括複數個組，諸如圖28中所展示之組2801a及其他組。此外，該基板可包括處理陣列，該處理陣列可包括複數個處理器子單元(諸如，圖7A中所展示之子單元730a、730b、730c、730d、730e、730f、730g及730h)。 Although the memory chips are described above, the new controllers of the present invention can also be used in distributed processor architectures, such as those described in the above sections and throughout the present invention. An example of such an architecture is depicted in Figure 7A. In these embodiments, the same substrate as the memory chip 2800 may include a plurality of processing groups disposed thereon, for example, as depicted in FIG. 7A. As explained above with respect to FIG. 3A, the "processing group" can refer to two or more processor sub-units on the substrate and their corresponding memory groups. The group may represent a spatial distribution on the substrate and/or a logical grouping for the purpose of compiling program codes for execution on the memory chip 2800. Therefore, the substrate may include a memory array including a plurality of groups, such as the group 2801a shown in FIG. 28 and other groups. In addition, the substrate may include a processing array, which may include a plurality of processor sub-units (such as the sub-units 730a, 730b, 730c, 730d, 730e, 730f, 730g, and 730h shown in FIG. 7A).

如上文關於圖7A進一步所解釋，每一處理群組可包括一處理器子單元及專用於該處理器子單元之一或多個對應記憶體組。此外，為了允許每一處理器子單元與其對應的專用記憶體組通信，該基板可包括將處理器子單元中之一者連接至其對應的專用記憶體組之第一複數個匯流排。 As further explained above with respect to FIG. 7A, each processing group may include a processor sub-unit and one or more corresponding memory groups dedicated to the processor sub-unit. In addition, in order to allow each processor sub-unit to communicate with its corresponding dedicated memory bank, the substrate may include a first plurality of buses connecting one of the processor sub-units to its corresponding dedicated memory bank.

在此等實施例中，如圖7A中所展示，該基板可包括用以將每一處理器子單元連接至至少另一處理器子單元(例如，同一列中之鄰近子單元、同一行中之鄰近處理器子單元，或基板上之任何其他處理器子單元)的第二複數個匯流排。第一複數個匯流排及/或第二複數個匯流排可能不含時序硬體邏輯組件，使得在處理器子單元之間及跨越該等複數個匯流排中之對應者的資料傳送不受時序硬體邏輯組件控制，如上文在「使用軟體之同步」章節中所解釋。 In these embodiments, as shown in FIG. 7A, the substrate may include a substrate for connecting each processor subunit to at least another processor subunit (for example, adjacent subunits in the same column, in the same row The second plurality of bus bars adjacent to the processor sub-unit, or any other processor sub-unit on the substrate. The first plurality of buses and/or the second plurality of buses may not contain timing hardware logic components, so that the data transmission between the processor subunits and across the corresponding ones of the plurality of buses is not subject to timing Control of hardware logic components, as explained in the chapter "Synchronization using software" above.

在與記憶體晶片2800相同之基板可包括安置於其上之複數個處理群組(例如，如圖7A中所描繪)的實施例中，處理器子單元可進一步包括位址產生器(例如，如圖4中所描繪之位址產生器450)。此外，每一處理群組可包括一處理器子單元及專用於該處理器子單元之一或多個對應記憶體組。因此，該等位址產生器中之每一者可與該等複數個記憶體組中之一對應的專用記憶體組相關聯。此外，該基板可包括複數個匯流排，每一匯流排將該等複數個位址產生器中之一者連接至其對應的專用記憶體組。 In an embodiment where the same substrate as the memory chip 2800 may include a plurality of processing groups (for example, as depicted in FIG. 7A) disposed thereon, the processor subunit may further include bit Address generator (e.g., address generator 450 as depicted in FIG. 4). In addition, each processing group may include a processor sub-unit and one or more corresponding memory groups dedicated to the processor sub-unit. Therefore, each of the address generators can be associated with a dedicated memory group corresponding to one of the plurality of memory groups. In addition, the substrate may include a plurality of bus bars, and each bus bar connects one of the plurality of address generators to its corresponding dedicated memory bank.

圖29A描繪符合本發明之實例再新控制器2900。再新控制器2900可併入本發明之記憶體晶片(諸如，圖28之記憶體晶片2800)中。如圖29A中所描繪，再新控制器2900可包括計時器2901，該計時器可包含晶載振盪器或用於再新控制器2900之任何其他時序電路。在圖29A中所描繪之組態中，計時器2901可週期性地(例如，每8ms、16ms、32ms、64ms或其類似時間)觸發再新循環。再新循環可使用列計數器2903以循環通過對應記憶體晶片之所有列，且使用加法器2901結合有效位元2905而針對每一列產生一再新信號。如圖29A中所展示，位元2905可固定為1(「真」)以確保在循環期間再新每一列。 Figure 29A depicts an example renewed controller 2900 in accordance with the present invention. The renewed controller 2900 can be incorporated into the memory chip of the present invention (such as the memory chip 2800 of FIG. 28). As depicted in FIG. 29A, the refresh controller 2900 may include a timer 2901, which may include an on-chip oscillator or any other timing circuit for the refresh controller 2900. In the configuration depicted in Figure 29A, the timer 2901 can trigger a recycle periodically (e.g., every 8ms, 16ms, 32ms, 64ms, or the like). Recycling can use the column counter 2903 to cycle through all columns of the corresponding memory chip, and use the adder 2901 to combine the valid bits 2905 to generate a new signal for each column. As shown in Figure 29A, bit 2905 can be fixed to 1 ("True") to ensure that each row is renewed during the cycle.

在本發明之實施例中，再新控制器2900可包括資料儲存器。如上文所描述，該資料儲存器可包含與記憶體晶片2800之區段(例如，記憶體晶片2800內之記憶體胞元的排、行或任何其他分組)相關聯的一或多個暫存器或非揮發性記憶體。該等暫存器或非揮發性記憶體可經組態以儲存時戳或提示相關聯區段之最近存取的其他資訊。 In an embodiment of the present invention, the refresh controller 2900 may include a data storage device. As described above, the data storage may include one or more temporary memories associated with a section of the memory chip 2800 (for example, the row, row, or any other grouping of memory cells within the memory chip 2800) Device or non-volatile memory. These registers or non-volatile memory can be configured to store time stamps or prompt other information of the most recently accessed associated section.

再新控制器2900可使用所儲存的資訊來跳過記憶體晶片2900之區段的再新。舉例而言，若該資訊提示區段在一或多個先前再新循環期間已再新，則再新控制器2900可在當前再新循環中跳過該區段。在另一實例中，若區段之所儲存時戳與當前時間之間的差低於臨限值，則再新控制器2900可在當前再新循環中跳過該區段。再新控制器2900可進一步經由多個再新循環繼續追蹤記憶體晶片2800之區段的存取及再新。舉例而言，再新控制器2900可使用計時器2901更新所儲存時戳。在此等實施例中，再新控制器2900可經組態以在臨限時間間隔之後使用計時器之輸出來清除儲存於資料儲存器中之存取資訊。舉例而言，在資料儲存器儲存對相關聯區段之最近存取或再新之時戳的實施例中，每當將存取命令或再新信號發送至該區段時，再新控制器2900便可將新時戳儲存於資料儲存器中。若資料儲存器儲存位元而非時戳，則計時器2901可經組態以清除經設定持續長於臨限時間段之位元。舉例而言，在資料儲存器儲存提示相關聯區段在一或多個先前循環中經存取之實施例中，每當計時器2901觸發新的再新循環，再新控制器2900便可清除資料儲存器中之位元(例如，將其設定為0)，該新的再新循環係自設定相關聯位元(例如，設定為1)起經過臨界數目個循環(例如，一個、兩個或其類似者)的循環。 The refresh controller 2900 can use the stored information to skip the refresh of the section of the memory chip 2900. For example, if the information indicates that the section has been renewed during one or more previous renew cycles, the renew controller 2900 may skip the section in the current renew cycle. In another example, if the difference between the stored time stamp of the section and the current time is lower than the threshold value, the renew controller 2900 may skip the section in the current renew cycle. The refresh controller 2900 may further continue to track the access and refresh of the section of the memory chip 2800 through multiple refresh cycles. For example, the new controller 2900 can use timing The device 2901 updates the stored time stamp. In these embodiments, the refresh controller 2900 can be configured to use the output of the timer to clear the access information stored in the data storage after the threshold time interval. For example, in an embodiment in which the data storage stores the time stamp of the most recent access or renewal to the associated section, the controller is renewed whenever an access command or renewal signal is sent to the section 2900 can store the new time stamp in the data memory. If the data storage stores bits instead of time stamps, the timer 2901 can be configured to clear bits that have been set to last longer than the threshold time period. For example, in an embodiment where the data storage prompts that the associated section has been accessed in one or more previous cycles, whenever the timer 2901 triggers a new recycle, the new controller 2900 can clear Bits in the data storage (for example, set it to 0), the new recycling cycle is a critical number of cycles (for example, one, two) after setting the associated bit (for example, set to 1) Or the like).

再新控制器2900可協同記憶體晶片2800之其他硬體追蹤記憶體晶片2800之區段的存取。舉例而言，記憶體晶片使用感測放大器以執行讀取操作(例如，如上文在圖9及圖10中所展示)。該等感測放大器可包含複數個電晶體，該等複數個電晶體經組態以感測來自將資料儲存於一或多個記憶體胞元中之記憶體晶片2800之區段的低功率信號，且該等感測放大器將小的電壓擺動放大至較高電壓位準，使得資料可由諸如外部CPU或GPU或整合式處理器子單元(如上文所解釋)的邏輯解譯。儘管在圖29A中未描繪，但再新控制器2900可進一步與感測放大器通信，該感測放大器經組態以存取一或多個區段且改變至少一個位元暫存器之狀態。舉例而言，當感測放大器存取一或多個區段時，其可設定與該等區段相關聯之位元(例如，設定為1)，該等位元提示相關聯區段在前一循環中經存取。在資料儲存器儲存對相關聯區段之最近存取或再新之時戳的實施例中，當感測放大器存取一或多個區段時，其可觸發將來自計時器2901之時戳寫入至暫存器、記憶體或包含資料儲存器之其他元件。 The renewed controller 2900 can cooperate with other hardware of the memory chip 2800 to track the access to the section of the memory chip 2800. For example, a memory chip uses a sense amplifier to perform a read operation (e.g., as shown in FIGS. 9 and 10 above). The sense amplifiers may include a plurality of transistors configured to sense low-power signals from a section of the memory chip 2800 storing data in one or more memory cells And the sense amplifiers amplify small voltage swings to higher voltage levels, so that data can be interpreted by logic such as external CPU or GPU or integrated processor sub-units (as explained above). Although not depicted in FIG. 29A, the refresh controller 2900 can further communicate with a sense amplifier that is configured to access one or more sectors and change the state of at least one bit register. For example, when the sense amplifier accesses one or more sections, it can set the bits associated with those sections (for example, set to 1), and these bits indicate that the associated section is first Accessed in one cycle. In an embodiment where the data storage stores the time stamp of the most recent access or renewal of the associated section, when the sense amplifier accesses one or more sections, it can trigger the time stamp from the timer 2901 Write to registers, memory, or other components including data storage.

在上文所描述之實施例中之任一者中，再新控制器2900可與用於複數個記憶體組之記憶體控制器整合。舉例而言，類似於圖3A中所描繪之實施例，再新控制器2900可併入至與記憶體晶片2800之記憶體組或其他區段相關聯的邏輯及控制子單元中。 In any of the above-described embodiments, the new controller 2900 can be used with Integration of memory controllers in multiple memory banks. For example, similar to the embodiment depicted in FIG. 3A, the renewed controller 2900 can be incorporated into the logic and control sub-units associated with the memory bank or other sections of the memory chip 2800.

圖29B描繪符合本發明之另一實例再新控制器2900'。再新控制器2900'可併入本發明之記憶體晶片(諸如，圖28之記憶體晶片2800)中。類似於再新控制器2900，再新控制器2900'包括計時器2901、列計數器2903、有效位元2905及加法器2907。另外，再新控制器2900，可包括資料儲存器2909。如圖29B中所展示，資料儲存器2909可包含與記憶體晶片2800之區段(例如，記憶體晶片2800內之記憶體胞元的排、行或任何其他分組)相關聯之一或多個暫存器或非揮發性記憶體，且資料儲存器內之狀態可經組態以回應於一或多個區段正經存取而改變(例如，藉由感測放大器及/或再新控制器2900'之其他元件，如上文所描述)。因此，再新控制器2900'可經組態以基於資料儲存器內之狀態跳過一或多個區段之再新。舉例而言，若與區段相關聯之狀態經啟動(例如，藉由接通、使性質變更以便儲存「1」或其類似者而設定為1)，則再新控制器2900'可跳過相關聯區段之再新循環且清除與彼部分相關聯之狀態。該狀態可藉由至少一位元暫存器或經組態以儲存至少一個資料位元之任何其他記憶體結構來儲存。 Figure 29B depicts another example renewed controller 2900' in accordance with the present invention. The renewed controller 2900' can be incorporated into the memory chip of the present invention (such as the memory chip 2800 of FIG. 28). Similar to the renew controller 2900, the renew controller 2900' includes a timer 2901, a column counter 2903, a valid bit 2905, and an adder 2907. In addition, the new controller 2900 may include a data storage 2909. As shown in FIG. 29B, the data storage 2909 may include one or more sections associated with the memory chip 2800 (for example, the row, row, or any other grouping of memory cells within the memory chip 2800) Register or non-volatile memory, and the state in the data storage can be configured to respond to changes in one or more sectors being accessed (for example, by sensing amplifiers and/or renewing the controller The other components of 2900' are as described above). Therefore, the refresh controller 2900' can be configured to skip the refresh of one or more sections based on the state in the data storage. For example, if the state associated with the section is activated (for example, set to 1 by turning on, changing the properties so as to store "1" or the like), then the new controller 2900' can be skipped Recycle the associated section and clear the state associated with that part. This state can be stored by at least one bit register or any other memory structure configured to store at least one data bit.

為了確保記憶體晶片之區段在每一再新循環期間經再新或存取，再新控制器2900'可重設或以其他方式清除狀態以便在下一再新循環期間觸發再新信號。在一些實施例中，在區段被跳過之後，再新控制器2900'可清除相關聯狀態，以便確保在下一再新循環再新該區段。在其他實施例中，再新控制器2900'可經組態以在臨限時間間隔之後重設資料儲存器內之狀態。舉例而言，每當自相關聯狀態經設定(例如，藉由接通、使性質變更以便儲存「1」或其類似者而設定為1)起，計時器2901超過臨限時間，再新控制器2900'便可清除資料儲存器中之狀態(例如，將其設定為0)。在一些實施例中，再新控制器2900'可使用臨界數目個再新循環(例如，一個、兩個或其類似者)或使用臨界數目個時脈循環(例如，兩個、四個或其類似者)而非臨限時間。 In order to ensure that the segments of the memory chip are renewed or accessed during each renewal cycle, the renewal controller 2900' can reset or clear the state in other ways to trigger the renewal signal during the next renewal cycle. In some embodiments, after the section is skipped, the renew controller 2900' may clear the associated state to ensure that the section is renewed in the next recycle. In other embodiments, the renew controller 2900' can be configured to reset the state in the data storage after the threshold time interval. For example, every time the associated state is set (for example, set to 1 by turning on, changing the properties so as to store "1" or the like), the timer 2901 exceeds the threshold time, and then re-controls Device 2900' can clear the capital The status in the material storage (for example, set it to 0). In some embodiments, the refresh controller 2900' may use a critical number of refresh cycles (e.g., one, two, or the like) or a critical number of clock cycles (e.g., two, four, or the like). Similar) instead of a threshold time.

在其他實施例中，該狀態可包含相關聯區段之最近再新或存取的時戳，使得若該時戳與當前時間(例如，來自圖29A及圖29B之計時器2901)之間的時間量超過預定臨限值(例如，8ms、16ms、32ms、64ms或其類似時間)，則再新控制器2900'可將存取命令或再新信號發送至相關聯區段且更新與彼部分相關聯之時戳(例如，使用計時器2901)。另外或替代地，若再新時間指示符提示最後再新時間在預定時間臨限值內，則再新控制器2900'可經組態以跳過相對於複數個記憶體組之一或多個區段的再新操作。在此等實施例中，在跳過相對於一或多個區段之再新操作之後，再新控制器2900'可經組態以更改與一或多個區段相關聯之所儲存的再新時間指示符，使得在下一操作循環期間，將再新該一或多個區段。舉例而言，如上文所描述，再新控制器2900'可使用計時器2901來更新所儲存的再新時間指示符。 In other embodiments, the state may include the most recently renewed or accessed time stamp of the associated section, so that if the time stamp is between the time stamp and the current time (for example, the timer 2901 from FIG. 29A and FIG. 29B) When the amount of time exceeds a predetermined threshold (for example, 8ms, 16ms, 32ms, 64ms or the like), the renew controller 2900' may send an access command or renew signal to the associated section and update the part The associated timestamp (for example, using timer 2901). Additionally or alternatively, if the renewed time indicator prompts that the last renewed time is within the predetermined time threshold, the renewed controller 2900' can be configured to skip one or more of the plurality of memory groups Renew operation of the section. In these embodiments, after skipping the renew operation with respect to one or more sections, the renew controller 2900' may be configured to change the stored renewal operations associated with one or more sections. The new time indicator, so that during the next operating cycle, the one or more sections will be renewed. For example, as described above, the refresh controller 2900' may use the timer 2901 to update the stored refresh time indicator.

因此，資料儲存器可包括經組態以儲存再新時間指示符之時戳暫存器，該再新時間指示符提示最後再新複數個記憶體組之一或多個區段的時間。此外，再新控制器2900'可在臨限時間間隔之後使用計時器之輸出來清除儲存於資料儲存器中之存取資訊。 Therefore, the data storage may include a time stamp register configured to store a new time indicator that prompts the time of the last new new one or more sections of the plurality of memory banks. In addition, the new controller 2900' can use the output of the timer to clear the access information stored in the data storage after the threshold time interval.

在上文所描述之實施例中之任一者中，對一或多個區段之存取可包括與一或多個區段相關聯之寫入操作。另外或替代地，對一或多個區段之存取可包括與一或多個區段相關聯之讀取操作。 In any of the embodiments described above, access to one or more sectors may include write operations associated with one or more sectors. Additionally or alternatively, access to one or more sectors may include read operations associated with one or more sectors.

此外，如圖29B中所描繪，再新控制器2900'可包含經組態以至少部分地基於資料儲存器內之狀態而輔助更新資料儲存器2909的列計數器2903及加法器2907。資料儲存器2909可包含與複數個記憶體組相關聯之位元表。舉例而言，該位元表可包含經組態以保存用於相關聯區段之位元的開關(或開關元件，諸如電晶體)或暫存器(例如，SRAM或其類似者)的陣列。另外或替代地，資料儲存器2909可儲存與複數個記憶體組相關聯之時戳。 In addition, as depicted in FIG. 29B, the refresh controller 2900' may include a row counter 2903 and an adder 2907 configured to assist in updating the data storage 2909 based at least in part on the state within the data storage. The data storage 2909 may include bit tables associated with a plurality of memory banks. Lift For example, the bit table may include an array of switches (or switching elements, such as transistors) or registers (e.g., SRAM or the like) configured to store bits for the associated segment . Additionally or alternatively, the data storage 2909 may store time stamps associated with a plurality of memory banks.

此外，再新控制器2900'可包括再新閘2911，該再新閘經組態以基於儲存於位元表中之對應值而控制是否進行對一或多個區段的再新。舉例而言，再新閘2911可包含邏輯閘(諸如，「及(and)閘」，該邏輯閘經組態以在資料儲存器2909之對應狀態提示相關聯區段在一或多個先前時脈循環期間經再新或存取之情況下使來自列計數器2903之再新信號無效。在其他實施例中，再新閘2911可包含微處理器或其他電路，該微處理器或其他電路經組態以在來自資料儲存器2909之對應時戳提示相關聯區段在預定臨限時間值內經再新或存取之情況下使來自列計數器2903之再新信號無效。 In addition, the renew controller 2900' may include a renew gate 2911 configured to control whether to renew one or more sections based on the corresponding value stored in the bit table. For example, the renew gate 2911 may include a logical gate (such as "and (and) gate", which is configured to prompt the associated section when the corresponding state of the data storage 2909 is one or more previous times). In the case of renewing or accessing during the pulse cycle, the renewing signal from the column counter 2903 is invalidated. In other embodiments, the renewing gate 2911 may include a microprocessor or other circuit. It is configured to invalidate the renew signal from the column counter 2903 when the corresponding time stamp from the data storage 2909 prompts that the associated section is renewed or accessed within the predetermined threshold time value.

圖30為用於記憶體晶片(例如，圖28之記憶體晶片2800)中之部分再新的處理程序3000之實例流程圖。處理程序3000可由符合本發明之再新控制器執行，諸如圖29A之再新控制器2900或圖29B之再新控制器2900'。 FIG. 30 is a flowchart of an example of a processing procedure 3000 used for partial renewal of a memory chip (for example, the memory chip 2800 of FIG. 28). The processing program 3000 can be executed by a renewed controller according to the present invention, such as the renewed controller 2900 of FIG. 29A or the renewed controller 2900' of FIG. 29B.

在步驟3010處，再新控制器可存取提示對複數個記憶體組之一或多個區段之存取操作的資訊。舉例而言，如上文關於圖29A及圖29B所解釋，再新控制器可包括資料儲存器，該資料儲存器與記憶體晶片2800之區段(例如，記憶體晶片2800內之記憶體胞元的排、行或任何其他分組)相關聯且經組態以儲存時戳或提示相關聯區段之最近存取的其他資訊。 At step 3010, the new controller can access the information prompting the access operation to one or more sections of the plurality of memory banks. For example, as explained above with respect to FIGS. 29A and 29B, the renewed controller may include a data storage that is connected to a section of the memory chip 2800 (for example, the memory cell in the memory chip 2800). The rank, row, or any other grouping) is associated and configured to store a time stamp or prompt other information of the most recently accessed associated section.

在步驟3020處，再新控制器可至少部分地基於所存取資訊而產生再新及/或存取命令。舉例而言，如上文關於圖29A及圖29B所解釋，若所存取資訊提示最後再新或存取時間在預定時間臨限值內及/或若所存取資訊提示最後再新或存取發生在一或多個先前時脈循環期間，則再新控制器可跳過相對於複數個記憶體組之一或多個區段的再新操作。另外或替代地，再新控制器可基於所存取資訊是否提示最後再新或存取時間超過預定臨限值及/或所存取資訊是否提示最後再新或存取並未在一或多個先前時脈循環期間發生而產生再新或存取相關聯區段之意見。 At step 3020, the refresh controller may generate a refresh and/or access command based at least in part on the accessed information. For example, as explained above with respect to FIGS. 29A and 29B, if the accessed information prompts the last renewal or the access time is within the predetermined time threshold and/or if the accessed information prompts the last renewal or access During one or more previous clock cycles, the renew controller can skip the renew operation relative to one or more sectors of the plurality of memory banks. Additionally or alternatively, the new controller can be based on Whether the accessed information prompts the last renewal or whether the access time exceeds a predetermined threshold and/or whether the accessed information prompts the last renewal or whether the access did not occur during one or more previous clock cycles. New or access comments on related sections.

在步驟3030處，再新控制器可更改與一或多個區段相關聯之所儲存的再新時間指示符，使得在下一操作循環期間，將再新該一或多個區段。舉例而言，在跳過相對於一或多個區段之再新操作之後，再新控制器可更改提示對該一或多個區段之存取操作的資訊，使得在下一時脈循環期間，將再新該一或多個區段。因此，在跳過再新循環之後，再新控制器可清除區段之狀態(例如，設定為0)。另外或替代地，再新控制器可設定在當前循環期間再新及/或存取之區段的狀態(例如，設定為1)。在提示對一或多個區段之存取操作之資訊包括時戳的實施例中，再新控制器可更新與在當前循環期間再新及/或存取之區段相關聯的任何所儲存的時戳。 At step 3030, the renew controller may change the stored renew time indicator associated with one or more sections so that during the next operating cycle, the one or more sections will be renewed. For example, after skipping the renew operation relative to one or more sectors, the renew controller can change the information prompting the access operation of the one or more sectors so that during the next clock cycle, The one or more sections will be renewed. Therefore, after skipping the renewal cycle, the renewing controller can clear the state of the section (for example, set it to 0). Additionally or alternatively, the renew controller may set the status of the section to be renewed and/or accessed during the current cycle (for example, set to 1). In an embodiment where the information that prompts access operations to one or more sections includes a time stamp, the renew controller may update any stored information associated with the section renewed and/or accessed during the current cycle Timestamp.

方法3000可進一步包括額外步驟。舉例而言，除步驟3030以外或作為該步驟之替代，感測放大器可存取一或多個區段且可改變與該一或多個區段相關聯之資訊。另外或替代地，感測放大器可在存取已發生時向再新控制器發信，使得再新控制器可更新與一或多個區段相關聯之資訊。如上文所解釋，感測放大器可包含複數個電晶體，該等複數個電晶體經組態以感測來自將資料儲存於一或多個記憶體胞元中之記憶體晶片之區段的低功率信號，且感測放大器將小的電壓擺動放大至較高電壓位準，使得資料可由諸如外部CPU或GPU或整合式處理器子單元(如上文所解釋)的邏輯解譯。在此實例中，每當感測放大器存取一或多個區段時，其可設定與區段相關聯之位元(例如，設定為1)，該等位元提示相關聯區段在前一循環中經存取。在提示對一或多個區段之存取操作之資訊包括時戳的實施例中，每當感測放大器存取一或多個區段時，其便可觸發將來自再新控制器之計時器之時戳寫入至資料儲存器以更新與該等區段相關聯之任何所儲存的時戳。 The method 3000 may further include additional steps. For example, in addition to or as an alternative to step 3030, the sense amplifier can access one or more segments and can change the information associated with the one or more segments. Additionally or alternatively, the sense amplifier can send a signal to the renewed controller when the access has occurred, so that the renewed controller can update the information associated with one or more sections. As explained above, the sense amplifier may include a plurality of transistors that are configured to sense low levels from a section of a memory chip that stores data in one or more memory cells. The power signal and the sense amplifier amplify the small voltage swing to a higher voltage level, so that the data can be interpreted by logic such as an external CPU or GPU or an integrated processor sub-unit (as explained above). In this example, whenever the sense amplifier accesses one or more sections, it can set the bits associated with the section (for example, set to 1), and these bits indicate that the associated section is first Accessed in one cycle. In an embodiment where the information prompting the access operation to one or more sections includes a time stamp, each time the sense amplifier accesses one or more sections, it can trigger the timing from the new controller The time stamp of the device is written to the data storage to update the Any associated time stamps stored.

圖31為用於判定記憶體晶片(例如，圖28之記憶體晶片2800)之再新的處理程序3100之實例流程圖。處理程序3100可實施於符合本發明之編譯器內。如上文所解釋，「編譯器」係指將較高階語言(例如，程序性語言，諸如C、FORTRAN、BASIC或其類似者；物件導向式語言，諸如Java、C++、Pascal、Python或其類似者；等等)轉換成較低階語言(例如，組合程式碼、目標程式碼、機器碼或其類似者)的任何電腦程式。編譯器可允許人類以人類可讀語言來程式設計一系列指令，接著將該人類可讀語言轉換成機器可執行語言。編譯器可包含由一或多個處理器執行之軟體指令。 FIG. 31 is a flowchart of an example of a processing procedure 3100 for determining the renewal of a memory chip (for example, the memory chip 2800 in FIG. 28). The processing program 3100 can be implemented in a compiler conforming to the present invention. As explained above, "compiler" refers to a higher-level language (for example, a procedural language such as C, FORTRAN, BASIC or the like; an object-oriented language such as Java, C++, Pascal, Python or the like ; Etc.) Any computer program that is converted into a lower-level language (for example, assembly code, object code, machine code, or the like). The compiler may allow humans to program a series of instructions in a human-readable language, and then convert the human-readable language into a machine executable language. The compiler may include software instructions executed by one or more processors.

在步驟3110處，一或多個處理器可接收較高階電腦程式碼。舉例而言，該較高階電腦程式碼可編碼於記憶體(例如，諸如硬碟機或其類似者之非揮發性記憶體、諸如DRAM之揮發性記憶體，或其類似者)上之一或多個檔案中或經由網路(例如，網際網路或其類似者)接收。另外或替代地，可自使用者接收該較高階電腦程式碼(例如，使用諸如鍵盤之輸入裝置)。 At step 3110, one or more processors may receive higher-level computer program codes. For example, the higher-level computer code can be encoded on one of memory (for example, non-volatile memory such as hard disk drives or the like, volatile memory such as DRAM, or the like) or Multiple files or received via a network (for example, the Internet or the like). Additionally or alternatively, the higher-level computer code can be received from the user (for example, using an input device such as a keyboard).

在步驟3120處，一或多個處理器可識別待由較高階電腦程式碼存取之在與記憶體晶片相關聯之複數個記憶體組上分佈的複數個記憶體區段。舉例而言，一或多個處理器可存取定義記憶體晶片之複數個記憶體組及一對應結構的資料結構。一或多個處理器可自記憶體(例如，諸如硬碟機或其類似者之非揮發性記憶體、諸如DRAM之揮發性記憶體，或其類似者)存取資料結構，或經由網路(例如，網際網路或其類似者)接收資料結構。在此等實施例中，資料結構包括於可由編譯器存取之一或多個庫中，以准許編譯器產生用於待存取之特定記憶體晶片的指令。 At step 3120, one or more processors can identify a plurality of memory segments distributed on a plurality of memory banks associated with the memory chip to be accessed by higher-level computer code. For example, one or more processors can access a data structure that defines a plurality of memory banks and a corresponding structure of the memory chip. One or more processors can access data structures from memory (for example, non-volatile memory such as hard disk drives or the like, volatile memory such as DRAM, or the like), or via a network (For example, the Internet or the like) Receive data structure. In these embodiments, the data structure is included in one or more libraries accessible by the compiler to allow the compiler to generate instructions for the specific memory chip to be accessed.

在步驟3130處，一個或處理器可評估較高階電腦程式碼以識別在複數個記憶體存取循環內出現的複數個記憶體讀取命令。舉例而言，一或多個處理器可識別需要針對記憶體之一或多個讀取命令及/或針對記憶體之一或多個寫入命令的較高階電腦程式碼內之每一操作。此等指令可包括變數初始化、變數重新指派、對變數進行邏輯運算、輸入輸出操作或其類似者。 At step 3130, one or the processor may evaluate the higher-level computer code to identify a plurality of memory read commands that occur within a plurality of memory access cycles. For example, one or more A processor can identify each operation in a higher-level computer code that requires one or more read commands for the memory and/or one or more write commands for the memory. These instructions may include variable initialization, variable reassignment, logic operation on variables, input and output operations, or the like.

在步驟3140處，一或多個處理器可致使跨越複數個記憶體區段中之每一者而分佈與複數個記憶體存取命令相關聯之資料，使得在複數個記憶體存取循環中之每一者期間存取複數個記憶體區段中之每一者。舉例而言，一或多個處理器可自定義記憶體晶片之結構的資料結構識別記憶體區段，且接著將來自較高階程式碼之變數指派給記憶體區段中之各者，使得在每一再新循環(其可包含特定數目個時脈循環)期間存取(例如，經由寫入或讀取)每一記憶體區段至少一次。在此實例中，一或多個處理器可存取提示較高階程式碼之每一行需要多少個時脈循環的資訊，以便指派來自較高階程式碼之行的變數，使得在特定數目個時脈循環期間存取(例如，經由寫入或讀取)每一記憶體區段至少一次。 At step 3140, the one or more processors may cause the data associated with the plurality of memory access commands to be distributed across each of the plurality of memory segments, such that in the plurality of memory access cycles Each of them accesses each of a plurality of memory segments during each period. For example, one or more processors can customize the data structure of the memory chip to identify the memory segment, and then assign variables from higher-level code to each of the memory segments, so that Each memory segment is accessed (e.g., via writing or reading) at least once during each refresh cycle (which may include a specific number of clock cycles). In this example, one or more processors can access information indicating how many clock cycles are required for each line of higher-level code in order to assign variables from the line of higher-level code so that at a certain number of clock cycles Each memory segment is accessed (for example, by writing or reading) at least once during the cycle.

在另一實例中，一或多個處理器可首先自較高階程式碼產生機器碼或其他較低階程式碼。一或多個處理器可接著將來自較低階程式碼之變數指派給記憶體區段中之各者，使得在每一再新循環(其可包含特定數目個時脈循環)期間存取(例如，經由寫入或讀取)每一記憶體區段至少一次。在此實例中，較低階程式碼之每一行可能需要單個時脈循環。 In another example, one or more processors may first generate machine code or other lower-level code from higher-level code. One or more processors can then assign variables from the lower-level code to each of the memory segments so that they can be accessed during each renewal cycle (which can include a specific number of clock cycles) (e.g., , By writing or reading) each memory section at least once. In this example, each line of lower-level code may require a single clock cycle.

在上文所給出之實例中之任一者中，一或多個處理器可進一步將邏輯運算或使用臨時輸出之其他命令指派給記憶體區段中之各者。此等臨時輸出亦可產生讀取及/或寫入命令，使得即使尚未將命名變數指派給經指派之記憶體區段，在彼再新循環期間仍存取該記憶體區段。 In any of the examples given above, one or more processors may further assign logical operations or other commands using temporary outputs to each of the memory segments. These temporary outputs can also generate read and/or write commands so that even if the named variable has not been assigned to the assigned memory segment, the memory segment is still accessed during its recycle.

方法3100可進一步包括額外步驟。舉例而言，在變數在編譯之前經指派的實施例中，一或多個處理器可自較高階程式碼產生機器碼或其他較低階程式碼。此外，一或多個處理器可傳輸經編譯程式碼以供記憶體晶片及對應邏輯電路執行。該等邏輯電路可包含諸如GPU或CPU之習知電路，或可包含與記憶體晶片在同一基板上之處理群組，例如，如圖7A中所描繪。因此，如上文所描述，該基板可包括記憶體陣列，該記憶體陣列包括複數個組，諸如圖28中所展示之組2801a及其他組。此外，該基板可包括處理陣列，該處理陣列可包括複數個處理器子單元(諸如，圖7A中所展示之子單元730a、730b、730c、730d、730e、730f、730g及730h)。 Method 3100 may further include additional steps. For example, in embodiments where variables are assigned before compilation, one or more processors can generate machine code or other relatively high-level code from higher-level code. Low-level code. In addition, one or more processors can transmit compiled code for execution by the memory chip and corresponding logic circuits. The logic circuits may include conventional circuits such as GPU or CPU, or may include processing groups on the same substrate as the memory chip, for example, as depicted in FIG. 7A. Therefore, as described above, the substrate may include a memory array including a plurality of groups, such as the group 2801a shown in FIG. 28 and other groups. In addition, the substrate may include a processing array, which may include a plurality of processor sub-units (such as the sub-units 730a, 730b, 730c, 730d, 730e, 730f, 730g, and 730h shown in FIG. 7A).

圖32為用於判定記憶體晶片(例如，圖28之記憶體晶片2800)之再新的處理程序3200之另一實例流程圖。處理程序3200可實施於符合本發明之編譯器內。處理程序3200可由執行包含編譯器之軟體指令的一或多個處理器執行。處理程序3200可與圖31之處理程序3100分開地或組合地實施。 FIG. 32 is a flowchart of another example of the processing procedure 3200 for determining the renewal of a memory chip (for example, the memory chip 2800 of FIG. 28). The processing program 3200 can be implemented in a compiler conforming to the present invention. The processing program 3200 may be executed by one or more processors that execute software instructions including a compiler. The processing program 3200 can be implemented separately or in combination with the processing program 3100 of FIG. 31.

在步驟3210處，類似於步驟3110，一或多個處理器可接收較高階電腦程式碼。在步驟3220處，類似於步驟3210，一或多個處理器可識別待由較高階電腦程式碼存取之在與記憶體晶片相關聯之複數個記憶體組上分佈的複數個記憶體區段。 At step 3210, similar to step 3110, one or more processors may receive higher-level computer program codes. At step 3220, similar to step 3210, one or more processors can identify a plurality of memory segments distributed on a plurality of memory banks associated with the memory chip to be accessed by higher-level computer code .

在步驟3230處，一或多個處理器可評估較高階電腦程式碼以識別各涉及複數個記憶體區段中之一或多者的複數個記憶體讀取命令。舉例而言，一或多個處理器可識別需要針對記憶體之一或多個讀取命令及/或針對記憶體之一或多個寫入命令的較高階電腦程式碼內之每一操作。此等指令可包括變數初始化、變數重新指派、對變數進行邏輯運算、輸入輸出操作或其類似者。 At step 3230, one or more processors may evaluate higher-level computer code to identify a plurality of memory read commands each involving one or more of the plurality of memory sections. For example, one or more processors can identify each operation in higher-level computer code that requires one or more read commands for memory and/or one or more write commands for memory. These instructions may include variable initialization, variable reassignment, logic operation on variables, input and output operations, or the like.

在一些實施例中，一或多個處理器可使用邏輯電路及複數個記憶體區段模擬較高階程式碼之執行。舉例而言，該模擬可包含較高階程式碼之逐行逐步通過，其類似於除錯器或其他指令集模擬器(ISS)之情況。該模擬可進一步維護表示複數個記憶體區段之位址的內部變數，其類似於除錯器可如何維護表示處理器之暫存器的內部變數。 In some embodiments, one or more processors can use logic circuits and a plurality of memory sections to simulate the execution of higher-level code. For example, the simulation may include a step-by-step pass of higher-level code, which is similar to the case of a debugger or other instruction set simulator (ISS). The simulation can further maintain the internal variables representing the addresses of multiple memory segments, which is similar to how the debugger can be maintained The protection represents the internal variables of the processor's register.

在步驟3240處，一或多個處理器可基於對記憶體存取命令之分析且針對複數個記憶體區段當中之每一記憶體區段而追蹤將自對記憶體區段之最後一次存取所累積的時間量。舉例而言，使用上文所描述之模擬，一個或處理器可判定對複數個記憶體區段中之每一者內的一或多個位址之每一存取(例如，讀取或寫入)之間的時間長度。可按絕對時間、時脈循環或再新循環(例如，由記憶體晶片之已知再新速率判定)來量測時間長度。 At step 3240, one or more processors can track the last memory segment from the memory segment based on the analysis of the memory access command and for each memory segment among the plurality of memory segments. Take the accumulated amount of time. For example, using the simulation described above, one or the processor can determine each access (e.g., read or write) to one or more addresses in each of a plurality of memory sections The length of time between entering). The length of time can be measured in terms of absolute time, clock cycle, or recycle (for example, determined by the known renew rate of the memory chip).

在步驟3250處，回應於自對任何特定記憶體區段之最後一次存取起的時間量將超過預定臨限值的判定，一或多個處理器可將經組態以致使對特定記憶體區段之存取的記憶體再新命令或記憶體存取命令中之至少一者引入至較高階電腦程式碼中。舉例而言，一或多個處理器可包括供再新控制器(例如，圖29A之再新控制器2900或圖29B之再新控制器2900')執行的再新命令。在邏輯電路不嵌入與記憶體晶片相同之基板上的實施例中，一或多個處理器可產生與用於發送至邏輯電路之較低階程式碼分開的用於發送至記憶體晶片之再新命令。 At step 3250, in response to the determination that the amount of time since the last access to any particular memory section will exceed a predetermined threshold, one or more processors may be configured to cause the At least one of the memory access command or the memory access command of the section is introduced into the higher-level computer code. For example, the one or more processors may include a refresh command for a refresh controller (for example, the refresh controller 2900 of FIG. 29A or the refresh controller 2900' of FIG. 29B) to execute. In the embodiment where the logic circuit is not embedded on the same substrate as the memory chip, one or more processors can generate a second code for sending to the memory chip separate from the lower-level code for sending to the logic circuit. New order.

另外或替代地，一或多個處理器可包括供記憶體控制器(其可與再新控制器分開或併入至再新控制器中)執行之存取命令。該存取命令可包含虛設命令，該虛設命令經組態以觸發對記憶體區段之讀取操作，但不使邏輯電路對來自記憶體區段之經讀取或寫入變數執行任何其他操作。 Additionally or alternatively, the one or more processors may include access commands for the memory controller (which may be separate from the renewed controller or incorporated into the renewed controller) to execute. The access command may include a dummy command, which is configured to trigger a read operation to the memory section, but does not cause the logic circuit to perform any other operations on the read or write variables from the memory section .

在一些實施例中，編譯器可包括來自處理程序3100之步驟與來自處理程序3200之步驟的組合。舉例而言，編譯器可根據步驟3140指派變數且接著根據步驟3250運行上文所描述之模擬，以添加於任何額外的記憶體再新命令或記憶體存取命令中。此組合可允許編譯器跨越儘可能多的記憶體區段而分佈變數，且為無法在預定臨限時間量存取之任何記憶體區段產生再新或存取命令。在另一組合實例中，編譯器可根據步驟3230模擬程式碼，且基於該模擬提示在預定臨限時間量內將不存取之任何記憶體區段而根據步驟3140指派變數。在一些實施例中，此組合可進一步包括步驟3250以允許編譯器為在預定臨限時間量內無法存取之任何記憶體區段產生再新或存取命令，即使在根據步驟3140之指派完成之後亦如此。 In some embodiments, the compiler may include a combination of steps from the processing program 3100 and steps from the processing program 3200. For example, the compiler can assign variables according to step 3140 and then run the simulation described above according to step 3250 to add to any additional memory renew commands or memory access commands. This combination allows the compiler to distribute variables across as many memory sections as possible, and to generate new or access commands for any memory section that cannot be accessed within a predetermined threshold amount of time. make. In another combination example, the compiler can simulate the code according to step 3230, and assign variables according to step 3140 based on the simulation prompts any memory section that will not be accessed within a predetermined threshold amount of time. In some embodiments, the combination may further include step 3250 to allow the compiler to generate renew or access commands for any memory section that cannot be accessed within a predetermined threshold amount of time, even if the assignment according to step 3140 is completed The same is true afterwards.

本發明之再新控制器可允許由邏輯電路(無論係諸如CPU及GPU之習知邏輯電路抑或與記憶體晶片在同一基板上之處理群組，例如，如圖7A中所描繪)執行之軟體使由再新控制器執行之自動再新去能，且替代地經由所執行軟體控制再新。因此，本發明之一些實施例可將具有已知存取圖案之軟體提供至記憶體晶片(例如，若編譯器能夠存取定義記憶體晶片之複數個記憶體組及一對應結構的資料結構)。在此等實施例中，編譯後最佳化器可使自動再新去能，且僅針對記憶體晶片之在臨限時間量內未被存取之區段手動地設定再新控制。因此，類似於上文所描述之步驟3250但在編譯之後，編譯後最佳化器可產生再新命令以確保使用預定臨限時間量存取或再新每一記憶體區段。 The renewed controller of the present invention can allow software to be executed by logic circuits (whether it is a conventional logic circuit such as CPU and GPU or a processing group on the same substrate as a memory chip, for example, as depicted in FIG. 7A) Disable the automatic renewal performed by the renewed controller, and instead control the renewal by the executed software. Therefore, some embodiments of the present invention can provide software with known access patterns to the memory chip (for example, if the compiler can access a plurality of memory groups defining the memory chip and a data structure corresponding to the structure) . In these embodiments, the post-compilation optimizer can automatically disable renewal, and manually set renewal control only for sections of the memory chip that have not been accessed within the threshold amount of time. Therefore, similar to step 3250 described above but after compilation, the post-compilation optimizer can generate a renew command to ensure that each memory segment is accessed or renewed with a predetermined threshold amount of time.

減少再新循環之另一實例可包括使用對記憶體晶片之存取的預定義圖案。舉例而言，若由邏輯電路執行之軟體可控制其用於記憶體晶片之存取圖案，則一些實施例可產生用於超出習知線性線再新之再新的存取圖案。舉例而言，若控制器判定由邏輯電路執行之軟體規則地每第二列記憶體進行存取，則本發明之再新控制器可使用並非每第二排進行再新之存取圖案以便加速記憶體晶片且減少功率使用量。 Another example of reducing recycling cycles can include using predefined patterns of access to memory chips. For example, if the software executed by the logic circuit can control its access pattern for the memory chip, some embodiments can generate new access patterns for renewing beyond the conventional linear line. For example, if the controller determines that the software executed by the logic circuit regularly accesses every second row of memory, the renewed controller of the present invention can use an access pattern that is not renewed every second row in order to speed up Memory chips and reduce power usage.

此再新控制器之實例展示於圖33中。圖33描繪藉由符合本發明所儲存圖案組態的實例再新控制器3300。再新控制器3300可併入於本發明之記憶體晶片中，該記憶體晶片例如具有複數個記憶體組及包括於複數個記憶體組中之每一者中的複數個記憶體區段，諸如圖28之記憶體晶片2800。 An example of this renewed controller is shown in Figure 33. Figure 33 depicts the renewal of the controller 3300 by conforming to the example of the stored pattern configuration of the present invention. The new controller 3300 can be incorporated into the memory chip of the present invention, the memory chip having, for example, a plurality of memory banks and a plurality of memory sections included in each of the plurality of memory banks, Such as the memory chip 2800 of FIG. 28.

再新控制器3300包括計時器3301(類似於圖29A及圖29B之計時器2901)、列計數器3303(類似於圖29A及圖29B之列計數器2903)及加法器3305(類似於圖29A及圖29B之加法器2907)。此外，再新控制器3300包括資料儲存器3307。不同於圖29B之資料儲存區2909，資料儲存器3307可儲存至少一個記憶體再新圖案，該至少一個記憶體再新圖案待在再新包括於複數個記憶體組中之每一者中的複數個記憶體區段時實施。舉例而言，如圖33中所描繪，資料儲存器3307可包括按列及/或行來定義記憶體組中之區段的Li(例如，在圖33之實例中，L1、L2、L3及L4)及Hi(例如，在圖33之實例中，H1、H2、H3及H4)。此外，每一區段可與Inci變數(例如，在圖33之實例中，Inc1、Inc2、Inc3及Inc4)相關聯，該變數定義與區段相關聯之列如何遞增(例如，是否存取或再新每一列，是否每隔一列進行存取或再新，或其類似者)。因此，如圖33中所展示，再新圖案可包含表，該表包括由軟體指派之複數個記憶體區段識別符，該等複數個記憶體區段識別符用以識別特定記憶體組中之待在再新循環期間需再新的複數個記憶體區段之範圍，及該特定記憶體組中之在該再新循環期間不需再新的複數個記憶體區段之範圍。 The new controller 3300 includes a timer 3301 (similar to the timer 2901 in FIGS. 29A and 29B), a column counter 3303 (similar to the column counter 2903 in FIGS. 29A and 29B), and an adder 3305 (similar to FIGS. 29A and 29B). Adder of 29B 2907). In addition, the renewed controller 3300 includes a data storage 3307. Different from the data storage area 2909 of FIG. 29B, the data storage 3307 can store at least one memory renew pattern, the at least one memory renew pattern to be newly included in each of the plurality of memory groups Implemented when there are multiple memory segments. For example, as depicted in FIG. 33, the data storage 3307 may include Li that defines the segments in the memory bank in rows and/or rows (for example, in the example of FIG. 33, L1, L2, L3, and L4) and Hi (for example, in the example of FIG. 33, H1, H2, H3, and H4). In addition, each section can be associated with an Inci variable (for example, in the example of FIG. 33, Inc1, Inc2, Inc3, and Inc4) that defines how the column associated with the section is incremented (for example, whether to access or Renew each column, whether to access every other column or renew, or the like). Therefore, as shown in Figure 33, the new pattern can include a table that includes a plurality of memory segment identifiers assigned by the software, and the plurality of memory segment identifiers are used to identify the specific memory group The range of a plurality of memory sections to be renewed during the recycle period, and the range of the plurality of memory sections in the specific memory group that do not need to be renewed during the recycle period.

因此，資料儲存器3308可定義由邏輯電路(無論係諸如CPU及GPU之習知邏輯電路抑或與記憶體晶片在同一基板上之處理群組，例如，如圖7A中所描繪)執行之軟體可選擇以供使用的再新圖案。記憶體再新圖案可為可使用軟體組態的，以識別在再新循環期間，特定記憶體組中之複數個記憶體區段中的哪些者需再新，而特定記憶體組中之複數個記憶體區段中的哪些者在該再新循環期間不需再新。因此，再新控制器3300可根據Inci再新在當前循環期間未被存取之所定義區段內的一些或所有列。再新控制器3300可跳過經設定為在當前循環期間被存取之所定義區段的其他列。 Therefore, the data storage 3308 can define software executed by logic circuits (whether it is a conventional logic circuit such as CPU and GPU or a processing group on the same substrate as a memory chip, for example, as depicted in FIG. 7A). Select the new pattern for use. The memory refresh pattern can be configurable by software to identify which of the plurality of memory sections in the specific memory group need to be refreshed during the refresh cycle, and the plural in the specific memory group Which of the memory sections does not need to be renewed during the recirculation period. Therefore, the renew controller 3300 may renew some or all columns in the defined section that have not been accessed during the current cycle according to Inci. The new controller 3300 may skip other rows of the defined section that are set to be accessed during the current cycle.

在再新控制器3300之資料儲存器3308包括複數個記憶體再新圖案的實施例中，每一記憶體再新圖案可表示用於再新包括於複數個記憶體組中之每一者中之複數個記憶體區段的不同再新圖案。記憶體再新圖案可為可選擇的以用於複數個記憶體區段上。因此，再新控制器3300可經組態以允許選擇在特定再新循環期間實施複數個記憶體再新圖案中之哪一者。舉例而言，由邏輯電路(無論係諸如CPU及GPU之習知邏輯電路抑或與記憶體晶片在同一基板上之處理群組，例如，如圖7A中所描繪)執行之軟體可選擇不同記憶體再新圖案以供在一或多個不同再新循環期間使用。替代地，由邏輯電路執行之軟體可選擇一個記憶體再新圖案以供貫穿不同再新循環中之一些或全部而使用。 The data storage 3308 of the renewed controller 3300 includes a plurality of memories and renewed picture In an embodiment of the case, each memory renewal pattern may represent a different renewal pattern for renewing a plurality of memory sections included in each of the plurality of memory groups. The memory renew pattern can be selectable for use in a plurality of memory segments. Therefore, the refresh controller 3300 may be configured to allow selection of which of a plurality of memory refresh patterns to be implemented during a particular refresh cycle. For example, the software executed by the logic circuit (whether it is a conventional logic circuit such as CPU and GPU or a processing group on the same substrate as the memory chip, for example, as depicted in Figure 7A) can choose different memory Renew patterns for use during one or more different renew cycles. Alternatively, the software executed by the logic circuit can select a memory regeneration pattern for use through some or all of the different regeneration cycles.

可使用儲存於資料儲存器3308中之一或多個變數來編碼記憶體再新圖案。舉例而言，在複數個記憶體區段配置成列的實施例中，每一記憶體區段識別符可經組態以識別記憶體之一列內記憶體再新應開始或結束之特定位置。舉例而言，除Li及Hi以外，一或多個額外變數亦可定義由Li及Hi定義之列之哪些部分在區段內。 One or more variables stored in the data storage 3308 can be used to encode the memory renew pattern. For example, in an embodiment where a plurality of memory segments are arranged in rows, each memory segment identifier can be configured to identify a specific position in a row of memory where the memory should start or end. For example, in addition to Li and Hi, one or more additional variables can also define which parts of the column defined by Li and Hi are in the segment.

圖34為用於判定記憶體晶片(例如，圖28之記憶體晶片2800)之再新的處理程序3400之實例流程圖。處理程序3100可由符合本發明之再新控制器(例如，圖33之再新控制器3300)內的軟體實施。 FIG. 34 is a flowchart of an example of a processing procedure 3400 for determining the renewal of a memory chip (for example, the memory chip 2800 in FIG. 28). The processing procedure 3100 can be implemented by software in a renewed controller (for example, the renewed controller 3300 of FIG. 33) according to the present invention.

在步驟3410處，再新控制器可儲存至少一個記憶體再新圖案，該至少一個記憶體再新圖案待在再新包括於複數個記憶體組中之每一者中的複數個記憶體區段時實施。舉例而言，如上文關於圖33所解釋，再新圖案可包含表，該表包括由軟體指派之複數個記憶體區段識別符，該等複數個記憶體區段識別符用以識別特定記憶體組中之在再新循環期間需再新的複數個記憶體區段之範圍，及該特定記憶體組中之在再新循環期間不需再新的複數個記憶體區段之範圍。 At step 3410, the renew controller may store at least one memory renew pattern, and the at least one memory renew pattern is to be renewed in a plurality of memory areas included in each of the plurality of memory groups Implemented in a short period of time. For example, as explained above with respect to FIG. 33, the new pattern may include a table including a plurality of memory segment identifiers assigned by the software, and the plurality of memory segment identifiers are used to identify a specific memory The range of a plurality of memory segments in the body group that need to be renewed during the recycle, and the range of a plurality of memory segments in the specific memory group that do not need to be renewed during the recycle.

在一些實施例中，至少一個再新圖案可在製造期間編碼至再新控制器上(例如，編碼至與再新控制器相關聯或至少可由再新控制器存取之唯讀記憶體上)。因此，再新控制器可存取至少一個記憶體再新圖案，但不儲存該至少一個記憶體再新圖案。 In some embodiments, at least one renew pattern can be coded to renew control during manufacturing. Controller (e.g., coded onto a read-only memory that is associated with the renewed controller or at least can be accessed by the renewed controller). Therefore, the refresh controller can access at least one memory refresh pattern, but does not store the at least one memory refresh pattern.

在步驟3420及3430處，再新控制器可使用軟體以識別特定記憶體組中之複數個記憶體區段中的哪些者在再新循環期間需再新，而特定記憶體組中之複數個記憶體區段中的哪些者在該再新循環期間不需再新。舉例而言，如上文關於圖33所解釋，由邏輯電路(無論係諸如CPU及GPU之習知邏輯電路抑或與記憶體晶片在同一基板上之處理群組，例如，如圖7A中所描繪)執行之軟體可選擇至少一個記憶體再新圖案。此外，再新控制器可在每一再新循環期間存取選定的至少一個記憶體再新圖案以產生對應再新信號。再新控制器可根據該至少一個記憶體再新圖案再新在當前循環期間未被存取之所定義區段內的一些或所有部分，且可跳過經設定為在當前循環期間被存取之所定義區段的其他部分。 At steps 3420 and 3430, the renewal controller can use software to identify which of the plurality of memory sections in the specific memory group need to be renewed during the renewal cycle, and the plural of the specific memory group Which ones of the memory segments do not need to be renewed during the recirculation period. For example, as explained above with respect to FIG. 33, by a logic circuit (whether it is a conventional logic circuit such as CPU and GPU or a processing group on the same substrate as a memory chip, for example, as depicted in FIG. 7A) The running software can select at least one memory to renew the pattern. In addition, the refresh controller can access the selected at least one memory refresh pattern during each refresh cycle to generate a corresponding refresh signal. The renewal controller can renew some or all parts of the defined section that have not been accessed during the current cycle according to the at least one memory renewal pattern, and can skip setting to be accessed during the current cycle The other parts of the defined section.

在步驟3440處，再新控制器可產生對應再新命令。舉例而言，如圖33中所描繪，加法器3305可包含邏輯電路，該邏輯電路經組態以根據資料儲存器3307中之至少一個記憶體再新圖案而使用於未被再新之特定區段的再新信號無效。另外或替代地，微處理器(圖33中未展示)可基於根據資料儲存器3307中之至少一個記憶體再新圖案將再新哪些區段而產生特定再新信號。 At step 3440, the renew controller may generate a corresponding renew command. For example, as depicted in FIG. 33, the adder 3305 may include a logic circuit configured to be used in a specific area that has not been renewed according to at least one memory renew pattern in the data storage 3307 The segment renew signal is invalid. Additionally or alternatively, the microprocessor (not shown in FIG. 33) may generate a specific renewal signal based on which sections will be renewed according to the at least one memory renewal pattern in the data storage 3307.

方法3400可進一步包括額外步驟。舉例而言，在至少一個記憶體再新圖案經組態以每一個、兩個或其他數目個再新循環而改變(例如，自L1、H1及Inc1移動至L2、H2及Inc2，如圖33中所展示)的實施例中，再新控制器可根據步驟3430及3440存取資料儲存器之不同部分以用於再新信號之下一判定。類似地，若由邏輯電路(無論係諸如CPU及GPU之習知邏輯電路抑或與記憶體晶片在同一基板上之處理群組，例如，如圖7A中所描繪)執行之軟體自資料儲存器選擇新的記憶體再新圖案以供用於一或多個未來再新循環中，則再新控制器可根據步驟3430及3440存取資料儲存器之不同部分以用於再新信號之下一判定。 Method 3400 may further include additional steps. For example, the renew pattern in at least one memory is configured to change with one, two or other number of renew cycles (for example, moving from L1, H1, and Inc1 to L2, H2, and Inc2, as shown in Figure 33 In the embodiment shown in), the refresh controller can access different parts of the data storage according to steps 3430 and 3440 for the next determination of the refresh signal. Similarly, if the software is executed by a logic circuit (whether it is a conventional logic circuit such as CPU and GPU or a processing group on the same substrate as a memory chip, for example, as depicted in Figure 7A), the software is self-funded The data storage device selects a new memory renew pattern for use in one or more future renew cycles, and the renew controller can access different parts of the data storage according to steps 3430 and 3440 for use in the renew signal. Next decision.

大小可選擇之記憶體晶片 Memory chip with selectable size

當設計記憶體晶片且目標為記憶體之某一容量時，記憶體容量改變至較大大小或較小大小可能需要重新設計產品及重新設計整個光罩集。通常，產品設計與市場研究並列地進行，且在一些狀況下，產品設計在市場研究可用之前完成。因此，產品設計與市場之實際需求之間可能存在脫節。本發明提出靈活地提供具有滿足市場需求之記憶體容量的記憶體晶片之方式。設計方法可包括在晶圓上設計晶粒連同適當的互連電路系統，使得可自晶圓選擇性地切割可含有一或多個晶粒之記憶體晶片，以便提供自單個晶圓生產具有大小可變之記憶體容量的記憶體晶片之機會。 When the memory chip is designed and the target is a certain capacity of the memory, changing the memory capacity to a larger or smaller size may require redesigning the product and redesigning the entire mask set. Usually, product design and market research are carried out side by side, and in some cases, product design is completed before market research is available. Therefore, there may be a disconnect between product design and actual market demand. The present invention proposes a way to flexibly provide a memory chip with a memory capacity that meets market demand. The design method may include designing the die on the wafer together with the appropriate interconnection circuit system, so that the memory chip that may contain one or more dies can be selectively cut from the wafer, so as to provide the size from a single wafer production Opportunities for memory chips with variable memory capacity.

本發明係關於用於藉由自晶圓切割記憶體晶片來製造該等記憶體晶片之系統及方法。該方法可用於自晶圓生產大小可選擇之記憶體晶片。含有晶粒3503之晶圓3501的實例實施例展示於圖35A中。晶圓3501可由半導體材料(例如，矽(Si)、矽鍺(SiGe)、絕緣體上矽(SOI)、氮化鎵(GaN)、氮化鋁(AlN)、氮化鋁鎵(AlGaN)、氮化硼(BN)、砷化鎵(GaAs)、砷化鎵鋁(AlGaAs)、氮化銦(InN)、以上各者之組合及其類似者)形成。晶粒3503可包括任何合適的電路元件(例如，電晶體、電容器、電阻器及/或其類似者)，該等電路元件可包括任何合適的半導體、介電或金屬組件。晶粒3503可由可能與晶圓3501之材料相同或不同的半導體材料形成。除晶粒3503以外，晶圓3501亦可包括其他結構及/或電路系統。在一些實施例中，可提供一或多個耦接電路且該一或多個耦接電路將晶粒中之一或多者耦接在一起。在實例實施例中，此耦接電路可包括由兩個或多於兩個晶粒3503共用之匯流排。另外，該耦接電路可包括經設計以控制與晶粒3503相關聯之電路系統及/或將資訊導引至晶粒3503/導引來自晶粒之資訊的一或多個邏輯電路。在一些狀況下，該耦接電路可包括記憶體存取管理邏輯。此邏輯可將邏輯記憶體位址轉譯成與晶粒3503相關聯之實體位址。應注意，如本文中所使用，術語製造可共同地指用於建置所揭示晶圓、晶粒及/或晶片之步驟中之任一者。舉例而言，製造可指包括於晶圓上之各種晶粒(及任何其他電路系統)的同時佈置及形成。製造亦可指自晶圓切割大小可選擇之記憶體晶片以在一些狀況下包括一個晶粒，或在其他狀況下包括多個晶粒。當然，術語製造並不欲限於此等實例，而是可包括與所揭示記憶體晶片及中間結構中之任一者或全部之產生相關聯的其他態樣。 The present invention relates to a system and method for manufacturing memory chips by cutting the memory chips from the wafer. This method can be used to produce memory chips of selectable sizes from the wafer. An example embodiment of a wafer 3501 containing die 3503 is shown in FIG. 35A. The wafer 3501 can be made of semiconductor materials (for example, silicon (Si), silicon germanium (SiGe), silicon-on-insulator (SOI), gallium nitride (GaN), aluminum nitride (AlN), aluminum gallium nitride (AlGaN), nitrogen Boron (BN), gallium arsenide (GaAs), aluminum gallium arsenide (AlGaAs), indium nitride (InN), combinations of the above and the like) are formed. The die 3503 may include any suitable circuit elements (for example, transistors, capacitors, resistors, and/or the like), and the circuit elements may include any suitable semiconductor, dielectric, or metal components. The die 3503 may be formed of a semiconductor material that may be the same as or different from the material of the wafer 3501. In addition to the die 3503, the wafer 3501 may also include other structures and/or circuit systems. In some embodiments, one or more coupling circuits may be provided and the one or more coupling circuits couple one or more of the dies together. In an example embodiment, the coupling circuit may include a bus bar shared by two or more dies 3503. In addition, the coupling The interconnection circuit may include one or more logic circuits designed to control the circuit system associated with the die 3503 and/or direct information to the die 3503/direct information from the die. In some cases, the coupling circuit may include memory access management logic. This logic can translate the logical memory address into a physical address associated with the die 3503. It should be noted that, as used herein, the term manufacturing can collectively refer to any of the steps used to build the disclosed wafers, dies, and/or chips. For example, manufacturing may refer to the simultaneous placement and formation of various dies (and any other circuit systems) included on a wafer. Manufacturing can also refer to cutting memory chips of selectable sizes from the wafer to include one die under some conditions, or multiple dies under other conditions. Of course, the term manufacturing is not intended to be limited to these examples, but may include other aspects associated with the production of any or all of the disclosed memory chips and intermediate structures.

晶粒3503或晶粒群組可用於製造記憶體晶片。記憶體晶片可包括分散式處理器，如本發明之其他章節中所描述。如圖35B中所展示，晶粒3503可包括基板3507及安置於該基板上之記憶體陣列。該記憶體陣列可包括一或多個記憶體單元，諸如經設計以儲存資料之記憶體組3511A至3511D。在各種實施例中，記憶體組可包括基於半導體之電路元件，諸如電晶體、電容器及其類似者。在實例實施例中，記憶體組可包括多列及多行儲存單元。在一些狀況下，此記憶體組可具有大於一百萬位元組之容量。該等記憶體組可包括動態或靜態存取記憶體。 The die 3503 or die group can be used to manufacture memory chips. The memory chip may include a distributed processor, as described in other chapters of the present invention. As shown in FIG. 35B, the die 3503 may include a substrate 3507 and a memory array disposed on the substrate. The memory array may include one or more memory cells, such as memory banks 3511A to 3511D designed to store data. In various embodiments, the memory bank may include semiconductor-based circuit elements such as transistors, capacitors, and the like. In an example embodiment, the memory bank may include multiple rows and multiple rows of storage cells. In some cases, this memory bank may have a capacity greater than one million bytes. These memory banks can include dynamic or static access memory.

晶粒3503可進一步包括安置於基板上之處理陣列，該處理陣列包括複數個處理器子單元3515A至3515D，如圖35B中所展示。如上文所描述，每一記憶體組可包括由專用匯流排連接之專用處理器子單元。舉例而言，處理器子單元3515A經由匯流排或連接件3512與記憶體組3511A相關聯。應理解，記憶體組3511A至3511D與處理器子單元3515A至3515D之間的各種連接為可能的，且僅一些說明性連接展示於圖35B中。在實例實施例中，處理器子單元可對相關聯之記憶體組執行讀取/寫入操作，且可進一步相對於儲存於各種記憶體組中之記憶體執行再新操作或任何其他合適之操作。 The die 3503 may further include a processing array disposed on the substrate, the processing array including a plurality of processor subunits 3515A to 3515D, as shown in FIG. 35B. As described above, each memory bank may include dedicated processor subunits connected by dedicated buses. For example, the processor sub-unit 3515A is associated with the memory group 3511A via a bus or connector 3512. It should be understood that various connections between the memory groups 3511A to 3511D and the processor subunits 3515A to 3515D are possible, and only some illustrative connections are shown in FIG. 35B. In an example embodiment, the processor sub-unit can perform read/write operations on the associated memory bank, and can be further compared to storage in various memories The memory in the body group performs a renew operation or any other suitable operation.

如所提到，晶粒3503可包括經組態以將處理器子單元與其對應記憶體組連接之第一群組匯流排。實例匯流排可包括連接電組件之一組導線或導體，且允許將資料及位址傳送至每一記憶體組及其相關聯之處理器子單元以及自每一記憶體組及其相關聯之處理器子單元傳送資料。在實例實施例中，連接件3512可充當用於將處理器子單元3515A連接至記憶體組3511A之專用匯流排。晶粒3503可包括此類匯流排之群組，每一匯流排將處理器子單元連接至對應的專用記憶體組。另外，晶粒3503可包括匯流排之另一群組，每一匯流排將處理器子單元(例如，子單元3515A至3515D)連接至彼此。舉例而言，此類匯流排可包括連接件3516A至3516D。在各種實施例中，用於記憶體組3511A至3511D之資料可經由輸入輸出匯流排3530遞送。在實例實施例中，輸入輸出匯流排3530可攜載資料相關資訊，及用於控制晶粒3503之記憶體單元之操作的命令相關資訊。資料資訊可包括用於儲存於記憶體組中之資料、自記憶體組讀取之資料、基於相對於儲存於對應記憶體組中之資料執行之操作的來自處理器子單元中之一或多者的處理結果、命令相關資訊、各種程式碼等。 As mentioned, the die 3503 may include a first group of buses configured to connect the processor sub-units with their corresponding memory groups. An example bus may include a set of wires or conductors that connect electrical components, and allow data and addresses to be transmitted to each memory bank and its associated processor subunits, and from each memory bank and its associated The processor subunit transmits data. In an example embodiment, the connector 3512 can serve as a dedicated bus for connecting the processor subunit 3515A to the memory bank 3511A. The die 3503 may include groups of such buses, each of which connects the processor sub-units to a corresponding dedicated memory group. In addition, die 3503 may include another group of bus bars, each bus bar connecting processor sub-units (eg, sub-units 3515A to 3515D) to each other. For example, such a bus bar may include connectors 3516A to 3516D. In various embodiments, the data for the memory banks 3511A to 3511D can be delivered via the input/output bus 3530. In an example embodiment, the input/output bus 3530 can carry data-related information and command-related information for controlling the operation of the memory cell of the die 3503. The data information may include one or more of the data stored in the memory bank, the data read from the memory bank, and the processor subunits based on operations performed relative to the data stored in the corresponding memory bank. User’s processing results, command-related information, various codes, etc.

在各種狀況下，由輸入輸出匯流排3530傳輸之資料及命令可由輸入輸出(IO)控制器3521控制。在實例實施例中，IO控制器3521可控制自匯流排3530至處理器子單元3515A至3515D及來自處理器子單元3515A至3515D之資料流。IO控制器3521可判定自處理器子單元3515A至3515D中之哪一者擷取資訊。在各種實施例中，IO控制器3521可包括經組態以不啟動IO控制器3521之熔斷器3554。若多個晶粒組合在一起以形成較大記憶體晶片(亦被稱作多晶粒記憶體晶片，作為僅含有一個晶粒之單晶粒記憶體晶片的替代)，則可使用熔斷器3554。多晶粒記憶體晶片可接著使用形成該多晶粒記憶體晶片之晶粒單元中之一者的IO控制器中之一者，同時藉由使用對應於與其他晶粒單元相關之其他IO控制器的熔斷器來使其他IO控制器去能。 Under various conditions, the data and commands transmitted by the input-output bus 3530 can be controlled by the input-output (IO) controller 3521. In an example embodiment, the IO controller 3521 can control the data flow from the bus 3530 to the processor sub-units 3515A to 3515D and from the processor sub-units 3515A to 3515D. The IO controller 3521 can determine which of the processor sub-units 3515A to 3515D to retrieve information. In various embodiments, the IO controller 3521 may include a fuse 3554 configured to not activate the IO controller 3521. If multiple dies are combined to form a larger memory chip (also called a multi-die memory chip, as an alternative to a single-die memory chip containing only one die), the fuse 3554 can be used . The multi-die memory chip can then use one of the IO controllers that form one of the die units of the multi-die memory chip, and at the same time by using one of the IO controllers corresponding to other die units Meta-related fuses of other IO controllers to disable other IO controllers.

如所提到，每一記憶體晶片或前置晶粒或晶粒群組可包括與對應記憶體組相關聯之分散式處理器。在一些實施例中，此等分散式處理器可配置於與複數個記憶體組安置在同一基板上的處理陣列中。另外，該處理陣列可包括各包括位址產生器(亦被稱作位址產生器單元(AGU))之一或多個邏輯部分。在一些狀況下，該位址產生器可為至少一個處理器子單元之部分。該位址產生器可產生自與記憶體晶片相關聯之一或多個記憶體組提取資料所需的記憶體位址。位址產生計算可涉及整數算術運算，諸如加法、減法、模數運算或位元移位。該位址產生器可經組態以一次對多個運算元進行運算。此外，多個位址產生器可同時執行多於一個位址計算運算。在各種實施例中，位址產生器可與對應記憶體組相關聯。該等位址產生器可藉助於對應匯流排線與其對應記憶體組連接。 As mentioned, each memory chip or pre-die or die group may include a distributed processor associated with the corresponding memory group. In some embodiments, these distributed processors may be arranged in a processing array arranged on the same substrate as a plurality of memory banks. In addition, the processing array may include one or more logic parts each including an address generator (also referred to as an address generator unit (AGU)). In some cases, the address generator may be part of at least one processor subunit. The address generator can generate memory addresses required to retrieve data from one or more memory groups associated with the memory chip. The address generation calculation may involve integer arithmetic operations, such as addition, subtraction, modulus operations, or bit shifting. The address generator can be configured to perform operations on multiple operands at once. In addition, multiple address generators can perform more than one address calculation operation at the same time. In various embodiments, the address generator may be associated with the corresponding memory bank. These address generators can be connected to their corresponding memory banks by means of corresponding bus lines.

在各種實施例中，大小可選擇之記憶體晶片可藉由選擇性地切割晶圓3501之不同區而由該晶圓形成。如所提到，該晶圓可包括晶粒3503之群組，該群組包括晶圓上所包括之兩個或多於兩個晶粒(例如，2個、3個、4個、5個、10個或多於10個晶粒)的任何群組。如將在下文進一步所論述，在一些狀況下，單個記憶體晶片可藉由切割晶圓之僅包括晶粒群組中之一個晶粒的一部分來形成。在此等狀況下，所得記憶體晶片將包括與一個晶粒相關聯之記憶體單元。然而，在其他狀況下，大小可選擇之記憶體晶片可形成為包括多於一個晶粒。此等記憶體晶片可藉由切割晶圓之包括晶圓上所包括之晶粒群組中之兩個或多於兩個晶粒的區來形成。在此等狀況下，晶粒連同將晶粒耦接在一起之耦接電路提供多晶粒記憶體晶片。一些額外電路元件亦可板載地線連接於晶片之間，諸如時脈元件、資料匯流排或任何合適的邏輯電路。 In various embodiments, memory chips of selectable sizes can be formed from the wafer 3501 by selectively cutting different areas of the wafer. As mentioned, the wafer may include a group of dies 3503, the group including two or more than two dies included on the wafer (for example, 2, 3, 4, 5 , 10 or more than 10 dies). As will be discussed further below, in some cases, a single memory chip may be formed by cutting a portion of the wafer that includes only one die in a die group. Under these conditions, the resulting memory chip will include memory cells associated with one die. However, under other conditions, memory chips of selectable sizes may be formed to include more than one die. These memory chips can be formed by cutting regions of the wafer that include two or more of the die group included on the wafer. Under these conditions, the die and the coupling circuit that couples the die together provide a multi-die memory chip. Some additional circuit components can also be connected between the chips on the on-board ground, such as clock components, data buses or any suitable logic circuits.

在一些狀況下，與晶粒群組相關聯之至少一個控制器可經組態以控制晶粒群組作為單個記憶體晶片(例如，多記憶體單元記憶體晶片)進行操作。該控制器可包括管理進入記憶體晶片及來自記憶體晶片之資料流的一或多個電路。記憶體控制器可為記憶體晶片之一部分，或其可為不與記憶體晶片直接相關之分開晶片的一部分。在實例實施例中，控制器可經組態以便利讀取及寫入請求或與記憶體晶片之分散式處理器相關聯的其他命令，且可經組態以控制記憶體晶片之任何其他合適的態樣(例如，再新記憶體晶片，與分散式處理器互動等)。在一些狀況下，控制器可為晶粒3503之部分，且在其他狀況下，控制器可鄰近於晶粒3503佈置。在各種實施例中，控制器亦可包括記憶體晶片上所包括之記憶體單元中之至少一者的至少一個記憶體控制器。在一些狀況下，對於複製可存在於記憶體晶片上之複製邏輯及記憶體單元(例如，記憶體組)，用於存取記憶體晶片上之資訊的協定可能為不可知的。該協定可經組態以具有用於充分存取記憶體晶片上之資料的不同ID或位址範圍。具有此協定之晶片的實例可包括具有聯合電子裝置工程委員會(JEDEC)雙資料速率(DDR)控制器之晶片，其中不同記憶體組可具有不同位址範圍、串列周邊介面(SPI)連接，其中不同記憶體單元(例如，記憶體組)具有不同識別項(ID)，及其類似者。 In some cases, at least one controller associated with a die group can be configured to The control die group operates as a single memory chip (for example, a multi-memory cell memory chip). The controller may include one or more circuits that manage the flow of data into and from the memory chip. The memory controller may be part of the memory chip, or it may be part of a separate chip that is not directly related to the memory chip. In an example embodiment, the controller can be configured to facilitate read and write requests or other commands associated with the distributed processor of the memory chip, and can be configured to control any other suitable of the memory chip (For example, new memory chips, interaction with distributed processors, etc.). In some cases, the controller may be part of die 3503, and in other cases, the controller may be arranged adjacent to die 3503. In various embodiments, the controller may also include at least one memory controller of at least one of the memory cells included on the memory chip. In some situations, for the replication logic and memory units (for example, memory banks) that can exist on the memory chip, the protocol used to access the information on the memory chip may be unknown. The protocol can be configured to have different IDs or address ranges for full access to the data on the memory chip. Examples of chips with this protocol may include chips with a joint electronic device engineering committee (JEDEC) dual data rate (DDR) controller, where different memory banks may have different address ranges and serial peripheral interface (SPI) connections, Among them, different memory units (for example, memory groups) have different identification items (ID), and the like.

在各種實施例中，可自晶圓切割多個區，其中各個區包括一或多個晶粒。在一些狀況下，可用每一分開區以建置多晶粒記憶體晶片。在其他狀況下，待自晶圓切割之每一區可包括單個晶粒以提供單晶粒記憶體晶片。在一些狀況下，該等區中之兩者或多於兩者可具有相同形狀且具有以相同方式耦接至耦接電路之相同數目個晶粒。替代地，在一些實例實施例中，可用第一群組區以形成第一類型之記憶體晶片，且可用第二群組區以形成第二類型之記憶體晶片。舉例而言，如圖35C中所展示，晶圓3501可包括區3505，該區可包括單個晶粒，且第二區3504可包括兩個晶粒之群組。當自晶圓3501切割區3505時，將提供單晶粒記憶體晶片。當自晶圓3501切割區3504時，將提供多晶粒記憶體晶片。圖35C中所展示之群組僅為說明性，且可自晶圓3501切下晶粒之各種其他區及群組。 In various embodiments, multiple regions may be diced from the wafer, where each region includes one or more dies. In some cases, each partition can be used to build a multi-die memory chip. In other cases, each area to be diced from the wafer may include a single die to provide a single die memory chip. In some cases, two or more of the regions may have the same shape and have the same number of dies coupled to the coupling circuit in the same manner. Alternatively, in some example embodiments, the first group of regions can be used to form memory chips of the first type, and the second group of regions can be used to form memory chips of the second type. For example, as shown in FIG. 35C, the wafer 3501 may include a region 3505, which may include a single die, and the second region 3504 may include a group of two dies. When cutting area 3505 from wafer 3501, Single-die memory chips will be provided. When the area 3504 is cut from the wafer 3501, a multi-die memory chip will be provided. The group shown in FIG. 35C is only illustrative, and various other regions and groups of dies can be cut from the wafer 3501.

在各種實施例中，晶粒可形成於晶圓3501上，使得其沿著晶圓之一或多列配置，如展示於例如圖35C中。該等晶粒可共用對應於一或多列之輸入輸出匯流排3530。在實例實施例中，可使用各種切割形狀自晶圓3501切下晶粒群組，其中當切下可用以形成記憶體晶片之晶粒群組時，可能不包括共用輸入輸出匯流排3530之至少一部分(例如，僅可包括輸入輸出匯流排3530之一部分作為形成為包括晶粒群組之記憶體晶片的一部分)。 In various embodiments, the dies may be formed on the wafer 3501 such that they are arranged along one or more rows of the wafer, as shown, for example, in FIG. 35C. The dies can share the input and output bus 3530 corresponding to one or more rows. In an example embodiment, various cutting shapes can be used to cut die groups from the wafer 3501. When the die groups that can be used to form a memory chip are cut, they may not include at least the shared input and output bus 3530. A part (for example, only a part of the input/output bus 3530 may be included as a part of a memory chip formed to include a die group).

如先前所論述，當多個晶粒(例如，晶粒3506A及3506B，如圖35C中所展示)用以形成記憶體晶片3517時，對應於該等晶粒中之一者的一個IO控制器可經賦能且經組態以控制至晶粒3506A及3506B之所有處理器子單元的資料流。舉例而言，圖35D展示經組合以形成記憶體晶片3517之記憶體晶粒3506A及3506B，該記憶體晶片包括記憶體組3511A至3511H、處理器子單元3515A至3515H、IO控制器3521A及3521B，以及熔斷器3554A及3554B。應注意，在自晶圓移除記憶體晶片3517之前，該記憶體晶片對應於晶圓3501之區3517。換言之，如此處且在本發明中別處所使用，一旦自晶圓3501切割，晶圓3501之區3504、3505、3517等便將產生記憶體晶片3504、3505、3517等。另外，本文中之熔斷器亦被稱作去能元件。在實例實施例中，熔斷器3554B可用以不啟動IO控制器3521B，且IO控制器3521A可用以藉由將資料傳達至處理器子單元3515A至3515H來控制至所有記憶體組3511A至3511H之資料流。在實例實施例中，IO控制器3521A可使用任何合適的連接來連接至各種處理器子單元。在一些實施例中，如下文進一步所描述，處理器子單元3515A至3515H可互連，且IO控制器3521A可經組態以控制至形成記憶體晶片3517之處理邏輯之處理器子單元3515A至3515H的資料流。 As previously discussed, when multiple dies (eg, dies 3506A and 3506B, as shown in FIG. 35C) are used to form the memory chip 3517, one IO controller corresponding to one of the dies It can be enabled and configured to control the data flow to all the processor subunits of the die 3506A and 3506B. For example, FIG. 35D shows memory dies 3506A and 3506B assembled to form a memory chip 3517, which includes memory groups 3511A to 3511H, processor subunits 3515A to 3515H, and IO controllers 3521A and 3521B , And fuses 3554A and 3554B. It should be noted that before the memory chip 3517 is removed from the wafer, the memory chip corresponds to the region 3517 of the wafer 3501. In other words, as used here and elsewhere in the present invention, once cut from wafer 3501, regions 3504, 3505, 3517, etc. of wafer 3501 will produce memory chips 3504, 3505, 3517, etc. In addition, the fuse in this article is also called a disabling element. In the example embodiment, the fuse 3554B can be used to disable the IO controller 3521B, and the IO controller 3521A can be used to control the data to all the memory banks 3511A to 3511H by transmitting data to the processor subunits 3515A to 3515H flow. In an example embodiment, the IO controller 3521A may use any suitable connection to connect to the various processor sub-units. In some embodiments, as described further below, the processor sub-units 3515A to 3515H can be interconnected, and the IO controller 3521A can be configured to control the processing logic to form the memory chip 3517 The data stream of the processor sub-units 3515A to 3515H of the series.

在實例實施例中，諸如控制器3521A及3521B之IO控制器以及對應熔斷器3554A及3554B可連同形成記憶體組3511A至3511H及處理器子單元3515A至3515H一起在晶圓3501上形成。在各種實施例中，當形成記憶體晶片3517時，可啟動熔斷器中之一者(例如，熔斷器3554B)使得晶粒3506A及3506B經組態以形成記憶體晶片3517，該記憶體晶片充當單個晶片且受單個輸入輸出控制器(例如，控制器3521A)控制。在實例實施例中，啟動熔斷器可包括施加電流以觸發熔斷器。在各種實施例中，當多於一個晶粒用於形成記憶體晶片時，可經由對應熔斷器不啟動除一個IO控制器之外的所有其他IO控制器。 In an example embodiment, IO controllers such as controllers 3521A and 3521B and corresponding fuses 3554A and 3554B may be formed on wafer 3501 along with forming memory banks 3511A to 3511H and processor subunits 3515A to 3515H. In various embodiments, when the memory chip 3517 is formed, one of the fuses (eg, fuse 3554B) can be activated so that the dies 3506A and 3506B are configured to form a memory chip 3517, which serves as A single chip is controlled by a single input and output controller (eg, controller 3521A). In an example embodiment, activating the fuse may include applying current to trigger the fuse. In various embodiments, when more than one die is used to form a memory chip, all other IO controllers except one IO controller may not be activated via the corresponding fuse.

在各種實施例中，如圖35C中所展示，多個晶粒連同一組輸入輸出匯流排及/或控制匯流排一起形成於晶圓3501上。實例輸入輸出匯流排3530展示於圖35C中。在實例實施例中，輸入輸出匯流排中之一者(例如，輸入輸出匯流排3530)可連接至多個晶粒。圖35C展示接近晶粒3506A及3506B通過之輸入輸出匯流排3530的實例實施例。如圖35C中所展示之晶粒3506A及3506B以及輸入輸出匯流排3530之組態僅為說明性的，且可使用各種其他組態。舉例而言，圖35E說明形成於晶圓3501上且配置成六邊形形式之晶粒3540。可自晶圓3501切下包括四個晶粒3540之記憶體晶片3532。在實例實施例中，記憶體晶片3532可包括藉由合適的匯流排線(例如，線3533，如圖35E中所展示)連接至四個晶粒的輸入輸出匯流排3530之一部分。為了將資訊投送至記憶體晶片3532之適當記憶體單元，記憶體晶片3532可包括置放於輸出匯流排3530之分支點處的輸入/輸出控制器3542A及3542B。控制器3542A及3542B可經由輸入輸出匯流排3530接收命令資料，且選擇匯流排3530之分支用於將資訊傳輸至適當記憶體單元。舉例而言，若命令資料包括自與晶粒3546相關聯之記憶體單元的讀取資訊/至該等記憶體單元之寫入資訊，則控制器3542A可接收命令請求且將資料傳輸至匯流排3530之分支3531A，如圖35D中所展示，而控制器3542B可接收命令請求且將資料傳輸至分支3531B。圖35E提示可進行之不同區的各種切割，其中切割線由虛線表示。 In various embodiments, as shown in FIG. 35C, multiple dies are formed on the wafer 3501 along with a set of input and output bus bars and/or control bus bars. An example input and output bus 3530 is shown in Figure 35C. In an example embodiment, one of the input and output buses (for example, the input and output bus 3530) may be connected to multiple dies. Figure 35C shows an example embodiment of an input-output bus 3530 close to which dies 3506A and 3506B pass through. The configuration of the dies 3506A and 3506B and the input-output bus 3530 as shown in FIG. 35C is only illustrative, and various other configurations can be used. For example, FIG. 35E illustrates a die 3540 formed on a wafer 3501 and configured in a hexagonal shape. A memory chip 3532 including four dies 3540 can be cut from the wafer 3501. In an example embodiment, the memory chip 3532 may include a portion of the input and output bus 3530 connected to the four dies by a suitable bus wire (eg, wire 3533, as shown in FIG. 35E). In order to deliver information to the appropriate memory cells of the memory chip 3532, the memory chip 3532 may include input/output controllers 3542A and 3542B placed at the branch points of the output bus 3530. The controllers 3542A and 3542B can receive command data via the input/output bus 3530, and select the branch of the bus 3530 to transfer the information to the appropriate memory unit. For example, if the command data includes a memory cell associated with die 3546 Read information/write information to the memory units, the controller 3542A can receive the command request and transmit the data to the branch 3531A of the bus 3530, as shown in Figure 35D, and the controller 3542B can receive the command Request and transfer data to branch 3531B. Figure 35E prompts various cuts in different areas that can be performed, where the cutting line is indicated by a dashed line.

在實例實施例中，晶粒群組及互連電路系統可經設計以包括於如圖36A中所展示之記憶體晶片3506中。此實施例可包括可經組態以彼此通信之處理器子單元(用於記憶體內處理)。舉例而言，待包括於記憶體晶片3506中之每一晶粒可包括諸如記憶體組3511A至3511D之各種記憶體單元、處理器子單元3515A至3515D，以及IO控制器3521及3522。IO控制器3521及3522可並聯連接至輸入輸出匯流排3530。IO控制器3521可具有熔斷器3554，且IO控制器3522可具有熔斷器3555。在實例實施例中，處理器子單元3515A至3515D可藉助於例如匯流排3613連接。在一些狀況下，IO控制器中之一者可使用對應熔斷器來去能。舉例而言，可使用熔斷器3555使IO控制器3522去能，且IO控制器3521可經由處理器子單元3515A至3515D控制至記憶體組3511A至3511D中之資料流，該等處理器子單元經由匯流排3613彼此連接。 In an example embodiment, the die group and interconnect circuitry may be designed to be included in the memory chip 3506 as shown in FIG. 36A. This embodiment may include processor sub-units (for in-memory processing) that can be configured to communicate with each other. For example, each die to be included in the memory chip 3506 may include various memory units such as memory banks 3511A to 3511D, processor sub-units 3515A to 3515D, and IO controllers 3521 and 3522. The IO controllers 3521 and 3522 can be connected to the input and output bus 3530 in parallel. The IO controller 3521 may have a fuse 3554, and the IO controller 3522 may have a fuse 3555. In an example embodiment, the processor sub-units 3515A to 3515D may be connected by means of a bus 3613, for example. In some cases, one of the IO controllers can use the corresponding fuse to disable it. For example, the fuse 3555 can be used to disable the IO controller 3522, and the IO controller 3521 can control the data flow to the memory banks 3511A to 3511D through the processor subunits 3515A to 3515D. These processor subunits They are connected to each other via a bus 3613.

如圖36A中所展示之記憶體單元的組態僅為說明性的，且各種其他組態可藉由切割晶圓3501之不同區來形成。舉例而言，圖36B展示具有三個域3601至3603之組態，該三個域含有記憶體單元且連接至輸入輸出匯流排3530。在實例實施例中，域3601至3603係使用可由對應熔斷器3554至3556去能之IO控制模組3521至3523連接至輸入輸出匯流排3530。配置含有記憶體單元之域的實施例之另一實例展示於圖36C中，其中使用匯流排線3611、3612及3613將三個域3601、3602及3603連接至輸入輸出匯流排3530。圖36D展示經由IO控制器3521至3524連接至輸入輸出匯流排3530A及3530B之記憶體晶片3506A至3506D的另一實例實施例。在實例實施例中，可使用對應熔斷器元件3554至3557不啟動IO控制器，如圖36D中所展示。 The configuration of the memory cell shown in FIG. 36A is only illustrative, and various other configurations can be formed by cutting different areas of the wafer 3501. For example, FIG. 36B shows a configuration with three domains 3601 to 3603, which contain memory cells and are connected to the input/output bus 3530. In the example embodiment, the domains 3601 to 3603 are connected to the input and output bus 3530 using IO control modules 3521 to 3523 that can be disabled by the corresponding fuses 3554 to 3556. Another example of an embodiment of configuring domains containing memory cells is shown in FIG. 36C, in which bus lines 3611, 3612, and 3613 are used to connect three domains 3601, 3602, and 3603 to an input-output bus 3530. FIG. 36D shows another example embodiment of the memory chips 3506A to 3506D connected to the input and output buses 3530A and 3530B via the IO controllers 3521 to 3524. In an example embodiment, a corresponding fuse element can be used Items 3554 to 3557 do not start the IO controller, as shown in Figure 36D.

圖37展示晶粒3503之各種群組，諸如可包括一或多個晶粒3503之群組3713及群組3715。在實例實施例中，除在晶圓3501上形成晶粒3503以外，晶圓3501亦可含有被稱作膠合邏輯3711之邏輯電路3711。相較於在不存在膠合邏輯3711之情況下可能已製造的晶粒之數目，膠合邏輯3711可佔用晶圓3501上之一些空間，以導致每晶圓3501製造較少數目個晶粒。然而，存在膠合邏輯3711可允許多個晶粒經組態以一起充當單個記憶體晶片。舉例而言，膠合邏輯可連接多個晶粒，而不必改變組態且不必將晶粒本身中之任一者內的區域指明用於僅用來將晶粒連接在一起之電路系統。在各種實施例中，膠合邏輯3711提供與其他記憶體控制器之介面，使得多晶粒記憶體晶片充當單個記憶體晶片。膠合邏輯3711可連同晶粒群組(例如，如由群組3713展示)一起切割。替代地，如例如對於群組3715，若記憶體晶片僅需要一個晶粒，則可能不切割膠合邏輯。舉例而言，在不需要使得不同晶粒之間能夠相配合之情況下，可選擇性地消除膠合邏輯。在圖37中，可進行不同區之各種切割，如例如由虛線區所展示。在各種實施例中，如圖37中所展示，對於每兩個晶粒3506，可在晶圓上佈置一個膠合邏輯元件3711。在一些狀況下，一個膠合邏輯元件3711可用於形成晶粒群組之任何合適數目個晶粒3506。膠合邏輯3711可經組態以連接至來自晶粒群組之所有晶粒。在各種實施例中，連接至膠合邏輯3711之晶粒可經組態以形成多晶粒記憶體晶片，且可經組態以在晶粒不連接至膠合邏輯3711時形成分開的單晶粒記憶體晶片。在各種實施例中，連接至膠合邏輯3711且經設計以一起起作用之晶粒可作為群組自晶圓3501切下，且可包括膠合邏輯3711，如例如由群組3713所提示。未連接至膠合邏輯3711之晶粒可在不包括膠合邏輯3711之情況下自晶圓3501切下(如例如由群組3715所提示)，以形成單晶粒記憶體晶片。 FIG. 37 shows various groups of dies 3503, such as group 3713 and group 3715, which may include one or more dies 3503. In the example embodiment, in addition to forming the die 3503 on the wafer 3501, the wafer 3501 may also contain a logic circuit 3711 called a glue logic 3711. Compared with the number of dies that may have been manufactured without the glue logic 3711, the glue logic 3711 may occupy some space on the wafer 3501, resulting in a smaller number of dies per wafer 3501. However, the presence of glue logic 3711 may allow multiple dies to be configured to act together as a single memory chip. For example, glue logic can connect multiple dies without having to change the configuration and without specifying the area within any one of the dies themselves for the circuit system used only to connect the dies together. In various embodiments, the glue logic 3711 provides an interface with other memory controllers so that the multi-die memory chip acts as a single memory chip. The glue logic 3711 may be cut along with the die group (e.g., as shown by the group 3713). Alternatively, for example, for group 3715, if the memory chip only requires one die, the glue logic may not be cut. For example, the glue logic can be selectively eliminated when it is not necessary to enable the matching of different dies. In Fig. 37, various cuts in different areas can be performed, as shown, for example, by the dotted area. In various embodiments, as shown in FIG. 37, for every two dies 3506, one glue logic element 3711 may be arranged on the wafer. In some cases, one glue logic element 3711 can be used to form any suitable number of die 3506 of the die group. The glue logic 3711 can be configured to connect to all dies from the die group. In various embodiments, the die connected to the glue logic 3711 can be configured to form a multi-die memory chip, and can be configured to form a separate single die memory when the die is not connected to the glue logic 3711 Body wafer. In various embodiments, the die connected to the glue logic 3711 and designed to work together may be cut from the wafer 3501 as a group, and may include the glue logic 3711, as suggested by the group 3713, for example. The die not connected to the glue logic 3711 can be cut from the wafer 3501 without the glue logic 3711 (as suggested by the group 3715, for example) to form a single-die memory chip.

在一些實施例中，在自晶圓3501製造多晶粒記憶體晶片期間，可判定一或多個切割形狀(例如，形成群組3713、3715之形狀)用於產生多晶粒記憶體晶片中之所要集合。在一些狀況下，如由群組3715所展示，切割形狀可能不包括膠合邏輯3711。 In some embodiments, during the production of multi-die memory chips from wafer 3501, one or more cut shapes (for example, shapes forming groups 3713, 3715) can be determined to be used in the production of multi-die memory chips. What you want to gather. In some cases, as shown by group 3715, the cut shape may not include glue logic 3711.

在各種實施例中，膠合邏輯3711可為用於控制多晶粒記憶體晶片之多個記憶體單元的控制器。在一些狀況下，膠合邏輯3711可包括可由各種其他控制器修改之參數。舉例而言，用於多晶粒記憶體晶片之耦接電路可包括用於組態膠合邏輯3711之參數或記憶體控制器之參數的電路(例如，處理器子單元3515A至3515D，如展示於例如圖35B中)。膠合邏輯3711可經組態以進行多種任務。舉例而言，邏輯3711可經組態以判定哪一晶粒可能需要定址。在一些狀況下，邏輯3711可用以使多個記憶體單元同步。在各種實施例中，邏輯3711可經組態以控制各種記憶體單元，使得記憶體單元作為單個晶片操作。在一些狀況下，可在輸入輸出匯流排(例如，匯流排3530，如圖35C中所展示)與處理器子單元3515A至3515D之間添加放大器以放大來自匯流排3530之資料信號。 In various embodiments, the glue logic 3711 may be a controller for controlling multiple memory cells of a multi-die memory chip. In some cases, glue logic 3711 may include parameters that can be modified by various other controllers. For example, the coupling circuit for the multi-die memory chip may include a circuit for configuring the parameters of the glue logic 3711 or the parameters of the memory controller (for example, the processor subunits 3515A to 3515D, as shown in For example, in Figure 35B). Glue logic 3711 can be configured to perform a variety of tasks. For example, the logic 3711 can be configured to determine which die may need to be addressed. In some cases, logic 3711 can be used to synchronize multiple memory cells. In various embodiments, the logic 3711 can be configured to control various memory cells so that the memory cells operate as a single chip. In some cases, an amplifier may be added between the input and output bus (for example, the bus 3530, as shown in FIG. 35C) and the processor subunits 3515A to 3515D to amplify the data signal from the bus 3530.

在各種實施例中，自晶圓3501切割複雜形狀在技術上可能為困難/昂貴的，且可採用較簡單的切割方法，其限制條件為晶粒在晶圓3501上對準。舉例而言，圖38A展示經對準以形成矩形柵格之晶粒3506。在實例實施例中，可進行跨越整個晶圓3501之豎直切割3803及水平切割3801以分開切下之晶粒群組。在實例實施例中，豎直切割3803及水平切割3801可產生含有選定數目個晶粒之群組。舉例而言，切割3803及3801可產生含有單個晶粒之區(例如，區3811A)、含有兩個晶粒之區(例如，區3811B)及含有四個晶粒之區(例如，區3811C)。由切割3801及3803形成之區僅為說明性的，且可形成任何其他合適的區。在各種實施例中，取決於晶粒對準，可進行各種切割。舉例而言，若晶粒配置成三角形柵格，如圖38B中所展示，則諸如線3802、3804及3806之切割線可用以製成多晶粒記憶體晶片。舉例而言，一些區可包括六個晶粒、五個晶粒、四個晶粒、三個晶粒、兩個晶粒、一個晶粒任何其他合適數目個晶粒。 In various embodiments, cutting complex shapes from the wafer 3501 may be technically difficult/expensive, and a simpler cutting method may be used, and the limitation is that the die is aligned on the wafer 3501. For example, Figure 38A shows die 3506 aligned to form a rectangular grid. In an example embodiment, a vertical cut 3803 and a horizontal cut 3801 across the entire wafer 3501 can be performed to separate the cut die groups. In an example embodiment, the vertical cut 3803 and the horizontal cut 3801 can produce a group containing a selected number of dies. For example, cutting 3803 and 3801 can produce a region containing a single die (eg, region 3811A), a region containing two grains (eg, region 3811B), and a region containing four grains (eg, region 3811C) . The regions formed by cuts 3801 and 3803 are merely illustrative, and any other suitable regions may be formed. In various embodiments, depending on the die alignment, various cuts can be made. For example, if The dies are arranged in a triangular grid, as shown in FIG. 38B, and dicing lines such as lines 3802, 3804, and 3806 can be used to make a multi-die memory chip. For example, some regions may include six crystal grains, five crystal grains, four crystal grains, three crystal grains, two crystal grains, one crystal grain, and any other suitable number of crystal grains.

圖38C展示配置成三角形柵格之匯流排線3530，其中晶粒3503在藉由匯流排線3530相交形成之三角形的中心對準。晶粒3503可經由匯流排線3820連接至所有相鄰的匯流排線。藉由切割含有兩個或多於兩個鄰近晶粒之區(例如，區3822，如圖38C中所展示)，至少一個匯流排線(例如，線3824)保留在區3822內，且匯流排線3824可用以將資料及命令供應至使用區3822形成之多晶粒記憶體晶片。 FIG. 38C shows the bus wires 3530 arranged in a triangular grid, in which the die 3503 is aligned at the center of the triangle formed by the bus wires 3530 intersecting. The die 3503 can be connected to all adjacent bus lines via the bus line 3820. By cutting a region containing two or more adjacent dies (e.g., region 3822, as shown in FIG. 38C), at least one bus line (e.g., line 3824) remains in region 3822, and the bus The line 3824 can be used to supply data and commands to the multi-die memory chip formed in the usage area 3822.

圖39展示可形成於處理器子單元3515A至3515P之間以允許記憶體單元之群組充當單個記憶體晶片的各種連接件。舉例而言，各種記憶體單元之群組3901可包括處理器子單元3515B與子單元3515E之間的連接件3905。連接件3905可用作用於將資料及命令自子單元3515B傳輸至可用以控制各別記憶體組3511E之子單元3515E的匯流排線。在各種實施例中，處理器子單元之間的連接件可在晶圓3501上之晶粒的形成期間實施。在一些狀況下，額外連接件可在由若干晶粒形成之記憶體晶片的封裝階段期間製造。 Figure 39 shows various connections that can be formed between processor sub-units 3515A to 3515P to allow groups of memory cells to act as a single memory chip. For example, the group 3901 of various memory units may include a connector 3905 between the processor sub-unit 3515B and the sub-unit 3515E. The connector 3905 can be used as a bus line for transmitting data and commands from the sub-unit 3515B to the sub-unit 3515E that can be used to control the respective memory group 3511E. In various embodiments, the connections between the processor sub-units may be implemented during the formation of the die on the wafer 3501. In some cases, the additional connectors may be manufactured during the packaging stage of a memory chip formed of a number of dies.

如圖39中所展示，處理器子單元3515A至3515P可使用各種匯流排(例如，連接件3905)彼此連接。連接件3905可能不含時序硬體邏輯組件，使得在處理器子單元之間及跨越連接件3905的資料傳送可能不受時序硬體邏輯組件控制。在各種實施例中，連接處理器子單元3515A至3515P之匯流排可在將各種電路製造於晶圓3501上之前佈置於晶圓3501上。 As shown in FIG. 39, the processor sub-units 3515A to 3515P may be connected to each other using various bus bars (for example, connectors 3905). The connector 3905 may not contain sequential hardware logic components, so that the data transfer between the processor sub-units and across the connector 3905 may not be controlled by the sequential hardware logic components. In various embodiments, the bus bars connecting the processor sub-units 3515A to 3515P may be arranged on the wafer 3501 before the various circuits are fabricated on the wafer 3501.

在各種實施例中，處理器子單元(例如，子單元3515A至3515P)可互連。舉例而言，子單元3515A至3515P可藉由合適匯流排(例如，連接件3905)連接。連接件3905可將子單元3515A至3515P中之任一者與子單元3515A 至3515P中之任何其他者連接。在實例實施例中，所連接之子單元可在同一晶粒上(例如，子單元3515A及3515B)，且在其他狀況下，所連接之子單元可在不同晶粒上(例如，子單元3515B及3515E)。連接件3905可包括用於連接子單元之專用匯流排且可經組態以在子單元3515A至3515P之間高效地傳輸資料。 In various embodiments, processor sub-units (e.g., sub-units 3515A to 3515P) may be interconnected. For example, the sub-units 3515A to 3515P can be connected by a suitable bus (for example, the connecting member 3905). The connector 3905 can connect any one of the subunits 3515A to 3515P with the subunit 3515A Connect to any other of 3515P. In an example embodiment, the connected subunits may be on the same die (e.g., subunits 3515A and 3515B), and in other cases, the connected subunits may be on different die (e.g., subunits 3515B and 3515E). ). The connector 3905 may include a dedicated bus for connecting the sub-units and may be configured to efficiently transfer data between the sub-units 3515A to 3515P.

本發明之各種態樣係關於用於自晶圓生產大小可選擇之記憶體晶片的方法。在實例實施例中，大小可選擇之記憶體晶片可由一或多個晶粒形成。如前文所提到，該等晶粒可沿著一或多列配置，如展示於例如圖35C中。在一些狀況下，對應於一或多列之至少一個共用輸入輸出匯流排可佈置於晶圓3501上。舉例而言，可佈置匯流排3530，如圖35C中所展示。在各種實施例中，匯流排3530可電連接至晶粒中之至少兩個的記憶體單元，且所連接之晶粒可用以形成多晶粒記憶體晶片。在實例實施例中，一或多個控制器(例如，輸入輸出控制器3521及3522，如圖35B中所展示)可經組態以控制用以形成多晶粒記憶體晶片之至少兩個晶粒之記憶體單元。在各種實施例中，可自晶圓切下具有連接至匯流排3530之記憶體單元的晶粒，其中共用輸入輸出匯流排(例如，匯流排3530，如圖35B中所展示)之至少一個對應部分將資訊傳輸至至少一個控制器(例如，控制器3521、3522)，以組態控制器控制所連接晶粒之記憶體單元從而一起充當單個晶片。 Various aspects of the present invention relate to methods for producing memory chips of selectable sizes from wafers. In an example embodiment, a memory chip with a selectable size may be formed of one or more dies. As mentioned above, the dies can be arranged along one or more rows, as shown in, for example, Figure 35C. In some cases, at least one common I/O bus corresponding to one or more rows may be arranged on the wafer 3501. For example, bus bar 3530 may be arranged, as shown in Figure 35C. In various embodiments, the bus bar 3530 can be electrically connected to the memory cells of at least two of the dies, and the connected dies can be used to form a multi-die memory chip. In an example embodiment, one or more controllers (for example, input and output controllers 3521 and 3522, as shown in FIG. 35B) may be configured to control at least two chips used to form a multi-die memory chip. The memory unit of the grain. In various embodiments, a die having a memory cell connected to the bus 3530 can be cut from the wafer, wherein at least one of the common input and output bus (for example, the bus 3530, as shown in FIG. 35B) corresponds to Part of the information is transmitted to at least one controller (for example, controllers 3521, 3522) to configure the controller to control the memory cells of the connected die so as to act as a single chip together.

在一些狀況下，可在藉由切割晶圓3501之區製造記憶體晶片之前測試位於晶圓3501上之記憶體單元。可使用至少一個共用輸入輸出匯流排(例如，匯流排3530，如圖35C中所展示)進行測試。當記憶體單元通過測試時，記憶體晶片可由含有該等記憶體單元之晶粒的群組形成。可捨棄未通過測試之記憶體單元，且不將該等記憶體單元用於製造記憶體晶片。 In some cases, the memory cells located on the wafer 3501 can be tested before the memory chip is manufactured by cutting the area of the wafer 3501. At least one common input and output bus (for example, bus 3530, as shown in Figure 35C) can be used for testing. When the memory cell passes the test, the memory chip can be formed by a group of dies containing the memory cell. The memory cells that fail the test can be discarded, and the memory cells are not used for manufacturing memory chips.

圖40展示自晶粒群組建置記憶體晶片之實例處理程序4000。在處理程序4000之步驟4011處，可在半導體晶圓3501上佈置晶粒。在步驟4015處，可使用任何合適的方法在晶圓3501上製造晶粒。舉例而言，可藉由蝕刻晶圓3501，沈積各種介電、金屬或半導體層及進一步蝕刻所沈積層等來製造晶粒。舉例而言，可沈積及蝕刻多個層。在各種實施例中，可使用任何合適的摻雜元素對層進行n型摻雜或P型摻雜。舉例而言，可用磷對半導體層進行n型摻雜且用硼對半導體層進行P型摻雜。如圖35A中所展示，晶粒3503可藉由可用以自晶圓3501切下晶粒3503之空間彼此分開。舉例而言，晶粒3503可藉由間隔區彼此隔開，其中可選擇間隔區之寬度以允許在間隔區中進行晶圓切割。 FIG. 40 shows an example process 4000 of building a memory chip from a die group. in At step 4011 of the processing program 4000, dies can be arranged on the semiconductor wafer 3501. At step 4015, any suitable method may be used to fabricate dies on the wafer 3501. For example, the die can be manufactured by etching the wafer 3501, depositing various dielectric, metal or semiconductor layers, and further etching the deposited layers. For example, multiple layers can be deposited and etched. In various embodiments, any suitable doping element may be used to do n-type or p-type doping of the layer. For example, the semiconductor layer can be doped n-type with phosphorus and the semiconductor layer can be p-type doped with boron. As shown in FIG. 35A, the dies 3503 can be separated from each other by the space that can be used to cut the dies 3503 from the wafer 3501. For example, the dies 3503 can be separated from each other by spacers, wherein the width of the spacers can be selected to allow wafer dicing in the spacers.

在步驟4017處，可使用任何合適的方法自晶圓3501切下晶粒3503。在實例實施例中，可使用雷射切下晶粒3503。在實例實施例中，可首先刻劃晶圓3501，其後接著進行機械劃割。替代地，可使用機械劃割鋸。在一些狀況下，可使用隱形劃割處理程序。在劃割期間，一旦切下晶粒，晶圓3501便可安裝於用於固持晶粒之劃割帶上。在各種實施例中，可進行大的切割，如例如在圖38A中由切割3801及3803所展示或在圖38B中由切割3802、3804或3806所展示。一旦個別地或以群組切下晶粒3503，如例如在圖35C中由群組3504所展示，便可封裝晶粒3503。晶粒之封裝可包括形成至晶粒3503之接點，在接點上方沈積保護層，附接熱管理裝置(例如，散熱片)及囊封晶粒3503。在各種實施例中，取決於選擇多少晶粒來形成記憶體晶片，可使用接點及匯流排之適當組態。在實例實施例中，可在記憶體晶片封裝期間製作形成記憶體晶片之不同晶粒之間的接點中之一些。 At step 4017, any suitable method may be used to cut the die 3503 from the wafer 3501. In an example embodiment, a laser may be used to cut the die 3503. In an example embodiment, wafer 3501 may be scribed first, followed by mechanical scribing. Alternatively, a mechanical dicing saw can be used. In some cases, an invisible cutting process can be used. During dicing, once the die is cut, the wafer 3501 can be mounted on the dicing tape for holding the die. In various embodiments, large cuts may be made, as shown, for example, by cuts 3801 and 3803 in FIG. 38A or cuts 3802, 3804, or 3806 in FIG. 38B. Once the dies 3503 are cut individually or in groups, as shown by the group 3504 in FIG. 35C, for example, the dies 3503 can be packaged. The packaging of the die may include forming contacts to the die 3503, depositing a protective layer over the contacts, attaching a thermal management device (for example, a heat sink), and encapsulating the die 3503. In various embodiments, depending on how many dies are selected to form the memory chip, appropriate configurations of contacts and buses can be used. In an example embodiment, some of the contacts between the different dies forming the memory chip may be fabricated during the packaging of the memory chip.

圖41A展示用於製造含有多個晶粒之記憶體晶片的實例處理程序4100。處理程序4100之步驟4011可與處理程序4000之步驟4011相同。在步驟4111處，如圖37中所展示，可將膠合邏輯3711佈置於晶圓3501上。膠合邏輯3711可為用於控制晶粒3506之操作的任何合適的邏輯，如圖37中所展示。如前文所描述，膠合邏輯3711之存在可允許多個晶粒充當單個記憶體晶片。膠合邏輯3711可提供與其他記憶體控制器之介面，使得由多個晶粒形成之記憶體晶片充當單個記憶體晶片。 FIG. 41A shows an example process 4100 for manufacturing a memory chip containing multiple dies. The step 4011 of the processing program 4100 may be the same as the step 4011 of the processing program 4000. At step 4111, as shown in FIG. 37, the glue logic 3711 may be placed on the wafer 3501. The glue logic 3711 may be any suitable logic for controlling the operation of the die 3506, as shown in FIG. 37. As described above, the existence of glue logic 3711 allows multiple dies to act as a single memory chip. The glue logic 3711 can provide an interface with other memory controllers, so that a memory chip formed by multiple dies can act as a single memory chip.

在處理程序4100之步驟4113處，可將匯流排(例如，輸入輸出匯流排及控制匯流排)佈置於晶圓3501上。匯流排可佈置為使得其與各種晶粒及諸如膠合邏輯3711之邏輯電路連接。在一些狀況下，匯流排可連接記憶體單元。舉例而言，匯流排可經組態以連接不同晶粒之處理器子單元。在步驟4115處，可使用任何合適的方法製造晶粒、膠合邏輯及匯流排。舉例而言，可藉由蝕刻晶圓3501，沈積各種介電、金屬或半導體層及進一步蝕刻所沈積層等來製造邏輯元件。可使用例如金屬蒸鍍來製造匯流排。 At step 4113 of the processing procedure 4100, bus bars (for example, input/output bus bars and control bus bars) may be arranged on the wafer 3501. The bus bar can be arranged such that it is connected to various dies and logic circuits such as glue logic 3711. In some cases, the bus can be connected to a memory unit. For example, the bus bar can be configured to connect processor sub-units of different dies. At step 4115, any suitable method can be used to fabricate the die, glue logic, and bus. For example, the logic device can be manufactured by etching the wafer 3501, depositing various dielectric, metal or semiconductor layers, and further etching the deposited layers. The bus bar can be manufactured using, for example, metal evaporation.

在步驟4140處，可使用切割形狀以切割連接至單個膠合邏輯3711之晶粒的群組，如展示於例如圖37中。可使用對含有多個晶粒3503之記憶體晶片的記憶體要求來判定切割形狀。舉例而言，圖41B展示處理程序4101，該處理程序可為處理程序4100之變體，其中處理程序4100之步驟4140之前可為步驟4117及4119。在步驟4117處，用於切割晶圓3501之系統可接收描述對記憶體晶片之要求的指令。舉例而言，要求可包括形成包括四個晶粒3503之記憶體晶片。在一些狀況下，在步驟4119處，程式軟體可判定用於晶粒群組及膠合邏輯3711之週期性圖案。舉例而言，週期性圖案可包括兩個膠合邏輯3711元件及四個晶粒3503，其中每兩個晶粒連接至一個膠合邏輯3711。替代地，在步驟4119處，可由記憶體晶片之設計者提供該圖案。 At step 4140, a cut shape can be used to cut a group of dies connected to a single glue logic 3711, as shown, for example, in FIG. 37. The memory requirements for a memory chip containing multiple dies 3503 can be used to determine the cutting shape. For example, FIG. 41B shows the processing program 4101, which can be a variant of the processing program 4100, where step 4140 of the processing program 4100 can be steps 4117 and 4119. At step 4117, the system for dicing the wafer 3501 may receive an instruction describing the requirements for the memory chip. For example, the requirement may include forming a memory chip including four dies 3503. In some cases, at step 4119, the programming software can determine the periodic pattern for die group and glue logic 3711. For example, the periodic pattern may include two glue logic 3711 elements and four dies 3503, wherein every two dies are connected to one glue logic 3711. Alternatively, at step 4119, the pattern may be provided by the designer of the memory chip.

在一些狀況下，可選擇該圖案以最大化來自晶圓3501之記憶體晶片的良率。在實例實施例中，可測試晶粒3503之記憶體單元以識別具有故障記憶體單元之晶粒(此類晶粒被稱作故障的未通過晶粒)，且基於故障晶粒之位置，可識別含有通過測試之記憶體單元的晶粒3503之群組且可判定適當的切割圖案。舉例而言，若在晶圓3501之邊緣處，大量晶粒3503發生未通過，則可判定切割圖案以避開晶圓3501之邊緣處的晶粒。處理程序4101之諸如步驟4011、4111、4113、4115及4140的其他步驟可與處理程序4100之相同編號步驟相同。 In some cases, the pattern can be selected to maximize the yield of memory chips from wafer 3501. In an example embodiment, the memory cell of the die 3503 can be tested to identify the die with a faulty memory cell (such die is called a failed die), and based on the location of the faulty die, Identify the group of dies 3503 containing memory cells that passed the test and determine the appropriate cut Cut the pattern. For example, if a large number of die 3503 fail to pass at the edge of the wafer 3501, the cutting pattern can be determined to avoid the die at the edge of the wafer 3501. The other steps of the processing program 4101, such as steps 4011, 4111, 4113, 4115, and 4140, may be the same as the same numbered steps of the processing program 4100.

圖41C展示可為處理程序4101之變化形式的實例處理程序4102。處理程序4102之步驟4011、4111、4113、4115及4140可與處理程序4101之相同編號步驟相同，處理程序4102之步驟4131可替代處理程序4101之步驟4117，且處理程序4102之步驟4133可替代處理程序4101之步驟4119。在步驟4131處，用於切割晶圓3501之系統可接收描述對第一記憶體晶片集合及第二記憶體晶片集合之要求的指令。舉例而言，要求可包括：形成具有由四個晶粒3503組成之記憶體晶片的第一記憶體晶片集合；及形成具有由兩個晶粒3503組成之記憶體晶片的第二記憶體晶片集合。在一些狀況下，可能需要自晶圓3501形成多於兩個記憶體晶片集合。舉例而言，第三記憶體晶片集合可包括僅由一個晶粒3503組成之記憶體晶片。在一些狀況下，在步驟4133處，程式軟體可判定用於晶粒群組及膠合邏輯3711之週期性圖案，以用於形成每一記憶體晶片集合中之記憶體晶片。舉例而言，第一記憶體晶片集合可包括含有兩個膠合邏輯3711及四個晶粒3503之記憶體晶片，其中每兩個晶粒連接至一個膠合邏輯3711。在各種實施例中，用於同一記憶體晶片之膠合邏輯單元3711可鏈接在一起以充當單個膠合邏輯。舉例而言，在製造膠合邏輯3711期間，可形成將膠合邏輯單元3711彼此鏈接之適當匯流排線。 FIG. 41C shows an example processing program 4102 that can be a variation of the processing program 4101. Steps 4011, 4111, 4113, 4115, and 4140 of processing program 4102 can be the same as the same numbered steps of processing program 4101, step 4131 of processing program 4102 can replace step 4117 of processing program 4101, and step 4133 of processing program 4102 can replace processing Step 4119 of program 4101. At step 4131, the system for dicing the wafer 3501 may receive an instruction describing the requirements for the first memory chip set and the second memory chip set. For example, the requirements may include: forming a first memory chip set having memory chips composed of four dies 3503; and forming a second memory chip set having memory chips composed of two dies 3503 . In some cases, it may be necessary to form more than two sets of memory chips from the wafer 3501. For example, the third memory chip set may include a memory chip consisting of only one die 3503. In some cases, at step 4133, the programming software can determine the periodic patterns for die groups and glue logic 3711 to be used to form the memory chips in each memory chip set. For example, the first set of memory chips may include a memory chip including two glue logic 3711 and four dies 3503, wherein every two dies are connected to one glue logic 3711. In various embodiments, glue logic units 3711 for the same memory chip can be linked together to act as a single glue logic. For example, during the manufacture of the glue logic 3711, suitable bus lines can be formed to link the glue logic units 3711 to each other.

第二記憶體晶片集合可包括含有一個膠合邏輯3711及兩個晶粒3503之記憶體晶片，其中晶粒3503連接至膠合邏輯3711。在一些狀況下，當選擇第三記憶體晶片集合時且當第三記憶體晶片集合包括由單個晶粒3503組成之記憶體晶片時，此等記憶體晶片可能不需要膠合邏輯3711。 The second memory chip set may include a memory chip including one glue logic 3711 and two dies 3503, wherein the die 3503 is connected to the glue logic 3711. In some cases, when the third memory chip set is selected and when the third memory chip set includes memory chips composed of a single die 3503, these memory chips may not require glue logic 3711.

雙埠功能性 Dual port functionality

當設計記憶體晶片或晶片內之記憶體例項時，一個重要的特性為在單個時脈循環期間可同時存取之字的數目。對於讀取及/或寫入，可同時存取之位址愈多(例如，沿著亦被稱作字或字線之列及亦被稱作位元或位元線之行的位址)，記憶體晶片愈快。雖然在開發包括多路埠之記憶體時已存在一些活動，該等埠允許同時存取多個位址，例如用於建置暫存器檔案、快取記憶體或共用記憶體，但大部分例項使用大小較大且支援多個位址存取之記憶體墊。然而，DRAM晶片通常包括連接至每一記憶體胞元之每一電容器的單個位元線及單個列線。因此，本發明之實施例試圖提供對現有DRAM晶片之多埠存取，而不修改DRAM陣列之此習知單埠記憶體結構。 When designing a memory chip or memory instance within the chip, an important characteristic is the number of ZigZags that can be accessed simultaneously during a single clock cycle. For reading and/or writing, the more addresses that can be accessed at the same time (for example, addresses along a column also called a word or word line and a row also called a bit or bit line) , The faster the memory chip. Although there have been some activities in the development of memory that includes multiple ports, which allow simultaneous access to multiple addresses, such as for building register files, cache memory, or shared memory, most The example uses a memory pad that is large in size and supports multiple address access. However, a DRAM chip usually includes a single bit line and a single column line connected to each capacitor of each memory cell. Therefore, the embodiment of the present invention attempts to provide multi-port access to the existing DRAM chip without modifying the conventional single-port memory structure of the DRAM array.

本發明之實施例可使用記憶體以兩倍於邏輯電路之速度來時控記憶體例項或晶片。使用記憶體之任何邏輯電路可因此「對應於」記憶體及其任何組件。因此，本發明之實施例可在兩個記憶體陣列時脈循環中對兩個位址進行擷取或寫入，該兩個記憶體陣列時脈循環等效於用於邏輯電路之單個處理時脈循環。該等邏輯電路可包含諸如控制器、加速器、GPU或CPU之電路，或可包含與記憶體晶片在同一基板上之處理群組，例如，如圖7A中所描繪。如上文關於圖3A所解釋，「處理群組」可指基板上之兩個或多於兩個處理器子單元及其對應記憶體組。該群組可表示基板上之空間分佈及/或用於編譯程式碼以供在記憶體晶片2800上執行之目的之邏輯分組。因此，如上文關於圖7A所描述，具有記憶體晶片之基板可包括記憶體陣列，該記憶體陣列具有複數個組，諸如圖28中所展示之組2801a及其他組。此外，該基板亦可包括處理陣列，該處理陣列可包括複數個處理器子單元(諸如，圖7A中所展示之子單元730a、730b、730c、730d、730e、730f、730g及730h)。 Embodiments of the present invention can use memory to time control memory instances or chips at twice the speed of logic circuits. Any logic circuit that uses memory can therefore "correspond" to the memory and any of its components. Therefore, the embodiment of the present invention can retrieve or write two addresses in two memory array clock cycles, which are equivalent to a single processing time for logic circuits. Pulse circulation. The logic circuits may include circuits such as controllers, accelerators, GPUs, or CPUs, or may include processing groups on the same substrate as the memory chip, for example, as depicted in FIG. 7A. As explained above with respect to FIG. 3A, the "processing group" can refer to two or more processor sub-units on the substrate and their corresponding memory groups. The group may represent a spatial distribution on the substrate and/or a logical grouping for the purpose of compiling program codes for execution on the memory chip 2800. Therefore, as described above with respect to FIG. 7A, the substrate with memory chips may include a memory array having a plurality of groups, such as the group 2801a shown in FIG. 28 and other groups. In addition, the substrate may also include a processing array, which may include a plurality of processor sub-units (such as the sub-units 730a, 730b, 730c, 730d, 730e, 730f, 730g, and 730h shown in FIG. 7A).

因此，本發明之實施例可在兩個連續記憶體循環中之每一者內自陣列擷取資料，以便針對每一邏輯循環處置兩個位址，且向邏輯提供兩個結果，就如同單埠記憶體陣列為雙埠記憶體晶片一般。額外時控可允許本發明之記憶體晶片如同單埠記憶體陣列為雙埠記憶體例項、三埠記憶體例項、四埠記憶體例項埠或任何其他多埠記憶體例項一般起作用。 Therefore, the embodiments of the present invention can be self-contained in each of two consecutive memory cycles. The array retrieves data to handle two addresses for each logical cycle and provides two results to the logic, just as if the single-port memory array is a dual-port memory chip. The additional timing allows the memory chip of the present invention to function as if the single-port memory array is a dual-port memory instance, a three-port memory instance, a four-port memory instance port, or any other multi-port memory instance.

圖42描繪符合本發明的實例電路系統4200，該電路系統提供沿著使用電路系統4200之記憶體晶片之行的雙埠存取。圖42中所描繪之實施例可使用具有兩個行多工器(「mux」)4205a及4205b以在用於邏輯電路之同一時脈循環期間存取同一列上之兩個字的一個記憶體陣列4201。舉例而言，在記憶體時脈循環期間，RowAddrA用於列解碼器4203中且ColAddrA用於多工器4205a中以緩衝來自具有位址(RowAddrA，ColAddrA)之記憶體胞元的資料。在同一記憶體時脈循環期間，ColAddrB用於多工器4205b中以緩衝來自具有位址(RowAddrA，ColAddrB)之記憶體胞元的資料。因此，電路系統4200可允許沿著同一列或字線對儲存於兩個不同位址處之記憶體胞元上的資料(例如，DataA及DataB)進行雙埠存取。因此，兩個位址可共用一列使得列解碼器4203針對兩次擷取啟動同一字線。此外，如圖42中所描繪之實例的實施例可使用行多工器，使得可在同一記憶體時脈循環期間存取兩個位址。 42 depicts an example circuit system 4200 consistent with the present invention that provides dual-port access along the rows of memory chips using the circuit system 4200. The embodiment depicted in FIG. 42 can use a memory with two row multiplexers ("mux") 4205a and 4205b to access two words on the same column during the same clock cycle used for logic circuits Array 4201. For example, during the memory clock cycle, RowAddrA is used in the column decoder 4203 and ColAddrA is used in the multiplexer 4205a to buffer data from memory cells with addresses (RowAddrA, ColAddrA). During the same memory clock cycle, ColAddrB is used in the multiplexer 4205b to buffer data from memory cells with addresses (RowAddrA, ColAddrB). Therefore, the circuit system 4200 can allow dual-port access to data (for example, DataA and DataB) stored on memory cells at two different addresses along the same row or word line. Therefore, two addresses can share a column so that the column decoder 4203 activates the same word line for two retrievals. In addition, the embodiment of the example depicted in FIG. 42 can use a row multiplexer so that two addresses can be accessed during the same memory clock cycle.

類似地，圖43描繪符合本發明的實例電路系統4300，該電路系統提供沿著使用電路系統4300之記憶體晶片之列的雙埠存取。圖43中所描繪之實施例可使用一個記憶體陣列4301，該記憶體陣列具有與多工器(「mux」)耦接之列解碼器4303以在用於邏輯電路之同一時脈循環期間存取同一行上之兩個字。舉例而言，在兩個記憶體時脈循環中之第一記憶體時脈循環上，RowAddrA用於列解碼器4303中且ColAddrA用於行多工器4305中以緩衝來自具有位址(RowAddrA，ColAddrA)之記憶體胞元的資料(例如，至圖43之「緩衝字」緩衝器)。在兩個記憶體時脈循環中之第二記憶體時脈循環上，RowAddrB用於列解碼器4303中且ColAddrA用於行多工器4305中以緩衝來自具有位址(RowAddrB，ColAddrA)之記憶體胞元的資料。因此，電路系統4300可允許沿著同一行或位元線對儲存於兩個不同位址處之記憶體胞元上的資料(例如，DataA及DataB)進行雙埠存取。因此，兩個位址可共用一列使得行解碼器(其可與一或多個行多工器分開或組合，如圖43中所描繪)針對兩次擷取啟動同一位元線。如圖43中所描繪之實例的實施例可使用兩個記憶體時脈循環，此係因為列解碼器4303啟動每一字線可能皆需要一個記憶體時脈循環。因此，若以至少兩倍於對應邏輯電路之速度進行時控，則使用電路系統4300之記憶體晶片可充當雙埠記憶體。 Similarly, FIG. 43 depicts an example circuit system 4300 in accordance with the present invention that provides dual-port access along the rows of memory chips using the circuit system 4300. The embodiment depicted in FIG. 43 can use a memory array 4301 that has a column decoder 4303 coupled to a multiplexer ("mux") for storage during the same clock cycle used in the logic circuit. Take two words on the same line. For example, on the first memory clock cycle of the two memory clock cycles, RowAddrA is used in the column decoder 4303 and ColAddrA is used in the row multiplexer 4305 to buffer data from the address (RowAddrA, ColAddrA) memory cell data (for example, to the "buffer word" buffer in FIG. 43). On the second memory clock cycle of the two memory clock cycles, RowAddrB is used to list The decoder 4303 and ColAddrA are used in the row multiplexer 4305 to buffer the data from the memory cell with the address (RowAddrB, ColAddrA). Therefore, the circuit system 4300 can allow dual-port access to the data (for example, DataA and DataB) stored on the memory cells at two different addresses along the same row or bit line. Therefore, two addresses can share a column so that the row decoder (which can be separated or combined with one or more row multiplexers, as depicted in FIG. 43) enables the same bit line for two acquisitions. The embodiment of the example depicted in FIG. 43 may use two memory clock cycles, because the row decoder 4303 may require one memory clock cycle to activate each word line. Therefore, if the timing is at least twice the speed of the corresponding logic circuit, the memory chip using the circuit system 4300 can serve as a dual-port memory.

因此，如上文所解釋，圖43可在比用於對應邏輯電路之時脈循環快的兩個記憶體時脈循環期間擷取DataA及DataB。舉例而言，列解碼器(例如，圖43之列解碼器4303)及行解碼器(其可與一或多個行多工器分開或組合，如圖43中所描繪)可經組態成以至少兩倍於對應邏輯電路產生兩個位址之速率的速率進行時控。舉例而言，用於電路系統4300之時脈電路(圖43中未展示)可根據至少兩倍於對應邏輯電路產生兩個位址之速率的速率對電路系統4300進行時控。 Therefore, as explained above, FIG. 43 can retrieve DataA and DataB during two memory clock cycles that are faster than the clock cycle used for the corresponding logic circuit. For example, a column decoder (e.g., column decoder 4303 of FIG. 43) and a row decoder (which can be separated or combined with one or more row multiplexers, as depicted in FIG. 43) can be configured as The timing is controlled at a rate that is at least twice the rate at which the corresponding logic circuit generates two addresses. For example, a clock circuit (not shown in FIG. 43) used in the circuit system 4300 can time the circuit system 4300 at a rate that is at least twice the rate at which the corresponding logic circuit generates two addresses.

可分開地或組合地使用圖42之實施例及圖43之實施例。因此，在單埠記憶體陣列或墊上提供雙埠功能性之電路系統(例如，電路系統4200或4300)可包含沿著至少一列及至少一行配置之複數個記憶體組。該等複數個記憶體組在圖42中描繪為記憶體陣列4201及在圖43中描繪為記憶體陣列4301。該等實施例可進一步使用經組態以在單個時脈循環期間接收用於讀取或寫入之兩個位址的至少一個列多工器(如圖43中所描繪)或至少一個行多工器(如圖42中所描繪)。此外，該等實施例可使用列解碼器(例如，圖42之列解碼器4203及圖43之列解碼器4303)及行解碼器(其可與一或多個行多工器分開或組合，如圖42及圖43中所描繪)以自兩個位址讀取或寫入至兩個位址。舉例而言，列解碼器及行解碼器可在第一循環期間自至少一個列多工器或至少一個行多工器擷取兩個位址中之第一位址，且解碼對應於第一位址之字線及位元線。此外，列解碼器及行解碼器可在第二循環期間自至少一個列多工器或至少一個行多工器擷取兩個位址中之第二位址，且解碼對應於第二位址之字線及位元線。該等擷取可各包含使用列解碼器啟動對應於位址之字線及使用行解碼器啟動經啟動字線上之對應於位址的位元線。 The embodiment of FIG. 42 and the embodiment of FIG. 43 can be used separately or in combination. Therefore, a circuit system that provides dual-port functionality on a single-port memory array or pad (for example, the circuit system 4200 or 4300) may include a plurality of memory banks arranged along at least one row and at least one row. The plurality of memory banks are depicted as a memory array 4201 in FIG. 42 and as a memory array 4301 in FIG. 43. These embodiments may further use at least one column multiplexer (as depicted in FIG. 43) or at least one row multiplexer configured to receive two addresses for reading or writing during a single clock cycle. Worker (as depicted in Figure 42). In addition, these embodiments may use column decoders (e.g., column decoder 4203 of FIG. 42 and column decoder 4303 of FIG. 43) and row decoders (which may be separated or combined with one or more row multiplexers, 42 and 43) to read from or write to two addresses. For example, the column decoder and the row decoder may retrieve the first of the two addresses from at least one column multiplexer or at least one row multiplexer during the first cycle, and the decoding corresponds to the first The word line and bit line of the address. In addition, the column decoder and the row decoder can retrieve the second address of the two addresses from at least one column multiplexer or at least one row multiplexer during the second cycle, and the decoding corresponds to the second address Zigzag line and bit line. The captures may each include using a column decoder to activate a word line corresponding to the address and using a row decoder to activate a bit line corresponding to the address on the activated word line.

儘管上文針對擷取進行了描述，但圖42及圖43之實施例(無論係分開地抑或組合地實施)皆可包括寫入命令。舉例而言，在第一循環期間，列解碼器及行解碼器可將自至少一個列多工器或至少一個行多工器擷取之第一資料寫入至兩個位址中之第一位址。此外，在第二循環期間，列解碼器及行解碼器可將自至少一個列多工器或至少一個行多工器擷取之第二資料寫入至兩個位址中之第二位址。 Although the capture is described above, the embodiments of FIGS. 42 and 43 (whether implemented separately or in combination) may include write commands. For example, during the first cycle, the row decoder and the row decoder can write the first data retrieved from at least one row multiplexer or at least one row multiplexer to the first of the two addresses. Address. In addition, during the second cycle, the column decoder and the row decoder can write the second data retrieved from at least one column multiplexer or at least one row multiplexer to the second address of the two addresses .

圖42之實例展示在第一位址及第二位址共用字線位址時之此處理程序，而圖43之實例展示在第一位址及第二位址共用行位址時之此處理程序。如下文關於圖47進一步所描述，在第一位址及第二位址不共用字線位址抑或行位址時，可實施同一處理程序。 The example in FIG. 42 shows the processing when the first address and the second address share the word line address, and the example in FIG. 43 shows the processing when the first address and the second address share the row address program. As described further below with respect to FIG. 47, the same processing procedure can be implemented when the first address and the second address do not share the word line address or the row address.

因此，儘管上文之實例提供沿著列或行中之至少一者的雙埠存取，但額外實施例可提供沿著列及行兩者之雙埠存取。圖44描繪符合本發明的實例電路系統4400，該電路系統提供沿著使用電路系統4400之記憶體晶片之列及行兩者的雙埠存取。因此，電路系統4700可表示圖42之電路系統4200與圖43之電路系統4300的組合。 Therefore, although the above examples provide dual-port access along at least one of a row or a row, additional embodiments may provide dual-port access along both the row and the row. FIG. 44 depicts an example circuit system 4400 in accordance with the present invention that provides dual-port access along both the column and row of memory chips using the circuit system 4400. Therefore, the circuit system 4700 can represent a combination of the circuit system 4200 of FIG. 42 and the circuit system 4300 of FIG. 43.

圖44中所描繪之實施例可使用一個記憶體陣列4401，該記憶體陣列具有與多工器(「mux」)耦接之列解碼器4403以在用於邏輯電路之同一時脈循環期間存取兩列。此外，圖44中所描繪之實施例可使用記憶體陣列4401，該記憶體陣列具有與多工器(「mux」)耦接之行解碼器(或多工器)4405以在同一時脈循環期間存取兩行。舉例而言，在兩個記憶體時脈循環中之第一記憶體時脈循環上，RowAddrA用於列解碼器4403中且ColAddrA用於行多工器4405中以緩衝來自具有位址(RowAddrA，ColAddrA)之記憶體胞元的資料(例如，至圖44之「緩衝字」緩衝器)。在兩個記憶體時脈循環中之第二記憶體時脈循環上，RowAddrB用於列解碼器4403中且ColAddrB用於行多工器4405中以緩衝來自具有位址(RowAddrB，ColAddrB)之記憶體胞元的資料。因此，電路系統4400可允許對儲存於兩個不同位址處之記憶體胞元上之資料(例如，DataA及DataB)進行雙埠存取。如圖44中所描繪之實例的實施例可使用額外緩衝器，此係因為列解碼器4403啟動每一字線可能皆需要一個記憶體時脈循環。因此，若以至少兩倍於對應邏輯電路之速度進行時控，則使用電路系統4400之記憶體晶片可充當雙埠記憶體。 The embodiment depicted in FIG. 44 can use a memory array 4401 with a column decoder 4403 coupled to a multiplexer ("mux") for use in the same logic circuit Two columns are accessed during the clock cycle. In addition, the embodiment depicted in FIG. 44 can use a memory array 4401 that has a row decoder (or multiplexer) 4405 coupled to a multiplexer ("mux") to cycle at the same clock Access two rows during the period. For example, on the first memory clock cycle of the two memory clock cycles, RowAddrA is used in the column decoder 4403 and ColAddrA is used in the row multiplexer 4405 to buffer data from the address (RowAddrA, ColAddrA) memory cell data (for example, to the "buffer word" buffer in FIG. 44). On the second memory clock cycle of the two memory clock cycles, RowAddrB is used in the column decoder 4403 and ColAddrB is used in the row multiplexer 4405 to buffer the memory from the address (RowAddrB, ColAddrB) Somatic data. Therefore, the circuit system 4400 can allow dual-port access to data (for example, DataA and DataB) stored on memory cells at two different addresses. The embodiment of the example depicted in FIG. 44 may use additional buffers because the column decoder 4403 may require one memory clock cycle to activate each word line. Therefore, if time control is performed at least twice the speed of the corresponding logic circuit, the memory chip using the circuit system 4400 can serve as a dual-port memory.

儘管在圖44中未描繪，但電路系統4400可進一步包括沿著列或字線之圖46(下文進一步描述)的額外電路系統及/或沿著行或位元線之類似額外電路系統。因此，電路系統4400可啟動對應電路系統(例如，藉由斷開一或多個開關元件，諸如圖46之開關元件4613a、4613b及其類似者中之一或多者)以啟動包括位址之斷開部分(例如，藉由連接電壓或允許電流流動至斷開部分)。因此，當電路系統之元件(諸如，線或其類似者)包括識別位址之位置時及/或當電路系統之元件(諸如，開關元件)控制至由位址識別之記憶體胞元的供應或電壓及/或電流時，該電路系統可「對應」。電路系統4400可接著使用列解碼器4403及行多工器4405以解碼對應字線及位元線，以自位於經啟動之斷開部分中之位址擷取資料或將資料寫入至該等位址。 Although not depicted in FIG. 44, circuitry 4400 may further include additional circuitry of FIG. 46 (described further below) along column or word lines and/or similar additional circuitry along row or bit lines. Therefore, the circuit system 4400 can activate the corresponding circuit system (for example, by turning off one or more switching elements, such as one or more of the switching elements 4613a, 4613b, and the like in FIG. 46) to activate the corresponding circuit system including the address The disconnected part (for example, by connecting a voltage or allowing current to flow to the disconnected part). Therefore, when an element of the circuit system (such as a wire or the like) includes the location of an identification address and/or when an element of the circuit system (such as a switching element) controls the supply to the memory cell identified by the address Or voltage and/or current, the circuit system can "correspond". The circuit system 4400 can then use the column decoder 4403 and the row multiplexer 4405 to decode the corresponding word lines and bit lines, to retrieve data from the addresses in the activated disconnected part or to write data to these Address.

如圖44中進一步所描繪，電路系統4400可進一步使用經組態以在單個時脈循環期間接收用於讀取或寫入之兩個位址的至少一個列多工器(描繪為與列解碼器4403分開，但可併入其中)及/或至少一個行多工器(例如，描繪為與行多工器4405分開，但可併入其中)。因此，實施例可使用列解碼器(例如，列解碼器4403)及行解碼器(其可與行多工器4405分開或組合)以自兩個位址讀取或寫入至兩個位址。舉例而言，列解碼器及行解碼器可在記憶體時脈循環期間自至少一個列多工器或至少一個行多工器擷取兩個位址中之第一位址，且解碼對應於第一位址之字線及位元線。此外，列解碼器及行解碼器可在同一記憶體循環期間自至少一個列多工器或至少一個行多工器擷取兩個位址中之第二位址，且解碼對應於第二位址之字線及位元線。 As further depicted in Figure 44, the circuitry 4400 can be further configured to At least one column multiplexer (depicted as being separate from column decoder 4403, but can be incorporated into it) that receives two addresses for reading or writing during a single clock cycle and/or at least one row multiplexer (E.g., depicted as separate from the row multiplexer 4405, but could be incorporated into it). Therefore, an embodiment can use a column decoder (for example, the column decoder 4403) and a row decoder (which can be separated or combined with the row multiplexer 4405) to read from or write to two addresses . For example, the column decoder and the row decoder can retrieve the first of the two addresses from at least one column multiplexer or at least one row multiplexer during the memory clock cycle, and the decoding corresponds to The word line and bit line of the first address. In addition, the column decoder and the row decoder can retrieve the second address of the two addresses from at least one column multiplexer or at least one row multiplexer during the same memory cycle, and the decoding corresponds to the second address Zigzag line and bit line.

圖45A及圖45B描繪用於在單埠記憶體陣列或墊上提供雙埠功能性之現有複製技術。如圖45A中所展示，雙埠讀取可藉由跨越記憶體陣列或墊使資料之複本保持同步來提供。因此，可自記憶體例項之兩個複本執行讀取，如圖45A中所描繪。此外，如圖45B中所展示，雙埠寫入可藉由跨越記憶體陣列或墊複製所有寫入來提供。舉例而言，記憶體晶片可能需要使用記憶體晶片之邏輯電路以複製形式發送寫入命令，針對每一資料複本發送一個寫入命令。替代地，在一些實施例中，如圖45A中所展示，額外電路系統可允許使用記憶體例項之邏輯電路發送單個寫入命令，該單個寫入命令由額外電路系統自動地複製以跨越記憶體陣列或墊產生寫入資料之複本，以便使複本保持同步。圖42、圖43及圖44之實施例可藉由使用多工器在單個記憶體時脈循環中存取兩條位元線(例如，如圖42中所描繪)及/或藉由比對應邏輯電路更快地時控記憶體(例如，如圖43及圖44中所描繪)及提供額外多工器以處置額外位址而非複製記憶體中之所有資料來減少來自此等現有複製技術之冗餘。 Figures 45A and 45B depict existing replication techniques used to provide dual-port functionality on a single-port memory array or pad. As shown in Figure 45A, dual-port reads can be provided by keeping copies of data synchronized across memory arrays or pads. Therefore, reading can be performed from two copies of the memory instance, as depicted in Figure 45A. In addition, as shown in Figure 45B, dual-port writes can be provided by copying all writes across a memory array or pad. For example, the memory chip may need to use the logic circuit of the memory chip to send a write command in the form of a copy, and send a write command for each data copy. Alternatively, in some embodiments, as shown in FIG. 45A, the additional circuitry may allow the logic circuit using the memory instance to send a single write command that is automatically copied by the additional circuitry to span the memory The array or pad generates copies of the written data in order to keep the copies in sync. The embodiments of FIGS. 42, 43, and 44 can access two bit lines in a single memory clock cycle by using a multiplexer (for example, as depicted in FIG. 42) and/or by comparing corresponding logic The circuit time-controls the memory faster (for example, as depicted in Figure 43 and Figure 44) and provides additional multiplexers to handle additional addresses instead of copying all the data in the memory to reduce the cost from these existing copy technologies. redundancy.

除上文所描述之更快時控及/或額外多工器以外，本發明之實施例亦可使用在記憶體陣列內之一些點處斷開位元線及/或字線的電路系統。此等實施例可允許對陣列之多個同時存取，只要列解碼器及行解碼器存取不耦接至斷開電路系統之相同部分的不同位置即可。舉例而言，可同時存取具有不同字線及位元線之位置，此係因為斷開電路系統可允許列解碼及行解碼存取不同位址而無電干擾。在設計記憶體晶片期間，可權衡記憶體陣列內之斷開區的粒度與斷開電路系統所需之額外區域。 In addition to the faster timing and/or additional multiplexers described above, the embodiments of the present invention can also use a circuit system that disconnects bit lines and/or word lines at some points in the memory array. Such reality The embodiment may allow multiple simultaneous accesses to the array, as long as the column decoder and row decoder access are not coupled to different positions of the same part of the disconnect circuit system. For example, it is possible to access locations with different word lines and bit lines at the same time, because the open circuit system allows column decoding and row decoding to access different addresses without electrical interference. During the design of the memory chip, the granularity of the disconnected area in the memory array can be weighed against the extra area required to disconnect the circuit system.

用於實施此同時存取之架構描繪於圖46中。特定而言，圖46描繪在單埠記憶體陣列或墊上提供雙埠功能性之實例電路系統4600。如圖46中所描繪，電路系統4600可包括沿著至少一列及至少一行配置之複數個記憶體墊(例如，記憶體墊4609a、墊4609b及其類似者)。電路系統4600之佈局進一步包括複數條字線，諸如對應於列之字線4611a及4611b，以及對應於行之位元線4615a及4615b。 The architecture for implementing this simultaneous access is depicted in FIG. 46. In particular, FIG. 46 depicts an example circuit system 4600 that provides dual port functionality on a single port memory array or pad. As depicted in FIG. 46, the circuit system 4600 may include a plurality of memory pads (eg, memory pads 4609a, 4609b, and the like) arranged along at least one column and at least one row. The layout of the circuit system 4600 further includes a plurality of word lines, such as word lines 4611a and 4611b corresponding to columns, and bit lines 4615a and 4615b corresponding to rows.

圖46之實例包括十二個記憶體墊，每一記憶體墊具有兩條線及八個行。在其他實施例中，基板可包括任何數目個記憶體墊，且每一記憶體墊可包括任何數目條線及任何數目個行。一些記憶體墊可包括相同數目個線及行(如圖46中所展示)，而其他記憶體墊可包括不同數目個線及/或行。 The example of FIG. 46 includes twelve memory pads, each of which has two lines and eight rows. In other embodiments, the substrate may include any number of memory pads, and each memory pad may include any number of lines and any number of rows. Some memory pads may include the same number of lines and rows (as shown in FIG. 46), while other memory pads may include a different number of lines and/or rows.

儘管在圖46中未描繪，但電路系統4600可進一步使用經組態以在單個時脈循環期間接收用於讀取或寫入之兩個(或三個或任何複數個)位址的至少一個列多工器(與列解碼器4601a及/或4601b分開或與該列解碼器合併)或至少一個行多工器(例如，行多工器4603a及/或4603b)。此外，實施例可使用列解碼器(例如，列解碼器4601a及/或4601b)及行解碼器(其可與行多工器4603a及/或4603b分開或組合)以自兩個(或多於兩個)位址讀取或寫入至兩個(或多於兩個)位址。舉例而言，列解碼器及行解碼器可在記憶體時脈循環期間自至少一個列多工器或至少一個行多工器擷取兩個位址中之第一位址，且解碼對應於第一位址之字線及位元線。此外，列解碼器及行解碼器可在同一記憶體循環期間自至少一個列多工器或至少一個行多工器擷取兩個位址中之第二位址，且解碼對應於第二位址之字線及位元線。如上文所解釋，只要兩個位址處於不耦接至斷開電路系統(例如，開關元件，諸如4613a、4613b及其類似者)之相同部分的不同位置中，便可在同一記憶體時脈循環期間進行存取。另外，電路系統4600可在第一記憶體時脈循環期間同時存取前兩個位址，且接著在第二記憶體時脈循環期間同時存取接下來的兩個位址。在此等實施例中，若以至少兩倍於對應邏輯電路之速度進行時控，則使用電路系統4600之記憶體晶片可充當四埠記憶體。 Although not depicted in FIG. 46, the circuitry 4600 may further be configured to receive at least one of two (or three or any plural) addresses for reading or writing during a single clock cycle A column multiplexer (separate from or combined with the column decoder 4601a and/or 4601b) or at least one row multiplexer (e.g., row multiplexer 4603a and/or 4603b). In addition, embodiments may use column decoders (for example, column decoders 4601a and/or 4601b) and row decoders (which can be separated or combined with row multiplexers 4603a and/or 4603b) to select from two (or more than one) Two) addresses read or write to two (or more than two) addresses. For example, the column decoder and the row decoder can retrieve the first of the two addresses from at least one column multiplexer or at least one row multiplexer during the memory clock cycle, and the decoding corresponds to The word line and bit line of the first address. In addition, the column decoder and row decoder can be During the same memory cycle, the second address of the two addresses is retrieved from at least one row multiplexer or at least one row multiplexer, and the word line and bit line corresponding to the second address are decoded. As explained above, as long as the two addresses are in different positions that are not coupled to the same part of the open circuit system (for example, switching elements such as 4613a, 4613b and the like), they can be at the same memory clock. Access during the cycle. In addition, the circuit system 4600 can simultaneously access the first two addresses during the first memory clock cycle, and then simultaneously access the next two addresses during the second memory clock cycle. In these embodiments, if the timing is at least twice the speed of the corresponding logic circuit, the memory chip using the circuit system 4600 can serve as a four-port memory.

圖46進一步包括經組態以充當開關之至少一個列電路及至少一個行電路。舉例而言，諸如4613a、4613b及其類似者之對應開關元件可包含電晶體或任何其他電元件，該電晶體或任何其他電元件經組態以允許或停止電流流動及/或連接或斷開電壓與連接至諸如4613a、4613b及其類似者之開關元件的字線或位元線。因此，對應開關元件可將電路系統4600分成斷開部分。儘管描繪為包含單個列且每一列包含十六行，但電路系統4600內之斷開區可取決於電路系統4600之設計而包括不同粒度等級。 Figure 46 further includes at least one column circuit and at least one row circuit configured to act as a switch. For example, corresponding switching elements such as 4613a, 4613b, and the like may include a transistor or any other electrical element that is configured to allow or stop current flow and/or connect or disconnect Voltage and word lines or bit lines connected to switching elements such as 4613a, 4613b and the like. Therefore, the corresponding switching element can divide the circuit system 4600 into disconnected parts. Although depicted as including a single column and each column including sixteen rows, the disconnection region within the circuit system 4600 may include different levels of granularity depending on the design of the circuit system 4600.

電路系統4600可使用控制器(例如，列控制件4607)以啟動至少一個列電路及至少一個行電路中之對應者，以便在上文所描述之位址操作期間啟動對應斷開區。舉例而言，電路系統4600可傳輸一或多個控制信號以閉合開關元件(例如，開關元件4613a、4613b及其類似者)中之對應者。在開關元件4613a、4613b及其類似者包含電晶體之實施例中，控制信號可包含斷開電晶體之電壓。 The circuit system 4600 may use a controller (for example, the column control element 4607) to activate the corresponding one of the at least one column circuit and the at least one row circuit, so as to activate the corresponding disconnection area during the address operation described above. For example, the circuit system 4600 may transmit one or more control signals to turn on the corresponding ones of the switching elements (for example, the switching elements 4613a, 4613b, and the like). In embodiments where the switching elements 4613a, 4613b, and the like include transistors, the control signal may include a voltage to turn off the transistors.

取決於包括位址之斷開區，可由電路系統4600啟動開關元件中之多於一者。舉例而言，為到達圖46之記憶體墊4609b內的位址，必須斷開允許存取記憶體墊4609a之開關元件以及允許存取記憶體墊4609b之開關元件。列控制件4607可判定要啟動之開關元件，以便根據特定位址擷取電路系統4600內之特定位址。 Depending on the open area including the address, more than one of the switching elements can be activated by the circuit system 4600. For example, in order to reach the address in the memory pad 4609b of FIG. 46, the switching element allowing access to the memory pad 4609a and the switching element allowing access to the memory pad 4609b must be turned off. Column The control unit 4607 can determine the switching element to be activated, so as to retrieve the specific address in the circuit system 4600 according to the specific address.

圖46表示用以劃分記憶體陣列(例如，包含記憶體墊4609a、墊4609b及其類似者)之字線的電路系統4600之實例。然而，其他實施例可使用類似電路系統(例如，將記憶體晶片4600分成斷開區之開關元件)以劃分記憶體陣列之位元線。因此，電路系統4600之架構可用於雙行存取(如圖42或圖44中所描繪之情況)以及雙列存取(如圖43或圖44中所描繪之情況)中。 FIG. 46 shows an example of a circuit system 4600 for dividing word lines of a memory array (for example, including memory pads 4609a, 4609b, and the like). However, other embodiments may use similar circuit systems (for example, switching elements that divide the memory chip 4600 into off regions) to divide the bit lines of the memory array. Therefore, the architecture of the circuit system 4600 can be used in dual-row access (as depicted in FIG. 42 or FIG. 44) and dual-column access (as depicted in FIG. 43 or FIG. 44).

用於對記憶體陣列或墊進行多循環存取的處理程序描繪於圖47A中。特定而言，圖47A為用於在單埠記憶體陣列或墊上提供雙埠存取(例如，使用圖43之電路系統4300或圖44之電路系統4400)之處理程序4700的實例流程圖。可使用符合本發明之列解碼器及行解碼器執行處理程序4700，諸如分別圖43或圖44之列解碼器4303或4403，及行解碼器(其可與一或多個行多工器分開或組合，諸如分別描繪於圖43或圖44中之行多工器4305或4405)。 The processing procedure for multi-cycle access to the memory array or pad is depicted in FIG. 47A. Specifically, FIG. 47A is an example flowchart of a processing procedure 4700 for providing dual-port access on a single-port memory array or pad (for example, using the circuit system 4300 of FIG. 43 or the circuit system 4400 of FIG. 44). The column decoder and row decoder in accordance with the present invention can be used to execute the processing program 4700, such as the column decoder 4303 or 4403 of FIG. 43 or FIG. 44, and the row decoder (which can be separated from one or more row multiplexers) Or a combination, such as the row multiplexer 4305 or 4405 depicted in Figure 43 or Figure 44, respectively).

在步驟4710處，在第一記憶體時脈循環期間，該電路系統可使用至少一個列多工器及至少一個行多工器以解碼對應於兩個位址中之第一位址的字線及位元線。舉例而言，至少一個列解碼器可啟動字線，且至少一個行多工器可放大來自沿著經啟動字線並對應於第一位址之記憶體胞元的電壓。可將經放大電壓提供至使用包括電路系統之記憶體晶片的邏輯電路，或根據下文所描述之步驟4720緩衝經放大電壓。該等邏輯電路可包含諸如GPU或CPU之電路，或可包含與記憶體晶片在同一基板上之處理群組，例如，如圖7A中所描繪。 At step 4710, during the first memory clock cycle, the circuit system may use at least one column multiplexer and at least one row multiplexer to decode the word line corresponding to the first of the two addresses And bit line. For example, at least one column decoder can activate the word line, and at least one row multiplexer can amplify the voltage from the memory cell along the activated word line and corresponding to the first address. The amplified voltage can be provided to a logic circuit using a memory chip including a circuit system, or the amplified voltage can be buffered according to step 4720 described below. The logic circuits may include circuits such as GPU or CPU, or may include processing groups on the same substrate as the memory chip, for example, as depicted in FIG. 7A.

儘管上文描述為讀取操作，但方法4700可類似地處理寫入操作。舉例而言，至少一個列解碼器可啟動字線，且至少一個行多工器可將電壓施加至沿著經啟動字線並對應於第一位址之記憶體胞元，以將新資料寫入至該記憶體胞元。在一些實施例中，該電路系統可將對寫入之確認提供至使用包括電路系統之記憶體晶片的邏輯電路，或根據下文步驟4720緩衝該確認。 Although described above as read operations, method 4700 can similarly handle write operations. For example, at least one column decoder can activate a word line, and at least one row multiplexer can apply a voltage to a memory cell along the activated word line and corresponding to the first address to write new data Into the memory cell. In some embodiments, the circuit system can provide confirmation of writing to the use of the circuit The logic circuit of the memory chip of the system, or buffer the confirmation according to the following step 4720.

在步驟4720處，該電路系統可緩衝第一位址之所擷取資料。舉例而言，如圖43及圖44中所描繪，緩衝器可允許電路系統擷取兩個位址中之第二位址(如下文描述於步驟4730中)且將兩次擷取之結果一起傳回。緩衝器可包含暫存器、SRAM、非揮發性記憶體或任何其他資料儲存裝置。 At step 4720, the circuit system can buffer the retrieved data at the first address. For example, as depicted in Figure 43 and Figure 44, the buffer may allow the circuit system to retrieve the second address of the two addresses (described below in step 4730) and combine the results of the two retrievals together Pass back. The buffer may include a register, SRAM, non-volatile memory or any other data storage device.

在步驟4730處，在第二記憶體時脈循環期間，該電路系統可使用至少一個列多工器及至少一個行多工器以解碼對應於兩個位址中之第二位址的字線及位元線。舉例而言，至少一個列解碼器可啟動字線，且至少一個行多工器可放大來自沿著經啟動字線並對應於第二位址之記憶體胞元的電壓。可將經放大電壓提供至使用包括電路系統之記憶體晶片的邏輯電路，無論係個別地提供抑或連同例如來自步驟4720之經緩衝電壓一起提供。該等邏輯電路可包含諸如GPU或CPU之電路，或可包含與記憶體晶片在同一基板上之處理群組，例如，如圖7A中所描繪。 At step 4730, during the second memory clock cycle, the circuit system may use at least one column multiplexer and at least one row multiplexer to decode the word line corresponding to the second of the two addresses And bit line. For example, at least one column decoder can activate the word line, and at least one row multiplexer can amplify the voltage from the memory cell along the activated word line and corresponding to the second address. The amplified voltage can be provided to a logic circuit using a memory chip including a circuit system, whether provided individually or in conjunction with, for example, the buffered voltage from step 4720. The logic circuits may include circuits such as GPU or CPU, or may include processing groups on the same substrate as the memory chip, for example, as depicted in FIG. 7A.

儘管上文描述為讀取操作，但方法4700可類似地處理寫入操作。舉例而言，至少一個列解碼器可啟動字線，且至少一個行多工器可將電壓施加至沿著經啟動字線並對應於第二位址之記憶體胞元，以將新資料寫入至該記憶體胞元。在一些實施例中，該電路系統可將對寫入之確認提供至使用包括電路系統之記憶體晶片的邏輯電路，無論係個別地提供抑或連同例如來自步驟4720之經緩衝電壓一起提供。 Although described above as read operations, method 4700 can similarly handle write operations. For example, at least one column decoder can activate a word line, and at least one row multiplexer can apply a voltage to a memory cell along the activated word line and corresponding to the second address to write new data Into the memory cell. In some embodiments, the circuit system may provide confirmation of writing to the logic circuit using the memory chip including the circuit system, whether provided individually or in conjunction with, for example, the buffered voltage from step 4720.

在步驟4740處，該電路系統可輸出第二位址之所擷取資料與經緩衝第一位址。舉例而言，如圖43及圖44中所描繪，該電路系統可將兩次擷取之結果(例如，來自步驟4710及4730)一起傳回。該電路系統可將結果傳回至使用包括電路系統之記憶體晶片的邏輯電路。該等邏輯電路可包含諸如GPU或CPU之電路，或可包含與記憶體晶片在同一基板上之處理群組，例如，如圖7A 中所描繪。 At step 4740, the circuit system can output the retrieved data at the second address and the buffered first address. For example, as depicted in FIG. 43 and FIG. 44, the circuit system can return the results of two captures (for example, from steps 4710 and 4730) together. The circuit system can transmit the result back to the logic circuit using the memory chip including the circuit system. The logic circuits may include circuits such as GPU or CPU, or may include processing groups on the same substrate as the memory chip, for example, as shown in FIG. 7A Portrayed in.

儘管參考多個循環進行描述，但若兩個位址共用字線，如圖42中所描繪，則方法4700可允許對兩個位址之單循環存取。舉例而言，步驟4710及4730可在同一記憶體時脈循環期間進行，此係因為多個行多工器可在同一記憶體時脈循環期間解碼同一字線上之不同位元線。在此等實施例中，可跳過緩衝步驟4720。 Although described with reference to multiple cycles, if two addresses share a word line, as depicted in FIG. 42, method 4700 may allow single cycle access to the two addresses. For example, steps 4710 and 4730 can be performed during the same memory clock cycle, because multiple row multiplexers can decode different bit lines on the same word line during the same memory clock cycle. In these embodiments, the buffering step 4720 can be skipped.

用於同時存取(例如，使用上文所描述之電路系統4600)之處理程序描繪於圖47B中。因此，儘管依序地展示，但圖47B之步驟可全部在同一記憶體時脈循環期間進行，且可同時執行至少一些步驟(例如，步驟4760與4780或步驟4770與4790)。特定而言，圖47B為用於在單埠記憶體陣列或墊上提供雙埠存取(例如，使用圖42之電路系統4200或圖46之電路系統4600)的處理程序4750之實例流程圖。可使用符合本發明之列解碼器及行解碼器執行處理程序4750，諸如分別圖42或圖46之列解碼器4203或列解碼器4601a及4601b，及行解碼器(其可與一或多個行多工器分開或組合，諸如分別描繪於圖42或圖46中之行多工器4205a及4205b或行多工器4603a及4306b)。 The processing procedure for simultaneous access (for example, using the circuit system 4600 described above) is depicted in FIG. 47B. Therefore, although shown sequentially, the steps in FIG. 47B can all be performed during the same memory clock cycle, and at least some steps can be performed at the same time (for example, steps 4760 and 4780 or steps 4770 and 4790). Specifically, FIG. 47B is an example flowchart of a processing procedure 4750 for providing dual-port access on a single-port memory array or pad (for example, using the circuit system 4200 of FIG. 42 or the circuit system 4600 of FIG. 46). The column decoder and row decoder in accordance with the present invention can be used to execute the processing program 4750, such as the column decoder 4203 or column decoders 4601a and 4601b of FIG. 42 or FIG. The row multiplexers are separated or combined, such as the row multiplexers 4205a and 4205b or the row multiplexers 4603a and 4306b depicted in FIG. 42 or FIG. 46, respectively).

在步驟4760處，在記憶體時脈循環期間，該電路系統可基於兩個位址中之第一位址啟動至少一個列電路及至少一個行電路中之對應者。舉例而言，該電路系統可傳輸一或多個控制信號以閉合包含至少一個列電路及至少一個行電路之開關元件中之對應者。因此，該電路系統可存取包括兩個位址中之第一位址的對應斷開區。 At step 4760, during the memory clock cycle, the circuit system may activate the corresponding one of at least one column circuit and at least one row circuit based on the first address of the two addresses. For example, the circuit system may transmit one or more control signals to close a corresponding one of the switching elements including at least one column circuit and at least one row circuit. Therefore, the circuit system can access the corresponding disconnected area including the first address of the two addresses.

在步驟4770處，在該記憶體時脈循環期間，該電路系統可使用至少一個列多工器及至少一個行多工器以解碼對應於第一位址之字線及位元線。舉例而言，至少一個列解碼器可啟動字線，且至少一個行多工器可放大來自沿著經啟動字線並對應於第一位址之記憶體胞元的電壓。可將經放大電壓提供至使用包括電路系統之記憶體晶片的邏輯電路。舉例而言，如上文所描述，該等邏輯電路可包含諸如GPU或CPU之電路，或可包含與記憶體晶片在同一基板上之處理群組，例如，如圖7A中所描繪。 At step 4770, during the memory clock cycle, the circuit system may use at least one column multiplexer and at least one row multiplexer to decode the word line and the bit line corresponding to the first address. For example, at least one column decoder can activate the word line, and at least one row multiplexer can amplify the voltage from the memory cell along the activated word line and corresponding to the first address. The amplified voltage can be increased For logic circuits using memory chips including circuit systems. For example, as described above, the logic circuits may include circuits such as GPU or CPU, or may include processing groups on the same substrate as the memory chip, for example, as depicted in FIG. 7A.

儘管上文描述為讀取操作，但方法4500可類似地處理寫入操作。舉例而言，至少一個列解碼器可敗動字線，且至少一個行多工器可將電壓施加至沿著經啟動字線並對應於第一位址之記憶體胞元，以將新資料寫入至該記憶體胞元。在一些實施例中，該電路系統可將對寫入之確認提供至使用包括該電路系統之記憶體晶片的邏輯電路。 Although described above as a read operation, the method 4500 can similarly handle write operations. For example, at least one column decoder can fail the word line, and at least one row multiplexer can apply a voltage to the memory cell along the activated word line and corresponding to the first address to transfer new data Write to the memory cell. In some embodiments, the circuit system can provide confirmation of writing to a logic circuit using a memory chip that includes the circuit system.

在步驟4780處，在同一循環期間，該電路系統可基於兩個位址中之第二位址啟動至少一個列電路及至少一個行電路中之對應者。舉例而言，該電路系統可傳輸一或多個控制信號以閉合包含至少一個列電路及至少一個行電路之開關元件中之對應者。因此，該電路系統可存取包括兩個位址中之第二位址的對應斷開區。 At step 4780, during the same cycle, the circuit system may activate the corresponding one of at least one column circuit and at least one row circuit based on the second address of the two addresses. For example, the circuit system may transmit one or more control signals to close a corresponding one of the switching elements including at least one column circuit and at least one row circuit. Therefore, the circuit system can access the corresponding disconnection area including the second address of the two addresses.

在步驟4790處，在同一循環期間，該電路系統可使用至少一個列多工器及至少一個行多工器以解碼對應於第二位址之字線及位元線。舉例而言，至少一個列解碼器可啟動字線，且至少一個行多工器可放大來自沿著經啟動字線並對應於第二位址之記憶體胞元的電壓。將經放大電壓提供至可使用包括電路系統之記憶體晶片的邏輯電路。舉例而言，如上文所描述，該等邏輯電路可包含諸如GPU或CPU之習知電路，或可包含與記憶體晶片在同一基板上之處理群組，例如，如圖7A中所描繪。 At step 4790, during the same cycle, the circuit system may use at least one column multiplexer and at least one row multiplexer to decode the word line and the bit line corresponding to the second address. For example, at least one column decoder can activate the word line, and at least one row multiplexer can amplify the voltage from the memory cell along the activated word line and corresponding to the second address. The amplified voltage is provided to a logic circuit that can use a memory chip including a circuit system. For example, as described above, the logic circuits may include conventional circuits such as GPU or CPU, or may include processing groups on the same substrate as the memory chip, for example, as depicted in FIG. 7A.

儘管上文描述為讀取操作，但方法4500可類似地處理寫入操作。舉例而言，至少一個列解碼器可啟動字線，且至少一個行多工器可將電壓施加至沿著經啟動字線並對應於第二位址之記憶體胞元，以將新資料寫入至該記憶體胞元。在一些實施例中，該電路系統可將對寫入之確認提供至使用包括該電路系統之記憶體晶片的邏輯電路。 Although described above as a read operation, the method 4500 can similarly handle write operations. For example, at least one column decoder can activate a word line, and at least one row multiplexer can apply a voltage to a memory cell along the activated word line and corresponding to the second address to write new data Into the memory cell. In some embodiments, the circuit system can provide confirmation of writing to the The logic circuit of the memory chip of the system.

儘管參考單個循環進行描述，但若兩個位址處於共用字線或位元線(或以其他方式共用至少一個列電路及至少一個行電路中之開關元件)之斷開區中，則方法4500可允許對兩個位址之多循環存取。舉例而言，步驟4760及4770可在第一記憶體時脈循環期間進行，在該第一記憶體時脈循環中，第一列解碼器及第一行多工器可解碼對應於第一位址之字線及位元線，而步驟4780及4790可在第二記憶體時脈循環期間進行，在該第二記憶體時脈循環中，第二列解碼器及第二行多工器可解碼對應於第二位址之字線及位元線。 Although the description is made with reference to a single cycle, if the two addresses are in the disconnected region of a shared word line or bit line (or otherwise share at least one column circuit and at least one switching element in the row circuit), the method 4500 Allows multiple cyclic accesses to two addresses. For example, steps 4760 and 4770 can be performed during the first memory clock cycle. In the first memory clock cycle, the first row of decoders and the first row of multiplexers can decode corresponding to the first bit Address word lines and bit lines, and steps 4780 and 4790 can be performed during the second memory clock cycle. In the second memory clock cycle, the second row of decoders and the second row of multiplexers can be Decoding corresponds to the word line and bit line of the second address.

用於沿著列及行兩者之雙埠存取的架構之另一實例描繪於圖48中。特定而言，圖48描繪使用多個列解碼器結合多個行多工器提供沿著列及行兩者之雙埠存取的實例電路系統4800。在圖48中，列解碼器4801a可存取第一字線，且行多工器4803a可解碼來自沿著第一字線之一或多個記憶體胞元的資料，而列解碼器4801b可存取第二字線，且行多工器4803b可解碼來自沿著第二字線之一或多個記憶體胞元的資料。 Another example of the architecture for dual port access along both rows and rows is depicted in FIG. 48. In particular, FIG. 48 depicts an example circuitry 4800 that uses multiple column decoders in combination with multiple row multiplexers to provide dual-port access along both columns and rows. In FIG. 48, the column decoder 4801a can access the first word line, and the row multiplexer 4803a can decode data from one or more memory cells along the first word line, and the column decoder 4801b can store Take the second word line, and the row multiplexer 4803b can decode data from one or more memory cells along the second word line.

如關於圖47B所描述，此存取可在一個記憶體時脈循環期間同時進行。因此，類似於圖46之架構，圖48之架構(包括下文描述於圖49中之記憶體墊)可允許在同一時脈循環中存取多個位址。舉例而言，圖48之架構可包括任何數目個列解碼器及任何數目個行多工器，使得數目對應於列解碼器及行多工器之數目的位址可全部在單個記憶體時脈循環內存取。 As described with respect to FIG. 47B, this access can be performed simultaneously during one memory clock cycle. Therefore, similar to the architecture of FIG. 46, the architecture of FIG. 48 (including the memory pad described in FIG. 49 below) can allow multiple addresses to be accessed in the same clock cycle. For example, the architecture of FIG. 48 can include any number of column decoders and any number of row multiplexers, so that the number of addresses corresponding to the number of column decoders and row multiplexers can all be at a single memory clock. Access within the loop.

在其他實施例中，此存取沿著兩個記憶體時脈循環可依序進行。藉由比對應邏輯電路更快地時控記憶體晶片4800，兩個記憶體時脈循環可等效於使用記憶體之邏輯電路的一個時脈循環。舉例而言，如上文所描述，該等邏輯電路可包含諸如GPU或CPU之習知電路，或可包含與記憶體晶片在同一基板上之處理群組，例如，如圖7A中所描繪。 In other embodiments, this access can be performed sequentially along two memory clock cycles. By clocking the memory chip 4800 faster than the corresponding logic circuit, two memory clock cycles can be equivalent to one clock cycle of a logic circuit using memory. For example, as described above, the logic circuits may include conventional circuits such as GPU or CPU, or may include processing groups on the same substrate as the memory chip, for example, as depicted in FIG. 7A.

其他實施例可允許同時存取。舉例而言，如關於圖42所描述，多個行解碼器(其可包含行多工器，諸如4803a及4803b，如圖48中所展示)可在單個記憶體時脈循環期間讀取沿著同一字線之多條位元線。另外或替代地，如關於圖46所描述，電路系統4800可併有額外電路系統使得此存取可為同時的。舉例而言，列解碼器4801a可存取第一字線，且行多工器4803a可在同一記憶體時脈循環期間解碼來自沿著第一字線之記憶體胞元的資料，在該記憶體時脈循環中，列解碼器4801b存取第二字線，且行多工器4803b解碼來自沿著第二字線之記憶體胞元的資料。 Other embodiments may allow simultaneous access. For example, as described with respect to FIG. 42, multiple row decoders (which may include row multiplexers, such as 4803a and 4803b, as shown in FIG. 48) can read along the line during a single memory clock cycle Multiple bit lines of the same word line. Additionally or alternatively, as described with respect to FIG. 46, the circuitry 4800 may incorporate additional circuitry so that this access can be simultaneous. For example, the column decoder 4801a can access the first word line, and the row multiplexer 4803a can decode data from memory cells along the first word line during the same memory clock cycle. In the clock cycle, the column decoder 4801b accesses the second word line, and the row multiplexer 4803b decodes data from memory cells along the second word line.

圖48之架構可與形成記憶體組之經修改記憶體墊一起使用，如圖49中所展示。在圖49中，藉由兩條字線及兩條位元線存取每一記憶體胞元(描繪為類似於DRAM之電容器，但亦可包含以類似於SRAM或任何其他記憶體胞元之方式配置的數個電晶體)。因此，圖49之記憶體墊4900允許藉由兩個不同邏輯電路同時存取兩個不同位元，或甚至存取同一位元。然而，圖49之實施例使用對記憶體墊之修改而非在標準DRAM記憶體墊上實施雙埠解決方案，該等記憶體墊經線連接以用於單埠存取，如以上實施例一般。 The architecture of FIG. 48 can be used with modified memory pads forming a memory bank, as shown in FIG. 49. In FIG. 49, each memory cell (depicted as a capacitor similar to DRAM, but can also include a memory cell similar to SRAM or any other memory cell) is accessed by two word lines and two bit lines. Several transistors configured in the same way). Therefore, the memory pad 4900 of FIG. 49 allows two different bits to be accessed simultaneously by two different logic circuits, or even the same bit. However, the embodiment of FIG. 49 uses a modification to the memory pads instead of implementing a dual-port solution on standard DRAM memory pads, which are connected via wires for single-port access, as in the above embodiment.

儘管描述為具有兩個埠，但上文所描述之實施例中之任一者可擴展至多於兩個埠。舉例而言，圖42、圖46、圖48及圖49之實施例可分別包括額外的行多工器或列多工器，以在單個時脈循環期間提供對額外行或列之存取。作為另一實例，圖43及圖44之實施例可包括額外的列解碼器及/或行多工器，以在單個時脈循環期間分別提供對額外列或行之存取。 Although described as having two ports, any of the embodiments described above can be extended to more than two ports. For example, the embodiments of FIGS. 42, 46, 48, and 49 may include additional row multiplexers or column multiplexers, respectively, to provide access to additional rows or columns during a single clock cycle. As another example, the embodiments of FIGS. 43 and 44 may include additional column decoders and/or row multiplexers to provide access to additional columns or rows during a single clock cycle, respectively.

記憶體中之可變字長存取 Variable word length access in memory

如上文及下文進一步所使用，術語「耦接」可包括直接連接、間接連接、電通信及其類似者。 As used further above and below, the term "coupled" can include direct connection, indirect connection, electrical communication, and the like.

此外，如「第一」、「第二」及其類似者之術語用以區分具有相同或類似名稱或標題之元件或方法步驟，且未必提示空間或時間次序。 In addition, terms such as "first", "second" and the like are used to distinguish Elements or method steps with the same or similar names or titles, and do not necessarily indicate spatial or temporal order.

通常，記憶體晶片可包括記憶體組。記憶體組可耦接至列解碼器及行解碼器，該等解碼器經組態以選擇待讀取或寫入之特定字(或其他固定大小之資料單元)。每一記憶體組可包括用以儲存資料單元之記憶體胞元、用以放大來自藉由列解碼器及行解碼器選擇之記憶體胞元的電壓，及任何其他適當電路。 Generally, the memory chip may include a memory bank. The memory bank can be coupled to column decoders and row decoders, which are configured to select specific words (or other fixed-size data units) to be read or written. Each memory bank may include memory cells for storing data cells, amplifying voltages from memory cells selected by the row decoder and row decoder, and any other appropriate circuits.

每一記憶體組通常具有特定I/O寬度。舉例而言，I/O寬度可包含字。 Each memory bank usually has a specific I/O width. For example, the I/O width can include words.

雖然由使用記憶體晶片之邏輯電路執行之一些處理程序可受益於使用極長字，但一些其他處理程序可僅需要該字之一部分。 Although some processing procedures executed by logic circuits using memory chips can benefit from using extremely long words, some other processing procedures may only require a portion of the word.

實際上，記憶體內運算單元(諸如，與記憶體晶片安置於同一基板上之處理器子單元，例如，如圖7A中所描繪及描述)頻繁地執行僅需要該字之一部分的記憶體存取操作。 In fact, in-memory arithmetic units (such as processor sub-units placed on the same substrate as the memory chip, for example, as depicted and described in FIG. 7A) frequently perform memory accesses that require only a portion of the word operating.

為了減少與在僅使用一部分時存取整個字相關聯之潛時，本發明之實施例可提供用於僅提取一字之一或多個部分的方法及系統，藉此減少與傳送該字之不需要部分相關聯的資料損失且允許記憶體裝置中之功率節省。 In order to reduce the latent time associated with accessing the entire word when only a part of it is used, embodiments of the present invention may provide a method and system for extracting only one or more parts of a word, thereby reducing and transmitting the word. No part of the associated data loss is required and power saving in the memory device is allowed.

此外，本發明之實施例亦可減少記憶體晶片與其他實體(諸如，邏輯電路，無論係分開的，如CPU及GPU，抑或與記憶體晶片包括於同一基板上，諸如圖7A中所描繪及描述之處理器子單元)之間的相互作用之功率消耗，該等其他實體存取記憶體晶片，其可僅接收或寫入該字之一部分。 In addition, the embodiments of the present invention can also reduce the memory chip and other entities (such as logic circuits, whether separate, such as CPU and GPU, or included on the same substrate as the memory chip, such as those depicted in FIG. 7A and The power consumption of the interaction between the described processor subunits), these other entities access the memory chip, which can only receive or write part of the word.

記憶體存取命令(例如，來自使用記憶體之邏輯電路)可包括記憶體中之位址。舉例而言，該位址可包括列位址及行位址，或可例如藉由記憶體之記憶體控制器轉譯成列位址及行位址。 Memory access commands (for example, from logic circuits that use memory) can include addresses in memory. For example, the address may include a column address and a row address, or may be translated into a column address and a row address by a memory controller of the memory, for example.

在諸如DRAM的許多揮發性記憶體中，將列位址發送(例如，直接由邏輯電路或使用記憶體控制器)至列解碼器，該列解碼器啟動整個列(亦被稱作字線)且載入包括於該列中之所有位元線。 In many volatile memories such as DRAM, the column address is sent (for example, Directly from the logic circuit or using the memory controller) to the column decoder, the column decoder activates the entire column (also called the word line) and loads all the bit lines included in the column.

該行位址識別經啟動列上之位元線，該等位元線在包括位元線之記憶體組外部傳送且傳送至下一層級電路系統。舉例而言，下一層級電路系統可包含記憶體晶片之I/O匯流排。在使用記憶體內處理之實施例中，下一層級電路系統可包含記憶體晶片之處理器子單元(例如，如圖7A中所描繪)。 The row address identifies the bit lines on the activated column, and the bit lines are sent outside the memory group including the bit lines and sent to the next level circuit system. For example, the next level of circuitry can include I/O buses of memory chips. In embodiments using in-memory processing, the next level of circuitry may include the processor sub-units of the memory chip (for example, as depicted in FIG. 7A).

因此，下文所描述之記憶體晶片可包括於如圖3A、圖3B、圖4至圖6、圖7A至圖7D、圖11至圖13、圖16至圖19、圖22或圖23中之任一者中所說明的記憶體晶片中，或以其他方式包含該記憶體晶片。 Therefore, the memory chips described below can be included in the ones shown in FIGS. 3A, 3B, 4-6, 7A-7D, 11-13, 16-19, 22 or 23. Any one of the memory chips described in, or otherwise includes the memory chip.

該記憶體晶片可藉由針對記憶體胞元而非邏輯胞元而最佳化之第一製造製程來製造。舉例而言，由第一製造製程所製造之記憶體胞元可展現比由第一製造製程所製造之邏輯電路之臨界尺寸小的臨界尺寸(例如，小超過2倍、3倍、4倍、5倍、6倍、7倍、8倍、9倍、10倍及其類似者)。舉例而言，第一製造製程可包含類比製造製程、DRAM製造製程及其類似者。 The memory chip can be manufactured by a first manufacturing process optimized for memory cells instead of logic cells. For example, the memory cell manufactured by the first manufacturing process may exhibit a critical size smaller than that of the logic circuit manufactured by the first manufacturing process (for example, more than 2 times, 3 times, 4 times, 5 times, 6 times, 7 times, 8 times, 9 times, 10 times and the like). For example, the first manufacturing process may include an analog manufacturing process, a DRAM manufacturing process, and the like.

此記憶體晶片可包含積體電路，該積體電路可包括記憶體單元。該記憶體單元可包括記憶體胞元、輸出埠及讀取電路系統。在一些實施例中，該記憶體單元可進一步包括處理單元，諸如，如上文所描述之處理器子單元。 The memory chip may include an integrated circuit, and the integrated circuit may include a memory cell. The memory unit may include a memory cell, an output port, and a reading circuit system. In some embodiments, the memory unit may further include a processing unit, such as the processor sub-unit as described above.

舉例而言，該讀取電路系統可包括縮減單元及第一群組記憶體讀取路徑，該等記憶體讀取路徑用於經由輸出埠輸出多達第一數目個位元。該輸出埠可連接至晶片外邏輯電路(諸如，加速器、CPU、GPU或其類似者)或晶載處理器子單元，如上文所描述。 For example, the read circuit system may include a reduction unit and a first group of memory read paths, and the memory read paths are used to output up to a first number of bits through the output port. The output port can be connected to off-chip logic circuits (such as accelerators, CPUs, GPUs, or the like) or on-chip processor sub-units, as described above.

在一些實施例中，該處理單元可包括縮減單元，可為縮減單元之一部分，可不同於縮減單元，或可用其他方式包含縮減單元。 In some embodiments, the processing unit may include a reduction unit, may be a part of the reduction unit, may be different from the reduction unit, or may include the reduction unit in other ways.

記憶體內讀取路徑可包括於積體電路中(例如，可在記憶體單元中)，且可包括經組態用於自記憶體胞元讀取及/或寫入至記憶體胞元之任何電路及/或鏈路。舉例而言，記憶體內讀取路徑可包括感測放大器、耦接至記憶體胞元之導體、多工器及其類似者。 The read path in the memory can be included in the integrated circuit (for example, it can be in the memory unit Medium), and may include any circuits and/or links configured for reading from and/or writing to memory cells. For example, the read path in the memory may include a sense amplifier, a conductor coupled to a memory cell, a multiplexer, and the like.

該處理單元可經組態以將讀取請求發送至該記憶體單元以自該記憶體單元讀取第二數目個位元。另外或替代地，該讀取請求可源自晶片外邏輯電路(諸如，加速器、CPU、GPU或其類似者)。 The processing unit can be configured to send a read request to the memory unit to read the second number of bits from the memory unit. Additionally or alternatively, the read request may originate from an off-chip logic circuit (such as an accelerator, CPU, GPU, or the like).

該縮減單元可經組態以例如藉由使用本文中所描述之部分字存取中之任一者來輔助減少與存取請求相關之功率消耗。 The reduction unit can be configured to assist in reducing the power consumption associated with the access request, for example, by using any of the partial word accesses described herein.

該縮減單元可經組態以在由該讀取請求觸發之讀取操作期間基於第一數目個位元及第二數目個位元而控制記憶體讀取路徑。舉例而言，來自縮減單元之控制信號可影響讀取路徑之記憶體消耗，以減少與所請求之第二數目個位元不相關的記憶體讀取路徑之能量消耗。舉例而言，該縮減單元可經組態以在第二數目小於第一數目時控制不相關的記憶體讀取路徑。 The reduction unit can be configured to control the memory read path based on the first number of bits and the second number of bits during the read operation triggered by the read request. For example, the control signal from the reduction unit can affect the memory consumption of the read path, so as to reduce the energy consumption of the memory read path that is not related to the second number of bits requested. For example, the reduction unit can be configured to control unrelated memory read paths when the second number is less than the first number.

如上文所解釋，該積體電路可包括於如圖3A、圖3B、圖4至圖6、圖7A至圖7D、圖11至圖13、圖16至圖19、圖22或圖23中之任一者中所說明的記憶體晶片中，可包括該記憶體晶片或以其他方式包含該記憶體晶片。 As explained above, the integrated circuit can be included in any of Figure 3A, Figure 3B, Figure 4 to Figure 6, Figure 7A to Figure 7D, Figure 11 to Figure 13, Figure 16 to Figure 19, Figure 22, or Figure 23 The memory chip described in any one may include the memory chip or include the memory chip in other ways.

不相關的記憶體內讀取路徑可與第一數目個位元中之不相關位元相關，諸如第一數目個位元中之不包括於第二數目個位元中的位元。 The unrelated in-memory read path may be related to unrelated bits in the first number of bits, such as bits in the first number of bits that are not included in the second number of bits.

圖50說明實例積體電路5000，其包括：記憶體胞元陣列5050中之記憶體胞元5001至5008；輸出埠5020，其包括位元5021至5028；讀取電路系統5040，其包括記憶體讀取路徑5011至5018；及縮減單元5030。 FIG. 50 illustrates an example integrated circuit 5000, which includes: memory cells 5001 to 5008 in a memory cell array 5050; output port 5020, which includes bits 5021 to 5028; and a read circuit system 5040, which includes memory Reading paths 5011 to 5018; and reduction unit 5030.

當使用對應的記憶體讀取路徑讀取第二數目個位元時，第一數目個位元中之不相關位元可對應於不應讀取之位元(例如，不包括於第二數目個位元中之位元)。 When using the corresponding memory read path to read the second number of bits, irrelevant bits in the first number of bits can correspond to bits that should not be read (for example, not included in the second number) The bit of the ones).

在讀取操作期間，縮減單元5030可經組態以啟動對應於第二數目個位元之記憶體讀取路徑，使得經啟動之記憶體讀取路徑可經組態以輸送第二數目個位元。在此等實施例中，可僅啟動對應於第二數目個位元之記憶體讀取路徑。 During the read operation, the reduction unit 5030 can be configured to activate the memory read path corresponding to the second number of bits, so that the activated memory read path can be configured to convey the second number of bits yuan. In these embodiments, only the memory read path corresponding to the second number of bits can be activated.

在讀取操作期間，縮減單元5030可經組態以切斷每一不相關的記憶體讀取路徑之至少一部分。舉例而言，不相關的記憶體讀取路徑可對應於第一數目個位元中之不相關位元。 During the read operation, the reduction unit 5030 can be configured to cut off at least a part of each unrelated memory read path. For example, the unrelated memory read path may correspond to unrelated bits in the first number of bits.

應注意，替代切斷不相關的記憶體路徑之至少一部分，縮減單元5030可替代地保證不啟動不相關的記憶體路徑。 It should be noted that instead of cutting off at least a part of the unrelated memory path, the reduction unit 5030 may alternatively ensure that the unrelated memory path is not activated.

另外或替代地，在讀取操作期間，縮減單元5030可經組態以將不相關的記憶體讀取路徑維持於低功率模式中。舉例而言，低功率模式可包含分別向不相關的記憶體路徑供應低於正常工作電壓或電流之電壓或電流。 Additionally or alternatively, during a read operation, the reduction unit 5030 may be configured to maintain the unrelated memory read path in a low power mode. For example, the low-power mode may include supplying voltages or currents lower than normal operating voltages or currents to unrelated memory paths, respectively.

縮減單元5030可經進一步組態以控制不相關的記憶體讀取路徑之位元線。 The reduction unit 5030 can be further configured to control the bit lines of unrelated memory read paths.

因此，縮減單元5030可經組態以載入相關的記憶體讀取路徑之位元線，且將不相關的記憶體讀取路徑之位元線維持於低功率模式下。舉例而言，僅可載入相關的記憶體讀取路徑之位元線。 Therefore, the reduction unit 5030 can be configured to load the bit lines of the relevant memory read path and maintain the bit lines of the unrelated memory read path in the low power mode. For example, only bit lines of the relevant memory read path can be loaded.

另外或替代地，縮減單元5030可經組態以載入相關的記憶體讀取路徑之位元線，同時將不相關的記憶體讀取路徑之位元線維持為不啟動。 Additionally or alternatively, the reduction unit 5030 can be configured to load the bit lines of the relevant memory read path while keeping the bit lines of the unrelated memory read path inactive.

在一些實施例中，縮減單元5030可經組態以在讀取操作期間利用相關的記憶體讀取路徑之部分，且將每一不相關的記憶體讀取路徑之一部分維持於低功率模式下，其中該部分不同於位元線。 In some embodiments, the reduction unit 5030 can be configured to utilize a portion of the related memory read path during a read operation and maintain a portion of each unrelated memory read path in a low power mode , Where this part is different from the bit line.

如上文所解釋，記憶體晶片可使用感測放大器以放大來自包括於記憶體晶片中之記憶體胞元的電壓。因此，縮減單元5030可經組態以在讀取操作期間利用相關的記憶體讀取路徑之部分，且將與不相關的記憶體讀取路徑中之至少一些相關聯的感測放大器維持於低功率模式下。 As explained above, the memory chip can use a sense amplifier to amplify the voltage from the memory cell included in the memory chip. Therefore, the reduction unit 5030 can be configured to During operation, part of the related memory read path is used, and the sense amplifiers associated with at least some of the unrelated memory read paths are maintained in a low power mode.

在此等實施例中，縮減單元5030可經組態以在讀取操作期間利用相關的記憶體讀取路徑之部分，且將與所有不相關的記憶體讀取路徑相關聯之一或多個感測放大器維持於低功率模式下。 In these embodiments, the reduction unit 5030 may be configured to utilize part of the related memory read path during a read operation, and associate one or more of all unrelated memory read paths. The sense amplifier is maintained in a low power mode.

另外或替代地，縮減單元5030可經組態以在讀取操作期間利用相關的記憶體讀取路徑之部分，且將在與不相關的記憶體讀取路徑相關聯之一或多個感測放大器之後(例如，在空間上及/或在時間上)的不相關的記憶體讀取路徑之部分維持於低功率模式下。 Additionally or alternatively, the reduction unit 5030 may be configured to utilize a portion of the related memory read path during a read operation, and to use one or more sensing elements associated with an unrelated memory read path. Part of the unrelated memory read path after the amplifier (e.g., spatially and/or temporally) is maintained in the low power mode.

在上文所描述之實施例中之任一者中，該記憶體單元可包括行多工器(未圖示)。 In any of the embodiments described above, the memory unit may include a row multiplexer (not shown).

在此等實施例中，縮減單元5030可耦接於行多工器與輸出埠之間。 In these embodiments, the reduction unit 5030 may be coupled between the row multiplexer and the output port.

另外或替代地，縮減單元5030可嵌入於行多工器中。 Additionally or alternatively, the reduction unit 5030 may be embedded in the row multiplexer.

另外或替代地，縮減單元5030可耦接於記憶體胞元與行多工器之間。 Additionally or alternatively, the reduction unit 5030 may be coupled between the memory cell and the row multiplexer.

縮減單元5030可包含可為可獨立控制之縮減子單元。舉例而言，不同的縮減子單元可與不同的記憶體單元行相關聯。 The reduction unit 5030 may include a reduction sub-unit that can be independently controlled. For example, different reduced subunits can be associated with different rows of memory cells.

儘管上文關於讀取操作及讀取電路系統進行了描述，但以上實施例可類似地應用於寫入操作及寫入電路系統。 Although the reading operation and the reading circuit system are described above, the above embodiments can be similarly applied to the writing operation and the writing circuit system.

舉例而言，根據本發明之積體電路可包括記憶體單元，該記憶體單元包含記憶體胞元、輸出埠及寫入電路系統。在一些實施例中，該記憶體單元可進一步包括處理單元，諸如，如上文所描述之處理器子單元。該寫入電路系統可包括縮減單元及第一群組記憶體寫入路徑，該等記憶體寫入路徑用於經由輸出埠輸出多達第一數目個位元。該處理單元可經組態以將寫入請求發送至該記憶體單元以寫入來自該記憶體單元之第二數目個位元。另外或替代地，該寫入請求可源自晶片外邏輯電路(諸如，加速器、CPU、GPU或其類似者)。縮減單元5030可經組態以在由該寫入請求觸發之寫入操作期間基於第一數目個位元及第二數目個位元而控制該等記憶體寫入路徑。 For example, the integrated circuit according to the present invention may include a memory cell including a memory cell, an output port, and a writing circuit system. In some embodiments, the memory unit may further include a processing unit, such as the processor sub-unit as described above. The write circuit system may include a reduction unit and a first group of memory write paths, and the memory write paths are used for Up to the first number of bits are output from the output port. The processing unit can be configured to send a write request to the memory unit to write the second number of bits from the memory unit. Additionally or alternatively, the write request may originate from an off-chip logic circuit (such as an accelerator, CPU, GPU, or the like). The reduction unit 5030 can be configured to control the memory write paths based on the first number of bits and the second number of bits during the write operation triggered by the write request.

圖51說明記憶體組5100，該記憶體組包括使用列位址及行位址(例如，來自晶載處理器子單元或晶片外邏輯電路，諸如加速器、CPU、GPU或其類似者)來定址之記憶體胞元的陣列5111。如圖51中所展示，記憶體胞元饋接至位元線(豎直)及字線(水平，為簡單起見省略許多字線)。此外，列解碼器5112可饋入有列位址(例如，來自晶載處理器子單元、晶片外邏輯電路，或圖51中未展示之記憶體控制器)，行多工器5113可饋入有行位址(例如，來自晶載處理器子單元、晶片外邏輯電路，或圖51中未展示之記憶體控制器)，且行多工器5113可經由輸出匯流排5115接收來自多達整條線之輸出及多達一字之輸出。在圖51中，行多工器5113之輸出匯流排5115耦接至主I/O匯流排5114。在其他實施例中，輸出匯流排5115可耦接至發送列位址及行位址之記憶體晶片(例如，如圖7A中所描繪)的處理器子單元。為簡單起見，未展示將記憶體組分成記憶體墊之劃分。 Figure 51 illustrates a memory bank 5100 that includes the use of column and row addresses (for example, from on-chip processor sub-units or off-chip logic circuits, such as accelerators, CPUs, GPUs, or the like) for addressing The array of memory cells 5111. As shown in Figure 51, the memory cells are fed to bit lines (vertical) and word lines (horizontal, many word lines are omitted for simplicity). In addition, the column decoder 5112 can be fed with column addresses (for example, from on-chip processor sub-units, off-chip logic circuits, or memory controllers not shown in FIG. 51), and the row multiplexer 5113 can be fed There are row addresses (for example, from on-chip processor sub-units, off-chip logic circuits, or memory controllers not shown in Figure 51), and the row multiplexer 5113 can receive up to the entire line via the output bus 5115 Line output and up to one word output. In FIG. 51, the output bus 5115 of the row multiplexer 5113 is coupled to the main I/O bus 5114. In other embodiments, the output bus 5115 may be coupled to a processor sub-unit of a memory chip (e.g., as depicted in FIG. 7A) that sends column addresses and row addresses. For the sake of simplicity, the division of memory components into memory pads is not shown.

圖52說明記憶體組5101。在圖52中，記憶體組亦說明為包括記憶體內處理(PIM)邏輯5116，該邏輯具有耦接至輸出匯流排5115之輸入端。PIM邏輯5116可產生位址(例如，包含列位址及行位址)且經由PIM位址匯流排5118輸出位址以存取記憶體組。PIM邏輯5116為亦包含處理單元之縮減單元(例如，單元5030)的實例。PIM邏輯5016可控制圖52未展示之輔助減少功率的其他電路。PIM邏輯5116可進一步控制包括記憶體組5101之記憶體單元的記憶體路徑。 Figure 52 illustrates the memory bank 5101. In FIG. 52, the memory bank is also illustrated as including in-memory processing (PIM) logic 5116, which has an input terminal coupled to the output bus 5115. The PIM logic 5116 can generate an address (for example, including a column address and a row address) and output the address via the PIM address bus 5118 to access the memory bank. PIM logic 5116 is an example of a reduction unit (e.g., unit 5030) that also includes a processing unit. The PIM logic 5016 can control other circuits not shown in FIG. 52 that assist in power reduction. The PIM logic 5116 can further control the memory path of the memory cells including the memory group 5101.

如上文所解釋，在一些狀況下，字長(例如，選擇一次傳送之位元線之數目)可為大的。 As explained above, in some situations, the word length (for example, the number of bit lines selected for one transmission) may be large.

在彼等狀況下，用於讀取及/或寫入之每一字可與可在讀取及/或寫入操作之各種階段消耗功率的記憶體路徑相關聯，例如： Under these conditions, each word used for reading and/or writing can be associated with a memory path that can consume power at various stages of the reading and/or writing operation, such as:

a.載入位元線一為了避免位元線載入至所需值(在讀取循環中自位元線上之電容器，抑或在寫入循環中待寫入至電容器之新值)，需要使位於記憶體陣列之末端處的感測放大器去能且確保保存資料之電容器不放電或充電(否則，儲存於其上之資料將被破壞)；及 a. Load the bit line. In order to prevent the bit line from loading to the required value (from the capacitor on the bit line in the read cycle, or the new value to be written to the capacitor in the write cycle), it is necessary to use The sense amplifier at the end of the memory array is disabled and the capacitors that store data are not discharged or charged (otherwise, the data stored on it will be destroyed); and

b.經由選擇位元線之行多工器移動來自感測放大器之資料且移動至晶片之其餘部分(移動至將資料傳入及傳出晶片之I/O匯流排或移動至將使用資料之嵌入式邏輯，諸如與記憶體在同一基板上之處理器子單元)。 b. Move the data from the sense amplifier through the row multiplexer that selects the bit line and move it to the rest of the chip (moving to the I/O bus that transfers data to and from the chip or to the I/O bus that will use the data) Embedded logic, such as a processor subunit on the same substrate as the memory).

為了達成功率節省，本發明之積體電路可在列啟動時間判定字之一些部分為不相關的且接著針對該字之該等不相關的部分將去能信號發送至一或多個感測放大器。 In order to achieve power saving, the integrated circuit of the present invention can determine that some parts of a word are irrelevant at the column start time and then send a disable signal to one or more sensors for these irrelevant parts of the word Amplifier.

圖53說明記憶體單元5102，該記憶體單元包括記憶體胞元陣列5111、列解碼器5112、耦接至輸出匯流排5115之行多工器5113，及PIM邏輯5116。 FIG. 53 illustrates a memory unit 5102, which includes a memory cell array 5111, a column decoder 5112, a row multiplexer 5113 coupled to an output bus 5115, and a PIM logic 5116.

記憶體單元5102亦包括對位元至行多工器5113之通道賦能或使其去能的開關5201。開關5201可包含類比開關、經組態以充當開關之電晶體，或經組態以控制至記憶體單元5102之部分的供應或電壓及/或電流流動的任何其他電路系統。感測放大器(未圖示)可位於記憶體胞元陣列之末端處，例如，在開關5201之前(在空間上及/或在時間上)。 The memory unit 5102 also includes a switch 5201 for enabling or disabling the channel of the bit-to-row multiplexer 5113. The switch 5201 may include an analog switch, a transistor configured to act as a switch, or any other circuit system configured to control the supply or the flow of voltage and/or current to a portion of the memory unit 5102. The sense amplifier (not shown) may be located at the end of the memory cell array, for example, before the switch 5201 (spatially and/or temporally).

開關5201可由自PIM邏輯5116經由匯流排5117發送之賦能信號控制。當斷開時，該等開關經組態以斷開記憶體單元5102之感測放大器(未圖示)，且因此不對與感測放大器斷開之位元線放電或充電。 The switch 5201 can be controlled by an enabling signal sent from the PIM logic 5116 via the bus 5117. When disconnected, the switches are configured to disconnect the sense amplifier (not shown) of the memory unit 5102. (Illustration), and therefore do not discharge or charge the bit line disconnected from the sense amplifier.

開關5201及PIM邏輯5116可形成縮減單元(例如，縮減單元5030)。 The switch 5201 and the PIM logic 5116 may form a reduced unit (for example, the reduced unit 5030).

在又一實例中，PIM邏輯5116可將賦能信號發送至感測放大器(例如，當感測放大器具有賦能輸入時)而非發送至開關5201。 In yet another example, the PIM logic 5116 may send the enabling signal to the sense amplifier (for example, when the sense amplifier has an enabling input) instead of sending it to the switch 5201.

位元線可另外或替代地在其他點處斷開，例如，不在位元線之末端處及在感測放大器之後斷開。舉例而言，位元線可在進入陣列5111之前斷開。 The bit line may additionally or alternatively be disconnected at other points, for example, not at the end of the bit line and after the sense amplifier. For example, the bit line can be disconnected before entering the array 5111.

在此等實施例中，在自感測放大器及轉送硬體(諸如，輸出匯流排5115)進行資料傳送時，亦可節省功率。 In these embodiments, power can also be saved when data is transmitted from the sense amplifier and the forwarding hardware (such as the output bus 5115).

其他實施例(其可節省較少功率，但可較容易實施)聚焦於節省行多工器5113之功率且將損失自行多工器5113轉移至下一層級電路系統。舉例而言，如上文所解釋，下一層級電路系統可包含記憶體晶片之I/O匯流排(諸如，匯流排5115)。在使用記憶體內處理之實施例中，下一層級電路系統可另外或替代地包含記憶體晶片之處理器子單元(諸如，PIM邏輯5116)。 Other embodiments (which can save less power, but are easier to implement) focus on saving the power of the row multiplexer 5113 and transferring the loss self-multiplexer 5113 to the next level circuit system. For example, as explained above, the next level of circuitry may include I/O buses of memory chips (such as bus 5115). In embodiments using in-memory processing, the next level of circuitry may additionally or alternatively include processor sub-units of memory chips (such as PIM logic 5116).

圖54A說明分段為多個區段5202之行多工器5113。行多工器5113之每一區段5202可藉由自PIM邏輯5116經由匯流排5119發送之賦能及/或去能信號來個別地賦能或去能。行多工器5113亦可由位址行匯流排5118饋入。 FIG. 54A illustrates a row multiplexer 5113 that is segmented into a plurality of sections 5202. Each section 5202 of the row multiplexer 5113 can be individually enabled or disabled by the enabling and/or disabling signal sent from the PIM logic 5116 via the bus 5119. The row multiplexer 5113 can also be fed by the address row bus 5118.

圖54A之實施例可提供對來自行多工器5113之輸出之不同部分的較佳控制。 The embodiment of FIG. 54A can provide better control of different parts of the output from the row multiplexer 5113.

應注意，對不同記憶體路徑之控制可具有不同解析度，例如範圍為自一位元解析度至多位元解析度。前者在功率節省之意義上可能更有效。後者之實施可能較簡單且需要較少控制信號。 It should be noted that the control of different memory paths may have different resolutions, for example, the range is from a one-bit resolution to a multi-bit resolution. The former may be more effective in the sense of power saving. The latter may be simpler to implement and require fewer control signals.

圖54B說明實例方法5130。舉例而言，可使用上文關於圖50、圖51、圖52、圖53或圖54A所描述之記憶體單元中之任一者來實施方法5130。 FIG. 54B illustrates an example method 5130. For example, the method 5130 can be implemented using any of the memory cells described above with respect to FIG. 50, FIG. 51, FIG. 52, FIG. 53, or FIG. 54A.

方法5130可包括步驟5132及5134。 The method 5130 may include steps 5132 and 5134.

步驟5132可包括：藉由積體電路之處理單元(例如，PIM邏輯5116)發送存取請求且發送至至積體電路之記憶體單元以自該記憶體單元讀取第二數目個位元。該記憶體單元可包括記憶體胞元(例如，陣列5111之記憶體胞元)、輸出埠(例如，輸出匯流排5115)，及讀取/寫入電路系統，該讀取/寫入電路系統可包括縮減單元(例如，縮減單元5030)及第一群組記憶體讀取/寫入路徑，該等記憶體讀取/寫入路徑用於經由輸出埠輸出及/或輸入多達第一數目個位元。 Step 5132 may include: sending an access request by the processing unit of the integrated circuit (for example, PIM logic 5116) and sending it to the memory cell of the integrated circuit to read the second number of bits from the memory cell. The memory unit may include a memory cell (for example, the memory cell of the array 5111), an output port (for example, the output bus 5115), and a read/write circuit system. The read/write circuit system It may include a reduction unit (for example, reduction unit 5030) and a first group of memory read/write paths, which are used to output and/or input up to a first number through output ports Bits.

存取請求可包含讀取請求及/或寫入請求。 The access request may include a read request and/or a write request.

記憶體輸入/輸出路徑可包含記憶體讀取路徑、記憶體寫入路徑及/或用於讀取及寫入兩者之路徑。 The memory input/output path may include a memory read path, a memory write path, and/or a path for both reading and writing.

步驟5134可包括對存取請求作出回應。 Step 5134 may include responding to the access request.

舉例而言，步驟5134可包括在由存取請求觸發之存取操作期間藉由縮減單元(例如，單元5030)基於第一數目個位元及第二數目個位元而控制記憶體讀取/寫入路徑。 For example, step 5134 may include controlling the memory read/write based on the first number of bits and the second number of bits by reducing the unit (e.g., unit 5030) during the access operation triggered by the access request. Write path.

步驟5134可進一步包括以下操作中之任一者及/或以下操作中之任一者的任何組合。下文列出之操作中的任一者可在對存取請求作出回應期間執行，但亦可在對存取請求作出回應之前及/或之後執行。 Step 5134 may further include any of the following operations and/or any combination of any of the following operations. Any of the operations listed below can be performed during the response to the access request, but can also be performed before and/or after the response to the access request.

因此，步驟5134可包括以下操作中之至少一者： Therefore, step 5134 may include at least one of the following operations:

a.在第二數目小於第一數目時控制不相關的記憶體讀取路徑，其中不相關的記憶體讀取路徑與第一數目個位元中之不包括於第二數目個位元中的位元相關聯； a. Control unrelated memory read paths when the second number is less than the first number, where the unrelated memory read paths and the first number of bits are not included in the second number of bits Bit-associated

b.在讀取操作期間啟動相關的記憶體讀取路徑，其中相關的記憶體讀取路徑經組態以輸送第二數目個位元； b. Start the relevant memory read path during the read operation, wherein the relevant memory read path is configured to transmit a second number of bits;

c.在讀取操作期間切斷不相關的記憶體讀取路徑中之每一者的至少一部分； c. Cut off at least part of each of the unrelated memory read paths during the read operation;

d.在讀取操作期間將不相關的記憶體讀取路徑維持於低功率模式中； d. Maintain the unrelated memory read path in the low power mode during the read operation;

e.控制不相關的記憶體讀取路徑之位元線； e. Control the bit lines of unrelated memory read paths;

f.載入相關的記憶體讀取路徑之位元線且將不相關的記憶體讀取路徑之位元線維持於低功率模式中； f. Load the bit lines of the relevant memory read path and maintain the bit lines of the unrelated memory read path in low power mode;

g.載入相關的記憶體讀取路徑之位元線，同時將不相關的記憶體讀取路徑之位元線維持為不啟動； g. Load the bit lines of the relevant memory read path, while keeping the bit lines of the unrelated memory read path inactive;

h.在讀取操作期間利用相關的記憶體讀取路徑之部分且將每一不相關的記憶體讀取路徑之一部分維持於低功率模式中，其中該部分不同於位元線； h. During a read operation, use a portion of the related memory read path and maintain a portion of each unrelated memory read path in a low power mode, where the portion is different from the bit line;

i.在讀取操作期間利用相關的記憶體讀取路徑之部分且將用於不相關的記憶體讀取路徑中之至少一些的感測放大器維持於低功率模式中； i. Use part of the related memory read path during a read operation and maintain the sense amplifiers used for at least some of the unrelated memory read paths in a low power mode;

j.在讀取操作期間利用相關的記憶體讀取路徑之部分且將不相關的記憶體讀取路徑中之至少一些的感測放大器維持於低功率模式中；及 j. Use part of the relevant memory read path during a read operation and maintain the sense amplifiers of at least some of the unrelated memory read paths in a low power mode; and

k.在讀取操作期間利用相關的記憶體讀取路徑之部分且將在不相關的記憶體讀取路徑之感測放大器之後的不相關的記憶體讀取路徑維持於低功率模式中。 k. Use part of the related memory read path during the read operation and maintain the unrelated memory read path after the sense amplifier of the unrelated memory read path in a low power mode.

低功率模式或閒置模式可包含記憶體存取路徑之功率消耗低於在記憶體存取路徑用於存取操作時記憶體存取路徑之功率消耗的模式。在一些實施例中，低功率模式可能甚至涉及切斷記憶體存取路徑。低功率模式可另外或替代地包括不啟動記憶體存取路徑。 The low power mode or the idle mode may include a mode in which the power consumption of the memory access path is lower than the power consumption of the memory access path when the memory access path is used for an access operation. In some embodiments, the low power mode may even involve cutting off the memory access path. The low power mode may additionally or alternatively include not activating the memory access path.

應注意，在位元線階段期間發生的功率減少可能需要在開放字線之前應知曉記憶體存取路徑之相關性或不相關性。在別處發生(例如，在行多工器中)之功率減少可替代地允許在每次存取時決定記憶體存取路徑之相關性或不相關性。 It should be noted that the power reduction that occurs during the bit line phase may require knowledge of the relevance or irrelevance of the memory access path before opening the word line. Power reduction that occurs elsewhere (for example, in a row multiplexer) may alternatively allow the relevance of the memory access path to be determined on each access Or irrelevance.

快速及低功率啟動以及快速存取記憶體 Fast and low power startup and fast memory access

DRAM及其他記憶體類型(諸如，SRAM、快閃記憶體或其類似者)常常自記憶體組建置，該等記憶體組通常建置為允許列及行存取方案。 DRAM and other memory types (such as SRAM, flash memory, or the like) are often built from memory, and these memory banks are usually built to allow column and row access schemes.

圖55說明記憶體晶片5140之實例，該記憶體晶片包括多個記憶體墊及相關聯邏輯(諸如，列及行解碼器，在圖55中分別描繪為RD及COL)。在圖55之實例中，墊被分組成組且具有通過其的字線及位元線。記憶體墊及相關聯邏輯在圖55中表示為5141、5142、5143、5144、5145及5146，且共用至少一個匯流排5147。 Figure 55 illustrates an example of a memory chip 5140 that includes a plurality of memory pads and associated logic (such as column and row decoders, depicted as RD and COL respectively in Figure 55). In the example of FIG. 55, the pads are grouped into groups and have word lines and bit lines passing through them. The memory pads and associated logic are represented as 5141, 5142, 5143, 5144, 5145, and 5146 in FIG. 55, and share at least one bus 5147.

記憶體晶片5140可包括於如圖3A、圖3B、圖4至圖6、圖7A至圖7D、圖11至圖13、圖16至圖19、圖22或圖23中之任一者中所說明的記憶體晶片中，可包括該記憶體晶片或以其他方式包含該記憶體晶片。 The memory chip 5140 may be included in any of FIGS. 3A, 3B, 4-6, 7A-7D, 11-13, 16-19, 22 or 23. The memory chip described may include the memory chip or include the memory chip in other ways.

舉例而言，在DRAM中，與啟動新列(例如，準備用於存取之新線)相關聯的耗用很大。一旦一線經啟動(亦被稱作開放)，彼列內之資料便可用於更快存取。在DRAM中，此存取可能以隨機方式進行。 For example, in DRAM, the consumption associated with starting a new row (for example, a new line ready for access) is high. Once the first line is activated (also known as open), the data in that row can be used for faster access. In DRAM, this access may be performed in a random manner.

與啟動新線相關聯之兩個問題為功率及時間： The two issues associated with starting a new line are power and time:

a.由於一起存取該線上之所有電容器及必須載入該線所導致的電流驟增，功率會上升(例如，當開放僅具有幾個記憶體組之線時，功率可達到若干安培)；及 a. Due to the sudden increase in current caused by accessing all the capacitors on the line and having to load the line together, the power will increase (for example, when the line with only a few memory banks is opened, the power can reach several amperes); and

b.時間延遲問題主要與載入列(字)線及接著載入位元(行)線所花費之時間相關聯。 b. The time delay problem is mainly related to the time it takes to load the column (word) line and then load the bit (row) line.

本發明之一些實施例可包括用以在啟動線期間減少峰值功率消耗且減少線啟動時間之系統及方法。一些實施例可至少在一定程度上犧牲一線內之完全隨機存取，以減少此等功率及時間成本。 Some embodiments of the present invention may include systems and methods to reduce peak power consumption during a start-up line and reduce line start-up time. Some embodiments may sacrifice complete random access within a line at least to a certain extent to reduce these power and time costs.

舉例而言，在一個實施例中，記憶體單元可包括第一記憶體墊、第二記憶體墊及啟動單元，該啟動單元經組態以啟動包括於第一記憶體墊中之第一群組記憶體胞元，而不啟動包括於第二記憶體墊中之第二群組記憶體胞元。該等一群組記憶體胞元及該等二群組記憶體胞元可皆屬於該記憶體單元之單個列。 For example, in one embodiment, the memory unit may include a first memory pad, a second memory pad, and an activation unit configured to activate the first group included in the first memory pad Group memory cells without activating the second group of memory cells included in the second memory pad. The one group of memory cells and the two groups of memory cells may all belong to a single row of the memory cell.

替代地，該啟動單元可經組態以啟動包括於第二記憶體墊中之第二群組記憶體胞元，而不啟動第一群組記憶體胞元。 Alternatively, the activation unit may be configured to activate the second group of memory cells included in the second memory pad without activating the first group of memory cells.

在一些實施例中，該啟動單元可經組態以在啟動第一群組記憶體胞元之後啟動第二群組記憶體胞元。 In some embodiments, the activation unit may be configured to activate the second group of memory cells after activating the first group of memory cells.

舉例而言，該啟動單元可經組態以在第一群組記憶體胞元的啟動已完成之後起始的延遲時段期滿之後啟動第二群組記憶體胞元。 For example, the activation unit may be configured to activate the second group of memory cells after a delay period that starts after the activation of the first group of memory cells has completed has expired.

另外或替代地，該啟動單元可經組態以基於信號之值而啟動第二群組記憶體胞元，該信號係在耦接至第一群組記憶體胞元的第一字線區段上產生的。 Additionally or alternatively, the activation unit may be configured to activate the second group of memory cells based on the value of the signal, the signal being coupled to the first word line section of the first group of memory cells Produced on.

在上文所描述之實施例中之任一者中，該啟動單元可包括安置於第一字線區段與第二字線區段之間的中間電路。在此等實施例中，第一字線區段可耦接至第一記憶體胞元且第二字線區段可耦接至第二記憶體胞元。中間電路之非限制性實例包括開關、正反器、緩衝器、反相器及其類似者，其中之一些貫穿圖56至圖61加以說明。 In any of the above-described embodiments, the activation unit may include an intermediate circuit disposed between the first word line section and the second word line section. In these embodiments, the first word line segment can be coupled to the first memory cell and the second word line segment can be coupled to the second memory cell. Non-limiting examples of intermediate circuits include switches, flip-flops, buffers, inverters, and the like, some of which are described throughout FIGS. 56 to 61.

在一些實施例中，第二記憶體胞元可耦接至第二字線區段。在此等實施例中，第二字線區段可耦接至通過至少第一記憶體墊之旁路字線徑。此類旁路路徑之實例說明於圖61中。 In some embodiments, the second memory cell may be coupled to the second word line segment. In these embodiments, the second word line segment may be coupled to the bypass word line diameter passing through at least the first memory pad. An example of such a bypass path is illustrated in Figure 61.

該啟動單元可包含控制單元，該控制單元經組態以基於來自與單個列相關聯之字線的啟動信號而控制電壓(及/或電流)至第一群組記憶體胞元及第二群組記憶體胞元的供應。 The activation unit may include a control unit configured to control the voltage (and/or current) to the first group of memory cells based on the activation signal from the word line associated with a single row And the supply of the second group of memory cells.

在另一實例實施例中，記憶體單元可包括第一記憶體墊、第二記憶體墊及啟動單元，該啟動單元經組態以將啟動信號供應至第一記憶體墊之第一群組記憶體胞元，且延遲該啟動信號至第二記憶體墊之第二群組記憶體胞元的供應，至少直至第一群組記憶體胞元的啟動已完成。該等一群組記憶體胞元及該等二群組記憶體胞元可屬於該記憶體單元之單個列。 In another example embodiment, the memory unit may include a first memory pad, a second memory pad, and an activation unit configured to supply activation signals to the first group of first memory pads Memory cells, and delay the supply of the activation signal to the second group of memory cells of the second memory pad at least until the activation of the first group of memory cells has been completed. The one group of memory cells and the two groups of memory cells can belong to a single row of the memory cell.

舉例而言，該啟動單元可包括可經組態以延遲供應啟動信號之延遲單元。 For example, the activation unit may include a delay unit that can be configured to delay the supply of the activation signal.

另外或替代地，該啟動單元可包括比較器，該比較器可經組態以在其輸入端處接收啟動信號且基於啟動信號之至少一個特性而控制延遲單元。 Additionally or alternatively, the activation unit may include a comparator that may be configured to receive the activation signal at its input and control the delay unit based on at least one characteristic of the activation signal.

在另一實例實施例中，記憶體單元可包括第一記憶體墊、第二記憶體墊及隔離單元，該隔離單元可經組態以：在第一記憶體墊之第一記憶體胞元被啟動的初始啟動時段期間將該等第一記憶體胞元與第二記憶體墊之第二記憶體胞元相隔離；及在該初始啟動時段之後將該等第一記憶體胞元耦接至該等二記憶體胞元。第一記憶體胞元及第二記憶體胞元可屬於記憶體單元之單個列。 In another example embodiment, the memory unit may include a first memory pad, a second memory pad, and an isolation unit, and the isolation unit may be configured to: in the first memory cell of the first memory pad Separate the first memory cell from the second memory cell of the second memory pad during the activated initial activation period; and couple the first memory cell after the initial activation period To these two memory cells. The first memory cell and the second memory cell may belong to a single row of memory cells.

在以下實例中，可能不需要對記憶體墊本身進行修改。在某些實例中，實施例可依賴於對記憶體組之少量修改。 In the following example, the memory pad itself may not need to be modified. In some instances, embodiments may rely on minor modifications to the memory bank.

以下圖式描繪縮短添加至記憶體組之字信號藉此將字線分裂成數個較短部分的機構。 The following diagram depicts the mechanism of shortening the word signal added to the memory bank to split the word line into several shorter parts.

在以下諸圖中，為了清楚起見省略各種記憶體組組件。 In the following figures, various memory bank components are omitted for clarity.

圖56至圖61說明記憶體組之部分(分別表示為5140(1)、5140(2)、5140(3)、5140(4)、5140(5)及5149(6))，該等部分包括分組於不同群組內之列解碼器5112及多個記憶體墊(諸如，5150(1)、5150(2)、5150(3)、5150(4)、5150(5)、5150(6)、5151(1)、5151(2)、5151(3)、5151(4)、5151(5)、5151(6)、5152(1)、5152(2)、 5152(3)、5152(4)、5152(5)及5152(6))。 Figure 56 to Figure 61 illustrate the parts of the memory group (represented as 5140(1), 5140(2), 5140(3), 5140(4), 5140(5) and 5149(6)), which include Column decoder 5112 and multiple memory pads grouped in different groups (such as 5150(1), 5150(2), 5150(3), 5150(4), 5150(5), 5150(6), 5151(1), 5151(2), 5151(3), 5151(4), 5151(5), 5151(6), 5152(1), 5152(2), 5152(3), 5152(4), 5152(5) and 5152(6)).

配置成一列之記憶體墊可包括不同群組。 The memory pads arranged in a row may include different groups.

圖56至圖59及圖61說明記憶體墊之九個群組，其中每一群組包括一對記憶體墊。可使用任何數目個群組，每一群組具有任何數目個記憶體墊。 Figures 56 to 59 and Figure 61 illustrate nine groups of memory pads, where each group includes a pair of memory pads. Any number of groups can be used, and each group has any number of memory pads.

記憶體墊5150(1)、5150(2)、5150(3)、5150(4)、5150(5)及5150(6)配置成一列，共用多條記憶體線，且分成三個群組：第一上部群組，其包括記憶體墊5150(1)及5150(2)；第二上部群組，其包括記憶體墊5150(3)及5150(4)；及第三上部群組，其包括記憶體墊5150(5)及5150(6)。 The memory pads 5150(1), 5150(2), 5150(3), 5150(4), 5150(5) and 5150(6) are arranged in a row, share multiple memory lines, and are divided into three groups: The first upper group, which includes memory pads 5150(1) and 5150(2); the second upper group, which includes memory pads 5150(3) and 5150(4); and the third upper group, which Including memory pads 5150(5) and 5150(6).

類似地，記憶體墊5151(1)、5151(2)、5151(3)、5151(4)、5151(5)及5151(6)配置成一列，共用多條記憶體線且分成三個群組：第一中間群組，其包括記憶體墊5151(1)及5151(2)；第二中間群組，其包括記憶體墊5151(3)及5151(4)；及第三中間群組，其包括記憶體墊5151(5)及5151(6)。 Similarly, the memory pads 5151(1), 5151(2), 5151(3), 5151(4), 5151(5) and 5151(6) are arranged in a row, share multiple memory lines and are divided into three groups Group: the first middle group, which includes memory pads 5151(1) and 5151(2); the second middle group, which includes memory pads 5151(3) and 5151(4); and the third middle group , Which includes memory pads 5151(5) and 5151(6).

此外，記憶體墊5152(1)、5152(2)、5152(3)、5152(4)、5152(5)及5152(6)配置成一列，共用多條記憶體線且分組成三個群組：第一下部群組，其包括記憶體墊5152(1)及5152(2)；第二下部群組，其包括記憶體墊5152(3)及5152(4)；及第三下部群組，其包括記憶體墊5152(5)及5152(6)。任何數目個記憶體墊可配置成一列並共用記憶體線，且可分成任何數目個群組。 In addition, the memory pads 5152(1), 5152(2), 5152(3), 5152(4), 5152(5) and 5152(6) are arranged in a row, share multiple memory lines and are grouped into three groups Group: the first lower group, which includes memory pads 5152(1) and 5152(2); the second lower group, which includes memory pads 5152(3) and 5152(4); and the third lower group Group, which includes memory pads 5152(5) and 5152(6). Any number of memory pads can be arranged in a row and share memory lines, and can be divided into any number of groups.

舉例而言，每個群組之記憶體墊的數目可為一個、兩個或可超過兩個。 For example, the number of memory pads in each group can be one, two, or more than two.

如上文所解釋，啟動電路可經組態以啟動記憶體墊之一個群組，而不啟動共用相同記憶體線或至少耦接至具有同一線位址之不同記憶體線區段的記憶體墊之另一群組。 As explained above, the activation circuit can be configured to activate a group of memory pads without activating memory pads that share the same memory line or are at least coupled to different memory line segments with the same line address Another group.

圖56至圖61說明啟動電路之不同實例。在一些實施例中，啟動電路之至少一部分(諸如，中間電路)可位於記憶體墊群組之間，以允許啟動一個群組之記憶體墊，而不啟動同一列之記憶體墊的另一群組。 Figure 56 to Figure 61 illustrate different examples of starting circuits. In some embodiments, at least a part of the activation circuit (such as an intermediate circuit) may be located between the memory pad groups to allow activation One group of memory pads does not activate another group of memory pads in the same row.

圖56說明如定位於記憶體之第一上部群組的不同線與記憶體墊之第二上部群組的不同線之間的中間電路，諸如延遲或隔離電路5153(1)至、5153(3)。 FIG. 56 illustrates the intermediate circuits such as delay or isolation circuits 5153(1) to 5153(3) located between different lines of the first upper group of memory and different lines of the second upper group of memory pads. ).

圖56亦說明如定位於記憶體之第二上部群組的不同線與記憶體墊之第三上部群組的不同線之間的中間電路，諸如延遲或隔離電路5154(1)至5154(3)。另外，一些延遲或隔離電路定位於由中間群組之記憶體墊形成的群組之間。此外，一些延遲或隔離電路定位於由下部群組之記憶體墊形成的群組之間。 FIG. 56 also illustrates the intermediate circuits such as delay or isolation circuits 5154(1) to 5154(3) located between different lines of the second upper group of memory and different lines of the third upper group of memory pads. ). In addition, some delay or isolation circuits are positioned between the groups formed by the memory pads of the middle group. In addition, some delay or isolation circuits are positioned between the groups formed by the memory pads of the lower group.

該等延遲或隔離電路可延遲或停止字線信號自列解碼器5112沿著一列傳播至另一群組。 The delay or isolation circuits can delay or stop the propagation of the word line signal from the column decoder 5112 along one column to another group.

圖57說明包含正反器(諸如，5155(1)至5155(3)及5156(1)至5156(3))之中間電路，諸如延遲或隔離電路。 Figure 57 illustrates an intermediate circuit, such as a delay or isolation circuit, including flip-flops (such as 5155(1) to 5155(3) and 5156(1) to 5156(3)).

當將啟動信號注入至字線時，啟動第一墊群組中之一者(取決於該字線)，而沿著該字線之其他群組保持不啟動。可在下一時脈循環啟動其他群組。舉例而言，可在下一時脈循環啟動其他群組中之第二群組，且可在又一時脈循環之後啟動其他群組中之第三群組。 When the activation signal is injected into the word line, one of the first pad groups (depending on the word line) is activated, while the other groups along the word line remain inactive. Other groups can be activated in the next clock cycle. For example, the second group in other groups can be activated in the next clock cycle, and the third group in other groups can be activated after another clock cycle.

正反器可包含D型正反器或任何其他類型的正反器。為簡單起見，自圖式省略饋入至D型正反器的時脈。 The flip-flop can include a D-type flip-flop or any other type of flip-flop. For simplicity, the clock fed to the D-type flip-flop is omitted from the diagram.

因此，對第一群組的存取可使用電力以僅對與第一群組相關聯之字線的部分充電，此充電比對整條字線充電更快且需要更少電流。 Therefore, access to the first group can use power to charge only part of the word line associated with the first group, which charge is faster and requires less current than charging the entire word line.

可在記憶體墊群組之間使用多於一個正反器，藉此增加開放部分之間的延遲。另外或替代地，實施例可使用較慢時脈以增加延遲。 More than one flip-flop can be used between the memory pad groups, thereby increasing the delay between the open parts. Additionally or alternatively, embodiments may use a slower clock to increase the delay.

此外，經啟動之群組可仍含有來自所使用之先前線值的群組。舉例而言，該方法可允許啟動新的線區段，同時仍存取先前線之資料，藉此減少與啟動新線相關聯之懲罰。 In addition, the activated group may still contain the group from the previous line value used. Lift For example, this method can allow new line segments to be activated while still accessing the data of the previous line, thereby reducing the penalty associated with activating the new line.

因此，一些實施例可具有經啟動之第一群組且允許先前經啟動線之其他群組保持在作用中，其中位元線之信號彼此不干擾。 Therefore, some embodiments may have a first group activated and allow other groups of previously activated lines to remain active, where the signals of the bit lines do not interfere with each other.

另外，一些實施例可包括開關及控制信號。該等控制信號可由組控制器控制或藉由在控制信號之間添加正反器(例如，產生上文所描述之機構具有的相同時序效應)來控制。 In addition, some embodiments may include switches and control signals. These control signals can be controlled by the group controller or by adding a flip-flop between the control signals (for example, producing the same timing effect as the mechanism described above).

圖58說明諸如延遲或隔離電路之中間電路，該等電路為開關(諸如，5157(1)至5157(3)及5158(1)至5158(3))且定位於一個群組與另一群組之間。定位於群組之間的一組開關可由專用控制信號控制。在圖58中，控制信號可由列控制單元5160(1)發送且由不同組開關之間的一或多個延遲單元(例如，單元5160(2)及5160(3))之序列延遲。 Figure 58 illustrates intermediate circuits such as delay or isolation circuits that are switches (such as 5157(1) to 5157(3) and 5158(1) to 5158(3)) and are located in one group and another group Between groups. A set of switches located between the groups can be controlled by dedicated control signals. In FIG. 58, the control signal can be sent by the column control unit 5160(1) and delayed by the sequence of one or more delay units (e.g., units 5160(2) and 5160(3)) between different sets of switches.

圖59說明諸如延遲或隔離電路之中間電路，該等電路為反相器閘或緩衝器(諸如，5159(1)至5159(3)及5159'(1)至5159'(3))之序列且定位於記憶體墊群組之間。 Figure 59 illustrates intermediate circuits such as delay or isolation circuits, which are sequences of inverter gates or buffers (such as 5159(1) to 5159(3) and 5159'(1) to 5159'(3)) And positioned between the memory pad groups.

替代開關，可在記憶體墊群組之間使用緩衝器。緩衝器可能不允許開關之間沿著字線降低電壓，電壓降低為在使用單個電晶體結構時有時會發生的效應。 Instead of switches, buffers can be used between memory pad groups. The buffer may not allow the voltage to be reduced along the word line between the switches. The voltage reduction is an effect that sometimes occurs when a single transistor structure is used.

其他實施例可允許更多的隨機存取，且藉由使用添加至記憶體組之區域仍提供極低的啟動功率及時間。 Other embodiments may allow more random access, and still provide extremely low startup power and time by using the area added to the memory bank.

實例展示於圖60中，該圖說明使用接近記憶體墊定位之全域字線(諸如，5152(1)至5152(8))。此等字線可能通過或可能不通過記憶體墊且經由諸如開關(諸如，5157(1)至5157(8))之中間電路耦接至記憶體墊內之字線。該等開關可控制將啟動哪一記憶體墊且允許記憶體控制器在每一時間點僅啟動相關線部分。不同於上文所描述之使用線部分之依序啟動的實施例，圖60之實例可提供更好的控制。 An example is shown in Figure 60, which illustrates the use of global word lines (such as 5152(1) to 5152(8)) located close to the memory pad. These word lines may or may not pass through the memory pad and are coupled to the word lines in the memory pad via intermediate circuits such as switches (such as 5157(1) to 5157(8)). These switches can control which memory pad will be activated and allow the memory controller to only activate at each point in time Related line part. Unlike the above-described embodiment that uses the sequential activation of the wire portion, the example of FIG. 60 can provide better control.

諸如列部分賦能信號5170(1)及7150(2)之賦能信號可源自未展示之邏輯，諸如記憶體控制器。 The enabling signals such as column part enabling signals 5170(1) and 7150(2) can be derived from unshown logic, such as a memory controller.

圖61說明全域字線5180通過記憶體墊且形成用於可能不需要在墊外部投送之字線信號的旁路路徑。因此，圖61中所展示之實施例可以一些記憶體密度為代價來減小記憶體組之面積。 Figure 61 illustrates that the global word line 5180 passes through the memory pad and forms a bypass path for the word line signal that may not need to be projected outside the pad. Therefore, the embodiment shown in FIG. 61 can reduce the area of the memory bank at the expense of some memory density.

在圖61中，全域世界線可不間斷地通過記憶體墊且可能不連接至記憶體胞元。區域字線區段可由開關中之一者控制且連接至墊中之記憶體胞元。 In FIG. 61, the global world line can pass through the memory pad without interruption and may not be connected to the memory cell. The local word line segment can be controlled by one of the switches and connected to the memory cell in the pad.

當記憶體墊群組提供字線之實質分割時，記憶體組可實際上支援完全隨機存取。 When the memory pad group provides substantial division of word lines, the memory group can actually support full random access.

用於減緩啟動信號沿著字線之散佈的另一實施例亦可節省一些佈線及邏輯，在記憶體墊之間使用開關及/或其他緩衝或隔離電路，而不使用專用賦能信號及專用線來輸送賦能信號。 Another embodiment for slowing the dispersion of the activation signal along the word line can also save some wiring and logic, using switches and/or other buffering or isolation circuits between memory pads, instead of using dedicated enable signals and dedicated Wire to convey the enabling signal.

舉例而言，比較器可用以控制開關或其他緩衝或隔離電路。當由比較器監視之字線區段上的信號之位準達到某一位準時，比較器可啟動開關或其他緩衝或隔離電路。舉例而言，某一位準可提示完全載入先前字線區段。 For example, the comparator can be used to control a switch or other buffer or isolation circuit. When the level of the signal on the word line segment monitored by the comparator reaches a certain level, the comparator can activate a switch or other buffering or isolation circuits. For example, a certain level may indicate that the previous word line segment is fully loaded.

圖62說明用於操作記憶體單元之方法5190。舉例而言，可使用上文關於圖56至圖61所描述之記憶體組中之任一者來實施方法5130。 Figure 62 illustrates a method 5190 for operating a memory cell. For example, the method 5130 can be implemented using any of the memory sets described above with respect to FIGS. 56-61.

方法5190可包括步驟5192及5194。 The method 5190 may include steps 5192 and 5194.

步驟5192可包括藉由啟動單元啟動包括於記憶體單元之第一記憶體墊中的第一群組記憶體胞元，而不啟動包括於記憶體單元之第二記憶體墊中的第二群組記憶體胞元。該等一群組記憶體胞元及該等二群組記憶體胞元可皆屬於該記憶體單元之單個列。 Step 5192 may include activating the first group of memory cells included in the first memory pad of the memory unit by the activation unit, and not activating the second group of memory cells included in the second memory pad of the memory unit Group memory cells. The one group of memory cells and the two groups of memory cells can be All belong to a single row of the memory cell.

步驟5194可包括藉由啟動單元啟動第二群組記憶體胞元，例如，在步驟5192之後。 Step 5194 may include activating the second group of memory cells by the activation unit, for example, after step 5192.

可在啟動第一群組記憶體胞元時，在完全啟動第一群組記憶體胞元之後，在第一群組記憶體胞元的啟動已完成之後起始的延遲時段期滿之後，在第一群組記憶體胞元不啟動之後及在類似情況下執行步驟5194。 When the first group of memory cells are activated, after the first group of memory cells are fully activated, after the start of the delay period expires after the activation of the first group of memory cells has been completed, Step 5194 is executed after the first group of memory cells are not activated and under similar circumstances.

延遲時段可為固定或可調整的。舉例而言，延遲時段之持續時間可基於記憶體單元之預期存取圖案，或可無關於預期存取圖案而設定。延遲時段之範圍可介於少於一毫秒與多於一秒之間。 The delay period can be fixed or adjustable. For example, the duration of the delay period may be based on the expected access pattern of the memory cell, or may be set regardless of the expected access pattern. The range of the delay period can be between less than one millisecond and more than one second.

在一些實施例中，步驟5194可基於信號之值起始，該信號係在耦接至第一群組記憶體胞元的第一字線區段上產生的。舉例而言，當信號之值超過第一臨限值時，其可提示第一群組記憶體胞元完全啟動。 In some embodiments, step 5194 may be initiated based on the value of the signal generated on the first word line segment coupled to the first group of memory cells. For example, when the value of the signal exceeds the first threshold, it can prompt the first group of memory cells to be fully activated.

步驟5192及5194中之任一者可涉及使用安置於第一字線區段與第二字線區段之間的中間電路(例如，啟動單元之中間電路)。第一字線區段可耦接至第一記憶體胞元且第二字線區段可耦接至第二記憶體胞元。 Any of steps 5192 and 5194 may involve the use of an intermediate circuit (e.g., the intermediate circuit of the activation cell) disposed between the first word line section and the second word line section. The first word line segment can be coupled to the first memory cell and the second word line segment can be coupled to the second memory cell.

中間電路之實例貫穿圖56至圖61加以說明。 Examples of the intermediate circuit are described throughout FIGS. 56 to 61.

步驟5192及5194可進一步包括藉由控制單元來控制啟動信號自與單個列相關聯之字線至第一群組記憶體胞元及第二群組記憶體胞元的供應。 Steps 5192 and 5194 may further include controlling the supply of the activation signal from the word line associated with a single row to the first group of memory cells and the second group of memory cells by the control unit.

使用記憶體並列性來加速測試時間及使用向量測試記憶體中之邏輯 Use memory parallelism to speed up test time and use vectors to test logic in memory

本發明之一些實施例可使用晶片內測試單元來加速測試。 Some embodiments of the present invention may use in-chip test units to accelerate testing.

一般而言，記憶體晶片測試需要大量測試時間。減少測試時間可減少生產成本且亦允許進行更多測試，以產生更可靠的產品。 Generally speaking, memory chip testing requires a lot of testing time. Reducing testing time can reduce production costs and also allow more testing to produce more reliable products.

圖63及圖64說明測試器5200及晶片(或晶片之晶圓)5210。測試器5200可包括管理測試之軟體。測試器5200可將不同資料序列運行至所有記憶體5210，且接著讀回該等序列以識別記憶體5210之發生故障的位元位於何處。一旦辨識到，測試器5200便可發出修復位元之命令，且若能夠修復問題，則測試器5200可聲明記憶體5210通過。在其他狀況下，可聲明一些晶片未通過。 Figures 63 and 64 illustrate the tester 5200 and the chip (or wafer of chips) 5210. The tester 5200 may include software for managing tests. The tester 5200 can run different data sequences to all memories 5210, and then read back the sequences to identify where the faulty bit of the memory 5210 is located. Once identified, the tester 5200 can issue a repair bit command, and if the problem can be repaired, the tester 5200 can declare that the memory 5210 has passed. In other situations, it may be declared that some chips have failed.

測試器5200可寫入測試序列且接著讀回資料以將其與預期結果進行比較。 The tester 5200 can write a test sequence and then read back the data to compare it with the expected result.

圖64展示測試系統，其具有測試器5200及被並列地測試之晶片(諸如，5210)之完整晶圓5202。舉例而言，測試器5200可藉由導線匯流排連接至晶片中之每一者。 Figure 64 shows a test system with a tester 5200 and a complete wafer 5202 of wafers (such as 5210) being tested in parallel. For example, the tester 5200 can be connected to each of the chips by a wire bus.

如圖64中所展示，測試器5200必須數次讀取及寫入所有記憶體晶片，且彼資料必須經由外部晶片介面傳遞。 As shown in FIG. 64, the tester 5200 must read and write all memory chips several times, and the data must be transferred through the external chip interface.

此外，例如使用可程式化組態資訊測試積體電路之邏輯及記憶體組兩者可為有益的，該組態資訊可使用規則I/O操作來提供。 In addition, it can be beneficial to test both the logic of the integrated circuit and the memory bank, for example, using programmable configuration information, which can be provided using regular I/O operations.

該測試亦可受益於積體電路內存在測試單元。 This test can also benefit from the presence of test units in the integrated circuit.

該等測試單元可屬於積體電路且可分析測試結果，且找到例如邏輯(例如，如圖7A中所描繪及所描述之處理器子單元)及/或記憶體(例如，跨越複數個記憶體組)中的故障。 These test units can belong to integrated circuits and can analyze the test results, and find, for example, logic (for example, the processor subunit as depicted and described in FIG. 7A) and/or memory (for example, across a plurality of memories) Group).

記憶體測試器通常極簡單且根據簡單格式與積體電路交換測試向量。舉例而言，可存在寫入向量，該等寫入向量包括成對的待寫入之記憶體條目的位址與待寫入至記憶體條目之值。亦可存在讀取向量，該讀取向量包括待讀取之記憶體條目的位址。寫入向量之位址中的至少一些可與讀取向量中之至少一些位址相同。寫入向量之至少一些其他位址可不同於讀取向量之至少一些其他位址。當經程式化時，記憶體測試器亦可接收預期結果向量，該預期結果向量可包括待讀取之記憶體條目的位址及待讀取之預期值。記憶體測試器可將預期值與其讀取值進行比較。 Memory testers are usually extremely simple and exchange test vectors with integrated circuits according to a simple format. For example, there may be write vectors that include a pair of addresses of the memory entry to be written and the value to be written to the memory entry. There may also be a read vector that includes the address of the memory entry to be read. At least some of the addresses of the write vector may be the same as at least some of the addresses of the read vector. At least some other addresses for writing the vector may be different from at least some other addresses for reading the vector. When programmed, the memory tester can also receive an expected result vector, which can include the address of the memory entry to be read and the expected value to be read. Memory tester can Compare the expected value with its read value.

根據實施例，積體電路(具有或不具有積體電路之記憶體)之邏輯(例如，處理器子單元)可藉由記憶體測試器使用同一協定/格式來測試。舉例而言，寫入向量中之一些值可為待由積體電路之邏輯執行的命令(且可例如涉及計算及/或記憶體存取)。可運用讀取向量及預期結果向量來程式化記憶體測試器，該預期結果向量可包括記憶體條目位址，該等記憶體條目位址中之至少一些儲存計算之預期值。因此，記憶體測試器可用於測試邏輯以及記憶體。記憶體測試器通常比邏輯測試器更簡單且更便宜，且所提議方法允許使用簡單的記憶體測試器執行複雜的邏輯測試。 According to the embodiment, the logic (for example, the processor subunit) of the integrated circuit (memory with or without the integrated circuit) can be tested by the memory tester using the same protocol/format. For example, some of the values written into the vector may be commands to be executed by the logic of the integrated circuit (and may involve calculations and/or memory access, for example). The read vector and the expected result vector may be used to program the memory tester. The expected result vector may include memory entry addresses, and at least some of the memory entry addresses store calculated expected values. Therefore, the memory tester can be used to test logic and memory. Memory testers are generally simpler and cheaper than logic testers, and the proposed method allows the use of simple memory testers to perform complex logic tests.

在一些實施例中，記憶體內之邏輯可藉由僅使用向量(或其他資料結構)而不使用邏輯測試中常見之更複雜機制(諸如，例如經由介面與控制器通信，告知邏輯待測試哪一電路)來對記憶體內之邏輯的測試賦能。 In some embodiments, the logic in the memory can be used by only using vectors (or other data structures) without using more complex mechanisms commonly used in logic testing (such as, for example, communicating with the controller via an interface to inform the logic of which one to test) Circuit) to enable the test of logic in the memory.

替代使用測試單元，記憶體控制器可經組態以接收存取包括於組態資訊中之記憶體條目的指令，且執行存取指令並輸出結果。 Instead of using a test unit, the memory controller can be configured to receive instructions to access memory items included in the configuration information, and execute the access instructions and output the results.

圖65至圖69中所說明之積體電路中之任一者可執行測試，甚至在缺乏測試單元之情況下或在存在不能夠執行測試之測試單元的情況下亦如此。 Any one of the integrated circuits illustrated in FIGS. 65 to 69 can perform testing, even in the absence of test units or in the presence of test units that cannot perform the test.

本發明之實施例可包括使用記憶體並列性及內部晶片頻寬來加速及改善測試時間之方法及系統。 Embodiments of the present invention may include methods and systems that use memory parallelism and internal chip bandwidth to accelerate and improve test time.

該方法及系統可基於記憶體晶片測試本身(相對於測試器運行測試、讀取測試結果及分析結果)，保存結果且最終允許測試器讀取該等結果(且在需要時，往回程式化記憶體晶片，例如以啟動冗餘機構)。該測試可包括測試記憶體或測試記憶體組及邏輯(在具有要測試之起作用邏輯部分的運算記憶體之狀況下，諸如上文在圖7A中所描述之狀況)。 The method and system can be based on the memory chip test itself (as opposed to the tester running the test, reading the test results, and analyzing the results), save the results and finally allow the tester to read the results (and when necessary, back to programmatically) Memory chip, for example, to activate the redundant mechanism). The test may include testing memory or testing memory groups and logic (in the case of arithmetic memory having a functional logic portion to be tested, such as the situation described above in FIG. 7A).

在一個實施例中，該方法可包括讀取及寫入晶片內之資料使得外部頻寬不限制測試。 In one embodiment, the method may include reading and writing data in the chip so that the external bandwidth does not limit the test.

在記憶體晶片包括處理器子單元之實施例中，每一處理器子單元可藉由測試程式碼或組態來程式化。 In embodiments where the memory chip includes processor sub-units, each processor sub-unit can be programmed by test code or configuration.

在記憶體晶片具有無法執行測試程式碼之處理器子單元或不具有處理器子單元但具有記憶體控制器的實施例中，記憶體控制器接著可經組態以讀取及寫入圖案(例如，在外部程式化至控制器)且標記故障之位置(例如，將值寫入至記憶體條目，讀取該條目，及接收不同於寫入值之值)以供進一步分析。 In embodiments where the memory chip has a processor subunit that cannot execute the test code or does not have a processor subunit but has a memory controller, the memory controller can then be configured to read and write patterns ( For example, externally program to the controller) and mark the location of the fault (for example, write a value to a memory entry, read the entry, and receive a value different from the written value) for further analysis.

應注意，測試記憶體可能需要測試大量位元，例如，測試記憶體之每一位元及驗證受測位元是否起作用。此外，有時可在不同電壓及溫度條件下重複記憶體測試。 It should be noted that testing memory may need to test a large number of bits, for example, testing each bit of the memory and verifying whether the tested bit works. In addition, sometimes the memory test can be repeated under different voltage and temperature conditions.

對於一些缺陷，可啟動一或多個冗餘機構(例如，藉由程式化快閃記憶體或OTP或燒斷熔斷器)。此外，可能亦必須測試記憶體晶片之邏輯及類比電路(例如，控制器、調節器、I/O)。 For some defects, one or more redundant mechanisms can be activated (for example, by programming a flash memory or OTP or blowing a fuse). In addition, it may also be necessary to test the logic and analog circuits of the memory chip (eg, controller, regulator, I/O).

在一個實施例中，積體電路可包括：基板、安置於基板上之記憶體陣列、安置於基板上之處理陣列，及安置於基板上之介面。 In one embodiment, the integrated circuit may include a substrate, a memory array disposed on the substrate, a processing array disposed on the substrate, and an interface disposed on the substrate.

本文中所描述之積體電路可包括於如圖3A、圖3B、圖4至圖6、圖7A至圖7D、圖11至圖13、圖16至圖19、圖22或圖23中之任一者中所說明的記憶體晶片中，可包括該記憶體晶片，或以其他方式包含該記憶體晶片。 The integrated circuit described herein can be included in any of FIGS. 3A, 3B, 4-6, 7A-7D, 11-13, 16-19, 22 or 23. The memory chip described in one may include the memory chip or include the memory chip in other ways.

圖65至圖69說明各種積體電路5210及測試器5200。 Figures 65 to 69 illustrate various integrated circuits 5210 and testers 5200.

該積體電路說明為包括記憶體組5212、晶片介面5211(諸如，由該等記憶體組共用之I/O控制器5214及匯流排5213)及邏輯單元(在下文中為「邏輯」)5215。圖66說明熔斷器介面5216及耦接至熔斷器介面及不同記憶體組之匯流排5217。 The integrated circuit is illustrated as including a memory bank 5212, a chip interface 5211 (such as the I/O controller 5214 and bus 5213 shared by the memory banks) and a logic unit (hereinafter referred to as "logic") 5215. Figure 66 illustrates the fuse interface 5216 and its coupling to the fuse interface and the different records The bus 5217 of the memory body group.

圖65至圖70亦說明測試處理程序中之各種步驟，諸如： Figure 65 to Figure 70 also illustrate various steps in the test process, such as:

a.寫入測試序列5221(圖65、圖67、圖68及圖69)； a. Write test sequence 5221 (Figure 65, Figure 67, Figure 68 and Figure 69);

b.讀回測試結果5222(圖67、圖68及圖69)； b. Read back the test result 5222 (Figure 67, Figure 68 and Figure 69);

c.寫入預期結果序列5223(圖65)； c. Write the expected result sequence 5223 (Figure 65);

d.讀取故障位址以修復5224(圖66)；及 d. Read the faulty address to repair 5224 (Figure 66); and

e.程式化熔斷器5225(圖66)。 e. Programmable fuse 5225 (Figure 66).

每一記憶體組可耦接至其自身的邏輯單元5215及/或由該邏輯單元來控制。然而，如上文所描述，可提供對邏輯單元5215之任何記憶體組分配。因此，邏輯單元5215之數目可不同於記憶體組之數目，邏輯單元可控制多於單個記憶體組或一記憶體組之一部分，及其類似者。 Each memory bank can be coupled to and/or controlled by its own logic unit 5215. However, as described above, any memory group allocation to the logic unit 5215 can be provided. Therefore, the number of logic units 5215 can be different from the number of memory groups, and the logic units can control more than a single memory group or a part of a memory group, and the like.

邏輯單元5215可包括一或多個測試單元。圖65說明邏輯5215內之測試單元(TU)5218。TU可包括於所有或一些邏輯單元5212中。應注意，測試單元可與邏輯單元分開或與邏輯單元整合。 The logic unit 5215 may include one or more test units. Figure 65 illustrates the test unit (TU) 5218 in the logic 5215. The TU may be included in all or some of the logical units 5212. It should be noted that the test unit can be separated from the logic unit or integrated with the logic unit.

圖65亦說明TU 5218內之測試圖案產生器(表示為GEN)5219。 Figure 65 also illustrates the test pattern generator (denoted as GEN) 5219 in the TU 5218.

測試圖案產生器可包括於所有或一些測試單元中。為簡單起見，測試圖案產生器及測試單元未說明於圖66至圖70中，但可包括於此等實施例中。 The test pattern generator may be included in all or some of the test units. For simplicity, the test pattern generator and the test unit are not illustrated in FIGS. 66 to 70, but may be included in these embodiments.

該記憶體陣列可包括多個記憶體組。此外，該處理陣列可包括複數個測試單元。該等複數個測試單元可經組態以測試多個記憶體組以提供測試結果。該介面可經組態以將提示測試結果之資訊輸出至在積體電路外部之裝置。 The memory array may include a plurality of memory banks. In addition, the processing array may include a plurality of test units. The plurality of test units can be configured to test multiple memory banks to provide test results. The interface can be configured to output information that prompts the test result to a device outside the integrated circuit.

該等複數個測試單元可包括至少一個測試圖案產生器，該至少一個測試圖案產生器經組態以產生至少一個測試圖案以供用於測試多個記憶體組中之一或多者。在一些實施例中，如上文所解釋，該等複數個測試單元中之每一者可包括測試圖案產生器，該測試圖案產生器經組態以產生測試圖案以供該等複數個測試單元中之特定測試單元使用以測試多個記憶體組中之至少一者。如上文所提示，圖65說明測試單元內之測試圖案產生器(GEN)5219。一或多個或甚至所有邏輯單元可包括測試圖案產生器。 The plurality of test units may include at least one test pattern generator configured to generate at least one test pattern for testing one or more of the plurality of memory groups. In some embodiments, as explained above, each of the plurality of test units One may include a test pattern generator configured to generate a test pattern for use by a specific test unit of the plurality of test units to test at least one of the plurality of memory groups. As indicated above, FIG. 65 illustrates the test pattern generator (GEN) 5219 in the test unit. One or more or even all logic units may include a test pattern generator.

至少一個測試圖案產生器可經組態以自介面接收用於產生至少一個測試圖案之指令。測試圖案可包括在測試期間應存取(例如，讀取及/或寫入)之記憶體條目及/或待寫入至該等條目之值，及其類似者。 The at least one test pattern generator can be configured to receive instructions for generating at least one test pattern from the interface. The test pattern may include memory entries that should be accessed (for example, read and/or write) during the test and/or values to be written to these entries, and the like.

該介面可經組態以自可在積體電路外部之外部單元接收組態資訊，該組態資訊包括用於產生至少一個測試圖案之指令。 The interface can be configured to receive configuration information from external units that can be external to the integrated circuit, the configuration information including instructions for generating at least one test pattern.

至少一個測試圖案產生器可經組態以自記憶體陣列讀取組態資訊，該組態資訊包括用於產生至少一個測試圖案之指令。 The at least one test pattern generator can be configured to read configuration information from the memory array, the configuration information including commands for generating at least one test pattern.

在一些實施例中，該組態資訊可包括向量。 In some embodiments, the configuration information may include vectors.

該介面可經組態以自可在積體電路外部之裝置接收組態資訊，該組態資訊可包括可為至少一個測試圖案之指令。 The interface can be configured to receive configuration information from a device that can be external to the integrated circuit, and the configuration information can include commands that can be at least one test pattern.

舉例而言，至少一個測試圖案可包括待在記憶體陣列之測試期間存取的記憶體陣列條自。 For example, the at least one test pattern may include memory array strips to be accessed during testing of the memory array.

至少一個測試圖案進一步可包括待寫入至在記憶體陣列之測試期間存取之記憶體陣列條目的輸入資料。 The at least one test pattern may further include input data to be written to the memory array entry accessed during the test of the memory array.

另外或替代地，至少一個測試圖案進一步可包括待寫入至在記憶體陣列之測試期間存取之記憶體陣列條目的輸入資料，及待自在記憶體陣列之測試期間存取之記憶體陣列條目讀取的輸出資料之預期值。 Additionally or alternatively, the at least one test pattern may further include input data to be written to the memory array entry accessed during the test of the memory array, and the memory array entry to be accessed during the test of the memory array freely The expected value of the output data read.

在一些實施例中，該等複數個測試單元可經組態以自記憶體陣列擷取一旦由該等複數個測試單元執行便使該等複數個測試單元測試該記憶體陣列之測試指令。 In some embodiments, the plurality of test units may be configured to retrieve the test instructions from the memory array that, once executed by the plurality of test units, cause the plurality of test units to test the memory array.

舉例而言，該等測試指令可包括於組態資訊中。 For example, these test commands can be included in the configuration information.

組態資訊可包括記憶體陣列之測試的預期結果。 The configuration information may include the expected result of the test of the memory array.

另外或替代地，該組態資訊可包括待自在記憶體陣列之測試期間存取之記憶體陣列條目讀取的輸出資料之值。 Additionally or alternatively, the configuration information may include the value of the output data to be read from the memory array entry accessed during the test of the memory array.

另外或替代地，該組態資訊可包括向量。 Additionally or alternatively, the configuration information may include vectors.

在一些實施例中，該等複數個測試單元可經組態以自記憶體陣列擷取一旦由該等複數個測試單元執行便使該等複數個測試單元測試該記憶體陣列且測試該處理陣列之測試指令。 In some embodiments, the plurality of test units can be configured to retrieve from the memory array once executed by the plurality of test units to make the plurality of test units test the memory array and test the processing array The test instruction.

該組態資訊可包括向量。 The configuration information may include vectors.

另外或替代地，該組態資訊可包括記憶體陣列及處理陣列之測試的預期結果。 Additionally or alternatively, the configuration information may include the expected result of the test of the memory array and the processing array.

在一些實施例中，如上文所描述，該等複數個測試單元可能缺乏測試圖案產生器，該測試圖案產生器用於產生在多個記憶體組之測試期間使用的測試圖案。 In some embodiments, as described above, the plurality of test units may lack a test pattern generator, which is used to generate test patterns used during testing of a plurality of memory banks.

在此等實施例中，該等複數個測試單元中之至少兩個可經組態以並列地測試多個記憶體組中之至少兩個。 In these embodiments, at least two of the plurality of test units may be configured to test at least two of the plurality of memory groups in parallel.

替代地，該等複數個測試單元中之至少兩個可經組態以串列地測試多個記憶體組中之至少兩個。 Alternatively, at least two of the plurality of test units may be configured to test at least two of the plurality of memory groups in series.

在一些實施例中，提示測試結果之資訊可包括故障記憶體陣列條目之識別符。 In some embodiments, the information prompting the test result may include the identifier of the faulty memory array entry.

在一些實施例中，該介面可經組態以在記憶體陣列之測試期間多次擷取由複數個測試電路獲得之部分測試結果。 In some embodiments, the interface can be configured to capture partial test results obtained by a plurality of test circuits multiple times during the test of the memory array.

在一些實施例中，該積體電路可包括錯誤校正單元，該錯誤校正單元經組態以校正在記憶體陣列之測試期間偵測到的至少一個錯誤。舉例而言，該錯誤校正單元可經組態以使用任何適當技術來修復記憶體誤差，例如藉由使一些記憶體字去能及用冗餘字替換該等字。 In some embodiments, the integrated circuit may include an error correction unit, and the error correction The unit is configured to correct at least one error detected during the test of the memory array. For example, the error correction unit can be configured to use any appropriate technology to repair memory errors, such as by disabling some memory words and replacing them with redundant words.

在上文所描述之實施例中之任一者中，該積體電路可為記憶體晶片。 In any of the embodiments described above, the integrated circuit may be a memory chip.

舉例而言，該積體電路可包括分散式處理器，其中處理陣列可包括分散式處理器之複數個子單元，如圖7A中所描繪。 For example, the integrated circuit may include a distributed processor, where the processing array may include a plurality of subunits of the distributed processor, as depicted in FIG. 7A.

在此等實施例中，該等處理器子單元中之每一者可與多個記憶體組中之對應的專用記憶體組相關聯。 In these embodiments, each of the processor sub-units may be associated with a corresponding dedicated memory group among multiple memory groups.

在上文所描述之實施例中之任一者中，提示測試結果之資訊可提示至少一個記憶體組之狀態。可按一或多個粒度來提供記憶體組之狀態：每一記憶體字，每一條目群組，或每一完整記憶體組。 In any of the above-described embodiments, the information prompting the test result can prompt the state of at least one memory bank. The state of the memory group can be provided in one or more granularities: each memory word, each entry group, or each complete memory group.

圖65至圖66說明測試器測試階段中之四個步驟。 Figure 65 to Figure 66 illustrate the four steps in the test phase of the tester.

在第一步驟中，測試器寫入(5221)測試序列，且組之邏輯單元將資料寫入至其記憶體。該邏輯亦可能足夠複雜以自測試器接收命令且其自身產生序列(如下文所解釋)。 In the first step, the tester writes (5221) the test sequence, and the logical unit of the group writes data to its memory. The logic may also be complex enough to receive commands from the tester and generate the sequence itself (as explained below).

在第二步驟中，測試器將預期結果寫入(5223)至受測記憶體，且邏輯單元將預期結果與自其記憶體組讀取之資料進行比較，以保存錯誤清單。若邏輯足夠複雜以自身產生預期結果之序列(如下文所解釋)，則可簡化預期結果之寫入。 In the second step, the tester writes (5223) the expected result to the memory under test, and the logic unit compares the expected result with the data read from its memory bank to save the error list. If the logic is complex enough to produce a sequence of expected results by itself (as explained below), the writing of expected results can be simplified.

在第三步驟中，測試器自邏輯單元讀取(5224)故障位址。 In the third step, the tester reads (5224) the fault address from the logic unit.

在第四步驟中，測試器對結果採取動作(5225)且可修復錯誤。舉例而言，測試器可連接至特定介面以程式化記憶體中之熔斷器，但亦可使用允許程式化記憶體內之錯誤校正機構的任何其他機構。 In the fourth step, the tester takes action on the result (5225) and the error can be repaired. For example, the tester can be connected to a specific interface to program the fuses in the memory, but any other mechanism that allows programming of the error correction mechanism in the memory can also be used.

在此等實施例中，記憶體測試器可使用向量以測試記憶體。 In these embodiments, the memory tester can use vectors to test the memory.

舉例而言，每一向量可自輸入系列及輸出系列建置。 For example, each vector can be built from the input series and the output series.

輸入系列可包括成對的位址與寫入至記憶體之資料(在許多實施例中，此系列可模型化為允許程式在需要時產生的公式，該程式諸如由邏輯單元執行之程式)。 The input series can include a pair of addresses and data written to the memory (in many embodiments, this series can be modeled as a formula that allows a program to be generated when needed, such as a program executed by a logic unit).

在一些實施例中，測試圖案產生器可產生此類向量。 In some embodiments, the test pattern generator can generate such vectors.

應注意，向量為實例資料結構，但一些實施例可使用其他資料結構。該等資料結構可與由位於積體電路外部之測試器產生的其他測試資料結構相容。 It should be noted that the vector is an example data structure, but some embodiments may use other data structures. These data structures are compatible with other test data structures generated by testers located outside the integrated circuit.

該輸出系列可包括位址與資料對，其包含待自記憶體讀回之預期資料(在一些實施例中，該系列可另外或替代地由程式在執行階段產生，例如藉由邏輯單元)。 The output series may include address and data pairs, which include expected data to be read back from memory (in some embodiments, the series may be generated in addition or alternatively by the program at the execution stage, such as by a logic unit).

記憶體測試通常包括執行向量清單，每一向量根據輸入系列將資料寫入至記憶體，且接著根據輸出系列讀回資料並將該資料與其預期資料進行比較。 Memory testing usually involves executing a list of vectors, each vector writes data to memory based on the input series, and then reads back the data based on the output series and compares the data with its expected data.

在失配之狀況下，記憶體可分類為發生故障的，或若記憶體包括用於冗餘之機構，則可啟動冗餘機構使得再次在經啟動冗餘機構上測試向量。 In the case of a mismatch, the memory can be classified as malfunctioning, or if the memory includes a mechanism for redundancy, the redundant mechanism can be activated so that the vector is tested on the activated redundant mechanism again.

在記憶體包括處理器子單元(如上文關於圖7A所描述)或含有許多記憶體控制器之實施例中，整個測試可由組之邏輯單元操縱。因此，記憶體控制器或處理器子單元可執行測試。 In embodiments where the memory includes a processor sub-unit (as described above with respect to FIG. 7A) or contains many memory controllers, the entire test can be handled by the group of logic units. Therefore, the memory controller or processor sub-unit can perform the test.

該記憶體控制器可自測試器程式化，且測試結果可保存於控制器本身中以稍後由測試器讀取。 The memory controller can be programmed from the tester, and the test results can be saved in the controller itself for later read by the tester.

為了組態及測試邏輯單元之操作，測試器可組態邏輯單元以用於記憶體存取且確認結果可藉由記憶體存取來讀取。 In order to configure and test the operation of the logic unit, the tester can configure the logic unit for memory access and confirm that the result can be read by the memory access.

舉例而言，輸入向量可含有用於邏輯單元之程式化序列，且輸出向量可含有此測試之預期結果。舉例而言，若諸如處理器子單元之邏輯單元包含經組態以對記憶體中之兩個位址執行運算的乘法器或加法器，則輸入向量可包括將資料寫入至記憶體之一組命令以及至加法器/乘法器邏輯之一組命令。只要可將加法器/乘法器結果讀回至輸出向量，便可將結果發送至測試器。 For example, the input vector may contain a stylized sequence for the logic unit, and the output vector may contain the expected result of this test. For example, if a logic unit such as a processor sub-unit includes a multiplier or an adder configured to perform operations on two addresses in memory, the input vector may include one of writing data to the memory Group commands and a group command to the adder/multiplier logic. As long as the adder/multiplier result can be read back to the output vector, the result can be sent to the tester.

該測試可進一步包括自記憶體載入邏輯組態及將邏輯輸出發送至記憶體。 The test may further include loading logic configuration from memory and sending logic output to memory.

在邏輯單元自記憶體載入其組態(例如，若該邏輯為記憶體控制器)之實施例中，該邏輯單元可運行來自記憶體本身之程式碼。 In embodiments where the logic unit loads its configuration from memory (for example, if the logic is a memory controller), the logic unit can run code from the memory itself.

因此，該輸入向量可包括用於邏輯單元之程式，且該程式本身可測試邏輯單元中之各種電路。 Therefore, the input vector can include a program for the logic unit, and the program itself can test various circuits in the logic unit.

因此，測試可能不限於接收呈由外部測試器使用之格式的向量。 Therefore, testing may not be limited to receiving vectors in a format used by an external tester.

若載入至邏輯單元之命令發指令給邏輯單元以將結果寫回至記憶體組中，則測試器可讀取彼等結果且將該等結果與預期輸出系列進行比較。 If the command loaded into the logic unit is issued to the logic unit to write the results back to the memory bank, the tester can read their results and compare the results with the expected output series.

舉例而言，寫入至記憶體之向量可為或可包括用於邏輯單元之測試程式(例如，測試可假定記憶體有效，但即使記憶體無效，寫入之測試程式仍將不工作且測試將未通過，此為可接受之結果，此係因為晶片無論如何為無效的)及/或邏輯單元如何運行程式碼及將結果寫回至記憶體。由於邏輯單元之所有測試可經由記憶體進行(例如，將邏輯測試輸入寫入至記憶體及將測試結果寫回至該記憶體)，因此測試器可運用輸入序列及預期輸出序列來運行簡單的向量測試。 For example, the vector written to the memory may be or may include a test program for the logic unit (for example, the test may assume that the memory is valid, but even if the memory is invalid, the written test program will still not work and the test Will fail, which is an acceptable result because the chip is invalid anyway) and/or how the logic unit runs the code and writes the result back to memory. Since all the tests of the logic unit can be performed through the memory (for example, the logic test input is written to the memory and the test result is written back to the memory), the tester can use the input sequence and the expected output sequence to run simple Vector test.

邏輯組態及結果可作為讀取及/或寫入命令來存取。 The logical configuration and results can be accessed as read and/or write commands.

圖68說明發送寫入測試序列5221之測試器5200，該寫入測試序列為向量。 Figure 68 illustrates the tester 5200 sending a write test sequence 5221, which is a vector.

向量之部分包括在耦接至處理陣列之邏輯5215的記憶體組5212之間分裂的測試程式碼5232。 The part of the vector includes the test code 5232 split between the memory bank 5212 coupled to the logic 5215 of the processing array.

每一邏輯5215可執行儲存於其相關聯記憶體組中之程式碼5232，且該執行可包括存取一或多個記憶體組，執行計算及將結果(例如，測試結果5231)儲存於記憶體組5212中。 Each logic 5215 can execute code 5232 stored in its associated memory bank, and the execution may include accessing one or more memory banks, performing calculations and storing results (for example, test results 5231) in memory Body group 5212.

測試結果可由測試器5200發送回(例如，讀回結果5222)。 The test result may be sent back by the tester 5200 (eg, read back the result 5222).

此可允許邏輯5215受由I/O控制器5214接收之命令控制。 This allows the logic 5215 to be controlled by commands received by the I/O controller 5214.

在圖68中，I/O控制器5214連接至記憶體組及邏輯。在其他實施例中，邏輯可連接於I/O控制器5214與記憶體組之間。 In Figure 68, the I/O controller 5214 is connected to the memory bank and logic. In other embodiments, logic can be connected between the I/O controller 5214 and the memory bank.

圖70說明用於測試記憶體組之方法5300。舉例而言，可使用上文關於圖65至圖69所描述之記憶體組中之任一者來實施方法5300。 Figure 70 illustrates a method 5300 for testing a memory bank. For example, the method 5300 can be implemented using any of the memory groups described above with respect to FIGS. 65 to 69.

方法5300可包括步驟5302、5310及5320。步驟5302可包括接收測試積體電路之記憶體組的請求。該積體電路可包括：基板、安置於基板上且包含記憶體組之記憶體陣列、安置於基板上之處理陣列，及安置於基板上之介面。該處理陣列可包括複數個測試單元，如上文所描述。 The method 5300 may include steps 5302, 5310, and 5320. Step 5302 may include receiving a request to test the memory bank of the integrated circuit. The integrated circuit may include a substrate, a memory array arranged on the substrate and including a memory group, a processing array arranged on the substrate, and an interface arranged on the substrate. The processing array may include a plurality of test units, as described above.

在一些實施例中，該請求可包括組態資訊、一或多個向量、命令，及其類似者。 In some embodiments, the request may include configuration information, one or more vectors, commands, and the like.

在此等實施例中，該組態資訊可包括記憶體陣列之測試的預期結果、指令、資料、待自在記憶體陣列之測試期間存取之記憶體陣列條目讀取的輸出資料之值、測試圖案，及其類似者。 In these embodiments, the configuration information may include the expected results, commands, data of the memory array test, the value of the output data read from the memory array entry to be accessed during the test of the memory array, and the test Patterns, and similar ones.

該測試圖案可包括以下各者中之至少一者：(i)待在記憶體陣列之測試期間存取的記憶體陣列條目，(ii)待寫入至在記憶體陣列之測試期間存取之記憶體陣列條目的輸入資料，或(iii)待自在記憶體陣列之測試期間存取之記憶體陣列條目讀取的輸出資料之預期值。 The test pattern may include at least one of the following: (i) memory array entries to be accessed during the test of the memory array, (ii) to be written to the memory array entries accessed during the test of the memory array The input data of the memory array entry, or (iii) the expected value of the output data to be read from the memory array entry accessed during the test of the memory array.

步驟5302可包括以下各者中之至少一者及/或其後可接著以下各者中之至少一者： Step 5302 may include at least one of the following and/or may be followed by at least one of the following:

a.藉由至少一個測試圖案產生器自介面接收用於產生至少一個測試圖案之指令； a. At least one test pattern generator receives an instruction for generating at least one test pattern from the interface;

b.藉由該介面及自在積體電路外部之外部單元接收組態資訊，該組態資訊包括用於產生至少一個測試圖案之指令； b. Receive configuration information through the interface and an external unit outside the free integrated circuit, the configuration information including instructions for generating at least one test pattern;

c.藉由至少一個測試圖案產生器自記憶體陣列讀取組態資訊，該組態資訊包括用於產生至少一個測試圖案之指令； c. Read configuration information from the memory array by at least one test pattern generator, the configuration information including instructions for generating at least one test pattern;

d.藉由該介面及自在積體電路外部之外部單元接收組態資訊，該組態資訊包含為至少一個測試圖案之指令； d. Receive configuration information through the interface and an external unit outside the free integrated circuit, the configuration information includes instructions for at least one test pattern;

e.藉由複數個測試單元及自記憶體陣列擷取一旦由該等複數個測試單元執行便使該等複數個測試單元測試記憶體陣列之測試指令；及 e. Retrieving test commands from the memory array by a plurality of test units and once executed by the plurality of test units to make the plurality of test units test the memory array; and

f.藉由複數個測試單元及自該記憶體陣列接收一旦由該等複數個測試單元執行便使該等複數個測試單元測試記憶體陣列且測試處理陣列之測試指令。 f. By using a plurality of test units and receiving from the memory array, once executed by the plurality of test units, the plurality of test units will test the memory array and test the processing array.

步驟5302之後可接著步驟5310。步驟5310可包括藉由複數個測試單元且回應於請求而測試多個記憶體組以提供測試結果。 Step 5302 can be followed by step 5310. Step 5310 may include testing a plurality of memory groups by a plurality of test units in response to requests to provide test results.

方法5300可進一步包括藉由該介面在記憶體陣列之測試期間複數次接收由複數個測試電路獲得之部分測試結果。 The method 5300 may further include receiving part of the test results obtained by the plurality of test circuits multiple times during the test period of the memory array through the interface.

步驟5310可包括以下各者中之至少一者及/或其後可接著以下各者中之至少一者： Step 5310 may include at least one of the following and/or may be followed by at least one of the following:

a.藉由一或多個測試圖案產生器(例如，包括於複數個測試單元中之一者、一些或全部中)產生測試圖案以供一或多個測試單元用於測試多個記憶體組中之至少一者； a. Generate test patterns by one or more test pattern generators (for example, included in one, some or all of a plurality of test units) for one or more test units to test multiple memory groups At least one of;

b.藉由該等複數個測試單元中之至少兩個並列地測試多個記憶體組中之至少兩個； b. Test a plurality of memory groups in parallel by at least two of the plurality of test units At least two

c.藉由該等複數個測試單元中之至少兩個串列地測試多個記憶體組中之至少兩個； c. Test at least two of the plurality of memory groups in series by at least two of the plurality of test units;

d.將值寫入至記憶體條目，讀取記憶體條目及比較結果；及 d. Write the value to the memory entry, read the memory entry and compare the result; and

e.藉由錯誤校正單元校正在記憶體陣列之測試期間偵測到的至少一個錯誤。 e. Correct at least one error detected during the test of the memory array by the error correction unit.

步驟5310之後可接著步驟5320。步驟5320可包括藉由介面及在積體電路外部輸出提示測試結果之資訊。 Step 5310 can be followed by step 5320. Step 5320 may include outputting information that prompts the test result through the interface and outside the integrated circuit.

提示測試結果之資訊可包括故障記憶體陣列條目之識別符。此可藉由不發送關於每一記憶體條目之讀取資料來節省時間。 The information prompting the test result may include the identifier of the faulty memory array entry. This can save time by not sending read data about each memory entry.

另外或替代地，提示測試結果之資訊可提示至少一個記憶體組之狀態。 Additionally or alternatively, the information prompting the test result may prompt the state of at least one memory bank.

因此，在一些實施例中，提示測試結果之資訊可比在測試期間寫入至記憶體組或自記憶體組讀取之資料單元的總大小小得多，且可比在無測試單元輔助之情況下可自測試記憶體之測試器發送的輸入資料小得多。 Therefore, in some embodiments, the information that prompts the test result can be much smaller than the total size of the data unit written to or read from the memory set during the test, and is comparable to that without the aid of the test unit The input data sent by the tester that can test the memory is much smaller.

受測試積體電路可包含如先前諸圖中之任一者中所說明的記憶體晶片及/或分散式處理器。舉例而言，本文中所描述之積體電路可包括於如圖3A、圖3B、圖4至圖6、圖7A至圖7D、圖11至圖13、圖16至圖19、圖22或圖23中之任一者中所說明的記憶體晶片中，可包括該記憶體晶片，或以其他方式包含該記憶體晶片。 The integrated circuit under test may include a memory chip and/or a distributed processor as described in any of the previous figures. For example, the integrated circuit described herein can be included in Figure 3A, Figure 3B, Figure 4 to Figure 6, Figure 7A to Figure 7D, Figure 11 to Figure 13, Figure 16 to Figure 19, Figure 22 or Figure The memory chip described in any one of 23 may include the memory chip, or include the memory chip in other ways.

圖71說明用於測試積體電路之記憶體組的方法5350之實例。舉例而言，可使用上文關於圖65至圖69所描述之記憶體組中之任一者來實施方法5350。 Figure 71 illustrates an example of a method 5350 for testing the memory bank of an integrated circuit. For example, the method 5350 can be implemented using any of the memory groups described above with respect to FIGS. 65-69.

方法5350可包括步驟5352、5355及5358。步驟5352可包括藉由積體電路之介面接收包含指令之組態資訊。包括介面之積體電路亦可包括基板、包含記憶體組且安置於基板上之記憶體陣列、安置於基板上之處理陣列，及安置於基板上之介面。 The method 5350 may include steps 5352, 5355, and 5358. Step 5352 may include borrowing The interface of the integrated circuit receives configuration information including commands. The integrated circuit including the interface may also include a substrate, a memory array including a memory group and arranged on the substrate, a processing array arranged on the substrate, and an interface arranged on the substrate.

該組態資訊可包括記憶體陣列之測試的預期結果、指令、資料、待自在記憶體陣列之測試期間存取之記憶體陣列條目讀取的輸出資料之值、測試圖案，及其類似者。 The configuration information may include expected results, commands, data of the memory array test, the value of the output data read from the memory array entry to be accessed during the test of the memory array, test patterns, and the like.

另外或替代地，該組態資訊可包括指令、用以寫入該等指令之記憶體條目的位址、輸入資料，且亦可包括用以接收在指令執行期間計算之輸出值的記憶體條目之位址。 Additionally or alternatively, the configuration information may include commands, addresses of memory entries used to write these commands, input data, and may also include memory entries used to receive output values calculated during command execution The address.

步驟5352之後可接著步驟5355。步驟5355可包括藉由處理陣列執行指令，該執行藉由存取記憶體陣列，執行運算操作及提供結果來進行。 Step 5355 can be followed by step 5352. Step 5355 may include executing instructions by processing the array, the execution being performed by accessing the memory array, performing arithmetic operations, and providing results.

步驟5355之後可接著步驟5358。步驟5358可包括藉由介面及在積體電路外部輸出提示結果之資訊。 Step 5358 can be followed by step 5355. Step 5358 may include outputting prompt information through the interface and outside the integrated circuit.

網路(cyber)安全性及篡改偵測技術 Cyber security and tamper detection technology

記憶體晶片及/或處理器可為惡意行動者之目標，且可能會受到各種類型之網路攻擊。在一些狀況下，此類攻擊可能嘗試改變儲存於一或多個記憶體資源中之資料及/或程式碼。相對於經訓練神經網路或取決於儲存於記憶體中之大量資料的其他類型之人工智慧(AI)模型，網路攻擊可能尤其成問題。若所儲存資料被操縱或甚至遮蔽，則此操縱可為有害的。舉例而言，若資料密集型AI模型所依賴之資料被破壞或遮蔽，則依賴於該等模型以識別其他車輛或行人等之自主車輛系統可能會不正確地評估主機車輛之環境。結果，可能會發生事故。隨著AI模型在廣泛技術中變得愈來愈普遍，針對與此類模型相關聯之資料的網路攻擊可能造成重大破壞。 The memory chip and/or processor can be the target of malicious actors and may be subject to various types of cyber attacks. In some cases, this type of attack may attempt to change the data and/or code stored in one or more memory resources. Compared to trained neural networks or other types of artificial intelligence (AI) models that depend on large amounts of data stored in memory, cyber attacks can be particularly problematic. If the stored data is manipulated or even obscured, this manipulation can be harmful. For example, if the data on which the data-intensive AI model depends is destroyed or obscured, it will rely on these models to identify other vehicles or Autonomous vehicle systems such as pedestrians may incorrectly evaluate the environment of the host vehicle. As a result, accidents may occur. As AI models become more common in a wide range of technologies, cyber attacks on data associated with such models can cause significant damage.

在其他狀況下，網路攻擊可包括一或多個行動者篡改或嘗試篡改與處理器或其他類型之基於積體電路之邏輯元件相關聯的操作參數。舉例而言，處理器通常經設計以在某些操作規格內操作。涉及篡改之網路攻擊可試圖改變處理器、記憶體單元或其他電路之操作參數中之一或多者，使得處理器、記憶體單元或其他電路超出其設計操作規格(例如，時脈速度、頻寬規格、溫度限制、操作速率等)。此篡改可導致目標硬體發生故障。 In other situations, cyber attacks may include one or more actors tampering or attempting to tamper with operating parameters associated with the processor or other types of integrated circuit-based logic components. For example, processors are often designed to operate within certain operating specifications. Cyber attacks involving tampering can attempt to change one or more of the operating parameters of the processor, memory unit, or other circuit, causing the processor, memory unit, or other circuit to exceed its design operating specifications (for example, clock speed, Bandwidth specifications, temperature limits, operating speeds, etc.). This tampering can cause the target hardware to malfunction.

用於防禦網路攻擊之習知技術可包括在處理器層級操作之電腦程式(例如，防病毒軟體或防惡意軟體的軟體)。其他技術可包括使用與路由器或其他硬體相關聯的基於軟體之防火牆。雖然此等技術可使用在記憶體單元外部執行之軟體程式來對抗網路攻擊，但仍需要用於高效地保護儲存於記憶體單元中之資料的額外或替代技術，尤其在彼資料之準確性及可用性對諸如神經網路等之記憶體密集型應用之操作至關重要的情況下。本發明之實施例可提供包含記憶體之抵抗對記憶體之網路攻擊的各種積體電路設計。 Conventional technologies used to defend against cyber attacks may include computer programs that operate at the processor level (for example, anti-virus software or anti-malware software). Other technologies may include the use of software-based firewalls associated with routers or other hardware. Although these technologies can use software programs running outside the memory unit to fight against cyber attacks, they still need additional or alternative technologies to efficiently protect the data stored in the memory unit, especially in the accuracy of the data. And availability is critical to the operation of memory-intensive applications such as neural networks. The embodiments of the present invention can provide various integrated circuit designs including memory to resist cyber attacks on the memory.

以安全方式將敏感資訊及命令擷取至積體電路(例如，在至晶片/積體電路外部之介面尚未起作用時的開機處理程序期間)及接著維護積體電路內之敏感資訊及命令而不將其曝露於積體電路外部，此可增加敏感資訊及命令之安全性。CPU及其他類型之處理單元易受網路攻擊，尤其在彼等CPU/處理單元與外部記憶體一起操作時。包括安置於記憶體陣列當中之記憶體晶片上之分散式處理器子單元的所揭示實施例可能不易受到網路攻擊及篡改(例如，此係因為處理在記憶體晶片內發生)，該記憶體陣列包括複數個記憶體組。包括在下文更詳細地論述之所揭示安全措施的任何組合可進一步降低所揭示實施例對網路攻擊及/或篡改之易感性。 Retrieve sensitive information and commands to the integrated circuit in a secure manner (for example, during the boot process when the interface to the chip/integrated circuit is not functional) and then maintain the sensitive information and commands in the integrated circuit. Do not expose it to the outside of the integrated circuit, which can increase the security of sensitive information and commands. CPUs and other types of processing units are vulnerable to network attacks, especially when their CPU/processing units operate together with external memory. The disclosed embodiment including the distributed processor subunits arranged on the memory chip in the memory array may not be vulnerable to cyber attacks and tampering (for example, this is because the processing takes place within the memory chip). The array includes a plurality of memory banks. Any combination that includes the disclosed security measures discussed in more detail below can further reduce the impact of the disclosed embodiments Susceptibility to cyber attacks and/or tampering.

圖72A為符合本發明之實施例的包括記憶體陣列及處理陣列之積體電路7200的圖解表示。舉例而言，積體電路7200可包括在以上章節中且貫穿本發明描述之記憶體晶片上分散式處理器架構(及特徵)中之任一者。記憶體陣列及處理陣列可形成於共同基板上，且在某些所揭示實施例中，積體電路7200可構成記憶體晶片。舉例而言，如上文所論述，積體電路7200可包括記憶體晶片，該記憶體晶片包括複數個記憶體組及在空間上分佈於記憶體晶片上之複數個處理器子單元，其中複數個記憶體組中之每一者與複數個處理器子單元中之專用的一或多者相關聯。在一些狀況下，每一處理器子單元可專用於一或多個記憶體組。 FIG. 72A is a diagrammatic representation of an integrated circuit 7200 including a memory array and a processing array in accordance with an embodiment of the present invention. For example, the integrated circuit 7200 may include any of the on-memory-chip distributed processor architectures (and features) described in the above sections and throughout the present invention. The memory array and the processing array can be formed on a common substrate, and in some disclosed embodiments, the integrated circuit 7200 can constitute a memory chip. For example, as discussed above, the integrated circuit 7200 may include a memory chip including a plurality of memory banks and a plurality of processor sub-units spatially distributed on the memory chip, of which a plurality of Each of the memory groups is associated with a dedicated one or more of the plurality of processor subunits. In some cases, each processor subunit can be dedicated to one or more memory banks.

在一些實施例中，記憶體陣列可包括複數個離散記憶體組7210_1、7210_2……7210_J1、7210_Jn，如圖72A中所展示。根據本發明之實施例，記憶體陣列7210可包含一或多種類型之記憶體，包括例如揮發性記憶體(諸如，RAM、DRAM、SRAM、相變RAM(PRAM)、磁阻式RAM(MRAM)、電阻式RAM(ReRAM)或其類似者)或非揮發性記憶體(諸如，快閃記憶體或ROM)。根據本發明之一些實施例，記憶體組7210_1至7210_Jn可包括複數個MOS記憶體結構。 In some embodiments, the memory array may include a plurality of discrete memory groups 7210_1, 7210_2...7210_J1, 7210_Jn, as shown in FIG. 72A. According to an embodiment of the present invention, the memory array 7210 may include one or more types of memory, including, for example, volatile memory (such as RAM, DRAM, SRAM, phase change RAM (PRAM), magnetoresistive RAM (MRAM)) , Resistive RAM (ReRAM) or the like) or non-volatile memory (such as flash memory or ROM). According to some embodiments of the present invention, the memory banks 7210_1 to 7210_Jn may include a plurality of MOS memory structures.

如上文所提及，處理陣列可包括複數個處理器子單元7220_1至7220_K。在一些實施例中，處理器子單元7220_1至7220_K中之每一者可與複數個離散記憶體組7210_1至7210_Jn當中之一或多個離散記憶體組相關聯。雖然圖72A之實例實施例說明每一處理器子單元與兩個離散記憶體組7210相關聯，但應瞭解，每一處理器子單元可與任何數目個離散的專用記憶體組相關聯。且反之亦然，每一記憶體組可與任何數目個處理器子單元相關聯。根據本發明之實施例，包括於積體電路7200之記憶體陣列中的離散記憶體組之數目可等於、小於或大於包括於積體電路7200之處理陣列中的處理器子單元之數目。 As mentioned above, the processing array may include a plurality of processor subunits 7220_1 to 7220_K. In some embodiments, each of the processor sub-units 7220_1 to 7220_K may be associated with one or more of a plurality of discrete memory groups 7210_1 to 7210_Jn. Although the example embodiment of FIG. 72A illustrates that each processor subunit is associated with two discrete memory banks 7210, it should be understood that each processor subunit can be associated with any number of discrete dedicated memory banks. And vice versa, each memory bank can be associated with any number of processor subunits. According to the embodiment of the present invention, the number of discrete memory groups included in the memory array of the integrated circuit 7200 can be equal It is less than, less than, or greater than the number of processor subunits included in the processing array of the integrated circuit 7200.

積體電路7200可進一步包括符合本發明之實施例(且如描述於以上章節中)的複數個第一匯流排7260。每一匯流排7260可將處理器子單元7220_k連接至對應的專用記憶體組7210_j。根據本發明之一些實施例，積體電路7200可進一步包括複數個第二匯流排7261。每一匯流排7261可將處理器子單元7220_k連接至另一處理器子單元7220_k+1。如圖72A中所展示，複數個處理器子單元7220_1至7220_K可經由匯流排7261連接至彼此。雖然圖72A將形成迴路之複數個處理器子單元7220_1至7220_K說明為其經由匯流排7261串聯連接，但應瞭解，處理器單元7220可用任何其他方式連接。舉例而言，在一些狀況下，特定處理器子單元可能不經由匯流排7261連接至其他處理器子單元。在其他狀況下，特定處理器子單元可僅連接至一個其他處理器子單元，且在另外其他狀況下，特定處理器子單元可經由一或多個匯流排7261連接至兩個或多於兩個其他處理器子單元(例如，形成串聯連接、並聯連接、分支連接等)。應注意，本文中所描述之積體電路7200的實施例僅為例示性的。在一些狀況下，積體電路7200可具有不同的內部組件及連接，且在其他狀況下，可省略內部組件及所描述連接中之一或多者(例如，取決於特定應用之需要)。 The integrated circuit 7200 may further include a plurality of first bus bars 7260 in accordance with the embodiment of the present invention (and as described in the above section). Each bus 7260 can connect the processor subunit 7220_k to the corresponding dedicated memory bank 7210_j. According to some embodiments of the present invention, the integrated circuit 7200 may further include a plurality of second bus bars 7261. Each bus 7261 can connect the processor sub-unit 7220_k to another processor sub-unit 7220_k+1. As shown in FIG. 72A, a plurality of processor sub-units 7220_1 to 7220_K may be connected to each other via a bus 7261. Although FIG. 72A illustrates that the plurality of processor sub-units 7220_1 to 7220_K forming a loop are connected in series via the bus 7261, it should be understood that the processor unit 7220 may be connected in any other manner. For example, in some situations, a particular processor sub-unit may not be connected to other processor sub-units via the bus 7261. In other conditions, a specific processor sub-unit may be connected to only one other processor sub-unit, and in other other conditions, a specific processor sub-unit may be connected to two or more than two via one or more busbars 7261. Other processor sub-units (e.g., form a series connection, a parallel connection, a branch connection, etc.). It should be noted that the embodiment of the integrated circuit 7200 described herein is only illustrative. In some cases, the integrated circuit 7200 may have different internal components and connections, and in other cases, one or more of the internal components and the described connections may be omitted (for example, depending on the needs of a particular application).

返回參看圖72A，積體電路7200可包括用於相對於積體電路7200實施至少一個安全措施的一或多個結構。在一些狀況下，此等結構可經組態以偵測操縱或遮蔽(或嘗試操縱或遮蔽)儲存於記憶體組中之一或多者中之資料的網路攻擊。在其他狀況下，此等結構可經組態以偵測篡改與積體電路7200相關聯之操作參數或篡改直接或間接影響與積體電路7200相關聯之一或多個操作的一或多個硬體元件(無論包括於積體電路7200內抑或積體電路7200外部)。 Referring back to FIG. 72A, the integrated circuit 7200 may include one or more structures for implementing at least one safety measure with respect to the integrated circuit 7200. In some cases, these structures can be configured to detect cyber attacks that manipulate or obscure (or attempt to manipulate or obscure) data stored in one or more of the memory banks. In other situations, these structures can be configured to detect tampering with the operating parameters associated with the integrated circuit 7200, or tampering directly or indirectly affects one or more of one or more operations associated with the integrated circuit 7200 Hardware components (whether included in the integrated circuit 7200 or outside the integrated circuit 7200).

在一些狀況下，控制器7240可包括於積體電路7200中。控制器7240可經由一或多個匯流排7250連接至例如處理器子單元7220_1……7220_k 中之一或多者。控制器7240亦可連接至記憶體組7210_1……7210_Jn中之一或多者。雖然圖72A之實例實施例展示一個控制器7240，但應理解，控制器7240可包括多個處理器元件及/或邏輯電路。在所揭示實施例中，控制器7240可經組態以相對於積體電路7200之至少一個操作實施至少一個安全措施。另外，在所揭示實施例中，若至少一個安全措施被觸發，則控制器7240可經組態以採取(或引起)一或多個補救動作。 In some cases, the controller 7240 may be included in the integrated circuit 7200. The controller 7240 may be connected to, for example, the processor sub-units 7220_1...7220_k via one or more bus bars 7250 One or more of them. The controller 7240 can also be connected to one or more of the memory banks 7210_1...7210_Jn. Although the example embodiment of FIG. 72A shows one controller 7240, it should be understood that the controller 7240 may include multiple processor elements and/or logic circuits. In the disclosed embodiment, the controller 7240 may be configured to implement at least one safety measure with respect to at least one operation of the integrated circuit 7200. In addition, in the disclosed embodiment, if at least one safety measure is triggered, the controller 7240 may be configured to take (or cause) one or more remedial actions.

根據本發明之一些實施例，至少一個安全措施可包括用於鎖定對積體電路7200之某些態樣之存取的控制器實施處理程序。存取鎖定涉及使控制器防止自晶片外部對記憶體之某些區的存取(讀取及/或寫入)。可按位址解析度、記憶體組解析度之部分、記憶體組解析度及其類似者來應用存取控制。在一些狀況下，可鎖定與積體電路7200相關聯之記憶體中的一或多個實體位置(例如，積體電路7200之一或多個記憶體組或記憶體組中之一或多者的任何部分)。在一些實施例中，控制器7240可鎖定對與人工智慧模型(或其他類型的基於軟體之系統)之執行相關聯的積體電路7200之某些部分的存取。舉例而言，在一些實施例中，控制器7240可鎖定對儲存於與積體電路7200相關聯之記憶體中的神經網路模型之權重的存取。應注意，軟體程式(亦即，模型)可包括三個組件，包括：程式之輸入資料、程式之程式碼資料及執行程式之輸出資料。此等組件亦可適用於神經網路模型。在此模型之操作期間，可產生輸入資料並將其饋入至模型，且執行模型可產生輸出資料以供讀取。然而，與使用所接收輸入資料執行模型相關聯的程式碼及資料值(例如，預定模型權重等)可保持固定。 According to some embodiments of the present invention, the at least one security measure may include a controller implementation processing program for locking access to certain aspects of the integrated circuit 7200. Access locking involves making the controller prevent access (reading and/or writing) to certain areas of the memory from outside the chip. Access control can be applied based on address resolution, part of memory bank resolution, memory bank resolution, and the like. In some cases, one or more physical locations in the memory associated with the integrated circuit 7200 may be locked (for example, one or more memory groups or one or more of the memory groups of the integrated circuit 7200 Any part of). In some embodiments, the controller 7240 may lock access to certain parts of the integrated circuit 7200 associated with the execution of an artificial intelligence model (or other type of software-based system). For example, in some embodiments, the controller 7240 may lock access to the weights of the neural network model stored in the memory associated with the integrated circuit 7200. It should be noted that a software program (ie, a model) may include three components, including: input data of the program, code data of the program, and output data of the execution program. These components can also be applied to neural network models. During the operation of the model, input data can be generated and fed to the model, and the execution model can generate output data for reading. However, the code and data values (for example, predetermined model weights, etc.) associated with executing the model using the received input data may remain fixed.

如本文中所描述，鎖定可指控制器例如不允許自晶片/積體電路外部起始之相對於記憶體之某些區的讀取或寫入操作的操作。晶片/積體電路之I/O可通過的控制器不僅可鎖定全部記憶體組，而且可鎖定記憶體組內之記憶體位址的任何範圍，自單個記憶體位址至包括可用記憶體組之所有位址的位址範圍 (或兩者之間的任何位址範圍)。 As described herein, locking may refer to an operation in which the controller does not allow read or write operations relative to certain areas of the memory from outside the chip/integrated circuit, for example. The controller through which the chip/integrated circuit I/O can pass can not only lock all memory banks, but also lock any range of memory addresses in the memory banks, from a single memory address to all including the available memory banks Address range (Or any address range in between).

因為與接收輸入資料及儲存輸出資料相關聯之記憶體位置係與改變值及與積體電路7200外部之組件(例如，供應輸入資料或接收輸出資料之組件)的互動相關聯，所以鎖定對彼等記憶體位置之存取在一些狀況下可能不切實際。另一方面，限制對與模型程式碼及固定資料值相關聯之記憶體位置的存取可有效抵抗某些類型之網路攻擊。因此，在一些實施例中，作為安全措施，可鎖定與程式碼及資料值相關聯之記憶體(例如，不用於寫入/接收輸入資料及用於讀取/提供輸出資料之記憶體)。限制存取可包括鎖定某些記憶體位置使得無法對某些程式碼及/或資料值(例如，與基於所接收輸入資料執行模型相關聯的彼等程式碼及/或資料值)進行改變。另外，亦可鎖定與中間資料(例如，在執行模型期間產生之資料)相關聯之記憶體區域以抵抗外部存取。因此，雖然各種運算邏輯(無論為在積體電路7200板上抑或位於積體電路7200外部)可將資料提供至與接收輸入資料或擷取所產生輸出資料相關聯之記憶體位置或自該等記憶體位置接收資料，但此運算邏輯將不能夠基於所接收輸入資料來存取或修改儲存與程式執行相關聯之程式碼及資料值的記憶體位置。 Because the memory location associated with receiving input data and storing output data is associated with changing values and interactions with components outside the integrated circuit 7200 (for example, components that supply input data or receive output data), it is locked to each other. Waiting for memory location access may be impractical in some situations. On the other hand, restricting access to memory locations associated with model code and fixed data values can effectively resist certain types of cyber attacks. Therefore, in some embodiments, as a security measure, the memory associated with the program code and data value (for example, the memory not used for writing/receiving input data and reading/providing output data) can be locked. Restricting access may include locking certain memory locations so that certain code and/or data values (for example, their code and/or data values associated with the execution model based on the received input data) cannot be changed. In addition, the memory area associated with intermediate data (for example, data generated during the execution of the model) can also be locked to resist external access. Therefore, although various arithmetic logics (whether on the integrated circuit 7200 board or located outside the integrated circuit 7200) can provide data to or from memory locations associated with receiving input data or capturing output data generated The memory location receives data, but this calculation logic will not be able to access or modify the memory location that stores the code and data values associated with program execution based on the received input data.

除鎖定積體電路7200上之記憶體位置以提供安全措施以外，亦可藉由限制對經組態以執行與特定程式或模型相關聯之程式碼的某些運算邏輯元件(及其存取之記憶體區)的存取來實施其他安全措施。在一些狀況下，可相對於位於積體電路7200上之運算邏輯(及其相關聯之記憶體區)(例如，運算記憶體(例如，包括運算能力之記憶體，諸如本文中所揭示之記憶體晶片上的分散式處理器)等)實現此存取約束。亦可鎖定/限制對與儲存於積體電路7200之鎖定記憶體部分中的程式碼之任何執行相關聯或與對儲存於積體電路7200之鎖定記憶體部分中的資料值之任何存取相關聯的運算邏輯(及相關聯之記憶體位置)之存取，而無關於彼運算邏輯是否位於積體電路7200板上。限制對負責執行程式/模型之運算邏輯的存取可進一步確保與對所接收輸入資料之操作相關聯的程式碼及資料值仍受到保護以免被操縱、遮蔽等。 In addition to locking the memory location on the integrated circuit 7200 to provide security measures, it is also possible to restrict certain arithmetic logic elements (and their access to them) that are configured to execute the code associated with a specific program or model. Memory area) to implement other security measures. In some cases, it may be relative to the arithmetic logic (and its associated memory area) located on the integrated circuit 7200 (e.g., arithmetic memory (e.g., a memory that includes computing power, such as the memory disclosed herein) Distributed processors on bulk chips), etc.) implement this access constraint. It can also be locked/restricted to any execution associated with the code stored in the locked memory portion of the integrated circuit 7200 or related to any access to the data value stored in the locked memory portion of the integrated circuit 7200 The access to the associated operation logic (and the associated memory location) is independent of whether the operation logic is located on the integrated circuit 7200 board. Limited liability The access to the operational logic of the execution program/model can further ensure that the code and data values associated with the operation of the received input data are still protected from manipulation, obscuration, etc.

可用任何合適的方式實現控制器實施之安全措施，包括鎖定或限制對與積體電路7200之記憶體陣列之某些部分相關聯的基於硬體之區的存取。在一些實施例中，可藉由將命令添加或供應至經組態以使控制器7240鎖定某些記憶體部分之控制器7240來實施此鎖定。在一些實施例中，待鎖定的基於硬體之記憶體部分可由特定記憶體位址(例如，與記憶體組7210_1……7210_J2等之任何記憶體元件相關聯的位址)指明。在一些實施例中，記憶體之鎖定區可在程式或模型執行期間保持固定。在其他狀況下，鎖定區可為可組態的。亦即，在一些狀況下，可向控制器7240供應命令使得在程式或模型之執行期間，鎖定區可改變。舉例而言，在特定時間，可將某些記憶體位置添加至記憶體之鎖定區。或在特定時間，可自記憶體之鎖定區排除某些記憶體位置(例如，先前鎖定之記憶體位置)。 The security measures implemented by the controller can be implemented in any suitable manner, including locking or restricting access to hardware-based areas associated with certain portions of the memory array of the integrated circuit 7200. In some embodiments, this locking can be implemented by adding or supplying commands to the controller 7240 that is configured to cause the controller 7240 to lock certain memory portions. In some embodiments, the hardware-based memory portion to be locked may be specified by a specific memory address (for example, an address associated with any memory element of the memory group 7210_1...7210_J2, etc.). In some embodiments, the locked area of the memory can be kept fixed during the execution of the program or model. In other situations, the locked area can be configurable. That is, in some situations, a command can be supplied to the controller 7240 so that the lock area can be changed during the execution of the program or model. For example, at a specific time, certain memory locations can be added to the locked area of the memory. Or at a specific time, certain memory locations (for example, previously locked memory locations) can be excluded from the locked area of the memory.

可用任何合適的方式實現某些記憶體位置之鎖定。在一些狀況下，鎖定記憶體位置之記錄(例如，儲存及識別鎖定記憶體位址之檔案、資料庫、資料結構等)可為可由控制器7240存取的，使得控制器7240可判定某一記憶體請求是否與鎖定記憶體位置相關。在一些狀況下，控制器7240維護鎖定位址之資料庫以使用控制對某些記憶體位置之存取。在其他狀況下，控制器可具有可組態直至鎖定之表或一或多個暫存器之集合，且可包括識別待鎖定之記憶體位置(例如，應限制自晶片外部對該等記憶體位置之記憶體存取)的固定預定值。舉例而言，當請求記憶體存取時，控制器7240可比較與記憶體存取請求相關聯之記憶體位址與鎖定記憶體位址。若判定與記憶體存取請求相關聯之記憶體位址在鎖定記憶體位址之清單內，則可拒絕記憶體存取請求(例如，讀取抑或寫入操作)。 Any suitable method can be used to lock certain memory positions. In some cases, the records of the locked memory location (for example, files, databases, data structures, etc.) that store and identify the locked memory address can be accessed by the controller 7240, so that the controller 7240 can determine a certain memory Whether the physical request is related to the locked memory location. In some cases, the controller 7240 maintains a database of locked addresses to use to control access to certain memory locations. In other situations, the controller may have a table or a collection of one or more registers that can be configured until locked, and may include identifying the location of the memory to be locked (for example, it should be restricted from outside the chip to the memory The fixed preset value of the memory access of the location. For example, when a memory access is requested, the controller 7240 may compare the memory address associated with the memory access request with the locked memory address. If it is determined that the memory address associated with the memory access request is in the list of locked memory addresses, the memory access request (for example, read or write operation) can be rejected.

如上文所論述，至少一個安全措施可包括鎖定對不用於接收輸入資料或用於提供對所產生輸出資料之存取的記憶體陣列7210之某些記憶體部分的存取。在一些狀況下，可調整鎖定區內之記憶體部分。舉例而言，可將鎖定記憶體部分解除鎖定，且可鎖定非鎖定記憶體部分。任何合適的方法可用於將鎖定記憶體部分解除鎖定。舉例而言，所實施之安全措施可包括需要用於將鎖定記憶體區之一或多個部分解除鎖定的複雜密碼。 As discussed above, at least one security measure may include locking access to certain memory portions of the memory array 7210 that are not used for receiving input data or for providing access to generated output data. In some cases, the memory part in the locked area can be adjusted. For example, the locked memory portion can be unlocked, and the non-locked memory portion can be locked. Any suitable method can be used to partially unlock the locked memory. For example, the implemented security measures may include complex passwords required to unlock one or more parts of the locked memory area.

在偵測到對抗所實施之安全措施的任何動作後，可觸發所實施之安全措施。舉例而言，嘗試對鎖定記憶體部分進行存取(無論為讀取抑或寫入請求)可觸發安全措施。另外，若所鍵入之複雜密碼(例如，試圖將鎖定記憶體部分解除鎖定)不匹配預定複雜密碼，則可觸發安全措施。在一些狀況下，若在可允許的臨限數目次複雜密碼條目嘗試(例如，1次、2次、3次等)中未提供正確的複雜密碼，則可觸發安全措施。 After detecting any action against the implemented security measures, the implemented security measures can be triggered. For example, an attempt to access a portion of the locked memory (whether it is a read or write request) can trigger security measures. In addition, if the entered complex password (for example, an attempt to unlock part of the locked memory) does not match the predetermined complex password, a security measure can be triggered. In some situations, if the correct complex password is not provided in the allowable threshold number of complex password entry attempts (eg, 1, 2, 3, etc.), security measures can be triggered.

可在任何合適的時間鎖定記憶體部分。舉例而言，在一些狀況下，可在程式執行期間之各個時間鎖定記憶體部分。在其他狀況下，可在起動後或在程式/模型執行之前鎖定記憶體部分。舉例而言，可連同程式/模型程式碼之程式化或在產生及儲存待由程式/模型存取之資料後判定及識別待鎖定之記憶體位址。藉此，可在程式/模型執行開始時或之後、在已產生及儲存待由程式/模型使用之資料之後等的時間期間減少或消除對記憶體陣列7210之攻擊的漏洞。 The memory part can be locked at any suitable time. For example, in some situations, the memory portion can be locked at various times during program execution. In other situations, the memory part can be locked after startup or before the program/model is executed. For example, it can be combined with the programming of the program/model code or determine and identify the memory address to be locked after generating and storing the data to be accessed by the program/model. In this way, the vulnerability of attacks on the memory array 7210 can be reduced or eliminated during the time period at the beginning or after the execution of the program/model, after the data to be used by the program/model has been generated and stored.

可藉由任何合適的方法或在任何合適的時間實現鎖定記憶體之解除鎖定。如上文所描述，可在接收到正確的複雜密碼或密碼等之後將鎖定記憶體部分解除鎖定。在其他狀況下，可藉由重新啟動(藉由命令或藉由斷電及通電)或刪除整個記憶體陣列7210將鎖定記憶體解除鎖定。另外或替代地，可實施釋放命令序列以將一或多個記憶體部分解除鎖定。 The unlocking of the locked memory can be achieved by any suitable method or at any suitable time. As described above, the locked memory can be partially unlocked after receiving the correct complex password or password. In other situations, the locked memory can be unlocked by restarting (by command or by power-off and power-on) or by deleting the entire memory array 7210. Additionally or alternatively, a release command sequence can be implemented to unlock one or more memory portions.

根據本發明之實施例且如上文所描述，控制器7240可經組態以控制至及自積體電路7200之訊務，尤其自在積體電路7200外部之源的訊務。舉例而言，如圖72A中所展示，可藉由控制器7240控制在積體電路7200外部之組件與在積體電路7200內部之組件(例如，記憶體陣列7210或處理器子單元7220)之間的訊務。此訊務可通過控制器7240或由控制器7240控制或監視之一或多個匯流排(例如，7250、7260或7261)。 According to an embodiment of the present invention and as described above, the controller 7240 can be configured to Control the traffic from the integrated circuit 7200, especially the traffic from the source outside the integrated circuit 7200. For example, as shown in FIG. 72A, the controller 7240 can control the components outside the integrated circuit 7200 and the components inside the integrated circuit 7200 (for example, the memory array 7210 or the processor subunit 7220). Inter-communication. This traffic can be controlled or monitored by the controller 7240 or by the controller 7240 to control or monitor one or more buses (for example, 7250, 7260, or 7261).

根據本發明之一些實施例，積體電路7200可在開機處理程序期間接收不可改變資料(例如，固定資料；例如模型權重、係數等)及某些命令(例如，程式碼；例如識別待鎖定之記憶體部分)。此處，不可改變資料可指在程式或模型之執行期間保持固定且可保持不變直至後續開機處理程序的資料。在程式執行期間，積體電路7200可與可改變資料互動，該可改變資料可包括待處理之輸入資料及/或由與積體電路7200相關聯之處理產生的輸出資料。如上文所論述，可在程式或模型執行期間限制對記憶體陣列7210或處理陣列7220之存取。舉例而言，存取可限於記憶體陣列7210之某些部分或限於某些處理器子單元，該等處理器子單元與以下各者相關聯：關於待寫入之傳入輸入資料的處理或與待寫入之傳入輸入資料的互動，或關於待讀取之所產生輸出資料的處理或與待讀取之所產生輸出資料的互動。在程式或模型執行期間，可鎖定含有不可改變資料之記憶體部分且藉此使其不可存取。在一些實施例中，與待鎖定之記憶體部分相關聯的不可改變資料及/或命令可包括於任何適當的資料結構中。舉例而言，可經由可在開機序列期間或之後存取的一或多個組態檔案使此類資料及/或命令可用於控制器7240。 According to some embodiments of the present invention, the integrated circuit 7200 can receive immutable data (e.g., fixed data; e.g. model weights, coefficients, etc.) and certain commands (e.g., code; Memory part). Here, unchangeable data may refer to data that remains fixed during the execution of the program or model and can remain unchanged until the subsequent boot process. During program execution, the integrated circuit 7200 can interact with changeable data, which can include input data to be processed and/or output data generated by processing associated with the integrated circuit 7200. As discussed above, access to the memory array 7210 or the processing array 7220 can be restricted during the execution of the program or model. For example, access can be limited to certain parts of the memory array 7210 or limited to certain processor subunits that are associated with each of the following: processing of incoming input data to be written or The interaction with the incoming input data to be written, or the processing of the generated output data to be read or the interaction with the generated output data to be read. During the execution of the program or model, the part of the memory containing the immutable data can be locked and thus made inaccessible. In some embodiments, the immutable data and/or commands associated with the portion of the memory to be locked can be included in any suitable data structure. For example, such data and/or commands can be made available to the controller 7240 through one or more configuration files that can be accessed during or after the boot sequence.

返回參看圖72A，積體電路7200可進一步包括通信埠7230。如圖72A中所展示，控制器7240可耦接於通信埠7230與匯流排7250之間，該匯流排在處理子單元7220_1至7220_K之間共用。在一些實施例中，通信埠7230可間接地或直接地耦接至主機電腦7270，該主機電腦與可包括例如非揮發性記憶體之主機記憶體7280相關聯。在一些實施例中，主機電腦7270可自其相關聯之主機記憶體7280擷取可改變資料7281(例如，待在程式或模型之執行期間使用的輸入資料)、不可改變資料7282及/或命令7283。可改變資料7181、不可改變資料7282及命令7283可在開機處理程序期間經由7230自主機電腦7270上傳至控制器7240。 Referring back to FIG. 72A, the integrated circuit 7200 may further include a communication port 7230. As shown in FIG. 72A, the controller 7240 may be coupled between the communication port 7230 and the bus 7250, which is shared between the processing sub-units 7220_1 to 7220_K. In some embodiments, the communication port 7230 may be indirectly or directly coupled to the host computer 7270, which may include, for example, a non-volatile memory The memory 7280 is associated with the host memory. In some embodiments, the host computer 7270 can retrieve changeable data 7281 (for example, input data to be used during the execution of a program or model), unchangeable data 7282 and/or commands from its associated host memory 7280 7283. The changeable data 7181, the unchangeable data 7282, and the command 7283 can be uploaded from the host computer 7270 to the controller 7240 via the 7230 during the boot process.

圖72B為符合本發明之實施例的積體電路內部之記憶體區的圖解表示。如所展示，圖72B描繪包括於主機記憶體7280中之資料結構的實例。 FIG. 72B is a diagrammatic representation of a memory area inside an integrated circuit according to an embodiment of the present invention. As shown, FIG. 72B depicts an example of the data structure included in the host memory 7280.

現參看圖7A3，其為符合本發明之實施例的積體電路之另一實例。如圖73A中所展示，控制器7240可包括網路攻擊偵測器7241及回應模組7242。在本發明之一些實施例中，控制器7240可經組態以儲存或存取存取控制規則7243。根據本發明之一些實施例，存取控制規則7243可包括於控制器7240可存取之組態檔案中。在一些實施例中，存取控制規則7243可在開機處理程序期間上傳至控制器7240。存取控制規則7243可包含提示與以下各者中之任一者相關聯之存取規則的資訊：可改變資料7281、不可改變資料7282及命令7283以及其對應記憶體位置。如上文所解釋，存取控制規則7243或組態檔案可包括識別記憶體陣列7210當中之某些記憶體位址的資訊。在一些實施例中，控制器7240可經組態以提供鎖定機制及/或功能，該鎖定機制及/或功能鎖定記憶體陣列7210之各種位址，例如用於儲存命令或不可改變資料之位址。 Referring now to FIG. 7A3, it is another example of an integrated circuit according to the embodiment of the present invention. As shown in FIG. 73A, the controller 7240 may include a network attack detector 7241 and a response module 7242. In some embodiments of the present invention, the controller 7240 can be configured to store or access the access control rules 7243. According to some embodiments of the present invention, the access control rule 7243 may be included in a configuration file accessible by the controller 7240. In some embodiments, the access control rules 7243 can be uploaded to the controller 7240 during the boot process. The access control rules 7243 may include information that prompts the access rules associated with any of the following: changeable data 7281, unchangeable data 7282 and commands 7283 and their corresponding memory locations. As explained above, the access control rule 7243 or the configuration file may include information identifying certain memory addresses in the memory array 7210. In some embodiments, the controller 7240 can be configured to provide a locking mechanism and/or function that locks various addresses of the memory array 7210, such as a location for storing commands or unchangeable data site.

控制器7240可經組態以強制執行存取控制規則7243，例如以防止未經授權實體改變不可改變資料或命令。在一些實施例中，可根據存取控制規則7243禁止對不可改變資料或命令之讀取。根據本發明之一些實施例，控制器7240可經組態以判定是否對某些命令或不可改變資料之至少一部分進行了存取嘗試。控制器7240(例如，包括網路攻擊偵測器7241)可比較與存取請求相關聯之記憶體位址與用於不可改變資料及命令之記憶體位址，以偵測是否已對一或多個鎖定記憶體位置進行了未經授權存取嘗試。以此方式，例如，控制器7240之網路攻擊偵測器7241可經組態以判定是否發生疑似網路攻擊，例如更改一或多個命令或改變或遮蔽與一或多個鎖定記憶體部分相關聯之不可改變資料的請求。回應模組7242可經組態以判定如何對偵測到之網路攻擊作出回應及/或實施對偵測到之網路攻擊的回應。舉例而言，在一些狀況下，回應於偵測到對一或多個鎖定記憶體位置中之資料或命令的攻擊，控制器7240之回應模組7242可實施或使得實施回應，該回應可包括例如停止一或多個操作，諸如與偵測到之攻擊相關聯的記憶體存取操作。對偵測到之攻擊的回應亦可包括停止與程式或模型之執行相關聯的一或多個操作，傳回所嘗試攻擊之警告或其他指示符，向主機確證提示線，或刪除整個記憶體等。 The controller 7240 can be configured to enforce access control rules 7243, for example, to prevent unauthorized entities from changing unalterable data or commands. In some embodiments, the reading of unchangeable data or commands can be prohibited according to the access control rule 7243. According to some embodiments of the present invention, the controller 7240 can be configured to determine whether an access attempt is made to certain commands or at least a part of the unchangeable data. The controller 7240 (for example, including the network attack detector 7241) can compare the memory address associated with the access request with the memory address used for immutable data and commands to detect whether the An unauthorized access attempt was made in one or more locked memory locations. In this way, for example, the network attack detector 7241 of the controller 7240 can be configured to determine whether a suspected network attack occurs, such as changing one or more commands or changing or masking and locking one or more memory parts The associated request for unchangeable information. The response module 7242 can be configured to determine how to respond to a detected network attack and/or implement a response to a detected network attack. For example, in some situations, in response to detecting an attack on data or commands in one or more locked memory locations, the response module 7242 of the controller 7240 can implement or cause the response to be implemented. The response can include For example, stop one or more operations, such as memory access operations associated with the detected attack. The response to the detected attack can also include stopping one or more operations associated with the execution of the program or model, returning a warning or other indicator of the attempted attack, confirming the prompt to the host, or deleting the entire memory Wait.

除鎖定記憶體部分以外，亦可實施用於抵禦網路攻擊之其他技術以提供與積體電路7200相關聯之所描述安全措施。舉例而言，在一些實施例中，控制器7240可經組態以在與積體電路7200相關聯之不同記憶體位置及處理器子單元內複製程式或模型。以此方式，可獨立地執行程式/模型及程式/模型之複製者，且可比較獨立程式/模型執行之結果。舉例而言，可在兩個記憶體組7210中複製且在積體電路7200中之不同處理器子單元7220執行程式/模型。在其他實施例中，可在兩個不同積體電路7200中複製程式/模型。在任一狀況下，可比較程式/模型執行之結果以判定複製程式/模型執行之間是否存在任何差異。執行結果(例如，中間執行結果、最終執行結果等)之偵測到的差異可提示存在已變更程式/模型或其相關聯資料之一或多個態樣的網路攻擊。在一些實施例中，可指派不同記憶體組7210及處理器子單元7220以基於相同輸入資料執行兩個複製模型。在一些實施例中，可在基於相同輸入資料執行兩個複製模型期間比較中間結果，且若在同一階段，兩個中間結果之間存在失配，則作為潛在的補救動作，可暫時中止執行。在同一積體電路之處理器子單元執行兩個複製模型的狀況下，彼積體電路亦可比較結果。此可在不通知積體電路外部之任何實體關於兩個複製模型之執行的情況下進行。換言之，晶片外部之實體不知曉複製模型正在積體電路上並列地運行。 In addition to locking the memory portion, other techniques for defending against network attacks can also be implemented to provide the described security measures associated with the integrated circuit 7200. For example, in some embodiments, the controller 7240 can be configured to replicate programs or models in different memory locations and processor subunits associated with the integrated circuit 7200. In this way, the program/model and the copy of the program/model can be executed independently, and the results of the independent program/model execution can be compared. For example, the programs/models can be copied in two memory banks 7210 and executed in different processor subunits 7220 in the integrated circuit 7200. In other embodiments, the program/model can be copied in two different integrated circuits 7200. In any case, the results of the program/model execution can be compared to determine whether there is any difference between the execution of the copied program/model. The detected differences in execution results (for example, intermediate execution results, final execution results, etc.) may indicate a network attack that has changed one or more aspects of the program/model or its associated data. In some embodiments, different memory groups 7210 and processor sub-units 7220 can be assigned to execute two copy models based on the same input data. In some embodiments, the intermediate results can be compared during the execution of the two replication models based on the same input data, and if there is a mismatch between the two intermediate results at the same stage, the execution can be temporarily suspended as a potential remedial action. Execute two replicated models in the processor subunit of the same integrated circuit Under the condition of, the integrated circuit can also compare the results. This can be done without informing any entity outside the integrated circuit about the execution of the two replicated models. In other words, entities outside the chip do not know that the replicated model is running in parallel on the integrated circuit.

雖然將單個程式/模型複製描述為用於偵測可能網路攻擊之一個實例，但可使用任何數目個複製(例如，1個、2個、3個或多於3個)以偵測可能網路攻擊。隨著複製及獨立程式/模型執行之數目增加，網路攻擊之偵測的信賴等級亦可增加。複製之較大數目亦可降低網路攻擊之潛在成功率，此係因為攻擊者可能更難影響多個程式/模型複製者。可在執行階段判定程式或模型複製者之數目，以進一步增加網路攻擊者成功地影響程式或模型執行之困難。 Although a single program/model copy is described as an instance for detecting possible cyber attacks, any number of copies (for example, 1, 2, 3, or more than 3) can be used to detect possible cyber attacks. Road attack. As the number of replication and independent program/model execution increases, the level of confidence in the detection of cyber attacks can also increase. A larger number of copies can also reduce the potential success rate of a network attack, because it may be more difficult for an attacker to influence multiple program/model copies. The number of copies of the program or model can be determined during the execution phase to further increase the difficulty for network attackers to successfully influence the execution of the program or model.

在一些實施例中，複製模型可在彼此不同之一或多個態樣中不相同。在此實例中，可使與兩個程式/模型相關聯之程式碼彼此不同，但該等程式/模型可經設計使得兩者傳回相同輸出結果。至少以此方式，兩個程式/模型可被視為彼此之複製者。舉例而言，兩個神經網路模型在一層中相對於彼此可能具有不同的神經元排序。然而，儘管模型程式碼具有此改變，但兩個神經網路模型均可傳回相同輸出結果。以此方式複製程式/模型可使得網路攻擊者更難以識別待破解之程式或模型的此等有效複製者，且結果，複製模型/程式不僅可提供用以提供冗餘以最小化網路攻擊影響之方式，而且可增強網路攻擊偵測(例如，藉由突出顯示篡改或未經授權存取，其中網路攻擊者更改一個程式/模型或其資料，但未能對程式/模型複製者作出對應改變)。 In some embodiments, the replication models may be different in one or more aspects from each other. In this example, the codes associated with the two programs/models can be made different from each other, but the programs/models can be designed so that both return the same output result. At least in this way, two programs/models can be regarded as copies of each other. For example, two neural network models may have different neuron rankings relative to each other in one layer. However, despite this change in the model code, both neural network models can return the same output results. Copying the program/model in this way can make it more difficult for cyber attackers to identify these effective replicators of the program or model to be cracked, and as a result, the copied model/program not only provides redundancy to minimize cyber attacks It can also enhance the detection of network attacks (for example, by highlighting tampering or unauthorized access, in which a network attacker changes a program/model or its data, but fails to affect the copy of the program/model Make corresponding changes).

在許多狀況下，複製程式/模型(尤其包括展現程式碼差異之複製程式/模型)可經設計使得其輸出不完全匹配，而是構成軟值(例如，近似相同的輸出值)，而非準確的固定值。在此等實施例中，可比較(例如，使用專用模組或藉由主機處理器)來自兩個或多於兩個有效程式/模型複製者之輸出結果，以判定其輸出結果(無論為中間結果抑或最終結果)之間的差是否處於預定範圍內。所輸出軟值之差不超過預定臨限值或範圍可被視為無篡改、未經授權存取等之證據。另一方面，若所輸出軟值之差超過預定臨限值或範圍，則此等差可被視為已發生呈篡改、對記憶體之未經授權存取等之形式的網路攻擊之證據。在此等狀況下，將觸發複製程式/模型安全措施且可採取一或多個補救動作(例如，停止執行程式或模型，關閉積體電路7200之一或多個操作，在具有有限功能性之安全模式中操作，連同許多其他動作)。 In many situations, copy programs/models (especially including copy programs/models that show code differences) can be designed so that their output does not match exactly, but constitutes a soft value (for example, approximately the same output value) instead of being accurate Fixed value. In these embodiments, it is possible to compare (e.g., use dedicated Module or by the host processor) the output results from two or more valid program/model copiers to determine whether the difference between the output results (no matter intermediate results or final results) is within a predetermined range . The difference between the output soft values does not exceed the predetermined threshold or range can be regarded as evidence of no tampering, unauthorized access, etc. On the other hand, if the difference between the output soft values exceeds a predetermined threshold or range, the difference can be regarded as evidence that a network attack has occurred in the form of tampering, unauthorized access to memory, etc. . Under these conditions, the copy program/model safety measures will be triggered and one or more remedial actions can be taken (for example, stop executing the program or model, close one or more operations of the integrated circuit 7200, in case of limited functionality Operate in safe mode, along with many other actions).

與積體電路7200相關聯之安全措施亦可涉及對與執行中或已執行程式或模型相關聯之資料的定量分析。舉例而言，在一些實施例中，控制器7240可經組態以計算關於儲存於記憶體陣列7210之至少一部分中之資料的一或多個總和檢查碼/散列/循環冗餘檢查(CRC)/同位值。可將所計算之值與一或多個預定值進行比較。若所比較值之間存在偏差，則此偏差可解譯為篡改儲存於記憶體陣列7210之至少部分中之資料的證據。在一些實施例中，可針對與記憶體陣列7210相關聯之所有記憶體位置而計算總和檢查碼/散列/CRC/同位值以識別資料之改變。在此實例中，可藉由例如主機電腦7270或與積體電路7200相關聯之處理器讀取所討論之整個記憶體(或記憶體組)，以用於計算總和檢查碼/散列/CRC/同位值。在其他狀況下，可針對與記憶體陣列7210相關聯之記憶體位置的預定子集而計算總和檢查碼/散列/CRC/同位值，以識別與記憶體位置之子集相關聯的資料之改變。在一些實施例中，控制器7240可經組態以計算與預定資料路徑相關聯(例如，與記憶體存取圖案相關聯)之總和檢查碼/散列/CRC/同位值，且所計算值可彼此進行比較或與預定值進行比較以判定是否已發生篡改或另一形式之網路攻擊。 The security measures associated with the integrated circuit 7200 may also involve quantitative analysis of data associated with the executing or executed program or model. For example, in some embodiments, the controller 7240 may be configured to calculate one or more checksum/hash/cyclic redundancy checks (CRC) on the data stored in at least a portion of the memory array 7210 )/Parallel value. The calculated value can be compared with one or more predetermined values. If there is a deviation between the compared values, the deviation can be interpreted as evidence of tampering with the data stored in at least part of the memory array 7210. In some embodiments, a checksum/hash/CRC/parity value can be calculated for all memory locations associated with the memory array 7210 to identify data changes. In this example, the entire memory (or memory group) in question can be read by, for example, the host computer 7270 or the processor associated with the integrated circuit 7200 for calculating the checksum/hash/CRC /Parallel value. In other cases, the checksum/hash/CRC/parity value can be calculated for a predetermined subset of memory locations associated with the memory array 7210 to identify changes in data associated with the subset of memory locations . In some embodiments, the controller 7240 may be configured to calculate a checksum/hash/CRC/parity value associated with a predetermined data path (for example, associated with a memory access pattern), and the calculated value It can be compared with each other or with a predetermined value to determine whether tampering or another form of network attack has occurred.

藉由保護積體電路7200內或積體電路7200可存取之位置中的一或多個預定值(例如，預期總和檢查碼/散列/CRC/同位值、中間或最終輸出結果之預期差值、與某些值相關聯之預期差範圍等)，可使積體電路7200甚至更安全地抵抗網路攻擊。舉例而言，在一些實施例中，一或多個預定值可儲存於記憶體陣列7210之暫存器中，且可在模型之每次運行期間或之後用以(例如，藉由積體電路7200之控制器7240)評估中間或最終輸出結果、總和檢查碼等。在一些狀況下，可使用「保存最後結果資料」命令來更新暫存器值以在運作中計算預定值，且可將所計算值保存於暫存器或另一記憶體位置中。以此方式，有效輸出值可用以在每一程式或模型執行或部分執行之後更新用於比較的預定值。此技術可增加網路攻擊者在嘗試修改或以其他方式篡改經設計以曝露網路攻擊者活動之一或多個預定參考值時可能體驗的困難。 By protecting one of the positions within the integrated circuit 7200 or accessible by the integrated circuit 7200 Or multiple predetermined values (for example, expected checksum/hash/CRC/parity value, expected difference of intermediate or final output results, expected difference range associated with certain values, etc.), the integrated circuit 7200 Even more securely against cyber attacks. For example, in some embodiments, one or more predetermined values can be stored in the registers of the memory array 7210, and can be used during or after each run of the model (for example, by integrated circuit The controller 7240 of the 7200) evaluates intermediate or final output results, sum check codes, etc. In some cases, the "save final result data" command can be used to update the register value to calculate a predetermined value during operation, and the calculated value can be saved in the register or another memory location. In this way, the effective output value can be used to update the predetermined value for comparison after each program or model is executed or partially executed. This technique can increase the difficulty that a network attacker may experience when trying to modify or otherwise tamper with one or more predetermined reference values designed to expose the network attacker's activities.

在操作中，CRC計算器可用以追蹤記憶體存取。舉例而言，此計算電路可安置於記憶體組層級處、處理器子單元中或控制器處，其中每一計算電路可經組態以在進行每一記憶體存取時累加至CRC計算器。 In operation, the CRC calculator can be used to track memory access. For example, the calculation circuit can be placed at the memory bank level, in the processor sub-unit, or at the controller, where each calculation circuit can be configured to accumulate to the CRC calculator for each memory access .

現參看圖74A，其提供積體電路7200之另一實施例的圖解表示。在由圖74A表示之實例實施例中，控制器7240可包括篡改偵測器7245及回應模組7246。類似於其他所揭示實施例，篡改偵測器7245可經組態以偵測潛在篡改嘗試之證據。根據本發明之一些實施例，與積體電路7200相關聯且由控制器7240實施之安全措施例如可包括將實際程式/模型操作圖案與預定/所允許操作圖案進行比較。若在一或多個態樣中，實際程式/模型操作圖案與預定/所允許操作圖案不同，則可觸發安全措施。且若觸發安全措施，則控制器7240之回應模組7246可經組態以作為回應而實施一或多個補救措施。 Referring now to FIG. 74A, a diagrammatic representation of another embodiment of an integrated circuit 7200 is provided. In the example embodiment represented by FIG. 74A, the controller 7240 may include a tamper detector 7245 and a response module 7246. Similar to the other disclosed embodiments, the tamper detector 7245 can be configured to detect evidence of potential tampering attempts. According to some embodiments of the present invention, the security measures associated with the integrated circuit 7200 and implemented by the controller 7240 may include, for example, comparing the actual program/model operation pattern with a predetermined/allowed operation pattern. If the actual program/model operation pattern is different from the predetermined/allowed operation pattern in one or more aspects, a safety measure can be triggered. And if a safety measure is triggered, the response module 7246 of the controller 7240 can be configured to implement one or more remedial measures in response.

圖74C為根據例示性所揭示實施例之可位於晶片內之各個點處的偵測元件之圖解表示。如上文所描述，可使用位於晶片內之各個點處的偵測元件執行網路攻擊及篡改之偵測，如展示於例如圖74C中。舉例而言，某一程式碼可與某一時間段內之預期數目個處理事件相關聯。圖74C中所展示之偵測器可對系統在某一時間段(由時間計數器監視)期間經歷之事件(由事件計數器監視)的數目進行計數。若事件之數目超過某一預定臨限值(例如，在預定義時間段期間之預期事件的數目)，則可提示篡改。此類偵測器可包括於系統之多個點中以監視各種類型之事件，如圖74C中所展示。 FIG. 74C is a diagrammatic representation of detection elements that can be located at various points within the chip according to an illustratively disclosed embodiment. As described above, the detection components located at various points within the chip can be used to perform network attack and tampering detection, as shown in, for example, FIG. 74C. For example, a program The code can be associated with an expected number of processing events within a certain period of time. The detector shown in FIG. 74C can count the number of events (monitored by the event counter) experienced by the system during a certain period of time (monitored by the time counter). If the number of events exceeds a predetermined threshold (for example, the number of expected events during a predefined time period), tampering can be prompted. Such detectors can be included in multiple points of the system to monitor various types of events, as shown in Figure 74C.

更具體而言，在一些實施例中，控制器7240可經組態以儲存或存取預期程式/模型操作圖案7244。舉例而言，在一些狀況下，操作圖案可表示為提示每時間圖案之所允許負載及每時間圖案之禁止或不合法負載的曲線7247。篡改嘗試可使記憶體陣列7210或處理陣列7220在某些操作規格之外操作。此可使記憶體陣列7210或處理陣列7220產生熱或發生故障，且可使得能夠改變與記憶體陣列7210或處理陣列7220相關的資料或程式碼。此等改變可導致操作圖案超出如由曲線7247提示之所允許操作圖案。 More specifically, in some embodiments, the controller 7240 may be configured to store or access the expected program/model operation pattern 7244. For example, in some situations, the operation pattern may be represented as a curve 7247 that prompts the allowable load per time pattern and the prohibition or illegal load per time pattern. Attempts to tamper may cause the memory array 7210 or the processing array 7220 to operate outside of certain operating specifications. This can cause the memory array 7210 or the processing array 7220 to generate heat or malfunction, and can enable the data or code related to the memory array 7210 or the processing array 7220 to be changed. These changes can cause the operation pattern to exceed the allowable operation pattern as prompted by curve 7247.

根據本發明之一些實施例，控制器7240可經組態以監視與記憶體陣列7210或處理陣列7220相關聯之操作圖案。操作圖案可與存取請求之數目、存取請求之類型、存取請求之時序等相關聯。控制器7240可經進一步組態以在操作圖案不同於可允許操作圖案之情況下偵測篡改攻擊。 According to some embodiments of the present invention, the controller 7240 may be configured to monitor the operation pattern associated with the memory array 7210 or the processing array 7220. The operation pattern can be associated with the number of access requests, the type of access requests, the timing of the access requests, and so on. The controller 7240 may be further configured to detect a tampering attack if the operation pattern is different from the allowable operation pattern.

應注意，所揭示實施例不僅可用以抵禦網路攻擊，而且用以抵禦操作中之非惡意錯誤。舉例而言，所揭示實施例亦可有效保護諸如積體電路7200之系統免受由諸如溫度或電壓改變或位準之環境因素引起之錯誤的影響，尤其在此等位準超出用於積體電路7200之操作規格的情況下。 It should be noted that the disclosed embodiments can be used not only to defend against network attacks, but also to defend against non-malicious errors in operation. For example, the disclosed embodiments can also effectively protect systems such as integrated circuit 7200 from errors caused by environmental factors such as temperature or voltage changes or levels, especially when these levels exceed those used for integrated circuits. In the case of the operating specifications of the circuit 7200.

回應於偵測到疑似網路攻擊(例如，作為對所觸發安全措施之回應)，可實施任何合適的補救動作。舉例而言，補救動作可包括停止與程式/模型執行相關聯之一或多個操作，在安全模式中操作與積體電路7200相關聯之一或多個組件，將積體電路7200之一或多個組件鎖定至額外輸入或存取等。 In response to the detection of a suspected cyber attack (for example, as a response to a triggered security measure), any appropriate remedial action can be implemented. For example, the remedial action may include stopping one or more operations associated with the program/model execution, operating one or more components associated with the integrated circuit 7200 in the safe mode, and resetting one or more of the integrated circuit 7200 Multiple components are locked to additional input or access, etc.

圖74B提供根據例示性所揭示實施例之保護積體電路以防篡改的方法7450之流程圖表示。舉例而言，步驟7452可包括使用與積體電路相關聯之控制器相對於積體電路之操作實施至少一個安全措施。在步驟7454處，若觸發至少一個安全措施，則可採取一或多個補救動作。積體電路包括：基板；記憶體陣列，其安置於基板上，該記憶體陣列包括複數個離散記憶體組；及處理陣列，其安置於基板上，該處理陣列包括複數個處理器子單元，該等複數個處理器子單元中之每一者與複數個離散記憶體組當中之一或多個離散記憶體組相關聯。 FIG. 74B provides a flowchart representation of a method 7450 of protecting an integrated circuit from tampering according to an illustratively disclosed embodiment. For example, step 7452 may include using a controller associated with the integrated circuit to implement at least one safety measure with respect to the operation of the integrated circuit. At step 7454, if at least one safety measure is triggered, one or more remedial actions can be taken. The integrated circuit includes: a substrate; a memory array arranged on the substrate, the memory array including a plurality of discrete memory groups; and a processing array arranged on the substrate, the processing array including a plurality of processor subunits, Each of the plurality of processor subunits is associated with one or more of the plurality of discrete memory groups.

在一些實施例中，所揭示安全措施可實施於多個記憶體晶片中，且所揭示安全機制中之至少一或多者可針對每一記憶體晶片/積體電路而實施。在一些狀況下，每一記憶體晶片/積體電路可實施相同的安全措施，但在一些狀況下，不同的記憶體晶片/積體電路可實施不同的安全措施(例如，當不同的安全措施可能更適合於與特定積體電路相關聯之某一類型的操作時)。在一些實施例中，可由積體電路之特定控制器實施多於一個安全措施。舉例而言，特定積體電路可實施任何數目或類型之所揭示安全措施。另外，特定積體電路控制器可經組態以回應於所觸發安全措施而實施多個不同的補救措施。 In some embodiments, the disclosed security measures can be implemented in multiple memory chips, and at least one or more of the disclosed security mechanisms can be implemented for each memory chip/integrated circuit. In some situations, each memory chip/integrated circuit can implement the same security measures, but in some situations, different memory chips/integrated circuits can implement different security measures (for example, when different security measures May be more suitable for a certain type of operation associated with a particular integrated circuit). In some embodiments, more than one safety measure can be implemented by a specific controller of the integrated circuit. For example, a particular integrated circuit can implement any number or type of disclosed security measures. In addition, specific integrated circuit controllers can be configured to implement a number of different remedial measures in response to triggered safety measures.

亦應注意，可組合上述安全機制中之兩者或多於兩者以改善針對網路攻擊或篡改攻擊之安全性。另外，可跨越不同積體電路而實施安全措施，且此等積體電路可協調安全措施實施。舉例而言，可在一個記憶體晶片內執行或可跨越不同記憶體晶片執行模型複製。在此實例中，可比較來自一個記憶體晶片之結果或來自兩個或多於兩個記憶體晶片之結果以偵測潛在網路攻擊或篡改攻擊。在一些實施例中，跨越多個積體電路而應用之複製安全措施可包括以下各者中之一或多者：所揭示之存取鎖定機制、散列保護機制、模型複製、程式/模型執行圖案分析，或此等或其他所揭示實施例之任何組合。 It should also be noted that two or more of the above security mechanisms can be combined to improve the security against network attacks or tampering attacks. In addition, safety measures can be implemented across different integrated circuits, and these integrated circuits can coordinate the implementation of safety measures. For example, model replication can be performed within one memory chip or can be performed across different memory chips. In this example, the results from one memory chip or the results from two or more memory chips can be compared to detect potential cyber attacks or tampering attacks. In some embodiments, the copy security measures applied across multiple integrated circuits may include one or more of the following: the disclosed access locking mechanism, hash protection mechanism, model copy, program/model execution Pattern analysis, or any combination of these or other disclosed embodiments.

DRAM中之多埠處理器子單元 Multi-port processor subunit in DRAM

如上文所描述，本發明所揭示之實施例可包括分散式處理器記憶體晶片，該記憶體晶片包括處理器子單元陣列及記憶體組陣列，其中處理器子單元中之每一者可專用於記憶體組陣列中之至少一者。如在以下章節中所論述，分散式處理器記憶體晶片可充當可擴展系統之基礎。亦即，在一些狀況下，分散式處理器記憶體晶片可包括經組態以將資料自一個分散式處理器記憶體晶片傳送至另一分散式處理器記憶體晶片之一或多個通信埠。以此方式，任何所要數目個分散式處理器記憶體晶片可鏈接在一起(例如，串聯、並聯、以迴路或其任何組合)以形成分散式處理器記憶體晶片之可擴展陣列。此陣列可提供用於高效地執行記憶體密集型操作及用於擴展與記憶體密集型操作之效能相關聯之計算資源的靈活解決方案。因為分散式處理器記憶體晶片可包括具有不同時序圖案之時脈，所以本發明所揭示之實施例包括用以甚至在存在時脈時序差異之情況下亦準確地控制分散式處理器記憶體晶片之間的資料傳送的特徵。此等實施例可使得能夠在不同的分散式處理器記憶體晶片間進行高效資料共用。 As described above, the disclosed embodiments of the present invention may include a distributed processor memory chip, the memory chip includes an array of processor sub-units and an array of memory banks, wherein each of the processor sub-units can be dedicated At least one of the memory bank arrays. As discussed in the following sections, distributed processor memory chips can serve as the basis for scalable systems. That is, in some cases, the distributed processor memory chip may include one or more communication ports configured to transfer data from one distributed processor memory chip to another distributed processor memory chip . In this way, any desired number of distributed processor memory chips can be linked together (for example, in series, in parallel, in a loop, or any combination thereof) to form an expandable array of distributed processor memory chips. This array can provide a flexible solution for efficiently performing memory-intensive operations and for expanding the computing resources associated with the performance of memory-intensive operations. Because the distributed processor memory chip can include clocks with different timing patterns, the embodiments disclosed in the present invention include methods to accurately control the distributed processor memory chip even when there is a difference in clock timing The characteristics of the data transfer between. Such embodiments can enable efficient data sharing among different distributed processor memory chips.

圖75A為符合本發明之實施例的包括複數個分散式處理器記憶體晶片之可擴展處理器記憶體系統的圖解表示。根據本發明之實施例，可擴展處理器記憶體系統可包括複數個分散式處理器記憶體晶片，諸如第一分散式處理器記憶體晶片7500、第二分散式處理器記憶體晶片7500'及第三分散式處理器記憶體晶片7500"。第一分散式處理器記憶體晶片7500、第二分散式處理器記憶體晶片7500'及第三分散式處理器記憶體晶片7500"中之每一者可包括與描述於本發明分散式處理器中之實施例中之任一者相關聯的組態及/或特徵中之任一者。 FIG. 75A is a diagrammatic representation of an expandable processor memory system including a plurality of distributed processor memory chips in accordance with an embodiment of the present invention. According to an embodiment of the present invention, the expandable processor memory system may include a plurality of distributed processor memory chips, such as a first distributed processor memory chip 7500, a second distributed processor memory chip 7500', and The third distributed processor memory chip 7500". Each of the first distributed processor memory chip 7500, the second distributed processor memory chip 7500', and the third distributed processor memory chip 7500" Those may include any of the configurations and/or features associated with any of the embodiments described in the distributed processor of the present invention.

在一些實施例中，第一分散式處理器記憶體晶片7500、第二分散式處理器記憶體晶片7500'及第三分散式處理器記憶體晶片7500"中之每一者可類似於圖7200中所展示之積體晶片7200而實施。如圖75A中所展示，第一分散式處理器記憶體晶片7500可包含記憶體陣列7510、處理陣列7520及控制器7540。記憶體陣列7510、處理陣列7520及控制器7540可類似於圖72A中之記憶體陣列7210、處理陣列7220及控制器7240而組態。 In some embodiments, each of the first distributed processor memory chip 7500, the second distributed processor memory chip 7500', and the third distributed processor memory chip 7500" may The implementation is similar to the integrated wafer 7200 shown in FIG. 7200. As shown in FIG. 75A, the first distributed processor memory chip 7500 may include a memory array 7510, a processing array 7520, and a controller 7540. The memory array 7510, the processing array 7520, and the controller 7540 can be configured similarly to the memory array 7210, the processing array 7220, and the controller 7240 in FIG. 72A.

根據本發明之實施例，第一分散式處理器記憶體晶片7500可包括第一通信埠7530。在一些實施例中，第一通信埠7530可經組態以與一或多個外部實體通信。舉例而言，通信埠7530可經組態以建立分散式處理器記憶體晶片7500與除另一分散式處理器記憶體晶片(諸如，分散式處理器記憶體晶片7500'及7500")以外之外部實體之間的通信連接。舉例而言，通信埠7530可間接地或直接地耦接至主機電腦(例如，如圖72A中所說明)或任何其他運算裝置、通信模組等。 According to an embodiment of the present invention, the first distributed processor memory chip 7500 may include a first communication port 7530. In some embodiments, the first communication port 7530 may be configured to communicate with one or more external entities. For example, the communication port 7530 can be configured to create a distributed processor memory chip 7500 and other distributed processor memory chips (such as distributed processor memory chips 7500' and 7500") Communication connection between external entities. For example, the communication port 7530 can be indirectly or directly coupled to a host computer (for example, as illustrated in FIG. 72A) or any other computing device, communication module, etc.

根據本發明之實施例，第一分散式處理器記憶體晶片7500可進一步包含經組態以與例如7500'或7500"之其他分散式處理器記憶體晶片通信的一或多個額外通信埠。在一些實施例中，一或多個額外通信埠可包括第二通信埠7531及第三通信埠7532，如圖75A中所展示。第二通信埠7531可經組態以與第二分散式處理器記憶體晶片7500'通信，且建立第一分散式處理器記憶體晶片7500與第二分散式處理器記憶體晶片7500'之間的通信連接。類似地，第三通信埠7532可經組態以與第三分散式處理器記憶體晶片7500'通信，且建立第一分散式處理器記憶體晶片7500與第三分散式處理器記憶體晶片7500"之間的通信連接。在一些實施例中，第一分散式處理器記憶體晶片7500(及本文中所揭示之記憶體晶片中的任一者)可包括複數個通信埠，包括任何適當數目個通信埠(例如，2個、3個、4個、5個、6個、7個、8個、9個、10個、20個、50個、100個、1000個等)。 According to an embodiment of the present invention, the first distributed processor memory chip 7500 may further include one or more additional communication ports configured to communicate with other distributed processor memory chips such as 7500' or 7500". In some embodiments, the one or more additional communication ports may include a second communication port 7531 and a third communication port 7532, as shown in Figure 75A. The second communication port 7531 may be configured to interact with the second distributed processing The processor memory chip 7500' communicates and establishes a communication connection between the first distributed processor memory chip 7500 and the second distributed processor memory chip 7500'. Similarly, the third communication port 7532 can be configured To communicate with the third distributed processor memory chip 7500', and establish a communication connection between the first distributed processor memory chip 7500 and the third distributed processor memory chip 7500". In some embodiments, the first distributed processor memory chip 7500 (and any of the memory chips disclosed herein) may include a plurality of communication ports, including any suitable number of communication ports (for example, 2 1, 3, 4, 5, 6, 7, 8, 9, 10, 20, 50, 100, 1000, etc.).

在一些實施例中，第一通信埠、第二通信埠及第三通信埠與對應匯流排相關聯。對應匯流排可為第一通信埠、第二通信埠及第三通信埠中之每一者所共同的匯流排。在一些實施例中，與第一通信埠、第二通信埠及第三通信埠中之每一者相關聯的對應匯流排皆連接至複數個離散記憶體組。在一些實施例中，第一通信埠連接至記憶體晶片內部之主匯流排或包括於記憶體晶片中之至少一個處理器子單元中的至少一者。在一些實施例中，第二通信埠連接至記憶體晶片內部之匯流排或包括於記憶體晶片中之至少一個處理器子單元中的至少一者。 In some embodiments, the first communication port, the second communication port, and the third communication port correspond to The bus is associated. The corresponding bus may be a bus common to each of the first communication port, the second communication port, and the third communication port. In some embodiments, the corresponding bus associated with each of the first communication port, the second communication port, and the third communication port are connected to a plurality of discrete memory banks. In some embodiments, the first communication port is connected to at least one of a main bus inside the memory chip or at least one processor subunit included in the memory chip. In some embodiments, the second communication port is connected to at least one of a bus inside the memory chip or at least one processor subunit included in the memory chip.

雖然相對於第一分散式處理器記憶體晶片7500解釋了所揭示之分散式處理器記憶體晶片的組態，但應注意，第二處理器記憶體晶片7500'及第三處理器記憶體晶片7500"可類似於第一分散式處理器記憶體晶片7500而組態。舉例而言，第二分散式處理器記憶體晶片7500'亦可包含記憶體陣列7510'、處理陣列7520'、控制器7540'及/或複數個通信埠，諸如埠7530'、7531'及7532'。類似地，第三分散式處理器記憶體晶片7500"可包含記憶體陣列7510"、處理陣列7520"、控制器7540"及/或複數個通信埠，諸如埠7530"、7531"及7532"。在一些實施例中，第二分散式處理器記憶體晶片7500'之第二通信埠7531'及第三通信埠7532'可經組態以分別與第三分散式處理器記憶體晶片7500"及第一分散式處理器記憶體晶片7500通信。類似地，第三分散式處理器記憶體晶片7500"之第二通信埠7531"及第三通信埠7532"可經組態以分別與第一分散式處理器記憶體晶片7500及第二分散式處理器記憶體晶片7500'通信。分散式處理器記憶體晶片間的此組態類似性可便利基於所揭示之分散式處理器記憶體晶片而擴展運算系統。另外，與每一分散式處理器記憶體晶片相關聯之通信埠的所揭示配置及組態可使得能夠靈活地配置分散式處理器記憶體晶片之陣列(例如，包括串聯連接、並聯連接、環形連接、星形連接或網路連接等)。 Although the configuration of the disclosed distributed processor memory chip is explained relative to the first distributed processor memory chip 7500, it should be noted that the second processor memory chip 7500' and the third processor memory chip 7500" can be configured similarly to the first distributed processor memory chip 7500. For example, the second distributed processor memory chip 7500' can also include a memory array 7510', a processing array 7520', and a controller 7540' and/or a plurality of communication ports, such as ports 7530', 7531' and 7532'. Similarly, the third distributed processor memory chip 7500" may include a memory array 7510", a processing array 7520", and a controller 7540" and/or multiple communication ports, such as ports 7530", 7531" and 7532". In some embodiments, the second communication port 7531' and the third communication port 7532' of the second distributed processor memory chip 7500' can be configured to correspond to the third distributed processor memory chip 7500" and The first distributed processor memory chip 7500 communicates. Similarly, the second communication port 7531" and the third communication port 7532" of the third distributed processor memory chip 7500" can be configured to communicate with the first distributed processor respectively The distributed processor memory chip 7500 and the second distributed processor memory chip 7500' communicate. This configuration similarity between distributed processor memory chips can facilitate the expansion of computing systems based on the disclosed distributed processor memory chips. In addition, the disclosed configuration and configuration of the communication ports associated with each distributed processor memory chip can enable flexible configuration of the array of distributed processor memory chips (for example, including series connection, parallel connection, ring Connection, star connection or network connection, etc.).

根據本發明之實施例，例如第一至第三分散式處理器記憶體晶片 7500、7500'及7500"之分散式處理器記憶體晶片可經由匯流排7533彼此通信。在一些實施例中，匯流排7533可連接兩個不同的分散式處理器記憶體晶片之兩個通信埠。舉例而言，第一處理器記憶體晶片7500之第二通信埠7531可經由匯流排7533連接至第二處理器記憶體晶片7500'之第三通信埠7532'。根據本發明之實施例，例如第一至第三分散式處理器記憶體晶片7500、7500'及7500"之分散式處理器記憶體晶片亦可經由諸如匯流排7534之匯流排與外部實體(例如，主機電腦)通信。舉例而言，第一分散式處理器記憶體晶片7500之第一通信埠7530可經由匯流排7534連接至一或多個外部實體。分散式處理器記憶體晶片可用各種方式彼此連接。在一些狀況下，分散式處理器記憶體晶片可展現串聯連接性，其中每一分散式處理器記憶體晶片連接至一對鄰近分散式處理器記憶體晶片。在其他狀況下，分散式處理器記憶體晶片可展現較高程度之連接性，其中至少一個分散式處理器記憶體晶片連接至兩個或多於兩個其他分散式處理器記憶體晶片。在一些狀況下，複數個記憶體晶片內之所有分散式處理器記憶體晶片可連接至複數個記憶體晶片中之所有其他分散式處理器記憶體晶片。 According to an embodiment of the present invention, for example, the first to third distributed processor memory chips 7500, 7500' and 7500" distributed processor memory chips can communicate with each other via bus 7533. In some embodiments, bus 7533 can connect two communication ports of two different distributed processor memory chips For example, the second communication port 7531 of the first processor memory chip 7500 can be connected to the third communication port 7532' of the second processor memory chip 7500' via the bus 7533. According to an embodiment of the present invention, For example, the distributed processor memory chips of the first to third distributed processor memory chips 7500, 7500', and 7500" can also communicate with external entities (for example, a host computer) via a bus such as a bus 7534. For example, the first communication port 7530 of the first distributed processor memory chip 7500 can be connected to one or more external entities via the bus 7534. Distributed processor memory chips can be connected to each other in various ways. In some cases, distributed processor memory chips may exhibit serial connectivity, where each distributed processor memory chip is connected to a pair of adjacent distributed processor memory chips. In other situations, distributed processor memory chips can exhibit a higher degree of connectivity, where at least one distributed processor memory chip is connected to two or more other distributed processor memory chips. In some cases, all distributed processor memory chips in the plurality of memory chips can be connected to all other distributed processor memory chips in the plurality of memory chips.

如圖75A中所展示，匯流排7533(或與圖75A之實施例相關聯的任何其他匯流排)可為單向的。雖然圖75A將匯流排7533說明為單向的且具有某一資料傳送流(如由圖75A中所展示之箭頭提示)，但匯流排7533(或圖75A中之任何其他匯流排)可實施為雙向匯流排。根據本發明之一些實施例，連接於兩個分散式處理器記憶體晶片之間的匯流排可經組態以具有比連接於分散式處理器記憶體晶片與外部實體之間的匯流排之通信速度高的通信速度。在一些實施例中，分散式處理器記憶體晶片與外部實體之間的通信可在有限時間期間發生，例如在執行準備(自主機電腦載入程式碼、輸入資料、權重資料等)期間，在將由神經網路模型之執行產生之結果等輸出至主機電腦的時段期間發生。在與晶片7500、7500'及7500"之分散式處理器相關聯的一或多個程式之執行期間(例如，在與人工智慧應用程式相關聯之記憶體密集型操作期間等)，分散式處理器記憶體晶片之間的通信可經由匯流排7533、7533'等進行。在一些實施例中，相比兩個處理器記憶體晶片之間的通信，分散式處理器記憶體晶片與外部實體之間的通信發生之頻率可能較低。根據通信要求及實施例，分散式處理器記憶體晶片與外部實體之間的匯流排可經組態以具有等於、大於或小於分散式處理器記憶體晶片之間的匯流排之通信速度的通信速度。 As shown in FIG. 75A, bus 7533 (or any other bus associated with the embodiment of FIG. 75A) may be unidirectional. Although FIG. 75A illustrates the bus 7533 as unidirectional and has a certain data transfer flow (as indicated by the arrow shown in FIG. 75A), the bus 7533 (or any other bus in FIG. 75A) can be implemented as Two-way bus. According to some embodiments of the present invention, the bus connected between the two distributed processor memory chips can be configured to have more communication than the bus connected between the distributed processor memory chip and an external entity High speed communication speed. In some embodiments, the communication between the distributed processor memory chip and external entities can occur during a limited time period, such as during execution preparation (loading of code from the host computer, input data, weight data, etc.), Occurs during the period of time when the results generated by the execution of the neural network model are output to the host computer. In the execution of one or more programs associated with distributed processors with chips 7500, 7500' and 7500" During runtime (for example, during memory-intensive operations associated with artificial intelligence applications, etc.), the communication between the memory chips of the distributed processors can be performed via the buses 7533, 7533', etc. In some embodiments, communication between a distributed processor memory chip and an external entity may occur less frequently than the communication between two processor memory chips. According to the communication requirements and embodiments, the bus between the distributed processor memory chip and the external entity can be configured to have a communication speed equal to, greater than, or less than the communication speed of the bus between the distributed processor memory chip speed.

在一些實施例中，如由圖75A表示，諸如第一至第三分散式處理器記憶體晶片7500、7500'及7500"之複數個分散式處理器記憶體晶片可經組態以彼此通信。如所提到，此能力可便利可擴展分散式處理器記憶體晶片系統之組裝。舉例而言，來自第一至第三處理器記憶體晶片7500、7500'及7500"之記憶體陣列7510、7510'及7510"及處理陣列7520、7520'及7520"在藉由通信通道(諸如，圖75A中所展示之匯流排)鏈接時可被視為實際上屬於單個分散式處理器記憶體晶片。 In some embodiments, as represented by FIG. 75A, a plurality of distributed processor memory chips such as the first to third distributed processor memory chips 7500, 7500', and 7500" may be configured to communicate with each other. As mentioned, this capability can facilitate the assembly of scalable distributed processor memory chip systems. For example, memory arrays 7510, 7500' and 7500" from the first to third processor memory chips 7500, 7500', and 7500" 7510' and 7510" and processing arrays 7520, 7520', and 7520" can be regarded as actually belonging to a single distributed processor memory chip when linked by a communication channel (such as the bus shown in FIG. 75A).

根據本發明之實施例，可用任何合適的方式管理複數個分散式處理器記憶體晶片之間的通信及/或分散式處理器記憶體晶片與一或多個外部實體之間的通信。在一些實施例中，可藉由諸如分散式處理器記憶體晶片7500中之處理陣列7520的處理資源來管理此等通信。在一些其他實施例中，例如為了減輕由分散式處理器之陣列提供的處理資源所受的由通信管理強加之運算負荷，分散式處理器記憶體晶片之諸如控制器7540、7540'、7540"等的控制器可經組態以管理分散式處理器記憶體晶片之間的通信及/或分散式處理器記憶體晶片與一或多個外部實體之間的通信。舉例而言，相對於其他分散式處理器記憶體晶片，第一至第三處理器記憶體晶片7500、7500'及7500"之每一控制器7540、7540'及7540"可經組態以管理與其對應分散式處理器記憶體晶片相關的通信。在一些實施例中，控制器7540、7540'及7540"可經組態以經由諸如埠7531、7531'、7531"、 7532、7532'及7532"等之對應通信埠控制此等通信。 According to the embodiments of the present invention, the communication between a plurality of distributed processor memory chips and/or the communication between the distributed processor memory chips and one or more external entities can be managed in any suitable manner. In some embodiments, such communications can be managed by processing resources such as the processing array 7520 in the distributed processor memory chip 7500. In some other embodiments, for example, in order to reduce the computational load imposed by communication management on the processing resources provided by the array of distributed processors, the distributed processor memory chips such as controllers 7540, 7540', 7540" Controllers such as those can be configured to manage the communication between the distributed processor memory chips and/or the communication between the distributed processor memory chips and one or more external entities. For example, as opposed to other Distributed processor memory chips, each controller 7540, 7540' and 7540" of the first to third processor memory chips 7500, 7500' and 7500" can be configured to manage its corresponding distributed processor memory Body chip-related communications. In some embodiments, the controllers 7540, 7540', and 7540" can be configured to pass through ports such as ports 7531, 7531', 7531", Corresponding communication ports such as 7532, 7532' and 7532" control these communications.

控制器7540、7540'及7540"亦可經組態以在考量可存在於分散式處理器記憶體晶片間之時序差的同時管理分散式處理器記憶體晶片之間的通信。舉例而言，分散式處理器記憶體晶片(例如，7500)可由內部時脈饋入，該內部時脈相對於其他分散式處理器記憶體晶片(例如，7500'及7500")之時脈可能不同。因此，在一些實施例中，控制器7540可經組態以實施用於考量分散式處理器記憶體晶片間之不同時脈時序圖案的一或多個策略，且藉由考慮分散式處理器記憶體晶片之間的可能時間偏差來管理分散式處理器記憶體晶片之間的通信。 The controllers 7540, 7540', and 7540" can also be configured to manage the communication between the distributed processor memory chips while taking into account the timing differences that may exist between the distributed processor memory chips. For example, Distributed processor memory chips (for example, 7500) can be fed in by an internal clock, which may be different from other distributed processor memory chips (for example, 7500' and 7500"). Therefore, in some embodiments, the controller 7540 can be configured to implement one or more strategies for considering the different clock timing patterns between distributed processor memory chips, and by considering the distributed processor memory The possible time deviations between bulk chips are used to manage the communication between the memory chips of distributed processors.

舉例而言，在一些實施例中，第一分散式處理器記憶體晶片7500之控制器7540可經組態以使得能夠在某些條件下將資料自第一分散式處理器記憶體晶片7500傳送至第二處理器記憶體晶片7500'。在一些狀況下，若第一分散式處理器記憶體晶片7500之一或多個處理器子單元未準備好傳送資料，則控制器7540可抑制資料傳送。替代地或另外，若第二分散式處理器記憶體晶片7500'之接收處理器子單元未準備好接收資料，則控制器7540可抑制資料傳送。在一些狀況下，控制器7540可在確定發送處理器子單元準備好發送資料且接收處理器子單元準備好接收資料之後起始將資料自發送處理器子單元(例如，在晶片7500中)傳送至接收處理器子單元(例如，在晶片7500'中)。在其他實施例中，控制器7540可僅基於發送處理器子單元是否準備好發送資料來起始資料傳送，尤其在資料可在控制器7540或7540'中緩衝例如直至接收處理器子單元準備好接收所傳送資料之情況下。 For example, in some embodiments, the controller 7540 of the first distributed processor memory chip 7500 can be configured to enable data to be transferred from the first distributed processor memory chip 7500 under certain conditions To the second processor memory chip 7500'. In some situations, if one or more of the processor sub-units of the first distributed processor memory chip 7500 is not ready to transmit data, the controller 7540 may inhibit the data transmission. Alternatively or in addition, if the receiving processor subunit of the second distributed processor memory chip 7500' is not ready to receive data, the controller 7540 may inhibit data transmission. In some cases, the controller 7540 may initiate data transmission from the transmitting processor sub-unit (for example, in the chip 7500) after determining that the transmitting processor sub-unit is ready to send data and the receiving processor sub-unit is ready to receive data. To the receiving processor sub-unit (for example, in the wafer 7500'). In other embodiments, the controller 7540 may initiate data transmission based only on whether the sending processor subunit is ready to send data, especially if the data can be buffered in the controller 7540 or 7540', for example, until the receiving processor subunit is ready. In the case of receiving the transmitted data.

根據本發明之實施例，控制器7540可經組態以判定是否滿足一或多個其他時序約束以便使得能夠進行資料傳送。此等時間約束可與以下各者相關：自發送處理器子單元之傳送時間與接收處理器子單元中之接收時間之間的時間差、來自外部實體(例如，主機電腦)之對所處理資料的存取請求、對與發送或接收處理器子單元相關聯之記憶體資源(例如，記憶體陣列)執行的再新操作，以及其他。 According to an embodiment of the present invention, the controller 7540 may be configured to determine whether one or more other timing constraints are satisfied to enable data transfer. These time constraints can be related to each of the following: between the transmission time from the transmitting processor subunit and the receiving time in the receiving processor subunit Time difference, access requests from external entities (for example, host computers) to processed data, renew operations performed on memory resources (for example, memory arrays) associated with the sending or receiving processor subunits, And other.

圖75E為符合本發明之實施例的實例時序圖。圖75E說明以下實例。 FIG. 75E is an example timing diagram according to an embodiment of the present invention. Figure 75E illustrates the following example.

在一些實施例中，控制器7540及與分散式處理器記憶體晶片相關聯之其他控制器可經組態以使用時脈賦能信號來管理晶片之間的資料傳送。舉例而言，處理陣列7520可由時脈饋入。在一些實施例中，可例如藉由控制器7540使用時脈賦能信號(例如，在圖75A展示為「至CE」)來控制一或多個處理器子單元是否對所供應時脈信號作出回應。每一處理器子單元，例如7520_1至7520_K，可執行程式碼，且程式碼可包括通信命令。根據本發明之一些實施例，控制器7540可藉由控制至處理器子單元7520_1至7520_K之時脈賦能信號來控制通信命令之時序。舉例而言，根據一些實施例，當發送處理器子單元(例如，在第一處理器記憶體晶片7500中)經程式化以在某一循環(例如，第1000個時脈循環)傳送資料且接收處理器子單元(例如，在第二處理器記憶體晶片7500'中)經程式化以在某一循環(例如，第1000個時脈循環)接收資料時，第一處理器記憶體晶片7500之控制器7540及第二處理器記憶體晶片7500'之控制器7540'可能不允許資料傳送，直至發送處理器子單元及接收處理器子單元兩者均準備好執行資料傳送。舉例而言，控制器7540可藉由向晶片7500中之發送處理器子單元供應某一時脈賦能信號(例如，邏輯低)來「抑制」自發送處理器子單元的資料傳送，該時脈賦能信號可防止發送處理器子單元回應於所接收時脈信號而發送資料。某一時脈賦能信號可「凍結」整個分散式處理器記憶體晶片或分散式處理器記憶體晶片之任何部分。另一方面，控制器7540可藉由向發送處理器子單元供應相反的時脈賦能信號(例如，邏輯高)來使發送處理器子單元起始資料傳送，該時脈賦能信號使發送處理器子單元對所接收時脈信號作出回應。可使用由控制器7540'發出之時脈賦能信號來控制類似操作，例如藉由晶片7500'中之接收處理器子單元接收或不接收。 In some embodiments, the controller 7540 and other controllers associated with the distributed processor memory chip can be configured to use clock-enable signals to manage the transfer of data between the chips. For example, the processing array 7520 can be fed in by a clock. In some embodiments, for example, the controller 7540 can use a clock energizing signal (for example, shown as "to CE" in FIG. 75A) to control whether one or more processor sub-units respond to the supplied clock signal. Response. Each processor sub-unit, such as 7520_1 to 7520_K, can execute program code, and the program code may include communication commands. According to some embodiments of the present invention, the controller 7540 can control the timing of the communication command by controlling the clock enabling signals to the processor sub-units 7520_1 to 7520_K. For example, according to some embodiments, when the sending processor sub-unit (for example, in the first processor memory chip 7500) is programmed to send data in a certain cycle (for example, the 1000th clock cycle) and The receiving processor sub-unit (for example, in the second processor memory chip 7500') is programmed to receive data in a certain cycle (for example, the 1000th clock cycle), the first processor memory chip 7500 The controller 7540 of the second processor memory chip 7500' and the controller 7540' of the second processor memory chip 7500' may not allow data transfer until both the sending processor sub-unit and the receiving processor sub-unit are ready to perform data transfer. For example, the controller 7540 can "suppress" the data transmission from the transmitter processor subunit by supplying a certain clock enable signal (for example, logic low) to the transmitter processor subunit in the chip 7500. The enabling signal can prevent the transmitting processor sub-unit from transmitting data in response to the received clock signal. A certain clock energizing signal can "freeze" the entire distributed processor memory chip or any part of the distributed processor memory chip. On the other hand, the controller 7540 can make the transmitting processor sub-unit by supplying the opposite clock energizing signal (for example, logic high) to the transmitting processor sub-unit. The unit initiates data transmission, and the clock enable signal causes the sending processor subunit to respond to the received clock signal. The clock energizing signal sent by the controller 7540' can be used to control similar operations, such as receiving or not receiving by the receiving processor sub-unit of the chip 7500'.

在一些實施例中，可將時脈賦能信號發送至處理器記憶體晶片(例如，7500)中之所有處理器子單元(例如，7520_1至7520_K)。一般而言，時脈賦能信號可具有使處理器子單元對其各別時脈信號作出回應或忽略彼等時脈信號之效應。舉例而言，在一些狀況下，當時脈賦能信號為高(取決於特定應用之慣例)時，處理器子單元可對其時脈信號作出回應且可根據其時脈信號時序執行一或多個指令。另一方面，當時脈賦能信號為低時，防止處理器子單元對其時脈信號作出回應，使得其不回應於時脈時序而執行指令。換言之，當時脈賦能信號為低時，處理器子單元可忽略所接收時脈信號。 In some embodiments, the clock energizing signal can be sent to all the processor sub-units (for example, 7520_1 to 7520_K) in the processor memory chip (for example, 7500). Generally speaking, the clock energizing signal can have the effect of causing the processor sub-units to respond to their respective clock signals or to ignore their clock signals. For example, in some situations, when the clock energizing signal is high (depending on the convention of a particular application), the processor sub-unit can respond to its clock signal and can perform one or more operations according to its clock signal timing. Instructions. On the other hand, when the clock energizing signal is low, the processor sub-unit is prevented from responding to its clock signal, so that it does not execute instructions in response to the clock timing. In other words, when the clock energizing signal is low, the processor subunit can ignore the received clock signal.

返回圖75A之實例，控制器7540、7540'或7540"中之任一者可經組態以使用時脈賦能信號，從而藉由使各別陣列中之一或多個處理器子單元對所接收時脈信號作出回應或不作出回應來控制各別分散式處理器記憶體晶片之操作。在一些實施例中，控制器7540、7540'或7540"可經組態以選擇性地推進程式碼執行，例如在此程式碼與資料傳送操作及其時序相關或包括資料傳送操作及其時序時。在一些實施例中，控制器7540、7540'或7540"可經組態以使用時脈賦能信號來控制兩個不同的分散式處理器記憶體晶片之間經由通信埠7531、7531'、7531"、7532、7532'及7532"等中之任一者的資料傳輸之時序。在一些實施例中，控制器7540、7540'或7540"可經組態以使用時脈賦能信號來控制兩個不同的分散式處理器記憶體晶片之間經由通信埠7531、7531'、7531"、7532、7532'及7532"等中之任一者的資料接收之時間。 Returning to the example of FIG. 75A, any one of the controllers 7540, 7540', or 7540" can be configured to use a clock energizing signal, thereby enabling one or more processor subunits in a respective array to pair The received clock signal responds or does not respond to control the operation of the respective distributed processor memory chip. In some embodiments, the controller 7540, 7540' or 7540" can be configured to selectively advance the program Code execution, for example, when the code is related to or includes data transfer operations and their timing. In some embodiments, the controller 7540, 7540' or 7540" can be configured to use the clock energizing signal to control the communication ports 7531, 7531', 7531 between two different distributed processor memory chips. ", 7532, 7532', and 7532". In some embodiments, the controller 7540, 7540' or 7540" can be configured to use a clock enable signal to control the two The time of receiving data between two different distributed processor memory chips via any one of the communication ports 7531, 7531', 7531", 7532, 7532', and 7532".

在一些實施例中，兩個不同的分散式處理器記憶體晶片之間的資料傳送時序可基於編譯最佳化步驟而配置。編譯可允許建置處理常式，其中可將任務高效地指派給處理子單元而不受連接於兩個不同處理器記憶體晶片之間的匯流排上之傳輸延遲影響。編譯可由主機電腦中之編譯器執行，或傳輸至主機電腦。通常，兩個不同處理器記憶體晶片之間的匯流排上之傳送延遲將導致需要資料之處理子單元的資料瓶頸。所揭示編譯可用使得處理單元能夠甚至在匯流排上具有不利傳輸延遲之情況下仍連續地接收資料的方式排程資料傳輸。 In some embodiments, the data transfer timing between two different distributed processor memory chips can be configured based on the compilation optimization step. Compilation allows to build processing routines, which can be Efficiently assign tasks to processing sub-units without being affected by the transmission delay on the bus connected between two different processor memory chips. Compilation can be executed by the compiler in the host computer or transferred to the host computer. Generally, the transmission delay on the bus between two different processor memory chips will cause the data bottleneck of the processing sub-units that require data. The disclosed code can be used to schedule data transmission in a way that enables the processing unit to continuously receive data even when there is an unfavorable transmission delay on the bus.

雖然圖75A之實施例針對每個分散式處理器記憶體晶片(7500'、7500"、7500''')包括三個埠，但根據所揭示實施例，任何數目個埠可包括於分散式處理器記憶體晶片中。舉例而言，在一些狀況下，分散式處理器記憶體晶片可包括更多或更少埠。在圖75B之實施例中，每一分散式處理器記憶體晶片(例如，7500A至7500I)可組態有多個埠。此等埠可大體上彼此相同或可能不同。在所展示之實例中，每一分散式處理器記憶體晶片包括五個埠，包括一主機通信埠7570及四個晶片埠7572。主機通信埠7570可經組態以在陣列(如圖75B中所展示)中之分散式處理器中的任一者與例如相對於分散式處理器記憶體晶片之陣列位於遠端的主機電腦之間進行通信(經由匯流排7534)。晶片埠7572可經組態以使得能夠經由匯流排7535在分散式處理器記憶體晶片之間進行通信。 Although the embodiment of FIG. 75A includes three ports for each distributed processor memory chip (7500', 7500", 7500"'), according to the disclosed embodiment, any number of ports can be included in the distributed processing For example, in some cases, distributed processor memory chips may include more or fewer ports. In the embodiment of FIG. 75B, each distributed processor memory chip (e.g., , 7500A to 7500I) can be configured with multiple ports. These ports may be substantially the same or may be different from each other. In the example shown, each distributed processor memory chip includes five ports, including a host communication Port 7570 and four chip ports 7572. The host communication port 7570 can be configured to any one of the distributed processors in the array (as shown in FIG. 75B) and, for example, relative to the distributed processor memory chip The array is located between remote host computers for communication (via bus 7534). Chip port 7572 can be configured to enable communication between distributed processor memory chips via bus 7535.

任何數目個分散式處理器記憶體晶片可彼此連接。在圖75B中所展示之每分散式處理器包括四個晶片埠的實例中，記憶體晶片可對陣列賦能，在該陣列中，每一分散式處理器記憶體晶片連接至兩個或多於兩個其他分散式處理器記憶體晶片且在一些狀況下，某些晶片可連接至四個其他分散式處理器記憶體晶片。在分散式處理器記憶體晶片中包括更多晶片埠可實現分散式處理器記憶體晶片之間的更多互連性。 Any number of distributed processor memory chips can be connected to each other. In the example shown in FIG. 75B where each distributed processor includes four chip ports, the memory chip can energize the array in which each distributed processor memory chip is connected to two or more On two other distributed processor memory chips and in some cases, some chips can be connected to four other distributed processor memory chips. Including more chip ports in a distributed processor memory chip can achieve more interconnectivity between distributed processor memory chips.

另外，雖然分散式處理器記憶體晶片7500A至7500I在圖75B中展示為具有兩種不同類型之通信埠7570及7572，但在一些狀況下，單種類型之通信埠可包括於每一分散式處理器記憶體晶片中。在其他狀況下，多於兩種不同類型之通信埠可包括於分散式處理器記憶體晶片中之一或多者中。在圖75C之實例中，分散式處理器記憶體晶片7500A'至7500C'中之每一者包括兩個(或多於兩個)相同類型之通信埠7570。在此實施例中，通信埠7570可經組態以使得能夠經由匯流排7534與諸如主機電腦之外部實體進行通信，且亦可經組態以使得能夠經由匯流排7535在分散式處理器記憶體晶片之間(例如，在分散式處理器記憶體晶片7500B'與7500C'之間)進行通信。 In addition, although the distributed processor memory chips 7500A to 7500I are shown in FIG. 75B as having two different types of communication ports 7570 and 7572, in some cases, a single type of The communication port can be included in each distributed processor memory chip. In other situations, more than two different types of communication ports may be included in one or more of the distributed processor memory chips. In the example of FIG. 75C, each of the distributed processor memory chips 7500A' to 7500C' includes two (or more than two) communication ports 7570 of the same type. In this embodiment, the communication port 7570 can be configured to enable communication with an external entity such as a host computer via the bus 7534, and can also be configured to enable the distributed processor memory via the bus 7535 Communication between chips (for example, between distributed processor memory chips 7500B' and 7500C').

在一些實施例中，設置於一或多個分散式處理器記憶體晶片上之埠可用以提供對多於一個主機之存取。舉例而言，在圖75D中所展示之實施例中，分散式處理器記憶體晶片包括兩個或多於兩個埠7570。埠7570可構成主機埠、晶片埠，或主機埠與晶片埠之組合。在所展示之實例中，兩個埠7570及7570'可使兩個不同主機(例如，主機電腦或運算元件或其他類型之邏輯單元)能夠經由匯流排7534及7534'存取分散式處理器記憶體晶片7500A。此實施例可使兩個(或多於兩個)不同主機電腦能夠存取分散式處理器記憶體晶片7500A。然而，在其他實施例中，匯流排7534及7534'兩者可連接至同一主機實體，例如其中彼主機實體需要額外頻寬或對分散式處理器記憶體晶片7500A之處理器子單元/記憶體組中之一或多者的並列存取。 In some embodiments, ports provided on one or more distributed processor memory chips can be used to provide access to more than one host. For example, in the embodiment shown in FIG. 75D, the distributed processor memory chip includes two or more ports 7570. The port 7570 can constitute a host port, a chip port, or a combination of a host port and a chip port. In the example shown, the two ports 7570 and 7570' enable two different hosts (for example, host computers or computing elements or other types of logic units) to access the distributed processor memory via buses 7534 and 7534' Bulk wafer 7500A. This embodiment enables two (or more than two) different host computers to be able to access the distributed processor memory chip 7500A. However, in other embodiments, both the bus 7534 and 7534' can be connected to the same host entity, for example, where the other host entity requires additional bandwidth or the processor subunit/memory of the distributed processor memory chip 7500A Parallel access of one or more of the group.

在一些狀況下，如圖75D中所展示，多於一個控制器7540及7540'可用以控制對分散式處理器記憶體晶片7500A之分散式處理器子單元/記憶體組的存取。在其他狀況下，單個控制器可用以處置自一或多個外部主機實體之通信。 In some cases, as shown in FIG. 75D, more than one controller 7540 and 7540' can be used to control access to the distributed processor subunit/memory bank of the distributed processor memory chip 7500A. In other situations, a single controller can be used to handle communications from one or more external host entities.

另外，分散式處理器記憶體晶片7500A內部之一或多個匯流排可使得能夠對分散式處理器記憶體晶片7500A之分散式處理器子單元/記憶體組進行並列存取。舉例而言，分散式處理器記憶體晶片7500A可包括第一匯流排7580 及第二匯流排7580'，該等匯流排使得能夠對例如分散式處理器子單元7520_1至7520_6及其對應的專用記憶體組7510_1至7510_6進行並列存取。此配置可允許同時存取分散式處理器記憶體晶片7500A中之兩個不同位置。另外，在不同時使用所有埠之狀況下，該等埠可共用分散式處理器記憶體晶片7500A內之硬體資源(例如，共同匯流排及/或共同控制器)，且可構成多工至彼硬體之IO。 In addition, one or more buses inside the distributed processor memory chip 7500A can enable parallel access to the distributed processor subunits/memory groups of the distributed processor memory chip 7500A. For example, the distributed processor memory chip 7500A may include the first bus 7580 And a second bus 7580', which enable parallel access to, for example, distributed processor sub-units 7520_1 to 7520_6 and their corresponding dedicated memory groups 7510_1 to 7510_6. This configuration allows simultaneous access to two different locations in the distributed processor memory chip 7500A. In addition, under the condition that all ports are not used at the same time, the ports can share the hardware resources (for example, common bus and/or common controller) in the distributed processor memory chip 7500A, and can form multiple tasks. The IO of that hardware.

在一些實施例中，運算單元中之一些(例如，處理器子單元7520_1至7520_6)可連接至額外埠(7570')或控制器，而其他者不連接至額外埠或控制器。然而，來自不連接至額外埠7570'之運算單元的資料可通過至連接至埠7570'之運算單元的連接之內部網格。以此方式，可同時在兩個埠7570及7570'處執行通信而無需添加額外匯流排。 In some embodiments, some of the arithmetic units (for example, the processor sub-units 7520_1 to 7520_6) can be connected to an additional port (7570') or a controller, while the others are not connected to an additional port or controller. However, data from arithmetic units not connected to the extra port 7570' can pass through the internal grid of connections to the arithmetic units connected to the port 7570'. In this way, communication can be performed at the two ports 7570 and 7570' at the same time without adding an additional bus.

雖然通信埠(例如，7530至7532)及控制器(例如，7540)已說明為分開元件，但應瞭解，通信埠及控制器(或任何其他組件)可實施為根據本發明之實施例的積體單元。圖76提供符合本發明之實施例的具有整合之控制器及介面模組的分散式處理器記憶體晶片7600之圖解表示。如圖76中所展示，處理器記憶體晶片7600可實施為具有整合之控制器及介面模組7547，該模組經組態以執行圖75中之控制器7540以及通信埠7530、7531及7532之功能。如圖76中所展示，控制器及介面模組7547經組態以經由類似於通信埠(例如，7530、7531及7532)之介面7548_1至7548_N與諸如外部實體、一或多個分散式處理器記憶體晶片等之多個不同實體通信。控制器及介面模組7547可經進一步組態以控制分散式處理器記憶體晶片之間或分散式處理器記憶體晶片7600與諸如主機電腦之外部實體之間的通信。在一些實施例中，控制器及介面模組7547可包括經組態以與一或多個其他分散式處理器記憶體晶片及與諸如主機電腦、通信模組等之外部實體並列地通信的通信介面7548_1至7548_N。 Although the communication ports (for example, 7530 to 7532) and the controller (for example, 7540) have been described as separate components, it should be understood that the communication ports and the controller (or any other components) can be implemented as products according to embodiments of the present invention. Body unit. Figure 76 provides a diagrammatic representation of a distributed processor memory chip 7600 with integrated controller and interface modules in accordance with an embodiment of the invention. As shown in Figure 76, the processor memory chip 7600 can be implemented as an integrated controller and interface module 7547, which is configured to execute the controller 7540 and communication ports 7530, 7531 and 7532 in Figure 75 The function. As shown in Figure 76, the controller and interface module 7547 is configured to communicate with external entities such as one or more distributed processors via interfaces 7548_1 to 7548_N similar to communication ports (for example, 7530, 7531, and 7532) Communication between multiple different entities such as memory chips. The controller and interface module 7547 can be further configured to control the communication between the distributed processor memory chip or between the distributed processor memory chip 7600 and an external entity such as a host computer. In some embodiments, the controller and interface module 7547 may include communications configured to communicate in parallel with one or more other distributed processor memory chips and with external entities such as host computers, communication modules, etc. Interfaces 7548_1 to 7548_N.

圖77提供表示符合本發明之實施例的用於在圖75中所展示之可擴展處理器記憶體系統中的分散式處理器記憶體晶片之間傳送資料的處理程序之流程圖。出於說明之目的，將參看圖75描述用於傳送資料之流程，且假定資料係自第一處理器記憶體晶片7500傳送至第二處理器記憶體晶片7500'。 FIG. 77 provides an example of the method shown in FIG. 75 according to an embodiment of the present invention. The flow chart of the processing procedure for transferring data between the distributed processor memory chips in the extended processor memory system. For illustrative purposes, the process for transferring data will be described with reference to FIG. 75, and it is assumed that the data is transferred from the first processor memory chip 7500 to the second processor memory chip 7500'.

在步驟S7710處，可接收資料傳送請求。然而，應注意且如上文所描述，在一些實施例中，資料傳送請求可能並非必需的。舉例而言，在一些狀況下，資料傳送之時序可為預定的(例如，藉由特定軟體程式碼)。在此等狀況下，資料傳送可在無分開的資料傳送請求之情況下繼續進行。步驟S7710可由例如控制器7540以及其他者執行。在一些實施例中，資料傳送請求可包括將資料自第一分散式處理器記憶體晶片7500之一個處理器子單元傳送至第二分散式處理器記憶體晶片7500'之另一處理器子單元的請求。 At step S7710, a data transmission request can be received. However, it should be noted that and as described above, in some embodiments, the data transfer request may not be necessary. For example, in some situations, the timing of data transmission may be predetermined (for example, by specific software code). Under these conditions, data transmission can continue without a separate data transmission request. Step S7710 can be executed by, for example, the controller 7540 and others. In some embodiments, the data transfer request may include transferring data from one processor subunit of the first distributed processor memory chip 7500 to another processor subunit of the second distributed processor memory chip 7500' Request.

在步驟S7720處，可判定資料傳送時序。如所提到，資料傳送時序可為預定的且可取決於特定軟體程式之執行次序。步驟S7720可由例如控制器7540以及其他者執行。在一些實施例中，可藉由考慮(1)發送處理器子單元是否準備好傳送資料及/或(2)接收處理器子單元是否準備好接收資料來判定資料傳送時序。根據本發明之實施例，亦可考慮是否滿足一或多個其他時序約束以使得能夠進行此資料傳送。一或多個時間約束可與以下各者相關：自發送處理器子單元之傳送時間與接收處理器子單元處之接收時間之間的時間差、來自外部實體(例如，主機電腦)之對所處理資料的存取請求、對與發送或接收處理器子單元相關聯之記憶體資源(例如，記憶體陣列)執行的再新操作等。根據本發明之實施例，處理子單元可由時脈饋入。在一些實施例中，可例如使用時脈賦能信號來控制供應至處理子單元之時脈。根據本發明之一些實施例，控制器7540可藉由控制至處理器子單元7520_1至7520_K之時脈賦能信號來控制通信命令之時序。 At step S7720, the data transmission timing can be determined. As mentioned, the data transmission sequence can be predetermined and can depend on the execution order of a specific software program. Step S7720 may be executed by, for example, the controller 7540 and others. In some embodiments, the data transmission timing can be determined by considering (1) whether the transmitting processor sub-unit is ready to transmit data and/or (2) whether the receiving processor sub-unit is ready to receive data. According to the embodiment of the present invention, it may also be considered whether one or more other timing constraints are satisfied to enable this data transmission. One or more time constraints may be related to each of the following: the time difference between the transmission time of the sending processor subunit and the receiving time at the receiving processor subunit, the processing of the pair from an external entity (for example, a host computer) Data access requests, renew operations performed on memory resources (for example, memory arrays) associated with the sending or receiving processor subunits, etc. According to an embodiment of the present invention, the processing sub-unit can be fed by a clock. In some embodiments, a clock energizing signal can be used, for example, to control the clock supplied to the processing sub-unit. According to some embodiments of the present invention, the controller 7540 can control the timing of the communication command by controlling the clock enabling signals to the processor sub-units 7520_1 to 7520_K.

在步驟S7730處，可基於在步驟S7720處判定之資料傳送時序而執行資料傳輸。步驟S7730可由例如控制器7540以及其他者執行。舉例而言，第一處理器記憶體晶片7500之發送處理器子單元可根據在步驟S7720處判定之資料傳送時序將資料傳送至第二處理器記憶體晶片7500'之接收處理器子單元。 At step S7730, it can be determined based on the data transmission timing determined at step S7720 Perform data transfer. Step S7730 may be executed by, for example, the controller 7540 and others. For example, the transmitting processor subunit of the first processor memory chip 7500 can transmit data to the receiving processor subunit of the second processor memory chip 7500' according to the data transmission timing determined at step S7720.

所揭示架構可適用於多種應用。舉例而言，在一些狀況下，以上架構可便利在不同分散式處理器記憶體晶片間共用資料，諸如與神經網路(尤其為大型神經網路)相關聯之權重或神經元值或部分神經元值。另外，在諸如SUM、AVG等之某些運算中可能需要來自多個不同的分散式處理器記憶體晶片之資料。在此等狀況下，所揭示架構可便利共用來自多個分散式處理器記憶體晶片之此資料。又另外，例如，所揭示架構可便利在分散式處理器記憶體晶片之間共用記錄以支援查詢之接合操作。 The disclosed architecture is applicable to a variety of applications. For example, in some situations, the above architecture can facilitate the sharing of data between different distributed processor memory chips, such as weights or neuron values or parts of nerves associated with neural networks (especially large-scale neural networks). Meta value. In addition, certain operations such as SUM and AVG may require data from multiple different distributed processor memory chips. Under these conditions, the disclosed architecture can facilitate the sharing of this data from multiple distributed processor memory chips. In addition, for example, the disclosed architecture can facilitate the sharing of records between distributed processor memory chips to support query bonding operations.

亦應注意，雖然已相對於分散式處理器記憶體晶片描述了以上實施例，但相同原理及技術可應用於例如不包括分散式處理器子單元之常規記憶體晶片。舉例而言，在一些狀況下，多個記憶體晶片可一起組合成多埠記憶體晶片，以形成甚至不具有處理器子單元之陣列的記憶體晶片之陣列。在另一實施例中，多個記憶體晶片可組合在一起以形成所連接記憶體之陣列，從而實際上向主機提供包含多個記憶體晶片之一個較大記憶體。 It should also be noted that although the above embodiments have been described with respect to distributed processor memory chips, the same principles and techniques can be applied to, for example, conventional memory chips that do not include distributed processor sub-units. For example, in some situations, multiple memory chips can be combined together to form a multi-port memory chip to form an array of memory chips that does not even have an array of processor subunits. In another embodiment, multiple memory chips can be combined to form an array of connected memory, thereby actually providing the host with a larger memory containing multiple memory chips.

埠之內部連接可至主匯流排或至包括於處理陣列中之內部處理器子單元中之一者。 The internal connection of the port can be to the main bus or to one of the internal processor subunits included in the processing array.

記憶體內零偵測 Zero detection in memory

本發明之一些實施例係有關於用於偵測儲存於複數個記憶體組之一或多個特定位址中之零值的記憶體單元。所揭示記憶體單元之此零值偵測特徵可適用於減少運算系統之功率消耗，且另外或替代地，亦可減少用於自記憶體擷取零值所需之處理時間。此特徵可在以下系統中尤其相關：在該系統中，讀取之大量資料實際上為0值且亦用於計算運算，諸如乘法\加法\減法\及更多運算，對於該等運算，自記憶體擷取零值可能不必要(例如，零值與任何其他值之乘積為零)，且運算電路可使用運算元中之一者為零的事實且在時間及能量上更高效地計算結果。在此等狀況下，可使用對零值之存在的偵測來代替記憶體存取及自記憶體擷取零值。 Some embodiments of the present invention relate to memory cells used to detect zero values stored in one or more specific addresses in a plurality of memory groups. The zero value detection feature of the disclosed memory unit can be applied to reduce the power consumption of the computing system, and additionally or alternatively, can also reduce the processing time required for retrieving zero values from the memory. This feature can be particularly relevant in the following systems: In this system, the large amount of data read is actually zero and is also used for calculation operations, such as multiplication\addition\subtraction\ and more operations For these operations, it may not be necessary to retrieve the zero value from the memory (for example, the product of the zero value and any other value is zero), and the arithmetic circuit can use the fact that one of the operands is zero. And calculate the result more efficiently in terms of energy. Under these conditions, the detection of the existence of the zero value can be used instead of memory access and retrieval of the zero value from the memory.

貫穿此章節，相對於讀取功能來描述所揭示實施例。然而，應注意，所揭示架構及技術同樣適用於零值寫入操作，或在其他值可能更經常出現之狀況下，亦用於其他特定預定非零值操作。 Throughout this section, the disclosed embodiments are described with respect to reading functions. However, it should be noted that the disclosed architecture and technology are also applicable to zero-value write operations, or other specific predetermined non-zero value operations under conditions where other values may occur more frequently.

在所揭示實施例中，替代自記憶體擷取零值，當在特定位址處偵測到此值時，記憶體單元可將零值指示符傳回至記憶體單元外部之一或多個電路(例如，位於記憶體單元外部之一或多個處理器、CPU等)。零值為多位元零值零(例如，零值位元組，零值字，小於一位元組、大於一位元組之多位元零值，及其類似者)。零值指示符為提示儲存於記憶體中之零值的1位元信號，因此相比傳送儲存於記憶體中之n個資料位元，傳送提示信號之1位元零值為有益的。所傳輸之零提示可將用於傳送之能量消耗減少至1/n，且可加速運算，例如其中在藉由神經元之權重計算輸入、卷積、將核心應用於輸入資料以及與經訓練神經網路、人工智慧及廣泛其他類型之運算相關聯的許多其他計算中涉及乘法運算。為提供此功能性，所揭示記憶體單元可包括一或多個零值偵測邏輯單元，該一或多個零值偵測邏輯單元可偵測記憶體中之特定位置中存在零值，防止擷取零值(例如，經由讀取命令)且使得替代地將零值指示符傳輸至記憶體單元外部之電路系統(例如，使用記憶體之一或多條控制線、與記憶體單元相關聯之一或多個匯流排等)。可在記憶體墊層級、在組層級、在子組層級、在晶片層級等執行零值偵測。 In the disclosed embodiment, instead of retrieving the zero value from the memory, when the value is detected at a specific address, the memory unit can return the zero value indicator to one or more of the external memory units Circuits (for example, one or more processors, CPUs, etc.) located outside the memory unit. The zero value is a multi-bit zero-valued zero (for example, a zero-valued byte, a zero-valued word, a multi-bit zero value less than one bit, greater than one bit, and the like). The zero value indicator is a 1-bit signal that prompts the zero value stored in the memory. Therefore, compared to transmitting n data bits stored in the memory, it is beneficial to transmit the 1-bit zero value of the prompt signal. The transmitted zero reminder can reduce the energy consumption for transmission to 1/n, and can speed up calculations, such as calculating the input by the weight of the neuron, convolution, applying the core to the input data and interacting with the trained nerve Multiplication is involved in many other calculations associated with the Internet, artificial intelligence, and a wide range of other types of operations. To provide this functionality, the disclosed memory unit may include one or more zero-value detection logic units that can detect the presence of zero values in specific locations in the memory to prevent Retrieve the zero value (for example, via a read command) and cause the zero value indicator to be transmitted instead to a circuit system outside the memory unit (for example, using one or more control lines of the memory, associated with the memory unit) One or more bus bars, etc.). Zero detection can be performed at the memory pad level, at the group level, at the sub-group level, at the chip level, etc.

應注意，雖然相對於將零指示符遞送至在記憶體晶片外部之位置而描述了所揭示實施例，但所揭示實施例及特徵亦可在處理可在記憶體晶片內部進行之系統中提供顯著益處。舉例而言，在諸如本文中所揭示之分散式處理器記憶體晶片的實施例中，可藉由對應處理器子單元對各種記憶體組中之資料執行處理。在許多狀況下，諸如相關聯資料可包括許多零之神經網路執行或資料分析，所揭示技術可加速處理及/或減少與由分散式處理器記憶體晶片中之處理器子單元執行之處理相關聯的功率消耗。 It should be noted that although the disclosed embodiments are described with respect to delivering the zero indicator to a location outside the memory chip, the disclosed embodiments and features can also be processed within the memory chip. Provides significant benefits in a partially implemented system. For example, in embodiments such as the distributed processor memory chip disclosed herein, the corresponding processor sub-units can perform processing on data in various memory groups. In many situations, such as neural network execution or data analysis where the associated data can include many zeros, the disclosed technology can speed up processing and/or reduce processing performed by the processor subunits in the distributed processor memory chip The associated power consumption.

圖78A說明符合本發明之實施例的用於在晶片層級偵測儲存於複數個記憶體組之一或多個特定位址中之零值的系統7800，該等複數個記憶體組實施於記憶體晶片7810中。系統7800可包括記憶體晶片7810及主機7820。記憶體晶片7810可包括複數個控制單元且每一控制單元可具有專用記憶體組。舉例而言，控制單元可用可操作方式連接至專用記憶體組。 FIG. 78A illustrates a system 7800 for detecting zero values stored in one or more specific addresses in a plurality of memory groups at the chip level according to an embodiment of the present invention, and the plurality of memory groups are implemented in the memory Bulk wafer 7810. The system 7800 may include a memory chip 7810 and a host 7820. The memory chip 7810 may include a plurality of control units and each control unit may have a dedicated memory bank. For example, the control unit can be operatively connected to a dedicated memory bank.

在一些狀況下，例如相對於此處所揭示之分散式處理器記憶體晶片，記憶體晶片內之處理可涉及記憶體存取(無論為讀取抑或寫入)，該等分散式處理器記憶體晶片包括在空間上分佈於記憶體組之陣列當中的處理器子單元。甚至在記憶體晶片內部之處理的狀況下，偵測與讀取或寫入命令相關聯之零值的所揭示技術可允許內部處理器單元或子單元放棄傳送實際零值。實情為，回應於零值偵測及零值指示符傳輸(例如，至一或多個內部處理子單元)，分散式處理器記憶體晶片可節省否則將已用於傳輸記憶體晶片內之零資料值的能量。 In some situations, such as the distributed processor memory chip disclosed herein, the processing within the memory chip may involve memory access (whether it is read or write). The distributed processor memory The chip includes processor sub-units spatially distributed in an array of memory banks. Even in the context of processing inside the memory chip, the disclosed technology of detecting the zero value associated with a read or write command can allow the internal processor unit or sub-unit to give up transmitting the actual zero value. In fact, in response to zero value detection and zero value indicator transmission (for example, to one or more internal processing sub-units), the distributed processor memory chip can be saved or would have been used to transmit the zeros in the memory chip The energy of the data value.

在另一實例中，記憶體晶片7810及主機7820中之每一者可包括輸入/輸出(IO)，以使得能夠在記憶體晶片7810與主機7820之間進行通信。每一IO可與零值指示符線7830A及匯流排7840A耦接。零值指示符線7830A可將零值指示符自記憶體晶片7810傳送至主機7820，其中零值指示符可包括在偵測到儲存於由主機7820請求之記憶體組之特定位址中的零值後由記憶體晶片7810產生之1位元信號。在經由零值指示符線7830A接收到零值指示符後，主機7820可執行與零值指示符相關聯之一或多個預定義動作。舉例而言，若主機7820向記憶體晶片7810請求擷取用於乘法之運算元，則主機7820可更高效地計算乘法，此係因為主機7820將自所接收零值指示符確認(不接收實際記憶體值)運算元中之一者為零。主機7820亦可經由匯流排7840將指令、資料及其他輸入提供至記憶體晶片7810，且自記憶體晶片7810讀取輸出。在自主機7820接收到通信後，記憶體晶片7810可擷取與所接收通信相關聯之資料，且經由匯流排7840將所擷取資料傳送至主機7820。 In another example, each of the memory chip 7810 and the host 7820 may include input/output (IO) to enable communication between the memory chip 7810 and the host 7820. Each IO can be coupled to the zero indicator line 7830A and the bus 7840A. The zero-value indicator line 7830A can transmit the zero-value indicator from the memory chip 7810 to the host 7820, where the zero-value indicator can include a zero stored in a specific address of the memory bank requested by the host 7820. The 1-bit signal generated by the memory chip 7810 after the value. After receiving the zero value indicator via the zero value indicator line 7830A, the master The machine 7820 may perform one or more predefined actions associated with the zero value indicator. For example, if the host 7820 requests the memory chip 7810 to retrieve operands for multiplication, the host 7820 can calculate the multiplication more efficiently. This is because the host 7820 will confirm from the received zero-value indicator (do not receive the actual One of the operands is zero. The host 7820 can also provide commands, data and other inputs to the memory chip 7810 via the bus 7840, and read and output from the memory chip 7810. After receiving the communication from the host 7820, the memory chip 7810 can retrieve data associated with the received communication, and transmit the retrieved data to the host 7820 via the bus 7840.

在一些實施例中，主機可將零值指示符而非零資料值發送至記憶體晶片。以此方式，記憶體晶片(例如，安置於記憶體晶片上之控制器)可儲存或再新記憶體中之零值而不必接收零資料值。此更新可基於零值指示符(例如，作為寫入命令之部分)之接收而發生。 In some embodiments, the host may send a zero value indicator instead of a zero data value to the memory chip. In this way, the memory chip (for example, a controller placed on the memory chip) can store or renew the zero value in the memory without having to receive the zero data value. This update may occur based on the receipt of a zero value indicator (e.g., as part of a write command).

圖78B說明符合本發明之實施例的用於在記憶體組層級偵測儲存於複數個記憶體組7811A至7811B之一或多個特定位址中之零值的記憶體晶片7810。記憶體晶片7810可包括複數個記憶體組7811A至7811B及IO匯流排7812。儘管圖78B描繪實施於記憶體晶片7810之兩個記憶體組7811A至7811B，但記憶體晶片7810可包括任何數目個記憶體組。 FIG. 78B illustrates a memory chip 7810 for detecting zero values stored in one or more specific addresses in a plurality of memory banks 7811A to 7811B at the memory bank level in accordance with an embodiment of the present invention. The memory chip 7810 may include a plurality of memory banks 7811A to 7811B and an IO bus 7812. Although FIG. 78B depicts two memory banks 7811A to 7811B implemented on the memory chip 7810, the memory chip 7810 may include any number of memory banks.

IO匯流排7812可經組態以經由匯流排7840B將資料傳送至外部晶片(例如，圖78A中之主機7820)/自該外部晶片傳送資料。匯流排7840B可類似於圖78A中之匯流排7840A起作用。IO 7812亦可經由零值指示符線7830B傳輸零值指示符，其中零值指示符線7830B可類似於圖78A中之零值指示符線7830A起作用。IO匯流排7812亦可經組態以經由內部零值指示符線7831及匯流排7841與記憶體組7811A至7811B通信。IO匯流排7812可將來自外部晶片之所接收資料傳輸至記憶體組7811A至7811B中之一者。舉例而言，IO匯流排7812可經由匯流排7841傳送資料，該資料包含用以讀取儲存於記憶體組7811A 之特定位址中之資料的指令。多工器可包括於IO匯流排7812與記憶體組7811A至7811B之間，且可藉由內部零值指示符線7831及匯流排7841A連接。多工器可經組態以將來自IO匯流排7812之所接收資料傳輸至特定記憶體組，且可經進一步組態以將來自特定記憶體組之所接收資料或所接收零值指示符傳輸至IO匯流排7812。 The IO bus 7812 can be configured to transmit data to/from an external chip (for example, the host 7820 in FIG. 78A) via the bus 7840B. The bus 7840B can function similarly to the bus 7840A in FIG. 78A. The IO 7812 can also transmit a zero-value indicator via the zero-value indicator line 7830B, where the zero-value indicator line 7830B can function similarly to the zero-value indicator line 7830A in FIG. 78A. The IO bus 7812 can also be configured to communicate with the memory banks 7811A to 7811B via the internal zero indicator line 7831 and the bus 7841. The IO bus 7812 can transmit the received data from the external chip to one of the memory banks 7811A to 7811B. For example, the IO bus 7812 can transmit data via the bus 7841, and the data includes data for reading and storing in the memory bank 7811A The command of the data in the specific address. The multiplexer can be included between the IO bus 7812 and the memory banks 7811A to 7811B, and can be connected by the internal zero indicator line 7831 and the bus 7841A. The multiplexer can be configured to transmit the received data from the IO bus 7812 to a specific memory group, and can be further configured to transmit the received data or the received zero-value indicator from the specific memory group To IO bus 7812.

在一些狀況下，主機實體可僅經組態以接收常規資料傳輸，且可不經裝備以解譯所揭示之零值指示符或對該零值指示符作出回應。在此狀況下，所揭示實施例(例如，控制器/晶片IO等)可在至主機IO之資料線上重新產生零值來代替零值指示符信號，且因此可節省晶片內部之資料傳輸功率。 In some situations, the host entity may only be configured to receive regular data transmissions, and may not be equipped to interpret or respond to the disclosed zero-value indicator. In this situation, the disclosed embodiments (for example, the controller/chip IO, etc.) can regenerate a zero value on the data line to the host IO to replace the zero value indicator signal, and thus can save the data transmission power inside the chip.

記憶體組7811A至7811B中之每一者可包括控制單元。控制單元可偵測儲存於記憶體組之所請求位址中的零值。在偵測到所儲存零值後，控制單元可產生零值指示符且經由內部零值指示符線7831將所產生之零值指示符傳輸至IO匯流排7812，其中零值指示符經由零值指示符線7830B進一步傳送至外部晶片。 Each of the memory groups 7811A to 7811B may include a control unit. The control unit can detect the zero value stored in the requested address of the memory bank. After detecting the stored zero value, the control unit can generate a zero value indicator and transmit the generated zero value indicator to the IO bus 7812 via the internal zero value indicator line 7831, where the zero value indicator passes through the zero value The indicator line 7830B is further transferred to the external wafer.

圖79說明符合本發明之實施例的用於在記憶體墊層級偵測儲存於複數個記憶體墊之特定位址中的一或多者中之零值的記憶體組7911。在一些實施例中，記憶體組7911可組織成記憶體墊7912A至7912B，該等記憶體墊中之每一者可被獨立地控制及獨立地存取。記憶體組7911可包括記憶體墊控制器7913A至7913B，該等控制器可包括零值偵測邏輯單元7914A至7914B。記憶體墊控制器7913A至7913B中之每一者可允許對記憶體墊7912A至7912B上之位置進行讀取及寫入。記憶體組7911可進一步包括讀取去能元件、區域感測放大器7915A至7915B及/或全域感測放大器7916。 FIG. 79 illustrates a memory set 7911 for detecting zero values stored in one or more of the specific addresses of a plurality of memory pads at the memory pad level in accordance with an embodiment of the present invention. In some embodiments, the memory group 7911 can be organized into memory pads 7912A to 7912B, each of which can be independently controlled and accessed independently. The memory bank 7911 may include memory pad controllers 7913A to 7913B, and the controllers may include zero value detection logic units 7914A to 7914B. Each of the memory pad controllers 7913A to 7913B can allow reading and writing of positions on the memory pads 7912A to 7912B. The memory bank 7911 may further include a read disabling element, area sense amplifiers 7915A to 7915B, and/or a global sense amplifier 7916.

記憶體墊7912A至7912B中之每一者可包括複數個記憶體胞元。複數個記憶體胞元中之每一者可儲存一個二進位資訊位元。舉例而言，記憶體胞元中之任一者可個別地儲存零值。若特定記憶體墊中之所有記憶體胞元皆儲存零值，則零值可與整個記憶體墊相關聯。 Each of the memory pads 7912A to 7912B may include a plurality of memory cells. Each of the plurality of memory cells can store one binary information bit. For example, memory Any one of the cells can store zero values individually. If all memory cells in a particular memory pad store zero values, the zero value can be associated with the entire memory pad.

記憶體墊控制器7913A至7913B中之每一者可經組態以存取專用記憶體墊，且讀取儲存於專用記憶體墊中之資料或將資料寫入專用墊中。 Each of the memory pad controllers 7913A to 7913B can be configured to access the dedicated memory pad, and read data stored in the dedicated memory pad or write data to the dedicated pad.

在一些實施例中，零值偵測邏輯單元7914A或7914B可實施於記憶體組7911中。一或多個零值偵測邏輯單元7914A至7914B可與記憶體組、記憶體子組、記憶體墊及一或多個記憶體胞元之集合相關聯。零值偵測邏輯單元7914A或7914B可偵測所請求之特定位址(例如，記憶體墊7912A或7912B)儲存零值。該偵測可用許多方法執行。 In some embodiments, the zero value detection logic unit 7914A or 7914B can be implemented in the memory bank 7911. One or more zero-value detection logic units 7914A to 7914B may be associated with a set of memory banks, memory sub-groups, memory pads, and one or more memory cells. The zero value detection logic unit 7914A or 7914B can detect the requested specific address (for example, the memory pad 7912A or 7912B) to store the zero value. This detection can be performed in many ways.

第一方法可包括使用相對於零的數位比較器。數位比較器可經組態以獲取兩個數字作為二進位形式之輸入，且判定第一數字(所擷取資料)是否等於第二數字(零)。若數位比較器判定兩個數字相等，則零值偵測邏輯單元可產生零值指示符。零值指示符可為1位元信號，且可使可將資料位元發送至下一層級(例如，圖78B中之IO匯流排7812)之放大器(例如，區域感測放大器7915A至7915B)、傳輸器及緩衝器去能。零值指示符可經由零值指示符線7931A或7931B進一步傳輸至全域感測放大器7916，但在一些狀況下，可繞過全域感測放大器。 The first method may include using a digital comparator relative to zero. The digital comparator can be configured to take two numbers as input in binary form and determine whether the first number (the captured data) is equal to the second number (zero). If the digital comparator determines that the two numbers are equal, the zero value detection logic unit can generate a zero value indicator. The zero value indicator can be a 1-bit signal, and can enable the amplifiers (for example, area sense amplifiers 7915A to 7915B) that can send data bits to the next level (for example, the IO bus 7812 in FIG. 78B), The transmitter and buffer are disabled. The zero value indicator can be further transmitted to the global sense amplifier 7916 via the zero value indicator line 7931A or 7931B, but in some cases, the global sense amplifier can be bypassed.

用於零偵測之第二方法可包括使用類比比較器。除了將兩個類比輸入之電壓用於比較以外，類比比較器亦可類似於數位比較器起作用。舉例而言，可感測所有位元，且比較器可充當信號之間的邏輯「或(OR)」函數。 The second method for zero detection may include the use of analog comparators. In addition to using the voltages of the two analog inputs for comparison, the analog comparator can also function like a digital comparator. For example, all bits can be sensed, and the comparator can act as a logical "OR" function between signals.

用於零值偵測之第三方法可包括使用自區域感測放大器7915A至7915B至全域感測放大器7916中之傳送信號，其中全域感測放大器7916經組態以感測輸入中之任一者是否為高(非零)且使用彼邏輯信號以控制放大器之下一層級。區域感測放大器7915A至7915B及全域感測放大器7916可包括複數個電晶體，該等複數個電晶體經組態以感測來自複數個記憶體組之低功率信號，且該等放大器將小的電壓擺動放大至較高電壓位準使得儲存於複數個記憶體組中之資料可由諸如記憶體墊控制器7913A或7913B之至少一個控制器解譯。舉例而言，記憶體胞元可按列及行佈置於記憶體組7911上。每一線可附接至列中之每一記憶體胞元。沿著列延行之線被稱作字線，該等字線藉由將電壓選擇性地施加至字線來啟動。沿著行延行之線被稱作位元線，且兩個此等互補位元線可在記憶體陣列之邊緣處附接至感測放大器。感測放大器之數目可對應於記憶體組7911上之位元線(行)的數目。為了自特定記憶體胞元讀取位元，接通沿著胞元列之字線，從而啟動該列中之所有記憶體胞元。來自每一胞元之所儲存值(0或1)接著在與特定胞元相關聯之位元線上可用。在兩個互補位元線之末端處，感測放大器可將小的電壓放大至正常邏輯位準。可接著將來自所要胞元之位元自胞元之感測放大器鎖存至緩衝器中且置於輸出匯流排上。 The third method for zero value detection can include the use of signal transmission from area sense amplifiers 7915A to 7915B to global sense amplifier 7916, where global sense amplifier 7916 is configured to sense any of the inputs Whether it is high (non-zero) and use that logic signal to control the next level of the amplifier. The area sense amplifier 7915A to 7915B and the global sense amplifier 7916 can include complex Several transistors are configured to sense low-power signals from multiple memory banks, and the amplifiers amplify small voltage swings to higher voltage levels for storage in multiple memories The data in the body group can be interpreted by at least one controller such as the memory pad controller 7913A or 7913B. For example, the memory cells can be arranged on the memory bank 7911 in columns and rows. Each line can be attached to each memory cell in the row. The lines running along the columns are called word lines, and the word lines are activated by selectively applying voltage to the word lines. The line extending along the row is called the bit line, and two of these complementary bit lines can be attached to the sense amplifier at the edge of the memory array. The number of sense amplifiers can correspond to the number of bit lines (rows) on the memory bank 7911. In order to read bits from a specific memory cell, the word line along the cell row is turned on to activate all the memory cells in the row. The stored value (0 or 1) from each cell is then available on the bit line associated with the specific cell. At the ends of the two complementary bit lines, the sense amplifier can amplify a small voltage to a normal logic level. The bit from the desired cell can then be latched into the buffer from the sense amplifier of the cell and placed on the output bus.

用於零值偵測之第四方法可包括：若值為0，則針對保存至記憶體且在寫入時間儲存之每一字使用一額外位元，且在讀出資料時使用彼額外位元以知曉資料是否為零。該方法可避免將所有零寫入至記憶體，因此節省更多能量。 The fourth method for zero value detection can include: if the value is 0, use an extra bit for each word stored in the memory and stored at the write time, and use that extra bit when reading data Yuan to know whether the data is zero. This method can avoid writing all zeros to the memory, thus saving more energy.

如上文且貫穿本發明所描述，一些實施例可包括記憶體單元(諸如，記憶體單元7800)，該記憶體單元包括複數個處理器子單元。此等處理器子單元可在空間上分佈於單個基板(例如，諸如記憶體單元7800之記憶體晶片的基板)上。此外，複數個處理器子單元中之每一者可專用於記憶體單元7800之複數個記憶體組當中的對應記憶體組。且專用於對應處理器子單元之此等記憶體組亦可在空間上分佈於基板上。在一些實施例中，記憶體單元7800可與特定任務(例如，執行與運行神經網路相關聯之一或多個操作等)相關聯，且記憶體單元7800之處理器子單元中之每一者可負責執行此任務之一部分。舉例而言，每一處理器子單元可裝備有可包括資料處置及記憶體操作、算術及邏輯運算等之指令。在一些狀況下，零值偵測邏輯可經組態以將零值指示符提供至在空間上分佈於記憶體單元7800上之所描述處理器子單元中之一或多者。 As described above and throughout the present invention, some embodiments may include a memory unit (such as a memory unit 7800) that includes a plurality of processor sub-units. These processor sub-units may be spatially distributed on a single substrate (for example, a substrate of a memory chip such as the memory unit 7800). In addition, each of the plurality of processor subunits can be dedicated to a corresponding memory group among the plurality of memory groups of the memory unit 7800. And these memory groups dedicated to the corresponding processor sub-units can also be spatially distributed on the substrate. In some embodiments, the memory unit 7800 may be associated with a specific task (for example, performing one or more operations associated with running a neural network, etc.), and each of the processor subunits of the memory unit 7800 The person may be responsible for performing part of this task. For example In other words, each processor sub-unit can be equipped with instructions that can include data processing and memory operations, arithmetic and logical operations, and so on. In some cases, the zero value detection logic can be configured to provide a zero value indicator to one or more of the described processor subunits that are spatially distributed on the memory unit 7800.

現參看圖80，其為說明符合本發明之實施例的偵測儲存於複數個記憶體組之特定位址中之零值的例示性方法8000之流程圖。方法8000可由記憶體晶片(例如，圖78B之記憶體晶片7810)執行。特定而言，記憶體單元之控制器(例如，圖79之控制器7913A)及零值偵測邏輯單元(例如，零值偵測邏輯單元7914A)可執行方法8000。 Referring now to FIG. 80, it is a flowchart illustrating an exemplary method 8000 for detecting zero values stored in specific addresses of a plurality of memory banks in accordance with an embodiment of the present invention. The method 8000 can be performed by a memory chip (for example, the memory chip 7810 of FIG. 78B). Specifically, the controller of the memory unit (for example, the controller 7913A of FIG. 79) and the zero-value detection logic unit (for example, the zero-value detection logic unit 7914A) can perform the method 8000.

在步驟8010中，可藉由任何合適的技術起始讀取或寫入操作。在一些狀況下，控制器可接收對讀取儲存於複數個離散記憶體組(例如，圖78中所描繪之記憶體組)之特定位址中之資料的請求。控制器可經組態以控制相對於複數個離散記憶體組之讀取/寫入操作的至少一個態樣。 In step 8010, any suitable technique can be used to initiate a read or write operation. In some cases, the controller may receive a request to read data stored in a specific address of a plurality of discrete memory groups (for example, the memory group depicted in FIG. 78). The controller can be configured to control at least one aspect of read/write operations with respect to a plurality of discrete memory groups.

在步驟8020中，一或多個零值偵測電路可用以偵測與讀取或寫入命令相關聯之零值的存在。舉例而言，零值偵測邏輯單元(例如，圖78之零值偵測邏輯單元7830)可偵測與特定位址相關聯之零值，該特定位址與讀取或寫入相關聯。 In step 8020, one or more zero value detection circuits can be used to detect the existence of zero values associated with the read or write commands. For example, a zero value detection logic unit (eg, zero value detection logic unit 7830 of FIG. 78) can detect a zero value associated with a specific address that is associated with reading or writing.

在步驟8030中，控制器可回應於由零值偵測邏輯單元在步驟8020中進行之零值偵測而將零值指示符傳輸至記憶體單元外部之一或多個電路。舉例而言，零值偵測邏輯可偵測到所請求位址儲存零值，且可將值為零之提示傳輸至記憶體晶片外部(或記憶體晶片內，例如在所揭示之分散式處理器記憶體晶片包括分佈於記憶體組之陣列當中的處理器子單元之狀況下)之實體(例如，一或多個電路)。若未偵測到與讀取或寫入命令相關聯之零值，則控制器可傳輸資料值而非零值指示符。在一些實施例中，被傳回零值指示符之一或多個電路可在記憶體單元內部。 In step 8030, the controller may transmit the zero value indicator to one or more circuits outside the memory unit in response to the zero value detection performed by the zero value detection logic unit in step 8020. For example, the zero value detection logic can detect that the requested address stores the zero value, and can transmit the zero value prompt to the outside of the memory chip (or inside the memory chip, such as in the disclosed distributed processing The memory chip includes entities (for example, one or more circuits) distributed among the processor sub-units in the array of the memory bank. If the zero value associated with the read or write command is not detected, the controller may transmit the data value instead of the zero value indicator. In some embodiments, one or more circuits of the returned zero indicator may be inside the memory cell.

雖然所揭示實施例已關於零值偵測進行了描述，但相同原理及技術將適用於偵測其他記憶體值(例如，1等)。在一些狀況下，除零值指示符以外，偵測邏輯亦可傳回與讀取或寫入命令相關聯之其他值(例如，1等)的一或多個指示符，且此等指示符可在偵測到對應於值指示符之任何值的情況下被傳回/傳輸。在一些狀況下，可藉由使用者(例如，經由更新一或多個暫存器)調整該等值。在可能知曉關於資料集之特性且瞭解到(例如，就使用者而言)某些值在資料中可能比其他值更普遍的情況下，此等更新可能尤其有用。在此等狀況下，一個、兩個、三個或多於三個值指示符可與最普遍資料相關聯，該等最普遍資料與資料集相關聯。 Although the disclosed embodiments have been described with respect to zero value detection, the same principles and techniques will be applicable to detecting other memory values (for example, 1, etc.). In some cases, in addition to the zero value indicator, the detection logic may also return one or more indicators of other values (for example, 1, etc.) associated with the read or write command, and these indicators Can be returned/transmitted when any value corresponding to the value indicator is detected. In some cases, these values can be adjusted by the user (for example, by updating one or more registers). Such updates may be especially useful in situations where you may know about the characteristics of the data set and understand (for example, as far as the user is concerned) that certain values may be more common in the data than others. Under these conditions, one, two, three, or more than three value indicators can be associated with the most common data, which is associated with the data set.

補償DRAM啟動懲罰 Compensation for DRAM startup penalty

在某些類型之記憶體(例如，DRAM)中，記憶體胞元可按陣列配置於記憶體組內，且一次可針對陣列中之一排記憶體胞元存取及擷取(讀取)包括於記憶體胞元中之值。此讀取處理程序可涉及首先開放(啟動)記憶體胞元之一排(或列)以使由記憶體胞元儲存之資料值可用。接下來，可同時感測開放排中之記憶體胞元的值，且行位址可用以循環通過個別記憶體胞元值或記憶體胞元值之群組(亦即，字)，且將每一記憶體胞元值連接至外部資料匯流排以便讀取記憶體胞元值。此等處理程序耗費時間。在一些狀況下，開放用於讀取之記憶體排可能需要運算時間之32個循環，且自開放排讀取值可能需要另外32個循環。若僅在當前開放排之讀取操作完成之後開放待讀取之下一排，則可產生顯著潛時。在此實例中，在開放下一排所需的32個循環期間，無資料被讀取，且讀取每一排有效地需要總計64個循環而非僅需要32個循環來遍歷排資料。習知記憶體系統不允許在正讀取或寫入第一排時開放同一組中之第二排。為節省潛時，待開放之下一排可因此在用於雙排存取之特殊組中的不同組o中，如下文進一步詳細地論述。在開放下一排之前，當前排可皆取樣至正反器或鎖存器，且在可開放下一排時，所有處理皆在正反器\鎖存器上完成。若下一預測排在同一組中(且以上情形中無一者存在)，則可能無法避免潛時且系統可能需要等待。此等機制與標準記憶體且尤其與記憶體處理裝置兩者均相關。 In some types of memory (for example, DRAM), memory cells can be arranged in an array in a memory bank, and one row of memory cells in the array can be accessed and retrieved (read) at a time The value included in the memory cell. This reading process may involve first opening (activating) a row (or row) of memory cells to make the data values stored by the memory cells available. Next, the value of the memory cell in the open row can be sensed at the same time, and the row address can be used to cycle through individual memory cell values or groups of memory cell values (that is, words), and each A memory cell value is connected to the external data bus for reading the memory cell value. Such processing procedures are time consuming. In some cases, the memory bank open for reading may require 32 cycles of computing time, and reading the value from the open bank may require another 32 cycles. If the next row to be read is opened only after the reading operation of the currently open row is completed, significant latency can be generated. In this example, during the 32 cycles required to open the next row, no data is read, and reading each row effectively requires a total of 64 cycles instead of only 32 cycles to traverse the row data. The conventional memory system does not allow the second row in the same group to be opened while the first row is being read or written. To save latency, the next row to be opened can therefore be in a different group o in the special group for dual-row access, as discussed in further detail below. Before opening the next row, the current row can be sampled to the flip-flop or lock When the next row can be opened, all processing is done on the flip-flop\latch. If the next prediction is in the same group (and none of the above situations exist), the latency may not be avoided and the system may need to wait. These mechanisms are related to both standard memory and especially memory processing devices.

本發明所揭示之實施例可藉由例如在當前開放記憶體排之讀取操作已完成之前預測待開放之下一記憶體排來減少此潛時。亦即，若可預測待開放之下一排，則用於開放下一排之處理程序可在當前排之讀取操作已完成之前開始。取決於在處理程序中何時進行下一排預測，與開放下一排相關聯之潛時可自32個循環(在上文所描述之特定實例中)減少至少於32個循環。在一個特定實例中，若提前20個循環預測下一排開放，則額外潛時僅為12個循環。在另一實例中，若提前32個循環預測下一排開放，則根本不存在潛時。結果，替代需要總計64個循環來串列地開放及讀取每一列，藉由在讀取當前列的同時開放下一列，可減少讀取每一列之有效時間。 The disclosed embodiment of the present invention can reduce this latency by, for example, predicting the next memory bank to be opened before the read operation of the current open memory bank is completed. That is, if the next row can be predicted to be opened, the processing procedure for opening the next row can be started before the reading operation of the current row is completed. Depending on when the next row prediction is performed in the processing procedure, the latent time associated with opening the next row can be reduced from 32 cycles (in the specific example described above) by at least 32 cycles. In a specific example, if the next row is predicted to open 20 cycles in advance, the additional latency is only 12 cycles. In another example, if the next row is predicted to open 32 cycles in advance, there is no latent time at all. As a result, instead of requiring a total of 64 cycles to serially open and read each row, by opening the next row while reading the current row, the effective time for reading each row can be reduced.

以下機制可能需要當前排及預測排在相同組中，但若存在可支援在一排上同時啟動及工作之此組，則亦可使用該等機制。 The following mechanisms may require the current row and forecast to be in the same group, but if there is such a group that can support simultaneous activation and work on a row, these mechanisms can also be used.

在所揭示實施例中，可使用各種技術(在下文更詳細地論述)執行下一列預測。舉例而言，下一列預測可基於圖案辨識，基於預定列存取排程，基於人工智慧模型(例如，用以分析列存取且進行待開放之下一列之預測的經訓練神經網路)之輸出或基於任何其他合適的預測技術。在一些實施例中，可藉由使用如下文所描述之延遲位址產生器或公式或其他方法來達成100%成功預測。預測可包含建置具有在需要存取待開放之下一排之前充分預測該排之能力的系統。在一些狀況下，下一列預測可由下一列預測器執行，該下一列預測器可用各種方式實施。舉例而言，用以產生用於對記憶體列進行讀取及/或寫入之當前位址的預測位址產生器。產生用於存取記憶體(讀取或寫入)之位址的實體可基於執行軟體指令之任何邏輯電路或控制器\CPU。預測位址產生器可包括圖案學習模型，該圖案學習模型觀測所存取列，識別與存取(例如，依序排存取，對每第二排之存取，對每第三排之存取等)相關聯之一或多個圖案且基於觀測到之圖案而估計待存取之下一列。在其他實例中，預測位址產生器可包括應用公式/演算法以預測待存取之下一列的單元。在另外其他實施例中，預測位址產生器可包括經訓練神經網路，該經訓練神經網路基於諸如正存取之當前位址/列、經存取之最後2個、3個、4個或多於4個位址/列等的輸入來輸出待存取之所預測下一列(包括與所預測列相關聯之一或多個位址)。使用所描述之預測位址產生器中之任一者預測待存取之下一記憶體排可顯著減少與記憶體存取相關聯之潛時。所描述之預測位址/列產生器可適用於涉及存取記憶體以擷取資料之任何系統中。在一些狀況下，所描述之預測位址/列產生器及用於預測下一記憶體排存取之相關聯技術可尤其適合於執行人工智慧模型之系統中，此係因為AI模型可與可便利下一列預測之重複記憶體存取圖案相關聯。 In the disclosed embodiment, various techniques (discussed in more detail below) can be used to perform the next column of predictions. For example, the next row prediction can be based on pattern recognition, based on a predetermined row access schedule, based on artificial intelligence models (for example, a trained neural network used to analyze row access and make predictions for the next row to be opened) Output or based on any other suitable forecasting technique. In some embodiments, 100% success prediction can be achieved by using the delay address generator or formula described below or other methods. Prediction can include building a system that has the ability to fully predict the next row before it needs to be accessed. In some cases, the next column of predictions can be performed by the next column of predictors, which can be implemented in various ways. For example, a predictive address generator used to generate the current address for reading and/or writing the memory row. The entity that generates the address used to access the memory (read or write) can be based on any logic circuit or controller\CPU that executes software commands. Predictive address generator can include Including the pattern learning model, the pattern learning model observes the accessed rows, identifying and accessing (for example, sequential access, access to every second row, access to every third row, etc.) associated with One or more patterns and the next column to be accessed is estimated based on the observed patterns. In other examples, the predictive address generator may include applying formulas/algorithms to predict the cells in the next column to be accessed. In still other embodiments, the predictive address generator may include a trained neural network based on, for example, the current address/row being accessed, the last 2, 3, 4 Inputs of one or more addresses/columns etc. are used to output the predicted next column to be accessed (including one or more addresses associated with the predicted column). Using any of the described predictive address generators to predict the next memory bank to be accessed can significantly reduce the latent time associated with memory access. The described predictive address/row generator can be applied to any system that involves accessing memory to retrieve data. In some cases, the described predictive address/row generator and the associated technology for predicting the next bank access can be particularly suitable for systems that execute artificial intelligence models, because AI models can be compatible with Facilitate the association of repeated memory access patterns for the next row of predictions.

圖81A說明符合本發明之實施例的用於基於下一列預測啟動與記憶體組8180相關聯之下一列的系統8100。系統8100可包括當前及預測位址產生器8192、組控制器8191及記憶體組8180A至8180B。位址產生器可為產生用於存取記憶體組8180A至8180B之位址的實體，且可基於執行軟體程式之任何邏輯電路、控制器或微處理器。組控制器8191可經組態以存取記憶體組8180A之當前列(例如，使用由位址產生器8192產生之當前列識別符)。組控制器8191亦可經組態以基於由位址產生器8192產生之預測列識別符啟動記憶體組8180B內待存取之所預測下一列。以下實例描述兩個組。在其他實例中，可使用更多組。在一些實施例中，可存在允許一次存取多於一列(如下文所論述)之記憶體組，且因此可在單個組上進行相同處理程序。如上文所描述，待存取之所預測下一列之啟動可在相對於正存取之當前列執行的讀取操作完成之前開始。因此，在一些狀況下，位址產生器8192可預測待存取之下一列，且可在對當前列之存取已完成之前的任何時間將所預測下一列之識別符(例如，一或多個位址)發送至組控制器8191。此時序可允許組控制器在正存取當前列期間且在對當前列之存取完成之前的任何時間點起始所預測下一列之啟動。在一些狀況下，組控制器8291可在待存取之當前列的啟動完成及/或相對於當前列之讀取操作已開始的同時(或在幾個時脈循環)起始記憶體組8180之所預測下一列的啟動。 FIG. 81A illustrates a system 8100 for activating the next row associated with the memory bank 8180 based on the prediction of the next row in accordance with an embodiment of the present invention. The system 8100 may include a current and predicted address generator 8192, a group controller 8191, and memory groups 8180A to 8180B. The address generator can be an entity that generates addresses for accessing the memory banks 8180A to 8180B, and can be based on any logic circuit, controller, or microprocessor that executes software programs. The group controller 8191 can be configured to access the current row of the memory group 8180A (for example, using the current row identifier generated by the address generator 8192). The group controller 8191 can also be configured to activate the predicted next row to be accessed in the memory group 8180B based on the predicted row identifier generated by the address generator 8192. The following examples describe two groups. In other instances, more groups can be used. In some embodiments, there may be memory banks that allow more than one row (as discussed below) to be accessed at a time, and therefore the same processing can be performed on a single bank. As described above, the activation of the predicted next row to be accessed may start before the completion of the read operation performed with respect to the current row being accessed. Therefore, in some situations, the address generator 8192 can predict the next row to be accessed, and can perform the current row At any time before the access is completed, the predicted identifier (for example, one or more addresses) of the next column is sent to the group controller 8191. This timing may allow the group controller to start the predicted activation of the next row at any point during which the current row is being accessed and before the access to the current row is completed. In some cases, the group controller 8291 may initiate the memory group 8180 at the same time (or within several clock cycles) when the activation of the current row to be accessed is completed and/or the read operation relative to the current row has started. It predicts the start of the next column.

在一些實施例中，相對於與當前位址相關聯之當前列的操作可為讀取或寫入操作。在一些實施例中，當前列及下一列可在同一記憶體組中。在一些實施例中，同一記憶體組可允許在正存取當前列之同時存取下一列。當前列及下一列可在不同記憶體組中。在一些實施例中，記憶體單元可包括經組態以產生當前位址及預測位址之處理器。在一些實施例中，記憶體單元可包括分散式處理器。分散式處理器可包括在空間上分佈於記憶體陣列之複數個離散記憶體組當中的處理陣列之複數個處理器子單元。在一些實施例中，預測位址可藉由對延遲產生之位址進行取樣的一系列正反器產生。該延遲可為可經由在儲存經取樣位址之正反器之間進行選擇的多工器來組態的。 In some embodiments, the operation relative to the current column associated with the current address may be a read or write operation. In some embodiments, the current row and the next row may be in the same memory bank. In some embodiments, the same memory bank may allow access to the next row while the current row is being accessed. The current row and the next row can be in different memory banks. In some embodiments, the memory unit may include a processor configured to generate the current address and the predicted address. In some embodiments, the memory unit may include a distributed processor. The distributed processor may include a plurality of processor subunits of the processing array among the plurality of discrete memory groups of the memory array distributed in space. In some embodiments, the predicted address can be generated by a series of flip-flops that sample the delayed generated address. The delay can be configurable via a multiplexer that selects between flip-flops storing sampled addresses.

應注意，在確認所預測下一列實際上為執行軟體請求以存取之下一列後(例如，在完成相對於當前列之讀取操作之後)，所預測下一列可成為待存取之當前列。在所揭示實施例中，因為可在完成當前列讀取操作之前起始用於啟動所預測下一列之處理程序，所以在確認所預測下一列為待存取之正確的下一列後，可能已完全或部分啟動待存取之下一列。此可顯著減少與排啟動相關聯之潛時。若啟動下一列使得啟動在當前列之讀取結束之前或同時結束，則可獲得功率減少。 It should be noted that after confirming that the predicted next row is actually executing a software request to access the next row (for example, after completing a read operation relative to the current row), the predicted next row can become the current row to be accessed . In the disclosed embodiment, because the processing program for starting the predicted next row can be started before the current row read operation is completed, it may be the correct next row to be accessed after confirming that the predicted next row is the correct next row to be accessed. Fully or partially activate the next row to be accessed. This can significantly reduce the latent time associated with platoon activation. If the next column is activated so that the activation ends before or at the same time that the reading of the current column ends, the power reduction can be obtained.

當前及預測位址產生器8192可包括經組態以識別記憶體組8180中待存取之列(例如，基於程式執行)且預測待存取之下一列(例如，基於列存取中之所觀測圖案，基於預定圖案(n+1、n+2)等)的任何合適的邏輯組件、運算單元、記憶體單元、演算法、經訓練模型等。舉例而言，在一些實施例中，當前及預測位址產生器8192可包括計數器8192A、當前位址產生器8192B及預測位址產生器8192C。當前位址產生器8192B可經組態以基於計數器8192A之輸出，例如基於來自運算單元之請求而產生記憶體組8180中待存取之當前列的當前位址。可將與待存取之當前列相關聯的位址提供至組控制器8191。預測位址產生器8192C可經組態以基於計數器8192A之輸出、基於預定存取圖案(例如，結合計數器8192A)或基於經訓練神經網路之輸出或其他類型之圖案預測演算法來判定記憶體組8180中待存取之下一列的預測位址，該圖案預測演算法觀測排存取且基於例如與所觀測到之排存取相關聯的圖案來預測待存取之下一排。位址產生器8192可將來自預測位址產生器8192C之所預測下一列位址提供至組控制器8191。 The current and predicted address generator 8192 may include a row configured to identify the row to be accessed in the memory bank 8180 (for example, based on program execution) and predict the next row to be accessed (eg, based on all the rows in the row access). Observation patterns, any suitable logic components based on predetermined patterns (n+1, n+2), etc.), Operation unit, memory unit, algorithm, trained model, etc. For example, in some embodiments, the current and predicted address generator 8192 may include a counter 8192A, a current address generator 8192B, and a predicted address generator 8192C. The current address generator 8192B can be configured to generate the current address of the current row to be accessed in the memory bank 8180 based on the output of the counter 8192A, for example, based on a request from the arithmetic unit. The address associated with the current row to be accessed can be provided to the group controller 8191. The predictive address generator 8192C can be configured to determine the memory based on the output of the counter 8192A, based on a predetermined access pattern (for example, combined with the counter 8192A), or based on the output of a trained neural network or other types of pattern prediction algorithms The predicted address of the next row to be accessed in group 8180. The pattern prediction algorithm observes row accesses and predicts the next row to be accessed based on, for example, patterns associated with the observed row accesses. The address generator 8192 can provide the predicted next column address from the predicted address generator 8192C to the group controller 8191.

在一些實施例中，當前位址產生器8192B及預測位址產生器8192C可實施於系統8100內部或外部。外部主機亦可實施於系統8100外部且進一步連接至系統8100。舉例而言，當前位址產生器8192B可為執行程式之外部主機處的軟體，且為避免任何潛時，預測位址產生器8192C可實施於系統8100內部或系統8100外部。 In some embodiments, the current address generator 8192B and the predicted address generator 8192C can be implemented inside or outside the system 8100. An external host can also be implemented outside the system 8100 and further connected to the system 8100. For example, the current address generator 8192B can be software at the external host that executes the program, and to avoid any latency, the predictive address generator 8192C can be implemented inside the system 8100 or outside the system 8100.

如所提到，可使用經訓練神經網路判定所預測下一列位址，該經訓練神經網路基於可包括一或多個先前存取之列位址的輸入來預測待存取之下一列。經訓練神經網路或其他類型之模型可在與預測位址產生器8192C相關聯之邏輯內運行。在一些狀況下，經訓練神經網路等可藉由預測位址產生器8192C外部但與該預測位址產生器通信之一或多個運算單元執行。 As mentioned, the predicted next row address can be determined using a trained neural network that predicts the next row to be accessed based on an input that can include one or more previously accessed row addresses . A trained neural network or other type of model can run within the logic associated with the predictive address generator 8192C. In some cases, the trained neural network, etc. can be executed by one or more arithmetic units external to the predictive address generator 8192C but in communication with the predictive address generator.

在一些實施例中，預測位址產生器8192C可包括當前位址產生器8192B之複製者或實質複製者。另外，當前位址產生器8192B及預測位址產生器8192C之操作的時序可相對於彼此固定或可調整。舉例而言，在一些狀況下，預測位址產生器8192C可經組態以相對於當前位址產生器8192B發出與待存取之下一列相關聯的位址識別符時在固定時間(例如，固定數目個時脈循環)輸出與所預測下一列相關聯之位址識別符。在一些狀況下，在待存取之當前列的啟動開始之前或之後，在與待存取之當前列相關聯的讀取操作開始之前或之後或在與正存取之當前列相關聯的讀取操作完成之前的任何時間，可產生所預測下一列識別符。在一些狀況下，可在待存取之當前列的啟動開始的同時或在與待存取之當前列相關聯的讀取操作開始的同時產生所預測下一列識別符。 In some embodiments, the predicted address generator 8192C may include a copy or a substantial copy of the current address generator 8192B. In addition, the operation timings of the current address generator 8192B and the predicted address generator 8192C can be fixed or adjustable with respect to each other. For example, in some situations, The predictive address generator 8192C can be configured to output the address identifier associated with the next row to be accessed relative to the current address generator 8192B at a fixed time (for example, a fixed number of clock cycles) to output and The address identifier associated with the predicted next column. In some cases, before or after the start of the current row to be accessed, before or after the start of the read operation associated with the current row to be accessed, or after the read operation associated with the current row to be accessed At any time before the completion of the fetch operation, the predicted next list of identifiers can be generated. In some cases, the predicted next row identifier may be generated at the same time as the start of the current row to be accessed or at the same time the read operation associated with the current row to be accessed starts.

在其他狀況下，所預測下一列識別符之產生與待存取之當前列的啟動或與當前列相關聯之讀取操作的起始之間的時間可為可調整的。舉例而言，在一些狀況下，此時間可在記憶體單元8100之操作期間基於與一或多個操作參數相關聯之值而延長或縮短。在一些狀況下，與記憶體單元或運算系統之另一組件相關聯的當前溫度(或任何其他參數值)可使當前位址產生器8192B及預測位址產生器8192C改變其相對操作時序。在實施例中，其中在記憶體處理中，預測機制可為彼邏輯之部分。 In other situations, the time between the generation of the predicted next row identifier and the start of the current row to be accessed or the start of the read operation associated with the current row may be adjustable. For example, in some situations, this time may be lengthened or shortened during the operation of the memory unit 8100 based on the value associated with one or more operating parameters. In some cases, the current temperature (or any other parameter value) associated with the memory unit or another component of the computing system may cause the current address generator 8192B and the predicted address generator 8192C to change their relative operating timing. In an embodiment, in the memory processing, the prediction mechanism may be part of that logic.

當前及預測位址產生器8192可產生與所預測下一列相關聯之信賴等級以存取判定。此信賴等級(其可作為預測處理程序之部分由預測位址產生器8192C判定)可用於判定例如是否在當前列之讀取操作期間(亦即，在當前列讀取操作已完成之前且在待存取之下一列的識別已確認之前)起始所預測下一列之啟動。舉例而言，在一些狀況下，可將與待存取之所預測下一列相關聯的信賴等級與臨限等級進行比較。若信賴等級降至低於臨限等級，則例如記憶體單元8100可放棄啟動所預測下一列。另一方面，若信賴等級超過臨限等級，則記憶體單元8100可起始記憶體組8180中之所預測下一列的啟動。 The current and predicted address generator 8192 can generate a confidence level associated with the predicted next row for access determination. This confidence level (which can be determined by the prediction address generator 8192C as part of the prediction processing program) can be used to determine, for example, whether it is during the read operation of the current row (that is, before the current row read operation has been completed and is pending Access to the next row before the identification has been confirmed) start the predicted activation of the next row. For example, in some situations, the confidence level associated with the predicted next row to be accessed may be compared with the threshold level. If the trust level drops below the threshold level, for example, the memory unit 8100 may give up starting the predicted next row. On the other hand, if the trust level exceeds the threshold level, the memory unit 8100 can initiate the activation of the predicted next row in the memory group 8180.

可用任何合適的方式實現測試相對於臨限等級之所預測下一列的信賴等級及所預測下一列之啟動之後續起始或非起始的機制。在一些狀況下，例如，若與所預測下一列相關聯之信賴等級降至低於臨限值，則預測位址產生器8192C可放棄將其所預測下一列結果輸出至下游邏輯組件。替代地，在此狀況下，當前及預測位址產生器8192可抑制來自組控制器8191之所預測下一列識別符，或組控制器(或另一邏輯單元)可經裝備以使用所預測下一列之信賴等級以判定是否在與正讀取之當前列相關聯的讀取操作完成之前開始啟動所預測下一列。 Any suitable way can be used to implement the mechanism of testing the confidence level of the predicted next column relative to the threshold level and the subsequent start or non-initial start of the predicted next column. In some situations Next, for example, if the confidence level associated with the predicted next column drops below the threshold, the predicted address generator 8192C may abandon outputting the predicted result of the next column to the downstream logic component. Alternatively, in this situation, the current and predicted address generator 8192 can suppress the predicted next column identifier from the group controller 8191, or the group controller (or another logic unit) can be equipped to use the predicted next The confidence level of a column is used to determine whether to start the predicted next column before the read operation associated with the current column being read is completed.

可用任何合適的方式產生與所預測下一列相關聯之信賴等級。在一些狀況下，諸如在基於預定之已知存取圖案識別所預測下一列之情況下，預測位址產生器8192C可產生高信賴等級或鑒於列存取之預定圖案，可完全放棄產生信賴等級。另一方面，在預測位址產生器8192C執行一或多個演算法以監視列存取，且基於相對於所監視之列存取而計算的圖案輸出所預測列，或在一或多個經訓練神經網路或其他模型經組態以基於包括最近列存取之輸入而輸出所預測下一列之情況下，可基於任何相關參數判定所預測下一列之信賴等級。舉例而言，在一些狀況下，信賴等級可取決於一或多個先前之下一列預測是否證明為準確的(例如，過去效能指示符)。信賴等級亦可基於演算法/模型之輸入的一或多個特性。舉例而言，包括遵循圖案之實際列存取的輸入可導致比展現較少圖案化之實際列存取高的信賴等級。且在相對於包括最近列存取之輸入的串流偵測隨機性之一些狀況下，例如，所產生之信賴度可為低的。另外，在偵測到隨機性之狀況下，可完全中止下一列預測處理程序，記憶體單元8100之組件中的一或多者可忽略下一列預測，或可採取任何其他動作以放棄啟動所預測下一列。 Any suitable method can be used to generate the confidence level associated with the predicted next column. In some situations, such as when the next row is predicted based on a predetermined known access pattern recognition, the predicted address generator 8192C can generate a high confidence level or in view of the predetermined pattern of row access, it can completely abandon generating a confidence level . On the other hand, the predicted address generator 8192C executes one or more algorithms to monitor column access, and outputs the predicted column based on the pattern calculated relative to the monitored column access, or one or more In the case where the training neural network or other model is configured to output the predicted next row based on the input including the most recent row access, the confidence level of the predicted next row can be determined based on any relevant parameters. For example, in some situations, the confidence level may depend on whether one or more previous predictions in the next column prove to be accurate (e.g., past performance indicators). The trust level can also be based on one or more characteristics of the input of the algorithm/model. For example, an input that includes actual column access that follows a pattern can result in a higher level of trust than actual column access that exhibits less patterning. And in some situations where randomness is detected relative to the input stream including the most recent row access, for example, the resulting reliability may be low. In addition, when randomness is detected, the next prediction process can be completely stopped, one or more of the components of the memory unit 8100 can ignore the next prediction, or any other action can be taken to abandon the activation of the prediction Next column.

在一些狀況下，可相對於記憶體8100之操作包括反饋機制。舉例而言，週期性地或甚至在每下一列預測之後，可判定預測位址產生器8192C預測待存取之實際下一列的準確性。在一些狀況下，若在預測待存取之下一列時存在錯誤(或在預定數目個錯誤之後)，則可暫時中止預測位址產生器8192C之下一列預測操作。在其他狀況下，預測位址產生器8192C可包括學習元件，使得其預測操作之一或多個態樣可基於關於其預測待存取之下一列之準確性的所接收反饋而調整。此能力可改進預測位址產生器8192C之操作，使得位址產生器8192C可適應於改變之存取圖案等。 In some cases, the operation with respect to the memory 8100 includes a feedback mechanism. For example, periodically or even after each next column prediction, the accuracy of the prediction address generator 8192C in predicting the actual next column to be accessed can be determined. In some cases, if the next column is predicted to be accessed If there is an error (or after a predetermined number of errors), the prediction operation of the next column of the prediction address generator 8192C can be temporarily suspended. In other cases, the predictive address generator 8192C may include a learning element so that one or more aspects of its predictive operation can be adjusted based on the received feedback regarding the accuracy of its predicting the next column to be accessed. This capability can improve the operation of the predictive address generator 8192C, so that the address generator 8192C can adapt to changing access patterns, etc.

在一些實施例中，所預測下一列之產生及/或所預測下一列之啟動的時序可取決於記憶體單元8100之整體操作。舉例而言，在通電之後或在重設記憶體單元8100之後，可暫時中止預測待存取之下一列(或將所預測下一列轉送至組控制器8191)(例如，持續預定時間量或時脈循環，直至預定數目個列存取/讀取已完成，直至所預測下一列之信賴等級超過預定臨限值，或基於任何其他合適的準則)。 In some embodiments, the timing of the predicted generation of the next row and/or the predicted activation of the next row may depend on the overall operation of the memory unit 8100. For example, after power-on or after resetting the memory unit 8100, the prediction of the next row to be accessed can be temporarily suspended (or the predicted next row is forwarded to the group controller 8191) (for example, for a predetermined amount of time or time). The pulse loops until the predetermined number of column access/reading has been completed, until the predicted confidence level of the next column exceeds a predetermined threshold, or based on any other suitable criteria).

圖81B說明根據例示性所揭示實施例之記憶體單元8100的另一組態。在圖81B之系統8100B中，快取記憶體8193可與組控制器8191相關聯。舉例而言，快取記憶體8193可經組態以在一或多個資料列被存取之後儲存該一或多個資料列，且防止需要再次啟動該等資料列。因此，快取記憶體8193可使得組控制器8191能夠存取來自快取記憶體8193之列資料而非存取記憶體組8180。舉例而言，快取記憶體8193可儲存最後X列資料(或任何其他快取記憶體節省策略)，且組控制器8191可根據所預測列來填充快取記憶體8193。此外，若所預測列已在快取記憶體8193中，則不需要再次開放所預測列，且組控制器(或實施於快取記憶體8193中之快取控制器)可保護所預測列不被調換。快取記憶體8193可提供若干益處。首先，由於快取記憶體8193將列載入至快取記憶體8193且組控制器可存取快取記憶體8193以擷取列資料，因此不需要特殊組或多於一個組用於下一列預測。其次，對快取記憶體8193進行讀取及寫入可節省能量，此係因為自組控制器8191至快取記憶體8193之實體距離小於自組控制器 8191至記憶體組8180之實體距離。第三，相較於記憶體組8180，由快取記憶體8193引起之潛時通常較低，此係因為快取記憶體8193更小且更接近控制器8191。在一些狀況下，當藉由組控制器8191在記憶體組8180中啟動所預測下一列時，由預測位址產生器產生之所預測下一列的識別符例如可儲存於快取記憶體8193中。基於程式執行等，當前位址產生器8192B可識別記憶體組8191中待存取之實際下一列。可將與待存取之實際下一列相關聯的識別符與儲存於快取記憶體8193中之所預測下一列之識別符進行比較。若待存取之實際下一列與待存取之所預測下一列相同，則組控制器8191可在待存取之實際下一列的啟動已完成之後開始相對於彼列之讀取操作(其可能由於下一列預測處理程序而完全或部分啟動)。另一方面，若待存取之實際下一列(由當前位址產生器8192B判定)不匹配儲存於快取記憶體8193中的所預測下一列識別符，則將不會相對於完全或部分啟動之所預測下一列開始讀取操作，而是系統將開始啟動待存取之實際下一列。 FIG. 81B illustrates another configuration of the memory cell 8100 according to the illustratively disclosed embodiment. In the system 8100B of FIG. 81B, the cache memory 8193 can be associated with the group controller 8191. For example, the cache memory 8193 can be configured to store one or more data rows after the one or more data rows are accessed, and prevent the need to activate the data rows again. Therefore, the cache memory 8193 can enable the group controller 8191 to access the row data from the cache memory 8193 instead of accessing the memory group 8180. For example, the cache memory 8193 can store the last X rows of data (or any other cache memory saving strategy), and the group controller 8191 can fill the cache memory 8193 according to the predicted rows. In addition, if the predicted row is already in the cache memory 8193, there is no need to open the predicted row again, and the group controller (or the cache controller implemented in the cache memory 8193) can protect the predicted row from Was replaced. Cache 8193 can provide several benefits. First, since the cache 8193 loads the rows into the cache 8193 and the group controller can access the cache 8193 to retrieve row data, there is no need for a special group or more than one group for the next row prediction. Secondly, reading and writing to the cache memory 8193 can save energy, because the physical distance between the self-organizing controller 8191 and the cache memory 8193 is smaller than that of the self-organizing controller The physical distance between 8191 and 8180 of the memory bank. Third, compared to the memory bank 8180, the latency caused by the cache memory 8193 is usually lower, because the cache memory 8193 is smaller and closer to the controller 8191. In some cases, when the group controller 8191 activates the predicted next row in the memory group 8180, the predicted next row identifier generated by the prediction address generator can be stored in the cache memory 8193, for example . Based on program execution, etc., the current address generator 8192B can identify the actual next row to be accessed in the memory bank 8191. The identifier associated with the actual next row to be accessed can be compared with the predicted next row identifier stored in the cache memory 8193. If the actual next row to be accessed is the same as the predicted next row to be accessed, the group controller 8191 can start the read operation relative to that row after the activation of the actual next row to be accessed has been completed. Fully or partially activated due to the next column of prediction processing procedures). On the other hand, if the actual next row to be accessed (determined by the current address generator 8192B) does not match the predicted next row identifier stored in the cache memory 8193, it will not be activated relative to the complete or partial It predicts that the next row will start the read operation, but the system will start the actual next row to be accessed.

雙重啟動組 Dual boot group

如所論述，描述若干機制為有價值的，該等機制允許建置能夠在一列仍正被處理的同時啟動另一列之組。可針對在另一列正被存取的同時啟動額外列之組提供若干實施例。雖然實施例僅描述兩列啟動，但應瞭解，其可適用於更多列。在首先建議之實施例中，記憶體組可分成記憶體子組，且所描述實施例可用以執行相對於一個子組中之一排的讀取操作，同時啟動另一子組中之所預測或所需下一列。舉例而言，如圖81C中所展示，記憶體組8180可經配置以包括多個記憶體子組8181。另外，與記憶體組8180相關聯之組控制器8191可包括與對應子組相關聯之複數個子組控制器。複數個子組控制器中之第一子組控制器可經組態以使得能夠存取包括於複數個子組中之第一子組之當前列中的資料，而複數個子組控制器中之第二子組控制器可啟動複數個子組中之第二子組中的下一列。當一次僅存取一個子組中之字時可使用僅一個行解碼器。兩個組可繫結至同一輸出匯流排以呈現為單個組。新的單個組輸入亦可為單個位址及用於開放下一列之額外列位址。 As discussed, it is valuable to describe several mechanisms that allow the build to be able to activate a group of one row while the other is still being processed. Several embodiments can be provided for activating a group of additional rows while another row is being accessed. Although the embodiment only describes two column activations, it should be understood that it can be applied to more columns. In the first proposed embodiment, the memory group can be divided into memory sub-groups, and the described embodiment can be used to perform read operations relative to one row in one sub-group, and at the same time activate the prediction in the other sub-group Or the next column is required. For example, as shown in FIG. 81C, the memory bank 8180 may be configured to include multiple memory sub-groups 8181. In addition, the group controller 8191 associated with the memory group 8180 may include a plurality of sub-group controllers associated with the corresponding sub-group. The first sub-group controller among the plurality of sub-group controllers can be configured to enable access to data included in the current row of the first sub-group among the plurality of sub-groups, and the second sub-group controller among the plurality of sub-group controllers The sub-group controller can start the second of a plurality of sub-groups The next column in the subgroup. Only one row decoder can be used when only accessing the ZigZag in one subgroup at a time. Two groups can be tied to the same output bus to appear as a single group. The new single group input can also be a single address and used to open the additional column address of the next column.

圖81C說明每一記憶體子組8181之第一及第二子組列控制器(8183A、8183B)。記憶體組8180可包括複數個子組8181，如圖81C中所展示。另外，組控制器8191可包括各與對應子組8181相關聯之複數個子組控制器8183A至8183B。複數個子組控制器中之第一子組控制器8183A可經組態以使得能夠存取包括於子組8181中之第一部分之當前列中的資料，而第二子組控制器8183B可啟動子組8181之第二部分中的下一列。 Figure 81C illustrates the first and second row controllers (8183A, 8183B) of each memory subgroup 8181. The memory group 8180 may include a plurality of sub-groups 8181, as shown in FIG. 81C. In addition, the group controller 8191 may include a plurality of sub-group controllers 8183A to 8183B each associated with the corresponding sub-group 8181. The first sub-group controller 8183A of the plurality of sub-group controllers can be configured to enable access to the data in the current row of the first part included in the sub-group 8181, and the second sub-group controller 8183B can launch the The next column in the second part of group 8181.

因為啟動直接鄰近於正被存取之列的列可能會使所存取列失真及/或損壞正自所存取列讀取之資料，所以所揭示實施例可經組態以使得待啟動之所預測下一列可與第一子組中正被存取資料之當前列隔開至少兩列(例如)。在一些實施例中，待啟動之列可隔開至少一墊，使得啟動可在不同墊中執行。第二子組控制器可經組態以使得存取包括於第二子組之當前列中的資料，而第一子組控制器啟動第一子組中之下一列。第一子組之經啟動的下一列可與第二子組中正被存取資料的當前列隔開至少兩列。 Since activating a row directly adjacent to the row being accessed may distort the accessed row and/or damage the data being read from the accessed row, the disclosed embodiment can be configured so that the row to be activated The predicted next row can be separated from the current row of the data being accessed in the first subgroup by at least two rows (for example). In some embodiments, the column to be activated can be separated by at least one pad, so that activation can be performed in different pads. The second sub-group controller can be configured to access data included in the current row of the second sub-group, and the first sub-group controller activates the next row in the first sub-group. The activated next row of the first subgroup can be separated from the current row of the data being accessed in the second subgroup by at least two rows.

正被讀取/存取之列與正被啟動之列之間的此預定義距離可由例如將記憶體組之不同部分耦接至不同列解碼器之硬體判定，且軟體可維持該預定義距離以免破壞資料。當前列之間的間隔可超過兩列(例如可為3列、4列、5列及甚至多於5列)。該距離可隨時間改變，例如基於關於所儲存資料中引入之失真的評估。可用各種方式評估失真，例如藉由計算信雜比、錯誤率、修復失真所需之錯誤碼及其類似者。若兩列足夠遠且兩個組控制器實施於同一組上，則實際上可啟動兩列。新架構(在同一組上實施兩個控制器)可防止開放同一墊中之多個排。 The predefined distance between the row being read/accessed and the row being activated can be determined by, for example, the hardware that couples different parts of the memory bank to different row decoders, and the software can maintain the predefined distance Distance so as not to destroy the data. The interval between the current rows can exceed two rows (for example, it can be 3 rows, 4 rows, 5 rows, and even more than 5 rows). The distance can change over time, for example based on an assessment of the distortion introduced in the stored data. Various methods can be used to evaluate the distortion, such as by calculating the signal-to-noise ratio, error rate, error code required to repair the distortion, and the like. If the two rows are far enough and the two group controllers are implemented on the same group, then two rows can actually be activated. The new architecture (implementing two controllers on the same group) prevents the opening of multiple rows in the same pad.

圖81D說明符合本發明之實施例的下一列預測之實施例。實施例可包括正反器(位址暫存器A至C)之額外管線。管線可藉由任何數目個正反器(級)實施為在位址產生器之後啟動及延遲整體執行以使用所延遲位址所需的延遲，接著預測可為所產生之新位址(在管線之開頭，在位址暫存器C下方)且當前位址為管線之末尾。在此實施例中，不需要複製位址產生器。可添加選擇器(圖81D中所展示之多工器)以組態延遲，而位址暫存器提供延遲。 Figure 81D illustrates an embodiment of the next column prediction in accordance with an embodiment of the present invention. The embodiment may include an additional pipeline of flip-flops (address registers A to C). The pipeline can be implemented by any number of flip-flops (stages) to start after the address generator and delay the overall execution to use the delay required for the delayed address, and then predict the new address that can be generated (in the pipeline The beginning is below the address register C) and the current address is the end of the pipeline. In this embodiment, there is no need to duplicate the address generator. A selector (the multiplexer shown in Figure 81D) can be added to configure the delay, and the address register provides the delay.

圖81E說明符合本發明之實施例的記憶體組之實施例。記憶體組可實施為若新啟動之排距當前排足夠遠，則啟動新排將不會破壞當前排。如圖81E中所展示，記憶體組可包括墊之每兩排之間的額外記憶體墊(黑色)因此，控制單元(諸如，列解碼器)可啟動隔開一墊之多個排。 FIG. 81E illustrates an embodiment of a memory bank in accordance with an embodiment of the present invention. The memory bank can be implemented such that if the newly activated row is far enough from the current row, the new row will not be destroyed if the new row is activated. As shown in FIG. 81E, the memory bank may include additional memory pads (black) between every two rows of pads. Therefore, a control unit (such as a column decoder) can activate multiple rows that separate one pad.

在一些實施例中，記憶體單元可經組態以在預定時間接收第一位址以用於處理及接收第二位址以起作用及存取。 In some embodiments, the memory unit can be configured to receive the first address at a predetermined time for processing and receiving the second address for function and access.

圖81F說明符合本發明之實施例的記憶體組之另一實施例。記憶體組可實施為若新啟動之排距當前排足夠遠，則啟動新排將不會破壞當前排。圖81F中所描繪之實施例可藉由確保在記憶體組之上半部分處實施所有偶數排且在記憶體組之下半部分處實施所有奇數排來允許列解碼器開放排n及n+1。實施方案可允許存取始終足夠遠之連續排。 FIG. 81F illustrates another embodiment of the memory bank in accordance with the embodiment of the present invention. The memory bank can be implemented such that if the newly activated row is far enough from the current row, the new row will not be destroyed if the new row is activated. The embodiment depicted in FIG. 81F can allow the row decoder to open rows n and n+ by ensuring that all even rows are implemented at the upper half of the memory bank and all odd rows are implemented at the lower half of the memory bank. 1. The implementation may allow access to consecutive rows that are always far enough away.

根據所揭示實施例，雙重控制記憶體組可允許存取及啟動單個記憶體組之不同部分，即使在雙重控制記憶體組經組態以一次輸出一個資料單元時亦如此。舉例而言，如所描述，雙重控制可使得記憶體組能夠在啟動第二列(例如，所預測下一列或待存取之預定下一列)時存取第一列。 According to the disclosed embodiments, the dual-control memory bank can allow access and activation of different parts of a single memory bank, even when the dual-control memory bank is configured to output one data unit at a time. For example, as described, dual control can enable the memory bank to access the first row when the second row is activated (for example, the predicted next row or the predetermined next row to be accessed).

圖82說明符合本發明之實施例的用於減少記憶體列啟動懲罰(例如，潛時)之雙重控制記憶體組8280。雙重控制記憶體組8280可包括輸入，該等輸入包括資料輸入(DIN)8290、列位址(ROW)8291、行位址(COLUMN) 8292、第一命令輸入(COMMAND_1)8293及第二命令輸入(COMMAND_2)8294。記憶體組8280可包括資料輸出(Dout)8295。 FIG. 82 illustrates a dual-control memory bank 8280 for reducing a memory bank activation penalty (eg, latency) according to an embodiment of the present invention. The dual control memory group 8280 can include inputs, including data input (DIN) 8290, row address (ROW) 8291, row address (COLUMN) 8292. The first command input (COMMAND_1) 8293 and the second command input (COMMAND_2) 8294. The memory bank 8280 may include a data output (Dout) 8295.

假定位址可包括列位址及行位址，且存在兩個列解碼器。可提供位址之其他配置，列解碼器之數目可超過兩個，且可存在多於單個行解碼器。 The pseudo location address can include a column address and a row address, and there are two column decoders. Other configurations of addresses can be provided, the number of column decoders can exceed two, and there can be more than a single row decoder.

列位址(ROW)8291可識別與諸如啟動命令之命令相關聯的列。因為列啟動後可接著自該列讀取或寫入至該列，所以接著在該列開放(在其啟動之後)，可能不需要發送用於寫入至開放列或自開放列讀取之列位址。 The column address (ROW) 8291 can identify the column associated with a command such as a start command. Since the column can be read from or written to the column after it is started, it is then opened (after it is started), and may not need to be sent for writing to the open column or read from the open column Address.

第一命令輸入(COMMAND_1)8293可用以將命令(諸如但不限於啟動命令)發送至由第一列解碼器存取之列。第二命令(COMMAND_2)輸入8294可用以將命令(諸如但不限於啟動命令)發送至由第二列解碼器存取之列。 The first command input (COMMAND_1) 8293 can be used to send a command (such as but not limited to a start command) to the column accessed by the first column decoder. The second command (COMMAND_2) input 8294 can be used to send a command (such as but not limited to a start command) to the column accessed by the second column decoder.

資料輸入(DIN)8290可用以在執行寫入操作時饋入資料。 Data input (DIN) 8290 can be used to feed data when performing write operations.

因為無法一次讀取整列，所以可依序讀取單個列區段，且行位址(COLUMN)8292可提示待讀取該列之哪一區段(哪些行)。為解釋簡單起見，可假定存在2Q個區段且行輸入具有Q個位元；Q為超過一的正整數。 Since the entire column cannot be read at one time, individual column segments can be read sequentially, and the row address (COLUMN) 8292 can prompt which segment (which row) of the column is to be read. For simplicity of explanation, it can be assumed that there are 2Q sectors and the line input has Q bits; Q is a positive integer exceeding one.

雙重控制記憶體組8280可在具有或不具有上文關於圖81A至圖81B所描述之位址預測的情況下操作。當然，為減少操作潛時，根據所揭示實施例，雙重控制記憶體組可在具有位址預測之情況下操作。 The dual control memory bank 8280 can operate with or without the address prediction described above with respect to FIGS. 81A to 81B. Of course, in order to reduce the operating latency, according to the disclosed embodiment, the dual control memory bank can be operated with address prediction.

圖83A、圖83B及圖83C說明存取及啟動記憶體組8180之列的實例。如上文所提及，假定在一個實例中，讀取列及啟動列兩者均需要32個循環(區段)。另外，為了減少啟動懲罰(具有表示為差量(Delta)的長度)，預先(在需要存取下一列之前至少差量)知曉應開放下一列可為有益的。在一些狀況下，差量可等於四個循環。圖83A、圖83B及圖83C中所描繪之每一記憶體組可包括兩個或多於兩個子組，在該兩個或多於兩個子組內，在一些實施例中，在任何給定時間可僅開放一個列。在一些狀況下，偶數列可與第一子組相關聯，且奇數列可與第二子組相關聯。在此實例中，使用所揭示之預測性定址實施例可使得能夠在到達相對於另一記憶體子組之列的讀取操作之末尾之前(在到達末尾之前的延遲時段)起始某一記憶體子組之一個列的啟動。以此方式，可用高效方式進行依序記憶體存取(例如，預定義記憶體存取序列，其中列1、2、3、4、5、6、7、8……待讀取，且列1、3、5……等與第一記憶體子組相關聯且列2、4、6……等與第二不同記憶體子組)相關聯。 83A, 83B, and 83C illustrate examples of accessing and activating the rows of the memory bank 8180. As mentioned above, assume that in one example, both the read column and the start column require 32 cycles (segments). In addition, in order to reduce the startup penalty (having a length denoted as delta), it may be beneficial to know in advance (at least the delta before the next column needs to be accessed) that the next column should be opened. In some cases, the difference can be equal to four cycles. Each memory group depicted in FIG. 83A, FIG. 83B, and FIG. 83C may include two or more than two sub-groups. Within the two or more sub-groups, in some implementations In the example, only one column can be opened at any given time. In some cases, even-numbered columns can be associated with the first sub-group, and odd-numbered columns can be associated with the second sub-group. In this example, the use of the disclosed predictive addressing embodiment can enable a memory to be started before the end of the read operation relative to another memory subgroup (the delay period before the end is reached) Start of a column of the body subgroup. In this way, sequential memory access can be performed in an efficient manner (for example, a predefined memory access sequence, in which rows 1, 2, 3, 4, 5, 6, 7, 8... are to be read, and rows 1, 3, 5... etc. are associated with the first memory sub-group and rows 2, 4, 6... etc. are associated with the second different memory sub-group).

圖83A可說明用於存取包括於兩個不同記憶體子組中之記憶體列的狀態。在圖83A中所展示之狀態中： FIG. 83A can illustrate the state for accessing memory rows included in two different memory subgroups. In the state shown in Figure 83A:

a.列A可為可由第一列解碼器存取的。可在第一列解碼器啟動列A之後存取第一區段(以灰色標記之最左區段)。 a. Column A can be accessed by the first column decoder. The first sector (leftmost sector marked in gray) can be accessed after the first row decoder activates row A.

b.列B可為可由第二列解碼器存取的。在圖83A中所展示之此等狀態中，列B被關閉且尚未啟動。 b. Column B can be accessed by the second column decoder. In these states shown in Figure 83A, column B is closed and has not yet been activated.

圖83A中所說明之狀態之前可為將啟動命令及列A之位址發送至第一列解碼器。 The state illustrated in FIG. 83A can be preceded by sending the start command and the address of column A to the first column decoder.

圖83B說明用於在存取列A之後存取列B的狀態。根據此實例：列A可為可由第一列解碼器存取的。在圖83B中所展示之狀態中，第一列解碼器啟動列A且已存取除四個最右區段(未以灰色標記之四個區段)以外的所有區段。因為差量(列A中之四個白色區段)等於四個循環，所以組控制器可使得第二列解碼器能夠在存取列A中之最右區段之前啟動列B。在一些狀況下，啟動列B可回應於預定存取圖案(例如，依序列存取，其中奇數列指明於第一子組中且偶數列指明於第二子組中)。在其他狀況下，啟動列B可回應於上文所描述之任何列預測技術。組控制器可使得第二列解碼器能夠預先啟動列B，使得當存取列B時，已啟動(開放)列B而非等待啟動列B以開放列B。 FIG. 83B illustrates the state for accessing column B after accessing column A. According to this example: column A may be accessible by the first column decoder. In the state shown in FIG. 83B, the first row decoder activates row A and has accessed all sectors except the four rightmost sectors (four sectors not marked in gray). Because the difference (four white sections in column A) is equal to four cycles, the group controller can enable the second column decoder to activate column B before accessing the rightmost section in column A. In some cases, the activation row B can respond to a predetermined access pattern (for example, access in a sequence, where the odd-numbered rows are designated in the first sub-group and the even-numbered rows are designated in the second sub-group). In other situations, the activation column B can respond to any of the column prediction techniques described above. The group controller can enable the second column decoder to activate column B in advance, so that when column B is accessed, column B has been activated (opened) instead of waiting for column B to be activated to open column B.

圖83B中所說明之狀態之前可為以下操作： The state illustrated in Figure 83B can be the following operations before:

a.將啟動命令及列A之位址發送至第一列解碼器。 a. Send the start command and the address of column A to the first column decoder.

b.寫入或讀取列A之前二十八個區段。 b. Write or read the twenty-eight sectors before column A.

c.在對列之二十八個區段進行讀取或寫入操作之後，將相對於列B之位址的啟動命令發送至第二列解碼器。 c. After reading or writing the twenty-eight sectors of the column, send the start command relative to the address of the column B to the decoder of the second column.

在一些實施例中，偶數編號列位於一或多個記憶體組之一半中。在一些實施例中，奇數編號列位於一或多個記憶體組之一半中。 In some embodiments, the even-numbered columns are located in one half of one or more memory banks. In some embodiments, the odd-numbered columns are located in one half of one or more memory banks.

在一些實施例中，一排額外冗餘墊置放於兩個墊排中之每一者之間以建立用於允許啟動之距離。在一些實施例中，可能不同時啟動彼此接近之多個排。 In some embodiments, an extra row of redundant pads is placed between each of the two pad rows to establish a distance for allowing activation. In some embodiments, multiple rows that are close to each other may not be activated at the same time.

圖83C可說明用於在存取列A之後存取列C(例如，包括於第一子組中之下一奇數列)的狀態。如圖83C中所展示，列B可為可由第二列解碼器存取的。如所展示，第二列解碼器已啟動列B且已存取除四個最右區段(未以灰色標記之四個剩餘區段)以外的所有區段。因為在此實例中，差量等於四個循環，組控制器可使得第一列解碼器能夠在存取列B中之最右區段之前啟動列C。組控制器可使得第一列解碼器能夠預先啟動列C，使得當存取列C時，已啟動列C而非等待啟動列C。以此方式操作可減少或完全消除與記憶體讀取操作相關聯之潛時。 FIG. 83C can illustrate the state for accessing column C (for example, the next odd-numbered column included in the first subgroup) after accessing column A. As shown in Figure 83C, column B may be accessible by the second column decoder. As shown, the second row decoder has activated row B and has accessed all sectors except the four rightmost sectors (the four remaining sectors not marked in gray). Because in this example, the difference is equal to four cycles, the group controller can enable the first column decoder to activate column C before accessing the rightmost section in column B. The group controller can enable the first column decoder to activate column C in advance, so that when column C is accessed, column C has been activated instead of waiting to be activated. Operating in this way can reduce or completely eliminate the latency associated with memory read operations.

作為暫存器檔案之記憶體墊 As a memory pad for temporary storage files

在電腦架構中，處理器暫存器構成電腦處理器(例如，中央處理單元(CPU))可快速存取之儲存位置。暫存器通常包括最接近處理器核心(L0)之記憶體單元。暫存器可提供存取某些類型之資料的最快方式。電腦可具有若干類型之暫存器，其各根據其儲存之資訊的類型或基於對某一類型之暫存器中之資訊操作的指令之類型而分類。舉例而言，電腦可包括：資料暫存器，其保存數值資訊、運算元、中間結果及組態；位址暫存器，其儲存由指令使用以存取主要記憶體之位址資訊；通用暫存器，其儲存資料及位址資訊兩者；及狀態暫存器；以及其他暫存器。暫存器檔案包括可供電腦處理單元使用之暫存器的邏輯群組。 In the computer architecture, the processor register constitutes a storage location that the computer processor (for example, a central processing unit (CPU)) can quickly access. The register usually includes the memory unit closest to the processor core (L0). Registers can provide the fastest way to access certain types of data. A computer may have several types of registers, each of which is classified according to the type of information it stores or based on the type of instructions operating on the information in a certain type of register. For example, the computer may include: a data register, which protects Store numerical information, operands, intermediate results and configuration; address register, which stores address information used by commands to access the main memory; general purpose register, which stores both data and address information; And status registers; and other registers. The register file includes a logical group of registers that can be used by the computer processing unit.

在許多狀況下，電腦之暫存器檔案位於處理單元(例如，CPU)內且由邏輯電晶體實施。然而，在所揭示實施例中，運算處理單元可能不駐存於傳統的CPU中。實情為，此等處理元件(例如，處理器子單元)可作為處理陣列在空間上分佈於(如以上章節中所描述)記憶體晶片內。每一處理器子單元可與一或多個對應及專用的記憶體單元(例如，記憶體組)相關聯。經由此架構，每一處理器子單元可在空間上位於儲存特定處理器子單元操作之資料的一或多個記憶體元件附近。如本文中所描述，此架構可藉由例如消除由典型CPU及外部記憶體架構所經歷之記憶體存取瓶頸來顯著加速某些記憶體密集型操作中之操作。 In many cases, the computer's register file is located in the processing unit (for example, CPU) and implemented by logic transistors. However, in the disclosed embodiment, the arithmetic processing unit may not reside in a traditional CPU. In fact, these processing elements (for example, processor sub-units) can be spatially distributed in the memory chip (as described in the above section) as a processing array. Each processor subunit can be associated with one or more corresponding and dedicated memory units (e.g., memory banks). With this structure, each processor sub-unit can be spatially located near one or more memory elements that store data for the operation of a specific processor sub-unit. As described herein, this architecture can significantly speed up operations in certain memory-intensive operations by, for example, eliminating memory access bottlenecks experienced by typical CPU and external memory architectures.

然而，本文中所描述之分散式處理器記憶體晶片架構可仍利用暫存器檔案，其包括用於對來自專用於對應處理器子單元之記憶體元件之資料進行操作的各種類型之暫存器。然而，由於處理器子單元可分佈於記憶體晶片之記憶體元件當中，因此有可能將一或多個記憶體元件(相較於特定製造製程中之邏輯元件，該一或多個記憶體元件可受益於彼同一製程)添加於對應處理器子單元中，以充當用於對應處理器子單元之暫存器檔案或快取記憶體，而非充當主要記憶體儲存器。 However, the distributed processor memory chip architecture described in this article can still use temporary memory files, which include various types of temporary memory used to manipulate data from memory components dedicated to the corresponding processor subunits. Device. However, since the processor subunits can be distributed among the memory elements of the memory chip, it is possible to combine one or more memory elements (compared to the logic elements in a specific manufacturing process, the one or more memory elements Can benefit from the same process) is added to the corresponding processor sub-unit to serve as a register file or cache memory for the corresponding processor sub-unit instead of acting as the main memory storage.

此架構可提供若干優點。舉例而言，由於暫存器檔案為對應處理器子單元之部分，因此處理器子單元可在空間上位於相關暫存器檔案附近。此配置可顯著增加操作效率。習知暫存器檔案由邏輯電晶體實施。舉例而言，習知暫存器檔案之每一位元由約12個邏輯電晶體製成，且因此16個位元之暫存器檔案由192個邏輯電晶體製成。此暫存器檔案可能需要大量邏輯組件來存取邏輯電晶體，且因此可佔用大的空間。相較於由邏輯電晶體實施之暫存器檔案，本發明所揭示之實施例的暫存器檔案可能需要顯著更少的空間。此大小減小可藉由使用包括記憶體胞元之記憶體墊實施所揭示實施例之暫存器檔案來實現，該等記憶體胞元係藉由經最佳化以用於製造記憶體結構而非用於製造邏輯結構之製程來製造。大小減小亦可允許較大暫存器檔案或快取記憶體。 This architecture can provide several advantages. For example, since the register file is a part corresponding to the processor sub-unit, the processor sub-unit can be spatially located near the relevant register file. This configuration can significantly increase operating efficiency. The conventional register file is implemented by logic transistors. For example, each bit of the conventional register file is made of about 12 logic transistors, and therefore a 16-bit register The file is made of 192 logic transistors. This register file may require a large number of logic components to access the logic transistors, and therefore may occupy a large space. Compared with a register file implemented by a logic transistor, the register file of the embodiment disclosed in the present invention may require significantly less space. This size reduction can be achieved by implementing the register file of the disclosed embodiment using memory pads including memory cells that are optimized for use in manufacturing memory structures It is not manufactured by the process used to manufacture logical structures. The size reduction can also allow larger scratchpad files or cache memory.

在一些實施例中，可提供分散式處理器記憶體晶片。分散式處理器記憶體晶片可包括：基板；記憶體陣列，其安置於基板上且包括複數個離散記憶體組；及處理陣列，其安置於基板上且包括複數個處理器子單元。該等處理器子單元中之每一者可與複數個離散記憶體組中之對應的專用記憶體組相關聯。分散式處理器記憶體晶片亦可包括第一複數個匯流排及第二複數個匯流排。第一複數個匯流排中之每一者可將複數個處理器子單元中之一者連接至其對應的專用記憶體組。第二複數個匯流排中之每一者可將複數個處理器子單元中之一者連接至複數個處理器子單元中之另一者。在一些狀況下，第二複數個匯流排可將複數個處理器子單元中之一或多者連接至複數個處理器子單元當中之兩個或多於兩個其他處理器子單元。處理器子單元中之一或多者亦可包括安置於基板上之至少一個記憶體墊。至少一個記憶體墊可經組態以充當用於複數個處理子單元中之一或多者的暫存器檔案之至少一個暫存器。 In some embodiments, a distributed processor memory chip may be provided. The distributed processor memory chip may include: a substrate; a memory array disposed on the substrate and including a plurality of discrete memory groups; and a processing array disposed on the substrate and including a plurality of processor subunits. Each of the processor subunits can be associated with a corresponding dedicated memory group among a plurality of discrete memory groups. The distributed processor memory chip may also include a first plurality of buses and a second plurality of buses. Each of the first plurality of bus bars can connect one of the plurality of processor subunits to its corresponding dedicated memory bank. Each of the second plurality of bus bars can connect one of the plurality of processor subunits to another of the plurality of processor subunits. In some cases, the second plurality of buses may connect one or more of the plurality of processor sub-units to two or more other processor sub-units among the plurality of processor sub-units. One or more of the processor sub-units may also include at least one memory pad disposed on the substrate. The at least one memory pad may be configured to serve as at least one register for a register file for one or more of the plurality of processing subunits.

在一些狀況下，暫存器檔案可與一或多個邏輯組件相關聯以使得記憶體墊能夠充當暫存器檔案之一或多個暫存器。舉例而言，此等邏輯組件可包括開關、放大器、反相器、感測放大器以及其他者。在暫存器檔案由動態隨機存取記憶體(DRAM)墊實施之實例中，可包括邏輯組件以執行再新操作從而防止所儲存資料丟失。此等邏輯組件可包括列及行多工器(「mux」)。此外，由DRAM墊實施之暫存器檔案可包括冗餘機構以對抗良率下降。 In some cases, the register file may be associated with one or more logical components so that the memory pad can serve as one or more registers of the register file. For example, these logic components may include switches, amplifiers, inverters, sense amplifiers, and others. In the case where the register file is implemented by a dynamic random access memory (DRAM) pad, logic components may be included to perform renew operations to prevent the loss of stored data. These logical components may include column and row multiplexers ("mux"). In addition, the register file implemented by the DRAM pad may include redundant mechanisms to combat yield degradation.

圖84說明包括CPU 8402及外部記憶體8406之傳統電腦架構8400。在操作期間，可將來自記憶體8406之值載入至與包括於CPU 8402中之暫存器檔案8504相關聯的暫存器中。 FIG. 84 illustrates a traditional computer architecture 8400 including a CPU 8402 and an external memory 8406. During operation, the value from the memory 8406 can be loaded into the register associated with the register file 8504 included in the CPU 8402.

圖85A說明符合所揭示實施例之例示性分散式處理器記憶體晶片8500a。相比於圖84之架構，分散式處理器記憶體晶片8500a包括安置於同一基板上之記憶體元件及處理器元件。亦即，晶片8500a可包括記憶體陣列及處理陣列，該處理陣列包括各與包括於記憶體陣列中之一或多個專用記憶體組相關聯的複數個處理器子單元。在圖85之架構中，由處理器子單元使用之暫存器係藉由安置於同一基板上之一或多個記憶體墊提供，記憶體陣列及處理陣列形成於該基板上。 FIG. 85A illustrates an exemplary distributed processor memory chip 8500a in accordance with the disclosed embodiments. Compared with the architecture of FIG. 84, the distributed processor memory chip 8500a includes memory elements and processor elements arranged on the same substrate. That is, the chip 8500a may include a memory array and a processing array, the processing array including a plurality of processor subunits each associated with one or more dedicated memory groups included in the memory array. In the architecture of FIG. 85, the register used by the processor subunit is provided by one or more memory pads disposed on the same substrate, and the memory array and the processing array are formed on the substrate.

如圖85A中所描繪，分散式處理器記憶體晶片8500a可藉由安置於基板8502上之複數個處理群組8510a、8510b及8510c形成。更具體而言，分散式處理器記憶體晶片8500a可包括安置於基板8502上之記憶體陣列8520及處理陣列8530。記憶體陣列8520可包括複數個記憶體組，諸如記憶體組8520a、8520b及8520c。處理陣列8530可包括複數個處理器子單元，諸如處理器子單元8530a、8530b及8530c。 As depicted in FIG. 85A, the distributed processor memory chip 8500a can be formed by a plurality of processing groups 8510a, 8510b, and 8510c disposed on a substrate 8502. More specifically, the distributed processor memory chip 8500a may include a memory array 8520 and a processing array 8530 disposed on a substrate 8502. The memory array 8520 may include a plurality of memory banks, such as memory banks 8520a, 8520b, and 8520c. The processing array 8530 may include a plurality of processor subunits, such as processor subunits 8530a, 8530b, and 8530c.

此外，處理群組8510a、8510b及8510c中之每一者可包括處理器子單元及專用於該處理器子單元之一或多個對應記憶體組。在圖85A中所描繪之實施例中，處理器子單元8530a、8530b及8530c中之每一者可與對應的專用記憶體組8520a、8520b或8520c相關聯。亦即，處理器子單元8530a可與記憶體組8520a相關聯；處理器子單元8530b可與記憶體組8520b相關聯；且處理器子單元8530c可與記憶體組8520c相關聯。 In addition, each of the processing groups 8510a, 8510b, and 8510c may include a processor subunit and one or more corresponding memory groups dedicated to the processor subunit. In the embodiment depicted in FIG. 85A, each of the processor subunits 8530a, 8530b, and 8530c may be associated with a corresponding dedicated memory bank 8520a, 8520b, or 8520c. That is, the processor subunit 8530a can be associated with the memory group 8520a; the processor subunit 8530b can be associated with the memory group 8520b; and the processor subunit 8530c can be associated with the memory group 8520c.

為了允許每一處理器子單元與其對應的專用記憶體組通信，分散式處理器記憶體晶片8500a可包括將處理器子單元中之一者連接至其對應的專用記憶體組之第一複數個匯流排8540a、8540b及8540c。在圖85A中所描繪之實施例中，匯流排8540a可將處理器子單元8530a連接至記憶體組8520a；匯流排8540b可將處理器子單元8530b連接至記憶體組8520b；且匯流排8540c可將處理器子單元8530c連接至記憶體組8520c。 In order to allow each processor subunit to communicate with its corresponding dedicated memory bank, the distributed processor memory chip 8500a may include connecting one of the processor subunits to its corresponding dedicated memory bank. Use the first plurality of buses 8540a, 8540b and 8540c of the memory bank. In the embodiment depicted in FIG. 85A, the bus 8540a can connect the processor sub-unit 8530a to the memory bank 8520a; the bus 8540b can connect the processor sub-unit 8530b to the memory bank 8520b; and the bus 8540c can Connect the processor sub-unit 8530c to the memory bank 8520c.

此外，為了允許每一處理器子單元與其他處理器子單元通信，分散式處理器記憶體晶片8500a可包括將處理器子單元中之一者連接至至少另一處理器子單元之第二複數個匯流排8550a及8550b。在圖85中所描繪之實施例中，匯流排8550a可將處理器子單元8530a連接至處理器子單元8530b，且匯流排8550b可將處理器子單元8530a連接至處理器子單元8550b，等等。 In addition, in order to allow each processor sub-unit to communicate with other processor sub-units, the distributed processor memory chip 8500a may include a second plurality that connects one of the processor sub-units to at least another processor sub-unit Two busbars 8550a and 8550b. In the embodiment depicted in FIG. 85, bus 8550a can connect processor sub-unit 8530a to processor sub-unit 8530b, and bus 8550b can connect processor sub-unit 8530a to processor sub-unit 8550b, etc. .

離散記憶體組8520a、8520b及8520c中之每一者可包括複數個記憶體墊。在圖84中所描繪之實施例中，記憶體組8520a可包括記憶體墊8522a、8524a及8526a；記憶體組8520b可包括記憶體墊8522b、8524b及8526b；且記憶體組8520c可包括記憶體墊8522c、8524c及8526c。如先前關於圖10所揭示，記憶體墊可包括複數個記憶體胞元，且每一胞元可包含電容器、電晶體或儲存至少一個資料位元之其他電路系統。習知記憶體墊可包含例如512個位元×512個位元，但本文中所揭示之實施例不限於此。 Each of the discrete memory groups 8520a, 8520b, and 8520c may include a plurality of memory pads. In the embodiment depicted in FIG. 84, the memory set 8520a may include memory pads 8522a, 8524a, and 8526a; the memory set 8520b may include memory pads 8522b, 8524b, and 8526b; and the memory set 8520c may include memory Pads 8522c, 8524c, and 8526c. As previously disclosed with respect to FIG. 10, the memory pad may include a plurality of memory cells, and each cell may include a capacitor, a transistor, or other circuit system that stores at least one data bit. The conventional memory pad may include, for example, 512 bits×512 bits, but the embodiments disclosed herein are not limited thereto.

處理器子單元8530a、8530b及8530c中之至少一者可包括經組態以充當用於對應處理器子單元8530a、8530b及8530c之暫存器檔案的至少一個記憶體墊，諸如記憶體墊8532a、8532b及8532c。亦即，至少一個記憶體墊8532a、8532b及8532c提供由處理器子單元8530a、8530b及8530c中之一或多者使用的暫存器檔案之至少一個暫存器。暫存器檔案可包括一或多個暫存器。在圖85A中所描繪之實施例中，處理器子單元8530a中之記憶體墊8532a可充當用於處理器子單元8530a(及/或包括於分散式處理器記憶體晶片8500a中之任何其他處理器子單元)之暫存器檔案(亦被稱作「暫存器檔案8532a」)；處理器子單元 8530b中之記憶體墊8532b可充當用於處理器子單元8530b之暫存器檔案；且處理器子單元8530c中之記憶體墊8532c可充當用於處理器子單元8530c之暫存器檔案。 At least one of the processor subunits 8530a, 8530b, and 8530c may include at least one memory pad configured to serve as a register file for the corresponding processor subunits 8530a, 8530b, and 8530c, such as a memory pad 8532a , 8532b and 8532c. That is, at least one of the memory pads 8532a, 8532b, and 8532c provides at least one register of the register files used by one or more of the processor subunits 8530a, 8530b, and 8530c. The register file may include one or more registers. In the embodiment depicted in FIG. 85A, the memory pad 8532a in the processor sub-unit 8530a can serve as a memory pad for the processor sub-unit 8530a (and/or any other processing included in the distributed processor memory chip 8500a). Register file (also known as "register file 8532a"); processor subunit The memory pad 8532b in the 8530b can serve as a register file for the processor subunit 8530b; and the memory pad 8532c in the processor subunit 8530c can serve as a register file for the processor subunit 8530c.

處理器子單元8530a、8530b及8530c中之至少一者亦可包括至少一個邏輯組件，諸如邏輯組件8534a、8534b及8534c。每一邏輯組件8534a、8534b或8534c可經組態以使得對應記憶體墊8532a、8532b或8532c能夠充當用於對應處理器子單元8530a、8530b或8530c之暫存器檔案。 At least one of the processor subunits 8530a, 8530b, and 8530c may also include at least one logic component, such as logic components 8534a, 8534b, and 8534c. Each logic component 8534a, 8534b, or 8534c can be configured so that the corresponding memory pad 8532a, 8532b, or 8532c can serve as a register file for the corresponding processor subunit 8530a, 8530b, or 8530c.

在一些實施例中，至少一個記憶體墊可安置於基板上，且至少一個記憶體墊可含有經組態以提供用於複數個處理器子單元中之一或多者之至少一個冗餘暫存器的至少一個冗餘記憶體位元。在一些實施例中，處理器子單元中之至少一者可包括用以停止當前任務且在某些時間觸發記憶體再新操作以再新記憶體墊之機制。 In some embodiments, at least one memory pad may be disposed on the substrate, and at least one memory pad may contain at least one redundant temporary configured to provide for one or more of the plurality of processor subunits. At least one redundant memory bit of the memory. In some embodiments, at least one of the processor sub-units may include a mechanism for stopping the current task and triggering a memory renew operation at a certain time to renew the memory pad.

圖85B說明符合所揭示實施例之例示性分散式處理器記憶體晶片8500b。圖85B中所說明之記憶體晶片8500b與圖85A中所說明之記憶體晶片8500大體上相同，除了圖85B中之記憶體墊8532a、8532b及8532c不包括於對應處理器子單元8530a、8530b及8530c中以外。實情為，圖85B中之記憶體墊8532a、8532b及8532c安置於對應處理器子單元8530a、8530b及8530c外部但在空間上靠近該等處理器子單元。以此方式，記憶體墊8532a、8532b及8532c仍可充當用於對應處理器子單元8530a、8530b及8530c之暫存器檔案。 FIG. 85B illustrates an exemplary distributed processor memory chip 8500b in accordance with the disclosed embodiment. The memory chip 8500b illustrated in FIG. 85B is substantially the same as the memory chip 8500 illustrated in FIG. 85A, except that the memory pads 8532a, 8532b, and 8532c in FIG. 85B are not included in the corresponding processor subunits 8530a, 8530b, and 8530c Chinese and foreign. In fact, the memory pads 8532a, 8532b, and 8532c in FIG. 85B are arranged outside the corresponding processor sub-units 8530a, 8530b, and 8530c but are spatially close to the processor sub-units. In this way, the memory pads 8532a, 8532b, and 8532c can still serve as register files for the corresponding processor subunits 8530a, 8530b, and 8530c.

圖85C說明符合所揭示實施例之裝置8500c。裝置8500c包括基板8560、第一記憶體組8570、第二記憶體組8572及處理單元8580。第一記憶體組8570、第二記憶體組8572及處理單元8580安置於基板8560上。處理單元8580包括處理器8584及由記憶體墊實施之暫存器檔案8582。在處理單元8580之操作期間，處理器8584可存取暫存器檔案8582以讀取或寫入資料。 Figure 85C illustrates a device 8500c in accordance with the disclosed embodiment. The device 8500c includes a substrate 8560, a first memory group 8570, a second memory group 8572, and a processing unit 8580. The first memory group 8570, the second memory group 8572 and the processing unit 8580 are disposed on the substrate 8560. The processing unit 8580 includes a processor 8584 and a register file 8582 implemented by a memory pad. During the operation of the processing unit 8580, the processor 8584 can access the register file 8582 to read or write data.

分散式處理器記憶體晶片8500a、8500b或裝置8500c可基於處理器子單元對由記憶體墊提供之暫存器的存取而提供多種功能。舉例而言，在一些實施例中，分散式處理器記憶體晶片8500a或8500b可包括處理器子單元，該處理器子單元充當耦接至記憶體之加速器，從而允許其使用更多記憶體頻寬。在圖85A中所描繪之實施例中，處理器子單元8530a可充當加速器(亦被稱作「加速器8530a」)。加速器8530a可使用安置於加速器8530a中之記憶體墊8532a以提供暫存器檔案之一或多個暫存器。替代地，在圖85B中所描繪之實施例中，加速器8530a可使用安置於加速器8530a外部之記憶體墊8532a作為暫存器檔案。又另外，加速器8530a可使用記憶體組8520b中之記憶體墊8522b、8524b及8526b中之任一者或記憶體組8520c中之記憶體墊8522c、8524c及8526c中之任一者，以提供一或多個暫存器。 The distributed processor memory chip 8500a, 8500b or the device 8500c can provide multiple functions based on the processor subunit's access to the register provided by the memory pad. For example, in some embodiments, the distributed processor memory chip 8500a or 8500b may include a processor sub-unit that acts as an accelerator coupled to the memory, allowing it to use more memory frequency. width. In the embodiment depicted in FIG. 85A, the processor sub-unit 8530a may act as an accelerator (also referred to as "accelerator 8530a"). The accelerator 8530a can use the memory pad 8532a disposed in the accelerator 8530a to provide one or more registers of the register file. Alternatively, in the embodiment depicted in FIG. 85B, the accelerator 8530a may use the memory pad 8532a disposed outside the accelerator 8530a as the register file. In addition, the accelerator 8530a can use any one of the memory pads 8522b, 8524b, and 8526b in the memory set 8520b or any one of the memory pads 8522c, 8524c, and 8526c in the memory set 8520c to provide a Or multiple registers.

所揭示實施例可尤其適用於某些類型之影像處理、神經網路、資料庫分析、壓縮及解壓縮以及更多應用。舉例而言，在圖85A或圖85B之實施例中，記憶體墊可提供用於與記憶體墊包括在同一晶片上之一或多個處理器子單元的暫存器檔案之一或多個暫存器作為記憶體墊。一或多個暫存器可用以儲存由處理器子單元頻繁存取之資料。舉例而言，在卷積影像處理期間，卷積加速器可在保存於記憶體中之整個影像上反覆使用相同係數。用於此卷積加速器之所建議實施方案可將所有此等係數保存於在一或多個暫存器內之「關閉」暫存器檔案中，該一或多個暫存器包括於專用於一或多個處理器子單元之記憶體墊內，該一或多個處理器子單元與暫存器檔案記憶體墊位於同一晶片上。此架構可將暫存器(及所儲存之係數值)置放成緊密接近對係數值操作之處理器子單元。因為由記憶體墊實施之暫存器檔案可充當在空間上緊密之高效快取記憶體，所以可達成資料傳送之顯著較低損失及存取之較低潛時。 The disclosed embodiments are particularly suitable for certain types of image processing, neural networks, database analysis, compression and decompression, and more applications. For example, in the embodiment of FIG. 85A or FIG. 85B, the memory pad may provide one or more register files for one or more processor subunits included on the same chip as the memory pad The register serves as a memory pad. One or more registers can be used to store data frequently accessed by the processor subunits. For example, during convolutional image processing, the convolution accelerator can repeatedly use the same coefficients on the entire image stored in memory. The proposed implementation for this convolutional accelerator can save all these coefficients in a "closed" register file in one or more registers, the one or more registers including those dedicated to Within the memory pad of one or more processor sub-units, the one or more processor sub-units and the register file memory pad are located on the same chip. This architecture can place the register (and the stored coefficient value) in close proximity to the processor subunit that operates on the coefficient value. Because the register file implemented by the memory pad can serve as an efficient cache memory that is tightly spaced, a significantly lower loss of data transmission and a lower latency of access can be achieved.

在另一實例中，所揭示實施例可包括可將字輸入至由記憶體墊提供之暫存器中的加速器。加速器可將暫存器處置為循環緩衝器以在單個循環中將向量相乘。舉例而言，在圖85C中所說明之裝置8500c中，處理單元8580中之處理器8584充當加速器，其使用由記憶體墊實施之暫存器檔案8582作為循環緩衝器以儲存資料A1、A2、A3……。第一記憶體組8570儲存待與資料A1、A2、A3……相乘之資料B1、B2、B3……。第二記憶體組8572儲存乘法結果C1、C2、C3……。亦即，Ci=Ai×Bi。若處理單元8580中不存在暫存器檔案，則處理器8584將需要更多記憶體頻寬及更多循環以自諸如記憶體組8570或8572之外部記憶體組讀取資料A1、A2、A3……及資料B1、B2、B3……兩者，此可產生顯著延遲。另一方面，在本實施例中，資料A1、A2、A3……儲存於形成於處理單元8580內之暫存器檔案8582中。因此，處理器8584將僅需要自外部記憶體組8570讀取資料B1、B2、B3……。因此，可顯著減少記憶體頻寬。 In another example, the disclosed embodiment may include words that can be input to a memory pad Provide the accelerator in the register. The accelerator can treat the scratchpad as a circular buffer to multiply the vectors in a single cycle. For example, in the device 8500c illustrated in FIG. 85C, the processor 8584 in the processing unit 8580 acts as an accelerator, which uses a register file 8582 implemented by a memory pad as a circular buffer to store data A1, A2, A3……. The first memory group 8570 stores the data B1, B2, B3... to be multiplied by the data A1, A2, A3.... The second memory group 8572 stores the multiplication results C1, C2, C3.... That is, Ci=Ai×Bi. If there is no register file in the processing unit 8580, the processor 8584 will need more memory bandwidth and more cycles to read data A1, A2, A3 from an external memory bank such as the memory bank 8570 or 8572 ...And data B1, B2, B3... both, this can cause a significant delay. On the other hand, in this embodiment, the data A1, A2, A3... are stored in the register file 8582 formed in the processing unit 8580. Therefore, the processor 8584 will only need to read the data B1, B2, B3... from the external memory bank 8570. Therefore, the memory bandwidth can be significantly reduced.

在記憶體處理程序中，記憶體墊通常允許單向存取(亦即，單次存取)。在單向存取中，存在至記憶體之一個埠。結果，可在某一時間僅執行對特定位址之一個存取操作，例如讀取或寫入。然而，若記憶體墊本身允許雙向存取，則雙向存取可為有效選項。在雙向存取中，可在某一時間存取兩個不同位址。存取記憶體墊之方法可基於面積及要求而判定。在一些狀況下，若由記憶體墊實施之暫存器檔案連接至需要讀取兩個源且具有一個目的地暫存器之處理器，則該等暫存器檔案可允許四向存取。在一些狀況下，當暫存器檔案由DRAM墊實施以儲存組態或快取記憶體資料時，暫存器檔案可僅允許單向存取。標準CPU可包括多向存取墊，而單向存取墊對於DRAM應用可為更佳的。 In the memory processing procedure, the memory pad usually allows one-way access (ie, single access). In one-way access, there is a port to the memory. As a result, only one access operation to a specific address, such as reading or writing, can be performed at a certain time. However, if the memory pad itself allows two-way access, then two-way access may be a valid option. In bidirectional access, two different addresses can be accessed at a certain time. The method of accessing the memory pad can be determined based on the area and requirements. In some situations, if the register files implemented by the memory pad are connected to a processor that needs to read two sources and has a destination register, the register files can allow four-way access. In some situations, when the register file is implemented by a DRAM pad to store configuration or cache memory data, the register file may only allow one-way access. Standard CPUs may include multi-directional access pads, while unidirectional access pads may be better for DRAM applications.

當控制器或加速器以其僅需要單次存取暫存器(在可能的少數情況下)之方式設計時，可使用記憶體墊實施之暫存器而非傳統的暫存器檔案。在單次存取中，一次僅可存取一個字。舉例而言，處理單元可在某一時間自兩個暫存器檔案存取兩個字。兩個暫存器檔案中之每一者可藉由僅允許單次存取之記憶體墊(例如，DRAM墊)實施。 When the controller or accelerator is designed in such a way that it only needs a single access to the register (in a few possible cases), the register implemented by the memory pad can be used instead of the traditional register file. In a single access, only one word can be accessed at a time. For example, the processing unit can access two words from two register files at a certain time. Each of the two register files can be accessed by only allowing single access The memory pad (for example, DRAM pad) is implemented.

在大多數技術中，記憶體墊IP(其為自製造商獲得之封閉區塊(IP))將附帶有處於適當位置以用於列及行存取之佈線，諸如字線及列線。但記憶體墊IP不包括環繞邏輯組件。因此，由揭示於本發明實施例中之記憶體墊實施的暫存器檔案可包括邏輯組件。可基於暫存器檔案之所需大小選擇記憶體墊之大小。 In most technologies, the memory pad IP, which is a closed block (IP) obtained from the manufacturer, will be accompanied by wiring in place for column and row access, such as word lines and column lines. But the memory pad IP does not include surround logic components. Therefore, the register file implemented by the memory pad disclosed in the embodiment of the present invention may include logic components. The size of the memory pad can be selected based on the required size of the register file.

當使用記憶體墊以提供暫存器檔案之暫存器時，可能會出現某些挑戰，且此等挑戰可取決於用以形成記憶體墊之特定記憶體技術。舉例而言，在記憶體生產中，並非所有製造之記憶體胞元皆可在生產之後適當地操作。此為已知問題，尤其在晶片上存在高密度之SRAM或DRAM的情況下。為了解決記憶體技術中之此問題，可使用一或多個冗餘機構以便將良率維持於合理位準。在所揭示實施例中，因為用以提供暫存器檔案之暫存器的記憶體例項(例如，記憶體組)之數目可相當小，所以冗餘機構可能不如正常記憶體應用中那樣重要。另一方面，影響記憶體功能性之相同生產問題亦可影響特定記憶體墊在提供一或多個暫存器時是否可適當地起作用。結果，冗餘元件可包括於所揭示實施例中。舉例而言，至少一個冗餘記憶體墊可安置於分散式處理器記憶體晶片之基板上。至少一個冗餘記憶體墊可經組態以針對複數個處理器子單元中之一或多者提供至少一個冗餘暫存器。在另一實例中，墊可大於所需大小(例如，620×620而非512×512)，且冗餘機構可建置至512×512區或其等效物外部之記憶體墊的區中。 When a memory pad is used to provide a register for a register file, certain challenges may arise, and these challenges may depend on the specific memory technology used to form the memory pad. For example, in memory production, not all memory cells manufactured can be properly operated after production. This is a known problem, especially when there is a high density of SRAM or DRAM on the chip. To solve this problem in memory technology, one or more redundant mechanisms can be used to maintain the yield rate at a reasonable level. In the disclosed embodiment, because the number of memory instances (for example, memory banks) of the register used to provide the register file can be quite small, the redundancy mechanism may not be as important as in normal memory applications. On the other hand, the same production issues that affect the functionality of the memory can also affect whether a particular memory pad can function properly when one or more registers are provided. As a result, redundant elements can be included in the disclosed embodiments. For example, at least one redundant memory pad can be disposed on the substrate of the distributed processor memory chip. The at least one redundant memory pad can be configured to provide at least one redundant register for one or more of the plurality of processor subunits. In another example, the pad may be larger than the required size (for example, 620×620 instead of 512×512), and the redundancy mechanism may be built into the area of the memory pad outside the 512×512 area or its equivalent .

另一挑戰可與時序相關。載入字及位元線之時序通常由記憶體之大小判定。由於暫存器檔案可由相當小之單個記憶體墊(例如，512×512個位元)實施，因此自記憶體墊載入字所需之時間將為少的，相較於邏輯，時序可足以相當快速地運行。 Another challenge can be related to timing. The timing of loading words and bit lines is usually determined by the size of the memory. Since the register file can be implemented by a relatively small single memory pad (for example, 512×512 bits), the time required to load words from the memory pad will be less. Compared with logic, the timing can be sufficient Runs fairly quickly.

再新一如DRAM之一些記憶體類型需要週期性地再新。再新可在暫停處理器或加速器時執行。對於小的記憶體墊，再新時間可為時間之一小部分。因此，即使系統在短時間段內停止，自總效能來看，藉由使用記憶體墊作為暫存器所獲得之增益亦值得停工時間。在一個實施例中，處理單元可包括自預定義數目向後計數之計數器。當計數器到達「0」時，處理單元可停止由處理器(例如，加速器)執行之當前任務，且觸發逐排再新記憶體墊之再新操作。當再新操作完成時，處理器可重新繼續其任務，且計數器可經重設以自預定義數目向後計數。 Some memory types like DRAM need to be renewed periodically. Renew can be executed when the processor or accelerator is paused. For small memory pads, the new time can be a small part of the time. Therefore, even if the system stops in a short period of time, from the perspective of overall performance, the gain obtained by using the memory pad as a register is worth the downtime. In one embodiment, the processing unit may include a counter that counts backward from a predefined number. When the counter reaches "0", the processing unit can stop the current task executed by the processor (for example, the accelerator), and trigger the renew operation of renewing the memory pad row by row. When the new operation is completed, the processor can resume its task, and the counter can be reset to count backward by a predefined number.

圖86提供表示符合所揭示實施例之用於在分散式處理器記憶體晶片中執行至少一個指令的例示性方法之流程圖8600。舉例而言，在步驟8602處，可自分散式處理器記憶體晶片之基板上的記憶體陣列擷取至少一個資料值。在步驟8604處，可將所擷取之資料值儲存於由分散式處理器記憶體晶片之基板上的記憶體陣列之記憶體墊提供的暫存器中。在步驟8606處，諸如分散式處理器記憶體晶片板上之分散式處理器子單元中之一或多者的處理器元件可對來自記憶體墊暫存器之所儲存資料值操作。 Figure 86 provides a flowchart 8600 representing an exemplary method for executing at least one instruction in a distributed processor memory chip consistent with the disclosed embodiments. For example, at step 8602, at least one data value can be retrieved from the memory array on the substrate of the distributed processor memory chip. At step 8604, the retrieved data value can be stored in the register provided by the memory pad of the memory array on the substrate of the distributed processor memory chip. At step 8606, a processor element such as one or more of the distributed processor subunits on the distributed processor memory chip may operate on the stored data value from the memory pad register.

此處且貫穿全文，應理解，對暫存器檔案之所有參考皆應等同地指快取記憶體，此係因為暫存器檔案可為最低層級快取記憶體。 Here and throughout the text, it should be understood that all references to the register file should equally refer to the cache, because the register file can be the lowest level cache.

處理瓶頸 Deal with bottlenecks

術語「第一」、「第二」、「第三」及其類似者僅用以區分不同術語。此等術語可能不提示元件之次序及/或時序及/或重要性。舉例而言，第一處理程序之前可為第二處理程序，及其類似者。 The terms "first", "second", "third" and the like are only used to distinguish different terms. These terms may not indicate the order and/or timing and/or importance of the components. For example, the first processing procedure can be preceded by the second processing procedure, and the like.

術語「耦接」可意謂直接連接及/或間接連接。 The term "coupled" can mean direct connection and/or indirect connection.

術語「記憶體/處理」、「記憶體及處理」及「記憶體處理」係以可互換方式使用。 The terms "memory/processing", "memory and processing" and "memory processing" are used interchangeably.

可提供可為記憶體/處理單元之多個方法、電腦可讀媒體、記憶體/處理單元及/或系統。 Multiple methods, computer-readable media, memory/processing units, and/or systems that can be memory/processing units can be provided.

記憶體/處理單元為具有記憶體及處理能力之硬體單元。 The memory/processing unit is a hardware unit with memory and processing capabilities.

記憶體/處理單元可為記憶體處理積體電路，可包括於記憶體處理積體電路中或可包括一或多個記憶體處理積體電路。 The memory/processing unit may be a memory processing integrated circuit, may be included in a memory processing integrated circuit, or may include one or more memory processing integrated circuits.

記憶體/處理單元可為如PCT專利申請公開案WO2019025892中所說明之分散式處理器。 The memory/processing unit may be a distributed processor as described in PCT Patent Application Publication WO2019025892.

記憶體/處理單元可包括如PCT專利申請公開案WO2019025892中所說明之分散式處理器。 The memory/processing unit may include a distributed processor as described in PCT Patent Application Publication WO2019025892.

記憶體/處理單元可屬於如PCT專利申請公開案WO2019025892中所說明之分散式處理器。 The memory/processing unit may belong to a distributed processor as described in PCT Patent Application Publication WO2019025892.

記憶體/處理單元可為如PCT專利申請公開案WO2019025892中所說明之記憶體晶片。 The memory/processing unit may be a memory chip as described in PCT Patent Application Publication WO2019025892.

記憶體/處理單元可包括如PCT專利申請公開案WO2019025892中所說明之記憶體晶片。 The memory/processing unit may include a memory chip as described in PCT Patent Application Publication WO2019025892.

記憶體/處理單元可為如PCT專利申請案第PCT/IB2019/001005號中所說明之分散式處理器。 The memory/processing unit can be a distributed processor as described in PCT Patent Application No. PCT/IB2019/001005.

記憶體/處理單元可屬於如PCT專利申請案第PCT/IB2019/001005號中所說明之分散式處理器。 The memory/processing unit may belong to a distributed processor as described in PCT Patent Application No. PCT/IB2019/001005.

記憶體/處理單元可為如PCT專利申請案第PCT/IB2019/001005號中所說明之記憶體晶片。 The memory/processing unit may be a memory chip as described in PCT Patent Application No. PCT/IB2019/001005.

記憶體/處理單元可包括如PCT專利申請案第PCT/IB2019/001005號中所說明之記憶體晶片。 The memory/processing unit may include a memory chip as described in PCT Patent Application No. PCT/IB2019/001005.

記憶體/處理單元可屬於如PCT專利申請案第PCT/IB2019/001005 號中所說明之記憶體晶片。 The memory/processing unit may belong to the PCT patent application No. PCT/IB2019/001005 The memory chip described in the number.

記憶體/處理單元可為使用晶圓間接合及多個導體彼此連接之積體電路。 The memory/processing unit can be an integrated circuit that uses inter-wafer bonding and multiple conductors to connect to each other.

對分散式處理器記憶體晶片、分散式記憶體處理積體電路、記憶體晶片、分散式處理器之任何參考可實施為藉由晶圓間接合及多個導體彼此連接之一對積體電路。 Any reference to distributed processor memory chips, distributed memory processing integrated circuits, memory chips, and distributed processors can be implemented as a pair of integrated circuits by bonding between wafers and connecting multiple conductors to each other .

記憶體/處理單元可藉由相比邏輯胞元更佳地適合記憶體胞元之第一製造製程來製造。因此，第一製造製程可被視為記憶體類別之製造製程。記憶體胞元可包括多個電晶體中之一者。邏輯胞元可包括一或多個電晶體。可應用第一製造製程以製造記憶體組。邏輯胞元可包括一起實施邏輯功能之一或多個電晶體，且可用作較大邏輯電路之基本建置區塊。記憶體胞元可包括一起實施記憶體功能之一或多個電晶體，且可用作較大邏輯電路之基本建置區塊。對應邏輯胞元可實施相同邏輯功能。 The memory/processing unit can be manufactured by a first manufacturing process that is better suited for memory cells than logic cells. Therefore, the first manufacturing process can be regarded as the manufacturing process of the memory type. The memory cell may include one of a plurality of transistors. The logic cell may include one or more transistors. The first manufacturing process can be applied to manufacture the memory bank. A logic cell can include one or more transistors that implement logic functions together, and can be used as a basic building block for larger logic circuits. A memory cell may include one or more transistors that perform memory functions together, and may be used as a basic building block for larger logic circuits. Corresponding logic cells can implement the same logic function.

記憶體/處理單元可不同於處理器、處理積體電路及/或處理單元中之任一者，該處理器、處理積體電路及/或處理單元藉由相比記憶體胞元更佳地適合於邏輯胞元之第二製造製程來製造。因此，第一製造製程可被視為邏輯類別之製造製程。第二製造製程可用以製造中央處理單元、圖形處理單元及其類似者。 The memory/processing unit may be different from any one of the processor, the processing integrated circuit and/or the processing unit, and the processor, the processing integrated circuit and/or the processing unit are better than the memory cell. It is suitable for the second manufacturing process of logic cells. Therefore, the first manufacturing process can be regarded as a logical type of manufacturing process. The second manufacturing process can be used to manufacture central processing units, graphics processing units and the like.

相比處理器、處理積體電路及/或處理單元，記憶體/處理單元可更適合於執行較少算術密集型運算。 Compared to processors, processing integrated circuits, and/or processing units, memory/processing units may be more suitable for performing less arithmetic-intensive operations.

舉例而言，由第一製造製程製造之記憶體胞元可展現超過且甚至大大超過(例如，超過2倍、3倍、4倍、5倍、9倍、7倍、8倍、9倍、10倍及其類似者)由第一製造製程製造之邏輯電路之臨界尺寸的臨界尺寸。 For example, the memory cells manufactured by the first manufacturing process can exhibit more than and even greater than (for example, more than 2 times, 3 times, 4 times, 5 times, 9 times, 7 times, 8 times, 9 times, 10 times and the like) the critical dimension of the critical dimension of the logic circuit manufactured by the first manufacturing process.

第一製造製程可為類比製造製程，第一製造製程可為DRAM製造製程，及其類似者。 The first manufacturing process can be an analog manufacturing process, and the first manufacturing process can be a DRAM manufacturing process Manufacturing process, and the like.

由第一製造製程製造之邏輯胞元的大小可超過由第二製造製程製造之對應邏輯胞元的大小至少兩倍。對應邏輯呼叫可具有與由第一製造製程製造之邏輯胞元相同的功能性。 The size of the logic cell manufactured by the first manufacturing process can exceed the size of the corresponding logic cell manufactured by the second manufacturing process by at least twice. The corresponding logic call may have the same functionality as the logic cell manufactured by the first manufacturing process.

第二製造製程可為數位製造製程。 The second manufacturing process may be a digital manufacturing process.

第二製造製程可為互補金屬氧化物半導體(CMOS)、雙極、雙極CMOS(BiCOMS)、雙擴散金屬氧化物半導體(DMOS)、氧化物上矽製造製程及其類似者中之任一者。 The second manufacturing process can be any of complementary metal oxide semiconductor (CMOS), bipolar, bipolar CMOS (BiCOMS), double diffused metal oxide semiconductor (DMOS), silicon-on-oxide manufacturing process, and the like .

記憶體/處理單元可包括多個處理器子單元。 The memory/processing unit may include multiple processor sub-units.

一或多個記憶體/處理單元之處理器子單元可彼此獨立地操作及/或可彼此相配合及/或執行分散式處理。可以各種方式，例如以平面方式或以階層式方式執行分散式處理。 The processor subunits of one or more memory/processing units can operate independently of each other and/or can cooperate with each other and/or perform distributed processing. Distributed processing can be performed in various ways, such as in a planar manner or in a hierarchical manner.

平面方式可涉及使處理器子單元執行相同操作(且可能在或可能不在處理器子單元之間輸出處理結果)。 The planar approach may involve causing processor sub-units to perform the same operation (and may or may not output processing results between the processor sub-units).

階層式方式可涉及執行不同層級之處理操作序列，而某一層之處理操作在又一層級之處理操作之後進行。處理器子單元可經分配(動態地或靜態地)給不同層且參與階層式處理。 The hierarchical approach may involve executing a sequence of processing operations at different levels, and the processing operations at one level are performed after the processing operations at another level. Processor sub-units can be allocated (dynamically or statically) to different layers and participate in hierarchical processing.

分散式處理亦可涉及其他單元，例如記憶體/處理單元之控制器及/或不屬於記憶體/處理單元之單元。 Distributed processing may also involve other units, such as memory/processing unit controllers and/or units that are not memory/processing units.

以可互換方式使用術語邏輯及處理器子單元。 The terms logic and processor subunit are used interchangeably.

可以任何方式(分散式及/或非分散式及其類似者)執行本申請案中所提及之任何處理。 Any processing mentioned in this application can be executed in any way (distributed and/or non-distributed and the like).

在以下申請案中，對PCT專利申請公開案WO2019025892及PCT專利申請案第PCT/IB2019/001005號(2019年9月9日)進行各種參考及/或以參考方式併入。PCT專利申請公開案WO2019025892及/或PCT專利申請案第PCT/IB2019/001005號提供各種方法、系統、處理器、記憶體晶片及其類似者之非限制性實例。可提供其他方法、系統、處理器。 In the following applications, various references and/or references are made to PCT Patent Application Publication WO2019025892 and PCT Patent Application No. PCT/IB2019/001005 (September 9, 2019) Incorporated by reference. PCT Patent Application Publication WO2019025892 and/or PCT Patent Application No. PCT/IB2019/001005 provide non-limiting examples of various methods, systems, processors, memory chips and the like. Other methods, systems, and processors can be provided.

可提供處理系統(系統)，其中處理器之前為一或多個記憶體/處理單元，每一記憶體及處理單元(記憶體/處理單元)具有處理資源及儲存資源。 A processing system (system) can be provided, in which the processor is previously one or more memory/processing units, and each memory and processing unit (memory/processing unit) has processing resources and storage resources.

處理器可請求或發指令給一或多個記憶體/處理單元以執行各種處理任務。各種處理任務之執行可減輕處理器之負擔，減少潛時，且在一些狀況下減少一或多個記憶體/處理單元與處理器之間的總資訊頻寬，及其類似者。 The processor can request or issue instructions to one or more memory/processing units to perform various processing tasks. The execution of various processing tasks can reduce the burden on the processor, reduce latency, and in some cases reduce the total information bandwidth between one or more memory/processing units and the processor, and the like.

處理器可用不同粒度提供指令及/或請求，例如處理器可發送針對某些處理資源之指令或可發送針對記憶體/處理單元之較高階指令，而不指定任何處理資源。 The processor can provide instructions and/or requests with different granularities. For example, the processor can send instructions for certain processing resources or can send higher-level instructions for memory/processing units without specifying any processing resources.

記憶體/處理單元可用任何方式(動態、靜態、分散式、集中式、離線、線上及其類似者)管理其處理及/或記憶體資源。資源之管理可在以下情況下執行：自主地、在處理器之控制下、在處理器進行組態之後，及其類似者。 The memory/processing unit can manage its processing and/or memory resources in any way (dynamic, static, distributed, centralized, offline, online, and the like). Resource management can be performed in the following situations: autonomously, under the control of the processor, after the processor is configured, and the like.

舉例而言，可將任務分割成可能需要一或多個記憶體/處理單元之一或多個處理資源及/或記憶體資源執行或一或多個指令之子任務。每一處理資源可經組態以執行(例如，獨立地或非獨立地)至少一個指令。參見例如藉由諸如PCT專利申請公開案WO2019025892之處理器子單元的處理資源對指令子系列之執行。 For example, the task can be divided into subtasks that may require one or more processing resources of one or more memory/processing units and/or execution of memory resources or one or more instructions. Each processing resource can be configured to execute (e.g., independently or non-independently) at least one instruction. See, for example, the execution of the instruction sub-series by the processing resources of the processor sub-unit such as PCT Patent Application Publication WO2019025892.

亦可至少將記憶體資源之分配提供至除一或多個記憶體/處理單元以外之實體，例如可耦接至一或多個記憶體/處理單元之直接存取記憶體(DMA)單元。 It is also possible to provide at least the allocation of memory resources to entities other than one or more memory/processing units, such as direct access memory (DMA) units that can be coupled to one or more memory/processing units.

編譯器可針對由記憶體/處理單元執行之任務的每個類型準備組態檔案。組態檔案包括與任務類型相關聯之記憶體分配及處理資源分配。組態檔案可包括可由不同處理資源執行及/或可定義記憶體分配之指令。 The compiler can prepare groups for each type of task performed by the memory/processing unit State file. The configuration file includes memory allocation and processing resource allocation associated with task types. The configuration file can include commands that can be executed by different processing resources and/or can define memory allocation.

舉例而言，與矩陣乘法(將矩陣A乘以矩陣B，A*B=C)之任務相關的組態檔案可提示在何處儲存矩陣A之元素，在何處儲存矩陣B之元素，在何處儲存矩陣C之元素，在何處儲存在矩陣乘法期間產生之中間結果，且可包括針對用於執行與矩陣乘法相關之任何數學運算之處理資源的指令。組態檔案為資料結構之實例，可提供其他資料結構。 For example, the configuration file related to the task of matrix multiplication (multiply matrix A by matrix B, A*B=C) can prompt where to store the elements of matrix A and where to store the elements of matrix B. Where to store the elements of the matrix C, where to store the intermediate results generated during the matrix multiplication, and may include instructions for processing resources used to perform any mathematical operations related to the matrix multiplication. The configuration file is an example of the data structure, and other data structures can be provided.

可藉由一或多個記憶體/處理單元以任何方式執行矩陣乘法。 The matrix multiplication can be performed in any manner by one or more memory/processing units.

一或多個記憶體/處理單元可將矩陣A乘以向量V。此可用任何方式進行。舉例而言，此可涉及每處理資源維護矩陣之一列或行(每不同處理資源維護行之不同列)，及循環(在不同處理資源之間)矩陣之列或行與向量的乘法之最終結果(在第一反覆期間)，及循環先前乘法之最終結果(在第二至最後反覆期間)。 One or more memory/processing units can multiply the matrix A by the vector V. This can be done in any way. For example, this may involve maintaining one column or row of the matrix per processing resource (different columns for each different processing resource maintenance row), and looping (between different processing resources) the final result of the multiplication of the column or row of the matrix and the vector (During the first iteration), and loop the final result of the previous multiplication (during the second to the last iteration).

假定矩陣A為4×4矩陣，向量V為1×4向量，且存在四個處理資源。在此等假設下，矩陣A之第一列儲存於第一處理器子單元處，矩陣A之第二列儲存於第二處理器子單元處，矩陣A之第三列儲存於第三處理資源處，且矩陣A在第四列儲存於第四處理器子單元處。藉由以下操作開始乘法：將向量V之第一至第四元素發送至第一至第四處理資源；及將向量V之第一至第四元素乘以A之不同向量以提供第一中間結果。藉由以下操作循環第一中間結果來繼續乘法：藉由每一處理資源將由第一處理資源計算之第一中間結果發送至其相鄰處理資源。每一處理資源將第一乘法結果乘以向量以提供第二乘法結果。此過程重複多次，直至矩陣A與向量V之乘法結束。 Assume that matrix A is a 4×4 matrix, vector V is a 1×4 vector, and there are four processing resources. Under these assumptions, the first row of matrix A is stored at the first processor subunit, the second row of matrix A is stored at the second processor subunit, and the third row of matrix A is stored at the third processing resource , And the matrix A is stored at the fourth processor subunit in the fourth column. Start the multiplication by: sending the first to fourth elements of vector V to the first to fourth processing resources; and multiplying the first to fourth elements of vector V by different vectors of A to provide the first intermediate result . The multiplication is continued by looping the first intermediate result by the following operation: With each processing resource, the first intermediate result calculated by the first processing resource is sent to its neighboring processing resources. Each processing resource multiplies the first multiplication result by the vector to provide the second multiplication result. This process is repeated many times until the multiplication of the matrix A and the vector V ends.

圖90A為包括一或多個記憶體/處理單元(共同地表示為10910)及處理器10920之系統10900的實例。處理器10920可將請求或指令發送(經由鏈路10931)至一或多個記憶體/處理單元10920，該一或多個記憶體/處理單元又完成(或選擇性地完成)請求及/或指令且將結果發送(經由鏈路10932)至處理器10920，如上文所說明。處理器10920可進一步處理結果以提供(經由鏈路10933)一或多個輸出。 90A is an example of a system 10900 including one or more memory/processing units (collectively designated as 10910) and a processor 10920. The processor 10920 can send a request or instruction (via Link 10931) to one or more memory/processing units 10920, which in turn completes (or selectively completes) requests and/or instructions and sends the results (via link 10932) To the processor 10920, as described above. The processor 10920 may further process the results to provide (via link 10933) one or more outputs.

一或多個記憶體/處理單元可包括J(J為正整數)個記憶體資源10912(1,1)至10912(1,J)及K(K為正整數)個處理資源10911(1,1)至10911(1,K)。 One or more memory/processing units may include J (J is a positive integer) memory resources 10912 (1, 1) to 10912 (1, J) and K (K is a positive integer) processing resources 10911 (1, 1) to 10911 (1, K).

J可等於K或可不同於K。 J can be equal to K or can be different from K.

處理資源10911(1,1)至10911(1,K)可為例如處理群組或處理器子單元，如PCT專利申請公開案WO2019025892中所說明。 The processing resources 10911(1,1) to 10911(1,K) may be, for example, processing groups or processor subunits, as described in PCT Patent Application Publication WO2019025892.

記憶體資源10912(1,1)至10912(1,J)可為記憶體例項、記憶體墊、記憶體組，如PCT專利申請公開案WO2019025892中所說明。 The memory resources 10912(1,1) to 10912(1,J) can be memory instances, memory pads, and memory groups, as described in PCT Patent Application Publication WO2019025892.

一或多個記憶體/處理單元之資源(記憶體或處理)中的任一者之間可存在任何連接性及/或任何功能關係。 There may be any connectivity and/or any functional relationship between any of the resources (memory or processing) of one or more memory/processing units.

圖90B為記憶體/處理單元10910(1)之實例。 Figure 90B shows an example of the memory/processing unit 10910(1).

在圖90B中，K(K為正整數)個處理資源10911(1,1)至10911(1,K)形成迴路，此係因為該等處理資源彼此串聯連接(參見鏈路10915)。每一處理資源亦耦接至其自身的一對專用記憶體資源(例如，處理資源10911(1)耦接至記憶體資源10912(1)及10912(2)，且處理資源10911(K)耦接至記憶體資源10912(J-1)及10912(J))。處理資源可用任何其他方式彼此連接。每一處理資源所分配之記憶體資源的數目可不同於兩個。不同資源之間的連接性之實例說明於PCT專利申請公開案WO2019025892中。 In FIG. 90B, K (K is a positive integer) processing resources 10911(1,1) to 10911(1,K) form a loop because these processing resources are connected in series with each other (see link 10915). Each processing resource is also coupled to its own pair of dedicated memory resources (for example, processing resource 10911(1) is coupled to memory resources 10912(1) and 10912(2), and processing resource 10911(K) is coupled Connect to memory resources 10912(J-1) and 10912(J)). Processing resources can be connected to each other in any other way. The number of memory resources allocated for each processing resource can be different from two. An example of the connectivity between different resources is described in PCT Patent Application Publication WO2019025892.

圖90C為N(N為正整數)個記憶體/處理單元10910(1)至10910(N)及處理器10920之系統10901的實例。處理器10920可將請求或指令發送(經由鏈路10931(1)至10931(N))至記憶體/處理單元10920(1)至10910(N)，該等記憶體/處理單元又完成請求及/或指令且將結果發送(經由鏈路10932(1)至3232(N))至處理器10920，如上文所說明。處理器10920可進一步處理結果以提供(經由鏈路10933)一或多個輸出。 90C is an example of a system 10901 with N (N is a positive integer) memory/processing units 10910(1) to 10910(N) and a processor 10920. The processor 10920 can send requests or instructions (via links 10931(1) to 10931(N)) to the memory/processing units 10920(1) to 10910(N), and these memories The body/processing unit in turn completes the request and/or instruction and sends the result (via links 10932(1) to 3232(N)) to the processor 10920, as explained above. The processor 10920 may further process the results to provide (via link 10933) one or more outputs.

圖90D為包括N(N為正整數)個記憶體/處理單元10910(1)至10910(N)及處理器10920之系統10902的實例。圖90D說明在記憶體/處理單元10910(1)至10910(N)之前的預處理器10909。預處理器可執行各種預處理操作，諸如圖框提取、標頭偵測及其類似者。 90D is an example of a system 10902 including N (N is a positive integer) memory/processing units 10910(1) to 10910(N) and a processor 10920. Figure 90D illustrates the pre-processor 10909 before the memory/processing units 10910(1) to 10910(N). The preprocessor can perform various preprocessing operations, such as frame extraction, header detection, and the like.

圖90E為包括一或多個記憶體/處理單元10910及處理器10920之系統10903的實例。圖90E說明在一或多個記憶體/處理單元10910及DMA控制器10908之前的預處理器10909。 FIG. 90E is an example of a system 10903 including one or more memory/processing units 10910 and a processor 10920. FIG. 90E illustrates a pre-processor 10909 before one or more memory/processing units 10910 and DMA controller 10908.

圖90F說明用於至少一個資訊串流之分散式處理的方法10800。 Figure 90F illustrates a method 10800 for distributed processing of at least one information stream.

方法10800可開始於藉由一或多個記憶體處理積體電路經由第一通信通道接收至少一個資訊串流之步驟10810；其中每一記憶體處理積體單元包含控制器、多個處理器子單元及多個記憶體單元。 The method 10800 can start with the step 10810 of receiving at least one information stream via the first communication channel by one or more memory processing integrated circuits; wherein each memory processing integrated unit includes a controller and a plurality of processors Unit and multiple memory units.

步驟10810之後可接著步驟10820及10830。 Step 10810 can be followed by steps 10820 and 10830.

步驟10820可包括藉由一或多個記憶體處理積體電路緩衝資訊串流。 Step 10820 may include buffering the information stream with one or more memory processing integrated circuits.

步驟10830可包括藉由一或多個記憶體處理積體電路對至少一個資訊串流執行第一處理操作以提供第一處理結果。 Step 10830 may include performing a first processing operation on at least one information stream by one or more memory processing integrated circuits to provide a first processing result.

步驟10830可涉及壓縮或解壓縮。 Step 10830 may involve compression or decompression.

因此，資訊串流之總大小可超過第一處理結果之總大小。資訊串流之總大小可反映在給定持續時間之時段期間接收的資訊量。第一處理結果之總大小可反映在同一給定持續時間之任何時段期間輸出的第一處理結果之量。 Therefore, the total size of the information stream may exceed the total size of the first processing result. The total size of the information stream can reflect the amount of information received during a period of a given duration. The total size of the first processing result can reflect the amount of the first processing result output during any period of the same given duration.

替代地，資訊串流(在本說明書中所提及之任何其他資訊實體) 之總大小小於第一處理結果之總大小。在此狀況下，獲得壓縮。 Instead, information stream (any other information entity mentioned in this manual) The total size of is smaller than the total size of the first processing result. In this situation, compression is obtained.

步驟10830之後可接著為將第一處理結果發送至一或多個處理積體電路之步驟10840。 Step 10830 can be followed by step 10840 of sending the first processing result to one or more processing integrated circuits.

一或多個記憶體處理積體電路可由記憶體類別之製造製程製造。 One or more memory processing integrated circuits can be manufactured by the manufacturing process of the memory type.

一或多個記憶體處理積體電路可由邏輯類別之製造製程製造。 One or more memory processing integrated circuits can be manufactured by a logic type manufacturing process.

在記憶體處理積體單元中，記憶體單元中之每一者可耦接至處理器子單元。 In the memory processing integrated unit, each of the memory units can be coupled to a processor sub-unit.

步驟10840之後可接著為藉由一或多個處理積體電路對第一處理結果執行第二處理操作以提供第二處理結果之步驟10850。 Step 10840 can be followed by step 10850 of performing a second processing operation on the first processing result by one or more processing integrated circuits to provide a second processing result.

步驟10820及/或步驟10830可藉由一或多個處理積體電路發指令，可藉由一或多個處理積體電路請求，可藉由一或多個處理積體電路在一或多個記憶體處理積體電路進行組態之後執行，或可獨立地執行而無需一或多個處理積體電路之介入。 Step 10820 and/or step 10830 can be commanded by one or more processing integrated circuits, can be requested by one or more processing integrated circuits, and can be used by one or more processing integrated circuits. The memory processing integrated circuit is configured to be executed, or can be executed independently without the intervention of one or more processing integrated circuits.

第一處理操作可具有比第二處理操作低的算術強度。 The first processing operation may have a lower arithmetic strength than the second processing operation.

步驟10830及/或步驟10850可為以下各者中之至少一者：(a)蜂巢式網路處理操作；(b)其他網路相關處理操作(不同於蜂巢式網路之網路的處理)；(c)資料庫處理操作；(d)資料庫分析處理操作；(e)人工智慧處理操作；或任何其他處理操作。 Step 10830 and/or step 10850 can be at least one of the following: (a) cellular network processing operations; (b) other network-related processing operations (different from cellular network processing) ; (C) Database processing operation; (d) Database analysis processing operation; (e) Artificial intelligence processing operation; or any other processing operation.

分解式系統記憶體/處理單元及用於分散式處理之方法 Decomposable system memory/processing unit and method for distributed processing

可提供分解式系統、用於分散式處理之方法、處理/記憶體單元、用於操作分解式系統之方法、用於操作處理/記憶體單元之方法及電腦可讀媒體，該電腦可讀媒體為非暫時性的且儲存用於執行該等方法中之任一者的指令。分解式系統分配不同子系統以執行不同功能。舉例而言，儲存器可主要實施於一或多個儲存子系統中，而運算可主要在一或多個儲存子系統中進行。 Decomposable systems, methods for distributed processing, processing/memory units, methods for operating decomposable systems, methods for operating processing/memory units, and computer-readable media can be provided. It is non-transitory and stores instructions for executing any of these methods. The decomposition system allocates different subsystems to perform different functions. For example, the storage may be mainly implemented in one or more storage subsystems, and the operation may be mainly performed in one or more storage subsystems.

分解式系統可為一分解式伺服器、一或多個分解式伺服器及/或可不同於一或多個伺服器。 The decomposition system may be a decomposition server, one or more decomposition servers, and/or may be different from one or more servers.

分解式系統可包括一或多個交換子系統、一或多個運算子系統、一或多個儲存子系統及一或多個處理/記憶體子系統。 The decomposition system may include one or more switching subsystems, one or more computing subsystems, one or more storage subsystems, and one or more processing/memory subsystems.

一或多個處理/記憶體子系統、一或多個運算子系統及一或多個儲存子系統經由一或多個交換子系統彼此耦接。 One or more processing/memory subsystems, one or more computing subsystems, and one or more storage subsystems are coupled to each other via one or more switching subsystems.

一或多個處理/記憶體子系統可包括於分解式系統之一或多個子系統中。 One or more processing/memory subsystems may be included in one or more of the decomposed systems.

圖87A說明分解式系統之各種實例。 Figure 87A illustrates various examples of the exploded system.

可提供任何數目個任何類型之子系統。分解式系統可包括圖87A中不包括之類型的一或多個額外子系統，可包括較少類型之子系統，及其類似者。 Any number of subsystems of any type can be provided. The decomposed system may include one or more additional subsystems of the types not included in FIG. 87A, may include fewer types of subsystems, and the like.

分解式系統7101包括兩個儲存子系統7130、運算子系統7120、交換子系統7140及處理/記憶體子系統7110。 The decomposition system 7101 includes two storage subsystems 7130, a computing subsystem 7120, a switching subsystem 7140, and a processing/memory subsystem 7110.

分解式系統7102包括兩個儲存子系統7130、運算子系統7120、交換子系統7140、處理/記憶體子系統7110及加速器子系統7150。 The decomposition system 7102 includes two storage subsystems 7130, a computing subsystem 7120, a switching subsystem 7140, a processing/memory subsystem 7110, and an accelerator subsystem 7150.

分解式系統7103包括兩個儲存子系統7130、運算子系統7120及包括處理/記憶體子系統7110之交換子系統7140。 The decomposition system 7103 includes two storage subsystems 7130, a computing subsystem 7120, and a switching subsystem 7140 including a processing/memory subsystem 7110.

分解式系統7104包括兩個儲存子系統7130、運算子系統7120、包括處理/記憶體子系統7110之交換子系統7140，及加速器子系統7150。 The decomposition system 7104 includes two storage subsystems 7130, a computing subsystem 7120, a switching subsystem 7140 including a processing/memory subsystem 7110, and an accelerator subsystem 7150.

將處理/記憶體子系統7110包括於交換子系統7140中可減少分解式系統7101及7102內之訊務，可減少切換之潛時，及其類似者。 Including the processing/memory subsystem 7110 in the switching subsystem 7140 can reduce the traffic in the decomposed systems 7101 and 7102, and can reduce the latency of handover, and the like.

分解式系統之不同子系統可使用各種通信協定彼此通信。已發現，使用乙太網路及甚至乙太網路RDMA通信協定可增加輸送量，且可能甚至降低與分解式系統之元件之間的資訊單元之交換相關的各種控制及/儲存操作之複雜度。 The different subsystems of the decomposed system can communicate with each other using various communication protocols. It has been found that the use of Ethernet and even the Ethernet RDMA communication protocol can increase throughput, and may even Reduce the complexity of various control and/storage operations related to the exchange of information units between the components of the decomposition system.

分解式系統可藉由允許處理/記憶體子系統參與計算(尤其藉由執行記憶體密集型計算)來執行分散式處理。 Disaggregated systems can perform distributed processing by allowing the processing/memory subsystem to participate in calculations (especially by performing memory-intensive calculations).

舉例而言，假定N個運算單元應在其間共用資訊單元(全部共用)，則(a)可將N個資訊單元發送至一或多個處理/記憶體子系統之一或多個處理/記憶體單元，(b)一或多個處理/記憶體單元可執行需要全部共用之計算，且(c)將N個經更新資訊單元發送至N個運算單元。此將需要大約N個傳送操作。 For example, assuming that N arithmetic units should share information units among them (all shared), then (a) N information units can be sent to one or more processing/memory subsystems or one or more processing/memory Volume unit, (b) one or more processing/memory units can perform calculations that need to be all shared, and (c) send N updated information units to N arithmetic units. This will require about N transfer operations.

舉例而言，圖87B說明更新神經網路之模型(該模型包括指派給神經網路之節點的權重)的分散式處理。 For example, Figure 87B illustrates the decentralized process of updating a model of a neural network (the model includes the weights assigned to the nodes of the neural network).

N個運算單元PU(1)7120(1)至PU(N)7120(N)中之每一者可屬於分解式系統7101、7102、7103及7104中之任一者的運算子系統7120。 Each of the N arithmetic units PU(1) 7120(1) to PU(N) 7120(N) may belong to the arithmetic subsystem 7120 of any one of the decomposition systems 7101, 7102, 7103, and 7104.

N個運算單元計算N個部分模型更新(經更新之N個不同部分)7121(1)至7121(N)，且將其發送(經由交換子系統7140)至處理/記憶體子系統7110。 The N arithmetic units calculate N partial model updates (updated N different parts) 7121(1) to 7121(N), and send them (via the exchange subsystem 7140) to the processing/memory subsystem 7110.

處理/記憶體子系統7110計算經更新模型7122且將經更新模型發送(經由交換子系統7140)至N個運算單元PU(1)7120(1)至PU(N)7120(N)。 The processing/memory subsystem 7110 calculates the updated model 7122 and sends (via the exchange subsystem 7140) the updated model to N arithmetic units PU(1) 7120(1) to PU(N) 7120(N).

圖87C、圖87D及圖87E分別說明記憶體/處理單元7011、7012及7013之實例，且圖87F及聽87G說明積體電路7014及7015，積體電路包括記憶體/處理單元9010諸如乙太網路模組及乙太網路RDMA模組22之一或多個通信模組。 Figure 87C, Figure 87D, and Figure 87E illustrate examples of memory/processing units 7011, 7012, and 7013, respectively, and Figures 87F and 87G illustrate integrated circuits 7014 and 7015. The integrated circuits include memory/processing units 9010 such as Ethernet One or more communication modules of the network module and the Ethernet RDMA module 22.

記憶體/處理單元包括控制器9020、內部匯流排9021以及多對邏輯9030及記憶體組9040。控制器經組態以作為通信模組操作或可耦接至通信模組。 The memory/processing unit includes a controller 9020, an internal bus 9021, and multiple pairs of logic 9030 and a memory bank 9040. The controller is configured to operate as a communication module or can be coupled to a communication module group.

可用其他方式實施控制器9020與多對邏輯9030及記憶體組9040之間的連接性。可用其他方式(不成對)配置記憶體組及邏輯。 The connectivity between the controller 9020 and the multiple pairs of logic 9030 and memory bank 9040 can be implemented in other ways. Other ways (unpaired) can be used to configure memory groups and logic.

處理/記憶體子系統7110之一或多個記憶體/處理單元9010可並列地處理(使用不同邏輯且自不同記憶體組並列地擷取模型之不同部分)模型更新，且受益於大量記憶體資源、記憶體組與邏輯之間的連接之極高頻寬，可用高效方式執行此等計算。 One or more memory/processing units 9010 of the processing/memory subsystem 7110 can be processed in parallel (using different logics and fetching different parts of the model in parallel from different memory groups) model updates and benefiting from a large amount of memory The extremely high bandwidth of the connections between resources, memory banks, and logic allows these calculations to be performed in an efficient manner.

圖87C至圖87E之記憶體/處理單元7011、7012及7013以及圖87C至圖87E之積體電路7014及7015包括一或多個通信模組，諸如乙太網路模組7023(在圖87C至圖87G中)及乙太網路RDMA模組7022(在圖87E及圖87G中)。 The memory/processing units 7011, 7012 and 7013 of FIGS. 87C to 87E and the integrated circuits 7014 and 7015 of FIGS. 87C to 87E include one or more communication modules, such as an Ethernet module 7023 (in FIG. 87C To Figure 87G) and Ethernet RDMA module 7022 (in Figure 87E and Figure 87G).

具有此等RDMA及/或乙太網路模組(在記憶體/處理單元內或在與記憶體/處理單元相同的積體電路內)大大加速分解式系統之不同元件之間的通信，且在RDMA之狀況下，大大簡化分解式系統之不同元件之間的通信。 Having these RDMA and/or Ethernet modules (in the memory/processing unit or in the same integrated circuit as the memory/processing unit) greatly accelerates the communication between the different components of the decomposed system, and In the case of RDMA, the communication between the different components of the decomposition system is greatly simplified.

應注意，包括RDMA及/或乙太網路模組之記憶體/處理單元在其他環境中可為有益的，即使當記憶體/處理單元不包括於分解式系統中時亦如此。 It should be noted that the memory/processing unit that includes RDMA and/or Ethernet modules can be beneficial in other environments, even when the memory/processing unit is not included in the disassembled system.

亦應注意，例如出於減少成本原因，可針對記憶體/處理單元之每個群組分配RDMA及/或乙太網路模組。 It should also be noted that, for example, for cost reduction reasons, RDMA and/or Ethernet modules can be allocated to each group of memory/processing units.

應注意，記憶體/處理單元、記憶體/處理單元之群組及甚至處理/記憶體子系統可包括其他通信埠，例如PCIe通信埠。 It should be noted that the memory/processing unit, the group of memory/processing units, and even the processing/memory subsystem may include other communication ports, such as PCIe communication ports.

使用RDMA及/或乙太網路模組可具有成本效益，此係因為可消除將記憶體/處理單元連接至橋接器之需要，該橋接器連接至可具有乙太網路埠之網路積體電路(NIC)。 The use of RDMA and/or Ethernet modules can be cost-effective because it eliminates the need to connect the memory/processing unit to a bridge that connects to a network product that can have an Ethernet port Body circuit (NIC).

使用RDMA及/或乙太網路模組可使乙太網路(或乙太網路 RDMA)為記憶體/處理單元中原生的。 Use RDMA and/or Ethernet modules to enable Ethernet (or Ethernet RDMA) is native to the memory/processing unit.

應注意，乙太網路僅為區域網路(LAN)協定之實例。PCIe僅為可在比乙太網路更長之距離上使用的另一通信協定之實例。 It should be noted that Ethernet is only an example of a local area network (LAN) protocol. PCIe is only an example of another communication protocol that can be used over longer distances than Ethernet.

圖87H說明用於分散式處理之方法7000。 Figure 87H illustrates a method 7000 for distributed processing.

方法7000可包括一或多個處理反覆。 Method 7000 may include one or more process iterations.

處理反覆可由分解式系統之一或多個記憶體處理積體電路執行。 The processing repetition can be performed by one or more memory processing integrated circuits of the decomposable system.

處理反覆可由分解式系統之一或多個處理積體電路執行。 The processing repetition can be performed by one or more processing integrated circuits of the decomposition system.

由更多記憶體處理積體電路執行之處理反覆之後可接著為由一或多個處理積體電路執行之處理反覆。 The processing iterations performed by more memory processing integrated circuits can be followed by processing iterations performed by one or more processing integrated circuits.

由更多記憶體處理積體電路執行之處理反覆可在由一或多個處理積體電路執行之處理反覆之前。 The process repetition performed by more memory processing integrated circuits may precede the process repetition performed by one or more processing integrated circuits.

又一處理反覆可由分解式系統之其他電路執行。舉例而言，一或多個預處理電路可執行任何類型之預處理，包括準備用於一或多個記憶體處理積體電路執行之處理反覆的資訊單元。 Yet another process iteration can be performed by other circuits of the decomposition system. For example, one or more pre-processing circuits can perform any type of pre-processing, including information units that are prepared for processing iterations performed by one or more memory processing integrated circuits.

方法7000可包括藉由分解式系統之一或多個記憶體處理積體電路接收資訊單元的步驟7020。 The method 7000 may include a step 7020 of processing the integrated circuit receiving information unit by one or more memories of the decomposable system.

每一記憶體處理積體單元可包括控制器、多個處理器子單元及多個記憶體單元。 Each memory processing integrated unit may include a controller, multiple processor sub-units, and multiple memory units.

資訊單元可輸送神經網路之模型的部分。 The information unit can convey the part of the neural network model.

資訊單元可輸送至少一個資料庫查詢之部分結果。 The information unit can deliver part of the results of at least one database query.

資訊單元可輸送至少一個聚集資料庫查詢之部分結果。 The information unit can deliver part of the results of at least one aggregate database query.

步驟7020可包括自分解式系統之一或多個儲存子系統接收資訊單元。 Step 7020 may include receiving information units from one or more storage subsystems of the self-decomposable system.

步驟7020可包括自分解式系統之一或多個運算子系統接收資訊單元，一或多個運算子系統可包括由邏輯類別之製造製程製造的多個處理積體電路。 Step 7020 may include receiving information units from one or more arithmetic subsystems of the self-decomposable system, and the one or more arithmetic subsystems may include a plurality of processing integrated circuits manufactured by a logic type manufacturing process.

步驟7020之後可接著為藉由一或多個記憶體處理積體電路對資訊單元執行處理操作以提供處理結果的步驟7030。 Step 7020 can be followed by step 7030 in which one or more memory processing integrated circuits perform processing operations on the information unit to provide processing results.

資訊單元之總大小可超過，可等於或可小於處理結果之總大小。 The total size of the information unit may exceed, may be equal to, or may be smaller than the total size of the processing result.

步驟7030之後可接著為藉由一或多個記憶體處理積體電路輸出處理結果之步驟7040。 Step 7030 can be followed by step 7040 of outputting processing results by one or more memory processing integrated circuits.

步驟7040可包括將處理結果輸出至分解式系統之一或多個運算子系統，一或多個運算子系統可包括由邏輯類別之製造製程製造的多個處理積體電路。 Step 7040 may include outputting the processing result to one or more computing subsystems of the decomposition system, and the one or more computing subsystems may include multiple processing integrated circuits manufactured by the manufacturing process of the logic category.

步驟7040可包括將處理結果輸出至分解式系統之一或多個儲存子系統。 Step 7040 may include outputting the processing result to one or more storage subsystems of the disaggregated system.

資訊單元可自多個處理積體電路之處理單元的不同群組發送，且可為藉由多個處理積體電路以分散式方式執行之處理程序的中間結果之不同部分。處理單元之群組可包括至少一個處理積體電路。 Information units can be sent from different groups of processing units of multiple processing integrated circuits, and can be different parts of intermediate results of processing procedures executed in a distributed manner by multiple processing integrated circuits. The group of processing units may include at least one processing integrated circuit.

步驟7030可包括處理資訊單元以提供整個處理程序之結果。 Step 7030 may include processing the information unit to provide the result of the entire processing procedure.

步驟7040可包括將整個處理程序之結果發送至多個處理積體電路中之每一者。 Step 7040 may include sending the result of the entire processing procedure to each of the multiple processing integrated circuits.

中間結果之不同部分可為經更新神經網路模型之不同部分，且其中整個處理程序之結果為經更新神經網路模型。 The different parts of the intermediate result can be different parts of the updated neural network model, and the result of the entire processing procedure is the updated neural network model.

步驟7040可包括將經更新神經網路模型發送至多個處理積體電路中之每一者。 Step 7040 may include sending the updated neural network model to each of the multiple processing integrated circuits.

步驟7040之後可接著為藉由多個處理積體電路至少部分地基於至多個處理積體電路之處理結果來執行另一處理的步驟7050。 Step 7040 can be followed by multiple processing integrated circuits based at least in part on Step 7050 of another processing is executed to the processing result of a plurality of processing integrated circuits.

步驟7040可包括使用分解式系統之交換子單元輸出處理結果。 Step 7040 may include using the exchange subunit of the decomposition system to output the processing result.

步驟7020可包括接收資訊單元，該等資訊單元為經預處理之資訊單元。 Step 7020 may include receiving information units, which are preprocessed information units.

圖87I說明用於分散式處理之方法7001。 Figure 87I illustrates a method 7001 for distributed processing.

方法7001與方法7000的不同之處在於包括藉由多個處理積體電路預處理資訊以提供經預處理之資訊單元的步驟7010。 The method 7001 is different from the method 7000 in that it includes a step 7010 of preprocessing information by a plurality of processing integrated circuits to provide a preprocessed information unit.

步驟7010之後可接著為步驟7010、7020、7030及7040。 Step 7010 can be followed by steps 7010, 7020, 7030, and 7040.

資料庫分析加速 Database analysis acceleration

提供一種裝置、方法及電腦可讀媒體，該裝置、該方法及該電腦可讀媒體儲存用於藉由屬於與記憶體單元相同之積體電路的篩選單元至少執行篩選的指令，而篩選器可提示哪些條目與某一資料庫查詢相關。仲裁器或任何其他流程控制管理器可將相關條目發送至處理器且不將不相關條目發送至處理器，因此節省了至處理器及來自處理器之幾乎大部分訊務。 A device, a method, and a computer-readable medium are provided. The device, the method, and the computer-readable medium store instructions for performing at least screening by a screening unit belonging to the same integrated circuit as a memory unit, and the filter can Prompt which items are related to a certain database query. The arbiter or any other flow control manager can send related items to the processor and not send irrelevant items to the processor, thus saving almost most of the traffic to and from the processor.

參見例如圖91A，其展示處理器(CPU 9240)、包括記憶體及篩選系統9220之積體電路。記憶體及篩選系統9220可包括耦接至記憶體單元條目9222及一或多個仲裁器(諸如，用於將相關條目發送至處理器之仲裁器9229)之篩選單元9224。可應用任何仲裁處理程序。條目之數目、篩選單元之數目及仲裁器之數目之間可存在任何關係。 See, for example, Figure 91A, which shows a processor (CPU 9240), an integrated circuit including memory and a screening system 9220. The memory and screening system 9220 may include a screening unit 9224 coupled to the memory unit entry 9222 and one or more arbiters (such as the arbiter 9229 for sending related entries to the processor). Any arbitration process can be applied. There can be any relationship between the number of entries, the number of screening units, and the number of arbiters.

仲裁器可由能夠控制資訊流之任何單元替換，例如通信介面、流控制器及其類似者。 The arbiter can be replaced by any unit capable of controlling the flow of information, such as a communication interface, a flow controller, and the like.

參考篩選，其係基於一或多個相關性/篩選準則。 Reference screening, which is based on one or more relevance/screening criteria.

可針對每個資料庫查詢設定相關性，且可用任何方式提示相關性，例如記憶體單元可儲存提示哪一條目相關之相關性旗標9224'。亦存在儲存 K個資料庫區段9220(k)之儲存裝置9210，而k之範圍為1與K之間。應注意，整個資料庫可儲存於記憶體單元中而不儲存於儲存裝置中(該解決方案亦被稱作揮發性記憶體儲存之資料庫)。 The relevance can be set for each database query, and the relevance can be displayed in any way. For example, the memory unit can store a relevance flag 9224' indicating which item is relevant. Storage also exists The storage device 9210 of K database sections 9220(k), and the range of k is between 1 and K. It should be noted that the entire database can be stored in the memory unit but not in the storage device (this solution is also known as the database of volatile memory storage).

記憶體單元條目可能太小而無法儲存整個資料庫，且因此一次可接收一個區段。 The memory cell entry may be too small to store the entire database, and therefore can receive one section at a time.

篩選單元可執行篩選操作，諸如比較欄位之值與臨限值，比較欄位之值與預定義值，判定欄位之值是否在預定義範圍內，及其類似者。 The filtering unit can perform filtering operations, such as comparing the value of the field with the threshold value, comparing the value of the field with a predefined value, and determining whether the value of the field is within the predefined range, and the like.

因此，篩選單元可執行已知資料庫篩選操作，且可為緊密且廉價的電路。 Therefore, the screening unit can perform screening operations of the known database, and can be a compact and inexpensive circuit.

將篩選操作之最終結果(例如，相關資料庫條目之內容)9101發送至CPU 9420以供處理。 The final result of the screening operation (for example, the content of the relevant database entry) 9101 is sent to the CPU 9420 for processing.

記憶體及篩選系統9220可由如圖91B中所說明之記憶體及處理系統替換。 The memory and screening system 9220 can be replaced by the memory and processing system as illustrated in FIG. 91B.

記憶體及處理系統9229包括耦接至記憶體單元條目9222之處理單元9225。處理單元9225可執行篩選操作，且可至少部分地參與對相關記錄執行一或多個額外操作。 The memory and processing system 9229 includes a processing unit 9225 coupled to a memory unit entry 9222. The processing unit 9225 may perform screening operations, and may at least partially participate in performing one or more additional operations on related records.

處理單元可經定製以執行特定操作及/或可為經組態以執行多個操作之可程式化單元。舉例而言，處理單元可為管線化處理單元，可包括ALU，可包括多個ALU，及其類似者。 The processing unit may be customized to perform specific operations and/or may be a programmable unit configured to perform multiple operations. For example, the processing unit may be a pipelined processing unit, may include ALU, may include multiple ALUs, and the like.

處理單元9225可執行全部的一或多個額外操作。 The processing unit 9225 may perform all one or more additional operations.

替代地，一或多個額外操作之一部分由處理單元執行，且處理器(CPU 9240)可執行一或多個額外操作之另一部分。 Alternatively, part of the one or more additional operations is performed by the processing unit, and the processor (CPU 9240) may perform another part of the one or more additional operations.

將處理操作之最終結果(例如，對資料庫查詢之部分回應9102，或完整回應9103)發送至CPU 9420。 Send the final result of the processing operation (for example, the partial response 9102 to the database query, or the complete response 9103) to the CPU 9420.

部分回應需要進一步處理。 Some responses require further processing.

圖92A說明包括經組態以執行篩選及額外處理之記憶體/處理單元9227的記憶體/處理系統9228。 Figure 92A illustrates a memory/processing system 9228 that includes a memory/processing unit 9227 configured to perform filtering and additional processing.

記憶體/處理系統9228藉由記憶體/處理單元9227實施圖91之處理單元及記憶體單元。 The memory/processing system 9228 implements the processing unit and the memory unit of FIG. 91 through the memory/processing unit 9227.

處理器之作用可包括控制處理單元、執行一或多個額外操作之至少一部分，及其類似者。 The role of the processor may include controlling the processing unit, performing at least part of one or more additional operations, and the like.

記憶體條目與處理單元之組合可至少部分地由一或多個記憶體/處理單元實施。 The combination of memory entries and processing units can be implemented at least in part by one or more memory/processing units.

圖92B說明實例記憶體/處理單元9010。 Figure 92B illustrates an example memory/processing unit 9010.

記憶體/處理單元9010包括控制器9020、內部匯流排9021以及多對邏輯9030及記憶體組9040。控制器經組態以作為通信模組操作或可耦接至通信模組。 The memory/processing unit 9010 includes a controller 9020, an internal bus 9021, and multiple pairs of logic 9030 and a memory bank 9040. The controller is configured to operate as a communication module or can be coupled to the communication module.

可用其他方式實施控制器9020與多對邏輯9030及記憶體組9040之間的連接性。可用其他方式(不成對)配置記憶體組及邏輯。多個記憶體組可耦接至單個邏輯及/或由單個邏輯管理。 The connectivity between the controller 9020 and the multiple pairs of logic 9030 and memory bank 9040 can be implemented in other ways. Other ways (unpaired) can be used to configure memory groups and logic. Multiple memory banks can be coupled to and/or managed by a single logic.

記憶體/處理系統經由介面9211接收資料庫查詢9100。介面9211可為匯流排、埠、輸入/輸出介面及其類似者。 The memory/processing system receives database query 9100 via interface 9211. The interface 9211 can be a bus, a port, an input/output interface, and the like.

應注意，對資料庫查詢之回應可由以下各者中之至少一者(或以下各者中之一或多者的組合)產生：一或多個記憶體/處理系統、一或多個記憶體及處理系統、一或多個記憶體及篩選系統、位於此等系統外部之一或多個處理器，及其類似者。 It should be noted that the response to the database query can be generated by at least one of the following (or a combination of one or more of the following): one or more memories/processing systems, one or more memories And processing systems, one or more memory and screening systems, one or more processors located outside these systems, and the like.

應注意，對資料庫查詢之回應可由以下各者中之至少一者(或以下各者中之一或多者的組合)產生：一或多個篩選單元、一或多個記憶體/處理單元、一或多個處理單元、一或多個其他處理器(諸如，一或多個其他CPU)，及其類似者。 It should be noted that the response to the database query can be generated by at least one of the following (or a combination of one or more of the following): one or more filtering units, one or more memories/processing Unit, one or more processing units, one or more other processors (such as one or more other CPUs), and the like.

任何處理程序可包括尋找相關資料庫條目，及其處理相關資料庫條目。處理可由一或多個處理實體執行。 Any processing procedure may include searching for related database entries and processing related database entries. The processing can be performed by one or more processing entities.

處理實體可為以下各者中之至少一者：記憶體及處理系統之處理單元(例如，記憶體及處理系統9229之處理單元9225)、記憶體/處理單元之處理器子單元(或邏輯)、另一處理器(例如，圖91A、圖91B及圖74之CPU 9240)，及其類似者。 The processing entity can be at least one of the following: the processing unit of the memory and the processing system (for example, the processing unit 9225 of the memory and processing system 9229), the processor subunit (or logic) of the memory/processing unit , Another processor (for example, the CPU 9240 of FIG. 91A, FIG. 91B, and FIG. 74), and the like.

在產生對資料庫查詢之回應中所涉及的處理可由以下各者中之任一者或以下各者之組合產生： The processing involved in generating a response to a database query can be generated by any one of the following or a combination of the following:

a.記憶體及處理系統9229之處理單元9225。 a. Memory and processing unit 9225 of processing system 9229.

b.不同記憶體及處理系統9229之處理單元9225。 b. Processing unit 9225 of different memory and processing system 9229.

c.記憶體/處理系統9228之一或多個記憶體/處理單元9227的處理器子單元(或邏輯9030)。 c. One or more of the memory/processing system 9228 processor sub-units (or logic 9030) of the memory/processing unit 9227.

d.不同記憶體/處理系統9228之記憶體/處理單元9227的處理器子單元(或邏輯9030)。 d. Different memory/processing system 9228 memory/processing unit 9227 processor sub-unit (or logic 9030).

e.記憶體/處理系統9228之一或多個記憶體/處理單元9227的控制器。 e. The controller of one or more memory/processing units 9227 of the memory/processing system 9228.

f.不同記憶體/處理系統9228之一或多個記憶體/處理單元9227的控制器。 f. One of the different memory/processing systems 9228 or a controller of multiple memory/processing units 9227.

因此，在對資料庫查詢之回應中所涉及的處理可由以下各者之組合或子組合產生：(a)一或多個記憶體/處理單元之一或多個控制器、(b)記憶體處理系統之一或多個處理單元、(c)一或多個記憶體/處理單元之一或多個處理器子單元，及(d)一或多個其他處理器，及其類似者。 Therefore, the processing involved in the response to the database query can be generated by a combination or sub-combination of: (a) one or more memory/processing unit or one or more controllers, (b) memory One or more processing units of the processing system, (c) one or more memory/processing units or one or more processor sub-units, and (d) one or more other processors, and the like.

由多於一個處理實體執行之處理可被稱作分散式處理。 Processing performed by more than one processing entity can be referred to as distributed processing.

應注意，篩選可由一或多個篩選單元及/或一或多個處理單元及/ 或一或多個處理器子單元中之篩選實體執行。在此意義上，執行篩選操作之處理單元及/或處理器子單元可被稱作篩選單元。 It should be noted that the screening can be performed by one or more screening units and/or one or more processing units and/ Or the screening entity in one or more processor sub-units executes. In this sense, the processing unit and/or the processor subunit that performs the screening operation can be referred to as the screening unit.

處理實體可為篩選實體或可不同於篩選實體。 The processing entity may be a screening entity or may be different from the screening entity.

處理實體可執行由另一篩選實體視為相關之資料庫條目的處理操作。 The processing entity can perform processing operations on the database entries deemed to be related by another screening entity.

處理實體亦可執行篩選操作。 The processing entity can also perform filtering operations.

對資料庫查詢之回應可利用一或多個篩選實體及一或多個處理實體。 One or more screening entities and one or more processing entities can be used in response to database queries.

一或多個篩選實體及一或多個處理實體可屬於同一系統(例如，記憶體/處理系統9228、記憶體及處理系統9229、記憶體及篩選系統9220)或屬於不同系統。 One or more screening entities and one or more processing entities may belong to the same system (eg, memory/processing system 9228, memory and processing system 9229, memory and screening system 9220) or different systems.

記憶體/處理單元可包括多個處理器子單元。處理器子單元可彼此獨立地操作，可彼此部分地合作，可參與分散式處理，及其類似者。 The memory/processing unit may include multiple processor sub-units. The processor sub-units can operate independently of each other, can partially cooperate with each other, can participate in distributed processing, and the like.

圖92C說明多個記憶體及篩選系統9220、多個其他處理器(諸如，CPU 9240)及儲存裝置9210。 FIG. 92C illustrates multiple memory and screening systems 9220, multiple other processors (such as CPU 9240), and storage device 9210.

多個記憶體及篩選系統9220可基於多個資料庫查詢中之一者內的一或多個篩選準則來參與(同時或不同時)一或多個資料庫條目之篩選。 Multiple memories and screening systems 9220 can participate (at the same time or at different times) in the screening of one or more database entries based on one or more screening criteria in one of the multiple database queries.

圖92D說明多個記憶體及處理系統9229、多個其他處理器(諸如，CPU 9240)及儲存裝置9210。 FIG. 92D illustrates multiple memory and processing systems 9229, multiple other processors (such as CPU 9240), and storage device 9210. FIG.

多個記憶體及處理系統9229可參與(同時或不同時)在對多個資料庫查詢中之一者作出回應中所涉及的篩選及至少部分處理。 Multiple memories and processing systems 9229 can participate (simultaneously or at different times) in the screening and at least part of the processing involved in responding to one of the multiple database queries.

圖92F說明多個記憶體/處理系統9228、多個其他處理器(諸如，CPU 9240)及儲存裝置9210。 Figure 92F illustrates multiple memory/processing systems 9228, multiple other processors (such as CPU 9240), and storage device 9210.

多個記憶體/處理系統9228可參與(同時或不同時)在對多個資料庫查詢中之一者作出回應中所涉及的篩選及至少部分處理。 Multiple memory/processing systems 9228 can participate (simultaneously or at different times) The screening and at least partial processing involved in the response to one of the database queries.

圖92G說明用於資料庫分析加速之方法9300方法。 Figure 92G illustrates the method 9300 method for database analysis acceleration.

方法9300可開始於藉由記憶體處理積體電路接收資料庫查詢之步驟9310，該資料庫查詢包含提示資料庫中與資料庫查詢相關之資料庫條目的至少一個相關性準則。 The method 9300 can start with the step 9310 of receiving a database query by processing the integrated circuit in the memory, the database query including at least one relevance criterion that prompts the database entries in the database related to the database query.

資料庫中與資料庫查詢相關之資料庫條目可能並非資料庫之資料庫條目，可為資料庫之資料庫條目中的一者、一些或全部。 The database entries related to the database query in the database may not be the database entries of the database, but may be one, some or all of the database entries in the database.

記憶體處理積體電路可包括控制器、多個處理器子單元及多個記憶體單元。 The memory processing integrated circuit may include a controller, a plurality of processor sub-units, and a plurality of memory units.

步驟9310之後可接著為藉由記憶體處理積體電路且基於至少一個相關性準則而判定儲存於記憶體處理積體電路中之相關資料庫條目之群組的步驟9320。 Step 9310 can be followed by step 9320 of determining the group of related database entries stored in the memory processing integrated circuit based on at least one correlation criterion by the memory processing integrated circuit.

步驟9320之後可接著為將相關資料庫條目之群組發送至一或多個處理實體以供進一步處理而實質上不將儲存於記憶體處理積體電路中之不相關資料條目發送至該一或多個處理實體的步驟9330。 Step 9320 can be followed by sending the group of related database entries to one or more processing entities for further processing without substantially sending irrelevant data entries stored in the memory processing integrated circuit to the one or Step 9330 of multiple processing entities.

片語「而實質上不發送」意謂根本不發送(在對資料庫查詢作出回應期間)或發送數目不多的不相關條目。不多可意謂至多1、2、3、4、5、9、7、8、9、10個百分比，或發送對頻寬無顯著影響之任何量。 The phrase "but not to send in substance" means not to send at all (during the response to database queries) or to send a small number of irrelevant items. Not much can mean at most 1, 2, 3, 4, 5, 9, 7, 8, 9, 10 percentages, or sending any amount that has no significant effect on bandwidth.

步驟9330之後可接著為處理相關資料庫條目之群組以提供對資料庫查詢之回應的步驟9340。 Step 9330 can be followed by step 9340 of processing groups of related database entries to provide responses to database queries.

圖92H說明用於資料庫分析加速之方法9301。 Figure 92H illustrates a method 9301 for database analysis acceleration.

假定對資料庫查詢作出回應所需之篩選及整個處理由記憶體處理積體電路執行。 It is assumed that the screening and the entire processing required to respond to database queries are performed by the memory processing integrated circuit.

方法9301可開始於藉由記憶體處理積體電路接收資料庫查詢之步驟9310，該資料庫查詢包含提示資料庫中與資料庫查詢相關之資料庫條目的至少一個相關性準則。 Method 9301 can be started by processing the integrated circuit receiving database query through the memory Step 9310: The database query includes at least one relevance criterion that prompts database entries in the database that are related to the database query.

步驟9320之後可接著為將相關資料庫條目之群組發送至記憶體處理積體電路之一或多個處理實體以供完全處理而實質上不將儲存於記憶體處理積體電路中之不相關資料條目發送至該一或多個處理實體的步驟9331。 After step 9320, the group of related database entries can be sent to one or more processing entities of the memory processing integrated circuit for complete processing without substantially not storing irrelevant data stored in the memory processing integrated circuit. The data entry is sent to step 9331 of the one or more processing entities.

步驟9331之後可接著為完全處理相關資料庫條目之群組以提供對資料庫查詢之回應的步驟9341。 Step 9331 can be followed by step 9341 of completely processing the group of related database entries to provide responses to database queries.

步驟9341之後可接著為自記憶體處理積體電路輸出對資料庫查詢之回應的步驟9351。 Step 9341 can be followed by step 9351 of outputting the response to the database query from the memory processing integrated circuit.

圖92I說明用於資料庫分析加速之方法9302。 Figure 92I illustrates a method 9302 for database analysis acceleration.

假定對資料庫查詢作出回應所需之篩選以及處理之僅一部分由記憶體處理積體電路執行。記憶體處理積體電路將輸出部分結果，該等部分結果將由位於記憶體處理積體電路外部之一或多個其他處理實體處理。 It is assumed that only part of the screening and processing required to respond to database queries is performed by the memory processing integrated circuit. The memory processing integrated circuit will output partial results, and these partial results will be processed by one or more other processing entities located outside the memory processing integrated circuit.

方法9301可開始於藉由記憶體處理積體電路接收資料庫查詢之步驟9310，該資料庫查詢包含提示資料庫中與資料庫查詢相關之資料庫條目的至少一個相關性準則。 The method 9301 can start with the step 9310 of receiving a database query through the memory processing integrated circuit, and the database query includes at least one correlation criterion that prompts the database entries in the database related to the database query.

步驟9320之後可接著為將相關資料庫條目之群組發送至記憶體處理積體電路之一或多個處理實體以供部分處理而實質上不將儲存於記憶體處理積體電路中之不相關資料條目發送至該一或多個處理實體的步驟9332。 Step 9320 can be followed by sending the group of related database entries to one or more processing entities of the memory processing integrated circuit for partial processing without substantially storing them in the memory. Step 9332 of sending irrelevant data items in the integrated circuit to the one or more processing entities.

步驟9332之後可接著為部分地處理相關資料庫條目之群組以提供對資料庫查詢之中間回應的步驟9342。 Step 9332 can be followed by step 9342 of partially processing groups of related database entries to provide intermediate responses to database queries.

步驟9342之後可接著為自記憶體處理積體電路輸出對資料庫查詢之中間回應的步驟9352。 Step 9342 can be followed by step 9352 of outputting an intermediate response to the database query from the memory processing integrated circuit.

步驟9352之後可接著為進一步處理中間回應以提供對資料庫之回應的步驟9390。 Step 9352 can be followed by step 9390 of further processing the intermediate response to provide a response to the database.

圖92J說明用於資料庫分析加速之方法9303。 Figure 92J illustrates a method 9303 for database analysis acceleration.

假定記憶體處理積體電路執行相關資料庫條目之篩選，但不執行相關資料庫條目之處理。記憶體處理積體電路將輸出將由位於記憶體處理積體電路外部之一或多個其他處理實體完全處理的相關資料庫條目之群組。 It is assumed that the memory processing integrated circuit performs the screening of related database entries, but does not perform the processing of related database entries. The memory processing integrated circuit will output a group of related database entries that will be completely processed by one or more other processing entities located outside the memory processing integrated circuit.

步驟9320之後可接著為將相關資料庫條目之群組發送至位於記憶體處理積體電路外部之一或多個處理實體而實質上不將儲存於記憶體處理積體電路中之不相關資料條目發送至該一或多個處理實體的步驟9333。 After step 9320, the group of related database entries can be sent to one or more processing entities located outside the memory processing integrated circuit without substantially storing the irrelevant data entries in the memory processing integrated circuit. Step 9333 of sending to the one or more processing entities.

步驟9333之後可接著為完全處理中間回應以提供對資料庫之回應的步驟9391。 Step 9333 can be followed by step 9391 of completely processing the intermediate response to provide a response to the database.

圖92K說明資料庫分析加速之方法9304。 Figure 92K illustrates a method 9304 for database analysis acceleration.

方法9303可開始於藉由積體電路接收資料庫查詢之步驟9315，該資料庫查詢包含提示資料庫中與資料庫查詢相關之資料庫條目的至少一個相關性準則；其中該積體電路包含控制器、篩選單元及多個記憶體單元。 Method 9303 can start at step 9315 of receiving database query by integrated circuit, The database query includes at least one correlation criterion that prompts a database entry related to the database query in the database; wherein the integrated circuit includes a controller, a screening unit, and a plurality of memory units.

步驟9315之後可接著為藉由篩選單元且基於至少一個相關性準則來判定儲存於積體電路中之相關資料庫條目之群組的步驟9325。 Step 9315 can be followed by step 9325 of determining the group of related database entries stored in the integrated circuit based on at least one correlation criterion by the screening unit.

步驟9325之後可接著為將相關資料庫條目之群組發送至位於積體電路外部之一或多個處理實體以供進一步處理而實質上不將儲存於積體電路中之不相關資料條目發送至一或多個處理實體的步驟9335。 After step 9325, the group of related database entries can be sent to one or more processing entities located outside the integrated circuit for further processing, and irrelevant data entries stored in the integrated circuit are not substantially sent to Step 9335 of one or more processing entities.

步驟9335之後可接著為步驟9391。 Step 9335 can be followed by step 9391.

圖92L說明資料庫分析加速之方法9305。 Figure 92L illustrates a method 9305 for database analysis acceleration.

方法9305可開始於藉由積體電路接收資料庫查詢之步驟9314，該資料庫查詢包含提示資料庫中與資料庫查詢相關之資料庫條目的至少一個相關性準則；其中積體電路包含控制器、篩選單元及多個記憶體單元。 The method 9305 may start at step 9314 of receiving a database query by an integrated circuit, the database query including at least one correlation criterion that prompts a database entry in the database related to the database query; wherein the integrated circuit includes a controller , Screening unit and multiple memory units.

步驟9314之後可接著為藉由處理單元且基於至少一個相關性準則來判定儲存於積體電路中之相關資料庫條目之群組的步驟9324。 Step 9314 can be followed by step 9324 of determining the group of related database entries stored in the integrated circuit by the processing unit based on at least one correlation criterion.

步驟9324之後可接著為藉由處理單元處理相關資料庫條目之群組而不藉由處理單元處理儲存於積體電路中之不相關資料條目以提供處理結果的步驟9334。 Step 9324 can be followed by step 9334 of processing the group of related database entries by the processing unit instead of processing the irrelevant data entries stored in the integrated circuit by the processing unit to provide a processing result.

步驟9334之後可接著為自積體電路輸出處理結果之步驟9344。 Step 9334 can be followed by step 9344 of outputting the processing result from the integrated circuit.

在方法9300、9301、9302、9304及9305中之任一者中，記憶體處理積體電路輸出一輸出。該輸出可為相關資料庫條目、一或多個中間結果或一或多個(完整)結果之群組。 In any of methods 9300, 9301, 9302, 9304, and 9305, the memory processing integrated circuit outputs an output. The output can be a related database entry, one or more intermediate results, or a group of one or more (complete) results.

該輸出之前可為自記憶體處理積體電路之篩選實體及/或處理實體擷取一或多個相關資料庫條目及/或一或多個結果(完整或中間)。 Before the output, one or more related database entries and/or one or more results (complete or intermediate) can be retrieved from the screening entity and/or the processing entity of the memory processing integrated circuit.

該擷取可用一或多種方式控制且可由記憶體處理積體電路之仲裁器及/或一或多個控制器控制。 The capture can be controlled in one or more ways and can be processed by the memory of the integrated circuit. The cutter and/or one or more controllers control.

輸出及/或擷取可包括控制擷取及/或輸出之一或多個參數。該等參數可包括擷取時序、擷取速率、擷取源、頻寬、次序或擷取、輸出時序、輸出速率、輸出源、頻寬、次序或輸出、擷取方法之類型、仲裁方法之類型及其類似者。 The output and/or capture may include controlling the capture and/or output of one or more parameters. These parameters can include acquisition timing, acquisition rate, acquisition source, bandwidth, sequence or acquisition, output timing, output rate, output source, bandwidth, sequence or output, type of acquisition method, arbitration method Types and similar ones.

輸出及/或擷取可執行流控制處理程序。 The output and/or capture can execute a flow control processing program.

輸出及/或擷取(例如，應用流控制處理程序)可對自一或多個處理實體輸出的關於群組之資料庫條目的處理之完成的指示符作出回應。指示符可提示中間結果是否已準備好被自處理實體擷取。 The output and/or retrieval (for example, application flow control processing program) may respond to the completion indicator of the processing of the group's database entry output from one or more processing entities. The indicator can prompt whether the intermediate result is ready to be retrieved by the self-processing entity.

輸出可包括嘗試匹配在輸出期間使用之頻寬與鏈路上之最大可允許頻寬，該鏈路將記憶體處理積體電路耦接至請求者單元。該鏈路可為至記憶體處理積體電路之輸出之接收者的鏈路。最大可允許頻寬可藉由鏈路之容量及/或可用性、所輸出內容之接收者的容量及/或可用性及其類似者規定。 The output may include an attempt to match the bandwidth used during the output with the maximum allowable bandwidth on the link that couples the memory processing integrated circuit to the requester unit. The link may be a link to the receiver of the output of the memory processing integrated circuit. The maximum allowable bandwidth may be specified by the capacity and/or availability of the link, the capacity and/or availability of the recipient of the output content, and the like.

輸出可包括嘗試以最佳或次最佳方式輸出所輸出內容。 Output can include trying to output the output in the best or sub-optimal way.

所輸出內容之輸出可包括嘗試維持輸出訊務速率之波動低於臨限值。 The output of the output content may include an attempt to maintain the fluctuation of the output traffic rate below a threshold value.

方法9300、9301、9302及9305之任何方法可包括藉由一或多個處理實體產生處理狀態指示符，處理狀態指示符可提示相關資料庫條目之群組的進一步處理之進展。 Any of methods 9300, 9301, 9302, and 9305 may include generating processing status indicators by one or more processing entities, and the processing status indicators may prompt the progress of further processing of the group of related database entries.

當包括於上文所提及之方法中之任一者中的處理由多於單個處理實體執行時，則處理可被視為分散式處理，此係因為處理以分散式方式執行。 When the processing included in any of the methods mentioned above is performed by more than a single processing entity, then the processing can be regarded as distributed processing because the processing is performed in a distributed manner.

如上文所提示，可以階層式方式或以平面方式執行處理。 As indicated above, processing can be performed in a hierarchical manner or in a planar manner.

方法9300至9305中之任一者可由可同時或依序地對一個或多個資料庫查詢作出回應之多個系統執行。 Any one of methods 9300 to 9305 can be executed by multiple systems that can respond to one or more database queries simultaneously or sequentially.

字嵌入 Word embedding

如上文所提及，字嵌入(word embedding)為自然語言處理(NLP)中之語言模型化及特徵學習技術之集合的總稱，其中來自詞彙表之字或片語映射至元素之向量。概念上，字嵌入涉及自每個字具有許多維度之空間至具有低得多之維度之連續向量空間的數學嵌入。 As mentioned above, word embedding is a general term for a collection of language modeling and feature learning techniques in natural language processing (NLP), in which words or phrases from a vocabulary are mapped to vectors of elements. Conceptually, word embedding involves mathematical embedding from a space where each word has many dimensions to a continuous vector space with much lower dimensions.

可對該等向量進行數學處理。舉例而言，可對屬於矩陣之向量進行加總以提供加總向量。 These vectors can be mathematically processed. For example, the vectors belonging to the matrix can be summed to provide a summed vector.

又對於另一實例，可計算(語句之)矩陣的協方差。此可包括將矩陣乘以其轉置矩陣。 For another example, the covariance of the matrix (of sentences) can be calculated. This can include multiplying the matrix by its transposed matrix.

記憶體/處理單元可儲存詞彙表。特定而言，詞彙表之部分可儲存於記憶體/處理單元之多個記憶體組中。 The memory/processing unit can store a glossary. In particular, parts of the vocabulary can be stored in multiple memory groups of the memory/processing unit.

因此，可使用將表示語句之片語的字之集合的存取資訊(諸如，擷取金鑰)來存取記憶體/處理單元，使得將自記憶體/處理單元之記憶體組中的至少一些擷取表示語句之片語之字的向量。 Therefore, it is possible to use the access information (such as the retrieval key) to access the memory/processing unit, so that at least from the memory group of the memory/processing unit Some extract vectors representing the phrase words of the sentence.

記憶體/處理單元之不同記憶體組可儲存詞彙表之不同部分，且可被並列地存取(取決於語句之索引的分佈)。即使在需要依序存取記憶體組之多於單排時，預測亦可減少懲罰。 Different memory groups of the memory/processing unit can store different parts of the vocabulary and can be accessed in parallel (depending on the distribution of the index of the sentence). Even when more memory banks need to be accessed sequentially than in a single row, prediction can reduce the penalty.

可最佳化在記憶體/處理單元之不同記憶體組之間分配詞彙表之字，或該分配在可增加每語句對記憶體/處理單元之不同記憶體組之並列存取的機會之意義上為高度有益的。該分配可按每個使用者學習，可按每個一般群體學習或可按每個人群來學習。 It can be optimized to allocate the words of the vocabulary between different memory groups of the memory/processing unit, or the allocation can increase the meaning of the chance of parallel access of each sentence to the different memory groups of the memory/processing unit The above is highly beneficial. The allocation can be learned for each user, for each general group, or for each group.

此外，記憶體/處理單元亦可用以執行處理操作中之至少一些(藉由其邏輯)，且藉此可減少自記憶體/處理單元外部之匯流排所需的頻寬，可用高效方式(甚至並列)計算多個運算(並列地使用記憶體/處理單元之多個處理器)。 In addition, the memory/processing unit can also be used to perform at least some of the processing operations (by its logic), and thereby reduce the bandwidth required by the bus outside the memory/processing unit, and can be used in an efficient manner (or even Parallel) Calculate multiple operations (multiple processing using memory/processing units in parallel 器).

記憶體組可與邏輯相關聯。 The memory bank can be associated with logic.

處理操作之至少一部分可由一或多個額外處理器(諸如，向量處理器，包括但不限於向量加法器)執行。 At least part of the processing operations may be performed by one or more additional processors (such as vector processors, including but not limited to vector adders).

記憶體/處理單元可包括可分配給記憶體組(邏輯對)中之一些或全部的一或多個額外處理器。 The memory/processing unit may include one or more additional processors that can be allocated to some or all of the memory groups (logical pairs).

因此，可將單個額外處理器分配給記憶體組(邏輯對)中之全部或一些。又對於另一實例，該等額外處理器可用階層式方式配置，使得某一層級之額外處理器處理來自較低層級之額外處理器的輸出。 Therefore, a single additional processor can be allocated to all or some of the memory banks (logical pairs). For yet another example, the additional processors can be configured in a hierarchical manner, so that the additional processors of a certain level process the output from the additional processors of a lower level.

應注意，處理操作可在不使用任何額外處理器之情況下執行，但可由記憶體/處理單元之邏輯執行。 It should be noted that the processing operations can be performed without using any additional processors, but can be performed by the logic of the memory/processing unit.

圖89A、圖89B、圖89C、圖89D、圖89E、圖89F及圖89G分別說明記憶體/處理單元9010、9011、9012、9013、9014、9015及9019之實例。記憶體/處理單元9010包括控制器9020、內部匯流排9021以及多對邏輯9030及記憶體組9040。 FIGS. 89A, 89B, 89C, 89D, 89E, 89F, and 89G illustrate examples of memory/processing units 9010, 9011, 9012, 9013, 9014, 9015, and 9019, respectively. The memory/processing unit 9010 includes a controller 9020, an internal bus 9021, and multiple pairs of logic 9030 and a memory bank 9040.

應注意，邏輯9030及記憶體組9040可用其他方式耦接至控制器及/或彼此耦接，例如，多個匯流排可設置於控制器與邏輯之間，邏輯可配置於多個層中，單個邏輯可由多個記憶體組共用(參見例如圖89E)，及其類似者。 It should be noted that the logic 9030 and the memory bank 9040 can be coupled to the controller and/or each other in other ways. For example, multiple bus bars can be provided between the controller and the logic, and the logic can be configured in multiple layers. A single logic can be shared by multiple memory banks (see, for example, FIG. 89E), and the like.

可用任何方式定義記憶體/處理單元9010內之每一記憶體組的頁面之長度，例如，其可足夠小，且記憶體組之數目可足夠大以使得能夠並列地輸出大量向量而不會在不相關資訊上浪費許多位元。 The length of the page of each memory bank in the memory/processing unit 9010 can be defined in any way, for example, it can be small enough, and the number of memory banks can be large enough to output a large number of vectors in parallel without being Many bits are wasted on irrelevant information.

邏輯9020可包括完整ALU、部分ALU、記憶體控制器、部分記憶體控制器及其類似者。部分ALU(記憶體控制器)單元能夠僅執行可由完整ALU(記憶體控制器)執行之功能的一部分。在本申請案中說明之任何邏輯或子處理器可包括完整ALU、部分ALU、記憶體控制器、部分記憶體控制器及其類似者。 The logic 9020 may include a complete ALU, a partial ALU, a memory controller, a partial memory controller, and the like. Some ALU (memory controller) units can perform only part of the functions that can be performed by a complete ALU (memory controller). Any logical or stated in this application The sub-processors may include complete ALUs, partial ALUs, memory controllers, partial memory controllers, and the like.

記憶體/處理單元9010可能不具有額外向量，且向量(來自記憶體組)之處理由邏輯9030進行。 The memory/processing unit 9010 may not have an additional vector, and the vector (from the memory bank) is processed by logic 9030.

圖89B說明額外處理器，諸如耦接至內部匯流排9021之向量處理器9050。 FIG. 89B illustrates an additional processor, such as a vector processor 9050 coupled to the internal bus 9021.

圖89C說明額外處理器，諸如耦接至內部匯流排9021之向量處理器9050。一或多個額外處理器執行(單獨或與邏輯相配合)處理操作。 Figure 89C illustrates an additional processor, such as a vector processor 9050 coupled to the internal bus 9021. One or more additional processors perform (alone or in conjunction with logic) processing operations.

圖89D亦說明經由匯流排9022耦接至記憶體/處理單元9010之主機9018。 FIG. 89D also illustrates the host 9018 coupled to the memory/processing unit 9010 via the bus 9022.

圖89D亦說明將字/片語9072映射至向量9073之詞彙表9070。使用擷取金鑰9071存取記憶體/處理單元，每一擷取金鑰表示先前辨識之字或片語。主機9018將表示語句之多個擷取金鑰9071發送至記憶體/處理單元，且記憶體/處理單元可輸出向量9070或由與語句相關之向量所應用的處理操作之最終結果。字/片語通常不儲存於記憶體/處理單元9010中。 FIG. 89D also illustrates the mapping of words/phrases 9072 to the vocabulary 9070 of the vector 9073. The retrieval key 9071 is used to access the memory/processing unit, and each retrieval key represents a previously recognized word or phrase. The host 9018 sends a plurality of retrieval keys 9071 representing the sentence to the memory/processing unit, and the memory/processing unit can output the vector 9070 or the final result of the processing operation applied by the vector related to the sentence. The words/phrases are usually not stored in the memory/processing unit 9010.

用於控制記憶體組之記憶體控制器功能性可包括(單獨或部分地)於邏輯中，可包括(單獨或部分地)於控制器9020中及/或可包括(單獨或部分地)於記憶體/處理單元9010內之一或多個記憶體控制器(未圖示)中。 The functionality of the memory controller used to control the memory bank may be included (alone or partially) in the logic, may be included (alone or partially) in the controller 9020, and/or may be included (alone or partially) in the One or more memory controllers (not shown) in the memory/processing unit 9010.

記憶體/處理單元可經組態以最大化發送至主機9018之向量/結果的輸送量，或可應用用於控制內部記憶體/處理單元訊務及/或控制記憶體/處理單元與主機電腦(或記憶體/處理單元外部之任何其他實體)之間的訊務的任何處理程序。 The memory/processing unit can be configured to maximize the throughput of vectors/results sent to the host 9018, or can be used to control internal memory/processing unit communications and/or control the memory/processing unit and the host computer (Or any other entity outside of the memory/processing unit) any processing procedures for communications between.

不同邏輯9030耦接至記憶體/處理單元之記憶體組9040，且可對向量執行(較佳並列地)數學運算以產生經處理向量。一個邏輯9030可將向量發送至另一邏輯(參見例如圖89G之線38)，且另一邏輯可對所接收向量及其計算之向量應用數學運算。邏輯可按層級配置，且某一層級之邏輯可處理來自前一層級邏輯之向量或中間結果(由應用數學運算產生)。該等邏輯可形成樹(二元、三元及其類似者)。 The different logic 9030 is coupled to the memory bank 9040 of the memory/processing unit, and can perform (preferably in parallel) mathematical operations on the vector to generate the processed vector. One logic 9030 can send a vector to another logic (see, for example, line 38 in FIG. 89G), and the other logic can apply mathematical operations to the received vector and the vector it calculates. Logic can be configured in levels, and a certain level of logic can handle vectors or intermediate results (generated by applying mathematical operations) from the previous level of logic. Such logic can form trees (binary, ternary, and the like).

當經處理向量之總大小超過結果之總大小時，則獲得輸出頻寬(在記憶體/處理單元外部)之減少。舉例而言，當K個向量由記憶體/處理單元加總以提供單個輸出向量時，則獲得頻寬的K：1減少。 When the total size of the processed vectors exceeds the total size of the result, the output bandwidth (outside the memory/processing unit) is reduced. For example, when K vectors are summed by the memory/processing unit to provide a single output vector, a K:1 reduction in bandwidth is obtained.

控制器9020可經組態以藉由廣播待存取之不同向量的位址來並列地開放多個記憶體組。 The controller 9020 can be configured to open multiple memory banks in parallel by broadcasting the addresses of different vectors to be accessed.

控制器可經組態以至少部分地基於語句中之字或片語的次序來控制自多個記憶體組(或自儲存不同向量之任何中間緩衝或儲存電路，參見圖89D之緩衝器9033)擷取不同向量之次序。 The controller can be configured to control multiple memory groups based at least in part on the order of the words or phrases in the sentence (or from any intermediate buffers or storage circuits that store different vectors, see buffer 9033 in FIG. 89D) Capture the order of different vectors.

控制器9020可經組態以基於與在記憶體/處理單元9010外部輸出向量相關之一或多個參數來管理不同向量之擷取，例如，自記憶體組擷取不同向量之速率可設定為實質上等於自記憶體/處理單元9010輸出不同向量之可允許速率。 The controller 9020 can be configured to manage the retrieval of different vectors based on one or more parameters related to the vector output external to the memory/processing unit 9010. For example, the rate at which different vectors are retrieved from the memory bank can be set to It is substantially equal to the allowable rate of outputting different vectors from the memory/processing unit 9010.

控制器可藉由應用任何訊務塑形處理程序來在記憶體/處理單元9010外部輸出不同向量。舉例而言，控制器9020可旨在以儘可能接近主機電腦或將記憶體/處理單元9010耦接至主機電腦之鏈路可允許之最大速率的速率輸出不同向量。又對於另一實例，控制器可輸出不同向量，同時最少化或至少實質上減少訊務速率隨時間的波動。 The controller can output different vectors outside the memory/processing unit 9010 by applying any traffic shaping processing program. For example, the controller 9020 may be designed to output different vectors at a rate as close as possible to the maximum rate allowed by the host computer or the link coupling the memory/processing unit 9010 to the host computer. For yet another example, the controller can output different vectors while minimizing or at least substantially reducing the fluctuation of the traffic rate over time.

控制器9020屬於與記憶體組9040及邏輯9030相同之積體電路，且因此可易於自不同邏輯/記憶體組接收關於不同向量之擷取狀態(例如，向量是否準備好，向量是否準備好但自同一記憶體組正擷取或將要擷取另一向量)的反饋，及其類似者。可用任何方式提供反饋：經由專用控制線，經由共用控制線。使用一或多個狀態位元及其類似者(參見圖89F之狀態線9039)。 The controller 9020 belongs to the same integrated circuit as the memory group 9040 and logic 9030, And therefore, it is easy to receive feedback on the acquisition status of different vectors from different logic/memory groups (for example, whether the vector is ready, whether the vector is ready but is being retrieved from the same memory group or is about to retrieve another vector) , And the like. Feedback can be provided in any way: via dedicated control lines, via shared control lines. Use one or more status bits and the like (see status line 9039 in Figure 89F).

控制器9020可獨立地控制不同向量之擷取及輸出，且因此可減少主機電腦之參與。替代地，主機電腦可能不知曉控制器之管理能力，且可能繼續發送詳細指令，且在此狀況下，記憶體/處理單元9010可忽略詳細指令，可隱藏控制器之管理能力，及其類似者。可基於可由主機電腦管理之協定使用所提及之以上解決方案。 The controller 9020 can independently control the acquisition and output of different vectors, and therefore can reduce the participation of the host computer. Alternatively, the host computer may not know the management capabilities of the controller and may continue to send detailed instructions. In this case, the memory/processing unit 9010 can ignore the detailed instructions, hide the management capabilities of the controller, and the like . The above mentioned solutions can be used based on the agreement that can be managed by the host computer.

已發現，在記憶體/處理單元中執行處理操作為極有益的(就能量而言)，即使當此等操作相比在主機中之處理操作消耗更多功率時且即使當此等操作相比在主機與記憶體/處理單元之間的傳送操作消耗更多功率時亦如此。舉例而言，假定向量足夠大，傳送資料單元之能量消耗為4pJ，資料單元之處理操作(藉由主機)的能量消耗為0.1pJ，則當藉由記憶體/處理單元處理資料單元之能量消耗低於5pJ時，藉由記憶體/處理單元處理資料單元更有效。 It has been found that performing processing operations in the memory/processing unit is extremely beneficial (in terms of energy), even when these operations consume more power than processing operations in the host and even when these operations are compared This is also true when the transfer operation between the host and the memory/processing unit consumes more power. For example, assuming that the vector is large enough, the energy consumption of the data unit is 4pJ, and the energy consumption of the processing operation of the data unit (by the host) is 0.1pJ, then the energy consumption of the data unit is processed by the memory/processing unit When it is lower than 5pJ, it is more effective to process the data unit by the memory/processing unit.

(表示語句之矩陣的)每一向量可由字(或其他多位元區段)之序列表示。為解釋簡單起見，假定多個位元區段為字。 Each vector (representing a matrix of sentences) can be represented by a sequence of words (or other multi-bit segments). For simplicity of explanation, it is assumed that multiple bit segments are words.

當向量包括零值字時，可獲得額外功率節省。替代輸出整個零值字，可輸出短於字(例如，位元)之零值旗標(甚至經由專用控制線輸送)而非整個字。可將旗標分配給其他值(例如，值1之字)。 When the vector includes zero-valued words, additional power savings can be obtained. Instead of outputting the entire zero-valued word, a zero-valued flag shorter than the word (for example, bit) can be output (or even sent via a dedicated control line) instead of the entire word. The flag can be assigned to other values (for example, the value 1 zigzag).

圖88A說明用於嵌入之方法9400，或確切而言，可為用於擷取特徵向量相關資訊之方法。特徵向量相關資訊可包括特徵向量及/或處理特徵向量之結果。 FIG. 88A illustrates a method 9400 for embedding, or, to be precise, a method for extracting information related to feature vectors. The feature vector related information may include the feature vector and/or the result of processing the feature vector.

方法9400可開始於藉由記憶體處理積體電路接收擷取資訊以用於擷取多個所請求特徵向量的步驟9410，該等特徵向量可映射至多個語句區段。 Method 9400 can start with the use of memory processing integrated circuits to receive and retrieve information for use In step 9410 of retrieving multiple requested feature vectors, the feature vectors can be mapped to multiple sentence segments.

記憶體處理單元可包括控制器、多個處理器子單元及多個記憶體單元。記憶體單元中之每一者可耦接至處理器子單元。 The memory processing unit may include a controller, multiple processor sub-units, and multiple memory units. Each of the memory units can be coupled to a processor sub-unit.

步驟9410之後可接著為自多個記憶體單元中之至少一些擷取多個所請求特徵向量的步驟9420。 Step 9410 may be followed by step 9420 of retrieving multiple requested feature vectors from at least some of the multiple memory cells.

該擷取可包括向兩個或多於兩個記憶體單元同時請求儲存於該兩個或多於兩個記憶體單元中之所請求特徵向量。 The retrieval may include simultaneously requesting two or more memory units for the requested feature vectors stored in the two or more memory units.

該請求係基於語句區段與映射至語句區段之特徵向量的位置之間的已知映射而執行。 The request is executed based on the known mapping between the sentence section and the location of the feature vector mapped to the sentence section.

該映射可在記憶體處理積體電路之開機處理程序期間上傳。 The mapping can be uploaded during the boot process of the memory processing integrated circuit.

一次擷取儘可能多的所請求特徵向量可為有益的，但此取決於所請求特徵向量儲存之處及不同記憶體單元之數目。 It may be beneficial to retrieve as many requested feature vectors as possible at one time, but this depends on where the requested feature vectors are stored and the number of different memory cells.

若多於一個所請求特徵向量儲存於同一記憶體組中，則可應用預測性擷取以用於減少與自記憶體組擷取資訊相關聯之懲罰。在本申請案之各種章節中說明用於減少懲罰之各種方法。 If more than one requested feature vector is stored in the same memory set, predictive retrieval can be applied to reduce the penalty associated with retrieving information from the memory set. Various methods for reducing punishment are explained in various chapters of this application.

擷取可包括應用儲存於單個記憶體單元中之所請求特徵向量之集合中的至少一些所請求特徵向量之預測性擷取。 The retrieval may include predictive retrieval using at least some of the requested feature vectors from the set of requested feature vectors stored in a single memory unit.

所請求特徵向量可用最佳方式分佈於記憶體單元之間。 The requested feature vectors can be optimally distributed among memory cells.

所請求特徵向量可基於預期擷取圖案而分佈於記憶體單元之間。 The requested feature vector can be distributed among the memory cells based on the expected capture pattern.

多個所請求特徵向量之擷取可根據某一次序執行。舉例而言，根據語句區段在一或多個語句中之次序。 The extraction of multiple requested feature vectors can be performed according to a certain order. For example, according to the order of sentence sections in one or more sentences.

多個所請求特徵向量之擷取可至少部分無序地執行；且其中擷取進一步可包括將多個所請求特徵向量重新排序。 The extraction of the plurality of requested feature vectors may be performed at least partially out of order; and the extraction may further include reordering the plurality of requested feature vectors.

多個所請求特徵之擷取可包括在多個所請求特徵向量可由控制器讀取之前緩衝該多個所請求特徵向量。 The extraction of multiple requested features can be included in multiple requested feature vectors that can be controlled The multiple requested feature vectors are buffered before being read by the processor.

多個所請求特徵之擷取可包括產生緩衝狀態指示符，該等緩衝狀態指示符提示與多個記憶體單元相關聯之一或多個緩衝器何時儲存一或多個所請求特徵向量。 The retrieval of the plurality of requested features may include generating buffer status indicators that prompt one or more buffers associated with the plurality of memory cells when to store the one or more requested feature vectors.

該方法可包括經由專用控制線輸送緩衝狀態指示符。 The method may include transmitting the buffer status indicator via a dedicated control line.

可每記憶體單元分配一個專用控制線。 Each memory unit can be assigned a dedicated control line.

緩衝狀態指示符可為儲存於一或多個緩衝器中之狀態位元。 The buffer status indicator can be a status bit stored in one or more buffers.

該方法可包括經由一或多個共用控制線輸送緩衝狀態指示符。 The method may include transmitting the buffer status indicator via one or more common control lines.

步驟9420之後可接著為處理多個所請求特徵向量以提供處理結果之步驟9430。 Step 9420 can be followed by step 9430 of processing multiple requested feature vectors to provide processing results.

另外或替代地，步驟9420之後可接著為自記憶體處理積體電路輸出可包括以下各者中之至少一者之輸出的步驟9440：(a)所請求特徵向量；及(b)處理所請求特徵向量之結果。(a)所請求特徵向量及(b)處理所請求特徵向量之結果中的至少一者亦被稱作特徵向量相關資訊。 Additionally or alternatively, step 9420 can be followed by a step 9440 of processing the integrated circuit output from memory, which can include at least one of the following: (a) the requested feature vector; and (b) processing the requested The result of the feature vector. At least one of (a) the requested feature vector and (b) the result of processing the requested feature vector is also referred to as feature vector related information.

當執行步驟9430時，則步驟9440可包括輸出(至少)處理所請求特徵向量之結果。 When step 9430 is performed, step 9440 may include outputting (at least) the result of processing the requested feature vector.

當跳過步驟9430時，則步驟9440包括輸出所請求特徵向量且可能不包括輸出處理所請求特徵向量之結果。 When step 9430 is skipped, step 9440 includes outputting the requested feature vector and may not include outputting the result of processing the requested feature vector.

圖88B說明用於嵌入之方法9401。 Figure 88B illustrates the method 9401 for embedding.

假定輸出包括所請求特徵向量，但不包括處理所請求特徵向量之結果。 It is assumed that the output includes the requested feature vector, but does not include the result of processing the requested feature vector.

方法9401可開始於藉由記憶體處理積體電路接收擷取資訊以用於擷取多個所請求特徵向量之步驟9410，該等特徵向量可映射至多個語句區段。 The method 9401 can start with the step 9410 of receiving the retrieval information by the memory processing integrated circuit for retrieving a plurality of requested feature vectors, and the feature vectors can be mapped to a plurality of sentence segments.

步驟9410之後可接著為自多個記憶體單元中之至少一些擷取多個所請求特徵向量的步驟9420。 Step 9410 can be followed by fetching multiple memory cells from at least some of the multiple memory cells. Step 9420 of the requested feature vector.

步驟9420之後可接著為自記憶體處理積體電路輸出包括所請求特徵向量但不包括處理所請求特徵向量之結果之輸出的步驟9431。 Step 9420 can be followed by step 9431 of outputting an output including the requested feature vector but not the result of processing the requested feature vector from the memory processing integrated circuit.

圖88C說明用於嵌入之方法9402。 Figure 88C illustrates the method 9402 for embedding.

假定輸出包括處理所請求特徵向量之結果。 It is assumed that the output includes the result of processing the requested feature vector.

方法9402可開始於藉由記憶體處理積體電路接收擷取資訊以用於擷取多個所請求特徵向量之步驟9410，該等特徵向量可映射至多個語句區段。 The method 9402 can start with the step 9410 of receiving the retrieved information by the memory processing integrated circuit for retrieving a plurality of requested feature vectors, and the feature vectors can be mapped to a plurality of sentence segments.

步驟9430之後可接著為自記憶體處理積體電路輸出可包括處理所請求特徵向量之結果之輸出的步驟9442。 Step 9430 can be followed by step 9442 of processing the output of the integrated circuit from the memory, which can include processing the output of the requested feature vector.

該輸出之輸出可包括對輸出應用訊務塑形。 The output of the output may include applying traffic shaping to the output.

該輸出之輸出可包括嘗試匹配在輸出期間使用之頻寬與鏈路上之最大可允許頻寬，該鏈路將記憶體處理積體電路耦接至請求者單元。 The output of the output may include an attempt to match the bandwidth used during the output with the maximum allowable bandwidth on the link that couples the memory processing integrated circuit to the requester unit.

該輸出之輸出可包括嘗試維持輸出訊務速率之波動低於臨限值。 The output of the output may include an attempt to maintain the fluctuation of the output traffic rate below a threshold value.

擷取及輸出中之任何步驟可在主機之控制下及/或獨立地或部分地由控制器執行。 Any step in the capture and output can be executed by the controller under the control of the host and/or independently or in part.

主機可發送具有不同粒度之擷取命令，自一般發送擷取資訊而無關於所請求特徵向量在多個記憶體單元內之位置，直至基於所請求特徵向量在多個記憶體單元內之位置而發送詳細擷取資訊。 The host can send capture commands with different granularities, from generally sending capture information regardless of the location of the requested feature vector in multiple memory cells, until based on the location of the requested feature vector in multiple memory cells Send detailed retrieval information.

主機可控制(或嘗試控制)記憶體處理積體電路內之不同擷取操作之時序，但可能與時序無關。 The host can control (or try to control) the timing of different acquisition operations in the memory processing integrated circuit, but it may not be related to the timing.

控制器可藉由主機在各種層級中控制，且可甚至忽略主機之詳細命令，且獨立地至少控制擷取及/或輸出。 The controller can be controlled by the host in various levels, and can even ignore the detailed commands of the host, and independently control at least the capture and/or output.

所請求特徵向量之處理可由以下各者中之至少一者(以下各者中之一或多者的組合)執行：一或多個記憶體/處理單元，及位於一或多個記憶體/處理單元外部之一或多個處理器，及其類似者。 The processing of the requested feature vector can be performed by at least one of the following (one or a combination of more of the following): one or more memories/processing units, and one or more memories/processing One or more processors outside the unit, and the like.

應注意，所請求特徵向量之處理可由以下各者中之至少一者(以下各者中之一或多者的組合)執行：一或多個處理器子單元、控制器、一或多個向量處理器，及位於一或多個記憶體/處理單元外部之一或多個記憶體/處理單元。 It should be noted that the processing of the requested feature vector can be performed by at least one of the following (a combination of one or more of the following): one or more processor sub-units, a controller, one or more vectors The processor, and one or more memories/processing units located outside of the one or more memories/processing units.

所請求特徵向量之處理可由以下各者中之任一者或以下各者之組合執行，可由以下各者中之任一者或以下各者之組合產生： The processing of the requested feature vector can be performed by any one of the following or a combination of the following, and can be generated by any one of the following or a combination of the following:

a.記憶體/處理單元之處理器子單元(或邏輯9030)。 a. The processor sub-unit of the memory/processing unit (or logic 9030).

b.多個記憶體/處理單元之處理器子單元(或邏輯9030)。 b. Processor sub-units of multiple memory/processing units (or logic 9030).

c.記憶體/處理單元之控制器。 c. Controller of memory/processing unit.

d.多個記憶體/處理單元之控制器。 d. Controller with multiple memory/processing units.

e.記憶體/處理單元之一或多個向量處理器。 e. One or more vector processors of memory/processing unit.

f.一或多個向量處理器、多個記憶體/處理單元。 f. One or more vector processors, multiple memory/processing units.

因此，所請求特徵向量之處理可由以下各者之任何組合或子組合執行：(a)一或多個記憶體/處理單元之一或多個控制器；(b)一或多個記憶體/處理單元之一或多個處理器子單元；(c)一或多個記憶體/處理單元之一或多個向量處理器；及(d)位於一或多個記憶體/處理單元外部之一或多個其他處理器。 Therefore, the processing of the requested feature vector can be performed by any combination or sub-combination of: (a) one or more memories/processing unit or one or more controllers; (b) one or more memories/ One or more processor sub-units of the processing unit; (c) one or more vector processors of one or more memory/processing units; and (d) one of the one or more memory/processing units located outside Or multiple other processors.

記憶體/處理單元可包括多個處理器子單元。處理器子單元可彼此獨立地操作，可彼此部分地合作，可參與分散式處理，及其類似者。 The memory/processing unit may include multiple processor sub-units. The processor sub-units can each other Operate independently, partially cooperate with each other, participate in distributed processing, and the like.

可用平面方式執行處理，其中所有處理器子單元執行相同操作(且在其之間可能輸出或可能不輸出處理結果)。 The processing can be performed in a planar manner, in which all processor subunits perform the same operation (and may or may not output the processing result in between).

可用階層式方式執行處理，其中處理涉及不同層級之處理操作序列，而某一層之處理操作在又一層級之處理操作之後。處理器子單元可經分配(動態地或靜態地)給不同層且參與階層式處理。 The processing can be performed in a hierarchical manner, where the processing involves a sequence of processing operations at different levels, and the processing operations of a certain level follow the processing operations of another level. Processor sub-units can be allocated (dynamically or statically) to different layers and participate in hierarchical processing.

所請求特徵向量之任何處理可由多於一個處理實體(處理器子單元、控制器、向量處理器、其他處理器)執行，可用任何方式(用平面、階層式或其他方式)進行分散式處理。舉例而言，處理器子單元可將其處理結果輸出至控制器，該控制器可進一步處理該等結果。位於一或多個記憶體/處理單元外部之一或多個其他處理器可進一步處理記憶體處理積體電路之輸出。 Any processing of the requested feature vector can be performed by more than one processing entity (processor subunits, controllers, vector processors, other processors), and distributed processing can be performed in any manner (planar, hierarchical, or other methods). For example, the processor sub-unit can output its processing results to the controller, and the controller can further process the results. One or more other processors located outside the one or more memory/processing units can further process the output of the memory processing integrated circuit.

應注意，擷取資訊亦可包括用於擷取不映射至語句區段之所請求特徵向量的資訊。此等特徵向量可映射至一或多個人員、裝置或可與語句區段相關之任何其他實體。舉例而言，感測語句區段之裝置的使用者、感測區段之裝置、識別為語句區段之來源的使用者、在產生語句時存取之網站、俘獲語句之位置，及其類似者。 It should be noted that the retrieved information may also include information used to retrieve the requested feature vector that is not mapped to the sentence section. These feature vectors can be mapped to one or more persons, devices, or any other entities that can be related to the sentence segment. For example, the user of the device that senses the sentence section, the device that senses the section, the user identified as the source of the sentence section, the website accessed when the sentence was generated, the location of the captured sentence, and the like By.

在細節上作必要修改後，方法9400、9401及9402可適用於不映射至語句區段之處理及/或所請求擷取向量。 After making necessary modifications in details, the methods 9400, 9401, and 9402 can be applied to processes that are not mapped to sentence segments and/or the requested extraction vector.

特徵向量之處理的非限制性實例可包括加總、加權和、平均、減法或應用任何其他數學函數。 Non-limiting examples of the processing of feature vectors may include summation, weighted sum, average, subtraction, or application of any other mathematical function.

混合裝置 Mixing device

隨著處理器速度及記憶體大小兩者均繼續增加，對有效處理速度之顯著限制係馮諾依曼(von Neumann)瓶頸。馮諾依曼瓶頸係由習知電腦架構所導致之輸送量限制造成。特定而言，相較於由處理器進行之實際運算，自記憶體至處理器之資料傳送(在諸如外部DRAM記憶體之邏輯晶粒外部)常常遇到瓶頸。因此，用以對記憶體進行讀取及寫入之時脈循環的數目隨著記憶體密集型處理程序而顯著增加。此等時脈循環導致較低的有效處理速度，此係因為對記憶體進行讀取及寫入會消耗時脈循環，該等時脈循環無法用於對資料執行操作。此外，處理器之運算頻寬通常大於處理器用以存取記憶體之匯流排的頻寬。 As both processor speed and memory size continue to increase, the significant limitation on effective processing speed is the von Neumann bottleneck. The Von Neumann bottleneck is caused by the throughput limitation caused by the conventional computer architecture. In particular, compared to the actual calculations performed by the processor, self-report Data transfer from the memory to the processor (outside the logic die such as external DRAM memory) often encounters bottlenecks. Therefore, the number of clock cycles used to read and write the memory increases significantly with memory-intensive processing procedures. These clock cycles result in a lower effective processing speed. This is because reading and writing to the memory consumes clock cycles, and these clock cycles cannot be used to perform operations on data. In addition, the computing bandwidth of the processor is generally greater than the bandwidth of the bus used by the processor to access the memory.

可提供一種用於記憶體密集型處理之混合裝置，該混合裝置可包括基礎晶粒、多個處理器、至少另一晶粒之第一記憶體資源，及至少一個其他晶粒之第二記憶體資源。 A hybrid device for memory-intensive processing can be provided. The hybrid device can include a basic die, a plurality of processors, a first memory resource of at least another die, and a second memory of at least one other die Physical resources.

該基礎晶粒及該至少另一晶粒藉由晶圓上晶圓接合彼此連接。 The base die and the at least another die are connected to each other by wafer-on-wafer bonding.

多個處理器經組態以執行處理操作，且擷取儲存於第一記憶體資源中之所擷取資訊。 The multiple processors are configured to perform processing operations and retrieve the retrieved information stored in the first memory resource.

第二記憶體資源經組態以將來自第二記憶體資源之額外資訊發送至第一記憶體資源。 The second memory resource is configured to send additional information from the second memory resource to the first memory resource.

基礎晶粒與至少另一晶粒之間的第一路徑之總頻寬超過至少另一晶粒與至少一個其他晶粒之間的第二路徑之總頻寬，且第一記憶體資源之儲存容量為第二記憶體資源之儲存容量的一部分。 The total bandwidth of the first path between the base die and at least another die exceeds the total bandwidth of the second path between at least another die and at least one other die, and the storage of the first memory resource The capacity is a part of the storage capacity of the second memory resource.

第二記憶體資源為高頻寬記憶體(HBM)資源。 The second memory resource is a high-bandwidth memory (HBM) resource.

至少一個其他晶粒為高頻寬記憶體(HBM)晶片之堆疊。 At least one other die is a stack of high-bandwidth memory (HBM) chips.

第二記憶體資源中之至少一些可屬於另一晶粒，該另一晶粒藉由不同於晶圓間接合之連接性而連接至基礎晶粒。 At least some of the second memory resources may belong to another die that is connected to the base die by a different connectivity from the inter-wafer bonding.

第二記憶體資源中之至少一些屬於另一晶粒，該另一晶粒藉由不同於晶圓間接合之連接性而連接至另一晶粒。 At least some of the second memory resources belong to another die, and the other die is connected to the other die by a different connectivity from the inter-wafer bonding.

第一記憶體資源及第二記憶體資源為不同層級之快取記憶體。 The first memory resource and the second memory resource are different levels of cache memory.

第一記憶體資源定位於基礎晶粒與第二記憶體資源之間。 The first memory resource is positioned between the basic die and the second memory resource.

第一記憶體資源定位於第二記憶體資源之一側。 The first memory resource is positioned on one side of the second memory resource.

另一晶粒經組態以執行額外處理，其中另一晶粒包含複數個處理器子單元及第一記憶體資源。 The other die is configured to perform additional processing, and the other die includes a plurality of processor sub-units and a first memory resource.

每一處理器子單元耦接至分配給處理器子單元之第一記憶體資源的唯一部分。 Each processor sub-unit is coupled to a unique portion of the first memory resource allocated to the processor sub-unit.

第一記憶體資源之唯一部分為至少一個記憶體組。 The only part of the first memory resource is at least one memory bank.

多個處理器為包括於為記憶體處理晶片第一記憶體資源中之複數個處理器子單元。 The plurality of processors is a plurality of processor subunits included in the first memory resource for the memory processing chip.

基礎晶粒包含多個處理器，其中多個處理器為經由使用晶圓間接合形成之導體耦接至第一記憶體資源的複數個處理器子單元。 The basic die includes a plurality of processors, wherein the plurality of processors are a plurality of processor sub-units coupled to the first memory resource through a conductor formed by using inter-wafer bonding.

可提供混合積體電路，其可利用晶圓上晶圓(WOW)連接性以將基礎晶粒之至少一部分耦接至第二記憶體資源，該等第二記憶體資源包括於一或多個其他晶粒中且使用不同於WOW連接性之連接性而連接。第二記憶體資源之實例可為高頻寬記憶體(HBM)記憶體資源。在各種圖中，第二記憶體資源包括於HBM記憶體單元之堆疊中，可使用矽穿孔(TSV)連接性而耦接至控制器。控制器可包括於基礎晶粒中或耦接(例如，經由微凸塊)至基礎晶粒之至少部分。 A hybrid integrated circuit can be provided, which can utilize wafer-on-wafer (WOW) connectivity to couple at least a part of the basic die to a second memory resource, and the second memory resource is included in one or more Other dies are connected using a connectivity different from that of WOW. An example of the second memory resource may be a high-bandwidth memory (HBM) memory resource. In the various figures, the second memory resource is included in the stack of HBM memory cells and can be coupled to the controller using via-silicon via (TSV) connectivity. The controller can be included in the base die or coupled (for example, via micro bumps) to the base die At least part of it.

基礎晶粒可為邏輯晶粒，但可為記憶體/處理單元。 The basic die can be a logic die, but can be a memory/processing unit.

WOW連接性用以將基礎晶粒之一或多個部分耦接至另一晶粒(WOW連接之晶粒)之一或多個部分，該另一晶粒可為記憶體晶粒或記憶體/處理單元。WOW連接性為極高輸送量連接性。 WOW connectivity is used to couple one or more parts of the basic die to one or more parts of another die (WOW connected die), which can be a memory die or a memory /Processing unit. WOW connectivity is extremely high throughput connectivity.

高頻寬記憶體(HBM)晶片之堆疊可耦接至基礎晶粒(直接或經由WOW連接之晶粒)，且可提供高輸送量連接及極廣泛記憶體資源。 The stack of high-bandwidth memory (HBM) chips can be coupled to the basic die (die connected directly or via WOW), and can provide high throughput connections and extremely extensive memory resources.

WOW連接之晶粒可耦接於HBM晶片之堆疊與基礎晶粒之間以形成HBM記憶體晶片堆疊，該堆疊具有TSV連接性且在其底部具有WOW連接之晶粒。 The WOW-connected die can be coupled between the stack of HBM chips and the base die to form an HBM memory chip stack that has TSV connectivity and has WOW-connected dies at the bottom.

具有TSV連接性且在底部具有WOW連接之晶粒的HBM晶片堆疊可提供多層記憶體階層，其中WOW連接之晶粒可用作基礎晶粒可存取之較低層級記憶體(例如，3階快取記憶體)，其中自較高層級HBM記憶體堆疊之提取及/或預提取操作填充WOW連接之晶粒。 HBM chip stacks with TSV connectivity and WOW connected dies at the bottom can provide multiple levels of memory, where WOW connected dies can be used as lower-level memory accessible by the basic die (for example, 3-level Cache memory), where the fetch and/or prefetch operation from the higher-level HBM memory stack fills the WOW connected die.

HBM記憶體晶片可為HBM DRAM晶片，但可使用任何其他記憶體技術。 The HBM memory chip can be an HBM DRAM chip, but any other memory technology can be used.

使用WOW連接性與HMB晶片之組合使得能夠提供多層記憶體結構，該多層記憶體結構可包括可提供頻寬與記憶體密度之間的不同取捨的多個記憶體層。 Using a combination of WOW connectivity and HMB chips enables the provision of a multi-layer memory structure that can include multiple memory layers that can provide different trade-offs between bandwidth and memory density.

所建議解決方案可充當傳統的DRAM記憶體/HBM至邏輯晶粒之內部快取記憶體之間的額外的全新記憶體階層，從而在DRAM側實現更多頻寬以及更佳管理及重複使用。 The proposed solution can serve as an additional brand new memory hierarchy between the traditional DRAM memory/HBM and the internal cache memory of the logic die, thereby achieving more bandwidth and better management and reuse on the DRAM side.

此可在DRAM側提供以快速方式較佳地管理記憶體讀取之新的記憶體階層。 This can provide a new memory hierarchy on the DRAM side that better manages memory reads in a fast manner.

圖93A至圖93I分別說明混合積體電路11011'至11019'。 FIGS. 93A to 93I illustrate the hybrid integrated circuits 11011' to 11019', respectively.

圖93A說明具有TSV連接性且在最低層級具有微凸塊之HBM DRAM堆疊(共同表示為11030)，該堆疊包括彼此耦接且使用TSV(11039)耦接至基礎晶粒之第一記憶體控制器11031的HDM DRAM記憶體晶片11032之堆疊。 Figure 93A illustrates an HBM DRAM stack with TSV connectivity and micro bumps at the lowest level (collectively denoted as 11030). The stack includes a first memory control coupled to each other and coupled to the base die using TSV (11039) The HDM DRAM memory chip 11032 of the device 11031 is stacked.

圖93A亦說明至少具有記憶體資源且使用WOW技術耦接之晶圓(共同表示為11040)，該晶圓包括經由一或多個WOW中間層(11023)耦接至DRAM晶圓(11021)之基礎晶粒11019的第二記憶體控制器11022。一或多個WOW中間層可由不同材料製成，但可不同於墊連接性及/或可不同於TSV連接性。 FIG. 93A also illustrates a wafer (collectively indicated as 11040) having at least memory resources and coupled using WOW technology. The wafer includes a wafer (11021) coupled to a DRAM wafer (11021) through one or more WOW interlayers (11023) The second memory controller 11022 of the base die 11019. One or more WOW intermediate layers may be made of different materials, but may be different from pad connectivity and/or may be different from TSV connectivity.

穿過一或多個WOW中間層之導體11022'將DRAM晶粒電耦接至基礎晶粒之組件。 Conductors 11022' passing through one or more WOW intermediate layers electrically couple the DRAM die to the components of the base die.

基礎晶粒11019耦接至中介層11018，該中介層又使用微凸塊耦接至封裝基板11017。封裝基板在其下表面處具有微凸塊之陣列。 The base die 11019 is coupled to the interposer 11018, which in turn is coupled to the package substrate 11017 using micro bumps. The package substrate has an array of micro bumps on its lower surface.

微凸塊可由其他連接性替換。中介層11018及封裝基板11017可由其他層替換。 The micro bumps can be replaced by other connectivity. The interposer 11018 and the packaging substrate 11017 can be replaced by other layers.

第一及/或第二記憶體控制器(分別為11031及11032)可定位於(至少部分)基礎晶粒11019外部，例如定位於DRAM晶圓中，DRAM晶圓與基礎晶粒之間，HBM記憶體單元之堆疊與基礎晶粒之間，及其類似者。 The first and/or second memory controllers (11031 and 11032, respectively) can be positioned (at least partly) outside the base die 11019, for example, positioned in the DRAM wafer, between the DRAM wafer and the base die, HBM Between the stack of memory cells and the base die, and the like.

第一及/或第二記憶體控制器(分別為11031及11032)可屬於同一控制器或可屬於不同控制器。 The first and/or second memory controllers (11031 and 11032, respectively) may belong to the same controller or may belong to different controllers.

HBM記憶體單元中之一或多者可包括邏輯以及記憶體，且可為或可包括記憶體/處理單元。 One or more of the HBM memory units may include logic and memory, and may or may include memory/processing units.

第一及第二記憶體控制器藉由多個匯流排11016彼此耦接，以用於在第一記憶體資源與第二記憶體資源之間輸送資訊。圖93A亦說明自第二記憶體控制器至基礎晶粒之組件(例如，多個處理器)的匯流排11014。圖93A進一步說明自第一記憶體控制器至基礎晶粒之組件(例如，多個處理器，如圖93C中所展示)的匯流排11015。 The first and second memory controllers are coupled to each other through a plurality of bus bars 11016 to use Transfer information between the first memory resource and the second memory resource. FIG. 93A also illustrates the bus 11014 from the second memory controller to the components of the base die (for example, multiple processors). FIG. 93A further illustrates the bus 11015 from the first memory controller to the components of the base die (for example, multiple processors, as shown in FIG. 93C).

圖93B說明混合積體電路11012，其與圖93A之混合積體電路11011的不同之處在於具有記憶體/處理單元11021'而非DRAM晶粒11021。 FIG. 93B illustrates the hybrid integrated circuit 11012, which is different from the hybrid integrated circuit 11011 of FIG. 93A in that it has a memory/processing unit 11021' instead of a DRAM die 11021.

圖93C說明混合積體電路11013，其與圖93A之混合積體電路11011的不同之處在於具有HBM記憶體晶片堆疊，該堆疊具有TSV連接性且在其底部具有WOW連接之晶粒(共同表示為11040)，該晶粒包括HBM記憶體單元之堆疊與基礎晶粒11018之間的DRAM晶粒11021。 FIG. 93C illustrates the hybrid integrated circuit 11013, which is different from the hybrid integrated circuit 11011 of FIG. 93A in that it has an HBM memory chip stack, which has TSV connectivity and has WOW connected die at its bottom (commonly indicated 11040), the die includes a DRAM die 11021 between the stack of HBM memory cells and the base die 11018.

DRAM晶粒11021使用WOW技術(參見WOW中間層11023)耦接至基礎晶粒11019之第一記憶體控制器11031。HBM記憶體晶粒11032中之一或多者可包括邏輯以及記憶體，且可為或可包括記憶體/處理單元。 The DRAM die 11021 is coupled to the first memory controller 11031 of the base die 11019 using WOW technology (see WOW intermediate layer 11023). One or more of the HBM memory die 11032 may include logic and memory, and may or may include a memory/processing unit.

最下部DRAM晶粒(在圖93C中表示為DEAM晶粒11021)可為HBM記憶體晶粒或可不同於HBM晶粒。最下部DRAM晶粒(DRAM晶粒11021)可由記憶體/處理單元11021'替換，如由圖93D之混合積體電路11014所說明。 The lowermost DRAM die (denoted as DEAM die 11021 in FIG. 93C) may be an HBM memory die or may be different from an HBM die. The lowermost DRAM die (DRAM die 11021) can be replaced by a memory/processing unit 11021', as illustrated by the hybrid integrated circuit 11014 of FIG. 93D.

圖93E至圖93G分別說明混合積體電路11015、11016及11016'，其中基礎晶粒11019耦接至在最低層級具有TSV連接性及微凸塊之HBM DRAM堆疊(11020)及至少具有記憶體資源且使用WOW技術耦接之晶圓(11030)的多個例項，及/或耦接至具有TSV連接性且在底部具有WOW連接之晶粒的HBM記憶體晶片堆疊(11040)之多個例項。 Figures 93E to 93G illustrate the hybrid integrated circuits 11015, 11016, and 11016', respectively, in which the base die 11019 is coupled to the HBM DRAM stack (11020) with TSV connectivity and micro bumps at the lowest level and has at least memory resources And multiple instances of wafers (11030) coupled using WOW technology, and/or multiple instances of HBM memory chip stacks (11040) coupled to TSV connectivity and with WOW connected dies at the bottom item.

圖93H說明混合積體電路11014'，該混合積體電路與圖93D之混合積體電路11014的不同之處在於說明記憶體單元53、二階快取記憶體(L2快取記憶體52)、多個處理器11051。多個處理器11051耦接至L2快取記憶體11052，且可饋入有儲存於記憶體單元11053及L2快取記憶體11052中之係數及/或資料。 FIG. 93H illustrates the hybrid integrated circuit 11014'. The difference between the hybrid integrated circuit and the hybrid integrated circuit 11014 of FIG. 93D is to illustrate the memory unit 53, the second-level cache memory (L2 fast Take memory 52), multiple processors 11051. The multiple processors 11051 are coupled to the L2 cache memory 11052, and can be fed with coefficients and/or data stored in the memory unit 11053 and the L2 cache memory 11052.

上文所提及之混合積體電路中的任一者可用於人工智慧(AI)處理，該處理為頻寬密集的。 Any of the hybrid integrated circuits mentioned above can be used for artificial intelligence (AI) processing, which is bandwidth intensive.

當使用WOW技術耦接至記憶體控制器時，圖93D及93H之記憶體/處理單元11021'可執行AI計算，且可用極高速率自HBM DRAM堆疊及/或自WOW連接之晶粒接收資料及係數兩者。 When coupled to a memory controller using WOW technology, the memory/processing unit 11021' of Figures 93D and 93H can perform AI calculations and can receive data from HBM DRAM stacks and/or from WOW connected dies at a very high rate And the coefficient both.

任何記憶體/處理單元可包括分散式記憶體陣列及處理器陣列。分散式記憶體及處理器陣列可包括多個記憶體組及多個處理器。多個處理器可形成處理陣列。 Any memory/processing unit can include distributed memory arrays and processor arrays. Distributed memory and processor arrays may include multiple memory banks and multiple processors. Multiple processors can form a processing array.

參看圖93C、圖93D及圖93H且假定需要混合積體電路(11013、11014或11014')來執行一般的矩陣向量乘法(GEMV)，該等乘法包括計算矩陣與向量之乘積。因為不存在對所擷取矩陣之重複使用，所以此類型之計算為頻寬密集的。因此，僅需要擷取及使用整個矩陣一次。 Refer to FIGS. 93C, 93D, and 93H and assume that a hybrid integrated circuit (11013, 11014, or 11014') is required to perform general matrix vector multiplication (GEMV), which includes calculating the product of a matrix and a vector. Because there is no repeated use of the extracted matrix, this type of calculation is bandwidth-intensive. Therefore, it is only necessary to retrieve and use the entire matrix once.

GEMV可為數學運算序列之一部分，其涉及(i)將第一矩陣(A)乘以第一向量(V1)以提供第一中間向量，對第一中間向量應用第一非線性運算(NLO1)以提供第一中間結果；(ii)將第二矩陣(B)乘以第一中間結果以第二中間向量，對第二中間向量應用第二非線性運算(NLO2)以提供第二中間結果，等等(直至接收第N中間結果，N可超過2)。 GEMV can be part of a sequence of mathematical operations involving (i) multiplying a first matrix (A) by a first vector (V1) to provide a first intermediate vector, and applying a first nonlinear operation (NLO1) to the first intermediate vector To provide the first intermediate result; (ii) multiply the second matrix (B) by the first intermediate result with a second intermediate vector, and apply a second non-linear operation (NLO2) to the second intermediate vector to provide the second intermediate result, And so on (until the Nth intermediate result is received, N can exceed 2).

假定每一矩陣為大的(例如，1Gb)，計算將需要1Tbs運算功率及1Tbs之頻寬/輸送量。可並列地執行運算及計算。 Assuming that each matrix is large (for example, 1Gb), the calculation will require 1Tbs of computing power and 1Tbs of bandwidth/transmission. Operations and calculations can be performed in parallel.

假定GEMV計算展現N=4且具有以下形式：結果=NLO4(D*(NLO3(C*(NLO2(B*(NLO1(A*V1)))))))。 Assume that the GEMV calculation exhibits N=4 and has the following form: result=NLO4(D*(NLO3(C*(NLO2(B*(NLO1(A*V1))))))).

亦假定DRAM晶粒11021(或記憶體/處理單元11021')不具有足夠的記憶體資源以同時儲存A、B、C及D，則此等矩陣中之至少一些將儲存於HDM DRAM晶粒11032中。 It is also assumed that the DRAM die 11021 (or memory/processing unit 11021') does not have enough memory resources to store A, B, C, and D at the same time, then at least some of these matrices will be stored in the HDM DRAM die 11032 in.

假定基礎晶粒為包括諸如但不限於處理器、算術邏輯單元及其類似者之計算單元的邏輯晶粒。 It is assumed that the basic die is a logic die including a computing unit such as but not limited to a processor, arithmetic logic unit, and the like.

在第一晶粒計算A*V1時，第一記憶體控制器11031自一或多個HBM DRAM晶粒11032擷取其他矩陣之缺失部分以用於接下來的計算。 When the first die calculates A*V1, the first memory controller 11031 retrieves the missing parts of other matrices from one or more HBM DRAM die 11032 to use in the next calculation.

參看圖93H且假定(a)DRAM晶粒11021具有2TBs頻寬及512Mb容量，(b)HBM DRAM晶粒11032具有0.2TBs頻寬及8Gb容量，且(c)L2快取記憶體11052為具有6Ts頻寬及10Mb容量之SRAM。 Refer to Figure 93H and assume that (a) the DRAM die 11021 has a bandwidth of 2TBs and a capacity of 512Mb, (b) the HBM DRAM die 11032 has a bandwidth of 0.2TBs and a capacity of 8Gb, and (c) the L2 cache 11052 has a capacity of 6Ts SRAM with bandwidth and 10Mb capacity.

矩陣乘法涉及重複使用資料，將大的矩陣分段成多個區段(例如，5Mb個區段以適合可在雙緩衝器組態下使用之L2快取記憶體)及將所提取之第一矩陣區段乘以第二矩陣之區段(一個第二矩陣區段接著另一第二矩陣區段)。 Matrix multiplication involves reusing data, segmenting a large matrix into multiple segments (for example, 5Mb segments to fit the L2 cache that can be used in a double-buffer configuration) and extracting the first The matrix segment is multiplied by the second matrix segment (one second matrix segment followed by another second matrix segment).

在將第一矩陣區段乘以第二矩陣區段時，將另一第二矩陣區段自(記憶體處理單元11021'之)DRAM晶粒11021提取至L2快取記憶體。 When the first matrix section is multiplied by the second matrix section, another second matrix section is extracted from the DRAM die 11021 (of the memory processing unit 11021') to the L2 cache.

假定矩陣各為1Gb，在執行取得及計算時，DRAM晶粒11021或記憶體/處理單元11021'饋入有來自HBM DRAM晶粒11032之矩陣區段。 Assuming that the matrices are each 1Gb, the DRAM die 11021 or the memory/processing unit 11021' is fed with the matrix section from the HBM DRAM die 11032 when performing the acquisition and calculation.

DRAM晶粒11021或記憶體/處理單元11021'聚集矩陣區段，且矩陣區段接著經由WOW中間層(11023)饋入至基礎晶粒11019。 The DRAM die 11021 or the memory/processing unit 11021' gathers matrix segments, and the matrix segments are then fed to the base die 11019 through the WOW intermediate layer (11023).

記憶體/處理單元11021'可藉由執行計算及發送結果而非發送經計算以提供結果之中間值來減少經由WOW中間層(11023)發送至基礎晶粒11019的資訊之量。當處理多個(Q個)中間值以提供結果時，則壓縮比可為Q比1。 The memory/processing unit 11021' can reduce the amount of information sent to the base die 11019 through the WOW intermediate layer (11023) by performing calculations and sending the results instead of sending the calculated intermediate values to provide the results. When multiple (Q) intermediate values are processed to provide a result, the compression ratio may be Q to 1.

圖93I說明使用WOW技術實施之記憶體處理單元11019'的實例。邏輯單元9030(可為處理器子單元)、控制器9020及匯流排9021位於一個晶片111061中，分配給不同邏輯單元之記憶體組9040位於第二晶片11062中，而第一及第二晶片使用穿過WOW接合部11061之導體11012'彼此連接，該WOW接合部可包括一或多個WOW中間層。 FIG. 93I illustrates an example of a memory processing unit 11019' implemented using WOW technology. The logic unit 9030 (may be a processor sub-unit), the controller 9020 and the bus 9021 are located in one chip 111061, the memory group 9040 allocated to different logic units is located in the second chip 11062, and the first and second chips are used The conductors 11012' passing through the WOW junction 11061 are connected to each other, and the WOW junction may include one or more WOW intermediate layers.

圖93J為用於記憶體密集型處理之方法11100的實例。記憶體密集意謂處理需要高頻寬記憶體消耗或與高頻寬記憶體消耗相關聯。 Figure 93J shows an example of a method 11100 for memory-intensive processing. Memory intensive means that processing requires high bandwidth memory consumption or is related to high bandwidth memory consumption.

方法11100可開始於步驟11110、11120及11130。 Method 11100 can start at steps 11110, 11120, and 11130.

步驟11110包括藉由多個處理器混合裝置執行處理操作，該混合裝置包含基礎晶粒、至少另一晶粒之第一記憶體資源及至少一個其他晶粒之第二記憶體資源；其中基礎晶粒及至少另一晶粒藉由晶圓上晶圓接合彼此連接。 Step 11110 includes performing processing operations by a plurality of processor hybrid devices, the hybrid device including a base die, a first memory resource of at least another die, and a second memory resource of at least one other die; wherein the base die The die and at least another die are connected to each other by wafer-on-wafer bonding.

步驟11120包括藉由多個處理器擷取儲存於第一記憶體資源中之所擷取資訊。 Step 11120 includes using multiple processors to retrieve the retrieved information stored in the first memory resource.

步驟11130可包括將來自第二記憶體資源之額外資訊發送至第一記憶體資源，其中基礎晶粒與至少另一晶粒之間的第一路徑之總頻寬超過至少另一晶粒與至少一個其他晶粒之間的第二路徑之總頻寬，且其中第一記憶體資源之儲存容量為第二記憶體資源之儲存容量的一部分。 Step 11130 may include sending additional information from the second memory resource to the first memory resource, wherein the total bandwidth of the first path between the basic die and at least another die exceeds that of at least another die and at least another die. The total bandwidth of the second path between one other die, and the storage capacity of the first memory resource is a part of the storage capacity of the second memory resource.

方法11100亦可包括藉由包括複數個處理器子單元及第一記憶體資源之另一晶粒執行額外處理的步驟11140。 The method 11100 may also include a step 11140 of performing additional processing by another die including a plurality of processor sub-units and a first memory resource.

每一處理器子單元可耦接至分配給處理器子單元之第一記憶體資源的唯一部分。 Each processor sub-unit can be coupled to a unique portion of the first memory resource allocated to the processor sub-unit.

步驟11110、11120、11130及11140可同時、以部分重疊方式及其類似方式執行。 Steps 11110, 11120, 11130, and 11140 can be performed simultaneously, in a partially overlapping manner, and similar manners.

第二記憶體資源可為高頻寬記憶體(HBM)記憶體資源或可不同於HBM記憶體資源。 The second memory resource may be a high-bandwidth memory (HBM) memory resource or may be different from the HBM memory resource.

至少一個其他晶粒為高頻寬記憶體(HBM)記憶體晶片之堆疊。 At least one other die is a stack of high-bandwidth memory (HBM) memory chips.

通信晶片 Communication chip

資料庫包括許多條目，該等條目包括多個欄位。資料庫處理通常包括執行一或多個查詢，該一或多個查詢包括一或多個篩選參數(例如，識別一或多個相關欄位及一或多個相關欄位值)且亦包括一或多個操作參數，該一或多個操作參數可判定待執行之操作的類型、待在應用操作時使用之變數或常數，及其類似者。資料處理可包括資料庫分析或其他資料庫處理程序。 The database includes many entries, and these entries include multiple fields. Database processing usually includes executing one or more queries that include one or more filter parameters (for example, identifying one or more related fields and one or more related field values) and also includes a Or multiple operating parameters, the one or more operating parameters can determine the type of operation to be performed, the variable or constant to be used in the application operation, and the like. Data processing may include database analysis or other database processing procedures.

對於儲存於記憶體單元中之資料庫的每一資料庫區段，處理包括以下步驟：(i)選擇資料庫區段之記錄；(ii)將記錄自記憶體單元發送至處理器；(iii)藉由處理器篩選記錄以判定記錄是否相關；及(iv)對相關記錄執行一或多個額外操作(求和、應用任何其他數學運算及/或統計操作)。 For each database section of the database stored in the memory unit, the processing includes the following steps: (i) select the record of the database section; (ii) send the record from the memory unit to the processor; (iii) ) Filter the records by the processor to determine whether the records are relevant; and (iv) Perform one or more additional operations on the related records (summing, applying any other mathematical operations and/or statistical operations).

在資料庫區段之相關條目不儲存於處理器中之狀況下，則需要在篩選階段之後將此等相關記錄發送至處理器以供進一步處理(應用在處理之後的操作)。 In the case that the relevant entries in the database section are not stored in the processor, you need to After the screening stage, these related records are sent to the processor for further processing (operations applied after processing).

可提供一種可包括資料庫加速積體電路之裝置。 It is possible to provide a device that can include a database to accelerate an integrated circuit.

可提供一種可包括資料庫加速積體電路之一或多個群組的裝置，該等資料庫加速積體電路可經組態以在資料庫加速積體電路之一或多個群組中的資料庫加速積體電路之間交換資訊及/或加速結果(藉由資料庫加速積體電路進行之處理的最終結果)。 It is possible to provide a device that can include one or more groups of database accelerated integrated circuits, and the database accelerated integrated circuits can be configured to perform data in one or more groups of database accelerated integrated circuits. The database accelerates the exchange of information and/or acceleration results between integrated circuits (the final result of the processing performed by the database to accelerate the integrated circuits).

群組之資料庫加速度積體電路可連接至同一印刷電路板。 The database acceleration integrated circuit of the group can be connected to the same printed circuit board.

群組之資料庫加速度積體電路可屬於電腦化系統之模組化單元。 The group's database acceleration integrated circuit can be a modular unit of a computerized system.

不同群組之資料庫加速積體電路可連接至不同印刷電路板。 Different groups of database accelerated integrated circuits can be connected to different printed circuit boards.

不同群組之資料庫加速積體電路可屬於電腦化系統之不同模組化單元。 Different groups of database accelerated integrated circuits can belong to different modular units of the computerized system.

該裝置可經組態以藉由一或多個群組之資料庫加速積體電路執行分散式處理程序。 The device can be configured to use one or more groups of databases to accelerate integrated circuits to execute distributed processing procedures.

該裝置可經組態以使用至少一個交換器以用於在一或多個群組中之不同群組的資料庫加速積體電路之間交換(a)資訊及(b)資料庫加速結果中之至少一者。 The device can be configured to use at least one switch for the exchange of (a) information and (b) database acceleration results between database acceleration integrated circuits of different groups in one or more groups At least one of them.

該裝置可經組態以藉由一或多個群組中之一些的資料庫加速積體電路中之一些執行分散式處理程序。 The device can be configured to speed up some of the integrated circuits to execute distributed processing procedures through the database of some of one or more groups.

該裝置可經組態以執行第一及第二資料結構之分散式處理程序，其中第一及第二資料結構之總大小超過多個記憶體處理積體電路之儲存能力。 The device can be configured to perform distributed processing of the first and second data structures Sequence, where the total size of the first and second data structures exceeds the storage capacity of multiple memory processing integrated circuits.

該裝置可經組態以藉由執行以下步驟之多個反覆來執行分散式處理程序：(a)執行將第一資料結構部分及第二資料結構部分之不同對新分配給不同資料庫加速積體電路；及(b)處理不同對。 The device can be configured to perform distributed processing by performing multiple iterations of the following steps: (a) Perform new allocation of different pairs of the first data structure part and the second data structure part to different database acceleration products Body circuit; and (b) deal with different pairs.

圖94A及圖9B說明儲存系統11560、電腦系統11150及用於資料庫加速之一或多個裝置11520的實例。用於資料庫加速之一或多個裝置11520可用各種方式(藉由監聽或藉由定位於電腦系統11150與儲存系統11560之間)監視儲存系統11560與電腦系統11150之間的通信。 94A and 9B illustrate examples of the storage system 11560, the computer system 11150, and one or more devices 11520 for database acceleration. One or more devices 11520 for database acceleration can monitor the communication between the storage system 11560 and the computer system 11150 in various ways (by monitoring or by positioning between the computer system 11150 and the storage system 11560).

儲存系統11560可包括許多(例如，多於20個、50個、100個、100個及其類似者)儲存單元(諸如，磁碟或磁碟之raid)，且可例如儲存多於100萬億位元組資訊。運算系統11510可為大型電腦系統且可包括數十、數百及甚至數千個處理單元。 The storage system 11560 may include many (for example, more than 20, 50, 100, 100, and the like) storage units (such as disks or disk raids), and may store more than 100 trillion, for example Byte information. The computing system 11510 may be a large-scale computer system and may include tens, hundreds, or even thousands of processing units.

運算系統11510可包括由管理器11511控制之多個運算節點11512。 The computing system 11510 may include a plurality of computing nodes 11512 controlled by the manager 11511.

運算節點可控制或以其他方式與用於資料庫加速之一或多個裝置11520互動。 The computing node can control or interact with one or more devices 11520 for database acceleration.

用於資料庫加速之一或多個裝置11520可包括一或多個資料庫加速積體電路(參見例如圖94A及圖94B之資料庫加速積體電路11530)及記憶體資源11550。記憶體資源可屬於專用於記憶體但可屬於記憶體/處理單元之一或多個晶片。 The one or more devices 11520 for database acceleration may include one or more database acceleration integrated circuits (see, for example, the database acceleration integrated circuit 11530 in FIGS. 94A and 94B) and a memory resource 11550. The memory resource can belong to one or more chips dedicated to the memory but can belong to the memory/processing unit.

圖94C及圖94D說明電腦系統11150及用於資料庫加速之一或多個裝置11520的實例。 94C and 94D illustrate an example of the computer system 11150 and one or more devices 11520 for database acceleration.

用於資料庫加速之一或多個裝置11520的一或多個資料庫加速積體電路可由管理單元11513控制，該管理單元可位於電腦系統內(參見圖94C)或位於用於資料庫加速之一或多個裝置11520內(圖94D)。 One or more database acceleration products of one or more devices 11520 for database acceleration The body circuit can be controlled by the management unit 11513, which can be located in the computer system (see FIG. 94C) or located in one or more devices 11520 for database acceleration (FIG. 94D).

圖94E說明用於資料庫加速之裝置11520，該裝置包括資料庫加速積體電路11530及多個記憶體處理積體電路1151。每一記憶體處理積體電路可包括控制器、多個處理器子單元及多個記憶體單元。 FIG. 94E illustrates a device 11520 for database acceleration. The device includes a database acceleration integrated circuit 11530 and a plurality of memory processing integrated circuits 1151. Each memory processing integrated circuit may include a controller, multiple processor sub-units, and multiple memory units.

資料庫加速積體電路11530經說明為包括網路通信介面11531、第一處理單元11532、記憶體控制器11533、資料庫加速單元11535、互連件11536及管理單元11513。 The database acceleration integrated circuit 11530 is illustrated as including a network communication interface 11531, a first processing unit 11532, a memory controller 11533, a database acceleration unit 11535, an interconnect 11536, and a management unit 11513.

網路通信介面(11531)可經組態以自大量儲存單元接收(例如，經由網路通信介面之第一埠11531(1))大量資訊。每一儲存單元可用超過數十及甚至數百百萬位元組/秒之速率輸出資訊，而資料傳送速率預期隨時間增加(例如，每2至3年加倍)。儲存資料單元之數目(大數目)可超過10個、50個、100個、200個及甚至更多個。大量資訊可超過數十、數百十億位元組/秒，且甚至可在萬億位元組/秒及千兆位元組/秒之範圍內。 The network communication interface (11531) can be configured to receive a large amount of information from a mass storage unit (for example, via the first port 11531(1) of the network communication interface). Each storage unit can output information at a rate exceeding tens or even hundreds of megabytes per second, and the data transfer rate is expected to increase over time (for example, doubling every 2 to 3 years). The number of data storage units (large number) can exceed 10, 50, 100, 200 and even more. A large amount of information can exceed tens or tens of billions of bytes per second, and can even be in the range of trillion bytes per second and gigabytes per second.

第一處理單元11532可經組態以對大量資訊進行第一處理(預處理)以提供第一經處理資訊。 The first processing unit 11532 can be configured to perform first processing (preprocessing) on a large amount of information to provide first processed information.

記憶體控制器11533可經組態以經由大輸送量介面11534將第一經處理資訊發送至多個記憶體處理積體電路。 The memory controller 11533 can be configured to send the first processed information to a plurality of memory processing integrated circuits via the high throughput interface 11534.

多個記憶體處理積體電路11551可經組態以藉由多個記憶體處理積體電路對第一經處理資訊之至少部分進行第二處理(處理)以提供第二經處理資訊。 The plurality of memory processing integrated circuits 11551 may be configured to perform second processing (processing) on at least a part of the first processed information through the plurality of memory processing integrated circuits to provide second processed information.

記憶體控制器11533可經組態以自多個記憶體處理積體電路擷取所擷取資訊。所擷取資訊可包括以下各者中之至少一者：(a)第一經處理資訊之至少一部分；及(b)第二經處理資訊之至少一部分。 The memory controller 11533 can be configured to retrieve the captured information from multiple memory processing integrated circuits. The retrieved information may include at least one of the following: (a) at least a part of the first processed information; and (b) at least a part of the second processed information.

資料庫加速單元11535可經組態以對所擷取資訊執行資料庫處理操作，以提供資料庫加速結果。 The database acceleration unit 11535 can be configured to perform database processing operations on the retrieved information to provide database acceleration results.

資料庫加速積體電路可經組態以輸出資料庫加速結果，例如經由網路通信介面之一或多個第二埠11531(2)。 The database acceleration integrated circuit can be configured to output the database acceleration result, for example, via one or more second ports 11531(2) of the network communication interface.

圖94E亦說明管理單元11513，該管理單元經組態以管理以下各者中之至少一者：所擷取資訊之擷取、第一處理(預處理)、第二處理(處理)及第三處理(資料庫處理)。管理單元11513可位於資料庫加速積體電路外部。 FIG. 94E also illustrates the management unit 11513, which is configured to manage at least one of the following: the extraction of the captured information, the first processing (preprocessing), the second processing (processing), and the third Processing (database processing). The management unit 11513 may be located outside the database acceleration integrated circuit.

管理單元可經組態以基於執行計劃而執行該管理。執行計劃可由管理單元產生，或可由位於資料庫加速積體電路外部之實體產生。執行計劃可包括以下各者中之至少一者：(a)待由資料庫加速積體電路之各種組件執行的指令、(b)實施執行計劃所需之資料及/或係數、(c)指令及/或資料之記憶體分配。 The management unit may be configured to perform the management based on the execution plan. The execution plan can be generated by the management unit, or by an entity located outside the accelerated integrated circuit of the database. The execution plan may include at least one of the following: (a) instructions to be executed by various components of the integrated circuit accelerated by the database, (b) data and/or coefficients required to implement the execution plan, (c) instructions And/or memory allocation of data.

管理單元可經組態以藉由分配以下各者中之至少一些來執行管理：(a)網路通信網路介面資源、(b)解壓縮單元資源、(c)記憶體控制器資源、(d)多個記憶體處理積體電路資源，及(e)資料庫加速單元資源。 The management unit can be configured to perform management by allocating at least some of the following: (a) network communication network interface resources, (b) decompression unit resources, (c) memory controller resources, ( d) Multiple memory processing integrated circuit resources, and (e) database acceleration unit resources.

如圖94E及圖94G中所說明，網路通信網路介面可包括不同類型之網路通信埠。 As illustrated in FIG. 94E and FIG. 94G, the network communication network interface may include different types of network communication ports.

不同類型之網路通信埠可包括儲存介面協定埠(例如，SATA埠、ATA埠、ISCSI埠、網路檔案系統、光纖通道埠)及通用網路儲存介面協定埠(例如，乙太網路ATA、乙太網路光纖通道、NVME、Roce及其他)。 Different types of network communication ports can include storage interface protocol ports (for example, SATA port, ATA port, ISCSI port, network file system, Fibre Channel port) and general network storage interface protocol ports (for example, Ethernet ATA , Ethernet Fibre Channel, NVME, Roce and others).

不同類型之網路通信埠可包括儲存介面協定埠及PCIe埠。 Different types of network communication ports can include storage interface protocol ports and PCIe ports.

圖94F包括虛線，該等虛線說明大量資訊、第一經處理資訊、所擷取資訊及資料庫加速結果之流。圖94F將資料庫加速積體電路11530說明為耦接至多個記憶體資源11550。多個記憶體資源11550可能不屬於記憶體處理積體電路。 FIG. 94F includes dotted lines that illustrate the flow of large amounts of information, first processed information, retrieved information, and database acceleration results. FIG. 94F illustrates the database acceleration integrated circuit 11530 as being coupled to a plurality of memory resources 11550. Multiple memory resources 11550 may not belong to the memory processing complex Circuit.

用於資料庫加速之裝置11520可經組態以藉由資料庫加速積體電路11530同時執行多個任務，此係因為網路通信介面11531可接收多個資訊串流(同時)，第一處理單元11532可同時對多個資訊單元執行第一處理，記憶體控制器11533可同時將多個第一經處理資訊單元發送至多個記憶體處理積體電路11551，資料庫加速單元11535可同時處理多個所擷取資訊單元。 The device 11520 for database acceleration can be configured to perform multiple tasks at the same time by the database acceleration integrated circuit 11530. This is because the network communication interface 11531 can receive multiple information streams (simultaneously), the first processing The unit 11532 can perform the first processing on multiple information units at the same time, the memory controller 11533 can send multiple first processed information units to multiple memory processing integrated circuits 11551 at the same time, and the database acceleration unit 11535 can simultaneously process multiple information units. Units of information retrieved.

用於資料庫加速之裝置11520可經組態以藉由大型運算系統之運算節點基於發送至資料庫加速積體電路的執行計劃而執行擷取、第一處理、發送及第三處理中之至少一者。 The device 11520 for database acceleration can be configured to perform at least one of acquisition, first processing, sending, and third processing based on the execution plan sent to the database acceleration integrated circuit by the operation node of the large-scale computing system One.

用於資料庫加速之裝置11520可經組態以用實質上最佳化資料庫加速積體電路之利用的方式管理擷取、第一處理、發送及第三處理中之至少一者。該最佳化考慮潛時、輸送量及任何其他時序或儲存或處理考慮因素，且嘗試使沿著流徑之所有組件保持忙碌且無瓶頸。 The device 11520 for database acceleration can be configured to manage at least one of acquisition, first processing, sending, and third processing in a manner that substantially optimizes the use of database acceleration integrated circuits. The optimization takes into account latency, throughput, and any other timing or storage or processing considerations, and tries to keep all components along the flow path busy and without bottlenecks.

用於資料庫加速之裝置11520可經組態以實質上最佳化藉由網路通信網路介面交換之訊務的頻寬。 The device 11520 for database acceleration can be configured to substantially optimize the bandwidth of the traffic exchanged through the network communication network interface.

用於資料庫加速之裝置11520可經組態以用實質上最佳化資料庫加速積體電路之利用的方式實質上防止在擷取、第一處理、發送及第三處理中之至少一者中形成瓶頸。 The device 11520 for database acceleration can be configured to substantially prevent at least one of capture, first processing, sending, and third processing by substantially optimizing the use of the database to accelerate the use of integrated circuits Form a bottleneck.

用於資料庫加速之裝置11520可經組態以根據時間I/O頻寬來分配資料庫加速積體電路之資源。 The device 11520 for database acceleration can be configured to allocate the resources of the database acceleration integrated circuit according to the time I/O bandwidth.

圖94G說明用於資料庫加速之裝置11520，該裝置包括資料庫加速積體電路11530及多個記憶體處理積體電路1151。圖94G亦說明耦接至資料庫加速積體電路11530之各種單元：遠端RAM 11546、乙太網路記憶體DIMM 11547、儲存系統11560、本端儲存單元11561及非揮發性記憶體(NVM)11563(該非揮發性記憶體可為快速NVM單元(NVME))。 FIG. 94G illustrates a device 11520 for database acceleration. The device includes a database acceleration integrated circuit 11530 and a plurality of memory processing integrated circuits 1151. Figure 94G also illustrates coupling to data The various units of the library acceleration integrated circuit 11530: remote RAM 11546, Ethernet memory DIMM 11547, storage system 11560, local storage unit 11561 and non-volatile memory (NVM) 11563 (the non-volatile memory can be It is a fast NVM unit (NVME)).

資料庫加速積體電路11530經說明為包括乙太網路埠11531(1)、RDMA單元11545、串列擴展埠11531(15)、SATA控制器11540、PCIe埠11531(9)、第一處理單元11532、記憶體控制器11533、資料庫加速單元11535、互連件11536、管理單元11513、用於執行密碼操作之密碼編譯引擎11537，及二階靜態隨機存取記憶體(L2 SRAM)11538。 The database acceleration integrated circuit 11530 is illustrated as including an Ethernet port 11531 (1), an RDMA unit 11545, a serial expansion port 11531 (15), a SATA controller 11540, a PCIe port 11531 (9), and a first processing unit 11532, memory controller 11533, database acceleration unit 11535, interconnect 11536, management unit 11513, cryptographic engine 11537 for performing cryptographic operations, and second-level static random access memory (L2 SRAM) 11538.

資料庫加速單元經說明為包括DMA引擎11549、三階(L3)記憶體11548及資料庫加速子單元11547。資料庫加速子單元11547可為可組態單元。 The database acceleration unit is illustrated as including a DMA engine 11549, a third-level (L3) memory 11548, and a database acceleration sub-unit 11547. The database acceleration subunit 11547 can be a configurable unit.

乙太網路埠11531(1)、RDMA單元11545、串列擴展埠11531(15)、SATA控制器11540、PCIe埠11531(9)可被視為網路通信介面11531之部分。 The Ethernet port 11531 (1), RDMA unit 11545, serial expansion port 11531 (15), SATA controller 11540, PCIe port 11531 (9) can be regarded as part of the network communication interface 11531.

遠端RAM 11546、乙太網路記憶體DIMM 11547、儲存系統11560耦接至乙太網路埠11531(1)，該乙太網路埠又耦接至RDMA單元11545。 The remote RAM 11546, the Ethernet memory DIMM 11547, and the storage system 11560 are coupled to the Ethernet port 11531(1), which is in turn coupled to the RDMA unit 11545.

本端儲存單元11561耦接至SATA控制器11540。 The local storage unit 11561 is coupled to the SATA controller 11540.

PCIe埠11531(9)耦接至NVM 11563。PCIe埠亦可用於交換命令，例如用於管理目的。 The PCIe port 11531(9) is coupled to the NVM 11563. The PCIe port can also be used to exchange commands, for example for management purposes.

圖94H為資料庫加速單元11535之實例。 FIG. 94H is an example of the database acceleration unit 11535.

資料庫加速單元11535可經組態以藉由資料庫處理子單元11573同時執行資料庫處理指令，其中資料庫加速單元可包括共用一共用記憶體單元11575之資料庫加速器子單元方的群組。 The database acceleration unit 11535 can be configured to simultaneously execute database processing commands by the database processing subunit 11573, wherein the database acceleration unit can include a group of database accelerator subunits sharing a common memory unit 11575.

資料庫加速子單元11535之不同組合可動態地彼此鏈接(經由可組態鏈路或互連件11576)以提供執行可包括多個指令之資料庫處理操作所需的執行管線。 The different combinations of the database acceleration subunit 11535 can be dynamically linked to each other (via a configurable link or interconnection 11576) to provide the necessary database processing operations that can include multiple commands Execution pipeline.

每一資料庫處理子單元可經組態以執行特定類型之資料庫處理指令(例如，篩選、合併、累加及其類似者)。 Each database processing subunit can be configured to execute specific types of database processing commands (for example, filtering, merging, accumulation, and the like).

圖94H亦說明耦接至快取記憶體11571之獨立資料庫處理單元11572。替代DB加速器之可重組態陣列11574或除DB加速器之可重組態陣列11574以外，亦可提供資料庫處理單元11572及快取記憶體11571。 FIG. 94H also illustrates the independent database processing unit 11572 coupled to the cache memory 11571. Instead of the reconfigurable array 11574 of the DB accelerator, or in addition to the reconfigurable array 11574 of the DB accelerator, a database processing unit 11572 and a cache memory 11571 can also be provided.

該裝置可便利向內擴展及/或向外擴展，因此使得多個資料庫加速積體電路11530(及其相關聯之記憶體資源11550或其相關聯之多個記憶體處理積體電路11551)能夠例如藉由參與資料庫操作之分散式處理而彼此相配合。 The device can facilitate inward expansion and/or outward expansion, thus enabling multiple databases to accelerate the integrated circuit 11530 (and its associated memory resource 11550 or its associated multiple memory processing integrated circuits 11551) It can cooperate with each other, for example, by participating in distributed processing of database operations.

圖94I說明包括兩個資料庫加速積體電路11530(及其相關聯之記憶體資源11550)的模組化單元，諸如刀鋒11580。該刀鋒可包括一個、兩個或多於兩個記憶體處理積體電路11551及其相關聯之記憶體資源11550。 FIG. 94I illustrates a modular unit, such as the blade 11580, that includes two database acceleration integrated circuits 11530 (and its associated memory resource 11550). The blade may include one, two, or more than two memory processing integrated circuits 11551 and their associated memory resources 11550.

該刀鋒亦可包括一或多個非揮發性記憶體單元、乙太網路交換器、PCIe交換器及乙太網路交換器。 The blade can also include one or more non-volatile memory units, Ethernet switches, PCIe switches and Ethernet switches.

多個刀鋒可使用任何通信方法、通信協定及連接性彼此通信。 Multiple blades can communicate with each other using any communication method, communication protocol, and connectivity.

圖94I說明彼此完全連接之四個資料庫加速積體電路11530(及其相關聯之記憶體資源11550)，每一資料庫加速積體電路11530連接至所有三個其他資料庫加速積體電路11530。連接性可使用任何通信協定，例如藉由使用乙太網路RDMA協定達成。 FIG. 94I illustrates four database acceleration integrated circuits 11530 (and its associated memory resource 11550) that are fully connected to each other. Each database acceleration integrated circuit 11530 is connected to all three other database acceleration integrated circuits 11530 . Connectivity can use any communication protocol, for example, by using the Ethernet RDMA protocol.

圖94I亦說明資料庫加速積體電路11530，該資料庫加速積體電路連接至其相關聯之記憶體資源11550以及包括RAM記憶體及乙太網路埠之單元11531。 94I also illustrates the database acceleration integrated circuit 11530, which is connected to its associated memory resource 11550 and the unit 11531 including RAM memory and Ethernet port.

圖94J、圖94K、圖94L及圖94M說明資料庫加速積體電路之四個群組11580，每一群組包括四個資料庫加速積體電路11530(彼此完全連接) 及其相關聯之記憶體資源11550。不同群組經由交換器11590彼此連接。 Fig. 94J, Fig. 94K, Fig. 94L, and Fig. 94M illustrate four groups 11580 of the database accelerated integrated circuit, each group includes four database accelerated integrated circuits 11530 (fully connected to each other) 11550 and its associated memory resources. Different groups are connected to each other via the switch 11590.

群組之數目可為兩個、三個或多於四個。每群組之資料庫加速積體電路的數目可為兩個、三個或多於四個。群組之數目可相同於(或可不同於)每群組之資料庫加速積體電路的數目。 The number of groups can be two, three or more than four. The number of database accelerated integrated circuits in each group can be two, three, or more than four. The number of groups can be the same as (or can be different from) the number of database acceleration integrated circuits per group.

圖94K說明兩個表A及B，該兩個表過大(例如，1萬億位元組)而無法一次高效地接合。 FIG. 94K illustrates two tables A and B, which are too large (for example, 1 trillion bytes) to be efficiently joined at one time.

將表實際上分段成分片且將接合操作應用於包括表A之分片及表B之分片的對。 The table is actually segmented into pieces and the splicing operation is applied to the pair including the piece of table A and the piece of table B.

資料庫加速積體電路之群組可用各種方式處理分片。 The group of the database accelerated integrated circuit can handle the slices in various ways.

舉例而言，裝置可經組態以藉由以下操作來執行分散式處理程序： For example, the device can be configured to perform distributed processing by:

g.將不同的第一資料結構部分(表A之分片，例如第一至第十六分片A0至A15)分配給一或多個群組之不同資料庫加速積體電路。 g. Allocate different first data structure parts (slices of Table A, such as the first to sixteenth slices A0 to A15) to one or more groups of different database acceleration integrated circuits.

h.執行以下各者之多個反覆：(i)將不同的第二資料結構部分(表B之分片，例如第一直至第十六分片B0至B15)新分配給一或多個群組之不同資料庫加速積體電路；及(ii)藉由資料庫加速積體電路處理第一及第二資料結構部分。 h. Perform multiple iterations of each of the following: (i) Newly assign different second data structure parts (slices of table B, such as the first to sixteenth slices B0 to B15) to one or more groups Groups of different databases accelerate the integrated circuit; and (ii) use the database to accelerate the integrated circuit to process the first and second data structure parts.

裝置可經組態以用與當前反覆之處理至少部分時間重疊的方式執行下一反覆之新分配。 The device can be configured to perform a new allocation for the next iteration in a manner that overlaps at least part of the time with the current iteration.

裝置可經組態以藉由在不同資料庫加速積體電路之間交換第二資料結構部分來執行新分配。 The device can be configured to perform a new allocation by exchanging the second data structure part between different database accelerated integrated circuits.

交換可用與處理程序至少部分時間重疊之方式執行。 The exchange can be performed in a manner that overlaps at least part of the time with the processing procedure.

裝置可經組態以藉由以下操作來執行新分配：在群組之不同資料庫加速積體電路之間交換第二資料結構部分；及一旦該交換已完成，則在資料庫加速積體電路之不同群組之間交換第二資料結構部分。 The device can be configured to perform a new allocation by: exchanging the second data structure part between the different database acceleration integrated circuits of the group; and once the exchange has been completed, then accelerating the integrated circuit in the database The second data structure part is exchanged between different groups.

在圖94K中，展示接合操作中之一些的四個循環，例如參考左上方群組之左上方資料庫加速積體電路11530，四個循環包括計算Join(A0，B0)、Join(A0，B3)、Join(A0，B2)及Join(A0，B1)。在此等四個循環期間，A0保持在同一資料庫加速積體電路11530處，而矩陣B之分片(B0、B1、B2及B3)在資料庫加速積體電路11530之同一群組的成員之間旋轉。 In Fig. 94K, four cycles of some of the joining operations are shown. For example, referring to the upper left database acceleration integrated circuit 11530 in the upper left group, the four cycles include calculating Join(A0, B0), Join(A0, B3). ), Join (A0, B2) and Join (A0, B1). During these four cycles, A0 remains at the same database acceleration integrated circuit 11530, and the slices of matrix B (B0, B1, B2, and B3) are members of the same group in the database acceleration integrated circuit 11530 Rotate between.

在圖94L中，第二矩陣之分片在不同群組之間旋轉，(a)將分片B0、B1、B2及B3(先前由左上方群組處理)自左上方群組發送至左下方群組，(b)將分片B4、B5、B6及B7(先前由左下方群組處理)自左下方群組發送至右上方群組，(c)將分片B8、B9、B10及B11(先前由右上方群組處理)自右上方群組發送至右下方群組，且(d)將分片B12、B13、B14及B15(先前由右下方群組處理)自右下方群組發送至左上方群組。 In Figure 94L, the slices of the second matrix are rotated between different groups. (a) The slices B0, B1, B2, and B3 (previously processed by the upper left group) are sent from the upper left group to the lower left Group, (b) send the slices B4, B5, B6 and B7 (previously processed by the lower left group) from the lower left group to the upper right group, (c) send the slices B8, B9, B10 and B11 (Previously processed by the upper right group) sent from the upper right group to the lower right group, and (d) segments B12, B13, B14, and B15 (previously processed by the lower right group) were sent from the lower right group To the upper left group.

圖94N為系統之實例，該系統包括多個刀鋒11580、SATA控制器11540、本端儲存單元11561、NVME 11563、PCIe交換器11601、乙太網路記憶體DIMM 11547及乙太網路埠11531(4)。 Figure 94N is an example of a system that includes multiple blades 11580, SATA controller 11540, local storage unit 11561, NVME 11563, PCIe switch 11601, Ethernet memory DIMM 11547, and Ethernet port 11531 ( 4).

刀鋒11580可耦接至PCIE交換器11601、乙太網路埠11531及SATA控制器11540中之每一者。 The blade 11580 can be coupled to each of the PCIE switch 11601, the Ethernet port 11531, and the SATA controller 11540.

圖94O說明兩個系統11621及11622。 Figure 94O illustrates two systems 11621 and 11622.

系統11621可包括用於資料庫加速之一或多個裝置11520、交換系統11611、儲存系統11612及運算系統11613。交換系統11611提供用於資料庫加速之一或多個裝置11520、儲存系統11612及運算系統11613之間的連接性。 The system 11621 may include one or more devices 11520 for database acceleration, an exchange system 11611, a storage system 11612, and a computing system 11613. The exchange system 11611 provides connectivity between one or more of the devices 11520, the storage system 11612, and the computing system 11613 for database acceleration.

系統11622可包括儲存系統以及用於資料庫加速之一或多個裝置11615、交換系統11611及運算系統11613。交換系統11611提供儲存系統以及用於資料庫加速之一或多個裝置11615及運算系統11613之間的連接性。 The system 11622 may include a storage system and one or more devices 11615 for database acceleration, an exchange system 11611, and a computing system 11613. The exchange system 11611 provides a storage system and a connection between one or more devices 11615 and a computing system 11613 for database acceleration.

圖95A說明用於資料庫加速之方法11200。 Figure 95A illustrates a method 11200 for database acceleration.

方法11200可開始於藉由資料庫加速積體電路之網路通信網路介面自大量儲存單元擷取大量資訊的步驟11210。 The method 11200 can start at step 11210 of retrieving a large amount of information from a large number of storage units through a network communication network interface of a database accelerated integrated circuit.

連接至大量儲存單元(例如，使用多個不同匯流排)使得網路通信網路介面能夠接收大量資訊，即使當單個儲存單元具有有限輸送量時亦如此。 Connecting to a large number of storage units (for example, using multiple different buses) enables the network communication network interface to receive a large amount of information, even when a single storage unit has a limited throughput.

步驟11210之後可接著為對大量資訊進行第一處理以提供第一經處理資訊。第一處理可包括緩衝、自有效負載提取資訊、移除標頭、解壓縮、壓縮、解密、篩選資料庫查詢或執行任何其他處理操作。第一處理亦可能限於緩衝。 Step 11210 can be followed by performing first processing on a large amount of information to provide first processed information. The first processing may include buffering, extracting information from the payload, removing headers, decompressing, compressing, decrypting, filtering database queries, or performing any other processing operations. The first processing may also be limited to buffering.

步驟11210之後可接著為藉由資料庫加速積體電路之記憶體控制器且經由大輸送量介面將第一經處理資訊發送至多個記憶體處理積體電路的步驟11220，其中每一記憶體處理積體電路可包括控制器、多個處理器子單元及多個記憶體單元。記憶體處理積體電路可為記憶體/處理單元或分散式處理器或記憶體晶片，如本專利申請案之任何其他部分中所說明。 Step 11210 can be followed by step 11220 of accelerating the memory controller of the integrated circuit through the database and sending the first processed information to a plurality of memory processing integrated circuits via a large throughput interface, where each memory processing The integrated circuit may include a controller, a plurality of processor sub-units, and a plurality of memory units. The memory processing integrated circuit can be a memory/processing unit or a distributed processor or a memory chip, as described in any other part of this patent application.

步驟11220之後可接著為藉由多個記憶體處理積體電路對第一經處理資訊之至少部分進行第二處理以提供第二經處理資訊的步驟11230。 Step 11220 can be followed by step 11230 of performing a second processing on at least part of the first processed information by a plurality of memory processing integrated circuits to provide second processed information.

步驟11230可包括藉由資料庫加速積體電路同時執行多個任務。 Step 11230 may include accelerating the integrated circuit to perform multiple tasks at the same time through the database.

步驟11230可包括藉由資料庫處理子單元同時執行資料庫處理指令，其中資料庫加速單元可包括共用一共用記憶體單元之資料庫加速器子單元的群組。 Step 11230 may include simultaneous execution of database processing commands by the database processing subunits, where the database acceleration unit may include a group of database accelerator subunits sharing a common memory unit.

步驟11230之後可接著為藉由資料庫加速積體電路之記憶體控制器自多個記憶體處理積體電路擷取所擷取資訊的步驟11240，其中所擷取資訊可包括以下各者中之至少一者：(a)第一經處理資訊之至少一部分；及(b)第二經處理資訊之至少一部分。 Step 11230 can be followed by step 11240 of retrieving the retrieved information from multiple memory processing integrated circuits by the memory controller of the database accelerating the integrated circuit, wherein the retrieved information may include one of the following At least one: (a) at least a part of the first processed information; and (b) at least a part of the second processed information.

步驟11240之後可接著為藉由資料庫加速積體電路之資料庫加速單元對所擷取資訊執行資料庫處理操作以提供資料庫加速結果的步驟11250。 After step 11240, the database acceleration of the integrated circuit by the database acceleration can be followed The unit performs a database processing operation on the retrieved information to provide a database acceleration result at step 11250.

步驟11250可包括根據時間I/O頻寬分配資料庫加速積體電路之資源。 Step 11250 may include allocating the resources of the database to accelerate the integrated circuit according to the time I/O bandwidth.

步驟11250之後可接著為輸出資料庫加速結果之步驟11260。 Step 11250 can be followed by step 11260 of outputting the database acceleration result.

步驟11260可包括動態地鏈接資料庫處理子單元以提供執行可包括多個指令之資料庫處理操作所需的執行管線。 Step 11260 may include dynamically linking the database processing subunits to provide an execution pipeline required to perform database processing operations that may include multiple instructions.

步驟11260可包括將資料庫加速結果輸出至本端儲存器及自本端儲存器擷取資料庫加速結果。 Step 11260 may include outputting the database acceleration result to the local storage and retrieving the database acceleration result from the local storage.

應注意，方法11100之步驟11210、11220、11230、11240、11250及11260或任何其他步驟可用管線化方式執行。可同時或以不同於上文所提及之次序的次序執行此等步驟。 It should be noted that steps 11210, 11220, 11230, 11240, 11250, and 11260 or any other steps of method 11100 can be executed in a pipelined manner. These steps can be performed simultaneously or in an order different from the order mentioned above.

舉例而言，步驟1120之後可接著為步驟11250，使得第一經處理資訊由資料庫加速單元進一步處理。 For example, step 1120 can be followed by step 11250, so that the first processed information is further processed by the database acceleration unit.

又對於另一實例，第一經處理資訊可發送至多個記憶體處理積體電路，且接著發送(不由多個記憶體處理積體電路處理)至資料庫加速單元。 For yet another example, the first processed information can be sent to multiple memory processing integrated circuits, and then sent (not processed by multiple memory processing integrated circuits) to the database acceleration unit.

又對於另一實例，第一經處理資訊及/或第二經處理資訊可自資料庫加速積體電路輸出，而不由資料庫加速度單元進行資料庫處理。 For another example, the first processed information and/or the second processed information can be output from the database acceleration integrated circuit, instead of database processing by the database acceleration unit.

該方法可包括藉由大型運算系統之運算節點基於發送至資料庫加速積體電路的執行計劃而執行以下操作中之至少一者：擷取、第一處理、發送及第三處理。 The method may include performing at least one of the following operations based on the execution plan sent to the database acceleration integrated circuit by the computing node of the large computing system: capturing, first processing, sending, and third processing.

該方法可包括以實質上最佳化資料庫加速積體電路之利用的方式管理擷取、第一處理、發送及第三處理中之至少一者。 The method may include managing at least one of capture, first processing, sending, and third processing in a manner that substantially optimizes the database to accelerate the utilization of the integrated circuit.

該方法可包括實質上最佳化藉由網路通信網路介面交換之訊務的頻寬。 The method may include substantially optimizing the bandwidth of the traffic exchanged through the network communication network interface.

該方法可包括以實質上最佳化資料庫加速積體電路之利用的方式實質上防止在擷取、第一處理、發送及第三處理中之至少一者中形成瓶頸。 The method may include substantially preventing the formation of a bottleneck in at least one of the acquisition, the first processing, the sending, and the third processing by substantially optimizing the database to accelerate the utilization of the integrated circuit.

方法11200亦可包括以下步驟中之至少一者： The method 11200 may also include at least one of the following steps:

步驟11270可包括藉由資料庫加速積體電路之管理單元來管理擷取、第一處理、發送及第三處理中之至少一者。 Step 11270 may include managing at least one of retrieval, first processing, sending, and third processing by the management unit of the database accelerated integrated circuit.

該管理可基於由資料庫加速積體電路之管理單元產生的執行計劃而執行。 The management can be performed based on the execution plan generated by the management unit of the database accelerated integrated circuit.

該管理可基於由資料庫加速積體電路之管理單元接收而並非由管理單元產生的執行計劃而執行。 The management can be performed based on the execution plan received by the management unit of the database accelerated integrated circuit, but not generated by the management unit.

該管理可包括分配以下各者中之至少一些：(a)網路通信網路介面資源、(b)解壓縮單元資源、(c)記憶體控制器資源、(d)多個記憶體處理積體電路資源，及(e)資料庫加速單元資源。 The management may include allocating at least some of the following: (a) network communication network interface resources, (b) decompression unit resources, (c) memory controller resources, (d) multiple memory processing products Body circuit resources, and (e) database acceleration unit resources.

步驟11271可包括藉由大型運算系統之運算節點控制擷取、第一處理、發送及第三處理中之至少一者中之至少一者。 Step 11271 may include controlling at least one of retrieval, first processing, sending, and third processing by the computing node of the large computing system.

步驟11272可包括藉由位於資料庫加速積體電路外部之管理單元來管理擷取、第一處理、發送及第三處理中之至少一者。 Step 11272 may include managing at least one of capture, first processing, sending, and third processing by a management unit located outside the database acceleration integrated circuit.

圖95B說明用於操作資料庫加速積體電路之群組的方法11300。 FIG. 95B illustrates a method 11300 for operating a database to accelerate a group of integrated circuits.

方法11300可開始於藉由資料庫加速積體電路執行資料庫加速操作之步驟11310。步驟11310可包括執行方法11200之一或多個步驟。 The method 11300 can start at step 11310 of performing a database acceleration operation by the database acceleration integrated circuit. Step 11310 may include performing one or more steps of method 11200.

方法11300亦可包括在資料庫加速積體電路之一或多個群組的資料庫加速積體電路之間交換(a)資訊及(b)資料庫加速結果中之至少一者的步驟11320。 The method 11300 may also include a step 11320 of exchanging at least one of (a) information and (b) database acceleration results between one or more groups of database acceleration integrated circuits.

步驟11310及11320之組合可相當於藉由一或多個群組之資料庫加速積體電路執行分散式處理。 The combination of steps 11310 and 11320 can be equivalent to using one or more groups of databases to accelerate integrated circuits to perform distributed processing.

可使用一或多個群組之資料庫加速積體電路的網路通信網路介面執行交換。 One or more groups of databases can be used to accelerate the exchange of the network communication network interface of the integrated circuit.

可經由多個群組執行交換，該等群組可藉由星形連接而彼此連接。 The exchange can be performed through multiple groups, and the groups can be connected to each other by a star connection.

步驟11320可包括使用至少一個交換器以用於在一或多個群組中之不同群組的資料庫加速積體電路之間交換以下各者中之至少一者：(a)資訊；及(b)資料庫加速結果。 Step 11320 may include using at least one switch for accelerating the exchange of at least one of the following between the database of different groups in one or more groups: (a) information; and ( b) The result of database acceleration.

步驟11310可包括藉由一或多個群組中之一些的資料庫加速積體電路中之一些執行分散式處理的步驟11311。 Step 11310 may include step 11311 of accelerating some of the integrated circuits to perform distributed processing by the database of some of one or more groups.

步驟11311可包括執行第一及第二資料結構之分散式處理，其中第一及第二資料結構之總大小超過多個記憶體處理積體電路之儲存能力。 Step 11311 may include executing distributed processing of the first and second data structures, where the total size of the first and second data structures exceeds the storage capacity of multiple memory processing integrated circuits.

分散式處理之執行可包括執行以下各者之多個反覆：(a)執行將第一資料結構部分及第二資料結構部分之不同對新分配給不同資料庫加速積體電路；及(b)處理不同對。 The execution of distributed processing may include the execution of multiple iterations of the following: (a) execution of newly allocating different pairs of the first data structure part and the second data structure part to different database accelerated integrated circuits; and (b) Deal with different pairs.

分散式處理之執行可包括執行資料庫接合操作。 The execution of distributed processing may include the execution of database joining operations.

步驟11310可包括(a)將不同的第一資料結構部分分配給一或多個群組之不同資料庫加速積體電路的步驟11312；及(b)執行以下各者之多個反覆：將不同的第二資料結構部分新分配給一或多個群組之不同資料庫加速積體電路的步驟11314；及藉由資料庫加速積體電路處理第一及第二資料結構部分的步驟11316。 Step 11310 may include (a) allocating different first data structure parts to one or more groups of different database accelerated integrated circuits 11312; and (b) performing multiple iterations of each of the following: The second data structure part is newly allocated to one or more groups of different database acceleration integrated circuits in step 11314; and the database acceleration integrated circuit is used to process the first and second data structure parts in step 11316.

可用與當前反覆之處理至少部分時間重疊的方式執行步驟11314。 Step 11314 may be performed in a manner that overlaps at least part of the time with the current repeated processing.

步驟11314可包括在不同資料庫加速積體電路之間交換第二資料結構部分。 Step 11314 may include exchanging the second data structure part between different database accelerated integrated circuits.

可用與步驟11310至少部分時間重疊之方式執行步驟11320。 Step 11320 can be performed in a manner that overlaps with step 11310 at least partially in time.

步驟11314可包括在群組之不同資料庫加速積體電路之間交換第二資料結構部分；及一旦交換已完成，便在資料庫加速積體電路之不同群組之間交換第二資料結構部分。 Step 11314 may include exchanging the second data structure part between the different database acceleration integrated circuits of the group; and once the exchange is completed, exchanging the second data structure part between the different groups of the database acceleration integrated circuit .

圖95C說明用於資料庫加速之方法11350。 Figure 95C illustrates method 11350 for database acceleration.

方法11350可包括藉由資料庫加速積體電路之網路通信網路介面自大量儲存單元擷取大量資訊的步驟11352。 The method 11350 may include a step 11352 of retrieving a large amount of information from a large amount of storage unit through the network communication network interface of the database accelerated integrated circuit.

步驟11352之後可接著為對大量資訊進行第一處理以提供第一經處理資訊的步驟11354。 Step 11352 can be followed by step 11354 of performing first processing on a large amount of information to provide first processed information.

步驟11352之後可接著藉由資料庫加速積體電路之記憶體控制器且經由大輸送量介面將第一經處理資訊發送至多個記憶體資源的步驟11354。 Step 11352 can be followed by step 11354 of speeding up the memory controller of the integrated circuit through the database and sending the first processed information to multiple memory resources via a large throughput interface.

步驟11354之後可接著為自多個記憶體資源擷取所擷取資訊之步驟11356。 Step 11354 can be followed by step 11356 of retrieving the retrieved information from multiple memory resources.

步驟11356之後可接著為藉由資料庫加速積體電路之資料庫加速單元對所擷取資訊執行資料庫處理操作以提供資料庫加速結果的步驟11358。 Step 11356 can be followed by step 11358 in which the database acceleration unit of the database acceleration integrated circuit performs a database processing operation on the retrieved information to provide a database acceleration result.

步驟11358之後可接著為輸出資料庫加速結果之步驟11359。 Step 11358 can be followed by step 11359 of outputting the database acceleration result.

該方法亦可包括對第一經處理資訊進行第二處理以提供第二經處理資訊之步驟11355。第二處理由多個處理器執行，該多個處理器位於進一步包含多個記憶體資源之一或多個記憶體處理積體電路中。步驟11355在步驟11354之後且在步驟11356之前。 The method may also include a step 11355 of performing second processing on the first processed information to provide second processed information. The second processing is executed by a plurality of processors, and the plurality of processors are located in an integrated circuit that further includes one or more of a plurality of memory resources. Step 11355 is after step 11354 and before step 11356.

第二經處理資訊之總大小可小於第一經處理資訊之總大小。 The total size of the second processed information may be smaller than the total size of the first processed information.

第一經處理資訊之總大小可小於大量資訊之總大小。 The total size of the first processed information may be smaller than the total size of the large amount of information.

第一處理可包括篩選資料庫條目。因此，在執行任何其他處理之前及/或甚至在將不相關的資料庫條目儲存於多個記憶體資源之前，篩選出與查詢不相關之資料庫條目，藉此節省頻寬、儲存資源及其他處理資源。 The first process may include screening database entries. Therefore, before performing any other processing and/or even before storing irrelevant database entries in multiple memory resources, filter out and check Consult irrelevant database entries to save bandwidth, storage resources, and other processing resources.

第二處理可包括篩選資料庫條目。篩選可在篩選條件可為複雜的(包括多個條件)時應用，且可能需要在篩選進行之前接收多個資料庫條目欄位。舉例而言，當搜尋(a)超過某一年齡且喜歡香蕉之人及(b)超過另一年齡且喜歡蘋果之人時。 The second process may include filtering database entries. Filtering can be applied when the filtering conditions can be complex (including multiple conditions), and it may be necessary to receive multiple database entry fields before the filtering can proceed. For example, when searching for (a) people over a certain age who like bananas and (b) people over another age who like apples.

資料庫database

以下實例可參考資料庫。資料庫可為資料中心，可為資料中心之部分，或可能不屬於資料中心。 The following examples can refer to the database. The database may be a data center, may be part of a data center, or may not belong to a data center.

資料庫可經由一或多個網路耦接至多個使用者。資料庫可為雲端資料庫。 The database can be coupled to multiple users via one or more networks. The database can be a cloud database.

可提供包括一或多個管理單元及多個資料庫加速器板之資料庫，該等加速器板包括一或多個記憶體/處理單元。 A database including one or more management units and multiple database accelerator boards can be provided. The accelerator boards include one or more memory/processing units.

圖96B說明資料庫12020，該資料庫包括管理單元12021及多個DB加速器板12022，該等加速器板各包括通信/管理處理器(處理器12024)及多個記憶體/處理單元12026。 FIG. 96B illustrates the database 12020. The database includes a management unit 12021 and a plurality of DB accelerator boards 12022, each of which includes a communication/management processor (processor 12024) and a plurality of memory/processing units 12026.

處理器12024可支援各種通信協定，諸如但不限於PCIe、類似ROCE之協定，及其類似者。 The processor 12024 can support various communication protocols, such as but not limited to PCIe, ROCE-like protocols, and the like.

資料庫命令可由記憶體/處理單元12026執行，且處理器可在記憶體/處理單元12026之間、在不同DB加速器板12022之間且與管理單元12021投送訊務。 The database commands can be executed by the memory/processing unit 12026, and the processor can be between the memory/processing unit 12026, between different DB accelerator boards 12022, and send traffic with the management unit 12021.

尤其在包括大型內部記憶體組時，使用多個記憶體/處理單元12026可顯著加速資料庫命令之執行且避免通信瓶頸。 Especially when a large internal memory bank is included, the use of multiple memory/processing units 12026 can significantly speed up the execution of database commands and avoid communication bottlenecks.

圖96C說明包括處理器12024及多個記憶體/處理單元12026之DB加速器板12022。處理器12024包括多個通信專用組件，諸如用於與記憶體/ 處理單元12026、RDMA引擎12031、DB查詢資料庫引擎12034及其類似者通信之DDR控制器12033。DDR控制器為通信控制器之實例，且RDMA引擎為任何通信引擎之實例。 FIG. 96C illustrates a DB accelerator board 12022 including a processor 12024 and multiple memory/processing units 12026. The processor 12024 includes a number of dedicated communication components, such as for communication with memory/ The processing unit 12026, the RDMA engine 12031, the DB query database engine 12034 and the like communicate with the DDR controller 12033. The DDR controller is an example of a communication controller, and the RDMA engine is an example of any communication engine.

可提供一種用於操作圖96B、圖96C及圖96D中之任一者之系統(或操作系統之任何部分)的方法。 A method for operating the system (or any part of the operating system) of any one of FIG. 96B, FIG. 96C, and FIG. 96D can be provided.

應注意，資料庫加速積體電路11530可與多個記憶體資源相關聯，該等記憶體資源不包括於多個記憶體處理積體電路中或以其他方式不與處理單元相關聯。在此狀況下，處理主要且甚至僅由資料庫加速積體電路執行。 It should be noted that the database acceleration integrated circuit 11530 may be associated with multiple memory resources, which are not included in the multiple memory processing integrated circuits or otherwise not associated with the processing unit. In this situation, the processing is mainly and even only performed by the database accelerated integrated circuit.

圖94P說明用於資料庫加速之方法11700。 Figure 94P illustrates a method 11700 for database acceleration.

方法11700可包括藉由資料庫加速積體電路之網路通信介面自儲存單元擷取資訊的步驟11710。 The method 11700 may include the step 11710 of retrieving information from the storage unit through the network communication interface of the accelerated integrated circuit through the database.

步驟11710之後可接著為對資訊量進行第一處理以提供第一經處理資訊的步驟11720。 Step 11710 can be followed by step 11720 of performing a first processing on the amount of information to provide first processed information.

步驟11720之後可接著為藉由資料庫加速積體電路之記憶體控制器且經由輸送量介面將第一經處理資訊發送至多個記憶體資源的步驟11730。 Step 11720 can be followed by step 11730 of speeding up the memory controller of the integrated circuit through the database and sending the first processed information to multiple memory resources via the throughput interface.

步驟11730之後可接著為自多個記憶體資源擷取資訊之步驟11740。 Step 11730 can be followed by step 11740 of retrieving information from multiple memory resources.

步驟11740之後可接著為藉由資料庫加速積體電路之資料庫加速單元對所擷取資訊執行資料庫處理操作以提供資料庫加速結果的步驟11750。 Step 11740 can be followed by step 11750 in which the database acceleration unit of the database acceleration integrated circuit performs a database processing operation on the retrieved information to provide a database acceleration result.

步驟11750之後可接著為輸出資料庫加速結果之步驟11760。 Step 11750 can be followed by step 11760 of outputting the database acceleration result.

第一處理及/或第二處理可包括篩選資料庫條目，判定應進一步處理哪些資料庫條目。 The first processing and/or the second processing may include screening database entries to determine which database entries should be further processed.

第二處理包含篩選資料庫條目。 The second process involves screening database entries.

混合系統 Hybrid system

記憶體/處理單元在執行可為記憶體密集的及/或瓶頸與擷取操作相關之計算時可為高效的。當瓶頸與運算操作相關時，面向處理(且較少面向記憶體)之處理器單元(諸如但不限於圖形處理單元、中央處理單元)可更有效。 The memory/processing unit can be efficient in performing calculations that can be memory-intensive and/or bottlenecks related to retrieval operations. When the bottleneck is related to arithmetic operations, processing-oriented (and less memory-oriented) processor units (such as but not limited to graphics processing units, central processing units) can be more effective.

混合系統可包括彼此可完全或部分連接之一或多個處理器單元及一或多個記憶體/處理單元兩者。 The hybrid system may include both one or more processor units and one or more memory/processing units that can be fully or partially connected to each other.

記憶體/處理單元(MPU)可藉由相比邏輯胞元更佳地適合記憶體胞元之第一製造製程來製造。舉例而言，由第一製造製程製造之記憶胞元可展現相比由第一製造製程製造之邏輯電路之臨界尺寸較小且甚至小得多(例如，小超過2倍、3倍、4倍、5倍、6倍、7倍、8倍、9倍、10倍及其類似者)的臨界尺寸。舉例而言，第一製造製程可為類比製造製程，第一製造製程可為DRAM製造製程，及其類似者。 The memory/processing unit (MPU) can be manufactured by the first manufacturing process that is better suited for memory cells than logic cells. For example, the memory cell manufactured by the first manufacturing process can exhibit a smaller and even much smaller critical size than the logic circuit manufactured by the first manufacturing process (for example, more than 2 times, 3 times, 4 times smaller). , 5 times, 6 times, 7 times, 8 times, 9 times, 10 times and the like) critical size. For example, the first manufacturing process may be an analog manufacturing process, and the first manufacturing process may be a DRAM manufacturing process, and the like.

處理器可由較佳地適合邏輯之第二製造製程製造。舉例而言，由第二製造製程製造之邏輯電路的臨界尺寸可比由第一製造製程製造之邏輯電路的臨界尺寸小且甚至小得多。又對於另一實例，由第二製造製程製造之邏輯電路的臨界尺寸可比由第一製造製程製造之記憶體胞元的臨界尺寸小且甚至小得多。舉例而言，第二製造製程可為類比製造製程，第二製造製程可為CMOS製造製程，及其類似者。 The processor can be manufactured by a second manufacturing process that is better suited for logic. For example, the critical dimension of the logic circuit manufactured by the second manufacturing process may be smaller or even much smaller than the critical dimension of the logic circuit manufactured by the first manufacturing process. For yet another example, the critical size of the logic circuit manufactured by the second manufacturing process may be smaller or even much smaller than the critical size of the memory cell manufactured by the first manufacturing process. For example, the second manufacturing process may be an analog manufacturing process, and the second manufacturing process may be a CMOS manufacturing process, or the like.

可藉由考慮每一單元之益處及與在單元之間傳送資料相關的任何懲罰而以靜態或動態方式在不同單元之間分配任務。 Tasks can be allocated between different units in a static or dynamic manner by considering the benefits of each unit and any penalties associated with transferring data between the units.

舉例而言，可將記憶體密集型處理程序分配給記憶體/處理單元，而可將處理密集型記憶體輕處理分配給處理單元。 For example, memory-intensive processing programs can be allocated to memory/processing units, and processing-intensive memory light processing can be allocated to processing units.

處理器可請求或發指令給一或多個記憶體/處理單元以執行各種處理任務。各種處理任務之執行可減輕處理器之負擔，減少潛時，且在一些狀況下減少一或多個記憶體/處理單元與處理器之間的總資訊頻寬，及其類似者。 The processor can request or issue instructions to one or more memory/processing units to perform various processing tasks. The execution of various processing tasks can reduce the burden on the processor, reduce latency, and in some situations In this case, reduce the total information bandwidth between one or more memory/processing units and the processor, and the like.

圖96D為包括一或多個記憶體/處理單元(MPU)12043及處理器12042之混合系統12040的實例。處理器12042可將請求或指令發送至一或多個MPU 12043，該一或多個MPU又完成(或選擇性地完成)請求及/或指令且將結果發送至處理器12042，如上文所說明。 FIG. 96D is an example of a hybrid system 12040 including one or more memory/processing units (MPU) 12043 and a processor 12042. The processor 12042 may send a request or instruction to one or more MPUs 12043, which in turn completes (or selectively completes) the request and/or instruction and sends the result to the processor 12042, as explained above .

處理器12042可進一步處理結果以提供一或多個輸出。 The processor 12042 may further process the results to provide one or more outputs.

每一MPU包括記憶體資源、處理資源(諸如，緊密微控制器12044)及快取記憶體12049。微控制器可具有有限運算能力(例如，可主要包括乘法累加單元)。 Each MPU includes memory resources, processing resources (such as compact microcontroller 12044), and cache memory 12049. The microcontroller may have limited computing capabilities (for example, it may mainly include a multiplication and accumulation unit).

微控制器12044可出於記憶體內加速目的而應用處理程序，亦可為CPU或整個整DB處理引擎或其子集。 The microcontroller 12044 can apply processing programs for the purpose of in-memory acceleration, and can also be a CPU or the entire DB processing engine or a subset thereof.

MPU 12043可包括可用網狀/環形/或其他拓樸連接以用於快速組間通信的微處理器及封包處理單元。 The MPU 12043 may include a microprocessor and a packet processing unit that can be connected in a mesh/ring/or other topology for fast inter-group communication.

可存在多於一個DDR控制器以用於快速DIMM間通信。 There may be more than one DDR controller for fast inter-DIMM communication.

記憶體內封包處理器之目標為減少BW、資料移動、功率消耗，且增加效能。相比標準解決方案，使用記憶體內封包處理器將使效能/TCO顯著增加。 The goal of the in-memory packet processor is to reduce BW, data movement, power consumption, and increase performance. Compared to the standard solution, the use of an in-memory packet processor will significantly increase the performance/TCO.

應注意，管理單元為可選的。 It should be noted that the management unit is optional.

每一MPU可作為人工智慧(AI)記憶體/處理單元操作，此係因為其可執行AI計算且僅將結果傳回至處理器，藉此減少訊務量，尤其在MPU接收及儲存待用於多個計算中之神經網路係數時，且每次使用神經網路之一部分以處理新資料時不需要自外部晶片接收係數。 Each MPU can be operated as an artificial intelligence (AI) memory/processing unit. This is because it can perform AI calculations and only return the results to the processor, thereby reducing the amount of traffic, especially when the MPU is receiving and storing for use In multiple calculations of neural network coefficients, and one part of the neural network is used each time It is not necessary to receive coefficients from an external chip when processing new data.

MPU可判定係數何時為零，且通知處理器不需要執行包括零值係數之乘法。 The MPU can determine when the coefficient is zero, and inform the processor that it is not necessary to perform multiplication including zero-valued coefficients.

應注意，第一處理及第二處理可包括篩選資料庫條目。 It should be noted that the first processing and the second processing may include filtering database entries.

MPU可為本說明書中、PCT專利申請案WO2019025862及PCT專利申請案第PCT/IB2019/001005號中之任一者中所說明的任何記憶體處理單元。 The MPU may be any memory processing unit described in any one of the PCT patent application WO2019025862 and the PCT patent application No. PCT/IB2019/001005 in this specification.

可提供AI運算系統(及可由系統執行之系統)，其中網路介面卡具有AI處理能力且經組態以執行一些AI處理任務，以便減少待經由耦接多個AI加速伺服器之網路發送的訊務之量。 It can provide an AI computing system (and a system that can be executed by the system), in which the network interface card has AI processing capabilities and is configured to perform some AI processing tasks, so as to reduce the need to be sent through the network coupled to multiple AI acceleration servers The amount of communications.

舉例而言，在一些推斷系統中，輸入為網路(例如，連接至AI伺服器之IP攝影機的多個串流)。在此等狀況下，在處理及網路連接單元上利用RDMA+AI可減小CPU及PCIe匯流排之負載且對處理及網路連接單元提供處理，而非由不包括於處理及網路連接單元中之GPU提供處理。 For example, in some inference systems, the input is a network (for example, multiple streams of IP cameras connected to an AI server). Under these conditions, using RDMA+AI on the processing and network connection unit can reduce the load of the CPU and PCIe bus and provide processing for the processing and network connection unit, instead of being excluded from the processing and network connection The GPU in the unit provides processing.

舉例而言，替代計算初始結果及將初始結果發送至目標AI加速伺服器(應用一或多個AI處理操作)，處理及網路連接單元可執行減少發送至目標AI加速伺服器之值之量的預處理。目標AI運算伺服器為經分配以對由其他AI加速伺服器提供之值執行計算的AI運算伺服器。此減少在AI加速伺服器之間交換的訊務之頻寬且亦減小目標AI加速伺服器之負載。 For example, instead of calculating the initial result and sending the initial result to the target AI acceleration server (applying one or more AI processing operations), the processing and network connection unit can reduce the amount of value sent to the target AI acceleration server Pretreatment. The target AI calculation server is an AI calculation server that is allocated to perform calculations on values provided by other AI acceleration servers. This reduces the bandwidth of the traffic exchanged between the AI acceleration servers and also reduces the load of the target AI acceleration server.

可藉由使用負載平衡或其他分配演算法以動態或靜態方式分配目標AI加速伺服器。可存在多於單個目標AI加速伺服器。 The target AI acceleration server can be allocated dynamically or statically by using load balancing or other allocation algorithms. There may be more than a single target AI acceleration server.

舉例而言，若目標AI加速伺服器添加了多個損失，則處理及網路連接單元可添加由其AI加速伺服器產生之損失且將損失總和發送至目標AI加速伺服器，藉此減少頻寬。當執行諸如導數計算及聚集以及其類似者之其他預處理操作時，可獲得相同益處。 For example, if the target AI acceleration server adds multiple losses, the processing and network connection unit can add the losses generated by its AI acceleration server and send the total loss to the target AI acceleration server, thereby reducing the frequency. width. When performing other such as derivative calculation and aggregation and the like The same benefits can be obtained during pretreatment operations.

圖97B說明包括子系統之系統12060，每一子系統包括用於將具有伺服器主機板12064之AI處理及網路連接單元12063連接至彼此的交換器12061。伺服器主機板包括具有網路能力且具有AI處理能力之一或多個AI處理及網路連接單元12063。AI處理及網路連接單元12063可包括一或多個NIC及ALU或用於執行預處理之其他計算電路。 FIG. 97B illustrates a system 12060 including sub-systems, each sub-system including a switch 12061 for connecting the AI processing and network connection unit 12063 with the server motherboard 12064 to each other. The server motherboard includes one or more AI processing and network connection units 12063 with network capabilities and AI processing capabilities. The AI processing and network connection unit 12063 may include one or more NICs and ALUs or other calculation circuits for performing preprocessing.

AI處理及網路連接單元12063可為晶片，或可包括多於單個晶片。具有為單個晶片之AI處理及網路連接單元12063可為有益的。 The AI processing and network connection unit 12063 may be a chip, or may include more than a single chip. It may be beneficial to have the AI processing and network connection unit 12063 as a single chip.

AI處理及網路連接單元12063可包括(僅或主要)處理資源。AI處理及網路連接單元12063可包括記憶體內運算電路，或可不包括記憶體內運算電路，或可能不包括大量記憶體內運算電路。 The AI processing and network connection unit 12063 may include (only or main) processing resources. The AI processing and network connection unit 12063 may include in-memory arithmetic circuits, or may not include in-memory arithmetic circuits, or may not include a large number of in-memory arithmetic circuits.

AI處理及網路連接單元12063可為積體電路，可包括多於單個積體電路，可為積體電路之一部分，及其類似者。 The AI processing and network connection unit 12063 may be an integrated circuit, may include more than a single integrated circuit, may be a part of an integrated circuit, and the like.

AI處理及網路連接單元12063可在包括AI處理及網路連接單元12063之AI加速伺服器與其他AI加速伺服器之間輸送(參見例如圖97C)訊務(例如，藉由使用諸如DDR通道、網路通道及/或PCIe通道之通信埠)。AI處理及網路連接單元12063亦可耦接至諸如DDR記憶體之外部記憶體。處理及網路連接單元可包括記憶體及/或可包括記憶體/處理單元。 The AI processing and network connection unit 12063 can transmit traffic between the AI acceleration server including the AI processing and network connection unit 12063 and other AI acceleration servers (see, for example, Figure 97C) (for example, by using channels such as DDR , Network channel and/or PCIe channel communication port). The AI processing and network connection unit 12063 can also be coupled to an external memory such as DDR memory. The processing and network connection unit may include memory and/or may include a memory/processing unit.

在圖97C中，AI處理及網路連接單元12063經說明為包括本端DDR連接、DDR通道、AI加速器、RAM記憶體、加密/解密引擎、PCIe交換器、PCIe介面、多個核心處理陣列、快速網路連接及其類似者。 In FIG. 97C, the AI processing and network connection unit 12063 is illustrated as including local DDR connection, DDR channel, AI accelerator, RAM memory, encryption/decryption engine, PCIe switch, PCIe interface, multiple core processing arrays, Fast internet connection and the like.

可提供一種用於操作圖97B及圖97C中之任一者之系統(或操作系統之任何部分)的方法。 A method for operating the system (or any part of the operating system) of any one of FIG. 97B and FIG. 97C can be provided.

可提供在本申請案中所提及之任何方法的任何步驟之任何組合。 Any combination of any steps of any method mentioned in this application can be provided.

可提供在本申請案中所提及之任何單元、積體電路、記憶體資源、邏輯、處理子單元、控制器、組件的任何組合。 Any combination of any units, integrated circuits, memory resources, logic, processing subunits, controllers, and components mentioned in this application can be provided.

對「包括」及/或「包含」之任何參考可在細節上作必要修改後應用於「組成」、「實質上組成」。 Any reference to "include" and/or "include" may be applied to "composition" and "substantial composition" after making necessary modifications in details.

已出於說明之目的呈現前述描述。先前描述並不詳盡且不限於所揭示之精確形式或實施例。自本說明書之考慮及所揭示實施例之實踐，修改及調適對熟習此項技術者將為顯而易見的。另外，儘管所揭示實施例之態樣描述為儲存於記憶體中，但熟習此項技術者將瞭解，此等態樣亦可儲存於其他類型之電腦可讀媒體上，諸如次要儲存裝置，例如硬碟或CD ROM，或其他形式之RAM或ROM、USB媒體、DVD、藍光、4K超HD藍光，或其他光碟機媒體。 The foregoing description has been presented for illustrative purposes. The previous description is not exhaustive and is not limited to the precise form or embodiment disclosed. From the consideration of this specification and the practice of the disclosed embodiments, modifications and adaptations will be obvious to those familiar with the technology. In addition, although the aspects of the disclosed embodiments are described as being stored in memory, those skilled in the art will understand that these aspects can also be stored on other types of computer-readable media, such as secondary storage devices. Such as hard disk or CD ROM, or other forms of RAM or ROM, USB media, DVD, Blu-ray, 4K Ultra HD Blu-ray, or other optical drive media.

基於書面描述及所揭示方法之電腦程式在有經驗開發者之技能內。各種程式或程式模組可使用熟習此項技術者已知的技術中之任一者來產生或可結合現有軟體來設計。舉例而言，程式區段或程式模組可用或藉助於.Net Framework、.Net Compact Framework(及相關語言，諸如Visual Basic、C等)、Java、C++、Objective-C、HTML、HTML/AJAX組合、XML或包括Java小程式之HTML來設計。 Computer programs based on written descriptions and disclosed methods are within the skills of experienced developers. Various programs or program modules can be generated using any of the technologies known to those skilled in the art or can be designed in combination with existing software. For example, program sections or program modules can be used or combined with .Net Framework, .Net Compact Framework (and related languages, such as Visual Basic, C, etc.), Java, C++, Objective-C, HTML, HTML/AJAX , XML or HTML including Java applets.

此外，雖然本文中已描述說明性實施例，但熟習此項技術者基於本發明將瞭解具有等效元件、修改、省略、組合(例如，跨越各種實施例之態樣的組合)、調適及/或更改的任何及所有實施例之範圍。申請專利範圍中之限制應基於申請專利範圍中所使用之語言來廣泛地解譯，且不限於本說明書中所描述或在本申請案之審查期間的實例。實例應解釋為非排他性的。此外，所揭示方法之步驟可用包括藉由對步驟重排序及/或插入或刪除步驟的任何方式來修改。因此，本說明書及實例意欲僅被視為說明性的，其中真實範圍及精神由以下申請範圍及其等效物之完整範圍提示。 In addition, although illustrative embodiments have been described herein, those skilled in the art based on the present invention will understand that there are equivalent elements, modifications, omissions, combinations (for example, combinations across various embodiments), adaptations and/ Or modify the scope of any and all embodiments. The restrictions in the scope of the patent application should be widely interpreted based on the language used in the scope of the patent application, and are not limited to the examples described in this specification or during the examination period of this application. Examples should be interpreted as non-exclusive. In addition, the steps of the disclosed method can be modified in any manner including by reordering the steps and/or inserting or deleting steps. Therefore, this specification and examples are intended to be regarded as illustrative only, wherein the true scope and spirit are suggested by the following application scope and the full scope of its equivalents.

300:硬體晶片 300: hardware chip

310a:處理群組 310a: Processing group

310b:處理群組 310b: Processing group

310c:處理群組 310c: Processing group

310d:處理群組 310d: Processing group

320a:邏輯及控制子單元 320a: logic and control subunit

320b:邏輯及控制子單元 320b: logic and control subunit

320c:邏輯及控制子單元 320c: logic and control subunit

320d:邏輯及控制子單元 320d: logic and control subunit

320e:邏輯及控制子單元 320e: logic and control subunit

320f:邏輯及控制子單元 320f: logic and control subunit

320g:邏輯及控制子單元 320g: logic and control subunit

320h:邏輯及控制子單元 320h: logic and control subunit

330a:專用記憶體例項 330a: Dedicated memory instance

330b:專用記憶體例項 330b: Dedicated memory example

330c:專用記憶體例項 330c: Dedicated memory instance

330d:專用記憶體例項 330d: Dedicated memory instance

330e:專用記憶體例項 330e: Dedicated memory instance

330f:專用記憶體例項 330f: Dedicated memory example

330g:專用記憶體例項 330g: Dedicated memory example

330h:專用記憶體例項 330h: Dedicated memory instance

340a:控制件 340a: control

340b:控制件 340b: control

340c:控制件 340c: control

340d:控制件 340d: control part

350:主機 350: host

Claims

An integrated circuit comprising:

A substrate;

A memory array arranged on the substrate, the memory array including a plurality of discrete memory groups;

A processing array disposed on the substrate, the processing array including a plurality of processor subunits, each of the plurality of processor subunits and one or more of the plurality of discrete memory groups Discrete memory group association; and

A controller, which is configured to:

At least one safety measure is implemented with respect to an operation of the integrated circuit.

The integrated circuit of claim 1, wherein the controller is configured to take one or more remedial actions if the at least one safety measure is triggered.

The integrated circuit of claim 1, wherein the controller is configured to implement at least one security measure in at least one memory location.

The integrated circuit according to claim 2, wherein the data includes weight data of a neural network model.

The integrated circuit of claim 1, wherein the controller is configured to implement at least one security measure, and the at least one security measure includes locking one or more of the memory arrays that are not used for data input or data output operations Access to a portion of memory.

The integrated circuit of claim 1, wherein the controller is configured to implement at least one security measure, the at least one security measure including locking only a subset of the memory array.

The integrated circuit of claim 6, wherein the subset of the array is specified by certain memory addresses.

The integrated circuit of claim 6, wherein the subset of the memory array is configurable.

The integrated circuit of claim 1, wherein the controller is configured to implement at least one security measure, and the at least one security measure includes controlling traffic to or from the integrated circuit.

The integrated circuit according to claim 1, wherein the controller is configured to implement at least one security measure, and the at least one security measure includes uploading a changeable data, a program code, or a fixed data.

The integrated circuit as described in claim 1, wherein the uploading of changeable data, code or fixed data occurs during a boot process.

The integrated circuit of claim 1, wherein the controller is configured to implement at least one security measure, and the at least one security measure includes uploading a configuration file during a boot process, the configuration file identifying the waiting Some memory addresses of at least part of the memory array are locked after the boot process is completed.

The integrated circuit of claim 1, wherein the controller is further configured to require a complex password in order to store a memory portion of the memory array associated with one or more memory addresses Take to unlock.

The integrated circuit according to claim 1, wherein the at least one security measure is triggered after an attempt to access one of the at least one locked memory address is detected.

The integrated circuit of claim 1, wherein the controller is configured to implement at least one security measure, and the at least one security measure includes:

Calculate a checksum, a hash, cyclic redundancy check (CRC) or parity calculated relative to at least part of the memory array; and

Compare the calculated sum check code, hash, CRC or parity with a predetermined value.

The integrated circuit of claim 15, wherein the controller is configured as part of the at least one security measure to determine whether the calculated checksum, hash, CRC, or parity matches the predetermined value.

The integrated circuit according to claim 1, wherein the at least one security measure includes copying a program code in at least two different memory portions.

The integrated circuit according to claim 17, wherein the at least one safety measure includes determining whether the output results of executing the code in the at least two different memory portions are different.

The integrated circuit according to claim 18, wherein the output results include intermediate or final output results.

The integrated circuit according to claim 17, wherein the at least two different memory portions are included in the integrated circuit.

The integrated circuit according to claim 1, wherein the at least one safety measure includes determining whether an operation pattern is different from one or more predetermined operation patterns.

The integrated circuit according to claim 2, wherein the one or more remedial actions include stopping execution of operations.

A method of protecting an integrated circuit against tampering, the method comprising:

Using a controller associated with the integrated circuit to implement at least one safety measure relative to an operation of the integrated circuit; wherein the integrated circuit includes:

A substrate;

A memory array arranged on the substrate, the memory array including a plurality of discrete memory groups; and

A processing array disposed on the substrate, the processing array including a plurality of processor subunits, each of the plurality of processor subunits and one or more of the plurality of discrete memory groups Discrete memory groups are associated.

The method according to claim 23, further comprising taking one or more remedial actions when the at least one safety measure is triggered.

An integrated circuit comprising:

A substrate;

A controller, which is configured to:

At least one security measure is implemented relative to an operation of the integrated circuit; wherein the at least one security measure includes copying a program code in at least two different memory portions.

An integrated circuit comprising:

A substrate;

A controller configured to implement at least one safety measure with respect to an operation of the integrated circuit.

The integrated circuit of claim 26, wherein the controller is further configured to take one or more remedial actions if the at least one safety measure is triggered.

A distributed processor memory chip, which includes:

A substrate;

A first communication port configured to establish a communication connection between the distributed processor memory chip and an external entity other than another distributed processor memory chip; and

A second communication port configured to establish a communication connection between the distributed processor memory chip and a first additional distributed processor memory chip.

The distributed processor memory chip of claim 28, which further includes a third communication port configured to connect the distributed processor memory chip and a second additional distributed processing A communication connection is established between the memory chips of the devices.

The distributed processor memory chip of claim 29, further comprising a controller configured to pass through at least one of the first communication port, the second communication port, and the third communication port One controls the communication.

The distributed processor memory chip according to claim 29, wherein each of the first communication port, the second communication port, and the third communication port is associated with a corresponding bus.

The distributed processing memory chip according to claim 31, wherein the corresponding bus is a bus common to each of the first communication port, the second communication port, and the third communication port.

The distributed processing memory chip described in claim 31, wherein

The corresponding buses associated with each of the first communication port, the second communication port, and the third communication port are connected to the plurality of discrete memory groups.

The distributed processor memory chip according to claim 31, wherein at least one bus associated with the first communication port, the second communication port and the third communication port is unidirectional.

The distributed processor memory chip according to claim 31, wherein at least one bus associated with the first communication port, the second communication port and the third communication port is bidirectional.

The distributed processor memory chip of claim 30, wherein the controller is configured to schedule a data between the distributed processor memory chip and the first additional distributed processor memory chip Transmitting so that one of the receiving processor subunits of the first additional distributed processor memory chip is based on the data transmission and executes its associated program code during a period of time when the data transmission is received.

The distributed processor memory chip of claim 30, wherein the controller is configured to send a clock energizing signal to one of the plurality of processor subunits of the distributed processor memory chip At least one is used to control one or more operation modes of the at least one of the plurality of processor subunits.

The distributed processor memory chip of claim 37, wherein the controller is configured to control the clock enable signal sent to the at least one of the plurality of processor subunits Control a timing of one or more communication commands associated with the at least one of the plurality of processor subunits.

The distributed processor memory chip of claim 30, wherein the controller is configured to selectively start by one of the plurality of processor subunits on the distributed processor memory chip Or more than one code can be executed.

The distributed processor memory chip of claim 30, wherein the controller is configured to use a clock energizing signal to control one or more of the plurality of processor subunits to the second The data transmission sequence of at least one of the communication port and the third communication port.

The distributed processor memory chip of claim 28, wherein a communication speed associated with the first communication port is lower than a communication speed associated with the second communication port.

The distributed processor memory chip of claim 30, wherein the controller is configured to determine whether one of the plurality of processor subunits is ready to transmit data to the first processor subunit included in One of the second processor sub-units in the first additional distributed processor memory chip, and after determining that the first processor sub-unit is ready to transmit the data to the second processor sub-unit, a clock is used The enable signal initiates the transfer of the data from the first processor sub-unit to one of the second processor sub-units.

The distributed processor memory chip according to claim 42, wherein the controller is further configured to determine whether the second processor sub-unit is ready to receive the data, and when it is determined that the second processor sub-unit is ready After receiving the data, the clock energizing signal is used to initiate the transmission of the data from the first processor sub-unit to the second processor sub-unit.

The distributed processor memory chip of claim 42, wherein the controller is further configured to determine whether the second processor subunit is ready to receive the data and buffer the data included in the transmission until After a determination that the second processor subunit of the first additional distributed processor memory chip is ready to receive the data.

A method for transferring data between a first distributed processor memory chip and a second distributed processor memory chip, the method comprising:

Use a controller associated with at least one of the first distributed processor memory chip and the second distributed processor memory chip to determine that it is placed on the first distributed processor memory chip Whether a first processor subunit among the plurality of processor subunits on the chip is ready to transmit data to a second processor subunit included in the second distributed processor memory chip; and determining the The first processor sub-unit is ready to transmit the data to the second processor sub-unit and then uses a clock energizing signal controlled by the controller to initiate the data from the first processor sub-unit to the second One of the processor subunits transmits.

The method according to claim 45, which further includes:

Use the controller to determine whether the second processor subunit is ready to receive the data; and

After determining that the second processor sub-unit is ready to receive the data, the clock enabling signal is used to initiate the transfer of the data from the first processor sub-unit to the second processor sub-unit.

The method according to claim 45, which further includes:

Use the controller to determine whether the second processor subunit is ready to receive the data, and buffer the data included in the transmission until the second processor subunit of the first additional distributed processor memory chip is ready After receiving a judgment of the data.

A memory chip including:

A substrate;

A first communication port configured to establish a communication connection between the memory chip and an external entity other than another memory chip; and

A second communication port is configured to establish a communication connection between the memory chip and a first additional memory chip.

The memory chip according to claim 48, wherein the first communication port is connected to at least one of a main bus inside the memory chip or at least one processor subunit included in the memory chip.

The memory chip according to claim 48, wherein the second communication port is connected to at least one of a main bus inside the memory chip or at least one processor subunit included in the memory chip.

A memory unit, which includes:

A memory array, which includes a plurality of memory banks;

At least one controller configured to control at least one aspect of the read operation with respect to the plurality of memory banks;

At least one zero-value detection logic unit configured to detect a multi-bit zero value associated with data stored in a specific address of one of the plurality of memory groups; and

The at least one controller is configured to return a zero value indicator to one or more circuits in response to a zero value detection performed by the at least one zero value detection logic unit.

The memory unit according to claim 51, wherein the one or more circuits that return the zero value indicator are outside the memory unit.

The memory unit according to claim 51, wherein the one or more circuits that return the zero value indicator are inside the memory unit.

The memory unit according to claim 51, wherein the memory unit further comprises at least one read disabling element, and the at least one read disabling element is configured to detect in the at least one zero value detection logic unit When a zero value associated with the specific address is reached, a read command associated with the specific address is interrupted.

The memory unit of claim 51, wherein the at least one controller is configured to send the zero-value indicator to the one or more circuits instead of sending the zero-value data stored in the specific address.

The memory unit according to claim 51, wherein a size of the zero value indicator is smaller than a size of zero data.

The memory unit according to claim 51, wherein an energy consumed by the first processing procedure including one of the following operations is less than an energy consumed by sending zero-value data to the one or more circuits: (a) Detecting the zero value; (b) generating the zero value indicator; and (c) sending the zero value indicator to the one or more circuits.

The memory unit according to claim 57, wherein the energy consumed by the first processing procedure is less than half of the energy consumed by sending the zero-value data to the one or more circuits.

The memory unit according to claim 51, wherein the memory unit further includes at least one sense amplifier configured to perform a zero value detection after the at least one zero value detection unit performs a zero value detection Prevent at least one of the plurality of memory groups from being activated.

The memory unit according to claim 59, wherein the at least one sense amplifier includes a plurality of transistors, and the plurality of transistors are configured to sense low-power signals from the plurality of memory banks, and The at least one sense amplifier amplifies a small voltage swing to a higher voltage level so that the data stored in the plurality of memory groups can be interpreted by the at least one controller.

The memory unit according to claim 51, wherein each of the plurality of memory groups is further organized into sub-groups, the at least one controller includes a sub-group controller, and wherein the at least one zero-value detection The detection logic unit includes a zero value detection logic associated with the subgroups.

The memory unit according to claim 61, wherein the memory unit further comprises at least one read disabling element, and the at least one read disabling element includes a sensor associated with each of the subgroups Test amplifier.

The memory unit according to claim 51, further comprising a plurality of processor sub-units spatially distributed in the memory unit, wherein each of the plurality of processor sub-units and the plurality of processor sub-units At least one dedicated to one of the memory groups is associated, and each of the plurality of processor subunits is configured to store data stored in the corresponding memory group Take and operate.

The memory unit according to claim 63, wherein the one or more circuits include one or more of the processor subunits.

The memory unit according to claim 63, wherein each of the plurality of processor sub-units is connected to two or more of the plurality of processor sub-units by one or more buses Two other processor subunits.

The memory unit according to claim 51, which further includes a plurality of buses.

The memory unit of claim 66, wherein the plurality of buses are configured to transfer data between the plurality of memory groups.

The memory unit of claim 67, wherein at least one of the plurality of buses is configured to transmit the zero value indicator to the one or more circuits.

A method for detecting a zero value in a specific address of a plurality of memory groups, which includes:

Receiving a request to read data stored in an address of a plurality of memory groups from a circuit outside a memory unit;

In response to the received request, a zero value detection logic unit is activated to detect a zero value in the received address by a controller; and

In response to the zero value detection performed by the zero value detection logic unit, a zero value indicator is transmitted to the circuit by the controller.

The method of claim 69, further comprising configuring a read disable element by the controller to interrupt when the zero value detection logic unit detects a zero value associated with the requested address A read command associated with the requested address.

The method according to claim 69, further comprising configuring a sense amplifier by the controller to prevent at least one of the plurality of memory groups when the zero value detection unit detects a zero value The first start of the person.

A non-transitory computer-readable medium storing an instruction set that can be executed by a controller of a memory unit so that the memory unit detects a zero in a specific address of a plurality of memory groups Value, methods include:

The non-transitory computer-readable medium according to claim 72, wherein the method further comprises configuring a read disable element by the controller to detect and the requested address in the zero value detection logic unit When a zero value is associated, a read command associated with the requested address is interrupted.

The non-transitory computer-readable medium of claim 72, wherein the method further comprises configuring a sense amplifier by the controller to prevent the plurality of values when the zero value detection unit detects a zero value An activation of at least one of the memory groups.

An integrated circuit comprising:

A memory unit including a plurality of memory groups, at least one controller configured to control at least one aspect of read operations relative to the plurality of memory groups, and configured to detect and store At least one zero-value detection logic unit with a multi-bit zero value associated with data in a specific address of the plurality of memory groups;

A processing unit configured to send a read request to the memory unit for reading data from the memory unit; and

The at least one controller and the at least one zero value detection logic are configured to return a zero value indicator to one or more in response to a zero value detection performed by the at least one zero value detection logic A circuit.

A memory unit, which includes:

A memory array, which includes a plurality of memory banks;

At least one detection logic unit configured to detect a predetermined multi-bit value associated with data stored in a specific address of one of the plurality of memory groups; and

The at least one controller is configured to return a value indicator to one or more circuits in response to a detection of the predetermined multi-bit value by the at least one detection logic.

The memory unit according to claim 76, wherein the predetermined multi-bit value can be selected by a user.

A memory unit, which includes:

A memory array, which includes a plurality of memory banks;

At least one controller configured to control at least one aspect of the write operation with respect to the plurality of memory banks;

At least one detection logic unit configured to detect a predetermined multi-bit value associated with data to be written to a specific address of the plurality of memory groups; and

The at least one controller is configured to provide a value indicator to one or more circuits in response to a detection of the predetermined multi-bit value by the at least one detection logic.

A distributed processor memory chip, which includes:

A substrate;

A memory array including a plurality of memory banks arranged on the substrate;

A plurality of processor sub-units, which are arranged on the substrate;

The at least one controller is configured to return a value indicator to one of the plurality of processor subunits in response to the at least one detection logic performing a detection on the predetermined multi-bit value Or more.

A memory unit, which includes:

One or more memory banks;

A set of controllers; and

One-bit address generator;

The address generator is configured to:

Providing a current address of a current row to be accessed in the memory group associated with one of the one or more memory groups to the group of controllers;

Determine a predicted address of the next row to be accessed in the associated memory group; and

The predicted address is provided to the group of controllers before an operation relative to the current column associated with the current address is completed.

The memory unit according to claim 80, wherein the operation relative to the current row associated with the current address is a read operation or a write operation.

The memory unit according to claim 80, wherein the current row and the next row In the same memory group.

The memory unit of claim 82, wherein the same memory group allows access to the next row while the current row is being accessed.

The memory unit according to claim 80, wherein the current row and the next row are in a different memory group.

The memory unit described in claim 80, a distributed processor, wherein the distributed processor includes a plurality of processors in a processing array among a plurality of discrete memory groups spatially distributed in the memory array unit.

The memory unit of claim 80, wherein the set of controllers is configured to access the current row and activate the next row before a completion of the operation relative to the current row.

The memory unit according to claim 80, wherein each of the one or more memory groups includes at least a first sub-group and a second sub-group, and is associated with the one or more memory groups The group of controllers associated with each of them includes a first subgroup of controllers associated with the first subgroup and a second group of controllers associated with the second subgroup.

The memory unit of claim 87, wherein the first sub-group controller is configured to enable access to data included in a current row of the first sub-group, and the second sub-group controller Start the next column in this second subgroup.

The memory unit according to claim 88, wherein the activated next row of the second sub-group is separated from the current row of the data being accessed in the first sub-group by at least two rows.

The memory unit of claim 87, wherein the second sub-group controller is configured to access data included in a current row of the second sub-group, and the first sub-group controller activates The next column in this first subgroup.

The memory unit according to claim 90, wherein the activated next row of the first subgroup is separated from the current row of the data being accessed in the second subgroup by at least two rows.

The memory unit according to claim 80, wherein a trained neural network is used to determine the predicted address.

The memory unit according to claim 80, wherein the predicted address is determined based on a determined line access pattern.

The memory unit according to claim 80, wherein the address generator includes a first address generator configured to generate the current address and a second address configured to generate the predicted address Address generator.

The memory unit of claim 94, wherein the second address generator is configured to calculate the predicted address within a predetermined period of time after the current address generator has generated the current address.

The memory unit according to claim 95, wherein the predetermined time period is adjustable.

The memory unit according to claim 96, wherein the predetermined time period is adjusted based on a value of at least one operating parameter associated with the memory unit.

The memory unit according to claim 97, wherein the at least one operating parameter includes a temperature of the memory unit.

The memory unit according to claim 80, wherein the address generator is further configured to generate a trust level associated with the predicted address, and the trust level drops below a predetermined threshold In this case, make the group of controllers give up access to the next row at the predicted address.

The memory unit of claim 80, wherein the predicted address is generated by a series of flip-flops that sample the address generated in a delay.

The memory unit of claim 100, wherein the delay can be configured via a multiplexer that selects between flip-flops storing sampled addresses.

The memory unit according to claim 80, wherein the group of controllers is configured to The predicted address received from the address generator is ignored during a predetermined period of time after one of the memory cells is reset.

The memory unit of claim 80, wherein the address generator is configured to abandon providing the predicted address after detecting a random pattern in the row access relative to the associated memory bank To the group of controllers.

A memory unit, which includes:

One or more memory groups, wherein each of the one or more memory groups includes:

Multiple columns

A first row controller configured to control a first subset of the plurality of rows;

A second row controller configured to control a second subset of the plurality of rows;

A single data input terminal for receiving data to be stored in the plurality of rows; and

A single data output terminal, which is used to provide data retrieved from the plurality of rows.

The memory unit of claim 104, wherein the memory unit is configured to receive a first address at a predetermined time for processing and receiving a second address for function and access.

The memory unit according to claim 104, wherein the first subset of the plurality of rows is composed of even-numbered rows.

The memory unit according to claim 106, wherein the even-numbered columns are located in one half of the one or more memory groups.

The memory unit according to claim 106, wherein the odd-numbered rows are located in one half of the one or more memory groups.

The memory unit according to claim 104, wherein the second subset of the plurality of rows is composed of odd-numbered rows.

The memory unit according to claim 104, wherein among the plurality of rows The first subset is contained in a first subset of a memory group that is adjacent to a second subset of the memory group that contains the second subset of the plurality of rows.

The memory unit of claim 104, wherein the first row controller is configured to cause an access to data included in a row in the first subset of the plurality of rows, and the The second row controller activates one row in the second subset of the plurality of rows.

The memory unit according to claim 111, wherein the activated row in the second subset of the plurality of rows is separated from the row in which data is being accessed in the first subset of the plurality of rows Open at least two columns.

The memory unit of claim 104, wherein the second row controller is configured to cause access to data included in one row of the second subset of the plurality of rows, and the first row A row controller activates a row in the second subset of the plurality of rows.

The memory unit according to claim 113, wherein the activated row in the first subset of the plurality of rows is separated from the row in which data is being accessed in the second subset of the plurality of rows Open at least two columns.

The memory unit according to claim 104, wherein each of the one or more memory groups includes a row input terminal, and the row input terminal is used to receive a row identifier that prompts a portion of a column to be accessed.

In the memory unit described in claim 104, an extra row of redundant pads is placed between each of the two rows of pads to generate a distance for allowing activation.

In the memory unit described in claim 104, rows close to each other may not be activated at the same time.

A distributed processor on a memory chip, which includes:

A substrate;

A processing array arranged on the substrate, the processing array including a plurality of processor sub-units, each of the processor sub-units is a dedicated memory corresponding to one of the plurality of discrete memory groups Group association; and

At least one memory pad disposed on the substrate, wherein the at least one memory pad is configured to serve as at least one of a register file for one or more of the plurality of processor subunits Scratchpad.

The memory chip according to claim 118, wherein the at least one memory pad is included in at least one of the plurality of processor subunits of the processing array.

The memory chip according to claim 118, wherein the register file is configured as a data register file.

The memory chip according to claim 118, wherein the register file is configured as a one-bit address register file.

The memory chip of claim 118, wherein the at least one memory pad is configured to provide at least one temporary storage of a temporary storage file for one or more of the plurality of processor subunits A device to store data to be accessed by one or more of the plurality of processor subunits.

The memory chip of claim 118, wherein the at least one memory pad is configured to provide at least one register for a register file of one or more of the plurality of processing subunits , Wherein the at least one register of the register file is configured to store coefficients, and the coefficients are executed by the plurality of processor subunits on one of the convolution accelerator operations in the plurality of processor subunits Used during the period.

The memory chip according to claim 118, wherein the at least one memory pad is a DRAM memory pad.

The memory chip of claim 118, wherein the at least one memory pad is configured to communicate via one-way access.

The memory chip according to claim 118, wherein the at least one memory pad allows bidirectional access.

The memory chip according to claim 1, further comprising at least one redundant memory pad disposed on the substrate, wherein the at least one redundant memory pad is configured to provide for the plurality of processors At least one redundant register of one or more of the subunits.

The memory chip according to claim 118, further comprising at least one memory pad disposed on the substrate, wherein the at least one memory pad includes configuration to provide for use in the plurality of processor subunits At least one redundant memory bit of at least one redundant register of one or more of them.

The memory chip according to claim 118, which further includes:

A first plurality of buses, each of the first plurality of buses connects one of the plurality of processor subunits to a corresponding dedicated memory bank; and

A second plurality of buses, and each of the second plurality of buses connects one of the plurality of processor subunits to the other of the plurality of processor subunits.

The memory chip of claim 118, wherein at least one of the processor subunits includes a counter configured to count backward from a predefined number, and after the counter reaches a zero value, the Wait for the at least one of the processor subunits to be configured to stop a current task and trigger a new operation of the memory.

The memory chip according to claim 118, wherein at least one of the processor subunits includes a mechanism for stopping a current task and triggering a memory renew operation at some time to renew the memory pad .

The memory chip of claim 118, wherein the register file is configured to be used as a cache memory.

A method for executing at least one instruction in a distributed processor memory chip, the method comprising:

Retrieve one or more data values from a memory array of the distributed processor memory chip;

Storing the one or more data values in a register formed in a memory pad of the distributed processor memory chip; and

Access the one or more data values stored in the register according to at least one instruction executed by a processor element;

The memory array includes a plurality of discrete memory groups arranged on a substrate;

The processor element is one of a plurality of processor sub-units included in a processing array arranged on the substrate, wherein each of the processor sub-units and the plurality of processor sub-units One of the discrete memory groups is associated with a dedicated memory group; and

The register is provided by a memory pad arranged on the substrate.

The method of claim 133, wherein the processor element is configured to act as an accelerator, and the method further includes:

Access the first data stored in the register;

Access the second data from the memory array;

Perform an operation on the first data and the second data.

The method according to claim 133, wherein at least one memory pad includes a plurality of word lines and bit lines, and the method further includes:

A timing of loading the word lines and bit lines is determined, and the timing is determined by a size of the memory pad.

The method according to claim 133, which further comprises:

Periodically renew the register.

The method of claim 12, wherein the memory pad includes a DRAM memory pad.

The method of claim 133, wherein the memory pad is included in at least one of the plurality of discrete memory groups in the memory array.

A device comprising:

A substrate;

A processing unit arranged on the substrate; and

A memory unit arranged on the substrate, wherein the memory unit is configured to store data to be accessed by the processing unit, and

The processing unit includes a memory pad configured to act as a cache for one of the processing units.

A method for distributed processing of at least one information stream, the method comprising:

The at least one information stream is received by one or more memory processing integrated circuits via the first communication channel; wherein each memory processing integrated unit includes a controller, a plurality of processor sub-units and a plurality of memories unit;

Buffering the at least one information stream by the one or more memory processing integrated circuits;

Performing a first processing operation on the at least one information stream by the one or more memory processing integrated circuits to provide a first processing result;

Sending the first processing results to a processing integrated circuit; and

Performing a second processing operation on the first processing results by the one or more memory processing integrated circuits to provide a second processing result;

The size of the logic cell of the one or more memory processing integrated circuits is smaller than the size of the logic cell of the processing integrated circuit.

The method of claim 140, wherein each of the plurality of memory units is coupled to at least one of the plurality of processor subunits.

The method according to claim 140, wherein a total size of the information unit of the at least one information stream received during a certain duration exceeds a total size of the first processing result output during the certain duration.

The method according to claim 140, wherein a total size of the at least one information stream is smaller than a total size of the first processing results.

The method according to claim 140, wherein the manufacturing process of the memory type is a DRAM manufacturing process.

The method according to claim 140, wherein the processing integrated circuit is manufactured by a memory type manufacturing process; and

The processing integrated circuit is manufactured by a logic type manufacturing process.

The method according to claim 140, wherein a size of a logic cell of the one or more memory processing integrated circuits is at least twice a size of a corresponding logic cell of the processing integrated circuit.

The method according to claim 140, wherein a critical size of a logic cell of the one or more memory processing integrated circuits is at least twice a critical size of a corresponding logic cell of the processing integrated circuit .

The method of claim 140, wherein a critical size of a memory cell of the one or more memory processing integrated circuits is at least two of a critical size of a logic cell corresponding to one of the processing integrated circuits Times.

The method according to claim 140, which includes requesting the one or more memory processing integrated circuits through the processing integrated circuit to perform the first processing operations.

The method according to claim 140, which comprises: sending instructions to the one or more memory processing integrated circuits through the processing integrated circuit to perform the first processing operations.

The method according to claim 140, which includes processing the integrated circuit configuration by The one or more memory processing integrated circuits perform the first processing operations.

The method according to claim 140, which includes executing the first processing operations by the one or more memory processing integrated circuits without the processing integrated circuit intervening.

The method according to claim 140, wherein the computational complexity of the first processing operations is lower than that of the second processing operations.

The method according to claim 140, wherein the total throughput of one of the first processing operations exceeds the total throughput of one of the second processing operations.

The method of claim 140, wherein the at least one information stream includes one or more preprocessed information streams.

The method according to claim 157, wherein the one or more preprocessed information streams are data extracted from a network transmission unit.

The method according to claim 140, wherein a part of the first processing operations is performed by one of the plurality of processor sub-units, and the other part of the first processing operations is performed by the plurality of processor sub-units The other one is executed.

The method according to claim 140, wherein the first processing operations and the second processing operations include cellular network processing operations.

The method according to claim 140, wherein the first processing operations and the second processing operations include database processing operations.

The method according to claim 140, wherein the first processing operations and the second processing operations include database analysis processing operations.

The method according to claim 140, wherein the first processing operations and the second processing operations include artificial intelligence processing operations.

A method for distributed processing, the method includes:

The information unit is received by one or more memory processing integrated circuits in a decomposition system, the decomposition type The system includes one or more computing subsystems separated from one or more storage subsystems; wherein each of the one or more memory processing integrated circuits includes a controller, multiple processor sub-units, and multiple Memory unit;

The one or more computing subsystems include multiple processing integrated circuits;

A size of a logic cell of the one or more memory processing integrated circuits is at least twice a size of a logic cell corresponding to one of the multiple processing integrated circuits;

Perform processing operations on the information units by the one or more memory processing integrated circuits to provide processing results; and

The processing results are output from the one or more memory processing integrated circuits.

The method according to claim 162, which includes outputting the processing results to the one or more arithmetic subsystems of the decomposition system.

The method of claim 162, which includes receiving the information units from the one or more storage subsystems of the decomposition system.

The method according to claim 162, which includes outputting the processing results to the one or more storage subsystems of the decomposition system.

The method of claim 162, which includes receiving the information units from the one or more computing subsystems of the decomposition system.

The method of claim 166, wherein the information units sent from different groups of the processing units of the plurality of processing integrated circuits include different parts of an intermediate result of a processing procedure executed by the plurality of processing integrated circuits , Wherein a group of processing units includes at least one processing integrated circuit.

The method according to claim 167, which includes outputting a result of the entire processing procedure by the one or more memory processing integrated circuits.

The method of claim 168, which includes sending the result of the entire processing procedure to each of the plurality of processing integrated circuits.

The method of claim 168, wherein the different parts of the intermediate result are different parts of an updated neural network model, and wherein the result of the entire processing procedure is the updated neural network model.

The method of claim 168, which includes sending the updated neural network model to each of the plurality of processing integrated circuits.

The method according to claim 162, which includes using the exchange subunit of the decomposition system to output the processing results.

The method of claim 162, wherein the one or more memory processing integrated circuits are included in a memory processing subsystem of the decomposition system.

The method of claim 162, wherein at least one of the one or more memory processing integrated circuits is included in one or more arithmetic subsystems of the decomposition system.

The method of claim 162, wherein at least one of the one or more memory processing integrated circuits is included in one or more memory subsystems of the decomposable system.

The method of claim 162, wherein at least one of the following is true: (a) receiving the information units from at least one of the plurality of processing integrated circuits; and (b) these The processing result is sent to one of the plurality of processing integrated circuits or a plurality of memory processing integrated circuits.

The method according to claim 176, wherein a critical size of a logic cell of the one or more memory processing integrated circuits exceeds at least a critical size of a logic cell corresponding to one of the multiple processing integrated circuits double.

The method of claim 176, wherein a critical size of a memory cell of the one or more memory processing integrated circuits exceeds a critical size of a logic cell corresponding to one of the multiple processing integrated circuits At least twice.

The method according to claim 162, wherein the information units include preprocessed information units.

The method according to claim 179, which includes preprocessing the information units by the plurality of processing integrated circuits to provide the preprocessed information units.

The method according to claim 162, wherein the information units are delivered as part of a model of a neural network.

The method according to claim 162, wherein the information units deliver partial results of at least one database query.

The method according to claim 162, wherein the information units deliver partial results of at least one aggregate database query.

A method for accelerating database analysis, the method includes:

Receiving a database query through a memory processing integrated circuit, the database query including at least one relevance criterion that prompts a database entry in a database related to the database query;

The memory processing integrated circuit includes a controller, a plurality of processor sub-units and a plurality of memory units;

Determining a group of related database entries stored in the memory processing integrated circuit by the memory processing integrated circuit and based on the at least one correlation criterion; and

The group of related database entries is sent to one or more processing entities for further processing, and irrelevant database entries stored in the memory processing integrated circuit are not sent to the one or more processing entities. Processing entity

Among them, these irrelevant database entries are different from these related database entries.

The method of claim 184, wherein the one or more processing entities are included in the plurality of processor subunits of the memory processing integrated circuit.

The method according to request item 185, which includes further processing the group of related database entries by the memory processing integrated circuit to complete a response to the database query.

The method of claim 186, which includes outputting the response to the database query from the memory processing integrated circuit.

The method according to claim 187, wherein the output includes applying a first-class control processing program.

The method of claim 188, wherein the application of the flow control processing program responds to indicators output from the one or more processing entities, and the indicators relate to one or more databases of the group One of the processing of the entry is completed.

The method of claim 185, which includes further processing the group of related database entries by the memory processing integrated circuit to provide an intermediate response to the database query.

The method of claim 190, which includes outputting the intermediate response to the database query from the memory processing integrated circuit.

The method according to claim 191, wherein the output includes applying a first-class control processing program.

The method of claim 192, wherein the application of the flow control processing program is in response to indicators output from the one or more processing entities, and the indicators are processing part of the database entries of the group One is complete.

The method according to claim 185, which includes generating processing status indicators by the one or more processing entities, the processing status indicators prompting a progress of the further processing of the group of related database entries.

The method of claim 185, which includes using the memory processing integrated circuit to further process the group of related database entries.

The method according to claim 195, wherein the processing is performed by the plurality of processor subunits.

The method of claim 195, wherein the processing includes calculating an intermediate result by one of the plurality of processing subunits, sending the intermediate result to another of the plurality of processing subunits, and borrowing An additional calculation is performed by the other processing sub-unit.

The method according to claim 195, wherein the processing is executed by the controller.

The method according to claim 195, wherein the processing is performed by the plurality of processor subunits and the controller.

The method of claim 184, wherein the one or more processing entities are located outside the memory processing integrated circuit.

The method according to claim 200, which includes outputting the group of related database entities from the memory processing integrated circuit.

The method according to claim 201, wherein the output includes applying a first-class control processing program.

The method of claim 202, wherein the application of the flow control processing program responds to indicators output from the one or more processing entities, and the indicators are related to being associated with the one or more processing entities A relevance of the database entries.

The method according to claim 184, wherein the plurality of processor sub-units includes a complete arithmetic logic unit.

The method according to claim 184, wherein the plurality of processor subunits include part of arithmetic logic units.

The method of claim 184, wherein the plurality of processor sub-units includes a memory controller.

The method according to claim 184, wherein the plurality of processor sub-units include part of the memory controller.

The method of claim 184, which includes outputting at least one of the following: (i) the group of related database entries, (ii) a response to one of the database queries, and (iii) the Intermediate response to one of the database queries.

The method of claim 212, wherein the output includes application traffic shaping.

The method of claim 212, wherein the output includes an attempt to match a bandwidth used during the output with a maximum allowable bandwidth on a link, the link coupling the memory processing integrated circuit to One requester unit.

The method of claim 212, wherein the output of the output includes maintaining the fluctuation of the output traffic rate below a threshold value.

The method according to claim 184, wherein the one or more processing entities include multiple processing entities, wherein at least one of the multiple processing entities belongs to the memory processing integrated circuit, and the multiple processing entities At least the other one does not belong to the memory processing integrated circuit.

The method according to claim 184, wherein the one or more processing entities belong to another memory processing integrated circuit.

A method for accelerating database analysis, the method includes:

A database query is received by a plurality of memory processing integrated circuits, the database query including prompting at least one correlation criterion of a database entry related to the database query in a database; wherein the multiple memory processing Each of the integrated circuits includes a controller, multiple processor sub-units and multiple memory units;

Determining a group of related database entries stored in the memory processing integrated circuit by each of the plurality of memory processing integrated circuits and based on the at least one correlation criterion; and

By each of the plurality of memory processing integrated circuits, the group of related database entries stored in the memory processing integrated circuit is sent to one or more processing entities for further processing, and Substantially, the irrelevant database entries stored in the memory processing integrated circuit are not sent to the one or more processing entities; wherein the irrelevant database entries are different from the related database entries.

A method for accelerating database analysis, the method includes:

A database query is received by an integrated circuit, and the database query includes at least one correlation criterion that prompts a database entry related to the database query in a database; wherein the integrated circuit includes a controller, filtering Unit and multiple memory units;

Determine a group of related database entries stored in the integrated circuit by the screening units and based on the at least one correlation criterion; and

Send the group of related database entries to one or more processing entities located outside the integrated circuit for further processing, and essentially not send irrelevant database entries stored in the integrated circuit to the One or more processing entities.

A method for accelerating database analysis, the method includes:

Receiving a database query through an integrated circuit, the database query including at least one relevance criterion that prompts a database entry in a database related to the database query;

Wherein the integrated circuit includes a controller, a processing unit and a plurality of memory units;

Determining a group of related database entries stored in the integrated circuit by the processing units and based on the at least one correlation criterion;

Processing the group of related database entries by the processing units instead of processing irrelevant data entries stored in the integrated circuit by the processing units to provide processing results; wherein the irrelevant database entries are different Entries in these relevant databases; and

The processing results are output from the integrated circuit.

A method for extracting information related to feature vectors, the method comprising:

A memory processing integrated circuit receives captured information for use in capturing a plurality of requested feature vectors mapped to a plurality of sentence segments; wherein the memory processing integrated circuit includes a controller and a plurality of processors A sub-unit and a plurality of memory units, each of the memory units being coupled to a processor sub-unit;

Retrieving the multiple requested feature vectors from at least some of the multiple memory units; wherein the retrieving includes simultaneously requesting two or more memory units to be stored in the two or more memories The requested feature vector in the unit; and

The output from the memory processing integrated circuit includes an output of at least one of the following: (a) the requested feature vectors; and (b) a result of processing one of the requested feature vectors.

The method of claim 217, wherein the output includes the requested feature vectors.

The method of claim 217, wherein the output includes the result of the processing of the requested feature vectors.

The method according to claim 219, wherein the processing is performed by the plurality of processor subunits.

The method of claim 220, wherein the processing includes sending a requested feature vector from one processing sub-unit to another processing sub-unit.

The method according to claim 220, wherein the processing includes calculating an intermediate result by one processing subunit, sending the intermediate result to another processing subunit, and calculating another intermediate result by the other processing subunit or One processing result.

The method according to claim 219, wherein the processing is executed by the controller.

The method according to claim 219, wherein the processing is performed by the plurality of processor subunits and the controller.

The method according to claim 219, wherein the processing is performed by a vector processor of the memory processing integrated circuit.

The method of claim 217, wherein the controller is configured to simultaneously request the requested feature vectors based on a known mapping between sentence segments and the locations of feature vectors mapped to the sentence segments .

The method according to claim 11, wherein the mapping is uploaded during a boot process of one of the memory processing integrated circuits.

The method of claim 217, wherein the controller is configured to manage the extraction of the plurality of requested feature vectors.

The method of claim 217, wherein the plurality of sentence sections have a certain order, and wherein the output of the requested feature vectors is performed according to the certain order.

The method according to claim 229, wherein the extraction of the plurality of requested feature vectors is performed according to the certain order.

The method of claim 229, wherein the retrieval of the plurality of requested feature vectors is performed at least partially out of order; and wherein the retrieval further includes reordering the plurality of requested feature vectors.

The method of claim 217, wherein the extraction of the plurality of requested features includes buffering the plurality of requested feature vectors before the plurality of requested feature vectors are read by the controller.

The method of claim 232, wherein the retrieval of the plurality of requested features includes generating buffer status indicators that indicate when one or more buffers are associated with the plurality of memory units Store one or more requested feature vectors.

The method according to claim 233, which includes transmitting the buffer status indicators via a dedicated control line.

The method described in claim 234, wherein each memory cell is allocated a dedicated control line.

The method of claim 234, wherein the buffer status indicator includes one or more status bits stored in one or more of the buffers.

The method of claim 234, which includes transmitting the buffer status indicators via one or more common control lines.

The method according to claim 217, wherein the captured information is included in one or more capturing commands of a first resolution, and the first resolution represents a certain number of bits.

The method of claim 238, which includes managing the capture via the controller at a higher resolution, the higher resolution representing a number of bits lower than the number of bits.

The method of claim 238, wherein the controller is configured to manage the capture according to a feature vector resolution.

The method of claim 238, which includes independently managing the capture by the controller.

The method according to claim 217, wherein the plurality of processor sub-units includes a complete arithmetic logic unit.

The method according to claim 217, wherein the plurality of processor subunits include part of arithmetic logic units.

The method of claim 217, wherein the plurality of processor sub-units includes a memory controller.

The method according to claim 217, wherein the plurality of processor sub-units include part of the memory controller.

The method of claim 217, wherein the output of the output includes applying traffic shaping to the output.

The method of claim 217, wherein the output of the output includes matching a bandwidth used during the output and a maximum allowable bandwidth on a link, the link coupling the memory processing integrated circuit Connect to a requester unit.

The method of claim 217, wherein the output of the output includes maintaining the fluctuation of an output traffic rate below a predetermined threshold.

The method of claim 217, wherein the retrieval includes applying a predictive retrieval to at least some of the requested feature vectors from a set of requested feature vectors stored in a single memory unit .

The method of claim 217, wherein the requested feature vectors are distributed among the memory cells.

The method of claim 217, wherein the requested feature vectors are distributed among the memory cells based on the expected extraction pattern.

A method for memory-intensive processing, the method includes:

Processing operations are performed by a plurality of processors included in a hybrid device, the hybrid device including a basic die, a first memory resource associated with at least one second die, and at least one third die Connected second memory resources; wherein the base die and the at least one second die are connected to each other by an inter-wafer bonding;

Use the plurality of processors to retrieve information stored in the first memory resources; and

Sending additional information from the second memory resources to the first memory resources, wherein a total bandwidth of a first path between the base die and the at least one second die exceeds the at least one first path A total bandwidth of the second path between the two dies and the at least one third die, and wherein the storage capacity of one of the first memory resources is smaller than the storage capacity of one of the second memory resources.

The method according to claim 252, wherein the second memory resources include high-bandwidth memory (HBM) resources.

The method of claim 252, wherein the at least one third die includes a stack of high-bandwidth memory (HBM) chips.

The method of claim 252, wherein at least some of the second memory resources belong to a third die in at least one third die, and the third die does not use an inter-wafer bonding Connect to the base die in case.

The method of claim 252, wherein at least some of the second memory resources belong to a third die in at least one third die, and the third die does not use an inter-wafer bonding In this case, it is connected to one of the at least one second die.

The method according to claim 252, wherein the first memory resources and the second memory resources include different levels of cache memory.

The method according to claim 252, wherein the first memory resources are positioned between the basic die and the second memory resources.

The method of claim 252, wherein the first memory resources are not located on top of the second memory resources.

The method according to claim 252, which includes performing additional processing by one second die in at least one second die, the second die including a plurality of processor subunits and the first memory resources .

The method of claim 260, wherein at least one processor sub-unit is coupled to a dedicated portion of the first memory resources allocated to the processor sub-unit.

The method according to claim 261, wherein the dedicated portion of the first memory resources includes at least one memory bank.

The method according to claim 252, wherein the plurality of processors belong to a memory processing chip that also includes the first memory resources.

The method of claim 252, wherein the basic die includes the plurality of processors, wherein the plurality of processors includes a plurality of first memory resources coupled to the first memory resources through conductors formed using the inter-wafer bonding A processor subunit.

The method of claim 264, wherein each processor sub-unit is coupled to a dedicated portion of the first memory resources allocated to the processor sub-unit.

A hybrid device for memory-intensive processing, the hybrid device comprising:

A basic die;

Multiple processors

At least one first memory resource of the second die;

At least one second memory resource of the third die;

The base die and the at least one second die are connected to each other through an inter-wafer bonding;

The multiple processors are configured to perform processing operations and retrieve information stored in the first memory resources; and

Wherein the second memory resources are configured to send additional information from the second memory resources to the first memory resources;

One of the total frequency bandwidths of the first path between the basic crystal grain and the at least one second crystal grain exceeds the total frequency of the second path between the at least one second crystal grain and the at least one third crystal grain Wide and

The storage capacity of one of the first memory resources is smaller than the storage capacity of one of the second memory resources.

The hybrid device according to claim 266, wherein the second memory resources include high-bandwidth memory (HBM) resources.

The hybrid device according to claim 266, wherein the at least one third die includes a stack of high-bandwidth memory (HBM) memory chips.

The hybrid device according to claim 266, wherein at least some of the second memory resources belong to a third die in the at least one third die, and the third die does not use a wafer directly Connect to the base die when combined.

The hybrid device according to claim 266, wherein at least some of the second memory resources belong to a third die in the at least one third die, and the third die does not use a wafer directly When combined, it is connected to one of the at least one second die.

The hybrid device according to claim 266, wherein the first memory resources and the second memory resources include different levels of cache memory.

The hybrid device according to claim 266, wherein the first memory resources are positioned between the basic die and the second memory resources.

The hybrid device according to claim 266, wherein the first memory resources are located on one side of the second memory resources.

The hybrid device according to claim 266, wherein one second die in the at least one second die is configured to perform additional processing, wherein the second die includes a plurality of processor subunits and the second die A memory resource.

The hybrid device of claim 274, wherein each processor sub-unit is coupled to a dedicated portion of the first memory resources allocated to the processor sub-unit.

The hybrid device according to claim 275, wherein the dedicated portion of the first memory resources includes at least one memory bank.

The hybrid device according to claim 266, wherein the plurality of processors includes a plurality of processor subunits that also include a memory processing chip of one of the first memory resources.

The hybrid device according to claim 266, wherein the basic die includes the plurality of processors, and wherein the plurality of processors include those coupled to the first memory resources through conductors formed using the inter-wafer bonding Multiple processor subunits.

The hybrid device of claim 278, wherein each processor sub-unit is coupled to a dedicated portion of the first memory resources allocated to the processor sub-unit.

A method for database acceleration, the method includes:

A network communication interface of the integrated circuit is accelerated by a database to retrieve a certain amount of information from the storage unit;

Perform first processing on the amount of information to provide first processed information;

Use the database to accelerate the memory controller of the integrated circuit and send the first processed information to a plurality of memory processing integrated circuits through an interface, wherein each of the memory processing integrated circuits includes a controller, a plurality of Processor sub-units and multiple memory units,

Using the plurality of memory processing integrated circuits to perform second processing on at least part of the first processed information to provide second processed information;

The memory controllers that accelerate the integrated circuit through the database retrieve information from the multiple memory processing integrated circuits, wherein the retrieved information includes at least one of the following: (a) the first At least a part of the once processed information; and (b) at least a part of the second processed information;

Using the database acceleration unit of the integrated circuit to perform database processing operations on the retrieved information to provide database acceleration results; and

Output the acceleration results of these databases.

The method according to claim 280, which includes using a management unit of the database acceleration integrated circuit to manage at least one of the retrieval, the first processing, the sending, and the processing.

The method of claim 281, wherein the management is performed based on an execution plan generated by the management unit of the database accelerated integrated circuit.

The method according to claim 281, wherein the management is performed based on an execution plan received by the management unit of the database accelerated integrated circuit but not generated by the management unit.

The method of claim 281, wherein the management includes allocating at least one of the following: (a) network communication network interface resources; (b) decompression unit resources; (c) memory controller resources ; (D) Multiple memory processing integrated circuit resources; and (e) Database acceleration unit resources.

The method according to claim 280, wherein the network communication interface includes two or more different types of network communication ports.

The method according to claim 285, wherein the two or more different types of network communication ports include a storage interface protocol port and a common network protocol storage interface port.

The method according to claim 285, wherein the two or more different types of network communication ports include a storage interface protocol port and an Ethernet protocol storage interface port.

The method according to claim 285, wherein the two or more network communication ports of different types include a storage interface protocol port and a PCIe port.

The method described in claim 280 includes a management unit including a computing node of a computing system and controlled by a manager of the computing system.

The method according to claim 280, which includes controlling the at least one of the retrieval, the first processing, the sending, and the third processing by a computing node of a computing system.

The method described in claim 280 includes using the database to accelerate the integrated circuit to perform multiple tasks at the same time.

The method according to claim 280, which includes using a management unit located outside the database acceleration integrated circuit to manage at least one of the retrieval, the first processing, the sending, and the third processing.

The method according to claim 280, wherein the database accelerated integrated circuit belongs to a computing system.

The method according to claim 280, wherein the database accelerated integrated circuit does not belong to a computing system.

The method according to claim 280, which includes executing the acquisition, the first processing, the sending, and the third processing based on an execution plan sent to the database acceleration integrated circuit by an arithmetic node of a computing system At least one of them.

The method according to claim 280, wherein the execution of the database processing operations includes simultaneous execution of database processing commands by the database processing subunit, wherein the database acceleration unit includes a database sharing a common memory unit A group of accelerator subunits.

The method described in claim 296, wherein each database processing subunit is configured to execute a specific type of database processing command.

The method according to claim 297, which includes dynamically linking the database processing subunit to provide an execution pipeline for executing a database processing operation including a plurality of instructions.

The method of claim 280, wherein the execution of the database processing operations includes allocating the resources of the database acceleration integrated circuit according to the time I/O bandwidth.

The method described in claim 280 includes outputting the database acceleration results to a local storage and retrieving the database acceleration results from the local storage.

The method according to claim 280, wherein the network communication interface includes an RDMA unit.

The method described in claim 280 includes exchanging information between one or more groups of database accelerated integrated circuits.

The method described in claim 280 includes exchanging database acceleration results among one or more groups of database acceleration integrated circuits.

The method according to claim 280, which includes exchanging at least one of the following among one or more groups of database accelerated integrated circuits: (a) information; and (b) The result of database acceleration.

In the method described in claim 304, the database accelerated integrated circuit of a group is connected to a common printed circuit board.

In the method described in claim 304, the database accelerated integrated circuit of a group belongs to a modular unit of a computerized system.

The method described in claim 304, wherein different groups of databases accelerate integrated circuits to connect to different printed circuit boards.

The method described in claim 304, wherein the database accelerated integrated circuits of different groups belong to different modular units of a computerized system.

The method described in claim 304 includes accelerating integrated circuits to perform distributed processing by the one or more groups of the databases.

The method according to claim 304, wherein one or more groups of the databases are used to accelerate the network communication interface of the integrated circuit to perform the exchange.

The method of claim 304, wherein the exchange is performed on multiple groups connected to each other by a star connection.

The method according to claim 304, which includes using at least one switch for exchanging at least one of the following between the database accelerated integrated circuits of different groups of the one or more groups: (a) Information and (b) database acceleration results.

The method of claim 304, which includes accelerating at least some of the integrated circuits to perform distributed processing by the one or more groups of the databases.

The method according to claim 304, which includes performing distributed processing of one of the first and second data structures, wherein the total size of one of the first and second data structures exceeds the size of the plurality of memory processing integrated circuits 1. Storage capacity.

The method of claim 314, wherein the execution of the distributed processing includes executing multiple iterations of: (a) executing different pairs of a first data structure part and a second data structure part to different data The library accelerates a new allocation of integrated circuits; and (b) handles the different pairs.

The method of claim 315, wherein the execution of the distributed processing includes a database joining operation.

The method of claim 315, wherein the execution of the distributed processing includes:

Allocating different first data structure parts to different database acceleration integrated circuits of the one or more groups; and

Perform multiple iterations of each of the following:

Newly assign different second data structure parts to different database acceleration integrated circuits of the one or more groups, and

The first and second data structure parts are processed by the integrated circuit to speed up the integrated circuit by the database.

The method of claim 317, wherein the new allocation of the next iteration is performed in a manner that overlaps at least part of the time with the processing of a current iteration.

The method of claim 317, wherein the new allocation includes exchanging the second data structure part between the different database accelerated integrated circuits.

The method according to claim 319, wherein the exchange is performed in a manner that overlaps at least a part of the processing time.

The method of claim 317, wherein the new allocation includes exchanging the second data structure part between the different database accelerated integrated circuits in a group; and once the exchange has been completed, accelerating the accumulation in the database The second data structure part is exchanged between different groups of body circuits.

The method according to claim 280, wherein the database accelerated integrated circuit is included in a blade, and the blade includes a plurality of database accelerated integrated circuits, one or more non-volatile memory units, and an Ethernet Circuit switch, a PCIe switch and an Ethernet switch, and the plurality of memory processing integrated circuits.

A device for database acceleration, the device includes:

A database to accelerate the integrated circuit; and

A plurality of memory processing integrated circuits; wherein each memory processing integrated circuit includes a controller, a plurality of processor sub-units and a plurality of memory units;

Wherein, a network communication interface of the database acceleration integrated circuit is configured to receive information from the storage unit;

The database acceleration integrated circuit is configured to perform first processing on a certain amount of information to provide first processed information;

The memory controller of the database acceleration integrated circuit is configured to send the first processed information to the plurality of memory processing integrated circuits through an interface;

The plurality of memory processing integrated circuits are configured to perform second processing on at least part of the first processed information through the plurality of memory processing integrated circuits to provide second processed information;

The memory controllers of the database accelerated integrated circuit are configured to retrieve information from the multiple memory processing integrated circuits, wherein the retrieved information includes at least one of the following: (a ) At least a part of the first processed information; and (b) at least a part of the second processed information;

A database acceleration unit of the database acceleration integrated circuit of the database acceleration integrated circuit is configured to perform a database processing operation on the retrieved information to provide a database acceleration result; and wherein the database acceleration unit The acceleration integrated circuit is configured to output the acceleration results of the database.

The device described in claim 323, which is configured to use a management unit of the database to accelerate the integrated circuit to manage at least one of the retrieval, a first process, and the second process of the retrieved information By.

The device of claim 324, wherein the management unit is configured to perform management based on an execution plan generated by the management unit of the database acceleration integrated circuit.

The device of claim 324, wherein the management unit is configured to perform management based on an execution plan received by the management unit of the database accelerated integrated circuit but not generated by the management unit.

The device of claim 324, wherein the management unit is configured to be managed by allocating one or more of the following: (a) network communication network interface resources; (b) decompression unit resources ; (C) Memory controller resources; (d) Multiple memory processing integrated circuit resources; and (e) Database acceleration unit resources.

The device according to claim 323, wherein the network communication interface includes different types of network communication ports.

The device according to claim 328, wherein the different types of network communication ports include storage interface protocol ports and general network protocol storage interface ports.

The device according to claim 328, wherein the different types of network communication ports include storage interface protocol ports and Ethernet protocol storage interface ports.

The device according to claim 328, wherein the different types of network communication ports include storage interface protocol ports and PCIe ports.

The device according to claim 323, wherein the device is coupled to a management unit, and the management unit includes a computing node of a computing system and is controlled by a manager of the computing system.

The device described in claim 323 is configured to be controlled by a computing node of a computing system.

The device described in claim 323 is configured to accelerate the integrated circuit to perform multiple tasks at the same time through the database.

The device according to claim 323, wherein the database accelerated integrated circuit belongs to a computing system.

The device according to claim 323, wherein the database acceleration integrated circuit does not belong to a computing system.

The device according to claim 323, which is configured to execute the acquisition, first processing, sending, and second execution based on an execution plan sent to the database accelerated integrated circuit by an arithmetic node of an arithmetic system At least one of the three treatments.

The device according to claim 323, wherein the database acceleration unit is configured to simultaneously execute database processing commands by the database processing subunit, wherein the database acceleration unit includes a database accelerator sharing a common memory unit A group of subunits.

The device of claim 338, wherein each database processing subunit is configured to execute a specific type of database processing command.

The device according to claim 339, wherein the device is configured to dynamically link the database processing subunit to provide an execution pipeline for executing a database processing operation including a plurality of instructions.

The device of claim 323, wherein the device is configured to allocate the resources of the database acceleration integrated circuit based on the time I/O bandwidth.

The device according to claim 323, wherein the device includes a local storage that can be accessed by the database accelerated integrated circuit.

The device according to claim 323, wherein the network communication interface includes an RDMA unit.

The device according to claim 323, wherein the device includes one or more groups of database accelerated integrated circuits, and the database accelerated integrated circuits are configured to accelerate one or more of the integrated circuits in the database Multiple groups of databases accelerate the exchange of information between integrated circuits.

The device according to claim 323, wherein the device includes one or more groups of database accelerated integrated circuits, and the database accelerated integrated circuits are configured to accelerate one or more of the integrated circuits in the database Multiple groups of databases accelerate the exchange of acceleration results between integrated circuits.

The device according to claim 323, wherein the device includes one or more groups of database accelerated integrated circuits, and the database accelerated integrated circuits are configured to accelerate one or more of the integrated circuits in the database The database acceleration integrated circuits of multiple groups execute at least one of the following: (a) information; and (b) database acceleration result.

In the device described in claim 346, the database accelerated integrated circuits of one group are connected to the same printed circuit board.

In the device described in claim 346, the database accelerated integrated circuit of a group belongs to a modular unit of a computerized system.

The device according to claim 346, wherein the database accelerating integrated circuits of different groups are connected to different printed circuit boards.

The device described in claim 346, wherein the database accelerated integrated circuits of different groups belong to different modular units of a computerized system.

The device of claim 346, wherein one or more groups of the databases are used to accelerate the network communication interface of the integrated circuit to perform the exchange.

The device of claim 346, wherein the exchange is performed on a plurality of groups connected to each other by a star connection.

The device of claim 346, wherein the device is configured to use at least one switch for exchanging one of the following between database accelerated integrated circuits of different groups in the one or more groups At least one: (a) information; and (b) database acceleration results.

The device of claim 346, wherein the device is configured to accelerate some of the integrated circuits to perform distributed processing by the databases of some of the one or more groups.

The device of claim 346, wherein the device is configured to perform a distributed process using first and second data structures, wherein the total size of one of the first and second data structures exceeds the plurality of memories Deal with the storage capacity of one of the integrated circuits.

The device of claim 355, wherein the device is configured to perform the distributed processing by performing multiple iterations of: (a) performing a first data structure part and a second data structure part The different pairs are assigned to different databases to speed up a new allocation of integrated circuits; and (b) processing these different pairs.

The device according to claim 355, wherein the distributed processing includes a database joining operation.

The device of claim 355, wherein the device is configured to perform the distributed processing by:

Perform multiple iterations of each of the following:

The device of claim 358, wherein the device is configured to perform the new allocation of the next iteration in a manner that overlaps at least a portion of the time of a current iteration of the process.

The device of claim 358, wherein the device is configured to perform the new allocation by exchanging the second data structure portion between the different database accelerated integrated circuits.

The device according to claim 360, wherein the database accelerates the integrated circuit to perform the exchange in a manner that overlaps at least a part of the processing time.

The device of claim 358, wherein the device is configured to perform the new allocation by: exchanging the second data structure part between the different database accelerated integrated circuits in a group; and once The exchange is completed, and the second data structure part is exchanged between different groups of the database accelerated integrated circuit.

The device according to claim 323, wherein the database accelerated integrated circuit is included in a blade, and the blade includes a plurality of database accelerated integrated circuits, one or more non-volatile memory units, and an Ethernet Circuit switch, a PCIe switch and an Ethernet switch, and the plurality of memory processing integrated circuits.

A method for database acceleration, the method includes:

A network communication interface of the integrated circuit is accelerated by a database to retrieve information from the storage unit;

Perform first processing on a certain amount of information to provide first processed information;

Accelerating the memory controller of the integrated circuit by the database and sending the first processed information to a plurality of memory resources through an interface;

Retrieve information from the multiple memory resources;

A database acceleration unit of the database acceleration integrated circuit performs database processing operations on the retrieved information to provide database acceleration results; and

Output the acceleration results of these databases.

The method of claim 364, further comprising processing the first processed information to provide second processed information, wherein the processing of the first processed information is performed by a plurality of processors OK, the plurality of processors are located in an integrated circuit that further includes one or more of the plurality of memory resources.

The method of claim 364, wherein the first process includes screening database entries.

The method of claim 364, wherein the second process includes screening database entries.

The method of claim 364, wherein the first processing and the second processing include filtering database entries.