TW202414200A

TW202414200A - Reducing memory bank conflicts in a hardware accelerator

Info

Publication number: TW202414200A
Application number: TW112127069A
Authority: TW
Inventors: 安德烈阿尤波夫
Original assignee: 美商谷歌有限責任公司
Priority date: 2022-09-15
Filing date: 2023-07-20
Publication date: 2024-04-01
Also published as: WO2024058810A1

Abstract

Methods and systems, including computer-readable media, are described for reducing or preventing memory bank conflicts in a hardware accelerator to allow for concurrent access of memory banks at a hardware accelerator. A compute tile of the hardware accelerator receives requests that are used to access a tile memory of the accelerator. For each of the requests: a logical address represented by a sequence of bits is identified in the request and a first subset of bits is obtained from the sequence. An identifier is generated based on a bank generation function that uses the first subset of bits. The identifier identifies a particular bank among physical memory banks of the tile memory. Each request is processed using the respective bank identifier that is generated for that request. Multiple distinct memory banks are accessed concurrently during the same clock cycle in response to processing the requests.

Description

Reduce memory bank conflicts in hardware accelerators

本說明書大體上係關於一硬體積體電路之記憶體操作。This specification generally relates to the operation of memory in a hardware integrated circuit.

神經網路係採用一或多層之節點以針對一經接收輸入產生一輸出(例如，一分類)的機器學習模型。一些神經網路除了一輸出層之外亦包含一或多個隱藏層。一些神經網路可為經組態用於影像處理之卷積(convolutional)神經網路(CNN)或經組態用於語音及語言處理之遞歸神經網路(RNN)。不同類型之神經網路架構可用於執行與分類或型樣辨識、涉及資料模型化之預測及資訊叢集化有關之各種任務。A neural network is a machine learning model that uses one or more layers of nodes to produce an output (e.g., a classification) for a received input. Some neural networks include one or more hidden layers in addition to an output layer. Some neural networks may be convolutional neural networks (CNNs) configured for image processing or recurrent neural networks (RNNs) configured for speech and language processing. Different types of neural network architectures can be used to perform a variety of tasks related to classification or pattern recognition, prediction involving data modeling, and information clustering.

一神經網路層可具有一組對應參數或權重。權重係用於透過該神經網路層處理輸入(例如，一批次輸入)以產生該層之一對應輸出以用於運算一神經網路推理。一批次輸入及核心組可表示為輸入及權重之一張量(即，一多維陣列)。一硬體加速器係用於實施神經網路之一專用積體電路。該電路包含具有可使用電路之控制邏輯來遍歷或存取之對應於一張量之元素之位置的記憶體。A neural network layer may have a corresponding set of parameters or weights. The weights are used to process inputs (e.g., a batch of inputs) through the neural network layer to produce a corresponding output of the layer for use in computing a neural network inference. A batch of inputs and a core set may be represented as a tensor (i.e., a multidimensional array) of inputs and weights. A hardware accelerator is a dedicated integrated circuit for implementing a neural network. The circuit includes memory with locations corresponding to elements of a tensor that can be traversed or accessed using the control logic of the circuit.

本文件描述用於減少(或防止)一硬體加速器之晶片塊記憶體處之記憶體庫衝突以容許彼等實體記憶體庫之同時存取的技術。This document describes techniques for reducing (or preventing) memory bank conflicts at the on-chip memory of a hardware accelerator to allow simultaneous access to their physical memory banks.

硬體加速器之一運算晶片塊接收用於存取加速器之一晶片塊記憶體之請求。對於請求之各者：i)在該請求中識別由一位元序列表示之一邏輯位址；ii)自該序列獲得一第一位元子集；及iii)基於使用該第一位元子集之一庫產生函數來產生一識別符。A computing chip of a hardware accelerator receives a request for accessing a chip memory of the accelerator. For each of the requests: i) identifying a logical address represented by a bit sequence in the request; ii) obtaining a first bit subset from the sequence; and iii) generating an identifier based on a library generation function using the first bit subset.

識別符識別晶片塊記憶體之實體記憶體庫當中的一特定庫。使用針對該請求產生之各自庫識別符(「庫ID」)處理各請求。回應於處理請求，在同一時脈週期(例如，一單個時脈週期)期間同時存取多個相異記憶體庫。The identifier identifies a particular bank of physical memory banks of chip memory. Each request is processed using a respective bank identifier ("bank ID") generated for the request. In response to processing the request, multiple distinct memory banks are accessed simultaneously during the same clock cycle (eg, a single clock cycle).

本說明書中所描述之標的物之一項態樣可體現於用於同時存取一硬體加速器之記憶體庫之一電腦實施方法中。該方法包含接收多個請求，其中各請求係用於存取該硬體加速器之一晶片塊記憶體。對於該多個請求之各者，該方法包含：在該請求中識別由一位元序列表示之一各自邏輯位址；自該位元序列獲得一第一位元子集；及基於使用該第一位元子集之一庫產生函數，產生識別該晶片塊記憶體之複數個實體記憶體庫當中的一特定庫之一各自庫識別符。該方法進一步包含：i)使用針對該請求產生之該各自庫識別符處理該多個請求之各者；及ii)回應於處理該多個請求之各者，在一時脈週期期間同時存取該晶片塊記憶體之多個相異實體記憶體庫。An aspect of the subject matter described in the specification may be embodied in a computer-implemented method for simultaneously accessing a memory bank of a hardware accelerator. The method includes receiving a plurality of requests, wherein each request is for accessing a chip memory of the hardware accelerator. For each of the plurality of requests, the method includes: identifying a respective logical address represented by a bit sequence in the request; obtaining a first bit subset from the bit sequence; and generating a respective bank identifier identifying a particular bank among a plurality of physical memory banks of the chip memory based on a bank generation function using the first bit subset. The method further includes: i) processing each of the plurality of requests using the respective bank identifier generated for the request; and ii) in response to processing each of the plurality of requests, simultaneously accessing a plurality of different physical memory banks of the chip memory during a clock cycle.

此等及其他實施方案可各視需要包含以下特徵之一或多者。例如，在一些實施方案中，同時存取該多個相異實體記憶體庫包含對於一特定步幅值，在一單個時脈週期期間存取該多個相異實體記憶體庫。該特定步幅值可係等於同一實體記憶體庫之特定列之間的一差之一記憶體存取步幅。These and other embodiments may each include one or more of the following features as desired. For example, in some embodiments, accessing the multiple different physical memory banks simultaneously includes accessing the multiple different physical memory banks during a single clock cycle for a specific stride value. The specific stride value may be a memory access stride equal to a difference between specific rows of the same physical memory bank.

在一些實施方案中，同時存取該多個相異實體記憶體庫包含：在無一庫衝突情況下存取該多個相異實體記憶體庫。在一項態樣中，該庫衝突係在兩個或更多個請求者在同一時脈週期期間請求存取該晶片塊記憶體之同一實體記憶體庫時。各實體記憶體庫可包含多個列且該方法包含：對於該多個請求之各者：i)自該位元序列獲得一第二位元子集；ii)將該第二位元子集作為一輸入提供至該庫產生函數；及iii)基於將該庫產生函數應用於該第二位元子集，產生識別該晶片塊記憶體之該多個實體記憶體庫當中的該特定庫中之一特定列之一各自列識別符。In some embodiments, accessing the multiple different physical memory libraries simultaneously includes: accessing the multiple different physical memory libraries without a library conflict. In one aspect, the library conflict is when two or more requesters request access to the same physical memory library of the chip memory during the same clock cycle. Each physical memory library may include multiple columns and the method includes: for each of the multiple requests: i) obtaining a second bit subset from the bit sequence; ii) providing the second bit subset as an input to the library generation function; and iii) based on applying the library generation function to the second bit subset, generating a respective column identifier identifying a specific column in the specific library among the multiple physical memory libraries of the chip memory.

在一些實施方案中，該多個列之各列包含16個位元組之一寬度；該晶片塊記憶體之一分割區包含32個實體記憶體庫；且該記憶體存取步幅等於：row_width * num_banks。自該位元序列獲得一第一位元子集可包含：獲得在該位元序列中之最低有效位元(LSB)當中的兩個或更多個位元。In some implementations, each row of the plurality of rows comprises a width of 16 bytes; a partition of the chip memory comprises 32 physical memory banks; and the memory access stride is equal to: row_width * num_banks. Obtaining a first bit subset from the bit sequence may comprise: obtaining two or more bits among the least significant bits (LSBs) in the bit sequence.

此態樣及其他態樣之其他實施方案包含經組態以執行方法之動作之對應系統、設備及編碼於電腦儲存裝置上之電腦程式。一或多個電腦之一系統可憑藉安裝於該系統上之在操作中引起該系統執行該等動作之軟體、韌體、硬體或其等之組合而如此組態。一或多個電腦程式可憑藉具有在藉由一資料處理設備執行時引起該設備執行該等動作之指令而如此組態。Other implementations of this aspect and other aspects include corresponding systems, devices, and computer programs encoded on computer storage devices that are configured to perform the actions of the method. A system of one or more computers can be so configured by software, firmware, hardware, or a combination thereof installed on the system that causes the system to perform the actions during operation. One or more computer programs can be so configured by having instructions that, when executed by a data processing device, cause the device to perform the actions.

本說明書中所描述之標的物可實施於特定實施例中以便實現以下優點之一或多者。揭示一庫產生函數，其可用於在不觸發一庫衝突的情況下處理針對一晶片塊記憶體之記憶體資源之多個請求。The subject matter described in this specification may be implemented in specific embodiments to achieve one or more of the following advantages: A bank generation function is disclosed that can be used to process multiple requests for memory resources of a chip memory without triggering a bank conflict.

庫產生函數經組態以針對一請求群組中之各請求產生一各自庫ID，使得沒有兩個請求將需要存取晶片塊記憶體之同一實體記憶體庫。庫產生函數提供容許在同一時脈週期(例如，一單個時脈週期)期間並行處理群組中之各請求之一存取型樣。由庫產生函數提供之存取型樣容許同時處理多個請求，而與用於記憶體存取之步幅值無關。The library generation function is configured to generate a respective library ID for each request in a request group so that no two requests will need to access the same physical memory library of chip memory. The library generation function provides an access pattern that allows each request in the group to be processed in parallel during the same clock cycle (e.g., a single clock cycle). The access pattern provided by the library generation function allows multiple requests to be processed simultaneously regardless of the stride value used for memory access.

本說明書中所描述之標的物之一或多項實施方案之細節係在附圖及下文描述中闡述。將自描述、圖式及發明申請專利範圍明白標的物之其他潛在特徵、態樣及優點。Details of one or more embodiments of the subject matter described in this specification are set forth in the accompanying drawings and the following description. Other potential features, aspects, and advantages of the subject matter will be apparent from the description, drawings, and the scope of the invention application.

圖1A係用於實施硬體積體電路(諸如一機器學習硬體加速器)處之一神經網路模型之一實例性運算系統100之一方塊圖。運算系統100包含一或多個運算晶片塊101、一主機120及一較高階控制器125 (「控制器125」)。如下文更詳細描述，主機120及控制器125協作以提供資料集及指令至系統100之一或多個運算晶片塊101。FIG1A is a block diagram of an exemplary computing system 100 for implementing a neural network model at a hardware integrated circuit (such as a machine learning hardware accelerator). The computing system 100 includes one or more computing chips 101, a host 120, and a higher-level controller 125 ("controller 125"). As described in more detail below, the host 120 and the controller 125 cooperate to provide data sets and instructions to the one or more computing chips 101 of the system 100.

在一些實施方案中，主機120及控制器125係相同裝置。主機120及控制器125亦可執行相異功能但整合於一單個裝置套裝中。例如，主機120及控制器125可形成與一硬體加速器互動或協作之包含多個運算晶片塊101之一中央處理單元(CPU)。在一些實施方案中，主機120、控制器125及多個運算晶片塊101係包含或形成於一單個積體電路晶粒上。例如，主機120、控制器125及多個運算晶片塊101可形成經最佳化以執行用於處理機器學習工作負載之神經網路模型之一專用系統單晶片(SoC)。In some embodiments, the host 120 and the controller 125 are the same device. The host 120 and the controller 125 may also perform different functions but be integrated into a single device package. For example, the host 120 and the controller 125 may form a central processing unit (CPU) including multiple computing chips 101 that interacts or cooperates with a hardware accelerator. In some embodiments, the host 120, the controller 125, and the multiple computing chips 101 are included or formed on a single integrated circuit die. For example, the host 120, the controller 125, and the multiple computing chips 101 may form a dedicated system-on-chip (SoC) optimized to execute a neural network model for processing machine learning workloads.

各運算晶片塊101通常包含一控制器103，控制器103提供一或多個控制信號105以引起一輸入向量102之輸入(或啟動)儲存於一第一記憶體108 (「記憶體108」)之一記憶體位置處或自該記憶體位置存取。同樣地，控制器103亦可提供一或多個控制信號105以引起權重104之一矩陣結構之權重(或參數)儲存於一第二記憶體110 (「記憶體110」)之一記憶體位置處或自該記憶體位置存取。在一些實施方案中，輸入向量102係自一輸入張量獲得，而權重之矩陣結構係自一參數張量獲得。輸入張量及參數張量之各者可為多維資料結構，諸如一多維矩陣或張量。此係在下文參考圖6更詳細描述。Each computing chip block 101 generally includes a controller 103, which provides one or more control signals 105 to cause an input vector 102 to be stored (or activated) at a memory location of a first memory 108 ("memory 108") or accessed from the memory location. Similarly, the controller 103 may also provide one or more control signals 105 to cause weights (or parameters) of a matrix structure of weights 104 to be stored at a memory location of a second memory 110 ("memory 110") or accessed from the memory location. In some embodiments, the input vector 102 is obtained from an input tensor, and the matrix structure of weights is obtained from a parameter tensor. Each of the input tensor and the parameter tensor can be a multi-dimensional data structure, such as a multi-dimensional matrix or tensor. This is described in more detail below with reference to FIG. 6 .

記憶體108、110之各記憶體位置可藉由一對應記憶體位址(諸如具有至記憶體之一實體記憶體庫之一實體列之一對應映射之一邏輯位址)識別。參考圖1B之實例，一運算晶片塊101可自請求130之一群組導出一組連續位址(例如，虛擬/邏輯位址)。例如，可參考對應於運算晶片塊101之實體記憶體108之一邏輯記憶體來導出該組連續位址。此亦在下文參考圖4及圖5之實施例進行描述。Each memory location of the memory 108, 110 may be identified by a corresponding memory address (e.g., a logical address having a corresponding mapping to a physical row of a physical memory bank of the memory). Referring to the example of FIG. 1B , a computing chip block 101 may derive a set of consecutive addresses (e.g., virtual/logical addresses) from a group of requests 130. For example, the set of consecutive addresses may be derived with reference to a logical memory corresponding to the physical memory 108 of the computing chip block 101. This is also described below with reference to the embodiments of FIGS. 4 and 5 .

邏輯記憶體具有多個邏輯埠135且各埠可連接至或相關聯於請求存取記憶體108之實體資源之一不同請求者。為處理一給定存取請求，對於各埠，晶片塊101 (或其控制器103)基於該請求中之一位址來判定該請求將被路由至之一庫。對於各庫，運算晶片塊101可包含一仲裁器140，仲裁器140 (例如)根據唯一經組態以緩解一些(或全部)請求被路由至同一實體記憶體庫之一庫產生函數來仲裁自多個埠至該庫之存取。邏輯記憶體、邏輯記憶體之埠及仲裁器可在軟體、硬體或兩者中實施。在一些實施方案中，邏輯記憶體及其埠以及仲裁器係基於藉由控制器103產生之控制信號加以控制。The logical memory has a plurality of logical ports 135 and each port may be connected to or associated with a different requestor requesting access to physical resources of the memory 108. To process a given access request, for each port, the chip block 101 (or its controller 103) determines a library to which the request is to be routed based on an address in the request. For each library, the computing chip block 101 may include an arbitrator 140 that arbitrates access from multiple ports to the library (for example) based on a unique library generation function that is configured to mitigate some (or all) requests to be routed to the same physical memory library. The logical memory, the ports of the logical memory, and the arbitrator may be implemented in software, hardware, or both. In some implementations, the logic memory and its ports and the arbiter are controlled based on control signals generated by the controller 103.

記憶體108、110之各者可經實施為一系列實體庫、單元或任何其他相關儲存媒體或裝置。記憶體108、110之各者可包含一或多個暫存器、緩衝器或兩者。在一些實施方案中，記憶體108係一輸入/啟動記憶體，而記憶體110係一參數記憶體。在一些其他實施方案中，輸入或啟動係儲存於記憶體108、記憶體110或兩者處；且權重係儲存於記憶體108、記憶體110或兩者處。例如，輸入及權重可在記憶體108與記憶體110之間傳送以促進特定神經網路運算。在一些實施方案中，記憶體108及記憶體110之各者被稱為晶片塊記憶體。Each of memories 108, 110 may be implemented as a series of physical libraries, units, or any other related storage media or devices. Each of memories 108, 110 may include one or more registers, buffers, or both. In some embodiments, memory 108 is an input/activation memory and memory 110 is a parameter memory. In some other embodiments, inputs or activations are stored in memory 108, memory 110, or both; and weights are stored in memory 108, memory 110, or both. For example, inputs and weights may be transmitted between memory 108 and memory 110 to facilitate a particular neural network operation. In some implementations, each of memory 108 and memory 110 is referred to as on-chip memory.

各運算晶片塊101亦包含一輸入啟動匯流排106、一輸出啟動匯流排107，及包含各胞元114 a/b/c中之一或多個硬體乘法累加電路(MAC)之一運算單元112。控制器103可產生控制信號105以獲得儲存於運算晶片塊101之記憶體處之運算元。例如，控制器103可產生控制信號105以獲得：i)儲存於記憶體108處之一實例性輸入向量102；及ii)儲存於記憶體110處之權重104。自記憶體108獲得之各輸入係提供至輸入啟動匯流排106以用於路由(例如，直接路由)至運算單元112中之一運算胞元114 a/b/c。類似地，自記憶體110獲得之各權重係路由至運算單元112之一胞元114 a/b/c。Each computing chip block 101 also includes an input enable bus 106, an output enable bus 107, and an operation unit 112 including one or more hardware multiply-accumulate circuits (MACs) in each cell 114 a/b/c. The controller 103 can generate control signals 105 to obtain the operation elements stored at the memory of the computing chip block 101. For example, the controller 103 can generate control signals 105 to obtain: i) an example input vector 102 stored at the memory 108; and ii) the weight 104 stored at the memory 110. Each input obtained from memory 108 is provided to input enable bus 106 for routing (e.g., direct routing) to a computation cell 114a/b/c in computation unit 112. Similarly, each weight obtained from memory 110 is routed to a cell 114a/b/c in computation unit 112.

如下文所描述，各胞元114 a/b/c執行產生用於產生一給定神經網路層之輸出之部分和或經累加值之運算。一啟動函數可應用於一組輸出以產生用於神經網路層之一組輸出啟動。在一些實施方案中，輸出或輸出啟動係經由輸出啟動匯流排107路由以用於儲存及/或傳送。例如，一組輸出啟動可自一第一運算晶片塊101傳送至一第二、不同運算晶片塊101以在第二運算晶片塊101處處理為用於神經網路之一不同層之輸入啟動。As described below, each cell 114a/b/c performs an operation that generates a partial sum or accumulated value used to generate an output of a given neural network layer. An activation function can be applied to a set of outputs to generate a set of output activations for the neural network layer. In some embodiments, the outputs or output activations are routed via the output activation bus 107 for storage and/or transmission. For example, a set of output activations can be transmitted from a first computing chip block 101 to a second, different computing chip block 101 to be processed at the second computing chip block 101 as input activations for a different layer of the neural network.

一般而言，各運算晶片塊101及系統100可包含額外硬體結構以執行與多維資料結構(諸如張量、矩陣及/或資料陣列)相關聯之運算。在一些實施方案中，用於一輸入向量(或張量) 102之輸入及用於一參數張量之權重104可預載入至運算晶片塊101之記憶體108、110中。輸入及權重係接收為自一主機120 (例如，一外部主機)經由一主機介面或自一較高階控制件(諸如控制器125)到達一特定運算晶片塊101之若干組資料值。In general, each computing chip block 101 and system 100 may include additional hardware structures to perform operations associated with multi-dimensional data structures (such as tensors, matrices and/or data arrays). In some embodiments, inputs for an input vector (or tensor) 102 and weights 104 for a parameter tensor may be preloaded into the memory 108, 110 of the computing chip block 101. The inputs and weights are received as sets of data values from a host 120 (e.g., an external host) via a host interface or from a higher-level control (such as controller 125) to a particular computing chip block 101.

運算晶片塊101及控制器103之各者可包含一或多個處理器、處理裝置及各種類型之記憶體。在一些實施方案中，運算晶片塊101及控制器103之處理器包含一或多個裝置，諸如微處理器或中央處理單元(CPU)、圖形處理單元(GPU)、特定應用積體電路(ASIC)或不同處理器之一組合。運算晶片塊101及控制器103之各者亦可包含其他運算及儲存資源(諸如緩衝器。暫存器、控制電路系統等)。此等資源協作以提供用於執行本說明書中所描述之判定及計算之一或多者之額外處理選項。Each of the computing chip block 101 and the controller 103 may include one or more processors, processing devices, and various types of memory. In some embodiments, the processor of the computing chip block 101 and the controller 103 includes one or more devices, such as a microprocessor or central processing unit (CPU), a graphics processing unit (GPU), an application-specific integrated circuit (ASIC), or a combination of different processors. Each of the computing chip block 101 and the controller 103 may also include other computing and storage resources (such as buffers, registers, control circuit systems, etc.). These resources work together to provide additional processing options for performing one or more of the determinations and calculations described in this specification.

在一些實施方案中，控制器103之(若干)處理單元執行儲存於記憶體中之經程式化指令以引起控制器103及運算晶片塊101執行本說明書中所描述之一或多個功能。控制器103之記憶體可包含一或多個非暫時性機器可讀儲存媒體。非暫時性機器可讀儲存媒體可包含固態記憶體、一隨機存取記憶體(RAM)、一唯讀記憶體(ROM)、一可擦除可程式化唯讀記憶體(例如，EPROM、EEPROM或快閃記憶體)，或能夠儲存資訊或指令之任何其他有形媒體。In some implementations, the processing unit(s) of the controller 103 executes programmed instructions stored in the memory to cause the controller 103 and the computing chip block 101 to perform one or more functions described in this specification. The memory of the controller 103 may include one or more non-transitory machine-readable storage media. The non-transitory machine-readable storage medium may include solid-state memory, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (e.g., EPROM, EEPROM, or flash memory), or any other tangible medium capable of storing information or instructions.

系統100接收定義待由一運算晶片塊101執行之一特定運算操作之指令。在一些實施方案中，一主機可產生若干組參數(即，權重)及對應輸入以用於在一神經網路層處進行處理。例如，一主機可針對一給定操作產生若干組經壓縮參數(CSP)及將一神經網路輸入映射至一組CSP中之一非零參數之對應映射向量(例如，一非零映射(NZM))。主機120可經由一主機介面將參數發送至一運算晶片塊101以在該晶片塊處進行進一步處理。控制器103可執行經程式化指令以分析與包含經壓縮參數及對應映射向量之經接收權重及輸入相關聯之一資料串流。System 100 receives instructions defining a particular computational operation to be performed by a computing chip block 101. In some implementations, a host may generate sets of parameters (i.e., weights) and corresponding inputs for processing at a neural network layer. For example, a host may generate sets of compressed parameters (CSPs) and corresponding mapping vectors (e.g., a non-zero mapping (NZM)) for a given operation that maps a neural network input to a non-zero parameter in a set of CSPs. Host 120 may send the parameters to a computing chip block 101 via a host interface for further processing at the chip block. Controller 103 may execute programmed instructions to analyze a data stream associated with received weights and inputs including compressed parameters and corresponding mapping vectors.

控制器103引起資料串流之輸入及權重儲存於運算晶片塊101處。例如，控制器103可將映射向量及經壓縮稀疏參數儲存於運算晶片塊101之本機晶片塊記憶體中。此係在下文更詳細描述。控制器103亦可分析輸入資料串流以偵測一操作碼(「operation code/opcode」)。系統100可支援各種類型之操作碼，諸如指示向量矩陣乘法、逐元素向量運算之操作及一給定操作是否使用經壓縮稀疏參數與未經壓縮參數/權重之操作碼類型。Controller 103 causes the input of data streams and the storage of weights at compute chip block 101. For example, controller 103 may store the mapping vectors and compressed sparse parameters in local chip block memory of compute chip block 101. This is described in more detail below. Controller 103 may also analyze the input data stream to detect an operation code ("operation code/opcode"). System 100 may support various types of operation codes, such as operation code types indicating vector matrix multiplication, element-by-element vector operations, and whether a given operation uses compressed sparse parameters and uncompressed parameters/weights.

基於一或多個操作碼，控制器103可啟動或執行一庫產生函數(下文所描述)以仲裁用於存取一運算晶片塊101之晶片塊記憶體之請求。例如，控制器103利用庫產生函數來仲裁兩個或更多個請求，使得針對晶片塊記憶體之不同實體庫處理該兩個或更多個請求之各者。控制器103可採用在運算晶片塊101處執行推理判定之前在控制器103處程式設計或編碼之一預定庫選擇方案。Based on one or more opcodes, the controller 103 may initiate or execute a library generation function (described below) to arbitrate requests for access to a chip memory of a computing chip 101. For example, the controller 103 utilizes the library generation function to arbitrate two or more requests so that different physical libraries for chip memory process each of the two or more requests. The controller 103 may employ a predetermined library selection scheme that is programmed or coded at the controller 103 before performing inference decisions at the computing chip 101.

在一些實施方案中，一給定運算操作涉及各需要存取記憶體108之資源之多個請求者/存取器。例如，歸因於需要對記憶體108之各自位址位置之讀取及寫入存取之張量遍歷操作，在運算晶片塊101處執行之一運算工作負載可觸發記憶體存取請求。如下文所描述，此等位址位置可對應於作為工作負載之部分而處理之一輸入張量之元素。In some implementations, a given computational operation involves multiple requestors/accessors each requiring access to resources of memory 108. For example, a computational workload executed at compute die 101 may trigger memory access requests due to tensor traversal operations requiring read and write access to respective address locations of memory 108. As described below, these address locations may correspond to elements of an input tensor processed as part of the workload.

除了張量讀取/寫入操作之外，處理工作負載亦可涉及處理讀取或寫入存取請求以：i)將資料(例如，參數)自記憶體108 (窄)移動至記憶體110 (寬)；及ii)將資料自記憶體110 (寬)移動至記憶體108 (窄)。在一些情況下，為處理一實例性工作負載，一第一運算晶片塊101仲裁及執行針對記憶體108之存取請求(例如，讀取/寫入請求)，其中該等請求係基於源自第一運算晶片塊101外部之外部資料通信。In addition to tensor read/write operations, processing a workload may also involve processing read or write access requests to: i) move data (e.g., parameters) from memory 108 (narrow) to memory 110 (wide); and ii) move data from memory 110 (wide) to memory 108 (narrow). In some cases, to process an example workload, a first computing chip block 101 arbitrates and executes access requests (e.g., read/write requests) to memory 108, where the requests are based on external data communications originating from outside the first computing chip block 101.

此等不同類型之存取請求對應於全部需要存取同一實體記憶體108之一或多個運算執行緒。在一些情況下，多個請求可對應於一單個運算執行緒(或時脈週期)。所揭示之庫產生函數可用於在不觸發一庫衝突的情況下處理針對一晶片塊記憶體之記憶體資源之多個請求。例如，庫產生函數經組態以針對一請求群組中之各請求產生一各自庫ID，使得沒有兩個請求將需要存取晶片塊記憶體之同一實體記憶體庫。These different types of access requests correspond to one or more computation threads that all need to access the same physical memory 108. In some cases, multiple requests may correspond to a single computation thread (or clock cycle). The disclosed library generation function can be used to handle multiple requests for memory resources of a chip memory without triggering a library conflict. For example, the library generation function is configured to generate a separate library ID for each request in a request group so that no two requests will need to access the same physical memory library of the chip memory.

在一些實施方案中，基於作為庫產生函數之輸出返回之庫識別符，控制器103能夠藉由在同一時脈週期(例如，一單個時脈週期)期間針對晶片塊記憶體之不同實體記憶體庫處理兩個或更多個請求來仲裁請求。庫產生函數容許控制器103實現此等優點，即使對於特定有問題的跨步(諸如其中步幅依據庫數目及一列庫中之位元組之數目而變化的記憶體存取)。此係在下文更詳細描述。In some implementations, based on the bank identifier returned as an output of the bank generation function, the controller 103 is able to arbitrate requests by processing two or more requests for different physical memory banks of the die memory during the same clock cycle (e.g., a single clock cycle). The bank generation function allows the controller 103 to achieve these advantages even for particularly problematic strides (e.g., memory accesses where the stride varies depending on the number of banks and the number of bytes in a row of banks). This is described in more detail below.

此外，基於操作碼，控制器103可啟動與一或多個運算胞元114 a/b/c相關聯之專用資料路徑邏輯以使用參數及輸入/啟動以及用於將輸入/啟動映射至一參數子集之對應映射向量執行運算(例如，係數運算)。如本文件中所使用，稀疏運算包含使用一組經壓縮稀疏參數中之非零權重值針對一神經網路層執行之神經網路運算，該組經壓縮稀疏參數係自該神經網路層之一組權重產生。In addition, based on the opcode, the controller 103 may activate dedicated data path logic associated with one or more computational cells 114 a/b/c to perform operations (e.g., coefficient operations) using parameters and inputs/activations and corresponding mapping vectors for mapping the inputs/activations to a subset of parameters. As used herein, sparse operations include neural network operations performed on a neural network layer using non-zero weight values in a set of compressed sparse parameters that are generated from a set of weights for the neural network layer.

在一些實施方案中，操作碼指示關於與一給定層之輸入及權重有關之操作之細節，諸如與該層相關聯之一或多個參數張量之稀疏性。控制器103：i)偵測操作碼，包含任何相關張量稀疏性資訊；ii)使用本機讀取邏輯以基於操作碼自晶片塊記憶體(例如，記憶體108或110)獲得參數；及iii)將彼等參數寫入或路由至運算晶片塊101之胞元 114 a/b/c。控制器103亦可分析一實例性資料串流且基於該分析，產生一組經壓縮稀疏參數及將一輸入向量之離散輸入映射至經壓縮稀疏參數中之各自非零權重值之一對應映射向量。在參考控制器103描述用於產生經壓縮稀疏參數及對應映射向量之操作及/或程序之範圍內，彼等操作及程序之各者亦可由主機120、控制器125或兩者執行。In some implementations, the opcode indicates details about the operation associated with the inputs and weights of a given layer, such as the sparsity of one or more parameter tensors associated with the layer. The controller 103: i) detects the opcode, including any relevant tensor sparsity information; ii) uses local read logic to obtain parameters from chip memory (e.g., memory 108 or 110) based on the opcode; and iii) writes or routes those parameters to cells 114 a/b/c of the computing chip 101. The controller 103 may also analyze an example data stream and based on the analysis, generate a set of compressed sparse parameters and a corresponding mapping vector that maps discrete inputs of an input vector to respective non-zero weight values in the compressed sparse parameters. To the extent that operations and/or procedures for generating compressed sparse parameters and corresponding mapping vectors are described with reference to the controller 103, each of those operations and procedures may also be performed by the host 120, the controller 125, or both.

在一些實施方案中，在主機120處執行一些(或所有)操作(諸如分析張量索引)，執行直接記憶體存取(DMA)操作以讀取系統記憶體(例如，SRAM、DRAM等)中之位址空間以獲得輸入及權重值，產生經壓縮稀疏參數及產生對應映射向量，容許減少各運算晶片塊101處之處理時間並改良系統100處之資料處理量。例如，在主機120處使用控制器125執行此等操作容許將一組已壓縮參數發送至一給定晶片塊運算101，此減少在系統100處路由之資料之大小及數量。In some implementations, performing some (or all) operations (such as analyzing tensor indexes), performing direct memory access (DMA) operations to read address space in system memory (e.g., SRAM, DRAM, etc.) to obtain input and weight values, generating compressed sparse parameters, and generating corresponding mapping vectors at the host 120 allows for reduced processing time at each computing chip block 101 and improved data processing throughput at the system 100. For example, performing these operations at the host 120 using the controller 125 allows a set of compressed parameters to be sent to a given chip block computing 101, which reduces the size and amount of data routed at the system 100.

圖2展示用於在一硬體積體電路之一記憶體與運算胞元之間路由輸入及輸出之一實例性處理管線200。一般而言，管線200使用輸入匯流排106以將自記憶體108之一記憶體位置獲得之輸入路由至一或多個運算胞元114且使用輸出匯流排107以將自一或多個運算胞元114處所執行之乘法產生之輸出路由至記憶體108之一記憶體位置。2 shows an example processing pipeline 200 for routing inputs and outputs between a memory and computational cells in a hardware integrated circuit. In general, pipeline 200 uses input bus 106 to route inputs obtained from a memory location in memory 108 to one or more computational cells 114 and uses output bus 107 to route outputs resulting from multiplications performed at one or more computational cells 114 to a memory location in memory 108.

管線200利用其中輸入匯流排106耦合(例如，直接耦合)至專用積體電路之多個硬體運算胞元分組之各者之一硬體架構。系統100可提供來自記憶體108中之一位置之一第一運算元及來自記憶體110中之一位置之一第二運算元。運算元204係用於在胞元114處執行之運算且胞元114產生對應於該運算之一結果之一輸出。在一些實施方案中，運算係用於一機器學習操作，諸如用於透過一人工神經網路之一神經網路層處理輸入。Pipeline 200 utilizes a hardware architecture in which input bus 106 is coupled (e.g., directly coupled) to each of a plurality of hardware operation cell groups of dedicated integrated circuits. System 100 may provide a first operator from a location in memory 108 and a second operator from a location in memory 110. Operator 204 is used for an operation performed at cell 114 and cell 114 produces an output corresponding to a result of the operation. In some embodiments, the operation is used for a machine learning operation, such as for processing inputs through a neural network layer of an artificial neural network.

在此實施方案中，一運算晶片塊101可將對應於一輸入特徵映射之輸入或啟動(例如，a0、a1、a2等)之第一運算元提供至胞元114之一子集。例如，一輸入向量102之一各自輸入係經由運算晶片塊101之輸入匯流排106提供至子集中之各MAC。系統100可跨多個運算晶片塊101執行此廣播操作以使用各運算晶片塊101處之各自輸入分組及對應權重運算一給定神經網路層之乘積。在一給定運算晶片塊101處，藉由使用MAC之乘法電路系統將子集中之各MAC處之一各自輸入(例如，a1)與對應權重(例如，w1)相乘來運算乘積。In this embodiment, a computing chip block 101 may provide a first operator corresponding to an input or activation (e.g., a0, a1, a2, etc.) of an input feature map to a subset of cells 114. For example, a respective input of an input vector 102 is provided to each MAC in the subset via the input bus 106 of the computing chip block 101. The system 100 may perform this broadcast operation across multiple computing chips 101 to calculate a product of a given neural network layer using respective input groups and corresponding weights at each computing chip block 101. At a given computing chip block 101, the product is calculated by multiplying a respective input (e.g., a1) at each MAC in the subset with the corresponding weight (e.g., w1) using the multiplication circuitry of the MAC.

系統100可基於在一運算單元112之胞元114a/b/c之一子集中之一胞元114a/b/c之各MAC處運算之多個各自乘積之累加來產生層之一輸出。如下文參考圖6闡釋，在一運算晶片塊101內執行之乘法運算可涉及：i)儲存於記憶體108之一記憶體位置處之一第一運算元(例如，一輸入或啟動)，其對應於一輸入張量之一各自元素；及ii)儲存於記憶體110之一記憶體位置處之一第二運算元(例如，一權重)，其對應於一參數張量之一各自元素。The system 100 may generate an output of a layer based on the accumulation of a plurality of respective products operated at each MAC of a cell 114a/b/c in a subset of cells 114a/b/c of an operation unit 112. As explained below with reference to FIG. 6 , a multiplication operation performed within an operation chip block 101 may involve: i) a first operand (e.g., an input or activation) stored at a memory location of the memory 108, which corresponds to a respective element of an input tensor; and ii) a second operand (e.g., a weight) stored at a memory location of the memory 110, which corresponds to a respective element of a parameter tensor.

在圖2之實例中，一移位暫存器202可提供移位功能性，其中將運算元204之一輸入廣播至輸入匯流排106上並路由至胞元114之一或多個MAC。在一些實施方案中，移位暫存器202在一運算晶片塊101處啟用一或多種輸入廣播模式。例如，移位暫存器202可用於自記憶體108循序地(一個接一個地)廣播輸入(第一廣播模式)，自記憶體108同時(例如，並行)廣播輸入(第二廣播模式)，或使用此等廣播模式之某一組合廣播輸入。移位暫存器202可為記憶體108之一整合式功能，且可在硬體、軟體或兩者中實施。In the example of FIG. 2 , a shift register 202 may provide shift functionality, wherein an input of an operator 204 is broadcast onto the input bus 106 and routed to one or more MACs of the cell 114. In some implementations, the shift register 202 enables one or more input broadcast modes at a computing die 101. For example, the shift register 202 may be used to broadcast inputs sequentially (one by one) from the memory 108 (a first broadcast mode), broadcast inputs simultaneously (e.g., in parallel) from the memory 108 (a second broadcast mode), or broadcast inputs using some combination of these broadcast modes. The shift register 202 may be an integrated function of the memory 108 and may be implemented in hardware, software, or both.

在一些實施方案中，運算元206之權重(w3)可具有零之一權重值。當控制器103判定權重(w3)具有一零值時，為節省處理資源，可略過一輸入(a2)與權重(w3)之間的一乘法，使得彼等運算元未被路由至一胞元114 a/b/c或由胞元114 a/b/c消耗。略過該特定乘法運算之判定可基於將一輸入向量之離散輸入(a n)映射至一參數張量之個別權重(w n)之一映射向量，如上文所描述。 In some implementations, the weight (w3) of the operator 206 may have a weight value of zero. When the controller 103 determines that the weight (w3) has a value of zero, to save processing resources, a multiplication between an input (a2) and the weight (w3) may be skipped so that those operators are not routed to or consumed by a cell 114 a/b/c. The determination to skip the particular multiplication operation may be based on a mapping vector that maps discrete inputs (a n ) of an input vector to individual weights (w n ) of a parameter tensor, as described above.

圖3繪示展示用於產生用於存取記憶體108之實體記憶體庫之位址之不同函數302、304之記憶體/資料遍歷之實例的一遍歷表300。特定言之，針對一第一函數302展示一第一記憶體遍歷，而針對一第二函數304展示一第二、不同記憶體遍歷。在一些實施方案中，記憶體108包含一或多個記憶體分割區。3 illustrates a traversal table 300 showing examples of memory/data traversals for different functions 302, 304 used to generate addresses for accessing a physical memory bank of the memory 108. Specifically, a first memory traversal is shown for a first function 302, and a second, different memory traversal is shown for a second function 304. In some implementations, the memory 108 includes one or more memory partitions.

在圖3之實例中，記憶體108之一分割區包含經展示為庫0至庫31之32個實體記憶體庫312。各庫可包含一定數目個列，例如，16列、24列等。在一些實施方案中，各列係16位元組(16B)寬，使得一運算晶片塊101以16B組塊(chunk) (例如，128位元)自記憶體108存取資料。在一些實施方案中，記憶體108可包含更多或更少實體記憶體庫且一庫之各列可具有大於16B或小於16B (例如，1B)之一寬度。3, a partition of memory 108 includes 32 physical memory banks 312, shown as bank 0 through bank 31. Each bank may include a number of rows, e.g., 16 rows, 24 rows, etc. In some implementations, each row is 16 bytes (16B) wide, so that a computing chip 101 accesses data from memory 108 in 16B chunks (e.g., 128 bits). In some implementations, memory 108 may include more or fewer physical memory banks and each row of a bank may have a width greater than 16B or less than 16B (e.g., 1B).

遍歷表300包含一第一位址306、一第二位址308及一第三位址310。第一位址306、第二位址308及第三位址310之各者之間的記憶體遍歷距離可基於一步幅值。在圖3之實例中，步幅值係512。然而，在其他實例中，步幅值可為不同的。因此，其他步幅值係在本發明之範疇內。當運算系統100執行一給定任務或機器學習工作負載之運算時，可能需要一步幅操作。步幅操作可基於一步幅參數或值。一給定步幅值在系統100處可為可程式設計的。例如，系統100可基於系統100之一編譯器已知為特定於一特定推理操作之特定有問題的步幅來針對一給定推理程式設計一步幅。The traversal table 300 includes a first address 306, a second address 308, and a third address 310. The memory traversal distance between each of the first address 306, the second address 308, and the third address 310 can be based on a stride value. In the example of Figure 3, the stride value is 512. However, in other examples, the stride value can be different. Therefore, other stride values are within the scope of the present invention. When the computing system 100 performs a given task or machine learning workload, a stride operation may be required. The stride operation can be based on a stride parameter or value. A given stride value can be programmable at the system 100. For example, the system 100 can program a stride for a given reasoning based on a compiler of the system 100 that is known to be specific to a specific problematic stride for a specific reasoning operation.

在一些實施方案中，基於記憶體108之一硬體組態、經執行之機器學習操作之類型或兩者來判定步幅值。例如，系統100之運算晶片塊101可用於實施經調諧或使用以用於影像及視訊內容之壓縮及/或辨識之一神經網路(例如，一卷積神經網路)。一步幅可為神經網路之一組成部分。在此實例中，機器學習操作涉及透過神經網路之一層根據對應於該層之一組權重/參數之一濾波器來處理影像。In some implementations, the stride value is determined based on a hardware configuration of memory 108, the type of machine learning operation being performed, or both. For example, computing chip block 101 of system 100 may be used to implement a neural network (e.g., a convolutional neural network) tuned or used for compression and/or recognition of image and video content. A stride may be a component of the neural network. In this example, the machine learning operation involves processing an image through a filter of a layer of the neural network according to a set of weights/parameters corresponding to the layer.

參考記憶體之實體列及庫之一硬體組態，與影像相關聯之影像或像素值可跨記憶體108之實體列及庫儲存。在此實例中，一步幅係神經網路之一濾波器(或核心)之一組成部分或參數。步幅係用於修改濾波器在一影像或視訊上移動之一量。例如，當步幅經設定為1時，則運算晶片塊101在區域上方一次一個像素(或輸入)地移動(若干)濾波器。同樣地，當步幅係2時，則運算晶片塊101在區域上方一次兩個像素地移動(若干)濾波器。Referring to a hardware configuration of physical rows and banks of memory, image or pixel values associated with an image can be stored across physical rows and banks of memory 108. In this example, a stride is a component or parameter of a filter (or kernel) of a neural network. The stride is used to modify the amount by which the filter moves over an image or video. For example, when the stride is set to 1, the computing chip block 101 moves (several) filters over the area one pixel (or input) at a time. Similarly, when the stride is 2, the computing chip block 101 moves (several) filters over the area two pixels at a time.

因此，濾波器可基於一層之一步幅值移位，且在一些實施方案中，系統100可跨不同層之多個運算晶片塊101重複執行此程序，直至一影像之不同區域之輸入具有一對應點積。基於一步幅值在一影像之一區域之輸入上移動一濾波器可包含根據該步幅值自記憶體108中之各個位置擷取、獲得或以其他方式存取輸入。Thus, filters may be shifted based on a step value of a layer, and in some implementations, the system 100 may repeat this process across multiple computing chips 101 of different layers until inputs of different regions of an image have a corresponding dot product. Moving a filter on inputs of a region of an image based on a step value may include fetching, obtaining, or otherwise accessing inputs from various locations in the memory 108 according to the step value.

如上文所提及，影像或與影像相關聯之像素值係跨記憶體108之實體列及庫儲存。圖3之實例可參考512之一步幅值進行描述。在一些實例中，512表示與特定影像處理操作相關聯之一共同步幅。取決於操作之類型，系統100之運算晶片塊101可根據一系列步幅值來處理用於存取記憶體108之請求。例如，一第一晶片塊101可基於一第一步幅值處理請求，而一第二、不同晶片塊101可基於一第二、不同步幅值處理請求。As mentioned above, images or pixel values associated with images are stored across physical rows and banks of memory 108. The example of FIG. 3 may be described with reference to a stride value at 512. In some examples, 512 represents a common stride associated with a particular image processing operation. Depending on the type of operation, the computing chips 101 of the system 100 may process requests for accessing memory 108 according to a series of stride values. For example, a first chip 101 may process requests based on a first stride value, while a second, different chip 101 may process requests based on a second, different stride value.

如遍歷表300處所展示，當庫產生函數302係用於產生用於處理請求之庫ID時，512之一存取步幅在晶片塊記憶體之同一實體記憶體庫314 (例如，庫0)處重複。此指示一庫衝突。如本文件中所使用，當兩個或更多個請求者在同一時脈週期期間請求存取晶片塊記憶體之同一實體記憶體庫時，發生一庫衝突。As shown at traversal table 300, when the bank generation function 302 is used to generate a bank ID for processing a request, an access stride of 512 is repeated at the same physical memory bank 314 (e.g., bank 0) of the chip memory. This indicates a bank conflict. As used in this document, a bank conflict occurs when two or more requesters request access to the same physical memory bank of the chip memory during the same clock cycle.

如下文參考圖4所描述，庫產生函數304經組態使得處理用於存取記憶體108之記憶體庫中之實體列之請求一致地或至少實質上導致存取記憶體108之相異實體記憶體庫。更明確言之，對於存取記憶體108之一請求群組中之各請求，庫產生函數304使用來自該請求中之一位址之位元來產生一庫ID，當(例如)在一單個時脈週期期間並行處理該請求群組時，該庫ID導致對不同實體記憶體庫之存取。4 , the bank generation function 304 is configured so that processing requests for accessing physical rows in a memory bank of the memory 108 uniformly or at least substantially results in accessing different physical memory banks of the memory 108. More specifically, for each request in a group of requests to access the memory 108, the bank generation function 304 uses bits from an address in the request to generate a bank ID that results in accessing different physical memory banks when, for example, the group of requests is processed in parallel during a single clock cycle.

例如，如圖3中所展示，庫產生函數304容許基於引起運算晶片塊101存取記憶體108之不同實體記憶體庫之一存取型樣來處理兩個或更多個請求。庫產生函數304經組態以產生容許此存取型樣之庫ID及列ID，而與用於記憶體存取之步幅值無關。3, the bank generation function 304 allows two or more requests to be processed based on an access pattern that causes the computing chip 101 to access different physical memory banks of the memory 108. The bank generation function 304 is configured to generate the bank ID and row ID that allow this access pattern regardless of the stride value used for the memory access.

在一些實施方案中，基於藉由庫產生函數304產生之庫ID：i)一第一請求引起運算晶片塊101存取一第一實體記憶體庫316 (例如，庫「0」)之一列中之16B；ii)一第二請求引起運算晶片塊101存取一第二、不同實體記憶體庫318 (例如，庫「4」)之一列中之16B；及iii)一第三請求引起運算晶片塊101存取一第三、不同實體記憶體庫320 (例如，庫「8」)之一列中之16B。In some implementations, based on the library ID generated by the library generation function 304: i) a first request causes the computing chip block 101 to access 16B in a row of a first physical memory library 316 (e.g., library "0"); ii) a second request causes the computing chip block 101 to access 16B in a row of a second, different physical memory library 318 (e.g., library "4"); and iii) a third request causes the computing chip block 101 to access 16B in a row of a third, different physical memory library 320 (e.g., library "8").

第一、第二及第三請求之各者可為不同請求，可來自不同請求者，或兩者。基於由庫產生函數304啟用之庫ID及/或存取型樣，不存在或實質上不存在其中兩個或更多個請求者在同一時脈週期期間請求存取晶片塊記憶體之同一實體記憶體庫的例項。因此，第一、第二及第三請求之各者可在同一時脈週期(例如，一單個時脈週期)期間同時處理，而不會在記憶體108處發生一庫衝突。此外，第一、第二及第三請求之各者可在同一時脈週期(例如，一單個時脈週期)期間針對一系列步幅值(例如，512之一步幅)同時處理，而不會發生一庫衝突。Each of the first, second, and third requests may be a different request, may be from a different requestor, or both. Based on the repository ID and/or access pattern enabled by the repository generation function 304, there is no or substantially no instance in which two or more requestors request access to the same physical memory repository of the chip memory during the same clock cycle. Therefore, each of the first, second, and third requests may be processed simultaneously during the same clock cycle (e.g., a single clock cycle) without a repository conflict occurring at the memory 108. Furthermore, each of the first, second, and third requests may be processed simultaneously during the same clock cycle (e.g., a single clock cycle) for a range of stride values (e.g., a stride of 512) without a repository conflict occurring.

在一些實施方案中，運算晶片塊101可操作以路由及儲存輸出，使得記憶體108 (例如，一啟動記憶體)在獲得經儲存之輸出值作為第二、不同神經網路層之輸入啟動時不經歷一庫衝突。此在以下段落中參考啟動值進行描述，但適用於可寫入至(或儲存於)記憶體108之實體位置之其他資料類型/值。In some embodiments, the computing chip 101 is operable to route and store outputs so that the memory 108 (e.g., an activation memory) does not experience a bank conflict when obtaining the stored output value as an input activation of a second, different neural network layer. This is described in the following paragraphs with reference to activation values, but is applicable to other data types/values that can be written to (or stored in) a physical location of the memory 108.

運算晶片塊101可包含將啟動函數應用於由在運算單元112處執行之運算所引起之經累加值之一非線性單元。在一項實例中，非線性單元可為包含於運算晶片塊101之乘法及加法電路當中之一硬體電路。在另一實例中，非線性單元係包含於運算晶片塊101處但在運算單元112外部。The computing chip block 101 may include a nonlinear unit that applies activation functions to accumulated values resulting from operations performed at the computing unit 112. In one example, the nonlinear unit may be a hardware circuit included in the multiplication and addition circuits of the computing chip block 101. In another example, the nonlinear unit is included at the computing chip block 101 but is external to the computing unit 112.

非線性單元應用其啟動函數以產生一組經啟動值。經啟動值可為一機器學習工作負載之輸出。例如，經啟動值可為路由至記憶體108之一記憶體庫且儲存於該記憶體庫處之一第一神經網路層之輸出。此等經啟動值(例如，一第一層之輸出)可自記憶體108之記憶體庫擷取且作為輸入啟動提供以用於透過一第二、不同神經網路層進行處理。The nonlinear unit applies its activation function to produce a set of activated values. The activated values may be the output of a machine learning workload. For example, the activated values may be the output of a first neural network layer routed to and stored at a memory bank of memory 108. These activated values (e.g., outputs of a first layer) may be retrieved from the memory bank of memory 108 and provided as input activations for processing by a second, different neural network layer.

運算晶片塊101可向記憶體108發出存取請求(例如，寫入請求)以在記憶體108之實體列及庫處儲存諸如啟動值之輸出。在一些實例中，此等請求可經路由至用作儲存啟動之一啟動記憶體之記憶體108之一分割區。在一些實施方案中，可使用庫產生函數304來處理此等寫入存取請求，使得當輸出/啟動值儲存於該晶片塊之記憶體108處時，系統100在一特定運算晶片塊101處不經歷一庫衝突。The computing die 101 may issue access requests (e.g., write requests) to the memory 108 to store outputs such as activation values at physical rows and banks of the memory 108. In some examples, these requests may be routed to a partition of the memory 108 used as an activation memory for storing activations. In some implementations, the bank generation function 304 may be used to process these write access requests so that the system 100 does not experience a bank conflict at a particular computing die 101 when the output/activation value is stored at the memory 108 of the die.

當處理存取請求(例如，讀取請求)以擷取或提取輸出值時，庫產生函數304亦可用於減少或防止庫衝突。例如，可提取輸出值以作為輸入啟動提供至第二、不同神經網路層。The library generation function 304 can also be used to reduce or prevent library conflicts when processing access requests (e.g., read requests) to retrieve or extract output values. For example, the output value can be extracted to be provided as an input activation to a second, different neural network layer.

圖4展示用於減少一硬體積體電路處之記憶體庫衝突之一實例性庫產生函數304。如上文所描述，庫產生函數304經組態使得用於存取記憶體108之記憶體庫中之實體列之請求一致地或至少實質上導致存取記憶體108之相異實體記憶體庫。4 shows an example library generation function 304 for reducing memory bank conflicts at a hardware integrated circuit. As described above, the library generation function 304 is configured so that requests for accessing physical rows in a memory bank of the memory 108 consistently or at least substantially result in accessing different physical memory banks of the memory 108.

例如，給定存取記憶體108之一請求群組，運算晶片塊101之一控制器103可自該請求群組導出一組位址。在一些情況下，控制器103可自請求群組導出一組連續位址(例如，虛擬/邏輯位址)。給定此組連續位址，庫產生函數304經組態以產生一組對應庫識別符(「庫ID」)。例如，庫產生函數304針對請求群組中之各請求產生一各自庫ID。For example, given a request group to access memory 108, a controller 103 of computing chip 101 may derive a set of addresses from the request group. In some cases, controller 103 may derive a set of consecutive addresses (e.g., virtual/logical addresses) from the request group. Given this set of consecutive addresses, library generation function 304 is configured to generate a set of corresponding library identifiers ("library IDs"). For example, library generation function 304 generates a separate library ID for each request in the request group.

系統100或一運算晶片塊101可在一或多個時脈週期內或在同一時脈週期內使用庫ID組來處理請求群組，而不會在記憶體108處發生一庫衝突。換言之，基於藉由庫產生函數304產生之庫ID組，一運算晶片塊101可同時(例如，並行)處理請求群組中之各請求且沒有兩個請求將需要在同一時脈週期期間存取晶片塊記憶體108之同一實體記憶體庫。The system 100 or a computing chip 101 can use the bank ID set to process the request group in one or more clock cycles or in the same clock cycle without a bank conflict occurring at the memory 108. In other words, based on the bank ID set generated by the bank generation function 304, a computing chip 101 can process each request in the request group at the same time (e.g., in parallel) and no two requests will need to access the same physical memory bank of the chip memory 108 during the same clock cycle.

在圖4之實例中，庫產生函數304基於一演算法402針對各請求產生一各自庫ID。演算法402基於包含於用於存取記憶體108之一對應請求中之位址位元404之一實例性序列來產生一庫ID。例如，運算晶片塊101使用位址位元404之該序列來獲得作為輸入提供至演算法402之一或多個變數之一各自值。輸入變數可對應於位址位元404之序列之不同部分。In the example of FIG4 , the library generation function 304 generates a respective library ID for each request based on an algorithm 402. The algorithm 402 generates a library ID based on an exemplary sequence of address bits 404 included in a corresponding request for accessing the memory 108. For example, the computing chip 101 uses the sequence of address bits 404 to obtain a respective value of one or more variables provided as input to the algorithm 402. The input variables may correspond to different portions of the sequence of address bits 404.

一第一變數A可自位元之一第一部分獲得，一第二變數B可自位元之一第二部分獲得，一第三變數row_id可自位元之一第三部分獲得，且一第四變數byte_in_row可自位元之一第四部分獲得。例如，位元404之序列可為分成兩個部分之一輸入位址：i) bytes_in_row (記憶體列中之位元組之數目)；及ii)一row_address。使用庫產生函數304，row_address可被分成兩個部分：i)一row_id (例如，各庫中之row_id)；及ii)經定義為變數A之一習知庫id。row_id中之最低有效位元(LSB)可經定義為變數B。A first variable A may be obtained from a first portion of bits, a second variable B may be obtained from a second portion of bits, a third variable row_id may be obtained from a third portion of bits, and a fourth variable byte_in_row may be obtained from a fourth portion of bits. For example, the sequence of bits 404 may be an input address divided into two parts: i) bytes_in_row (the number of bytes in the memory row); and ii) a row_address. Using the library generation function 304, row_address may be divided into two parts: i) a row_id (e.g., the row_id in each library); and ii) a learned library id defined as variable A. The least significant bit (LSB) in row_id may be defined as variable B.

如圖4之實例中所指示，對於一給定位元序列404，序列404之一個部分中之一或多個位元可與序列之另一部分中之一或多個位元重疊。在一些實施方案中，一輸入變數可對應於一請求中之最低有效位元(lsb)或一請求中之最高有效位元(msb)。4, for a given bit sequence 404, one or more bits in one portion of the sequence 404 may overlap with one or more bits in another portion of the sequence. In some implementations, an input variable may correspond to the least significant bit (lsb) in a request or the most significant bit (msb) in a request.

在一些實施方案中，庫產生函數304經組態以執行至少一第一組操作412、一第二組操作414及一第三操作416。第一及/或第二組操作可用於建立特定輸入變數，諸如row_address、row_id、A及B。在一些情況下，row_address係用於提取序列404中之特定位元，諸如位元4至15，而row_id係用於提取特定其他位元，諸如位元8至15。在一些實施方案中，變數可用於提取其他範圍或組合之位元。In some implementations, the library generation function 304 is configured to perform at least a first set of operations 412, a second set of operations 414, and a third operation 416. The first and/or second set of operations may be used to create specific input variables, such as row_address, row_id, A, and B. In some cases, row_address is used to extract specific bits in the sequence 404, such as bits 4 to 15, and row_id is used to extract specific other bits, such as bits 8 to 15. In some implementations, the variables may be used to extract other ranges or combinations of bits.

第三操作(或操作組) 416可用於基於一移位參數或操作(諸如rotation_banking_shift)使一位元向量旋轉。該位元向量可基於使用庫產生函數304執行之操作自用於存取記憶體108之一請求中之一位址位元序列導出。位元向量可自位址位元序列中之msb或位址位元序列中之lsb導出。例如，位元向量可為變數A及B之一組合且庫產生函數304經組態以對此變數組合應用一rotation_banking_shift操作。The third operation (or set of operations) 416 may be used to rotate a bit vector based on a shift parameter or operation such as rotation_banking_shift. The bit vector may be derived from an address bit sequence in a request to access memory 108 based on the operation performed using the library generation function 304. The bit vector may be derived from the msb in the address bit sequence or the lsb in the address bit sequence. For example, the bit vector may be a combination of variables A and B and the library generation function 304 may be configured to apply a rotation_banking_shift operation to this combination of variables.

rotation_banking_shift操作可參考一運算晶片塊101之記憶體中之一庫數目來應用。例如，rotation_banking_shift操作可用於：i)使一msb位元向量移位由操作之一移位值指定之一量；或ii)使一lsb位元向量移位由操作之一移位值指定之一量。系統100執行移位操作使得一新庫之一庫ID將屬於一特定群組。因此，第三操作416可根據庫之數目以及與變數A及/或B相關聯之任何列、庫及位元組屬性來產生一經旋轉庫ID。The rotation_banking_shift operation may be applied with reference to a number of banks in the memory of a computing chip 101. For example, the rotation_banking_shift operation may be used to: i) shift an msb bit vector by an amount specified by a shift value of the operation; or ii) shift an lsb bit vector by an amount specified by a shift value of the operation. The system 100 performs the shift operation so that a bank ID of a new bank will belong to a particular group. Thus, the third operation 416 may generate a rotated bank ID based on the number of banks and any row, bank, and byte attributes associated with variables A and/or B.

在一些實施方案中，系統100判定用於定義msb或lsb之一最小位元數目。例如，系統100可基於記憶體108之一給定分割區中之實體記憶體庫之數量來判定最小位元數目。例如，若存在32個實體記憶體庫，則系統100可判定需要5位元之一最小數目來表示32個數字或32列ID。In some implementations, the system 100 determines a minimum number of bits used to define an MSB or LSB. For example, the system 100 may determine the minimum number of bits based on the number of physical memory banks in a given partition of the memory 108. For example, if there are 32 physical memory banks, the system 100 may determine that a minimum number of 5 bits is required to represent 32 numbers or 32 column IDs.

在一些實施方案中，如圖4之實例中之演算法402處所展示般組態或編碼庫產生函數304。在一些其他實施方案中，一經修改庫產生函數可經組態或編碼(使用替代命令)以產生可在一運算晶片塊101處同時處理而不觸發記憶體108處之一庫衝突的一組庫ID。例如，可針對相對於記憶體108包含更多或更少實體記憶體庫及更多或更少列之一晶片塊記憶體組態或編碼此經修改庫產生函數。In some embodiments, the library generation function 304 is configured or coded as shown at algorithm 402 in the example of FIG4. In some other embodiments, a modified library generation function may be configured or coded (using an alternative command) to generate a set of library IDs that can be processed simultaneously at a computing die 101 without triggering a library conflict at memory 108. For example, the modified library generation function may be configured or coded for a die memory that includes more or fewer physical memory libraries and more or fewer rows relative to memory 108.

庫產生函數304可為接收一或多個輸入且產生一或多個輸出之一演算法。例如，如上文所論述，輸入可為一或多個列之一各自位址(「row_address」)、各庫中之列數目(「rows_per_bank」)、一請求中之最低有效位元(lsb)、lsb之一最小位元數量、一請求中之最高有效位元(msb)、msb之一最小位元數量、一庫旋轉參數或一移位參數(「msb_shift_minus_1」)。輸出可為記憶體108之實體記憶體庫之各自庫ID及記憶體108之一實體記憶體庫中之實體列之各自列ID。更多或更少輸入係在本發明之範疇內且可用於結合庫產生函數304。The bank generation function 304 may be an algorithm that receives one or more inputs and generates one or more outputs. For example, as discussed above, the inputs may be a respective address of one or more rows ("row_address"), the number of rows in each bank ("rows_per_bank"), the least significant bit (lsb) in a request, a minimum number of bits of the lsb, the most significant bit (msb) in a request, a minimum number of bits of the msb, a bank rotation parameter, or a shift parameter ("msb_shift_minus_1"). The outputs may be respective bank IDs of physical memory banks of the memory 108 and respective row IDs of physical rows in a physical memory bank of the memory 108. More or fewer inputs are within the scope of the present invention and may be used in conjunction with the bank generation function 304.

圖5係用於減少一硬體加速器或專用硬體積體電路之一晶片塊記憶體中之記憶體庫衝突之一實例性程序500。在一些實施方案中，程序500係在用於實施於一硬體加速器上之一神經網路機器學習模型之運算期間執行。例如，該等運算可經執行以使用一專用神經網路處理器來處理一神經網路輸入(諸如一影像或語音話語)。此一處理器可由參考圖1A所描述之積體電路或系統100表示。FIG5 is an example process 500 for reducing memory bank conflicts in a chip memory of a hardware accelerator or dedicated hardware integrated circuit. In some implementations, the process 500 is performed during operations for a neural network machine learning model implemented on a hardware accelerator. For example, the operations may be performed to process a neural network input (such as an image or speech utterance) using a dedicated neural network processor. Such a processor may be represented by the integrated circuit or system 100 described with reference to FIG1A.

例如，硬體積體電路可經組態以實施包含多個神經網路層之一CNN。在一些情況下，神經網路層可包含一群組卷積層。輸入可為如上文所描述之一實例性影像，包含各種其他類型之數位影像或相關圖形資料。在至少一個實例中，積體電路可實施一RNN以用於處理自一話語或其他音訊內容導出之輸入。在一些情況下，程序500係相對於其他資料處理技術容許在加速用以產生影像或語音處理輸出之神經網路運算時實現延時及處理量之改良之一技術之部分。For example, the hardware integrated circuit may be configured to implement a CNN comprising a plurality of neural network layers. In some cases, the neural network layers may include a group of convolutional layers. The input may be an example image as described above, including various other types of digital images or related graphical data. In at least one example, the integrated circuit may implement an RNN for processing input derived from a speech or other audio content. In some cases, process 500 is part of a technique that allows improvements in latency and throughput to be achieved in accelerating neural network operations used to produce image or speech processing outputs relative to other data processing techniques.

程序500可使用上文描述之系統100實施或執行。因此，程序500之描述可參考系統100之上述運算資源。在一些實例中，程序500之步驟或動作係由經程式化之韌體指令、軟體指令或兩者來實現。各類型之指令可儲存於一非暫時性機器可讀儲存裝置中且可由本文件中描述之裝置及資源之一或多個處理器執行。Process 500 may be implemented or executed using the system 100 described above. Therefore, the description of process 500 may refer to the above-described computing resources of system 100. In some examples, the steps or actions of process 500 are implemented by programmed firmware instructions, software instructions, or both. Instructions of various types may be stored in a non-transitory machine-readable storage device and may be executed by one or more processors of the devices and resources described in this document.

在一些實施方案中，程序500之步驟係在一硬體電路處執行以產生用於一神經網路層之一層輸出。該輸出可為用以產生一影像處理或影像辨識輸出之一機器學習任務或推理工作負載之運算之一部分。如上文所指示，積體電路可為一專用神經網路處理器或經組態以加速用於產生各種類型之資料處理輸出之運算之硬體機器學習加速器。In some implementations, the steps of process 500 are performed at a hardware circuit to generate a layer output for a neural network layer. The output may be part of a computation of a machine learning task or reasoning workload to generate an image processing or image recognition output. As indicated above, the integrated circuit may be a dedicated neural network processor or a hardware machine learning accelerator configured to accelerate computations used to generate various types of data processing outputs.

再次參考程序500，系統100接收用於存取系統之記憶體資源之多個請求(502)。例如，該多個請求之各請求可用於存取一基體電路或硬體加速器之一第一、晶片塊記憶體。在一些實施方案中，系統100跨多個晶片塊處理多個請求。例如，系統100可在一基體電路之各運算晶片塊101處處理一各自請求子集。Referring again to process 500, system 100 receives a plurality of requests for accessing memory resources of the system (502). For example, each of the plurality of requests may be for accessing a first, die-block memory of a base circuit or a hardware accelerator. In some implementations, system 100 processes the plurality of requests across a plurality of die-blocks. For example, system 100 may process a respective subset of requests at each computing die-block 101 of a base circuit.

對於各請求，系統100可識別由請求中之一位元序列表示之一各自邏輯位址(504)。如上文所描述，記憶體108可為列可定址的，使得可基於一對應請求中之一位址來識別或存取一記憶體庫之一列。一般而言，各請求指定可映射至一對應實體位址之一邏輯位址。例如，系統100之一編譯器可判定一組邏輯位址至一組對應實體位址之一映射，其中實體位址指定記憶體中之一實體位置(諸如第一記憶體108中之一庫之一位置或第一記憶體108中之一庫中之一列之一位置)。For each request, the system 100 may identify a respective logical address represented by a bit sequence in the request (504). As described above, the memory 108 may be row-addressable such that a row of a memory bank may be identified or accessed based on an address in a corresponding request. Generally, each request specifies a logical address that may be mapped to a corresponding physical address. For example, a compiler of the system 100 may determine a mapping of a set of logical addresses to a set of corresponding physical addresses, where the physical address specifies a physical location in memory (e.g., a location in a bank in the first memory 108 or a location in a row in a bank in the first memory 108).

對於各請求，系統100可自位元序列獲得一第一位元子集(506)。例如，邏輯位址可由包含一位元序列(例如，8位元、16位元等)之一資料結構指定。在一些實施方案中，一運算晶片塊101之一控制器103掃描表示由一請求指定之一邏輯位址之各資料結構。回應於掃描形成資料結構之位元，控制器識別或判定可用作至一庫產生函數之輸入之一位元子集。在一些實施方案中，控制器基於晶片塊記憶體108之一組態來判定位元子集。For each request, the system 100 may obtain a first bit subset from the bit sequence (506). For example, a logical address may be specified by a data structure comprising a sequence of bits (e.g., 8 bits, 16 bits, etc.). In some embodiments, a controller 103 of a computing chip 101 scans each data structure representing a logical address specified by a request. In response to the bits formed from the scanned data structure, the controller identifies or determines a bit subset that can be used as an input to a library generation function. In some embodiments, the controller determines the bit subset based on a configuration of the chip memory 108.

對於各請求，系統100可基於使用第一位元子集之一庫產生函數304來產生一各自庫識別符(「庫ID」) (508)。每一各自庫ID識別記憶體108之多個實體記憶體庫當中之一特定實體庫。如上文所描述，各實體記憶體庫可包含多個列。對於各請求，運算晶片塊101可：i)自位元序列獲得一第二位元子集；ii)將該第二位元子集作為一輸入提供至庫產生函數；及iii)基於將庫產生函數應用於第二位元子集，產生識別晶片塊記憶體之實體記憶體庫當中的特定庫中之一特定列之一各自列ID。For each request, the system 100 may generate a respective library identifier ("library ID") (508) based on a library generation function 304 using the first subset of bits. Each respective library ID identifies a particular physical library of the plurality of physical memory libraries of the memory 108. As described above, each physical memory library may include a plurality of columns. For each request, the computing chip 101 may: i) obtain a second subset of bits from the bit sequence; ii) provide the second subset of bits as an input to the library generation function; and iii) based on applying the library generation function to the second subset of bits, generate a respective column ID identifying a particular column in a particular library of the physical memory libraries of the chip memory.

系統100經組態以使用針對各請求產生之各自庫ID來處理該請求(510)。在一些實施方案中，一運算晶片塊101處理多個請求(例如，同時)，且回應於處理該等請求，並行存取兩個或更多個實體記憶體庫，而不發生一庫衝突(512)。例如，運算晶片塊101之一控制器103產生控制信號以存取記憶體108之一實體庫或列以自第一記憶體108之位址位置擷取一輸入向量102。The system 100 is configured to process the requests using the respective bank ID generated for each request (510). In some implementations, a computing chip 101 processes multiple requests (e.g., simultaneously), and in response to processing the requests, accesses two or more physical memory banks in parallel without a bank conflict (512). For example, a controller 103 of the computing chip 101 generates control signals to access a physical bank or row of memory 108 to fetch an input vector 102 from an address location of the first memory 108.

在一些實施方案中，回應於處理請求之各者以擷取一或多個輸入向量，運算晶片塊101可在一單個時脈週期期間同時存取晶片塊記憶體之多個相異實體記憶體庫。輸入向量102可對應於一影像之一輸入特徵映射且可為神經網路輸入(諸如由一先前神經網路層產生之啟動)之一矩陣結構。In some implementations, in response to processing each of the requests to fetch one or more input vectors, the computing chip 101 may simultaneously access multiple distinct physical memory banks of the chip memory during a single clock cycle. The input vector 102 may correspond to an input feature map of an image and may be a matrix structure of neural network inputs (such as activations generated by a previous neural network layer).

如上文所論述，庫產生函數304容許基於引起運算晶片塊101存取記憶體108之不同實體記憶體庫之一存取型樣來處理兩個或更多個請求。庫產生函數304經組態以產生容許此存取型樣之庫ID及列ID，而與用於記憶體存取之步幅值無關。As discussed above, the bank generation function 304 allows two or more requests to be processed based on an access pattern that causes the computing chip 101 to access different physical memory banks of the memory 108. The bank generation function 304 is configured to generate bank IDs and row IDs that allow this access pattern regardless of the stride value used for the memory access.

顯著地，即使當一記憶體存取步幅等於同一實體記憶體庫之特定列之間的一差時，庫產生函數仍可提供此存取型樣。例如，一實體記憶體庫之各列可具有16個位元組之一寬度且晶片塊記憶體之一分割區可具有32個實體記憶體庫。在此實例中，即使當記憶體存取步幅等於row_width * num_banks (即，步幅= 512)時，庫產生函數仍可提供此存取型樣。Notably, the bank generation function can provide this access pattern even when a memory access stride is equal to a difference between specific rows of the same physical memory bank. For example, each row of a physical memory bank may have a width of 16 bytes and a partition of chip memory may have 32 physical memory banks. In this example, the bank generation function can provide this access pattern even when the memory access stride is equal to row_width * num_banks (i.e., stride = 512).

圖6繪示包含一輸入張量604、一參數張量606之變動及一輸出張量608之張量或多維矩陣600之實例。在圖6之實例中，張量600之各者包含各自元素，其中各元素可對應於用於在一神經網路之一給定層處執行之運算之一各自資料值(或運算元)。6 shows an example of a tensor or multidimensional matrix 600 including an input tensor 604, a transformation of a parameter tensor 606, and an output tensor 608. In the example of FIG6 , each of the tensors 600 includes respective elements, where each element may correspond to a respective data value (or operand) for an operation performed at a given layer of a neural network.

例如，輸入張量604之各輸入可對應於沿著輸入張量604之一給定維度之一各自元素，參數張量606之各權重可對應於沿著參數張量606之一給定維度之一各自元素，且一組輸出中之各輸出值或啟動可對應於沿著輸出張量608之一給定維度之一各自元素。相關地，各元素可對應於經指派以在一給定張量604、606、608之一或多個維度上操作之一運算晶片塊101之一記憶體中之一各自記憶體位置或位址。For example, each input of the input tensor 604 may correspond to a respective element along a given dimension of the input tensor 604, each weight of the parameter tensor 606 may correspond to a respective element along a given dimension of the parameter tensor 606, and each output value or activation in a set of outputs may correspond to a respective element along a given dimension of the output tensor 608. Relatedly, each element may correspond to a respective memory location or address in a memory of a compute die 101 assigned to operate on one or more dimensions of a given tensor 604, 606, 608.

在一給定神經網路層處執行之運算可包含在一或多個處理器時脈週期上將一輸入/啟動張量604與一參數/權重張量606相乘以產生可包含輸出啟動之層輸出。將一啟動張量604與一權重張量606相乘包含將來自張量604之一元素之一啟動與來自張量606之一元素之一權重相乘以產生一或多個部分和。圖6之實例性張量606可為未經修改之參數張量、經修改參數張量或此等之組合。在一些實施方案中，各參數張量606對應於包含基於上文描述之特定稀疏性利用技術導出之非零CSP值之一經修改參數張量。Operations performed at a given neural network layer may include multiplying an input/activation tensor 604 with a parameter/weight tensor 606 on one or more processor clock cycles to produce layer outputs, which may include output activations. Multiplying an activation tensor 604 with a weight tensor 606 includes multiplying an activation from an element of tensor 604 with a weight from an element of tensor 606 to produce one or more partial sums. The example tensors 606 of FIG. 6 may be unmodified parameter tensors, modified parameter tensors, or a combination thereof. In some embodiments, each parameter tensor 606 corresponds to a modified parameter tensor that includes a non-zero CSP value derived based on a particular sparsity exploitation technique described above.

系統100之處理器核心可對以下進行操作：i)對應於某一多維張量604、606中之一離散元素之純量；ii)包含沿著某一多維張量604、606之相同或不同維度之多個離散元素609之值之一向量(例如，輸入向量102)；或iii)此等之一組合。取決於張量之維數，某一多維張量中之離散元素609或多個離散元素609之各者可使用X,Y座標(2D)或使用X,Y,Z座標(3D)表示。The processor core of the system 100 can operate on: i) a scalar corresponding to a discrete element in a multidimensional tensor 604, 606; ii) a vector (e.g., input vector 102) containing the values of multiple discrete elements 609 along the same or different dimensions of a multidimensional tensor 604, 606; or iii) a combination of these. Depending on the dimensionality of the tensor, a discrete element 609 or each of multiple discrete elements 609 in a multidimensional tensor can be represented using X, Y coordinates (2D) or using X, Y, Z coordinates (3D).

系統100可運算對應於由將一批次輸入與對應權重值相乘而產生之乘積之多個部分和。如上文所提及，系統100可在許多時脈週期內執行乘積之一累加(例如，部分和)。例如，可在一或多個運算晶片塊之一隨機存取記憶體、共用記憶體或高速暫存記憶體中基於本文件中描述之技術來執行乘積之累加。在一些實施方案中，可寫入一輸入權重乘法作為各權重元素乘以一輸入向量102 (諸如輸入張量604之一列或切片)之離散輸入之一乘積之和。此列或切片可表示一給定維度，諸如輸入張量604之一第一維度610或輸入張量604之一第二、不同維度615。The system 100 may compute multiple partial sums corresponding to products generated by multiplying a batch of inputs by corresponding weight values. As mentioned above, the system 100 may perform an accumulation (e.g., partial sum) of products over many clock cycles. For example, the accumulation of products may be performed in a random access memory, shared memory, or high-speed scratch memory of one or more computing chips based on the techniques described in this document. In some embodiments, an input weight multiplication may be written as a sum of products of each weight element multiplied by a discrete input of an input vector 102 (e.g., a column or slice of the input tensor 604). This row or slice can represent a given dimension, such as a first dimension 610 of the input tensor 604 or a second, different dimension 615 of the input tensor 604.

在一些實施方案中，一組實例性運算可用於運算一卷積神經網路層之一輸出。CNN層之運算可涉及在一3D輸入張量604與至少一個3D濾波器(權重張量606)之間執行一2D空間卷積。例如，在3D輸入張量604上方卷積一個3D濾波器606可產生一2D空間平面620或625。運算可涉及運算包含輸入向量102之一輸入體積之一特定維度之點積和。In some implementations, an example set of operations may be used to compute an output of a convolutional neural network layer. The operation of the CNN layer may involve performing a 2D spatial convolution between a 3D input tensor 604 and at least one 3D filter (weight tensor 606). For example, convolving a 3D filter 606 over the 3D input tensor 604 may produce a 2D spatial plane 620 or 625. The operation may involve computing the sum of dot products of a particular dimension of an input volume including the input vector 102.

例如，空間平面620可包含自沿著維度610之輸入運算之乘積和的輸出值，而空間平面625可包含自沿著維度615之輸入運算之乘積和的輸出值。用以產生空間平面620及625之各者中之輸出值之乘積和的運算可：i)在運算胞元114 a/b/c處執行；ii)直接在記憶體110處使用耦合至記憶體110之一共用記憶體庫之一算術運算子執行；iii)或兩者。在一些實施方案中，縮減運算可被簡化且直接在記憶體110之一記憶體胞元(或位置)處使用縮減經累加值之各種技術來執行。For example, spatial plane 620 may include output values from sums of products of input operations along dimension 610, while spatial plane 625 may include output values from sums of products of input operations along dimension 615. The sums of products operations used to generate output values in each of spatial planes 620 and 625 may be: i) performed at operation cells 114 a/b/c; ii) performed directly at memory 110 using an arithmetic operator coupled to a shared memory bank of memory 110; iii) or both. In some embodiments, the reduction operation may be simplified and performed directly at a memory cell (or location) of memory 110 using various techniques for reducing the accumulated values.

可在數位電子電路系統、有形體現之電腦軟體或韌體、電腦硬體(包含本說明書中所揭示之結構及其等之結構等效物)或其等之一或多者之組合中實施本說明書中所描述之標的物及功能操作之實施例。本說明書中所描述之標的物之實施例可經實施為一或多個電腦程式，即，在一有形非暫時性程式載體上編碼以藉由資料處理設備執行或控制資料處理設備之操作之電腦程式指令之一或多個模組。Embodiments of the subject matter and functional operations described in this specification may be implemented in digital electronic circuit systems, tangibly embodied computer software or firmware, computer hardware (including the structures disclosed in this specification and their structural equivalents), or a combination of one or more of them. Embodiments of the subject matter described in this specification may be implemented as one or more computer programs, i.e., one or more modules of computer program instructions encoded on a tangible, non-transitory program carrier to be executed by a data processing device or to control the operation of the data processing device.

替代性地或此外，程式指令可在一人為產生之傳播信號(例如，一機器產生之電、光學或電磁信號)上編碼，該傳播信號經產生以編碼資訊用於傳輸至合適接收器設備以藉由一資料處理設備執行。電腦儲存媒體可為一機器可讀儲存裝置、一機器可讀儲存基板、一隨機或串列存取記憶體裝置或其等之一或多者之一組合。Alternatively or in addition, the program instructions may be encoded on an artificially generated propagated signal (e.g., a machine-generated electrical, optical, or electromagnetic signal) that is generated to encode information for transmission to appropriate receiver equipment for execution by a data processing device. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.

術語「運算系統」涵蓋用於處理資料之全部種類的設備、裝置及機器，舉實例而言，包含一可程式化處理器、一電腦或多個處理器或電腦。設備可包含專用邏輯電路系統，例如，一FPGA (場可程式化閘陣列)或一ASIC (特定應用積體電路)。除硬體之外，設備亦可包含針對所討論之電腦程式建立一執行環境之程式碼，例如，構成處理器韌體、一協定堆疊、一資料庫管理系統、一作業系統或其等之一或多者之一組合的程式碼。The term "computing system" encompasses all kinds of equipment, devices, and machines for processing data, including, for example, a programmable processor, a computer, or multiple processors or computers. Equipment may include special-purpose logic circuitry, such as an FPGA (field programmable gate array) or an ASIC (application specific integrated circuit). In addition to hardware, equipment may also include program code that establishes an execution environment for the computer program in question, such as code that constitutes processor firmware, a protocol stack, a database management system, an operating system, or a combination of one or more of these.

可以任何形式之程式設計語言(包含編譯或解譯語言、或宣告式或程序性語言)撰寫一電腦程式(其亦可被稱為或描述為一程式、軟體、一軟體應用程式、一模組、一軟體模組、一指令檔或程式碼)，且其可以任何形式部署，包含作為一獨立程式或作為一模組、組件、副常式或適用於一運算環境中之其他單元。A computer program (which may also be called or described as a program, software, a software application, a module, a software module, a script or code) may be written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages), and may be deployed in any form, including as a stand-alone program or as a module, component, subroutine or other unit suitable for use in a computing environment.

一電腦程式可(但不需要)對應於一檔案系統中之一檔案。一程式可儲存於保存其他程式或資料(例如，儲存於一標記語言文件中之一或多個指令檔)之一檔案之一部分中、專用於所討論之程式之一單一檔案中或多個協調檔案(例如，儲存程式碼之一或多個模組、子程式或部分的檔案)中。一電腦程式可經部署以在一個電腦上或在定位於一個位點處或跨多個位點分佈且藉由一通信網路互連之多個電腦上執行。A computer program may (but need not) correspond to a file in a file system. A program may be stored as part of a file that holds other programs or data (e.g., one or more scripts stored in a markup language document), in a single file dedicated to the program in question, or in a plurality of coordinated files (e.g., files that store one or more modules, subroutines or portions of program code). A computer program may be deployed to execute on one computer or on multiple computers located at one site or distributed across multiple sites and interconnected by a communications network.

可藉由執行一或多個電腦程式以藉由對輸入資料進行操作及產生輸出而執行功能之一或多個可程式化電腦來執行本說明書中所描述之程序及邏輯流程。亦可藉由專用邏輯電路系統(例如，一FPGA (場可程式化閘陣列)、一ASIC (特定應用積體電路)或一GPGPU (通用圖形處理單元))來執行該等程序及邏輯流程，且設備亦可實施為該專用邏輯電路系統。The procedures and logic flows described in this specification may be performed by one or more programmable computers executing one or more computer programs to perform functions by operating on input data and generating output. The procedures and logic flows may also be performed by, and the apparatus may be implemented as, a dedicated logic circuit system, such as, for example, an FPGA (field programmable gate array), an ASIC (application specific integrated circuit), or a GPGPU (general purpose graphics processing unit).

適用於一電腦程式之執行之電腦包含(舉實例而言，可基於)通用微處理器或專用微處理器或兩者或任何其他種類之中央處理單元。一般而言，一中央處理單元將自一唯讀記憶體或一隨機存取記憶體或兩者接收指令及資料。一電腦之一些元件係用於執行(performing或executing)指令之一中央處理單元及用於儲存指令及資料之一或多個記憶體裝置。一般而言，一電腦亦將包含用於儲存資料之一或多個大容量儲存裝置(例如，磁碟、磁光碟或光碟)，或可操作耦合以自該一或多個大容量儲存裝置接收資料或將資料傳送至該一或多個大容量儲存裝置，或兩者。然而，一電腦不需要具有此等裝置。此外，一電腦可嵌入於另一裝置中，例如，一行動電話、一個人數位助理(PDA)、一行動音訊或視訊播放器、一遊戲控制台、一全球定位系統(GPS)接收器或一可攜式儲存裝置(例如，一通用串列匯流排(USB)快閃隨身碟)，等等。A computer suitable for the execution of a computer program includes (for example, may be based on) a general purpose microprocessor or a special purpose microprocessor or both or any other kind of central processing unit. Generally, a central processing unit will receive instructions and data from a read-only memory or a random access memory or both. Some elements of a computer are a central processing unit for performing or executing instructions and one or more memory devices for storing instructions and data. Generally, a computer will also include one or more mass storage devices (e.g., magnetic, magneto-optical, or optical disks) for storing data, or be operably coupled to receive data from or transfer data to the one or more mass storage devices, or both. However, a computer need not have such devices. In addition, a computer may be embedded in another device, such as a mobile phone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a global positioning system (GPS) receiver, or a portable storage device (e.g., a universal serial bus (USB) flash drive), etc.

適用於儲存電腦程式指令及資料之電腦可讀媒體包含所有形式之非揮發性記憶體、媒體及記憶體裝置，舉實例而言包含：半導體記憶體裝置，例如，EPROM、EEPROM、及快閃記憶體裝置；磁碟，例如，內部硬碟或可抽換式磁碟；磁光碟；及CD ROM及DVD-ROM光碟。處理器及記憶體可藉由專用邏輯電路系統增補或併入專用邏輯電路系統中。Computer-readable media suitable for storing computer program instructions and data include all forms of nonvolatile memory, media, and memory devices, including, by way of example: semiconductor memory devices, such as EPROM, EEPROM, and flash memory devices; magnetic disks, such as internal hard disks or removable disks; magneto-optical disks; and CD ROM and DVD-ROM optical disks. The processor and memory may be supplemented by or incorporated in dedicated logic circuitry.

為提供與一使用者之互動，可在一電腦上實施本說明書中所描述之標的物之實施例，該電腦具有用於向該使用者顯示資訊之一顯示裝置(例如，LCD (液晶顯示器)監視器)及該使用者可藉由其提供輸入至該電腦之一鍵盤及一指標裝置(例如，一滑鼠或一軌跡球)。其他種類之裝置亦可用於提供與一使用者之互動；例如，提供給該使用者之回饋可為任何形式之感覺回饋，例如，視覺回饋、聽覺回饋或觸覺回饋；且來自該使用者之輸入可以任何形式接收，包含聲音、語音或觸覺輸入。另外，一電腦可藉由發送文件至由一使用者使用之一裝置及自該裝置接收文件而與該使用者互動；例如，藉由回應於自一使用者之用戶端裝置上之一網頁瀏覽器接收之請求而將網頁發送至該網頁瀏覽器。To provide interaction with a user, embodiments of the subject matter described in this specification may be implemented on a computer having a display device (e.g., an LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the computer. Other types of devices may also be used to provide interaction with a user; for example, feedback provided to the user may be any form of sensory feedback, such as visual feedback, auditory feedback, or tactile feedback; and input from the user may be received in any form, including sound, voice, or tactile input. Additionally, a computer may interact with a user by sending files to and receiving files from a device used by the user; for example, by sending web pages to a web browser in response to requests received from the web browser on a user's client device.

可在一運算系統中實施本說明書中所描述之標的物之實施例，該運算系統包含一後端組件(例如，作為一資料伺服器)，或包含一中介軟體組件(例如，一應用程式伺服器)，或包含一前端組件(例如，具有一使用者可透過其與本說明書中所描述之標的物之一實施方案互動之一圖形使用者介面或一網頁瀏覽器之一用戶端電腦)，或一或多個此等後端、中介軟體或前端組件之任何組合。該系統之該等組件可藉由數位資料通信之任何形式或媒體(例如，一通信網路)互連。通信網路之實例包含一區域網路(「LAN」)及一廣域網路(「WAN」)，例如，網際網路。Embodiments of the subject matter described in this specification may be implemented in a computing system that includes a back-end component (e.g., as a data server), or includes a middleware component (e.g., an application server), or includes a front-end component (e.g., a client computer having a graphical user interface or a web browser through which a user can interact with an embodiment of the subject matter described in this specification), or any combination of one or more such back-end, middleware, or front-end components. The components of the system may be interconnected by any form or medium of digital data communication (e.g., a communications network). Examples of communications networks include a local area network ("LAN") and a wide area network ("WAN"), such as the Internet.

運算系統可包含用戶端及伺服器。一用戶端及伺服器一般彼此遠離且通常透過一通信網路互動。用戶端與伺服器之關係憑藉運行於各自電腦上及彼此具有一用戶端-伺服器關係之電腦程式而發生。A computing system may include clients and servers. A client and server are generally remote from each other and usually interact through a communication network. The relationship of client and server occurs by virtue of computer programs running on the respective computers and having a client-server relationship with each other.

雖然本說明書含有許多特定實施方案細節，但此等不應被理解為限制任何發明或可主張之內容之範疇，而是被理解為描述可特定於本發明之特定實施例之特徵。本說明書中在分開的實施例之背景內容中所描述之特定特徵亦可組合實施於一單個實施例中。相反地，在一單個實施例之背景內容中描述之各種特徵亦可分開地實施於多個實施例中或以任何合適子組合實施。此外，儘管特徵在上文可被描述為依特定組合起作用且甚至最初如此主張，然來自一所主張之組合之一或多個特徵在一些情況中可自該組合免除，且該所主張之組合可係關於一子組合或一子組合之變型。Although this specification contains many specific implementation details, these should not be construed as limiting the scope of any invention or claimable content, but rather as describing features that may be specific to a particular embodiment of the invention. Specific features described in this specification in the context of separate embodiments may also be implemented in combination in a single embodiment. Conversely, various features described in the context of a single embodiment may also be implemented separately in multiple embodiments or in any suitable subcombination. In addition, although features may be described above as functioning in a particular combination and even initially claimed as such, one or more features from a claimed combination may be exempted from that combination in some cases, and the claimed combination may be related to a subcombination or a variation of a subcombination.

類似地，雖然在圖式中依一特定順序描繪操作，但此不應被理解為需要依所展示之特定順序或依循序順序來執行此等操作或需要執行所有經繪示之操作以達成所要結果。在特定境況中，多任務處理及平行處理可為有利的。此外，上文所描述之實施例中之各種系統模組及組件之分離不應被理解為在所有實施例中需要此分離，且應理解，所描述之程式組件及系統可大體上一起整合於一單個軟體產品中或套裝於多個軟體產品中。Similarly, although operations are depicted in a particular order in the drawings, this should not be understood as requiring that the operations be performed in the particular order shown or in sequential order or that all depicted operations be performed to achieve the desired result. In certain circumstances, multitasking and parallel processing may be advantageous. In addition, the separation of various system modules and components in the embodiments described above should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems may be generally integrated together in a single software product or packaged in multiple software products.

已描述標的物之特定實施例。其他實施例係在以下發明申請專利範圍之範疇內。例如，發明申請專利範圍中所敘述之動作可依一不同順序執行且仍達成所要結果。作為一實例，附圖中所描繪之程序並不一定需要所展示之特定順序，或循序順序來達成所要結果。在特定實施方案中，多任務處理及平行處理可為有利的。Specific embodiments of the subject matter have been described. Other embodiments are within the scope of the following claims. For example, the actions described in the claims may be performed in a different order and still achieve the desired results. As an example, the processes depicted in the accompanying figures do not necessarily require the specific order shown, or sequential order, to achieve the desired results. In certain embodiments, multitasking and parallel processing may be advantageous.

100:運算系統 101:運算晶片塊 102:輸入向量 103:控制器 104:權重 105:控制信號 106:輸入啟動匯流排 107:輸出啟動匯流排 108:第一記憶體/晶片塊記憶體 110:第二記憶體 112:運算單元 114:運算胞元 114a:運算胞元 114b:運算胞元 114c:運算胞元 120:主機 125:較高階控制器 130:請求 135:邏輯埠 140:仲裁器 200:處理管線 202:移位暫存器 204:運算元 206:運算元 300:遍歷表 302:第一函數/庫產生函數 304:第二函數/庫產生函數 306:第一位址 308:第二位址 310:第三位址 312:實體記憶體庫 314:實體記憶體庫 316:第一實體記憶體庫 318:第二、不同實體記憶體庫 320:第三、不同實體記憶體庫 402:演算法 404:位址位元/位元序列 412:第一組操作 414:第二組操作 416:第三操作 500:程序 502:步驟 504:步驟 506:步驟 508:步驟 510:步驟 512:步驟 600:張量或多維矩陣 604:輸入張量/啟動張量 606:參數張量/權重張量/3D濾波器 608:輸出張量 609:離散元素 610:第一維度 615:第二不同維度 620:2D空間平面 625:2D空間平面 100: Computing system 101: Computing chip 102: Input vector 103: Controller 104: Weight 105: Control signal 106: Input enable bus 107: Output enable bus 108: First memory/chip memory 110: Second memory 112: Computing unit 114: Computing cell 114a: Computing cell 114b: Computing cell 114c: Computing cell 120: Host 125: Higher-level controller 130: Request 135: Logic port 140: Arbiter 200: Processing pipeline 202: Shift register 204: Operator 206: Operator 300: Traversal table 302: First function/library generation function 304: Second function/library generation function 306: First address 308: Second address 310: Third address 312: Physical memory library 314: Physical memory library 316: First physical memory library 318: Second, different physical memory library 320: Third, different physical memory library 402: Algorithm 404: Address bit/bit sequence 412: First set of operations 414: Second set of operations 416: Third operation 500: Program 502: Step 504: Step 506: Step 508: Step 510: Step 512: Step 600: Tensor or multidimensional matrix 604: Input tensor/starting tensor 606: Parameter tensor/weight tensor/3D filter 608: Output tensor 609: Discrete elements 610: First dimension 615: Second different dimension 620: 2D spatial plane 625: 2D spatial plane

圖1A係用於實施一神經網路機器學習模型之一實例性運算系統之一方塊圖。FIG1A is a block diagram of an example computing system for implementing a neural network machine learning model.

圖1B係用於實施一神經網路機器學習模型之一實例性運算系統之一方塊圖。FIG1B is a block diagram of an example computing system for implementing a neural network machine learning model.

圖2展示用於在一硬體積體電路之一記憶體與運算胞元之間路由輸入及輸出之一實例性處理管線。FIG. 2 shows an example processing pipeline for routing inputs and outputs between a memory and computational cells in a hardware integrated circuit.

圖3展示用於不同分庫(banking)產生函數之一記憶體遍歷之實例。FIG3 shows an example of a memory traversal for different banking generation functions.

圖4展示用於減少一硬體積體電路處之記憶體庫衝突之一實例性庫產生函數。FIG4 shows an example library generation function for reducing memory library conflicts at a hardware integrated circuit.

圖5係用於減少一硬體積體電路中之記憶體庫衝突之一實例性程序。FIG. 5 is an example process for reducing memory bank conflicts in a hardware integrated circuit.

圖6繪示一輸入張量、一參數張量及一輸出張量之一實例。FIG6 shows an example of an input tensor, a parameter tensor, and an output tensor.

各種圖式中之相同元件符號及名稱指示相同元件。Like reference numerals and names in the various drawings indicate like elements.

100:運算系統 100: Computing system

101:運算晶片塊 101: Computing chip block

102:輸入向量 102: Input vector

103:控制器 103: Controller

104:權重 104: Weight

105:控制信號 105: Control signal

106:輸入啟動匯流排 106: Input start bus

107:輸出啟動匯流排 107: Output start bus

108:第一記憶體/晶片塊記憶體 108: First memory/chip memory

110:第二記憶體 110: Second memory

112:運算單元 112: Arithmetic unit

114a:運算胞元 114a: Operation Cell

114b:運算胞元 114b: Operation Cell

114c:運算胞元 114c: Operation Cell

120:主機 120: Host

125:較高階控制器 125: Higher-end controller

Claims

A computer-implemented method for simultaneously accessing a memory library of a hardware accelerator, the method comprising: Receiving a plurality of requests, each of the plurality of requests being for accessing a chip memory of the hardware accelerator; For each of the plurality of requests: Identifying a respective logical address represented by a bit sequence in the request; Obtaining a first bit subset from the bit sequence; and Based on a library generation function using the first bit subset, generating a respective library identifier identifying a particular library among a plurality of physical memory libraries of the chip memory; Processing each of the plurality of requests using the respective library identifier generated for the request; and In response to processing each of the plurality of requests, multiple different physical memory banks of the chip block memory are simultaneously accessed during a clock cycle.

The method of claim 1, wherein simultaneously accessing the multiple different entity memory libraries comprises: For a specific stride value, accessing the multiple different entity memory libraries during a single clock cycle.

The method of claim 2, wherein the specific stride value is equal to a memory access stride of a difference between specific rows of the same physical memory bank.

The method of claim 3, wherein simultaneously accessing the multiple different physical memory libraries comprises: Accessing the multiple different physical memory libraries without any library conflict.

A method as claimed in claim 4, wherein a bank conflict occurs when two or more requesters request access to the same physical memory bank of the chip block memory during the same clock cycle.

The method of claim 5, wherein each physical memory library includes a plurality of columns and the method includes: For each of the plurality of requests: Obtaining a second bit subset from the bit sequence; Providing the second bit subset as an input to the library generation function; and Based on applying the library generation function to the second bit subset, generating a respective column identifier for a specific column in the specific library of the plurality of physical memory libraries of the chip block memory.

The method of claim 6, wherein: each of the plurality of rows comprises a width of 16 bytes; a partition of the chip block memory comprises 32 physical memory banks; and the memory access stride is equal to: row_width * num_banks.

The method of claim 1, wherein obtaining a first bit subset from the bit sequence comprises: Obtaining two or more bits from the least significant bits (LSBs) in the bit sequence.

A system comprising: a hardware accelerator; a processing device; and a non-transitory machine-readable storage medium for storing instructions executable by the processing device to cause execution of operations including: receiving a plurality of requests, each of the plurality of requests for accessing a chip memory of the hardware accelerator; for each of the plurality of requests: identifying a respective logical address represented by a bit sequence in the request; obtaining a first bit subset from the bit sequence; and generating a respective library identifier identifying a particular library among a plurality of physical memory libraries of the chip memory based on a library generation function using the first bit subset; Processing each of the plurality of requests using the respective repository identifier generated for the request; and In response to processing each of the plurality of requests, simultaneously accessing multiple distinct physical memory repositories of the chip block memory during a single clock cycle.

A system as claimed in claim 9, wherein simultaneously accessing the multiple different entity memory libraries comprises: For a specific stride value, accessing the multiple different entity memory libraries during a single clock cycle.

A system as claimed in claim 10, wherein the particular stride value is equal to a memory access stride that is a difference between particular rows of the same physical memory bank.

A system as claimed in claim 11, wherein simultaneously accessing the multiple different physical memory libraries comprises: Accessing the multiple different physical memory libraries without any library conflict.

A system as claimed in claim 12, wherein a bank conflict occurs when two or more requestors request access to the same physical memory bank of the chip memory during the same clock cycle.

The system of claim 13, wherein each physical memory library includes a plurality of rows and the operations further include: For each of the plurality of requests: Obtaining a second bit subset from the bit sequence; Providing the second bit subset as an input to the library generation function; and Based on applying the library generation function to the second bit subset, generating a respective row identifier for a specific row in the specific library of the plurality of physical memory libraries of the chip block memory.

The system of claim 14, wherein: each row of the plurality of rows comprises a width of 16 bytes; a partition of the chip block memory comprises 32 physical memory banks; and the memory access stride is equal to: row_width * num_banks.

The system of claim 9, wherein obtaining a first bit subset from the bit sequence comprises: obtaining two or more bits from the least significant bits (LSBs) in the bit sequence.

A non-transitory machine-readable storage medium for storing instructions executable by a processing device of a hardware accelerator to cause execution of operations including the following: Receiving a plurality of requests, each of the plurality of requests for accessing a chip memory of the hardware accelerator; For each of the plurality of requests: Identifying a respective logical address represented by a bit sequence in the request; Obtaining a first bit subset from the bit sequence; and Based on a library generation function using the first bit subset, generating a respective library identifier identifying a particular library among a plurality of physical memory libraries of the chip memory; Processing each of the plurality of requests using the respective library identifier generated for the request; and In response to processing each of the plurality of requests, multiple different physical memory banks of the chip block memory are simultaneously accessed during a single clock cycle.

A non-transitory machine-readable storage medium as claimed in claim 18, wherein accessing the multiple different physical memory libraries simultaneously comprises: For a specific stride value, accessing the multiple different physical memory libraries during a single clock cycle.

A non-transitory machine-readable storage medium as in claim 18, wherein the specific stride value is a memory access stride equal to a difference between specific rows of the same physical memory bank.

The non-transitory machine-readable storage medium of claim 19, wherein simultaneously accessing the multiple different physical memory libraries comprises: Accessing the multiple different physical memory libraries without any library conflict.