DE102017117381A1

DE102017117381A1 - Accelerator for sparse folding neural networks

Info

Publication number: DE102017117381A1
Application number: DE102017117381.1A
Authority: DE
Inventors: William J. Dally; Angshuman Parashar; Joel Springer Emer; Stephen William Keckler
Original assignee: Nvidia Corp
Current assignee: Nvidia Corp
Priority date: 2016-08-11
Filing date: 2017-08-01
Publication date: 2018-02-15

Abstract

Ein Verfahren, ein Computerprogrammprodukt und ein System führen Berechnungen unter Verwendung eines Beschleunigers für dünnbesetzte faltende neuronale Netze aus. Ein erster Vektor, welcher nur Gewichtswerte ungleich Null und erste zugehörige Positionen der Gewichtswerte ungleich Null in einem dreidimensionalen Raum umfasst, wird erfasst. Ein zweiter Vektor, welcher nur Eingabeaktivierungswerte ungleich Null und zweite zugehörige Positionen der Eingabeaktivierungswerte ungleich Null in einem zweidimensionalen Raum umfasst, wird erfasst. Die Gewichtswerte ungleich Null werden mit den Eingabeaktivierungswerte ungleich Null in einer Multipliziereranordnung multipliziert, um einen dritten Vektor von Produkten zu erzeugen. Die ersten zugehörigen Positionen werden mit den zweiten zugehörigen Positionen kombiniert, um einen vierten Vektor von Positionen zu erzeugen, wobei jede Position in dem vierten Vektor einem entsprechenden Produkt in dem dritten Vektor zugeordnet ist. Die Produkte in dem dritten Vektor werden zu Addierern in einer Akkumulatoranordnung abhängig von der Position, welche jedem der Produkte zugeordnet ist, übertragen.A method, computer program product, and system perform calculations using a sparse convolutional neural network accelerator. A first vector comprising only non-zero weight values and first associated positions of the nonzero weight values in a three-dimensional space is detected. A second vector comprising only non-zero input activation values and second associated positions of non-zero input activation values in a two-dimensional space is detected. The non-zero weight values are multiplied by the non-zero input enable values in a multiplier array to produce a third vector of products. The first associated positions are combined with the second associated positions to produce a fourth vector of positions, each position in the fourth vector associated with a corresponding product in the third vector. The products in the third vector are transferred to adders in an accumulator array depending on the position assigned to each of the products.

Description

Beanspruchte PrioritätClaimed priority

Diese Anmeldung beruft sich auf die vorläufige US-Anmeldung mit der Nr. 62/373,919 (Kanzleiaktenzeichen NVIDP1137+/16-SC-0139-US01) mit dem Titel ”Sparse Convolutional Neural Network Accelerator”, eingereicht am 11. August 2016, deren gesamter Inhalt hierin durch Verweis aufgenommen wird.This application is based on US Provisional Application No. 62 / 373,919 (Attorney Docket No. NVIDP1137 + / 16-SC-0139-US01) entitled "Sparse Convolutional Neural Network Accelerator", filed on Aug. 11, 2016, the entire contents of which application is incorporated herein by reference.

Bereich der ErfindungField of the invention

Die vorliegende Erfindung bezieht sich auf faltende neuronale Netze, und insbesondere auf einen Beschleuniger für faltende neuronale Netze.The present invention relates to folding neural networks, and more particularly to a convolutional neural network accelerator.

Hintergrundbackground

Durch die Verfügbarkeit von Massendaten und die rechnergestützte Verarbeitung dieser Daten hat sich das Tiefenlernen („Deep Learning”) in jüngster Zeit zu einem kritischen Werkzeug entwickelt, um komplexe Probleme in einer Vielzahl von Bereichen zu lösen, darunter Bilderkennung, Sprachverarbeitung, Verarbeitung natürlicher Sprache, Sprachübersetzung und autonome Fahrzeuge. Faltende neuronale Netze („Convolutional Neural Networks” (CNNs)) sind für viele dieser Bereiche zum populärsten algorithmischen Ansatz für Tiefenlernen geworden. Hohe Leistung und extreme Energieeffizienz sind entscheidend für den Einsatz von CNNs in einer Vielzahl von Situationen, insbesondere bei mobilen Plattformen, wie autonomen Fahrzeugen, Kameras und elektronischen Assistenten.With the availability of mass data and computer-aided processing of these data, deep learning has recently become a critical tool to solve complex problems in a variety of fields, including image recognition, speech processing, natural language processing, Language translation and autonomous vehicles. Convolutional Neural Networks (CNNs) have become the most popular algorithmic approach to depth learning in many of these areas. High performance and extreme energy efficiency are key to deploying CNNs in a variety of situations, especially on mobile platforms such as autonomous vehicles, cameras and electronic assistants.

Der Einsatz von CNNs lässt sich in zwei Aufgaben zerlegen: (1) Training – bei dem die Parameter eines neuronalen Netzes durch die Beobachtung zahlreicher Trainingsbeispiele erlernt werden, und (2) Klassifikation – bei der ein trainiertes neuronales Netz im Feld eingesetzt wird und die beobachteten Daten klassifiziert werden. Heute wird das Training häufig an Grafikprozessoren (GPUs) oder Farmen von GPUs durchgeführt, während die Klassifizierung abhängig von der Anwendung ist und zentrale Prozessoreinheiten (CPUs), GPUs, feldprogrammierbare Gate-Arrays (FPGAs) oder applikationsspezifische integrierte Schaltkreise (ASICs) einsetzen kann.The use of CNNs can be broken down into two tasks: (1) training - in which the parameters of a neural network are learned by observing numerous training examples, and (2) classification - in which a trained neural network is deployed in the field and observed Data to be classified. Today, training is often performed on GPUs or farms of GPUs, while the classification is dependent on the application and may employ central processing units (CPUs), GPUs, Field Programmable Gate Arrays (FPGAs), or Application Specific Integrated Circuits (ASICs).

Während des Trainingsprozesses erstellt ein Experte für Tiefenlernen typischerweise das Netzwerk, ermittelt die Anzahl der Schichten im neuronalen Netz, die von jeder Schicht ausgeführte Operation und die Verbindung zwischen den Schichten. Viele Schichten haben Parameter, typischerweise Filtergewichte, die die genaue Berechnung, welche durch die Schicht ausgeführt wird, bestimmen. Ziel des Trainingsprozesses ist es, die Filtergewichte zu erlernen, meist über eine stochastische Gradienten abhängige Exkursion durch den Raum der Gewichte. Der Trainingsprozess verwendet in der Regel eine vorwärts propagierende Berechnung für jedes Trainingsbeispiel, eine Messung des Fehlers zwischen der berechneten und der gewünschten Ausgabe und anschließend eine Rückpropagation durch das Netz, um die Gewichte zu aktualisieren. Inferenz hat Ähnlichkeiten, beinhaltet aber nur die vorwärts propagierende Berechnung. Dennoch können die Berechnungsanforderungen für Inferenz jedoch sehr hoch sein, insbesondere bei der Herausbildung von tieferen Netzen (Hunderte von Schichten) und größeren Eingabesätzen, wie z. B. High-Definition-Video. Darüber hinaus ist die Energieeffizienz dieser Berechnung wichtig, insbesondere für mobile Plattformen, wie autonome Fahrzeuge, Kameras und elektronische persönliche Assistenten. Die Rechenanforderungen und der Energieverbrauch eines neuronalen Netzes für das maschinelle Lernen stellen mobile Plattformen vor Herausforderungen. Daher besteht ein Bedarf, diese und/oder andere Probleme beim Stand der Technik zu lösen.During the training process, a depth learning expert typically creates the network, determines the number of layers in the neural network, the operation performed by each layer, and the connection between the layers. Many layers have parameters, typically filter weights, that determine the exact calculation performed by the layer. The aim of the training process is to learn the filter weights, usually via a stochastic gradient-dependent excursion through the space of the weights. The training process typically uses a forward propagating calculation for each training example, a measurement of the error between the calculated and the desired output, and then backpropagation through the network to update the weights. Inference has similarities, but only includes the forward propagating calculation. However, the inference calculation requirements can be very high, especially in the formation of deeper networks (hundreds of layers) and larger sets of input, such as. B. High-definition video. In addition, the energy efficiency of this calculation is important, especially for mobile platforms, such as autonomous vehicles, cameras and electronic personal assistants. The computational requirements and power consumption of a neural network for machine learning present challenges to mobile platforms. Therefore, there is a need to solve these and / or other problems in the prior art.

ZusammenfassungSummary

Ein Verfahren, ein Computerprogrammprodukt und ein System führen Berechnungen mit einem Beschleuniger für dünnbesetzte faltende neuronale Netze durch. Ein erster Vektor, der nur Gewichtswerte ungleich Null und erste zugeordnete Positionen der Gewichtswerte ungleich Null innerhalb eines dreidimensionalen Raums umfasst, wird empfangen. Ein zweiter Vektor, der nur Eingabeaktivierungswerte ungleich Null und zweite zugeordnete Positionen der Eingabeaktivierungswerte ungleich Null innerhalb eines zweidimensionalen Raums enthält, wird erfasst. Die Gewichtswerte ungleich Null werden mit den Aktivierungswerten ungleich Null innerhalb einer Multiplikationsanordnung multipliziert, um einen dritten Vektor von Produkten zu erzeugen. Die ersten zugeordneten Positionen werden mit den zweiten zugeordneten Positionen kombiniert, um einen vierten Vektor von Positionen zu erzeugen, wobei jede Position im vierten Vektor mit einem entsprechenden Produkt in dem dritten Vektor verknüpft ist. Der dritte Vektor wird an eine Akkumulatoranordnung übertragen, wobei jedes der Produkte im dritten Vektor an einen Addierer in der Akkumulatoranordnung übertragen wird, der so konfiguriert ist, dass er an der dem Produkt zugeordneten Position einen Ausgabeaktivierungswert erzeugt.A method, a computer program product, and a system perform calculations with an accelerator for sparse folding neural networks. A first vector comprising only non-zero weight values and first associated positions of non-zero weight values within a three-dimensional space is received. A second vector containing only non-zero input activation values and second associated positions of the non-zero input activation values within a two-dimensional space is detected. The nonzero weight values are multiplied by the nonzero activation values within a multiplication arrangement to produce a third vector of products. The first associated positions are combined with the second associated positions to produce a fourth vector of positions, each position in the fourth vector being associated with a corresponding product in the third vector. The third vector is transmitted to an accumulator arrangement, wherein each of the products in the third vector is transmitted to an adder in the accumulator assembly that is configured to generate an output enable value at the position associated with the product.

Kurze Beschreibung der ZeichnungenBrief description of the drawings

1 zeigt ein Flussdiagramm eines Verfahrens zur Durchführung von Berechnungen mit Hilfe eines Beschleunigers für dünnbesetzte faltende neuronale Netze (SCNN) in Übereinstimmung mit einer Ausführungsform; 1 FIG. 12 is a flowchart of a method for performing calculations using a sparse convolutional neural network (SCNN) accelerator in accordance with one embodiment; FIG.

2A zeigt ein Blockschaltbild eines SCNN-Beschleunigers nach einer Ausführungsform; 2A shows a block diagram of an SCNN accelerator according to an embodiment;

2B zeigt ein konzeptuelles Diagramm der Organisation von Eingabeaktivierungen und Filtergewichten für die Verarbeitung durch den in 2A dargestellten SCNN-Beschleuniger bei einer Ausführungsform; 2 B shows a conceptual diagram of the organization of input activations and filter weights for processing by the in 2A illustrated SCNN accelerator in one embodiment;

2C zeigt ein Blockschaltbild eines Verarbeitungselements gemäß einer Ausführungsform; 2C shows a block diagram of a processing element according to an embodiment;

3A zeigt ein Blockdiagramm eines anderen Verarbeitungselements gemäß einer Ausführungsform; 3A shows a block diagram of another processing element according to an embodiment;

3B zeigt zwei 3 × 3-Gewichtskerne und Positionen gemäß einer Ausführungsform; 3B shows two 3 × 3 weight cores and positions according to one embodiment;

3C zeigt ein einstufiges F*I-Zuteilungskoppelfeld gemäß einer Ausführungsform; 3C shows a single-stage F * I allocation switch according to an embodiment;

3D zeigt eine Akkumulatoreinheit gemäß einer Ausführungsform; 3D shows an accumulator unit according to an embodiment;

3E zeigt ein zweistufiges F*I-Zuteilungskoppelfeld gemäß einer Ausführungsform; 3E shows a two-stage F * I allocation switch according to an embodiment;

4A zeigt ein Flussdiagramm eines Verfahrens zur Komprimierung von Gewichts- und Aktivierungswerten gemäß einer Ausführungsform; 4A FIG. 10 is a flowchart of a method for compressing weight and activation values according to one embodiment; FIG.

4B zeigt eine Kachel mit Gewichtswerten für zwei Ausgabekanäle gemäß einer Ausführungsform; 4B shows a tile with weight values for two output channels according to one embodiment;

4C zeigt ein Codierungsschema für Gewichte und Eingabeaktivierungen (IA) gemäß einer Ausführungsform; 4C shows a weight and input activation (IA) coding scheme according to one embodiment;

4D zeigt die Gewichtswerte für vier 3 × 3-Faltungskerne gemäß einer Ausführungsform; 4D Figure 4 shows the weight values for four 3x3 convolution cores according to one embodiment;

4E zeigt eine Codierung der Positionen für die Gewichtswerte in den vier in 4D gezeigten 3 × 3-Faltungskernen gemäß einer Ausführungsform; 4E shows a coding of the positions for the weight values in the four in 4D 3 × 3 convolution cores according to an embodiment;

4F zeigt ein Blockdiagramm zur Bestimmung der Gewichtskoordinaten (r, s) gemäß einer Ausführungsform; 4F shows a block diagram for determining the weight coordinates (r, s) according to an embodiment;

4G zeigt ein Blockdiagramm zur Bestimmung der Eingabeaktivierungskoordinaten (x, y) gemäß einer Ausführungsform; 4G FIG. 12 is a block diagram for determining input activation coordinates (x, y) according to an embodiment; FIG.

5A zeigt ein nichtlineares Codierungsschema für die Eingabeaktivierungsnullanzahlwerte gemäß einer Ausführungsform; 5A FIG. 12 shows a non-linear coding scheme for the input enable zero numbers according to one embodiment; FIG.

5B zeigt ein weiteres Codierungsschema für die Eingabeaktivierungsnullanzahlwerte gemäß einer Ausführungsform; 5B FIG. 12 shows another encoding scheme for the input enable zero numbers according to one embodiment; FIG.

5C zeigt ein weiteres Codierungsschema für die Eingabeaktivierungsnullzahlwerte gemäß einer Ausführungsform; 5C FIG. 12 shows another encoding scheme for the input enable zero values according to one embodiment; FIG.

5D zeigt ein weiteres Codierungsschema für Gewichtsnullanzahlwerte gemäß einer Ausführungsform; 5D FIG. 11 shows another weight zero number coding scheme according to one embodiment; FIG.

5E zeigt ein weiteres Codierungsschema für Gewichtsnullanzahlwerte gemäß einer Ausführungsform; und 5E FIG. 11 shows another weight zero number coding scheme according to one embodiment; FIG. and

6 zeigt ein beispielhaftes System, in dem die unterschiedlichen Architekturen bzw. Funktionalitäten der verschiedenen vorhergehenden Ausführungsformen umgesetzt sein könnten. 6 shows an exemplary system in which the different architectures or functionalities of the various previous embodiments could be implemented.

Detaillierte BeschreibungDetailed description

Neuronale Netze weisen in der Regel eine signifikante Redundanz auf und können während des Trainings drastisch reduziert werden, ohne die Genauigkeit des neuronalen Netzes wesentlich zu beeinträchtigen. Die Anzahl der Gewichte, die eliminiert werden können, variiert stark zwischen den Schichten des neuronalen Netzes, liegt aber in der Regel zwischen 20% und 80%. Die Eliminierung von Gewichten führt zu einem neuronalen Netz mit einer beträchtlichen Anzahl von Nullwerten, was die rechnerischen Anforderungen für eine Schlussfolgerung bzw. Inferenz verringern kann.Neural networks typically have significant redundancy and can be drastically reduced during training without significantly affecting the accuracy of the neural network. The number of weights that can be eliminated varies widely between the layers of the neural network, but is usually between 20% and 80%. The elimination of weights results in a neural network with a significant number of zero values, which can reduce the computational requirements for inference.

Die Inferenz-Berechnung bietet auch eine weitere Optimierungsmöglichkeit. Insbesondere viele neuronale Netze nutzen die Funktion der gleichgerichteten Lineareinheit („Rectified Linear Unit” (ReLU)), die alle negativen Aktivierungswerte gemäß einem nichtlinearen Operator auf Null setzt. Die Aktivierungen sind die Ausgabewerte einer einzelnen Schicht, die als Eingaben an die nächste Schicht übergeben werden. Bei typischen Datensätzen werden 50–70% der Aktivierungen auf Null gesetzt. Da die Multiplikation von Gewichten und Aktivierungen die Schlüsselberechnung für die Inferenz ist, kann die Kombination von Aktivierungen, die Null sind, und von Gewichten, die Null sind, den Rechenaufwand um mehr als eine Größenordnung reduzieren. Eine hier beschriebene dünnbesetzte CNN(SCNN)-Beschleunigerarchitektur nutzt die Seltenheit von Gewichten und/oder Aktivierungen, um den Energieverbrauch zu reduzieren und den Verarbeitungsdurchsatz zu verbessern. Die SCNN-Beschleunigerarchitektur koppelt einen algorithmischen Datenfluss, der alle Multiplikationen mit einem Null-Operanden eliminiert, während eine komprimierte Darstellung von Gewichten und Aktivierungen durch nahezu die gesamte Berechnung hindurch verwendet wird. In einer Ausführungsform wird jedes Gewicht und jeder Aktivierungswert, das bzw. der nicht Null ist, durch ein (Wert-Positions-)Paar repräsentiert.The inference calculation also offers a further optimization possibility. In particular, many neural networks use the function of the rectified linear unit (ReLU), which sets all negative activation values to zero according to a non-linear operator. The activations are the output values of a single layer, which are passed as inputs to the next layer. For typical records, 50-70% of the activations are set to zero. Because weight and activation multiplication is the key calculation for inference, the combination of zero-enabled and zero-weighted weights can reduce computational complexity by more than an order of magnitude. A sparse CNN (SCNN) accelerator architecture described here utilizes the rarity of weights and / or activations to reduce power consumption and improve processing throughput. The SCNN accelerator architecture couples an algorithmic data flow that eliminates all multiplications with a null operand while using a compressed representation of weights and activations throughout most of the calculation. In one embodiment, each weight and each non-zero activation value is represented by a (value-position) pair.

Weitere Vorteile ergeben sich durch eine komprimierte oder kompakte Codierung für spärlich verteilte Gewichte und/oder Aktivierungen, welche mehrere Nullen aufweisen, wodurch mehr Gewichts- und/oder Aktivierungswerte in einen On-Chip-RAM-Speicher („Random Access Memory”) passen und die Anzahl der energiekostenintensiven DRAM-Zugriffe („Dynamic Random Access Memory”), um Aktivierungen und Gewichte zu lesen, reduziert werden kann. Außerdem kann die Übertragung der kompakten Codierung die Anzahl von Übertragungen auf Bussen reduzieren, wobei der Energieverbrauch weiter reduziert wird. Schließlich werden den Multiplizierern als Operanden nur die Nicht-Null-Elemente bzw. Elemente ungleich Null von Gewichten und Eingabeaktivierungen bereitgestellt, so dass sichergestellt ist, dass jeder Multiplizierer innerhalb eines Verarbeitungselements ((„Processing Element”) PE) ein Produkt erzeugt, das einen Ausgabeaktivierungswert beeinflusst. Im Zusammenhang mit der folgenden Beschreibung bezieht sich die Aktivierung auf eine Eingabe- und/oder Ausgabeaktivierung. Im Zusammenhang mit der folgenden Beschreibung sind die Gewichts- und Aktivierungswerte Multibit-Werte, die Null, positive oder negative Werte darstellen. Im Zusammenhang mit der folgenden Beschreibung sind die Positionen Koordinaten in einem N-dimensionalen Raum.Further advantages result from a compressed or compact coding for sparsely distributed weights and / or activations, which have several zeros, whereby more weight and / or activation values fit into an on-chip RAM memory ("Random Access Memory") and reduce the number of energy-intensive dynamic random access memory (DRAM) accesses to read activations and weights. In addition, transmission of the compact encoding can reduce the number of transfers on buses, further reducing power consumption. Finally, the multipliers are provided as operands with only the non-zero or non-zero elements of weights and input activations, so as to ensure that each multiplier within a processing element ("processing element" PE) produces a product having a Output activation value affected. In the context of the following description, activation refers to an input and / or output activation. In the following description, the weight and activation values are multibit values that represent zero, positive or negative values. In the following description, the positions are coordinates in an N-dimensional space.

1 zeigt ein Flussdiagramm eines Verfahrens 100 zur Durchführung von Berechnungen mit einem SCNN-Beschleuniger nach einer Ausführungsform. Obwohl das Verfahren 100 im Kontext eines Verarbeitungselements innerhalb eines SCNN-Beschleunigers beschrieben wird, kann das Verfahren 100 auch von einem Programm, einer benutzerdefinierten Schaltung oder einer Kombination aus benutzerdefinierter Schaltung und einem Programm ausgeführt werden. Darüber hinaus wird der Fachmann verstehen, dass jedes System, das das Verfahren 100 ausführt, zum Umfang und Geist der erfindungsgemäßen Ausführungsformen gehört. 1 shows a flowchart of a method 100 for performing calculations with a SCNN accelerator according to an embodiment. Although the procedure 100 in the context of a processing element within an SCNN accelerator, the method may 100 also be executed by a program, a user-defined circuit or a combination of user-defined circuit and a program. In addition, the skilled person will understand that any system that uses the procedure 100 belongs to the scope and spirit of the embodiments of the invention.

Bei Schritt 105 wird ein erster Vektor empfangen, der nur Gewichtswerte ungleich Null und die ersten zugehörigen Positionen der Gewichtswerte ungleich Null innerhalb eines dreidimensionalen (3D-)Raums umfasst. Bei einer Ausführungsform wird der erste Vektor aus einem Speicher empfangen. Bei einer Ausführungsform wird der erste Vektor von einem Verarbeitungselement (PE) innerhalb eines SCNN-Beschleunigers empfangen, wie z. B. bei dem in Bild 2A beschriebenen SCNN-Beschleuniger 200.At step 105 A first vector is received that includes only non-zero weight values and the first associated non-zero weight value positions within a three-dimensional (3D) space. In one embodiment, the first vector is received from a memory. In one embodiment, the first vector is received by a processing element (PE) within an SCNN accelerator, such as a. For example, in the SCNN accelerator described in Figure 2A 200 ,

Bei Schritt 110 wird ein zweiter Vektor empfangen, der nur Eingabeaktivierungswerte ungleich Null und zweite zugeordnete Positionen der Eingabeaktivierungswerte ungleich Null innerhalb eines zweidimensionalen (2D-)Raums umfasst. Bei einer Ausgestaltung wird der zweite Vektor aus einem Speicher empfangen. Bei einer Ausführungsform wird der zweite Vektor von einem PE innerhalb eines SCNN-Beschleunigers empfangen, wie z. B. des in beschriebenen SCNN-Beschleunigers 200. Bei einer Ausführungsform wird der zweite Vektor durch den SCNN-Beschleuniger 200 während der Verarbeitung einer vorhergehenden Schicht eines neuronalen Netzes erzeugt.At step 110 receiving a second vector that includes only non-zero input enable values and second associated positions of non-zero input enable values within a two-dimensional (2D) space. In one embodiment, the second vector is received from a memory. In one embodiment, the second vector is received by a PE within an SCNN accelerator, such as a. B. of in described SCNN accelerator 200 , In one embodiment, the second vector becomes the SCNN accelerator 200 during processing of a previous layer of a neural network.

Bei Schritt 115 wird jeder der Gewichtswerte ungleich Null mit jedem der Aktivierungswerte ungleich Null innerhalb einer Multipliziereranordnung multipliziert, um einen dritten Vektor von Produkten zu erzeugen. Im Schritt 120 werden die ersten zugeordneten Positionen mit den zweiten zugehörigen Positionen kombiniert, um einen vierten Vektor von Positionen zu erzeugen, wobei jede Position im vierten Vektor mit einem entsprechenden Produkt im dritten Vektor verknüpft ist. Bei einer Ausführungsform umfasst das Kombinieren das Durchführen einer Vektoraddition, um Koordinaten in den ersten zugeordneten Positionen mit den zweiten zugeordneten Positionen zu summieren, um den vierten Vektor von Positionen zu erzeugen, wobei jede Position in dem vierten Vektor einem entsprechenden Produkt im dritten Vektor zugeordnet ist. At step 115 For example, each of the nonzero weight values is multiplied by each of the nonzero activation values within a multiplier array to produce a third vector of products. In step 120 the first associated positions are combined with the second associated positions to produce a fourth vector of positions, each position in the fourth vector being associated with a corresponding product in the third vector. In one embodiment, the combining comprises performing a vector addition to sum coordinates in the first associated positions with the second assigned positions to generate the fourth vector of positions, each position in the fourth vector being associated with a corresponding product in the third vector ,

Bei Schritt 125 wird der dritte Vektor zu einer Akkumulatoranordnung übertragen, wobei jedes der Produkte im dritten Vektor zu einem Addierer in der Akkumulatoranordnung übertragen wird, der so konfiguriert ist, dass er einen Ausgabeaktivierungswert an der dem Produkt zugeordneten Position erzeugt. Bei einer Ausführungsform wird der dritte Vektor durch ein Anordnung von Puffern in der Akkumulatoranordnung übertragen, wobei jeder der Puffer mit einem Eingang eines der Addierer in der Akkumulatoranordnung gekoppelt ist.At step 125 For example, the third vector is transferred to an accumulator assembly, wherein each of the products in the third vector is transferred to an adder in the accumulator assembly that is configured to generate an output enable value at the position associated with the product. In one embodiment, the third vector is transmitted through an array of buffers in the accumulator array, each of the buffers coupled to an input of one of the adders in the accumulator array.

Es werden nun illustrativere Informationen zu verschiedenen optionalen Architekturen und Merkmalen gegeben, mit denen das vorstehende Framework je nach Wunsch des Benutzers implementiert werden kann oder auch nicht. Es sei darauf hingewiesen, dass die folgenden Informationen zur Veranschaulichung dienen und nicht als Beschränkung ausgelegt werden sollen. Alle folgenden Merkmale können wahlweise mit oder ohne Ausschluss der beschriebenen anderen Merkmal eingesetzt werden.There will now be given more illustrative information on various optional architectures and features that may or may not implement the above framework as desired by the user. It should be noted that the following information is intended to be illustrative and not to be construed as limiting. All of the following features may be used with or without the exclusion of the other features described.

Beschleuniger für dünnbesetzte faltende neuronale NetzeAccelerator for sparse folding neural networks

2A zeigt ein Blockschaltbild des SCNN 200 gemäß einer Ausführungsform. Das SCNN 200 koppelt einen algorithmischen Datenfluss, der alle Multiplikationen mit einem Null-Operanden eliminiert und gleichzeitig eine kompakte Darstellung von Gewichten und/oder Eingabeaktivierungen zwischen Speicher und Logikblöcken innerhalb des SCNN 200 überträgt. Das SCNN 200 verfügt über eine Speicherschnittstelle 205, eine Schichtablaufsteuerung 215 und eine Anordnung mit Verarbeitungselementen (PEs) 210. 2A shows a block diagram of the SCNN 200 according to one embodiment. The SCNN 200 couples an algorithmic data flow that eliminates all multiplications with a null operand while providing a compact representation of weights and / or input activations between memory and logic blocks within the SCNN 200 transfers. The SCNN 200 has a memory interface 205 , a layer sequence control 215 and an arrangement with processing elements (PEs) 210 ,

Die Speicherschnittstelle 205 liest Gewichts- und Aktivierungsdaten aus einem an das SCNN 200 gekoppelten Speicher aus, wobei die Speicherschnittstelle 205 auch Gewichts- und/oder Aktivierungsdaten von dem SCNN 200 in den Speicher schreiben kann. Bei einer Ausführungsform werden alle Aktivierungsdaten innerhalb der PEs 210 gespeichert, so dass über die Speicherschnittstelle 205 nur auf Gewichtsdaten zugegriffen wird. Die Gewichts- und/oder Aktivierungsdaten können in einem kompakten Format oder einem erweiterten Format im Speicher abgelegt werden. Das kompakte Format kann Vektoren enthalten, die nur Elemente mit einem Wert von ungleich Null (Gewichte oder Aktivierungen) und Positionen, die den Elementen mit einem Wert von ungleich Null zugeordnet sind, aufweisen.The storage interface 205 reads weight and activation data from one to the SCNN 200 coupled memory, the memory interface 205 also weight and / or activation data from the SCNN 200 can write to memory. In one embodiment, all activation data is within the PEs 210 stored, so over the memory interface 205 only access to weight data. The weight and / or activation data may be stored in memory in a compact format or in an extended format. The compact format may include vectors that include only non-zero (weights or activations) items and items associated with non-zero items.

Der Speicher kann mit einem dynamischen Arbeitsspeicher (DRAM) oder Ähnlichem implementiert werden. Bei einer Ausführungsform wird bzw. werden die Speicherschnittstelle 205 oder die PEs 210 so konfiguriert, dass Multi-Bit-Daten, wie die Gewichte, Eingabe- und Ausgabeaktivierungen, komprimiert werden. Die Schichtablaufsteuerung 215 steuert das Auslesen des Speichers, um die kompakten Eingabeaktivierungen und die kompakten Gewichte zu erhalten. Die kompakten Eingabeaktivierungen und die kompakten Gewichte können vor der Übertragung an die PEs 210 in der Speicherschnittstelle 205 abgespeichert werden.The memory can be implemented with dynamic random access memory (DRAM) or the like. In one embodiment, the memory interface becomes 205 or the PEs 210 configured to compress multi-bit data, such as weights, input and output activations. The layer sequence control 215 controls the reading of the memory to obtain the compact input activations and the compact weights. The compact input activations and the compact weights can be prior to transmission to the PEs 210 in the memory interface 205 be stored.

Die kompakten Aktivierungen und die kompakten Gewichte sind bei einer Ausführungsform Datenfolgen, die als Nicht-Null-Elemente und Positionen codiert sind. Bei einer Ausführungsform werden die Nicht-Null-Elemente und Positionen jeweils als Paar (Wert, Position) codiert. Die kompakten Aktivierungen und die kompakten Gewichte können bei Bedarf zu Datenfolgen von Gewichten und Aktivierungen erweitert werden, die Multibit-Null- und Nicht-Null-Elemente enthalten. Wichtig ist, dass bei kompakter Form der Gewichte und Eingabeaktivierungen nur Nicht-Null-Gewichte und Nicht-Null-Eingabeaktivierungen von der Speicherschnittstelle 205 in die PEs 210 übertragen werden. Bei einer Ausführungsform weisen die Nicht-Null-Elemente 8 Bits und die Positionen 4 Bits auf. Die Nicht-Null-Elemente können jedoch mehr als 8 Bits oder weniger als 8 Bits und die Positionen können mehr als 4 Bits oder weniger als 4 Bits aufweisen.The compact activations and the compact weights in one embodiment are data sequences encoded as non-zero elements and positions. In one embodiment, the non-zero elements and positions are each encoded as a pair (value, position). The compact activations and the compact weights can be extended as needed to sequences of weights and activations containing multibit zero and non-zero elements. Importantly, with compact form weights and input activations, only non-zero weights and non-zero input activations from the memory interface 205 in the PEs 210 be transmitted. In one embodiment, the nonzero elements have 8 bits and the positions 4 bits. However, the nonzero elements may be more than 8 bits or less than 8 bits and the positions may be more than 4 bits or less than 4 bits.

Die Schichtablaufsteuerung 215 liest die Gewichte aus und gibt die Gewichtsvektoren aus, die mit den PEs 210 zu multiplizieren sind. Bei einer Ausführung liegen die Gewichte in kompakter Form vor und werden nur einmal vom Off-Chip-DRAM gelesen und im SCNN-Beschleuniger 200 gespeichert. Bei einer Ausführungsform überträgt die Schichtablaufsteuerung 215 einen Gewichtsvektor an jedes PE 210 und lässt mehrere Aktivierungsvektoren der Reihe nach folgen, bevor ein weiterer Gewichtsvektor übertragen wird. Bei einer Ausführungsform überträgt die Schichtablaufsteuerung 215 einen Eingabeaktivierungsvektor an jedes PE 210 und lässt mehrere Gewichtsvektoren der Reihe nach folgen, bevor ein weiterer Eingabeaktivierungsvektor übertragen wird. Die von den Multiplizierern innerhalb jedes PE 210 erzeugten Produkte werden zu Zwischenwerten (z. B. Teilsummen) addiert, die nach einer oder mehreren Iterationen zu Ausgabeaktivierungen werden. Wenn die Ausgabeaktivierungen für eine neuronale Netzwerkschicht berechnet und in einem Ausgabeaktivierungspuffer gespeichert wurden, kann die Schichtablaufsteuerung 215 eine weitere Schicht verarbeiten, indem die Ausgabeaktivierungen als Eingabeaktivierungen angewendet werden.The layer sequence control 215 reads out the weights and outputs the weight vectors that match the PEs 210 to multiply. In one embodiment, the weights are in a compact form and are only read once by the off-chip DRAM and in the SCNN accelerator 200 saved. At a Embodiment transmits the layer sequencing control 215 a weight vector to each PE 210 and allows several activation vectors to follow in turn before another weight vector is transmitted. In one embodiment, the layer scheduler transmits 215 an input activation vector to each PE 210 and allows several weight vectors to follow in sequence before transmitting another input activation vector. The multipliers within each PE 210 Products created are added to intermediate values (eg, partial sums) that become output activations after one or more iterations. When the neural network layer output activations have been computed and stored in an output enable buffer, the layer scheduler may 215 process another layer by applying the output activations as input activations.

Jedes PE 210 enthält eine Multipliziereranordnung, die einen Vektor der Gewichte (Gewichtsvektor) und einen Vektor der Eingabeaktivierungen (Aktivierungsvektor) entgegennimmt, wobei jeder Multiplizierer innerhalb der Anordnung so konfiguriert ist, dass er ein Produkt aus einem Eingabeaktivierungswert im Aktivierungsvektor und einem Gewicht im Gewichtsvektor erzeugt. Die Gewichte und Eingabeaktivierungen in den Vektoren können alle wie ein kartesisches Produkt miteinander multipliziert werden. Wenn die Eingabevektoren zum Beispiel a, b, c, d und p, q, r, s sind, ist die Ausgabe ein 16-Vektor mit den Werten a*p, a*q, a*r, a*s, b*p, b*q, b*r, b*s, c*p, c*q, c*r, c*s, c*s, d*p, d*q, d*r und d*s.Every PE 210 contains a multiplier array that receives a vector of weights (weight vector) and a vector of input activations (activation vector), each multiplier within the array being configured to produce a product of an input enable value in the activation vector and a weight in the weight vector. The weights and input activations in the vectors can all be multiplied together like a Cartesian product. For example, if the input vectors are a, b, c, d and p, q, r, s, the output is a 16 vector with the values a * p, a * q, a * r, a * s, b * p, b * q, b * r, b * s, c * p, c * q, c * r, c * s, c * s, d * p, d * q, d * r and d * s.

Wichtig ist, dass nur Gewichte und Eingabeaktivierungen ungleich Null an die Multipliziereranordnung innerhalb jedes PE 210 übertragen werden. Zusätzlich können die Eingabeaktivierungsvektoren innerhalb jedes PE 210 in einer stationären Eingabeform bei mehreren Gewichtsvektoren wiederverwendet werden, um Datenzugriffe zu reduzieren. Die von den Multiplizierern erzeugten Produkte werden dann zu Teilsummen und Ausgabeaktivierungen addiert. Da die Nullwerte entfernt wurden, kann das Zuordnen der Produkte auf Akkumulatoren jedoch für jedes Produkt, das innerhalb des Multipliziereranordnung erzeugt wird, variieren. Bei einer konventionellen Implementierung, bei der die Nullwerte beibehalten werden, können zum Beispiel die in einem Taktzyklus erzeugten Produkte zu einer Teilsumme summiert werden. Im Gegensatz dazu müssen die Produkte, die während eines Taktzyklus innerhalb eines PE 210 erzeugt werden, nicht notwendigerweise zu einer Teilsumme addiert werden. Daher werden die Ausgangskoordinaten, welche jeder Multiplikation zugeordnet sind, innerhalb des PE 210 verfolgt und eine Ausgabeposition (definiert durch die Ausgangskoordinaten) und ein Produkt zur Addition einer verteilten Akkumulatoranordnung zur Verfügung gestellt. Die verteilte Akkumulatoranordnung ermöglicht abhängig von der Ausgangsposition des Produkts die Übertragung eines beliebigen Produkts auf einen beliebigen Addierer. Bei einer Ausführungsvariante sind die PEs 210 so konfiguriert, dass sie Faltungsoperationen mit den Gewichten und Eingabeaktivierungen durchführen. Die Addition der Produkte in den Addierern vervollständigt die Faltungsoperation und erzeugt die Ausgabeaktivierungen.Importantly, only non-zero weights and input activations to the multiplier array within each PE 210 be transmitted. In addition, the input activation vectors can be within each PE 210 be reused in a stationary input form with multiple weight vectors to reduce data access. The products generated by the multipliers are then added to partial sums and output activations. However, because the null values have been removed, the allocation of the products to accumulators may vary for each product generated within the multiplier array. For example, in a conventional implementation where the null values are maintained, the products produced in one clock cycle may be summed to a partial sum. In contrast, the products need to be inside a PE during a clock cycle 210 are not necessarily added to a partial sum. Therefore, the output coordinates associated with each multiplication become within the PE 210 and providing an output position (defined by the output coordinates) and a product for adding a distributed accumulator arrangement. The distributed accumulator arrangement allows the transfer of any product to any adder, depending on the initial position of the product. In one embodiment, the PEs 210 configured to perform convolution operations on the weights and input activations. Addition of the products in the adders completes the convolution operation and generates the output activations.

Das SCNN 200 kann so konfiguriert werden, dass es CNN-Algorithmen implementiert, die als kaskadierter Satz von Mustererkennungsfiltern, die mittels Überwachung trainiert wurden, ausgeführt werden. Ein CNN besteht aus einer Reihe von Schichten, zu denen Faltungsschichten, nicht-lineare skalare Operator-Schichten und Schichten gehören, die die Zwischendaten downsamplen, zum Beispiel durch Pooling. Die Faltungsschichten stellen den Kern der CNN-Berechnung dar und zeichnen sich durch eine Reihe von Filtern aus, die in der Regel 1 × 1 oder 3 × 3 und gelegentlich 5 × 5 oder größer sind. Die Werte dieser Filter sind die Gewichte, die mit einer Trainingsmenge für das Netz trainiert werden. Einige tiefe neuronale Netze („Deep Neuronal Networks” (DNNs)) enthalten auch vollständig verbundene Schichten, typischerweise gegen Ende des DNN. Bei der Klassifikation wird dem neuronalen Netz ein neues Bild (im Falle der Bilderkennung) präsentiert, das die Bilder in die Trainingskategorien klassifiziert, indem nacheinander jede Schicht des neuronalen Netzes berechnet wird. Das SCNN 200 beschleunigt die Faltungsschichten, das Empfangen von Gewichten und Eingabeaktivierungen und die Erzeugung von Ausgabeaktivierungen.The SCNN 200 can be configured to implement CNN algorithms that execute as a cascaded set of pattern recognition filters trained by monitoring. A CNN consists of a series of layers, which include convolutional layers, nonlinear scalar operator layers, and layers that downsample the intermediate data, for example, by pooling. The convolutional layers are the core of the CNN calculation and are characterized by a series of filters, typically 1 × 1 or 3 × 3 and occasionally 5 × 5 or larger. The values of these filters are the weights that are trained with a training set for the net. Some deep neural networks (DNNs) also contain fully connected layers, typically towards the end of the DNN. In the classification, a new image (in the case of image recognition) is presented to the neural network, which classifies the images into the training categories by successively calculating each layer of the neural network. The SCNN 200 accelerates the convolutional layers, receiving weights and input activations, and generating output activations.

Die Seltenheit bzw. Dünnbesetztheit („sparsity”) in einer Schicht eines CNN ist definiert als der Anteil von Nullen in der Gewichts- und Eingabeaktivierungsmatrix der Schicht. Die primäre Technik zur Erzeugung einer Seltenheit von Gewichten besteht darin, das Netz während des Trainings zu beschneiden. Bei einer Ausführungsvariante wird jedes Gewicht mit einem Absolutwert nahe Null (z. B. unter einem definierten Schwellenwert) auf Null gesetzt. Der Beschneidungsvorgang bewirkt, dass die Gewichte aus den Filtern entfernt werden, und erzwingt manchmal sogar, dass eine Ausgabeaktivierung immer gleich Null ist. Das verbleibende Netz kann erneut trainiert werden, um die durch den naiven Schnitt verlorene Genauigkeit wiederzuerlangen. Das Ergebnis ist ein kleineres Netz mit einer Genauigkeit, die dem ursprünglichen Netz extrem nahe kommt. Der Prozess kann iterativ wiederholt werden, um die Netzgröße zu reduzieren, während die Genauigkeit beibehalten wird.The rarity in a layer of a CNN is defined as the fraction of zeros in the weight and input activation matrix of the layer. The primary technique for creating a rarity of weights is to prune the mesh during exercise. In one embodiment, each weight having an absolute value near zero (eg below a defined threshold) is set to zero. The clipping process causes the weights to be removed from the filters and sometimes even forces an output activation to always be zero. The remaining net can be re-trained to regain the accuracy lost by the naive cut. The result is a smaller mesh with an accuracy that is extremely close to the original mesh. The process can be repeated iteratively to reduce mesh size while maintaining accuracy.

Die Aktivierungsseltenheit tritt dynamisch während der Inferenz auf und ist stark abhängig von den verarbeiteten Daten. Bei einer Ausführungsform werden Aktivierungen mit negativen Werten auf Null gesetzt. Bei einer Ausführungsform werden Eingabeaktivierungen mit einem Absolutwert unter einem definierten Schwellwert auf Null gesetzt.The activation segregation occurs dynamically during inference and is heavily dependent on the processed data. In one embodiment, activations with negative values are set to zero. at In one embodiment, input activations having an absolute value below a defined threshold are set to zero.

Bei einer Ausführungsvariante setzt eine Komprimierungsmaschine innerhalb des PE 210 Ausgabeaktivierungen mit einem Absolutwert unter einer definierten Schwelle auf Null. Wenn die Aktivierungen in komprimierter Form vorliegen, werden die Aktivierungen nach Bedarf von der Komprimierungsmaschine neu formatiert, nachdem eine oder mehrere Aktivierungen auf Null gesetzt wurden, um komprimierte Aktivierungen zu erzeugen. Nach Abschluss der Berechnung der Ausgabeaktivierungen für eine Schicht eines CNN kann jedes Element in den Ausgabeaktivierungsmatrizen, das unter einem Schwellenwert liegt, auf Null gesetzt werden, bevor die Ausgabeaktivierungsdaten an die nächste Schicht übergeben werden.In one embodiment, a compression engine sets within the PE 210 Output activations with an absolute value below a defined threshold to zero. If the activations are in compressed form, the activations are reformatted as needed by the compression engine after one or more activations have been cleared to produce compressed activations. After completing the calculation of the output activations for a layer of CNN, each element in the output enable matrices that is below a threshold may be set to zero before the output enable data is passed to the next layer.

SCNN-BerechnungsdatenflussSCN N-computation dataflow

Die Kernoperation in einer CNN-Schicht ist eine zweidimensionale Schiebefensterfaltung eines R × S-Elementfilters über einer W × H-Element-Eingabeaktivierungsebene, um eine W × H-Element-Ausgabeaktivierungsebene zu erzeugen. Es können mehrere (C) Eingabeaktivierungsebenen vorhanden sein, die als Eingabekanäle bezeichnet werden. An jedem Eingabeaktivierungskanal wird ein eigenes Filter angewendet, und die Filterausgabe für jeden der C Kanäle wird elementweise in einer einzigen Ausgabeaktivierungsebene zusammengefasst. Mehrere Filter (K) können auf den gleichen Körper von Eingabeaktivierungen angewendet werden, um K Ausgabekanäle von Ausgabeaktivierungen zu erzeugen. Schließlich kann ein Stapel der Länge N von Gruppen von C Kanälen der Eingabeaktivierungsebenen auf dasselbe Volumen an Filtergewichten angewendet werden.The kernel operation in a CNN layer is a two-dimensional sash window convolution of an RxS element filter over a WxH element input activation plane to produce a WxH element output activation plane. There may be several (C) input activation levels called input channels. A separate filter is applied to each input enable channel, and the filter output for each of the C channels is aggregated element-by-element into a single output enable level. Multiple filters (K) can be applied to the same body of input activations to produce K output enable output channels. Finally, a stack of length N from groups of C channels of the input activation levels can be applied to the same volume of filter weights.

2B zeigt die Eingabeaktivierungen, Gewichte und Ausgabenaktivierungen für eine einzelne CNN-Schicht gemäß einer Ausführungsform. Der Satz von Berechnungen für die gesamte Schicht kann als Schleifenverschachtelung über die sieben Variablen (N, K, C, W, H, R und S) formuliert werden. Da Multiplizier-Addier-Operationen assoziativ sind (modulo Rundungsfehlern, die im Rahmen der folgenden Beschreibung ignoriert werden), sind alle Permutationen der sieben Schleifenvariablen legal. Tabelle 1 zeigt ein Beispiel für eine Schleifenverschachtelung basierend auf einer solchen Permutation. Die Verschachtelung kann kurz und bündig als N → K → C → W → H → R → S dargestellt werden. Jeder Punkt im siebendimensionalen Raum, der aus den Variablen gebildet wird, stellt eine einzelne Multiplikations-Akkumulations-Operation dar. Es sei angemerkt, dass für den Rest der Beschreibung eine Stapelgröße von 1 angenommen wird, was eine übliche Stapelgröße für Inferenzaufgaben darstellt. Tabelle 1: siebendimensionale Schleifenverschachtelung

2 B Figure 12 shows the input activations, weights, and output activations for a single CNN layer according to one embodiment. The set of calculations for the entire layer can be formulated as loop interleaving over the seven variables (N, K, C, W, H, R, and S). Since multiply-add operations are associative (modulo round-off errors, which are ignored in the following description), all permutations of the seven loop variables are legal. Table 1 shows an example of loop interleaving based on such a permutation. The interleaving can be described succinctly as N → K → C → W → H → R → S. Each point in the seven-dimensional space formed from the variables represents a single multiply-accumulate operation. Note that for the remainder of the description, a stack size of 1 is assumed, which is a common batch size for inference tasks. Table 1: seven-dimensional loop nesting

Die in Tabelle 1 gezeigte einfache Schleifenverschachtelung kann auf vielfältige Weise transformiert werden, um verschiedene Wiederverwendungsmuster der Aktivierungen und Gewichte zu erfassen und die Berechnung auf eine Hardware-Beschleuniger-Implementierung abzubilden, wie z. B. den SCNN-Beschleuniger 200. Der Datenfluss eines CNN definiert, wie die Schleifen geordnet, partitioniert und parallelisiert werden. Die Wahl des Datenflusses kann einen wesentlichen Einfluss auf die Fläche und Energieeffizienz einer Architektur haben.The simple loop nesting shown in Table 1 can be transformed in a variety of ways to capture various reuse patterns of the activations and weights, and to map the calculation to a hardware accelerator implementation, such as, for example, FIG. For example, the SCNN accelerator 200 , The data flow of a CNN defines how the loops are ordered, partitioned, and parallelized. The choice of data flow can have a significant impact on the area and energy efficiency of an architecture.

Während das Konzept des Datenflusses für dichte Architekturen untersucht wurde, können spärliche bzw. dünnbesetzte Architekturen auch verschiedene alternative Datenflüsse nutzen, die jeweils eigene Kompromisse eingehen. Ein solcher spezifischer Datenfluss, der hier beschrieben wird, ist ein PTIS-sparse („sparse Planar-Tiled Input-Stationary”). PTIS-sparse ermöglicht die Wiederverwendung von Mustern, die die Eigenschaften von dünnbesetzten Gewichten und Aktivierungen ausnutzen. Zunächst wird ein äquivalenter dichter Datenfluss (PTIS-dense) beschrieben, um die Zerlegung der Berechnungen zu erklären. Anschließend werden die Besonderheiten für PTIS-sparse beschrieben. While the concept of dataflow for dense architectures has been explored, sparse or sparsely populated architectures can also use a variety of alternative data flows, each of which compromises itself. One such specific data flow described herein is a sparse planar tiled input stationary (PTIS) sparse. PTIS-sparse allows the reuse of patterns that exploit the properties of sparse weights and activations. First, an equivalent dense data flow (PTIS-dense) is described to explain the decomposition of the calculations. Afterwards the peculiarities of PTIS-sparse will be described.

2C zeigt ein PE 220 gemäß einer Ausführungsform. Zum Verständnis der zeitlichen Komponente des PTIS-dense-Datenflusses wird die Funktionsweise des PE 220 beschrieben. PTIS verwendet eine Input-Stationary-Rechenreihenfolge, bei der eine Eingabeaktivierung stationär an den Berechnungseinheiten festgehalten wird, wenn die Eingabeaktivierung mit allen Filtergewichten multipliziert wird, die erforderlich sind, um alle Beiträge der Eingabeaktivierung zu jedem der K Ausgabekanäle (ein K × R × S-Teilvolumen) zu leisten. So trägt jede Eingabeaktivierung zu einem Volumen von K × R × S-Ausgabeaktivierungen bei. Die Reihenfolge der bezüglich der Eingabe stationären Berechnung maximiert die Wiederverwendung der Eingabeaktivierungen, während die Kosten für die Übertragung der Gewichte auf die PEs 220 gezahlt werden. Die Aufnahme mehrerer Eingabekanäle (C) fügt eine zusätzliche äußere Schleife hinzu und führt zur Schleifenverschachtelung C → W → H → R → S. 2C shows a PE 220 according to one embodiment. To understand the temporal component of the PTIS-dense data flow, the functioning of the PE 220 described. PTIS uses an input stationary arithmetic sequence in which an input activation is held stationary on the calculation units when the input activation is multiplied by all the filter weights required to encode all contributions of the input activation to each of the K output channels (a K × R × S Partial volume). Thus, each input activation contributes to a volume of K × R × S output activations. The order of entry relative to the input maximizes the reuse of the input activations, while the cost of transferring the weights to the PEs 220 be paid. The inclusion of multiple input channels (C) adds an additional outer loop and results in loop interleaving C → W → H → R → S.

Der PTIS-dense-Datenfluss beruht auf Eingabepuffern, Gewichtspuffern 230 und Eingabeaktivierungspuffern 235, um Gewichte bzw. Eingabeaktivierungen zu speichern. Ein Akkumulatorpuffer 250 speichert die Teilsummen der Ausgabeaktivierungen. Für jeden Zugriff auf eine zuvor geschriebene Teilsumme im Akkumulatorpuffer 250 wird eine Lese-Addier-Schreiboperation durchgeführt. Der Akkumulatorpuffer 250 bildet in Verbindung mit einer angeschlossenen Addiereinheit 255 eine Akkumulationseinheit 245.The PTIS dense data flow is based on input buffers, weight buffers 230 and input activation buffers 235 to save weights or input activations. An accumulator buffer 250 stores the subtotals of the output activations. For each access to a previously written partial sum in the accumulator buffer 250 a read add write operation is performed. The accumulator buffer 250 forms in conjunction with an attached adding unit 255 an accumulation unit 245 ,

Die Parameter heutiger Netze bewirken, dass der Gewichtspuffer 230 und der Eingabepuffer 235 groß und der Zugriff energieaufwendig ist. Die bezüglich Eingaben stationäre zeitliche Schleifenverschachtelung amortisiert die Energiekosten für den Zugriff auf den Gewichtspuffer 230 und den Eingabeaktivierungspuffer 235 über mehrere Zugriffe auf den Gewichtspuffer 235 und den Akkumulatorpuffer 250. Genauer gesagt dient das Register, in dem die Eingabe über K × R × S Iterationen stationär bzw. konstant gehalten wird, als innerer Puffer, wobei Zugriffe auf den größeren Eingabepuffer (z. B. Gewichtspuffer 230 oder Eingabeaktivierungspuffer 235) gefiltert werden.The parameters of today's networks cause the weight buffer 230 and the input buffer 235 great and the access is energy consuming. The temporal loop nesting related to inputs amortizes the energy cost of accessing the weight buffer 230 and the input enable buffer 235 over several accesses to the weight buffer 235 and the accumulator buffer 250 , More specifically, the register, in which the input is kept stationary over K × R × S iterations, serves as an internal buffer, accessing the larger input buffer (eg, weight buffer 230 or input activation buffer 235 ) are filtered.

Leider kostet das Merkmal, die Eingabeaktivierungen konstant zu halten, mehr Zugriffe auf die Gewichte im Gewichtspuffer 230 (oder im Speicher) und auf die Teilsummen im Akkumulatorpuffer 250. Durch ein Blockieren der Gewichte und Teilsummen im Ausgabekanal (K-fach) kann die Wiederverwendung des Gewichtspuffers 230 und des Pufferspeichers 250 erhöht werden, was die Energieeffizienz verbessert. Die Ausgabekanalvariable (K) kann mit Kc (genannt Ausgabekanalgruppe) berücksichtigt werden, und K/Kc entspricht der Anzahl der Ausgabekanalgruppen. Bei einer Ausführungsform werden jeweils nur die Gewichte und Ausgaben einer Ausgabekanalgruppe im Gewichtspuffer 230 und im Akkumulationspuffer 250 gespeichert. Die Teilvolumen, die in Puffern bei der Berechnungseinheit aufgenommen sind, lauten also:
Gewichte: K_c × R × S
Eingabeaktivierungen: C × B × H
Teilsummen: K_c × B × HUnfortunately, the feature of keeping the input activations constant requires more access to the weights in the weight buffer 230 (or in memory) and the subtotals in the accumulator buffer 250 , Blocking the weights and subtotals in the output channel (K-fold) can reuse the weight buffer 230 and the cache 250 be increased, which improves energy efficiency. The output channel variable (K) can be considered with Kc (called output channel group), and K / Kc equals the number of output channel groups. In one embodiment, only the weights and outputs of an output channel group are in the weight buffer at a time 230 and in the accumulation buffer 250 saved. The partial volumes that are included in buffers in the calculation unit are thus:
Weights: K _c × R × S
Input activations: C × B × H
Partial sums: K _c × B × H

Eine äußere Schleife über alle K/Kc Ausgabekanalkacheln ergibt die komplette Schleifenverschachtelung K/Kc → C → W → H → K_c → R → S. Es sei angemerkt, dass für jede Iteration der äußeren Schleife der Gewichtspuffer 230 erneut gefüllt werden muss und der Akkumulatorpuffer 250 entleert und gelöscht werden muss, während der Inhalt des Eingabeaktivierungspuffers 235 vollständig wiederverwendet wird, da dieselben Eingabeaktivierungen über alle Ausgabekanäle verwendet werden.An outer loop over all K / Kc output channel tiles results in the complete loop nesting K / Kc → C → W → H → K → _c → R S. It should be noted that for each iteration of the outer loop of the weight buffer 230 must be refilled and the accumulator buffer 250 must be emptied and deleted while the contents of the input activation buffer 235 is completely reused since the same input activations are used across all output channels.

Um die Parallelität vieler Multiplizierer innerhalb eines PE 220 auszunutzen, kann aus dem Gewichtspuffer 230 ein Vektor von F Filtergewichten und aus dem Eingabeaktivierungspuffer 235 ein Vektor von I Eingängen geholt werden. Die Vektoren werden an eine Anordnung von FxI Multiplizierern 240 übergeben, um ein vollständiges kartesisches Produkt von Ausgabeteilsummen zu berechnen. Jedes Produkt liefert eine sinnvolle Teilsumme, so dass keine belanglosen Zugriffe oder Berechnungen durchgeführt werden. PTIS-sparse nutzt dieselbe Eigenschaft, um effiziente Berechnungen auf komprimierte-dünnbesetzte Gewichte und Eingabeaktivierungen durchzuführen.To the parallelism of many multipliers within a PE 220 can take advantage of the weight buffer 230 a vector of F filter weights and the input activation buffer 235 a vector of I entrances to be brought. The vectors are sent to an array of FxI multipliers 240 to compute a complete Cartesian product of issue sub sums. Each product provides a reasonable partial sum, so that no trivial accesses or calculations are carried out. PTIS-sparse uses the same property to perform efficient calculations on compressed-sparse weights and input activations.

Die Multipliziererausgaben (z. B. Produkte) werden an die Akkumulationseinheit 245 gesendet, welche die Teilsummen aktualisiert, die in dem Akkumulationspuffer 250 gespeichert sind. Jedes Produkt wird mit einer Teilsumme an den Ausgabekoordinaten in dem Ausgabeaktivierungsraum akkumuliert, welche mit einer Position übereinstimmt (d. h. dieser entspricht), die dem Produkt zugeordnet ist. Die Ausgabepositionen für die Produkte werden parallel mit den Produkten berechnet (in 2C nicht dargestellt). Bei einer Ausführungsform werden Koordinaten, welche die Ausgabepositionen definieren, durch eine Zustandsmaschine bzw. Logikschaltung in der Akkumulationseinheit 245 berechnet. Die Anzahl der Addierer in der Addierereinheit 255 entspricht nicht notwendigerweise der Anzahl der Multiplizierer in der FxI-Multipliziereranordnung 240. Die Akkumulationseinheit 245 muss jedoch mindestens FxI Addierer in der Addierereinheit 255 einsetzen, um dem Durchsatz der FxI-Multipliziereranordnung 240 zu entsprechen. The multiplier outputs (eg products) are sent to the accumulation unit 245 which updates the subtotals that are in the accumulation buffer 250 are stored. Each product is accumulated with a subtotal at the output coordinates in the output enable space that matches (ie, corresponds to) a position associated with the product. The delivery items for the products are calculated in parallel with the products (in 2C not shown). In one embodiment, coordinates defining the output positions are determined by a state machine or logic circuit in the accumulation unit 245 calculated. The number of adders in the adder unit 255 does not necessarily correspond to the number of multipliers in the FxI multiplier arrangement 240 , The accumulation unit 245 However, at least FxI adders must be in the adder unit 255 to increase the throughput of the FxI multiplier array 240 correspond to.

Tabelle 2 zeigt einem Pseudocode für den PTIS-dense-Datenfluss, einschließlich eines Blockierens in der K-Dimension, eines Holens von Vektoren der Eingabeaktivierungen und der Gewichte (B, D) und eines Berechnens des kartesischen Produktes parallel (E, F). Es sei angemerkt, dass dieser PTIS-dense-Datenfluss einfach eine neu angeordnete, aufgeteilte und parallelisierte Version des Pseudocodes ist, welcher in Tabelle 1 dargestellt ist. Tabelle 2: Pseudocode für den PTIS-dense-Datenfluss

Table 2 shows a pseudo code for the PTIS dense data flow, including blocking in the K dimension, fetching vectors of the input activations and weights (B, D), and calculating the Cartesian product in parallel (E, F). It should be noted that this PTIS dense data flow is simply a rearranged, split and parallelized version of the pseudocode shown in Table 1. Table 2: Pseudocode for the PTIS dense data flow

Es sei angemerkt, dass die Ausgabepositionen, welche einem Ausgangspuffer (out_buf) zugeordnet sind, mittels der Schleifenindices berechnet werden können, welche im Abschnitt (F) der Tabelle 2 dargestellt sind.It should be noted that the output positions associated with an output buffer (out_buf) may be calculated using the loop indices shown in section (F) of Table 2.

Um die praktischen Grenzen der Multipliziereranzahl und Puffergrößen in einem PE 220 zu überwinden, kann eine Tiling-Strategie eingesetzt werden, um die Arbeit über eine Anordnung von PEs 210 zu verteilen, so dass jedes PE 210 unabhängig arbeiten kann. Bei einer Ausführungsform der PTIS-dense-Technik wird die W × H-Elementaktivierungsebene in kleinere W_t × H_t-Elementkacheln aufgeteilt, welche über die PEs 210 in dem SCNN-Beschleuniger 200 aufgeteilt werden. Jede Kachel erstreckt sich vollständig in der Eingabekanaldimension C, was zu einem Eingabeaktivierungsvolumen von C × W_t × H_t führt, welches jedem PE 210 zugeordnet ist. Die Gewichte werden an die PEs 210 übertragen, und jedes PE 210 arbeitet auf einer exklusiven Teilmenge des Eingabe- und Ausgabe-Aktivierungsraums. Mit anderen Worten existiert keine Verdopplung von Eingabeaktivierungen oder Ausgabeaktivierungen zwischen den PEs 210.To the practical limits of the multiplier number and buffer sizes in a PE 220 To overcome, a tiling strategy can be employed to work over an array of PEs 210 to distribute so that every PE 210 can work independently. In one embodiment of the PTIS-dense technique, the W × H-plane element activation is divided into smaller W _t × H _t -Elementkacheln which over the PEs 210 in the SCNN accelerator 200 be split. Each tile extends completely in the input channel dimension C, resulting in an input enable volume of C x W _t x H _t , which corresponds to each PE 210 assigned. The weights are sent to the PEs 210 transferred, and every PE 210 works on an exclusive Subset of the input and output activation space. In other words, there is no duplication of input activations or output activations between the PEs 210 ,

Leider funktioniert ein striktes Trennen sowohl von Eingabeaktivierungen als auch Ausgabeaktivierungen in W_t × H_t Kacheln nicht, da aufgrund der Eigenschaft des gleitenden Fensters bei der Faltungsoperation Überkreuzkachelabhängigkeiten an den Kachelrändern eingeführt werden. Diese Abhängigkeiten werden Halos genannt. Halos können auf zwei Arten bestimmt werden. Bei der ersten Technik zur Bearbeitung von Halos werden die Eingabeaktivierungspuffer 235 bei jedem PE 210 etwas größer als C × W_t × H_t ausgebildet, um die Halos aufzunehmen. Die Halo-Eingabeaktivierungswerte werden über benachbarte PEs 210 repliziert, aber die berechneten Produkte sind für jedes PE 210 streng privat. Die replizierten Eingabeaktivierungswerte können per Multicast übertragen werden, wenn die Eingabeaktivierungswerte in den Eingabeaktivierungspuffern 235 gespeichert werden. Die zweite Technik zur Bearbeitung von Halos ist, den Akkumulationspuffer bei jedem PE 210 etwas größer als K_c × W × H auszubilden, um die Halos aufzunehmen. Die Halos enthalten dann vollständige Teilsummen, welche zur Akkumulation zu benachbarten PEs 210 kommuniziert werden müssen. Bei einer Ausführungsform tritt die Kommunikation zwischen benachbarten PEs 210 am Ende einer Berechnung von jeder Ausgabekanalgruppe auf.Unfortunately, a strict separation of both input activations and output activations in W _t × H _t tiles not work, as introduced due to the property of the sliding window in the folding operation cross-tile dependencies on the tile edges. These dependencies are called halos. Halos can be determined in two ways. The first technique for editing halos becomes the input activation buffers 235 at every PE 210 slightly larger than C x W _t x H _t to accommodate the halos. The halo input enable values are via adjacent PEs 210 replicates, but the calculated products are for each PE 210 strictly private. The replicated input enable values may be multicast when the input enable values in the input enable buffers 235 get saved. The second technique for editing halos is the accumulation buffer on each PE 210 slightly larger than K _c × W × H to accommodate the halos. The haloes then contain complete subtotals which accumulate to neighboring PEs 210 must be communicated. In one embodiment, communication occurs between adjacent PEs 210 at the end of a calculation of each output channel group.

Die PTIS-sparse-Technik ist eine natürliche Erweiterung der PTIS-dense-Technik, wobei die PTIS-sparse-Technik die Seltenheit (sparsity) bei den Gewichten und Aktivierungen ausnutzt. Der PTIS-sparse-Datenfluss ist speziell entworfen, um auf einer komprimierten dünnbesetzten (d. h. kompakten) Codierung der Gewichte und Eingabeaktivierungen zu arbeiten, um eine komprimierte-dünnbesetzte (sparse) Codierung der Ausgabeaktivierungen zu erzeugen. Bei einer CNN-Schichtgrenze werden die Ausgabeaktivierungen der vorherigen Schicht die Eingabeaktivierungen der nächsten Schicht. Das spezielle Format, welches eingesetzt wird, um die komprimierten-dünnbesetzten codierten Daten zu erzeugen, ist orthogonal zu der dünnbesetzten (sparse) Architektur selbst. Das wichtige Merkmal ist, dass ein Decodieren eines dünnbesetzten Formats ultimativ Datenwerte ungleich Null und eine Position ergibt, welche die Koordinaten des Wertes in der Gewichtsmatrix oder Eingabeaktivierungswertematrix angibt. Bei einer Ausführungsform ist die Position durch einen Index oder eine Adresse definiert, wie beispielsweise eine Adresse, welche einem der Akkumulationspuffer 250 oder Addierereinheiten 255 entspricht.The PTIS sparse technique is a natural extension of the PTIS dense technique, whereby the PTIS sparse technique exploits the rarity (sparsity) in the weights and activations. The PTIS sparse data flow is specifically designed to work on a compressed sparse (ie, compact) encoding of the weights and input activations to produce a sparse encoding of the output activations. For a CNN layer boundary, the previous layer's output activations become the next layer's input activations. The particular format used to generate the compressed-sparse coded data is orthogonal to the sparse architecture itself. The important feature is that decoding a sparse format ultimately results in non-zero data values and a position which indicates the coordinates of the value in the weight matrix or input activation value matrix. In one embodiment, the position is defined by an index or an address, such as an address, which is one of the accumulation buffers 250 or adder units 255 equivalent.

3A zeigt ein Blockschaltbild eines PE 210 nach einer Ausführungsform. Das PE 210 ist so konfiguriert, dass es den PTIS-sparse-Datenfluss unterstützt. Wie das PE 220, welches in 2C dargestellt ist, enthält auch das PE 210 einen Gewichtspuffer 305, einen Eingabeaktivierungspuffer 310 und eine FxI-Multipliziereranordnung 325. Eine Parallelität innerhalb eines PE 210 wird durch die Verarbeitung eines Vektors von F Filtergewichten ungleich Null und eines Vektors von I Eingabeaktivierungen ungleich Null innerhalb der FxI-Multipliziereranordnung 325 erreicht. FxI Produkte werden in jedem Verarbeitungszyklus von jedem PE 210 im SCNN-Beschleuniger 200 erzeugt. Bei einer Ausführungsform gilt F = I = 4. Bei anderen Ausführungsformen können F und I jede positive ganze Zahl sein, und der Wert von F kann größer oder kleiner als I sein. Die Werte von F und I können jeweils so abgestimmt sein, dass sie die Gesamtleistung und die Schaltungsfläche im Gleichgewicht halten. Bei typischen Dichtewerten von 30% sowohl für Gewichte als auch für Aktivierungen entsprechen 16 Multiplikationen bezüglich der komprimierten dünnbesetzten Gewichts- und Eingabeaktivierungswerte 178 Multiplikationen in einem dichten Beschleuniger, der Gewichts- und Eingabeaktivierungswerte mit Nullen verarbeitet. 3A shows a block diagram of a PE 210 according to one embodiment. The PE 210 is configured to support PTIS sparse data flow. Like the PE 220 which is in 2C also contains the PE 210 a weight buffer 305 , an input activation buffer 310 and a FxI multiplier arrangement 325 , A parallelism within a PE 210 is generated by processing a vector of F non-zero filter weights and a vector of I non-zero input activations within the FxI multiplier array 325 reached. FxI products become part of every PE in every processing cycle 210 in the SCNN accelerator 200 generated. In one embodiment, F = I = 4. In other embodiments, F and I may be any positive integer and the value of F may be greater than or less than one. The values of F and I can each be tuned to balance the overall performance and the circuit area. For typical density values of 30% for both weights and activations, there are 16 multiplications for the compressed sparse weight and input activation values 178 Multiplications in a dense accelerator processing zero weight and input activation values.

Die Akkumulatoranordnung 340 kann einen oder mehrere Akkumulationspuffer und Akkumulationsaddierer aufweisen, um die Produkte, welche in der Multipliziereranordnung 325 erzeugt werden, zu speichern und die Produkte in die Teilsummen aufzusummieren. Das PE 210 weist auch Positionspuffer 315 und 320, Indexpuffer 355, eine Zielberechnungseinheit 330, ein F*I-Zuteilungskoppelfeld 335 und eine Nachverarbeitungseinheit 345 auf.The accumulator arrangement 340 may include one or more accumulation buffers and accumulation adders to identify the products used in the multiplier array 325 be created, and to sum the products into the subtotals. The PE 210 also has position buffers 315 and 320 , Index buffer 355 , a destination calculation unit 330 , an F * I allotment cell 335 and a post-processing unit 345 on.

Um eine einfachere Decodierung der komprimierten-dünnbesetzten Daten zu ermöglichen, werden Gewichte in komprimierte-dünnbesetzte Blöcke mit der Granularität einer Ausgabekanalgruppe gruppiert, wobei K_c × R × S Gewichte in einen komprimierten-dünnbesetzten Block codiert werden. In ähnlicher Weise werden Eingabeaktivierungen mit der Granularität der Eingabekanäle codiert, wobei ein Block von W_t × H_t in einen komprimierten-dünnbesetzten Block codiert wird. Bei jedem Zugriff liefert der Gewichtspuffer 305 und der Positionspuffer 315 einen Vektor von F Filtergewichten ungleich Null zusammen mit den zugehörigen Positionen (z. B. Koordinaten) in dem K_c × R × S-Bereich. In ähnlicher Weise liefern der Eingabeaktivierungspuffer 310 und der Positionspuffer 320 einen Vektor von I Eingabeaktivierungen ungleich Null und die zugehörigen Positionen (z. B. Koordinaten) in dem W_t × H_t-Bereich. In ähnlicher Weise wie bei dem PTIS-dense-Datenfluss berechnet die FxI-Multipliziereranordnung 325 das vollständige Kreuzprodukt von FxI Teilsummenausgaben ohne irrelevante Berechnungen. Anders als bei einer dense bzw. dichten Architektur, welche Nullwerte aufweist, werden die Ausgabekoordinaten, welche die Ausgabepositionen definieren, nicht von den Schleifenindices bei einer Zustandsmaschine abgeleitet, sondern werden stattdessen von den Positionen (z. B. Koordinaten) der Elemente ungleich Null abgeleitet, welche in dem komprimierten Format eingebettet sind.To allow for easier decoding of the compressed spiky data, weights are grouped into compressed sparse blocks with the granularity of an output channel group, where K _c x R x S weights are encoded into a compressed sparsely populated block. Similarly, input activations wherein a block of W × H _t into a compressed-coded block sparse _t are encoded with the granularity of the input channels. The weight buffer delivers at every access 305 and the position buffer 315 a vector of F filter weights not equal to zero together with the associated positions (eg coordinates) in the K _c × R × S range. Similarly, provide the input enable buffer 310 and the position buffer 320 a vector of I input activations other than zero and the associated positions (eg, coordinates) in the W _t × H _t range. Similar to the PTIS dense data flow, the FxI multiplier arrangement calculates 325 the complete cross product of FxI subtotal outputs without irrelevant calculations. Unlike a dense architecture that has null values, the output coordinates that define the output positions are not derived from the loop indices at a state machine, but instead are derived from the positions (eg, coordinates) of the non-zero elements which are embedded in the compressed format.

Wenn auch die Berechnung der Ausgabepositionen der Produkte nicht schwierig ist, sind die Produkte, anders als bei der PTIS-dense-Technik, nicht typischerweise zusammenhängend, wenn die PTIS-sparse-Technik eingesetzt wird. Daher müssen die Produkte, welche durch die FxI-Multipliziereranordnung 325 erzeugt werden, auf nicht zusammenhängende Positionen in dem K_c × W_t × H_t-Ausgaberaum verteilt werden. Da jede Teilsumme in dem Ausgaberaum Null sein kann, speichert die Akkumulatoranordnung 340 die Daten in einem dichten Format, welches sowohl Nicht-Nullwerte als auch Nullwerte aufweist. Tatsächlich weisen Ausgabeaktivierungen mit hoher Wahrscheinlichkeit eine hohe Dichte auf, auch wenn die Dichte der Gewichte und der Eingabeaktivierungen sehr gering ist (d. h. eine hohe Seltenheit aufweisen), bis die Ausgabeaktivierungen eine ReLU-Operation durchlaufen.Although the calculation of the output positions of the products is not difficult, unlike the PTIS dense technique, the products are not typically cohesive when using the PTIS sparse technique. Therefore, the products required by the FxI multiplier array 325 are distributed to non-contiguous positions in the K _c × W _t × H _t output space. Since each subtotal in the output space can be zero, the accumulator arrangement stores 340 the data in a dense format that has both non-zero and zero values. In fact, output activations are highly likely to have a high density even though the density of the weights and the input activations are very low (ie, high in rarity) until the output activations undergo a ReLU operation.

Um die Akkumulation von dünn besetzten Teilsummen durchzuführen, wird der monolithische K_c × W_t × H_t-Akkumulationspuffer 250, welcher bei dem PTIS-dense-Datenfluss eingesetzt wird, in eine verteilte Anordnung von kleineren Akkumulationspuffern modifiziert, auf welche über ein verteiltes Netzwerk zugegriffen wird, das als ein Koppelfeld-Switch, wie z. B. das FxI-Zuteilungskoppelfeld 335, implementiert sein kann. Das FxI-Zuteilungskoppelfeld 335 routet FxI Produkte zu einer Anordnung von A Akkumulatoreinheiten abhängig von den Ausgabepositionen, welche jedem Produkt zugeordnet sind. Die Positionen können übersetzt werden, um eine Adresse auszubilden. Ein bestimmtes Produkt wird zu der Akkumulatoreinheit in der Akkumulatoranordnung 340 übertragen, welche ausgestaltet ist, um die Ausgabeaktivierung für die Position zu berechnen, welche dem Produkt zugeordnet ist. Zusammengenommen ist eine verteilte Akkumulatoranordnung, welche das FxI-Zuteilungskoppelfeld 335 und die Akkumulatoranordnung 340 umfasst, einem K_c × W_t × H_t-Adressbereich zugeordnet. Der Adressraum wird über die A Akkumulatoreinheiten verteilt und jede Akkumulatoreinheit weist eine Bank eines adressierbaren Speichers und einen Addierer auf, um eine Teilsumme für die Ausgabeposition zu akkumulieren (wenn die Verarbeitung einer Kachel abgeschlossen ist, entspricht die Teilsumme einer Ausgabeaktivierung).To perform the accumulation of sparse partial sums, the monolithic K _c × W _t × H _t accumulation buffer 250 which is used in the PTIS dense data flow, is modified into a distributed array of smaller accumulation buffers accessed via a distributed network acting as a switching matrix switch, such as a switched network switch. For example, the FxI allocation matrix 335 , can be implemented. The FxI allocation matrix 335 routes FxI products to an array of A accumulator units depending on the dispensing positions associated with each product. The positions can be translated to form an address. A particular product becomes the accumulator unit in the accumulator assembly 340 which is configured to calculate the output activation for the position associated with the product. Taken together, a distributed accumulator arrangement comprising the FxI allocation matrix 335 and the accumulator assembly 340 is associated with a K _c × W _t × H _t address range. The address space is distributed over the A accumulator units, and each accumulator unit has a bank of addressable memory and an adder to accumulate a partial sum for the output position (when the processing of a tile is completed, the partial sum corresponds to an output activation).

Die PTIS-sparse-Technik kann anhand von kleinen Anpassungen an dem in Tabelle 2 dargestellten Pseudocode implementiert werden. Anstatt dass ein dichter Vektor geholt wird, sind (B) und (D) modifizierte Hohloperationen für komprimierte dünnbesetzte Eingabeaktivierungen bzw. Gewichte. Darüber hinaus werden die Positionen der Elemente ungleich Null in der komprimierten-dünnbesetzten Form der Datenstrukturen von den entsprechenden Puffern (in Tabelle 2 nicht dargestellt) geholt. Nachdem die Gewichte, Eingabeaktivierungen und Positionen geholt sind, wird der Akkumulatorpuffer (F) mit den Ausgabepositionen indiziert, welche aus dem dünnbesetzten Gewicht und den dünnbesetzten Eingabeaktivierungen berechnet werden.The PTIS sparse technique can be implemented by making small adjustments to the pseudocode shown in Table 2. Instead of fetching a dense vector, (B) and (D) are modified Heats for compressed sparse input activations and weights, respectively. In addition, the positions of the non-zero elements in the compressed-sparse form of the data structures are fetched from the corresponding buffers (not shown in Table 2). After the weights, input activations and positions are fetched, the accumulator buffer (F) is indexed with the dispensing positions calculated from the sparse weight and sparse input activations.

Bei einer Ausführungsform sind die Akkumulationseinheit 245, welche in 2C dargestellt ist, und die verteilte Akkumulatoranordnung doppelt gepuffert, so dass Produkte, welche für eine Kachel von Gewichten erzeugt werden, in einem Satz von Addierern in der Akkumulatoranordnung 340 akkumuliert werden, während auf Register in der Akkumulatoranordnung 340, welche Teilprodukte für die vorherige Kachel speichern, zugegriffen wird, um Halos zu bestimmen und die sich ergebenden Ausgabeaktivierungen in das komprimierte Format zu codieren. Wenn die Berechnung für die Ausgabekanalgruppe abgeschlossen worden ist, wird schließlich die Akkumulatoranordnung 340 ausgelesen und die komprimierten Ausgabeaktivierungen werden in dem Ausgabeaktivierungspuffer 350 gespeichert und die Ausgabekoordinaten werden in dem Indexpuffer 355 gespeichert.In one embodiment, the accumulation unit 245 , what a 2C and the distributed accumulator assembly is double buffered such that products produced for a tile of weights are stored in a set of adders in the accumulator assembly 340 accumulated while on registers in the accumulator array 340 which store partial products for the previous tile is accessed to determine halos and to code the resulting output activations into the compressed format. When the calculation for the output channel group has been completed, finally, the accumulator arrangement becomes 340 and the compressed output activations are read in the output enable buffer 350 stored and the output coordinates are in the index buffer 355 saved.

Die Tabelle 3 stellt einen Pseudocode für den PTIS-sparse-Datenfluss dar. Mit Bezug zu 2A steuert die Schichtablaufsteuerung 215 die Speicherschnittstelle 205, um die Gewichte einmal von dem DRAM außerhalb des Chips („off-chip DRAM”) in einer vollständig komprimierten Form zu lesen und die Gewichte zu den PEs 210 zu übertragen. In jedem PE 210 werden die Gewichte pro Kachel (d. h. Ausgabekanalgruppe) (g), dann pro Eingabekanal (c), dann pro Ausgabekanal innerhalb der Kachel (k) geordnet. Die Berechnung pro PE unter Verwendung der Kachel-/Eingabekanal-/Ausgabekanal-Reihenfolge ist in Tabelle 3 dargestellt. Tabelle 3: Pseudocode für den PTIS-sparse-Datenfluss

Table 3 represents a pseudocode for the PTIS sparse data flow 2A controls the shift sequence control 215 the memory interface 205 to read the weights once from the off-chip DRAM in a fully compressed form and the weights to the PEs 210 transferred to. In every PE 210 the weights are ordered per tile (ie output channel group) (g), then per input channel (c), then per output channel within the tile (k). The calculation per PE using the tile / input channel / output channel order is shown in Table 3. Table 3: Pseudocode for the PTIS sparse data flow

Verarbeitungselementprocessing element

Mit Bezug zu 3A werden die Gewichte, wenn die Gewichte durch die Speicherschnittstelle 205 von dem DRAM gelesen werden, an die PEs 210 übertragen und lokal pro PE in einem Gewichtspuffer 305 gehalten. Die Eingabeaktivierungen können durch die Speicherschnittstelle 205 von dem DRAM gelesen werden oder von dem Ausgabeaktivierungspuffer 350 übertragen und lokal pro PE in einem Eingabeaktivierungspuffer 310 gespeichert werden.In reference to 3A be the weights when the weights through the memory interface 205 read from the DRAM to the PEs 210 transmitted and locally per PE in a weight buffer 305 held. The input activations may be through the memory interface 205 read from the DRAM or from the output enable buffer 350 and locally per PE in an input enable buffer 310 get saved.

Eine Zustandsmaschine in der Zielberechnungseinheit 330 arbeitet mit den Gewichten und Eingabeaktivierungen gemäß der Reihenfolge, welche durch den PTIS-sparse-Datenfluss definiert wird, um eine Ausgabekanalgruppe von K_c × W_t × H_t Teilsummen in der Akkumulatoranordnung 340 zu erzeugen. Zuerst wird ein Vektor F von komprimierten Gewichten und ein Vektor 1 von komprimierten Eingabeaktivierungen von dem Gewichtspuffer 305 bzw. dem Eingabeaktivierungspuffer 310 geholt. Die Vektoren werden in der FxI-Multipliziereranordnung 325 verteilt, welche eine Form des kartesischen Produkts der Vektoren berechnet.A state machine in the target calculation unit 330 operates on the weights and input activations according to the order defined by the PTIS sparse data flow around an output channel group of K _c × W _t × H _t partial sums in the accumulator array 340 to create. First, a vector F of compressed weights and a vector 1 of compressed input activations from the weight buffer 305 or the input activation buffer 310 fetched. The vectors are in the FxI multiplier array 325 which computes a shape of the Cartesian product of the vectors.

Während die Vektoren durch die FxI-Multipliziereranordnung 325 verarbeitet werden, um Produkte zu berechnen, werden die Positionen von den dünnbesetzten-komprimierten Gewichten und Aktivierungen durch die Zielberechnungseinheit 330 verarbeitet, um die Ausgabepositionen zu berechnen, welche den Produkten zugeordnet sind. Die FxI Produkte werden an eine Anordnung von A Akkumulatoreinheiten in der Akkumulatoranordnung 340 geliefert, welche durch die Ausgabepositionen adressiert werden. Jede Akkumulatoreinheit in der Akkumulatoranordnung 340 weist eine adressierbare Speicherbank, Addierer und ein Register auf, um Teilsummen zu speichern, welche der Ausgabekanalgruppe zugeordnet sind, die bearbeitet wird. Wenn eine Bearbeitung einer Ausgabekanalgruppe abgeschlossen ist, ist die Teilsumme, welche in jedem Register gespeichert ist, der Ausgabeaktivierungswert für eine der Ausgabepositionen. Bei einer Ausführungsform sind die Akkumulatoreinheiten doppelt gepuffert, so dass die eine Gruppe von Registern neue Teilsummen speichern kann, während die zweite Gruppe von Registern von der Nachverarbeitungseinheit 345 ausgelesen wird. Wenn die Ausgabekanalgruppe abgeschlossen ist, führt die Nachverarbeitungseinheit 345 die folgenden Aufgaben durch: (1) Austauschen von Teilsummen mit benachbarten PEs 210 für die Halobereiche an der Grenze der Ausgabeaktivierungen des PEs 210, (2) Anwenden der nichtlinearen Aktivierung (z. B. ReLU), Pooling und Dropout-Funktionen, und (3) Komprimieren der Ausgabeaktivierungen in die komprimierte-dünnbesetzte Form und Schreiben der komprimierten-dünnbesetzten Ausgabeaktivierungen in die Ausgabeaktivierungspuffer 350 und Schreiben der Ausgabepositionen, welche den komprimierten-dünnbesetzten Ausgabeaktivierungen zugeordnet sind, in den Indexpuffer 355. Bei einer Ausführungsform weist die Nachbearbeitungseinheit 345 eine Komprimierungsmaschine auf, welche ausgestaltet ist, um die Ausgabeaktivierungen und Ausgabepositionen in die komprimierte-dünnbesetzte Form zu codieren.While the vectors through the FxI multiplier array 325 are processed to calculate products, the positions of the sparse-compressed weights and activations by the target calculation unit 330 processed to calculate the issue items associated with the products. The FxI products are attached to an array of A accumulator units in the accumulator array 340 delivered, which are addressed by the issue positions. Each accumulator unit in the accumulator arrangement 340 has an addressable memory bank, adders and a register to store partial sums associated with the output channel group being processed. When processing of an output channel group is completed, the partial sum stored in each register is the output activation value for one of the output positions. In one embodiment, the accumulator units are double buffered so that the one group of registers can store new partial sums while the second group of registers is from the postprocessing unit 345 is read out. When the output channel group is complete, the postprocessing unit will run 345 performing the following tasks: (1) Exchanging subtotals with adjacent PEs 210 for the halo regions at the boundary of the output activations of the PE 210 (2) applying the non-linear activation (e.g., ReLU), pooling and drop-out functions, and (3) compressing the output activations into the compressed sparse form and writing the compressed sparse output activations into the output enable buffers 350 and writing the issue positions associated with the compressed spiky issue activations into the index buffer 355 , In one embodiment, the post-processing unit 345 a compression engine configured to encode the output activations and output positions into the compressed sparse form.

Bei einer Ausführungsform ist der Gewichtspuffer 305 ein FIFO-Puffer („First In First Out”-Puffer) (WFIFO). Der Gewichtspuffer 305 sollte genug Speicherkapazität aufweisen, um all die Gewichte ungleich Null für einen Eingabekanal in einer Kachel (d. h. für die innerste „For”-Schleife in Tabelle 3) zu halten. Wenn es möglich ist, werden die Gewichte und Eingabeaktivierungen in dem Gewichtspuffer 305 bzw. Eingabeaktivierungspuffer 310 gehalten und werden niemals in das DRAM ausgelagert. Wenn das Ausgabeaktivierungsvolumen einer Schicht des neuronalen Netzes als das Eingabeaktivierungsvolumen für die nächste Schicht des neuronalen Netzes dienen kann, dann wird der Ausgabeaktivierungspuffer 350 logisch zwischen der Bearbeitung der verschiedenen Schichten des neuronalen Netzes in den Eingabeaktivierungspuffer 310 ausgelagert. In ähnlicher Weise wird zwischen der Verarbeitung der verschiedenen Schichten des neuronalen Netzes der Indexpuffer 355 logisch in den Puffer 320 ausgelagert.In one embodiment, the weight buffer is 305 a FIFO (First In First Out) buffer (WFIFO). The weight buffer 305 should have enough storage capacity to hold all the non-zero weights for an input channel in a tile (ie for the innermost "For" loop in Table 3). If possible, the weights and input activations in the weight buffer become 305 or input activation buffer 310 are never outsourced to the DRAM. If the output enable volume of a layer of the neural network can serve as the input enable volume for the next layer of the neural network, then the output enable buffer becomes 350 logical between the Processing the various layers of the neural network into the input activation buffer 310 outsourced. Similarly, between the processing of the various layers of the neural network, the index buffer becomes 355 logically in the buffer 320 outsourced.

Wenn der Gewichtspuffer 305 bei irgendeinem PE 210 vollläuft, dann wird bei einer Ausführungsform die Übertragung der Gewichtswerte in den Gewichtspuffer 305 angehalten. Wenn der Gewichtspuffer 305 groß genug ist, um einige Eingabekanäle einer Kachel zu halten, können einige PEs 210 zu dem nächsten Eingabekanal voranschreiten, während ein oder mehrere andere PEs 210 einige Kanäle zurück sind – Glätten von Lastungleichgewichten zwischen den PEs 210. Bei einer Ausführungsform weist der Gewichtspuffer 305 eine ausreichende Speicherkapazität auf, um mehr als alle Gewichte in einer Kachel (d. h. einer Ausgabekanalgruppe) zu halten, um einige Lastungleichgewichte zwischen den PEs 210 auszugleichen.When the weight buffer 305 at any PE 210 then, in one embodiment, the transmission of the weight values into the weight buffer 305 stopped. When the weight buffer 305 big enough to hold some input channels of a tile, some PEs can 210 proceed to the next input channel while one or more other PEs 210 Some channels are back - Smoothing of load imbalances between the PEs 210 , In one embodiment, the weight buffer 305 have sufficient storage capacity to hold more than all the weights in a tile (ie, an output channel group) to some load imbalances between the PEs 210 compensate.

Die verschiedenen Logikblöcke in dem PE 210 können hintereinander ausgeführt werden, wenn es erforderlich ist, um einer Zieltaktrate zu genügen. Jedoch müssen die Pipeline-Register zwischen den Pipeline-Stufen fixiert bzw. eingefroren werden können, wenn der Logikblock, welcher eine Datenausgabe durch die Pipeline-Register empfängt, angehalten ist. Alternativ können elastische Puffer zwischen den Pipeline-Stufen eingesetzt werden, um die Verteilung eines Bereitschaftssignals, welches anzeigt, dass Daten akzeptiert werden können, zu vereinfachen.The different logic blocks in the PE 210 can be executed consecutively when required to meet a target clock rate. However, the pipeline registers must be able to be frozen between the pipeline stages when the logic block receiving data output through the pipeline registers is halted. Alternatively, elastic buffers may be inserted between the pipeline stages to facilitate the distribution of a standby signal indicating that data may be accepted.

Bei einer Ausführungsform ist der Gewichtspuffer 305 ein FIFO-Puffer, welcher einen Endezeiger, einen Kanalzeiger und einen Anfangszeiger aufweist. Die Schichtablaufsteuerung 215 steuert die „Eingabe-„Seite des Gewichtspuffers 305, wobei Gewichtsvektoren in den Gewichtspuffer 305 geschoben werden. Der Endezeiger darf sich nicht über den Kanalzeiger hinweg fortbewegen. Eine Vollbedingung wird signalisiert, wenn sich der Endezeiger an dem Kanalzeiger vorbei bewegt, wenn ein anderer Schreibvektor gespeichert wird. Der Puffer 315 kann auf dieselbe Weise wie der Gewichtspuffer 305 implementiert sein und kann ausgestaltet sein, um die Positionen zu speichern, welche jedem Gewichtsvektor zugeordnet sind. Bei einer Ausführungsform gibt der Gewichtspuffer 305 einen Gewichtsvektor von F Gewichten {w[0] ... w[F – 1]} aus, und der Puffer 315 gibt die zugeordneten Positionen {x[0] ... x[F – 1]} aus. Jede Position spezifiziert r, s und k für ein Gewicht. Der Ausgabekanal k ist relativ zu der Kachel codiert. Wenn die Kachel zum Beispiel die Kanäle 40– 47 enthält, dann wird der Kanal 42 als k = 2 codiert – einen Versatz von 2 von 40, der Basis der Kachel.In one embodiment, the weight buffer is 305 a FIFO buffer having an end pointer, a channel pointer, and an initial pointer. The layer sequence control 215 controls the "input" side of the weight buffer 305 , where weight vectors in the weight buffer 305 be pushed. The tail pointer must not move beyond the channel pointer. A full condition is signaled when the tail pointer moves past the channel pointer when another write vector is stored. The buffer 315 can work in the same way as the weight buffer 305 implemented and may be configured to store the positions associated with each weight vector. In one embodiment, the weight buffer is 305 a weight vector of F weights {w [0] ... w [F-1]}, and the buffer 315 outputs the assigned positions {x [0] ... x [F - 1]}. Each position specifies r, s and k for a weight. The output channel k is coded relative to the tile. For example, if the tile contains channels 40-47, then channel 42 is encoded as k = 2 - an offset of 2 out of 40, the base of the tile.

Die Zielberechnungseinheit 330 steuert den Anfangs- und Kanalzeiger (HeadPtr und ChannelPtr) des Gewichtspuffers 305 und des Puffers 315, um die Berechnung einer Kachel in eine Reihenfolge zu bringen. Der Eingabeaktivierungspuffer 310 und der Puffer 320 können eine Gruppe von Registern oder ein SRAM sein, welche ausgestaltet sind, um die Eingabeaktivierungen und die Positionen, welche jedem Eingabeaktivierungswert zugeordnet sind, zu speichern. Die Zielberechnungseinheit 330 steuert auch einen Zeiger (IAPtr) in dem Eingabeaktivierungspuffer 310 und dem Puffer 320, um die Berechnung einer Kachel in eine Reihenfolge zu bringen. Die Reihenfolge, welche durch die Zielberechnungseinheit 330 implementiert wird, entspricht den drei inneren Schleifen des Pseudocodes, welcher in Tabelle 3 dargestellt ist. Der Pseudocode zum Betrieb der Zielberechnungseinheit 330 ist in Tabelle 4 dargestellt. ScatterAdd ist eine Funktion, welche die Produkte zu den A Akkumulatoreinheiten in der Akkumulatoranordnung 340 überträgt. Tabelle 4: Pseudocode, um die Berechnungen für eine Kachel in eine Reihenfolge zu bringen

The target calculation unit 330 controls the start and channel pointer (HeadPtr and ChannelPtr) of the weight buffer 305 and the buffer 315 to put the calculation of a tile in an order. The input activation buffer 310 and the buffer 320 may be a group of registers or an SRAM configured to store the input activations and the positions associated with each input enable value. The target calculation unit 330 also controls a pointer (IAPtr) in the input enable buffer 310 and the buffer 320 to put the calculation of a tile in an order. The order given by the target calculation unit 330 is implemented, corresponds to the three inner loops of the pseudocode shown in Table 3. The pseudocode for operating the target calculation unit 330 is shown in Table 4. ScatterAdd is a function that maps the products to the A accumulator units in the accumulator array 340 transfers. Table 4: Pseudocode to rank the calculations for a tile

Während der Pseudocode, welcher in Tabelle 4 dargestellt ist, einige Zeilen lang ist, benötigt jede Iteration der inneren Schleife einen einzigen Zyklus und der Overhead eines Inkrementierens der Zähler und eines Testens der Schleifengrenzen findet parallel statt. Daher führt die F*I-Multipliziereranordnung 325 FxI Multiplikationen (von Werten und Positionen) in jedem Verarbeitungszyklus durch, bis der Gewichtspuffer 305 leer gelaufen ist oder das F*I-Zuteilungskoppelfeld 335 signalisiert, dass es keine Eingaben mehr akzeptieren kann. Wenn die Verarbeitung nicht angehalten ist, inkrementiert die Zielberechnungseinheit 330 die Anfangszeiger in jedem Verarbeitungszyklus, wobei ein anderer Vektor von F Gewichten (und zugehörigen Positionen) in jedem Verarbeitungszyklus ausgegeben wird. Die Zielberechnungseinheit 330 setzt damit fort, den Anfangszeiger in jedem Verarbeitungszyklus, in welchem die Verarbeitung nicht angehalten ist, zu inkrementieren, bis das nächste Inkrement das Ende des aktuellen Kanals passieren würde (d. h. den Kanalzeiger passieren würde). Wenn das Ende des aktuellen Kanals erreicht wird, schreibt die Zielberechnungseinheit 330 den IAPtr fort und der Anfangszeiger wird zurückgerollt (erneut gesetzt) auf den Anfang des aktuellen Kanals. Der IAPtr wird dann eingesetzt, um den nächsten Vektor von I Eingabeaktivierungen zu lesen und der zurückgerollte Anfangszeiger wird eingesetzt, um den ersten Vektor von F Gewichten zu lesen. Die Zielberechnungseinheit 330 bringt dann all die Gewichte für einen anderen Vektor von Eingabeaktivierungen in eine Reihenfolge, um einen anderen Vektor von Produkten zu erzeugen. Wenn der letzte Vektor von Eingabeaktivierungen für den Kanal c verarbeitet ist, schreitet die Zielberechnungseinheit 330 mit Kanal c + 1 fort, indem der Kanalzeiger gesetzt wird, so dass er auf den ersten Gewichtsvektor des Kanals c + 1 zeigt.While the pseudocode shown in Table 4 is several lines long, each iteration of the inner loop takes a single cycle and the overhead of incrementing the counters and testing the loop boundaries occurs in parallel. Therefore, the F * I multiplier arrangement results 325 FxI multiplies (from values and positions) in each processing cycle through to the weight buffer 305 has run empty or the F * I allotment cell 335 signals that it can no longer accept input. If the processing is not stopped, the target calculation unit increments 330 the start pointers in each processing cycle, where another vector of F weights (and associated positions) is output in each processing cycle. The target calculation unit 330 continues to increment the start pointer in each processing cycle in which processing is not paused until the next increment would pass the end of the current channel (ie, pass the channel pointer). When the end of the current channel is reached, the target calculation unit writes 330 IAPtr continues and the start pointer is rolled back (set again) to the beginning of the current channel. The IAPtr is then used to read the next vector of I input activations, and the rolled back pointer is used to read the first vector of F weights. The target calculation unit 330 then puts all the weights for another vector of input activations in an order to create another vector of products. When the last vector of input activations for channel c is processed, the target calculation unit proceeds 330 continues with channel c + 1 by setting the channel pointer to point to the first weight vector of channel c + 1.

Am Ende eines Eingabekanals müssen nicht alle F Gewichte oder I Aktivierungen gültig sein. Ungültige Aktivierungen werden durch einen Wert 0 angezeigt und führen nicht zu einer Anforderung bezüglich der ScatterAdd-Funktion. Das Ende eines Eingabekanals c wird durch Zählen identifiziert. Die Gewichte und Aktivierungen für jeden Eingabekanal werden durch ein Zählen der Elemente ungleich Null für den Kanal bestimmt. Zu Beginn des Kanals werden IACnt und WCnt auf die Anzahl von I-großen oder F-großen Einträgen für den Kanal initialisiert. IACnt und WCnt werden dekrementiert, nachdem der jeweilige Vektor abgearbeitet ist, und auf Null überprüft, um das Ende des Kanals zu bestimmen. Um zu vermeiden, dass für ein Lesen von IACnt und WCnt für einen Kanal ein Verarbeitungszyklus verloren wird, werden die Zähler in einem Paar von separaten kleinen RAMs gehalten – eines für die Gewichtszähler und eines für die IA-Zähler (in 3A nicht dargestellt).At the end of an input channel, not all F weights or I activations need to be valid. Invalid activations are indicated by a value of 0 and do not result in a request for the ScatterAdd function. The end of an input channel c is identified by counting. The weights and activations for each input channel are determined by counting the nonzero elements for the channel. At the beginning of the channel, IACnt and WCnt are initialized to the number of I-size or F-size entries for the channel. IACnt and WCnt are decremented after the respective vector has been executed and checked for zero to determine the end of the channel. To avoid losing a processing cycle for a channel reading IACnt and WCnt, the counters are kept in a pair of separate small RAMs - one for the weight counters and one for the IA counters (in 3A not shown).

Positionswandlung zu einer AkkumulatoradressePosition conversion to an accumulator address

3B stellt zwei 3 × 3-Gewichtskerne und Positionen gemäß einer Ausführungsform dar. Eine erste Gruppe von Gewichten weist für k = 1 die Elemente ungleich Null a, b und c auf, und eine zweite Gruppe von Gewichten weist für k = 2 die Elemente ungleich Null d, e und f auf. Das Format (r, s, k) codiert die Positionen für die Gewichte ungleich Null als den folgenden Positionsvektor:
(2, 0, 1), (0, 1, 1), (1, 2, 1), (0, 1, 2), (2, 1, 2), (1, 2, 2) 3B FIG. 3 illustrates two 3 × 3 kernels and positions according to one embodiment. For k = 1, a first set of weights has the non-zero elements a, b, and c, and a second set of For k = 2, weights have the elements not equal to zero d, e, and f. The format (r, s, k) encodes the positions for the nonzero weights as the following position vector:
(2, 0, 1), (0, 1, 1), (1, 2, 1), (0, 1, 2), (2, 1, 2), (1, 2, 2)

Wenn eine Multiplikation auf der „Wert„-Komponente von jedem (Wert, Positions)-Paar ausgeführt wird, führt die Zielberechnungseinheit 330 eine Vektoraddition auf die Positionen durch – woraus sich eine Position (x, y, k) (z. B. Ausgabekoordinaten) für das sich ergebende Produkt ergibt. Insbesondere werden für jedes Produkt die x-Koordinaten, welche den Gewichts- und Eingabeaktivierungspositionen zugeordnet sind, aufsummiert und die y-Koordinaten, welche den Gewichts- und Eingabeaktivierungspositionen zugeordnet sind, werden aufsummiert, um die Position (x, y k) für das sich ergebende Produkt zu erzeugen. Zum Beispiel erzeugt ein Aufsummieren der ersten Position in dem Gewichtspositionsvektor mit einer Gruppe von vier Positionen für Eingabeaktivierungen ungleich Null (7, 3), (12, 3), (20, 3) und (24, 3) einen Produktpositionsvektor (9, 3, 1), (14, 3, 1), (22, 3, 1) und (26, 3, 1).When a multiplication on the "value" component of each (value, position) pair is performed, the target calculation unit executes 330 vector addition to the positions - resulting in a position (x, y, k) (eg output coordinates) for the resulting product. In particular, for each product, the x-coordinates associated with the weight and input enable positions are summed and the y-coordinates associated with the weight and input enable positions are summed to give the position (x, yk) for the resulting Produce product. For example, summing the first position in the weight position vector with a group of four positions for non-zero input activations (7, 3), (12, 3), (20, 3) and (24, 3) produces a product position vector (9, 3 , 1), (14, 3, 1), (22, 3, 1) and (26, 3, 1).

Die Zielberechnungseinheit 330 linearisiert dann die Koordinaten der Ausgabeposition, um eine Akkumulatoradresse zu erzeugen, welche an das F*I-Zuteilungskoppelfeld 335 ausgegeben wird. Tabelle 5 ist ein Pseudocode für die Operationen, welche in der F*I-Multipliziereranordnung 325 und der Zielberechnungseinheit 330 ausgeführt werden. Tabelle 5: Pseudocode für Produkt- und Positionsberechnungen

The target calculation unit 330 then linearizes the coordinates of the output position to produce an accumulator address which is sent to the F * I allocation matrix 335 is issued. Table 5 is a pseudocode for the operations used in the F * I multiplier arrangement 325 and the target calculation unit 330 be executed. Table 5: Pseudocode for Product and Position Calculations

Das „forall” in Tabelle 5 impliziert, dass alle P Iterationen der inneren Schleife parallel ausgeführt werden – in einem einzigen Zyklus. Nach einem Berechnen der Ausgabeposition jedes Produkts p[t] in der Form (x, y, k) wird die Ausgabeposition in eine Akkumulatoradresse p[t].a gemäß der Formel linearisiert: P[t].a = p[t].x + p[t].y*max_x_oa + p[t].k*max_x_oa*max_y_oa (1) The "forall" in Table 5 implies that all P iterations of the inner loop are executed in parallel - in a single cycle. After calculating the output position of each product p [t] in the form (x, y, k), the output position is linearized into an accumulator address p [t] .a according to the formula: P [t] .a = p [t] .x + p [t] .y * max_x_oa + p [t] .k * max_x_oa * max_y_oa (1)

Es sei angemerkt, dass max_x_oa typischerweise um Eins weniger als R, der Breite des Faltungskerns, max_x_weight, größer als max_x_ia ist. In ähnlicher Weise ist m_y_oa typischerweise um Eins weniger als S, der Höhe des Faltungskerns, max_y_weight, größer als max_y_ia. max_x_oa und max_y_oa bezeichnen die Dimensionen des Halo. Bei einer Fortsetzung des vorherigen Beispiels wird der Ausgabepositionsvektor (9, 3, 0), (14, 3, 0), (22, 3, 0) und (26, 3, 0) in 105, 110, 118 und 122 gewandelt wobei angenommen wird, dass eine Ausgabekachel max_x_oa = 32 aufweist.It should be noted that max_x_oa is typically greater than max_x_ia by one less than R, the convolution kernel width, max_x_weight. Similarly, m_y_oa is typically one less than S, the height of the convolution kernel, max_y_weight, greater than max_y_ia. max_x_oa and max_y_oa denote the dimensions of the halo. Continuing from the previous example, the output position vector (9, 3, 0), (14, 3, 0), (22, 3, 0) and (26, 3, 0) in FIG 105 . 110 . 118 and 122 converted assuming that an output tile has max_x_oa = 32.

Das F*I-Zuteilungskoppelfeld 335 überträgt die Produkte dem zugeordneten Akkumulator in der Akkumulatoranordnung 340 abhängig von den Produktpositionen. Die niederwertigen Bits der linearisierten Akkumulatoradresse werden durch das F*I-Zuteilungskoppelfeld 335 eingesetzt, um jedes Produkt zu einer Akkumulatoreinheit in der Akkumulatoranordnung 340 zu routen, und das Produkt wird durch den Addierer in der Akkumulatoranordnung 340 zu einer Teilsumme addiert, welche durch die hochwertigen Bits der Adresse ausgewählt wird. Die Arbeitsweise des F*I-Zuteilungskoppelfelds 335 wird im Detail mit Bezug zu 3C beschrieben.The F * I allocation matrix 335 transfers the products to the associated accumulator in the accumulator assembly 340 depending on the product positions. The low order bits of the linearized accumulator address are passed through the F * I allocation matrix 335 used to add each product to an accumulator unit in the accumulator assembly 340 route, and the product is passed through the adder in the accumulator assembly 340 is added to a partial sum which is selected by the high-order bits of the address. The operation of the F * I allocation matrix 335 will be explained in detail with reference to 3C described.

Wenn eine Entscheidung getroffen wird und zwei Produkte derselben Ausgabeposition (z. B. Adresse) zugeordnet werden, wird eines der zwei Produkte durch das F*I-Zuteilungskoppelfeld 335 übertragen und in einer Akkumulatoreinheit in der Akkumulatoranordnung 340 gespeichert, während das andere Produkt, welches für dieselbe Akkumulatoreinheit bestimmt ist, durch das F*I-Zuteilungskoppelfeld 335 angehalten wird. Jede Akkumulatoreinheit kann als eine Bank von einem adressierbaren Speicher kombiniert mit einem Addierer angesehen werden, so dass Produkte, welche derselben Adresse zugeordnet sind, akkumuliert bzw. addiert werden können. Wenn ein Produkt angehalten wird, werden bei einer Ausführungsform Ausgaberegister in der F*I-Multipliziereranordnung 325 angehalten und eine Berechnung von neuen Produkten angehalten. Bei einer Ausführungsform wird ein FIFO-Puffer an dem Ausgang von jedem Multiplizierer in der F*I-Multipliziereranordnung 325 eingesetzt, um Lastungleichgewichte zwischen Akkumulatoreinheiten zu glätten. Eine Leistungsverbesserung kann auftreten, wenn die Anzahl von Bänken A größer als die Anzahl der Produkte F*I ist. Bei einer Ausführungsform gilt A = 2F*I, wobei F*I = 16 und A = 32 gilt.If a decision is made and two products are assigned to the same issue position (eg address), one of the two products will become the F * I allocation link 335 transferred and in an accumulator unit in the accumulator 340 while the other product destined for the same accumulator unit is stored by the F * I allocation switch 335 is stopped. Each accumulator unit may be combined as a bank with an addressable memory Adder can be viewed so that products that are assigned to the same address can be accumulated or added. When a product is stopped, in one embodiment, output registers are in the F * I multiplier array 325 stopped and stopped a calculation of new products. In one embodiment, a FIFO buffer is provided at the output of each multiplier in the F * I multiplier arrangement 325 used to smooth load imbalances between accumulator units. Performance improvement may occur if the number of banks A is greater than the number of products F * I. In one embodiment, A = 2F * I, where F * I = 16 and A = 32.

Nachdem alle Teilsummen für eine Kachel berechnet worden sind, wird die doppelt gepufferte Akkumulatoranordnung 340 geschaltet. Das PE 210 kann mit der Verarbeitung der nächsten Kachel beginnen, wobei das „primäre” der Akkumulatoranordnung 340 verwendet wird, während die Nachbearbeitungseinheit 345 eine Nachbearbeitung der letzten Kachel parallel beginnt, wobei das „sekundäre” der Akkumulatoranordnung 340 verwendet wird. Die Nachbearbeitungseinheit 345 führt die folgenden Schritte durch: eine Halobestimmung, eine Evaluierung der nichtlinearen Funktion und eine Codierung. Die Addierer und Register in der „sekundären” Akkumulatoranordnung 340 werden auch gelöscht, um die Teilsummen für eine nachfolgende Kachel auf Werte von Null zu zwingen, wenn der Codierungsprozess abgeschlossen ist.After all subtotals have been calculated for a tile, the double-buffered accumulator array becomes 340 connected. The PE 210 may begin processing the next tile, with the "primary" of the accumulator array 340 is used while the post-processing unit 345 a post-processing of the last tile begins in parallel, being the "secondary" of the accumulator assembly 340 is used. The post-processing unit 345 performs the following steps: a halo determination, an evaluation of the nonlinear function and an encoding. The adders and registers in the "secondary" accumulator arrangement 340 are also cleared to force the subtotals for a subsequent tile to zero values when the encoding process is complete.

Scatter-AddScatter add

Eine Scatter-Add-Funktion wird durch eine Kombination des F*I-Zuteilungskoppelfeld 335 und der Akkumulatoranordnung 340 durchgeführt. Das F*I-Zuteilungskoppelfeld 335 empfängt F*I = P Produkte und Ausgabepositionen von der FxI-Multipliziereranordnung 325. Bei einer Ausführungsform sind die Ausgabepositionen als lineare Adressen repräsentiert. Die Produkte werden an die Addierer in der Akkumulatoranordnung 340 geroutet, wobei jedes Produkt zu einem bestimmten Addierer, ausgewählt durch die lineare Adresse, welche dem Produkt zugeordnet ist, geroutet wird. Bei einer Ausführungsform werden Produkte durch einen Puffer (z. B. eine Akkumulatoreinheit) zu dem Addierer geroutet. Das Produkt wird dann zu dem Wert addiert, welcher in dem Register gespeichert ist, welches paarweise mit dem Addierer angeordnet ist, um eine Teilsumme zu erzeugen. Die Tabelle 6 ist der Pseudocode für die Scatter-Add-Funktion, welche durch das F*I-Zuteilungskoppelfeld 335 und die Akkumulatoranordnung 340 ausgeführt wird. Tabelle 6 Pseudocode für die Scatter-Add-Funktion

A scatter-add function is provided by a combination of the F * I arbitration array 335 and the accumulator assembly 340 carried out. The F * I allocation matrix 335 F * I = P receives products and output positions from the FxI multiplier array 325 , In one embodiment, the issue positions are represented as linear addresses. The products are sent to the adders in the accumulator assembly 340 with each product being routed to a particular adder selected by the linear address associated with the product. In one embodiment, products are routed through a buffer (eg, an accumulator unit) to the adder. The product is then added to the value stored in the register which is arranged in pairs with the adder to produce a partial sum. Table 6 is the pseudocode for the scatter-add function, which is passed through the F * I allotment array 335 and the accumulator assembly 340 is performed. Table 6 Pseudocode for the Scatter Add function

Um ein Adressieren der Addierer zu vereinfachen, sollte die Anzahl der Akkumulatoreinheiten eine Potenz von 2, A = 2^b sein, wobei b eine ganzzahlige Zahl ist. Die niederwertigen Bits der Adresse selektieren die Akkumulatoreinheit, welche Acc[a] enthält, und die höherwertigen Bits der Adresse spezifizieren einen Offset in dem Speicher in der Akkumulatoreinheit. Das F*I-Zuteilungskoppelfeld 335 enthält ein Netzwerk, um die Werte zu der richtigen Akkumulatoreinheit zu routen. A sollte größer als F*I sein, um die Konkurrenzsituation für die Akkumulatoreinheiten zu verringern und um für einen adäquaten Verarbeitungsdurchsatz zu sorgen. Für kleine Werte von A kann das Netzwerk eine einzige Stufe eines Zuteilungsmultiplexers sein. Für größere Werte von A kann ein mehrstufiges Netzwerk eingesetzt werden, um die Verdrahtungskomplexität zu verringern. Bei einer Ausführungsform wird ein FIFO bei jedem Eingang zu dem F*I-Zuteilungskoppelfeld 335 bereitgestellt, um ein Lastungleichgewicht zwischen den Akkumulatoreinheiten zu glätten.To simplify addressing of the adders, the number of accumulator units should be a power of 2, A = 2 ^b , where b is an integer number. The low order bits of the address select the accumulator unit containing Acc [a] and the high order bits of the address specify an offset in the memory in the accumulator unit. The F * I allocation matrix 335 contains a network to route the values to the right accumulator unit. A should be greater than F * I in order to reduce the contention for the accumulator units and to provide adequate processing throughput. For small values of A, the network may be a single stage of an arbitration multiplexer. For larger values of A, a multi-level network can be used to reduce wiring complexity. In one embodiment, a FIFO at each input becomes the F * I allocation switch 335 provided to smooth a load imbalance between the accumulator units.

3C stellt ein einstufiges F*I-Zuteilungskoppelfeld 335 gemäß einer Ausführungsform dar. Das einstufige F*I-Zuteilungskoppelfeld 135 weist ein FIFO 362, einen Decoder 364, einen Zuteiler 365, einen Multiplexer 366 und ein Oder-Gatter 370 auf. Eine Akkumulatoreinheit 368 in der Akkumulatoranordnung 340 ist mit dem Ausgang des Multiplexers 366 gekoppelt. Head-of-Line-Blocking bei der Akkumulatoranordnung 340 kann vermieden werden, indem mehrere Eingangs-FIFOs bei den Akkumulatoreinheiten 368 eingesetzt werden, wobei jedes FIFO Paare (p, a) für eine Teilmenge der Akkumulatoreinheiten 368 hält. Nur ein Eingang und eine Akkumulatoreinheit 368 sind in 3C dargestellt. Ein vollständiges F*I-Zuteilungskoppelfeld 335 weist P FIFOs 362, P Decoder 364, P Oder-Gatter 370, A Zuteiler 365 und A Multiplexer 366, welche mit A Akkumulatoreinheiten 368 gekoppelt sind, auf. 3C provides a one-level F * I allocation switch 335 according to one embodiment. The single-stage F * I allocation matrix 135 has a FIFO 362 , a decoder 364 an arbiter 365 , a multiplexer 366 and an OR gate 370 on. An accumulator unit 368 in the accumulator arrangement 340 is with the output of the multiplexer 366 coupled. Head-of-line blocking in the accumulator arrangement 340 can be avoided by adding multiple input FIFOs to the accumulator units 368 are employed, each FIFO pairs (p, a) for a subset of the accumulator units 368 holds. Only one input and one accumulator unit 368 are in 3C shown. A complete F * I allocation matrix 335 has P FIFOs 362 , P Decoder 364 , P or gate 370 , A dispatcher 365 and A multiplexer 366 , which with A accumulator units 368 are coupled up.

Die Produkte p[i] werden in das FIFO 362 geschoben. Bei einer Ausführungsform weist das FIFO 362 eine Tiefe von 2 oder 3 auf. Wenn jedes der FIFOs 362 gefüllt ist, ist das F*I-Zuteilungskoppelfeld 335 nicht bereit und hält die F*I-Multipliziereranordnung 325 an. Die Ausgabe des FIFO 362 besteht aus einem Produkt p[i] und einer Adresse a[i]. Das Produkt p[i] von dem Eingang i wird mit dem i-ten Eingang des Multiplexers 366 an dem Eingang jeder Akkumulatoreinheit 368 verbunden. Die niederwertigen Bits der Adresse a[i] werden durch den Decoder 364 in einen One-Hot-Anforderungsvektor r[i][j] decodiert. Über alle Eingänge gilt, wenn r[i][j] wahr ist, impliziert das, dass der Eingang i eine Anforderung für die j-te Akkumulatoreinheit 368 ausbildet. Wenn das FIFO 362 leer ist, wird der Decoder 364 deaktiviert, so dass keine Anforderungen geltend gemacht werden. Bei einer Ausführungsform wird die Auswahl der niederwertigen Bits von a[i] durch einen Hash bzw. Hashwert ersetzt, um die Adressen in der Akkumulatoranordnung 340 über die Akkumulatoreinheiten 368 zu verteilen, um Bankkonflikte zu verringern. The products p [i] are in the FIFO 362 pushed. In one embodiment, the FIFO 362 a depth of 2 or 3 on. If any of the FIFOs 362 is filled, is the F * I allotment array 335 not ready and holds the F * I multiplier array 325 at. The output of the FIFO 362 consists of a product p [i] and an address a [i]. The product p [i] from the input i is connected to the ith input of the multiplexer 366 at the entrance of each accumulator unit 368 connected. The least significant bits of address a [i] are passed through the decoder 364 is decoded into a one-hot request vector r [i] [j]. Across all inputs, if r [i] [j] is true, this implies that the input i is a request for the jth accumulator unit 368 formed. If the FIFO 362 is empty, becomes the decoder 364 disabled, so no requests are asserted. In one embodiment, the selection of the least significant bits of a [i] is replaced by a hash value to the addresses in the accumulator array 340 via the accumulator units 368 to reduce bank conflicts.

Jede Akkumulatoreinheit 368 in der Akkumulatoranordnung 340 fungiert als eine Speicherbank (z. B. eine Latch- oder Registeranordnung), welche einem Addierer zugeordnet ist. Die Anforderungen rq[*][j] von dem Decoder 364 an die Akkumulatoreinheit 368 werden dem Zuteiler 365 eingegeben. Der Zuteiler 365 erzeugt einen Zugangsgewährungs-Vektor gr[*][j] (wobei das gewinnende i für die Akkumulatoreinheit 368 j ausgewählt wird). Über alle Akkumulatoreinheiten 368 gilt, wenn das Bit gr[i][j] der P×I-Zugangsgewährungs-Matrix wahr ist, impliziert dies, dass dem Eingang i Zugang zu der Akkumulatoreinheit 368 j für den nächsten Zyklus gewährt wird. Die Zugangsgewährungs-Signale werden eingesetzt, um sowohl den Multiplexer 366 zu steuern, als auch um das gewinnende Produkt und die gewinnende Adresse von den Multiplexereingängen auszuwählen und einen Hinweis zurück zu dem FIFO 362 bereitzustellen – so dass das gewinnende Produkt am Ende des Verarbeitungszyklus aus dem FIFO 362 ausgegliedert wird.Each accumulator unit 368 in the accumulator arrangement 340 acts as a memory bank (eg, a latch or register array) associated with an adder. The requirements rq [*] [j] of the decoder 364 to the accumulator unit 368 become the arbiter 365 entered. The allocator 365 generates an access grant vector gr [*] [j] (where the winning i is for the accumulator unit 368 j is selected). About all accumulator units 368 If the bit gr [i] [j] of the Px I access grant matrix is true, this implies that the input i has access to the accumulator unit 368 j is granted for the next cycle. The access grant signals are used to both the multiplexer 366 as well as to select the winning product and the winning address from the multiplexer inputs and a hint back to the FIFO 362 provide - so that the winning product at the end of the processing cycle from the FIFO 362 is outsourced.

3D stellt die Akkumulatoreinheit 368 gemäß einer Ausführungsform dar. Die Akkumulatoreinheit 368 weist ein Flipflop 382, eine Speicheranordnung 380 und einen Addierer 385 auf. Die Adressausgabe des Multiplexers 366 wird eingesetzt, um ein Latch oder Register von der Speicheranordnung 382 zur Ausgabe zu dem Addierer 385 auszuwählen. Die Speicheranordnung 380 speichert Teilsummen und wird mittels der Adresse a'[i] ausgelesen. Das Produkt p'[i], welches durch die Akkumulatoreinheit 368 empfangen wird, wird zu der Teilsumme aufsummiert, welche in der Speicheranordnung 380 an der Stelle gespeichert ist, die der Adresse a'[i] zugeordnet ist. Wie in 3D dargestellt ist, wird die Adresse a'[i] durch ein Flipflop 382 abgegriffen und dadurch um einen Taktzyklus verzögert, um als eine Schreibadresse zum Speichern der Summenausgabe von dem Addierer 385 eingesetzt zu werden. Bei anderen Ausführungsformen kann a'[i] um mehr als einen Taktzyklus verzögert werden, um die Summe, welche durch den Addierer 385 erzeugt wird, zu dem Produkt in der Teilsumme zu akkumulieren. 3D represents the accumulator unit 368 according to one embodiment. The accumulator unit 368 has a flip-flop 382 , a storage arrangement 380 and an adder 385 on. The address output of the multiplexer 366 is used to latch or register from the memory array 382 for output to the adder 385 select. The memory arrangement 380 stores partial sums and is read out by means of the address a '[i]. The product p '[i], which by the accumulator unit 368 is summed up to the partial sum which is stored in the memory array 380 is stored at the location associated with the address a '[i]. As in 3D is shown, the address a '[i] by a flip-flop 382 and thereby delayed by one clock cycle, as a write address for storing the sum output from the adder 385 to be used. In other embodiments, a '[i] may be delayed by more than one clock cycle by the sum provided by the adder 385 is generated to accumulate to the product in the subtotal.

Head-of-Line-Blocking bei der Akkumulatoranordnung 340 kann vermieden werden, indem mehrere Eingangs-FIFOs bei der Akkumulatoreinheit 368 eingesetzt werden und wobei jedes FIFO Paare (p, a) für eine Teilmenge der Akkumulatoreinheiten 368 hält. Bei einer Ausführungsform wird ein separates FIFO an jedem Eingang zu jedem Addierer 385 für jede der Akkumulatoreinheiten 368 bereitgestellt (d. h. eine virtuelle Ausgabereihenbildung wird an den Ausgängen der Akkumulatoreinheit 368 eingesetzt). Ein Nachteil des einstufigen F*I-Zuteilungskoppelfels 335, welches in 3C dargestellt ist, ist die komplexe Verdrahtung, da ein direkter Pfad von jeder Produkteingabe zu jeder Akkumulatoreinheit 368 existiert, was zu P × A Pfaden führt. Wenn zum Beispiel P = 16 und A = 32 gilt, dann existieren 612 Pfade, wobei jeder ein Produkt, eine Adresse, eine Anforderung und eine zurückkehrende Zugangsgewährung trägt. Die Verdrahtungskomplexität kann verringert werden, indem die Scatter-Add-Funktion faktorisiert wird.Head-of-line blocking in the accumulator arrangement 340 can be avoided by adding multiple input FIFOs to the accumulator unit 368 and each FIFO pairs (p, a) for a subset of the accumulator units 368 holds. In one embodiment, a separate FIFO is provided at each input to each adder 385 for each of the accumulator units 368 (ie, a virtual output string is taken at the outputs of the accumulator unit 368 used). A disadvantage of the single-stage F * I allotment box 335 which is in 3C is the complex wiring, since there is a direct path from each product input to each accumulator unit 368 exists, resulting in P × A paths. For example, if P = 16 and A = 32, then there are 612 paths, each carrying a product, an address, a request, and a return grant. The wiring complexity can be reduced by factoring the scatter add function.

3E stellt ein zweistufiges F*I-Zuteilungskoppelfeld 380 gemäß einer Ausführungsform dar. Obwohl das zweistufige F*I-Zuteilungskoppelfeld 335 für P = 16 und A = 32 beschrieben ist, können andere Werte für P und A bei zwei oder mehr Stufen eingesetzt werden. Eine erste Stufe besitzt 4 Instanzen des einstufigen F*I-Zuteilungskoppelfelds 335 mit P = 4 und A = 8. Eine zweite Stufe besitzt 8 Instanzen des einstufigen F*I-Zuteilungskoppelfelds 335 mit P = 4 und A = 4. Jede der Stufen erfordert 128 direkte Pfade. Die Anzahl der Stufen kann erhöht werden, um die Anzahl der direkten Pfade zu verringern. Bei einer Ausführungsform sind FIFOs an den Zwischenstufen eines mehrstufigen Zuteilungskoppelfelds vorhanden. Wenn jedoch alle Zuteilungen in einem Verarbeitungszyklus abgeschlossen werden können, stellen die FIFOs an den Zwischenstufen nicht unbedingt irgendeinen Vorteil bezüglich eines Verarbeitungsdurchsatzes dar. 3E provides a two-stage F * I allotment array 380 according to one embodiment. Although the two-stage F * I allocation switch 335 For P = 16 and A = 32, other values for P and A can be used at two or more stages. A first stage has 4 instances of the single-stage F * I arbitration array 335 with P = 4 and A = 8. A second stage has 8 instances of the single-stage F * I arbitration array 335 with P = 4 and A = 4. Each of the levels requires 128 direct paths. The number of levels can be increased to reduce the number of direct paths. In one embodiment, FIFOs are present at the intermediate stages of a multi-stage dispatching matrix. However, if all allocations can be completed in one processing cycle, the FIFOs at the intermediate stages do not necessarily provide any advantage in terms of processing throughput.

Die Energie zum Zugreifen auf die Akkumulatoranordnung 340 kann reduziert werden, indem Produkte kombiniert werden, welche derselben Ausgabeposition zugeordnet sind. Bei einer Ausführungsform werden die Produkte an den Akkumulatoreinheiten 368 in einem Kombinationspuffer (z. B. ein FIFO mit 8 Eingängen) gepuffert, um die Wahrscheinlichkeit für das Kombinieren zu maximieren, und die Produkte werden nur in der Teilsumme akkumuliert, wenn der Kombinationspuffer voll ist. Adressen von ankommenden Produkten werden mit Einträgen in dem Kombinationspuffer verglichen und wenn eine Adresse eines ankommenden Produkts mit der Adresse eines gespeicherten Produkts übereinstimmt, wird das ankommende Produkt mit dem gespeicherten Produkt aufsummiert. Bei einer Ausführungsform weisen die Kombinationspuffer mehrere Schreibanschlüsse auf, wodurch zwei oder mehr ankommende Produkte gleichzeitig in den Kombinationspuffer eingefügt werden können.The energy to access the accumulator assembly 340 can be reduced by combining products associated with the same issue position. In one embodiment, the products on the accumulator units 368 buffered in a combination buffer (e.g., an 8-input FIFO) to maximize the likelihood of combining, and the products are only used in the Subtotal accumulated when the combination buffer is full. Addresses of incoming products are compared with entries in the combination buffer, and when an incoming product address matches the address of a stored product, the incoming product is summed with the stored product. In one embodiment, the combination buffers have multiple write ports, allowing two or more incoming products to be simultaneously inserted into the combination buffer.

Nachverarbeitungpostprocessing

Die Nachverarbeitungseinheit 345 führt drei Funktionen aus: eine Halobestimmung, eine Evaluierung nichtlinearer Funktionen und eine Codierung der dünnbesetzten Ausgabeaktivierungen. Bei einer Ausführungsform ist die Akkumulatoranordnung 340 doppelt gepuffert. Die drei Funktionen werden auf einer vollständigen Kachel von Ausgabeaktivierungen in der zweiten Akkumulatoranordnung 340 ausgeführt, während eine aktuelle Kachel von Ausgabeaktivierungen in der primären Akkumulatoranordnung 340 berechnet wird.The post-processing unit 345 performs three functions: halo determination, nonlinear function evaluation, and sparse output activation coding. In one embodiment, the accumulator assembly is 340 double buffered. The three functions become on a complete tile of output activations in the second accumulator array 340 while executing a current tile of output activations in the primary accumulator array 340 is calculated.

Die Anzahl der durchzuführenden Operationen der Nachbearbeitungseinheit 345 ist im Vergleich zu der FxI-Multipliziereranordnung 325 relativ gering. Die FxI-Multipliziereranordnung 325 führt eine sechsfach verschachtelte Schleife (über x, y, r, s, c, k) aus, während die Nachbearbeitungseinheit 345 nur eine dreifach verschachtelte Schleife (über x, y, k) ausführt. Daher sollte eine Nachbearbeitungseinheit 345, welche eine Operation pro Zyklus ausführt, mit einer FxI-Multipliziereranordnung 325 Schritt halten, welche 16 Operationen pro Zyklus ausführt. Bei einer Ausführungsform ist die Nachbearbeitungseinheit 345 mittels eines Mikrocontrollers oder einer Zustandsmaschine implementiert.The number of post-processing unit operations to be performed 345 is compared to the FxI multiplier arrangement 325 relatively low. The FxI multiplier arrangement 325 performs a six-way nested loop (over x, y, r, s, c, k) while the post-processing unit 345 only execute a triple nested loop (over x, y, k). Therefore, a post-processing unit should 345 performing one operation per cycle with an FxI multiplier arrangement 325 Keep pace, which performs 16 operations per cycle. In one embodiment, the post-processing unit is 345 implemented by means of a microcontroller or a state machine.

Der Pseudocode für die Halobestimmung ist in Tabelle 7 dargestellt. Tabelle 7: Pseudocode für die Halobestimmung

The pseudocode for the halo determination is shown in Table 7. Table 7: Pseudocode for the halo determination

Der in Tabelle 7 dargestellte Pseudocode iteriert über die acht Halobereiche. Jeder Bereich wird durch ein 7-Tupel beschrieben, welches von einer Bereichsdeskriptortabelle geladen wird. Das 7-Tupel weist die x- und y-Bereiche des Halobereichs in dem Ursprungs-PE bzw. Ausgangs-PE 210 (x1:x2, y1:y2) auf. Das 7-Tupel weist den x- und y-Versatz (xo, yo) auf, um eine Position in diesem PE 210 zu einer Position in dem Ziel-PE 210 zu versetzen. (Die Versätze sind Werte mit einem Vorzeichen). Schließlich weist das 7-Tupel die Nachbarzahl des Ziel-PEs 210 auf. Die Funktion linearAddress wandelt (x, y, k) in eine lineare Akkumulatoradresse um, gemäß: linearAddress(x, y, k) = x + y*max_x_oa + k*max_x_oa*max_y_oa (2) The pseudocode shown in Table 7 iterates over the eight halo regions. Each region is described by a 7-tuple loaded from a region descriptor table. The 7-tuple identifies the x and y regions of the halo region in the source PE 210 (x1: x2, y1: y2). The 7-tuple has the x and y offsets (xo, yo) around a position in this PE 210 to a position in the destination PE 210 to move. (The offsets are signed values). Finally, the 7-tuple indicates the neighbor number of the target PE 210 on. The linearAddress function converts (x, y, k) to a linear accumulator address, according to: linearAddress (x, y, k) = x + y * max_x_oa + k * max_x_oa * max_y_oa (2)

Es sei ein Beispiel betrachtet, wobei R × S = 3 × 3 Faltungen auf Eingabeaktivierungen mit den Dimensionen 50 × 50 × c durchgeführt werden, und wobei das Ergebnis ein Satz von Ausgabeaktivierungen mit den Dimensionen 52 × 52 × |h| ist. Das Halo besteht aus acht Bereichen – vier Rändern und vier Ecken. Die vier Bereichsdeskriptoren für diesen Fall sind in Tabelle 8 dargestellt. Tabelle 8: Halo-Bereichsdeskriptoren für R = S = 3 und W = H = 50 Region x1 x2 y1 y2 xo yo PE Left 0 0 1 51 51 0 (–1, 0) Top 1 51 0 0 0 51 (0, –1) Right 52 52 1 51 –51 0 (1, 0) Bottom 1 51 52 52 0 –51 (0, 1) Upper-Left 0 0 0 0 51 51 (–1, –1) Upper-Right 52 52 0 0 –51 51 (1, –1) Lower-Right 52 52 52 52 –51 –51 (1, 1) Lower-Left 0 0 52 52 51 –51 (-1, 1) Consider an example where R x S = 3 x 3 convolutions are performed on input activations of dimensions 50 x 50 x c, and the result is a set of output activations of dimensions 52 x 52 x | h | is. The halo consists of eight sections - four edges and four corners. The four area descriptors for this case are shown in Table 8. Table 8: Halo area descriptors for R = S = 3 and W = H = 50 region x1 x2 y1 y2 xo yo PE left 0 0 1 51 51 0 (-1, 0) Top 1 51 0 0 0 51 (0, -1) right 52 52 1 51 -51 0 (1, 0) bottom 1 51 52 52 0 -51 (0, 1) Upper-Left 0 0 0 0 51 51 (-1, -1) Upper-Right 52 52 0 0 -51 51 (1, -1) Lower-Right 52 52 52 52 -51 -51 (1, 1) Lower-Left 0 0 52 52 51 -51 (-1, 1)

In dem Beispiel spezifiziert der Bereich Left einen Ausgangsbereich von (0,1:51), einen Versatz von (51, 0) und ein PE 210 mit den Koordinaten (–1, 0) relativ zu dem aktuellen PE 210. Der Versatz führt zu dem Zielbereich (51, 1:51). Die Nachbearbeitungseinheit 345 setzt den Deskriptor ein, um die Akkumulatoranordnung 340 in dem PE 210 zu lesen, über den linken Rand zu gehen und Wert-, Positions-Paare an ein Nachbar-PE 210 nach links (–1, 0) zu senden. Das Nachbar-PE 210 verarbeitet die Wert-, Positions-Paare in derselben Weise wie die Wert-, Positions-Paare, welche von der FxI-Multipliziereranordnung 325 kommen, mit der Ausnahme, dass die Wert-Positions-Paare den sekundären Akkumulatoreinheiten 368 eingegeben werden. Zusätzliche Eingabeanschlüsse sind durch das FxI-Zuteilungskoppelfeld 335 vorhanden, um die Wert-, Positions-Paare von jedem der benachbarten PEs 210 zu den sekundären Akkumulatoreinheiten 368 zu routen. Die PEs 210 an den Rändern und Ecken der PE-Anordnung in dem SCNN-Beschleuniger 200 sind die fehlenden 3(Rand-) oder 5(Eck-)Nachbarn. Die Deskriptoren für die fehlenden Nachbarn sind als ungültig markiert, was bewirkt, dass die Nachbearbeitungseinheit 345 die Halobestimmung für die nicht existierenden Nachbarn überspringt.In the example, the area Left specifies an output range of (0,1: 51), an offset of (51, 0), and a PE 210 with the coordinates (-1, 0) relative to the current PE 210 , The offset leads to the target area (51, 1:51). The post-processing unit 345 sets the descriptor to the accumulator array 340 in the PE 210 to read, go over the left margin and value, position pairs to a neighbor PE 210 to send to the left (-1, 0). The neighbor PE 210 processes the value, position pairs in the same way as the value, position pairs, which from the FxI multiplier arrangement 325 come with the exception that the value position pairs the secondary accumulator units 368 be entered. Additional input ports are through the FxI dispatching matrix 335 present to the value, position pairs of each of the neighboring PEs 210 to the secondary accumulator units 368 to route. The PEs 210 at the edges and corners of the PE array in the SCNN accelerator 200 are the missing 3 (border) or 5 (corner) neighbors. The descriptors for the missing neighbors are marked as invalid, which causes the post-processing unit 345 skip the halo determination for nonexistent neighbors.

Nachdem die Halobestimmung für ein PE 210 und alle seine direkten Nachbarn abgeschlossen ist, scannt die Nachbearbeitungseinheit 345 die Akkumulatoranordnung 340 und führt eine nichtlineare Funktion für jede Ausgabeaktivierung in der Kachel aus. Der Pseudocode für die nichtlineare Funktion ist in Tabelle 9 dargestellt. Tabelle 9: die nichtlineare Funktion

After the halo determination for a PE 210 and all its direct neighbors is complete, the post-processing unit scans 345 the accumulator arrangement 340 and performs a nonlinear function for each output enable in the tile. The pseudocode for the nonlinear function is shown in Table 9. Table 9: the nonlinear function

Der in Tabelle 9 dargestellte Pseudocode iteriert über den Nicht-Halobereich der Akkumulatoranordnung 340. Der Nicht-Halobereich weist alle der Akkumulatoreinheiten in der Akkumulatoranordnung 340 auf, welche nicht Teil eines Rand- oder Eckbereichs sind. Bei dem vorherigen Beispiel ist der Nicht-Halobereich (1:51, 1:51). Die am häufigsten eingesetzte nichtlineare Funktion ist die gleichgerichtete nichtlineare Funktion (ReLU), welche negative Werte in Null wandelt, aber andere Funktionen (wie z. B. Sigmoid) können auch eingesetzt werden. Einige Funktionen können näherungsweise als stückweise lineare Funktionen angesehen werden. Bei einer Ausführungsform werden positive Werte unterhalb eines vorbestimmten Schwellenwerts auf 0 und negative Werte oberhalb eines vorbestimmten Schwellenwerts auf 0 gelegt.The pseudocode shown in Table 9 iterates over the non-halo region of the accumulator array 340 , The non-halo region comprises all of the accumulator units in the accumulator arrangement 340 which are not part of a border or corner area. In the previous example, the non-halo range is 1:51, 1:51. The most commonly used nonlinear function is the rectified non-linear function (ReLU), which converts negative values to zero, but other functions (such as sigmoid) can also be used. Some functions may be considered approximately as piecewise linear functions. In one embodiment, positive values below a predetermined threshold are set to 0 and negative values above a predetermined threshold are set to zero.

Nachdem die nichtlineare Funktion auf die Kachel in den sekundären Registern in der Akkumulatoranordnung 340 angewendet worden ist, wird die Kachel codiert, um Elemente ungleich Null zu komprimieren. Der Pseudocode für die Komprimierungsoperation ist in Tabelle 10 dargestellt. Tabelle 10: Pseudocode für die Komprimierungsoperation

After the nonlinear function on the tile in the secondary registers in the accumulator array 340 has been applied, the tile is encoded to compress non-zero elements. The pseudocode for the compression operation is shown in Table 10. Table 10: Pseudocode for the compression operation

Der in Tabelle 10 dargestellte Pseudocode geht über die Akkumulatoranordnung 340 mit einem Kanal von Ausgabeaktivierungen zu einem Zeitpunkt und schreibt einen (Wert-, Positions)-Eintrag in den Ausgabeaktivierungspuffer 350 und den Indexpuffer 355 für jeden Ausgabeaktivierungswert ungleich Null. Die Funktion „encode” codiert die Position relativ zu der letzten Position unter Verwendung einer der im Folgenden beschriebenen Verfahren. Es sei angemerkt, dass „encode” ein oder mehrere „Dummy”-Werte (ein Wert ungleich Null mit einem Wert von Null) emittiert, wenn der Unterschied zwischen der aktuellen Position (x, y) und „lastNZPos” nicht direkt codiert werden kann. Nachdem jeder Kanal abgearbeitet worden ist, wird die Anzahl von Werten ungleich Null in diesem Kanal (nzCount) in einer separaten Tabelle gespeichert. Wenn Ausgaben codiert werden, adressiert der OAptr einzelne (Wert-, Poistions-)Einträge in dem Ausgabeaktivierungspuffer 350 und dem Indexpuffer 355. Nachdem alle Kacheln in einer Schicht des neuronalen Netzes fertig bearbeitet worden sind, wechseln der Ausgabeaktivierungspuffer 350 und der Indexpuffer 355 und der Eingabeaktivierungspuffer 310 und der Puffer 320 jeweils ihre Funktionen und die nächste Schicht des neuronalen Netzes wird bearbeitet. Wenn der Ausgabeaktivierungspuffer 350 und der Indexpuffer 355 umgeschaltet werden, liest der IAptr vier Vektoren von (Wert, Position) zu einem Zeitpunkt.The pseudocode shown in Table 10 goes through the accumulator assembly 340 with one channel of output activations at a time, and writes a (value, position) entry into the output enable buffer 350 and the index buffer 355 for each non-zero output enable value. The encode function encodes the position relative to the last position using one of the methods described below. It should be noted that "encode" emits one or more "dummy" values (a non-zero value with a value of zero) when the difference between the current position (x, y) and "lastNZPos" can not be coded directly , After each channel has been processed, the number of non-zero values in this channel (nzCount) is stored in a separate table. When outputs are encoded, the OAptr addresses individual (value, Poistions) entries in the output enable buffer 350 and the index buffer 355 , After all the tiles in a layer of the neural network have been finished, the output enable buffers switch 350 and the index buffer 355 and the input enable buffer 310 and the buffer 320 each of their functions and the next layer of the neural network is processed. If the output enable buffer 350 and the index buffer 355 The IAptr reads four vectors of (value, position) at a time.

Um die Parallelisierung über einen einzelnen PE 210 hinaus zu erhöhen, werden mehrere PEs 210 parallel betrieben, wobei jedes auf einer getrennten dreidimensionalen Kachel von Eingabeaktivierungen arbeitet. Aufgrund der durchgehenden Komprimierung von Aktivierungen können die Eingabe- und Ausgabe-Aktivierungen beide von jeder Kachel lokal bezüglich des PE 210, welches die Kachel bearbeitet, gespeichert werden, wobei eine energiehungrige Datenübertragung weiter verringert wird. Insgesamt stellt der SCNN-Beschleuniger 200 eine effiziente komprimierte Speicherung und Zuführung von Eingabeoperanden zu der FxI-Multipliziereranordnung 325, eine hohe Wiederverwendbarkeit der Eingabeoperanden in der FxI-Multipliziereranordnung 325 bereit, und dies verbraucht keine Verarbeitungszyklen für Multiplikation mit Operanden gleich Null.To parallelize over a single PE 210 Beyond that, multiple PEs will be added 210 operating in parallel, each working on a separate three-dimensional tile of input activations. Due to the continuous compression of activations, the input and output activations of both tiles can be local to the PE 210 , which processes the tile, are stored, further reducing energy-hungry data transmission. Overall, the SCNN accelerator represents 200 an efficient compressed storage and supply of input operands to the FxI multiplier arrangement 325 , high reusability of the input operands in the FxI multiplier arrangement 325 and this does not consume any processing cycles for multiplication with operands equal to zero.

komprimierte-dünnbesetzte Gewichte und Aktivierungencompressed-sparse weights and activations

Um einen Energieverbrauch von Gewichten und Eingabeaktivierungen mit dem Wert Null zu verringern, nutzt die Architektur des SCNN 200 dünnbesetzte Gewichte und Aktivierungen aus. Eine dichte Codierung von dünnbesetzten Gewichten und Aktivierungen wird unter verschiedenen Stufen der Speicherhierarchie und unter verschiedenen Logikschaltungen in dem SCNN 200 eingesetzt, um die Bandbreite zu verringern, welche erforderlich ist, um die Gewichts- und Aktivierungswerte von dem Speicher zu dem SCNN 200 zu übertragen. Eingabedaten, wie z. B. Gewichte und Aktivierungen mit Werten gleich Null, können in einer kompakten Form dargestellt werden, was als komprimiertes-dünnbesetztes Format bezeichnet wird. Der Umfang, mit welchem die Eingabedaten komprimiert werden können, erhöht sich mit der Anzahl von Nullen. Auch wenn nur 10% der Mehrbitelemente gleich Null sind, kann es sinnvoll sein, die Eingabedaten in dem komprimierten-dünnbesetzten Format zu codieren. Ein Codieren der dünnbesetzten Gewichte und/oder Aktivierungen verringert den Daten-Fußabdruck, was ermöglicht, größere Matrizen in einer Speicherstruktur vorgegebener Größe, wie z. B. dem Eingabeaktivierungspuffer 235 um dem Gewichtspuffer 230, zu speichern. Bei einer Ausführungsform weisen der Gewichtspuffer 230 und der Eingabeaktivierungspuffer 235 jeweils einen Overhead von 10 Bit für jeden Wert von 16 Bit auf, um mehrdimensionale Positionen von Elementen ungleich Null in dem komprimierten-dünnbesetzten Format zu codieren.In order to reduce the power consumption of weights and input activations with the value zero, the architecture of the SCNN uses 200 sparse weights and activations off. A dense coding of sparse weights and activations will take place under different levels of memory hierarchy and among various logic circuits in the SCNN 200 used to reduce the bandwidth required to transfer the weight and activation values from the memory to the SCNN 200 transferred to. Input data, such as. Weights and activations with values equal to zero can be represented in a compact form, which is referred to as a compressed-sparse format. The extent to which the input data can be compressed increases with the number of zeroes. Even though only 10% of the multi-bit elements are equal to zero, it may be useful to encode the input data in the compressed-sparse format. Coding the sparse weights and / or activations reduces the data footprint, allowing larger matrices to be stored in a memory structure of a given size, such as a memory array. The input enable buffer 235 around the weight buffer 230 , save. In one embodiment, the weight buffer 230 and the input enable buffer 235 each have a 10-bit overhead for each 16-bit value to encode multi-dimensional locations of nonzero elements in the compressed sparse format.

4A stellt einen Flussplan eines Verfahrens 400 zur Verarbeitung von komprimierten-dünnbesetzten Daten in dem SCNN 200 gemäß einer Ausführungsform dar. Obwohl das Verfahren 400 im Kontext eines Verarbeitungselements in dem SCNN 200 beschrieben wird, kann das Verfahren 400 auch durch ein Programm, eine Anwenderschaltung oder durch eine Kombination aus einer Anwenderschaltung und eines Programms ausgeführt werden. Darüber hinaus versteht der Fachmann, dass jedes System, welches das Verfahren 400 ausführt, im Umfang und im Geist der Ausführungsformen der vorliegenden Erfindung liegt. 4A provides a flowchart of a process 400 for processing compressed sparse data in the SCNN 200 according to one embodiment. Although the method 400 in the context of a processing element in the SCNN 200 is described, the process can 400 be performed by a program, a user circuit or by a combination of a user circuit and a program. In addition, one skilled in the art will understand that any system incorporating the method 400 is within the scope and spirit of embodiments of the present invention.

Im Schritt 405 werden komprimierte-dünnbesetzte Daten für eine Eingabe zu dem PE 210 empfangen, wobei die komprimierten-dünnbesetzten Daten Elemente ungleich Null codieren und entsprechende mehrdimensionale Positionen codieren. Bei einer Ausführungsform repräsentieren die komprimierten-dünnbesetzten Daten Gewichtswerte. Bei einer anderen Ausführungsform repräsentieren die komprimierten-dünnbesetzten Daten Eingabeaktivierungswerte.In step 405 will be compressed-sparse data for input to the PE 210 received, wherein the compressed-sparse data encode non-zero elements and encode corresponding multi-dimensional positions. In one embodiment, the compressed spiky data represents weight values. In another embodiment, the compressed sparse data represents input activation values.

Bei Schritt 410 werden die Elemente ungleich Null parallel durch das PE 210 verarbeitet, um mehrere Ergebniswerte zu erzeugen. Bei einer Ausführungsform werden die Elemente ungleich Null in der FxI-Multipliziereranordnung 325 multipliziert, um Ergebniswerte zu erzeugen, welche Produkte sind. Bei Schritt 415 werden die entsprechenden mehrdimensionalen Positionen parallel verarbeitet, um Zieladressen für den jeweiligen Ergebniswert in der Vielzahl von Ergebniswerten zu erzeugen. Bei einer Ausführungsform werden die mehrdimensionalen Positionen in der Zielberechnungseinheit 330 bearbeitet, um für jeden der Ergebniswerte eine Zielakkumulatoradresse zu erzeugen, welche einer Stelle in der Akkumulatoranordnung 340 zugeordnet ist. Genauer gesagt kann die Zielakkumulatoradresse eine Stelle in der Speicheranordnung 380 (d. h. eine Bank) in einer Akkumulatoreinheit 368 anzeigen. Bei Schritt 420 wird jeder Ergebniswert zu einer Akkumulatoreinheit 368 übertragen, welche der Zieladresse für den Ergebniswert zugeordnet ist. Bei einer Ausführungsform ist jeder Ergebniswert ein Produkt, welches durch das FxI-Zuteilungskoppelfeld 335 zu einer der Akkumulatoreinheiten 368 abhängig von der entsprechenden Zieladresse übertragen wird.At step 410 the nonzero elements are parallel through the PE 210 processed to produce multiple result values. In one embodiment, the nonzero elements in the FxI multiplier array 325 multiplied to produce result values, which are products. At step 415 the corresponding multi-dimensional positions are processed in parallel to generate destination addresses for the respective result value in the plurality of result values. In one embodiment, the multi-dimensional positions in the target calculation unit become 330 to generate, for each of the result values, a destination accumulator address which corresponds to a location in the accumulator array 340 assigned. More specifically, the destination accumulator address may be a location in the memory array 380 (ie a bank) in an accumulator unit 368 Show. At step 420 each result value becomes an accumulator unit 368 which is assigned to the destination address for the result value. In one embodiment, each result value is a product generated by the FxI allocation matrix 335 to one of the accumulator units 368 is transmitted depending on the corresponding destination address.

Bei einer Ausführungsform setzt das SCNN 200 einen einfachen komprimierten-dünnbesetzten Codierungsansatz basierend auf einem Lauflängencodierungsschema ein. Ein Datenvektor kann von den komprimierten-dünnbesetzten codierten Daten extrahiert werden, wobei der Datenvektor eine Folge von Werten ungleich Null ist. Ein Indexvektor kann von den komprimierten-dünnbesetzten codierten Daten extrahiert werden, wobei der Indexvektor eine Folge einer Anzahl von Nullwerten (die Anzahl von Nullen zwischen jeweiligen Elementen ungleich Null) ist. Zum Beispiel entspricht eine komprimierte-dünnbesetzte Codierung der Daten, welche in 3B dargestellt sind, (a, b, c, d, e, f) und (2, 0, 3, 4, 1, 1), die einen Datenvektor und einen entsprechenden Indexvektor repräsentieren, wobei jedes Element in dem Indexvektor eine Anzahl von Nullwerten ist, welche dem entsprechenden Element ungleich Null voranstehen.In one embodiment, the SCNN sets 200 a simple compressed-sparse coding approach based on a run-length coding scheme. A data vector may be extracted from the compressed sparse coded data, where the data vector is a sequence of nonzero values. An index vector may be extracted from the compressed sparse coded data, where the index vector is a sequence of a number of null values (the number of zeros between respective nonzero elements). For example, compressed-sparse coding corresponds to data stored in 3B (a, b, c, d, e, f) and (2, 0, 3, 4, 1, 1) representing a data vector and a corresponding index vector, each element in the index vector having a number of zero values is, which precedes the corresponding non-zero element.

Ein Bestimmen der Koordinaten einer Stelle in der Akkumulatoranordnung 340 für jede Produktausgabe durch einen Multiplizierer in der FxI-Multipliziereranordnung 325 erfordert ein Lesen der Indexvektoren für F und I und ein Kombinieren der Indexvektoren mit den Koordinaten eines Abschnitts des Ausgabeaktivierungsraums, welcher aktuell bearbeitet wird. Vier Bits pro Index ermöglichen, dass bis zu 15 Nullwerte zwischen irgendwelchen zwei Elementen ungleich Null existieren. Wenn mehr als 15 Nullwerte zwischen zwei Elementen ungleich Null vorhanden sind, wird ein Nullwert-Platzhalter (d. h. ein Nullfüllwert) als ein dazwischenkommendes Element ungleich Null eingefügt, ohne dass eine merkliche Verschlechterung in der Kompressionseffizienz erzielt wird. Mit einer erwarteten Dichte von Elementen ungleich Null von 30% existieren im Mittel näherungsweise zwei Nullwerte zwischen Elementen ungleich Null.Determining the coordinates of a location in the accumulator array 340 for each product output by a multiplier in the FxI multiplier arrangement 325 requires reading the index vectors for F and I and combining the index vectors with the coordinates of a portion of the output enable space that is currently being edited. Four bits per index allow for up to 15 null values to exist between any two non-zero elements. If there are more than 15 nulls between two non-zero elements, then a zero-value wildcard (ie, a zero-fill value) is inserted as a non-zero intervening element, without any appreciable degradation in compression efficiency. With an expected density of nonzero elements of 30%, there are approximately two zero values on average between non-zero elements.

Während das SCNN 200 am effizientesten arbeitet, wenn die Aktivierungen in die Eingabeaktivierungspuffer 235 passen, können große Netze erfordern, dass Aktivierungen gesichert und von einem DRAM durch die Speicherschnittstelle 205 wiederhergestellt werden. Daher kann das SCNN 200 einen Tiling-Ansatz einsetzen, welcher auf einer 2D-Teilmenge des Aktivierungsraums zu einem Zeitpunkt arbeitet. Die vom DRAM erforderlichen Zugriffe, um eine Kachel von Eingabeaktivierungen zu lesen, können verborgen werden, indem die Leseoperationen zusammen mit der Berechnung der vorherigen Kachel von Ausgabeaktivierungen hintereinander ausgeführt werden. In ähnlicher Weise kann ein Lesen der Gewichte von dem DRAM auf einer Kachelgranularität ausgeführt werden. While the SCNN 200 works most efficiently when the activations in the input activation buffers 235 Large networks may require that activations be backed up and backed up by a DRAM through the memory interface 205 be restored. Therefore, the SCNN 200 use a tiling approach which operates on a 2D subset of the activation space at a time. The accesses required by the DRAM to read a tile of input activations can be hidden by performing the read operations in tandem along with the calculation of the previous tile of output activations. Similarly, reading the weights from the DRAM may be performed on a tile granularity.

Bei einer Ausführungsform werden die Gewichte in einem komprimierten-dünnbesetzten Format von Kacheln codiert, welche höchstens K Ausgabekanäle aufweisen, und die Kacheln werden durch einen Eingabekanal geordnet. Das Ziel ist, die Wiederverwendung von Eingabeaktivierungen unter Voraussetzung einer feststehenden Anzahl von Akkumulatoren (und damit einer Begrenzung der Anzahl der Ausgabekanäle) zu maximieren. Das komprimierte-dünnbesetzte Format ermöglicht ein Lesen von W Gewichten und entsprechenden Positionen (r, s, k) parallel für einen Eingabekanal c. Daher ist ein Format, bei welchem die Gewichte und die Positionen feste Stellen aufweisen, wünschenswert, – so dass eine inkrementelle Decodierung nicht erforderlich ist. Die Gewichtswerte entsprechen einer vier-dimensionalen Matrix, wobei x, y, c und k die vier Dimensionen sind. Eine Kachel ist eine Scheibe eines Gewichtsdatenvektors k mit {k₁, k₂, ..., k_K} – d. h. beliebige Werte r, s, aber mit k beschränkt auf eine Gruppe von K Werten. Eine Kachel kann in ein komprimiertes-dünnbesetztes Format codiert werden, was K (die Anzahl der Ausgabekanäle), k₁, k₂, ..., k_K (die aktuellen Zahlen der K Ausgabekanäle) und C (die Anzahl der Eingabekanäle in der Kachel) einschließt. Für jeden Eingabekanal weist das komprimierte-dünnbesetzte Format einen codierten Deltaindex c für den Eingabekanal (d. h. eine Differenz von dem vorherigen Eingabekanal) und eine Anzahl der Gewichte ungleich Null in dem Eingabekanal auf. Für jeden Ausgabekanal k weist das komprimierte-dünnbesetzte Format drei Parameter für jedes Gewicht ungleich Null in dem Kern c_k auf. Ein erster Parameter ist die Anzahl von Nullen zwischen dem vorherigen Gewicht ungleich Null und dem aktuellen Gewicht. Es sei angemerkt, dass die Nullwerte am Ende eines Kerns und dem Beginn des nächsten Kerns zusammen codiert werden. Ein zweiter Parameter ist der codierte Gewichtswert w_xyck, welcher entweder ein binäres Gewicht oder einen Index in einem Codebuch repräsentiert.In one embodiment, the weights are encoded in a compressed sparse format of tiles having at most K output channels, and the tiles are ordered through an input channel. The goal is to maximize the reuse of input activations, assuming a fixed number of accumulators (and thus limiting the number of output channels). The compressed sparse format allows reading of W weights and corresponding positions (r, s, k) in parallel for an input channel c. Therefore, a format in which the weights and the positions have fixed positions is desirable, so that incremental decoding is not required. The weight values correspond to a four-dimensional matrix, where x, y, c, and k are the four dimensions. A tile is a slice of a weight data vector with k {k _1, k _2, ..., k} _K - ie arbitrary values r, s, k with but limited to a group of K values. A tile can be encoded in a compressed-sparse format, which K (the number of output channels), k _1, k _2, ..., k _K (the current of the K output channels) and C (the number of input channels in the Tile). For each input channel, the compressed sparse format has a coded delta index c for the input channel (ie, a difference from the previous input channel) and a number of nonzero weights in the input channel. For each output channel k, the compressed sparse format has three parameters for each non-zero weight in the kernel c _k . A first parameter is the number of zeros between the previous non-zero weight and the current weight. It should be noted that the zero values are coded together at the end of one core and the beginning of the next core. A second parameter is the encoded weight _value w _xyck , which represents either a binary weight or an index in a codebook.

4B stellt eine Kachel 340 von Gewichtswerten für zwei Ausgabekanäle gemäß einer Ausführungsform dar. Bei einer Ausführungsform können 3 × 3-Faltungen mittels der Kachel 340 von Gewichtswerten über zwei Eingabekanäle ausgeführt werden, um Ergebnisse für zwei Ausgabekanäle zu erzeugen. Die Kachel 340 von Gewichtswerten ist dünn besetzt und kann in einem komprimierten-dünnbesetzten Format dargestellt werden. 4B represents a tile 340 of weight values for two output channels according to one embodiment. In one embodiment, 3x3 convolutions may be performed using the tile 340 of weight values over two input channels to produce results for two output channels. The tile 340 Weight values are sparse and can be represented in a compressed-sparsely populated format.

Bei einer Ausführungsform ist die Kachel 340 von Gewichtswerten als {2, 1, 2, 4, 0, 6, 1, 3, 4, 4, 1, 5, 0, 6, 3, 7, 3, 8, 0, ...} codiert. Die ersten vier Symbole bezeichnen die „Form” der Kachel K = 2 mit k₁ = 1 und k₂ = 2 und C = 4. Die erste 0 bezeichnet den ersten Eingabekanal mit einem Versatz von 0 von der Startposition, c = 0. Die folgende 6 zeigt an, dass sechs Gewichte ungleich Null in dem ersten Eingabekanal existieren. Die nächsten sechs Symbole sind Nullanzahl-, Gewichts-Paare, welche den Kern c = 0, k = 1 codieren. Die 1 zeigt an, dass es einen Nullwert vor der 3 gibt, und die erste 4 zeigt an, dass 4 Nullwerte zwischen der 3 und der 4 liegen. Da die 5 die letzte Position für c = 0, k = 1 ist, wissen wir, dass die 0 nach der 5 der Beginn der Codierung des nächsten Kanals ist. Die nächsten 6 Symbole codieren den Kern c = 0, k = 2. Die letzte 0 zeigt an, dass es keine leeren Kanäle vor dem nächsten Eingabekanal gibt, so dass die nächsten Symbole den Kanal c = 1 codieren. Die Folge von Nullanzahlen zeigt die Anzahl von Nullwerten bzw. Nullen vor dem ersten Gewichtswert ungleich Null und zwischen benachbarten Paaren von Gewichtswerten ungleich Null an.In one embodiment, the tile is 340 of weight values as {2, 1, 2, 4, 0, 6, 1, 3, 4, 4, 1, 5, 0, 6, 3, 7, 3, 8, 0, ...}. The first four symbols denote the "shape" of the tile K = 2 with k ₁ = 1 and k ₂ = 2 and C = 4. The first 0 denotes the first input channel with an offset of 0 from the start position, c = 0 The following Figure 6 indicates that there are six nonzero weights in the first input channel. The next six symbols are zero number, weight pairs, which encode the kernel c = 0, k = 1. 1 indicates that there is a zero value before 3, and the first 4 indicates that 4 zero values are between 3 and 4. Since 5 is the last position for c = 0, k = 1, we know that the 0 after 5 is the beginning of the next channel's encoding. The next 6 symbols encode the kernel c = 0, k = 2. The last 0 indicates that there are no empty channels before the next input channel, so the next symbols encode the channel c = 1. The sequence of zero counts indicates the number of zeros before the first nonzero weight value and between adjacent pairs of nonzero weight values.

Nach einem Abstreifen des Kachelkopfes (2, 1, 2, 4) und des Kanalkopfes (0, 6) können die nächsten 12 Symbole parallel als der Datenvektor und der Indexvektor gelesen werden, was sechs Gewichte zusammen mit den entsprechenden Positionen r, s, k ergibt. Eine laufende Summe ist erforderlich, um einen linearen Index für jedes Gewicht zu berechnen, und die linearen Indices werden dann zu den Positionskoordinaten r, s, k gewandelt. Um es einfacher zu machen, den linearen Index in die Koordinaten r, s zu decodieren, kann r_max auf die nächste Potenz von 2 aufgerundet werden. Zum Beispiel wird ein 3 × 3-Kern zu einem 3 × 4-(s_max × r_max)-Kern, wobei die letzte Spalte der Gewichte auf Nullen gesetzt wird. Wenn eine laufende Summe eingesetzt wird, um den linearen Index zu berechnen, bilden bei einer Ausführungsform die niederwertigen zwei Bits r und die verbleibenden Bits s.After stripping the tile head (2, 1, 2, 4) and the channel head (0, 6), the next 12 symbols can be read in parallel as the data vector and the index vector, giving six weights together with the corresponding positions r, s, k results. A running sum is required to calculate a linear index for each weight, and the linear indices are then converted to the position coordinates r, s, k. To make it easier to decode the linear index into the coordinates r, s, r _max can be rounded up to the nearest power of 2. For example, a 3x3 kernel becomes a 3x4 (s _max xr _max ) kernel with the last column of weights set to zeros. If a running sum is used to calculate the linear index, in one embodiment, the least significant two bits r and the remaining bits s.

Jede Position r, s, k für ein Gewicht oder jede Position (x, y) für eine Eingabeaktivierung kann mittels der Positionskoordinaten des vorherigen Gewichts bzw. der vorherigen Eingabeaktivierung berechnet werden. Die Gewichtspositionsberechnung ist in Tabelle 11 dargestellt, wobei „value” der Nullanzahl entspricht. Tabelle 11: Pseudocode für die Positionsberechnungen

Each position r, s, k for a weight or each position (x, y) for an input activation may be calculated using the position coordinates of the previous weight and the previous input activation, respectively. The weight position calculation is shown in Table 11, where "value" corresponds to the zero number. Table 11: Pseudocode for the position calculations

Ein r vom Koordinatentyp (R) codiert die Nullanzahl, d. h. die Anzahl von Nullen zwischen dem letzten Element ungleich Null und dem aktuellen Element ungleich Null. Wenn eine laufende Summe in jeder Dimension (z. B. position.r und position.s) den maximalen Dimensionswert für r (r_max) übersteigt, kann die Position optional umgebrochen werden – y wird inkrementiert und r wird um r_max verringert. Das y vom Koordinatentyp (S) inkrementiert die Koordinate s der Position um 1 und setzt die Position r auf den Wert. Das k vom Koordinatentyp (K) inkrementiert die Koordinate k der Position, setzt s auf 0 zurück und setzt r auf den Wert. Die Umbruchprozedur ist in Tabelle 12 dargestellt, wobei max_r r_max entspricht und max_s s_max entspricht. Tabelle 12: Pseudocode für die Positionsberechnungen mit Umbruch

An r of coordinate type (R) encodes the zero count, that is, the number of zeros between the last non-zero element and the current non-zero element. If a running total in each dimension (eg, position.r and position.s) exceeds the maximum dimension value for r (r _max ), the position can optionally be wrapped - y is incremented and r is decreased by r _max . The y of the coordinate type (S) increments the coordinate s of the position by 1 and sets the position r to the value. The k of the coordinate type (K) increments the coordinate k of the position, resets s to 0, and sets r to the value. The break procedure is shown in Table 12, where max_r corresponds to r _max and max_s corresponds to s _max . Table 12: Pseudocode for position calculations with break

Ein Umbrechen kann möglicherweise zu einer dichteren Codierung führen – indem mehr Codierungsoptionen bereitgestellt werden. Jedoch erfordert die Unterstützung eines Umbrechens eine komplexere Decodierungsschaltung, um die Divisions- und Modulo-Operationen auszuführen. Eine Zwischenoption ist, ein Umbrechen auszuführen, es aber bezüglich r_max und s_max auf Potenzen von 2 zu beschränken – wobei das Dividieren und das Modulo auf Schiebe- bzw. Maskierungs-Operationen vereinfacht wird. Alternativ kann das Umbrechen unterlassen werden, und der entsprechende Koordinatentyp muss die Koordinate s oder k vorrücken. Die Koordinaten (r, s, k) können durch die Koordinaten (x, y) ersetzt werden, wobei k ausgespart ist, um die Positionsberechnungen für die Eingabeaktivierungen auszuführen.Breaking may potentially result in tighter coding - providing more encoding options. However, the support of a break requires a more complex decode circuit to perform the divide and modulo operations. An intermediate option is to perform a wrap but limit it to powers of 2 with respect to r _max and s _max - thereby simplifying the dividing and modulo on shift operations. Alternatively, the wrapping can be omitted and the corresponding coordinate type must advance the coordinate s or k. The coordinates (r, s, k) may be replaced by the coordinates (x, y), where k is omitted to perform the position calculations for the input activations.

Bei einer Ausführungsform können die Gewichte als direkte 16-Bit oder 8-Bit-Werte paarweise angeordnet mit einem „Code”-Wert mit variabler Bitbreite, welcher eingesetzt wird, um ein „Codebuch” zu indizieren, um die zugehörige Nullanzahl zu lesen, repräsentiert werden. Verschiedene Codebücher können bei verschiedenen Kacheln eingesetzt werden. Die Codierung des Koordinatentyps und des Nullanzahlwerts sollte in einer Weise stattfinden, welche die Codierungseffizienz maximiert, indem mehr Codierungen bzw. Codewerte für allgemeinere Koordinatentypen und Nullanzahlen bereitgestellt werden.In one embodiment, the weights may be represented as direct 16-bit or 8-bit values arranged in pairs with a variable bit width "code" value used to index a "codebook" to read the associated zero number become. Different codebooks can be used with different tiles. The encoding of the coordinate type and the zero count should be in in a way that maximizes coding efficiency by providing more codings for more general types of coordinates and zero counts.

4C stellt ein Codierungsschema für Gewichte und Eingabeaktivierungen (IA) gemäß einer Ausführungsform dar. Ein 4-Bit-Code stellt den Koordinatentyp und den Nullanzahlwert dar. Andere Codierungsschemata sind möglich, und ein Codierungsschema kann mehr oder weniger als vier Bit einsetzen. Die Gewichtscodewerte weisen die Koordinatentypen R, S und K auf, während die Aktivierungscodewerte nur die Koordinatentypen X und Y aufweisen. Bei den Gewichten ist eine größere Anzahl von Codewerten (10) dem Koordinatentyp R gewidmet, da er am häufigsten eingesetzt wird. Das Inkrement zwischen den Werten muss nicht 1 sein. Zum Beispiel sind die Nullanzahlwerte R9 und X9 nicht vorhanden, um mehr „Reichweite” zwischen den Elementen ungleich Null zu ermöglichen. Neun Nullen zwischen zwei Elementen ungleich Null können als R4 (oder X4) beigefügt zu R4 (oder X4) mit einem Nullgewichtswert eingepackt zwischen den zwei Läufen von Nullen codiert werden. Für die Aktivierungscodierung kann eine aggressivere Codierung von langen Läufen von Nullen ermöglicht werden, mit großen Inkrementlücken zwischen den Codewerten. 4C Figure 4 illustrates a weighting and input enabling (IA) coding scheme according to one embodiment. A 4-bit code represents the coordinate type and the zero count. Other coding schemes are possible, and a coding scheme may employ more or less than four bits. The weight code values have the coordinate types R, S and K, while the activation code values have only the coordinate types X and Y. In the weights, a larger number of code values (10) are dedicated to the R coordinate type, since it is most commonly used. The increment between the values need not be 1. For example, the zero counts R9 and X9 are not present to allow more "reach" between the non-zero elements. Nine zeros between two non-zero elements can be encoded as R4 (or X4) attached to R4 (or X4) packed with a zero-weight value between the two runs of zeros. For activation coding, more aggressive coding of long runs of zeros can be enabled, with large increment gaps between code values.

Wenn Gruppen von F Gewichten und I Eingabeaktivierungen in jedem Zyklus von dem Gewichtspuffer 305 bzw. dem Eingabeaktivierungspuffer 310 gelesen werden, werden der Positionsabschnitt der Gewichte und der Eingabeaktivierungen, welche von dem Puffer 315 bzw. 320 gelesen werden, von den 4-Bit-Werten, welche in der in 4C dargestellten Tabelle dargestellt sind, auf die vollständigen Positionen (x, y) für die Aktivierungen und (r, s, k) für die Gewichte decodiert. Wie vorab erläutert ist, nimmt die FxI-Multipliziereranordnung 325F Gewichte und I Eingabeaktivierungen entgegen und erzeugt P = F*I Produkte. Jedes Produkt ist einer Position zugeordnet, welche durch die Zielberechnungseinheit 330 berechnet wird. Für alle Produkt-Positions-Paare werden die Gewichts- und Eingabeaktivierungswerte ungleich Null im komprimierten-dünnbesetzten Format ohne Expansion multipliziert. Der Positionsabschnitt des komprimierten-dünnbesetzten Formats weist Nullanzahlen auf, welche in (r, s, k) für jedes Gewicht und (x, y) für jede Eingabeaktivierung decodiert werden und dann addiert werden, um eine Position (x, y, k) für das entsprechende Produkt zu erzeugen. Die Produkt-Positionsberechnung wurde früher in Tabelle 5 dargestellt.When groups of F weights and I input activations in each cycle from the weight buffer 305 or the input activation buffer 310 are read, the position portion of the weights and the input activations which are from the buffer 315 respectively. 320 from the 4-bit values used in the 4C are decoded to the complete positions (x, y) for the activations and (r, s, k) for the weights. As previously explained, the FxI multiplier arrangement takes 325F Weights and I input activations and produces P = F * I products. Each product is assigned to a position determined by the target calculation unit 330 is calculated. For all product position pairs, the non-zero weight and input enable values are multiplied in the compressed-sparse format without expansion. The compressed-sparse format position portion has zero numbers which are decoded into (r, s, k) for each weight and (x, y) for each input enable, and then added to a position (x, y, k) for to produce the corresponding product. The product position calculation was previously shown in Table 5.

4D stellt Gewichtswerte für vier 3 × 3-Faltungskerne 435 gemäß einer Ausführungsform dar. 4E stellt eine Codierung 440 der Positionen für die Gewichtswerte in den vier 3 × 3-Faltungskernen 435 gemäß einer Ausführungsform dar. Die erste Reihe der Codierung 440 umfasst einen Strom von 12 Codewerten, einen für jeden Gewichtswert ungleich Null in den vier 3 × 3-Faltungskernen 435. Mit einem Umbrechen und r_max = s_max = 3 sind die Positionen in der ersten Reihe der Codierung 440 codiert. Der erste Wert S1 korrespondiert mit der Null in der oberen linken Position, welche von einer 3 gefolgt wird. Der Wert S4 folgt dem ersten Wert S1 und korrespondiert mit der Null in der ersten Reihe, welche der 3 folgt, und mit den drei Nullen in der zweiten Reihe des ersten Faltungskerns. Ein zweiter Wert S1, welcher dem Wert S4 folgt, korrespondiert mit der einen Null in der dritten Reihe des ersten Faltungskerns zwischen der 4 und der 5. Dem zweiten Wert S1 folgen zwei Werte SO, welche dem Ausbleiben von Nullen zwischen der 5 und der 6 und zwischen der 6 und der 7 in der ersten Reihe des zweiten Faltungskerns entsprechen. Den zwei Werten SO folgt ein Wert 55, welcher mit den fünf Nullen vor der 8 in der dritten Reihe des zweiten Faltungskerns korrespondiert. Die verbleibenden Codewerte werden dann in einer ähnlichen Weise abgeleitet. 4D provides weight values for four 3x3 convolution kernels 435 according to an embodiment. 4E represents a coding 440 the positions for the weights in the four 3x3 convolution kernels 435 according to one embodiment. The first row of coding 440 comprises a stream of 12 code values, one for each nonzero weight value in the four 3x3 convolution kernels 435 , With a wrap and r _max = s _max = 3, the positions are in the first row of the coding 440 coded. The first value S1 corresponds to the zero in the upper left position, which is followed by a 3. The value S4 follows the first value S1 and corresponds to the zero in the first row following that of FIG. 3 and to the three zeros in the second row of the first convolution kernel. A second value S1, which follows the value S4, corresponds to the one zero in the third row of the first convolution kernel between 4 and 5. The second value S1 is followed by two values SO, which indicate the absence of zeroes between the 5 and the 6 and between the 6 and the 7 in the first row of the second convolution kernel. The two values SO are followed by a value 55, which corresponds to the five zeros before the 8 in the third row of the second convolution kernel. The remaining code values are then derived in a similar manner.

Die zweite Reihe der Codierung 440, welche in 4E dargestellt ist, stellt die Positionen für die Gewichtswerte ungleich Null in den vier 3 × 3-Faltungskernen 435 dar. Die Positionen können abhängig von den Codewerten in der ersten Reihe bestimmt werden. Beginnend mit einer initialen Position von (0, 0, 0) wird der erste Wert S1 in die Position (r, s, k) = (1, 0, 0) decodiert, welcher in der zweiten Reihe dargestellt ist und welcher mit der Position des Gewichtswerts von 3 in der ersten Reihe des ersten Faltungskerns korrespondiert. Der erste Wert 84 wird in die Position (r, s, k) = (0, 2, 0) decodiert, welcher in der zweiten Reihe dargestellt ist und welcher mit der Position des Gewichtswerts von 4 in der dritten Reihe des ersten Faltungskerns korrespondiert. Die verbleibenden Positionen können in einer ähnlichen Weise abgeleitet werden.The second row of coding 440 , what a 4E represents the positions for the nonzero weight values in the four 3x3 convolution kernels 435 The positions may be determined depending on the code values in the first row. Beginning with an initial position of (0, 0, 0), the first value S1 is decoded into the position (r, s, k) = (1, 0, 0) which is shown in the second row and which with the position of the weight value of 3 in the first row of the first convolution kernel. The first value 84 is decoded into the position (r, s, k) = (0, 2, 0), which is shown in the second row and which corresponds to the position of the weight value of 4 in the third row of the first convolution kernel. The remaining positions can be derived in a similar manner.

Bei einer Ausführungsform werden lineare Indices für die Codewerte in der obersten Reihe der Codierung 440 abgeleitet, indem eine laufende Summe, welche bei –1 startet, berechnet wird und 1 für jeden Gewichtswert zusammen mit dem Nullanzahlwert addiert wird. Ein Extrahieren der Nullanzahlen von der obersten Reihe erzeugt {1, 4, 1, 0, 0, 5, 2, 1, 1, 1, 4, 1}. Wenn r_max auf 4 anstatt auf 3 gesetzt wird (für einen Faltungskern, welcher 4 × 3 anstatt von 3 × 3 ist), werden die Nullanzahlen {1, 6, 1, 0, 0, 7, 3, 2, 1, 2, 6, 1}. Die Nullanzahlen werden dann in eine laufende Summe gewandelt, welche bei –1 beginnt und eine 1 wird für jede Position für jedes der entsprechenden Gewichte hinzugefügt. Die laufende Summe, welche ein linearer Index L_i der Nullanzahlen C_i ist, beträgt {1, 8, 10, 11, 12, 20, 24, 27, 29, 32, 39, 41), wobei L_i = L_i-1 + C_i + 1 und L₀ = –1 gilt. Der lineare Index wird dann in die Positionskoordinaten (r, s, k) gewandelt.In one embodiment, linear indices for the code values are in the top row of the encoding 440 is derived by calculating a running sum starting at -1 and adding 1 for each weight value together with the zero number value. Extracting the zero numbers from the top row produces {1, 4, 1, 0, 0, 5, 2, 1, 1, 1, 4, 1}. If r _{max is set} to 4 instead of 3 (for a convolution kernel which is 4 × 3 instead of 3 × 3), the zero counts {1, 6, 1, 0, 0, 7, 3, 2, 1, 2 , 6, 1}. The zero counts are then converted to a running sum starting at -1 and a 1 is added for each position for each of the corresponding weights. The running sum, which is a linear index L _{i of} the zero numbers C _i , is {1, 8, 10, 11, 12, 20, 24, 27, 29, 32, 39, 41), where L _i = L _{i- 1} + C _i + 1 and L ₀ = -1 holds. The linear index is then converted to the position coordinates (r, s, k).

Wenn r_max auf 4 gesetzt wird (oder eine andere Potenz von 2), kann r extrahiert werden, indem nur die zwei niederwertigsten Bit entfernt werden. Eine Division durch 3 ist erforderlich, um k und s von den verbleibenden Bits zu separieren. Die Division kann vermieden werden, indem die Kerndimensionen auf 4 × 4 (oder eine andere Potenz von 2 in jeder Dimension) aufgerundet wird, wobei die komprimierte-dünnbesetzte Codierung aufgrund der zusätzlichen Nullen nicht so dicht sein kann. Es sei angemerkt, dass die Koordinate k in der Position (r, s, k) nicht der absoluten Adresse des Ausgabekanals entspricht, sondern der temporären Adresse des Akkumulators, welcher aktuell den Ausgabekanal hält. Die Positionen, welche von dem linearen Index extrahiert werden, sind in der zweiten Reihe der Codierung 440 dargestellt. If r _{max is set} to 4 (or another power of 2), r can be extracted by removing only the two least significant bits. Divide by 3 is required to separate k and s from the remaining bits. The division can be avoided by rounding up the kernel dimensions to 4 × 4 (or another power of 2 in each dimension), where the compressed sparse coding can not be so dense due to the extra zeroes. It should be noted that the coordinate k in the position (r, s, k) does not correspond to the absolute address of the output channel but to the temporary address of the accumulator currently holding the output channel. The positions extracted from the linear index are in the second row of the coding 440 shown.

4F zeigt ein Blockdiagramm 450 zur Bestimmung der Gewichtskoordinaten (r, s, k) gemäß einer Ausführungsform. Es sei angenommen, dass r_max klein ist (kleiner als die maximale Nullanzahl) und daher auf eine Potenz von 2 aufgerundet wird, so dass die Koordinaten r und s als ein einziges Feld rs behandelt werden können, wobei die niederwertigen Bits r und die höherwertigen Bits s entsprechen. Ein Addierer 425 summiert eine Nullenanzahl, z_i, und 1 mit rs_i-1, um einen versuchsweisen Wert r_Si zu erzeugen. Die Divisionsoperation, um k und s zu separieren, muss tatsächlich keine Division erfordern, sondern kann stattdessen mittels einer laufenden Divisionstechnik ausgeführt werden. Bei jedem Schritt kann bei der Berechnung der laufenden Summe der versuchsweise Wert rs_i mit rs_max = r_max*s_max verglichen werden. Wenn die Summe größer oder gleich dem Wert rs_max ist, wird rs_max von dem versuchsweisen Wert rs_i abgezogen und k wird inkrementiert. Die laufende Divisionstechnik kann eingesetzt werden, um r und s zu separieren, wenn r_max nicht auf die nächste Potenz von 2 aufgerundet ist. 4F shows a block diagram 450 for determining the weight coordinates (r, s, k) according to an embodiment. Assume that r _{max is} small (less than the maximum zero count) and is therefore rounded up to a power of 2, so that the coordinates r and s can be treated as a single field rs, with the least significant bits r and higher Correspond to bits s. An adder 425 sums a zero number, z _i , and 1 with rs _i-1 to produce a tentative value r _Si . The division operation to separate k and s does not in fact require division, but instead can be done by a continuous division technique. At each step, the trial sum rs _{i can be} compared with rs _max = r _max * s _max when calculating the running total. If the sum is greater than or equal to the value rs _max, _max is rs from the tentative value rs _i is drawn off and k is incremented. The running division technique can be used to separate r and s if r _{max is} not rounded up to the nearest power of 2.

Max Subtracts 455 subtrahiert rs_max von dem versuchsweisen Wert rs_i, welcher durch den Addierer 425 ausgegeben wird, und bestimmt, ob das Ergebnis positiv ist, wie es durch das Signal pos angezeigt wird, welches durch den Max Subtract 445 ausgegeben wird. Wenn das Ergebnis positiv ist, wird das Ergebnis der Subtraktion gehalten und für eine Ausgabe als rs_i durch einen Multiplexer 460 ausgewählt. Wenn das Ergebnis nicht positiv ist, hält der Multiplexer 460 den versuchsweisen Wert rs_i zur Ausgabe als rs_i. Ein Inkrementierer 455 empfängt k_i-1 und inkrementiert k_i-1, um die Ausgabe k_i zu aktualisieren, wenn das Ergebnis positiv ist. Es sei angemerkt, dass, wenn rs_max kleiner als die maximale Nullanzahl ist, es notwendig sein kann, gegen 2*rs_max und andere Mehrfache zu vergleichen. Wenn jedoch rs_max klein ist, kann rs_max bei einer Ausführungsform auf die nächste Potenz von 2 aufgerundet werden und eine laufende Summe sollte auf ein kombiniertes Feld krs berechnet werden.Max Subtracts 455 subtracts rs _max from the tentative value rs _i which is passed through the adder 425 and determines whether the result is positive, as indicated by the signal pos, which is indicated by the Max Subtract 445 is issued. If the result is positive, the result of the subtraction is kept and for output as rs _i by a multiplexer 460 selected. If the result is not positive, the multiplexer stops 460 the trial value rs _i for output as rs _i . An incrementer 455 receives k _i-1 and increments k _i-1 to update the output k _i if the result is positive. It should be noted that if rs _{max is} less than the maximum zero count, it may be necessary to compare against 2 * rs _max and other multiples. However, if rs _{max is} small, rs _max may be rounded up to the nearest power of 2 in one embodiment, and a running sum should be calculated on a combined field krs.

Bei einer Ausführungsform ist die Codierung für die Eingabeaktivierungen dieselbe wie für die Gewichte, außer dass die Koordinaten (r, s) durch die Koordinaten (x, y) ersetzt werden und die Koordinate k weggelassen wird. Die Größe einer Eingabeaktivierungsscheibe kann jedoch merklich größer sein. Für eine (High Definition) HD-Bildgröße von 1920×1080 Pixel, welche in eine 8 × 8-Anordnung von PEs 210 aufgeteilt sind, hält jedes PE 210 eine Scheibe von 240 × 135. Bei einem anderen Extrem kann eine tiefe Faltungsschicht nur 14 × 14 mit einem x_max von nur 1 oder 2 aufweisen. Wenn große Größen zu groß sind, um auf Potenzen von 2 aufgerundet zu werden, können bei Eingabeaktivierungen laufende Divisionstechniken „running divide technique”) eingesetzt werden, um x, y und k zu separieren.In one embodiment, the encoding for the input activations is the same as for the weights, except that the coordinates (r, s) are replaced by the coordinates (x, y) and the coordinate k is omitted. However, the size of an input activation disc may be significantly larger. For a (High Definition) HD image size of 1920 × 1080 pixels, which is in an 8 × 8 array of PEs 210 split, each PE holds 210 a slice of 240x135. At another extreme, a deep convolution layer can only have 14x14 with an x _max of only 1 or 2. When large sizes are too large to be rounded up to powers of 2, divisional technique running on input activations can be used to separate x, y, and k.

4G stellt ein Blockdiagramm 470 zur Bestimmung der Eingabeaktivierungskoordinaten (x, y) gemäß einer Ausführungsform dar. Die Berechnung, welche für die Eingabeaktivierungskoordinaten eingesetzt wird, ist ähnlich der Berechnung der Gewichtskoordinaten, außer dass: (1) es kein Feld k gibt und die Positionen alle von demselben Eingabekanal c stammen; und (2) die x-Koordinate bei jedem Schritt mit x_max verglichen wird und, wenn es erforderlich ist, x_max subtrahiert wird. Für Eingabeaktivierungen kann x_max groß werden, wodurch es kostspielig ist, es auf die nächste Potenz von 2 aufzurunden. 4G represents a block diagram 470 for determining the input activation coordinates (x, y) according to an embodiment. The calculation used for the input activation coordinates is similar to the calculation of the weight coordinates except that: (1) there is no field k and the positions are all from the same input channel c come; and (2) the x-coordinate is compared with x _max at each step and, if necessary, x _{max is} subtracted. For input activations, x _{max can become} large, making it costly to round it to the nearest power of 2.

Ein Addierer 475 summiert eine Nullanzahl, t_i, und 1 mit x_i-1, um einen versuchsweisen Wert x_i zu erzeugen. Ein Max Subtract 485 subtrahiert x_max von dem versuchsweisen Wert x_i, welcher durch den Addierer 475 ausgegeben wird, und bestimmt, ob das Ergebnis positiv ist, wie es durch das Signal pos, welches durch den Max Subtract 485 ausgegeben wird, angezeigt wird. Wenn das Ergebnis positiv ist, wird das Ergebnis der Subtraktion beibehalten und für eine Ausgabe als x_i durch einen Multiplexer 480 ausgewählt. Wenn das Ergebnis nicht positiv ist, wählt der Multiplexer 480 den versuchsweisen Wert x_i für eine Ausgabe als x_i. Ein Inkrementierer 490 empfängt y_i-1 und inkrementiert y_i-1, um die Ausgabe y_i zu aktualisieren, wenn das Ergebnis positiv ist.An adder 475 sums a zero number, t _i , and 1 with x _i-1 to produce a tentative value x _i . A Max Subtract 485 x _max is subtracted from the experimental value x _i defined by the adder 475 and determines whether the result is positive, as indicated by the signal pos, which is given by the max subtract 485 is displayed is displayed. If the result is positive, the result of the subtraction is retained and for output as x _i by a multiplexer 480 selected. If the result is not positive, the multiplexer chooses 480 the experimental value x _i for an output as x _i . An incrementer 490 receives y _i-1 and increments y _i-1 to update the output y _i if the result is positive.

Es sei angemerkt, dass das Eingabeaktivierungskoordinatensystem mit dem Halo verknüpft ist, so dass für einen 3 × 3-Faltungskern die aktuelle Eingabeaktivierung bei (1, 1) beginnt. Wenn einmal die Positionen (r, s, k) der Gewichte berechnet sind und die Positionen (x, y) der Eingabeaktivierungen durch die Zielberechnungseinheit 330 berechnet sind, werden die Koordinaten r und x summiert und die Koordinaten s und y werden durch die Zielberechnungseinheit 330 summiert, um die Ausgabeaktivierungspositionen in der Form (x, y, k) zu berechnen. Die Zielberechnungseinheit 330 wandelt dann die Ausgabeaktivierungspositionen in eine lineare Akkumulatoradresse um, gemäß: address_i = x + y*x_{max_halo} + k*x_{max_halo}*y_{max_halo} Note that the input activation coordinate system is linked to the halo such that for a 3x3 convolution kernel, the current input activation begins at (1, 1). Once the positions (r, s, k) of the weights are calculated and the positions (x, y) of the input activations are calculated by the Target calculation unit 330 are calculated, the coordinates r and x are summed and the coordinates s and y are determined by the target calculation unit 330 to calculate the output enable positions in the form (x, y, k). The target calculation unit 330 then converts the output enable positions to a linear accumulator address, according to: address _i = x + y * x _{max_halo} + k * x _{max_halo} * y _{max_halo}

Es sei angemerkt dass x_{max_halo} und y_{max_halo} die Abmessungen des Halo betreffen und (x, y, k) die Ausgabeaktivierungsposition ist. Die Werte, welche mit y und k multipliziert werden, können aufgerundet werden, wenn es erforderlich ist, um die Kosten des Multiplizierens zu verringern. Das Aufrunden kann jedoch die Kosten der Akkumulatoren hinsichtlich zusätzlicher Operationen, welche nicht erforderlich sind, erhöhen.It should be noted that x _{max_halo} and y _{max_halo relate to} the dimensions of the halo and (x, y, k) is the output enable position. The values multiplied by y and k can be rounded up when necessary to reduce the cost of multiplying. However, rounding up can increase the cost of the accumulators in terms of additional operations that are not required.

5A stellt ein nichtlineares Codierungsschema 505 für Eingabeaktivierungs-Nullanzahlwerte gemäß einer Ausführungsform dar. Eine Technik zur Verringerung der Kosten eines Aufrundens von x_max (oder xy_max) zu der nächsten Potenz von 2 ist, die Nullanzahlsymbole nichtlinear zu verteilen. Die erste Reihe des Codierungsschemas 505 ist der 4-Bit-Codewert und die zweite Reihe ist der entsprechende Nullanzahlwert. Die ersten 8 Codewerte codieren die Nullanzahlen von 0 bis 7 linear, wie es in Zusammenhang mit 4C beschrieben wurde. Die nächsten 8 Codewerte codieren jedoch die Nullanzahlwerte (z. B. 12, 18, 16, 24, 32, 48, 64, 96 und 128) größer und nichtlinear, um in dem großen leeren Bereich, welcher erzeugt wird, indem x_max aufgerundet wird, „herum zu springen”. Wenn zum Beispiel x_max 129 ist und auf 256 aufgerundet ist, kann es erforderlich sein, um 128 zu springen. 5A provides a non-linear coding scheme 505 One technique for reducing the cost of rounding up from x _max (or xy _max ) to the next power of 2 is to non-linearly distribute the zero-number symbols. The first row of the coding scheme 505 is the 4-bit code value and the second row is the corresponding zero count value. The first 8 code values linearly encode the zero numbers from 0 to 7, as related to 4C has been described. However, the next 8 code values encode the zero count values (eg, 12, 18, 16, 24, 32, 48, 64, 96, and 128) larger and nonlinear to round up in the large empty area that is generated by x _max is going to "jump around". For example, if x _{max is} 129 and rounded up to 256, it may be necessary to jump 128.

5B stellt ein anderes Codierungsschema 510 für Eingabeaktivierungs-Nullanzahlwerte gemäß einer Ausführungsform dar. Das Codierungsschema 510 ermöglicht, dass die Nullanzahl spezifiziert, dass die x-Koordinate auf den spezifizierten Wert gesetzt werden sollte und dass die y-Koordinate inkrementiert werden sollte. Wie bei dem Codierungsschema 505 spezifizieren die ersten acht Codewerte die Nullanzahlen 0 bis 7. Die nächsten acht Codewerte der Form Yn weisen die Zielberechnungseinheit 330 an, die y-Koordinate zu inkrementieren und die x-Koordinate auf x = n zu setzen. Mit dieser Form einer Codierung besteht keine Notwendigkeit, zuerst in einen linearen Index zu wandeln. Die Nullanzahl-Codewerte können direkt in (x, y) gewandelt werden. 5B represents another coding scheme 510 for input enable zero counts according to one embodiment. The coding scheme 510 allows the zero number to specify that the x-coordinate should be set to the specified value and that the y-coordinate should be incremented. As with the coding scheme 505 the first eight code values specify the zero numbers 0 to 7. The next eight code values of the form Yn indicate the target calculation unit 330 to increment the y-coordinate and set the x-coordinate to x = n. With this form of coding there is no need to first convert to a linear index. The zero-number code values can be converted directly into (x, y).

5C stellt ein anderes Codierungsschema 515 für Eingabeaktivierungs-Nullanzahlwerte gemäß einer Ausführungsform dar. Da in den meisten Fällen Y nicht inkrementiert wird, macht es Sinn, mehr „normale” Codewerte als „Y-Inkrementierungs”-Codewerte zu haben. Daher weist das Codierungsschema 510 11 Codewerte auf, welche der Nullanzahl ermöglichen, zu spezifizieren, dass die x-Koordinate auf den spezifizierten Wert eingestellt werden soll, und weist 5 Codewerte auf, um die y-Koordinate zu inkrementieren. 5C represents another coding scheme 515 for input enable zero counts according to one embodiment. Since Y is not incremented in most cases, it makes sense to have more "normal" code values than "Y-increment" code values. Therefore, the coding scheme points 510 11 code values which allow the zero number to specify that the x-coordinate should be set to the specified value, and has 5 code values to increment the y-coordinate.

5D stellt ein anderes Codierungsschema 520 für Gewichts-Nullanzahlwerte gemäß einer Ausführungsform dar. Für Gewichte sind die Werte ungleich Null in einem dreidimensionalen Raum r, s, k codiert, so dass r_max auf die nächste Potenz von 2 aufgerundet werden kann und zu dem nächsten Kanal k springt, wobei mit verschiedenen Nullanzahlwerten codiert wird. Das Codierungsschema 520 erlaubt, dass die Nullanzahl spezifiziert, dass die r-Koordinate auf den angegebenen Wert gesetzt wird und dass die k-Koordinate inkrementiert werden sollte. Die ersten 14 Codewerte spezifizieren Nullanzahlen von 0 bis 13. Die letzten beiden Codewerte der Form Kn weisen die Zielberechnungseinheit 330 an, rs auf 0 zu setzen und zu dem nächsten Ausgabekanal k zu springen. 5D represents another coding scheme 520 For weights, the nonzero values are encoded in a three-dimensional space r, s, k such that r _max can be rounded up to the nearest power of 2 and jump to the next channel k, where different zero number values is encoded. The coding scheme 520 allows the zero number to specify that the r coordinate be set to the specified value and that the k coordinate should be incremented. The first 14 code values specify zero numbers from 0 to 13. The last two code values of the form Kn indicate the destination calculation unit 330 set rs to 0 and jump to the next output channel k.

5E stellt ein anderes Codierungsschema 525 für die Gewichts-Nullanzahlwerte gemäß einer Ausführungsform dar. Das Codierungsschema 525 ermöglicht, dass die Nullanzahl spezifiziert, dass die r-Koordinate auf den angegebenen Wert gesetzt werden sollte und dass entweder die s-Koordinate oder die k-Koordinate inkrementiert werden sollte. Die ersten 10 Codewerte spezifizieren Nullanzahlen von 0 bis 9. Die nächsten drei Codewerte der Form Sn weisen die Zielberechnungseinheit 330 an, r auf Null zu setzen und s zu inkrementieren. Die letzten beiden Codewerte der Form Kn weisen die Zielberechnungseinheit 330 an, r und s auf Null zu setzen und k zu inkrementieren. 5E represents another coding scheme 525 for the weight zero counts according to one embodiment. The coding scheme 525 allows the zero number to specify that the r coordinate should be set to the specified value and that either the s coordinate or the k coordinate should be incremented. The first 10 code values specify zero numbers from 0 to 9. The next three code values of the form Sn indicate the destination calculation unit 330 set r to zero and increment s. The last two code values of the form Kn have the target calculation unit 330 to set r and s to zero and increment k.

Wie in 5A–5E dargestellt ist, kann das komprimierte-dünnbesetzte Format die Gewichte und Eingabeaktivierungen ungleich Null als dichte Vektoren von Werten zusammen mit einem Sprung-codierten (d. h. nichtlinearen) Vektor von Codewerten codieren, welche die Position darstellen, wobei z der Wert ist. Bei einer Ausführungsform kann einer oder können mehrere Codewerte eines spezifizieren aus (i) addiere z + 1 zu der letzten Koordinate, umbrechen in r, s und/oder k wie angemessen (r = r + z + 1, Umbruch), (ii) springe zu der nächsten Reihe (s = s + 1, x = z, Umbruch) oder (iii) springe zu dem nächsten Kanal (k = k + 1, s = 0, r = z, Umbruch).As in 5A - 5E For example, the compressed sparse format may encode the nonzero weights and input activations as dense vectors of values along with a branch coded (ie, non-linear) vector of code values representing the position, where z is the value. In one embodiment, one or more code values may specify one of (i) add z + 1 to the last coordinate, break in r, s and / or k as appropriate (r = r + z + 1, break), (ii) jump to the next row (s = s + 1, x = z, break) or (iii) jump to the next channel (k = k + 1, s = 0, r = z, break).

Die vorher beschriebenen Bemühungen, um die Seltenheit bzw. Dünnbesetztheit bei CNN-Beschleunigern auszunutzen, haben sich auf eine Reduzierung von Energie oder ein Einsparen von Zeit konzentriert, wodurch immer auch Energie eingespart wird. Ein Vermeiden der Multiplikation, wenn ein Eingabeoperand Null ist, durch Ausblenden einer Operandeneingabe zu einem Multiplizierer, ist eine natürliche Möglichkeit, um Energie zu sparen. Ein Ausblenden eines Operanden spart Energie, ohne aber dabei die Anzahl der Verarbeitungszyklen zu verringern. Der SCNN-Beschleuniger 200 spart auch Energie, indem all die unnötigen Multiplikation vermieden werden, und wenn irgendein Eingabeoperand Null ist, wird die Schaltung noch nicht einmal vorbereitet, um eine Multiplikationsoperation durchzuführen, wodurch ebenfalls Zeit eingespart wird.The previously described efforts to exploit the rarity of CNN accelerators have focused on reducing energy or saving time, thereby always conserving energy. Avoiding multiplication when an input operand is zero by hiding an operand input to a multiplier is a natural way to conserve energy. Hiding an operand saves energy, but without reducing the number of processing cycles. The SCNN accelerator 200 also saves energy by avoiding all the unnecessary multiplication, and if any input operand is zero, the circuit is not even prepared to perform a multiplication operation, which also saves time.

Ein weiterer Ansatz, um Energie zu verringern, ist Datenübertragungskosten zu verringern, wenn die Daten dünnbesetzt sind. Die Eingabeaktivierungen können zur Übertragung zu und von dem DRAM komprimiert werden, um Energie (und Zeit) einzusparen, indem die Anzahl der DRAM-Zugriffe verringert wird. Jedoch expandieren herkömmliche Systeme die komprimierten Eingabeaktivierungen, bevor die Eingabeaktivierungen in einen On-Chip-Puffer geladen werden, so dass die Eingabeaktivierungen in einer expandierten Form gespeichert werden. Daher existieren keine Einsparungen bei Übertragungen von einem internen Puffer zu einem anderen internen Puffer oder zu den Multiplizierern. Im Gegensatz dazu setzt der SCNN-Beschleuniger 200 eine komprimierte Darstellung für alle Daten ein, welche vom DRAM kommen und behält die komprimierte Darstellung bei den On-Die-Puffern bei. Im Gegensatz dazu hält der SCNN-Beschleuniger 200 sowohl die Gewichte als auch die Aktivierungen in einer komprimierten Form sowohl in dem DRAM als auch in den internen Puffern. Dies spart Datenübertragungszeit und Energie bei allen Datenübertragungen und ermöglicht dem Chip, größere Modelle für einen vorgegebenen Umfang an internem Speicher zu halten.Another approach to reducing energy is to reduce data transmission costs when the data is sparse. The input activations may be compressed for transmission to and from the DRAM to save power (and time) by reducing the number of DRAM accesses. However, conventional systems expand the compressed input activations before loading the input activations into an on-chip buffer, so that the input activations are stored in an expanded form. Therefore, there are no savings in transfers from one internal buffer to another internal buffer or to the multipliers. In contrast, the SCNN accelerator continues 200 a compressed representation of all data coming from the DRAM and maintaining the compressed representation of the on-die buffers. In contrast, the SCNN accelerator stops 200 both the weights and the activations in a compressed form in both the DRAM and the internal buffers. This saves data transfer time and energy on all data transfers and allows the chip to hold larger models for a given amount of internal memory.

Der SCNN-Beschleuniger 200 nutzt die Seltenheit bzw. Dünnbesetztheit sowohl bezüglich der Gewichte als auch bezüglich der Aktivierungen mittels des PTIS-sparse-Datenflusses (dünnbesetzten in der Ebene in Kacheln aufgeteilten bezüglich der Eingabe konstanten Datenflusses) aus. Der PTIS-sparse-Datenfluss ermöglicht dem SCNN-Beschleuniger 200, eine neue auf einem kartesischen Produkt basierende Rechenarchitektur zu verwenden, welche eine Wiederverwendung von Gewichten und Eingabeaktivierungen in einer Gruppe von verteilten PEs 210 maximiert. Darüber hinaus ermöglicht der PTIS-sparse-Datenfluss den Einsatz einer dichten komprimierten-dünnbesetzten Codierung sowohl für die Gewichte als auch für die Aktivierungen, welche nahezu ausschließlich während des gesamten Bearbeitungsstroms eingesetzt wird. Der Umfang von Daten, welche in dem SCNN-Beschleuniger 200 übertragen werden, wird verringert und der Umfang der On-Die-Speicherkapazität wird effektiv erhöht. Ergebnisse zeigen, dass für eine äquivalente Fläche, die Architektur des SCNN-Beschleunigers 200 eine höhere Energieeffizienz im Vergleich mit einer energieoptimierten dichten Architektur erzielt, wenn die Gewichte und Aktivierungen jeweils weniger als 85% dicht sind. Auf drei heutigen Netzen erzielte die Architektur des SCNN-Beschleunigers 200 Leistungsverbesserungen gegenüber der dichten Architektur um einen Faktor von 2,6, während sie noch um den Faktor 2,5 Energie-effizienter war.The SCNN accelerator 200 exploits the rarity, both in terms of weights and in terms of activations, by means of the PTIS sparse data flow (sparse in-plane tiled with respect to the input of constant data flow). The PTIS sparse data flow enables the SCNN accelerator 200 to use a new Cartesian-based computational architecture that reuses weights and input activations in a group of distributed PEs 210 maximized. In addition, the PTIS sparse data flow allows the use of dense compressed-sparse coding for both weights and activations, which is almost exclusively used throughout the machining flow. The amount of data stored in the SCNN accelerator 200 is reduced, and the amount of on-die storage capacity is effectively increased. Results show that for an equivalent area, the architecture of the SCNN accelerator 200 Achieves higher energy efficiency compared to energy-optimized dense architecture when the weights and activations are each less than 85% dense. On three of today's networks scored the architecture of the SCNN accelerator 200 Performance improvements over the dense architecture by a factor of 2.6, while still being more energy efficient by a factor of 2.5.

Beispielhaftes SystemExemplary system

6 stellt ein beispielhaftes System 600 dar, in welchem die verschiedenen Architekturen und/oder Funktionalitäten der verschiedenen vorherigen Ausführungsformen implementiert werden können. Wie dargestellt ist, ist ein System 600 vorhanden, welches mindestens einen SCNN-Beschleuniger 200 aufweist, welcher mit einem Kommunikationsbus 602 verbunden ist. Der Kommunikationsbus 602 kann mittels jedes verfügbaren Protokolls, wie z. B. PCI („Peripheral Component Interconnect”), PCI-Express, AGP („Accelerated Graphics Port”), HyperTransport oder jedem anderen Bus- oder Punkt-zu-Punkt-Kommunikations-Protokoll implementiert werden. Das System 600 weist auch einen Hauptspeicher 604 auf. Die Steuerlogik (Software) und Daten werden in dem Hauptspeicher 604 gespeichert, welcher die Form eines Direktzugriffspeichers (RAM) aufweisen kann. 6 represents an exemplary system 600 in which the various architectures and / or functionalities of the various previous embodiments can be implemented. As shown, is a system 600 present, which is at least one SCNN accelerator 200 which is connected to a communication bus 602 connected is. The communication bus 602 can by any available protocol, such. PCI (Peripheral Component Interconnect), PCI Express, Accelerated Graphics Port (AGP), HyperTransport, or any other bus or point-to-point communication protocol. The system 600 also has a main memory 604 on. The control logic (software) and data are stored in main memory 604 stored, which may be in the form of a random access memory (RAM).

Das System 600 weist auch einen Zentralprozessor 601 (z. B. eine CPU), Eingabeeinrichtungen 612, einen Grafikprozessor 606 und eine Anzeige 608, d. h. eine herkömmliche CRT (Kathodenstrahlröhre), LCD (Flüssigkeitskristallanzeige), LED (Licht emittierende Diode), Plasmaanzeige oder Ähnliches, auf. Eine Benutzereingabe kann von den Eingabevorrichtungen 612, z. B. einer Tastatur, einer Maus, einem Touchpad, einem Mikrofon und Ähnlichem, erfasst werden. Bei einer Ausführungsform kann der Grafikprozessors 606 eine Mehrzahl von Schattenmodulen, Rastergrafikmodulen, usw., aufweisen. Jedes der vorher genannten Module kann auf einer einzigen Halbleiterplattform angeordnet sein, um eine Grafikverarbeitungseinheit (GPU) auszubilden.The system 600 also has a central processor 601 (eg, a CPU), input devices 612 , a graphics processor 606 and an ad 608 ie, a conventional CRT (cathode ray tube), LCD (liquid crystal display), LED (light emitting diode), plasma display or the like. User input may be from the input devices 612 , z. A keyboard, a mouse, a touchpad, a microphone, and the like. In one embodiment, the graphics processor may 606 a plurality of shadow modules, raster graphics modules, etc., have. Each of the aforementioned modules may be arranged on a single semiconductor platform to form a graphics processing unit (GPU).

In der vorliegenden Beschreibung kann eine einzelne Halbleiterplattform eine einzige einheitliche halbleiterbasierte integrierte Schaltung oder einen entsprechenden Chip bezeichnen. Es sei angemerkt, dass sich die Bezeichnung einzelne Halbleiterplattform auch auf Module mit mehreren Chips mit erhöhter Anschlussfähigkeit beziehen, welche eine On-Chip-Operation simulieren und wesentliche Verbesserungen gegenüber dem Einsatz einer herkömmlichen CPU- und Busimplementierung aufweisen. Natürlich können die verschiedenen Module getrennt oder in verschiedenen Kombinationen von Halbleiterplattformen nach den Wünschen des Benutzers angeordnet sein.In the present specification, a single semiconductor platform may designate a single unitary semiconductor-based integrated circuit or chip. It should be noted that the term single semiconductor platform also refers to modules with multiple chips with increased connectivity, which simulate on-chip operation and have significant improvements over the use of conventional CPU and bus implementation. Of course, the various modules may be separate or arranged in various combinations of semiconductor platforms as desired by the user.

Das System 600 kann auch einen sekundären Speicher 610 aufweisen. Der sekundäre Speicher 610 weist zum Beispiel ein Festplattenlaufwerk und/oder ein entfernbares Speicherlaufwerk, welches ein Floppydisklaufwerk, ein Magnetbandlaufwerk, ein CD-Plattenlaufwerk, ein DVD-Laufwerk, eine Aufnahmevorrichtung, ein USB-Flashspeicher repräsentiert, aufweist. Das entfernbare Speicherlaufwerk liest von und/oder schreibt zu einer entfernbaren Speichereinheit auf eine gut bekannten Weise.The system 600 can also have a secondary memory 610 exhibit. The secondary storage 610 For example, a hard disk drive and / or a removable storage drive that includes a floppy disk drive, a magnetic tape drive, a CD disk drive, a DVD drive, a cradle, a USB flash memory, includes. The removable storage drive reads from and / or writes to a removable storage device in a well known manner.

Computerprogramme oder Computersteuerlogikalgorithmen, Eingabedaten für den SCNN-Beschleuniger 200, Ausgabedaten, welche durch den SCNN-Beschleuniger 200 erzeugt werden, und Ähnliches können in dem Hauptspeicher 604 und/oder dem sekundären Speicher 610 gespeichert werden. Solche Computerprogramme ermöglichen, wenn sie ausgeführt werden, dass das System 600 verschiedene Funktionen ausführt. Der Speicher 604, der Speicher 610 und/oder jeder andere Speicher sind mögliche Beispiele von computerlesbaren Medien.Computer programs or computer control logic algorithms, input data for the SCNN accelerator 200 , Output data provided by the SCNN accelerator 200 and the like can be generated in the main memory 604 and / or the secondary memory 610 get saved. Such computer programs, when executed, allow the system 600 performs various functions. The memory 604 , the memory 610 and / or any other memory are possible examples of computer-readable media.

Bei einer Ausführungsform können die Architektur und/oder die Funktionalität der verschiedenen vorherigen Figuren im Zusammenhang mit dem SCNN-Beschleuniger 200, dem Zentralprozessor 601, dem Grafikprozessors 606, einer integrierten Schaltung (nicht dargestellt), welche in der Lage ist, zumindest einen Teil der Funktionen des einen oder der mehreren aus dem SCNN-Beschleuniger 200, dem Zentralprozessor 601 und dem Grafikprozessor 606, einem Chipsatz (d. h. einer Gruppe von integrierten Schaltungen, welche entworfen sind, um als eine Einheit zu arbeiten und verkauft zu werden, um entsprechende Funktionen auszuführen, usw.) und/oder jeder anderen integrierten Schaltung für diesen Zweck, implementiert werden.In one embodiment, the architecture and / or functionality of the various previous figures may be related to the SCNN accelerator 200 , the central processor 601 , the graphics processor 606 an integrated circuit (not shown) capable of performing at least a portion of the functions of the one or more of the SCNN accelerator 200 , the central processor 601 and the graphics processor 606 , a chipset (ie, a set of integrated circuits designed to operate and be sold as a unit to perform corresponding functions, etc.) and / or any other integrated circuit for this purpose.

Des Weiteren kann die Architektur und/oder die Funktionalität der verschiedenen vorherigen Figuren im Zusammenhang mit einem allgemeinen Computersystem, einem Platinensystem, einem Spielekonsolesystem, welches für Unterhaltungszwecke bestimmt ist, einem anwenderspezifischen System und/oder jedem anderen erwünschten System, implementiert werden. Zum Beispiel kann das System 600 die Ausgestaltung eines Desktopcomputers, eines Laptopcomputers, eines Servers, einer Workstations, von Spielekonsolen, eines eingebetteten Systems und/oder jedes anderen Typs einer Logik, aufweisen. Darüber hinaus kann das System 600 die Ausgestaltung von verschiedenen anderen Vorrichtungen aufweisen, was einen Minicomputer (PDA), ein Mobiltelefon, einen Fernseher, usw., einschließt.Furthermore, the architecture and / or functionality of the various previous figures may be implemented in conjunction with a general computer system, a board system, a game console system intended for entertainment purposes, a custom system, and / or any other desired system. For example, the system 600 the design of a desktop computer, a laptop computer, a server, a workstation, game consoles, an embedded system and / or any other type of logic have. In addition, the system can 600 have the design of various other devices, including a minicomputer (PDA), a mobile phone, a television, and so forth.

Darüber hinaus kann das System, während es nicht dargestellt ist, mit einem Netz (z. B. einem Telekommunikationsnetz, einem lokalen Netz (LAN), einem Funknetz, einem weiträumigen Netz (WAN), wie z. B. dem Internet, einem Peer-to-Peer-Netz, einem Kabelnetz oder Ähnlichem) für Kommunikationszwecke gekoppelt sein.Moreover, while not shown, the system may interface with a network (eg, a telecommunications network, a local area network (LAN), a wireless network, a wide area network (WAN), such as the Internet, a peer to peer network, a cable network or the like) for communication purposes.

Obwohl verschiedene Ausführungsformen oben beschrieben wurden, sollte man verstehen, dass sie nur beispielhaft und nicht als Einschränkung dargestellt wurden. Daher sollte die Breite und der Umfang einer bevorzugten Ausführungsform nicht durch eine der oben beschriebenen beispielhaften Ausführungsformen eingeschränkt werden, sondern nur in Übereinstimmung mit den nachstehenden Ansprüchen und ihren Entsprechungen definiert werden.Although various embodiments have been described above, it should be understood that they have been presented by way of example only and not limitation. Therefore, the breadth and scope of a preferred embodiment should not be limited by any of the above-described exemplary embodiments, but should be defined only in accordance with the following claims and their equivalents.

Claims

A method comprising: detecting a first vector comprising only non-zero weight values and first associated non-zero weight value positions in a first space; Detecting a second vector comprising only non-zero input activation values and second associated positions in a second space; Multiplying in a multiplier array of non-zero weight values with the non-zero input enable values to produce a third vector of products; Combining the first associated positions with the second associated positions to generate a fourth vector of positions, each position in the fourth vector associated with a corresponding product in the third vector; and transmitting the third vector to an accumulator assembly, wherein each product in the third vector is transmitted to an adder in the accumulator assembly configured to generate an output enable value at the position associated with the product.

The method of claim 1, further comprising: Detecting a fifth vector comprising only additional non-zero weight values and fifth associated positions of the additional non-zero weight values in the first space; Multiplying in the multiplier array the additional nonzero weight values with the non-zero activation values to produce a seventh vector of products; Generating an eighth vector of positions, each position in the eighth vector associated with a corresponding product in the seventh vector of products; and for each matching position in the fourth vector and the eighth vector, summing the corresponding products in the third vector and the seventh vector through the accumulator array to produce partial sums.

The method of claim 1 or 2, wherein the first space is a three-dimensional space, and wherein the second space is a two-dimensional space.

The method of any one of claims 1 to 3, further comprising transmitting the third vector by an array of buffers in the accumulator array, each of the buffers coupled to an input of one of the adders in the accumulator array.

The method of any one of claims 1 to 4, further comprising compressing the output enable values to produce a set of vectors having non-zero output enable values having only non-zero output enable values.

The method of claim 5, wherein the group of vectors further comprises positions associated with non-zero output enable values.

The method of claim 2, wherein the second vector has been generated during processing of a first layer of a neural network, and wherein the seventh vector of products is generated during processing of a second layer of the neural network.

The method of claim 1, further comprising transmitting a first product in the third vector from a first accumulator input in the accumulator assembly to a first adder in the accumulator assembly, wherein the first product is associated with a first position at an edge of the second space is.

The method of claim 1, wherein combining comprises performing a vector addition to add coordinates of the first associated positions to coordinates of the second associated positions to generate the fourth vector of positions, each position in the fourth vector corresponding product is assigned in the third vector.

The method of any one of claims 1 to 9, wherein the second space is divided into two-dimensional tiles, and wherein the multiplier array generates products for one of the two-dimensional tiles in parallel with additional multiplier arrays that produce additional products for the remaining two-dimensional tiles.

The method of claim 10, wherein each of the additional multiplier arrays detects an additional vector comprising only nonzero input activation values and additional associated locations in another tile of the second space.

The method of claim 10 or 11, wherein the tile for a number of input channels is expanded into a further dimension of the first space and the second space, and further comprising detecting additional vectors containing only non-zero weight values and additional associated positions of the weight values include nonzero for each of the number of input channels.

A convolutional neural network accelerator comprising: an array of processing elements, each processing element comprising a multiplier arrangement configured to detect a first vector comprising only non-zero weight values and first associated non-zero weight positions in a first space ; to detect a second vector comprising only non-zero input activation values and second corresponding positions in a second space; Multiplying the non-zero weight values by the non-zero activation values to produce a third vector of products; to combine the first associated positions with the second associated positions to generate a fourth vector of positions, each position in the fourth vector associated with a corresponding product in the third vector; and to transmit the third vector to an accumulator assembly, wherein each product in the third vector is transferred to an adder in the accumulator assembly configured to generate an output enable value at the position associated with the product.

A convolutional neural network accelerator according to claim 13, wherein said multiplier arrangement is further configured to detect a fifth vector comprising only additional non-zero weight values and fifth associated positions of the additional non-zero weight values in the first space; Multiplying in the multiplier array of the non-zero additional weight values with the non-zero input enable values to produce a seventh vector of products; Generating an eighth vector of positions, each position in the eighth vector associated with a corresponding product in the seventh vector of products; and for each corresponding position in the fourth vector and the eighth vector, summing the corresponding products in the third vector and the seventh vector by the accumulator array to produce partial sums.

A folding neural network accelerator according to claim 13 or 14, wherein the first space is a three-dimensional space and the second space is a two-dimensional space.

A convolutional neural network accelerator according to any one of claims 13 to 15, wherein the first vector is transmitted to each processing element in the array of processing elements.

The convolutional neural network accelerator of any one of claims 13 to 16, wherein the second space is divided into two-dimensional tiles, and wherein the multiplier array generates products for one of the two-dimensional tiles in parallel with additional multiplier arrays that produce additional products for the remaining two-dimensional tiles.

The convolutional neural network accelerator of claim 17, wherein each of the additional multiplier arrays detects an additional vector comprising only nonzero input activation values and additional associated positions in another tile of the second space.

The convolutional neural network accelerator of claim 17 or 18, wherein the tile for a number of input channels is expanded into an additional dimension of the first space and the second space, and further comprising detecting additional vectors containing only non-zero weight values and include additional associated non-zero weight value locations for each of the number of input channels.

A system comprising: a memory storing vectors comprising only non-zero weight values and first associated non-zero weight value positions in a first space; and a convolutional neural network accelerator coupled to the memory and comprising: an array of processing elements, each processing element including a multiplier arrangement configured to acquire from the memory a first vector of the vectors; to detect a second vector comprising only input activation values and second associated positions in a second space; Multiplying the non-zero weight values by the non-zero input enable values to produce a third vector of products; to combine the first associated positions with the second associated positions to generate a fourth vector of positions, each position in the fourth vector associated with a corresponding product in the third vector; and to transmit the third vector to an accumulator array, wherein each of the products in the third vector is transmitted to an adder in the accumulator array configured to generate an output enable value at the location associated with the product.

Method according to claim 5 or 6, wherein the set of vectors comprises a data vector having the non-zero output enable values and a corresponding index vector, and wherein each element in the index vector corresponds to a number of zero values which precede the corresponding element in the data vector.

Method according to claim 21, wherein the number of zero values between two elements of the data vector is limited to a predetermined number, and wherein, when the number of null values between two elements of the data vector having nonzero values is greater than the predetermined number, inserting a null value wildcard in the data vector.

A method according to claim 21 or 22, wherein the index vector employs an N-bit code which encodes the number of zero values, wherein the N-bit code is non-linear to allow a greater number of zero values to precede the corresponding element in the data vector.

A convolutional neural network accelerator according to any one of claims 13-19, wherein the convolutional neural network accelerator is adapted to perform the method of any of claims 1-12 or 21-23.

The system of claim 20, wherein the system is configured to perform the method of any of claims 1-12 or 21-23.