DE102020201182A1

DE102020201182A1 - Hardware-accelerated calculation of convolutions

Info

Publication number: DE102020201182A1
Application number: DE102020201182.6A
Authority: DE
Inventors: Armin Runge; Taha Ibrahim Ibrahim Soliman; Leonardo Luiz Ecco
Original assignee: Robert Bosch GmbH
Current assignee: Robert Bosch GmbH
Priority date: 2020-01-31
Filing date: 2020-01-31
Publication date: 2021-08-05
Also published as: JP2023513064A; EP4097646A1; WO2021151749A1

Abstract

Verfahren (100) zur Berechnung einer Faltung (4) eines Eingabetensors (1) von Eingabedaten (1a) mit einem tensoriellen Faltungskern (2), wobei• der Faltungskern (2) in einem vorgegebenen Raster von Positionen (21, 22) innerhalb des Eingabetensors (1) geführt wird (110),• in jeder dieser Positionen (21, 22) der Faltungskern (2) angewendet wird, indem aus den Eingabedaten (1a) in dem durch den Faltungskern (2) an seiner aktuellen Position (21, 22) abgedeckten Bereich des Eingabetensors (1) eine mit den Werten (2a) des Faltungskerns (2) gewichtete Summe (3) gebildet wird (120) und• diese gewichtete Summe (3) in der Faltung (4) der aktuellen Position (21, 22) des Faltungskerns (2) zugeordnet wird (130), wobei• die gewichtete Summe (3) mit mindestens einem Hardwarebeschleuniger (5) berechnet wird (121), welcher einen Eingangsspeicher (51) und eine feste Anzahl von Multiplizierern (52) aufweist, die ihre Operanden (52a, 52b) jeweils von vorgegebenen Speicherstellen (51a-51h) des Eingangsspeichers (51) abrufen und• in mindestens einem Arbeitsgang des Hardwarebeschleunigers (5) mehr Summanden verarbeitet werden als es einer Tiefe (11) des Eingabetensors (1) entspricht (122), wobei• die Zuordnung zwischen Operanden (52a, 52b) und Speicherstellen (51a-51h) des Eingangsspeichers (51) während der Berechnung der Faltung (4) variiert wird (123), und/oder Eingabedaten (1a) und/oder Werte (2a) des Faltungskerns (2) mehrfach in dem Eingangsspeicher (51) hinterlegt werden (124).Method (100) for calculating a convolution (4) of an input sensor (1) of input data (1a) with a tensile convolution kernel (2), wherein • the convolution kernel (2) in a predetermined grid of positions (21, 22) within the input sensor (1) is guided (110), • the convolution kernel (2) is applied in each of these positions (21, 22) by using the input data (1a) in the by the convolution kernel (2) at its current position (21, 22 ) covered area of the input sensor (1) a sum (3) weighted with the values (2a) of the convolution kernel (2) is formed (120) and • this weighted sum (3) in the convolution (4) of the current position (21, 22) of the convolution kernel (2) is assigned (130), where • the weighted sum (3) is calculated (121) with at least one hardware accelerator (5) which has an input memory (51) and a fixed number of multipliers (52) which their operands (52a, 52b) each from predetermined memory locations (51a-51h) of the input memory chers (51) and • more summands are processed in at least one operation of the hardware accelerator (5) than corresponds to a depth (11) of the input sensor (1) (122), where • the assignment between operands (52a, 52b) and memory locations (51a-51h) of the input memory (51) is varied (123) during the calculation of the convolution (4), and / or input data (1a) and / or values (2a) of the convolution kernel (2) are stored several times in the input memory (51) deposited (124).

Description

Die vorliegende Erfindung betrifft die Berechnung der Faltung von Eingabedaten mit einem Faltungskern mittels eines Hardwarebeschleunigers.The present invention relates to the computation of the convolution of input data with a convolution kernel by means of a hardware accelerator.

Stand der TechnikState of the art

Faltende neuronale Netzwerke (englisch „convolutional neural network“, CNN) haben sich vor allem für die Verarbeitung von Bilddaten und Audiodaten durchgesetzt. Auswertungsverfahren für derartige Daten, die CNNs verwenden, haben im Vergleich zu Verfahren, die nicht auf künstlicher Intelligenz basieren, ein deutlich höheres Leistungspotential. Allerdings ist hierfür auch ein deutlich höherer Rechenaufwand erforderlich. Selbst kleinere CNNs können aus mehreren Millionen Parametern bestehen und mehrere Milliarden Rechenoperationen benötigen, um einen Satz Eingangsgrößen zu einem Satz Ausgangsgrößen verarbeiten.Convolutional neural networks (CNN) have established themselves primarily for processing image data and audio data. Evaluation methods for such data that use CNNs have a significantly higher performance potential compared to methods that are not based on artificial intelligence. However, this also requires a significantly higher computational effort. Even smaller CNNs can consist of several million parameters and require billions of arithmetic operations to process a set of input quantities into a set of output quantities.

Einen großen Anteil dieses Aufwands nehmen Faltungen von Daten mit Faltungskernen ein. Im Rahmen dieser Faltungen werden Summen von Daten berechnet, die mit Gewichten aus den Faltungskernen gewichtet sind. Es werden also sehr viele Produkte aus Daten und Gewichten berechnet, und diese Produkte werden summiert. Im weitesten Sinne werden also innere Produkte von Daten und Faltungskernen berechnet.A large proportion of this effort is made up of convolutions of data with convolution kernels. In the context of these convolutions, sums of data are calculated which are weighted with weights from the convolution kernels. So a great many products are calculated from data and weights, and these products are totaled. In the broadest sense, inner products of data and convolution kernels are calculated.

Für diese Grundaufgabe werden zunehmend dedizierte Hardwarebeschleuniger verwendet, wie beispielsweise Inneres-Produkt-Recheneinheiten. Diese Einheiten sind dafür ausgelegt, ein komplettes inneres Produkt zweier Vektoren einer festen Länge in einem Taktzyklus zu berechnen.Dedicated hardware accelerators, such as inner product computing units, are increasingly being used for this basic task. These units are designed to compute a complete inner product of two vectors of fixed length in one clock cycle.

Offenbarung der ErfindungDisclosure of the invention

Im Rahmen der Erfindung wurde ein Verfahren zur Berechnung einer Faltung eines Eingabetensors von Eingabedaten mit einem tensoriellen Faltungskern entwickelt.In the context of the invention, a method for calculating a convolution of an input sensor of input data with a tensorial convolution kernel was developed.

Bei dieser Faltung wird der Faltungskern in einem vorgegebenen Raster von Positionen innerhalb des Eingabetensors geführt. Der Abstand zwischen benachbarten Positionen dieses Rasters wird auch als „stride“ bezeichnet. In jeder der Positionen wird der Faltungskern angewendet, indem aus den Eingabedaten in dem durch den Faltungskern an seiner aktuellen Position abgedeckten Bereich des Eingabetensors eine mit den Werten des Faltungskerns gewichtete Summe gebildet wird. In der Faltung wird diese gewichtete Summe der aktuellen Position des Faltungskerns zugeordnet.With this folding, the folding core is guided in a predetermined grid of positions within the input sensor. The distance between adjacent positions in this grid is also known as the “stride”. The convolution kernel is used in each of the positions by forming a sum weighted with the values of the convolution kernel from the input data in the area of the input sensor covered by the convolution kernel at its current position. In the convolution, this weighted sum is assigned to the current position of the convolution kernel.

Wenn beispielsweise der Faltungskern dazu dient, ein bestimmtes gesuchtes Merkmal in den Eingabedaten zu erkennen, dann ist die gewichtete Summe für diejenigen Positionen des Faltungskerns am größten, an denen die größte Übereinstimmung zwischen dem in diesem Faltungskern verkörperten gesuchten Merkmal und den Eingabedaten besteht. Daher wird das Ergebnis der Faltung mit einem Faltungskern auch als Merkmalskarte („feature map“) in Bezug auf diesen Faltungskern bezeichnet.If, for example, the convolution kernel is used to identify a certain searched feature in the input data, then the weighted sum is greatest for those positions of the convolution kernel at which there is the greatest correspondence between the searched feature embodied in this convolution kernel and the input data. The result of the convolution with a convolution kernel is therefore also referred to as a feature map in relation to this convolution kernel.

Die gewichtete Summe wird mit mindestens einem Hardwarebeschleuniger berechnet. Dieser Hardwarebeschleuniger weist einen Eingangsspeicher und eine feste Anzahl von Multiplizierern auf, die ihre Operanden, hier also Eingabedaten und Werte des Faltungskerns, von vorgegebenen Speicherstellen des Eingangsspeichers abrufen. So enthält beispielsweise eine Inneres-Produkt-Recheneinheit, die das innere Produkt zweier Vektoren mit einer festen Länge berechnet, typischerweise so viele Multiplizierer wie die Vektoren jeweils Elemente aufweisen. Damit können alle für die Berechnung des inneren Produkts erforderlichen Multiplikationen gleichzeitig ausgeführt werden. Die hierbei entstehenden Produkte müssen dann nur noch mit Addierern kumuliert werden. Insgesamt kann so das innere Produkt in weniger Taktzyklen berechnet werden. The weighted sum is calculated using at least one hardware accelerator. This hardware accelerator has an input memory and a fixed number of multipliers which call up their operands, that is to say here input data and values of the convolution kernel, from predetermined storage locations in the input memory. For example, an inner product arithmetic unit that calculates the inner product of two vectors with a fixed length typically contains as many multipliers as the vectors each have elements. This means that all the multiplications required to calculate the inner product can be carried out at the same time. The resulting products then only have to be cumulated with adders. Overall, the inner product can be calculated in fewer clock cycles.

Im Rahmen der Berechnung von Faltungen werden solche Hardwarebeschleuniger üblicherweise so betrieben, dass in jedem Arbeitsgang maximal so viele Summanden verarbeitet werden, wie es einer Tiefe des Eingabetensors entspricht, also der in der Anzahl Elemente gemessenen Ausdehnung des Eingabetensors in einer Dimension. Wenn beispielsweise die Eingabedaten RGB-Bilddaten umfassen, hat der Eingabetensor eine Tiefe von 3, da jedem Bildpixel drei Intensitätswerte für Rot, Grün und Blau zugeordnet werden. Bei der Faltung mit dem Faltungskern werden dann immer diese drei Intensitätswerte, die einem bestimmten Pixel zugeordnet sind, mit Werten des Faltungskerns gewichtet aufsummiert. Diese Berechnung wird für alle aktuell vom Faltungskern abgedeckten Pixel wiederholt, und anschließend werden alle erhaltenen inneren Produkte addiert.In the context of the calculation of convolutions, such hardware accelerators are usually operated in such a way that a maximum of as many summands are processed in each operation as corresponds to a depth of the input sensor, i.e. the extension of the input sensor measured in the number of elements in one dimension. For example, if the input data comprises RGB image data, the input sensor has a depth of 3, since three intensity values for red, green and blue are assigned to each image pixel. During the convolution with the convolution kernel, these three intensity values, which are assigned to a specific pixel, are then always summed up, weighted with values of the convolution kernel. This calculation is repeated for all pixels currently covered by the convolution kernel, and then all inner products obtained are added.

Diese Vorgehensweise wird im Rahmen des Verfahrens dahingehend abgeändert, dass in mindestens einem Arbeitsgang des Hardwarebeschleunigers mehr Summanden verarbeitet werden als es der Tiefe des Eingabetensors entspricht.This procedure is modified within the scope of the method so that more summands are processed in at least one work step of the hardware accelerator than corresponds to the depth of the input sensor.

Im genannten Beispiel des RGB-Bildes können also etwa nicht nur die Intensitätswerte und Werte des Faltungskerns für ein einziges Pixel, sondern auch bereits die entsprechenden Daten für weitere Pixel in den Eingangsspeicher des Hardwarebeschleunigers geladen werden, so dass der Hardwarebeschleuniger idealerweise mit einem komplett gefüllten Eingangsspeicher betrieben wird. Da die bisher berechneten pixelweisen Zwischenergebnisse ohnehin alle addiert werden, um das Endergebnis der Faltung zu erhalten, ist es für das Ergebnis unerheblich, wenn die Berechnung für mehrere oder idealerweise alle Pixel in einem Arbeitsgang des Hardwarebeschleunigers zusammengefasst wird. Jedoch wird dieses Ergebnis erheblich schneller geliefert, weil insgesamt wesentlich weniger Arbeitsgänge des Hardwarebeschleunigers erforderlich sind.In the example of the RGB image mentioned, not only the intensity values and values of the convolution kernel for a single pixel, but also the corresponding data for other pixels can be loaded into the input memory of the hardware accelerator, so that the hardware accelerator can ideally be loaded with a complete filled input memory is operated. Since the pixel-wise intermediate results calculated up to now are all added anyway in order to obtain the final result of the convolution, it is irrelevant for the result if the calculation for several or ideally all pixels is combined in one operation of the hardware accelerator. However, this result is delivered much faster because, overall, significantly fewer hardware accelerator operations are required.

Hierhinter steckt die Erkenntnis, dass jeder Arbeitsgang des Hardwarebeschleunigers unabhängig vom Inhalt des Eingangsspeichers immer gleich lange dauert, da alle benötigten Multiplikationen gleichzeitig ausgeführt werden.This is based on the knowledge that every work step of the hardware accelerator always takes the same length of time, regardless of the content of the input memory, since all the necessary multiplications are carried out at the same time.

Die Zeitersparnis ist besonders groß, wenn eine Faltung in einer Schicht eines CNN berechnet wird, die eine große laterale Ausdehnung bei zugleich geringer Tiefe aufweist. So kann etwa das besagte RGB-Bild eine Auflösung von Full-HD (1.920 × 1.080 Pixel) aufweisen bei einer Tiefe von lediglich 3. Wenn nun beispielsweise eine Inneres-Produkt-Recheneinheit für Vektoren mit einer Länge von 128 Elementen genutzt wird, werden nach der herkömmlichen Betriebsweise pro Arbeitsgang dieser Recheneinheit lediglich drei statt 128 Multiplikationen ausgeführt. Es liegen also fast 98 % der zur Verfügung stehenden Rechenkapazität brach. Gemäß dem hier vorgeschlagenen Verfahren wird der Hardwarebeschleuniger wesentlich besser ausgelastet.The time saving is particularly great if a fold is calculated in a layer of a CNN that has a large lateral extent and, at the same time, a shallow depth. For example, the aforementioned RGB image can have a resolution of Full HD (1,920 × 1,080 pixels) with a depth of just 3. If, for example, an inner product computing unit is used for vectors with a length of 128 elements, according to the conventional mode of operation per operation of this arithmetic unit only three instead of 128 multiplications carried out. So almost 98% of the available computing capacity is idle. According to the method proposed here, the hardware accelerator is utilized much better.

Hierbei ist es auf Grund der üblicherweise festen Zuordnung zwischen den Operanden der Multiplizierer und den Speicherstellen im Eingangsspeicher des Hardwarebeschleunigers nicht damit getan, lediglich die Rechenoperationen umzuorganisieren.Due to the usually fixed assignment between the operands of the multipliers and the storage locations in the input memory of the hardware accelerator, it is not enough to simply reorganize the arithmetic operations.

Nur in dem Spezialfall, in dem die Positionen, an denen der Faltungskern angewendet wird, immer um die Ausdehnung des Faltungskerns auseinanderliegen, überlappen sich die an im Raster benachbarten Positionen des Faltungskerns jeweils verarbeiteten Bereiche des Eingabetensors nicht. Damit geht ein jeder Wert im Eingabetensor auch nur für eine Position des Filterkerns in die Berechnung der Faltung ein. In diesem Spezialfall können also die Eingabedaten nach einer für alle Positionen des Faltungskerns gleichen Vorschrift in den Eingangsspeicher des Hardwarebeschleunigers geladen werden, und diese bloße Umorganisation reicht aus, um das gleiche Ergebnis wie bisher wesentlich schneller zu erhalten.Only in the special case in which the positions at which the convolution kernel is used are always apart by the extent of the convolution kernel do the areas of the input sensor processed at adjacent positions of the convolution kernel in the grid not overlap. Each value in the input tensor is therefore only included in the calculation of the convolution for one position of the filter core. In this special case, the input data can be loaded into the input memory of the hardware accelerator according to a rule that is the same for all positions of the convolution kernel, and this mere reorganization is sufficient to obtain the same result much faster than before.

Im allgemeinen Fall geht jedoch ein und derselbe Wert der Eingabedaten an verschiedenen Positionen des Faltungskerns mehrfach in die Berechnung der Faltung ein, wobei er jedes Mal mit einem anderen Wert des Faltungskerns gewichtet werden muss. Mit anderen Worten, der Wert der Eingabedaten muss in der Liste der Eingabedaten im Eingabespeicher des Hardwarebeschleunigers an derjenigen Position stehen, an der der dazu passende Wert des Faltungskerns in der Liste von Werten des Faltungskerns im Eingabespeicher steht. Das Verfahren stellt zwei Möglichkeiten bereit, dies sicherzustellen, so dass das bislang unter sehr geringer Auslastung des Hardwarebeschleunigers erhaltene Ergebnis der Faltung nun auch mit stark verbesserter Auslastung exakt reproduziert werden kann.In the general case, however, one and the same value of the input data at different positions of the convolution kernel is used several times in the calculation of the convolution, each time having to be weighted with a different value of the convolution kernel. In other words, the value of the input data must be in the list of input data in the input memory of the hardware accelerator at the position at which the matching value of the convolution kernel is in the list of values of the convolution kernel in the input memory. The method provides two options for ensuring this, so that the result of the convolution, which was previously obtained with very little utilization of the hardware accelerator, can now also be reproduced exactly with a greatly improved utilization.

Die erste Möglichkeit besteht darin, die Zuordnung zwischen Operanden und Speicherstellen mindestens eines Eingangsspeichers, und/oder die Zuordnung zwischen Operanden und Speicherstellen mindestens eines Parameterspeichers, für Werte des Faltungskerns, während der Berechnung der Faltung zu variieren. Zu diesem Zweck kann insbesondere beispielsweise ein Multiplexer zwischen mindestens einen Multiplizierer und mindestens einen Eingangsspeicher geschaltet werden. Alternativ oder auch in Kombination hierzu kann beispielsweise ein Multiplexer zwischen mindestens einen Multiplizierer und mindestens einen Parameterspeicher für Werte des Faltungskerns geschaltet werden. Es kann auch beispielsweise ein und derselbe Multiplexer sowohl Zugriff auf den mindestens einen Eingangsspeicher als auch Zugriff auf den mindestens einen Parameterspeicher haben. Auf diese Weise kann ein bestimmter Operand für eine konkrete Multiplikation wahlweise von einer von mehreren möglichen Speicherstellen gelesen werden. Damit kann der Hardwarebeschleuniger zumindest eingeschränkt wahlfrei auf den Eingangsspeicher, und/oder auf den Parameterspeicher, zugreifen. Die Wahlfreiheit reicht aus, um Eingabedaten, die für eine erste Position des Faltungskerns an der richtigen Stelle im Eingangsspeicher des Hardwarebeschleunigers stehen, auch bei den für eine zweite während der Faltung auftretende Position des Faltungskerns wiederverwenden zu können. Zugleich ist der schaltungstechnische Aufwand noch deutlich geringer als beispielsweise für ein Bussystem oder ein „Network on Chip“. Insbesondere ein 4:1-Multiplexer hat sich in Untersuchungen der Erfinder als optimaler Kompromiss zwischen Wahlfreiheit und damit Effizienz einerseits und Hardwarekosten andererseits herausgestellt. Das Multiplexen kann auf die Eingabedaten, auf die Werte des Faltungskerns oder auch sowohl auf die Eingabedaten als auch auf die Werte des Faltungskerns angewendet werden. The first possibility is to vary the assignment between operands and storage locations of at least one input memory and / or the assignment between operands and storage locations of at least one parameter memory for values of the convolution kernel during the calculation of the convolution. For this purpose, for example, a multiplexer can in particular be connected between at least one multiplier and at least one input memory. Alternatively or in combination with this, a multiplexer, for example, can be connected between at least one multiplier and at least one parameter memory for values of the convolution kernel. For example, one and the same multiplexer can have both access to the at least one input memory and access to the at least one parameter memory. In this way, a specific operand for a specific multiplication can optionally be read from one of several possible memory locations. This means that the hardware accelerator can access the input memory and / or the parameter memory at least to a limited extent. The freedom of choice is sufficient to be able to reuse input data that are in the correct place in the input memory of the hardware accelerator for a first position of the convolution kernel, also for the position of the convolution kernel that occurs during the convolution. At the same time, the circuitry effort is significantly lower than, for example, for a bus system or a “Network on Chip”. In particular, a 4: 1 multiplexer has turned out to be an optimal compromise between freedom of choice and thus efficiency on the one hand and hardware costs on the other. The multiplexing can be applied to the input data, to the values of the convolution kernel or also to both the input data and the values of the convolution kernel.

Die zweite, alternativ oder auch in Kombination einsetzbare Möglichkeit besteht darin, Eingabedaten und/oder Werte des Faltungskerns mehrfach in dem Eingangsspeicher zu hinterlegen. Es kann dann beispielsweise für jede im Verlauf der Faltung beabsichtigte Verwendung eines bestimmten Werts aus den Eingabedaten eine eigene Kopie im Eingangsspeicher abgelegt werden. Beispielsweise kann im Eingangsspeicher für jeden beabsichtigten Arbeitsgang des Hardwarebeschleunigers die Sammlung der Werte aus den Eingabedaten, die in diesem Arbeitsgang zu verarbeiten ist, in der richtigen Reihenfolge abgelegt werden. Wenn die Faltung dann zu dem jeweiligen Arbeitsgang voranschreitet, können diese Werte en bloc von den Multiplizierern des Hardwarebeschleunigers abgerufen werden.The second possibility, which can be used alternatively or in combination, is to store input data and / or values of the convolution kernel multiple times in the input memory. For each intended use of a certain value from the input data in the course of the convolution, for example, a separate copy in the Input memory. For example, the collection of the values from the input data to be processed in this work step can be stored in the correct order in the input memory for each intended work step of the hardware accelerator. When the convolution then progresses to the respective operation, these values can be retrieved en bloc from the multipliers of the hardware accelerator.

Beispielsweise kann ein Eingangsspeicher mit mindestens einem separaten Speicher oder Speicherbereich, auch Partition oder Bank genannt, für jeden Multiplizierer gewählt werden. In diesen Speicher oder Speicherbereich können dann diejenigen Eingabedaten bzw. Werte des Faltungskerns geladen werden, die der jeweilige Multiplizierer im Verlauf der Berechnung der Faltung benötigt. Beim Wechsel von einer Position des Faltungskerns zur nächsten muss dann immer nur der jeweils nächste Wert aus jeder Partition abgerufen und dem Multiplizierer zugeführt werden. Hierfür ist kein wahlfreier Zugriff des Hardwarebeschleunigers auf den Eingangsspeicher notwendig, sondern es reicht beispielsweise aus, die Partitionen jeweils als Schieberegister auszugestalten.For example, an input memory with at least one separate memory or memory area, also called a partition or bank, can be selected for each multiplier. Those input data or values of the convolution kernel that the respective multiplier needs in the course of the calculation of the convolution can then be loaded into this memory or memory area. When changing from one position of the convolution kernel to the next, only the next value in each case has to be retrieved from each partition and fed to the multiplier. No random access of the hardware accelerator to the input memory is necessary for this, but it is sufficient, for example, to design the partitions as shift registers.

Hierbei ist es freigestellt, ob die Eingabedaten und/oder die Werte des Faltungskerns repliziert werden. Die Wirkung ist jeweils die gleiche, nämlich, dass an den Multiplizierern des Hardwarebeschleunigers jeweils diejenigen Werte der Eingabedaten und Werte des Faltungskerns zusammenkommen, deren Produkt tatsächlich in der gesuchten Faltung enthalten ist. Speziell das Replizieren der Werte des Faltungskerns, also der Gewichte, kann in solchen Schichten von neuronalen Netzwerken sinnvoll sein, die Eingabedaten mit geringer Tiefe verarbeiten und nur wenige Filter aufweisen. Gerade in diesen Schichten sind die zur Verfügung stehenden Hardware-Speichereinheiten für die Gewichte häufig nicht ausgelastet, so dass das Replizieren der Gewichte mit geringeren oder gar keinen Zusatzkosten für weitere Hardware-Speichereinheiten möglich ist.It is up to you whether the input data and / or the values of the convolution kernel are replicated. The effect is always the same, namely that those values of the input data and values of the convolution kernel come together at the multipliers of the hardware accelerator, the product of which is actually contained in the desired convolution. In particular, replicating the values of the convolution kernel, i.e. the weights, can be useful in layers of neural networks that process input data with little depth and only have a few filters. It is precisely in these layers that the hardware storage units available for the weights are often not fully utilized, so that the weights can be replicated with little or no additional costs for additional hardware storage units.

Die vorangegangenen Ausführungen zeigen, dass die Möglichkeit, dem Hardwarebeschleuniger in jedem Arbeitsgang mehr Summanden zuzuführen und ihn somit besser auszulasten, nicht in naheliegender Weise kostenlos zu haben ist. Vielmehr ist zunächst Vorkasse zu leisten in Form zusätzlicher Hardware für den zumindest eingeschränkt wahlfreien Zugriff auf den Eingangsspeicher und/oder in Form eines erhöhten Speicherbedarfs im Eingangsspeicher.The preceding explanations show that the possibility of adding more summands to the hardware accelerator in each work step and thus better utilizing it is not obviously available free of charge. Rather, advance payment must first be made in the form of additional hardware for at least limited random access to the input memory and / or in the form of increased memory requirements in the input memory.

In einer besonders vorteilhaften Ausgestaltung wird eine Inneres-Produkt-Recheneinheit für Vektoren mit einer Länge zwischen 16 und 128 Elementen als Hardwarebeschleuniger gewählt. Je größer die Anzahl der Elemente ist, desto mehr Daten können mit jedem Arbeitsgang verarbeitet werden und desto größer ist der Geschwindigkeitsgewinn bei einer optimierten Auslastung dieser Recheneinheit. Jedoch wächst auch der besagte Aufwand für das Bereitstellen der richtigen Eingabedaten an den richtigen Multiplizierern an. Der Bereich zwischen 16 und 128 Elementen hat sich in Untersuchungen der Erfinder als optimaler Kompromiss herausgestellt.In a particularly advantageous embodiment, an inner product arithmetic unit for vectors with a length between 16 and 128 elements is selected as the hardware accelerator. The greater the number of elements, the more data can be processed with each operation and the greater the gain in speed with an optimized utilization of this processing unit. However, the said effort for providing the correct input data to the correct multipliers also increases. The inventors' investigations have shown that the range between 16 and 128 elements is an optimal compromise.

Eine Hauptanwendung für CNNs ist die Verarbeitung von Messdaten zu für die jeweilige Anwendung relevanten Ausgangsgrößen. So hat beispielsweise im Kontext des zumindest teilweise automatisierten Fahrens die bessere Auslastung der Hardwarebeschleuniger zur Folge, dass geringere Kosten für die Hardware eines entsprechenden Auswertungssystems anfallen und auch der Energieverbrauch entsprechend sinkt.A main application for CNNs is the processing of measurement data into output variables relevant for the respective application. For example, in the context of at least partially automated driving, better utilization of the hardware accelerator means that lower costs are incurred for the hardware of a corresponding evaluation system and energy consumption is also reduced accordingly.

Daher bezieht sich die Erfindung allgemein auch auf ein Verfahren zur Auswertung von mit mindestens einem Sensor aufgenommenen Messdaten, und/oder von realistischen synthetischen Messdaten dieses mindestens einen Sensors, zu einer oder mehreren Ausgangsgrößen mit mindestens einem neuronalen Netzwerk. Realistische synthetische Messdaten können beispielsweise an Stelle von oder in Kombination mit tatsächlich physikalisch aufgenommenen Messdaten verwendet werden, um das Auswertungssystem zu trainieren. Typischerweise ist ein Datensatz mit realistischen synthetischen Messdaten eines Sensors schwer von tatsächlich mit diesem Sensor physikalisch aufgenommenen Messdaten zu unterscheiden.The invention therefore generally also relates to a method for evaluating measurement data recorded with at least one sensor, and / or realistic synthetic measurement data from this at least one sensor, for one or more output variables with at least one neural network. Realistic synthetic measurement data can be used, for example, instead of or in combination with actually physically recorded measurement data in order to train the evaluation system. Typically, a data set with realistic, synthetic measurement data from a sensor is difficult to distinguish from measurement data actually recorded physically with this sensor.

Das neuronale Netzwerk weist mindestens eine Faltungsschicht auf. In dieser Faltungsschicht wird eine Faltung eines Tensors von Eingabedaten mit mindestens einem vorgegebenen Faltungskern ermittelt. Diese Faltung wird mit dem zuvor beschriebenen Verfahren berechnet. Wie zuvor erläutert, führt dies dazu, dass die gesuchten Ausgangsgrößen bei gegebenen Hardwareressourcen besonders schnell aus den Eingangsgrößen ausgewertet werden können. Bei vorgegebener Verarbeitungsgeschwindigkeit kann die Auswertung mit geringerem Einsatz an Hardwareressourcen, und damit auch mit geringerem Energieverbrauch, erfolgen.The neural network has at least one convolution layer. In this convolution layer, a convolution of a tensor of input data with at least one predefined convolution kernel is determined. This convolution is calculated using the method described above. As explained above, this means that the desired output variables can be evaluated particularly quickly from the input variables given the hardware resources. With a given processing speed, the evaluation can be carried out with less use of hardware resources and thus also with less energy consumption.

In einer besonders vorteilhaften Ausgestaltung wird die Faltung in der ersten Faltungsschicht, die die Messdaten durchlaufen, mit dem zuvor beschriebenen Verfahren berechnet, während dieses Verfahren in mindestens eine später durchlaufenen Faltungsschicht nicht zum Einsatz kommt. Wie zuvor erläutert, ist der Geschwindigkeitsgewinn durch das zuvor beschriebene Verfahren am größten in solchen Schichten des CNN, die lateral stark ausgedehnt sind, jedoch nur eine geringe Tiefe aufweisen. Die entsprechenden schaltungstechnischen Mittel für den zumindest eingeschränkt wahlfreien Zugriff des Hardwarebeschleunigers auf den Eingangsspeicher, bzw. der entsprechende Speicherplatz im Eingangsspeicher des Hardwarebeschleunigers, sollten daher bevorzugt auf solche Schichten verwendet werden.In a particularly advantageous embodiment, the convolution in the first convolution layer through which the measurement data pass is calculated using the method described above, while this method is not used in at least one convolution layer passed through later. As explained above, the gain in speed through the method described above is greatest in those layers of the CNN which are laterally greatly expanded, but only have a shallow depth. The corresponding circuitry for the at least restricted random access of the hardware accelerator to the input memory, or the corresponding storage space in the input memory of the hardware accelerator, should therefore preferably be used on such layers.

In einer besonders vorteilhaften Ausgestaltung umfassen die Messdaten Bilddaten mindestens einer optischen Kamera oder Thermalkamera, und/oder Audiodaten, und/oder Messdaten, die durch Abfrage eines räumlichen Gebiets mit Ultraschall, Radarstrahlung oder LIDAR erhalten wurden. Gerade diese Daten sind in dem Zustand, in dem sie in das CNN eingegeben werden, lateral sehr ausgedehnt und hoch aufgelöst, jedoch von vergleichsweise geringer Tiefe. Die laterale Auflösung wird durch die Faltung von Schicht zu Schicht sukzessive vermindert, während die Tiefe zunehmen kann.In a particularly advantageous embodiment, the measurement data include image data of at least one optical camera or thermal camera, and / or audio data, and / or measurement data obtained by querying a spatial area with ultrasound, radar radiation or LIDAR. It is precisely these data in the state in which they are entered into the CNN, laterally very extensive and highly resolved, but of comparatively shallow depth. The lateral resolution is successively reduced by the folding from layer to layer, while the depth can increase.

Die gesuchten Ausgangsgrößen können insbesondere beispielsweise

• mindestens eine Klasse einer vorgegebenen Klassifikation, und/oder
• mindestens einen Regressionswert einer gesuchten Regressionsgröße, und/oder
• eine Detektion mindestens eines Objekts, und/oder
• eine semantische Segmentierung der Messdaten in Bezug auf Klassen und/oder Objekte

umfassen. Dies sind Ausgangsgrößen, zu deren Gewinnung aus hochdimensionalen Eingangsgrößen bevorzugt CNNs genutzt werden.The output variables sought can in particular, for example

• at least one class of a given classification, and / or
• at least one regression value of a regression variable you are looking for, and / or
• a detection of at least one object, and / or
• a semantic segmentation of the measurement data in relation to classes and / or objects

include. These are output variables which CNNs are preferably used to obtain from high-dimensional input variables.

In einer weiteren besonders vorteilhaften Ausgestaltung wird aus der oder den Ausgangsgrößen ein Ansteuersignal gebildet. Ein Roboter, und/oder ein Fahrzeug, und/oder ein Klassifikationssystem, und/oder ein System für die Überwachung von Bereichen, und/oder ein System für die Qualitätskontrolle von in Serie gefertigten Produkten, und/oder ein System für die medizinische Bildgebung, mit diesem Ansteuersignal angesteuert. Der Einsatz des zuvor beschriebenen Verfahrens für die Berechnung der Faltung führt dazu, dass diese Systeme bei gegebenen Hardwareressourcen für die Auswertung schneller eine Reaktion auf sensorisch aufgenommene Messdaten ausführen. Ist hingegen die Reaktionszeit vorgegeben, können Hardwareressourcen eingespart werden.In a further particularly advantageous embodiment, a control signal is formed from the output variable or variables. A robot, and / or a vehicle, and / or a classification system, and / or a system for monitoring areas, and / or a system for quality control of mass-produced products, and / or a system for medical imaging, controlled with this control signal. The use of the previously described method for calculating the convolution means that these systems, given the hardware resources for the evaluation, react more quickly to measurement data recorded by sensors. If, on the other hand, the response time is specified, hardware resources can be saved.

Die Verfahren können insbesondere ganz oder teilweise computerimplementiert sein. Daher bezieht sich die Erfindung auch auf ein Computerprogramm mit maschinenlesbaren Anweisungen, die, wenn sie auf einem oder mehreren Computern ausgeführt werden, den oder die Computer dazu veranlassen, eines der beschriebenen Verfahren auszuführen. In diesem Sinne sind auch Steuergeräte für Fahrzeuge und Embedded-Systeme für technische Geräte, die ebenfalls in der Lage sind, maschinenlesbare Anweisungen auszuführen, als Computer anzusehen.In particular, the methods can be implemented in whole or in part by a computer. The invention therefore also relates to a computer program with machine-readable instructions which, when they are executed on one or more computers, cause the computer or computers to carry out one of the described methods. In this sense, control devices for vehicles and embedded systems for technical devices, which are also able to execute machine-readable instructions, are to be regarded as computers.

Ebenso bezieht sich die Erfindung auch auf einen maschinenlesbaren Datenträger und/oder auf ein Downloadprodukt mit dem Parametersatz, und/oder mit dem Computerprogramm. Ein Downloadprodukt ist ein über ein Datennetzwerk übertragbares, d.h. von einem Benutzer des Datennetzwerks downloadbares, digitales Produkt, das beispielsweise in einem Online-Shop zum sofortigen Download feilgeboten werden kann.The invention also relates to a machine-readable data carrier and / or to a download product with the parameter set and / or with the computer program. A download product is a digital product that can be transmitted via a data network, i.e. that can be downloaded by a user of the data network and that can be offered for immediate download in an online shop, for example.

Weiterhin kann ein Computer mit dem Computerprogramm, mit dem maschinenlesbaren Datenträger bzw. mit dem Downloadprodukt ausgerüstet sein.Furthermore, a computer can be equipped with the computer program, with the machine-readable data carrier or with the download product.

Weitere, die Erfindung verbessernde Maßnahmen werden nachstehend gemeinsam mit der Beschreibung der bevorzugten Ausführungsbeispiele der Erfindung anhand von Figuren näher dargestellt.Further measures improving the invention are illustrated in more detail below together with the description of the preferred exemplary embodiments of the invention with reference to figures.

AusführungsbeispieleEmbodiments

Es zeigt:

1 Ausführungsbeispiel des Verfahrens 100 zur Berechnung einer Faltung 4;
2 Veranschaulichung des grundsätzlichen Wirkmechanismus, der die Berechnung beschleunigt;
3 Änderung der Zuordnung von Operanden 52a, 52b zu Speicherstellen 51a-51h im Eingangsspeicher 51 eines Hardwarebeschleunigers 5 mit einem Multiplexer 53,
4 Mehrfaches Hinterlegen von Eingabedaten 1a und/oder Werten 2a eines Faltungskerns 2 im Eingangsspeicher 51 für die effizientere Abarbeitung;
5 Ausführungsbeispiel des Verfahrens 200 zur Auswertung von Messdaten 61, 62.

It shows:

1 Embodiment of the method 100 to calculate a convolution 4th ;
2 Illustration of the basic mechanism of action that accelerates the calculation;
3 Change of the assignment of operands 52a , 52b to storage locations 51a-51h in the input memory 51 a hardware accelerator 5 with a multiplexer 53 ,
4th Multiple storage of input data 1a and / or values 2a a convolution kernel 2 in the input memory 51 for more efficient processing;
5 Embodiment of the method 200 for the evaluation of measurement data 61 , 62 .

1 ist ein schematisches Ablaufdiagramm eines Ausführungsbeispiels des Verfahrens 100, mit dem die Faltung 4 eines Eingabetensors 1 von Eingabedaten 1a mit einem tensoriellen Faltungskern 2 berechnet wird. In Schritt 110 wird der Faltungskern in einem vorgegebenen Raster von Positionen 21, 22 innerhalb des Eingabetensors 1 geführt. In Schritt 120 wird der Faltungskern 2 in jeder dieser Positionen 21, 22 angewendet, indem aus den Eingabedaten 1a in dem durch den Faltungskern 2 an seiner aktuellen Position 21, 22 abgedeckten Bereich des Eingabetensors 1 eine mit den Werten 2a des Faltungskerns 2 gewichtete Summe 3 gebildet wird. In Schritt 130 wird diese gewichtete Summe 3 in der Faltung 4 der aktuellen Position 21, 22 des Faltungskerns 2 zugeordnet. Durch sukzessives Abarbeiten aller Positionen 21, 22 des Faltungskerns 2 wird das Gesamtergebnis der Faltung 4 erhalten. 1 Figure 3 is a schematic flow diagram of an embodiment of the method 100 with which the folding 4th an input sensor 1 of input data 1a with a tensile convolution kernel 2 is calculated. In step 110 the core of the convolution is in a predetermined grid of positions 21 , 22nd within the input sensor 1 guided. In step 120 becomes the convolution kernel 2 in each of these positions 21 , 22nd applied by from the input data 1a in which by the convolution kernel 2 at its current position 21 , 22nd covered area of the input sensor 1 one with the values 2a of Convolution kernel 2 weighted sum 3 is formed. In step 130 becomes this weighted sum 3 in the fold 4th the current position 21 , 22nd of the convolution kernel 2 assigned. By successively working through all positions 21 , 22nd of the convolution kernel 2 becomes the overall result of the convolution 4th Receive.

Beim Anwenden 120 des Faltungskerns 2 kommt gemäß Block 121 ein Hardwarebeschleuniger 5 zum Einsatz, wobei hier gemäß Block 125 insbesondere beispielsweise eine Inneres-Produkt-Recheneinheit für Vektoren mit einer Länge zwischen 16 und 128 Elementen gewählt wird. Gemäß Block 122 werden in mindestens einem Arbeitsgang des Hardwarebeschleunigers 5 mehr Summanden verarbeitet als es einer Tiefe 11 des Eingabetensors 1 entspricht. Je mehr Arbeitsgänge des Hardwarebeschleunigers 5 in dieser Weise besser ausgelastet werden können, desto schneller wird das Gesamtergebnis der Faltung 4 erhalten.When applying 120 of the convolution kernel 2 comes according to block 121 a hardware accelerator 5 used, here according to block 125 in particular, for example, an inner product arithmetic unit for vectors with a length between 16 and 128 Elements is chosen. According to block 122 are in at least one operation of the hardware accelerator 5 processed more summands than there is a depth 11 of the input sensor 1 is equivalent to. The more operations of the hardware accelerator 5 In this way, the better the capacity utilization, the faster the overall result of the convolution will be 4th Receive.

Innerhalb des Kastens 122 ist aufgeschlüsselt, wie beim Einsatz des Hardwarebeschleunigers 5 sichergestellt wird, dass an allen Positionen 21, 22 des Faltungskerns 2 in den Multiplizierern 52 des Hardwarebeschleunigers 5 die richtigen Eingabedaten 1a mit den richtigen Werten 2a des Faltungskerns 2 als Operanden 52a, 52b von Multiplikationen zusammengeführt werden.Inside the box 122 is broken down, as when using the hardware accelerator 5 ensures that at all positions 21 , 22nd of the convolution kernel 2 in the multipliers 52 of the hardware accelerator 5 the correct input data 1a with the right values 2a of the convolution kernel 2 as operands 52a , 52b be merged by multiplications.

Gemäß Block 123 kann, wie zuvor erläutert, die Zuordnung zwischen Operanden 52a, 52b und Speicherstellen 51a-51h des Eingangsspeichers 51 während der Berechnung der Faltung 4 variiert werden, um den Multiplizierern 52 zumindest eingeschränkt wahlfreien Zugriff auf den Eingangsspeicher 51 zu geben. Hierzu kann gemäß Block 123a ein Multiplexer 53 genutzt werden, was in 3 näher erläutert ist.According to block 123 can, as previously explained, the assignment between operands 52a , 52b and storage locations 51a-51h of the input memory 51 while calculating the convolution 4th can be varied to the multipliers 52 at least limited random access to the input memory 51 admit. This can be done according to block 123a a multiplexer 53 what can be used in 3 is explained in more detail.

Gemäß Block 124 können Eingabedaten 1a und/oder Werte 2a des Faltungskerns 2 mehrfach in dem Eingangsspeicher 51 hinterlegt werden. Damit können dem Hardwarebeschleuniger die Eingabedaten 1a und Werte 2a an jeder Position 21, 22 des Faltungskerns 2 in einer Anordnung zueinander zugeführt werden, die sicherstellt, dass der Hardwarebeschleuniger 5 tatsächlich in der gewichteten Summe 3 vorkommende Summanden berechnet. Dies ist in 4 näher erläutert.According to block 124 can input data 1a and / or values 2a of the convolution kernel 2 multiple in the input memory 51 be deposited. This enables the hardware accelerator to provide the input data 1a and values 2a at any position 21 , 22nd of the convolution kernel 2 are fed to each other in an arrangement that ensures that the hardware accelerator 5 actually in the weighted sum 3 occurring summands are calculated. This is in 4th explained in more detail.

Beispielsweise kann gemäß Block 124a ein Eingangsspeicher 51 mit mindestens einem separaten Speicher oder Speicherbereich für jeden Multiplizierer 52 gewählt werden. Es können dann gemäß Block 124b in diesen Speicher oder Speicherbereich diejenigen Eingabedaten 1a und Werte 2a des Faltungskerns 2 geladen werden, die der jeweilige Multiplizierer 52 im Verlauf der Berechnung der Faltung 4 benötigt.For example, according to block 124a an input memory 51 with at least one separate memory or storage area for each multiplier 52 to get voted. It can then according to block 124b those input data in this memory or memory area 1a and values 2a of the convolution kernel 2 loaded by the respective multiplier 52 during the calculation of the convolution 4th needed.

2 erläutert das grundlegende Prinzip der verbesserten Auslastung eines Hardwarebeschleunigers 5. In dem in 2 gezeigten illustrativen Beispiel ist im Rahmen einer Faltung 4 die gewichtete Summe aus den schattierten Eingabedaten 1a im Eingabetensor 1 zu berechnen, wobei die schattierten Werte 2a des Faltungstensors 2 als Gewichte dienen. Der Eingabetensor 1 hat in diesem Beispiel eine Tiefe 11 von 3. 2 explains the basic principle of improved utilization of a hardware accelerator 5 . In the in 2 The illustrative example shown is in the context of a folding 4th the weighted sum of the shaded input data 1a in the input tensor 1 to calculate, with the shaded values 2a of the convolution tensor 2 serve as weights. The input tensor 1 has a depth in this example 11 Of 3.

Beim herkömmlichen Einsatz des Hardwarebeschleunigers 5 würden in jedem Arbeitsgang des Hardwarebeschleunigers 5 immer nur so viele Eingabedaten 1a und Werte 2a des Faltungstensors 2 verarbeitet wie entlang der Tiefe 11 jeweils übereinander liegen. Um die schattierten Werte 1a, 2a insgesamt zu verarbeiten, wären also drei Arbeitsgänge des Hardwarebeschleunigers 5 erforderlich. Wenn nun aber alle zu verarbeitenden Werte 1a bzw. 2a im Eingangsspeicher 51 des Hardwarebeschleunigers 5 jeweils in einem Vektor zusammengefasst werden, kann die gewichtete Summe der schattierten Werte 1a, 2a mit nur einem Arbeitsgang des Hardwarebeschleunigers 5 berechnet werden.With conventional use of the hardware accelerator 5 would in every operation of the hardware accelerator 5 always only so much input data 1a and values 2a of the convolution tensor 2 processed like along the depth 11 each lie on top of each other. Around the shaded values 1a , 2a to process in total would therefore be three operations of the hardware accelerator 5 necessary. But if now all values to be processed 1a or. 2a in the input memory 51 of the hardware accelerator 5 can each be combined in a vector, the weighted sum of the shaded values 1a , 2a with just one operation of the hardware accelerator 5 be calculated.

Wie zuvor erläutert, müssen zu diesem Zweck an jeder Position 21, 22 des Faltungskerns 2 die richtigen Eingabedaten 1a aus dem Eingabetensor 1 mit den richtigen Werten 2a des Faltungskerns 2 multipliziert werden, damit insgesamt die gewichtete Summe 3 nur solche Summanden enthält, die auch wirklich in der Faltung 4 vorkommen. Die 3 und 4 veranschaulichen die zuvor erläuterten Wege, auf denen dies sichergestellt werden kann.As explained earlier, this must be done at every position 21 , 22nd of the convolution kernel 2 the correct input data 1a from the input tensor 1 with the right values 2a of the convolution kernel 2 must be multiplied, so that the total weighted sum 3 contains only those summands that are actually in the convolution 4th happen. the 3 and 4th illustrate the previously explained ways in which this can be ensured.

3 veranschaulicht die Verwendung eines Multiplexers 53, um einem Multiplizierer 52 in einem Hardwarebeschleuniger 5 zumindest eingeschränkt wahlfreien Zugriff auf den Eingangsspeicher 51 des Hardwarebeschleunigers 5 zu geben. Von dem Eingangsspeicher 51 sind in diesem illustrativen Beispiel acht Speicherstellen 51a-51h dargestellt. Mit dem 4:1-Multiplexer 53 kann ausgewählt werden, ob ein Wert 1a aus der Speicherstelle 51a, 51c, 51e oder 51g des Eingangsspeichers 51 abgerufen und dem Multiplizierer 52 als erster Operand 52a zugeführt wird. In 3 sind zwei beispielhafte mögliche Quellen eingezeichnet, aus denen der zweite Operand 52b stammen kann. Der zweite Operand 52b kann dem Multiplizierer 52 beispielsweise aus der Speicherstelle 51b des Eingangsspeicher 51 zugeführt werden. Der Multiplexer 53, oder ein weiterer Multiplexer, kann aber auch beispielsweise Zugriff auf verschiedene Speicherstellen 55a-55d des Parameterspeichers 55 haben, die jeweils verschiedene Werte 2a des Faltungskerns 2 speichern. Der Multiplexer 53 kann dann wahlweise einen dieser Werte 2a als zweiten Operanden 52b an den Multiplizierer 52 liefern. Diese Option ist in 3 gestrichelt eingezeichnet. 3 illustrates the use of a multiplexer 53 to get a multiplier 52 in a hardware accelerator 5 at least limited random access to the input memory 51 of the hardware accelerator 5 admit. From the input memory 51 are eight storage locations in this illustrative example 51a-51h shown. With the 4: 1 multiplexer 53 can be selected whether a value 1a from the memory location 51a , 51c , 51e or 51g of the input memory 51 and the multiplier 52 as the first operand 52a is fed. In 3 two exemplary possible sources are shown, from which the second operand 52b can originate. The second operand 52b can use the multiplier 52 for example from the memory location 51b of the input memory 51 are fed. The multiplexer 53 , or another multiplexer, but can also, for example, access different memory locations 55a-55d of the parameter memory 55 each have different values 2a of the convolution kernel 2 to save. The multiplexer 53 can then choose one of these values 2a as the second operand 52b to the multiplier 52 deliver. This option is in 3 shown in dashed lines.

Der Multiplizierer 52 multipliziert die beiden Operanden 52a und 52b und liefert das Produkt 52c als Ergebnis. Auch ein beispielhaft eingezeichneter weiterer Multiplizierer 52' liefert ein solches Produkt 52c, das er aus anderen Operanden 52a und 52b multipliziert hat. Produkte 52c, die von verschiedenen Multiplizierern 52 geliefert wurden, werden mit Addierern 54 zu Zwischenergebnissen 54a addiert. Die Zwischenergebnisse 54a werden wiederum mit weiteren (in 4 nicht eingezeichneten) Addierern 54 aufkumuliert, bis schließlich die gewichtete Summe 3, oder zumindest ein Teil hiervon, berechnet ist. Die maximale Effizienzsteigerung ergibt sich, wenn für eine Position 21, 22 des Faltungskerns 2 die komplette gewichtete Summe 3 mit nur einem Arbeitsgang des Hardwarebeschleunigers 5 berechnet werden kann. Jedoch beginnt eine Effizienzsteigerung bereits, sobald im Verlauf der Berechnung einer gewichteten Summe 3 auch nur ein solcher Arbeitsgang eingespart werden kann. Über die Anzahl der Multiplikatoren 52, 52' im Hardwarebeschleuniger 5 kann ein beliebiger Kompromiss zwischen Hardwarekosten und Effizienzsteigerung eingestellt werden.The multiplier 52 multiplies the two operands 52a and 52b and delivers the product 52c as a result. Another multiplier shown as an example 52 ' delivers such a product 52c that he made from other operands 52a and 52b has multiplied. Products 52c by different multipliers 52 are supplied with adders 54 to intermediate results 54a added. The interim results 54a are in turn with further (in 4th not shown) adders 54 accumulates until finally the weighted sum 3 , or at least a part thereof, is calculated. The maximum increase in efficiency arises when for a position 21 , 22nd of the convolution kernel 2 the complete weighted sum 3 with just one operation of the hardware accelerator 5 can be calculated. However, an increase in efficiency begins as soon as a weighted sum is calculated 3 only one such operation can be saved. About the number of multipliers 52 , 52 ' in the hardware accelerator 5 any compromise can be made between hardware costs and increased efficiency.

4 veranschaulicht das Replizieren von Eingabedaten 1a im Eingangsspeicher 51 des Hardwarebeschleunigers 5 mit dem Ziel, für jede Position 21, 22 des Faltungskerns 2 die richtigen Eingabedaten 1a als Operanden 2a für die Multiplizierer 52 abrufen zu können. Der Eingabetensor 1 umfasst in diesem illustrativen Beispiel drei Ebenen, hat also eine Tiefe 11 von 3. Einige unterschiedliche Positionen von Eingabedaten 1a in diesen Ebenen sind durch unterschiedliche Schraffuren gekennzeichnet. 4th illustrates replicating input data 1a in the input memory 51 of the hardware accelerator 5 aiming for each position 21 , 22nd of the convolution kernel 2 the correct input data 1a as operands 2a for the multipliers 52 to be able to retrieve. The input tensor 1 In this illustrative example, it comprises three levels, i.e. it has a depth 11 of 3. Some different positions of input data 1a in these levels are indicated by different hatching.

In dem Eingangsspeicher 51 sind für die Positionen 21 und 22 jeweils einige Werte 1a aus dem Eingangstensor 1 in der Reihenfolge untereinander geschrieben, in der sie für Multiplikationen mit Werten 2a des Faltungstensors 2 benötigt werden. Hierbei ist zur Veranschaulichung ein Wert 1a herausgegriffen und mit dem Bezugszeichen 1a bezeichnet. An der ersten Position 21 des Faltungskerns 2 steht dieser Wert 1a im Eingangsspeicher 51 an vierter Stelle von oben, da zunächst ausgehend von der linken oberen Ecke der Ebenen des Eingangstensors 1 eine „Säule“ in Richtung der Tiefe 11 des Eingangstensors 11 abgearbeitet wird und der Wert 1a den Beginn der zweiten derartigen „Säule“ bildet. Wenn nun aber der Faltungskern 2 zur Position 22 voranschreitet, muss der Wert 1a mit dem ersten Wert 2a des Faltungskerns 2 multipliziert werden. Der Wert 1a wird also für diese Position 22 an erster Stelle im Eingangsspeicher 51 benötigt. Zu diesem Zweck werden die Eingangsdaten 1a im Eingangsspeicher 51 repliziert wie in 4 gezeichnet.In the input memory 51 are for the positions 21 and 22nd some values each 1a from the input tensor 1 written one below the other in the order in which they are used for multiplications with values 2a of the convolution tensor 2 are needed. Here is a value for illustration 1a picked out and with the reference number 1a designated. In the first position 21 of the convolution kernel 2 stands this value 1a in the input memory 51 fourth from the top, starting from the top left corner of the planes of the input tensor 1 a "pillar" in the direction of the depth 11 of the input tensor 11 is processed and the value 1a forms the beginning of the second such “pillar”. But if now the kernel of convolution 2 to the position 22nd advances, the value must 1a with the first value 2a of the convolution kernel 2 be multiplied. The value 1a so will for this position 22nd first in the input memory 51 needed. For this purpose the input data 1a in the input memory 51 replicated as in 4th drawn.

5 ist ein schematisches Ablaufdiagramm eines Ausführungsbeispiels des Verfahrens 200 zum Auswerten von Messdaten. Hierbei kann es sich um eine beliebige Mischung aus Messdaten 61, die mit mindestens einem Sensor 6 physikalisch erfasst wurden, und realistischen synthetischen Messdaten 62 dieses mindestens einen Sensors 6 handeln. 5 Figure 3 is a schematic flow diagram of an embodiment of the method 200 for evaluating measurement data. This can be any mixture of measurement data 61 that have at least one sensor 6th physically recorded, and realistic synthetic measurement data 62 this at least one sensor 6th Act.

In Schritt 210 werden die Messdaten 61, 62 mit einem neuronalen Netzwerk 8 zu Ausgangsgrößen 7 verarbeitet. Das neuronale Netzwerk 8 umfasst eine Mehrzahl von Faltungsschichten 81-83, die von den Messdaten 61, 62 nacheinander durchlaufen werden. Das heißt, die Messdaten 61, 62 werden von der Schicht 81 zu einem Zwischenergebnis („feature map“) verarbeitet, das dann von der Schicht 82 zu einem weiteren Zwischenergebnis und von der Schicht 83 zu den letztendlichen Ausgangsgrößen 7 verarbeitet wird.In step 210 are the measurement data 61 , 62 with a neural network 8th to output variables 7th processed. The neural network 8th comprises a plurality of folding layers 81-83 that from the measurement data 61 , 62 are run through one after the other. That is, the measurement data 61 , 62 are from the shift 81 processed to an intermediate result ("feature map"), which is then processed by the layer 82 to another intermediate result and from the shift 83 to the final output variables 7th is processed.

In jeder Faltungsschicht 81-83 wird jeweils eine Faltung 4 eines Tensors 1 von Eingabedaten 1a mit mindestens einem vorgegebenen Faltungskern 2 ermittelt. Dabei wird gemäß Block 210a mindestens eine solche Faltung 4 mit dem zuvor beschriebenen Verfahren 100 berechnet.In every folding layer 81-83 becomes one convolution each time 4th of a tensor 1 of input data 1a with at least one predetermined convolution core 2 determined. In doing so, according to block 210a at least one such fold 4th using the procedure described above 100 calculated.

Insbesondere kann gemäß Block 210b die Faltung 4 in der ersten Faltungsschicht 81, die die Messdaten 61, 62 durchlaufen, mit dem Verfahren 100 berechnet werden, während dieses Verfahren in mindestens einer später durchlaufenen Faltungsschicht 82, 83 nicht zum Einsatz kommt. Wie zuvor erläutert, kann auf diese Weise der für das Zusammenfassen vieler Berechnungen in einem Arbeitsgang des Hardwarebeschleunigers 5 erforderliche Zusatzaufwand bevorzugt auf diejenigen Faltungsschichten konzentriert werden, in denen auf Grund ihrer vergleichsweise geringen Tiefe der Effizienzgewinn besonders groß ist.In particular, according to block 210b the folding 4th in the first folding layer 81 that the measurement data 61 , 62 go through with the procedure 100 can be calculated during this process in at least one convolution layer passed through later 82 , 83 is not used. As previously explained, this allows the hardware accelerator to combine many calculations in one operation 5 Any additional effort required should preferably be concentrated on those folding layers in which the gain in efficiency is particularly great due to their comparatively small depth.

Aus den Ausgangsgrößen 7 wird in Schritt 220 ein Ansteuersignal 220a gebildet. In Schritt 230 wird mit diesem Ansteuersignal ein Roboter 91, und/oder ein Fahrzeug 92, und/oder ein Klassifikationssystem 93, und/oder ein System 94 für die Überwachung von Bereichen, und/oder ein System 95 für die Qualitätskontrolle von in Serie gefertigten Produkten, und/oder ein System 96 für die medizinische Bildgebung, angesteuert.From the output variables 7th will be in step 220 a control signal 220a educated. In step 230 becomes a robot with this control signal 91 , and / or a vehicle 92 , and / or a classification system 93 , and / or a system 94 for monitoring areas and / or a system 95 for the quality control of mass-produced products, and / or a system 96 for medical imaging.

Claims

Method (100) for calculating a convolution (4) of an input sensor (1) of input data (1a) with a tensile convolution kernel (2), wherein • the convolution kernel (2) in a predetermined grid of positions (21, 22) within the input sensor (1) is guided (110), • the convolution kernel (2) is applied in each of these positions (21, 22) by using the input data (1a) in the by the convolution kernel (2) at its current position (21, 22 ) covered area of the Input sensor (1) a sum (3) weighted with the values (2a) of the convolution kernel (2) is formed (120) and • this weighted sum (3) in the convolution (4) of the current position (21, 22) of the convolution kernel (2) is assigned (130), where • the weighted sum (3) is calculated (121) with at least one hardware accelerator (5), which has an input memory (51) and a fixed number of multipliers (52) that contain their operands (52a, 52b) each from predetermined storage locations (51a-51h) of the input memory (51) and • more summands are processed in at least one operation of the hardware accelerator (5) than corresponds to a depth (11) of the input sensor (1) (122 ), where • the assignment between operands (52a, 52b) and storage locations (51a-51h) of the input memory (51) is varied (123) during the calculation of the convolution (4), and / or • input data (1a) and / or Values (2a) of the convolution kernel (2) are left multiple times in the input memory (51) be treated (124).

Method (100) according to Claim 1 , wherein an inner product arithmetic unit for vectors with a length between 16 and 128 elements is selected as hardware accelerator (5) (125).

Method (100) according to one of the Claims 1 until 2 , the assignment between operands (52a, 52b) and memory locations (51a-51h) of the input memory (51), and / or the assignment between operands (52a, 52b) and memory locations (55a-55d) of a parameter memory (55) for values (2a) of the convolution core (2), with at least one multiplexer (53) connected between a multiplier (52) and at least one input memory (51), and / or between at least one multiplier (52) and at least one parameter memory (55) becomes (123a).

Method (100) according to Claim 3 , a 4: 1 multiplexer being chosen as the multiplexer (53).

Method (100) according to one of the Claims 1 until 4th , wherein an input memory (51) with at least one separate memory or memory area for each multiplier (52) is selected (124a) and the input data (1a) and values (2a) of the convolution kernel (2) are loaded into this memory or memory area ( 124b), which the respective multiplier (52) requires in the course of the calculation of the convolution (4).

Method (200) for evaluating measurement data (61) recorded with at least one sensor (6), and / or of realistic synthetic measurement data (62) from this at least one sensor (6), for one or more output variables (7) with at least one neuronal Network (8), this neural network (8) having at least one convolution layer (81-83), a convolution (4) of a tensor (1) of input data (1a) with at least one predetermined in the convolution layer (81-83) Folding core (2) is determined (210), this folding (4) using the method (100) according to one of the Claims 1 until 5 is calculated (210a).

Method (200) according to Claim 6 , wherein the convolution (4) in the first convolution layer (81) through which the measurement data (61, 62) pass using the method (100) according to one of the Claims 1 until 5 is calculated (210b), while at the same time the convolution (4) in at least one convolution layer (82-83) passed through later is not performed with the method (100) according to one of the Claims 1 until 5 is calculated (210c).

Method (200) according to one of the Claims 6 until 7th , the measurement data (61, 62) comprising image data of at least one optical camera or thermal camera, and / or audio data, and / or measurement data obtained by querying a spatial area with ultrasound, radar radiation or LIDAR.

Method (200) according to one of the Claims 6 until 8th , the output variables (7) • at least one class of a predetermined classification, and / or • at least one regression value of a regression variable sought, and / or • a detection of at least one object, and / or • a semantic segmentation of the measurement data in relation to classes and / or include objects.

Method (200) according to one of the Claims 6 until 9 , a control signal (220a) being formed (220) from the output variable (s) (7) and a robot (91), and / or a vehicle (92), and / or a classification system (93), and / or a System (94) for monitoring areas, and / or a system (95) for quality control of mass-produced products, and / or a system (96) for medical imaging, with this control signal (220a) is controlled (230 ).

Computer program, containing machine-readable instructions which, when executed on one or more computers, cause the computer or computers to implement a method (100, 200) according to one of the Claims 1 until 9 to execute.

Machine-readable data carrier with the computer program after Claim 11 .

Computer equipped with the computer program according to claim 18 and / or with the machine-readable data carrier according to claim 18 Claim 12 .