WO2018158293A1 - Attribution d'unités de calcul dans une classification d'objets - Google Patents

Attribution d'unités de calcul dans une classification d'objets Download PDF

Info

Publication number
WO2018158293A1
WO2018158293A1 PCT/EP2018/054891 EP2018054891W WO2018158293A1 WO 2018158293 A1 WO2018158293 A1 WO 2018158293A1 EP 2018054891 W EP2018054891 W EP 2018054891W WO 2018158293 A1 WO2018158293 A1 WO 2018158293A1
Authority
WO
WIPO (PCT)
Prior art keywords
filter
feature map
map
input feature
operations
Prior art date
Application number
PCT/EP2018/054891
Other languages
English (en)
Inventor
Rastislav STRUHARIK
Bogdan VUKOBRATOVIC
Mihajlo KATONA
Original Assignee
Frobas Gmbh
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Frobas Gmbh filed Critical Frobas Gmbh
Publication of WO2018158293A1 publication Critical patent/WO2018158293A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns
    • G06F18/24133Distances to prototypes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • G06V10/443Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components by matching or filtering
    • G06V10/449Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters
    • G06V10/451Biologically inspired filters, e.g. difference of Gaussians [DoG] or Gabor filters with interaction between the filter responses, e.g. cortical complex cells
    • G06V10/454Integrating the filters into a hierarchical structure, e.g. convolutional neural networks [CNN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/94Hardware or software architectures specially adapted for image or video understanding
    • G06V10/955Hardware or software architectures specially adapted for image or video understanding using specific electronic processors

Definitions

  • Various examples of the invention relate to techniques of controlling at least one memory and a plurality of computational units to perform a plurality of filter operations between an input feature map and a filter map for classification of at least one object. Furthermore, various examples of the invention relate to techniques of selecting a filter geometry from a plurality of filter geometry and using filters of the filter map for object classification which have the selected filter geometry.
  • an object includes a set of features.
  • the features are typically arranged in a certain inter-relationship with respect to each other.
  • typical objects in the field of image analysis for assisted and autonomous driving may include: neighboring vehicles; lane markings; traffic signs; pedestrians; etc..
  • Features may include: edges; colors; geometrical shapes; etc..
  • Neural networks are typically hierarchical graph-based algorithms, wherein the structure of the graph correlates with previously trained recognition capabilities.
  • the neural networks break down the problem of classification of an object into sequential recognition of the various features and their interrelationship.
  • the neural network is initially trained by inputting information (machine learning or training), e.g., using techniques of backpropagation; here, supervised and semi-supervised training or even fully automatic training helps to configure the neural network to accurately recognize objects.
  • the trained neural network can be applied to a classification task.
  • input data sometimes also called input instance
  • input data such as image data, audio data, sensor data, video data, etc.
  • a typical neural network includes a plurality of layers which are arranged sequentially. Each layer receives a corresponding input feature map which has been processed by a preceding layer. Each layer processes the respective input feature map based on a layer-specific filter map including filter coefficients. The filter map defines a strength of connection (weights) between data points of subsequent layers (neurons). Different layers correspond to different features of the object. Each layer outputs a processed input feature map (output feature map) to the next layer. The last layer then provides - as the respective output feature map - output data (sometimes also referred to as classification vector) which is indicative of the recognized objects, e.g., their position, orientation, count, and/or type/class.
  • classification vector sometimes also referred to as classification vector
  • Neural network can be implemented in software and/or hardware.
  • neural network can be implemented in software executed on a general-purpose central processing unit (CPU). It is also possible to implement neural network algorithms on a graphical processor unit (GPU). In other examples, it is also possible to implement neural network algorithms at least partly in hardware, e.g., using a field- programmable gate array (FPGA) integrated circuit or an application specific integrated circuit (ASIC).
  • FPGA field- programmable gate array
  • ASIC application specific integrated circuit
  • processing of unclassified data by means of a neural network requires significant computational resources, typically both in terms of processing power and memory access requirements.
  • One bottleneck in terms of computational resources is that, typically, on-chip memory is insufficient for storing the various intermediate feature maps and filter maps that occur during processing a particular input data. This typically results in a need for significant memory read/write operations (data movement) with respect to an off- chip/external memory. This data movement can be, both, energy inefficient, as well as time consuming.
  • a circuit includes at least one memory.
  • the at least one memory is configured to store an input feature map and a filter map.
  • the input feature map represents at least one object.
  • the circuit further includes a plurality of computational units.
  • the circuit further includes a control logic.
  • the control logic is configured to control the at least one memory and the plurality of computational units to perform a plurality of filter operations between the input feature map and the filter map. Said performing of the filter operations of the plurality of filter operations is for classification of the at least one object.
  • Each filter operation of the plurality of filter operations includes a plurality of combinational operations.
  • the control logic is configured to sequentially assign at least two or all combinational operations of the same filter operation to the same computational unit of the plurality of computational units.
  • a method includes storing an input feature map and a filter map.
  • the input feature map represents at least one object.
  • the method further includes controlling a plurality of computational units to perform a plurality of filter operations between the input feature map and the filter map for classification of the at least one object.
  • Each filter operation of the plurality of filter operations includes a plurality of combinational operations.
  • the method further includes sequentially assigning at least two or all combinational operations of the same filter operation to the same computational unit of the plurality of computational units.
  • a computer program product or computer program includes program code that can be executed by at least one computer. Executing the program code causes the at least one computer to perform a method.
  • the method includes storing an input feature map and a filter map.
  • the input feature map represents at least one object.
  • the method further includes controlling a plurality of computational units to perform a plurality of filter operations between the input feature map and the filter map for classification of the at least one object.
  • Each filter operation of the plurality of filter operations includes a plurality of combinational operations.
  • the method further includes sequentially assigning at least two or all combinational operations of the same filter operation to the same computational unit of the plurality of computational units.
  • a method includes loading an input feature map.
  • the input feature map represents at least one object.
  • the method further includes selecting at least one filter geometry from a plurality of filter geometries.
  • the method further includes performing a plurality of filter operations between receptive fields of the input feature map and filters of the filter map for classification of the at least one objects.
  • the filters have the selected at least one filter geometry.
  • a circuit includes at least one memory configured to store an input feature map and a filter map, the input feature map representing at least one object.
  • the circuit further includes a control logic configured to select at least one filter geometry from a plurality of filter geometry.
  • the circuit further includes a plurality of computational units configured to perform a plurality of filter operations between receptive fields of the input feature map and filters of the filter map for classification of the at least one objects.
  • the filters have the selected at least one filter geometry.
  • a computer program product includes program code that can be executed by at least one computer. Executing the program code causes the at least one computer to perform a method.
  • the method includes loading an input feature map.
  • the input feature map represents at least one object.
  • the method further includes selecting at least one filter geometry from a plurality of filter geometries.
  • the method further includes performing a plurality of filter operations between receptive fields of the input feature map and filters of the filter map for classification of the at least one objects.
  • the filters have the selected at least one filter geometry.
  • a computer program includes program code that can be executed by at least one computer. Executing the program code causes the at least one computer to perform a method.
  • the method includes loading an input feature map.
  • the input feature map represents at least one object.
  • the method further includes selecting at least one filter geometry from a plurality of filter geometries.
  • the method further includes performing a plurality of filter operations between receptive fields of the input feature map and filters of the filter map for classification of the at least one objects.
  • the filters have the selected at least one filter geometry.
  • a circuit includes a plurality of computational units; and a first cache memory associated with the plurality of computational units; and a second cache memory associated with the plurality of computational units.
  • the circuit also includes an interface configured to connect to an off- chip random-access memory for storing an input feature map and a filter map.
  • the circuit also includes a control logic configured to select allocations of the first cache memory and the second cache memory to the input feature map and to the filter map, respectively.
  • the control logic is further configured to route the input feature map and the filter map to the plurality of computational units via the first cache memory or the second cache memory, respectively, and to control the plurality of computational units perform a plurality of filter operations between the input feature map and the filter map for classification of at least one object represented by the input feature map.
  • the circuit also may include at least one router configured to dynamically route data stored by the first cache memory to computational units of the plurality of computational units.
  • the second cache memory may comprise a plurality of blocks, wherein different blocks of the plurality of blocks are statically connected with different computational units of the plurality of computational units.
  • the blocks of the plurality of blocks of the second cache memory may all have the same size.
  • the control logic may be configured to select a first allocation of the first cache memory and the second cache memory to a first input feature map and a first filter map, respectively.
  • the control logic may be configured to select a second allocation of the first cache memory and the second cache memory to a second input feature map and a second filter map, respectively.
  • the first feature map and the first filter map may be associated with a first layer of a multi-layer neural network.
  • the second feature map and the second filter map may be associated with a second layer of a multi-layer neural network.
  • the control logic may be configured to select the allocation of the first cache memory and the second cache memory based on at least one of a size of the input feature map, a size of the filter map, a relation of the size of the input feature map with respect to the size of the filter map.
  • the control logic may be configured to select the allocation of the first cache memory to the input feature map if the size of the input feature map is larger than the size of the kernel map.
  • the control logic may be configured to select the allocation of the first cache memory to the filter map if the size of the input feature map is not larger than the size of the kernel map.
  • Each block of the plurality of blocks may be dimensioned in size to store an entire receptive field of the input feature map and/or is dimensioned in size to store an entire filter of the filter map.
  • Data written to the first cache memory by a single refresh event may be routed to a multiple computational units of the plurality of computational units.
  • Data written to the second cache memory by a single refresh event may be routed to a single computational unit of the plurality of computational units.
  • a rate of refresh events of the first cache memory may be larger than a rate of refresh events of the second cache memory.
  • the circuit may further comprise at least one cache memory providing level-2 cache functionality to the plurality of computational units, and optionally at least one cache memory providing level-3 cache functionality to the plurality of computational units.
  • the control logic may be configured to, depending on a size of receptive fields of the input feature map and a stride size associated with filters of the filter map: controlling data written to a given cache memory of the at least one cache memory providing level-2 cache functionality to the plurality of computational units.
  • the circuit may further include a first cache memory providing level-2 cache functionality to the plurality of computational units and being allocated to the input feature map and not to the filter map; and a second cache memory providing level-2 cache functionality and level-3 cache functionality to the plurality of computational units and being allocated to the input feature map and to the filter map.
  • the control logic may be configured to allocate the first cache memory to a first one of the input feature map and the filter map and to allocate the second cache memory to a second one of the input feature map and the filter map.
  • the first cache memory and the second cache memory are arranged at the same hierarchy with respect to the plurality of computation units.
  • the first cache memory and the second cache memory may be at level-1 hierarchy with respect to the plurality of computational units.
  • the plurality of filter operations may comprise convolutions of a convolutional layer of a convolutional neural network, the convolutions being between a respective kernel of the filter map and a respective receptive field of the input feature map.
  • a method includes selecting allocations of a first cache memory associated with a plurality of computational units and of a second level-2 cache memory associated with the plurality of computational units to an input feature map and a filter map, respectively; and routing the input feature map and the filter map to the plurality of computational units via the first cache memory or the second cache memory, respectively; and controlling the plurality of computational units to perform a plurality of filter operations between the input feature map and the filter map for classification of at least one object represented by the input feature map.
  • FIG. 1 schematically illustrates a circuit including an external memory and a computer including an internal memory.
  • FIG. 2 is a flowchart of a method of processing data using a multi-layer neural network according to various examples.
  • FIG. 3 schematically illustrates the various layers of the multi-layer neural network, as well as receptive fields of neurons of the neural network arranged with respect to respective feature maps according to various examples.
  • FIG. 4 schematically illustrates a convolutional layer of the layers of the multi-layer network according to various examples.
  • FIG. 5 schematically illustrates a stride between of a convolution of an input feature map with a kernel positioned at various positions throughout the feature map according to various examples, wherein the different positions correspond to different receptive fields.
  • FIG. 6 schematically illustrates arithmetic operations associated with a convolution according to various examples.
  • FIG. 7 schematically illustrates a cubic kernel having a large kernel size according to various examples.
  • FIG. 8 schematically illustrates a cubic kernel having a small kernel size according to various examples.
  • FIG. 9 schematically illustrates a spherical kernel having a large kernel size according to various examples.
  • FIG. 10 schematically illustrates a pooling layer of the layers of the multi-layer network according to various examples.
  • FIG. 1 1 schematically illustrates an adding layer of the layers of the multi-layer network according to various examples.
  • FIG. 12 schematically illustrates a concatenation layer of the layers of the multi-layer network according to various examples.
  • FIG. 13 schematically illustrates a fully-connected layer of the layers of the multi-layer network according to various examples, wherein the fully-connected layer is connected to a not-fully- connected layer.
  • FIG. 14 schematically illustrates a fully-connected layer of the layers of the multi-layer network according to various examples, wherein the fully-connected layer is connected to a fully- connected layer.
  • FIG. 15 schematically illustrates a circuit including an external memory and a computer according to various examples, wherein the computer includes a plurality of calculation modules.
  • FIG. 16 schematically illustrates a circuit including an external memory and a computer according to various examples, wherein the computer includes a single calculation module.
  • FIG. 17 schematically illustrates a circuit including an external memory and a computer according to various examples, wherein the computer includes a plurality of calculation modules.
  • FIG. 18 schematically illustrates details of a calculation module, the calculation module including a plurality of computational units according to various examples.
  • FIG. 19 schematically illustrates assigning multiple convolutions to multiple computational units.
  • FIG. 20 schematically illustrates assigning multiple convolutions to multiple computational units according to various examples.
  • FIG. 21 is a flowchart of a method according to various examples.
  • FIG. 22 is a flowchart of a method according to various examples.
  • FIG. 23 schematically illustrates details of a calculation module, the calculation module including a plurality of computational units according to various examples.
  • FIG. 24 is a flowchart of a method according to various examples.
  • FIG. 25 is a flowchart of a method according to various examples.
  • FIG. 26 schematically illustrates techniques of dynamic memory allocation between input feature maps and filter maps according to various examples, and further illustrates refresh events of level-1 cache memories according to various examples.
  • neural networks are employed for the object classification.
  • a particular form of neural networks that can be employed according to examples are convolutional neural networks (CNN).
  • CNNs are a type of feed-forward neural networks in which the connectivity between the neurons is inspired by the connectivity found in the animal visual cortex.
  • Individual neurons from the visual cortex respond to stimuli from a restricted region of space, known as receptive field.
  • the receptive field of a neuron may designate a 3-D region within the respective input feature map to which said neuron is directly connected to.
  • the receptive fields of neighboring neurons may partially overlap.
  • the receptive fields may span the entire visual field, i.e., the entire input feature map. It was shown that the response of an individual neuron to stimuli within its receptive field can be approximated mathematically by a convolution, so CNNs extensively make use of convolution.
  • a convolution includes a plurality of combinational operations which can be denoted as in the products of vectors.
  • a convolution may be defined with respect to a certain kernel.
  • a convolution may be between a 3-D kernel - or, generally, a 3-D filter - and a 3-D input feature map.
  • a convolution includes a plurality of combinational operations, i.e., applying 2-D channels of the 3-D kernel - or, generally, 2-D filter coefficients - to 2-D sections of a 3-D receptive field associated with a certain neuron; such applying of 2-D channels to 2-D sections may include multiple arithmetic operations, e.g., multiplication and adding operations.
  • a CNN is formed by stacking multiple layers that transform the input data into an appropriate output data, e.g., holding the class scores.
  • the CNN may include layers which are selected from the group of layer types including: Convolutional Layer, Pooling Layer, Non-Linear Activation Layer, Adding Layer, Concatenation Layer, Fully-connected Layer.
  • CNN are typically characterized by the following features: (i) 3-D volume of neurons: the layers of a CNN have neurons arranged in 3-D: width, height and depth.
  • the neurons inside a layer are selectively connected to a sub- region of the input feature map obtained from the previous layer, called a receptive field. Distinct types of layers, both locally and completely connected, are stacked to form a CNN.
  • Various techniques described herein are based on the finding that the computational resources associated with implementing the CNN can vary from layer to layer, in particular, depending on a layer type. For example, it has been found that weight sharing as implemented by the convolutional layer can significantly reduce the number of free parameters being learnt, such that the memory access requirements for running the network are reduced. In other words, a filter map of the convolutional layers can be comparably small.
  • the convolutional layers may require significant processing power, because a large number of convolutions may have to be performed.
  • different convolutional layers may rely on different kernels: in particular, the kernel geometry may vary from layer to layer. Hence, computational resources in terms of processing power and memory access requirements can change from convolutional layer to convolutional layer.
  • Convolutional layers in other words, are often characterized by a relatively small number of weights since kernels are shared; but because input feature maps and output feature maps of convolutional layers are large, there is often a large number of combinational operations that need to be performed. So, in convolutional layers, often memory access requirements are comparably limited, but the required processing power is large. Often, in fully-connected layers, the situation is the opposite: here, the memory access requirements can be high since there is no weight sharing between the neurons, but the number of combinational operations is small. According to various examples it is possible to provide a circuit which flexibly provides efficient usage of available computational units for the various layers encountered in a CNN - even in view of different requirements in terms of memory acces requirements and/or processing power, as described above. This is facilitated by a large degree of freedom and flexibility provided when assigning combinational operations to available computational units, i.e., when allocating computational units for certain combinational operations.
  • the techniques described herein may be of particular use for multi-layer filter networks which iteratively employ multiple filters, wherein different iterations are associated with a different balance between computational resources in terms of processing power on the one hand side and computational resources in terms of memory access requirements on the other hand side.
  • different iterations are associated with a different balance between computational resources in terms of processing power on the one hand side and computational resources in terms of memory access requirements on the other hand side.
  • convolutional layers of CNNs - requiring significant processing power - hereinafter, such techniques may be applied to other kinds and types of layers of CNNs, e.g., fully-connected layers - having significant memory access requirements.
  • FIG. 1 schematically illustrates aspects with respect to the circuit 100 that can be configured to implement a neural network.
  • the circuit 100 could be implemented by an ASIC or FPGA.
  • the circuit 100 includes a computer 121 that may be integrated on a single chip/die which includes an on-chip/internal memory 122.
  • the internal memory 122 could be implemented by cache or buffer memory.
  • the circuit 100 also includes external memory 1 1 1 , e.g., DDR3 RAM.
  • FIG. 1 schematically illustrates input data 201 which is re-presenting an object 285.
  • the circuit 100 is configured to recognize and classify the object 285.
  • the input data 201 is processed.
  • a set of filter maps 280 is stored in the external memory 1 1 1.
  • Each filter map 280 includes a plurality of filters, e.g., kernels for the convolutional layers of a CNN.
  • Each filter map 280 is associated with a corresponding layer of a multi-layer neural network, e.g., a CNN.
  • FIG. 2 is a flowchart of a method.
  • the method of FIG. 2 illustrates aspects with respect to processing of the input data 201 .
  • FIG. 2 illustrates aspects with respect to iteratively processing the input data 201 using multiple filters.
  • the input data 201 is read as a current input feature map.
  • the input data may be read from the external memory 1 1 1.
  • the input data 201 may be retrieved from a sensor.
  • each layer i.e., each execution of 1002, corresponds to an iteration of the analysis of the data.
  • Different iterations of 1002 may be associated with different requirements for computational resources, e.g., in terms of processing power vs. memory access requirements.
  • such layer processing may include one or more filter operations be between the current input feature map and the filters of the respective filter map 280.
  • an output feature map is written.
  • the output feature map may be written to the external memory 1 1 1.
  • 1004 it is checked whether the CNN includes a further layer. If this is not the case, then the current output feature map of the current iteration of 1003 is output; the current output feature map then provides classification of the object 285. Otherwise, in 1005 the current output feature map is read as the current input feature map, e.g., from the external memory 1 1 1 . Then, 1002 - 1004 are re-executed in a next iteration.
  • processing of the input data requires multiple read and multiple write operations to the external memory 1 1 1 , e.g., for different iterations of 1002 or even multiple times per iteration 1002.
  • Such data movement can be energy inefficient and may require significant time.
  • multiple input feature maps are subsequently processed in the multiple iterations of 1002. This can be time-consuming.
  • Various techniques described herein enable to efficiently implement the layer processing of 1002. In particular, according to examples, it is possible to avoid idling of computational units during execution of 1002.
  • FIG. 3 illustrates aspects with respect to a CNN 200.
  • the CNN includes a count of sixteen layers 260.
  • FIG. 3 illustrates the input data 201 converted to a respective input feature map.
  • the first layer 260 which receives the input data is typically called an input layer.
  • the feature maps 202, 203, 205, 206, 208, 209, 21 1 - 213, 215 - 217 are associated with convolutional layers 260.
  • the feature maps 204, 207, 210, 214 are associated with pooling layers 260.
  • the feature maps 219, 220 are associated with fully-connected layers 260.
  • the output data 221 corresponds to the output feature map of the last fully connected layer 260.
  • the last layer which outputs the output data is typically called an output layer.
  • Layers 260 between the input layer and the exit layer are sometimes referred to as hidden layers 260.
  • the output feature maps of every convolutional layer 260 and of at least some of the fully- connected layers are post-processed using a non-linear post-processing function (not shown in FIG. 3), e.g., a rectified linear activation function and/or a softmax activation function.
  • a non-linear post-processing function e.g., a rectified linear activation function and/or a softmax activation function.
  • dedicated layers can be provided for non-linear post-processing (not shown in FIG. 3).
  • FIG. 3 also illustrates the receptive fields 251 of neurons 255 of the various layers 260.
  • the lateral size (xy-plane) of the receptive fields 251 - and thus of the corresponding kernels - is the same for all layers 200, e.g., 3x3 neurons.
  • different layers 260 could rely on kernels and receptive fields having different lateral sizes.
  • the various convolutional layers 260 employ receptive fields and kernels having different depth dimensions (z-axis). For example, in FIG. 3, the smallest depth dimensions equals to 3 neurons while the largest depth dimensions equals to 512 neurons.
  • the pooling layers 260 employ 2x2 pooling kernels of different depths. Similar to convolutional layers, different pooling layers may use different sized pooling kernels and / or stride size. The size of 2x2 is an example, only.
  • the CNN 200 according to the example of FIG. 3 and also of the various further examples described herein may have about 15,000,000 neurons, 138,000,000 network parameters and may require more than 15,000,000,000 our arithmetic operations.
  • FIG. 4 illustrates aspects with respect to a convolutional layer 260.
  • the input feature map 208 is processed to obtain the output feature map 209 of the convolutional layer 260.
  • Convolutional layers 260 can be seen as the core building blocks of a CNN 200.
  • the convolutional layers 260 are associated with a set of learnable 3-D filters - also called kernels 261 , 262 - stored in a filter map.
  • Each filter has limited lateral dimensions (xy-plane) - associated with the small receptive field 251 , 252 typical for the convolutional layers -, but typically extend through the full depth of the input feature map (in FIG. 4, only 2-D slices 261 - 1 , 262-1 of the kernels 261 , 262 are illustrated for sake of simplicity, but the arrows along z- axis indicate that the kernels 261 , 262 are, in fact, 3-D structures; also cf. FIG.
  • the different kernels 261 , 262 are each associated with a plurality of combinational operations 201 1 , 2012; the various combinational operations 201 1 , 2012 of a kernel 261 , 262 correspond to different receptive fields 251 (in FIG. 5, per kernel 261 , 262 a single combinational operation 201 1 , 2012 corresponding to a given receptive field 251 , 252 is illustrated).
  • each kernel 261 , 262 will be applied to different receptive fields 251 , 252; each such application of a kernel 261 , 262 to a certain receptive field defines a respective combinational operation 201 1 , 2012 between the respective kernel 261 , 262 and the respective receptive field 251 , 252. As illustrated in FIG. 4, each kernel 261 , 262 is convolved across the width and height of the input feature map 208 (in FIG.
  • kernels 261 , 262 activate when detecting some specific type of feature at some spatial position in the input feature map.
  • Stacking such activation maps for all kernels 261 , 262 along the depth dimension (z-axis) forms the full output feature map 209 of the convolution layer 260. Every entry in the output feature map 209 can thus also be interpreted as a neuron 255 that perceives a small receptive field of the input feature map 208 and shares parameters with neurons 255 in the same slice of the output feature map. Often, when dealing with high-dimensional inputs such as images, it may be undesirable to connect neurons 255 of the current convolutional layer to all neurons 255 of the previous layer, because such network architecture does not take the spatial structure of the data into account.
  • CNNs exploit spatially local correlation by enforcing a local connectivity pattern between the neurons 255 of adjacent layers 260: each neuron is connected to only a small region of the input feature map. The extent of this connectivity is a parameter called the receptive field of the neuron.
  • the connections are local in space (along width and height of the input feature map), but typically extend along the entire depth of the input feature map. Such architecture ensures that the learnt filters produce the strongest response to a spatially local input pattern.
  • the Stride (S) parameter controls how receptive fields 251 , 252 of different neurons 255 slide around the lateral dimensions (width and height; xy-plane) of the input feature map 208.
  • the stride is set to 1
  • the receptive fields 251 , 252 of adjacent neurons are located at spatial positions only 1 spatial unit apart (horizontally, vertically or both). This leads to heavily overlapping receptive fields 251 , 252 between the neurons, and also to large output feature maps 209.
  • the receptive fields will overlap less and the resulting output feature map 209 will have smaller lateral dimensions (cf. FIG.
  • the spatial size of the output feature map 209 can be computed as a function of the input feature map 208 whose width is W and height is H, the kernel field size of the convolutional layer neurons is KWKKH, the stride with which they are applied S, and the amount of zero padding P used on the border.
  • the number of neurons that "fit" a given output feature map is given by
  • each depth slice of the convolutional layer can be computed as a 3D convolution 2001 , 2002 of the neurons' 255 weights (kernel coefficients) with the section of the input volume including its receptive field 251 , 252:
  • OFM[z][x][y] B[z] + £ ⁇ IFM[k] [Sx + i][Sy + j] ⁇ Kernel[z][k][i] [j],
  • IFM and OFM are 3-D input feature map 208 and output feature map 209, respectively;
  • Wi, Hi, DI, WO, HO and Do are the width, height and depth of the input and output feature map 208, 209, respectively;
  • B is the bias value for each kernel 261 , 262 from the kernel map;
  • Kernel is the kernel map;
  • Kw, KH and KD are the width, height and depth of every kernel 261 , 262 respectively.
  • each convolution 2001 , 2002 - defined by a certain value of parameter z - includes a plurality of combinational operations 201 1 , 2012, i.e., the sums defining the inner vector product.
  • the various combinational operations 201 1 , 2012 correspond to different neurons 255 of the output feature map 209, i.e., different values of x and y in Eq. 2.
  • Each combinational operation 201 1 , 2012 can be broken down into a plurality of arithmetic operations, i.e., multiplications and sums (cf. FIG. 5).
  • FIG. 4 illustrates how the process of calculating 3-D convolutions 2001 , 2002 is performed.
  • the same 3-D kernel 261 , 261 is being used when calculating the 3-D convolution.
  • What differ between the neurons 255 from the same depth slice are the 3-D receptive fields 251 , 252 used in the 3-D convolution.
  • the neuron 255 (di, xi, yi) uses 3-D kernel 261 Kernel(di) and receptive field 251 .
  • All neurons 255 of the output feature map 209 located in the depth slice di use the same kernel 261 Kernel(di), i.e., are associated with the same 3-D convolution 2001.
  • a different 3- D kernel 262 Kernel(d2) is used.
  • output feature map neurons OFM(di, xi, yi) and OFM(c/2, xi, yi) would use the same 3-D receptive field 251 , 262 from the input feature map, but different 3-D kernels 261 , 262 , Kernel ⁇ di) and Kernel ⁇ d2) respectively.
  • FIG. 7 illustrates a slice of a kernel 261 , 262 having a cuboid shape. From a comparison of FIGs.
  • kernel geometries may be achieved by varying the size 265 of the respective kernel 261 , 262.
  • FIG. 9 it can be seen that different kernel geometries may be achieved by varying the 3-D shape of the kernel 261 , 262, i.e., cuboid in FIGs. 7 and 8 and spherical in FIG. 9.
  • FIG. 10 illustrates aspects with respect to a pooling layer 260 which performs a filter operation in the form of pooling.
  • Pooling is generally a form of non-linear down-sampling. The intuition is that once a feature has been found, its exact location may not be as important as its rough location relative to other features, i.e., its spatial inter-relationship to other features.
  • the pooling layer 260 operates independently on every depth slice of the input feature map 209 and resizes it spatially.
  • the function of the pooling layer 260 is to progressively reduce the spatial size of the representation to reduce the amount of parameters and computation in the CNN 200, and hence to also control overfitting.
  • Pooling layers 260 may be inserted in-between successive convolutional layers 260.
  • the pooling operation provides a form of translation invariance.
  • OFM[z][x][y] max ⁇ lFM[z][Sx + i][Sy + j] ⁇ ,
  • FIG. 10 This is illustrated in FIG. 10 for two neurons 255 associated with different pooling regions 671 , 672.
  • An example is a pooling layer with filters of size 2x2 applied with a stride of 2 which downsamples at every depth slice the spatial size of input feature map 209 by a factor of 2 along both width and height, discarding in the process 75% of the input feature map values within every 2x2 sub-region. Every max operation would in this case be taking a max over 4 numbers. The depth dimension remains unchanged. (ii) Average pooling partitions the input feature map 209 into a non-overlapping sub-regions and for each sub-region outputs the average value of input feature map points located within it.
  • L 2 -norm pooling partitions the input feature map 209 into a non-overlapping sub-regions and for each sub-region outputs the L2-norm values of input feature map points located within it, defined by
  • Pooling is performed on depth slice by depth slice basis within the pooling layer 260, as can be seen from the FIG. 10.
  • Each neuron 255 of the output feature map 210 irrelevant of its lateral position in the xy-plane, uses identical 2D pooling region 671 , 672, but applied to different slices of the input feature map 209, because similar to the convolutional layer each neuron from the pooling layer has its own unique region of interest.
  • a resulting value from the output feature map 210 is calculated.
  • FIG. 1 1 illustrates aspects with respect to an adding layer 260 which performs a filter operation in the form of an point-wise addition of two or more input feature maps 225-227 to yield a corresponding output feature map 228.
  • each neuron 255 of the output feature map 228, OFM(c/, x, y) actually represents a sum of neurons from all input feature maps 225-227 located at the same location within the input feature map 225-228.
  • the neuron 255 OFM(c/i, xi, yi) of the output feature map 228 is calculated as the sum of neurons of the input feature maps 225 - 227 located at the same coordinates, i.e., as
  • FIG. 12 illustrates aspects with respect to a concatenation layer 260.
  • the concatenation layer concatenates two or more input feature maps 225-227, usually along the depth axis and, thereby, implements a corresponding filter operation.
  • every concatenation layer accepts N input feature maps 225-227, each of them with identical spatial size WI*HI, but with possibly different depths Dn, D , DIN. produces an output feature map 228 of size Wo*Ho*Do where: + D !2 + . . . + DIN, and introduces zero parameters, since it computes a fixed function of the input feature maps 225-227.
  • the input feature map 225 IFM1 is located at depth slices 1 :D within the output feature map 228, then comes the input feature map 226 IFM2, located at depth slices (D +1 ):(D +Dl2) within the output feature map 228, while input feature map 227 IFM3 is located at depth slices (Dh+DI 2 +1 ):(Dli+DI 2 +DI 3 ) of the output feature map 228.
  • FIGs. 13 and 14 illustrate aspects with respect to fully-connected layers.
  • the high- level reasoning in the CNN 200 is done by means of fully connected layers 260.
  • Neurons 255 in a fully connected layer 260 have full connections to all neurons 255 in the respective input feature map 218, 219.
  • Their activations can hence be computed with a filter operation implemented by a matrix multiplication followed by a bias offset: or where Eq. 7 applies to a scenario where the input feature map 218 is associated with a not- fully connected layer 260 (cf. FIG. 13) while Eq. 8 applies to a scenario where the input feature map 219 is associated with a fully connected layer 260 (cf. FIG. 14).
  • the memory access requirements can be comparably high, in particular if compared to executing Eq. 2 for a convolutional layer.
  • the accumulated weighted sums for all neurons 255 of the fully-connected layer 260 are passed through some non-linear activation function, Af.
  • Activation functions which are most commonly used within the fully-connected layer are the ReLU, Sigmoid and Softmax functions.
  • FIGs. 13 and 14 illustrate that every neuron 255 of the respective output feature map 219, 220 is determined based on all values of the respective input feature map 218, 219. What is different between neurons 255 from the fully-connected layer are the weight values that are used to modify input feature map values 218, 219. Each neuron 255 of the output feature map 219, 220 uses its own unique set of weights. In other words, the receptive fields of all neurons 255 of an output feature map 219, 220 of a fully-connected layer 260 are identically spanning entire input feature map 218, 219, but different neurons 255 us different kernels.
  • every fully-connected layer Accepts an input feature map 219, 220 of size WlxHIxDI, if the input feature map is the product of the convolutional, pooling, non-linear activation, adding or concatenation layer. If the input feature map is the product of the fully- connected layer, than its size is equal to Nl neurons; Produces an output feature map 219, 220of size NO; introduces a total of WlxHlxDlxNO weights and NxO biases, in case of non- fully-connected input feature maps, or NlxNO weights and ⁇ - ⁇ biases, in case of fully- connected input feature maps.
  • FIG. 15 illustrates aspects with respect to the architecture of the circuit 100. In the example of FIG.
  • the circuit 100 includes a memory controller 1 12 which controls data movement from and to the external memory 1 1 1 . Furthermore, a memory access arbiter 1 13 is provided which distributes data between multiple calculation modules (CMs) 123.
  • CMs calculation modules
  • Each CM 122 may include internal memory 122.
  • Each CM 122 may include one or more computational units (not illustrated in FIG. 15; sometimes also referred to functional units, FUs), e.g., an array of FU units.
  • the FU array can be re-configured to perform processing for different types of layers 260, e.g., convolutional layer, pooling layer, etc.. This is why the FU array is sometimes also referred to as reconfigurable computing unit (RCU).
  • RCU reconfigurable computing unit
  • the circuit 100 includes a plurality of CMs aligned in parallel.
  • the circuit 100 could include a plurality of CMs arranged in a network geometry, e.g., a 2-D mesh network (not illustrated in FIG. 15).
  • CMs By means of multiple CMs, different instances of the input data can be processed in parallel. For example, different frames of the video could be assigned to different CMs 123. Pipelined processing may be employed.
  • FIG. 16 illustrates aspects with respect to the CMs 123.
  • FIG. 16 illustrates an example where the circuit 100 only includes a single CM 123. However, it would be possible that the circuit 100 includes a larger number of CMs 123.
  • a - generally optional - feature map memory is provided on the hierarchy of the computer 121 .
  • the feature map memory 164 may be referred to as level 2 cache.
  • the feature map memory is configured to cache at least parts of input feature maps and/or output feature maps that are currently processed by the computer 121 . This may facilitate power reduction, because read/write to the external memory 1 1 1 can be reduced.
  • the feature map memory 164 could store all intermediate feature maps 201 -220.
  • a tradeoff between reduced energy consumption and increased on-chip memory may be found.
  • the CM 123 includes an input stream manager for controlling data movement from the feature map memory 164 and/or the external memory 1 1 1 to a computing unit array 161 .
  • the CM 123 also includes an output stream manager 163 for controlling data movement to the feature map memory 164 and/or the external memory 1 1 1 from the computing unit array 161 .
  • the input stream manager 162 is configured to supply all data to be processed such as configuration data, the kernel map, and the input feature map, coming from the external memory, to the proper FU units.
  • the output stream manager 163 is configured to format and stream processed data to the external memory 1 1 1 .
  • FIG. 17 illustrates aspects with respect to the CMs 123.
  • the example of FIG. 17 generally corresponds to the example of FIG. 16.
  • a plurality of CMs 123 is provided. This facilitates pipelined or parallel processing of different instances of the input data.
  • kernel maps may be shared between multiple CMs 123. For example, if a kernel map is unloaded from a first CM 123, the kernel map may be loaded by a second CM 123.
  • the feature map cache is employed. This helps to avoid frequent read/write operations to the external memory 1 1 1 with respect to the kernel maps.
  • kernel maps 280 are handed from CM 123 to CM 123 instead of moving kernel maps 280 back and forth the memory external memory 1 1 1 .
  • Such sharing of kernel maps between the CMs 123 can relate to parallel processing of different instances of the input data.
  • different CMs 123 use the same kernel map - which is thus shared between multiple CMs 123; and different CMs 123 process different instances of the input data or different feature maps of the input data.
  • CMs 123 use pipelined processing, then every CM 123 uses a different kernel map, because each CM 123 evaluates different CNN layer; here, feature maps will move along the pipeline, sliding one CM mode at a time.
  • different input instances e.g., different video frames or different images
  • the CNN depending where the respective input instance is currently located in the pipeline.
  • FIG. 18 illustrates aspects with respect to the FU array 161 , the input stream manager 162, and the output stream manager 163. Typically, these elements are integrated on a single chip or die.
  • the FU array 161 includes a plurality of FU units 321 - 323. While in the example of FIG. 18, a count FU units 321 - 323 is illustrated, in other examples, it would be possible that the FU array 161 includes a larger count of FU units 321 - 323.
  • the various FU units 321 - 323 can be implemented alike or can be identical to each other.
  • the various FU units 321 - 323 may be configured to perform basic arithmetic operations such as multiplication or summation.
  • the FU array 161 may include a count of at least 200 FU units 321 - 323, optionally at least 1000 FU units 321 - 323, further optionally at least 5000 FU units 321 - 323.
  • the FU array 161 also includes shared memory 301. Different sections of the shared memory can be associated with data for different ones of the FU units 321 - 323 (in FIG. 18 the three illustrated partitions are associated with different FU units 321 - 323). In other words, different sections of the shared memory may be allocated to different FU units 321 -323.
  • routing elements 319 e.g., multiplexers, are employed.
  • An encoder 352 may encode the output data.
  • An inter-related decoder 342 is provided in the input stream manager 162.
  • the input stream manager 162 includes a stick buffer 341 ; here, it is possible to pre-buffer certain data later on provided to the shared memories 301 .
  • the output stream manager 163 includes an output buffer 351 . These buffers 341 , 351 are optional.
  • output registers 329 associated with the various FU units 321 - 323.
  • the registers 329 can be used to buffer data that has been processed by the FU units 321 - 323.
  • the FU array 161 also includes a postprocessing unit 330.
  • the postprocessing unit 330 can be configured to modify the data processed by the FU units 321 - 323 based on linear or nonlinear functions. While in the example of FIG. 18 a dedicated postprocessing unit 330 non- linear postprocessing is illustrated, in other examples, it would also be possible that non-linear postprocessing is associated with a dedicated layer of the CNN 200.
  • non-linear postprocessing functions examples include:
  • FIG. 19 illustrates aspects with respect to the assignment (arrows in FIG. 19) of convolutions 2001 -2003 to FU units 321 -323.
  • different combinational operations 201 1 , 2012 of each convolution 2001 -2003 are assigned to different FU units 321 -323.
  • the various convolutions 2001 , 2002 are processed sequentially. Because the number of combinational operations 201 1 , 2012 of a given convolution 2001 -2003 may not match the number of FU units 321 -323, this may result in idling FU units 323. This reduces the efficiency.
  • FIG. 20 illustrates aspects with respect to the assignment of convolutions 2001 -2003 using different kernels 261 , 262 to FU units 321 -323.
  • the combinational operations required to complete processing of an input feature map 201 - 220, 225 - 228 are flexibly assigned to the various FU units 321 - 323. This is based on the finding that such a flexible assignment of the combinational operations can reduce idling of the FU units 321 - 323 if compared to a static assignment, i.e., a predefined assignment which does not vary - e.g., from layer to layer 260 of the CNN 200. If a static, predefined assignment is used it may not be possible to flexibly adjust the assignment depending on properties of the respective input feature map 201 - 220, 225 - 228 and/or the respective kernel map 280.
  • the flexible assignment enables to tailor allocation of the FU units 321 - 323.
  • the assignment can take into account properties such as the size 265 of the used kernel 261 , 262 or the shape of the used kernel 261 , 262 - or generally the kernel geometry.
  • the assignment can take into account the stride 269.
  • a control logic - e.g., implemented by a control 343 in the input stream manager 162 and/or a control 353 in the output stream manager 163 or another control of the computer 121 - is configured to sequentially assign at least two combinational operations 201 1 , 2012 of the same convolution 2001 , 2002 to the same FU unit 321 - 323. This avoids idling of FU units 321 -323.
  • the processing time is reduced.
  • control logic 343, 353 is configured to sequentially assign all combinational operations 201 1 , 2012 of a given convolution 2001 - 2003 to the same FU unit 321 - 323.
  • the values of all neurons 255 of a given slice of the output feature map of the respective convolution or layer 260 are determined by the same FU unit 321 - 323.
  • all arithmetic operations of a given value of z in equation 2 are performed by the same FU unit 321 - 323.
  • idling of FU units 321 - 323 is avoided for a scenario where the count of FU units 321 - 323 is different from a count of convolutions 2001 , 2002 - or generally the count of filter operations.
  • the count of convolutions 2001 , 2002 can depend on various parameters such as the kernel geometry; the stride size; etc.
  • the count of filter operations may be different for convolutional layers 260 if compared to fully-connected layers.
  • control logic 343, 353 is configured to selectively assign at least two combinational operations 201 1 , 2012 of the same convolution 2001 , 2002 to the same FU unit 321 - 323 depending on at least one of the following: a kernel geometry; a stride size; a count of the FU units 321 - 323; a count of the kernels 261 , 262 of the kernel map 280 or, generally, a count of the filter operations of the filter map; a size of the on-chip memory 301 (which may limit the number of combinational operations 201 1 , 2012 that can be possibly executed in parallel); and generally layer 260 of the CNN 200 which is currently processed.
  • FIG. 20 a scenario is illustrated where combinational operations 201 1 , 2012 associated with different convolutions 2001 - 2003 are completed at the same point in time.
  • the control logic 343, 353 may be configured to monitor completion of the first combinational operation by the respective FU unit 321 - 323; and then trigger a second combinational operation of the same convolution 2001 - 2003 to be performed by the respective FU unit 321 - 323.
  • the trigger time points are generally not required to be synchronized for different filter operations.
  • FIG. 21 is a flowchart of a method according to various examples.
  • a plurality of filter operations is performed.
  • Each filter operation includes a plurality of combinational operations.
  • the filter operations may correspond to 3-D convolutions between an input feature map and a kernel map.
  • the plurality of combinational operations in 101 1 may correspond to arithmetic operations such as multiplications and summations between a plurality of two-dimensional slices of 3-D receptive fields of the feature map and associated 2-D filter coefficients of a 3-D kernel of the kernel map.
  • the filter operations in 101 1 may be part of processing of a not-fully- connected layer or of a fully-connected layer.
  • 101 1 may be re-executed for various layers of a multi-layer neural network (cf. FIG. 2, 1002).
  • At least two combinational operations of the same filter operation are assigned to the same FU unit.
  • the same FU unit sequentially calculates at least parts of the filter operation. This facilitates efficient utilization of the available FU units. For example, for at least one filter operation, it would be possible to assign all respective combinational operations to the same FU unit.
  • FIG. 22 illustrates a method which could enable such flexible selection of the parameters of the various layers of the multi-layer neural network.
  • an input feature map is loaded.
  • the input feature map may be associated with an input layer or an output layer or a hidden layer.
  • the input feature map may be loaded from an external memory or from some on-chip memory, e.g., a level-2 cache, etc.
  • the input feature map may also correspond to the output feature map of a previous layer, e.g., in case of a hidden layer.
  • a filter geometry is selected from a plurality of filter geometries. It is possible that different layers of the multi-layer neural network are associated with different filter geometries. For example, different filter geometries may refer to different filter sizes in the lateral plane (xy - plane) and/or different filter shapes.
  • Possible filter shapes may correspond to: cuboid; spherical; cubic; and/or cylindric.
  • the filter geometries may be selected in view of the feature recognition task. Selecting the appropriate filter geometry may increase an accuracy with which the features can be recognized.
  • one and the same filter geometry is used throughout receptive fields across the entire input feature map of the respective layer of the multi-layer neural network.
  • different filter geometries are selected for different receptive fields of the respective input feature map, i.e., that different filter geometries are used for one and the same layer of the multi-layer neural network.
  • all layers e.g., all convolutional layers
  • filters having the same geometry for example 3x3
  • different convolutional layers use different filter geometries, e.g., different lateral filter shapes.
  • different filter geometries are used at different depths: e.g., neurons from depth 1 of the output feature map use cubical kernel, neurons from depth 2 use spherical, neurons from depth 3 again use cubical kernel but with different xy size, etc.
  • Such a flexible selection of the filter geometry can break the translational invariance and, thus, help to accurately identify objects, e.g., based on a-priori knowledge.
  • the filter geometry is selected based on a-priori knowledge on objects represented by the input filter map.
  • the a-priori knowledge may correspond to distance information for one or more objects represented by the input filter map.
  • a-priori knowledge is obtained by sensor fusion between a sensor providing the input data and one or more further sensors providing the a-priori knowledge.
  • distance information on one or more objects represented by the input data is obtained from a distance sensor such as RADAR or LIDAR or a stereoscopic camera.
  • filter operations include convolutions.
  • the corresponding layer is a convolutional layer of a CNN, e.g., a CNN as described in further examples disclosed herein.
  • the filter operations rely on one or more filters having the selected filter geometry.
  • 1022 is executed after 1021. In other examples, it would also be possible that 1022 is executed prior to executing 1021 .
  • selecting of the filter geometry from the plurality of filter geometries can be in response to said loading of the input feature map.
  • FIG. 23 illustrates aspects with respect to the RCU 161 , the input stream manager 162, and the output stream manager 163. Typically, these elements are integrated on a single chip or die.
  • the RCU 161 of FIG. 23 generally corresponds to the RCU 161 of FIG. 18.
  • the techniques of assigning at least some or all combinational operations of the same filter operation to the same FU unit 321 -323, as explained above, can also be implemented for the RCU 161 of FIG. 23.
  • the RCU 161 includes multiple instances of L1 cache memory 301 , 302 associated with the FUs 321 -323.
  • the L1 cache memory 301 and the L1 cache memory 302 are arranged on the same level of hierarchy with respect to the FUs 321 -323, because access of the FUs 321 -323 to the L1 cache memory 301 is not via the L1 cache memory 302, and vice versa.
  • an input stream router 344 implements a control functionality configured for selecting an allocation of the cache memory 301 and an allocation of the cache memory 302 to the respective input feature map and to the respective kernel map of the active layer of the CNN 200, respectively.
  • the input stream router 344 is configured to route the input feature map and the kernel map to the FUs 321 - 323 via the cache memory 301 or the cache memory 302, respectively.
  • the L1 cache memory 301 is connected to the FUs 321 - 323 via routers 319.
  • the shared L1 cache memory 301 is shared between the multiple FUs 321 - 323.
  • different sections of the shared L1 cache memory 301 can be allocated for data associated with different parts of the allocated map. This may be helpful, in particular, if the map allocated to the L1 cache memory 302 is replicated more than once across different blocks 31 1 -313 of the L1 cache memory 302.
  • the shared L1 cache memory 301 providing data to multiple FUs 321 - 323 is implemented in a single memory entity in terms of a geometrical arrangement on the respective chip and/or in terms of an address space.
  • the routers 319 could be configured by the input stream router 344 to access the appropriate address space of the shared L1 cache memory 301 .
  • the input stream manager 162 also includes L2 cache memory 341 (labeled stick buffer in FIG. 23). Data is provided to the shared L1 cache memory 301 by the input stream router 344 via the L2 cache memory 341.
  • the cache memory 341 may be used to buffer sticks of data - e.g., receptive fields 251 , 252 - from the respective map, e.g., the input feature map.
  • the cache memory 341 may be allocated to storing data of the input feature map, but not allocated to store data of the kernel map.
  • the refresh events of the cache memory 341 may be controlled depending on the size of the receptive fields 251 , 252 of the input feature map and the stride size 269 of the kernels 261 , 262 of the kernel map.
  • a sequence of processing the convolutions 2001 , 2002 may be set appropriately.
  • every data point of the input feature map is re-used in 9 different convolutions 2001 , 2002.
  • the data movement to the external memory 101 can be reduced by a factor of 9.
  • the cache memory 341 is arranged up-stream of the input stream router 344, i.e., closer to the external memory 101 . Hence, it is possible to store data of the input feature map irrespective of the allocation of the input feature map to the local L1 cache memory 302 or to the shared L1 cache memory 301 .
  • the cache memory 341 provides L2 cache memory functionality for the input feature map - but not for the kernel map.
  • the cache memory 164 provides L3 cache memory functionality for the input feature map - because of the intermediate cache memory 341 - and, at the same time, the cache memory 164 provides L2 cache memory functionality for the kernel map.
  • FIG. 23 Also illustrated in FIG. 23 is an implementation of the L1 cache memory 302 including a plurality of blocks 31 1 - 313. Different blocks 31 1 - 313 are statically connected with different FUs 321 - 323. Hence, there are no intermediate routers required between the blocks 31 1 - 313 and the FUs 321 - 323.
  • the L1 cache memory 302 is local memory associated with the FUs 321 -323.
  • the blocks 31 1 - 313 of the local L1 cache memory 302 are separately implemented, i.e., on different positions on the respective chip.
  • the blocks 31 1 - 313 use different address spaces.
  • L1 cache memory both, as shared RAM and local RAM, as illustrated in FIG. 23, data movement can be significantly reduced when processing a layer 260 of the CNN 200. For example, refresh events - where the content of at least parts of the respective L1 cache memory 301 , 302 is flushed and new content is written to the respective L1 cache memory 301 , 302 - may occur less frequently.
  • This is reflected by the following finding: for example, for processing a convolutional layer 260 of the CNN 200, convolutions are performed between multiple kernels 261 , 262 and one and the same receptive field 251 , 252. Likewise, it is also required to perform convolutions between one and the same kernel 261 , 262 and multiple receptive fields 251 , 252.
  • each block 31 1 - 313 of the local L1 cache memory 302 stores at least parts or all of a receptive field 251 , 252 - i.e., in a scenario where the input feature map is allocated to the local L1 cache memory 302 -, it is possible to reuse that data for multiple convolutions with different kernels 261 , 262 that are stored in the shared L1 cache memory 301 .
  • the routers 319 different sections of the shared L1 cache memory 301 can be flexibly routed to the FUs 321 - 323.
  • the shared L1 cache memory 301 stores, at a given point in time / in response to a given refresh event, data which is being processed by each FU 321 -323, i.e., being routed to multiple FUs 321 -323. Differently, the data written to the local L1 cache memory 302 by a single refresh event is routed to a single FU 321 -323.
  • multiple FUs 321 -323 calculate different 3-D convolutions using their locally stored convolution coefficients or kernels 261 , 262 on the same data of the input feature map, e.g., the same receptive field 251 , 252 currently stored in the shared L1 cache memory 301 .
  • the ratio of available FUs 321 -323 and number of different kernels 261 , 262 is such that it allows simultaneous calculation of all required convolutions (e.g., more FUs 321— 323 than kernels 261 ,262), this is achieved by storing data of different receptive fields 251 , 252 of the input feature map at a given point in time in the shared L1 cache memory 301 .
  • the kernel map is stationary stored in the local L1 cache memory 301 ; while different data of the input feature map associated with one or more receptive fields 252, 152 is sequentially stored in the shared L1 cache memory 301 .
  • SIFM and SKM implement different allocations of the shared L1 cache memory 301 to a first one of the input feature map and the kernel map of the local L1 cache memory 302 to a second one of the input feature map and the kernel map. From the above, it is apparent that data structures of the same size are stored in the blocks 31 1 -313. Therefore, in some examples, it is possible that the blocks 31 1 - 313 of the L1 cache memory 302 are all of the same size. Then, refresh events may occur in a correlated manner - e.g., within a threshold time or synchronously - for all blocks 31 1 -313 of the local L1 cache memory 302.
  • the L1 cache memory 302 In order to facilitate a reduced number of refresh events for the blocks 31 1 - 313 throughout the processing of a particular layer 260 of the CNN 200, it can be possible to implement the L1 cache memory 302 with a particularly large size. For example, it would be possible that the size of the local L1 cache memory 302 is larger than the size of the shared L1 cache memory 301 . For example, the time-alignment between multiple convolutions that re-use certain data between the FUs 321 -323 can require a high rate of refresh events for the shared L1 cache memory 301 . Differently, the local L1 cache memory 302 may have a comparably low rate of refresh event, e.g., only a single refresh event at the beginning of processing a particular layer 260.
  • each block 31 1 -313 of the local L1 cache memory 302 is dimensioned to store an entire receptive field 251 , 252 or an entire kernel 261 , 262; i.e., if the local L1 cache memory 302 is dimensioned to store the entire input feature map or the entire kernel map.
  • FIG. 24 is a flowchart of a method according to various examples.
  • the method of FIG. 24 may be executed by the input stream router 344.
  • a first one of an input feature map and a filter map - e.g., a kernel map - may be allocated to the first L1 cache memory; in the second one of the input feature map and the kernel map may be allocated to the second L1 cache memory.
  • the first L1 cache memory is shared memory associated with multiple computational units of a plurality of computational units; while the second L1 cache memory is local memory, wherein each block of the local memory is associated with the respective computational unit of the plurality of computational units.
  • filter operations such as convolutions typically required for processing of a layer of a multi-layer neural network, where a given block of data - e.g., a receptive field or a kernel - is combined with many other blocks of data - e.g., kernels or receptive fields.
  • FIG. 25 is a flowchart of a method according to various examples.
  • FIG. 25 illustrates aspects with respect to selecting an allocation for the first and second L1 cache memory.
  • the method according to FIG. 25 could be executed as part of 701 1 (cf. FIG. 24).
  • 7021 it is checked whether a further layer exists in a multi-layer neural network for which an allocation of L1 cache memories is required. The further layer is selected as the current layer, if applicable.
  • the local L1 cache memory is allocated to the input feature map. If, however, the size of the input feature map is not smaller than the size of the kernel map, then, in 7024, the local L1 cache memory is allocated to the kernel map. By allocating the smaller one of the input feature map and the kernel map to the local L1 cache memory, it is possible to reduce the size of the local L1 cache memory.
  • the kernel map is allocated to the local L1 cache memory. If, however, the input feature map is smaller than the kernel map, then the input feature map is allocated to the local L1 cache memory. In other words, the smaller of the maps is allocated to the local L1 cache memory. Thereby, the size of the local L1 cache memory 302 can be significantly reduced. This is based on the finding that, in a typical CNN, the size of input feature maps gets smaller towards deeper layers. The opposite is true for kernel maps.
  • the local L1 cache memory 302 can have a total size which is equal to max ⁇ min ⁇ IFMSize(l), KMSize(l) ⁇ , where / stands for the /-th CNN layer, and maximum operator is taken over all CNN layers.
  • max ⁇ min ⁇ IFMSize(l), KMSize(l) ⁇ where / stands for the /-th CNN layer, and maximum operator is taken over all CNN layers.
  • this would require that the total size of local RAM memories is roughly at least 2 * 550KB 1.1 MB, which is 4-6 times less from the sizes required if one cannot select allocation of input feature map and kernel map to the local L1 cache memory 302 on a per- layer basis. Reducing the size of the local L1 cache memory 302 helps to simplify the hardware requirements.
  • the equations for calculating the required sizes of the local L1 cache memory 302 are examples only an may be more complex if the stride size is taken into account as well.
  • FIG. 25 it is possible that different allocations are selected for different layers of the multi-layer neural network.
  • the method according to FIG. 25 can be executed prior to the forward pass or during the forward pass of the respective multi-layer neural network.
  • the allocation is selected based on a relation of the size of the input feature map with respect to the size of the kernel map
  • it would also be possible to select the allocation based on other decision criteria such as the (absolute) size of the input feature map and/or the (absolute) size of the kernel map.
  • one decision criterion that may be taken into account is whether the size of the local L1 cache memory is sufficient to store the entire input feature map and/or the entire kernel map.
  • FIG. 26 schematically illustrates aspects with respect to refresh events 701 -710.
  • FIG. 26 schematically illustrates a timeline of processing different input feature maps 209 - 21 1 .
  • processing of the input feature map 209 associated with the respective layer 260 of the CNN 200 commences.
  • the input feature map 209 is allocated to the shared L1 cache memory while the respective kernel map is allocated to the local L1 cache memory.
  • the input feature map 209 is partly loaded into the shared L1 cache memory and the kernel map is fully loaded into the local L1 cache memory.
  • the map allocated to the locale L1 cache memory 302 is replicated over a number of FUs 321 -323 / blocks 31 1 -313. This may be the case because there is a correspondingly large count of FUs 321-323 and blocks 31 1 -313. This is facilitated by the shared L1 cache memory 301 being able to store different sections of the respective other map which makes it possible to keep busy all FUs 321 -323 even if they are allocated to the same sections of the respective map.
  • the number of, e.g., input feature map bundles allocated to the shared L1 cache memory 301 that can be processed in parallel is equal to the replication factor of the kernel map allocated to the local L1 cache memory 302 (or vice versa).
  • processing of the input feature map 210 commences. For example, here, it would be possible that the size of the input feature map 210 is significantly smaller than the size of the input feature map 209. For this reason, while processing the input feature map 210, the input feature map 210 is allocated to the local L1 cache memory 302 while the respective kernel map is allocated to the shared L1 cache memory 301 . A similar allocation is also used for processing the input feature map 21 1 which may have the same size as the input feature map 210.
  • a rate of refresh events 705 - 710 is larger for data associated with the respective kernel maps - now stored in the shared L1 cache memory 310 - than for data associated with the respective input feature maps 210, 21 1 .

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Neurology (AREA)
  • Biodiversity & Conservation Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

Au moins une mémoire (301, 341) et une pluralité d'unités de calcul (321-323) réalisent une pluralité d'opérations de filtrage entre une carte de caractéristiques d'entrée et un livre de cartes de filtre pour la classification d'au moins un objet représenté par la carte de caractéristiques d'entrée. Chaque opération de filtrage de la pluralité d'opérations de filtrage comprend une pluralité d'opérations combinatoires. La logique de commande est conçue pour attribuer séquentiellement au moins deux ou toutes les opérations combinatoires de la même opération de filtrage à la même unité de calcul de la pluralité d'unités de calcul.
PCT/EP2018/054891 2017-02-28 2018-02-28 Attribution d'unités de calcul dans une classification d'objets WO2018158293A1 (fr)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
DE102017104103.6 2017-02-28
DE102017104103 2017-02-28
DE102017105217.8 2017-03-13
DE102017105217 2017-03-13

Publications (1)

Publication Number Publication Date
WO2018158293A1 true WO2018158293A1 (fr) 2018-09-07

Family

ID=61563377

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/EP2018/054891 WO2018158293A1 (fr) 2017-02-28 2018-02-28 Attribution d'unités de calcul dans une classification d'objets

Country Status (1)

Country Link
WO (1) WO2018158293A1 (fr)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112116071A (zh) * 2020-09-07 2020-12-22 地平线(上海)人工智能技术有限公司 神经网络计算方法、装置、可读存储介质以及电子设备
CN112434184A (zh) * 2020-12-15 2021-03-02 四川长虹电器股份有限公司 基于历史影视海报的深度兴趣网络的排序方法
CN112926595A (zh) * 2021-02-04 2021-06-08 深圳市豪恩汽车电子装备股份有限公司 深度学习神经网络模型的训练装置、目标检测系统及方法
US20210174177A1 (en) * 2019-12-09 2021-06-10 Samsung Electronics Co., Ltd. Method and device with neural network implementation
TWI782328B (zh) * 2019-12-05 2022-11-01 國立清華大學 適用於神經網路運算的處理器
EP4258009A1 (fr) * 2022-04-14 2023-10-11 Aptiv Technologies Limited Procédé, appareil et produit programme informatique de classification de scènes

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
CHEN ZHANG ET AL: "Optimizing FPGA-based Accelerator Design for Deep Convolutional Neural Networks", PROCEEDINGS OF THE 2015 ACM/SIGDA INTERNATIONAL SYMPOSIUM ON FIELD-PROGRAMMABLE GATE ARRAYS, FPGA '15, 22 February 2015 (2015-02-22), New York, New York, USA, pages 161 - 170, XP055265150, ISBN: 978-1-4503-3315-3, DOI: 10.1145/2684746.2689060 *
CHEN, YU-HSIN ET AL.: "Eyeriss: An energy-efficient reconfigurable accelerator for deep convolutional neural networks", IEEE JOURNAL OF SOLID-STATE CIRCUITS, 2016
DUNDAR, AYSEGUL ET AL.: "Embedded Streaming Deep Neural Networks Accelerator With Applications", IEEE TRANSACTIONS ON NEURAL NETWORKS AND LEARNING SYSTEMS, 2016
HUIMIN LI ET AL: "A high performance FPGA-based accelerator for large-scale convolutional neural networks", 2016 26TH INTERNATIONAL CONFERENCE ON FIELD PROGRAMMABLE LOGIC AND APPLICATIONS (FPL), EPFL, 29 August 2016 (2016-08-29), pages 1 - 9, XP032971527, DOI: 10.1109/FPL.2016.7577308 *
MAURICE PEEMEN ET AL: "Memory-centric accelerator design for Convolutional Neural Networks", 2013 IEEE 31ST INTERNATIONAL CONFERENCE ON COMPUTER DESIGN (ICCD), 1 October 2013 (2013-10-01), pages 13 - 19, XP055195589, ISBN: 978-1-47-992987-0, DOI: 10.1109/ICCD.2013.6657019 *
SRIMAT CHAKRADHAR ET AL: "A dynamically configurable coprocessor for convolutional neural networks", PROCEEDINGS OF THE 37TH ANNUAL INTERNATIONAL SYMPOSIUM ON COMPUTER ARCHITECTURE, ISCA '10, ACM PRESS, NEW YORK, NEW YORK, USA, 19 June 2010 (2010-06-19), pages 247 - 257, XP058174461, ISBN: 978-1-4503-0053-7, DOI: 10.1145/1815961.1815993 *
YONGMING SHEN ET AL: "Maximizing CNN Accelerator Efficiency Through Resource Partitioning", ARXIV:1607.00064V1 [CS.AR], 30 June 2016 (2016-06-30), XP055303793, Retrieved from the Internet <URL:https://arxiv.org/abs/1607.00064v1> [retrieved on 20160919] *

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI782328B (zh) * 2019-12-05 2022-11-01 國立清華大學 適用於神經網路運算的處理器
US20210174177A1 (en) * 2019-12-09 2021-06-10 Samsung Electronics Co., Ltd. Method and device with neural network implementation
US11829862B2 (en) * 2019-12-09 2023-11-28 Samsung Electronics Co., Ltd. Method and device with neural network implementation
CN112116071A (zh) * 2020-09-07 2020-12-22 地平线(上海)人工智能技术有限公司 神经网络计算方法、装置、可读存储介质以及电子设备
CN112434184A (zh) * 2020-12-15 2021-03-02 四川长虹电器股份有限公司 基于历史影视海报的深度兴趣网络的排序方法
CN112434184B (zh) * 2020-12-15 2022-03-01 四川长虹电器股份有限公司 基于历史影视海报的深度兴趣网络的排序方法
CN112926595A (zh) * 2021-02-04 2021-06-08 深圳市豪恩汽车电子装备股份有限公司 深度学习神经网络模型的训练装置、目标检测系统及方法
EP4258009A1 (fr) * 2022-04-14 2023-10-11 Aptiv Technologies Limited Procédé, appareil et produit programme informatique de classification de scènes

Similar Documents

Publication Publication Date Title
US11960999B2 (en) Method and apparatus with neural network performing deconvolution
WO2018158293A1 (fr) Attribution d&#39;unités de calcul dans une classification d&#39;objets
US11521039B2 (en) Method and apparatus with neural network performing convolution
US11508146B2 (en) Convolutional neural network processing method and apparatus
CN110175671B (zh) 神经网络的构建方法、图像处理方法及装置
US11461998B2 (en) System and method for boundary aware semantic segmentation
US9786036B2 (en) Reducing image resolution in deep convolutional networks
US20200160535A1 (en) Predicting subject body poses and subject movement intent using probabilistic generative models
US20220215227A1 (en) Neural Architecture Search Method, Image Processing Method And Apparatus, And Storage Medium
CN112215332B (zh) 神经网络结构的搜索方法、图像处理方法和装置
US11636306B2 (en) Implementing traditional computer vision algorithms as neural networks
CN111797983A (zh) 一种神经网络构建方法以及装置
WO2017155661A1 (fr) Analyse vidéo avec des réseaux de neurones récurrents d&#39;attention convolutionnels
Fuhl et al. Multi layer neural networks as replacement for pooling operations
CN109918204B (zh) 数据处理系统及方法
EP3800585A1 (fr) Procédé et appareil de traitement de données
US11410040B2 (en) Efficient dropout inference for bayesian deep learning
US20210174177A1 (en) Method and device with neural network implementation
CN111931901A (zh) 一种神经网络构建方法以及装置
EP3971781A1 (fr) Procédé et appareil avec opération de réseau de neurones
CN111178495A (zh) 用于检测图像中极小物体的轻量卷积神经网络
KR20220097161A (ko) 인공신경망을 위한 방법 및 신경 프로세싱 유닛
EP3401840A1 (fr) Flux de données comprimés dans la reconnaissance d&#39;objet
CN114120045B (zh) 一种基于多门控混合专家模型的目标检测方法和装置
CN114846382A (zh) 具有卷积神经网络实现的显微镜和方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18708643

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18708643

Country of ref document: EP

Kind code of ref document: A1