EP3766020B1 - Effiziente faltungsmaschine - Google Patents

Effiziente faltungsmaschine Download PDF

Info

Publication number
EP3766020B1
EP3766020B1 EP19708715.8A EP19708715A EP3766020B1 EP 3766020 B1 EP3766020 B1 EP 3766020B1 EP 19708715 A EP19708715 A EP 19708715A EP 3766020 B1 EP3766020 B1 EP 3766020B1
Authority
EP
European Patent Office
Prior art keywords
storage element
data storage
multiplier
product
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
EP19708715.8A
Other languages
English (en)
French (fr)
Other versions
EP3766020C0 (de
EP3766020A1 (de
Inventor
Eugene M. Feinberg
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Recogni Inc
Original Assignee
Recogni Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Recogni Inc filed Critical Recogni Inc
Priority to EP23166763.5A priority Critical patent/EP4220488A1/de
Publication of EP3766020A1 publication Critical patent/EP3766020A1/de
Application granted granted Critical
Publication of EP3766020C0 publication Critical patent/EP3766020C0/de
Publication of EP3766020B1 publication Critical patent/EP3766020B1/de
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F5/00Methods or arrangements for data conversion without changing the order or content of the data handled
    • G06F5/01Methods or arrangements for data conversion without changing the order or content of the data handled for shifting, e.g. justifying, scaling, normalising
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/04Inference or reasoning models
    • G06N5/046Forward inferencing; Production systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]

Definitions

  • the present invention relates to a hardware architecture for a convolutional engine, and more particularly relates to an efficient way to provide data values to compute units (called convolver units or functional units) of the convolutional engine.
  • neural networks in particular convolution neural networks
  • convolution neural networks are widely used for performing image recognition/classification, object recognition/classification and image segmentation. While having numerous applications (e.g., object identification for self-driving cars, facial recognition for social networks, etc.), neural networks require intensive computational processing and frequent memory accesses. Described herein is an efficient hardware architecture for implementing a convolutional neural network.
  • US 2017/0011006 to Saber et al discloses a processor comprising an input buffer configured to receive data and store the data, a data extractor configured to extract kernel data corresponding to a kernel in the data from the input buffer, a multiplier configured to multiply the extracted kernel data by a convolution coefficient, and an adder configured to calculate a sum of multiplication results from the multiplier.
  • Figure 1 depicts a diagram providing an overview of the training phase and the inference phase in a neural network.
  • pairs of input and known (or desired) output may be provided to train model parameters (also called "weights") of classification model 104.
  • train model parameters also called "weights”
  • input 102 is a matrix of numbers (which may represent the pixels of an image)
  • known output 106 is a vector of classification probabilities (e.g., the probability that the input image is a cat is 1, the probability that the input image is a dog is 0, and the probability that the input image is a human is 0).
  • the classification probabilities may be provided by a human (e.g., a human can recognize that the input image depicts a cat and assign the classification probabilities accordingly).
  • the model parameters may simply be the parameters that minimize the error between the model's classification (or the model's classification probabilities) of a given set of input with the known classification (or known classification probabilities), while at the same time avoiding "model overfitting".
  • classification model 104 In the inference (or prediction or feed-forward) phase, classification model 104 with trained parameters (i.e., parameters trained during the training phase) is used to classify a set of input.
  • the trained classification model 104 provides the classification output 110 of a vector of probabilities (e.g., the probability that the input image is a cat is 0.3, the probability that the input image is a dog is 0.6, and the probability that the input image is a human is 0.1) in response to input 108.
  • classification model 104 is a convolutional neural network.
  • a basic building block of a convolutional neural network is a convolution operation, which is described in Figures 2-7 .
  • a convolution operation may refer to a 2-dimensional convolution operation with 2-dimensional input and a 2-dimensional filter, a 3-dimensional convolution operation with 3-dimensional input and a 3-dimensional filter, etc.
  • Figure 2 depicts a diagram of the input, model parameters and output of a 2-dimensional convolution operation.
  • the input includes a 2-dimensional matrix of numerical values (each of the numerical values abstractly represented by "•").
  • the matrix in the example of Figure 2 is a 4x4 matrix, but other input could have different dimensions (e.g., could be a 100x100 square matrix, a 20x70 rectangular matrix, etc.). Later presented examples will illustrate that the input may even be a 3-dimensional object. In fact, the input may be an object of any number of dimensions.
  • the input may represent pixel values of an image or may represent the output of a previous convolution operation.
  • the model parameters may include a filter and a bias.
  • the filter is a 3x3 matrix of values (the values also called "weights") and the bias is a scalar value.
  • weights the values also called "weights”
  • the example in Figure 2 includes one filter, so there is one corresponding bias. However, in certain embodiments, if there were 5 filters, there would be 5 associated biases, one for each of the filters.
  • the convolution operator 208 receives input 202 and the model parameters 204, 206, and generates output 210 called an activation map or a feature map. Each value of the activation map is generated as the sum of a dot product between of input 202 and filter 204 (at a certain spatial location relative to input 202) and bias 206.
  • the computations to arrive at activation map 210 are described in more detail below in Figure 3 .
  • the center of filter 204 is spatially aligned with the element at position (1, 1) of input 202.
  • Such computation assumes the use of "zero padding" in which the input 202 is implicitly surrounded by a border of zeros.
  • zero padding is that the dimensions of input 202 and output activation map 210 remain constant when using a 3x3 filter.
  • a dot product is computed between filter 204 and the four values of input 202 that spatially align with filter 204. The dot product is then summed with bias b to arrive at the element at position (1, 1) of activation map 210.
  • the second row of Figure 3 describes the computation of the element at position (1, 2) of activation map 210.
  • the center of filter 204 is spatially aligned with the element at position (1, 2) of input 202.
  • a dot product is computed between filter 204 and the six values of input 202 that spatially align with filter 204.
  • the dot product is then summed with bias b to arrive at the element at position (1, 2) of activation map 210.
  • the third row of Figure 3 describes the computation of the element at position (1, 3) of activation map 210.
  • the center of filter 204 is spatially aligned with the element at position (1, 3) of input 202.
  • a dot product is computed between filter 204 and the six values of input 202 that spatially align with filter 204.
  • the dot product is then summed with bias b to arrive at the element at position (1, 3) of activation map 210.
  • the fourth row of Figure 3 describes the computation of the element at position (4, 4) of activation map 210.
  • the center of filter 204 is spatially aligned with the element at position (4, 4) of input 202.
  • a dot product is computed between filter 204 and these four values of input 202 that spatially align with filter 204.
  • the dot product is then summed with bias b to arrive at the element at position (4, 4) of activation map 210.
  • the convolution operation comprises a plurality of shift (or align), dot product and bias (or sum) steps.
  • the filter was shifted by 1 spatial position between dot product computations (called the step size or stride), but other step sizes of 2, 3, etc. are possible.
  • Figure 4 is similar to Figure 2 , except that there are F filters 404, F biases 406 and F activation maps 410 instead of a single filter 204, a single bias 206 and a single activation map 210.
  • the relation between the F filters 404, F biases 406 and F activation maps 410 is as follows: Filter f 1 , bias b 1 and input 402 are used to compute activation map y 1 (in very much the same way that filter 204, bias 206 and input 202 were used to compute activation map 210 in Figure 2 ); filter f 2 , bias b 2 and input 402 are used to compute activation map y 2 ; and so on.
  • FIG. 5 is similar to Figure 2 , except that instead of a 2-dimensional input 202 and a 2-dimensional filter 204, a 3-dimensional input 502 and a 3-dimensional filter 504 are used.
  • the computations to arrive at activation map 510 are described in more detail below in Figure 6 .
  • input 502 and filter 504 are 3-dimensional
  • activation map 510 is 2-dimensional, as will become clearer in the associated description of Figure 6 .
  • Each "slice" of filter 504 (analogous to a "channel" of input 502) may be called a kernel.
  • filter 504 is composed of five kernels
  • input 502 is composed of five channels.
  • the number of kernels of filter 504 (or the size of the "z" dimension of filter 504) must match the number of channels of input 502 (or the size of the "z” dimension of input 502).
  • channel 1 of input 502 aligns with kernel 1 of filter 504;
  • channel 2 of input 502 aligns with kernel 2 of filter 504; and so on.
  • the central axis 506 of filter 504 (with central axis drawn parallel to the z-axis) is aligned with the elements at positions (1, 1, z) for z ⁇ ⁇ 1, ..., 5 ⁇ of input 502.
  • a dot product is computed between filter 504 and the twenty values of input 502 that spatially align with filter 504 (4 aligned values per channel x 5 channels).
  • the dot product is then summed with bias b to arrive at the element at position (1, 1) of activation map 510.
  • the second row of Figure 6 describes the computation of the element at position (1, 2) of activation map 510.
  • the central axis 506 of filter 504 is aligned with the elements at positions (1, 2, z) for z ⁇ ⁇ 1, ..., 5 ⁇ of input 502.
  • a dot product is computed between filter 504 and the thirty values of input 502 that spatially align with filter 504 (6 aligned values per channel x 5 channels).
  • the dot product is then summed with bias b to arrive at the element at position (1, 2) of activation map 510.
  • the third row of Figure 6 describes the computation of the element at position (1, 3) of activation map 510.
  • the central axis 506 of filter 504 is aligned with the elements at positions (1, 3, z) for z ⁇ ⁇ 1, ..., 5 ⁇ of input 502.
  • a dot product is computed between filter 504 and the thirty values of input 502 that spatially align with filter 504 (6 aligned values per channel x 5 channels).
  • the dot product is then summed with bias b to arrive at the element at position (1, 3) of activation map 510.
  • the fourth row of Figure 6 describes the computation of the element at position (4, 4) of activation map 510.
  • the central axis 506 of filter 504 is aligned with the elements at positions (4, 4, z) for z ⁇ ⁇ 1, ..., 5 ⁇ of input 502.
  • a dot product is computed between filter 504 and the twenty values of input 502 that spatially align with filter 504 (4 aligned values per channel x 5 channels).
  • the dot product is then summed with bias b to arrive at the element at position (4, 4) of activation map 510.
  • Figure 7 is similar to Figure 5 , except that there are F 3-dimensional filters 704, F biases 706 and F activation maps 710 ( F > 1), instead of a single 3-dimensional filter 504, a single bias 505 and a single activation map 510.
  • the relation between the F 3-dimensional filters 704, F biases 706 and F activation maps 710 is as follows: Filter f 1 , bias b 1 and input 702 are used to compute activation map y 1 (in very much the same way that filter 504, bias 505 and input 502 were used to compute activation map 510 in Figure 5 ); filter f 2 , bias b 2 and input 702 are used to compute activation map y 2 ; and so on.
  • FIG 8 depicts convolutional engine 708, in accordance with one embodiment of the invention.
  • Convolutional engine 708 (depicted in Figure 8 ) is a hardware architecture of the convolution operator (“conv") 708 (depicted in Figure 7 ).
  • Convolutional engine 708 includes a 2-D shift register with an array of data storage elements: d 1,1 d 1,2 d 1,3 d 1,4 d 2,1 d 2,2 d 2,3 d 2,4 d 3,1 d 3,2 d 3,3 d 3,4 d 4,1 d 4,2 d 4,3 d 4,4
  • the array is a four by four array.
  • Each of the data storage elements may be formed by a plurality of D flip-flops (i.e., one D flip-flop to store each bit of a data signal). Therefore, if data storage element d 1,1 were to store eight bits, d 1,1 may be formed from eight D flip-flops.
  • Each of the arrows between pairs of data storage elements represents an electrical connection (i.e., may be implemented as a wire). For example, data storage element d 1,1 (ref. num. 802) is electrically coupled to storage element d 2,1 (ref. num. 802) via electrical connection 804.
  • the arrow may represent a one-directional flow of data (i.e., data being transmitted from data storage element d 1,1 to data storage element d 2,1 , but not from d 2,1 to data storage element d 1,1 ).
  • data being transmitted from data storage element d 1,1 to data storage element d 2,1 , but not from d 2,1 to data storage element d 1,1 .
  • the first row of data storage elements may be called a "header”
  • the last row of data storage elements may be called a "footer”.
  • Convolutional engine 708 may further include an array of convolver units: CU 1,1 CU 1,2 CU 1,3 CU 1,4 CU 2,1 CU 2,2 CU 2,3 CU 2,4
  • an array of convolver units may be called "a convolver array".
  • the convolver array is a two by four array.
  • Convolver unit CU 1,2 has been labeled with reference numeral 806 (to facilitate later discussion). It is understood that a more typical embodiment will contain many more convolver units, such as in the example embodiment of Figure 30 .
  • the operation of the 2-D shift register and the operation of the convolver units will be described in detail in the following figures.
  • Figure 9A depicts the loading of data values into convolutional engine 708, in accordance with one embodiment of the invention.
  • Each channel of input may be loaded into convolutional engine 708 in a serial fashion.
  • Figure 9A depicts the loading of the first channel 702a of input 702 into convolutional engine 708 (assuming that the channels are numbered from 1 to 5 in the left to right direction).
  • the rows of a particular channel may be loaded into convolutional engine 708 in a serial fashion.
  • terms such as a "row” and a "column" will be/are being used for convenience and with respect to how elements are depicted in the figures. Nevertheless, the meaning of such terms may or may not translate into how circuit elements are laid out on a chip, where a row could be interpreted as a column and vice versa, depending on the viewer's orientation with respect to the chip.
  • this first example describing the hardware architecture of a convolutional engine will handle the case in which the number of columns of an input channel is equal to the number of columns of the convolver array.
  • the number of columns of input channel 702a is assumed to equal the number of columns of the convolver array.
  • input channel 702a may be a ten by four matrix of data values.
  • Figures 27A-27C describe how to handle the scenario in which the number of columns of an input channel is greater than the number of columns of the convolver array.
  • Figures 28 , 29A and 29B describe two schemes for handling the case in which the number of columns of an input channel is less than the number of columns of the convolver array.
  • convolutional engine 708 can only compute the convolution operation for a certain number of contiguous rows of the data values before the output needs to be saved (copied to a memory location separate from the convolver units - see memory 3002 in Figure 30 ). Once the output is saved, the convolutional engine 708 can continue onto the next set of contiguous rows.
  • convolution engine 708 can compute the output of n contiguous input rows (plus two padding rows explained below). For simplicity of explanation, n contiguous input rows will be called a "horizontal stripe" of data.
  • the loading of a leading row (i.e., first row of a horizontal stripe to be loaded) that is an external edge may be preceded by the loading of a zero padding row (as in row n of horizontal stripe 902a); the loading of a trailing row (i.e., last row of a horizontal stripe to be loaded) that is an external edge may be followed by the loading of a zero padding row (as in row 1 of horizontal stripe 902b); the loading of a leading row that is an internal edge may be preceded by the loading of a data padding row (as in row n of horizontal stripe 902b); and the loading of a trailing row that is an internal edge may be followed by the loading of a data padding row (as in row 1 of horizontal stripe 902a).
  • an "external edge” refers to a leading or trailing row of a horizontal stripe that forms an external boundary of an input channel
  • an internal edge refers to a leading or trailing row of a horizontal stripe that is not part of an external boundary of an input channel.
  • the reason for the zero or data padding row is tied to the 3x3 filter requiring data from a row above and a row below the row of interest to compute the convolution output. For a 5x5 filter, two padding rows (for the top row of a stripe) and two padding rows (for the bottom row of a stripe) or a total of four padding rows would have been needed.
  • n +2 rows within the bolded and dashed rectangle are loaded into convolutional engine 708.
  • the n +2 rows include a zero padding row, n rows of horizontal stripe 902a and a data padding row (equivalent to row n of horizontal stripe 902b).
  • Figures 9C-9D depict the loading of filter weights to convolutional engine 708, in accordance with one embodiment of the invention. More specifically, Figure 9C depicts the loading of the nine weights of kernel 704a into each of the convolver units of the first row of the convolver array (i.e., CU 1,1 , CU 1,2 , CU 1,3 and CU 1,4 ), and Figure 9D depicts the loading of the nine weights of kernel 704b into each of the convolver units of the second row of the convolver array (i.e., CU 2,1 , CU 2,2 , CU 2,3 and CU 2,4 ).
  • Kernel 704a is the first kernel of filter f 1 , and each of its weights is labeled with the superscript "1,1", which is shorthand for (filter f 1 , kernel 1).
  • Kernel 704b is the first kernel of filter f 2 , and each of its weights is labeled with the superscript "2,1", which is shorthand for (filter f 2 , kernel 1).
  • Figures 10A-10B depict the loading of a row of zero values into the 2-D shift register.
  • Figures 10B-10D depict a row-by-row loading of data values from the first input channel 702a into the 2-D shift register and a row-to-row shifting of the data values through the 2-D shift register.
  • Data values x n, 1 , x n, 2 , x n, 3 and x n ,4 may represent values from row n of horizontal stripe 902a of input channel 702a.
  • Data values x n -1,1 , x n- 1,2 , x n -1,3 and x n -1,4 may represent values from row n-1 of horizontal stripe 902a of input channel 702a.
  • Data values x n -2,1 , x n -2,2 , x n -2,3 and x n -2,4 may represent values from row n-2 of horizontal stripe 902a of input channel 70
  • the first row of convolver units i.e., CU 1,1 , CU 1,2 , CU 1,3 and CU 1,4
  • corresponding it is meant that there is a logical correspondence between convolver unit CU 1,1 and data storage element d 2,1 , convolver unit CU 1,2 and data storage element d 2,2 , and so on.
  • Active convolver units are drawn in Figure 11A in bolded lines while non-active convolver units are drawn in Figure 11A using non-bolded lines.
  • active means that a convolver unit is powered on
  • non-active means that a convolver unit is powered off to save power.
  • a controller (depicted as controller 2202 in Figure 22 and controller 3006 in Figure 30 , but not depicted in other figures for conciseness of presentation) may be responsible for powering on and off convolver units.
  • the controller may power on a row of convolver units once the data from row n of a horizontal stripe has been loaded into the data storage elements corresponding to the row of convolver units.
  • the controller may power off a row of convolver units once data from row 1 of a horizontal stripe has been transferred out of the data storage elements corresponding to the row of convolver units.
  • FIGs 11A and 11B describe the processing of two out of the four active convolver units for the spatial orientation of the data values depicted in Figure 10D . While the processing of the two convolver units is described in two separate figures, it is understood that such processing typically occurs in parallel (i.e., at the same time) in order to increase the number of computations per clock cycle.
  • convolver unit CU 1,1 receives data and/or zero values from five neighboring data storage elements and one data value from the data storage element corresponding to convolver unit CU 1,1 . More specifically, convolver unit CU 1,1 receives:
  • convolver unit CU 1,1 may compute the partial sum y 1 defined by w 2 1,1 x n ⁇ 1,1 + w 3 1,1 x n ⁇ 1,2 + w 5 1,1 x n , 1 + w 6 1,1 x n , 2 (where w 2 1,1 , w 3 1,1 , w 5 1,1 , and w 6 1,1 are four of the nine weights of kernel 704a depicted in Figure 9C ) and store the partial sum y 1 in accumulator 1102a of convolver unit CU 1,1 .
  • Accumulator 1102a may be part of a linear array of n accumulators, where n is the number of rows within horizontal stripe 902a. Accumulator 1102a may be configured to store the partial sums corresponding to row n of a horizontal stripe; accumulator 1102b may be configured to store the partial sums corresponding to row n-1 of a horizontal stripe; and so on.
  • the bottom instance of convolver unit CU 1,1 and the top instance of convolver unit CU 1,1 are one and the same convolver unit, with the bottom instance showing additional details of the top instance.
  • convolver unit CU 1,2 receives data and/or zero values from eight neighboring data storage elements and one data value from the data storage element corresponding to convolver unit CU 1,2 . More specifically, convolver unit CU 1,2 receives:
  • convolver unit CU 1,2 may compute the partial sum y 2 defined by w 1 1,1 x n ⁇ 1,1 + w 2 1,1 x n ⁇ 1,2 + w 3 1,1 x n ⁇ 1,3 + w 4 1,1 x n , 1 + w 5 1,1 x n , 2 + w 6 1,1 x n , 3 (where w 1 1,1 , w 2 1,1 , w 3 1,1 , w 4 1,1 , w 5 1,1 and w 6 1,1 are six of the nine weights of kernel 704a depicted in Figure 9C ) and store the partial sum y 2 in accumulator 1104a of convolver unit CU 1,2 .
  • Figure 12 depicts the 2-D shift register after the data and/or zero values have been shifted down one row of data storage elements, and data values x n -2,1 , x n -2,2 , x n -2,3 and x n -2,4 from the n-2 row of the horizontal stripe 902a have been loaded into the 2-D shift register.
  • FIGS 13A-13D describe the processing of four of the eight active convolver units, in accordance with one embodiment of the invention. While the processing of the four convolver units is described in four separate figures, it is understood that such processing typically occurs in parallel (i.e., at the same time) in order to increase the number of computations per clock cycle.
  • convolver unit CU 1,1 may receive data values from the five neighboring data storage elements and the one corresponding data storage element.
  • Convolver unit CU 1,1 may compute the partial sum y 5 defined by w 2 1,1 x n ⁇ 2,1 + w 3 1,1 x n ⁇ 2,2 + w 5 1,1 x n ⁇ 1,1 + w 6 1,1 x n ⁇ 1,2 + w 8 1,1 x n , 1 + w 9 1,1 x n , 2 and store the partial sum y 5 in accumulator 1102b of convolver unit CU 1,1 .
  • convolver unit CU 1,2 receives data values from the eight neighboring data storage elements and the one corresponding data storage element.
  • Convolver unit CU 1,2 computes the partial sum y 6 defined by w 1 1,1 x n ⁇ 2,1 + w 2 1,1 x n ⁇ 2,2 + w 3 1,1 x n ⁇ 2,3 + w 4 1,1 x n ⁇ 1,1 + w 5 1,1 x n ⁇ 1,2 + w 6 1,1 x n ⁇ 1,3 + w 7 1,1 x n , 1 + w 8 1,1 x n , 2 + w 9 1,1 x n , 3 and stores the partial sum y 6 in accumulator 1104b of convolver unit CU 1,2 .
  • convolver unit CU 1 , 3 receives data values from the eight neighboring data storage elements and the one corresponding data storage element.
  • Convolver unit CU 1 , 3 computes the partial sum y 7 defined by w 1 1,1 x n ⁇ 2,2 + w 2 1,1 x n ⁇ 2,3 + w 3 1,1 x n ⁇ 2,4 + w 4 1,1 x n ⁇ 1,2 + w 5 1,1 x n ⁇ 1,3 + w 6 1,1 x n ⁇ 1,4 + w 7 1,1 x n , 2 + w 8 1,1 x n , 3 + w 9 1,1 x n , 4 and stores the partial sum y 7 in accumulator 1106b of convolver unit CU 1,3 .
  • convolver unit CU 2,1 may receive data and/or zero values from the five neighboring data storage elements and the one corresponding data storage element. Convolver unit CU 2,1 may then compute the partial sum y 9 defined by w 2 2,1 x n ⁇ 1,1 + w 3 2,1 x n ⁇ 1,2 + w 5 2,1 x n , 1 + w 6 2,1 x n , 2 (where w 2 2,1 , w 3 2,1 , w 5 2,1 , and w 6 2,1 are four of the nine weights of kernel 704b depicted in Figure 9D ) and store the partial sum y 9 in accumulator 1110a of convolver unit CU 2,1 .
  • Figure 14A depicts the loading of data values from the second input channel 702b into convolutional engine 708, in accordance with one embodiment of the invention.
  • the second input channel 702b may include horizontal stripes 904a and 904b, and horizontal stripe 904a may be loaded into convolutional engine 708 in a similar manner as horizontal stripe 902a was loaded.
  • Figures 14C-14D depict the loading of filter weights into convolutional engine 708, in accordance with one embodiment of the invention. More specifically, Figure 14C depicts the loading of the nine weights of kernel 704c into each of the convolver units of the first row of the convolver array (i.e., CU 1,1 , CU 1,2 , CU 1 , 3 and CU 1,4 ), and Figure 14D depicts the loading of the nine weights of kernel 704b into each of the convolver units of the second row of the convolver array (i.e., CU 2,1 , CU 2,2 , CU 2,3 and CU 2,4 ).
  • Kernel 704c is the second kernel of filter f 1 , and each of its weights is labeled with the superscript "1,2", which is shorthand for (filter f 1 , kernel 2).
  • Kernel 704d is the second kernel of filter f 2 , and each of its weights is labeled with the superscript "2,2", which is shorthand for (filter f 2 , kernel 2).
  • Figures 15A-15B depict the loading of a row of zero values into the 2-D shift register.
  • Figures 15B-15D depict a row-by-row loading of data values from the second input channel 702b into the 2-D shift register and a row-to-row shifting of the data values through the 2-D shift register.
  • Data values x n , 1 ′ , x n , 2 ′ , x n , 3 ′ and x n , 4 ′ may represent values from row n of horizontal stripe 904a of input channel 702b.
  • Data values x n ⁇ 1,1 ′ , x n ⁇ 1,2 ′ , x n ⁇ 1,3 ′ and x n ⁇ 1,4 ′ may represent values from row n-1 of horizontal stripe 904a of input channel 702b.
  • Data values x n ⁇ 2,1 ′ , x n ⁇ 2,2 ′ , x n ⁇ 2,3 ′ and x n ⁇ 2,4 ′ may represent values from row n-2 of horizontal stripe 904a of input channel 702b.
  • the first row of convolver units may be activated (as shown in Figure 16A ).
  • FIGs 16A and 16B describe the processing of two out of the four active convolver units for the spatial orientation of the data values depicted in Figure 15D .
  • convolver unit CU 1,1 may receive data and/or zero values from the five neighboring data storage elements and one data value from the data storage element corresponding to convolver unit CU 1,1 .
  • convolver unit CU 1,1 may compute the partial sum y 13 defined by w 2 1,2 x n ⁇ 1,1 ′ + w 3 1,2 x n ⁇ 1,2 ′ + w 5 1,2 x n , 1 ′ + w 6 1,2 x n , 2 ′ (where w 2 1,2 , w 3 1,2 , w 5 1,2 , and w 6 1,2 are four of the nine weights of kernel 704c depicted in Figure 14C ).
  • the partial sum y 13 may be summed with y 1 (the partial sum previously computed by convolver unit CU 1,1 for row n) and the new partial sum y 1 + y 13 may be stored in accumulator 1102a.
  • convolver unit CU 1,2 receives data and/or zero values from the eight neighboring data storage elements and one data value from the data storage element corresponding to convolver unit CU 1,2 .
  • convolver unit CU 1,2 may compute the partial sum y 14 defined by w 1 1,2 x n ⁇ 1,1 ′ + w 2 1,2 x n ⁇ 1,2 ′ + w 3 1,2 x n ⁇ 1,3 ′ + w 4 1,2 x n , 1 ′ + w 5 1,2 x n , 2 ′ + w 6 1,2 x n , 3 ′ (where w 1 1,2 , w 2 1,2 , w 3 1,2 , w 4 1,2 , w 5 1,2 and w 6 1,2 are six of the nine weights of kernel 704c depicted in Figure 14C ).
  • the partial sum y 14 may be summed with y 2
  • Figure 17 depicts the 2-D shift register after the data and/or zero values have been shifted down one row of data storage elements, and data values x n ⁇ 2,1 ′ , x n ⁇ 2,2 ′ , x n ⁇ 2,3 ′ and x n ⁇ 2,4 ′ from the n-2 row of the horizontal stripe 904a have been loaded into the 2-D shift register.
  • FIGs 18A-18B describe the processing of two of the eight active convolver units, in accordance with one embodiment of the invention.
  • convolver unit CU 1,1 may receive data values from the five neighboring data storage elements and the one corresponding data storage element.
  • Convolver unit CU 1,1 may then compute the partial sum y 17 defined by w 2 1,2 x n ⁇ 2,1 ′ + w 3 1,2 x n ⁇ 2,2 ′ + w 5 1,2 x n ⁇ 1,1 ′ + w 6 1,2 x n ⁇ 1,2 ′ + w 8 1,2 x n , 1 ′ + w 9 1,2 x n , 2 ′ .
  • the partial sum y 17 may be summed with y 5 (the partial sum previously computed by convolver unit CU 1,1 for row n-1 ) and the new partial sum y 5 + y 17 may be stored in accumulator 1102b.
  • convolver unit CU 1,2 receives data values from the eight neighboring data storage elements and the one corresponding data storage element. Convolver unit CU 1,2 then computes the partial sum y 18 defined by w 1 1,2 x n ⁇ 2,1 ′ + w 2 1,2 x n ⁇ 2,2 ′ + w 3 1,2 x n ⁇ 2,3 ′ + w 4 1,2 x n ⁇ 1,1 ′ + w 5 1,2 x n ⁇ 1,2 ′ + w 6 1,2 x n ⁇ 1,3 ′ + w 7 1,2 x n , 1 ′ + w 8 1,2 x n , 2 ′ + w 9 1,2 x n , 3 ′ .
  • the partial sum y 18 may be summed with y 6 (the partial sum previously computed by convolver unit CU 1,2 for row n-1) and the new partial sum y 6 + y 18 may be stored in accumul
  • the processing of the 2-D shift register and the plurality of convolutional units continues in a similar fashion until row 1 of horizontal stripe 904a has been shifted through the 2-D shift register.
  • the processing of the 2-D shift register and the plurality of convolutional units then continues until all of the remaining input channels have been processed in a manner similar to the processing of the first two input channels.
  • bias values may be loaded into the convolutional units. More specifically, Figure 19A depicts the loading of bias value b 1 into the first row of convolver units ( CU 1,1 , CU 1,2 , CU 1 , 3 and CU 1,4 ) and Figure 19B depicts the loading of bias value b 2 into the second row of convolver units ( CU 2,1 , CU 2,2 , CU 2,3 and CU 2,4 ).
  • the partial sums computed by the first row of convolver units may be biased by bias value b 1
  • the partial sums computed by the second row of convolver units may be biased by bias value b 2 (as depicted in Figure 20 ) to yield the output of the convolution operation.
  • the number of rows of the convolver array equals the number filters. This relationship, however, does not always hold. If the number of filters were less than the number of rows of the convolver array, unused rows of the convolver array could be deactivated. If the number of filters were more than the number of rows of the convolver array, the convolution operations would essentially need to be repeated. For instance, if there were six filters and only three rows of convolver units, then the convolution operations could be performed for filters 1-3, and the same convolution operations would be repeated, except that filters 1-3 would be substituted with filters 4-6.
  • the architecture essentially attempts to strike a balance between the fan-out of data storage elements (related to the sizing of circuit components) and the number of computations per clock cycle (related to the speed of computation).
  • the 2-D shift register could have been reduced to three rows of data storage elements, with CU 1,1 , CU 2,1 , CU 3,1 , ... wired to the same six data storage elements; CU 1,2 , CU 2,2 , CU 3,2 , ... wired to the same nine data storage elements, etc.
  • FIG. 21 depicts internal components of convolver unit 806 (i.e., CU 1,2 ), in accordance with one embodiment of the invention.
  • Convolver unit 806 includes nine multipliers (2102a, ... , 2102i). Each of the multipliers is electrically coupled to a data storage element (i.e., one of the data storage elements of the 2-D shift register) and is configured to receive a data value stored in the corresponding data storage element.
  • multipliers 2102a, 2102b, 2102c, 2102d, 2102e, 2102f, 2102g, 2102h, and 2102i are electrically coupled to data storage elements d 1,1 , d 1,2 , d 1,3 , d 2,1 , d 2,2 , d 2,3 , d 3,1 , d 3,2 , and d 3,3 , and are configured to receive data values x 1 , x 2 , x 3 , x 4 , x 5 , x 6 , x 7 , x 8 , and x 9 , from data storage elements 2102a, 2102b, 2102c, 2102d, 2102e, 2102f, 2102g, 2102h, and 2102i, respectively.
  • the data value stored in a data storage element typically changes with each clock cycle. For example, in the context of Figure 10C , x 1 would equal x n ,1 ; in Figure 10D , x 1 would equal x n -1,1 ; and so on. The same comment applies for the other data values.
  • Each of the multipliers is further configured to receive a weight.
  • multipliers 2102a, 2102b, 2102c, 2102d, 2102e, 2102f, 2102g, 2102h, and 2102i are configured to receive weights w 1 , w 2 , w 3 , w 4 , w 5 , w 6 , w 7 , w 8 , and w 9 , respectively.
  • a different set of weights may be loaded for each channel of input data 702. For example, in the context of Figure 9C , w 1 would equal w 1 1,1 ; in the context of Figure 14C , w 1 would equal w 1 1,2 ; and so on.
  • multipliers 2102a, 2102b, 2102c, 2102d, 2102e, 2102f, 2102g, 2102h, and 2102i may multiply data values x 1 , x 2 , x 3 , x 4 , x 5 , x 6 , x 7 , x 8 , and x 9 with weights w 1 , w 2 , w 3 , w 4 , w 5 , w 6 , w 7 , w 8 , and w 9 so as to generate the products w 1 x 1 , w 2 x 2 , w 3 x 3 , w 4 x 4 , w 5 x 5 , w 6 x 6 , w 7 x 7 , w 8 x 8 , and w 9 x 9 , respectively.
  • a specialized multiplier may be implemented using a bit-shifter and an adder (the specialized multiplier further performing a log-to-linear transformation).
  • the specialized multiplier further performing a log-to-linear transformation.
  • Convolver unit 806 may further include a plurality of adders and the values that are summed by the adders may depend on control signal s1.
  • control signal s1 may be set to 0, causing output selector 2106 to deliver the zero value to adder 2104h.
  • the partial sum w 1 x 1 + w 2 x 2 + w 3 x 3 + w 4 x 4 + w 5 x 5 + w 6 x 6 + w 7 x 7 + w 8 x 8 + w 9 x 9 is computed, and is not based on any previous partial sums.
  • the partial sum is then stored in one of the accumulators 1104a, 1104b, etc. depending on which row of a horizontal stripe the data values are from. If the data values are from row n, the partial sum would be stored in accumulator 1104a; if the data values are from row n-1, the partial sum would be stored in accumulator 1104b; and so on.
  • control signal s1 may be set to 1, causing output selector 2106 to deliver a previously computed partial sum to adder 2104h.
  • the previously computed partial sum stored in accumulator 1104a would be provided to adder 2104h; if the data values are from row n-1, the previously computed partial sum stored in accumulator 1104b would be provided to adder 2104h; and so on.
  • output selector 2106 may be configured to deliver a partial sum from an accumulator to adder 2104j, which sums the partial sum with bias b k .
  • the resulting sum may be stored back into the accumulator from which the partial sum was read.
  • an entire vector of partial sums may be read from the accumulator array (1104a, 1104b, ...), summed with bias b k , and the vector (now with biasing) may be stored back into the accumulator array.
  • Such computation may implement the biasing operation described for CU 1,2 in Figure 20 .
  • specialized adders may receive two values in the linear domain (since the preceding specialized multipliers performed a log-to-linear transformation) and return the resulting sum in the log domain. Details of such specialized adders may also be found in Daisuke Miyashita et al. "Convolutional Neural Networks using Logarithmic Data Representation" arXiv preprint arXiv:1603.01025, 2016 .
  • any of the convolver units that receive nine data values (and nine weights) may have a similar hardware architecture as convolver unit CU 1,2 , and hence will not be described for conciseness.
  • the hardware architecture could still be similar to the hardware architecture of convolver unit CU 1,2 , except that some of the inputs to the multipliers could be hardwired to the zero value (data input or weight could be set to the zero value).
  • weights w 1 , w 4 and w 7 could be set to zero.
  • some of the multipliers could even be omitted.
  • multipliers 2102a, 2102d and 2102g could be omitted.
  • the computations of all nine multipliers (or their equivalents in the log domain) and nine adders (or their equivalents in the log domain) take place all within one clock cycle. That is, if data values are stored in the nine data storage elements at clock cycle n, the partial sum is stored in the accumulators at clock cycle n + 1. Further, for increased throughput, new data values are stored in the nine data storage elements at clock cycle n + 1 while the partial sum is stored. Therefore the computation of a new partial sum is performed during every clock cycle.
  • the stride (or the step size) is the number of pixels or data values that the filter is shifted between dot product operations.
  • Figure 22 illustrates that by setting every odd row and every odd column of convolver units to be active and setting every even row and every even column of convolver units to be non-active (by means of control signals provided by controller 2202), a stride of 2 may be achieved. It should be apparent how other stride values can be set.
  • rows 3x+1 for x ⁇ ⁇ 0, 1, 2, ... ⁇ of convolver units and columns 3x+1 for x ⁇ ⁇ 0, 1, 2, ... ⁇ of convolver units may be set to be active and all other rows and columns may be set to be non-active. Even strides of less than 1 are possible. For example, for a stride of 1 ⁇ 2, input 702 can be interpolated before it is loaded into convolutional engine 708.
  • a 2x2 input matrix of a b c d the following 3x3 interpolated matrix can be provided as input to convolutional engine 708 in order to achieve a stride of 1 ⁇ 2: a a + b 2 b a + c 2 a + b + c + d 4 b + d 2 c c + d 2 d
  • a linear interpolation was used in the present example, it is understood that other forms of interpolation (e.g., polynomial interpolation, spline interpolation, etc.) are also possible.
  • a convolutional neural network typically involves other types of operations, such as the max pool and rectification operators.
  • the convolver unit was presented first for ease of understanding, but now a more generalized form of a convolver unit, called a "functional unit" will now be described for handling other types of operations common in a convolutional neural network in addition to the convolution operation.
  • FIG. 23 depicts convolutional engine 2300 including a 2-D shift register and an array of functional units, in accordance with one embodiment of the invention.
  • Convolutional engine 2300 is similar to the above-described convolutional engine 708, except that the convolver units have been replaced with functional units.
  • One of the functional units, FU 1,2 is labeled as 2302 and its hardware architecture is described below in Figure 23 .
  • Figure 24 depicts internal components of functional unit 2302, in accordance with one embodiment of the invention. There are two main differences between functional unit 2302 and convolver unit 806. First, functional unit 2302 has the ability to compute the maximum of a sum (needed to perform the max pool operation). Second, functional unit 2302 has the ability to compute the rectification of a value. In order to compute the maximum of a sum, each of the nine adders (2104a, ..., 2104i) of the convolver unit may be replaced with a function selector (2404a, ..., 2404i). The function selector receives control signal s2, allowing the selection between an adder and a comparator (see inset in Figure 24 ).
  • the functional unit is transformed back into the hardware architecture of convolver unit 806, and functional unit 2302 is configured to perform the above-described convolution operation.
  • functional unit 2302 is configured to compute max( w 1 x 1 , w 2 x 2 , w 3 x 3 , w 4 x 4 , w 5 x 5 , w 6 x 6 , w 7 x 7 , w 8 x 8 , w 9 x 9 ) when control signal s1 is set to 0, and max( w 1 x 1 , w 2 x 2 , w 3 x 3 , w 4 x 4 , w 5 x 5 , w 6 x 6 , w 7 x 7 , w 8 x 8 , w 9 x 9 , previous partial sum) when control signal s1 is set to 1.
  • the maximum of the pointwise multiplication of a three dimensional filter e.g., f 1
  • a three dimensional volume of input i.e., a volume of input that aligns with the filter - as described in Figure 6
  • the max pool operator may be implemented with the comparators of a functional unit selected and the stride set equal to the magnitude of one dimension of a kernel of the filter (e.g., for a 3x3 kernel, the stride would be set to be 3).
  • control signal s1 When the control signal s1 is set to 2, functional unit is configured to perform the rectification operation.
  • rectifier 2408 can be configured to return 0 whenever the sign bit indicates a negative number or if the zero bit is set, and return the magnitude otherwise.
  • control signal s1 When the control signal s1 is set to 3, functional unit is configured to add a bias value to the data stored in accumulators 1104a, 1104b, etc. similar to the operation of convolver unit 806.
  • Figure 25 depicts three scenarios of data values being loaded from input channel 702a into convolutional engine 708 having m columns of convolver units, with scenario (a) illustrating input channel 702a having m columns of data values, scenario (b) illustrating input channel 702a having 3m-4 columns of data values, and scenario (c) illustrating input channel 702a having m / 2 columns of data values, in accordance with one embodiment of the invention.
  • Scenario (a) was previously described in Figure 9B , but will be more fully discussed in Figures 26A-26B .
  • Scenario (b) discusses an example in which the number of columns of input channel 702a is greater than the number of columns of the convolver array.
  • Scenario (c) discusses an example in which the number of columns of input channel 702a is less than the number of columns of the convolver array. While a convolutional engine is more abstractly depicted, it should be understood that the architecture of a convolutional engine may be similar to earlier described examples, with a 2-D shift register and a convolver array.
  • Figure 26A depicts the loading of a zero padding row, horizontal stripe 902a and a data padding row (corresponding to row n of horizontal stripe 902b) into convolutional engine 708.
  • the bolded dashed rectangle denotes the portion of input channel 702a being loaded into convolutional engine 708.
  • the zero padding row is first loaded into the 2-D shift register of convolutional engine 708, followed by row n of horizontal stripe 902a, followed by row n-1 of horizontal stripe 902a, ... followed by row 1 of horizontal stripe 902a, and followed by the data padding row.
  • each time a row of data storage elements stores row n of a horizontal stripe the convolver units corresponding to that row of data storage elements are activated.
  • Figure 26B depicts the loading of one data padding row (corresponding to row 1 of horizontal stripe 902a), horizontal stripe 902b and a zero padding row into convolutional engine 708. More specifically, the data padding row is first loaded into the 2-D shift register of convolutional engine 708, followed by row n of horizontal stripe 902b, followed by row n-1 of horizontal stripe 902b, ... followed by row 1 of horizontal stripe 902b, and followed by the zero padding row.
  • While input channel 702a included two horizontal stripes to illustrate the concept of a single "horizontal cut line" through the input data (conceptually located at the boundary of horizontal stripes 902a and 902b), it is understood that an input channel would have more horizontal stripes if there were more horizontal cut lines. For a horizontal stripe that is bordered above and below by other horizontal stripes, the loading of that horizontal stripe would be preceded by a data padding row and followed by another data padding row.
  • Figures 27A-27C illustrate a scenario in which "vertical cut lines" through input channel 702a are needed, and how to handle the vertical cut lines.
  • a vertical cut line is needed whenever the number of columns of the input channel is greater than the number of columns of the convolver array.
  • the present example discusses the scenario in which the number of columns of the input channel is equal to 3m-4, where m is the number of columns of the convolver array.
  • the convolver array is utilized in an efficient manner (no unused convolver units), but if this relationship does not hold, the concepts described below still apply, but the convolver array will be utilized in a less efficient manner (will have unused convolver units).
  • horizontal cut lines, zero padding rows, and data padding rows are not discussed in the example of Figures 27A-27C . Nevertheless, it is expected that one of ordinary skill in the art will be able to combine concepts from Figures 26A-26B and 27A-27B in order to handle scenarios in which there are both horizontal and vertical cut lines.
  • input channel 702a is divided into vertical stripes 906a, 906b and 906c.
  • first vertical cut line separating vertical stripe 906a from vertical stripe 906b
  • second vertical cut line separating vertical stripe 906b from 906c.
  • interior vertical stripes such as 906b
  • exterior vertical stripes such as 906a and 906c
  • Figure 27A depicts m columns (including the m-1 columns of vertical stripe 906a and one data padding column) being loaded into convolutional engine 708.
  • the right most column of convolver units (which aligns with the data padding column) is non-active, as the output of these convolver units would have produced a convolution output treating the data padding column as an external column (which is not true in the current scenario).
  • the remaining m-1 columns of the convolver units operate in a similar manner as the convolver units that have been previously described.
  • Figure 27B depicts m columns (including the m-2 columns of vertical stripe 906b bordered on the right and left sides by a data padding column) being loaded into convolutional engine 708.
  • the left most and right most columns of convolver units (which align with the data padding columns) are non-active, for reasons similar to those provided above.
  • the remaining m-2 columns of the convolver units operate in a similar manner as the convolver units that have been previously described.
  • Figure 27C depicts m columns (including one data padding column and the m-1 columns of vertical stripe 906c) being loaded into convolutional engine 708.
  • the left most column of convolver units (which aligns with the data padding column) is non-active, for reasons similar to those provided above.
  • the remaining m-1 columns of the convolver units operate in a similar manner as the convolver units that have been previously described.
  • Figure 28 describes the scenario in which the number of columns of the input channel 702a is equal to m /2, in which m is the number of columns of the convolutional engine.
  • the variable m is assumed to be an even number for the example of Figure 28 , but need not be an even number in general.
  • the convolver array is utilized in an efficient manner (i.e., will have no unused convolver units), but if this relationship does not hold, the concepts described below still apply, but the convolver array will be utilized in a less efficient manner (i.e., will have unused convolver units).
  • FIG. 28 illustrates the concept of a "vertical cut line" through the convolutional engine 708, in which there is no transfer of data between region 708a (which includes the first half of the "columns" of the convolutional engine) and region 708b (which includes the second half of the "columns” of the convolutional engine).
  • the term column when used in the context of a convolutional engine, includes a column of the 2-D shift register and the corresponding column of convolutional units.
  • a vertical cut line that separates region 708a from region 708b.
  • Region 708a essentially functions independently from region 708b, allowing region 708a to be configured to perform a convolution with a first set of filters (e.g., filters 1 through 10), and region 708b to be configured to perform the convolution with a second set of filters (e.g., filters 11-20).
  • the number of filters (10 in each region) was chosen for clarity of explanation, and it is understood that there could have been a different number of filters in one or both of the two regions.
  • convolver units in the right most column of region 708a have weights w 3 , w 6 and w 9 set to zero (regardless of what those weights might be from the filter kernels), and convolver units in the left most column of region 708b have weights w 1 , w 4 and w 7 set to zero (regardless of what those weights might be from the filter kernels).
  • input channel 702a When input channel 702a is loaded into convolutional engine 708, it is loaded into region 708a row-by-row, and at the same time, it is loaded into region 708b row-by-row.
  • convolutional engine 708 If the propagation of data through convolutional engine 708 could conceptually be viewed as a ticker tape traversing in the vertical direction, there would be one ticker tape traversing down region 708a, and there would be a mirror image of that ticker tape traversing down region 708b.
  • Figure 28 illustrated an example with one vertical cut line through the convolutional engine, it should be apparent how a convolutional engine could be modified to have multiple vertical cut lines. Further, for the sake of clarity of illustration and explanation, horizontal cut lines, zero padding rows, and data padding rows are not discussed in the example of Figure 28 . Nevertheless, it is expected that one of ordinary skill in the art will be able to combine concepts from Figures 26A-26B and 28 together to handle scenarios in which there are both horizontal and vertical cut lines.
  • Figures 29A-29B illustrate another scheme for handling the scenario in which the number of columns of the input channel 702a is equal to m/2, in which m is the number of columns of convolutional engine 708.
  • the scheme involves combining the concept of a horizontal cut line through the input data (described in Figures 26A-26B ) and the concept of a vertical cut line through the convolver array (described in Figure 28 ).
  • the two horizontal stripes were processed one after another (i.e., serially).
  • the horizontal stripes 908a and 908b are processed in parallel, with horizontal stripe 908a processed in region 708a, and horizontal stripe 908b processed in region 708b.
  • the same filters are populated in regions 708a and 708b, in contrast to the scheme of Figure 28 .
  • Figure 29A Since there are several overlapping rectangles in Figure 29A , the scheme is conceptually redrawn in Figure 29B , which more clearly shows the data loaded into region 708a and region 708b. If not already apparent, it is noted that row 1 of horizontal stripe 908a is identical to the data padding row that precedes horizontal stripe 908b, and the data padding row that follows horizontal stripe 908a is identical to row n of horizontal stripe 908b.
  • the scheme of Figures 29A-29B also has the effect of doubling the throughput.
  • One consideration between the scheme of Figure 28 and the scheme of Figures 29A-29B is the number of filters versus the number of rows of the input channel. If there are many more filters than the number of rows of the input channel, then the scheme of Figure 28 might be preferred, whereas if there are many more rows of the input channel than the number of filters, then the scheme of Figures 29A-29B might be preferred.
  • the former case would be analogous to a long skinny column of filters, in which it would be advantageous to cut the long skinny column of filters in half (place one half in region 708a and the other half in region 708b), whereas the latter case would be analogous to a long skinny column of input data, in which it would be advantageous to cut the long skinny column of input data in half and process the two halves of input data in parallel.
  • FIG. 30 depicts convolutional engine 708 as one component of system 3000, in accordance with one embodiment of the invention.
  • System 3000 may include memory 3002, shift and format module 3004, convolutional engine 708 and controller 3006.
  • Memory 3002 may be implemented using static random-access memory (SRAM), and may store input data 702, and the output of the convolutional engine 708 (e.g., convolution output, max pool output, rectified output, etc.).
  • SRAM static random-access memory
  • Shift and format module 3004 is an interface between memory 3002 and convolutional engine 708 and is configured to shift and format the data. For instance, in the example of Figure 29A , providing horizontal stripe 908b to region 708b of the convolutional engine would be one task performed by shift and format module 3004. Achieving a stride of 1 ⁇ 2 (or a stride less than one) could also involve shift and format module 3004, in which the above-described interpolation could be performed by the shift and format module 3004.
  • convolutional engine 708 contains a more typical number of data storage elements and convolver units.
  • Figure 30 depicts a convolutional engine with a 64 by 256 array of convolver units 806 and a 66 by 256 array of data storage elements configured as a 2-D shift register. Similar to the previously-described embodiments, the first row of convolver units logically corresponds with the second row of data storage elements, and the last row of convolver units logically corresponds with the second to last row of data storage elements.
  • Controller 3006 may be responsible for performing many of the above-described control operations. For example, controller 3006 may provide the control signals that set convolver units to be active and non-active (and hence, the above-described controller 2202 may be part of controller 3006). Controller 3006 may be responsible for providing control signal s1 (described in Figures 21 and 24 ) for controlling the output of output selectors 2106 and 2406. Controller 3006 may be responsible for providing control signal s2 (described in Figure 24 ) for controlling whether a functional unit is programmed to output a convolution output or a max pool output.
  • Controller 3006 may logically partition an input channel into horizontal stripes, and/or vertical stripes (more appropriately called chunks when there are vertical and horizontal cut lines) based on the dimensions of the input channel relative to the dimensions of the convolver array. Controller 3006 may control shift and format module 3004 to perform the necessary shift and format operations. Controller 3006 may determine which weights are to be loaded to which convolutional units. Controller 3006 may determine whether to override filter weights with zero values in order to logically partition the convolutional engine into multiple independent regions (as depicted in Figures 28 , 29A and 29B ).
  • Controller 3006 may also contain the logic that determines, for the loading of a horizontal stripe into the convolutional engine, whether the horizontal stripe is to be preceded by a zero padding row or a data padding row, or whether the horizontal stripe is to be followed by a zero padding row or a data padding row. These are merely some examples of the functions that may be performed by controller 3006.
  • FIG. 31 depicts a block diagram of weight decompressor 3100 for decompressing filter weights before the weights are provided to the convolver units, in accordance with one embodiment of the invention.
  • Weight decompressor 3100 may utilize dictionary 3102 to decompress weights.
  • compressed weights are keys to a look-up table (an embodiment of the dictionary), and the records corresponding to the keys in the look-up table are the decompressed weights.
  • the 256 convolver units may be logically and/or physically grouped into 16 groups, each group including 16 convolver units. The decompressed weights may be provided to each of the 16 groups of convolver units.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computational Mathematics (AREA)
  • Neurology (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Complex Calculations (AREA)
  • Lubrication Of Internal Combustion Engines (AREA)
  • Image Processing (AREA)

Claims (4)

  1. Vorrichtung, umfassend:
    ein zweidimensionales synchrones Schieberegister, das eine drei mal vier große Matrix von Datenspeicherelementen d 1,1 d 1,2 d 1,3 d 1,4 d 2,1 d 2,2 d 2,3 d 2,4 d 3,1 d 3,2 d 3,3 d 3,4
    Figure imgb0131
    umfasst, wobei:
    das Datenspeicherelement d1,1 elektrisch mit dem Datenspeicherelement d2,1 gekoppelt ist,
    das Datenspeicherelement d2,1 elektrisch mit dem Datenspeicherelement d3,1 gekoppelt ist,
    das Datenspeicherelement d1,2 elektrisch mit dem Datenspeicherelement d2,2 gekoppelt ist,
    das Datenspeicherelement d2,2 elektrisch mit dem Datenspeicherelement d3,2 gekoppelt ist,
    das Datenspeicherelement d1,3 elektrisch mit dem Datenspeicherelement d2,3 gekoppelt ist,
    das Datenspeicherelement d2,3 elektrisch mit dem Datenspeicherelement d3,3 gekoppelt ist,
    das Datenspeicherelement d1,4 elektrisch mit dem Datenspeicherelement d2,4 gekoppelt ist, und
    das Datenspeicherelement d2,4 elektrisch mit dem Datenspeicherelement d3,4 gekoppelt ist,
    eine erste Faltungseinheit, die eine erste Mehrzahl von Multiplizierern m 1,1 1 , m 1,2 1 , m 1,3 1 , m 2,1 1 , m 2,2 1 , m 2,3 1 , m 3,1 1 , m 3,2 1
    Figure imgb0132
    , und m 3,3 1
    Figure imgb0133
    umfasst, wobei:
    der Multiplizierer m 1,1 1
    Figure imgb0134
    elektrisch mit dem Datenspeicherelement d1,1 gekoppelt ist,
    der Multiplizierer m 1,2 1
    Figure imgb0135
    elektrisch mit dem Datenspeicherelement d1,2 gekoppelt ist,
    der Multiplizierer m 1,3 1
    Figure imgb0136
    elektrisch mit dem Datenspeicherelement d1,3 gekoppelt ist,
    der Multiplizierer m 2,1 1
    Figure imgb0137
    elektrisch mit dem Datenspeicherelement d2,1 gekoppelt ist,
    der Multiplizierer m 2,2 1
    Figure imgb0138
    elektrisch mit dem Datenspeicherelement d2,2 gekoppelt ist,
    der Multiplizierer m 2,3 1
    Figure imgb0139
    elektrisch mit dem Datenspeicherelement d2,3 gekoppelt ist,
    der Multiplizierer m 3,1 1
    Figure imgb0140
    elektrisch mit dem Datenspeicherelement d3,1 gekoppelt ist,
    der Multiplizierer m 3,2 1
    Figure imgb0141
    elektrisch mit dem Datenspeicherelement d3,2 gekoppelt ist,
    der Multiplizierer m 3,3 1
    Figure imgb0142
    elektrisch mit dem Datenspeicherelement d3,3 gekoppelt ist, und
    eine zweite Faltungseinheit, die eine zweite Mehrzahl von Multiplizierern m 1,1 2 , m 1,2 2 , m 1,3 2 , m 2,1 2 , m 2,2 2 , m 2,3 2 , m 3,1 2 , m 3,2 2
    Figure imgb0143
    , und m 3,3 2
    Figure imgb0144
    umfasst, wobei:
    der Multiplizierer m 1,1 2
    Figure imgb0145
    elektrisch mit dem Datenspeicherelement d1,2 gekoppelt ist,
    der Multiplizierer m 1,2 2
    Figure imgb0146
    elektrisch mit dem Datenspeicherelement d1,3 gekoppelt ist,
    der Multiplizierer m 1,3 2
    Figure imgb0147
    elektrisch mit dem Datenspeicherelement d1,4 gekoppelt ist,
    der Multiplizierer m 2,1 2
    Figure imgb0148
    elektrisch mit dem Datenspeicherelement d2,2 gekoppelt ist,
    der Multiplizierer m 2,2 2
    Figure imgb0149
    elektrisch mit dem Datenspeicherelement d2,3 gekoppelt ist,
    der Multiplizierer m 2,3 2
    Figure imgb0150
    elektrisch mit dem Datenspeicherelement d2,4 gekoppelt ist,
    der Multiplizierer m 3,1 2
    Figure imgb0151
    elektrisch mit dem Datenspeicherelement d3,2 gekoppelt ist,
    der Multiplizierer m 3,2 2
    Figure imgb0152
    elektrisch mit dem Datenspeicherelement d3,3 gekoppelt ist, und
    der Multiplizierer m 3,3 2
    Figure imgb0153
    elektrisch mit dem Datenspeicherelement d3,4 gekoppelt ist,
    wobei zu Beginn eines ersten Taktzyklus
    das Datenspeicherelement d1,1 konfiguriert ist, den Datenwert x1,1 zu speichern,
    das Datenspeicherelement d1,2 konfiguriert ist, den Datenwert x1,2 zu speichern,
    das Datenspeicherelement d1,3 konfiguriert ist, den Datenwert x1,3 zu speichern,
    das Datenspeicherelement d1,4 konfiguriert ist, den Datenwert x1,4 zu speichern,
    das Datenspeicherelement d2,1 konfiguriert ist, den Datenwert x2,1 zu speichern,
    das Datenspeicherelement d2,2 konfiguriert ist, den Datenwert x2,2 zu speichern,
    das Datenspeicherelement d2,3 konfiguriert ist, den Datenwert x2,3 zu speichern,
    das Datenspeicherelement d2,4 konfiguriert ist, den Datenwert x2,4 zu speichern,
    das Datenspeicherelement d3,1 konfiguriert ist, den Datenwert x3,1 zu speichern,
    das Datenspeicherelement d3,2 konfiguriert ist, den Datenwert x3,2 zu speichern,
    das Datenspeicherelement d3,3 konfiguriert ist, den Datenwert x3,3 zu speichern, und
    das Datenspeicherelement d3,4 konfiguriert ist, den Datenwert x3,4 zu speichern;
    wobei während des ersten Taktzyklus
    der Multiplizierer m 1,1 1
    Figure imgb0154
    konfiguriert ist, um den Datenwert x1,1 vom Datenspeicherelement d1,1 zu empfangen und den Datenwert x1,1 mit dem Gewicht w1 zu multiplizieren, um ein Produkt w1x1,1 zu erzeugen,
    der Multiplizierer m 1,2 1
    Figure imgb0155
    konfiguriert ist, um den Datenwert x1,2 aus dem Datenspeicherelement d1,2 zu empfangen und den Datenwert x1,2 mit dem Gewicht w2 zu multiplizieren, um ein Produkt w2x1,2 zu erzeugen,
    der Multiplizierer m 1,3 1
    Figure imgb0156
    konfiguriert ist, um den Datenwert x1,3 aus dem Datenspeicherelement d1,3 zu empfangen und den Datenwert x1,3 mit dem Gewicht w3 zu multiplizieren, um ein Produkt w3x1,3 zu erzeugen,
    der Multiplizierer m 2,1 1
    Figure imgb0157
    konfiguriert ist, um den Datenwert x2,1 aus dem Datenspeicherelement d2,1 zu empfangen und den Datenwert x2,1 mit dem Gewicht w4 zu multiplizieren, um ein Produkt w4x2,1 zu erzeugen,
    der Multiplizierer m 2,2 1
    Figure imgb0158
    konfiguriert ist, um den Datenwert x2,2 aus dem Datenspeicherelement d2,2 zu empfangen und den Datenwert x2,2 mit dem Gewicht w5 zu multiplizieren, um ein Produkt w5x2,2 zu erzeugen,
    der Multiplizierer m 2,3 1
    Figure imgb0159
    konfiguriert ist, um den Datenwert x2,3 aus dem Datenspeicherelement d2,3 zu empfangen und den Datenwert x2,3 mit dem Gewicht w6 zu multiplizieren, um ein Produkt w6x2,3 zu erzeugen,
    der Multiplizierer m 3,1 1
    Figure imgb0160
    konfiguriert ist, um den Datenwert x3,1 aus dem Datenspeicherelement d3,1 zu empfangen und den Datenwert x3,1 mit dem Gewicht w7 zu multiplizieren, um ein Produkt w7x3,1 zu erzeugen,
    der Multiplizierer m 3,2 1
    Figure imgb0161
    konfiguriert ist, um den Datenwert x3,2 aus dem Datenspeicherelement d3,2 zu empfangen und den Datenwert x3,2 mit dem Gewicht w8 zu multiplizieren, um ein Produkt w8x3,2 zu erzeugen,
    der Multiplizierer m 3,3 1
    Figure imgb0162
    konfiguriert ist, um den Datenwert x3,3 aus dem Datenspeicherelement d3,3 zu empfangen und den Datenwert x3,3 mit dem Gewicht w9 zu multiplizieren, um ein Produkt w9x3,3 zu erzeugen;
    der Multiplizierer m 1,1 2
    Figure imgb0163
    konfiguriert ist, um den Datenwert x1,2 aus dem Datenspeicherelement d1,2 zu empfangen und den Datenwert x1,2 mit dem Gewicht w1 zu multiplizieren, um ein Produkt w1x1,2 zu erzeugen,
    der Multiplizierer m 1,2 2
    Figure imgb0164
    konfiguriert ist, um den Datenwert x1,3 aus dem Datenspeicherelement d1,3 zu empfangen und den Datenwert x1,3 mit dem Gewicht w2 zu multiplizieren, um ein Produkt w2x1,3 zu erzeugen,
    der Multiplizierer m 1,3 2
    Figure imgb0165
    konfiguriert ist, um den Datenwert x1,4 aus dem Datenspeicherelement d1,4 zu empfangen und den Datenwert x1,4 mit dem Gewicht w3 zu multiplizieren, um ein Produkt w3x1,4 zu erzeugen,
    der Multiplizierer m 2,1 2
    Figure imgb0166
    konfiguriert ist, um den Datenwert x2,2 aus dem Datenspeicherelement d2,2 zu empfangen und den Datenwert x2,2 mit dem Gewicht w4 zu multiplizieren, um ein Produkt w4x2,2 zu erzeugen,
    der Multiplizierer m 2,2 2
    Figure imgb0167
    konfiguriert ist, um den Datenwert x2,3 aus dem Datenspeicherelement d2,3 zu empfangen und den Datenwert x2,3 mit dem Gewicht w5 zu multiplizieren, um ein Produkt w5x2,3 zu erzeugen,
    der Multiplizierer m 2,3 2
    Figure imgb0168
    konfiguriert ist, um den Datenwert x2,4 aus dem Datenspeicherelement d2,4 zu empfangen und den Datenwert x2,4 mit dem Gewicht w6 zu multiplizieren, um ein Produkt w6x2,4 zu erzeugen,
    der Multiplizierer m 3,1 2
    Figure imgb0169
    konfiguriert ist, um den Datenwert x3,2 aus dem Datenspeicherelement d3,2 zu empfangen und den Datenwert x3,2 mit dem Gewicht w7 zu multiplizieren, um ein Produkt w7x3,2 zu erzeugen,
    der Multiplizierer m 3,2 2
    Figure imgb0170
    konfiguriert ist, um den Datenwert x3,3 aus dem Datenspeicherelement d3,3 zu empfangen und den Datenwert x3,3 mit dem Gewicht w8 zu multiplizieren, um ein Produkt w8x3,3 zu erzeugen, und
    der Multiplizierer m 3,3 2
    Figure imgb0171
    konfiguriert ist, um den Datenwert x3,4 aus dem Datenspeicherelement d3,4 zu empfangen und den Datenwert x3,4 mit dem Gewicht w9 zu multiplizieren, um ein Produkt w9x3,4 zu erzeugen,
    wobei vor einem zweiten Taktzyklus, der auf den ersten Taktzyklus folgt
    die erste Faltungseinheit konfiguriert ist, um eine erste Summe von Termen zu erzeugen, die mindestens das Produkt w1x1,1, das Produkt w2x1,2, das Produkt w3x1,3, das Produkt w4x2,1, das Produkt w5x2,2, das Produkt w6x2,3, das Produkt w7x3,1, das Produkt w8x3,2 und das Produkt w9x3,3 umfasst, und
    die zweite Faltungseinheit konfiguriert ist, um eine zweite Summe von Termen zu erzeugen, die mindestens das Produkt w1x1,2, das Produkt w2x1,3, das Produkt w3x1,4, das Produkt w4x2,2, das Produkt w5x2,3, das Produkt w6x2,4, das Produkt w7x3,2, das Produkt w8x3,3 und das Produkt w9x3,4 enthält,
    wobei zu einem Beginn des zweiten Taktzyklus, der auf den ersten Taktzyklus folgt,
    ein erster Akkumulator konfiguriert ist, die erste Summe von Termen zu speichern,
    ein zweiter Akkumulator konfiguriert ist, die zweite Summe von Termen zu speichern,
    das Datenspeicherelement d1,1 konfiguriert ist, den Datenwertes x0,1 zu speichern,
    das Datenspeicherelement d1,2 konfiguriert ist, den Datenwert x0,2 zu speichern,
    das Datenspeicherelement d1,3 konfiguriert ist, den Datenwert x0,3 zu speichern,
    das Datenspeicherelement d1,4 konfiguriert ist, den Datenwert x0,4 zu speichern,
    das Datenspeicherelement d2,1 konfiguriert ist, den Datenwert x1,1 zu speichern,
    das Datenspeicherelement d2,2 konfiguriert ist, den Datenwert x1,2 zu speichern,
    das Datenspeicherelement d2,3 konfiguriert ist, den Datenwert x1,3 zu speichern,
    das Datenspeicherelement d2,4 konfiguriert ist, den Datenwert x1,4 zu speichern,
    das Datenspeicherelement d3,1 konfiguriert ist, den Datenwert x2,1 zu speichern,
    das Datenspeicherelement d3,2 konfiguriert ist, den Datenwert x2,2 zu speichern,
    das Datenspeicherelement d3,3 konfiguriert ist, den Datenwert x2,3 zu speichern, und
    das Datenspeicherelement d3,4 konfiguriert ist, den Datenwert x2,4 zu speichern;
    wobei während des zweiten Taktzyklus:
    der Multiplizierer m 1,1 1
    Figure imgb0172
    konfiguriert ist, den Datenwert x0,1 vom Datenspeicherelement d1,1 zu empfangen und den Datenwert x0,1 mit der Gewichtung w1 zu multiplizieren, um ein Produkt w1x0,1 zu erzeugen;
    der Multiplizierer m 1,2 1
    Figure imgb0173
    konfiguriert ist, den Datenwert x0,2 vom Datenspeicherelement d1,2 zu empfangen und den Datenwert x0,2 mit dem Gewicht w2 zu multiplizieren, um ein Produkt w2x0,2 zu erzeugen,
    der Multiplizierer m 1,3 1
    Figure imgb0174
    konfiguriert ist, den Datenwert x0,3 vom Datenspeicherelement d1,3 zu empfangen und den Datenwert x0,3 mit dem Gewicht w3 zu multiplizieren, um ein Produkt w3x0,3 zu erzeugen,
    der Multiplizierer m 2,1 1
    Figure imgb0175
    konfiguriert ist, den Datenwert x1,1 von dem Datenspeicherelement d2,1 zu empfangen und den Datenwert x1,1 mit dem Gewicht w4 zu multiplizieren, um ein Produkt w4x1,1 zu erzeugen,
    der Multiplizierer m 2,2 1
    Figure imgb0176
    konfiguriert ist, den Datenwert x1,2 vom Datenspeicherelement d2,2 zu empfangen und den Datenwert x1,2 mit dem Gewicht w5 zu multiplizieren, um ein Produkt w5x1,2 zu erzeugen,
    der Multiplizierer m 2,3 1
    Figure imgb0177
    konfiguriert ist, den Datenwert x1,3 vom Datenspeicherelement d2,3 zu empfangen und den Datenwert x1,3 mit dem Gewicht w6 zu multiplizieren, um ein Produkt w6x1,3 zu erzeugen,
    der Multiplizierer m 3,1 1
    Figure imgb0178
    konfiguriert ist, den Datenwert x2,1 vom Datenspeicherelement d3,1 zu empfangen und den Datenwert x2,1 mit dem Gewicht w7 zu multiplizieren, um ein Produkt w7x2,1 zu erzeugen,
    der Multiplizierer m 3,2 1
    Figure imgb0179
    konfiguriert ist, den Datenwert x2,2 vom Datenspeicherelement d3,2 zu empfangen und den Datenwert x2,2 mit dem Gewicht w8 zu multiplizieren, um ein Produkt w8x2,2 zu erzeugen,
    der Multiplizierer m 3,3 1
    Figure imgb0180
    konfiguriert ist, den Datenwert x2,3 vom Datenspeicherelement d3,3 zu empfangen und den Datenwert x2,3 mit dem Gewicht w9 zu multiplizieren, um ein Produkt w9x2,3 zu erzeugen;
    der Multiplizierer m 1,1 2
    Figure imgb0181
    konfiguriert ist, den Datenwert x0,2 vom Datenspeicherelement d1,2 zu empfangen und den Datenwert x0,2 mit dem Gewicht w1 zu multiplizieren, um ein Produkt w1x0,2 zu erzeugen,
    der Multiplizierer m 1,2 2
    Figure imgb0182
    konfiguriert ist, den Datenwert x0,3 vom Datenspeicherelement d1,3 zu empfangen und den Datenwert x0,3 mit dem Gewicht w2 zu multiplizieren, um ein Produkt w2x0,3 zu erzeugen,
    der Multiplizierer m 1,3 2
    Figure imgb0183
    konfiguriert ist, den Datenwert x0,4 vom Datenspeicherelement d1,4 zu empfangen und den Datenwert x0,4 mit dem Gewicht w3 zu multiplizieren, um ein Produkt w3x0,4 zu erzeugen,
    der Multiplizierer m 2,1 2
    Figure imgb0184
    konfiguriert ist, den Datenwert x1,2 vom Datenspeicherelement d2,2 zu empfangen und den Datenwert x1,2 mit dem Gewicht w4 zu multiplizieren, um ein Produkt w4x1,2 zu erzeugen,
    der Multiplizierer m 2,2 2
    Figure imgb0185
    konfiguriert ist, den Datenwert x1,3 vom Datenspeicherelement d2,3 zu empfangen und den Datenwert x1,3 mit dem Gewicht w5 zu multiplizieren, um ein Produkt w5x1,3 zu erzeugen,
    der Multiplizierer m 2,3 2
    Figure imgb0186
    konfiguriert ist, den Datenwert x1,4 vom Datenspeicherelement d2,4 zu empfangen und den Datenwert x1,4 mit dem Gewicht w6 zu multiplizieren, um ein Produkt w6x1,4 zu erzeugen,
    der Multiplizierer m 3,1 2
    Figure imgb0187
    konfiguriert ist, den Datenwert x2,2 vom Datenspeicherelement d3,2 zu empfangen und den Datenwert x2,2 mit dem Gewicht w7 zu multiplizieren, um ein Produkt w7x2,2 zu erzeugen,
    der Multiplizierer m 3,2 2
    Figure imgb0188
    konfiguriert ist, den Datenwert x2,3 vom Datenspeicherelement d3,3 zu empfangen und den Datenwert x2,3 mit dem Gewicht w8 zu multiplizieren, um ein Produkt w8x2,3 zu erzeugen, und
    der Multiplizierer m 3,3 2
    Figure imgb0189
    konfiguriert ist, den Datenwert x2,4 vom Datenspeicherelement d3,4 zu empfangen und den Datenwert x2,4 mit dem Gewicht w9 zu multiplizieren, um ein Produkt w9x2,4 zu erzeugen,
    wobei vor einem dritten Taktzyklus, der auf den zweiten Taktzyklus folgt,
    die erste Faltungseinheit konfiguriert ist, eine dritte Summe von Termen zu erzeugen, die mindestens das Produkt w1x0,1, das Produkt w2x0,2, das Produkt w3x0,3, das Produkt w4x1,1, das Produkt w5x1,2, das Produkt w6x1,3, das Produkt w7x2,1, das Produkt w8x2,2 und das Produkt w9x2,3 umfasst, und
    die zweite Faltungseinheit konfiguriert ist, eine vierte Summe von Termen zu erzeugen, die mindestens das Produkt w1x0,2, das Produkt w2x0,3, das Produkt w3x0,4, das Produkt w4x1,2, das Produkt w5x1,3, das Produkt w6x1,4, das Produkt w7x2,2, das Produkt w8x2,3 und das Produkt w9x2,4 enthält, und
    wobei zu einem Beginn des dritten Taktzyklus, der auf den zweiten Taktzyklus folgt,
    ein dritter Akkumulator konfiguriert ist, um die dritte Summe von Termen zu speichern, und
    ein vierter Akkumulator konfiguriert ist, um die vierte Summe von Termen zu speichern.
  2. Vorrichtung nach Anspruch 1, bei dem die erste Summe von Termen ferner b1 enthält, wobei b1 ein Vorgabewert ist.
  3. Vorrichtung nach Anspruch 1, bei der die zweite Summe von Termen ferner b1 enthält, wobei b1 ein Vorgabewert ist.
  4. Vorrichtung nach Anspruch 1, ferner umfassend:
    eine dritte Faltungseinheit, die eine dritte Mehrzahl von Multiplizierern m 1,1 3 , m 1,2 3 , m 1,3 3 , m 2,1 3 , m 2,2 3 , m 2,3 3 , m 3,1 3 , m 3,2 3
    Figure imgb0190
    , und m 3,3 3
    Figure imgb0191
    umfasst, wobei:
    der Multiplizierer m 1,1 3
    Figure imgb0192
    elektrisch mit dem Datenspeicherelement d1,1 gekoppelt ist,
    der Multiplizierer m 1,2 3
    Figure imgb0193
    elektrisch mit dem Datenspeicherelement d1,2 gekoppelt ist,
    der Multiplizierer m 1,3 3
    Figure imgb0194
    elektrisch mit dem Datenspeicherelement d1,3 gekoppelt ist,
    der Multiplizierer m 2,1 3
    Figure imgb0195
    elektrisch mit dem Datenspeicherelement d2,1 gekoppelt ist,
    der Multiplizierer m 2,2 3
    Figure imgb0196
    elektrisch mit dem Datenspeicherelement d2,2 gekoppelt ist,
    der Multiplizierer m 2,3 3
    Figure imgb0197
    elektrisch mit dem Datenspeicherelement d2,3 gekoppelt ist,
    der Multiplizierer m 3,1 3
    Figure imgb0198
    elektrisch mit dem Datenspeicherelement d3,1 gekoppelt ist,
    der Multiplizierer m 3,2 3
    Figure imgb0199
    elektrisch mit dem Datenspeicherelement d3,2 gekoppelt ist, und
    der Multiplizierer m 3,3 3
    Figure imgb0200
    elektrisch mit dem Datenspeicherelement d3,3 gekoppelt ist; und
    eine vierte Faltungseinheit, die eine vierte Mehrzahl von Multiplizierern, m 1,1 4 , m 1,2 4 , m 1,3 4 , m 2,1 4 , m 2,2 4 , m 2,3 4 , m 3,1 4 , m 3,2 4
    Figure imgb0201
    , und m 3,3 4
    Figure imgb0202
    umfasst, wobei:
    der Multiplizierer m 1,1 4
    Figure imgb0203
    elektrisch mit dem Datenspeicherelement d1,2 gekoppelt ist,
    der Multiplizierer m 1,2 4
    Figure imgb0204
    elektrisch mit dem Datenspeicherelement d1,3 gekoppelt ist,
    der Multiplizierer m 1,3 4
    Figure imgb0205
    elektrisch mit dem Datenspeicherelement d1,4 gekoppelt ist,
    der Multiplizierer m 2,1 4
    Figure imgb0206
    elektrisch mit dem Datenspeicherelement d2,2 gekoppelt ist,
    der Multiplizierer m 2,2 4
    Figure imgb0207
    elektrisch mit dem Datenspeicherelement d2,3 gekoppelt ist,
    der Multiplizierer m 2,3 4
    Figure imgb0208
    elektrisch mit dem Datenspeicherelement d2,4 gekoppelt ist,
    der Multiplizierer m 3,1 4
    Figure imgb0209
    elektrisch mit dem Datenspeicherelement d3,2 gekoppelt ist,
    der Multiplizierer m 3,2 4
    Figure imgb0210
    elektrisch mit dem Datenspeicherelement d3,3 gekoppelt ist, und
    der Multiplizierer m 3,3 4
    Figure imgb0211
    elektrisch mit dem Datenspeicherelement d3,4 gekoppelt ist.
EP19708715.8A 2018-03-13 2019-02-13 Effiziente faltungsmaschine Active EP3766020B1 (de)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP23166763.5A EP4220488A1 (de) 2018-03-13 2019-02-13 Verfahren zur verarbeitung horizontaler datenstreifen in einer effizienten faltungsmaschine

Applications Claiming Priority (3)

Application Number Priority Date Filing Date Title
US201862642578P 2018-03-13 2018-03-13
US201862694290P 2018-07-05 2018-07-05
PCT/US2019/017787 WO2019177735A1 (en) 2018-03-13 2019-02-13 Efficient convolutional engine

Related Child Applications (2)

Application Number Title Priority Date Filing Date
EP23166763.5A Division EP4220488A1 (de) 2018-03-13 2019-02-13 Verfahren zur verarbeitung horizontaler datenstreifen in einer effizienten faltungsmaschine
EP23166763.5A Division-Into EP4220488A1 (de) 2018-03-13 2019-02-13 Verfahren zur verarbeitung horizontaler datenstreifen in einer effizienten faltungsmaschine

Publications (3)

Publication Number Publication Date
EP3766020A1 EP3766020A1 (de) 2021-01-20
EP3766020C0 EP3766020C0 (de) 2023-11-08
EP3766020B1 true EP3766020B1 (de) 2023-11-08

Family

ID=65635810

Family Applications (2)

Application Number Title Priority Date Filing Date
EP23166763.5A Pending EP4220488A1 (de) 2018-03-13 2019-02-13 Verfahren zur verarbeitung horizontaler datenstreifen in einer effizienten faltungsmaschine
EP19708715.8A Active EP3766020B1 (de) 2018-03-13 2019-02-13 Effiziente faltungsmaschine

Family Applications Before (1)

Application Number Title Priority Date Filing Date
EP23166763.5A Pending EP4220488A1 (de) 2018-03-13 2019-02-13 Verfahren zur verarbeitung horizontaler datenstreifen in einer effizienten faltungsmaschine

Country Status (7)

Country Link
US (6) US11468302B2 (de)
EP (2) EP4220488A1 (de)
JP (2) JP7171883B2 (de)
KR (1) KR102516039B1 (de)
CN (1) CN112236783B (de)
IL (3) IL301126A (de)
WO (1) WO2019177735A1 (de)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11514136B2 (en) * 2019-05-17 2022-11-29 Aspiring Sky Co. Limited Circuit for neural network convolutional calculation of variable feature and kernel sizes
US11782310B2 (en) * 2021-12-07 2023-10-10 3M Innovative Properties Company Backlighting for display systems
JP2023159945A (ja) * 2022-04-21 2023-11-02 株式会社日立製作所 情報処理装置、情報処理方法、情報処理プログラム、ソフトウェア作成装置、ソフトウェア作成方法、及びソフトウェア作成プログラム
US11762946B1 (en) * 2022-09-23 2023-09-19 Recogni Inc. Systems for using shifter circuit and 3×3 convolver units to emulate functionality of larger sized convolver units

Family Cites Families (32)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4347580A (en) * 1980-07-21 1982-08-31 The United States Of America As Represented By The Secretary Of The Navy Array convolver/correlator
US5014235A (en) * 1987-12-15 1991-05-07 Steven G. Morton Convolution memory
US5138695A (en) 1989-10-10 1992-08-11 Hnc, Inc. Systolic array image processing system
US5949920A (en) * 1996-08-13 1999-09-07 Hewlett-Packard Co. Reconfigurable convolver circuit
US10572824B2 (en) 2003-05-23 2020-02-25 Ip Reservoir, Llc System and method for low latency multi-functional pipeline with correlation logic and selectively activated/deactivated pipelined data processing engines
CA2718129A1 (en) * 2008-03-27 2009-10-01 Ge Healthcare Bioscience Bioprocess Corp. A method for preventing an unauthorized use of disposable bioprocess components
US8533250B1 (en) * 2009-06-17 2013-09-10 Altera Corporation Multiplier with built-in accumulator
US8442927B2 (en) 2009-07-30 2013-05-14 Nec Laboratories America, Inc. Dynamically configurable, multi-ported co-processor for convolutional neural networks
US9477999B2 (en) 2013-09-20 2016-10-25 The Board Of Trustees Of The Leland Stanford Junior University Low power programmable image processor
US10417525B2 (en) * 2014-09-22 2019-09-17 Samsung Electronics Co., Ltd. Object recognition with reduced neural network weight precision
US10223635B2 (en) * 2015-01-22 2019-03-05 Qualcomm Incorporated Model compression and fine-tuning
US9965824B2 (en) * 2015-04-23 2018-05-08 Google Llc Architecture for high performance, power efficient, programmable image processing
US10373050B2 (en) * 2015-05-08 2019-08-06 Qualcomm Incorporated Fixed point neural network based on floating point neural network quantization
US10049322B2 (en) * 2015-05-21 2018-08-14 Google Llc Prefetching weights for use in a neural network processor
US20160379109A1 (en) 2015-06-29 2016-12-29 Microsoft Technology Licensing, Llc Convolutional neural networks on hardware accelerators
KR102325602B1 (ko) 2015-07-06 2021-11-12 삼성전자주식회사 데이터를 병렬적으로 처리하는 장치 및 방법
CN106203617B (zh) * 2016-06-27 2018-08-21 哈尔滨工业大学深圳研究生院 一种基于卷积神经网络的加速处理单元及阵列结构
US10546211B2 (en) * 2016-07-01 2020-01-28 Google Llc Convolutional neural network on programmable two dimensional image processor
US20180007302A1 (en) * 2016-07-01 2018-01-04 Google Inc. Block Operations For An Image Processor Having A Two-Dimensional Execution Lane Array and A Two-Dimensional Shift Register
US20180046903A1 (en) * 2016-08-12 2018-02-15 DeePhi Technology Co., Ltd. Deep processing unit (dpu) for implementing an artificial neural network (ann)
JP2018067154A (ja) * 2016-10-19 2018-04-26 ソニーセミコンダクタソリューションズ株式会社 演算処理回路および認識システム
US9779786B1 (en) 2016-10-26 2017-10-03 Xilinx, Inc. Tensor operations and acceleration
JP6961011B2 (ja) 2016-12-09 2021-11-05 ベイジン ホライズン インフォメーション テクノロジー カンパニー リミテッド データ管理のためのシステム及び方法
WO2018138603A1 (en) 2017-01-26 2018-08-02 Semiconductor Energy Laboratory Co., Ltd. Semiconductor device and electronic device including the semiconductor device
CN106951395B (zh) 2017-02-13 2018-08-17 上海客鹭信息技术有限公司 面向压缩卷积神经网络的并行卷积运算方法及装置
CN106951961B (zh) 2017-02-24 2019-11-26 清华大学 一种粗粒度可重构的卷积神经网络加速器及系统
US10489878B2 (en) 2017-05-15 2019-11-26 Google Llc Configurable and programmable image processor unit
CN111133452A (zh) * 2017-05-19 2020-05-08 莫维迪乌斯有限公司 用于提高卷积效率的方法、系统和装置
US10929746B2 (en) 2017-11-27 2021-02-23 Samsung Electronics Co., Ltd. Low-power hardware acceleration method and system for convolution neural network computation
KR20190066473A (ko) * 2017-12-05 2019-06-13 삼성전자주식회사 뉴럴 네트워크에서 컨볼루션 연산을 처리하는 방법 및 장치
US10747844B2 (en) * 2017-12-12 2020-08-18 Tesla, Inc. Systems and methods for converting a matrix input to a vectorized input for a matrix processor
US11256977B2 (en) * 2017-12-29 2022-02-22 Facebook, Inc. Lowering hardware for neural networks

Also Published As

Publication number Publication date
US11580372B2 (en) 2023-02-14
US20220351028A1 (en) 2022-11-03
US11593630B2 (en) 2023-02-28
US20220351031A1 (en) 2022-11-03
US11694069B2 (en) 2023-07-04
KR102516039B1 (ko) 2023-03-30
JP7171883B2 (ja) 2022-11-15
US11645504B2 (en) 2023-05-09
US20220351030A1 (en) 2022-11-03
WO2019177735A1 (en) 2019-09-19
US20220351027A1 (en) 2022-11-03
IL277197A (en) 2020-10-29
US11468302B2 (en) 2022-10-11
EP3766020C0 (de) 2023-11-08
CN112236783B (zh) 2023-04-11
US20220351029A1 (en) 2022-11-03
US11694068B2 (en) 2023-07-04
US20190286975A1 (en) 2019-09-19
JP2023014091A (ja) 2023-01-26
CN112236783A (zh) 2021-01-15
EP3766020A1 (de) 2021-01-20
EP4220488A1 (de) 2023-08-02
KR20200140282A (ko) 2020-12-15
IL277197B2 (en) 2023-02-01
IL277197B (en) 2022-10-01
JP2021517702A (ja) 2021-07-26
IL295915A (en) 2022-10-01
IL301126A (en) 2023-05-01

Similar Documents

Publication Publication Date Title
EP3766020B1 (de) Effiziente faltungsmaschine
US20190095776A1 (en) Efficient data distribution for parallel processing
EP3836028A1 (de) Beschleunigung einer 2d-faltungsschicht-abbildung auf einer skalarproduktarchitektur
US10366328B2 (en) Approximating fully-connected layers with multiple arrays of 3x3 convolutional filter kernels in a CNN based integrated circuit
US10387772B1 (en) Ensemble learning based image classification systems
US11164032B2 (en) Method of performing data processing operation
CN110059815B (zh) 人工智能推理计算设备
US20230376733A1 (en) Convolutional neural network accelerator hardware
US20210019602A1 (en) Using and training cellular neural network integrated circuit having multiple convolution layers of duplicate weights in performing artificial intelligence tasks
CN116152037A (zh) 图像反卷积方法和设备、存储介质
CN114662647A (zh) 处理用于神经网络的层的数据
CN114580618A (zh) 一种反卷积处理方法、装置、电子设备及介质
EP4290395A1 (de) Hardwarearchitektur mit geringem stromverbrauch zur handhabung von akkumulationsüberschüssen in einem konvolutionsbetrieb
EP4303771A1 (de) Iterationsmaschine zur berechnung grosser kerne bei faltungsbeschleunigern
US20230075264A1 (en) Methods and devices for efficient general deconvolution implementation on hardware accelerator
EP4300369A1 (de) Verfahren und systeme zur ausführung eines neuronalen netzwerks auf einem beschleuniger eines neuronalen netzwerks
CN116090518A (zh) 基于脉动运算阵列的特征图处理方法、装置以及存储介质
WO2023212203A1 (en) Matrix multiplication performed using convolution engine which includes array of processing elements

Legal Events

Date Code Title Description
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: UNKNOWN

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE INTERNATIONAL PUBLICATION HAS BEEN MADE

PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: REQUEST FOR EXAMINATION WAS MADE

17P Request for examination filed

Effective date: 20200908

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

AX Request for extension of the european patent

Extension state: BA ME

DAV Request for validation of the european patent (deleted)
DAX Request for extension of the european patent (deleted)
STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: EXAMINATION IS IN PROGRESS

17Q First examination report despatched

Effective date: 20220915

GRAP Despatch of communication of intention to grant a patent

Free format text: ORIGINAL CODE: EPIDOSNIGR1

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: GRANT OF PATENT IS INTENDED

RIC1 Information provided on ipc code assigned before grant

Ipc: G06N 3/0464 20230101ALI20230125BHEP

Ipc: G06N 3/063 20060101AFI20230125BHEP

INTG Intention to grant announced

Effective date: 20230208

GRAJ Information related to disapproval of communication of intention to grant by the applicant or resumption of examination proceedings by the epo deleted

Free format text: ORIGINAL CODE: EPIDOSDIGR1

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: EXAMINATION IS IN PROGRESS

GRAP Despatch of communication of intention to grant a patent

Free format text: ORIGINAL CODE: EPIDOSNIGR1

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: GRANT OF PATENT IS INTENDED

INTC Intention to grant announced (deleted)
INTG Intention to grant announced

Effective date: 20230605

RAP3 Party data changed (applicant data changed or rights of an application transferred)

Owner name: RECOGNI INC.

RIN1 Information on inventor provided before grant (corrected)

Inventor name: FEINBERG, EUGENE, M.

GRAS Grant fee paid

Free format text: ORIGINAL CODE: EPIDOSNIGR3

GRAA (expected) grant

Free format text: ORIGINAL CODE: 0009210

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE PATENT HAS BEEN GRANTED

AK Designated contracting states

Kind code of ref document: B1

Designated state(s): AL AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HR HU IE IS IT LI LT LU LV MC MK MT NL NO PL PT RO RS SE SI SK SM TR

REG Reference to a national code

Ref country code: GB

Ref legal event code: FG4D

REG Reference to a national code

Ref country code: CH

Ref legal event code: EP

REG Reference to a national code

Ref country code: DE

Ref legal event code: R096

Ref document number: 602019041006

Country of ref document: DE

REG Reference to a national code

Ref country code: IE

Ref legal event code: FG4D

U01 Request for unitary effect filed

Effective date: 20231115

U07 Unitary effect registered

Designated state(s): AT BE BG DE DK EE FI FR IT LT LU LV MT NL PT SE SI

Effective date: 20231122

U20 Renewal fee paid [unitary effect]

Year of fee payment: 6

Effective date: 20240227

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: GR

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20240209

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: IS

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20240308

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: ES

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20231108

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: IS

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20240308

Ref country code: GR

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20240209

Ref country code: ES

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20231108

PG25 Lapsed in a contracting state [announced via postgrant information from national office to epo]

Ref country code: RS

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20231108

Ref country code: PL

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20231108

Ref country code: NO

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20240208

Ref country code: HR

Free format text: LAPSE BECAUSE OF FAILURE TO SUBMIT A TRANSLATION OF THE DESCRIPTION OR TO PAY THE FEE WITHIN THE PRESCRIBED TIME-LIMIT

Effective date: 20231108