TWI630544B

TWI630544B - Operation device and method for convolutional neural network

Info

Publication number: TWI630544B
Application number: TW106104513A
Authority: TW
Inventors: 李一雷; 杜源; 杜力; 管延城; 劉峻誠
Original assignee: 耐能股份有限公司
Priority date: 2017-02-10
Filing date: 2017-02-10
Publication date: 2018-07-21
Also published as: US20180232621A1; TW201830232A

Abstract

一種卷積神經網路的運算方法，包括：對多個輸入數據進行一加法運算以輸出一累加數據；對累加數據進行一位元移位運算以輸出一移位數據；以及對移位數據進行一加權運算以輸出一加權數據，其中加權運算的一因子依據輸入數據的數量、位元移位運算中向右移位的位元數量以及卷積神經網路的一後續層的一縮放權值而定。 A method for computing a convolutional neural network, comprising: performing an addition operation on a plurality of input data to output an accumulated data; performing a one-bit shift operation on the accumulated data to output a shift data; and performing shift data on the shift data a weighting operation to output a weighted data, wherein a factor of the weighting operation is based on the number of input data, the number of bits shifted to the right in the bit shift operation, and a scaling weight of a subsequent layer of the convolutional neural network And set.

Description

Operation device and method for convolutional neural network

本發明係關於一種卷積神經網路的運算方法，特別是關於一種執行平均池化運算的裝置及方法。 The present invention relates to a method of computing a convolutional neural network, and more particularly to an apparatus and method for performing an average pooling operation.

卷積神經網路(Convolutional Neural Network，CNN)是一種前饋型神經網路，其通常包含多組的卷積層(convolution layer)及池化層(pooling layer)。池化層可針對輸入數據某個區域上的特定特徵進行最大池化(max pooling)或平均池化(average pooling)運算，以減小參數量及神經網路中的運算。以平均池化運算而言，傳統的方式是先進行加法運算，再將加總結果進行除法運算。然而，除法運算需耗費較多的處理器效能，故容易造成硬體資源的負擔過重。此外，當進行多個數據的累加運算時，也容易發生溢位(overflow)的情形。 The Convolutional Neural Network (CNN) is a feedforward neural network that typically contains multiple sets of convolution layers and pooling layers. The pooling layer can perform maximum pooling or average pooling operations on specific features on a certain area of the input data to reduce the amount of parameters and operations in the neural network. In terms of the average pooling operation, the conventional method is to perform the addition operation first, and then divide the total result. However, the division operation requires a lot of processor performance, so it is easy to cause excessive burden on the hardware resources. Further, when an accumulation operation of a plurality of data is performed, an overflow situation is likely to occur.

因此，如何提供一種池化運算方式，可使用較少的處理器效能執行平均池化運算，實為當前重要的課題之一。 Therefore, how to provide a pooling operation method, which can perform average pooling operation with less processor performance, is one of the current important topics.

有鑑於此，本發明之一目的為提供一種卷積運算裝置及池化運算方法，可避免硬體資源的負擔過重，以增進池化運算的效能。 In view of the above, an object of the present invention is to provide a convolution operation device and a pooling operation method, which can avoid overloading hardware resources and improve the performance of the pooling operation.

在一實施例中，加權運算的因子隨縮放權值以及位元移位運算中向右移位的位元數量呈正比，加權運算的因子隨輸入數據的數量呈反比，加權數據等於移位數據乘以因子。 In an embodiment, the weighting operation factor is shifted with the scaling weight and the bit The number of bits shifted to the right in the calculation is proportional, and the weighting operation factor is inversely proportional to the amount of input data, and the weighted data is equal to the shift data multiplied by the factor.

在一實施例中，位元移位運算中向右移位的位元數量依據一池化窗的規模而定，輸入數據的數量依據池化窗的規模而定。 In one embodiment, the number of bits shifted to the right in the bit shift operation depends on the size of a pooled window, and the amount of input data depends on the size of the pooled window.

在一實施例中，後續層為卷積神經網路的一次一層卷積層，縮放權值為次一層卷積層的一濾波器係數，加法運算以及位元移位運算是卷積神經網路的一池化層中的運算。 In one embodiment, the subsequent layer is a one-level convolution layer of the convolutional neural network, the scaling weight is a filter coefficient of the next-level convolution layer, and the addition and bit shift operations are one of the convolutional neural networks. The operation in the pooling layer.

在一實施例中，池化層中的除法運算整合在次一層卷積層的乘法運算中進行。 In one embodiment, the division operation in the pooling layer is performed in a multiplication operation of the next layer convolutional layer.

一種卷積神經網路的運算方法，包括：在一池化層中對多個輸入數據進行一加法運算以輸出一累加數據；以及在一後續層中對累加數據進行一加權運算以輸出一加權數據，其中加權運算的一因子依據輸入數據的數量以及後續層的一縮放權值而定，加權數據等於累加數據乘以因子。 A method for computing a convolutional neural network, comprising: performing an addition operation on a plurality of input data in a pooling layer to output an accumulated data; and performing a weighting operation on the accumulated data in a subsequent layer to output a weighting Data, wherein a factor of the weighting operation depends on the number of input data and a scaling weight of subsequent layers, the weighting data being equal to the accumulated data multiplied by a factor.

在一實施例中，後續層為一次一層卷積層，縮放權值為一濾波器係數，加權運算為卷積運算，加權運算的因子等於濾波器係數除以等輸入數據的數量。 In an embodiment, the subsequent layer is a one-layer convolution layer, the scaling weight is a filter coefficient, and the weighting operation is a convolution operation, and the weighting operation factor is equal to the filter coefficient divided by the number of input data.

在一實施例中，輸入數據的數量依據池化窗的規模而定。 In an embodiment, the amount of input data depends on the size of the pooling window.

一種卷積神經網路的運算方法包括：將一縮放權值與一原始濾波器係數相乘以產生一加權後濾波器係數；以及在一卷積層對一輸入數據以及加權後濾波器係數進行卷積運算。 A method for computing a convolutional neural network includes: multiplying a scaled weight by an original filter coefficient to produce a weighted filter coefficient; and volumeizing an input data and weighted filter coefficients in a convolution layer Product operation.

在一實施例中，運算方法更包括：對輸入數據進行一位元移位運算；以及將位元移位運算後的輸入數據輸入至該卷積層，其中，縮放權值依據一原始縮放權值以及位元移位運算中向右移位的位元數量而定。 In an embodiment, the operation method further comprises: performing a one-bit shift operation on the input data; and inputting the input data after the bit shift operation to the convolution layer, wherein the scaling weight is based on an original scaling weight And the number of bits shifted to the right in the bit shift operation.

一種卷積神經網路的運算裝置，能夠進行前述之方法。 An arithmetic device for a convolutional neural network capable of performing the aforementioned method.

承上所述，本發明的運算裝置及運算方法中，以兩階段進行平均池化運算，池化單元僅進行加法運算，並搭配位元移位運算，以避免累加過程所造成的數據溢位，再對池化單元的輸出結果進行加權運算，而得到最終的平均結果。由於池化單元並未做除法運算，故可避免處理器耗費較多的效能，進而達成提升池化運算之效能的功效。 As described above, in the arithmetic device and the arithmetic method of the present invention, the average pooling operation is performed in two stages, and the pooling unit performs only the addition operation, and is combined with the bit shift operation to avoid data overflow caused by the accumulation process. Then, the output result of the pooling unit is weighted to obtain the final average result. Since the pooling unit does not perform the division operation, the processor can be prevented from using more performance, thereby achieving the effect of improving the performance of the pooling operation.

1‧‧‧記憶體 1‧‧‧ memory

2‧‧‧緩衝裝置 2‧‧‧buffering device

21‧‧‧記憶體控制單元 21‧‧‧Memory Control Unit

3‧‧‧卷積運算模組 3‧‧‧Convolutional computing module

4‧‧‧交錯加總單元 4‧‧‧Interlaced summing unit

5‧‧‧加總緩衝單元 5‧‧‧Additional buffer unit

51‧‧‧部分加總區塊 51‧‧‧Partial total block

52‧‧‧池化單元 52‧‧‧ pooling unit

6‧‧‧係數擷取控制器 6‧‧‧ coefficient acquisition controller

7‧‧‧控制單元 7‧‧‧Control unit

71‧‧‧指令解碼器 71‧‧‧ instruction decoder

72‧‧‧數據讀取控制器 72‧‧‧Data Read Controller

ADD_IN、RD、WR‧‧‧線路 ADD_IN, RD, WR‧‧‧ lines

C1~Cn‧‧‧數據 C1~Cn‧‧‧ data

DMA‧‧‧直接記憶體存取 DMA‧‧‧direct memory access

F1~Fn、WF1~WFn‧‧‧濾波器係數 F1~Fn, WF1~WFn‧‧‧ filter coefficients

P1~Pn‧‧‧數據 P1~Pn‧‧‧ data

W‧‧‧縮放權值 W‧‧‧Scale weight

圖1為卷積神經網路的部分層的一示意圖。 Figure 1 is a schematic illustration of a portion of a convolutional neural network.

圖2為卷積神經網路的整合運算的一示意圖。 Figure 2 is a schematic diagram of the integration operation of a convolutional neural network.

圖3為卷積神經網路的一示意圖。 Figure 3 is a schematic diagram of a convolutional neural network.

圖4為依據本發明一實施例的一卷積運算裝置的功能方塊圖。 4 is a functional block diagram of a convolution operation device in accordance with an embodiment of the present invention.

以下將參照相關圖式，說明依據本發明具體實施例的卷積運算裝置及方法，其中相同的元件將以相同的元件符號加以說明，所附圖式僅為說明用途，並非用於侷限本發明。 The convolution operation apparatus and method according to the specific embodiments of the present invention will be described with reference to the accompanying drawings, wherein the same elements will be described with the same element symbols, and the drawings are for illustrative purposes only and are not intended to limit the present invention. .

圖1為卷積神經網路的部分層的一示意圖。請參閱圖1所示，卷積神經網路具有多個運算層，例如卷積層、池化層等，卷積層以及池化層的層數可以是多層，各層的輸出可以當作另一層或後續層的輸入，例如第N層卷積層的輸出是第N層池化層的輸入或是其他後續層的輸入，第N層池化層的輸出是第N+1層卷積層的輸入或是其他後續層的輸入，第N層運算層的輸出可以是第N+1層運算層的輸入。 Figure 1 is a schematic illustration of a portion of a convolutional neural network. Referring to FIG. 1 , the convolutional neural network has multiple operation layers, such as a convolution layer, a pooling layer, etc., and the number of layers of the convolution layer and the pooling layer may be multiple layers, and the output of each layer may be regarded as another layer or subsequent. The input of the layer, for example, the output of the Nth layer convolutional layer is the input of the Nth layer pooling layer or the input of other subsequent layers, and the output of the Nth layer pooling layer is the input of the N+1th layer convolution layer or other For the input of the subsequent layer, the output of the Nth operation layer may be the input of the N+1th operation layer.

為了提升運算效能，不同層但性質接近的運算可以適當的整合在一起運算，舉例來說，池化層的池化運算是平均池化運算，除法的運算可以整合在次一層運算層中，次一層運算層例如是卷積層，也就是池化層的平均池化的除法是和次一層卷積層的卷積乘法一起運算。另外，池化層也可以進行移位運算來替代平均計算所需的除法，並將尚未除完的部分整合在次一層運算層中一起計算，也就是池化層的平均池化的除法未能利用移位運算完整替代的部分是和次一層卷積層的卷積乘法一起運算。 In order to improve the performance of the operation, different layers but close to each other can be properly integrated. For example, the pooling operation of the pooling layer is an average pooling operation, and the division operation can be integrated into the next layer of the operation layer. A layer of operation is, for example, a convolutional layer, that is, the average pooled division of the pooled layer is computed with the convolutional multiplication of the next layer of convolutional layers. In addition, the pooling layer can also perform shift operations instead of the division required for the average calculation, and integrate the unfinished parts into the next-level operation layer, that is, the average pooling division of the pooling layer fails. The part that is completely replaced by the shift operation is operated together with the convolution multiplication of the next layer of the convolutional layer.

圖2為卷積神經網路的整合運算的一示意圖。請參閱圖2所示，在卷積層中，多個數據P1~Pn和多個濾波器係數F1~Fn進行卷積運算以產生多個數據C1~Cn，數據C1~Cn作為池化層的多個輸入數據。在池化層中，多個輸入數據進行一加法運算以輸出一累加數據。在一後續層中對，累加數據進行一加權運算以輸出一加權數據，其中加權運算的一縮放權值W依據輸入數據的數量以及後續層的一縮放權值而定，加權數據等於累加數據乘以縮放權值W。 Figure 2 is a schematic diagram of the integration operation of a convolutional neural network. Referring to FIG. 2, in the convolutional layer, a plurality of data P1~Pn and a plurality of filter coefficients F1~Fn are convoluted to generate a plurality of data C1~Cn, and data C1~Cn are used as a pooling layer. Input data. In the pooling layer, a plurality of input data are subjected to an addition operation to output an accumulated data. In a subsequent layer, the accumulated data is subjected to a weighting operation to output a weighted data, wherein one of the weighting operations The scaling weight W is dependent on the number of input data and a scaling weight of the subsequent layer, the weighting data being equal to the accumulated data multiplied by the scaling weight W.

舉例來說，後續層可以是次一層卷積層，縮放權值為一濾波器係數，加權運算為卷積運算，加權運算的因子等於濾波器係數除以等輸入數據的數量。另外，輸入數據的數量依據池化窗的規模而定。 For example, the subsequent layer may be a next-level convolution layer, the scaling weight is a filter coefficient, and the weighting operation is a convolution operation, and the weighting operation factor is equal to the filter coefficient divided by the amount of input data. In addition, the amount of input data depends on the size of the pooling window.

另一方面，累加數據在另一層計算之前，可以藉由移位運算來得到部分的除法結果。舉例來說，累加數據可進行一位元移位運算以輸出一移位數據，然後對移位數據進行一加權運算以輸出一加權數據，其中加權運算的一縮放權值W依據輸入數據的數量、位元移位運算中向右移位的位元數量以及卷積神經網路的一後續層的一縮放權值而定。加權運算的縮放權值W隨縮放權值以及位元移位運算中向右移位的位元數量呈正比，加權運算的縮放權值W隨輸入數據的數量呈反比，加權數據等於移位數據乘以縮放權值W。 On the other hand, the accumulated data can be partially shifted by the shift operation before another layer is calculated. For example, the accumulated data may perform a one-bit shift operation to output a shift data, and then perform a weighting operation on the shift data to output a weighted data, wherein a weighted weight W of the weighting operation is based on the amount of input data. The number of bits shifted to the right in the bit shift operation and a scaling weight of a subsequent layer of the convolutional neural network. The weighted weight W of the weighting operation is proportional to the scale weight and the number of bits shifted to the right in the bit shift operation. The weighted weight W of the weighting operation is inversely proportional to the number of input data, and the weighted data is equal to the shift data. Multiply by the scale weight W.

位元移位運算中向右移位的位元數量依據一池化窗的規模而定，向右移位一個位元相當於除以2一次，若向右移位的位元數量為n，則2的n次方為最接近但不超過池化窗的規模冪次方。以2×2的池化窗為例，池化窗的規模為4，n則為2，向右移位2位元；以3×3的池化窗為例，池化窗的規模為9，n則為3，向右移位3位元。 The number of bits shifted to the right in the bit shift operation depends on the size of a pooled window. Shifting one bit to the right is equivalent to dividing by 2, and if the number of bits shifted to the right is n, Then the nth power of 2 is the closest but not exceeding the scale power of the pooling window. Taking a 2×2 pooling window as an example, the size of the pooling window is 4, n is 2, and 2 bits are shifted to the right; taking a 3×3 pooling window as an example, the size of the pooling window is 9 , n is 3, shifting 3 bits to the right.

輸入數據的數量依據池化窗的規模而定。後續層為卷積神經網路的一次一層卷積層，縮放權值為次一層卷積層的一濾波器係數，加法運算以及位元移位運算是卷積神經網路的一池化層中的運算。 The amount of input data depends on the size of the pooling window. The subsequent layer is a one-layer convolutional layer of the convolutional neural network, and the scaling weight is a filter coefficient of the next-layer convolutional layer. The addition operation and the bit shift operation are operations in a pooled layer of the convolutional neural network. .

舉例來說，若某一特徵區域內有9個數據待進行平均池化運算，可先對該9個數據進行累加而得到一累加值，為避免該累加值發生溢位的情形，可對該累加值進行位元移位運算，例如將該累加值向右移兩個位元，而得到一移位值，即相當於將該累加值除4的效果，再將該移位值乘上一加權係數，而得到一加權值。加權係數的選擇是依據位元移位的偏移量而定，於本實施例中，加權係數為1/2.25，故最終得到的加權值即等同於將該累加值除9的效果。由於位元移位運算及加權運算不會占用太多的處理程序，故透過位元移位及加權兩階段的運算方式，可讓處理器使用較少的效能便可進行平均池化運算，進而提升池化運算的效能。 For example, if there are 9 data in a feature area to be averaged, the first 9 data may be accumulated to obtain an accumulated value. To avoid overflow of the accumulated value, The accumulated value performs a bit shift operation, for example, shifting the accumulated value to the right by two bits to obtain a shift value, which is equivalent to dividing the accumulated value by 4, and multiplying the shift value by one. The weighting coefficients are obtained to obtain a weighted value. The selection of the weighting coefficient is determined according to the offset of the bit shift. In the present embodiment, the weighting coefficient is 1/2.25, so the weighting value finally obtained is equivalent to the effect of dividing the accumulated value by 9. Since the bit shift operation and the weighting operation do not occupy too many processing procedures, the processor can be used by the bit shifting and weighting two-stage operation. With less performance, the average pooling operation can be performed, thereby improving the performance of the pooling operation.

圖3為卷積神經網路的整合運算的一示意圖。請參閱圖3所示，卷積層的卷積運算是將輸入的數據與濾波器係數相乘，當輸入的數據需要加權或縮放時，這個加權或縮放的運算可以整合在卷積運算時一起處理。也就是說，卷積層的輸入加權(或縮放)以及卷積運算可以在同一個乘法運算中完成。 Figure 3 is a schematic diagram of the integration operation of a convolutional neural network. Referring to FIG. 3, the convolution operation of the convolution layer is to multiply the input data by the filter coefficients. When the input data needs to be weighted or scaled, the weighted or scaled operation can be integrated in the convolution operation. . That is, the input weighting (or scaling) and convolution operations of the convolutional layer can be done in the same multiplication operation.

輸入至卷積層的數據P1~Pn可以是影像的畫素或是卷積神經網路的上一層的輸出，例如是前一層池化層、隱藏層等。在圖3中，卷積神經網路的運算方法包括：將一縮放權值W與原始的濾波器係數F1~Fn相乘以產生加權後的濾波器係數WF1~WFn；以及在一卷積層對輸入數據P1~Pn以及加權後的濾波器係數WF1~WFn進行卷積運算。原本的卷積運算是輸入的數據P1~Pn與原始的濾波器係數F1~Fn進行乘法，為了整合加權或縮放的運算，卷積層實際運算所使用的係數是加權後的濾波器係數WF1~WFn而非原始的濾波器係數F1~Fn。卷積層的輸入不用額外利用乘法運算來進行加權或縮放。 The data P1~Pn input to the convolutional layer may be the pixel of the image or the output of the upper layer of the convolutional neural network, for example, the previous layer of the pooling layer, the hidden layer, and the like. In FIG. 3, the operation method of the convolutional neural network includes: multiplying a scaling weight W by the original filter coefficients F1 to Fn to generate weighted filter coefficients WF1 to WFn; and a convolution layer pair The input data P1 to Pn and the weighted filter coefficients WF1 to WFn are convoluted. The original convolution operation is that the input data P1~Pn are multiplied with the original filter coefficients F1~Fn. In order to integrate the weighting or scaling operations, the coefficients used in the actual operation of the convolutional layer are the weighted filter coefficients WF1~WFn. Rather than the original filter coefficients F1~Fn. The input of the convolutional layer is not weighted or scaled without additional multiplication operations.

另外，當加權或縮放需進行除法運算、或是加權或縮放的值小於1，運算方法可先對輸入數據進行一位元移位運算，然後將位元移位運算後的輸入數據輸入至卷積層。縮放權值W依據一原始縮放權值以及位元移位運算中向右移位的位元數量而定。例如原始縮放權值是0.4，位元移位運算則設為向右移位1位元(相當於乘上0.5)，然後縮放權值W則設為0.8，這樣整個運算結果仍會是相當於輸入數據有乘上原始縮放權值(0.5*0.8=0.4)。另外，將除法運算改為移位運算可以降低硬體負擔，卷積層的輸入不用額外利用乘法運算來進行加權或縮放。 In addition, when the weighting or scaling needs to be divided, or the weighted or scaled value is less than 1, the operation method may first perform a one-bit shift operation on the input data, and then input the input data after the bit shift operation to the volume. Laminated. The scaling weight W is based on an original scaling weight and the number of bits shifted to the right in the bit shift operation. For example, the original scaling weight is 0.4, and the bit shifting operation is set to shift 1 bit to the right (equivalent to multiplying by 0.5), and then the scaling weight W is set to 0.8, so that the entire operation result will still be equivalent. The input data is multiplied by the original scaling weight (0.5*0.8=0.4). In addition, changing the division operation to shift operation can reduce the hardware burden, and the input of the convolution layer is not additionally weighted or scaled by multiplication.

圖4為依據本發明一實施例的的卷積運算裝置的功能方塊圖。請參閱圖3所示，卷積運算裝置包括一記憶體1、一緩衝裝置2、一卷積運算模組3、一交錯加總單元4、一加總緩衝單元5、一係數擷取控制器6以及一控制單元7。卷積運算裝置可用在卷積神經網路(Convolutional Neural Network，CNN)的應用。 4 is a functional block diagram of a convolution operation device in accordance with an embodiment of the present invention. Referring to FIG. 3, the convolution operation device includes a memory 1, a buffer device 2, a convolution operation module 3, an interleaving and summing unit 4, a total buffer unit 5, and a coefficient extraction controller. 6 and a control unit 7. The convolutional computing device can be used in the application of the Convolutional Neural Network (CNN).

記憶體1儲存待卷積運算的數據，可例如為影像、視頻、音頻、統計、卷積神經網路其中一層的數據等等。以影像數據來說，其例如是像素(pixel)數據；以視頻數據來說，其例如是視頻視框的像素數據或是移動向量、或是視頻中的音訊；以卷積神經網路其中一層的數據來說，其通常是一個二維陣列數據，以影像數據而言，則通常是一個二維陣列的像素數據。另外，在本實施例中，係以記憶體1為一靜態隨機存取記憶體(static random-access memory,SRAM)為例，其除了可儲存待卷積運算的數據之外，也可以儲存卷積運算完成的數據，並且可以具有多層的儲存結構並分別存放待運算與運算完畢的數據，換言之，記憶體1可做為如卷積運算裝置內部的快取記憶體(cache memory)。 The memory 1 stores data to be convoluted, and can be, for example, an image, a video, or a sound. Frequency, statistics, convolutional neural network, one layer of data, and so on. In the case of image data, for example, pixel data; in the case of video data, for example, pixel data of a video frame or a motion vector, or audio in a video; In terms of data, it is usually a two-dimensional array of data, and in the case of image data, it is usually a two-dimensional array of pixel data. In addition, in the embodiment, the memory 1 is a static random-access memory (SRAM), which can store the volume in addition to the data to be convoluted. The data is completed by the product, and may have a multi-layered storage structure and separately store the data to be calculated and operated. In other words, the memory 1 can be used as a cache memory inside the convolution operation device.

實際應用時，全部或大部分的數據可先儲存在其他地方，例如在另一記憶體中，另一記憶體可選擇如動態隨機存取記憶體(dynamic random access memory，DRAM)或其他種類之記憶體。當卷積運算裝置要進行卷積運算時，再全部或部分地將數據由另一記憶體載入至記憶體1中，然後通過緩衝裝置2將數據輸入至卷積運算模組3來進行卷積運算。若輸入的數據是串流數據，記憶體1隨時會寫入最新的串流數據以供卷積運算。 In practical applications, all or most of the data may be stored elsewhere, for example, in another memory, and another memory may be selected as dynamic random access memory (DRAM) or other types. Memory. When the convolution operation device is to perform the convolution operation, the data is completely or partially loaded into the memory 1 by another memory, and then the data is input to the convolution operation module 3 through the buffer device 2 to perform the volume. Product operation. If the input data is streaming data, the memory 1 will write the latest stream data for convolution at any time.

緩衝裝置2耦接有記憶體1、卷積運算模組3以及加總緩衝單元5。並且，緩衝裝置2也與卷積運算裝置的其他元件進行耦接，例如交錯加總單元4以及控制單元7。此外，對於影像數據或視頻的視框數據運算來說，處理的順序是逐行(column)同時讀取多列(row)，因此在一個時序(clock)中，緩衝裝置2係從記憶體1輸入同一行不同列上的數據，對此，本實施例的緩衝裝置2係作為一種行緩衝(column buffer)的緩衝裝置。欲進行運算時，緩衝裝置2可先由記憶體1擷取卷積運算模組3所需要運算的數據，並於擷取後將該些數據調整為可順利寫入卷積運算模組3的數據型式。另一方面，由於緩衝裝置2也與加總緩衝單元5耦接，加總緩衝單元5運算完畢後之數據，也將透過緩衝裝置2暫存重新排序(reorder)後再傳送回記憶體1儲存。換言之，緩衝裝置2除了具有行緩衝的功能之外，其還具有類似中繼暫存數據的功能，或者說緩衝裝置2可做為一種具有排序功能的數據暫存器。 The buffer device 2 is coupled to the memory 1, the convolution operation module 3, and the total buffer unit 5. Further, the buffer device 2 is also coupled to other elements of the convolution operation device, such as the interleaving summing unit 4 and the control unit 7. In addition, for frame data calculation of image data or video, the order of processing is to read multiple columns at the same time, so in one clock, the buffer device 2 is from the memory 1 The data on the different columns of the same row are input. For this, the buffer device 2 of the present embodiment functions as a buffer device for a column buffer. When the calculation is to be performed, the buffer device 2 can first extract the data required by the convolution operation module 3 from the memory 1, and adjust the data to be successfully written into the convolution operation module 3 after the capture. Data type. On the other hand, since the buffer device 2 is also coupled to the summing buffer unit 5, the data after the calculation of the total buffer unit 5 is also reordered by the buffer device 2 and then transferred back to the memory 1 for storage. . In other words, the buffer device 2 has a function similar to relaying temporary data in addition to the function of line buffering, or the buffer device 2 can be used as a data register having a sorting function.

值得一提的是，緩衝裝置2還包括一記憶體控制單元21，當緩衝裝置2在進行與記憶體1之間的數據擷取或寫入時可經由記憶體控制單元21控制。另外，由於其與記憶體1之間具有有限的一記憶體存取寬度，或又稱為帶寬或頻寬(bandwidth)，卷積運算模組3實際上能進行的卷積運算也與記憶體1的存取寬度有關。換言之，卷積運算模組3的運算效能會受到前述存取寬度而有所限制。因此，如果記憶體1的輸入有瓶頸，則卷積運算的效能將受到衝擊而下降。 It is worth mentioning that the buffer device 2 further includes a memory control unit 21, When the buffer device 2 performs data capture or writing with the memory 1, it can be controlled via the memory control unit 21. In addition, since it has a limited memory access width with the memory 1, or is also called bandwidth or bandwidth, the convolution operation module 3 can actually perform convolution operations with the memory. 1 is related to the access width. In other words, the computational efficiency of the convolutional computing module 3 is limited by the aforementioned access width. Therefore, if there is a bottleneck in the input of the memory 1, the performance of the convolution operation will be affected by the impact and fall.

卷積運算模組3具有多個卷積單元，各卷積單元基於一濾波器以及多個當前數據進行一卷積運算，並於卷積運算後保留部分的當前數據。緩衝裝置2從記憶體1取得多個新數據，並將新數據輸入至卷積單元，新數據不與當前數據重複。卷積運算模組3的卷積單元基於濾波器、保留的當前數據以及新數據進行次輪卷積運算。交錯加總單元4耦接卷積運算模組3，依據卷積運算的結果產生一特徵輸出結果。加總緩衝單元5耦接交錯加總單元4與緩衝裝置2，暫存特徵輸出結果；其中，當指定範圍的卷積運算完成後，緩衝裝置2從加總緩衝單元5將暫存的全部數據寫入到記憶體1。 The convolution operation module 3 has a plurality of convolution units, each convolution unit performs a convolution operation based on a filter and a plurality of current data, and retains part of the current data after the convolution operation. The buffer device 2 takes a plurality of new data from the memory 1, and inputs the new data to the convolution unit, and the new data does not overlap with the current data. The convolution unit of the convolution operation module 3 performs a second round convolution operation based on the filter, the retained current data, and the new data. The interleaving summation unit 4 is coupled to the convolution operation module 3 to generate a feature output result according to the result of the convolution operation. The summing buffer unit 5 is coupled to the interleaving summing unit 4 and the buffer device 2 to temporarily store the feature output result; wherein, when the convolution operation of the specified range is completed, the buffer device 2 will temporarily store all the data from the summing buffer unit 5. Write to memory 1.

係數擷取控制器6耦接卷積運算模組3，而控制單元7則耦接緩衝裝置2。實際應用時，對於卷積運算模組3而言，其所需要的輸入來源除了數據本身以外，還需輸入有濾波器(filter)的係數，始得進行運算。於本實施例中所指即為3×3的卷積單元陣列之係數輸入。係數擷取控制器6可藉由直接記憶體存取DMA(direct memory access)的方式由外部之記憶體，直接輸入濾波器係數。除了耦接卷積運算模組3之外，係數擷取控制器6還可與緩衝裝置2進行連接，以接受來自控制單元7的各種指令，使卷積運算模組3能夠藉由控制單元7控制係數擷取控制器6，進行濾波器係數的輸入。 The coefficient acquisition controller 6 is coupled to the convolution operation module 3, and the control unit 7 is coupled to the buffer device 2. In actual application, for the convolution operation module 3, the input source required for the convolution operation module 3 needs to input a coefficient of a filter in addition to the data itself, and the operation is started. The coefficient input of the 3 × 3 convolution unit array is referred to in the present embodiment. The coefficient acquisition controller 6 can directly input the filter coefficients from the external memory by direct memory access DMA (direct memory access). In addition to being coupled to the convolution operation module 3, the coefficient capture controller 6 can also be coupled to the buffer device 2 to accept various commands from the control unit 7, so that the convolution operation module 3 can be controlled by the control unit 7. The control coefficient capture controller 6 performs input of filter coefficients.

控制單元7可包括一指令解碼器71以及一數據讀取控制器72。指令解碼器71係從數據讀取控制器72得到控制指令並將指令解碼，藉以得到目前輸入數據的大小、輸入數據的行數、輸入數據的列數、輸入數據的特徵編號以及輸入數據在記憶體1中的起始位址。另外，指令解碼器71也可從數據讀取控制器72得到有關濾波器的種類資訊以及輸出特徵的編號，並輸出適當的空置訊號到緩衝裝置2。緩衝裝置2則根據指令解碼後所提供的資訊來運行，也進而控制卷積運算模組3以及加總緩衝單元5的運作，例如數據從記憶體1輸入到緩衝裝置2以及卷積運算模組3的時序、卷積運算模組3的卷積運算的規模、數據從記憶體1到緩衝裝置2的讀取位址、數據從加總緩衝單元5到記憶體1的寫入位址、卷積運算模組3及緩衝裝置2所運作的卷積模式。 Control unit 7 can include an instruction decoder 71 and a data read controller 72. The instruction decoder 71 obtains a control instruction from the data read controller 72 and decodes the instruction, thereby obtaining the size of the current input data, the number of rows of the input data, the number of columns of the input data, the feature number of the input data, and the input data in the memory. The starting address in body 1. In addition, the instruction decoder 71 can also obtain information about the type of the filter and the output characteristics from the data reading controller 72. The number is output and the appropriate vacant signal is output to the buffer device 2. The buffer device 2 operates according to the information provided by the instruction decoding, and further controls the operation of the convolution operation module 3 and the summation buffer unit 5, for example, data is input from the memory 1 to the buffer device 2 and the convolution operation module. The timing of 3, the scale of the convolution operation of the convolution operation module 3, the read address of data from the memory 1 to the buffer device 2, the write address of the data from the sum buffer unit 5 to the memory 1, and the volume The convolution mode operated by the product computing module 3 and the buffer device 2.

另一方面，控制單元7則同樣可藉由直接記憶體存取DMA的方式由外部之記憶體提取所需的控制指令及卷積資訊，指令解碼器71將指令解碼之後，該些控制指令及卷積資訊由緩衝裝置2擷取，指令可包含移動窗的步幅大小、移動窗的位址以及欲提取特徵的影像數據行列數。 On the other hand, the control unit 7 can also extract the required control commands and convolution information from the external memory by means of direct memory access DMA. After the instruction decoder 71 decodes the instructions, the control commands and The convolution information is retrieved by the buffer device 2, and the instruction may include the stride size of the moving window, the address of the moving window, and the number of image data rows and columns of the feature to be extracted.

加總緩衝單元5耦接交錯加總單元4，加總緩衝單元5包括一部分加總區塊51以及一池化單元52。部分加總區塊51暫存交錯加總單元4輸出的數據。池化單元52對暫存於部分加總區塊51的數據進行池化運算。池化運算為最大值池化或平均池化。 The summing buffer unit 5 is coupled to the interleaving summing unit 4, and the summing buffer unit 5 includes a part of the summing block 51 and a pooling unit 52. The partial summing block 51 temporarily stores the data output by the interleaving summing unit 4. The pooling unit 52 performs a pooling operation on the data temporarily stored in the partial summing block 51. Pooling operations are either pooled or averaged.

舉例來說，加總緩衝單元5可將經由卷積運算模組3卷積計算結果及交錯加總單元4的輸出特徵結果予以暫存於部分加總區塊51。接著，再透過池化單元52對暫存於部分加總區塊51的數據進行池化(pooling)運算，池化運算可針對輸入數據某個區域上的特定特徵，取其平均值或者取其最大值作為概要特徵提取或統計特徵輸出，此統計特徵相較於先前之特徵而言不僅具有更低的維度，還可改善運算的處理結果。 For example, the summed buffer unit 5 may temporarily store the result of the convolution calculation via the convolution operation module 3 and the output feature result of the interleaving summation unit 4 in the partial summing block 51. Then, the pooling unit 52 performs a pooling operation on the data temporarily stored in the partial summing block 51. The pooling operation may take an average value or take a specific feature on a certain area of the input data. The maximum value is used as a summary feature extraction or statistical feature output. This statistical feature not only has a lower dimension than the previous feature, but also improves the processing result of the operation.

須說明者，此處的暫存，仍係將輸入數據中的部分運算結果相加(partial sum)後才將其於部分加總區塊51之中暫存，因此稱其為部分加總區塊51與加總緩衝單元5，或者可將其簡稱為PSUM單元與PSUM BUFFER模組。另一方面，本實施例之池化單元52的池化運算，係可採用前述平均池化(average pooling)的計算方式取得統計特徵輸出。待所輸入的數據全部均被卷積運算模組3及交錯加總單元4處理計算完畢後，加總緩衝單元5輸出最終的數據處理結果，並同樣可透過緩衝裝置2將結果回存至記憶體1，或者再透過記憶體1輸出至其他元件。與此同時，卷積運算模組3與交錯加總單元4仍持續地進行數據特徵的取得與運算，以提高卷積運算裝置的處理效能。 It should be noted that the temporary storage here still temporarily stores the partial operation result in the input data and then temporarily stores it in the partial summing block 51, so it is called a partial summing area. Block 51 and summing buffer unit 5, or simply referred to as PSUM unit and PSUM BUFFER module. On the other hand, the pooling operation of the pooling unit 52 of the present embodiment can obtain the statistical feature output by using the above-described average pooling calculation method. After all the data to be input are processed by the convolution operation module 3 and the interleave summation unit 4, the total buffer unit 5 outputs the final data processing result, and the result can be restored to the memory through the buffer device 2 as well. Body 1, or re-transmission through memory 1 to other components. At the same time, the convolution operation module 3 and the interleaving summation unit 4 continue to perform data feature acquisition and calculation to improve the volume. The processing performance of the product computing device.

採用前述平均池化(average pooling)的情況下，原本在記憶體中的卷積層的濾波器係數需經調整，實際輸入到卷積運算模組3的是經調整後的因子，因子可以是前述整合池化層以及次一卷積層的運算中所使用的因子，由於因子的產生已在前述實施例說明，故此不再墜述。當卷積運算裝置在處理當前層的卷積層以及池化層，池化單元52可以先不處理當前層池化層中平均池化的除法部分，待卷積運算裝置處理到下一層卷積層時，卷積運算模組3再將先前池化單元52尚未處理的平均池化的除法部分整合在卷積的乘法運算中。另一方面，當卷積運算裝置在處理當前層的卷積層以及池化層時，池化單元52可以利用移位運算當作部分的除法，但留下還未完全平均池化的除法部分，待卷積運算裝置處理到下一層卷積層時，卷積運算模組3再將先前池化單元52尚未處理的平均池化的除法部分整合在卷積的乘法運算中。 In the case of the above average pooling, the filter coefficients of the convolution layer originally in the memory need to be adjusted, and the actual input to the convolution operation module 3 is an adjusted factor, and the factor may be the foregoing. The factors used in the operation of integrating the pooling layer and the next convolution layer have been described in the foregoing embodiments since the generation of the factors has been described, and therefore will not be described. When the convolution operation device is processing the convolution layer and the pooling layer of the current layer, the pooling unit 52 may not process the average pooled division portion in the current layer pooling layer, and the convolution operation device processes the next convolution layer. The convolution operation module 3 then integrates the average pooled division portion of the previous pooling unit 52 that has not been processed into the convolution multiplication operation. On the other hand, when the convolution operation device processes the convolution layer of the current layer and the pooling layer, the pooling unit 52 can use the shift operation as a partial division, but leaves a division portion that has not been completely averaged, When the convolution operation device processes the next convolution layer, the convolution operation module 3 integrates the average pooled division portion that has not been processed by the previous pooling unit 52 into the convolution multiplication operation.

卷積運算裝置可包括多個卷積運算模組3，卷積運算模組3的卷積單元以及交錯加總單元4係能夠選擇性地操作在一低規模卷積模式以及一高規模卷積模式。在低規模卷積模式中，交錯加總單元4配置來對卷積運算模組3中對應順序的各卷積運算的結果交錯加總以各別輸出一加總結果。在高規模卷積模式中，交錯加總單元4將各卷積單元的各卷積運算的結果交錯加總作為輸出。 The convolution operation device may include a plurality of convolution operation modules 3, and the convolution unit of the convolution operation module 3 and the interleave summation unit 4 are capable of selectively operating in a low-scale convolution mode and a high-scale convolution mode. In the low-scale convolution mode, the interleaving summing unit 4 is configured to interleave the results of the respective convolution operations in the corresponding order in the convolution operation module 3 to output a total result. In the high-scale convolution mode, the interleaving summing unit 4 interleaves the results of the convolution operations of the respective convolution units as outputs.

綜上所述，本發明的運算裝置及運算方法中，以兩階段進行平均池化運算，池化單元僅進行加法運算，並搭配位元移位運算，以避免累加過程所造成的數據溢位，再對池化單元的輸出結果進行加權運算，而得到最終的平均結果。由於池化單元並未做除法運算，故可避免處理器耗費較多的效能，進而達成提升池化運算之效能的功效。 In summary, in the arithmetic device and the calculation method of the present invention, the average pooling operation is performed in two stages, and the pooling unit performs only the addition operation, and is matched with the bit shift operation to avoid the data overflow caused by the accumulation process. Then, the output result of the pooling unit is weighted to obtain the final average result. Since the pooling unit does not perform the division operation, the processor can be prevented from using more performance, thereby achieving the effect of improving the performance of the pooling operation.

上述實施例並非用以限定本發明，任何熟悉此技藝者，在未脫離本發明之精神與範疇內，而對其進行之等效修改或變更，均應包含於後附之申請專利範圍中。 The above-mentioned embodiments are not intended to limit the invention, and any equivalent modifications and variations of the present invention are intended to be included within the scope of the appended claims.

Claims

A method for computing a convolutional neural network, comprising: performing an addition operation on a plurality of input data to output an accumulated data; performing a one-bit shift operation on the accumulated data to output a shift data; and shifting the shift data; The data performs a weighting operation to output a weighted data, wherein a factor of the weighting operation is based on the number of the input data, the number of bits shifted to the right in the bit shift operation, and a subsequent step of the convolutional neural network. a scaling weight of the layer; wherein the factor of the weighting operation is proportional to the scaling weight and the number of bits shifted to the right in the bit shifting operation, the factor of the weighting operation being associated with the input The amount of data is inversely proportional to the weighted data multiplied by the factor.

A method for computing a convolutional neural network, comprising: performing an addition operation on a plurality of input data to output an accumulated data; performing a one-bit shift operation on the accumulated data to output a shift data; and shifting the shift data; The data performs a weighting operation to output a weighted data, wherein a factor of the weighting operation is based on the number of the input data, the number of bits shifted to the right in the bit shift operation, and a subsequent step of the convolutional neural network. The weight of the layer depends on a scale weight; wherein the number of bits shifted to the right in the bit shift operation depends on the size of a pooled window, and the number of the input data depends on the size of the pool window. .

A method for computing a convolutional neural network, comprising: performing an addition operation on a plurality of input data to output an accumulated data; performing a one-bit shift operation on the accumulated data to output a shift data; and shifting the shift data; The data performs a weighting operation to output a weighted data, wherein a factor of the weighting operation is based on the number of the input data, the number of bits shifted to the right in the bit shift operation, and a subsequent step of the convolutional neural network. a scaling weight of the layer; wherein the subsequent layer is a primary convolution layer of the convolutional neural network, the scaling weight is a filter coefficient of the convolution layer of the next layer, the addition operation and the bit shift The operation is an operation in a pooled layer of the convolutional neural network.

The arithmetic method according to claim 3, wherein the division operation in the pooling layer is performed in a multiplication operation of the next layer convolution layer.

A method for computing a convolutional neural network, comprising: performing an addition operation on a plurality of input data in a pooling layer to output an accumulated data; and performing a weighting operation on the accumulated data in a subsequent layer to output a Weighting data, wherein a factor of the weighting operation is determined according to the number of the input data and a scaling weight of the subsequent layer, the weighting data being equal to the accumulated data multiplied by the factor.

The operation method of claim 5, wherein the subsequent layer is a layer-by-one convolution layer, the scaling weight is a filter coefficient, and the weighting operation is a convolution operation, and the factor of the weighting operation is equal to the filtering The factor is divided by the number of such input data.

The method of claim 5, wherein the amount of the input data is determined according to the size of the pooling window.

A method for computing a convolutional neural network, comprising: multiplying a scaled weight by an original filter coefficient to generate a weighted filter coefficient; and pairing a convolution with an input data and the weighted filter coefficient Perform a convolution operation.

The operation method of claim 8, wherein the operation method further comprises: performing a one-bit shift operation on the input data; and inputting the input data after the bit shift operation to the convolution layer, wherein The scaling weight is based on an original scaling weight and the number of bits shifted to the right in the bit shift operation.

An arithmetic unit of a convolutional neural network, which performs the arithmetic method according to any one of the first to ninth aspects of the patent application.