TWI616813B

TWI616813B - Convolution operation method

Info

Publication number: TWI616813B
Application number: TW105137131A
Authority: TW
Inventors: 杜力; 杜源; 李一雷; 管延城; 劉峻誠
Original assignee: 耐能股份有限公司
Priority date: 2016-11-14
Filing date: 2016-11-14
Publication date: 2018-03-01
Also published as: TW201818232A

Abstract

本發明揭露一種卷積運算方法，運算方法包括以下步驟：將一大規模卷積運算區塊分割成多個小規模卷積運算區塊；對小規模卷積運算區塊進行卷積運算以分別產生一部份結果；以及將部份結果相加作為大規模卷積運算區塊的一卷積運算結果。本發明還一併揭露一種可硬體支援前述卷積運算方法的卷積運算裝置。 The invention discloses a convolution operation method, which comprises the steps of: dividing a large-scale convolution operation block into a plurality of small-scale convolution operation blocks; performing convolution operations on the small-scale convolution operation blocks to respectively Generate a partial result; and add partial results as a convolution result of the large-scale convolution operation block. The present invention also discloses a convolution operation device that can support the convolution operation method by hardware.

Description

Convolution operation method

本發明係關於一種卷積運算裝置及卷積運算方法，尤指一種可將大規模卷積運算區塊分割成多個小規模卷積運算區塊進行卷積運算的卷積運算方法。 The present invention relates to a convolution operation device and a convolution operation method, and more particularly to a convolution operation method capable of dividing a large-scale convolution operation block into a plurality of small-scale convolution operation blocks for convolution operation.

深度學習(deep learning)已是開展人工智慧(Artificial intelligence,AI)的重要應用技術之一。而卷積神經網絡(Convolutional Neural Network，CNN)則是近年來，引起廣泛重視的一種深度學習高效識別技術，其藉由複數個串接的特徵濾波器或濾鏡(filter)構成，且濾波器的卷積運算區塊規模可由如1×1、3×3的小區塊規模到較大的區塊規模，如5×5、7×7甚至是11×11的大規模卷積運算區塊。 Deep learning is one of the important application technologies for the development of artificial intelligence (AI). The Convolutional Neural Network (CNN) is a deep learning efficient recognition technology that has attracted widespread attention in recent years. It consists of a plurality of cascaded eigenfilters or filters, and filters. The convolution operation block size can be from a cell block size of 1 × 1, 3 × 3 to a larger block size, such as a 5 × 5, 7 × 7 or even 11 × 11 large-scale convolution operation block.

但是，卷積運算是一種很耗費效能的運算，特別是對於大規模卷積運算區塊的卷積運算來說，更會占據大部分處理器的效能。此外，對於運算資料特徵之濾波器的卷積運算單元而言，通常只被設計於運算特定的卷積運算區塊規模，或是特定的輸入資料規模，並產生卷積運算單元只能運算小於該卷積運算區塊規模的運算限制或是硬體支援限制。而若是要進行更大的卷積運算區塊規模運算時，則須藉由軟體的輔助或是以額外的硬體資源，才可完成更大卷積運算區塊規模的運算。 However, convolution is a very cost-effective operation, especially for the convolution of large-scale convolutional blocks, which will take up most of the processor's performance. In addition, for the convolution operation unit of the filter of the operation data feature, it is usually only designed to calculate the specific convolution operation block size, or the specific input data size, and the convolution operation unit can only calculate less than the operation. The computational limit of the convolution operation block size or the hardware support limit. However, if a larger convolution operation block size operation is to be performed, the operation of the larger convolution operation block size must be completed by software assistance or additional hardware resources.

因此，有必要提出一種卷積運算方法，可降低特定卷積運算區塊規模的限制，也不需以額外的硬體資源即可同樣達到大規模卷積運算區塊的運算，並同樣取得大規模卷積運算區塊的卷積運算結果，實為當前重要的課題之一。 Therefore, it is necessary to propose a convolution operation method, which can reduce the limitation of the size of a specific convolution operation block, and can also achieve the operation of a large-scale convolution operation block without additional hardware resources, and also achieves a large The convolution operation result of the scale convolution operation block is one of the current important topics.

有鑑於上述課題，本發明提出一種卷積運算裝置及卷積運算方法，可降低特定卷積運算區塊規模的限制，也不需以額外的硬體資源即可同樣達到大規模卷積運算區塊的運算，並同樣取得大規模卷積運算區塊的卷積運算結果。 In view of the above problems, the present invention provides a convolution operation device and a convolution operation method, which can reduce the limitation of the size of a specific convolution operation block, and can also achieve a large-scale convolution operation area without additional hardware resources. The operation of the block, and the convolution operation result of the large-scale convolution operation block is also obtained.

為達上述目的，本發明提供一種卷積運算方法，包括以下步驟：將一大規模卷積運算區塊分割成多個小規模卷積運算區塊；對小規模卷積運算區塊進行卷積運算以分別產生一部份結果；以及將部份結果相加作為大規模卷積運算區塊的一卷積運算結果。 To achieve the above object, the present invention provides a convolution operation method comprising the steps of: dividing a large-scale convolution operation block into a plurality of small-scale convolution operation blocks; and convolving a small-scale convolution operation block. The operation is to generate a partial result separately; and the partial results are added as a convolution operation result of the large-scale convolution operation block.

在一實施例中，小規模卷積運算區塊的大小相同。 In an embodiment, the small-scale convolution operation blocks are the same size.

在一實施例中，卷積運算方法更包括在小規模卷積運算區塊中超出大規模卷積運算區塊的部分填入0。 In an embodiment, the convolution operation method further includes filling a portion of the large-scale convolution operation block beyond 0 in the small-scale convolution operation block.

在一實施例中，進行卷積運算的步驟中，小規模卷積運算區塊是利用至少一卷積單元進行卷積運算以分別產生部分結果，小規模卷積運算區塊的區塊規模等於卷積單元硬體支援的最大卷積規模。 In an embodiment, in the step of performing a convolution operation, the small-scale convolution operation block performs a convolution operation using at least one convolution unit to respectively generate partial results, and the block size of the small-scale convolution operation block is equal to The maximum convolution size supported by the convolution unit hardware.

在一實施例中，進行卷積運算的步驟中，小規模卷積運算區塊是利用對應數量的卷積單元進行平行的卷積運算以分別產生部分結果。 In one embodiment, in the step of performing a convolution operation, the small-scale convolution operation block performs parallel convolution operations using a corresponding number of convolution units to respectively generate partial results.

在一實施例中，大規模卷積運算區塊包括多個濾波器係數，濾波器係數依據排列順序以及小規模卷積運算區塊的規模而分派到小規模卷積運算區塊中。 In an embodiment, the large-scale convolution operation block includes a plurality of filter coefficients, and the filter coefficients are assigned to the small-scale convolution operation block according to the arrangement order and the size of the small-scale convolution operation block.

在一實施例中，大規模卷積運算區塊包括多個數據，數據依據排列順序以及小規模卷積運算區塊的規模而分派到小規模卷積運算區塊中。 In one embodiment, the large-scale convolution operation block includes a plurality of data, and the data is allocated to the small-scale convolution operation block according to the arrangement order and the size of the small-scale convolution operation block.

在一實施例中，大規模卷積運算區塊的規模為5×5或7×7，小規模卷積運算區塊的規模為3×3。 In one embodiment, the scale of the large-scale convolution operation block is 5×5 or 7×7, and the size of the small-scale convolution operation block is 3×3.

在一實施例中，進行部份結果相加的步驟更包括：提供複數移動位址至各小規模卷積運算區塊，各部份結果係依據移動位址而於一座標中移動並彼此疊加。 In an embodiment, the step of adding the partial results further comprises: providing a complex mobile address to each small-scale convolution operation block, and each part of the result is moved in a target according to the mobile address and superimposed on each other. .

在一實施例中，卷積運算方法更包括依據一當前卷積運算區塊的規模決定卷積運算模式，其中當卷積運算模式為分割模式時，當前卷積運算區塊為大規模卷積運算區塊，進行將大規模卷積運算區塊分割成多個小規模卷積運算區塊、對小規模卷積運算區塊進行卷積運算以分別產生部份結果、以及將該等部份結果相加作為大規模卷積運算區塊的卷積運算結果。其中當卷積運算模式為非分割模式時，不將當前卷積運算區塊分割，對當前卷積運算區塊進行卷積運算。 In an embodiment, the convolution operation method further comprises determining a convolution operation mode according to a size of a current convolution operation block, wherein when the convolution operation mode is a split mode, the current volume The product operation block is a large-scale convolution operation block, and the large-scale convolution operation block is divided into a plurality of small-scale convolution operation blocks, and the convolution operation is performed on the small-scale convolution operation block to respectively generate the operation unit. The results, and the partial results are added as a result of the convolution operation of the large-scale convolution operation block. When the convolution operation mode is the non-split mode, the current convolution operation block is not divided, and the current convolution operation block is convoluted.

在一實施例中，卷積運算方法更包括進行卷積神經網路的一後續層的部分運算。 In an embodiment, the convolution operation method further includes performing a partial operation of a subsequent layer of the convolutional neural network.

為達上述目的，本發明還提供一種卷積運算裝置，可運行上述所提供的卷積運算方法。 To achieve the above object, the present invention also provides a convolution operation device which can operate the convolution operation method provided above.

承上所述，本發明所提供之卷積運算方法，藉由將一大規模卷積運算區塊分割成多個小規模卷積運算區塊；對小規模卷積運算區塊進行卷積運算以分別產生一部份結果；以及將部份結果相加作為大規模卷積運算區塊的一卷積運算結果，即可降低特定卷積運算區塊規模的限制，也不需以額外的硬體資源便可同樣達到大規模卷積運算區塊的運算，並同樣取得大規模卷積運算區塊的卷積運算結果。 As described above, the convolution operation method provided by the present invention divides a large-scale convolution operation block into a plurality of small-scale convolution operation blocks; convolution operation on a small-scale convolution operation block To generate a partial result separately; and to add partial results as a convolution operation result of a large-scale convolution operation block, the limitation of the size of a specific convolution operation block can be reduced without additional hard The volume resource can also achieve the operation of the large-scale convolution operation block, and the convolution operation result of the large-scale convolution operation block is also obtained.

1‧‧‧記憶體 1‧‧‧ memory

2‧‧‧緩衝裝置 2‧‧‧buffering device

21‧‧‧記憶體控制單元 21‧‧‧Memory Control Unit

3‧‧‧卷積運算模組 3‧‧‧Convolutional computing module

30‧‧‧卷積單元陣列 30‧‧‧Convolutional unit array

4‧‧‧交錯加總單元 4‧‧‧Interlaced summing unit

5‧‧‧加總緩衝單元 5‧‧‧Additional buffer unit

51‧‧‧部分加總區塊 51‧‧‧Partial total block

52‧‧‧池化單元 52‧‧‧ pooling unit

6‧‧‧係數擷取控制器 6‧‧‧ coefficient acquisition controller

7‧‧‧控制單元 7‧‧‧Control unit

71‧‧‧指令解碼器 71‧‧‧ instruction decoder

72‧‧‧數據讀取控制器 72‧‧‧Data Read Controller

8‧‧‧資料緩衝控制器 8‧‧‧Data buffer controller

9‧‧‧卷積單元 9‧‧‧Convolution unit

91‧‧‧位址解碼器 91‧‧‧ address decoder

92‧‧‧加法器 92‧‧‧Adder

DMA‧‧‧直接記憶體存取 DMA‧‧‧direct memory access

F1~F9‧‧‧小規模卷積運算區塊 F1~F9‧‧‧Small-scale convolution operation block

PE‧‧‧卷積單元 PE‧‧‧convolution unit

PE0~PE8‧‧‧處理單位 PE0~PE8‧‧‧Processing unit

(0,0)、(0,3)、(3,0)、(3,3)‧‧‧移動位址 (0,0), (0,3), (3,0), (3,3)‧‧‧ mobile address

圖1為對一二維數據進行卷積運算的示意圖。 FIG. 1 is a schematic diagram of a convolution operation on a two-dimensional data.

圖2為卷積單元的示意圖。 2 is a schematic diagram of a convolution unit.

圖3A為依據本發明一實施例將一個5×5大規模卷積運算區塊分割成四個3×3小規模卷積運算區塊的示意圖。 3A is a schematic diagram of partitioning a 5×5 large-scale convolution operation block into four 3×3 small-scale convolution operation blocks according to an embodiment of the invention.

圖3B為本發明一實施例將多個濾波器係數依據排列順序及卷積運算區塊的規模而分派至卷積運算區塊的示意圖。 FIG. 3B is a schematic diagram of assigning a plurality of filter coefficients to a convolution operation block according to an arrangement order and a scale of a convolution operation block according to an embodiment of the present invention.

圖3C為本發明一實施例將多個數據依據排列順序及卷積運算區塊的規模而分派至卷積運算區塊的示意圖。 FIG. 3C is a schematic diagram of assigning a plurality of data to a convolution operation block according to an arrangement order and a scale of a convolution operation block according to an embodiment of the present invention.

圖4為依據本發明再一實施例將一個7×7大規模卷積運算區塊分割成九個3×3小規模卷積運算區塊的示意圖。 4 is a schematic diagram of dividing a 7×7 large-scale convolution operation block into nine 3×3 small-scale convolution operation blocks according to still another embodiment of the present invention.

圖5為依據本發明一實施例的一卷積運算裝置的區塊圖。 Figure 5 is a block diagram of a convolution operation device in accordance with an embodiment of the present invention.

圖6為圖5所示之卷積運算裝置的部分示意圖。 Fig. 6 is a partial schematic view showing the convolution operation device shown in Fig. 5.

圖7為依據本發明一實施例卷積單元的功能方塊圖。 Figure 7 is a functional block diagram of a convolution unit in accordance with an embodiment of the present invention.

以下將參照相關圖式，說明依本發明較佳實施例之一種卷積運算裝置及卷積運算方法，其中相同的元件將以相同的參照符號加以說明。 Hereinafter, a convolution operation device and a convolution operation method according to a preferred embodiment of the present invention will be described with reference to the accompanying drawings, wherein the same elements will be described with the same reference numerals.

請參考圖1，圖1為對一二維數據進行卷積運算的示意圖。二維數據具有多行多列，其例如是影像，於此僅示意地顯示其中5×4的像素。一3×3矩陣大小的濾波器用於二維數據的卷積運算，濾波器具有係數FC0~FC8，濾波器移動的步幅小於濾波器的最短寬度。濾波器的規模與移動窗(sliding window)或卷積運算窗相當。移動窗可在5×4的影像上間隔移動，每移動一次便對窗內對應的數據P0~P8進行一次3×3卷積運算，卷積運算後的結果可稱為特徵值。移動窗S每次移動的間隔稱為步幅(stride)，由於步幅(stride)的大小並不會超過移動窗S(sliding window)的大小或是卷積運算的尺寸(convolution size)，因此以本實施例的移動窗步幅來說，將會小於3個像素的移動距離。而且，相鄰的卷積運算往往會有重疊的數據。以步幅等於1來說，數據P2、P5、P8是新數據，數據P0、P1、P3、P4、P6、P7是前一輪卷積運算已經輸入過的數據。對於一般卷積神經網路的應用來說，常用的移動窗尺寸為1×1、3×3、5×5、7×7不等，其中又以本實施例的移動窗尺寸較為常用(3×3)。 Please refer to FIG. 1. FIG. 1 is a schematic diagram of convolution operation on a two-dimensional data. The two-dimensional data has a plurality of rows and columns, which is, for example, an image, and only 5×4 of the pixels therein are schematically shown. A 3×3 matrix size filter is used for convolution of two-dimensional data. The filter has coefficients FC0~FC8, and the moving step of the filter is smaller than the shortest width of the filter. The size of the filter is comparable to a sliding window or a convolution window. The moving window can be moved on the 5×4 image, and each time the movement, the corresponding data P0~P8 in the window is subjected to a 3×3 convolution operation, and the result of the convolution operation can be referred to as a feature value. The interval at which the moving window S moves each time is called stride, and since the stride does not exceed the size of the sliding window or the convolution size, With the moving window stride of this embodiment, it will be less than the moving distance of 3 pixels. Moreover, adjacent convolution operations tend to have overlapping data. In the case where the stride is equal to 1, the data P2, P5, and P8 are new data, and the data P0, P1, P3, P4, P6, and P7 are data that has been input by the previous round of convolution operations. For the application of the general convolutional neural network, the commonly used moving window size is 1×1, 3×3, 5×5, 7×7, and the moving window size of the embodiment is more commonly used (3) ×3).

圖2為卷積單元的示意圖。請參照圖2所示，圖2的卷積單元可進行圖1的卷積運算，卷積單元具有3×3陣列的9個乘法器Mul_0~Mul_8，各乘法器具有一數據輸入、一濾波器係數輸入以及一乘法輸出OUT，數據輸入及濾波器係數輸入是各乘法器的二個乘法運算輸入。各乘法器的輸出OUT分別連接到加法器的輸入#0~#8，加法器將各乘法器的輸出相加後產生一卷積輸出OUT。進行完一輪的卷積運算後，乘法器Mul_0、Mul_3、Mul_6會將當下在其內的數據(本輪輸入的Q0、Q1、Q2)輸入到次一級的乘法器Mul_1、Mul_4、Mul_7，乘法器Mul_1、Mul_4、Mul_7會將當下在其內的數據(相當於前一輪輸入的Q0、Q1、Q2)輸入到次一級的乘法器Mul_2、Mul_5、Mul_8，這樣就可以把部分已經輸入到卷積單元內的數據保留以供次一輪的卷積運算。乘法器Mul_0、Mul_3、Mul_6則在次一輪接收新數據Q0、Q1、Q2。前一輪、本輪、以及次一輪的卷積運算彼此間隔至少一個時脈。 2 is a schematic diagram of a convolution unit. Referring to FIG. 2, the convolution unit of FIG. 2 can perform the convolution operation of FIG. 1. The convolution unit has 9 multipliers Mul_0~Mul_8 of a 3×3 array, each multiplier having a data input and a filter coefficient. The input and a multiply output OUT, the data input and the filter coefficient input are the two multiplication inputs of each multiplier. The output OUT of each multiplier is connected to the input #0~#8 of the adder, respectively, and the adder adds the outputs of the multipliers to generate a convolution output OUT. After performing a round of convolution operations, the multipliers Mul_0, Mul_3, and Mul_6 input the data currently in them (Q0, Q1, Q2 input in the current round) to the multipliers Mul_1, Mul_4, and Mul_7 in the next stage, and the multiplier Mul_1, Mul_4, and Mul_7 will input the current data (equivalent to the previous round of Q0, Q1, Q2) to the next level. The multipliers Mul_2, Mul_5, Mul_8, so that some of the data that has been input into the convolution unit can be reserved for the next round of convolution operations. The multipliers Mul_0, Mul_3, and Mul_6 receive new data Q0, Q1, and Q2 in the next round. The previous round, the current round, and the next round of convolution operations are separated from each other by at least one clock.

舉例來說，在一般的情況下，濾波器係數不需要時常更新，係數FC0~FC8輸入到乘法器Mul_0~Mul_8後可以留存在乘法器Mul_0~Mul_8中以利乘法運算；或者是係數FC0~FC8要一直持續輸入到乘法器Mul_0~Mul_8。 For example, in the general case, the filter coefficients do not need to be updated from time to time, and the coefficients FC0~FC8 are input to the multipliers Mul_0~Mul_8 and can be left in the multipliers Mul_0~Mul_8 for multiplication; or the coefficients FC0~FC8 Always input to the multipliers Mul_0~Mul_8.

在別的實施態樣中，卷積單元也可以是不同於3×3大小的陣列，例如5×5、7×7大小的陣列，本發明並不限制。而卷積單元PE亦可分別平行處理不同組輸入數據的卷積運算。 In other embodiments, the convolution unit may also be an array different from a 3×3 size, such as a 5×5, 7×7 size array, which is not limited in the present invention. The convolution unit PE can also separately process convolution operations of different sets of input data.

請一併參考圖3A至圖3C，圖3A為依據本發明一實施例將一個5×5大規模卷積運算區塊分割成四個3×3小規模卷積運算區塊的示意圖。圖3B為本發明一實施例將多個濾波器係數依據排列順序及卷積運算區塊的規模而分派至卷積運算區塊的示意圖。圖3C為本發明一實施例將多個數據依據排列順序及卷積運算區塊的規模而分派至卷積運算區塊的示意圖。 Please refer to FIG. 3A to FIG. 3C together. FIG. 3A is a schematic diagram of dividing a 5×5 large-scale convolution operation block into four 3×3 small-scale convolution operation blocks according to an embodiment of the present invention. FIG. 3B is a schematic diagram of assigning a plurality of filter coefficients to a convolution operation block according to an arrangement order and a scale of a convolution operation block according to an embodiment of the present invention. FIG. 3C is a schematic diagram of assigning a plurality of data to a convolution operation block according to an arrangement order and a scale of a convolution operation block according to an embodiment of the present invention.

圖3A中，提供一個優先設定於運算一二維5×5像素數據的濾波器，其為一5×5卷積運算單元陣列或是一5×5大規模卷積運算區塊(下稱卷積運算區塊)。圖3B中，則揭示對應原先5×5大規模卷積運算區塊的5×5像素數據。一般而言，通常5×5的像素數據，以5×5的大規模卷積運算區塊進行處理會較為直接且有效率，不過若是卷積運算裝置的硬體功能並無法達到支援5×5卷積運算區塊的卷積運算，則需透過其他的方法進行卷積運算。 In FIG. 3A, a filter for preferentially setting a two-dimensional 5×5 pixel data is provided, which is a 5×5 convolution operation unit array or a 5×5 large-scale convolution operation block (hereinafter referred to as a volume). Product operation block). In Fig. 3B, 5 x 5 pixel data corresponding to the original 5 x 5 large-scale convolution operation block is revealed. In general, usually 5 × 5 pixel data, processing with 5 × 5 large-scale convolution operation block will be more direct and efficient, but if the hardware function of the convolution operation device can not reach support 5 × 5 The convolution operation of the convolution operation block requires convolution operation by other methods.

對此，請先參考圖3A，於圖3A中先將原先5×5的大規模卷積運算區塊進行分割，使其分割成多個小規模卷積運算區塊。於本實施例則是分割成4個3×3的小規模卷積運算區塊，且4個小規模卷積運算區塊的大小均為相同，在別的實施態樣下，也可將5×5的大規模卷積運算區塊甚至更大的7×7大規模卷積運算區塊分割成多個更小規模的卷積運算區塊，例如是1×1的小規模卷積運算區塊，本發明並不限制。須說明者，由於分割前的5×5大規模卷積運算區塊其區塊的行、列數，並非小規模卷積運算區塊行、列數的整數倍，且4個小規模卷積運算區塊將會超出原先5×5的大規模卷積運算區塊。因此，本發明的卷積運算方法需額外在各小規模卷積運算區塊中超出該大規模卷積運算區塊的部分填入0。於本實施例中，則是將原先之5×5大規模卷積運算區塊對應新增一行與一列的係數0，使補上係數0後的大規模卷積運算區塊行列數(6×6)可為3×3小規模卷積運算區塊行列數的整數倍，並且各該3×3小規模卷積運算區塊彼此並不重疊(non-overlapping)。當分割或拆解完大規模卷積運算區塊後，將產生共4個的3×3的小規模卷積運算區塊，為了方便說明，於此將該些拆解後的小規模卷積運算區塊分別命為小規模卷積運算區塊F1~F4。 For this reason, please refer to FIG. 3A first. In FIG. 3A, the original 5×5 large-scale convolution operation block is first divided into a plurality of small-scale convolution operation blocks. In this embodiment, it is divided into four 3×3 small-scale convolution operation blocks, and the sizes of the four small-scale convolution operation blocks are the same. In other embodiments, 5 ×5 large-scale convolution operation block even larger 7×7 large-scale convolution operation block is divided into multiple smaller-scale convolution operation areas The block is, for example, a 1×1 small-scale convolution operation block, and the present invention is not limited. It should be noted that the number of rows and columns of the block of the 5×5 large-scale convolution operation block before the division is not an integer multiple of the number of rows and columns of the small-scale convolution operation block, and four small-scale convolutions. The arithmetic block will exceed the original 5×5 large-scale convolution operation block. Therefore, the convolution operation method of the present invention requires an additional 0 in the portion of each small-scale convolution operation block that exceeds the large-scale convolution operation block. In this embodiment, the original 5×5 large-scale convolution operation block is added with a coefficient of 0 for one row and one column, so that the number of rows and columns of the large-scale convolution operation block after the coefficient 0 is added (6× 6) may be an integer multiple of the number of rows and columns of the 3×3 small-scale convolution operation block, and each of the 3×3 small-scale convolution operation blocks does not overlap each other. After dividing or disassembling the large-scale convolution operation block, a total of 4 3×3 small-scale convolution operation blocks will be generated. For convenience of explanation, the disassembled small-scale convolutions will be described here. The operation blocks are respectively called small-scale convolution operation blocks F1~F4.

在得出小規模卷積運算區塊F1~F4後，即可利用各些小規模卷積運算區塊F1~F4開始對像素數據進行卷積運算，並因此分別產生一部份結果(影像結果)。請參考圖3B及圖3C，圖3B及圖3C分別揭示大規模卷積運算區塊包括多個濾波器係數以及多個數據，各濾波器係數及各數據可依據本身的排列順序，以及小規模卷積運算區塊F1~F4的規模而分派到小規模卷積運算區塊F1~F4中。 After the small-scale convolution operation blocks F1 to F4 are obtained, the convolution operations of the pixel data can be started by using the small-scale convolution operation blocks F1 to F4, and thus a part of the result (image result) is generated separately. . Please refer to FIG. 3B and FIG. 3C. FIG. 3B and FIG. 3C respectively show that the large-scale convolution operation block includes a plurality of filter coefficients and a plurality of data, each filter coefficient and each data can be arranged according to its own order, and small scale. The scale of the convolution operation blocks F1 to F4 is assigned to the small-scale convolution operation blocks F1 to F4.

另外，在進行卷積運算的步驟中，小規模卷積運算區塊F1~F4是利用至少一卷積單元進行卷積運算以分別產生各部分結果，於此則是利用至少4個的卷積單元進行卷積運算(F4僅具有4個卷積單元)，而且小規模卷積運算區塊F1~F4的區塊規模等於卷積單元硬體支援的最大卷積規模。換言之，小規模卷積運算區塊F1~F4即為本實施例的硬體支援上限，其至多僅支援到3×3規模的卷積運算區塊。而且，小規模卷積運算區塊F1~F4是利用對應數量的卷積單元進行平行的卷積運算以分別產生各部分結果。 Further, in the step of performing the convolution operation, the small-scale convolution operation blocks F1 to F4 perform convolution operations using at least one convolution unit to respectively generate partial results, and here, at least four convolutions are used. The unit performs a convolution operation (F4 has only 4 convolution units), and the block size of the small-scale convolution operation blocks F1 to F4 is equal to the maximum convolution size supported by the convolution unit hardware. In other words, the small-scale convolution operation blocks F1 to F4 are the hardware support upper limit of the present embodiment, and at most, only the convolution operation block of the 3×3 scale is supported. Further, the small-scale convolution operation blocks F1 to F4 perform parallel convolution operations using a corresponding number of convolution units to respectively generate partial partial results.

當小規模卷積運算區塊F1~F4進行卷積運算並分別產生部份結果後，最後則將小規模卷積運算區塊F1~F4產生的各部份結果相加以作為該5×5大規模卷積運算區塊的卷積運算結果。各部份結果相加的方法可透過提供複數移動位址至各個小規模卷積運算區塊，且各部份結果可依據所提供的移動位址而於一座標中移動並彼此疊加。例如，可分別提供不同的移動位址(0,0)、(0,3)、(3,0)及(3,3)對應分派至小規模卷積運算區塊F1、F2、F3及F4。其中由於小規模卷積運算區塊F1~F4彼此並不重疊並具有不同的移動位址，各小規模卷積運算區塊F1~F4即可依據所分派到的多個濾波器係數掃描圖3B中的數據(像素數據)，並產生各部份結果I1~I4以及一最終部份結果I5(圖未顯示I1~I5)。並且，將最終的部份結果I5起始緩衝值設定為0後，開始將四個小規模卷積運算區塊F1~F4所輸出部份結果I1~I4相加。 When the small-scale convolution operation blocks F1 to F4 perform convolution operations and respectively generate partial results, finally, the partial results of the small-scale convolution operation blocks F1 to F4 are added as the 5×5 large The result of the convolution operation of the scale convolution operation block. The method of adding the partial results can provide a complex mobile address to each small-scale convolution operation block, and the results of each part can be Moves in a target and superimposes each other according to the provided mobile address. For example, different mobile addresses (0, 0), (0, 3), (3, 0), and (3, 3) may be respectively assigned to the small-scale convolution operation blocks F1, F2, F3, and F4. . Since the small-scale convolution operation blocks F1~F4 do not overlap each other and have different mobile addresses, each small-scale convolution operation block F1~F4 can scan FIG. 3B according to the plurality of filter coefficients assigned. The data in the (pixel data), and the results of the various parts I1~I4 and the final part of the result I5 (the figure does not show I1~I5). Then, after the final partial result I5 initial buffer value is set to 0, the partial results I1 to I4 outputted by the four small-scale convolution operation blocks F1 to F4 are started to be added.

由於小規模卷積運算區塊F1的移動位址為(0,0)，因此部份結果I1將直接疊加於最終的部份結果I5中；而小規模卷積運算區塊F2的移動位址為(0,3)，故輸出部份結果I2中位於座標(X,Y)的各像素數據將於座標(X,Y-3)疊加於最終的部份結果I5中；小規模卷積運算區塊F3的移動位址為(3,0)，故輸出部份結果I3中位於座標(X,Y)的各像素數據將於座標(X-3,Y)疊加於最終的部份結果I5中；小規模卷積運算區塊F4的移動位址為(3,3)，故輸出部份結果I4中位於座標(X,Y)的各像素數據將於座標(X-3,Y-3)疊加於最終的部份結果I5中。藉此，各小規模卷積運算區塊的輸出部份結果I1~I4已根據不同的移動位址而於座標中移動並彼此疊加，產生最終的部份結果I5。 Since the mobile address of the small-scale convolution operation block F1 is (0, 0), part of the result I1 will be directly superimposed on the final partial result I5; and the mobile address of the small-scale convolution operation block F2 For (0,3), the pixel data of the coordinates (X, Y) in the output partial result I2 is superimposed on the coordinate part (X, Y-3) in the final partial result I5; small-scale convolution operation The moving address of the block F3 is (3, 0), so the data of each pixel located at the coordinates (X, Y) in the output partial result I3 is superimposed on the coordinate (X-3, Y) in the final partial result I5. The moving address of the small-scale convolution operation block F4 is (3, 3), so the pixel data of the coordinates (X, Y) in the output partial result I4 will be at coordinates (X-3, Y-3). ) superimposed on the final partial result I5. Thereby, the output partial results I1~I4 of the small-scale convolution operation blocks have been moved in the coordinates according to different mobile addresses and superimposed on each other to generate a final partial result I5.

因此，本實施例之拆解方法的第一步驟為：將一大規模卷積運算區塊分割成多個小規模卷積運算區塊(步驟：S10)；接著，對該等小規模卷積運算區塊進行卷積運算以分別產生一部份結果；(步驟：S20)；以及將該等部份結果相加作為該大規模卷積運算區塊的一卷積運算結果(步驟：S30)。 Therefore, the first step of the disassembling method of the embodiment is: dividing a large-scale convolution operation block into a plurality of small-scale convolution operation blocks (step: S10); and then, the small-scale convolutions The operation block performs a convolution operation to respectively generate a partial result; (step: S20); and adding the partial results as a convolution operation result of the large-scale convolution operation block (step: S30) .

另外，步驟S10將一大規模卷積運算區塊分割成多個小規模卷積運算區塊的步驟中，當該等小規模卷積運算區塊中超出該大規模卷積運算區塊時，則更包含一步驟：在該等小規模卷積運算區塊中超出該大規模卷積運算區塊的部分填入0(步驟：S11)。而且，步驟S30將該等部份結果相加作為該大規模卷積運算區塊的一卷積運算結果的步驟中，更包括一步驟S31：提供複數移動位址至該等小規模卷積運算區塊，該等部份結果係依據該些移動位址而於一座標中移動並彼此疊加(步驟：S31)。 In addition, in step S10, a step of dividing a large-scale convolution operation block into a plurality of small-scale convolution operation blocks, when the large-scale convolution operation block exceeds the large-scale convolution operation block, Then, a step is further included: a portion exceeding the large-scale convolution operation block in the small-scale convolution operation blocks is filled with 0 (step: S11). Moreover, the step S30 adds the partial results as a result of a convolution operation of the large-scale convolution operation block, and further includes a step S31: providing a complex mobile address to the small-scale convolution operations Block, these partial results According to the mobile addresses, they move in a target and are superimposed on each other (step: S31).

請參考圖4，圖4為依據本發明再一實施例將一個7×7大規模卷積運算區塊分割成九個3×3小規模卷積運算區塊的示意圖。 Please refer to FIG. 4. FIG. 4 is a schematic diagram of dividing a 7×7 large-scale convolution operation block into nine 3×3 small-scale convolution operation blocks according to still another embodiment of the present invention.

本實施例的7×7大規模卷積運算區塊具有與前一實施例的5×5大規模卷積運算區塊相類似的分割拆解方式及卷積運算方法，由於7×7大規模卷積運算區塊其區塊的行、列數，同樣並非3×3小規模卷積運算區塊行、列數的整數倍，且9個小規模卷積運算區塊將會超出原先的7×7大規模卷積運算區塊。因此，本發明的卷積運算方法需額外在各小規模卷積運算區塊中超出該大規模卷積運算區塊的部分填入0。於本實施例中，則是將原先之7×7大規模卷積運算區塊對應新增各兩行與兩列的係數0，使補上係數0後的大規模卷積運算區塊行列數(9×9)可為3×3小規模卷積運算區塊行列數的整數倍。當分割或拆解完大規模卷積運算區塊後，將產生共9個的3×3的小規模卷積運算區塊，並將該些拆解後的小規模卷積運算區塊分別命為小規模卷積運算區塊F1~F9。並且各該3×3小規模卷積運算區塊彼此同樣並不重疊(non-overlapping)。最後，各小規模卷積運算區塊F1~F9同樣可輸出部份結果I1~I9並根據不同的移動位址而於座標中移動並彼此疊加，產生最終的部份結果I10。 The 7×7 large-scale convolution operation block of this embodiment has a splitting and disassembling method and a convolution operation method similar to the 5×5 large-scale convolution operation block of the previous embodiment, due to the large scale of 7×7. The number of rows and columns of the block in the convolution operation block is also not an integer multiple of the number of rows and columns of the 3×3 small-scale convolution operation block, and the 9 small-scale convolution operation blocks will exceed the original 7 ×7 large-scale convolution operation block. Therefore, the convolution operation method of the present invention requires an additional 0 in the portion of each small-scale convolution operation block that exceeds the large-scale convolution operation block. In this embodiment, the original 7×7 large-scale convolution operation block is added with the coefficient 0 of each two rows and two columns, so that the number of large-scale convolution operation blocks and the number of rows after the coefficient 0 is added. (9×9) may be an integer multiple of the number of rows and columns of the 3×3 small-scale convolution operation block. After dividing or disassembling the large-scale convolution operation block, a total of 9 3×3 small-scale convolution operation blocks will be generated, and the disassembled small-scale convolution operation blocks will be respectively ordered. For small-scale convolution operation blocks F1~F9. And each of the 3×3 small-scale convolution operation blocks does not overlap each other as non-overlapping. Finally, each small-scale convolution operation block F1~F9 can also output partial results I1~I9 and move in coordinates according to different mobile addresses and superimpose each other to generate a final partial result I10.

此外，本實施例關於將7×7大規模卷積運算區塊分割成九個3×3小規模卷積運算區塊，並對數據進行卷積運算的其他技術特徵，可參照前一實施例的相關說明，於此不再贅述。 In addition, in this embodiment, regarding other technical features of dividing a 7×7 large-scale convolution operation block into nine 3×3 small-scale convolution operation blocks and performing convolution operation on the data, refer to the previous embodiment. The relevant description will not be repeated here.

另外，在一實施例中，卷積運算方法更包括依據一當前卷積運算區塊的規模決定卷積運算模式。藉以針對不同規模的區塊以適當的卷積運算模式來處理。 In addition, in an embodiment, the convolution operation method further includes determining a convolution operation mode according to a size of a current convolution operation block. It is processed in an appropriate convolution operation mode for blocks of different sizes.

當卷積運算模式為分割模式時，當前卷積運算區塊為大規模卷積運算區塊，大規模卷積運算區塊分割成多個小規模卷積運算區塊，然後分割後的小規模卷積運算區塊進行卷積運算以分別產生部份結果，然後再將這些部份結果相加作為大規模卷積運算區塊的卷積運算結果。 When the convolution operation mode is the split mode, the current convolution operation block is a large-scale convolution operation block, and the large-scale convolution operation block is divided into a plurality of small-scale convolution operation blocks, and then the small-scale segmentation is performed. The convolution operation block performs a convolution operation to separately generate partial results, and then adds these partial results as a convolution operation result of the large-scale convolution operation block.

當卷積運算模式為非分割模式時，不將當前卷積運算區塊分割，直接對當前卷積運算區塊進行卷積運算。 When the convolution operation mode is the non-split mode, the current convolution operation block is not divided, and the convolution operation is directly performed on the current convolution operation block.

另外，卷積運算方法可更包括進行卷積神經網路的一後續層的部分運算。部分運算例如是將後續層的加總運算、平均運算、取最大值運算或其他運算等先在卷積神經網路的目前這一層做運算。 In addition, the convolution operation method may further include performing a partial operation of a subsequent layer of the convolutional neural network. Partial operations, for example, are to perform operations on the current layer of the convolutional neural network by summing, averaging, maximizing, or other operations of subsequent layers.

以下將舉例說明可支援前述運算的硬體態樣。請參考圖5所示，圖5為依據本發明一實施例的一卷積運算裝置的區塊圖。卷積運算裝置包括一記憶體1、一緩衝裝置2、一卷積運算模組3、一交錯加總單元4、一加總緩衝單元5、一係數擷取控制器6以及一控制單元7。卷積運算裝置可用在卷積神經網路(Convolutional Neural Network，CNN)的應用。 The following describes an example of a hardware that can support the aforementioned operations. Please refer to FIG. 5. FIG. 5 is a block diagram of a convolution operation device according to an embodiment of the present invention. The convolution operation device includes a memory 1, a buffer device 2, a convolution operation module 3, an interleaving summation unit 4, a total buffer unit 5, a coefficient extraction controller 6, and a control unit 7. The convolutional computing device can be used in the application of the Convolutional Neural Network (CNN).

記憶體1儲存待卷積運算的數據，可例如為影像、視頻、音頻、統計、卷積神經網路其中一層的數據等等。以影像數據來說，其例如是像素(pixel)數據；以視頻數據來說，其例如是視頻視框的像素數據或是移動向量、或是視頻中的音訊；以卷積神經網路其中一層的數據來說，其通常是一個二維陣列數據，以影像數據而言，則通常是一個二維陣列的像素數據。另外，在本實施例中，係以記憶體1為一靜態隨機存取記憶體(static random-access memory,SRAM)為例，其除了可儲存待卷積運算的數據之外，也可以儲存卷積運算完成的數據，並且可以具有多層的儲存結構並分別存放待運算與運算完畢的數據，換言之，記憶體1可做為如卷積運算裝置內部的快取記憶體(cache memory)。 The memory 1 stores data to be convoluted, and may be, for example, image, video, audio, statistics, data of one layer of a convolutional neural network, and the like. In the case of image data, for example, pixel data; in the case of video data, for example, pixel data of a video frame or a motion vector, or audio in a video; In terms of data, it is usually a two-dimensional array of data, and in the case of image data, it is usually a two-dimensional array of pixel data. In addition, in the embodiment, the memory 1 is a static random-access memory (SRAM), which can store the volume in addition to the data to be convoluted. The data is completed by the product, and may have a multi-layered storage structure and separately store the data to be calculated and operated. In other words, the memory 1 can be used as a cache memory inside the convolution operation device.

實際應用時，全部或大部分的數據可先儲存在其他地方，例如在另一記憶體中，另一記憶體可選擇如動態隨機存取記憶體(dynamic random access memory，DRAM)或其他種類之記憶體。當卷積運算裝置要進行卷積運算時，再全部或部分地將數據由另一記憶體載入至記憶體1中，然後通過緩衝裝置2將數據輸入至卷積運算模組3來進行卷積運算。若輸入的數據從數據串流而來，記憶體1隨時會從數據串流寫入最新的數據以供卷積運算。 In practical applications, all or most of the data may be stored elsewhere, for example, in another memory, and another memory may be selected as dynamic random access memory (DRAM) or other types. Memory. When the convolution operation device is to perform the convolution operation, the data is completely or partially loaded into the memory 1 by another memory, and then the data is input to the convolution operation module 3 through the buffer device 2 to perform the volume. Product operation. If the input data is streamed from the data stream, the memory 1 will write the latest data from the data stream for convolution at any time.

舉例來說，一控制單元或處理單元可以控制要進行哪一種模式的卷積運算，當這個控制單元或處理單元發現卷積運算區塊的規模大於硬體直接能夠運算的最大規模，則會以分割模式來運算。例如卷積運算模組3硬體最大只能直接進行3×3的卷積運算，控制單元或處理單元會先將當前卷積運算區塊分割成多個3×3運算區塊，然後依序將3×3運算區塊寫入至記憶體1，並且命令卷積運算裝置對這些3×3運算區塊進行3×3的卷積運算，分割後的3×3運算區塊透過卷積運算模組3進行卷積運算以分別產生部份結果。這些部份結果相加作為原當前卷積運算區塊的卷積運算結果，例如：這些部份結果透過加總緩衝單元5相加，然後相加後的結果再透過緩衝裝置2寫入至記憶體1，控制單元或處理單元從記憶體1取得原當前卷積運算區塊的卷積運算結果。另外，這些部份結果也可以不透過加總緩衝單元5相加，而是全部直接先透過緩衝裝置2寫入至記憶體1，控制單元或處理單元從記憶體1取得這些部份結果然後再自己把這些部份結果當作原當前卷積運算區塊的卷積運算結果。 For example, a control unit or a processing unit can control which mode of convolution operation is to be performed. When the control unit or processing unit finds that the size of the convolution operation block is larger than the maximum size that the hardware can directly calculate, Split mode to operate. For example, the convolution operation module 3 hardware can only directly perform a 3×3 convolution operation, and the control unit or the processing unit will first The current convolution operation block is divided into a plurality of 3×3 operation blocks, and then the 3×3 operation block is sequentially written to the memory 1, and the convolution operation device is commanded to perform 3×3 operation blocks. The convolution operation of ×3, the divided 3×3 operation block is convoluted by the convolution operation module 3 to generate partial results, respectively. These partial results are added as a result of the convolution operation of the original current convolution operation block. For example, the partial results are added by the summation buffer unit 5, and the added result is then written to the memory through the buffer device 2. The body 1, the control unit or the processing unit obtains the convolution operation result of the original current convolution operation block from the memory 1. In addition, these partial results may also be added to the memory 1 through the buffer device 2 without being added by the total buffer unit 5, and the control unit or the processing unit may obtain these partial results from the memory 1 and then These partial results are taken as the result of the convolution operation of the original current convolution operation block.

緩衝裝置2耦接有記憶體1、卷積運算模組3以及加總緩衝單元5。並且，緩衝裝置2也與卷積運算裝置的其他元件進行耦接，例如交錯加總單元4以及控制單元7。此外，對於影像數據或視頻的視框數據運算來說，處理的順序是逐行(column)同時讀取多列(row)，因此在一個時序(clock)中，緩衝裝置2係從記憶體1輸入同一行不同列上的數據，對此，本實施例的緩衝裝置2係作為一種行緩衝(column buffer)的緩衝裝置。欲進行運算時，緩衝裝置2可先由記憶體1擷取卷積運算模組3所需要運算的數據，並於擷取後將該些數據調整為可順利寫入卷積運算模組3的數據型式。另一方面，由於緩衝裝置2也與加總緩衝單元5耦接，加總緩衝單元5運算完畢後之數據，也將透過緩衝裝置2暫存重新排序(reorder)後再傳送回記憶體1儲存。換言之，緩衝裝置2除了具有行緩衝的功能之外，其還具有類似中繼暫存數據的功能，或者說緩衝裝置2可做為一種具有排序功能的數據暫存器。 The buffer device 2 is coupled to the memory 1, the convolution operation module 3, and the total buffer unit 5. Further, the buffer device 2 is also coupled to other elements of the convolution operation device, such as the interleaving summing unit 4 and the control unit 7. In addition, for frame data calculation of image data or video, the order of processing is to read multiple columns at the same time, so in one clock, the buffer device 2 is from the memory 1 The data on the different columns of the same row are input. For this, the buffer device 2 of the present embodiment functions as a buffer device for a column buffer. When the calculation is to be performed, the buffer device 2 can first extract the data required by the convolution operation module 3 from the memory 1, and adjust the data to be successfully written into the convolution operation module 3 after the capture. Data type. On the other hand, since the buffer device 2 is also coupled to the summing buffer unit 5, the data after the calculation of the total buffer unit 5 is also reordered by the buffer device 2 and then transferred back to the memory 1 for storage. . In other words, the buffer device 2 has a function similar to relaying temporary data in addition to the function of line buffering, or the buffer device 2 can be used as a data register having a sorting function.

值得一提的是，緩衝裝置2還包括一記憶體控制單元21，當緩衝裝置2在進行與記憶體1之間的數據擷取或寫入時可經由記憶體控制單元21控制。另外，由於其與記憶體1之間具有有限的一記憶體存取寬度，或又稱為帶寬或頻寬(bandwidth)，卷積運算模組3實際上能進行的卷積運算也與記憶體1的存取寬度有關。換言之，卷積運算模組3的運算效能會受到前述存取寬度而有所限制。因此，如果記憶體1的輸入有瓶頸，則卷積運算的效能將受到衝擊而下降。 It is worth mentioning that the buffer device 2 further includes a memory control unit 21, which can be controlled via the memory control unit 21 when the buffer device 2 performs data capture or writing with the memory 1. In addition, since it has a limited memory access width with the memory 1, or is also called bandwidth or bandwidth, the convolution operation module 3 can actually perform convolution operations with the memory. 1 is related to the access width. In other words, the computational efficiency of the convolutional computing module 3 is limited by the aforementioned access width. Therefore, if the input of the memory 1 has a bottleneck, Then the performance of the convolution operation will be affected by the impact.

卷積運算模組3具有多個卷積單元，各卷積單元基於一濾波器以及多個當前數據進行一卷積運算，並於卷積運算後保留部分的當前數據。緩衝裝置2從記憶體1取得多個新數據，並將新數據輸入至卷積單元，新數據不與當前數據重複，新數據例如是前一輪卷積運算還未用到但是本輪卷積運算要用到的數據。卷積運算模組3的卷積單元基於濾波器、保留的當前數據以及新數據進行次輪卷積運算。交錯加總單元4耦接卷積運算模組3，依據卷積運算的結果產生一特徵輸出結果。加總緩衝單元5耦接交錯加總單元4與緩衝裝置2，暫存特徵輸出結果；其中，當指定範圍的卷積運算完成後，緩衝裝置2從加總緩衝單元5將暫存的全部數據寫入到記憶體1。 The convolution operation module 3 has a plurality of convolution units, each convolution unit performs a convolution operation based on a filter and a plurality of current data, and retains part of the current data after the convolution operation. The buffer device 2 acquires a plurality of new data from the memory 1, and inputs the new data to the convolution unit, and the new data is not overlapped with the current data, for example, the previous round convolution operation is not used but the current convolution operation The data to be used. The convolution unit of the convolution operation module 3 performs a second round convolution operation based on the filter, the retained current data, and the new data. The interleaving summation unit 4 is coupled to the convolution operation module 3 to generate a feature output result according to the result of the convolution operation. The summing buffer unit 5 is coupled to the interleaving summing unit 4 and the buffer device 2 to temporarily store the feature output result; wherein, when the convolution operation of the specified range is completed, the buffer device 2 will temporarily store all the data from the summing buffer unit 5. Write to memory 1.

係數擷取控制器6耦接卷積運算模組3，而控制單元7則耦接緩衝裝置2。實際應用時，對於卷積運算模組3而言，其所需要的輸入來源除了數據本身以外，還需輸入有濾波器(filter)的係數，始得進行運算。於本實施例中所指即為3×3的卷積單元陣列之係數輸入。係數擷取控制器6可藉由直接記憶體存取DMA(direct memory access)的方式由外部之記憶體，直接輸入濾波器係數。除了耦接卷積運算模組3之外，係數擷取控制器6還可與緩衝裝置2進行連接，以接受來自控制單元7的各種指令，使卷積運算模組3能夠藉由控制單元7控制係數擷取控制器6，進行濾波器係數的輸入。 The coefficient acquisition controller 6 is coupled to the convolution operation module 3, and the control unit 7 is coupled to the buffer device 2. In actual application, for the convolution operation module 3, the input source required for the convolution operation module 3 needs to input a coefficient of a filter in addition to the data itself, and the operation is started. The coefficient input of the 3 × 3 convolution unit array is referred to in the present embodiment. The coefficient acquisition controller 6 can directly input the filter coefficients from the external memory by direct memory access DMA (direct memory access). In addition to being coupled to the convolution operation module 3, the coefficient capture controller 6 can also be coupled to the buffer device 2 to accept various commands from the control unit 7, so that the convolution operation module 3 can be controlled by the control unit 7. The control coefficient capture controller 6 performs input of filter coefficients.

控制單元7可包括一指令解碼器71以及一數據讀取控制器72。指令解碼器71係從數據讀取控制器72得到控制指令並將指令解碼，藉以得到目前輸入數據的大小、輸入數據的行數、輸入數據的列數、輸入數據的特徵編號以及輸入數據在記憶體1中的起始位址。另外，指令解碼器71也可從數據讀取控制器72得到有關濾波器的種類資訊以及輸出特徵的編號，並輸出適當的空置訊號到緩衝裝置2。緩衝裝置2則根據指令解碼後所提供的資訊來運行，也進而控制卷積運算模組3以及加總緩衝單元5的運作，例如數據從記憶體1輸入到緩衝裝置2以及卷積運算模組3的時序、卷積運算模組3的卷積運算的規模、數據從記憶體1到緩衝裝置2的讀取位址、數據從加總緩衝單元5到記憶體1的寫入位址、卷積運算模組3及緩衝裝置2所運作的卷積模式。 Control unit 7 can include an instruction decoder 71 and a data read controller 72. The instruction decoder 71 obtains a control instruction from the data read controller 72 and decodes the instruction, thereby obtaining the size of the current input data, the number of rows of the input data, the number of columns of the input data, the feature number of the input data, and the input data in the memory. The starting address in body 1. Alternatively, the command decoder 71 may obtain the type information of the filter and the number of the output feature from the data read controller 72, and output an appropriate vacant signal to the buffer device 2. The buffer device 2 operates according to the information provided by the instruction decoding, and further controls the operation of the convolution operation module 3 and the summation buffer unit 5, for example, data is input from the memory 1 to the buffer device 2 and the convolution operation module. The timing of 3, the scale of the convolution operation of the convolution operation module 3, and the data from the memory 1 to the buffer device 2 The address, the data from the sum buffer unit 5 to the write address of the memory 1, the convolution operation module 3, and the convolution mode operated by the buffer device 2 are read.

另一方面，控制單元7則同樣可藉由直接記憶體存取DMA的方式由外部之記憶體提取所需的控制指令及卷積資訊，指令解碼器71將指令解碼之後，該些控制指令及卷積資訊由緩衝裝置2擷取，指令可包含移動窗的步幅大小、移動窗的位址以及欲提取特徵的影像數據行列數。 On the other hand, the control unit 7 can also extract the required control commands and convolution information from the external memory by means of direct memory access DMA. After the instruction decoder 71 decodes the instructions, the control commands and The convolution information is retrieved by the buffer device 2, and the instruction may include the stride size of the moving window, the address of the moving window, and the number of image data rows and columns of the feature to be extracted.

加總緩衝單元5耦接交錯加總單元4，加總緩衝單元5包括一部分加總區塊51以及一池化單元52。部分加總區塊51暫存交錯加總單元4輸出的數據。池化單元52對暫存於部分加總區塊51的數據進行池化運算。池化運算為最大值池化或平均池化。 The summing buffer unit 5 is coupled to the interleaving summing unit 4, and the summing buffer unit 5 includes a part of the summing block 51 and a pooling unit 52. The partial summing block 51 temporarily stores the data output by the interleaving summing unit 4. The pooling unit 52 performs a pooling operation on the data temporarily stored in the partial summing block 51. Pooling operations are either pooled or averaged.

舉例來說，加總緩衝單元5可將經由卷積運算模組3卷積計算結果及交錯加總單元4的輸出特徵結果予以暫存於部分加總區塊51。接著，再透過池化單元52對暫存於部分加總區塊51的數據進行池化(pooling)運算，池化運算可針對輸入數據某個區域上的特定特徵，取其平均值或者取其最大值作為概要特徵提取或統計特徵輸出，此統計特徵相較於先前之特徵而言不僅具有更低的維度，還可改善運算的處理結果。 For example, the summed buffer unit 5 may temporarily store the result of the convolution calculation via the convolution operation module 3 and the output feature result of the interleaving summation unit 4 in the partial summing block 51. Then, the pooling unit 52 performs a pooling operation on the data temporarily stored in the partial summing block 51. The pooling operation may take an average value or take a specific feature on a certain area of the input data. The maximum value is used as a summary feature extraction or statistical feature output. This statistical feature not only has a lower dimension than the previous feature, but also improves the processing result of the operation.

須說明者，此處的暫存，仍係將輸入數據中的部分運算結果相加(partial sum)後才將其於部分加總區塊51之中暫存，因此稱其為部分加總區塊51與加總緩衝單元5，或者可將其簡稱為PSUM單元與PSUM BUFFER模組。另一方面，本實施例之池化單元52的池化運算，係採用最大池化(max pooling)的方式取得統計特徵輸出，在別的實施態樣中，也可選擇平均池化(avg pooling)的計算方式取得統計特徵輸出，端看使用需求決定何種池化方式。待所輸入的數據全部均被卷積運算模組3及交錯加總單元4處理計算完畢後，加總緩衝單元5輸出最終的數據處理結果，並同樣可透過緩衝裝置2將結果回存至記憶體1，或者再透過記憶體1輸出至其他元件。與此同時，卷積運算模組3與交錯加總單元4仍持續地進行數據特徵的取得與運算，以提高卷積運算裝置的處理效能。 It should be noted that the temporary storage here still temporarily stores the partial operation result in the input data and then temporarily stores it in the partial summing block 51, so it is called a partial summing area. Block 51 and summing buffer unit 5, or simply referred to as PSUM unit and PSUM BUFFER module. On the other hand, the pooling operation of the pooling unit 52 of the present embodiment uses the max pooling method to obtain statistical feature output. In other embodiments, the average pooling (avg pooling) may also be selected. The calculation method of the statistical characteristics is obtained, and the pooling method is determined by the usage requirements. After all the data to be input are processed by the convolution operation module 3 and the interleave summation unit 4, the total buffer unit 5 outputs the final data processing result, and the result can be restored to the memory through the buffer device 2 as well. Body 1, or re-transmission through memory 1 to other components. At the same time, the convolution operation module 3 and the interleave summation unit 4 continue to perform data feature acquisition and calculation to improve the processing performance of the convolution operation device.

卷積運算裝置可包括多個卷積運算模組3，卷積運算模組3的卷積單元以及交錯加總單元4係能夠選擇性地操作在一低規模卷積模式以及一高規模卷積模式。在低規模卷積模式中，交錯加總單元4配置來對卷積運算模組3中對應順序的各卷積運算的結果交錯加總以各別輸出一加總結果。在高規模卷積模式中，交錯加總單元4將各卷積單元的各卷積運算的結果交錯加總作為輸出。 The convolution operation device may include a plurality of convolution operation modules 3, and the convolution unit of the convolution operation module 3 and the interleave summation unit 4 are selectively operable in a low-scale convolution mode. And a high-volume convolution model. In the low-scale convolution mode, the interleaving summing unit 4 is configured to interleave the results of the respective convolution operations in the corresponding order in the convolution operation module 3 to output a total result. In the high-scale convolution mode, the interleaving summing unit 4 interleaves the results of the convolution operations of the respective convolution units as outputs.

舉例來說，控制單元7可接收一控制訊號或模式指令，並且根據這個控制訊號或模式指令來決定其他模組以及單元要在哪一種模式運算。這個控制訊號或模式指令可從其他控制單元或處理單元而得。 For example, the control unit 7 can receive a control signal or a mode command, and according to the control signal or mode command, determine which mode the other modules and the unit are to operate in. This control signal or mode command can be derived from other control units or processing units.

請參照圖6所示，圖6為圖5所示之卷積運算裝置的部分示意圖。係數擷取控制器6係透過濾波器係數FC以及控制訊號Ctrl的線路耦接卷積運算模組3中的各個3×3卷積單元。當緩衝裝置2取得指令、卷積資訊以及數據後，其控制各卷積單元進行卷積運算。 Please refer to FIG. 6. FIG. 6 is a partial schematic diagram of the convolution operation device shown in FIG. 5. The coefficient acquisition controller 6 couples the respective 3×3 convolution units in the convolution operation module 3 through the lines of the filter coefficients FC and the control signals Ctrl. When the buffer device 2 acquires the instruction, the convolution information, and the data, it controls each convolution unit to perform a convolution operation.

交錯加總單元4耦接卷積運算模組3，由於卷積運算模組3可針對輸入數據的不同特徵對其進行運算，並輸出特徵運算結果。而對於多個特徵的數據寫入來說，卷積運算模組3則可對應輸出多筆的運算結果。交錯加總單元4的功能則在於可將卷積運算模組3多筆的運算結果，結合後再得出一輸出特徵結果。當交錯加總單元4取得輸出特徵結果後，再將該輸出特徵結果傳送至加總緩衝單元5，以利進行下一階段的處理。 The interleaving summation unit 4 is coupled to the convolution operation module 3, and the convolution operation module 3 can operate on the different characteristics of the input data and output the feature operation result. For data writing of multiple features, the convolution operation module 3 can output multiple calculation results correspondingly. The function of the interleaving and summing unit 4 is that the result of the operation of the convolutional computing module 3 can be combined to obtain an output characteristic result. After the interleaved summation unit 4 obtains the output feature result, the output feature result is transmitted to the summation buffer unit 5 to facilitate the next stage of processing.

舉例來說，卷積神經網路具有多個運算層，例如卷積層、池化層等，卷積層以及池化層的層數可以是多層，各層的輸出可以當作另一層或後續層的輸入，例如第N層卷積層的輸出是第N層池化層的輸入或是其他後續層的輸入，第N層池化層的輸出是第N+1層卷積層的輸入或是其他後續層的輸入，第N層運算層的輸出可以是第N+1層運算層的輸入。 For example, the convolutional neural network has multiple operation layers, such as a convolutional layer, a pooled layer, etc., and the number of layers of the convolutional layer and the pooled layer may be multiple layers, and the output of each layer may be used as an input of another layer or subsequent layers. For example, the output of the Nth layer convolutional layer is the input of the Nth layer pooling layer or the input of other subsequent layers, and the output of the Nth layer pooling layer is the input of the N+1th convolution layer or other subsequent layers. Input, the output of the Nth operation layer may be the input of the N+1th operation layer.

為了提升運算效能，進行第N層運算層的運算時，可以視運算資源(硬體)的使用情況來進行第N+i(i>0，N、i為自然數)層運算層的部分運算，有效運用運算資源並且能降低實際在第N+i層運算層時的運算量。 In order to improve the computational efficiency, when performing the operation of the Nth operation layer, the partial operation of the N+i (i>0, N, i is a natural number) layer operation layer can be performed depending on the use of the computing resource (hardware). The effective use of computing resources and can reduce the amount of computation actually in the N+i layer operation layer.

在本實施例中，在一種運算情況下，例如3×3卷積運算，卷積運算模組3進行卷積神經網路的某一層卷積層的運算，交錯加總單元4沒有進行卷積神經網路的後續層的部分運算，加總緩衝單元5進行卷積神經網路的同一階池化層的運算。在另一種運算情況下，例如1×1卷積運算，卷積運算模組3進行卷積神經網路的某一層卷積層的運算，交錯加總單元4進行卷積神經網路的後續層的部分運算，部分運算例如是相加加總，加總緩衝單元5進行卷積神經網路的同一階池化層的運算。在其他實施例中，加總緩衝單元5除了進行池化層的運算，也可進行卷積神經網路的後續層的部分運算。前述的部分運算例如是將後續層的加總運算、平均運算、取最大值運算或其他運算等先在卷積神經網路的目前這一層做運算。 In the present embodiment, in an operation case, for example, a 3×3 convolution operation, the convolution operation module 3 performs a convolutional layer operation of a convolutional neural network, and the interleave summation unit 4 does not perform a convolutional nerve. Partial operation of the subsequent layers of the network, summing up the buffer unit 5 to perform convolution The operation of the same hierarchical pooling layer of the network. In another operation, such as a 1×1 convolution operation, the convolution operation module 3 performs a convolutional layer operation of a convolutional neural network, and the interleaving summation unit 4 performs a subsequent layer of the convolutional neural network. The partial operation, the partial operation is, for example, addition and addition, and the total buffer unit 5 performs the operation of the same hierarchical pooling layer of the convolutional neural network. In other embodiments, the summing buffer unit 5 may perform partial operations of subsequent layers of the convolutional neural network in addition to the operations of the pooling layer. The partial operation described above is, for example, performing a summation operation, an averaging operation, a maximum value operation or the like of a subsequent layer on the current layer of the convolutional neural network.

圖7為依據本發明一實施例卷積單元的功能方塊圖，如圖7所示，卷積單元9包括9個處理單位PE0~PE8(process engine)、一位址解碼器91及一加法器92。卷積單元9可以作為前述卷積單元。 FIG. 7 is a functional block diagram of a convolution unit according to an embodiment of the present invention. As shown in FIG. 7, the convolution unit 9 includes nine processing units PE0~PE8 (process engine), a bit address decoder 91, and an adder. 92. The convolution unit 9 can be used as the aforementioned convolution unit.

在3×3卷積運算模式下，待卷積運算的輸入數據係經由線路data[47：0]輸入至處理單位PE0~PE2，處理單位PE0~PE2會將當前時脈的輸入數據在之後的時脈輸入至處理單位PE3~PE5以供下一輪卷積運算，處理單位PE3~PE5會將當前時脈的輸入數據在之後的時脈輸入至處理單位PE6~PE8以供下一輪卷積運算。3×3濾波器係數透過線路fc_bus[47：0]輸入至處理單位PE0~PE8。在步幅為1時，3個新數據會輸入至處理單位，已經輸入的6個舊數據會移到其他處理單位。執行卷積運算時，處理單位PE0~PE8透過位址解碼器91將選定位址的濾波器係數與輸入至處理單位PE0~PE8的輸入數據做乘法運算。當卷積單元9進行3×3卷積運算時，加法器92會將各乘法運算的結果相加以得到卷積運算的結果作為輸出psum[35：0]。 In the 3×3 convolution operation mode, the input data to be convoluted is input to the processing units PE0~PE2 via the line data[47:0], and the processing units PE0~PE2 will input the current clock input data later. The clock is input to the processing unit PE3~PE5 for the next round of convolution operation, and the processing units PE3~PE5 input the current clock input data to the processing unit PE6~PE8 for the next round of convolution operation. The 3 × 3 filter coefficients are input to the processing units PE0 to PE8 through the line fc_bus[47:0]. When the stride is 1, 3 new data will be input to the processing unit, and the 6 old data that has been input will be moved to other processing units. When the convolution operation is performed, the processing units PE0 to PE8 multiply the filter coefficients of the selected address and the input data input to the processing units PE0 to PE8 through the address decoder 91. When the convolution unit 9 performs a 3 × 3 convolution operation, the adder 92 adds the results of the multiplication operations to obtain the result of the convolution operation as the output psum[35:0].

當卷積單元9進行1×1卷積運算時，待卷積運算的輸入數據係經由線路data[47：0]輸入至處理單位PE0~PE2，三個1×1濾波器係數透過線路fc_bus[47：0]輸入至處理單位PE0~PE2。在步幅為1時，3個新數據會輸入至處理單位。執行卷積運算時，處理單位PE0~PE2透過位址解碼器91將選定位址的濾波器係數與輸入至處理單位PE0~PE2的輸入數據做乘法運算。當卷積單元9進行1×1卷積運算時，加法器92會直接將處理單位PE0~PE2的卷積運算的結果作為輸出pm_0[31：0]、pm_1[31：0]、pm_2[31：0]。另外，由於處理單位PE3~PE8沒有實際參與卷積運算，這些處理單位PE3~PE8可以先關閉以節省電力。另外，雖然卷積單元9有三個1×1卷積運算的輸出，但可以只有其中二個輸出連接到交錯加總單元4；或是三個1×1卷積運算的輸出都連接到交錯加總單元4，藉由控制處理單位PE0~PE2的關閉與否來決定輸出到交錯加總單元4的1×1卷積運算結果的數量。 When the convolution unit 9 performs a 1×1 convolution operation, the input data to be convoluted is input to the processing units PE0 to PE2 via the line data[47:0], and the three 1×1 filter coefficients are transmitted through the line fc_bus [ 47:0] Input to processing unit PE0~PE2. When the stride is 1, 3 new data is input to the processing unit. When the convolution operation is performed, the processing units PE0 to PE2 multiply the filter coefficients of the selected address and the input data input to the processing units PE0 to PE2 through the address decoder 91. When the convolution unit 9 performs a 1×1 convolution operation, the adder 92 directly takes the result of the convolution operation of the processing units PE0 to PE2 as the output pm_0[31:0], pm_1[31:0], pm_2[31 :0]. In addition, since the processing units PE3~PE8 do not actually participate in the convolution operation, these processes Units PE3~PE8 can be turned off first to save power. In addition, although the convolution unit 9 has three outputs of 1 × 1 convolution operation, only two of the outputs may be connected to the interleave summing unit 4; or the outputs of three 1 × 1 convolution operations are connected to the interleave plus The total unit 4 determines the number of 1×1 convolution operation results outputted to the interleaving summing unit 4 by controlling whether the processing units PE0 to PE2 are turned off or not.

待影像的所有數據已分別由卷積運算模組3、交錯加總單元4及加總緩衝單元5運算完畢後，且最終的數據處理結果也已回存至記憶體1時，將由緩衝裝置2傳送一停止訊號至指令解碼器71與控制單元7，通知控制單元7目前已運算完畢且等候下一個處理指令。 After all the data of the image has been calculated by the convolution operation module 3, the interleave summation unit 4, and the total buffer unit 5, and the final data processing result has been restored to the memory 1, the buffer device 2 is used. A stop signal is transmitted to the command decoder 71 and the control unit 7, notifying that the control unit 7 has been calculated and waiting for the next processing command.

藉此，因卷積運算裝置的各卷積單元可於卷積運算後保留部分的當前數據，緩衝裝置從記憶體取得多個新數據並將新數據輸入至卷積單元，新數據不與當前數據重複，因此卷積運算的處理效率得以提升，適合用於串流數據的卷積運算。因此，當用於卷積運算及連續地平行運算的處理資料時，可具有優異的運算性能和低功耗表現，並且能夠處理資料串流。 Thereby, since each convolution unit of the convolution operation device can retain part of the current data after the convolution operation, the buffer device acquires a plurality of new data from the memory and inputs the new data to the convolution unit, the new data is not current The data is repeated, so the processing efficiency of the convolution operation is improved, and it is suitable for the convolution operation of the stream data. Therefore, when processing data for convolution operations and continuous parallel operations, it is possible to have excellent arithmetic performance and low power consumption performance, and is capable of processing data streams.

卷積運算方法可應用或實施在前述實施例的卷積運算裝置，相關的變化及實施方式故此不再贅述。卷積運算方法亦可應用或實施在其他計算裝置。舉例來說，卷積運算方法可用在能夠執行指令的處理器，配置以執行卷積運算方法的指令是儲存在記憶體，處理器耦接記憶體並執行這些指令以進行卷積運算方法。例如，處理器包括快取記憶體、數學運算單元以及內部暫存器，快取記憶體儲存數據串流，數學運算單元能夠進行卷積運算，內部暫存器可留存本輪卷積運算的部分數據於卷積運算模組內以供次一輪卷積運算。 The convolution operation method can be applied or implemented in the convolution operation device of the foregoing embodiment, and related changes and implementations will not be described again. The convolution operation method can also be applied or implemented in other computing devices. For example, the convolution operation method can be used in a processor capable of executing instructions, and the instructions configured to execute the convolution operation method are stored in a memory, and the processor couples the memory and executes the instructions to perform a convolution operation method. For example, the processor includes a cache memory, a math operation unit, and an internal register, the cache memory stores the data stream, the math operation unit can perform a convolution operation, and the internal register can retain the portion of the current convolution operation. The data is used in the convolutional computing module for the next round of convolution operations.

綜上所述，本發明所提供之卷積運算方法，藉由將一大規模卷積運算區塊分割成多個小規模卷積運算區塊；對小規模卷積運算區塊進行卷積運算以分別產生一部份結果；以及將部份結果相加作為大規模卷積運算區塊的一卷積運算結果，即可降低特定卷積運算區塊規模的限制，也不需以額外的硬體資源便可同樣達到大規模卷積運算區塊的運算，並同樣取得大規模卷積運算區塊的卷積運算結果。 In summary, the convolution operation method provided by the present invention divides a large-scale convolution operation block into a plurality of small-scale convolution operation blocks; performs convolution operation on the small-scale convolution operation block. To generate a partial result separately; and to add partial results as a convolution operation result of a large-scale convolution operation block, the limitation of the size of a specific convolution operation block can be reduced without additional hard The volume resource can also achieve the operation of the large-scale convolution operation block, and the convolution operation result of the large-scale convolution operation block is also obtained.

以上所述僅為舉例性，而非為限制性者。任何未脫離本發明之精神與範疇，而對其進行之等效修改或變更，均應包含於後附之申請專利範圍中。 The above is intended to be illustrative only and not limiting. Any equivalent modifications or alterations to the spirit and scope of the invention are intended to be included in the scope of the appended claims.

Claims

A convolution operation method comprising the steps of: determining a convolution operation mode according to a size of a current convolution operation block; and when the convolution operation mode is a division mode, the current convolution operation block is a large-scale convolution operation a block, the large-scale convolution operation block is divided into a plurality of small-scale convolution operation blocks; when the convolution operation mode is a non-split mode, the current convolution operation block is not divided, and the The current convolution operation block performs a convolution operation; convolution operations are performed on the small-scale convolution operation blocks to respectively generate a partial result; and the partial results are added as the large-scale convolution operation area The result of a convolution operation of the block.

The convolution operation method of claim 1, wherein the small-scale convolution operation blocks have the same size.

The convolution operation method of claim 1, further comprising: filling a portion of the small-scale convolution operation block beyond 0 in the small-scale convolution operation block.

The convolution operation method according to claim 1, wherein in the step of performing a convolution operation, the small-scale convolution operation blocks perform convolution operations using at least one convolution unit to respectively generate the portion As a result, the block size of the small-scale convolution operation blocks is equal to the maximum convolution size supported by the convolution unit hardware.

The convolution operation method according to claim 1, wherein in the step of performing a convolution operation, the small-scale convolution operation blocks are respectively subjected to parallel convolution operations by a corresponding number of convolution units to respectively generate Part of the result.

The convolution operation method of claim 1, wherein the large-scale convolution operation block comprises a plurality of filter coefficients, and the filter coefficients are arranged according to an arrangement order and the small-scale convolution operation blocks. Size is assigned to these small-scale convolution operation blocks.

The convolution operation method of claim 1, wherein the large-scale convolution operation block includes a plurality of data, and the data is allocated according to an arrangement order and a size of the small-scale convolution operation blocks. These small-scale convolution operation blocks.

The convolution operation method according to claim 1, wherein the large-scale convolution operation block has a size of 5×5 or 7×7, and the size of the small-scale convolution operation block is 3×3. .

The convolution operation method of claim 1, wherein the step of adding the partial results further comprises: providing a complex mobile address to the small-scale convolution operation blocks, the portions The results are moved in a target and superimposed on each other based on the mobile addresses.

For example, the convolution operation method described in claim 1 further includes: performing a partial operation of a subsequent layer of the convolutional neural network.