TW202230229A - Computing circuit and data processing method based on convolution neural network and computer readable storage medium - Google Patents

Computing circuit and data processing method based on convolution neural network and computer readable storage medium Download PDF

Info

Publication number
TW202230229A
TW202230229A TW110140625A TW110140625A TW202230229A TW 202230229 A TW202230229 A TW 202230229A TW 110140625 A TW110140625 A TW 110140625A TW 110140625 A TW110140625 A TW 110140625A TW 202230229 A TW202230229 A TW 202230229A
Authority
TW
Taiwan
Prior art keywords
data
output data
output
memory
filter
Prior art date
Application number
TW110140625A
Other languages
Chinese (zh)
Other versions
TWI840715B (en
Inventor
林文翔
潘偉正
林金岷
Original Assignee
創惟科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 創惟科技股份有限公司 filed Critical 創惟科技股份有限公司
Priority to CN202210055056.1A priority Critical patent/CN114781626A/en
Priority to US17/578,416 priority patent/US20220230055A1/en
Publication of TW202230229A publication Critical patent/TW202230229A/en
Application granted granted Critical
Publication of TWI840715B publication Critical patent/TWI840715B/en

Links

Images

Abstract

A computing circuit and a data processing method based on convolution neural network and a computer readable storage medium are provided. An input data is obtained from the memory. The first computation is performed on the first part data of the input data, to obtain a first output data. The first output data is buffered in a first buffer area. When the buffered first output feature map is larger than a first predetermined data amount, the second computation is performed on the first output data. The second output data is buffered in a second buffer area. The third output data obtained from the third computation on the second output data is outputted to the memory. When the second computation is performed on the first output data, the first computation is performed on the input data continuously. Accordingly, the accessing time on the main memory could be reduced.

Description

基於卷積神經網路的運算電路、資料處理方法及電腦可讀取儲存媒體Operation circuit, data processing method and computer readable storage medium based on convolutional neural network

本發明是有關於一種機器學習(Machine Learning,ML)技術,且特別是有關於一種基於卷積神經網路(Conventional Neural Network,CNN)的運算電路、資料處理方法及電腦可讀取儲存媒體。The present invention relates to a machine learning (ML) technology, and more particularly, to a computing circuit, a data processing method and a computer-readable storage medium based on a convolutional neural network (Conventional Neural Network, CNN).

機器學習是人工智慧(Artificial Intel1igence,AI)中的一個重要主題,並可分析訓練樣本以自中獲得規律,從而透過規律對未知資料預測。而經學習後所建構出的機器學習模型即是用以對待評估資料推論。Machine learning is an important topic in artificial intelligence (AI), and can analyze training samples to obtain patterns from them, so as to predict unknown data through patterns. The machine learning model constructed after learning is used to infer the evaluation data.

機器學習的演算法有很多種。例如,神經網路可透過模擬人類腦細胞的運作來進行決策。其中,卷積神經網路在影像及語音辨識方面提供較佳的結果,並逐漸成為廣泛應用及主力研發的機器學習架構之一。There are many algorithms for machine learning. For example, neural networks can make decisions by simulating the workings of human brain cells. Among them, the convolutional neural network provides better results in image and speech recognition, and has gradually become one of the widely used and main research and development machine learning architectures.

值得注意的是,在卷積神經網路架構的卷積層中,處理元件對輸入資料滑動卷積核(kernel)或濾波器(filter)並執行特定運算。處理元件需要反覆地自記憶體讀取輸入資料及權重值並輸出運算結果至記憶體。甚至,若不同卷積層採用不同大小的卷積核或不同卷積運算,則將大幅提升記憶體的存取次數。例如,MobileNet模型結合卷積運算及深度可分離卷積(Depthwise Separable Convolution)運算。因此,這些運算都需要分別對記憶體存取。Notably, in the convolutional layer of a convolutional neural network architecture, the processing element slides a kernel or filter on the input data and performs specific operations. The processing element needs to repeatedly read the input data and the weight value from the memory and output the operation result to the memory. Even, if different convolution layers use different convolution kernels or different convolution operations, the number of memory accesses will be greatly increased. For example, the MobileNet model combines convolution operations with Depthwise Separable Convolution operations. Therefore, these operations require separate memory accesses.

有鑑於此,本發明實施例提供一種基於卷積神經網路的運算電路、資料處理方法及電腦可讀取儲存媒體,融合多個卷積層,並據以減少記憶體的存取次數。In view of this, embodiments of the present invention provide a convolutional neural network-based computing circuit, a data processing method, and a computer-readable storage medium, which integrates multiple convolutional layers and reduces the number of memory accesses accordingly.

本發明實施例的基於卷積神經網路的資料處理方法包括(但不僅限於)下列步驟:自記憶體讀取輸入資料。對輸入資料中的第一部分資料進行第一運算,以取得第一輸出資料。第一運算設有第一過濾器,第一輸出資料的大小相關於第一運算的第一過濾器的大小及第一部分資料的大小。暫存第一輸出資料於第一暫存區。當第一暫存區所暫存的第一輸出資料大於第一預定資料量時,對第一輸出資料進行第二運算,以取得第二輸出資料。第二運算設有第二過濾器,第二輸出資料的大小相關於第二運算的第二過濾器大小。暫存第二輸出資料於第二暫存器。對第二輸出資料進行第三運算所得出的第三輸出資料輸出至記憶體。對第一輸出資料進行第二運算時,持續對輸入資料進行第一運算。The data processing method based on the convolutional neural network according to the embodiment of the present invention includes (but is not limited to) the following steps: reading input data from a memory. A first operation is performed on the first part of the input data to obtain the first output data. The first operation is provided with a first filter, and the size of the first output data is related to the size of the first filter of the first operation and the size of the first part of the data. The first output data is temporarily stored in the first temporary storage area. When the first output data temporarily stored in the first temporary storage area is larger than the first predetermined amount of data, a second operation is performed on the first output data to obtain the second output data. The second operation is provided with a second filter, and the size of the second output data is related to the second filter size of the second operation. The second output data is temporarily stored in the second register. The third output data obtained by performing the third operation on the second output data is output to the memory. When the second operation is performed on the first output data, the first operation is continuously performed on the input data.

本發明實施例的基於卷積神經網路的運算電路包括(但不僅限於)記憶體及處理元件。記憶體用以儲存輸入資料。處理元件耦接記憶體,並包括第一、第二及第三運算器、第一暫存記憶體及第二暫存記憶體。第一運算器用以對輸入資料中的第一部分資料進行第一運算以取得第一輸出資料,並暫存第一輸出資料至處理元件的第一暫存記憶體。第一輸出資料的大小相關於第一運算的第一過濾器大小及第一部分資料的大小。第二運算器用以當第一暫存記憶體所暫存的第一輸出資料符合第二運算所需的大小時,對第二輸入資料進行第二運算,以取得第二輸出資料,並暫存第二輸出資料至處理元件的第三記憶體。第二運算設有第二過濾器,且第二輸出資料的大小相關於第二運算的第二過濾器的大小。第三運算器用以將對第二輸出資料進行第三運算所得出的第三輸出資料輸出至記憶體。第二運算器進行第二運算時,第一運算器持續進行第一運算。The computing circuit based on the convolutional neural network according to the embodiment of the present invention includes (but is not limited to) a memory and a processing element. Memory is used to store input data. The processing element is coupled to the memory and includes first, second and third arithmetic units, a first temporary memory and a second temporary memory. The first operator is used for performing a first operation on the first part of the input data to obtain the first output data, and temporarily storing the first output data in the first temporary storage memory of the processing element. The size of the first output data is related to the size of the first filter of the first operation and the size of the first part of the data. The second arithmetic unit is used for performing a second operation on the second input data when the first output data temporarily stored in the first temporary storage memory meets the size required by the second operation, so as to obtain the second output data and temporarily store it The second output data is to the third memory of the processing element. The second operation is provided with a second filter, and the size of the second output data is related to the size of the second filter of the second operation. The third operator is used for outputting the third output data obtained by performing the third operation on the second output data to the memory. When the second arithmetic unit performs the second arithmetic operation, the first arithmetic unit continues to perform the first arithmetic operation.

本發明實施例的電腦可讀取儲存媒體用於儲存程式碼,且處理器載入程式碼以執行前述基於卷積神經網路的資料處理方法。The computer-readable storage medium of the embodiment of the present invention is used for storing the code, and the processor loads the code to execute the aforementioned data processing method based on the convolutional neural network.

基於上述,依據本發明實施例的基於卷積神經網路的運算電路、資料處理方法及電腦可讀取儲存媒體,暫存輸出資料於處理元件中的記憶體,並依據下一個運算器(即,下一運算層)的啟動條件觸發其運算。藉此,下一運算層不用等到上一運算層對所有輸入資料運算後即可提前觸發運算。此外,本發明實施例可減少自記憶體中存取輸入資料的次數。Based on the above, the computing circuit, data processing method and computer-readable storage medium based on convolutional neural network according to the embodiments of the present invention temporarily store the output data in the memory of the processing element, and according to the next computing unit (ie , the start condition of the next operation layer) triggers its operation. In this way, the next operation layer can trigger the operation in advance without waiting for the previous operation layer to operate on all the input data. In addition, the embodiments of the present invention can reduce the number of times of accessing input data from the memory.

為讓本發明的上述特徵和優點能更明顯易懂,下文特舉實施例,並配合所附圖式作詳細說明如下。In order to make the above-mentioned features and advantages of the present invention more obvious and easy to understand, the following embodiments are given and described in detail with the accompanying drawings as follows.

圖1是依據本發明一實施例的基於卷積神經網路的運算電路100的元件方塊圖。請參照圖1,運算電路100包括(但不僅限於)記憶體110及一個或更多個處理元件(Processing Element,PE)120。FIG. 1 is a block diagram of components of an operation circuit 100 based on a convolutional neural network according to an embodiment of the present invention. Referring to FIG. 1 , the arithmetic circuit 100 includes (but is not limited to) a memory 110 and one or more processing elements (PE) 120 .

記憶體110可以是動態隨機存取記憶體(Dynamic  Random Access Memory,DRAM)、快閃記憶體(Flash Memory)、寄存器(Register)、組合邏輯電路(Combinational Circuit)或上述元件的組合。The memory 110 may be a dynamic random access memory (DRAM), a flash memory (Flash Memory), a register (Register), a combinational logic circuit (Combinational Circuit), or a combination of the above elements.

處理元件120耦接記憶體110。處理元件120包括(但不僅限於)特徵暫存憶體131、先入先出(First Input First Output,FIFO)單元132、第一權重暫存記憶體133、第一運算器135、第一暫存記憶體151、第二權重暫存記憶體153、第二運算器155、第二暫存記憶體171、第二先入先出單元172、第三權重暫存記憶體173及第三運算器175。The processing element 120 is coupled to the memory 110 . The processing element 120 includes (but is not limited to) a feature temporary memory 131, a first-in, first-out (First Input First Output, FIFO) unit 132, a first weight temporary memory 133, a first operator 135, and a first temporary memory The bank 151 , the second weight temporary storage memory 153 , the second arithmetic unit 155 , the second temporary storage memory 171 , the second FIFO unit 172 , the third weight temporary storage memory 173 , and the third arithmetic unit 175 .

在一實施例中,特徵暫存憶體131、先入先出單元132、第一權重暫存記憶體133、第一運算器135對應於一層卷積層/運算層。此外,第一運算器135設有第一運算所用的第一過濾器。In one embodiment, the feature temporary storage 131 , the FIFO unit 132 , the first weight temporary storage 133 , and the first operator 135 correspond to one convolution layer/operation layer. In addition, the first operator 135 is provided with a first filter for the first operation.

在一實施例中,特徵暫存記憶體131用以儲存來自記憶體110的輸入資料中的部分或全部,第一先入先出單元132用以依據先入先出規則輸入且/或輸出特徵暫存記憶體131中的資料,第一權重暫存記憶體133用以儲存來自記憶體110的權重(形成第一卷積核/過濾器),且第一運算器135用以進行第一運算。在一實施例中,第一運算是卷積運算,並待後續實施例詳述。在另一實施例中,第一運算也可能是深度可分離卷積運算或其他類型的卷積運算。In one embodiment, the feature temporary storage memory 131 is used for storing part or all of the input data from the memory 110, and the first FIFO unit 132 is used for inputting and/or outputting the feature temporary storage according to the FIFO rule For the data in the memory 131, the first weight temporary storage memory 133 is used to store the weights from the memory 110 (forming the first convolution kernel/filter), and the first operator 135 is used to perform the first operation. In one embodiment, the first operation is a convolution operation, which will be described in detail in subsequent embodiments. In another embodiment, the first operation may also be a depthwise separable convolution operation or other types of convolution operations.

在一實施例中,第一暫存記憶體151、第二權重暫存記憶體153、第二運算器155對應於一層卷積層/運算層。此外,第二運算器155設有第二運算所用的第二過濾器。In one embodiment, the first temporary memory 151 , the second weight temporary memory 153 , and the second operator 155 correspond to one convolution layer/operation layer. In addition, the second operator 155 is provided with a second filter for the second operation.

在一實施例中,第一暫存記憶體151用以儲存來自第一運算器135所輸出的輸入資料中的部分或全部,第二權重暫存記憶體153用以儲存來自記憶體110的權重(形成第二卷積核/過濾器),且第二運算器155用以進行第二運算。在一實施例中,第二運算是逐深度(depthwise)卷積運算,並待後續實施例詳述。在另一實施例中,第二運算也可能是卷積運算或其他類型的卷積運算。In one embodiment, the first temporary memory 151 is used to store part or all of the input data output from the first arithmetic unit 135 , and the second weight temporary memory 153 is used to store the weights from the memory 110 (forming the second convolution kernel/filter), and the second operator 155 is used to perform the second operation. In one embodiment, the second operation is a depthwise convolution operation, which will be described in detail in subsequent embodiments. In another embodiment, the second operation may also be a convolution operation or other types of convolution operations.

在一實施例中,第二暫存記憶體171、第二先入先出單元172、第三權重暫存記憶體173及第三運算器175對應於一層卷積層/運算層。此外,第三運算器175設有第三運算所用的第三過濾器。In one embodiment, the second temporary memory 171 , the second FIFO unit 172 , the third weighted temporary memory 173 and the third operator 175 correspond to one convolution layer/operation layer. In addition, the third operator 175 is provided with a third filter for the third operation.

在一實施例中,第二暫存記憶體171用以儲存來自第二運算器155所輸出的輸入資料中的部分或全部,第二先入先出單元172用以依據先入先出規則輸入且/或輸出第二暫存記憶體171中的資料,第三權重暫存記憶體173用以儲存來自記憶體110的權重(形成第三卷積核/過濾器),且第三運算器175用以進行第三運算。在一實施例中,第三運算是逐點(pointwise)卷積運算,並待後續實施例詳述。在另一實施例中,第三運算也可能是卷積運算或其他類型的卷積運算。In one embodiment, the second temporary memory 171 is used for storing part or all of the input data output from the second arithmetic unit 155, and the second FIFO unit 172 is used for inputting and/ Or output the data in the second temporary memory 171, the third weight temporary memory 173 is used to store the weights from the memory 110 (forming the third convolution kernel/filter), and the third operator 175 is used for Perform the third operation. In one embodiment, the third operation is a pointwise convolution operation, which will be described in detail in subsequent embodiments. In another embodiment, the third operation may also be a convolution operation or other types of convolution operations.

在一實施例中,前述特徵暫存記憶體131、第一暫存記憶體、第二暫存記憶體、第一權重暫存記憶體133、第二權重暫存記憶體153及第三權重暫存記憶體173可以是靜態隨機存取記憶體(Static Random Access Memory,SRAM)、快閃記憶體、寄存器、各類型緩衝器或上述元件的組合。In one embodiment, the aforementioned feature temporary storage 131, first temporary storage, second temporary storage, first weight temporary storage 133, second weight temporary storage 153, and third weight temporary storage The memory 173 may be a static random access memory (SRAM), a flash memory, a register, various types of buffers, or a combination of the above elements.

在一實施例中,運算電路100中的部分或全部元件可形成神經網路處理元件(Network Processing Unit,NPU)、片上系統(System on Chip,SoC)或積體電路(Integrated Circuit,IC)。In one embodiment, some or all of the elements in the arithmetic circuit 100 may form a neural network processing unit (NPU), a system on chip (SoC), or an integrated circuit (IC).

在一實施例中,第一運算器135在單位時間內具有第一最大運算量,第二運算器155在相同單位時間內具有第二最大運算量,且第三運算器175在相同單位時間內具有第三最大運算量。第一最大運算量大於第二最大運算量,且第一最大運算量大於第三最大運算量。In one embodiment, the first arithmetic unit 135 has a first maximum arithmetic amount within a unit time, the second arithmetic unit 155 has a second maximum arithmetic amount within the same unit time, and the third arithmetic unit 175 has the same unit time Has the third largest amount of operations. The first maximum computation amount is greater than the second maximum computation amount, and the first maximum computation amount is greater than the third maximum computation amount.

下文中,將搭配運算電路100中的各項裝置、元件及模組說明本發明實施例所述之方法。本方法的各個流程可依照實施情形而隨之調整,且並不僅限於此。Hereinafter, the method according to the embodiment of the present invention will be described in conjunction with various devices, components and modules in the computing circuit 100 . Each process of the method can be adjusted according to the implementation situation, and is not limited to this.

圖2是依據本發明一實施例的基於卷積神經網路的資料處理方法的流程圖。請參照圖2,處理元件120自記憶體110讀取輸入資料(步驟S210)。具體而言,輸入資料可以是影像中部份或全部畫素的資料(例如,色階、亮度、或灰階度)。或者,輸入資料也可以是相關於語音、文字、圖案或其他態樣所形成的資料集合。FIG. 2 is a flowchart of a data processing method based on a convolutional neural network according to an embodiment of the present invention. Referring to FIG. 2, the processing element 120 reads the input data from the memory 110 (step S210). Specifically, the input data may be data of some or all of the pixels in the image (eg, level, brightness, or gray level). Alternatively, the input data can also be a set of data related to speech, text, patterns or other aspects.

讀取輸入資料的方法有很多種。在一實施例中,處理元件120讀取全部的輸入資料,並作為第一部分資料。在另一實施例中,處理元件120依據第一運算所需的資料量或特徵暫存記憶體131的容量每次讀取輸入資料中的一部分,並作為第一部分資料。There are many ways to read input data. In one embodiment, the processing element 120 reads all the input data as the first part of the data. In another embodiment, the processing element 120 reads a part of the input data each time according to the amount of data required for the first operation or the capacity of the feature temporary storage memory 131 and uses it as the first part of the data.

圖3是依據本發明一實施例的輸入資料F的示意圖。請參照圖3,假設輸入資料F的大小為(高H,寬W,通道數C),且處理元件120所讀取的第一部分資料F fi1的大小為(高H fi1,寬W fi1,通道數C)。高H fi1可小於或等於高H,且寬W fi1可小於或等於寬W。 FIG. 3 is a schematic diagram of input data F according to an embodiment of the present invention. Referring to FIG. 3 , it is assumed that the size of the input data F is (height H, width W, the number of channels C), and the size of the first part of the data F fi1 read by the processing element 120 is (height H fi1 , width W fi1 , channel number C). The height H fi1 may be less than or equal to the height H, and the width W fi1 may be less than or equal to the width W.

須說明的是,輸入資料可能儲存在記憶體110中的特定區塊或位置,但本發明實施例不加以限制輸入資料中的各元素在記憶體110中的儲存位置。It should be noted that the input data may be stored in a specific block or location in the memory 110 , but the embodiment of the present invention does not limit the storage location of each element in the input data in the memory 110 .

特徵暫存記憶體131儲存來自記憶體110的輸入資料中的部分或全部。即,特徵暫存記憶體131儲存第一部分資料。第一運算器135對輸入資料中的第一部分資料進行第一運算,以取得第一輸出資料(步驟S230)。具體而言,第一運算是將第一部分資料與對應權重進行第一運算(例如,卷積運算)。第一輸出資料的大小相關於第一運算的第一過濾器大小及第一部分資料的大小。The feature temporary memory 131 stores some or all of the input data from the memory 110 . That is, the feature temporary storage memory 131 stores the first part of the data. The first operation unit 135 performs a first operation on the first part of the input data to obtain the first output data (step S230). Specifically, the first operation is to perform a first operation (eg, a convolution operation) on the first part of the data and the corresponding weight. The size of the first output data is related to the size of the first filter of the first operation and the size of the first part of the data.

例如,圖4是依據本發明一實施例的第一運算的示意圖。請參照圖4,第一運算以卷積運算為例。第一部分資料F fi1(大小為高H fi1,寬W fi1,通道數C fi1)與第一濾波器K n(大小為高H kd,寬W kd)。若高H fi1大於或等於高H kd且寬W fi1大於或等於寬W kd,則第一運算器135可觸發第一運算。而第一運算的結果(即,第一輸出資料F fo1)的高H f1o為高H fi1-H kd+1,寬W f1o為寬W fi1-W kd+1,且通道數C f1o相同於通道數C fi1For example, FIG. 4 is a schematic diagram of a first operation according to an embodiment of the present invention. Referring to FIG. 4 , the first operation is a convolution operation as an example. The first part of data F fi1 (size is height H fi1 , width W fi1 , number of channels C fi1 ) and the first filter K n (size is height H kd , width W kd ). If the height H fi1 is greater than or equal to the height H kd and the width W fi1 is greater than or equal to the width W kd , the first operator 135 may trigger the first operation. The height H f1o of the result of the first operation (ie, the first output data F fo1 ) is the height H fi1 -H kd +1, the width W f1o is the width W fi1 -W kd +1, and the number of channels C f1o is the same as The number of channels C fi1 .

又例如,若第一部分資料F fi1的大小(高,寬,通道數)為(3,32,16),且第一濾波器K n的大小(高,寬)為(3,3),則第一輸出資料F fo1的大小(高,寬,通道數)為(1,30,16)。 For another example, if the size (height, width, number of channels) of the first part of the data F fi1 is (3, 32, 16), and the size (height, width) of the first filter K n is (3, 3), then The size (height, width, number of channels) of the first output data F fo1 is (1, 30, 16).

在一實施例中,第一運算器135採用脈動陣列(systo1ic array)結構。第一運算器135將第一部分資料區分成多個第一脈動陣列輸入,並分別對這些第一脈動陣列輸入進行第一運算,以取得多個第一脈動陣列輸出。各第一脈動陣列輸出的大小將受限於脈動陣列的大小。例如,第一脈動陣列輸出的元素量小於或等於脈動陣列的容量。此外,基於相同第一部分資料的這些第一脈動陣列輸出組成第一輸出資料。In one embodiment, the first operator 135 adopts a systolic array structure. The first operator 135 divides the first partial data area into a plurality of first systolic array inputs, and performs a first operation on the first systolic array inputs respectively to obtain a plurality of first systolic array outputs. The size of each first systolic array output will be limited by the size of the systolic array. For example, the amount of elements output by the first systolic array is less than or equal to the capacity of the systolic array. Furthermore, these first systolic array outputs based on the same first partial data form the first output data.

例如,圖5A及圖5B是依據本發明一實施例的脈動陣列輸入及輸出的示意圖。請參照圖5A及圖5B,假設脈動陣列的大小為(數量M sa×數量N sa)。脈動陣列輸出SA 1o的高H a1o為1,其寬W a1o可為數量M sa,且其通道數C a1o可為數量N sa。因此,第一運算器135將第一部分資料區分成脈動陣列輸入SA 1i(其大小為高H a1i,寬W a1i,通道數C a1i)及脈動陣列輸入SA 2i(其大小為高H a2i,寬W a2i,通道數C a2i)。這兩個脈動陣列輸入SA 1i, SA 2i分別與濾波器K n的各通道的權重進行卷積運算,並據以得出脈動陣列輸出SA 1o, SA 2o。而脈動陣列輸出SA 2o的高H a2o為1,其寬W a2o可小於或等於數量M sa,且其通道數C a2o可小於或等於數量N saFor example, FIGS. 5A and 5B are schematic diagrams of input and output of a systolic array according to an embodiment of the present invention. Referring to FIG. 5A and FIG. 5B , it is assumed that the size of the systolic array is (number M sa ×number N sa ). The height H a1o of the systolic array output SA 1o is 1, its width W a1o can be the number M sa , and its channel number C a1o can be the number N sa . Therefore, the first operator 135 divides the first part of the data into a systolic array input SA 1i (whose size is height H a1i , width W a1i , and the number of channels C a1i ) and systolic array input SA 2i (whose size is height H a2i , width W a2i , number of channels C a2i ). The two systolic array inputs SA 1i , SA 2i are respectively convolved with the weights of each channel of the filter K n , and the systolic array outputs SA 1o , SA 2o are obtained accordingly. The height H a2o of the systolic array output SA 2o is 1, the width W a2o may be less than or equal to the number M sa , and the number of channels C a2o may be less than or equal to the number N sa .

又例如,第一部分資料的大小(高,寬,通道數)為(3,32,16),脈動陣列的大小為16×16,且濾波器K n的大小為3×3。脈動陣列輸出SA 1o的高H a1o為1,其寬W a1o可為16,且其通道數C a1o可為16。此外,脈動陣列輸入SA 1i的高H a1i為3,其寬W a1i為18,且其通道數C a1i為16。另一方面,第一運算器135自第一部分資料區分出脈動陣列輸入SA 1i之後,可得出脈動陣列輸入SA 2i。脈動陣列輸入SA 2i的高H a2i為3,其寬W a2i為16,且其通道數C a2i為16。此外,脈動陣列輸出SA 2o的高H a2o為1,其寬W a2o為14(即,寬W a2i-濾波器K n的寬+1),且其通道數C a2o為16。 For another example, the size (height, width, number of channels) of the first part of the data is (3, 32, 16), the size of the systolic array is 16×16, and the size of the filter K n is 3×3. The height H a1o of the systolic array output SA 1o may be 1, the width W a1o may be 16, and the number of channels C a1o may be 16. In addition, the height H a1i of the systolic array input SA 1i is 3, its width W a1i is 18, and its channel number C a1i is 16. On the other hand, after the first operator 135 distinguishes the systolic array input SA 1i from the first part of the data, the systolic array input SA 2i can be obtained. The height H a2i of the systolic array input SA 2i is 3, its width W a2i is 16, and its channel number C a2i is 16. In addition, the height H a2o of the systolic array output SA 2o is 1, its width W a2o is 14 (ie, the width W a2i - the width of the filter K n +1), and its number of channels C a2o is 16.

再例如,表(1)~表(3)是儲存在特徵暫存記憶體131的第一筆、第二筆及第十五筆第一部分資料的資料(其餘筆資料可依此類推): 表(1) I(0,2,15)~I(0,2,0) I(0,1,15)~I(0,1,0) I(0,0,15)~I(0,0,0) I(1,2,15)~I(1,2,0) I(1,1,15)~I(1,1,0) I(1,0,15)~I(1,0,0) I(2,2,15)~I(2,2,0) I(2,1,15)~I(2,1,0) I(2,0,15)~I(2,0,0) 表(2) I(0,3,15)~I(0,2,0) I(0,2,15)~I(0,2,0) I(0,1,15)~I(0,1,0) I(1,3,15)~I(1,2,0) I(1,2,15)~I(1,2,0) I(1,1,15)~I(1,1,0) I(2,3,15)~I(2,2,0) I(2,2,15)~I(2,2,0) I(2,1,15)~I(2,1,0) 表(3) I(0,17,15)~I(0,17,0) I(0,16,15)~I(0,16,0) I(0,15,15)~I(0,15,0) I(1,17,15)~I(1,17,0) I(1,16,15)~I(1,16,0) I(1,15,15)~I(1,15,0) I(2, 17,15)~I(2,17,0) I(2,16,15)~I(2,16,0) I(2,15,15)~I(2,15,0) I(i1,j1,n1)代表所讀取的輸入資料在位置(高位置i1,寬位置j1,通道位置n1)的值。而第一先入先出單元132由右至左且由上至下依序輸入那些資料至第一運算器135。 For another example, Tables (1) to (3) are the data of the first, second and fifteenth first part data stored in the feature temporary memory 131 (the rest of the data can be deduced by analogy): Table (1) I(0,2,15)~I(0,2,0) I(0,1,15)~I(0,1,0) I(0,0,15)~I(0,0,0) I(1,2,15)~I(1,2,0) I(1,1,15)~I(1,1,0) I(1,0,15)~I(1,0,0) I(2,2,15)~I(2,2,0) I(2,1,15)~I(2,1,0) I(2,0,15)~I(2,0,0) Table 2) I(0,3,15)~I(0,2,0) I(0,2,15)~I(0,2,0) I(0,1,15)~I(0,1,0) I(1,3,15)~I(1,2,0) I(1,2,15)~I(1,2,0) I(1,1,15)~I(1,1,0) I(2,3,15)~I(2,2,0) I(2,2,15)~I(2,2,0) I(2,1,15)~I(2,1,0) table 3) I(0,17,15)~I(0,17,0) I(0,16,15)~I(0,16,0) I(0,15,15)~I(0,15,0) I(1,17,15)~I(1,17,0) I(1,16,15)~I(1,16,0) I(1,15,15)~I(1,15,0) I(2, 17,15)~I(2,17,0) I(2,16,15)~I(2,16,0) I(2,15,15)~I(2,15,0) I(i1,j1,n1) represents the value of the read input data at the position (high position i1, wide position j1, channel position n1). The first FIFO unit 132 sequentially inputs the data to the first operator 135 from right to left and from top to bottom.

表(4)為卷積運算所用16個通道的3×3濾波器的資料: 表(4) 通道0 通道1 通道14 通道15 F d0(2,2,15) F d1(2,2,15) F d14(2,2,15) F d15(2,2,15) F d0(2,2,0) F d1(2,2,0) F d14(2,2,0) F d15(2,2,0) F d0(0,0,0) F d1(0,0,0) F d14(0,0,0) F d15(0,0,0) F d0(0,0,0) F d1(2,2,15) F d14(2,2,15) F d15(2,2,15) F dn(i2,j2,n2)代表所讀取的第n濾波器在位置(高位置i2,寬位置j2,通道位置n2)的值。 Table (4) is the data of the 16-channel 3×3 filter used in the convolution operation: Table (4) channel 0 channel 1 Channel 14 Channel 15 F d0 (2,2,15) F d1 (2,2,15) F d14 (2,2,15) F d15 (2,2,15) F d0 (2,2,0) F d1 (2,2,0) F d14 (2,2,0) F d15 (2,2,0) F d0 (0,0,0) F d1 (0,0,0) F d14 (0,0,0) F d15 (0,0,0) F d0 (0,0,0) F d1 (2,2,15) F d14 (2,2,15) F d15 (2,2,15) F dn (i2,j2,n2) represents the read value of the nth filter at the position (high position i2, wide position j2, channel position n2).

表(5)為脈動陣列輸出: 表(5) A(0,0,0) A(0,0,1) A(0,0,14) A(0,0,15) A(0,1,0) A(0,1,1) A(0,1,14) A(0,1,15) A(0,15,0) A(0,15,1) A(0,15,14) A(0,15,15) A(i3,j3,n3)代表脈動陣列輸出在位置(高位置i3,寬位置j3,通道位置n3)的值,且其數學表示式為: A(i3,j3,n3) = I(i3,j3,0)×F dn3(0,0,0)+I(i3,j3,1)×F dn3(0,0,1)+…+I(i3,j3,15) ×F dn3(0,0,15) +I(i3,j3+1,0)×F dn3(0,1,0)+I(i3,j3+1,1) ×F dn3(0,1,1)+…+I(i3,j3+1,15) × F dn3(0,1,15) +I(i3,j3+2,0)×F dn3(0,2,0)+I(i3,j3+2,1) ×F dn3(0,2,1)+…+I(i3,j3+2,15) ×F dn3(0,2,15) +I(i3+1,j3,0)×F dn3(1,0,0)+I(i3+1,j3,1) ×F dn3(1,0,1)+…+I(i3+1,j3,15) ×F dn3(1,0,15) +I(i3+1,j3+1,0)×F dn3(1,1,0)+I(i3+1,j3+1,1) ×F dn3(1,1,1)+…+I(i3+1,j3+1,15) ×F dn3(1,1,15) +I(i3+1,j3+2,0)×F dn3(1,2,0)+I(i3+1,j3+2,1) ×F dn3(1,2,1)+…+I(i3+1,j3+2,15) ×F dn3(1,2,15) +I(i3+2,j3,0)×F dn3(2,0,0)+I(i3+2,j3,1) ×F dn3(2,0,1)+…+I(i3+2,j3,15) ×F dn3(2,0,15) +I(i3+2,j3+1,0)×F dn3(2,1,0)+I(i3+2,j3+1,1) ×F dn3(2,1,1)+…+I(i3+2,j3+1,15) ×F dn3(2,1,15) +I(i3+2,j3+2,0)×F dn3(2,2,0)+I(i3+2,j3+2,1) ×F dn3(2,2,1)+…+I(i3+2,j3+2,15)×F dn3(2,2,15)…(1)。 Table (5) is the systolic array output: Table (5) A(0,0,0) A(0,0,1) A(0,0,14) A(0,0,15) A(0,1,0) A(0,1,1) A(0,1,14) A(0,1,15) A(0,15,0) A(0,15,1) A(0,15,14) A(0,15,15) A(i3,j3,n3) represents the value of the systolic array output at the position (high position i3, wide position j3, channel position n3), and its mathematical expression is: A(i3,j3,n3) = I(i3, j3,0)×F dn3 (0,0,0)+I(i3,j3,1)×F dn3 (0,0,1)+…+I(i3,j3,15) ×F dn3 (0, 0,15) +I(i3,j3+1,0)×F dn3 (0,1,0)+I(i3,j3+1,1) ×F dn3 (0,1,1)+…+I (i3,j3+1,15) × F dn3 (0,1,15) +I(i3,j3+2,0)×F dn3 (0,2,0)+I(i3,j3+2,1 ) ×F dn3 (0,2,1)+…+I(i3,j3+2,15) ×F dn3 (0,2,15) +I(i3+1,j3,0)×F dn3 (1 ,0,0)+I(i3+1,j3,1) ×F dn3 (1,0,1)+…+I(i3+1,j3,15) ×F dn3 (1,0,15) + I(i3+1,j3+1,0)×F dn3 (1,1,0)+I(i3+1,j3+1,1) ×F dn3 (1,1,1)+…+I( i3+1,j3+1,15) ×F dn3 (1,1,15) +I(i3+1,j3+2,0)×F dn3 (1,2,0)+I(i3+1, j3+2,1) ×F dn3 (1,2,1)+…+I(i3+1,j3+2,15) ×F dn3 (1,2,15) +I(i3+2,j3, 0)×F dn3 (2,0,0)+I(i3+2,j3,1) ×F dn3 (2,0,1)+…+I(i3+2,j3,15) ×F dn3 ( 2,0,15) +I(i3+2,j3+1,0)×F dn3 (2,1,0)+I(i3+2,j3+1,1) ×F dn3 (2,1, 1)+…+I(i3+2,j3+1,15) ×F dn3 (2,1,15) +I(i3+2,j3+2,0)×F dn3 (2,2,0) +I(i3+2,j3+2,1) ×F dn3 (2,2,1)+…+I(i3+2,j3+2,15)×F dn3 (2,2,15)…( 1).

圖6A是依據本發明一實施例的脈動陣列輸出SA 3o的示意圖。請參照圖6A,脈動陣列輸出SA 3o的高H a3o為1,其寬W a3o為16,且其通道數C a3o為16。而在數學表示式(1)中,代表高的i3∈0,代表寬的j3∈0~15且代表濾波器輸出通道的n3∈0~15。 FIG. 6A is a schematic diagram of a systolic array output SA 3o according to an embodiment of the present invention. Referring to FIG. 6A , the height H a3o of the systolic array output SA 3o is 1, the width W a3o is 16, and the number of channels C a3o is 16. And in the mathematical expression (1), i3 ∈ 0 representing the height, j3 ∈ 0~15 representing the width and n3 ∈ 0~15 representing the filter output channel.

圖6B是依據本發明一實施例的脈動陣列輸出SA 4o的示意圖。請參照圖6B,脈動陣列輸出SA 4o的高H a3o為1,其寬W a4o為14,且其通道數C a3o為16。而在數學表示式(1)中,代表高的i3∈0,代表寬的j3∈16~29且代表濾波器輸出通道的n3∈0~15。此外,已完成輸出601為圖6A的脈動陣列輸出SA 3o,且當前處理輸出602為脈動陣列輸出SA 4oFIG. 6B is a schematic diagram of a systolic array output SA 4o according to an embodiment of the present invention. Referring to FIG. 6B , the height H a3o of the systolic array output SA 4o is 1, the width W a4o is 14, and the number of channels C a3o is 16. And in the mathematical expression (1), i3 ∈ 0 representing the height, j3 ∈ 16~29 representing the width and n3 ∈ 0~15 representing the filter output channel. Additionally, the completed output 601 is the systolic array output SA 3o of FIG. 6A , and the current processing output 602 is the systolic array output SA 4o .

依此類推,圖6C是依據本發明一實施例的脈動陣列輸出SA 4o的示意圖。請參照圖6C,脈動陣列輸出SA 5o的高H a3o為1,其寬W a5o為14,且其通道數C a3o為16。而在數學表示式(1)中,代表高的i3∈4,代表寬的j3∈16~29且代表濾波器輸出通道的n3∈0~15。此外,當前處理輸出602為脈動陣列輸出SA 5o。這些脈動陣列輸出SA 3o~SA 5o可形成一筆或更多筆第一輸出資料。 By analogy, FIG. 6C is a schematic diagram of the systolic array output SA 4o according to an embodiment of the present invention. Referring to FIG. 6C , the height H a3o of the systolic array output SA 5o is 1, the width W a5o is 14, and the number of channels C a3o is 16. In the mathematical expression (1), i3 ∈ 4 representing the height, j3 ∈ 16~29 representing the width and n3 ∈ 0~15 representing the filter output channel. Additionally, the current processing output 602 is the systolic array output SA 5o . These systolic array outputs SA 3o to SA 5o can form one or more first output data.

在一實施例中,第一運算為卷積運算,且第一運算器135對記憶體110所儲存的輸入資料往第一滑動方向讀取第一部分資料。第一運算器135將輸入資料區分成多個區段,並以平行於輸入資料的高的第一滑動方向接續讀取下一區段,並據以作為第一部分資料。In one embodiment, the first operation is a convolution operation, and the first operation unit 135 reads the first part of the data in the first sliding direction for the input data stored in the memory 110 . The first arithmetic unit 135 divides the input data into a plurality of segments, and successively reads the next segment in a first sliding direction parallel to the height of the input data, and uses them as the first part of the data.

例如,圖7A是依據本發明一實施例說明讀取輸入資料F fi1, F fi2的示意圖。請參照圖7A,若已完成第一部分資料F fi1的第一運算,則第一運算器135將這第一部分資料F fi1視為已完成輸入701,並進一步往方向D1(例如,圖面下方)讀取輸入資料F中的下一區段的第一部分資料F fi2並作為當前處理輸入702。 For example, FIG. 7A is a schematic diagram illustrating reading input data F fi1 , F fi2 according to an embodiment of the present invention. Referring to FIG. 7A , if the first operation of the first part of the data F fi1 has been completed, the first operator 135 regards the first part of the data F fi1 as having completed the input 701 , and goes further to the direction D1 (eg, the lower part of the drawing) The first part of the data F fi2 of the next section in the input data F is read and input 702 as the current process.

圖7B是依據本發明一實施例的第一輸出資料F fo1, F fo2的示意圖。請參照圖7B,第一輸出資料F fo1是對圖7A的第一部分資料F fi1進行卷積運算的輸出並作為已完成輸出703。此外,第一輸出資料F fo2是對圖7A的第一部分資料F fi2進行卷積運算的輸出並作為當前處理輸出704。第一輸出資料F fo2也是依據圖7A的方向D1排列在第一輸出資料F fo1的下方。 FIG. 7B is a schematic diagram of the first output data F fo1 , F fo2 according to an embodiment of the present invention. Referring to FIG. 7B , the first output data F fo1 is the output of the convolution operation performed on the first part of the data F fi1 in FIG. 7A and is regarded as the completed output 703 . In addition, the first output data F fo2 is the output of the convolution operation performed on the first partial data F fi2 of FIG. 7A and is used as the current processing output 704 . The first output data F fo2 is also arranged below the first output data F fo1 according to the direction D1 of FIG. 7A .

圖7C是依據本發明一實施例說明讀取輸入資料F fi3的示意圖。請參照圖7C,若已完成輸入701已到達輸入資料F底部,則第一運算器135往方向D2(例如,圖面右方)並由上至下(對應於圖7A的方向D1)讀取輸入資料F中的下一區段的第一部分資料F fi3並作為當前處理輸入702。 FIG. 7C is a schematic diagram illustrating reading the input data F fi3 according to an embodiment of the present invention. Referring to FIG. 7C , if the input 701 has been completed and the bottom of the input data F is reached, the first arithmetic unit 135 reads in the direction D2 (for example, the right side of the drawing) and from top to bottom (corresponding to the direction D1 in FIG. 7A ). The first part of the data F fi3 of the next section in the data F is entered and entered 702 as the current process.

圖7D是依據本發明一實施例的第一輸出資料F fo3的示意圖。請參照圖7D,第一輸出資料F fo3是對圖7C的第一部分資料F fi3進行卷積運算的輸出並作為當前處理輸出704。相似地,當前處理輸出704排列在已完成輸出703的右方。 FIG. 7D is a schematic diagram of the first output data F fo3 according to an embodiment of the present invention. Referring to FIG. 7D , the first output data F fo3 is the output of the convolution operation performed on the first part of the data F fi3 in FIG. 7C and is used as the current processing output 704 . Similarly, the current processing output 704 is arranged to the right of the completed output 703 .

圖7E是依據本發明一實施例說明讀取輸入資料F fi4的示意圖。請參照圖7E,作為當前處理輸入702的第一部分資料F fi4為輸入資料中的最後一個區段。 FIG. 7E is a schematic diagram illustrating reading the input data F fi4 according to an embodiment of the present invention. Referring to FIG. 7E, the first part of the data F fi4 as the current processing input 702 is the last section in the input data.

圖7F是依據本發明一實施例的第一輸出資料F fo4的示意圖。請參照圖7F,第一輸出資料F fo4是對圖7E的第一部分資料F fi4進行卷積運算的輸出並作為當前處理輸出704。相似地,當前處理輸出704排列在已完成輸出703的下方,並據以完成對輸入資料F的卷積運算。 FIG. 7F is a schematic diagram of the first output data F fo4 according to an embodiment of the present invention. Referring to FIG. 7F , the first output data F fo4 is the output of the convolution operation performed on the first part of the data F fi4 in FIG. 7E and is used as the current processing output 704 . Similarly, the current processing output 704 is arranged below the completed output 703, and the convolution operation on the input data F is performed accordingly.

在另一實施例中,第一運算器135對記憶體110所儲存的輸入資料往(不同於第一滑動方向的)第二滑動方向讀取第一部分資料。相似地,第一運算器135將輸入資料區分成多個區段,並以平行於輸入資料的寬的第二滑動方向接續讀取下一區段,並據以作為第一部分資料。In another embodiment, the first arithmetic unit 135 reads the first part of the data in the second sliding direction (different from the first sliding direction) to the input data stored in the memory 110 . Similarly, the first arithmetic unit 135 divides the input data into a plurality of sections, and successively reads the next section in a second sliding direction parallel to the width of the input data, and uses them as the first part of the data.

例如,圖8是依據本發明一實施例說明讀取輸入資料的示意圖。請參照圖8,若已完成第一部分資料F fi1的第一運算,則第一運算器135將這第一部分資料F fi1視為已完成輸入701,並進一步往方向D2(例如,圖面右方)讀取輸入資料F中的下一區段的第一部分資料F fi6並作為當前處理輸入702。相似地,若往方向D2已讀取到同行的最後一區段,則第一運算器135會讀取第一部分資料F fi1下方的區段。此外,第一部分資料F fi1及其他第一部分資料(圖未示)的排列可參照前述說明,於此不再贅述。 For example, FIG. 8 is a schematic diagram illustrating reading input data according to an embodiment of the present invention. Referring to FIG. 8 , if the first operation of the first part of the data F fi1 has been completed, the first calculator 135 regards the first part of the data F fi1 as having completed the input 701 , and goes further to the direction D2 (for example, the right side of the figure). ) to read the first part of the data F fi6 of the next section in the input data F and input 702 as the current process. Similarly, if the last segment of the row has been read in the direction D2, the first arithmetic unit 135 will read the segment below the first partial data F fi1 . In addition, the arrangement of the first part of the data F fi1 and other first part of the data (not shown) can refer to the above description, and will not be repeated here.

請參照圖2,第一運算器135暫存一個或更多個第一輸出資料至第一暫存記憶體151的第一暫存區(步驟S250)。具體而言,不同於現有技術將第一輸出資料輸出至記憶體110,本發明實施例的第一輸出資料輸出至第二運算器155的暫存第一暫存記憶體151,從而降低記憶體110的存取次數。Referring to FIG. 2 , the first arithmetic unit 135 temporarily stores one or more first output data in the first temporary storage area of the first temporary storage memory 151 (step S250 ). Specifically, unlike the prior art that outputs the first output data to the memory 110 , the first output data of the embodiment of the present invention is output to the first temporary memory 151 of the second arithmetic unit 155 for temporary storage, thereby reducing the amount of memory 110 access times.

而第一暫存記憶體151(或第一暫存區)所暫存的第一輸出資料大於第一預定資料量時,第二運算器155對第一輸出資料進行第二運算,以取得第二輸出資料(步驟S270)。具體而言,現有多卷積層的架構中,下一個卷積層需要等到上一個卷積層對其所有輸入資料運算並輸出至主記憶體之後,才會自主記憶體讀取上一個卷積層所輸出的輸入資料。不同於現有技術,除了暫存至記憶體110以外的儲存媒體(例如,第一暫存記憶體151或第二暫存記憶體171),本發明實施例更在每當滿足下一個卷積層所需的輸入資料的大小(即,第一預定資料量)的情況下,即可觸發下一個卷積層的卷積運算。同時,若上一個卷積層對所有輸入資料的運算尚未完成,這兩個卷積層的運算可一併同時進行。換句而言,第二運算器155對第一輸出資料進行第二運算時,第一運算器135持續對輸入資料進行第一運算。When the first output data temporarily stored in the first temporary storage memory 151 (or the first temporary storage area) is larger than the first predetermined amount of data, the second operation unit 155 performs a second operation on the first output data to obtain the first output data. 2. Output data (step S270). Specifically, in the existing multi-convolutional layer architecture, the next convolutional layer needs to wait until the previous convolutional layer operates on all its input data and outputs it to the main memory, and then reads the output from the previous convolutional layer from the autonomous memory. Enter data. Different from the prior art, in addition to the temporary storage in the storage medium other than the memory 110 (for example, the first temporary memory 151 or the second temporary memory 171 ), the embodiment of the present invention further satisfies the requirements of the next convolutional layer. In the case of the required size of the input data (ie, the first predetermined amount of data), the convolution operation of the next convolution layer can be triggered. At the same time, if the operation of the previous convolutional layer on all input data has not been completed, the operations of the two convolutional layers can be performed simultaneously. In other words, when the second operation unit 155 performs the second operation on the first output data, the first operation unit 135 continues to perform the first operation on the input data.

值得注意的是,第二運算所輸入的第二部份資料包括第一暫存記憶體151所暫存的第一輸出資料,且第二輸出資料的大小相關於第二運算的第二過濾器大小。假設第二運算是逐深度(depthwise)卷積運算。逐深度卷積運算的各濾波器分別僅對應於第二部分資料中的一個通道的資料。即,逐深度卷積運算的任一濾波器僅對一個通道的資料卷積運算。因此,逐深度卷積運算的濾波器的數量通常等於第二部份資料的通道數。然而,卷積運算的各濾波器是對所有通道的資料進行卷積運算。此外,只要暫存的那些第一輸出資料的高增加至濾波器的高且第一輸出資料的寬增加至濾波器的寬,濾波器即可與這些暫存的第一輸出資料(作為第二部份資料)進行逐深度卷積運算。It is worth noting that the second part of the data input by the second operation includes the first output data temporarily stored in the first temporary storage memory 151, and the size of the second output data is related to the second filter of the second operation size. Assume that the second operation is a depthwise convolution operation. Each filter of the depthwise convolution operation corresponds to only one channel of data in the second part of the data, respectively. That is, any filter of the depthwise convolution operation only convolves the data of one channel. Therefore, the number of filters for the depthwise convolution operation is usually equal to the number of channels of the second part of the data. However, each filter of the convolution operation performs a convolution operation on the data of all channels. In addition, as long as the height of the temporarily stored first output data increases to the height of the filter and the width of the first output data increases to the width of the filter, the filter can be combined with these temporarily stored first output data (as the second output data) Part of the data) to perform depthwise convolution operations.

在一實施例中,假設逐深度卷積運算所用的各濾波器的高為H kd,且其過濾器的寬(width)為W kd。每一區段的第一輸出資料的高為H f1o,且第一輸出資料的寬為W f1o。當第一暫存記憶體151或第一暫存區所暫存的第一輸出資料大於W kd×H kd時,第二運算器155可進行第二運算。當第一暫存記憶體151或第一暫存區所暫存的第一輸出資料大於第一預定資料量時,第一暫存記憶體151或第一暫存區所暫存的第一輸出資料所組成的高為M H×H f1o且所組成的寬為M W×W f1o。M H及M W為倍數且為正整數,M H×H f1o未小於H kd,且M W×W f1o未小於W kd。換句而言,在所暫存的第一輸出資料的高M H×H f1o小於過濾器的高H kd且所暫存的第一輸出資料的寬M W×W f1o小於過濾器的寬W kd的情況下,第二運算器155會繼續等待下一筆第一輸出資料或脈動陣列輸出,直到至所暫存的第一輸出資料的高M H×H f1o大於或等於過濾器的高H kd且所暫存的第一輸出資料的寬M W×W f1o大於或等於過濾器的寬W kdIn one embodiment, it is assumed that the height of each filter used in the depthwise convolution operation is H kd , and the width of the filter is W kd . The height of the first output data of each section is H f1o , and the width of the first output data is W f1o . When the first output data temporarily stored in the first temporary storage memory 151 or the first temporary storage area is greater than W kd ×H kd , the second arithmetic unit 155 can perform the second operation. When the first output data temporarily stored in the first temporary storage memory 151 or the first temporary storage area is larger than the first predetermined amount of data, the first output data temporarily stored in the first temporary storage memory 151 or the first temporary storage area The height of the composition of the material is M H × H f1o and the width of the composition is M W ×W f1o . MH and MW are multiples and are positive integers, MH × H f1o is not less than H kd , and MW × W f1o is not less than W kd . In other words, the height M H × H f1o of the temporarily stored first output data is smaller than the height H kd of the filter and the width M W ×W f1o of the temporarily stored first output data is smaller than the width W of the filter In the case of kd , the second operator 155 will continue to wait for the next first output data or systolic array output until the temporarily stored high M H ×H f1o of the first output data is greater than or equal to the filter high H kd And the width M W ×W f1o of the temporarily stored first output data is greater than or equal to the width W kd of the filter.

舉例而言,圖9A是依據本發明一實施例說明第二運算的觸發條件的示意圖,且圖9B是依據本發明一實施例的暫存的第一輸出資料的示意圖。請參照圖9A,輸入資料中的已完成輸入901對應於第一輸出資料的已完成輸出903。若當前處理輸入902對應的當前處理輸出904與已完成輸出903的大小符合第二運算所需的大小,則可觸發第二運算。For example, FIG. 9A is a schematic diagram illustrating a trigger condition of the second operation according to an embodiment of the present invention, and FIG. 9B is a schematic diagram of temporarily stored first output data according to an embodiment of the present invention. Referring to FIG. 9A , the completed input 901 in the input data corresponds to the completed output 903 of the first output data. If the size of the current processing output 904 corresponding to the current processing input 902 and the completed output 903 conform to the size required by the second operation, the second operation can be triggered.

請參照圖9B,假設圖9A的已完成輸出903與當前處理輸出904形成暫存的第一輸出資料F tfo。第一運算器135所用脈動陣列的大小為16×16,其中脈動陣列輸出的寬W tfo1可為16或寬W tfo2可為14。假設逐深度卷積運算所用的各濾波器的高為3,且其過濾器的寬為3。寬W tfo1, W tfo2皆已大於3。若第五筆脈動陣列輸出暫存於第一暫存記憶體151,則第一筆至第五筆脈動陣列輸出的大小(高,寬,通道數)為(1,16,16)或(1,14,16),即通道數C tfo為16)輸出所形成的大小已滿足3×3的大小。即,高度為1的脈動陣列輸出堆疊三層,使堆疊後的高度為3。此時,這些脈動陣列輸出可作為第二部份資料,並可供第二運算使用。 Referring to FIG. 9B , it is assumed that the completed output 903 and the current processing output 904 of FIG. 9A form temporarily stored first output data F tfo . The size of the systolic array used by the first operator 135 is 16×16, wherein the width W tfo1 of the output of the systolic array may be 16 or the width W tfo2 may be 14. Assume that the height of each filter used in the depthwise convolution operation is 3, and the width of its filter is 3. The widths W tfo1 and W tfo2 are all greater than 3. If the fifth systolic array output is temporarily stored in the first temporary memory 151, the size (height, width, channel number) of the first to fifth systolic array outputs is (1, 16, 16) or (1) , 14, 16), that is, the number of channels C tfo is 16) The size formed by the output has satisfied the size of 3 × 3. That is, a systolic array output with a height of 1 is stacked three layers, so that the stacked height is 3. At this time, these systolic array outputs can be used as the second part of the data and can be used for the second operation.

須說明的是,圖9A及圖9B是只要堆疊的層數等於濾波器的高即觸發第二運算。然而,在其他實施例中,也可能是堆疊層數大於濾波器的高。It should be noted that in FIGS. 9A and 9B , the second operation is triggered as long as the number of stacked layers is equal to the high level of the filter. However, in other embodiments, the number of stacked layers may be greater than the height of the filter.

針對逐深度卷積運算,圖10A是依據本發明一實施例的第二運算的示意圖。請參照圖10A,假設第二部份資料F si1的大小(高,寬,通道數)為(5,30,16),且逐深度卷積運算所用濾波器F d的大小3×3。I(i4,j4,n4)代表第二部份資料在位置(高位置i4,寬位置j4,通道位置n4)的值。F dn4(i5,j5,n5)代表所讀取的第n4濾波器在位置(高位置i5,寬位置j5)的值。A(i4,j4,n4)代表第二輸出資料或脈動陣列輸出在位置(高位置i4,寬位置j4,通道位置n4)的值,且其數學表示式為: A(i4,j4,n4)=I(i4,j4,n4)×F dn4(0,0)+I(i4,j4+1,n4)×F dn4(0,1)+ I(i4,j4+2,n4)×F dn4(0,2) +I(i4+1,j4,n4)×F dn4(1,0)+I(i4+1,j4+1,n4)×F dn4(1,1)+ I(i4+1,j4+2,n4)×F dn4(1,2) +I(i4+2,j4,n)×F dn4(2,0)+I(i4+2,j4+1,n4)×F dn4(2,1)+ I(i4+2,j4+2,n4)×F dn4(2,2)…(2)。 For the depthwise convolution operation, FIG. 10A is a schematic diagram of a second operation according to an embodiment of the present invention. Referring to FIG. 10A , it is assumed that the size (height, width, number of channels) of the second part of data F si1 is (5, 30, 16), and the size of the filter F d used in the depthwise convolution operation is 3×3. I(i4,j4,n4) represents the value of the second part of the data at the position (high position i4, wide position j4, channel position n4). F dn4 (i5, j5, n5) represents the read value of the n4th filter at the position (high position i5, wide position j5). A(i4,j4,n4) represents the value of the second output data or systolic array output at the position (high position i4, wide position j4, channel position n4), and its mathematical expression is: A(i4,j4,n4) =I(i4,j4,n4)×F dn4 (0,0)+I(i4,j4+1,n4)×F dn4 (0,1)+ I(i4,j4+2,n4)×F dn4 (0,2) +I(i4+1,j4,n4)×F dn4 (1,0)+I(i4+1,j4+1,n4)×F dn4 (1,1)+ I(i4+ 1,j4+2,n4)×F dn4 (1,2) +I(i4+2,j4,n)×F dn4 (2,0)+I(i4+2,j4+1,n4)×F dn4 (2,1)+I(i4+2,j4+2,n4)×F dn4 (2,2)…(2).

圖10B是依據本發明一實施例的第二輸出資料F so1的示意圖。請參照圖10B,假設當前處理的第二輸出資料F so1的大小(高H so1,寬W so1,通道數C so1)為(1,28,16)。其中,第二輸出資料F so1中的各值為: A(0,0,n4)=I(0,0,n4)×F dn4(0,0)+I(0,1,n4)×F dn4(0,1)+ I(0,2,n4)×F dn4(0,2) +I(1,0,n4)×F dn4(1,0)+I(1,1,n4)×F dn4(1,1)+ I(1,2,n4)×F dn4(1,2) +I(2,0,n)×F dn4(2,0)+I(2,1,n4)×F dn4(2,1)+ I(2,2,n4)×F dn4(2,2)…(3) A(0,1,n4)=I(0,1,n4)×F dn4(0,0)+I(0,2,n4)×F dn4(0,1)+ I(0,3,n4)×F dn4(0,2) +I(1,1,n4)×F dn4(1,0)+I(1,2,n4)×F dn4(1,1)+ I(1,3,n4)×F dn4(1,2) +I(2,1,n)×F dn4(2,0)+I(2,2,n4)×F dn4(2,1)+ I(2,3,n4)×F dn4(2,2)…(4) A(0,27,n4)=I(0,27,n4)×F dn4(0,0)+I(0,28,n4)×F dn4(0,1)+ I(0,29,n4)×F dn4(0,2) +I(1, 27,n4)×F dn4(1,0)+I(1,28,n4)×F dn4(1,1)+ I(1,29,n4)×F dn4(1,2) +I(2, 27,n)×F dn4(2,0)+I(2,28,n4)×F dn4(2,1)+ I(2,29,n4)×F dn4(2,2)…(5) 其餘依此類推,於此不再贅述。 FIG. 10B is a schematic diagram of the second output data F so1 according to an embodiment of the present invention. Referring to FIG. 10B , it is assumed that the size (height H so1 , width W so1 , and number of channels C so1 ) of the currently processed second output data F so1 is (1, 28, 16). Among them, each value in the second output data F so1 is: A(0,0,n4)=I(0,0,n4)×F dn4 (0,0)+I(0,1,n4)×F dn4 (0,1)+ I(0,2,n4)×F dn4 (0,2) +I(1,0,n4)×F dn4 (1,0)+I(1,1,n4)× F dn4 (1,1)+ I(1,2,n4)×F dn4 (1,2) +I(2,0,n)×F dn4 (2,0)+I(2,1,n4) ×F dn4 (2,1)+ I(2,2,n4)×F dn4 (2,2)…(3) A(0,1,n4)=I(0,1,n4)×F dn4 ( 0,0)+I(0,2,n4)×F dn4 (0,1)+ I(0,3,n4)×F dn4 (0,2) +I(1,1,n4)×F dn4 (1,0)+I(1,2,n4)×F dn4 (1,1)+ I(1,3,n4)×F dn4 (1,2) +I(2,1,n)×F dn4 (2,0)+I(2,2,n4)×F dn4 (2,1)+ I(2,3,n4)×F dn4 (2,2)…(4) A(0,27, n4)=I(0,27,n4)×F dn4 (0,0)+I(0,28,n4)×F dn4 (0,1)+ I(0,29,n4)×F dn4 (0 ,2) +I(1, 27,n4)×F dn4 (1,0)+I(1,28,n4)×F dn4 (1,1)+ I(1,29,n4)×F dn4 ( 1,2) +I(2, 27,n)×F dn4 (2,0)+I(2,28,n4)×F dn4 (2,1)+ I(2,29,n4)×F dn4 (2,2)…(5) and so on and so on, so I won’t repeat them here.

圖10C是依據本發明一實施例的第二輸出資料F so2的示意圖。請參照圖10C,已完成輸出101為圖10B的第二輸出資料F so1。第二輸出資料F so2為當前處理輸出102,且其大小可相同於圖10B的第二輸出資料F so1。其中,第二輸出資料F so2中的各值為: A(1,0,n4)=I(1,0,n4)×F dn4(0,0)+I(1,1,n4)×F dn4(0,1)+ I(1,2,n4)×F dn4(0,2) +I(2,0,n4)×F dn4(1,0)+I(2,1,n4)×F dn4(1,1)+ I(2,2,n4)×F dn4(1,2) +I(3,0,n)×F dn4(2,0)+I(3,1,n4)×F dn4(2,1)+ I(3,2,n4)×F dn4(2,2)…(6) A(1,1,n4)=I(1,1,n4)×F dn4(0,0)+I(1,2,n4)×F dn4(0,1)+ I(1,3,n4)×F dn4(0,2) +I(2,1,n4)×F dn4(1,0)+I(2,2,n4)×F dn4(1,1)+ I(2,3,n4)×F dn4(1,2) +I(3,1,n)×F dn4(2,0)+I(3,2,n4)×F dn4(2,1)+ I(3,3,n4)×F dn4(2,2)…(7) A(1,27,n4)=I(1,27,n4)×F dn4(0,0)+I(1,28,n4)×F dn4(0,1)+ I(1,29,n4)×F dn4(0,2) +I(2, 27,n4)×F dn4(1,0)+I(2,28,n4)×F dn4(1,1)+ I(2,29,n4)×F dn4(1,2) +I(3, 27,n)×F dn4(2,0)+I(3,28,n4)×F dn4(2,1)+ I(3,29,n4)×F dn4(2,2)…(8) 其餘依此類推,於此不再贅述。 FIG. 10C is a schematic diagram of the second output data F so2 according to an embodiment of the present invention. Referring to FIG. 10C , the completed output 101 is the second output data F so1 of FIG. 10B . The second output data F so2 is the current processing output 102 , and its size may be the same as the second output data F so1 of FIG. 10B . Among them, each value in the second output data F so2 is: A(1,0,n4)=I(1,0,n4)×F dn4 (0,0)+I(1,1,n4)×F dn4 (0,1)+ I(1,2,n4)×F dn4 (0,2) +I(2,0,n4)×F dn4 (1,0)+I(2,1,n4)× F dn4 (1,1)+ I(2,2,n4)×F dn4 (1,2) +I(3,0,n)×F dn4 (2,0)+I(3,1,n4) ×F dn4 (2,1)+ I(3,2,n4)×F dn4 (2,2)…(6) A(1,1,n4)=I(1,1,n4)×F dn4 ( 0,0)+I(1,2,n4)×F dn4 (0,1)+ I(1,3,n4)×F dn4 (0,2) +I(2,1,n4)×F dn4 (1,0)+I(2,2,n4)×F dn4 (1,1)+ I(2,3,n4)×F dn4 (1,2) +I(3,1,n)×F dn4 (2,0)+I(3,2,n4)×F dn4 (2,1)+ I(3,3,n4)×F dn4 (2,2)…(7) A(1,27, n4)=I(1,27,n4)×F dn4 (0,0)+I(1,28,n4)×F dn4 (0,1)+ I(1,29,n4)×F dn4 (0 ,2) +I(2, 27,n4)×F dn4 (1,0)+I(2,28,n4)×F dn4 (1,1)+ I(2,29,n4)×F dn4 ( 1,2) +I(3, 27,n)×F dn4 (2,0)+I(3,28,n4)×F dn4 (2,1)+ I(3,29,n4)×F dn4 (2,2)…(8) and so on and so on, so I won’t repeat them here.

圖10D是依據本發明一實施例的第二輸出資料F so3的示意圖。請參照圖10C,第二輸出資料F so3為當前處理輸出102,且其大小可相同於圖10B的第二輸出資料F so1。其中,第二輸出資料F so3中的各值為: A(2,0,n4)=I(2,0,n4)×F dn4(0,0)+I(2,1,n4)×F dn4(0,1)+ I(2,2,n4)×F dn4(0,2) +I(3,0,n4)×F dn4(1,0)+I(3,1,n4)×F dn4(1,1)+ I(3,2,n4)×F dn4(1,2) +I(4,0,n)×F dn4(2,0)+I(4,1,n4)×F dn4(2,1)+ I(4,2,n4)×F dn4(2,2)…(9) A(2,1,n4)=I(2,1,n4)×F dn4(0,0)+I(2,2,n4)×F dn4(0,1)+ I(2,3,n4)×F dn4(0,2) +I(3,1,n4)×F dn4(1,0)+I(3,2,n4)×F dn4(1,1)+ I(3,3,n4)×F dn4(1,2) +I(4,1,n)×F dn4(2,0)+I(4,2,n4)×F dn4(2,1)+ I(4,3,n4)×F dn4(2,2)…(10) A(2,27,n4)=I(2,27,n4)×F dn4(0,0)+I(2,28,n4)×F dn4(0,1)+ I(2,29,n4)×F dn4(0,2) +I(3, 27,n4)×F dn4(1,0)+I(3,28,n4)×F dn4(1,1)+ I(3,29,n4)×F dn4(1,2) +I(4, 27,n)×F dn4(2,0)+I(4,28,n4)×F dn4(2,1)+ I(4,29,n4)×F dn4(2,2)…(11) 其餘依此類推,於此不再贅述。 FIG. 10D is a schematic diagram of the second output data F so3 according to an embodiment of the present invention. Referring to FIG. 10C , the second output data F so3 is the current processing output 102 , and its size may be the same as the second output data F so1 of FIG. 10B . Among them, each value in the second output data F so3 is: A(2,0,n4)=I(2,0,n4)×F dn4 (0,0)+I(2,1,n4)×F dn4 (0,1)+ I(2,2,n4)×F dn4 (0,2) +I(3,0,n4)×F dn4 (1,0)+I(3,1,n4)× F dn4 (1,1)+ I(3,2,n4)×F dn4 (1,2) +I(4,0,n)×F dn4 (2,0)+I(4,1,n4) ×F dn4 (2,1)+ I(4,2,n4)×F dn4 (2,2)…(9) A(2,1,n4)=I(2,1,n4)×F dn4 ( 0,0)+I(2,2,n4)×F dn4 (0,1)+ I(2,3,n4)×F dn4 (0,2) +I(3,1,n4)×F dn4 (1,0)+I(3,2,n4)×F dn4 (1,1)+ I(3,3,n4)×F dn4 (1,2) +I(4,1,n)×F dn4 (2,0)+I(4,2,n4)×F dn4 (2,1)+ I(4,3,n4)×F dn4 (2,2)…(10) A(2,27, n4)=I(2,27,n4)×F dn4 (0,0)+I(2,28,n4)×F dn4 (0,1)+ I(2,29,n4)×F dn4 (0 ,2) +I(3, 27,n4)×F dn4 (1,0)+I(3,28,n4)×F dn4 (1,1)+ I(3,29,n4)×F dn4 ( 1,2) +I(4, 27,n)×F dn4 (2,0)+I(4,28,n4)×F dn4 (2,1)+ I(4,29,n4)×F dn4 (2,2)…(11) The rest are analogous and will not be repeated here.

在一實施例中,第二運算器155採用脈動陣列結構。第二運算器155將第二部分資料(即,部份暫存的第一輸出資料)區分成多個第二脈動陣列輸入,並分別對這些第二脈動陣列輸入進行第二運算,以取得多個第二脈動陣列輸出。各第二脈動陣列輸出的大小將受限於脈動陣列的大小。例如,第二脈動陣列輸出的元素量小於或等於脈動陣列的容量。此外,基於相同第二部份資料的這些第二脈動陣列輸出組成第二輸出資料。以圖10B為例,若脈動陣列的大小為16×16,則第二輸出資料F so1包括1×16×16及1×12×16的第二脈動陣列輸出。 In one embodiment, the second operator 155 adopts a systolic array structure. The second arithmetic unit 155 divides the second partial data (ie, the partially temporarily stored first output data) into a plurality of second systolic array inputs, and performs a second operation on the second systolic array inputs respectively, so as to obtain a plurality of second systolic array inputs. A second systolic array output. The size of each second systolic array output will be limited by the size of the systolic array. For example, the amount of elements output by the second systolic array is less than or equal to the capacity of the systolic array. Furthermore, these second systolic array outputs based on the same second partial data constitute the second output data. Taking FIG. 10B as an example, if the size of the systolic array is 16×16, the second output data F so1 includes 1×16×16 and 1×12×16 outputs of the second systolic array.

針對下一個卷積層,圖11是依據本發明一實施例的基於卷積神經網路的資料處理方法的流程圖。請參照圖11,在一實施例中,第二運算器155可暫存一個或更多個第二輸出資料在第二暫存記憶體171的第二暫存區(步驟S111)(步驟S280)。具體而言,相似地,本發明實施例將上一個卷積層的輸出暫存在下一個卷積層的緩衝器中,而不是將輸出資料直接輸出到記憶體110。For the next convolutional layer, FIG. 11 is a flowchart of a data processing method based on a convolutional neural network according to an embodiment of the present invention. Referring to FIG. 11 , in one embodiment, the second arithmetic unit 155 may temporarily store one or more second output data in the second temporary storage area of the second temporary storage memory 171 (step S111 ) (step S280 ) . Specifically, similarly, the embodiment of the present invention temporarily stores the output of the previous convolutional layer in the buffer of the next convolutional layer instead of directly outputting the output data to the memory 110 .

當第一暫存記憶體171或第二暫存區所暫存的第二輸出資料大於第二預定資料量時,第三運算器175可對第二輸出資料進行第三運算,以取得第三輸出資料(步驟S113)。具體而言,第三運算所輸入的第三部份資料包括第二暫存記憶體171所暫存的第二輸出資料,且第三部份資料的大小相關於第三運算的過濾器大小。假設第三運算是逐點(pointwise)卷積運算。逐點卷積運算的各濾波器的大小僅為1×1。相似於卷積運算,逐點卷積運算的各濾波器也是對所有通道的資料進行卷積運算。此外,只要暫存的那些第二輸出資料的高增加至濾波器的高(為1)且第二輸出資料的寬增加至濾波器的寬(為1),濾波器即可與這些暫存的第二輸出資料(作為第三部份資料)進行逐深度卷積運算。When the second output data temporarily stored in the first temporary storage memory 171 or the second temporary storage area is larger than the second predetermined amount of data, the third operator 175 may perform a third operation on the second output data to obtain the third output data. Data is output (step S113). Specifically, the third part of the data input by the third operation includes the second output data temporarily stored in the second temporary storage memory 171 , and the size of the third part of the data is related to the filter size of the third operation. Assume that the third operation is a pointwise convolution operation. The size of each filter of the point-by-point convolution operation is only 1×1. Similar to the convolution operation, each filter of the point-by-point convolution operation also performs a convolution operation on the data of all channels. In addition, as long as the height of the temporarily stored second output data is increased to the height of the filter (1) and the width of the second output data is increased to the width of the filter (1), the filter can The second output data (as the third part of data) is subjected to a depthwise convolution operation.

在一實施例中,依據圖10B~圖10D所示,每一筆的第二輸出資料皆可滿足逐點卷積運算所需的大小。因此,第二先入先出單元172可依序輸入每一筆第二輸出資料至第三運算器175。第三運算器175可對各暫存的第二輸出資料進行第三運算。In one embodiment, according to FIG. 10B to FIG. 10D , the second output data of each stroke can satisfy the size required for the point-by-point convolution operation. Therefore, the second FIFO unit 172 can sequentially input each second output data to the third arithmetic unit 175 . The third operator 175 may perform a third operation on each of the temporarily stored second output data.

例如,圖12A是依據本發明一實施例的暫存的第一輸出資料F tfo的示意圖,且圖12B是依據本發明一實施例的暫存的第二輸出資料F tso的示意圖。請參照圖12A及圖12B,若第二運算器155已完成對所暫存的多個第一輸出資料中的一部分的卷積運算,則第一暫存記憶體151可暫存大小(高,寬,通道數)為(1,W so21,C so2)的第二脈動陣列輸出或(1, W so21+W so21,C so2)的第二輸出資料,並據以成為暫存的第二輸出資料F tso。其中,通道數C so2相同於通道數C tf2For example, FIG. 12A is a schematic diagram of the temporarily stored first output data F tfo according to an embodiment of the present invention, and FIG. 12B is a schematic diagram of the temporarily stored second output data F tso according to an embodiment of the present invention. Referring to FIGS. 12A and 12B , if the second operator 155 has completed the convolution operation on a part of the temporarily stored first output data, the first temporary storage memory 151 can temporarily store the size (high, width, number of channels) is the second systolic array output of (1, W so21 , C so2 ) or the second output data of (1, W so21+ W so21 , C so2 ), and becomes the second output data temporarily stored accordingly Ftso . The number of channels C so2 is the same as the number of channels C tf2 .

圖13A是依據本發明一實施例的第三運算的示意圖。請參照圖13A,第三運算器175將圖12B的暫存的第二輸出資料F tso作為第三部份資料F ti(其寬W so3為W so21+W so21),並對第三部份資料F ti與逐點卷積運算所用的濾波器F p進行第三運算。 FIG. 13A is a schematic diagram of a third operation according to an embodiment of the present invention. Referring to FIG. 13A , the third arithmetic unit 175 uses the temporarily stored second output data F tso in FIG. 12B as the third partial data F ti (its width W so3 is W so21+ W so21 ), and the third partial data F ti performs a third operation with the filter F p used for the pointwise convolution operation.

圖13B是依據本發明一實施例的第三輸出資料F to的示意圖。請參照圖13A及圖13B,第三輸出資料F to大小等同於第三部份資料F ti。即,寬W to1相同於寬W so3,且通道數C to1相同於通道數C so2FIG. 13B is a schematic diagram of the third output data F to according to an embodiment of the present invention. Referring to FIG. 13A and FIG. 13B , the size of the third output data F to is equal to the size of the third partial data F ti . That is, the width W to1 is the same as the width W so3 , and the number of channels C to1 is the same as the number of channels C so2 .

在一實施例中,第三運算器175採用脈動陣列結構。第三運算器175將第三部分資料區分成多個第三脈動陣列輸入,並分別對這些第三脈動陣列輸入進行第三運算,以取得多個第三脈動陣列輸出。各第三脈動陣列輸出的大小將受限於脈動陣列的大小。例如,第三脈動陣列輸出的元素量小於或等於脈動陣列的容量。此外,基於相同第三部份資料(即,部份暫存的第二輸出資料)的這些第三脈動陣列輸出組成第三輸出資料。例如,若第三部份資料的大小為1×28×16,且脈動陣列的大小為16×16,則第三輸出資料包括1×16×16及1×12×16的第三脈動陣列輸出。In one embodiment, the third operator 175 adopts a systolic array structure. The third operator 175 divides the third partial data area into a plurality of third systolic array inputs, and performs a third operation on the third systolic array inputs respectively to obtain a plurality of third systolic array outputs. The size of each third systolic array output will be limited by the size of the systolic array. For example, the amount of elements output by the third systolic array is less than or equal to the capacity of the systolic array. Furthermore, these third systolic array outputs based on the same third partial data (ie, the partially temporarily stored second output data) constitute the third output data. For example, if the size of the third part of the data is 1×28×16 and the size of the systolic array is 16×16, the third output data includes the third systolic array outputs of 1×16×16 and 1×12×16 .

例如,圖14A是依據本發明一實施例的脈動陣列輸出SA 6o的示意圖。請參照圖14A,表(6)是儲存在第二暫存記憶體171的第二輸出資料的資料: 表(6) I(0,0,15) I(0,0,14) I(0,0,1) I(0,0,0) I(0,1,15) I(0,1,14) I(0,1,1) I(0,1,0) I(0,15,15) I(0,15,14) I(0,15,1) I(0,15,0) I(i6,j6,n6)代表所讀取的輸入資料在位置(高位置i6,寬位置j6,通道位置n6)的值。而第二先入先出單元172由右至左且由上至下依序輸入那些資料至第三運算器175。 For example, FIG. 14A is a schematic diagram of a systolic array output SA 6o according to an embodiment of the present invention. Please refer to FIG. 14A , Table (6) is the data of the second output data stored in the second temporary memory 171: Table (6) I(0,0,15) I(0,0,14) I(0,0,1) I(0,0,0) I(0,1,15) I(0,1,14) I(0,1,1) I(0,1,0) I(0,15,15) I(0,15,14) I(0,15,1) I(0,15,0) I(i6,j6,n6) represents the value of the read input data at the position (high position i6, wide position j6, channel position n6). The second FIFO unit 172 sequentially inputs the data to the third operator 175 from right to left and from top to bottom.

表(7)為逐點卷積運算所用16個通道的1×1濾波器的資料: 表(8) 通道0 通道1 通道14 通道15 F p0(0,0,15) F p1(0,0,15) F p14(0,0,15) F p15(0,0,15) F p0(0,0,1) F p1(0,0,1) F p14(0,0,1) F p15(0,0,1) F p0(0,0,0) F p1(0,0,0) F p14(0,0,0) F p15(0,0,0) F dn(i7,j7,n7)代表所讀取的第n濾波器在位置(高位置i7,寬位置j7,通道位置n7)的值。 Table (7) is the data of the 16-channel 1×1 filter used in the point-by-point convolution operation: Table (8) channel 0 channel 1 Channel 14 Channel 15 F p0 (0,0,15) F p1 (0,0,15) F p14 (0,0,15) F p15 (0,0,15) F p0 (0,0,1) F p1 (0,0,1) F p14 (0,0,1) F p15 (0,0,1) F p0 (0,0,0) F p1 (0,0,0) F p14 (0,0,0) F p15 (0,0,0) F dn (i7, j7, n7) represents the read value of the nth filter at position (high position i7, wide position j7, channel position n7).

表(9)為脈動陣列輸出: 表(9) A(0,0,0) A(0,0,1) A(0,0,14) A(0,0,15) A(0,1,0) A(0,1,1) A(0,1,14) A(0,1,15) A(0,15,0) A(0,15,1) A(0,15,14) A(0,15,15) A(i6,j6,n6)代表脈動陣列輸出在位置(高位置i6,寬位置j6,通道位置n6)的值,且其數學表示式為: A(i6,j3,n6) = I(i6,j6,0)×F dn6(0,0,0)+I(i6,j6,1)×F dn6(0,0,1)+ …+I(i6,j6,15) ×F dn6(0,0,15)…(12)。 Table (9) is the systolic array output: Table (9) A(0,0,0) A(0,0,1) A(0,0,14) A(0,0,15) A(0,1,0) A(0,1,1) A(0,1,14) A(0,1,15) A(0,15,0) A(0,15,1) A(0,15,14) A(0,15,15) A(i6,j6,n6) represents the value of the systolic array output at the position (high position i6, wide position j6, channel position n6), and its mathematical expression is: A(i6,j3,n6) = I(i6, j6,0)×F dn6 (0,0,0)+I(i6,j6,1)×F dn6 (0,0,1)+ …+I(i6,j6,15) ×F dn6 (0, 0,15)…(12).

因此,脈動陣列輸出SA 6o的各值為(𝑛6∈0~15): A(0,0,n6) = I(0,0,0)×F dn6(0,0,0)+I(0,0,1)×F dn6(0,0,1)+ …+I(0,0,15) ×F dn6(0,0,15)…(13); A(0,1,n6) = I(0,1,0)×F dn6(0,0,0)+I(0,1,1)×F dn6(0,0,1)+ …+I(0,1,15) ×F dn6(0,0,15)…(14)。 A(0,15,n6) = I(0,15,0)×F dn6(0,0,0)+I(0,15,1)×F dn6(0,0,1)+ …+I(0,15,15) ×F dn6(0,0,15)…(15),其餘依此類推且不再贅述。 Therefore, each value of the systolic array output SA 6o is (𝑛6∈0~15): A(0,0,n6) = I(0,0,0)×F dn6 (0,0,0)+I(0 ,0,1)×F dn6 (0,0,1)+ …+I(0,0,15) ×F dn6 (0,0,15)…(13); A(0,1,n6) = I(0,1,0)×F dn6 (0,0,0)+I(0,1,1)×F dn6 (0,0,1)+ …+I(0,1,15)×F dn6 (0,0,15)…(14). A(0,15,n6) = I(0,15,0)×F dn6 (0,0,0)+I(0,15,1)×F dn6 (0,0,1)+ …+I (0,15,15) ×F dn6 (0,0,15)…(15), and the rest are analogous and will not be repeated.

又例如,圖14B是依據本發明一實施例的脈動陣列輸出SA 7o的示意圖。請參照圖14A,表(10)是儲存在第二暫存記憶體171的第二輸出資料的資料: 表(10) I(0,16,15) I(0,16,14) I(0,16,1) I(0,17,0) I(0,17,15) I(0,17,14) I(0,17,1) I(0,17,0) I(0,27,15) I(0,27,14) I(0,27,1) I(0,27,0) For another example, FIG. 14B is a schematic diagram of a systolic array output SA 7o according to an embodiment of the present invention. Please refer to FIG. 14A, the table (10) is the data of the second output data stored in the second temporary memory 171: Table (10) I(0,16,15) I(0,16,14) I(0,16,1) I(0,17,0) I(0,17,15) I(0,17,14) I(0,17,1) I(0,17,0) I(0,27,15) I(0,27,14) I(0,27,1) I(0,27,0)

表(11)為脈動陣列輸出: 表(11) A(0,16,0) A(0,16,1) A(0,16,14) A(0,16,15) A(0,17,0) A(0,17,1) A(0,17,14) A(0,17,15) A(0,27,0) A(0,27,1) A(0,27,14) A(0,27,15) Table (11) is the systolic array output: Table (11) A(0,16,0) A(0,16,1) A(0,16,14) A(0,16,15) A(0,17,0) A(0,17,1) A(0,17,14) A(0,17,15) A(0,27,0) A(0,27,1) A(0,27,14) A(0,27,15)

因此,脈動陣列輸出SA 7o的各值為(𝑛6∈0~15): A(0,16,n6) = I(0,16,0)×F dn6(0,0,0)+I(0,16,1)×F dn6(0,0,1)+ …+I(0,16,15) ×F dn6(0,0,15)…(16); A(0,17,n6) = I(0,17,0)×F dn6(0,0,0)+I(0,17,1)×F dn6(0,0,1)+ …+I(0,17,15) ×F dn6(0,0,15)…(17)。 A(0,27,n6) = I(0,27,0)×F dn6(0,0,0)+I(0,27,1)×F dn6(0,0,1)+ …+I(0,27,15) ×F dn6(0,0,15)…(18),其餘依此類推且不再贅述。此外,圖14A的脈動陣列輸出SA 6o為已完成輸出141,脈動陣列輸出SA 7o為當前處理輸出142。 Therefore, each value of the systolic array output SA 7o is (𝑛6∈0~15): A(0,16,n6) = I(0,16,0)×F dn6 (0,0,0)+I(0 ,16,1)×F dn6 (0,0,1)+ …+I(0,16,15) ×F dn6 (0,0,15)…(16); A(0,17,n6) = I(0,17,0)×F dn6 (0,0,0)+I(0,17,1)×F dn6 (0,0,1)+ …+I(0,17,15)×F dn6 (0,0,15)…(17). A(0,27,n6) = I(0,27,0)×F dn6 (0,0,0)+I(0,27,1)×F dn6 (0,0,1)+ …+I (0,27,15) ×F dn6 (0,0,15)…(18), and the rest are analogous and will not be repeated. Furthermore, the systolic array output SA 6o of FIG. 14A is the completed output 141 , and the systolic array output SA 7o is the current processing output 142 .

再例如,圖14C是依據本發明一實施例的脈動陣列輸出SA 8o的示意圖。請參照圖14A,表(12)是儲存在第二暫存記憶體171的第二輸出資料的資料: 表(12) I(2,16,15) I(2,16,14) I(2,16,1) I(2,17,0) I(2,17,15) I(2,17,14) I(2,17,1) I(2,17,0) I(2,27,15) I(2,27,14) I(2,27,1) I(2,27,0) For another example, FIG. 14C is a schematic diagram of a systolic array output SA 8o according to an embodiment of the present invention. Please refer to FIG. 14A, the table (12) is the data of the second output data stored in the second temporary memory 171: Table (12) I(2,16,15) I(2,16,14) I(2,16,1) I(2,17,0) I(2,17,15) I(2,17,14) I(2,17,1) I(2,17,0) I(2,27,15) I(2,27,14) I(2,27,1) I(2,27,0)

表(13)為脈動陣列輸出: 表(13) A(2,16,0) A(2,16,1) A(2,16,14) A(2,16,15) A(2,17,0) A(2,17,1) A(2,17,14) A(2,17,15) A(2,27,0) A(2,27,1) A(2,27,14) A(2,27,15) Table (13) is the systolic array output: Table (13) A(2,16,0) A(2,16,1) A(2,16,14) A(2,16,15) A(2,17,0) A(2,17,1) A(2,17,14) A(2,17,15) A(2,27,0) A(2,27,1) A(2,27,14) A(2,27,15)

因此,最後一筆當前處理輸出142的脈動陣列輸出SA 8o的各值為(𝑛6∈0~15): A(2,16,n6) = I(2,16,0)×F dn6(0,0,0)+I(2,16,1)×F dn6(0,0,1)+ …+I(2,16,15) ×F dn6(0,0,15)…(19); A(2,17,n6) = I(2,17,0)×F dn6(0,0,0)+I(2,17,1)×F dn6(0,0,1)+ …+I(2,17,15) ×F dn6(0,0,15)…(20)。 A(2,27,n6) = I(2,27,0)×F dn6(0,0,0)+I(2,27,1)×F dn6(0,0,1)+ …+I(2,27,15) ×F dn6(0,0,15)…(21),其餘依此類推且不再贅述。 Therefore, each value of the systolic array output SA 8o of the last current processing output 142 is (𝑛6∈0~15): A(2,16,n6) = I(2,16,0)×F dn6 (0,0 ,0)+I(2,16,1)×F dn6 (0,0,1)+ …+I(2,16,15) ×F dn6 (0,0,15)…(19); A( 2,17,n6) = I(2,17,0)×F dn6 (0,0,0)+I(2,17,1)×F dn6 (0,0,1)+ …+I(2 ,17,15) ×F dn6 (0,0,15)…(20). A(2,27,n6) = I(2,27,0)×F dn6 (0,0,0)+I(2,27,1)×F dn6 (0,0,1)+ …+I (2, 27, 15) ×F dn6 (0, 0, 15)…(21), and the rest are analogous and will not be repeated.

在一實施例中,第三運算器175運行第三運算時,第一運算器135及第二運算器155持續分別運行第一運算及第二運算。也就是說,若第一運算器135及第二運算器155對所有輸入資料的運算尚未完成,這第一運算器135、第二運算器155及第三運算器175的運算可一併進行。In one embodiment, when the third operator 175 is running the third calculation, the first operator 135 and the second operator 155 continue to run the first calculation and the second calculation, respectively. That is, if the operations of the first operator 135 and the second operator 155 on all the input data have not been completed, the operations of the first operator 135, the second operator 155 and the third operator 175 can be performed together.

請參照圖2,最後,第三運算器175將第三運算所得出的第三輸出資料輸出至記憶體110(步驟S290)。Referring to FIG. 2 , finally, the third operation unit 175 outputs the third output data obtained by the third operation to the memory 110 (step S290 ).

為了方便理解完整流程,以下再舉一實施例說明。圖15是依據本發明一實施例的MobileNet架構的資料處理方法的流程圖。請參照圖15,第一運算器135自記憶體110中的輸入資料讀取已定義區段的寬的資料以做作第一部分資料(步驟S1501)。第一運算器135判斷當前處理的條數是否大於或等於(第一過濾器的大小-1)且這條數與第一運算所用的第一跨步(stride)相除所得的餘數是否為1(步驟S1503)。若符合步驟S1503的條件,則第一先入先出單元132依序將第一部分資料輸出至第一運算器135(步驟S1505)。第一運算器135自記憶體110讀取卷積運算所用的第一濾波器中的權重(步驟S1511),進行卷積運算(步驟S1513),並將所得的第一輸出資料輸出至第一暫存記憶體151(步驟S1515)。第一運算器135判斷當前區段中的當前條(其大小相同於過濾器的大小)的所有資料是否皆已進行卷積運算(步驟S1517)。若這些資料尚未完成卷積運算,則第一先入先出單元132繼續將第一部分資料輸出至第一運算器135(步驟S1505)。若這條的所有資料皆已進行卷積運算,則第一運算器135判斷當前區段中的所有條的所有資料是否皆已進行卷積運算(步驟S1518)。若當前區段中的一條或更多條的資料尚未完成卷積運算或步驟S1503的條件為符合,則第一運算器135自記憶體110中的輸入資料繼續處理下一條的資料(步驟S1507)。若當前區段中的每一條的所有資料皆已進行卷積運算,則第一運算器135判斷所有區段的所有資料是否皆已進行卷積運算(步驟S1519)。若尚有區段的資料未完成卷積運算,則第一運算器135繼續處理下一個區段的資料(步驟S1509)。此外,第一運算器135將當前處理的條數歸零,並設定當前處理的寬數設為:原寬數+區段的寬-(第一運算所用的第一跨步-1+第二運算所用的第二跨步-1)。若所有區段的所有資料皆已進行卷積運算,則第一運算器135完成對輸入資料的所有卷積運算(步驟S1520)。In order to facilitate the understanding of the complete process, another embodiment is described below. FIG. 15 is a flowchart of a data processing method of the MobileNet architecture according to an embodiment of the present invention. Referring to FIG. 15 , the first arithmetic unit 135 reads the data of the width of the defined section from the input data in the memory 110 as the first part of data (step S1501 ). The first operator 135 determines whether the number of pieces currently processed is greater than or equal to (the size of the first filter - 1) and whether the remainder obtained by dividing this number with the first stride used in the first operation is 1 (step S1503). If the conditions of step S1503 are met, the first FIFO unit 132 sequentially outputs the first part of the data to the first arithmetic unit 135 (step S1505 ). The first operator 135 reads the weights in the first filter used in the convolution operation from the memory 110 (step S1511 ), performs the convolution operation (step S1513 ), and outputs the obtained first output data to the first temporary filter. stored in the memory 151 (step S1515). The first operator 135 determines whether all the data of the current bar (the size of which is the same as the filter size) in the current segment has been subjected to the convolution operation (step S1517 ). If the data has not completed the convolution operation, the first FIFO unit 132 continues to output the first part of the data to the first operator 135 (step S1505). If the convolution operation has been performed on all the data of this piece, the first operator 135 determines whether all the data of all the pieces in the current segment have been subjected to the convolution operation (step S1518 ). If one or more pieces of data in the current segment have not completed the convolution operation or the condition of step S1503 is satisfied, the first operator 135 continues to process the next piece of data from the input data in the memory 110 (step S1507 ) . If the convolution operation has been performed on all the data in each of the current segments, the first operator 135 determines whether all the data in all the segments have been subjected to the convolution operation (step S1519 ). If there is still data in the segment that has not completed the convolution operation, the first operator 135 continues to process the data in the next segment (step S1509 ). In addition, the first operator 135 resets the number of bars currently processed to zero, and sets the number of widths currently processed as: the number of original widths + the width of the segment - (the first stride used in the first operation - 1 + the second The second stride used by the operation - 1). If all data of all sections have been subjected to convolution operations, the first operator 135 completes all convolution operations on the input data (step S1520 ).

第二運算器155判斷第一暫存記憶體151所暫存的第一輸出資料是否大於第一預定資料量(步驟S1521)。第二過濾器的大小以3×3為例,則第二運算器155判斷所暫存的第一輸出資料是否已有三條。第二運算器155判斷在第一運算已處理的條數與第二運算所用的第二跨步所得的餘數是否等於零(步驟S1523)。若於數為零,則第二運算器155讀取暫存的第一輸出資料並作為第二部分資料(步驟S1525)。The second calculator 155 determines whether the first output data temporarily stored in the first temporary storage memory 151 is larger than the first predetermined data amount (step S1521 ). The size of the second filter is 3×3, for example, the second arithmetic unit 155 determines whether there are three pieces of the temporarily stored first output data. The second operator 155 determines whether the remainder obtained by the number of strips processed in the first operation and the second stride used in the second operation is equal to zero (step S1523). If the number is zero, the second arithmetic unit 155 reads the temporarily stored first output data as the second partial data (step S1525).

若步驟S1521及步驟S1523未符合條件,則第二運算器155判斷目前處理的資料是否為第一筆第一部分資料(步驟S1531)。若步驟S1531未符合條件,則結束所有第二運算(步驟S1540)。另一方面,第二運算器155自記憶體110讀取逐深度卷積運算所用的第二濾波器中的權重(步驟S1533),進行逐深度卷積運算(步驟S1535),並將所得的第二輸出資料輸出至第二暫存記憶體171(步驟S1537)。第二運算器155判斷當前區段中的所有條的所有資料是否皆已進行逐深度卷積運算(步驟S1538)。若當前區段中的一條或更多條的資料尚未完成卷積運算,則第二運算器155位移至下一個點索引(例如,相隔第二跨步)(步驟S1527)並處理下一筆的資料。直到第一暫存記憶體151中的所有條的所有資料皆已進行逐深度卷積運算,則第二運算器155將當前處理的條數設為:原條數+1,且將當前處理的寬數歸零(步驟S1539)。接著,第二運算器155完成對第二輸入資料的所有逐深度卷積運算(步驟S1540)。If the conditions in steps S1521 and S1523 are not met, the second arithmetic unit 155 determines whether the currently processed data is the first part of the first data (step S1531 ). If step S1531 does not meet the conditions, all the second operations are ended (step S1540). On the other hand, the second operator 155 reads the weights in the second filter used for the depthwise convolution operation from the memory 110 (step S1533 ), performs the depthwise convolution operation (step S1535 ), and converts the obtained first depth The second output data is output to the second temporary memory 171 (step S1537). The second operator 155 determines whether all data of all stripes in the current segment have been subjected to depthwise convolution operations (step S1538 ). If one or more pieces of data in the current segment have not completed the convolution operation, the second operator 155 moves to the next point index (eg, a second step away) (step S1527 ) and processes the next data . Until all the data of all the strips in the first temporary memory 151 have been subjected to the depthwise convolution operation, the second operator 155 sets the number of currently processed strips as: the original number of strips+1, and the currently processed The width number is reset to zero (step S1539). Next, the second operator 155 completes all depthwise convolution operations on the second input data (step S1540).

第三運算器175判斷第二暫存記憶體171所暫存的第二輸出資料是否已達到一條第二輸出資料(步驟S1541)。第三運算器175讀取暫存的第二輸出資料並作為第三部份資料(步驟S1543)。第三運算器175自記憶體110讀取逐點卷積運算所用的第三濾波器中的權重(步驟S1551),進行逐點卷積運算(步驟S1553),並將所得的第三輸出資料輸出至記憶體110(步驟S1555)。第三運算器175判斷第二暫存記憶體171中的所有條的所有資料是否皆已進行逐點卷積運算。若這些資料尚未完成逐點卷積運算,則第二先入先出單元172繼續將第三部份資料輸出至第三運算器175。若第二暫存記憶體171中的所有條的所有資料皆已完成逐點卷積運算,則第三運算器175完成對第三部份資料的所有逐點卷積運算(步驟S1560)。The third arithmetic unit 175 determines whether the second output data temporarily stored in the second temporary storage memory 171 has reached one second output data (step S1541 ). The third arithmetic unit 175 reads the temporarily stored second output data as the third partial data (step S1543). The third operator 175 reads the weights in the third filter used in the point-by-point convolution operation from the memory 110 (step S1551 ), performs the point-by-point convolution operation (step S1553 ), and outputs the obtained third output data to the memory 110 (step S1555). The third operator 175 determines whether all the data of all the strips in the second temporary memory 171 have been subjected to the point-by-point convolution operation. If these data have not completed the point-by-point convolution operation, the second FIFO unit 172 continues to output the third part of the data to the third operator 175 . If all the data in the second temporary memory 171 have completed the point-by-point convolution operation, the third operator 175 completes all the point-by-point convolution operations on the third part of the data (step S1560 ).

本發明實施例更提供一種電腦可讀取儲存媒體(例如,硬碟、光碟、快閃記憶體、固態硬碟(Solid State Disk,SSD)等儲存媒體),並用以儲存程式碼。運算電路100或其他處理器可載入程式碼,並據以執行本發明實施例的一個或更多個資料處理方法的相應流程。這些流程可參酌上文說明,且於此不再贅述。Embodiments of the present invention further provide a computer-readable storage medium (eg, a hard disk, an optical disk, a flash memory, a solid state disk (SSD), etc.) for storing program codes. The arithmetic circuit 100 or other processors can load the program code, and execute the corresponding flow of one or more data processing methods according to the embodiments of the present invention. These processes can be referred to the above description, and will not be repeated here.

綜上所述,在本發明實施例的基於卷積神經網路的運算電路、資料處理方法及電腦可讀取儲存媒體中,暫存第一輸出資料及/或第二輸出資料而不輸出至記憶體,並可在暫存的資料符合第二運算及/或第三運算所需的大小的情況下,開始第二運算及/或第三運算。藉此,可減少記憶體的存取次數,並提升運算效率。To sum up, in the computing circuit, data processing method and computer-readable storage medium based on the convolutional neural network according to the embodiments of the present invention, the first output data and/or the second output data are temporarily stored and not output to memory, and can start the second operation and/or the third operation when the temporarily stored data meets the size required by the second operation and/or the third operation. Thereby, the access times of the memory can be reduced, and the operation efficiency can be improved.

雖然本發明已以實施例揭露如上,然其並非用以限定本發明,任何所屬技術領域中具有通常知識者,在不脫離本發明的精神和範圍內,當可作些許的更動與潤飾,故本發明的保護範圍當視後附的申請專利範圍所界定者為準。Although the present invention has been disclosed above by the embodiments, it is not intended to limit the present invention. Anyone with ordinary knowledge in the technical field can make some changes and modifications without departing from the spirit and scope of the present invention. Therefore, The protection scope of the present invention shall be determined by the scope of the appended patent application.

100:運算電路 110:記憶體 120:處理元件 131:特徵暫存記憶體 151:第一暫存記憶體 171:第二暫存記憶體 132:第一先入先出單元 172:第二先入先出單元 133:第一權重緩衝器 153:第二權重緩衝器 173:第三權重緩衝器 135:第一運算器 155:第二運算器 175:第三運算器 S210~S290、S111~S113、S1501~S1560:步驟 F:輸入資料 H、H fi1、H kd、H f1o、H a1o、H a2o、H a1i、H a2i、H a3o、H s1o:高 W、W fi1、W kd、W f1o、W a1o、W a2o、W a1i、W a2i、W a3o、W a4o、W a5o、W tfo1、W tfo2、W s1o、W so21、W so22、W so3、W to1:寬 C、C fi1、C f1o、C a1o、C a2o、C a1i、C a2i、C a3o、C tfo、C s1o、C so2、C to1:通道數 F fi1、F fi2、F fi3、F fi4、F fi6:第一部分資料 K n、F d、F p:濾波器 F fo1、F fo2、F fo3、F fo4、F tfo:第一輸出資料 SA 1i、SA 2i:脈動陣列輸入 SA 1o、SA 2o、SA 3o、SA 4o、SA 5o、SA 6o、SA 7o、SA 8o:脈動陣列輸出 601、703、903、141:已完成輸出 602、704、904、142:當前處理輸出 D1、D2:方向 701、901:已完成輸入 702、902:當前處理輸入 i4:高位置 j4:寬位置 n4:通道位置 I(,,,)、A(,,,):值 F dn4(,,):權重 F si1:第二部份資料 F so1、F so2、F so3、F tso:第二輸出資料 F ti:第三部份資料 F to:第三輸出資料 100: arithmetic circuit 110: memory 120: processing element 131: feature temporary memory 151: first temporary memory 171: second temporary memory 132: first FIFO unit 172: second FIFO Unit 133: first weight buffer 153: second weight buffer 173: third weight buffer 135: first operator 155: second operator 175: third operator S210~S290, S111~S113, S1501~ S1560: Step F: Input data H, H fi1 , H kd , H f1o , H a1o , H a2o , H a1i , H a2i , H a3o , H s1o : high W, W fi1 , W kd , W f1o , W a1o , W a2o , W a1i , W a2i , W a3o , W a4o , W a5o , W tfo1 , W tfo2 , W s1o , W so21 , W so22 , W so3 , W to1 : Wide C, C fi1 , C f1o , C a1o , C a2o , C a1i , C a2i , C a3o , C tfo , C s1o , C so2 , C to1 : the number of channels F fi1 , F fi2 , F fi3 , F fi4 , F fi6 : the first part of the data K n , F d , F p : filters F fo1 , F fo2 , F fo3 , F fo4 , F tfo : first output data SA 1i , SA 2i : systolic array inputs SA 1o , SA 2o , SA 3o , SA 4o , SA 5o , SA 6o , SA 7o , SA 8o : Systolic Array Output 601, 703, 903, 141: Completed Output 602, 704, 904, 142: Current Processing Output D1, D2: Direction 701, 901: Completed Input 702, 902: Current processing input i4: high position j4: wide position n4: channel position I(,,,), A(,,,): value F dn4 (,,): weight F si1 : second part of data F so1 , F so2 , F so3 , F tso : the second output data F ti : the third part data F to : the third output data

圖1是依據本發明一實施例的基於卷積神經網路的運算電路的元件方塊圖。 圖2是依據本發明一實施例的基於卷積神經網路的資料處理方法的流程圖。 圖3是依據本發明一實施例的輸入資料的示意圖。 圖4是依據本發明一實施例的第一運算的示意圖。 圖5A及圖5B是依據本發明一實施例的脈動陣列輸入及輸出的示意圖。 圖6A~圖6C是依據本發明一實施例的脈動陣列輸出的示意圖。 圖7A是依據本發明一實施例說明讀取輸入資料的示意圖。 圖7B是依據本發明一實施例的第一輸出資料的示意圖。 圖7C是依據本發明一實施例說明讀取輸入資料的示意圖。 圖7D是依據本發明一實施例的第一輸出資料的示意圖。 圖7E是依據本發明一實施例說明讀取輸入資料的示意圖。 圖7F是依據本發明一實施例的第一輸出資料的示意圖。 圖8是依據本發明一實施例說明讀取輸入資料的示意圖。 圖9A是依據本發明一實施例說明第二運算的觸發條件的示意圖。 圖9B是依據本發明一實施例的暫存的第一輸出資料的示意圖。 圖10A是依據本發明一實施例的第二運算的示意圖。 圖10B~圖10D是依據本發明一實施例的第二輸出資料的示意圖。 圖11是依據本發明一實施例的基於卷積神經網路的資料處理方法的流程圖。 圖12A是依據本發明一實施例的暫存的第一輸出資料的示意圖。 圖12B是依據本發明一實施例的暫存的第二輸出資料的示意圖。 圖13A是依據本發明一實施例的第三運算的示意圖。 圖13B是依據本發明一實施例的第三輸出資料的示意圖。 圖14A~圖14C是依據本發明一實施例的脈動陣列輸出的示意圖。 圖15是依據本發明一實施例的MobileNet架構的資料處理方法的流程圖。 FIG. 1 is a block diagram of components of an operation circuit based on a convolutional neural network according to an embodiment of the present invention. FIG. 2 is a flowchart of a data processing method based on a convolutional neural network according to an embodiment of the present invention. FIG. 3 is a schematic diagram of input data according to an embodiment of the present invention. FIG. 4 is a schematic diagram of a first operation according to an embodiment of the present invention. 5A and 5B are schematic diagrams of input and output of a systolic array according to an embodiment of the present invention. 6A-6C are schematic diagrams of systolic array output according to an embodiment of the present invention. FIG. 7A is a schematic diagram illustrating reading input data according to an embodiment of the present invention. FIG. 7B is a schematic diagram of the first output data according to an embodiment of the present invention. 7C is a schematic diagram illustrating reading input data according to an embodiment of the present invention. FIG. 7D is a schematic diagram of the first output data according to an embodiment of the present invention. FIG. 7E is a schematic diagram illustrating reading input data according to an embodiment of the present invention. FIG. 7F is a schematic diagram of the first output data according to an embodiment of the present invention. FIG. 8 is a schematic diagram illustrating reading input data according to an embodiment of the present invention. FIG. 9A is a schematic diagram illustrating a trigger condition of the second operation according to an embodiment of the present invention. FIG. 9B is a schematic diagram of temporarily stored first output data according to an embodiment of the present invention. FIG. 10A is a schematic diagram of a second operation according to an embodiment of the present invention. 10B to 10D are schematic diagrams of second output data according to an embodiment of the present invention. FIG. 11 is a flowchart of a data processing method based on a convolutional neural network according to an embodiment of the present invention. 12A is a schematic diagram of temporarily stored first output data according to an embodiment of the present invention. FIG. 12B is a schematic diagram of temporarily stored second output data according to an embodiment of the present invention. FIG. 13A is a schematic diagram of a third operation according to an embodiment of the present invention. FIG. 13B is a schematic diagram of a third output data according to an embodiment of the present invention. 14A-14C are schematic diagrams of systolic array output according to an embodiment of the present invention. FIG. 15 is a flowchart of a data processing method of the MobileNet architecture according to an embodiment of the present invention.

S210~S290:步驟 S210~S290: Steps

Claims (13)

一種基於卷積神經網路(Conventional Neural Network,CNN)的資料處理方法,包括: 自一記憶體讀取一輸入資料; 對該輸入資料中的一第一部份資料進行一第一運算,以取得一第一輸出資料,其中該第一運算設有一第一過濾器,該第一輸出資料的大小相關於該第一運算的該第一過濾器的大小及該第一部份資料的大小; 暫存該第一輸出資料於一第一暫存區; 當該第一暫存區所暫存的該第一輸出資料大於一第一預定資料量時,對該第一輸出資料進行一第二運算,以取得一第二輸出資料,其中該第二運算設有一第二過濾器,且該第二輸出資料的大小相關於該第二運算的該第二過濾器的大小; 暫存該第二輸出資料於一第二暫存區;以及 對該第二輸出資料進行一第三運算所得出的一第三輸出資料輸出至該記憶體, 其中,對該第一輸出資料進行該第二運算時,持續對該輸入資料進行該第一運算。 A data processing method based on a convolutional neural network (Conventional Neural Network, CNN), comprising: read an input data from a memory; A first operation is performed on a first part of the input data to obtain a first output data, wherein the first operation is provided with a first filter, and the size of the first output data is related to the first output data the size of the first filter of the operation and the size of the first part of the data; temporarily storing the first output data in a first temporary storage area; When the first output data temporarily stored in the first temporary storage area is larger than a first predetermined amount of data, a second operation is performed on the first output data to obtain a second output data, wherein the second operation a second filter is provided, and the size of the second output data is related to the size of the second filter of the second operation; temporarily storing the second output data in a second temporary storage area; and A third output data obtained by performing a third operation on the second output data is output to the memory, Wherein, when the second operation is performed on the first output data, the first operation is continuously performed on the input data. 如請求項1所述的基於卷積神經網路的資料處理方法,其中該第二運算不同於該第三運算,且暫存該第二輸出資料於該第二暫存區記憶體的步驟包括: 當該第二暫存區所暫存的該第二輸出資料大於一第二預定資料量時,對該第二輸出資料進行該第三運算,以取得該第三輸出資料,其中該第三運算設有一第三過濾器,且該第三輸出資料的大小相關於該第三過濾器的大小。 The data processing method based on a convolutional neural network as claimed in claim 1, wherein the second operation is different from the third operation, and the step of temporarily storing the second output data in the second temporary storage area memory includes the following steps: : When the second output data temporarily stored in the second temporary storage area is larger than a second predetermined data amount, the third operation is performed on the second output data to obtain the third output data, wherein the third operation A third filter is provided, and the size of the third output data is related to the size of the third filter. 如請求項2所述的基於卷積神經網路的資料處理方法,其中對該第二輸出資料進行該第三運算所得出的該第三輸出資料輸出至該記憶體的步驟包括: 對該第二輸出資料進行該第三運算時,持續進行該第一運算及該第二運算。 The data processing method based on a convolutional neural network according to claim 2, wherein the step of outputting the third output data obtained by performing the third operation on the second output data to the memory comprises: When the third operation is performed on the second output data, the first operation and the second operation are continuously performed. 如請求項1所述的基於卷積神經網路的資料處理方法,其中該第二運算為一逐深度(depthwise)卷積運算,該第二過濾器的高(height)為H kd,該第二過濾器的寬(width)為W kd,該第一輸出資料的高為H f1o,該第一輸出資料的寬為W f1o,H kd、W kd、H f1o及W f1o為正整數,且對該第一輸出資料進行該第二運算的步驟包括: 當該第一暫存區所暫存的該第一輸出資料大於W kd×H kd時,進行該第二運算,其中該第一暫存區所暫存的最大資料量所組成的高為M H×H f1o且所組成的寬為M W×W f1o,M H及M W為倍數且為正整數,M H×H f1o未小於H kd,且M W×W f1o未小於W kdThe data processing method based on a convolutional neural network as claimed in claim 1, wherein the second operation is a depthwise convolution operation, the height of the second filter is H kd , and the first filter is H kd . The width of the two filters is W kd , the height of the first output data is H f1o , the width of the first output data is W f1o , H kd , W kd , H f1o and W f1o are positive integers, and The step of performing the second operation on the first output data includes: when the first output data temporarily stored in the first temporary storage area is greater than W kd ×H kd , performing the second operation, wherein the first temporary storage area The height of the maximum amount of data temporarily stored in the storage area is M H ×H f1o and the width is M W ×W f1o , M H and M W are multiples and are positive integers, and M H ×H f1o is not less than H kd , and M W ×W f1o is not smaller than W kd . 如請求項2所述的基於卷積神經網路的資料處理方法,其中該第三運算為一逐點(pointwise)卷積運算,該第三過濾器的高及寬皆為1,且對該第三輸入資料進行該第三運算的步驟包括: 對每一該暫存的第二輸出資料進行該第三運算。 The data processing method based on a convolutional neural network as claimed in claim 2, wherein the third operation is a pointwise convolution operation, the height and width of the third filter are both 1, and the third filter is The step of performing the third operation on the third input data includes: The third operation is performed on each of the temporarily stored second output data. 如請求項4所述的基於卷積神經網路的資料處理方法,其中該第一運算為一卷積運算,且自該記憶體讀取該輸入資料的步驟包括: 對該輸入資料往一第一滑動方向讀取該第一部分資料,其中該第一滑動方向平行於該輸入資料的高。 The data processing method based on a convolutional neural network according to claim 4, wherein the first operation is a convolution operation, and the step of reading the input data from the memory comprises: The first part of the data is read in a first sliding direction of the input data, wherein the first sliding direction is parallel to the height of the input data. 如請求項4所述的基於卷積神經網路的資料處理方法,其中該第一運算為一卷積運算,且自該記憶體讀取該輸入資料的步驟包括: 對該輸入資料往一第二滑動方向讀取該第一部分資料,其中該第二滑動方向平行於該輸入資料的寬。 The data processing method based on a convolutional neural network according to claim 4, wherein the first operation is a convolution operation, and the step of reading the input data from the memory comprises: The first part of the data is read in a second sliding direction of the input data, wherein the second sliding direction is parallel to the width of the input data. 如請求項2所述的基於卷積神經網路的資料處理方法,其中對該輸入資料中的該第一部份資料進行該第一運算、對該第一輸出資料進行該第二運算、或對該第二輸出資料進行該第三運算的步驟包括: 將該第一部份資料區分成多個第一脈動陣列(systo1ic array)輸入;以及 分別對該些第一脈動陣列輸入進行該第一運算,以取得多個第一脈動陣列輸出,其中該些第一脈動陣列輸出組成該第一輸出資料。 The data processing method based on a convolutional neural network according to claim 2, wherein the first operation is performed on the first part of the input data, the second operation is performed on the first output data, or The step of performing the third operation on the second output data includes: dividing the first portion of data into a plurality of first systolic array inputs; and The first operation is respectively performed on the first systolic array inputs to obtain a plurality of first systolic array outputs, wherein the first systolic array outputs constitute the first output data. 一種基於卷積神經網路的運算電路,包括: 一記憶體,用以儲存一輸入資料; 一處理元件,耦接該記憶體,並包括: 一第一運算器,用以對該輸入資料中的一第一部份資料進行一第一運算以取得一第一輸出資料,並暫存該第一輸出資料至該處理元件的一第一暫存記憶體,其中該第一運算設有一第一過濾器,該第一輸出資料的大小相關於該第一運算的該第一過濾器的大小及該第一部份資料的大小; 一第二運算器,用以當該第一暫存記憶體所暫存的該第一輸出資料大於一第一預定資料量時,對該第一輸出資料進行該第二運算以取得一第二輸出資料,並暫存該第二輸出資料至一第二暫存記憶體,其中該第二運算設有一第二過濾器,且該第二輸出資料的大小相關於該第二運算的該第二過濾器的大小; 該第二暫存記憶體,用以儲存該第二輸出資料;以及 一第三運算器,用以將對該第二輸出資料進行一第三運算所得出的一第三輸出資料輸出至該記憶體, 其中,該第二運算器進行該第二運算時,該第一運算器持續進行該第一運算。 An arithmetic circuit based on a convolutional neural network, comprising: a memory for storing an input data; A processing element, coupled to the memory, includes: a first operator for performing a first operation on a first part of the input data to obtain a first output data, and temporarily storing the first output data in a first temporary part of the processing element storage memory, wherein the first operation is provided with a first filter, and the size of the first output data is related to the size of the first filter of the first operation and the size of the first part of the data; a second arithmetic unit for performing the second operation on the first output data to obtain a second output data when the first output data temporarily stored in the first temporary storage memory is greater than a first predetermined amount of data output data, and temporarily store the second output data to a second temporary storage memory, wherein the second operation is provided with a second filter, and the size of the second output data is related to the second output data of the second operation the size of the filter; the second temporary memory for storing the second output data; and a third operator for outputting a third output data obtained by performing a third operation on the second output data to the memory, Wherein, when the second arithmetic unit performs the second arithmetic operation, the first arithmetic unit continues to perform the first arithmetic operation. 如請求項9所述的基於卷積神經網路的運算電路,其中,該第一運算器在一單位時間內具有一第一最大運算量,該第二運算器在該單位時間內具有一第二最大運算量,該第三運算器在該單位時間內具有一第三最大運算量,該第一最大運算量大於該第二最大運算量,該第一最大運算量大於該第三最大運算量。The operation circuit based on a convolutional neural network as claimed in claim 9, wherein the first operation unit has a first maximum operation amount within a unit time, and the second operation unit has a first maximum operation amount within the unit time. Two maximum computations, the third operator has a third maximum computation within the unit time, the first maximum computation is greater than the second maximum computation, the first maximum computation is greater than the third maximum computation . 如請求項9所述的基於卷積神經網路的運算電路,其中,該第三運算器運行該第三運算時,該第一運算器及該第二運算器持續運行該第一運算及該第二運算。The operation circuit based on a convolutional neural network as claimed in claim 9, wherein when the third operation unit runs the third operation, the first operation unit and the second operation unit continue to run the first operation and the second operation unit. second operation. 如請求項9所述的基於卷積神經網路的運算電路,其中,該第一暫存記憶體以及該第二暫存記憶體為一靜態隨機存取記憶體。The computing circuit based on a convolutional neural network according to claim 9, wherein the first temporary memory and the second temporary memory are a static random access memory. 一種電腦可讀取儲存媒體,用於儲存一程式碼,一處理器載入該程式碼以執行如請求項1至8中任一項所述的基於卷積神經網路的資料處理方法。A computer-readable storage medium is used for storing a program code, which is loaded by a processor to execute the data processing method based on a convolutional neural network as described in any one of claim items 1 to 8.
TW110140625A 2021-01-21 2021-11-01 Computing circuit and data processing method based on convolution neural network and computer readable storage medium TWI840715B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202210055056.1A CN114781626A (en) 2021-01-21 2022-01-18 Arithmetic circuit, data processing method, and computer-readable storage medium
US17/578,416 US20220230055A1 (en) 2021-01-21 2022-01-18 Computing circuit and data processing method based on convolutional neural network and computer readable storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US202163139809P 2021-01-21 2021-01-21
US63/139,809 2021-01-21

Publications (2)

Publication Number Publication Date
TW202230229A true TW202230229A (en) 2022-08-01
TWI840715B TWI840715B (en) 2024-05-01

Family

ID=

Also Published As

Publication number Publication date
TW202230113A (en) 2022-08-01
TWI792665B (en) 2023-02-11

Similar Documents

Publication Publication Date Title
Qiao et al. AtomLayer: A universal ReRAM-based CNN accelerator with atomic layer computation
US11449576B2 (en) Convolution operation processing method and related product
US10915816B2 (en) System and method of executing neural networks
Burr et al. Large-scale neural networks implemented with non-volatile memory as the synaptic weight element: Comparative performance analysis (accuracy, speed, and power)
US20150324125A1 (en) Storage compute device with tiered memory processing
US20100031008A1 (en) Parallel sorting apparatus, method, and program
Lee et al. ComPEND: Computation pruning through early negative detection for ReLU in a deep neural network accelerator
CN108804973B (en) Hardware architecture of target detection algorithm based on deep learning and execution method thereof
Yang et al. PIMGCN: A ReRAM-based PIM design for graph convolutional network acceleration
Samavatian et al. Rnnfast: An accelerator for recurrent neural networks using domain-wall memory
TWI648640B (en) A parallel hardware searching system for building artificial intelligent computer
US11875248B2 (en) Implementation of a neural network in multicore hardware
TW202230229A (en) Computing circuit and data processing method based on convolution neural network and computer readable storage medium
Chow et al. On the use of iterative methods and blocking for solving sparse triangular systems in incomplete factorization preconditioning
Yang et al. Pasgcn: An reram-based pim design for gcn with adaptively sparsified graphs
JP7044118B2 (en) Parallel union controller, parallel union control method, and parallel union control program
CN112435157A (en) Graphics processing system including different types of memory devices and method of operating the same
US20220230055A1 (en) Computing circuit and data processing method based on convolutional neural network and computer readable storage medium
Okafor et al. Fusing in-storage and near-storage acceleration of convolutional neural networks
CN114816322A (en) External sorting method and device of SSD and SSD memory
Feng et al. A segment‐based sparse matrix–vector multiplication on CUDA
Kim et al. ComPreEND: Computation pruning through predictive early negative detection for ReLU in a deep neural network accelerator
CN113325999B (en) Method and system for processing unstructured source data
CN116992203A (en) FPGA-based large-scale high-throughput sparse matrix vector integer multiplication method
US20220107844A1 (en) Systems, methods, and devices for data propagation in graph processing