TWI798591B - Convolutional neural network operation method and device - Google Patents
Convolutional neural network operation method and device Download PDFInfo
- Publication number
- TWI798591B TWI798591B TW109134795A TW109134795A TWI798591B TW I798591 B TWI798591 B TW I798591B TW 109134795 A TW109134795 A TW 109134795A TW 109134795 A TW109134795 A TW 109134795A TW I798591 B TWI798591 B TW I798591B
- Authority
- TW
- Taiwan
- Prior art keywords
- target
- multiply
- neural network
- convolutional neural
- accumulate
- Prior art date
Links
- 238000013527 convolutional neural network Methods 0.000 title claims abstract description 74
- 238000000034 method Methods 0.000 title claims description 14
- 238000004364 calculation method Methods 0.000 claims description 50
- 230000008521 reorganization Effects 0.000 claims description 4
- 238000009825 accumulation Methods 0.000 abstract description 15
- 238000010586 diagram Methods 0.000 description 7
- 238000013135 deep learning Methods 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000013528 artificial neural network Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000011176 pooling Methods 0.000 description 2
- 230000000750 progressive effect Effects 0.000 description 2
- 238000013507 mapping Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000005215 recombination Methods 0.000 description 1
- 230000006798 recombination Effects 0.000 description 1
- 238000010977 unit operation Methods 0.000 description 1
Images
Landscapes
- Complex Calculations (AREA)
- Magnetic Resonance Imaging Apparatus (AREA)
- Error Detection And Correction (AREA)
Abstract
Description
本申請涉資料處理技術領域,具體涉及一種卷積神經網路運算方法及裝置。 The present application relates to the technical field of data processing, in particular to a convolutional neural network computing method and device.
深度學習(Deep learning)是發展AI(Artificial intelligence,人工智慧)的重要應用技術之一,其廣泛應用於電腦視覺、語音辨識等領域。其中CNN(Convolutional Neural Network,卷積神經網路)則是近年來引起重視的一種深度學習高效識別技術,它通過直接輸入原始圖像或語音資料,與多個特徵濾波器(filter)資料進行若干層的卷積運算及向量運算,從而在圖像和語音辨識方面產生高準確性結果。 Deep learning (Deep learning) is one of the important application technologies for the development of AI (Artificial intelligence), which is widely used in computer vision, speech recognition and other fields. Among them, CNN (Convolutional Neural Network, Convolutional Neural Network) is a deep learning and efficient recognition technology that has attracted attention in recent years. Layer convolution operations and vector operations to produce high-accuracy results in image and speech recognition.
然而,隨著卷積神經網路的發展和廣泛應用,其面臨的挑戰也越來越多,例如,CNN模型的參數規模越來越大,並且網路結構複雜多變,一個CNN模型中往往包含多個卷積層,並且每一卷積層的深度以及卷積核的尺寸等資料都不相同。其中,在CNN網路中較淺的層中,待處理輸入資料的平面尺寸可能會較大、通道方向上的尺寸會較小,而隨著網路層數的深入,一些卷積核在通道方向上的深度會較大,或者卷積層中卷積核的數量會較多。那麼,對 於電子設備中由多個MAC(Multiply Accumulation Cell,乘積累加運算單元)組成的乘積累加運算陣列來說,就會面對龐大需計算的資料量。而電子設備提供的處理能力往往是有限的,也就是說,乘積累加運算陣列的一輪運算中,能夠輸入的最大資料量是固定的。例如,電子設備的乘積累加運算陣列包含多個乘積累加運算單元的處理能力為256,則該乘積累加運算陣列具有256個乘法器,即同時最多能夠進行256個權重值分別與對應的256個輸入資料值進行乘法運算。而一般輸入資料是遠大於256的,因此,需要分別將卷積核和輸入資料拆分為多個資料塊,進行逐一運算。然而,先前技術中針對不同的卷積層是以相同的方式來拆分卷積核和輸入資料,這樣的作法無法有效利用電子設備中的硬體資源,基於此,如何提高硬體加速器在計算過程中的資源利用率成為亟待解決的問題。 However, with the development and wide application of convolutional neural network, it faces more and more challenges. For example, the parameter scale of CNN model is getting larger and larger, and the network structure is complex and changeable. A CNN model often Contains multiple convolutional layers, and the depth of each convolutional layer and the size of the convolution kernel are different. Among them, in the shallower layers of the CNN network, the plane size of the input data to be processed may be larger, and the size in the channel direction may be smaller, and as the number of network layers deepens, some convolution kernels in the channel The depth in the direction will be larger, or the number of convolution kernels in the convolutional layer will be larger. then, yes For a multiplication-accumulation operation array composed of multiple MACs (Multiply Accumulation Cells, multiplication-accumulation operation units) in an electronic device, it will face a huge amount of data to be calculated. However, the processing capability provided by electronic devices is often limited, that is to say, the maximum amount of data that can be input in one round of the multiply-accumulate array is fixed. For example, if the multiply-accumulate operation array of the electronic device includes a plurality of multiply-accumulate operation units with a processing capacity of 256, then the multiply-accumulate operation array has 256 multipliers, that is, at the same time, up to 256 weight values and corresponding 256 input Data values are multiplied. The general input data is much larger than 256. Therefore, it is necessary to split the convolution kernel and input data into multiple data blocks and perform calculations one by one. However, in the prior art, the convolution kernel and the input data are split in the same way for different convolution layers, which cannot effectively utilize the hardware resources in electronic devices. Based on this, how to improve the performance of hardware accelerators in the calculation process The utilization of resources in the system has become an urgent problem to be solved.
本申請提供一種卷積神經網路運算方法及裝置,旨在提高硬體加速器的資源利用率。 The present application provides a convolutional neural network computing method and device, aiming at improving resource utilization of hardware accelerators.
本申請提供一種卷積神經網路運算方法,其應用於包含乘積累加運算陣列的卷積神經網路運算裝置,乘積累加運算陣列包含多個乘積累加運算單元。本申請的卷積神經網路運算方法包括:確定一目標卷積層中的目標卷積核的數量與所述目標卷積核的一第一尺寸資訊;根據所述目標卷積核的數量和所述第一尺寸資訊確定一目標調度模式,其中,所述目標調度模式對應一卷積計算塊的大小;根據所述目標調度模式,對所述目標卷積核中的權重資料進行重組,並將重組後所述權重資料輸出至所述乘積累加運算陣列;根據所述目 標調度模式,對所述目標卷積層中的輸入資料進行重組,並將重組後所述輸入資料輸出至所述乘積累加運算陣列;以及以所述乘積累加運算陣列基於重組後所述權重資料和重組後所述輸入資料進行乘積累加運算,其中,所述乘積累加運算陣列每輪運算中所使用的所述乘積累加運算單元的數目對應所述卷積計算塊的大小。 The present application provides a convolutional neural network computing method, which is applied to a convolutional neural network computing device including a multiply-accumulate array, and the multiply-accumulate array includes a plurality of multiply-accumulate computing units. The convolutional neural network operation method of the present application includes: determining the number of target convolution kernels in a target convolution layer and a first size information of the target convolution kernel; according to the number of the target convolution kernels and the set The first size information determines a target scheduling mode, wherein the target scheduling mode corresponds to the size of a convolution calculation block; according to the target scheduling mode, the weight data in the target convolution kernel is reorganized, and After recombination, the weight data is output to the multiply-accumulate operation array; according to the purpose A standard scheduling mode, reorganizing the input data in the target convolutional layer, and outputting the reorganized input data to the multiply-accumulate operation array; and using the multiply-accumulate operation array based on the reorganized weight data and After recombining, the input data is subjected to a multiply-accumulate operation, wherein the number of multiply-accumulate units used in each round of the multiply-accumulate array corresponds to the size of the convolution calculation block.
本申請另提供一種卷積神經網路運算裝置,其包含一調度模式單元、一第一資料處理單元、一第二資料處理單元及一乘積累加運算陣列。調度模式單元根據所述目標卷積核的數量及一第一尺寸資訊確定一目標調度模式,其中,所述目標調度模式對應一卷積計算塊的大小。第一資料處理單元根據所述目標調度模式,對所述目標卷積核中的權重資料進行重組。第二資料處理單元根據所述目標調度模式,對所述目標卷積層中的輸入資料進行重組。乘積累加運算陣列包含多個乘積累加運算單元,其基於重組後所述權重資料和重組後所述輸入資料進行乘積累加運算,其中,所述乘積累加運算陣列每輪運算中所使用的所述乘積累加運算單元的數目對應所述卷積計算塊的大小。 The present application also provides a convolutional neural network computing device, which includes a scheduling mode unit, a first data processing unit, a second data processing unit, and a multiply-accumulate array. The scheduling mode unit determines a target scheduling mode according to the number of target convolution kernels and a first size information, wherein the target scheduling mode corresponds to the size of a convolution calculation block. The first data processing unit reorganizes the weight data in the target convolution kernel according to the target scheduling mode. The second data processing unit reorganizes the input data in the target convolutional layer according to the target scheduling mode. The multiply-accumulate operation array includes a plurality of multiply-accumulate operation units, which perform multiply-accumulate operations based on the reorganized weight data and the reorganized input data, wherein the product used in each round of the multiply-accumulate operation array The number of accumulation operation units corresponds to the size of the convolution calculation block.
本申請實施例提供的卷積神經網路的運算方案,對於卷積神經網路中具有不同網路結構的各個卷積層來說,可以動態地調整目標調度模式,使得每一個卷積層可以採用與其乘積累加運算陣列結構匹配的調度模式,來對待處理輸入資料和目標卷積核進行資料塊的拆分,使得拆分後的權重資料塊中包含的權重值和輸入資料塊中包含的輸入資料的數量能夠實現對乘積累加運算陣列的運算資源的最大化利用,從整體上提高了硬體加速器的資源利用率,進而提高卷積神經網路的運算速度。 The convolutional neural network calculation scheme provided by the embodiment of the present application can dynamically adjust the target scheduling mode for each convolutional layer with different network structures in the convolutional neural network, so that each convolutional layer can adopt a different The scheduling mode of the multiplication and accumulation operation array structure matching is used to split the data block between the input data to be processed and the target convolution kernel, so that the weight value contained in the split weight data block and the input data contained in the input data block The number can realize the maximum utilization of the operation resources of the multiply-accumulate operation array, which improves the resource utilization rate of the hardware accelerator as a whole, and then improves the operation speed of the convolutional neural network.
101,102,103,104,105:步驟 101, 102, 103, 104, 105: steps
00~09,10~17:權重資料塊 00~09,10~17: weight data block
0~41,47,53,59,65,71:輸入資料塊 0~41,47,53,59,65,71: input data block
40:卷積神經網路運算裝置 40: Convolutional neural network computing device
401:模式調度單元 401: Mode Scheduling Unit
402:權重資料處理單元 402: Weight data processing unit
403:特徵資料處理單元 403: Feature data processing unit
404:乘積累加運算陣列 404: multiply-accumulate operation array
405:暫存器 405: scratchpad
406:快取 406: Cache
407:記憶體 407: memory
408:直接記憶體存取控制器 408: Direct Memory Access Controller
409:快取 409: Cache
K1,K2,K3,KM:卷積核 K 1 , K 2 , K 3 , K M : convolution kernel
R,W,w:寬度 R, W, w: width
C,D,d:深度 C,D,d: Depth
S,H,h:高度 S, H, h: Height
為了更清楚地說明本申請實施例中的技術方案,下面將對實施例描述中所需要使用的附圖作簡單地介紹,顯而易見地,下面描述中的附圖僅僅是本申請的一些實施例,對於本領域技術人員來講,在不付出進步性勞動的前提下,還可以根據這些附圖獲得其他的附圖。 In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings that need to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present application. For those skilled in the art, other drawings can also be obtained based on these drawings on the premise of not paying progressive efforts.
〔圖1〕是本申請實施例提供的卷積神經網路的運算方法的流程示意圖;〔圖2〕為一卷積層的輸入資料及卷積核的資料結構示意圖;〔圖3〕為一實施例中對卷積層的輸入資料及卷積核的資料拆分示意圖;以及〔圖4〕是本申請實施例提供的卷積神經網路運算裝置應用於電子設備的方塊示意圖。 [Fig. 1] is a schematic flow diagram of the operation method of the convolutional neural network provided by the embodiment of the present application; [Fig. 2] is a schematic diagram of the input data of a convolution layer and a data structure of a convolution kernel; [Fig. 3] is an implementation In the example, a schematic diagram of the input data of the convolutional layer and data splitting of the convolution kernel; and [FIG. 4] is a schematic block diagram of the application of the convolutional neural network computing device provided by the embodiment of the present application to electronic equipment.
下面將結合本申請實施例中的附圖,對本申請實施例中的技術方案進行清楚、完整地描述,顯然,所描述的實施例僅僅是本申請一部分實施例,而不是全部的實施例。其中相同的元件符號代表相同的元件,本發明的原理是以實施在一適當的應用環境中來舉例說明。基於本申請中的實施例,本領域技術人員在沒有作出進步性勞動前提下所獲得的所有其他實施例,都屬於本申請保護的範圍。 The following will clearly and completely describe the technical solutions in the embodiments of the application with reference to the drawings in the embodiments of the application. Apparently, the described embodiments are only some of the embodiments of the application, not all of them. Where the same reference numerals represent the same components, the principle of the present invention is illustrated by being implemented in a suitable application environment. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without making progressive efforts belong to the scope of protection of this application.
在本文中提及『實施例』意味著,結合實施例描述的特定特徵、結構或特性可以包含在本申請的至少一個實施例中。在說明書中的各個位置出現該用語並不一定均是指相同的實施例,也不是與其它實施例互斥的獨立 的或備選的實施例。本領域技術人員可顯然地或可隱含地理解的是,本文所描述的實施例可以與其它實施例相結合。 Reference herein to "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application. The appearances of this term in various places in the specification are not necessarily all referring to the same embodiment, nor are they mutually exclusive from other embodiments. or alternative embodiments. It may be apparent or implicitly understood by those skilled in the art that the embodiments described herein may be combined with other embodiments.
本申請實施例提供一種卷積神經網路的運算方法,該卷積神經網路的運算方法的執行主體可以是本申請實施例提供的卷積神經網路運算裝置,或者是整合了該卷積神經網路的運算裝置的電子設備,實施上,卷積神經網路的運算裝置可以採用硬體、軟體或硬體結合軟體的方式實現。 The embodiment of the present application provides a convolutional neural network computing method. The execution subject of the convolutional neural network computing method may be the convolutional neural network computing device provided in the embodiment of the present application, or integrate the convolutional neural network The electronic equipment of the computing device of the neural network, in terms of implementation, the computing device of the convolutional neural network can be realized by hardware, software, or a combination of hardware and software.
本申請實施例提供的卷積神經網路的運算方案,可以應用於任何結構的卷積神經網路(以下簡稱CNN),例如,可以應用於只有一個卷積層的CNN,還可以應用於一些複雜CNN,比如包括多達上百或者更多的卷積層的CNN。此外,本申請實施例中的CNN還可以有池化層、全連接層等。也就是說,本申請實施例的方案並不局限於某種特定的卷積神經網路,只要是包含有卷積層的神經網路,都可以認為是本申請中的『卷積神經網路』,其卷積層部分都可以按照本申請實施例進行運算。 The convolutional neural network operation scheme provided by the embodiment of the present application can be applied to any structured convolutional neural network (hereinafter referred to as CNN), for example, it can be applied to a CNN with only one convolutional layer, and it can also be applied to some complex CNNs, such as CNNs that include up to hundreds or more convolutional layers. In addition, the CNN in the embodiment of the present application may also have a pooling layer, a fully connected layer, and the like. That is to say, the scheme of the embodiment of this application is not limited to a specific convolutional neural network, as long as it is a neural network that includes a convolutional layer, it can be considered as a "convolutional neural network" in this application , the convolutional layer part can be operated according to the embodiment of the present application.
需要說明的是,本申請實施例的卷積神經網路可以應用於多種場景,例如,諸如人臉識別、車牌識別等圖像識別的領域、諸如圖像特徵提取、語音特徵提取的特徵領域、語音辨識領域、自然語言處理領域等,將圖像或者由其他形式的資料轉換得到的特徵資料登錄到預先訓練好的卷積神經網路,即可利用該卷積神經網路進行運算,以達到或分類或識別或特徵提取的目的。 It should be noted that the convolutional neural network of the embodiment of the present application can be applied to various scenarios, for example, image recognition fields such as face recognition and license plate recognition, feature fields such as image feature extraction and speech feature extraction, In the field of speech recognition, natural language processing, etc., the feature data converted from images or other forms of data can be registered into the pre-trained convolutional neural network, and the convolutional neural network can be used to perform calculations to achieve Or classification or recognition or feature extraction purposes.
請參閱圖1,圖1是本申請實施例提供的卷積神經網路運算方法的流程示意圖。圖4是本申請實施例提供的卷積神經網路運算裝置應用於電子設備的方塊示意圖。卷積神經網路運算裝置40可用以實現圖1中的卷積神經網
路運算方法。卷積神經網路運算方法的具體步驟及卷積神經網路運算裝置40的操作方式說明如下。
Please refer to FIG. 1 . FIG. 1 is a schematic flowchart of a convolutional neural network operation method provided by an embodiment of the present application. FIG. 4 is a schematic block diagram of a convolutional neural network computing device provided by an embodiment of the present application applied to an electronic device. The convolutional neural
步驟101中,確定一目標卷積層中的目標卷積核的數量以及所述目標卷積核的一第一尺寸資訊。
In
對於整合有卷積神經網路運算裝置的電子設備來說,卷積層對輸入資料和卷積核資料進行卷積運算,得到輸出資料。輸入資料可以是原始的圖像、語音資料或者上一卷積層或池化層輸出的資料,而卷積神經網路運算裝置中的輸入資料一般為特徵資料,因此,卷積神經網路運算裝置40的輸入資料可為目標卷積層的特徵資料。 For an electronic device integrated with a convolutional neural network computing device, the convolution layer performs convolution operations on input data and convolution kernel data to obtain output data. The input data can be the original image, voice data, or the output data of the previous convolutional layer or pooling layer, and the input data in the convolutional neural network computing device is generally characteristic data. Therefore, the convolutional neural network computing device The input data of 40 may be feature data of the target convolutional layer.
輸入資料可以有多個通道(channel),每一個通道上的輸入資料可以理解為一個二維資料,當輸入資料的通道數大於1時,可以將輸入資料理解為多個通道的二維資料疊在一起的立體資料,其深度等於通道數。目標卷積層(即當下待進行卷積運算的卷積層)可以包含有一個或者多個卷積核,卷積核又稱為濾波器(filter),每一卷積核的通道數等於該層輸入資料的通道數,卷積核資料的個數等於該目標卷積層的輸出資料的通道數。也就是說,輸入資料與一個卷積核資料進行卷積運算後,得到一個二維的資料,當目標卷積層有多個卷積核時,每一個卷積核輸出的二位元資料疊加得到一個三維的輸出資料。 The input data can have multiple channels. The input data on each channel can be understood as a two-dimensional data. When the number of channels of the input data is greater than 1, the input data can be understood as a two-dimensional data stack of multiple channels. Together the volume data has a depth equal to the number of channels. The target convolutional layer (that is, the convolutional layer to be convolutional at present) can contain one or more convolution kernels, which are also called filters, and the number of channels of each convolution kernel is equal to the input of the layer The number of channels of the data, the number of convolution kernel data is equal to the number of channels of the output data of the target convolution layer. That is to say, after the input data is convoluted with a convolution kernel data, a two-dimensional data is obtained. When the target convolution layer has multiple convolution kernels, the two-bit data output by each convolution kernel is superimposed to obtain A 3D output data.
在基於卷積神經網路進行運算時,模式調度單元401從當前用於運算的卷積神經網路中確定出目標卷積層,實施上,模式調度單元401可自配置暫存器405取得目標卷積層的相關訊息,例如自配置暫存器405得知何者為目
標卷積層及其具有的卷積核數量及卷積核的平面尺寸及通道方向上的深度等訊息。
When performing calculations based on convolutional neural networks, the
請參閱圖2,圖2為一卷積層的輸入資料及卷積核的資料結構示意圖。圖2的卷積層中包含M個卷積核,分別為K1、K2、K3......KM。這M個卷積核的尺寸相同,均為D×R×S,如圖所示,D為代表卷積核在通道方向上的深度,R×S為代表卷積核在平面方向上的尺寸。輸入資料的尺寸為C×W×H,其中,C為代表輸入資料在通道方向上的深度。實施上,C=D。而W×H為代表輸入資料在平面方向上的尺寸。 Please refer to FIG. 2 . FIG. 2 is a schematic diagram of the input data of a convolution layer and the data structure of the convolution kernel. The convolution layer in Fig. 2 includes M convolution kernels, namely K 1 , K 2 , K 3 . . . K M . The M convolution kernels have the same size, all of which are D×R×S. As shown in the figure, D represents the depth of the convolution kernel in the channel direction, and R×S represents the size of the convolution kernel in the plane direction. . The size of the input data is C×W×H, where C represents the depth of the input data in the channel direction. In practice, C=D. And W×H represents the size of the input data in the plane direction.
由於每個卷積層中卷積核的尺寸和數量可能都不相同,本申請實施例中,在開始對目標卷積層進行運算時,先確定該目標卷積層中的目標卷積核的數量、目標卷積核的尺寸及/或深度資訊。 Since the size and number of convolution kernels in each convolution layer may be different, in the embodiment of the present application, when starting to operate on the target convolution layer, first determine the number of target convolution kernels in the target convolution layer, the target Size and/or depth information of the convolution kernel.
步驟102中,根據所述目標卷積核的數量和所述第一尺寸資訊確定一目標調度模式,其中,所述目標調度模式對應一卷積計算塊的大小。
In
在乘積累加運算陣列中乘積累加運算單元的數量是有限的情況下,而卷積層中的參數量(例如卷積核的資料量)巨大,對於一個卷積層來說,可能需要乘積累加運算陣列的多輪運算才能完成全部運算,因此,需要分別將卷積核和輸入資料拆分為多個資料塊,每一次將一定數量的權重資料塊和對應數量的輸入資料塊輸入到乘積累加運算陣列中,然後進行乘積累加運算。為有效利用乘積累加運算陣列404的運算資源,模式調度單元401可根據目標卷積核的數量及其相關的尺寸訊息確定一目標調度模式,實施上,模式調度單元401可根據目標卷積核的數量及其相關的尺寸訊息自多個預設的調度模式中選出目標調度模式,而每一調度模式對應一特定的卷積計算塊的大小,卷積計算
塊為進行卷積運算的最小單位,模式調度單元401所選出的目標調度模式可最有效的利用乘積累加運算陣列404,例如,當卷積神經網路運算裝置40操作在目標調度模式下,乘積累加運算陣列可以最少運算輪數完成對輸入資料及目標卷積核的乘積累加運算。
In the case where the number of multiply-accumulate operation units in the multiply-accumulate operation array is limited, and the amount of parameters in the convolution layer (such as the amount of data of the convolution kernel) is huge, for a convolution layer, it may be necessary to multiply-accumulate. Multiple rounds of calculations can complete all calculations. Therefore, it is necessary to split the convolution kernel and input data into multiple data blocks, and input a certain number of weight data blocks and corresponding number of input data blocks into the multiplication-accumulation operation array each time. , and then perform a multiply-accumulate operation. In order to effectively utilize the computing resources of the multiply-accumulate
在一實施例中,調度模式所對應的卷積計算塊的大小可為乘積累加運算陣列的一輪運算中從m個目標卷積核中獲取m個大小為d×w×h的權重資料塊的資料塊的大小,其中,d為代表權重資料塊在通道方向上的深度,w×h為代表權重資料塊在平面方向上的尺寸,其中,m、d、w、h皆為正整數。在設置預設調度模式時,可視為在設置m、d、w、h的具體數值,需要綜合考慮各種因素。 In an embodiment, the size of the convolution calculation block corresponding to the scheduling mode can be obtained from m target convolution kernels in one round of operation of the multiply-accumulate operation array. m weight data blocks with a size of d×w×h The size of the data block, where d represents the depth of the weight data block in the channel direction, and w×h represents the size of the weight data block in the plane direction, where m, d, w, and h are all positive integers. When setting the preset scheduling mode, it can be regarded as setting the specific values of m, d, w, and h, and various factors need to be considered comprehensively.
首先,需要考慮電子設備的乘積累加運算陣列404的處理能力,即需要考慮乘積累加運算陣列404中乘積累加單元的數量,比如,乘積累加運算陣列404中一共有256個乘積累加單元,則一輪運算中最多可以同時進行256個乘積累加單元運算。因此,在乘積累加運算陣列的一輪運算中分別從m個目標卷積核中獲取m個大小為d×w×h的權重資料塊時,m、d、w、h的大小需要滿足:m×d×w×h256。
First, the processing capability of the multiply-accumulate
其次,還要考慮實際網路的需要,比如,卷積層中卷積核的尺寸和數量,比如,有的卷積核的大小為1×1×64,有的卷積核大小為11×11×3,有的卷積層可能有8個卷積核,有的卷積層可能有2048個卷積核。在綜合考慮上述參數後,設置了不同大小的卷積計算塊,以適應不同的網路層。 Secondly, the needs of the actual network must also be considered, for example, the size and number of convolution kernels in the convolution layer, for example, the size of some convolution kernels is 1×1×64, and the size of some convolution kernels is 11×11 ×3, some convolutional layers may have 8 convolutional kernels, and some convolutional layers may have 2048 convolutional kernels. After comprehensive consideration of the above parameters, different sizes of convolution calculation blocks are set to adapt to different network layers.
比如,在乘積累加運算單元的處理能力固定的前提下,即在m×d×w×h256的條件下,當卷積層的卷積核數量較多、深度較小時,可以將m 的值設置的大一些,d的值設置的小一些,比如,m=64,d=4,w=1,h=1,或者m=16,d=16,w=1,h=1。反之,當卷積層的卷積核數量較少、深度較大時,可以將m的值設置的較小一些,d的值設置的較大一些,比如,m=4,d=64,w=1,h=1。或者,對於一些特別尺寸的卷積核,還可以做一些特別的設置,比如,對於3×3的卷積核,可以設置m=1,d=32,w=3,h=3,這種情況下,雖然不能百分之百的利用乘積累加運算單元的計算資源,但是也達到了較高的利用率。請再參閱圖2,圖2的卷積層中包含M個卷積核,此例中m較佳為的M的正因數,也就是說,M為m的整數倍。 For example, under the premise that the processing capability of the multiply-accumulate operation unit is fixed, that is, in m×d×w×h Under the condition of 256, when the number of convolution kernels of the convolutional layer is large and the depth is small, the value of m can be set larger, and the value of d can be set smaller, for example, m=64, d=4, w =1, h=1, or m=16, d=16, w=1, h=1. Conversely, when the number of convolution kernels of the convolutional layer is small and the depth is large, the value of m can be set smaller, and the value of d can be set larger, for example, m=4, d=64, w= 1, h=1. Or, for some special-sized convolution kernels, you can also make some special settings. For example, for a 3×3 convolution kernel, you can set m=1, d=32, w=3, h=3, this kind of In this case, although the computing resources of the multiply-accumulate operation unit cannot be utilized 100%, a relatively high utilization rate is achieved. Please refer to FIG. 2 again. The convolution layer in FIG. 2 includes M convolution kernels. In this example, m is preferably a positive factor of M, that is, M is an integer multiple of m.
實施上,可針對各種卷積層的卷積核數量及尺寸預先評估最適合的卷積計算塊的大小,再據以預先設置多種預設調度模式,並於一記憶體中建立一查閱資料表,此查閱資料表包含卷積核的參數與預設調度模式之間的映射關係。模式調度單元401可根據目標卷積核的數量及其相關的尺寸訊息自查閱資料表查得目標調度模式。在一實施例中,模式調度單元401可由一處理器執行程式碼來實現,而目標卷積層的輸入資料的資料量訊息及卷積核的數量及尺寸等訊息是儲存於配置暫存器405中,模式調度單元401自配置暫存器405中獲得目標卷積核的數量及其相關的尺寸訊息。
In practice, it is possible to pre-evaluate the size of the most suitable convolution calculation block for the number and size of convolution kernels of various convolution layers, and then pre-set a variety of preset scheduling modes based on this, and create a look-up data table in a memory, This lookup data table contains the mapping relationship between the parameters of the convolution kernel and the default scheduling mode. The
如圖4所示,卷積神經網路運算裝置40在進行卷積運算的過程中,是經由快取409自記憶體407中取得每一輪運算所需要的權重值和輸入資料,而乘積累加運算陣列404運算過程中產生的中間結果是暫存於快取406中。對於電子設備來說,分配給卷積運算所使用的儲存空間是有限的,基於上述原因,在預先設置調度模式時,除了考慮網路結構之外,還需要考慮使用這種調度模式運算時,對儲存空間的佔用情況,以設置合適的調度模式。因此,在一
實施例中,模式調度單元401亦會根據快取409及/或快取406的大小來確定目標調度模式。
As shown in FIG. 4 , during the convolution operation, the convolutional neural
步驟103中,根據目標調度模式,對目標卷積核中的權重資料進行重組,並將重組後的權重資料輸出至乘積累加運算陣列。
In
目標調度模式確定之後,權重資料處理單元402根據目標調度模式對目標卷積核中的權重資料進行拆分及適當的資料重組,使重組後的權重資料得以適當的順序輸入到乘積累加運算陣列404,而乘積累加運算陣列404可完成所需的卷積運算。
After the target scheduling mode is determined, the weight
實施上,目標卷積核中的權重資料可儲存於記憶體407中,權重資料處理單元402可經由快取409在直接記憶體存取(Direct Memory Access,DMA)控制器408的控制下,自記憶體407讀取權重資料。
In practice, the weight data in the target convolution kernel can be stored in the
一實施例中,針對每一調度模式,權重資料處理單元402配置有一對應的權重資料讀取及重組的操作設定,當目標調度模式確定後,權重資料處理單元402根據目標調度模式以對應的操作設定讀取及重組目標卷積核中的權重資料。實施上,權重資料處理單元402可藉由將目標卷積核中的權重資料以原始順序寫入一快取中,再以所需的順序自快取中將權重資料讀出,實現對權重資料重組及重排序的目的。
In one embodiment, for each scheduling mode, the weight
請參閱圖3,圖3為一實施例中對卷積層的輸入資料及卷積核的資料拆分示意圖。在確定出目標調度模式後,假設該目標調度模式對應的卷積計算塊的尺寸為m×d×w×h,也就是說,在乘積累加運算陣列的一輪運算中,權重資料處理單元402從目標卷積核讀取m個尺寸為d×w×h的權重資料塊,將經重組後權重資料的輸入至乘積累加運算陣列404。詳細來說,權重資料處理單元
402分別從K1、K2、……、Km這m個卷積核中獲取m個權重資料塊,其中,每一個權重資料塊在通道方向上的深度為d,在平面上的尺寸為w×h。
Please refer to FIG. 3 . FIG. 3 is a schematic diagram of splitting the input data of the convolution layer and the data of the convolution kernel in an embodiment. After the target scheduling mode is determined, it is assumed that the size of the convolution calculation block corresponding to the target scheduling mode is m×d×w×h. The target convolution kernel reads m weight data blocks with a size of d×w×h, and inputs the reorganized weight data to the multiply-accumulate
步驟104,根據目標調度模式,對目標卷積層中的輸入資料進行重組,並將重組後的輸入資料輸出至乘積累加運算陣列。
目標調度模式確定之後,特徵資料處理單元403根據目標調度模式對目標卷積層中的輸入資料進行拆分及適當的資料重組,使重組後的輸入資料得以匹配對應的權重資料塊的順序輸入到乘積累加運算陣列404,以完成所需的卷積運算。
After the target scheduling mode is determined, the characteristic
實施上,目標卷積層中的輸入資料可儲存於記憶體407中,特徵資料處理單元403可經由快取409在直接記憶體存取控制器的控制下,自記憶體407讀取輸入資料。
In practice, the input data of the target convolutional layer can be stored in the
類似地,針對每一調度模式,特徵資料處理單元403配置有一對應的輸入資料讀取及重組的操作設定,當目標調度模式確定後,特徵資料處理單元403根據目標調度模式以對應的操作設定讀取及重組目標卷積層中的輸入資料。實施上,特徵資料處理單元403亦可藉由將目標卷積層中的輸入資料以原始順序寫入一快取中,再以所需的順序自快取中將輸入資料讀出,例如以匹配對應的權重資料塊的資料順序自快取中將輸入資料讀出,實現對輸入資料重組及重排序的目的。
Similarly, for each scheduling mode, the feature
請再參閱圖3,在圖3的實施例中特徵資料處理單元403將目標卷積層中的輸入資料拆分為多個尺寸為d×w×h的輸入資料塊,並對每一輸入資料塊進行資料重組,使重組後輸入資料塊的資料順序能匹配對應的權重資料塊,使乘積累加運算陣列404得以據以完成正確的乘積累加運算。
Please refer to FIG. 3 again. In the embodiment of FIG. 3, the feature
步驟105,基於重組後權重資料和重組後輸入資料進行乘積累加運算。
乘積累加運算陣列404基於重組後權重資料和重組後的輸入資料進行乘積累加運算,其中,乘積累加運算陣列404每輪運算中所使用的乘積累加運算單元的數目對應卷積計算塊的大小。
The multiply-accumulate
乘積累加運算陣列404進行一輪運算後,將計算得到的結果作為中間資料儲存到快取406中。乘積累加運算陣列404在乘積累加運算時,會將同一個卷積核中通道方向上的乘積相加後作為中間資料儲存。然後,權重資料處理單元402根據卷積核在輸入資料上進行卷積運算的順序,繼續從快取409中讀取權重資料塊,而特徵資料處理單元403從快取409中讀取並重組輸入資料,以輸出與該權重資料塊匹配的輸入資料塊,乘積累加運算陣列404據以進行另一輪的運算,如此循環往復,直至完成輸入資料中的每一個資料塊與權重資料塊的運算。
After the multiply-accumulate
接下來以一個具體應用場景為例對本發明進行說明,請繼續參閱圖3,假設C=32,W=6,H=6;D=32,R=3,S=3,M=16;d=16,m=16,w=1,h=1。則如圖所示,輸入資料可以被拆分72個輸入資料塊,一個卷積核可以被拆分為18個權重資料塊。假設該卷積層對應的步長為1,並在長、寬方向均進行了長度為2的零填充,則輸入資料上的0~35這36個輸入資料塊都需要與每一個卷積核中的00對應的權重資料塊進行內積運算。 Next, take a specific application scenario as an example to illustrate the present invention, please continue to refer to Figure 3, assuming C=32, W=6, H=6; D=32, R=3, S=3, M=16; d =16, m=16, w=1, h=1. As shown in the figure, the input data can be split into 72 input data blocks, and one convolution kernel can be split into 18 weight data blocks. Assuming that the step size corresponding to the convolutional layer is 1, and zero padding with a length of 2 is performed in both the length and width directions, the 36 input data blocks of 0~35 on the input data need to be combined with each convolution kernel. Inner product operation is performed on the weight data block corresponding to 00.
比如,輸入資料上的大小為16×1×1(圖3中的輸入資料的灰色方塊)的特徵資料塊0,該資料塊需要與每一個卷積核中的大小為16×1×1的權重資料塊00(圖3中的卷積核的灰色方塊)進行內積運算。因此,權重資料處
理單元402根據目標調度模式對應的卷積計算塊的大小(d=16,m=16,w=1,h=1),從快取中的16個卷積核中,分別讀取第一個權重資料塊,得到16個大小為16×1×1的權重資料塊00。特徵資料處理單元403從輸入資料中讀取一個大小為16×1×1的輸入資料塊0,並將輸入資料塊0分別匹配給16個權重資料塊00(也就是說,輸入資料塊0在乘積累加運算陣列404的一輪運算中重複使用了16次,相當於256個資料)。輸入資料塊0及16個權重資料塊00輸入到乘積累加運算陣列404進行運算,得到16個數值(通道方向上的乘積進行了相加處理),儲存為中間結果,這個過程為乘積累加運算陣列404的一輪運算。然後,再進行第二次讀數據,以進行乘積累加運算陣列404的第二輪運算。如前文所述,輸入資料上的0-35這36個輸入資料塊都需要與每一個卷積核中的權重資料塊00進行內積運算,因此,第二次讀數據時無需重複讀權重資料塊00,只需從輸入資料中讀取大小為16×1×1的輸入資料塊1,將輸入資料塊1分別匹配給16個權重資料00,並利用乘積累加運算陣列404進行運算,得到16個數值,也儲存為中間結果。然後按照與第二輪運算相同的讀數據方式,讀取一個大小為16×1×1的輸入資料塊2,進行乘乘積累加運算陣列404的第三輪運算,……,然後按照與第二輪運算相同的讀數據方式,讀取一個大小為16×1×1的輸入資料塊35,進行乘積累加運算陣列404的第36輪運算。至此,完成了輸入資料與權重資料塊00的卷積運算,同時,儲存了36組中間結果,每一組中間結果有16個數值。
For example, the feature data block 0 with a size of 16×1×1 (the gray square of the input data in Figure 3) on the input data needs to be matched with the size of 16×1×1 in each convolution kernel The weight data block 00 (the gray square of the convolution kernel in Fig. 3 ) performs the inner product operation. Therefore, the weight data
According to the size of the convolution calculation block corresponding to the target scheduling mode (d=16, m=16, w=1, h=1), the
接下來,在第37輪運算時,從快取中讀取16個大小為16×1×1的權重資料塊01,從輸入資料中讀取一個大小為16×1×1的輸入資料塊0,將輸入資料塊0分別匹配給16個權重資料塊01,並利用乘積累加運算陣列404進行運算,得到16個數值,由於這16個數值與第一輪運算中得到的16數值均對應於輸
出資料上的輸入資料塊0,因此需要將這16個數值和分別與第一輪運算中得到的16個數值相加,得到新的16個數值,並儲存為新的中間結果,覆蓋掉第一輪運算儲存的16個中間結果。按照與前36次同樣的取資料方式,乘積累加運算陣列404進行第37輪至第72輪運算,完成輸入資料與權重資料塊01的卷積運算。
Next, in the 37th round of calculation, read 16 weight data blocks 01 with a size of 16×1×1 from the cache, and read an
重複執行上述運算過程,直至完成目標卷積核對輸入資料的全部卷積運算,得到16個二維的輸出資料,這16個二維的輸出資料疊加,得到該目標卷積層的三維的輸出資料。其中,如果下一層也是卷積層,則該輸出資料可以被讀取到快取中,作為下一層運算的輸入資料,繼續進行卷積運算。 Repeat the above operation process until all the convolution operations of the target convolution kernel on the input data are completed, and 16 two-dimensional output data are obtained. These 16 two-dimensional output data are superimposed to obtain the three-dimensional output data of the target convolution layer. Wherein, if the next layer is also a convolutional layer, the output data can be read into the cache and used as the input data for the next layer operation to continue the convolution operation.
需要說明的是,上文所舉的是一個為了便於讀者理解本申請方案的一個具體實施例。在該實施例中,由於一次讀取的權重值數量多於輸入資料數量,因此,當相鄰兩輪運算中使用的權重值重複時,為了提高資料處理效率,沒有重複讀取權重資料塊。但是這並不是對本申請方案的限制,在其他實施例中,也可以按照其他順序讀取資料,在按照其他順序讀取資料時,可以重複或者不重複的讀取資料塊。 It should be noted that what is mentioned above is a specific embodiment for the convenience of readers to understand the solution of this application. In this embodiment, since the number of weight values read at one time is more than the number of input data, when the weight values used in two adjacent rounds of calculations are repeated, in order to improve data processing efficiency, the weight data blocks are not repeatedly read. However, this is not a limitation to the solution of the present application. In other embodiments, data may also be read in other orders. When reading data in other orders, data blocks may be read repeatedly or not.
具體實施時,本申請不受所描述的各個步驟的執行順序的限制,在不產生衝突的情況下,某些步驟還可以採用其它順序進行或者同時進行。 During specific implementation, the present application is not limited by the execution order of the described steps, and some steps may be performed in other orders or simultaneously in the case of no conflict.
綜上所述,本申請實施例提出的卷積神經網路的運算方法,對於卷積神經網路中具有不同網路結構的各個卷積層來說,可以動態地調整目標調度模式,使得每一個卷積層可以採用與其結構匹配的調度模式,來對輸入資料和目標卷積核進行資料塊的拆分及重組,使得拆分後的權重資料塊中包含的權重值和特徵資料塊中包含的特徵值的數量能夠實現對乘積累加運算陣列的運 算資源的最大化利用,從整體上提高了硬體加速器的資源利用率,進而提高卷積神經網路的運算速度。 In summary, the convolutional neural network computing method proposed in the embodiment of this application can dynamically adjust the target scheduling mode for each convolutional layer with different network structures in the convolutional neural network, so that each The convolutional layer can use a scheduling mode that matches its structure to split and reorganize the input data and the target convolution kernel data blocks, so that the weight values contained in the split weight data blocks and the features contained in the feature data blocks The number of values enables the operation of an array of multiply-accumulate operations The maximum utilization of computing resources improves the resource utilization of hardware accelerators as a whole, thereby improving the computing speed of convolutional neural networks.
以上對本申請實施例所提供的一種卷積神經網路運算方法及裝置進行了詳細介紹,本文中應用了具體個例對本申請的原理及實施方式進行了闡述,以上實施例的說明只是用於幫助理解本申請的方法及其核心思想;同時,對於本領域的技術入員,依據本申請的思想,在具體實施方式及應用範圍上均會有改變之處,綜上,本說明書內容不應理解為對本申請的限制。 The above is a detailed introduction to a convolutional neural network computing method and device provided by the embodiment of the present application. In this paper, a specific example is used to illustrate the principle and implementation of the present application. The description of the above embodiment is only used to help Understand the method of this application and its core idea; at the same time, for technical personnel in this field, according to the idea of this application, there will be changes in the specific implementation and application scope. In summary, the content of this specification should not be understood For the limitation of this application.
101,102,103,104,105:步驟 101, 102, 103, 104, 105: steps
Claims (15)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW109134795A TWI798591B (en) | 2020-10-07 | 2020-10-07 | Convolutional neural network operation method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
TW109134795A TWI798591B (en) | 2020-10-07 | 2020-10-07 | Convolutional neural network operation method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
TW202215300A TW202215300A (en) | 2022-04-16 |
TWI798591B true TWI798591B (en) | 2023-04-11 |
Family
ID=82197207
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
TW109134795A TWI798591B (en) | 2020-10-07 | 2020-10-07 | Convolutional neural network operation method and device |
Country Status (1)
Country | Link |
---|---|
TW (1) | TWI798591B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220083857A1 (en) * | 2020-09-15 | 2022-03-17 | Sigmastar Technology Ltd. | Convolutional neural network operation method and device |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3385844A1 (en) * | 2017-04-09 | 2018-10-10 | INTEL Corporation | Neural network scheduling mechanism |
US10572225B1 (en) * | 2018-09-26 | 2020-02-25 | Xilinx, Inc. | Circuit arrangements and methods for performing multiply-and-accumulate operations |
CN111178519A (en) * | 2019-12-27 | 2020-05-19 | 华中科技大学 | Convolutional neural network acceleration engine, convolutional neural network acceleration system and method |
-
2020
- 2020-10-07 TW TW109134795A patent/TWI798591B/en active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP3385844A1 (en) * | 2017-04-09 | 2018-10-10 | INTEL Corporation | Neural network scheduling mechanism |
US10572225B1 (en) * | 2018-09-26 | 2020-02-25 | Xilinx, Inc. | Circuit arrangements and methods for performing multiply-and-accumulate operations |
CN111178519A (en) * | 2019-12-27 | 2020-05-19 | 华中科技大学 | Convolutional neural network acceleration engine, convolutional neural network acceleration system and method |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20220083857A1 (en) * | 2020-09-15 | 2022-03-17 | Sigmastar Technology Ltd. | Convolutional neural network operation method and device |
Also Published As
Publication number | Publication date |
---|---|
TW202215300A (en) | 2022-04-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
TWI811291B (en) | Deep learning accelerator and method for accelerating deep learning operations | |
CN109919311B (en) | Method for generating instruction sequence, method and device for executing neural network operation | |
CN112200300B (en) | Convolutional neural network operation method and device | |
WO2020073211A1 (en) | Operation accelerator, processing method, and related device | |
US20230026006A1 (en) | Convolution computation engine, artificial intelligence chip, and data processing method | |
CN113469350B (en) | Deep convolutional neural network acceleration method and system suitable for NPU | |
TWI775210B (en) | Data dividing method and processor for convolution operation | |
US20230229917A1 (en) | Hybrid multipy-accumulation operation with compressed weights | |
CN117574970A (en) | Inference acceleration method, system, terminal and medium for large-scale language model | |
WO2022041188A1 (en) | Accelerator for neural network, acceleration method and device, and computer storage medium | |
CN111353591A (en) | Computing device and related product | |
CN112799599A (en) | Data storage method, computing core, chip and electronic equipment | |
CN110647981B (en) | Data processing method, data processing device, computer equipment and storage medium | |
TWI798591B (en) | Convolutional neural network operation method and device | |
CN113837922A (en) | Computing device, data processing method and related product | |
WO2024191479A1 (en) | Dynamic uncompression for channel-separable operation in neural network | |
CN112966729A (en) | Data processing method and device, computer equipment and storage medium | |
CN112766473A (en) | Arithmetic device and related product | |
CN112801276B (en) | Data processing method, processor and electronic equipment | |
WO2022001500A1 (en) | Computing apparatus, integrated circuit chip, board card, electronic device, and computing method | |
CN116957018A (en) | Method for realizing channel-by-channel convolution | |
KR102372869B1 (en) | Matrix operator and matrix operation method for artificial neural network | |
CN110533176B (en) | Caching device for neural network computation and related computing platform thereof | |
KR20220078819A (en) | Method and apparatus for performing deep learning operations | |
CN115081603A (en) | Computing device, integrated circuit device and board card for executing Winograd convolution |