TWI798591B

TWI798591B - Convolutional neural network operation method and device

Info

Publication number: TWI798591B
Application number: TW109134795A
Authority: TW
Inventors: 李超; 朱煒; 林博
Original assignee: 大陸商星宸科技股份有限公司
Priority date: 2020-10-07
Filing date: 2020-10-07
Publication date: 2023-04-11
Also published as: TW202215300A

Abstract

A convolutional neural network device includes a scheduling mode unit, a first data processor unit, a second data processor unit, and a multiply-accumulation operation array. The scheduling mode unit determines a target scheduling mode according to a number of target convolutional kernel(s) and size information of the target convolutional kernel(s), in which the target scheduling mode corresponds to a size of a convolutional computing block. The first data processor unit recombines weighted data in the target convolutional kernel(s) according to the target scheduling mode. The second data processor unit recombines input data of a target convolutional kernel layer. The multiply-accumulation operation array includes multiply-accumulation cells which performs a multiply-accumulation operation based on the recombined weighted data and the recombined input data. A number of the multiply-accumulation cells being utilized by the multiply-accumulation array in each operation corresponds to a size of the convolutional computing block.

Description

Convolutional neural network computing method and device

本申請涉資料處理技術領域，具體涉及一種卷積神經網路運算方法及裝置。 The present application relates to the technical field of data processing, in particular to a convolutional neural network computing method and device.

深度學習(Deep learning)是發展AI(Artificial intelligence，人工智慧)的重要應用技術之一，其廣泛應用於電腦視覺、語音辨識等領域。其中CNN(Convolutional Neural Network，卷積神經網路)則是近年來引起重視的一種深度學習高效識別技術，它通過直接輸入原始圖像或語音資料，與多個特徵濾波器(filter)資料進行若干層的卷積運算及向量運算，從而在圖像和語音辨識方面產生高準確性結果。 Deep learning (Deep learning) is one of the important application technologies for the development of AI (Artificial intelligence), which is widely used in computer vision, speech recognition and other fields. Among them, CNN (Convolutional Neural Network, Convolutional Neural Network) is a deep learning and efficient recognition technology that has attracted attention in recent years. Layer convolution operations and vector operations to produce high-accuracy results in image and speech recognition.

然而，隨著卷積神經網路的發展和廣泛應用，其面臨的挑戰也越來越多，例如，CNN模型的參數規模越來越大，並且網路結構複雜多變，一個CNN模型中往往包含多個卷積層，並且每一卷積層的深度以及卷積核的尺寸等資料都不相同。其中，在CNN網路中較淺的層中，待處理輸入資料的平面尺寸可能會較大、通道方向上的尺寸會較小，而隨著網路層數的深入，一些卷積核在通道方向上的深度會較大，或者卷積層中卷積核的數量會較多。那麼，對於電子設備中由多個MAC(Multiply Accumulation Cell，乘積累加運算單元)組成的乘積累加運算陣列來說，就會面對龐大需計算的資料量。而電子設備提供的處理能力往往是有限的，也就是說，乘積累加運算陣列的一輪運算中，能夠輸入的最大資料量是固定的。例如，電子設備的乘積累加運算陣列包含多個乘積累加運算單元的處理能力為256，則該乘積累加運算陣列具有256個乘法器，即同時最多能夠進行256個權重值分別與對應的256個輸入資料值進行乘法運算。而一般輸入資料是遠大於256的，因此，需要分別將卷積核和輸入資料拆分為多個資料塊，進行逐一運算。然而，先前技術中針對不同的卷積層是以相同的方式來拆分卷積核和輸入資料，這樣的作法無法有效利用電子設備中的硬體資源，基於此，如何提高硬體加速器在計算過程中的資源利用率成為亟待解決的問題。 However, with the development and wide application of convolutional neural network, it faces more and more challenges. For example, the parameter scale of CNN model is getting larger and larger, and the network structure is complex and changeable. A CNN model often Contains multiple convolutional layers, and the depth of each convolutional layer and the size of the convolution kernel are different. Among them, in the shallower layers of the CNN network, the plane size of the input data to be processed may be larger, and the size in the channel direction may be smaller, and as the number of network layers deepens, some convolution kernels in the channel The depth in the direction will be larger, or the number of convolution kernels in the convolutional layer will be larger. then, yes For a multiplication-accumulation operation array composed of multiple MACs (Multiply Accumulation Cells, multiplication-accumulation operation units) in an electronic device, it will face a huge amount of data to be calculated. However, the processing capability provided by electronic devices is often limited, that is to say, the maximum amount of data that can be input in one round of the multiply-accumulate array is fixed. For example, if the multiply-accumulate operation array of the electronic device includes a plurality of multiply-accumulate operation units with a processing capacity of 256, then the multiply-accumulate operation array has 256 multipliers, that is, at the same time, up to 256 weight values and corresponding 256 input Data values are multiplied. The general input data is much larger than 256. Therefore, it is necessary to split the convolution kernel and input data into multiple data blocks and perform calculations one by one. However, in the prior art, the convolution kernel and the input data are split in the same way for different convolution layers, which cannot effectively utilize the hardware resources in electronic devices. Based on this, how to improve the performance of hardware accelerators in the calculation process The utilization of resources in the system has become an urgent problem to be solved.

本申請提供一種卷積神經網路運算方法及裝置，旨在提高硬體加速器的資源利用率。 The present application provides a convolutional neural network computing method and device, aiming at improving resource utilization of hardware accelerators.

本申請提供一種卷積神經網路運算方法，其應用於包含乘積累加運算陣列的卷積神經網路運算裝置，乘積累加運算陣列包含多個乘積累加運算單元。本申請的卷積神經網路運算方法包括：確定一目標卷積層中的目標卷積核的數量與所述目標卷積核的一第一尺寸資訊；根據所述目標卷積核的數量和所述第一尺寸資訊確定一目標調度模式，其中，所述目標調度模式對應一卷積計算塊的大小；根據所述目標調度模式，對所述目標卷積核中的權重資料進行重組，並將重組後所述權重資料輸出至所述乘積累加運算陣列；根據所述目標調度模式，對所述目標卷積層中的輸入資料進行重組，並將重組後所述輸入資料輸出至所述乘積累加運算陣列；以及以所述乘積累加運算陣列基於重組後所述權重資料和重組後所述輸入資料進行乘積累加運算，其中，所述乘積累加運算陣列每輪運算中所使用的所述乘積累加運算單元的數目對應所述卷積計算塊的大小。 The present application provides a convolutional neural network computing method, which is applied to a convolutional neural network computing device including a multiply-accumulate array, and the multiply-accumulate array includes a plurality of multiply-accumulate computing units. The convolutional neural network operation method of the present application includes: determining the number of target convolution kernels in a target convolution layer and a first size information of the target convolution kernel; according to the number of the target convolution kernels and the set The first size information determines a target scheduling mode, wherein the target scheduling mode corresponds to the size of a convolution calculation block; according to the target scheduling mode, the weight data in the target convolution kernel is reorganized, and After recombination, the weight data is output to the multiply-accumulate operation array; according to the purpose A standard scheduling mode, reorganizing the input data in the target convolutional layer, and outputting the reorganized input data to the multiply-accumulate operation array; and using the multiply-accumulate operation array based on the reorganized weight data and After recombining, the input data is subjected to a multiply-accumulate operation, wherein the number of multiply-accumulate units used in each round of the multiply-accumulate array corresponds to the size of the convolution calculation block.

本申請另提供一種卷積神經網路運算裝置，其包含一調度模式單元、一第一資料處理單元、一第二資料處理單元及一乘積累加運算陣列。調度模式單元根據所述目標卷積核的數量及一第一尺寸資訊確定一目標調度模式，其中，所述目標調度模式對應一卷積計算塊的大小。第一資料處理單元根據所述目標調度模式，對所述目標卷積核中的權重資料進行重組。第二資料處理單元根據所述目標調度模式，對所述目標卷積層中的輸入資料進行重組。乘積累加運算陣列包含多個乘積累加運算單元，其基於重組後所述權重資料和重組後所述輸入資料進行乘積累加運算，其中，所述乘積累加運算陣列每輪運算中所使用的所述乘積累加運算單元的數目對應所述卷積計算塊的大小。 The present application also provides a convolutional neural network computing device, which includes a scheduling mode unit, a first data processing unit, a second data processing unit, and a multiply-accumulate array. The scheduling mode unit determines a target scheduling mode according to the number of target convolution kernels and a first size information, wherein the target scheduling mode corresponds to the size of a convolution calculation block. The first data processing unit reorganizes the weight data in the target convolution kernel according to the target scheduling mode. The second data processing unit reorganizes the input data in the target convolutional layer according to the target scheduling mode. The multiply-accumulate operation array includes a plurality of multiply-accumulate operation units, which perform multiply-accumulate operations based on the reorganized weight data and the reorganized input data, wherein the product used in each round of the multiply-accumulate operation array The number of accumulation operation units corresponds to the size of the convolution calculation block.

本申請實施例提供的卷積神經網路的運算方案，對於卷積神經網路中具有不同網路結構的各個卷積層來說，可以動態地調整目標調度模式，使得每一個卷積層可以採用與其乘積累加運算陣列結構匹配的調度模式，來對待處理輸入資料和目標卷積核進行資料塊的拆分，使得拆分後的權重資料塊中包含的權重值和輸入資料塊中包含的輸入資料的數量能夠實現對乘積累加運算陣列的運算資源的最大化利用，從整體上提高了硬體加速器的資源利用率，進而提高卷積神經網路的運算速度。 The convolutional neural network calculation scheme provided by the embodiment of the present application can dynamically adjust the target scheduling mode for each convolutional layer with different network structures in the convolutional neural network, so that each convolutional layer can adopt a different The scheduling mode of the multiplication and accumulation operation array structure matching is used to split the data block between the input data to be processed and the target convolution kernel, so that the weight value contained in the split weight data block and the input data contained in the input data block The number can realize the maximum utilization of the operation resources of the multiply-accumulate operation array, which improves the resource utilization rate of the hardware accelerator as a whole, and then improves the operation speed of the convolutional neural network.

101,102,103,104,105:步驟 101, 102, 103, 104, 105: steps

00~09,10~17:權重資料塊 00~09,10~17: weight data block

0~41,47,53,59,65,71:輸入資料塊 0~41,47,53,59,65,71: input data block

40:卷積神經網路運算裝置 40: Convolutional neural network computing device

401:模式調度單元 401: Mode Scheduling Unit

402:權重資料處理單元 402: Weight data processing unit

403:特徵資料處理單元 403: Feature data processing unit

404:乘積累加運算陣列 404: multiply-accumulate operation array

405:暫存器 405: scratchpad

406:快取 406: Cache

407:記憶體 407: memory

408:直接記憶體存取控制器 408: Direct Memory Access Controller

409:快取 409: Cache

K₁,K₂,K₃,K_M:卷積核 K ₁ , K ₂ , K ₃ , K _M : convolution kernel

R,W,w:寬度 R, W, w: width

C,D,d:深度 C,D,d: Depth

S,H,h:高度 S, H, h: Height

為了更清楚地說明本申請實施例中的技術方案，下面將對實施例描述中所需要使用的附圖作簡單地介紹，顯而易見地，下面描述中的附圖僅僅是本申請的一些實施例，對於本領域技術人員來講，在不付出進步性勞動的前提下，還可以根據這些附圖獲得其他的附圖。 In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings that need to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present application. For those skilled in the art, other drawings can also be obtained based on these drawings on the premise of not paying progressive efforts.

〔圖1〕是本申請實施例提供的卷積神經網路的運算方法的流程示意圖；〔圖2〕為一卷積層的輸入資料及卷積核的資料結構示意圖；〔圖3〕為一實施例中對卷積層的輸入資料及卷積核的資料拆分示意圖；以及〔圖4〕是本申請實施例提供的卷積神經網路運算裝置應用於電子設備的方塊示意圖。 [Fig. 1] is a schematic flow diagram of the operation method of the convolutional neural network provided by the embodiment of the present application; [Fig. 2] is a schematic diagram of the input data of a convolution layer and a data structure of a convolution kernel; [Fig. 3] is an implementation In the example, a schematic diagram of the input data of the convolutional layer and data splitting of the convolution kernel; and [FIG. 4] is a schematic block diagram of the application of the convolutional neural network computing device provided by the embodiment of the present application to electronic equipment.

下面將結合本申請實施例中的附圖，對本申請實施例中的技術方案進行清楚、完整地描述，顯然，所描述的實施例僅僅是本申請一部分實施例，而不是全部的實施例。其中相同的元件符號代表相同的元件，本發明的原理是以實施在一適當的應用環境中來舉例說明。基於本申請中的實施例，本領域技術人員在沒有作出進步性勞動前提下所獲得的所有其他實施例，都屬於本申請保護的範圍。 The following will clearly and completely describe the technical solutions in the embodiments of the application with reference to the drawings in the embodiments of the application. Apparently, the described embodiments are only some of the embodiments of the application, not all of them. Where the same reference numerals represent the same components, the principle of the present invention is illustrated by being implemented in a suitable application environment. Based on the embodiments in this application, all other embodiments obtained by those skilled in the art without making progressive efforts belong to the scope of protection of this application.

在本文中提及『實施例』意味著，結合實施例描述的特定特徵、結構或特性可以包含在本申請的至少一個實施例中。在說明書中的各個位置出現該用語並不一定均是指相同的實施例，也不是與其它實施例互斥的獨立的或備選的實施例。本領域技術人員可顯然地或可隱含地理解的是，本文所描述的實施例可以與其它實施例相結合。 Reference herein to "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment may be included in at least one embodiment of the present application. The appearances of this term in various places in the specification are not necessarily all referring to the same embodiment, nor are they mutually exclusive from other embodiments. or alternative embodiments. It may be apparent or implicitly understood by those skilled in the art that the embodiments described herein may be combined with other embodiments.

本申請實施例提供一種卷積神經網路的運算方法，該卷積神經網路的運算方法的執行主體可以是本申請實施例提供的卷積神經網路運算裝置，或者是整合了該卷積神經網路的運算裝置的電子設備，實施上，卷積神經網路的運算裝置可以採用硬體、軟體或硬體結合軟體的方式實現。 The embodiment of the present application provides a convolutional neural network computing method. The execution subject of the convolutional neural network computing method may be the convolutional neural network computing device provided in the embodiment of the present application, or integrate the convolutional neural network The electronic equipment of the computing device of the neural network, in terms of implementation, the computing device of the convolutional neural network can be realized by hardware, software, or a combination of hardware and software.

本申請實施例提供的卷積神經網路的運算方案，可以應用於任何結構的卷積神經網路(以下簡稱CNN)，例如，可以應用於只有一個卷積層的CNN，還可以應用於一些複雜CNN，比如包括多達上百或者更多的卷積層的CNN。此外，本申請實施例中的CNN還可以有池化層、全連接層等。也就是說，本申請實施例的方案並不局限於某種特定的卷積神經網路，只要是包含有卷積層的神經網路，都可以認為是本申請中的『卷積神經網路』，其卷積層部分都可以按照本申請實施例進行運算。 The convolutional neural network operation scheme provided by the embodiment of the present application can be applied to any structured convolutional neural network (hereinafter referred to as CNN), for example, it can be applied to a CNN with only one convolutional layer, and it can also be applied to some complex CNNs, such as CNNs that include up to hundreds or more convolutional layers. In addition, the CNN in the embodiment of the present application may also have a pooling layer, a fully connected layer, and the like. That is to say, the scheme of the embodiment of this application is not limited to a specific convolutional neural network, as long as it is a neural network that includes a convolutional layer, it can be considered as a "convolutional neural network" in this application , the convolutional layer part can be operated according to the embodiment of the present application.

需要說明的是，本申請實施例的卷積神經網路可以應用於多種場景，例如，諸如人臉識別、車牌識別等圖像識別的領域、諸如圖像特徵提取、語音特徵提取的特徵領域、語音辨識領域、自然語言處理領域等，將圖像或者由其他形式的資料轉換得到的特徵資料登錄到預先訓練好的卷積神經網路，即可利用該卷積神經網路進行運算，以達到或分類或識別或特徵提取的目的。 It should be noted that the convolutional neural network of the embodiment of the present application can be applied to various scenarios, for example, image recognition fields such as face recognition and license plate recognition, feature fields such as image feature extraction and speech feature extraction, In the field of speech recognition, natural language processing, etc., the feature data converted from images or other forms of data can be registered into the pre-trained convolutional neural network, and the convolutional neural network can be used to perform calculations to achieve Or classification or recognition or feature extraction purposes.

請參閱圖1，圖1是本申請實施例提供的卷積神經網路運算方法的流程示意圖。圖4是本申請實施例提供的卷積神經網路運算裝置應用於電子設備的方塊示意圖。卷積神經網路運算裝置40可用以實現圖1中的卷積神經網路運算方法。卷積神經網路運算方法的具體步驟及卷積神經網路運算裝置40的操作方式說明如下。 Please refer to FIG. 1 . FIG. 1 is a schematic flowchart of a convolutional neural network operation method provided by an embodiment of the present application. FIG. 4 is a schematic block diagram of a convolutional neural network computing device provided by an embodiment of the present application applied to an electronic device. The convolutional neural network operation device 40 can be used to realize the convolutional neural network in Fig. 1 Road operation method. The specific steps of the convolutional neural network computing method and the operation of the convolutional neural network computing device 40 are described as follows.

步驟101中，確定一目標卷積層中的目標卷積核的數量以及所述目標卷積核的一第一尺寸資訊。 In step 101, the number of target convolution kernels in a target convolution layer and a first size information of the target convolution kernels are determined.

對於整合有卷積神經網路運算裝置的電子設備來說，卷積層對輸入資料和卷積核資料進行卷積運算，得到輸出資料。輸入資料可以是原始的圖像、語音資料或者上一卷積層或池化層輸出的資料，而卷積神經網路運算裝置中的輸入資料一般為特徵資料，因此，卷積神經網路運算裝置40的輸入資料可為目標卷積層的特徵資料。 For an electronic device integrated with a convolutional neural network computing device, the convolution layer performs convolution operations on input data and convolution kernel data to obtain output data. The input data can be the original image, voice data, or the output data of the previous convolutional layer or pooling layer, and the input data in the convolutional neural network computing device is generally characteristic data. Therefore, the convolutional neural network computing device The input data of 40 may be feature data of the target convolutional layer.

輸入資料可以有多個通道(channel)，每一個通道上的輸入資料可以理解為一個二維資料，當輸入資料的通道數大於1時，可以將輸入資料理解為多個通道的二維資料疊在一起的立體資料，其深度等於通道數。目標卷積層(即當下待進行卷積運算的卷積層)可以包含有一個或者多個卷積核，卷積核又稱為濾波器(filter)，每一卷積核的通道數等於該層輸入資料的通道數，卷積核資料的個數等於該目標卷積層的輸出資料的通道數。也就是說，輸入資料與一個卷積核資料進行卷積運算後，得到一個二維的資料，當目標卷積層有多個卷積核時，每一個卷積核輸出的二位元資料疊加得到一個三維的輸出資料。 The input data can have multiple channels. The input data on each channel can be understood as a two-dimensional data. When the number of channels of the input data is greater than 1, the input data can be understood as a two-dimensional data stack of multiple channels. Together the volume data has a depth equal to the number of channels. The target convolutional layer (that is, the convolutional layer to be convolutional at present) can contain one or more convolution kernels, which are also called filters, and the number of channels of each convolution kernel is equal to the input of the layer The number of channels of the data, the number of convolution kernel data is equal to the number of channels of the output data of the target convolution layer. That is to say, after the input data is convoluted with a convolution kernel data, a two-dimensional data is obtained. When the target convolution layer has multiple convolution kernels, the two-bit data output by each convolution kernel is superimposed to obtain A 3D output data.

在基於卷積神經網路進行運算時，模式調度單元401從當前用於運算的卷積神經網路中確定出目標卷積層，實施上，模式調度單元401可自配置暫存器405取得目標卷積層的相關訊息，例如自配置暫存器405得知何者為目標卷積層及其具有的卷積核數量及卷積核的平面尺寸及通道方向上的深度等訊息。 When performing calculations based on convolutional neural networks, the mode scheduling unit 401 determines the target convolutional layer from the convolutional neural network currently used for calculations. In practice, the mode scheduling unit 401 can obtain the target volume from the configuration register 405 Layer-related information, such as learning from the configuration register 405 which is the target The convolutional layer and its number of convolution kernels, the plane size of the convolution kernel, and the depth in the direction of the channel.

請參閱圖2，圖2為一卷積層的輸入資料及卷積核的資料結構示意圖。圖2的卷積層中包含M個卷積核，分別為K₁、K₂、K₃......K_M。這M個卷積核的尺寸相同，均為D×R×S，如圖所示，D為代表卷積核在通道方向上的深度，R×S為代表卷積核在平面方向上的尺寸。輸入資料的尺寸為C×W×H，其中，C為代表輸入資料在通道方向上的深度。實施上，C=D。而W×H為代表輸入資料在平面方向上的尺寸。 Please refer to FIG. 2 . FIG. 2 is a schematic diagram of the input data of a convolution layer and the data structure of the convolution kernel. The convolution layer in Fig. 2 includes M convolution kernels, namely K ₁ , K ₂ , K ₃ . . . K _M . The M convolution kernels have the same size, all of which are D×R×S. As shown in the figure, D represents the depth of the convolution kernel in the channel direction, and R×S represents the size of the convolution kernel in the plane direction. . The size of the input data is C×W×H, where C represents the depth of the input data in the channel direction. In practice, C=D. And W×H represents the size of the input data in the plane direction.

由於每個卷積層中卷積核的尺寸和數量可能都不相同，本申請實施例中，在開始對目標卷積層進行運算時，先確定該目標卷積層中的目標卷積核的數量、目標卷積核的尺寸及/或深度資訊。 Since the size and number of convolution kernels in each convolution layer may be different, in the embodiment of the present application, when starting to operate on the target convolution layer, first determine the number of target convolution kernels in the target convolution layer, the target Size and/or depth information of the convolution kernel.

步驟102中，根據所述目標卷積核的數量和所述第一尺寸資訊確定一目標調度模式，其中，所述目標調度模式對應一卷積計算塊的大小。 In step 102, a target scheduling mode is determined according to the number of target convolution kernels and the first size information, wherein the target scheduling mode corresponds to the size of a convolution calculation block.

在乘積累加運算陣列中乘積累加運算單元的數量是有限的情況下，而卷積層中的參數量(例如卷積核的資料量)巨大，對於一個卷積層來說，可能需要乘積累加運算陣列的多輪運算才能完成全部運算，因此，需要分別將卷積核和輸入資料拆分為多個資料塊，每一次將一定數量的權重資料塊和對應數量的輸入資料塊輸入到乘積累加運算陣列中，然後進行乘積累加運算。為有效利用乘積累加運算陣列404的運算資源，模式調度單元401可根據目標卷積核的數量及其相關的尺寸訊息確定一目標調度模式，實施上，模式調度單元401可根據目標卷積核的數量及其相關的尺寸訊息自多個預設的調度模式中選出目標調度模式，而每一調度模式對應一特定的卷積計算塊的大小，卷積計算塊為進行卷積運算的最小單位，模式調度單元401所選出的目標調度模式可最有效的利用乘積累加運算陣列404，例如，當卷積神經網路運算裝置40操作在目標調度模式下，乘積累加運算陣列可以最少運算輪數完成對輸入資料及目標卷積核的乘積累加運算。 In the case where the number of multiply-accumulate operation units in the multiply-accumulate operation array is limited, and the amount of parameters in the convolution layer (such as the amount of data of the convolution kernel) is huge, for a convolution layer, it may be necessary to multiply-accumulate. Multiple rounds of calculations can complete all calculations. Therefore, it is necessary to split the convolution kernel and input data into multiple data blocks, and input a certain number of weight data blocks and corresponding number of input data blocks into the multiplication-accumulation operation array each time. , and then perform a multiply-accumulate operation. In order to effectively utilize the computing resources of the multiply-accumulate array 404, the mode scheduling unit 401 can determine a target scheduling mode according to the number of target convolution kernels and related size information. In practice, the mode scheduling unit 401 can determine a target scheduling mode according to the target convolution kernel The number and its related size information select the target scheduling mode from multiple preset scheduling modes, and each scheduling mode corresponds to a specific convolution calculation block size, convolution calculation A block is the smallest unit for performing convolution operations. The target scheduling mode selected by the mode scheduling unit 401 can make the most effective use of the multiply-accumulate array 404. For example, when the convolutional neural network computing device 40 operates in the target scheduling mode, the product The accumulation operation array can complete the multiplication and accumulation operation of the input data and the target convolution kernel with the minimum number of operation rounds.

在一實施例中，調度模式所對應的卷積計算塊的大小可為乘積累加運算陣列的一輪運算中從m個目標卷積核中獲取m個大小為d×w×h的權重資料塊的資料塊的大小，其中，d為代表權重資料塊在通道方向上的深度，w×h為代表權重資料塊在平面方向上的尺寸，其中，m、d、w、h皆為正整數。在設置預設調度模式時，可視為在設置m、d、w、h的具體數值，需要綜合考慮各種因素。 In an embodiment, the size of the convolution calculation block corresponding to the scheduling mode can be obtained from m target convolution kernels in one round of operation of the multiply-accumulate operation array. m weight data blocks with a size of d×w×h The size of the data block, where d represents the depth of the weight data block in the channel direction, and w×h represents the size of the weight data block in the plane direction, where m, d, w, and h are all positive integers. When setting the preset scheduling mode, it can be regarded as setting the specific values of m, d, w, and h, and various factors need to be considered comprehensively.

首先，需要考慮電子設備的乘積累加運算陣列404的處理能力，即需要考慮乘積累加運算陣列404中乘積累加單元的數量，比如，乘積累加運算陣列404中一共有256個乘積累加單元，則一輪運算中最多可以同時進行256個乘積累加單元運算。因此，在乘積累加運算陣列的一輪運算中分別從m個目標卷積核中獲取m個大小為d×w×h的權重資料塊時，m、d、w、h的大小需要滿足：m×d×w×h

256。 First, the processing capability of the multiply-accumulate array 404 of the electronic device needs to be considered, that is, the number of multiply-accumulate units in the multiply-accumulate array 404 needs to be considered. For example, there are 256 multiply-accumulate units in the multiply-accumulate array 404, and one round of operation A maximum of 256 multiply-accumulate unit operations can be performed simultaneously in the system. Therefore, when m weight data blocks of size d×w×h are respectively obtained from m target convolution kernels in a round of operation of the multiply-accumulate array, the sizes of m, d, w, and h need to satisfy: m× d×w×h

256.

其次，還要考慮實際網路的需要，比如，卷積層中卷積核的尺寸和數量，比如，有的卷積核的大小為1×1×64，有的卷積核大小為11×11×3，有的卷積層可能有8個卷積核，有的卷積層可能有2048個卷積核。在綜合考慮上述參數後，設置了不同大小的卷積計算塊，以適應不同的網路層。 Secondly, the needs of the actual network must also be considered, for example, the size and number of convolution kernels in the convolution layer, for example, the size of some convolution kernels is 1×1×64, and the size of some convolution kernels is 11×11 ×3, some convolutional layers may have 8 convolutional kernels, and some convolutional layers may have 2048 convolutional kernels. After comprehensive consideration of the above parameters, different sizes of convolution calculation blocks are set to adapt to different network layers.

比如，在乘積累加運算單元的處理能力固定的前提下，即在m×d×w×h

256的條件下，當卷積層的卷積核數量較多、深度較小時，可以將m 的值設置的大一些，d的值設置的小一些，比如，m=64，d=4，w=1，h=1，或者m=16，d=16，w=1，h=1。反之，當卷積層的卷積核數量較少、深度較大時，可以將m的值設置的較小一些，d的值設置的較大一些，比如，m=4，d=64，w=1，h=1。或者，對於一些特別尺寸的卷積核，還可以做一些特別的設置，比如，對於3×3的卷積核，可以設置m=1，d=32，w=3，h=3，這種情況下，雖然不能百分之百的利用乘積累加運算單元的計算資源，但是也達到了較高的利用率。請再參閱圖2，圖2的卷積層中包含M個卷積核，此例中m較佳為的M的正因數，也就是說，M為m的整數倍。 For example, under the premise that the processing capability of the multiply-accumulate operation unit is fixed, that is, in m×d×w×h

Under the condition of 256, when the number of convolution kernels of the convolutional layer is large and the depth is small, the value of m can be set larger, and the value of d can be set smaller, for example, m=64, d=4, w =1, h=1, or m=16, d=16, w=1, h=1. Conversely, when the number of convolution kernels of the convolutional layer is small and the depth is large, the value of m can be set smaller, and the value of d can be set larger, for example, m=4, d=64, w= 1, h=1. Or, for some special-sized convolution kernels, you can also make some special settings. For example, for a 3×3 convolution kernel, you can set m=1, d=32, w=3, h=3, this kind of In this case, although the computing resources of the multiply-accumulate operation unit cannot be utilized 100%, a relatively high utilization rate is achieved. Please refer to FIG. 2 again. The convolution layer in FIG. 2 includes M convolution kernels. In this example, m is preferably a positive factor of M, that is, M is an integer multiple of m.

實施上，可針對各種卷積層的卷積核數量及尺寸預先評估最適合的卷積計算塊的大小，再據以預先設置多種預設調度模式，並於一記憶體中建立一查閱資料表，此查閱資料表包含卷積核的參數與預設調度模式之間的映射關係。模式調度單元401可根據目標卷積核的數量及其相關的尺寸訊息自查閱資料表查得目標調度模式。在一實施例中，模式調度單元401可由一處理器執行程式碼來實現，而目標卷積層的輸入資料的資料量訊息及卷積核的數量及尺寸等訊息是儲存於配置暫存器405中，模式調度單元401自配置暫存器405中獲得目標卷積核的數量及其相關的尺寸訊息。 In practice, it is possible to pre-evaluate the size of the most suitable convolution calculation block for the number and size of convolution kernels of various convolution layers, and then pre-set a variety of preset scheduling modes based on this, and create a look-up data table in a memory, This lookup data table contains the mapping relationship between the parameters of the convolution kernel and the default scheduling mode. The mode scheduling unit 401 can look up the target scheduling mode from the lookup data table according to the number of target convolution kernels and their related size information. In one embodiment, the mode scheduling unit 401 can be implemented by a processor executing the program code, and the data volume information of the input data of the target convolution layer and the number and size of convolution kernels are stored in the configuration register 405. , the mode scheduling unit 401 obtains the number of target convolution kernels and related size information from the configuration register 405 .

如圖4所示，卷積神經網路運算裝置40在進行卷積運算的過程中，是經由快取409自記憶體407中取得每一輪運算所需要的權重值和輸入資料，而乘積累加運算陣列404運算過程中產生的中間結果是暫存於快取406中。對於電子設備來說，分配給卷積運算所使用的儲存空間是有限的，基於上述原因，在預先設置調度模式時，除了考慮網路結構之外，還需要考慮使用這種調度模式運算時，對儲存空間的佔用情況，以設置合適的調度模式。因此，在一實施例中，模式調度單元401亦會根據快取409及/或快取406的大小來確定目標調度模式。 As shown in FIG. 4 , during the convolution operation, the convolutional neural network operation device 40 obtains the weight value and input data required for each round of operation from the memory 407 via the cache 409, and the multiply-accumulate operation The intermediate results generated during the operation of the array 404 are temporarily stored in the cache 406 . For electronic devices, the storage space allocated for convolution operations is limited. For the above reasons, when pre-setting the scheduling mode, in addition to considering the network structure, it is also necessary to consider when using this scheduling mode operation. Occupancy of storage space to set an appropriate scheduling mode. Therefore, in a In an embodiment, the mode scheduling unit 401 also determines the target scheduling mode according to the size of the cache 409 and/or the cache 406 .

步驟103中，根據目標調度模式，對目標卷積核中的權重資料進行重組，並將重組後的權重資料輸出至乘積累加運算陣列。 In step 103, according to the target scheduling mode, reorganize the weight data in the target convolution kernel, and output the reorganized weight data to the multiply-accumulate array.

目標調度模式確定之後，權重資料處理單元402根據目標調度模式對目標卷積核中的權重資料進行拆分及適當的資料重組，使重組後的權重資料得以適當的順序輸入到乘積累加運算陣列404，而乘積累加運算陣列404可完成所需的卷積運算。 After the target scheduling mode is determined, the weight data processing unit 402 splits and properly reorganizes the weight data in the target convolution kernel according to the target scheduling mode, so that the reorganized weight data can be input into the multiply-accumulate operation array 404 in an appropriate order , and the multiply-accumulate array 404 can complete the required convolution operation.

實施上，目標卷積核中的權重資料可儲存於記憶體407中，權重資料處理單元402可經由快取409在直接記憶體存取(Direct Memory Access，DMA)控制器408的控制下，自記憶體407讀取權重資料。 In practice, the weight data in the target convolution kernel can be stored in the memory 407, and the weight data processing unit 402 can automatically store the weight data under the control of the direct memory access (Direct Memory Access, DMA) controller 408 via the cache 409. The memory 407 reads weight data.

一實施例中，針對每一調度模式，權重資料處理單元402配置有一對應的權重資料讀取及重組的操作設定，當目標調度模式確定後，權重資料處理單元402根據目標調度模式以對應的操作設定讀取及重組目標卷積核中的權重資料。實施上，權重資料處理單元402可藉由將目標卷積核中的權重資料以原始順序寫入一快取中，再以所需的順序自快取中將權重資料讀出，實現對權重資料重組及重排序的目的。 In one embodiment, for each scheduling mode, the weight data processing unit 402 is configured with a corresponding operation setting for reading and reorganizing the weight data. When the target scheduling mode is determined, the weight data processing unit 402 operates according to the target scheduling mode. Set to read and reorganize the weight data in the target convolution kernel. In practice, the weight data processing unit 402 can write the weight data in the target convolution kernel into a cache in the original order, and then read the weight data from the cache in the required order to realize the weight data Purposes of reorganization and reordering.

請參閱圖3，圖3為一實施例中對卷積層的輸入資料及卷積核的資料拆分示意圖。在確定出目標調度模式後，假設該目標調度模式對應的卷積計算塊的尺寸為m×d×w×h，也就是說，在乘積累加運算陣列的一輪運算中，權重資料處理單元402從目標卷積核讀取m個尺寸為d×w×h的權重資料塊，將經重組後權重資料的輸入至乘積累加運算陣列404。詳細來說，權重資料處理單元 402分別從K₁、K₂、……、K_m這m個卷積核中獲取m個權重資料塊，其中，每一個權重資料塊在通道方向上的深度為d，在平面上的尺寸為w×h。 Please refer to FIG. 3 . FIG. 3 is a schematic diagram of splitting the input data of the convolution layer and the data of the convolution kernel in an embodiment. After the target scheduling mode is determined, it is assumed that the size of the convolution calculation block corresponding to the target scheduling mode is m×d×w×h. The target convolution kernel reads m weight data blocks with a size of d×w×h, and inputs the reorganized weight data to the multiply-accumulate array 404 . In detail, the weight data processing unit 402 obtains m weight data blocks from the m convolution kernels K ₁ , K ₂ , ..., K _m respectively, wherein the depth of each weight data block in the channel direction is d, the dimension on the plane is w×h.

步驟104，根據目標調度模式，對目標卷積層中的輸入資料進行重組，並將重組後的輸入資料輸出至乘積累加運算陣列。 Step 104, according to the target scheduling mode, reorganize the input data in the target convolutional layer, and output the reorganized input data to the multiply-accumulate array.

目標調度模式確定之後，特徵資料處理單元403根據目標調度模式對目標卷積層中的輸入資料進行拆分及適當的資料重組，使重組後的輸入資料得以匹配對應的權重資料塊的順序輸入到乘積累加運算陣列404，以完成所需的卷積運算。 After the target scheduling mode is determined, the characteristic data processing unit 403 splits the input data in the target convolutional layer according to the target scheduling mode and appropriately reorganizes the data, so that the reorganized input data can match the order of the corresponding weight data blocks and input them to the product Accumulate the operation array 404 to complete the required convolution operation.

實施上，目標卷積層中的輸入資料可儲存於記憶體407中，特徵資料處理單元403可經由快取409在直接記憶體存取控制器的控制下，自記憶體407讀取輸入資料。 In practice, the input data of the target convolutional layer can be stored in the memory 407 , and the feature data processing unit 403 can read the input data from the memory 407 through the cache 409 under the control of the direct memory access controller.

類似地，針對每一調度模式，特徵資料處理單元403配置有一對應的輸入資料讀取及重組的操作設定，當目標調度模式確定後，特徵資料處理單元403根據目標調度模式以對應的操作設定讀取及重組目標卷積層中的輸入資料。實施上，特徵資料處理單元403亦可藉由將目標卷積層中的輸入資料以原始順序寫入一快取中，再以所需的順序自快取中將輸入資料讀出，例如以匹配對應的權重資料塊的資料順序自快取中將輸入資料讀出，實現對輸入資料重組及重排序的目的。 Similarly, for each scheduling mode, the feature data processing unit 403 is configured with a corresponding operation setting for reading and reorganizing input data. When the target scheduling mode is determined, the feature data processing unit 403 reads the corresponding operation settings according to the target scheduling mode. Fetch and reorganize the input data in the target convolutional layer. In practice, the characteristic data processing unit 403 can also write the input data in the target convolutional layer into a cache in the original order, and then read the input data from the cache in the required order, for example, to match the corresponding The data sequence of the weighted data blocks is read from the cache to achieve the purpose of reorganizing and reordering the input data.

請再參閱圖3，在圖3的實施例中特徵資料處理單元403將目標卷積層中的輸入資料拆分為多個尺寸為d×w×h的輸入資料塊，並對每一輸入資料塊進行資料重組，使重組後輸入資料塊的資料順序能匹配對應的權重資料塊，使乘積累加運算陣列404得以據以完成正確的乘積累加運算。 Please refer to FIG. 3 again. In the embodiment of FIG. 3, the feature data processing unit 403 splits the input data in the target convolutional layer into a plurality of input data blocks whose size is d×w×h, and for each input data block Data reorganization is performed so that the data order of the input data blocks after reorganization can match the corresponding weight data blocks, so that the multiply-accumulate operation array 404 can complete the correct multiply-accumulate operation accordingly.

步驟105，基於重組後權重資料和重組後輸入資料進行乘積累加運算。 Step 105, perform multiply-accumulate calculation based on the reorganized weight data and the reorganized input data.

乘積累加運算陣列404基於重組後權重資料和重組後的輸入資料進行乘積累加運算，其中，乘積累加運算陣列404每輪運算中所使用的乘積累加運算單元的數目對應卷積計算塊的大小。 The multiply-accumulate operation array 404 performs multiply-accumulate operations based on the reorganized weight data and the reorganized input data, wherein the multiply-accumulate operation array 404 uses the number of multiply-accumulate operation units used in each round of operation corresponding to the size of the convolution calculation block.

乘積累加運算陣列404進行一輪運算後，將計算得到的結果作為中間資料儲存到快取406中。乘積累加運算陣列404在乘積累加運算時，會將同一個卷積核中通道方向上的乘積相加後作為中間資料儲存。然後，權重資料處理單元402根據卷積核在輸入資料上進行卷積運算的順序，繼續從快取409中讀取權重資料塊，而特徵資料處理單元403從快取409中讀取並重組輸入資料，以輸出與該權重資料塊匹配的輸入資料塊，乘積累加運算陣列404據以進行另一輪的運算，如此循環往復，直至完成輸入資料中的每一個資料塊與權重資料塊的運算。 After the multiply-accumulate operation array 404 performs one round of operation, the calculated result is stored in the cache 406 as intermediate data. During the multiply-accumulate operation, the multiply-accumulate operation array 404 will add the products in the channel direction of the same convolution kernel and store them as intermediate data. Then, the weight data processing unit 402 continues to read weight data blocks from the cache 409 according to the order in which the convolution kernel performs convolution operations on the input data, while the feature data processing unit 403 reads and reorganizes the input data from the cache 409. data to output an input data block that matches the weight data block, and the multiplication and accumulation operation array 404 performs another round of calculations, and so on, until the calculation of each data block in the input data and the weight data block is completed.

接下來以一個具體應用場景為例對本發明進行說明，請繼續參閱圖3，假設C=32，W=6，H=6；D=32，R=3，S=3，M=16；d=16，m=16，w=1，h=1。則如圖所示，輸入資料可以被拆分72個輸入資料塊，一個卷積核可以被拆分為18個權重資料塊。假設該卷積層對應的步長為1，並在長、寬方向均進行了長度為2的零填充，則輸入資料上的0~35這36個輸入資料塊都需要與每一個卷積核中的00對應的權重資料塊進行內積運算。 Next, take a specific application scenario as an example to illustrate the present invention, please continue to refer to Figure 3, assuming C=32, W=6, H=6; D=32, R=3, S=3, M=16; d =16, m=16, w=1, h=1. As shown in the figure, the input data can be split into 72 input data blocks, and one convolution kernel can be split into 18 weight data blocks. Assuming that the step size corresponding to the convolutional layer is 1, and zero padding with a length of 2 is performed in both the length and width directions, the 36 input data blocks of 0~35 on the input data need to be combined with each convolution kernel. Inner product operation is performed on the weight data block corresponding to 00.

比如，輸入資料上的大小為16×1×1(圖3中的輸入資料的灰色方塊)的特徵資料塊0，該資料塊需要與每一個卷積核中的大小為16×1×1的權重資料塊00(圖3中的卷積核的灰色方塊)進行內積運算。因此，權重資料處理單元402根據目標調度模式對應的卷積計算塊的大小(d=16，m=16，w=1，h=1)，從快取中的16個卷積核中，分別讀取第一個權重資料塊，得到16個大小為16×1×1的權重資料塊00。特徵資料處理單元403從輸入資料中讀取一個大小為16×1×1的輸入資料塊0，並將輸入資料塊0分別匹配給16個權重資料塊00(也就是說，輸入資料塊0在乘積累加運算陣列404的一輪運算中重複使用了16次，相當於256個資料)。輸入資料塊0及16個權重資料塊00輸入到乘積累加運算陣列404進行運算，得到16個數值(通道方向上的乘積進行了相加處理)，儲存為中間結果，這個過程為乘積累加運算陣列404的一輪運算。然後，再進行第二次讀數據，以進行乘積累加運算陣列404的第二輪運算。如前文所述，輸入資料上的0-35這36個輸入資料塊都需要與每一個卷積核中的權重資料塊00進行內積運算，因此，第二次讀數據時無需重複讀權重資料塊00，只需從輸入資料中讀取大小為16×1×1的輸入資料塊1，將輸入資料塊1分別匹配給16個權重資料00，並利用乘積累加運算陣列404進行運算，得到16個數值，也儲存為中間結果。然後按照與第二輪運算相同的讀數據方式，讀取一個大小為16×1×1的輸入資料塊2，進行乘乘積累加運算陣列404的第三輪運算，……，然後按照與第二輪運算相同的讀數據方式，讀取一個大小為16×1×1的輸入資料塊35，進行乘積累加運算陣列404的第36輪運算。至此，完成了輸入資料與權重資料塊00的卷積運算，同時，儲存了36組中間結果，每一組中間結果有16個數值。 For example, the feature data block 0 with a size of 16×1×1 (the gray square of the input data in Figure 3) on the input data needs to be matched with the size of 16×1×1 in each convolution kernel The weight data block 00 (the gray square of the convolution kernel in Fig. 3 ) performs the inner product operation. Therefore, the weight data According to the size of the convolution calculation block corresponding to the target scheduling mode (d=16, m=16, w=1, h=1), the processing unit 402 reads the first weight data blocks to obtain 16 weight data blocks 00 with a size of 16×1×1. The feature data processing unit 403 reads an input data block 0 with a size of 16×1×1 from the input data, and matches the input data block 0 to 16 weight data blocks 00 respectively (that is to say, the input data block 0 is in In one round of operation of the multiply-accumulate array 404, 16 times are repeatedly used, which is equivalent to 256 data). The input data block 0 and 16 weight data blocks 00 are input to the multiply-accumulate array 404 for calculation, and 16 values are obtained (the products in the channel direction are added), which are stored as intermediate results. This process is the multiply-accumulate array 404 round of operations. Then, the data is read for the second time, so as to perform the second round of operation of the multiply-accumulate operation array 404 . As mentioned above, the 36 input data blocks of 0-35 on the input data need to be inner producted with the weight data block 00 in each convolution kernel, so there is no need to read the weight data repeatedly when reading the data for the second time Block 00, just read the input data block 1 with a size of 16×1×1 from the input data, match the input data block 1 to the 16 weight data 00, and use the multiply-accumulate operation array 404 to perform calculations to obtain 16 values are also stored as intermediate results. Then read the data block 2 that is 16 * 1 * 1 according to the same data reading mode as the second round operation, read a size and carry out the third round operation of the multiply-multiply-accumulate operation array 404, ..., then according to the second round operation The same way of reading data as the round operation, reads an input data block 35 with a size of 16×1×1, and performs the 36th round operation of the multiply-accumulate array 404 . So far, the convolution operation of the input data and the weight data block 00 is completed, and at the same time, 36 sets of intermediate results are stored, and each set of intermediate results has 16 values.

接下來，在第37輪運算時，從快取中讀取16個大小為16×1×1的權重資料塊01，從輸入資料中讀取一個大小為16×1×1的輸入資料塊0，將輸入資料塊0分別匹配給16個權重資料塊01，並利用乘積累加運算陣列404進行運算，得到16個數值，由於這16個數值與第一輪運算中得到的16數值均對應於輸出資料上的輸入資料塊0，因此需要將這16個數值和分別與第一輪運算中得到的16個數值相加，得到新的16個數值，並儲存為新的中間結果，覆蓋掉第一輪運算儲存的16個中間結果。按照與前36次同樣的取資料方式，乘積累加運算陣列404進行第37輪至第72輪運算，完成輸入資料與權重資料塊01的卷積運算。 Next, in the 37th round of calculation, read 16 weight data blocks 01 with a size of 16×1×1 from the cache, and read an input data block 0 with a size of 16×1×1 from the input data , match the input data block 0 to the 16 weight data blocks 01 respectively, and use the multiply-accumulate operation array 404 to perform calculations to obtain 16 values, because these 16 values and the 16 values obtained in the first round of calculation are all corresponding to the input The input data block 0 on the output data, so these 16 values need to be added to the 16 values obtained in the first round of calculations to obtain new 16 values, and stored as new intermediate results, overwriting the first 16 intermediate results stored in a round of calculation. According to the same data fetching method as the previous 36 times, the multiply-accumulate operation array 404 performs the 37th to 72nd rounds of operations to complete the convolution operation of the input data and the weight data block 01 .

重複執行上述運算過程，直至完成目標卷積核對輸入資料的全部卷積運算，得到16個二維的輸出資料，這16個二維的輸出資料疊加，得到該目標卷積層的三維的輸出資料。其中，如果下一層也是卷積層，則該輸出資料可以被讀取到快取中，作為下一層運算的輸入資料，繼續進行卷積運算。 Repeat the above operation process until all the convolution operations of the target convolution kernel on the input data are completed, and 16 two-dimensional output data are obtained. These 16 two-dimensional output data are superimposed to obtain the three-dimensional output data of the target convolution layer. Wherein, if the next layer is also a convolutional layer, the output data can be read into the cache and used as the input data for the next layer operation to continue the convolution operation.

需要說明的是，上文所舉的是一個為了便於讀者理解本申請方案的一個具體實施例。在該實施例中，由於一次讀取的權重值數量多於輸入資料數量，因此，當相鄰兩輪運算中使用的權重值重複時，為了提高資料處理效率，沒有重複讀取權重資料塊。但是這並不是對本申請方案的限制，在其他實施例中，也可以按照其他順序讀取資料，在按照其他順序讀取資料時，可以重複或者不重複的讀取資料塊。 It should be noted that what is mentioned above is a specific embodiment for the convenience of readers to understand the solution of this application. In this embodiment, since the number of weight values read at one time is more than the number of input data, when the weight values used in two adjacent rounds of calculations are repeated, in order to improve data processing efficiency, the weight data blocks are not repeatedly read. However, this is not a limitation to the solution of the present application. In other embodiments, data may also be read in other orders. When reading data in other orders, data blocks may be read repeatedly or not.

具體實施時，本申請不受所描述的各個步驟的執行順序的限制，在不產生衝突的情況下，某些步驟還可以採用其它順序進行或者同時進行。 During specific implementation, the present application is not limited by the execution order of the described steps, and some steps may be performed in other orders or simultaneously in the case of no conflict.

綜上所述，本申請實施例提出的卷積神經網路的運算方法，對於卷積神經網路中具有不同網路結構的各個卷積層來說，可以動態地調整目標調度模式，使得每一個卷積層可以採用與其結構匹配的調度模式，來對輸入資料和目標卷積核進行資料塊的拆分及重組，使得拆分後的權重資料塊中包含的權重值和特徵資料塊中包含的特徵值的數量能夠實現對乘積累加運算陣列的運算資源的最大化利用，從整體上提高了硬體加速器的資源利用率，進而提高卷積神經網路的運算速度。 In summary, the convolutional neural network computing method proposed in the embodiment of this application can dynamically adjust the target scheduling mode for each convolutional layer with different network structures in the convolutional neural network, so that each The convolutional layer can use a scheduling mode that matches its structure to split and reorganize the input data and the target convolution kernel data blocks, so that the weight values contained in the split weight data blocks and the features contained in the feature data blocks The number of values enables the operation of an array of multiply-accumulate operations The maximum utilization of computing resources improves the resource utilization of hardware accelerators as a whole, thereby improving the computing speed of convolutional neural networks.

以上對本申請實施例所提供的一種卷積神經網路運算方法及裝置進行了詳細介紹，本文中應用了具體個例對本申請的原理及實施方式進行了闡述，以上實施例的說明只是用於幫助理解本申請的方法及其核心思想；同時，對於本領域的技術入員，依據本申請的思想，在具體實施方式及應用範圍上均會有改變之處，綜上，本說明書內容不應理解為對本申請的限制。 The above is a detailed introduction to a convolutional neural network computing method and device provided by the embodiment of the present application. In this paper, a specific example is used to illustrate the principle and implementation of the present application. The description of the above embodiment is only used to help Understand the method of this application and its core idea; at the same time, for technical personnel in this field, according to the idea of this application, there will be changes in the specific implementation and application scope. In summary, the content of this specification should not be understood For the limitation of this application.

101,102,103,104,105:步驟 101, 102, 103, 104, 105: steps

Claims

A convolutional neural network computing method, applied to a convolutional neural network computing device, the convolutional neural network computing device includes a multiply-accumulate array, the multiply-accumulate array includes a plurality of multiply-accumulate computing units , and the operation method of the convolutional neural network includes: determining the number of target convolution kernels in a target convolution layer and a first size information of the target convolution kernels; determining a target scheduling mode according to the number of target convolution kernels and the first size information, wherein the target scheduling mode corresponds to a size of a convolution calculation block; According to the target scheduling mode, reorganize the weight data in the target convolution kernel, and output the reorganized weight data to the multiply-accumulate array; According to the target scheduling mode, reorganize the input data in the target convolutional layer, and output the reorganized input data to the multiply-accumulate operation array; and Using the multiply-accumulate array to perform multiply-accumulate operations based on the reorganized weight data and the reorganized input data, wherein the number of multiply-accumulate units used in each round of the multiply-accumulate array corresponds to the The size of the convolution computation block described above.

The computing method of the convolutional neural network according to claim 1, wherein the target scheduling mode corresponds to the minimum number of rounds of computing for the input data and the target convolution kernel to be completed by the multiply-accumulate computing array.

The convolutional neural network computing method according to claim 1, wherein the first size information includes depth information of the target convolution kernel in a channel direction.

The operation method of the convolutional neural network according to claim 1, wherein the target scheduling mode is selected from a plurality of preset scheduling modes.

The operation method of the convolutional neural network according to claim 1, wherein the multiply-accumulate operation array stores intermediate data in a cache, and the step of determining the target scheduling mode is further based on the size of the cache to determine the target scheduling mode.

The operation method of the convolutional neural network as claimed in item 1, wherein the number of the target convolution kernel is M, the size of the convolution calculation block is an integer multiple of m, and M is an integer multiple of m, M and m are all positive integers.

The convolutional neural network computing method according to claim 1, wherein in the step of reorganizing the input data in the target convolution layer, the reorganized input data matches the reorganized weight data.

A convolutional neural network computing device is used to perform convolution operations on a target convolution kernel and input data in a target convolutional layer, and the convolutional neural network computing device includes: A scheduling mode unit, which determines a target scheduling mode according to the number of target convolution kernels and a first size information, wherein the target scheduling mode corresponds to the size of a convolution calculation block; A first data processing unit, reorganizing the weight data in the target convolution kernel according to the target scheduling mode; a second data processing unit, reorganizing the input data in the target convolutional layer according to the target scheduling mode; and A multiply-accumulate operation array, including a plurality of multiply-accumulate operation units, which perform multiply-accumulate operations based on the reorganized weight data and the reorganized input data, wherein, the multiply-accumulate operation array used in each round of calculation The number of the multiply-accumulate operation units corresponds to the size of the convolution calculation block.

The convolutional neural network computing device according to claim 8, wherein the target scheduling mode corresponds to the minimum number of computing rounds for the multiply-accumulate computing array to complete the input data and the target convolution kernel.

The convolutional neural network computing device according to claim 8, wherein the first size information includes depth information of the target convolution kernel in a channel direction.

The convolutional neural network computing device according to claim 8, wherein the target scheduling mode is selected from a plurality of preset scheduling modes.

The convolutional neural network computing device according to claim 11, wherein the plurality of preset scheduling modes are stored in a memory.

The convolutional neural network computing device according to claim 8, wherein the multiply-accumulate array stores intermediate data in a cache, and the scheduling mode unit determines the target scheduling mode according to the size of the cache .

The convolutional neural network computing device according to claim 8, wherein the number of the target convolution kernels is M, the size of the convolution calculation block is an integer multiple of m, and M is an integer multiple of m, M and m All are positive integers.

The convolutional neural network computing device according to claim 8, wherein the first data processing unit performs data reorganization by writing and reading weight data in the target convolution kernel from a cache.