CN113902107A

CN113902107A - Data processing method, readable medium and electronic device for neural network model full connection layer

Info

Publication number: CN113902107A
Application number: CN202111370380.4A
Authority: CN
Inventors: 黄嘉胜; 黄敦博; 潘阿成
Original assignee: ARM Technology China Co Ltd
Current assignee: ARM Technology China Co Ltd
Priority date: 2021-11-18
Filing date: 2021-11-18
Publication date: 2022-01-07

Abstract

The utility model relates to the field of artificial intelligence, and provides a data processing method for a neural network model full-connection layer, which converts the full-connection operation of N data vectors to be processed and a weight matrix of the full-connection layer into the convolution operation of a matrix to be convolved and a convolution kernel, and only 1 time of the weight matrix W needs to be obtained from a memory in the convolution operation process, so that the N data vectors to be processed can be processed. The matrix to be convolved is a matrix formed by data vectors to be processed of the full connection layer, and the convolution kernel is a matrix converted from the weight matrix. And the corresponding data in the processing result converted into the convolution operation is the same as the corresponding data in the processing result of the full connection layer, and the arrangement sequence is also the same, so that the loading times of the weight coefficient can be reduced, the operation of the next layer of network of the full connection layer is not influenced, and the data processing efficiency is greatly improved.

Description

Data processing method, readable medium and electronic device for neural network model full connection layer

Technical Field

The present application relates to the field of artificial intelligence, and in particular, to a data processing method, a readable medium, and an electronic device for a fully connected layer of a neural network model.

Background

With the development of artificial intelligence, Neural network models such as Convolutional Neural Networks (CNNs), Recurrent Neural Networks (RNNs), Deep Neural Networks (DNNs), and the like have been applied to various fields, for example, Neural network models can be applied to speech recognition, picture classification, object detection, and the like.

In practical application, the full connection layer is an indispensable part for forming a neural network model, and is used for outputting the operation result of the neuron in the previous layer to the input of all the neurons in the next layer so as to facilitate the next layer to continue operation; in the fully connected layer, the local features of the processing objects need to be integrated into a global feature. Although the calculation amount of the full connection layer is small, the amount of the weight coefficient (weight) required is large. In most cases, the weight factor required for the fully connected layer accounts for more than 70% of the weight factor of the entire neural network. Therefore, it becomes crucial how to reduce the reading of the weight coefficients of the fully-connected layer when using the neural network model.

Disclosure of Invention

The application aims to provide a data processing method, a readable medium and an electronic device for a neural network model full connection layer. And obtaining a convolution result by converting the full-connection operation executed by the full-connection layer into the convolution layer to execute the convolution operation. In the convolution operation process, the multiple images or multiple voices to be recognized can be recognized only by acquiring the weight coefficient from the memory once, and the processing results of the multiple images are output. The processing time of a plurality of data to be processed is reduced, and the calculation efficiency of the neural network model is improved.

A first aspect provides a data processing method for a fully-connected layer of a neural network model, which is used for an electronic device, and includes: and acquiring a plurality of to-be-processed data vectors and a weight matrix of the full connection layer. And converting the plurality of to-be-processed data vectors of the full connection layer into to-be-convolved matrixes, and converting the weight matrixes into at least one convolution kernel. And performing convolution operation on the matrix to be convolved and the convolution kernel to obtain a plurality of full-connection layer processing results of a plurality of data vectors to be processed of the full-connection layer.

That is to say, in the data processing method for the neural network model fully-connected layer provided by the present application, the operation accelerator 100 can process N data vectors to be processed by converting the fully-connected operation of the data vectors to be processed and the weight matrix into the convolution operation of the convolution matrix and the convolution kernel, and only needs to acquire the weight matrix W from the memory 1 time. The weight matrix of the fully-connected layer does not need to be loaded from memory once per processed data vector to be processed of one fully-connected layer as in fully-connected layer computation. It can be understood that according to the data processing method for the fully-connected layer of the neural network model provided by the application, it can be ensured that each data in the convolution output matrix is consistent with each data in the N output data of the fully-connected layer, and the arrangement sequence of the data in the convolution output matrix is also the same as the corresponding data in the N output data of the fully-connected layer, so that the number of times of loading the weight coefficient can be reduced, and since the arrangement sequence of the output data is not changed, the operation of the next layer of network of the fully-connected layer is not affected, thereby greatly improving the data processing efficiency.

In a possible implementation of the first aspect, a height value of the matrix to be convolved is the same as a width value of the data vector to be processed, and a product of the width value of the matrix to be convolved and a depth value of the matrix to be convolved has the same number as the data vector to be processed.

In one possible implementation of the first aspect, the height of the convolution kernel is the same as the width of the weight matrix, and the number of the convolution kernels is the same as the height of the weight matrix.

In a possible implementation of the first aspect, a column of data in the matrix to be convolved corresponds to one vector of the data to be processed.

In one possible implementation of the first aspect, the convolution kernels are one-dimensional vectors, and each convolution kernel corresponds to a row of data in the weight matrix.

In a possible implementation of the first aspect, in the process of performing convolution operation on the matrix to be convolved and the convolution kernel, an order of multiplying each column of data in the matrix to be convolved by the corresponding convolution kernel is the same as an order of inputting the plurality of data vectors to be processed into the fully connected layer.

In a possible implementation of the first aspect, when performing convolution operation on the matrix to be convolved and the convolution kernel, the set sliding step is 1, and the set padding is 0.

In a second aspect, embodiments of the present application provide a readable medium, on which instructions are stored, and when executed by a processor of an electronic device, the instructions cause the electronic device to implement the first aspect and any one of the data processing methods for a neural network model full-connectivity layer provided in various possible implementations of the first aspect.

In a third aspect, an embodiment of the present application provides an electronic device, including: a memory for storing instructions for execution by one or more processors of an electronic device, and a plurality of processors for executing the instructions in the memory, any of the data processing methods for a fully connected layer of a neural network model provided in the above first aspect and various possible implementations of the first aspect.

In a fourth aspect, the present application provides a computer program product including a computer program/instructions, which when executed by a processor, implement the data processing method for a neural network model full connectivity layer in the first aspect and possible implementations of the first aspect.

Drawings

FIG. 1 is a diagram illustrating a convolutional neural network model according to an embodiment of the present application;

FIG. 2 is a schematic diagram illustrating a recurrent neural network model according to an embodiment of the present application;

FIG. 3 is a flow diagram illustrating a forward computation flow of a fully-concatenated operation, according to an embodiment of the present application;

FIG. 4 is a diagram illustrating an operation of multiplying a first to-be-processed data vector of a fully-connected layer by a transpose of a weight matrix of the fully-connected layer, according to an embodiment of the present application;

FIG. 5 is an exemplary diagram illustrating the conversion of 6 to-be-processed data vectors of a fully-connected layer into a to-be-convolved matrix of a convolutional layer, according to an embodiment of the present application;

FIG. 6 is a diagram illustrating an example of a weight matrix for fully connected layers converted into convolution kernels for convolutional layers, according to an embodiment of the present application;

FIG. 7A is a diagram illustrating an example of a convolution matrix to be convolved with 2 convolution kernels to obtain a convolution output matrix according to an embodiment of the present application;

7B-7D are diagrams illustrating an example of a convolution of a data block in a matrix to be convolved with a convolution kernel K1 to obtain a convolution output matrix according to an embodiment of the application;

FIG. 8 is a block diagram illustrating a computing accelerator, according to an embodiment of the present application;

FIG. 9 is a block diagram illustrating another computing accelerator, according to an embodiment of the present application;

FIG. 10 is a flow chart illustrating a forward calculation of a convolution operation according to an embodiment of the present application;

FIG. 11 is a flow chart illustrating a method of data processing according to an embodiment of the present application;

FIG. 12 is a block diagram illustrating an electronic device according to some embodiments of the present application;

fig. 13 is a block diagram illustrating a system on a chip (SoC), according to some embodiments of the present application.

Detailed Description

Embodiments of the present application include, but are not limited to, a data processing method, readable medium, and electronic device for a neural network model fully connected layer.

The application provides a data processing method (hereinafter referred to as data processing method) for a fully-connected layer of a neural network model, wherein a convolution result is obtained by converting fully-connected operation executed by the fully-connected layer into a convolution layer and executing convolution operation. In the convolution operation process, the multiple images or multiple voices to be recognized can be recognized only by acquiring the weight coefficient from the memory once, and the processing results of the multiple images are output. The processing time of a plurality of data to be processed is reduced, and the calculation efficiency of the neural network model is improved.

It can be understood that the idea of the data processing method in the present application is applicable to a scenario in which a neural network model including a fully-connected layer needs to process a plurality of data to be processed, where the neural network model including the fully-connected layer may be at least one of a recurrent neural network model, a convolutional neural network model, and a multi-layer feedforward neural network model.

For the convenience of understanding the technical solutions provided by the embodiments of the present application, the following key terms used in the embodiments of the present application are explained:

(1) neural network model (algorithm): is the core of artificial intelligence and belongs to a branch of artificial intelligence. Machine learning theory is mainly to design and analyze some algorithms that allow computers to "learn" automatically. The machine learning algorithm is an algorithm for automatically analyzing and obtaining rules from data and predicting unknown data by using the rules. Therefore, the core of machine learning is data, algorithm (model) and computing power (computer computing power). The application field of machine learning is quite wide, for example: data mining, data classification, computer vision, Natural Language Processing (NLP), biometric recognition, search engines, medical diagnostics, stock market analysis, DNA sequence sequencing, speech and handwriting recognition, strategic gaming, and robotic use, among others. The machine learning algorithm (model) includes, but is not limited to, a convolutional neural network model, a cyclic neural network model, a deep neural network model, and the like.

(2) The Convolutional Neural Network (CNN) is a multi-layer Neural Network, each layer is composed of a plurality of two-dimensional planes, each plane is composed of a plurality of independent neurons, and the plurality of neurons of each plane share weights, so that the number of parameters in the Neural Network can be reduced through weight sharing. Currently, in a convolutional neural network, a convolution operation performed by a processor is usually to convert the convolution of an input signal feature and a weight into a matrix multiplication operation between a feature map matrix and a weight matrix.

(3) The format of the three-dimensional convolution kernel comprises four dimensions: convolution kernel height, convolution kernel width, number of input channels (convolution kernel depth), number of output channels (convolution kernel number). When the convolutional layer is convolved with only one convolution kernel, the weight matrix is the convolution kernel. When the convolutional layer is convolved with two or more convolution kernels, the weight matrix may be a matrix composed of the convolution kernels used for the convolution.

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

As described above, the fully-connected layer is an indispensable part constituting the neural network model, and the data processing procedure of the fully-connected layer is described below by taking the convolutional neural network and the cyclic neural network as examples.

FIG. 1 illustrates a convolutional neural network 11 including fully-connected layers, according to an embodiment of the present application. As shown in fig. 1, the convolutional neural network 11 may include convolutional layers 1 to a (a ≧ 2), pooling layers 1 to a, fully-connected layers 1 to a, and softmax layers (processing layers of the convolutional neural network that output probabilities of types to which pictures belong), which can classify input image data, voice data, and the like.

Fig. 2 illustrates a recurrent neural network 20 that includes a fully-connected layer, according to an embodiment of the present application. As shown in fig. 2, the recurrent neural network 20 is mainly used to process sequence data. Such as input speech segments, time series data, informative character strings, dialogs, etc. As shown in FIG. 2, taking the recurrent neural network model 20 as an example, the recurrent neural network model 20 includes an input layer 1 to an input layer c (c ≧ 2), RNN units 1 to RNN units c, an output layer 1 to an output layer c, and a fully-connected layer 1 to a fully-connected layer 2 c-1.

Specifically, in a convolutional neural network or a cyclic neural network, the core operation of a general fully connected layer (FC) is a fully connected operation, which is a matrix-vector product, and output data Fout of the fully connected layer can be represented by formula (1):

F_out＝Act(F_in*W^T+b) (1)

wherein Act () is an activation function; f_inIs a data vector to be processed; w is the weight matrix (weight matrix) of the full connection layer, W^TTranspose of weight matrix for fully connected layers; b is the bias term of the fully connected layer; f_inThe dimension of (2) is X, namely the data vector to be processed comprises X elements; f_outDimension of (d) is H, i.e. the output data comprises H elements; w has dimension (H, X), i.e., W includes X × H elements, and b has dimension H. In the embodiments of the present application, "+" indicates multiplication.

In some embodiments, in the neural network model, the activation function may include, but is not limited to, at least one of: a sigmoid function, a tanh function, a binary function, or a Linear rectification function (ReLU).

FIG. 3 is a flow chart of the forward calculation of the full join operation, as shown in FIG. 3, F_inRepresents the data vector to be processed (the feature vector of the data vector to be processed of the full link layer), W represents the weight matrix of the full link layer, b represents the offset, Act () is the activation function, F_out0Is represented by F_inResult of multiplication with W matrix, F_outRepresenting the output data (the feature vector of the output data of the fully connected layer). In fig. 3, 301 denotes calculating the feature vector F_inMultiplied by the transpose of the weight matrix W to yield F_out0(ii) a 302 is shown at F_out0On the basis of (1), adding an offset b to obtain F_out1(ii) a 303 denotes the calculation of the output F by an activation operation_out1The activation value of each data in the table, and the final result is obtained.

Fig. 4 shows a schematic diagram of the operation of multiplying the first to-be-processed data vector of a fully-connected layer by the transpose of the weight matrix of the fully-connected layer. This exemplary embodiment will be described for illustrative purposes with respect to a vector of data to be processed, F_inIs set to 6, a data vector F to be processed_inIs set to 4, i.e. each data vector F to be processed_inComprising 4 elements. As shown in FIG. 4, the first vector of data to be processed F_inComprises four elements of X11, X12, X13 and X14; second to-be-processed data vector F_inComprises four elements of X21, X22, X23 and X24; by analogy, the sixth data vector F to be processed_inComprises X61 and X62. X63 and X64. The dimensions of the transpose of the weight matrix of the fully-connected layer are (2,4), i.e., the transpose of the weight matrix of the fully-connected layer includes 8 elements. As shown in fig. 4, the transpose of the weight matrix of the full connection layer includes eight elements W11, W21, W12, W22, W13, W23, W14, W24. As shown in FIG. 4, the first vector of data to be processed F_inMultiplying the first output data by the transpose of the weight matrix to obtain first output data F_outFirst output data F_outThe material comprises 2 elements, namely Y11 and Y12, wherein the value of the element Y11 is X11W 11+ X12W 12+ X13W 13+ X14W 14, and the value of the element Y12 is X11W 21+ X12W 22+ X13W 23+ X14W 24. By analogy, the sixth data vector F to be processed_inMultiplying the result by the transpose of the weight matrix to obtain sixth output data F_outSixth output data F_outThe material comprises 2 elements, namely Y61 and Y62, wherein the value of the element Y61 is X61W 11+ X62W 12+ X63W 13+ X64W 14, and the value of the element Y62 is X61W 21+ X62W 22+ X63W 23+ X64W 24.

As can be seen from the formula (1), fig. 3 and fig. 4, the fully-connected layer processes one to-be-processed data vector F each time_inThe weight matrix W is read from the memory once, and the full link layer needs to process N data vectors F to be processed when the convolutional neural network or the cyclic neural network needs to process N samples_inThe weight matrix W needs to be repeatedly acquired from the memory N times. The N samples may be N images to be processed, N texts to be translated, and N voices to be recognized. It is understood that in the case where the number N of samples processed by the convolutional neural network or the recurrent neural network is small, the number of times of repeatedly reading the weight matrix W from the memory is small. However, in the case where the number N of samples processed by the convolutional neural network or the recurrent neural network is large, the number of times of repeatedly reading the weight matrix W from the memory is large, so that the implementation of the formula (1) requires a large amount of data reading and running time.

In order to solve the problem, the application provides a data processing method, which can process N data vectors to be processed by converting full connection operation of the data vectors to be processed and a weight matrix into convolution operation of a matrix to be convolved and a convolution kernel, and only 1 time of the weight matrix W needs to be acquired from a memory. The matrix to be convolved is a matrix formed by data vectors to be processed of the full connection layer, and the convolution kernel is a matrix converted from the weight matrix. Specifically, in order to make the operation result of the convolution operation consistent with the operation result of the fully-connected operation, the data processing method can realize that under the condition that the arrangement sequence of the to-be-convolved matrix and the to-be-processed data vector of the fully-connected layer is ensured to be the same, and the arrangement sequence of the data in the convolution kernel and the data in the weight matrix is ensured to be the same, the height of the to-be-convolved matrix and the height of the convolution kernel are set to be the same as the dimension of the to-be-processed data vector of the fully-connected layer, the product of the depth and the width of the to-be-convolved matrix is the same as the number of the to-be-processed data vector of the fully-connected layer, and the number of the convolution kernels is the same as the dimension of the output data of the fully-connected layer, so that the operation result of the convolution operation of the to-convolved matrix and the convolution kernel is ensured to be consistent with the operation result of the to-be-processed data vector of the fully-connected layer and the weight matrix.

For example, the operation of multiplying the first to-be-processed data vector of the fully-connected layer by the transpose of the weight matrix of the fully-connected layer as shown in fig. 4, wherein the operation factor of the fully-connected operation includes: x11 is multiplied by W11 correspondingly, X12 is multiplied by W12 correspondingly, X13 is multiplied by W13 correspondingly, X14 is multiplied by W14 correspondingly, and the results of the corresponding multiplications are summed to obtain the value of Y11. In order to realize the multiplication of X11 and W11, X12 and W12, X13 and W13, and X14 and W14, the result of each multiplication is obtained, and then the result of each multiplication is summed to obtain the value of Y11. The implementation of the convolution operation of the present application is described in detail below with reference to fig. 5 to 7.

Fig. 5 shows an exemplary diagram of converting 6 to-be-processed data vectors of a fully-connected layer into a to-be-convolved matrix. Fig. 5(a) shows an exemplary diagram of 6 vectors of data to be processed for a fully connected layer. Fig. 5(b) shows an exemplary diagram of the transformed matrix to be convolved. For convenience of illustration, the convolutional layer is configured to process 6 to-be-processed data vectors of the fully-connected layer at a time, and the dimension of the to-be-processed data vector of the fully-connected layer is set to 4.

Specifically, as shown in fig. 5(a), the number of the to-be-processed data vectors of the fully-connected layer is 6, and the dimension of the to-be-processed data vectors of the fully-connected layer is 4. The data in the first to-be-processed data vector of the full connection layer are arranged in the sequence: x11, X12, X13, X14; the data in the second to-be-processed data vector is arranged in the order: x21, X22, X23, X24; by analogy, the data in the sixth to-be-processed data vector are arranged in the following sequence: x61, X62, X63 and X64. The data processing method converts the 6 to-be-processed data vectors of the full connection layer into the to-be-convolved matrix under the condition that the arrangement sequence of the to-be-convolved matrix is the same as that of the to-be-processed data vectors of the full connection layer. As shown in fig. 5(b), the height of the matrix to be convolved is set to the dimension of the data vector to be processed of the fully connected layer, i.e., 4. The product of the width and the depth of the matrix to be convolved is set as the number of the data vectors to be processed of the fully connected layer, namely 6. For example, as shown in fig. 5(b), the width of the matrix to be convolved may be set to 2, and the depth of the matrix to be convolved may be set to 3. The sequence of the matrices to be convolved is: x11, X12, X13, … … X62, X63, X64, the matrix formed being a three-dimensional matrix of size 2 × 3 × 4.

FIG. 6 shows an exemplary diagram of converting the weight matrix of the fully-connected layer into a convolution kernel. Fig. 6(a) shows an exemplary diagram of a weight matrix of a fully connected layer. Fig. 6(b) shows an example diagram of the convolution kernel after conversion. In practical application, the width of the weight matrix of the full connection layer is the dimension of the data vector to be processed of the full connection layer, and the height of the weight matrix of the full connection layer is the dimension of the output data of the full connection layer. This exemplary embodiment, for convenience of explanation, the height of the weight matrix of the full connection layer is set to 2; the width of the weight matrix of the fully-connected layer is set to 4.

As shown in fig. 6(a), the arrangement order of the data of the weight matrix of the full connection layer is: w11, W12, W13 … … and W24. The formed weight matrix of the full connection layer is a two-dimensional matrix with the size of 4 multiplied by 2, and the data processing method converts the weight matrix of the full connection layer into a convolution kernel under the condition of ensuring that the arrangement sequence of data in the convolution kernel is the same as that of data in the weight matrix of the full connection layer. As shown in fig. 6(b), the number of transformed convolution kernels is set to the height of the weight matrix of the full-connected layer, i.e., 2; the height of each convolution kernel after conversion is set to the width of the weight matrix of the fully-connected layer, i.e., 4. The width and depth of each convolution kernel after conversion are set to 1. As shown in fig. 6(b), the converted 2 convolution kernels are convolution kernel K1 and convolution kernel K2, respectively. The data in the convolution kernel K1 is the first row of data in the weight matrix of the full connection layer, and the arrangement order of the data in the convolution kernel K1 is: w11, W12, W13 and W14, the convolution kernel K1 is a matrix of 1 × 1 × 4 in size. The data in the convolution kernel K2 is the second row of data in the weight matrix of the full connection layer, and the arrangement order of the data in the convolution kernel K2 is: w21, W22, W23 and W24, the convolution kernel K2 is a matrix of 1 × 1 × 4 in size.

Fig. 7A shows an exemplary diagram of a convolution operation performed on a matrix to be convolved and 2 convolution kernels to obtain a convolution output matrix. As shown in fig. 7A, the matrix to be convolved is the matrix to be convolved shown in fig. 5(b), and the convolution kernel is the convolution kernel shown in fig. 6 (b). Performing convolution operation on the matrix to be convolved, the convolution kernel K1 and the convolution kernel K2 to obtain a convolution output matrix, wherein the obtained convolution output matrix is arranged in the sequence: y11, Y12, Y21, Y22, Y31 … … Y61, Y62, the matrix formed being a three-dimensional matrix of size 2 × 3 × 2. Taking convolution kernel K1 to perform convolution operation on the to-be-convolved matrix to obtain Y11, Y21, Y31, Y41, Y51, and Y61 in the convolution output matrix as an example, the operation process of the three-dimensional convolution operation will be described.

Specifically, as shown in fig. 7B to 7D, for a matrix to be convolved having a size of 2 × 3 × 4, starting with the first data X11 on the upper left thereof, the step size is 1, the padding is 0, and the matrix to be convolved is slid on the 2 × 3 × 4 matrix to be convolved in accordance with a sliding window of 1 × 1 × 4. For example, as shown in fig. 7B, first, data Y11 having a size of 1 × 1 × 1 and data Y21 having a size of 1 × 1 × 1 are sequentially obtained by sliding from left to right in the width direction. Wherein the data Y11 has a value of X11W 11+ X12W 12+ X13W 13+ X14W 14; the data Y21 had values X21 × W11+ X22 × W12+ X23 × W13+ X24 × W14. As shown in fig. 7C, the data X11 is slid backward in the depth direction by one data and then slid further from left to right, and data Y31 of size 1 × 1 × 1 and data Y41 of size 1 × 1 × 1 are obtained in this order. Wherein the data Y31 has a value of X31W 11+ X32W 12+ X33W 13+ X34W 14; the data Y41 had values X41 × W11+ X42 × W12+ X43 × W13+ X44 × W14. As shown in fig. 7D, the data X31 is slid backward in the depth direction by one data and then slid further from left to right, and data Y51 of size 1 × 1 × 1 and data Y61 of size 1 × 1 × 1 are obtained in this order. Wherein the data Y51 has a value of X51W 11+ X52W 12+ X53W 13+ X54W 14; the data Y61 had values X61 × W11+ X62 × W12+ X63 × W13+ X64 × W14.

It is obvious that the above three-dimensional convolution operation is actually performed with a step size of 1 and a padding of 0. Specifically, according to the arrangement order of the data in the matrix to be convolved, the data W11, W12, W13, W14 in the convolution kernel K1 and the corresponding data in the matrix to be convolved are convolved, so that the operation factor of the convolution operation is consistent with the operation factor of the full connection operation, that is, the values of Y11, Y21, Y31, Y41, Y51, and Y61 in the 6 output data of the full connection layer obtained by multiplying the 6 data vectors to be processed in the convolution output matrix obtained by the convolution operation by the weight matrix of the full connection layer in fig. 4 are the same as the values of Y11, Y21, Y31, Y41, Y51, and Y61 in the convolution output matrix obtained by the convolution operation by the full connection layer.

Similarly, compared with the convolution kernel K1, performing convolution operation on the to-be-convolved matrix to obtain Y11, Y21, Y31, Y41, Y51, and Y61 in the convolution output matrix, and performing convolution operation on the to-be-convolved matrix by the convolution kernel K2 to obtain Y12, Y22, Y32, Y42, Y52, and Y62 in the convolution output matrix, except that the data in the convolution kernel K2 is different from the data in the convolution kernel K1, that is, the data in the convolution kernel K1 includes W11, W12, W13, and W14, the data in the convolution kernel K2 includes W21, W22, W23, and W24, the specific process of the convolution operation is the same as that with reference to fig. 7B to fig. 7D, which is not described herein.

It can be understood that, in the data processing method of the present application, 6 to-be-processed data vectors of the full connection layer and the weight matrix of the full connection layer are respectively converted into a to-be-convolved matrix and a convolution kernel, when the to-be-convolved matrix and the convolution kernel are subjected to convolution operation, corresponding multiplication of X11 and W11, corresponding multiplication of X12 and W12, corresponding multiplication of X13 and W13, corresponding multiplication of X14 and W14, and summation of corresponding multiplication results, so as to obtain a value of Y11.

Therefore, the present application designs a computation accelerator that can read a weight matrix W once and process N data vectors F to be processed at a time using one weight matrix W by converting a full join operation into a convolution operation_inObtaining N F once by convolution product operation_out0. The method can reduce the reading quantity of the weight matrix and realize the full-connection operation of the formula (1), thereby efficiently calculating a large amount of data operation in the convolutional neural network or the cyclic neural network.

It can be understood that the operation accelerator provided by the present application can be applied to the field of machine learning including full-connection layer operation, and the specific applicable scenarios include processing scenarios of various images, voices, texts, and the like, especially for scenarios that require processing of multiple images, voices, texts, and the like, and have a large data processing amount and a high processing speed requirement. For example, the following scenarios may apply: the electronic equipment processes scenes of multi-section text translation or multiple voice recognition through the recurrent neural network, and the electronic equipment realizes scenes of multiple image recognition through the convolutional neural network.

In some embodiments, the operation accelerator of the present application may be a Neural Network Processing Unit (NPU) or other processor, and may also be a hardware Unit that performs convolution operation. The hardware unit of the convolution operation may be any one of the following: an Application Specific Integrated Circuit (ASIC), a field-programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, etc. It can be understood that the operation accelerator of the present application is applied to devices capable of performing convolution operations, such as mobile phones, tablet computers, servers, wearable devices, smart speakers, smart televisions, and the like.

Fig. 8 is a block diagram illustrating an operation accelerator 100 according to an embodiment of the present application. As shown in fig. 8, the arithmetic accelerator 100 includes an arithmetic circuit 101, and a first matrix converter 102 and a second matrix converter 103 coupled to the arithmetic circuit 101.

The first matrix converter 102 is configured to obtain N to-be-processed data vectors of a fully-connected layer, and determine dimensions of the N to-be-processed data vectors of the fully-connected layer, where the dimension of each to-be-processed data vector of the fully-connected layer is X. For example, as shown in fig. 5(a), the first to-be-processed data vector of the fully-connected layer is arranged in the following order: x11, X12, X13 and X14, wherein the formed matrix is a one-dimensional matrix with the dimension of 4; the second to-be-processed data vector of the full connection layer is arranged in the following sequence: x21, X22, X23 and X24, wherein the formed matrix is a one-dimensional matrix with the dimension of 4; by analogy, the order of the sixth to-be-processed data vector of the full connection layer is as follows: x61, X62, X63, and X64, the matrix formed is a one-dimensional matrix with dimension 4.

In order to ensure that the operation factor of the convolution operation is consistent with the operation factor of the full connection operation, the first matrix converter 102 converts the N data vectors to be processed of the full connection layer into a matrix to be convolved of the convolution layer according to the dimensionality of the N data vectors to be processed of the full connection layer, wherein the arrangement sequence of the matrix to be convolved of the convolution layer is the same as the arrangement sequence of the N data vectors to be processed of the full connection layer, the height of the matrix to be convolved is the same as the dimensionality of the data vectors to be processed of the full connection layer, and the product of the width of the matrix to be convolved and the depth of the matrix to be convolved is the number N of the data vectors to be processed of the full connection layer.

For example, as shown in fig. 5, the first matrix converter 102 converts the 6 to-be-processed data vectors of the fully-connected layer into the to-be-convolved matrix of the convolution layer shown in fig. 5(b) according to the dimensions of the 6 to-be-processed data vectors of the fully-connected layer shown in fig. 5 (a). As shown in fig. 5(b), the order of the matrices to be convolved is: the matrix formed by the X11, the X12, the X13, the … … X62, the X63 and the X64 is a three-dimensional matrix with the size of 2X 3X 4, the Height (Height, abbreviated as H) of the matrix to be convolved is the dimension 4 of the data vector to be processed of the full connection layer, and the product of the Width of the matrix to be convolved and the (Width, abbreviated as W) Depth (Depth, abbreviated as D) of the matrix to be convolved is the number 6 of the data vectors to be processed of the full connection layer. As shown in fig. 5(b), W of the matrix to be convolved is 2, and H of the matrix to be convolved is 3. It can be understood that, under the condition that the product of W of the matrix to be convolved and D of the matrix to be convolved is ensured to be the number of the data vectors to be processed of the full connection layer, the application does not specifically limit W of the matrix to be convolved and D of the matrix to be convolved.

And the second matrix converter 103 is configured to obtain a weight matrix of the fully-connected layer and determine a dimension of the weight matrix of the fully-connected layer. The dimension of the weight matrix of the full connection layer is represented as (X, H), that is, the weight matrix of the full connection layer is a two-dimensional matrix with the size of X × H, the width X of the weight matrix of the full connection layer is the dimension of the data vector to be processed of the full connection layer, and the height H of the weight matrix of the full connection layer is the dimension of the output data of the full connection layer.

For example, as shown in fig. 6(a), the arrangement order of the data of the weight matrix of the full connection layer is: w11, W12, W13 … … and W24. The formed weight matrix of the full connection layer is a two-dimensional matrix with the size of 4 multiplied by 2, the dimension of the weight matrix of the full connection layer is (4, 2), the dimension of the data vector to be processed of the full connection layer is 4, and the dimension of the output data of the full connection layer is 2.

In order to ensure that the operation factor of the convolution operation is consistent with the operation factor of the full-connection operation, the second matrix converter 103 converts the weight matrix of the full-connection layer into H convolution kernels of the convolution layer according to the dimension of the weight matrix of the full-connection layer, wherein the data in the H convolution kernels of the convolution layer is the same as the data in the weight matrix of the full-connection layer in arrangement order, the number H of the convolution kernels is the same as the dimension of the output data of the full-connection layer, the height of each convolution kernel is the same as the dimension of the to-be-processed data vector of the full-connection layer, and the width and the depth of each convolution kernel are both 1.

For example, as shown in fig. 6, the second matrix converter 103 converts the weight matrix of the fully-connected layers into convolution kernels of convolution layers in accordance with the dimensions of the weight matrix of the fully-connected layers shown in fig. 6(a), where the number of convolution kernels of the convolution layers is dimension 2 of the output data of the fully-connected layers. As shown in fig. 6(b), the 2 convolution kernels of the converted convolutional layer are convolution kernel K1 and convolution kernel K2, respectively. The data in the convolution kernel K1 is the first row of data in the weight matrix of the full connection layer, and the data in the convolution kernel K1 has the following arrangement order: w11, W12, W13 and W14, wherein the formed convolution kernel K1 is a matrix with the size of 1 × 1 × 4; the height of the convolution kernel K1 is dimension 4 of the data vector to be processed of the full connection layer, and the width and the depth of the convolution kernel K1 are both 1. The data in the convolution kernel K2 is the second row of data in the weight matrix of the full connection layer, and the arrangement order of the data in the convolution kernel K2 is: w21, W22, W23 and W24, wherein the formed convolution kernel K2 is a matrix with the size of 1 × 1 × 4; the height of the convolution kernel K2 is dimension 4 of the data vector to be processed of the full connection layer, and the width and the depth of the convolution kernel K1 are both 1.

The operation circuit 101 is connected to the first matrix converter 102 and the second matrix converter 103, respectively, and the operation circuit 101 is configured to obtain a matrix to be convolved and H convolution kernels, and calculate a convolution product of the matrix to be convolved and the H convolution kernels to obtain a convolution output matrix. When the operation circuit 101 calculates the convolution product of the matrix to be convolved and the H convolution kernels, the step length of the sliding set by the operation circuit 101 is 1, and the set padding is 0.

For example, as shown in fig. 7A, the matrix to be convolved is a 2 × 3 × 4 three-dimensional matrix shown in fig. 5(b), the 2 convolution kernels are 21 × 1 × 4 three-dimensional matrices shown in fig. 6(b), and the arrangement order of convolution output matrices obtained by performing convolution operation on the matrix to be convolved and the 2 convolution kernels is: y11, Y12, Y21, Y22, Y31 … … Y61, Y62, the matrix formed being a three-dimensional matrix of size 2 × 3 × 2.

It can be understood that the first matrix converter 102 converts the 6 to-be-processed data vectors of the fully-connected layer into to-be-convolved matrices of the convolutional layer, and the second matrix converter 103 converts the weight matrices of the fully-connected layer into 2 convolution kernels of the convolutional layer, so as to ensure that the operation factors of the convolution operation are consistent with the operation factors of the fully-connected operation, and further ensure that the results of Y11, Y12, Y21, Y22, Y31 … … Y61 in the convolution output matrix obtained according to the convolution operation in fig. 7A are consistent with the results of the output data Y11, Y12, Y21, Y22, Y31 … … Y61 obtained by the fully-connected operation in fig. 4. In addition, as can be seen from the operation procedures in fig. 7A to 7D, the convolution output matrix and the corresponding data in the 6 output data of the full connection layer are arranged in the same order, that is, the order is Y11, Y12, Y21, Y22, Y31 … …, Y61, and Y62.

As can be seen from fig. 4 and fig. 8, in the operation accelerator 100 provided in the present application, the operation accelerator 100 converts the full connection operation of the to-be-processed data vectors and the weight matrix into the convolution operation of the to-be-convolved matrix and the convolution kernel, and only needs to obtain the weight matrix W from the memory 1 time, so that N to-be-processed data vectors can be processed. It is not necessary to load the weight matrix of the fully-connected layer from the memory once per processing of the to-be-processed data vector of one fully-connected layer as in fig. 4. It can be understood that the operation accelerator 100 provided in the present application can ensure that each data in the convolution output matrix is consistent with each data in the N output data of the full connection layer, and can also achieve that the arrangement sequence of the data in the convolution output matrix is also the same as the arrangement sequence of the corresponding data in the N output data of the full connection layer, so that not only can the number of times of loading the weight coefficient be reduced, but also the operation of the next layer network of the full connection layer (for example, the next full connection layer 2 of the full connection layer 1 in fig. 1) is not affected because the arrangement sequence of the output data is not changed, thereby greatly improving the data processing efficiency.

In some embodiments, the Memory storing the N to-be-processed Data vectors of the full connection layer or the weight matrix of the full connection layer may be an off-chip Memory, and the off-chip Memory may be a Random-Access Memory (RAM), a Double Data Rate Synchronous Dynamic Random Access Memory (DDR SDRAM), or the like. In other embodiments, the memory storing the N to-be-processed data vectors of the fully-connected layer or the weight matrix of the fully-connected layer may also be a memory provided on the arithmetic accelerator 100.

Optionally, referring to fig. 9, fig. 9 is a structural diagram illustrating another operation accelerator 100 according to an embodiment of the present application, and as shown in fig. 9, the operation accelerator 100 may further include a first memory 104 and a second memory 105. The first memory 104 is used for storing N data vectors to be processed of the fully connected layer, and the second memory 105 is used for storing a weight matrix of the fully connected layer. The first matrix converter 102 may read the N to-be-processed data vectors of the fully-connected layer from the first memory 104, and the second matrix converter 103 may read the weight matrix of the fully-connected layer from the second memory 105. The first matrix converter 102 and the first memory 104 may communicate data via a bus, or may form a small system to communicate directly. The second matrix converter 103 and the second memory 105 can communicate data through a bus, or can directly communicate with each other in a small system.

In some embodiments, the first matrix converter 102 sequentially obtains the N to-be-processed data vectors of the fully-connected layer from the first memory 104 according to the arrangement order of the N to-be-processed data vectors of the fully-connected layer in the first memory 104, the first matrix converter 102 determines the dimensionality of the obtained N to-be-processed data vectors of the fully-connected layer according to the obtained N to-be-processed data vectors of the fully-connected layer, and converts the N to-be-processed data vectors of the fully-connected layer into the to-be-convolved matrix of the convolution layer according to the dimensionality of the N to-be-processed data vectors of the fully-connected layer.

In some embodiments, the second matrix converter 103 sequentially obtains the data in the weight matrices of the full connection layer according to the arrangement order of the weight matrices of the full connection layer in the second memory 105. The second matrix converter 103 determines the dimension of the weight matrix of the fully-connected layer according to the obtained weight matrix of the fully-connected layer, and converts the weight matrix of the fully-connected layer into H convolution kernels of the convolution layer according to the dimension of the weight matrix of the fully-connected layer.

In some embodiments, after the operation circuit 101 calculates a convolution product of the matrix to be convolved and H convolution kernels to obtain a convolution output matrix, the operation circuit 101 further uses an activation function and a bias term b (bias) which are the same as those of the fully-connected layer to obtain a final output result Fout 'of the convolutional layer, and the final output result Fout' of the convolutional layer can be represented by formula (2):

Fout′＝Act(CONV(Fin′,kernel)_stride＝1+b) (2)

wherein, Act () is an activation function of the full link layer; fin' is a matrix to be convolved; kernel is H convolution kernels; b is the bias term of the fully connected layer; CONV () is the convolution operation of the convolutional layer; stride 1 is the step size set to 1 for convolution operation.

In some embodiments, fig. 10 is a forward calculation flowchart of convolution operation, as shown in fig. 10, Fin' is a matrix to be convolved, kernel is H convolution kernels, b is offset, Act () represents an activation function, and CONV () represents convolution operation of a convolution layer; fout0 ' represents the convolution product of Fin ' and H convolution kernels, Fout ' represents the final output result of the convolution layer. In fig. 10, 1001 denotes calculation of the convolution product of Fin 'and H convolution kernels to Fout 0'; 1002, adding an offset b on the basis of Fout0 'to obtain Fout 1'; 1003 represents that after the activation operation, the activation value of each data in the output Fout1 'is calculated to obtain the final output result Fout' of the convolutional layer.

Based on the hardware structure diagram of the operation accelerator 100 in fig. 8 or fig. 9, the embodiment of the present application provides a flowchart of a data processing method, where each step in the flowchart shown in fig. 11 is executed by the operation accelerator 100, as shown in fig. 11, the method includes:

s1101: the method comprises the steps of obtaining N data vectors to be processed and a weight matrix of a full connection layer, and determining the dimensionality of the N data vectors to be processed and the dimensionality of the weight matrix of the full connection layer.

For example, as shown in fig. 5(a), the number of the to-be-processed data vectors of the fully-connected layer acquired by the computation accelerator 100 is 6, and the dimension of the to-be-processed data vector of the fully-connected layer is 4, that is, the width value of the to-be-processed data vector of the fully-connected layer is 4. The data in the first to-be-processed data vector of the full connection layer are arranged in the sequence: x11, X12, X13, X14; the data in the second to-be-processed data vector is arranged in the order: x21, X22, X23, X24; by analogy, the data in the sixth to-be-processed data vector are arranged in the following sequence: x61, X62, X63 and X64. For example, as shown in fig. 6(a), the order of arrangement of the data of the weight matrix acquired by the computation accelerator 100 is: w11, W12, W13 … … and W24. The weight matrix formed is a two-dimensional matrix of size 4 × 2, the width value of the weight matrix is 4, and the height value of the weight matrix is 2.

S1102: and respectively converting the N data vectors to be processed of the full connection layer into a matrix to be convolved and converting the weight matrix into a convolution kernel based on the dimensionality of the N data vectors to be processed of the full connection layer and the dimensionality of the weight matrix.

In some embodiments, the computation accelerator 100 converts the N to-be-processed data vectors of the fully-connected layer into the to-be-convolved matrix and converts the weight matrix into the convolution kernel, respectively, based on the dimensions of the N to-be-processed data vectors of the fully-connected layer and the dimensions of the weight matrix. The arrangement sequence of the to-be-convolved matrix of the convolution layer is the same as the arrangement sequence of the N to-be-processed data vectors of the full connection layer, the height of the to-be-convolved matrix of the convolution layer is the same as the dimension of the to-be-processed data vectors of the full connection layer, and the product of the width of the to-be-convolved matrix of the convolution layer and the depth of the to-be-convolved matrix of the convolution layer is the number N of the to-be-processed data vectors of the full connection layer. The data in the convolution kernels of the convolution layers and the data in the weight matrix of the full-connection layer are in the same arrangement sequence, the dimensions of the convolution kernels of the convolution layers and the output data of the full-connection layer are the same, the height of each convolution kernel in the convolution kernels of the convolution layers is the same as the dimension of the data vector to be processed of the full-connection layer, and the width and the depth of each convolution kernel in the convolution kernels of the convolution layers are both 1.

For example, the arithmetic accelerator 100 converts 6 to-be-processed data vectors of the fully-connected layer into a to-be-convolved matrix while ensuring that the to-be-convolved matrix and the to-be-processed data vectors of the fully-connected layer are arranged in the same order. As shown in fig. 5(b), the height of the matrix to be convolved is set to the dimension of the data vector to be processed of the fully connected layer, i.e., 4. The product of the width and the depth of the matrix to be convolved is set as the number of the data vectors to be processed of the full connection layer, namely 6. For example, as shown in fig. 5(b), the width of the matrix to be convolved may be set to 2, and the depth of the matrix to be convolved may be set to 3. The sequence of the matrices to be convolved is: x11, X12, X13, … … X62, X63, X64, the matrix formed being a three-dimensional matrix of size 2 × 3 × 4.

For example, the arithmetic accelerator 100 converts the weight matrix of the fully-connected layer into a convolution kernel while ensuring that the data in the convolution kernel is in the same order as the arrangement of the data in the weight matrix of the fully-connected layer. As shown in fig. 6(b), the number of transformed convolution kernels is set to the height of the weight matrix of the full-connected layer, i.e., 2; the height of each convolution kernel after conversion is set to the width of the weight matrix of the fully-connected layer, i.e., 4. The width and depth of each convolution kernel after conversion are set to 1. As shown in fig. 6(b), the converted 2 convolution kernels are convolution kernel K1 and convolution kernel K2, respectively. The data in the convolution kernel K1 is the first row of data in the weight matrix of the full connection layer, and the arrangement order of the data in the convolution kernel K1 is: w11, W12, W13 and W14, the convolution kernel K1 is a matrix of 1 × 1 × 4 in size. The data in the convolution kernel K2 is the second row of data in the weight matrix of the full connection layer, and the arrangement order of the data in the convolution kernel K2 is: w21, W22, W23 and W24, the convolution kernel K2 is a matrix of 1 × 1 × 4 in size.

S1103: and calculating the product of the matrix to be convolved and the convolution kernel to obtain a convolution output matrix, wherein the sliding step length set by the operation circuit 101 is 1, and the set filling is 0.

For example, as shown in fig. 7B to 7D, for a matrix to be convolved with a size of 2 × 3 × 4, starting with the first data X11 on the upper left thereof, the step size is 1, the padding is 0, and the matrix to be convolved with a size of 2 × 3 × 4 is slid in a sliding window of 1 × 1 × 4. For example, as shown in fig. 7B, first, data Y11 having a size of 1 × 1 × 1 and data Y21 having a size of 1 × 1 × 1 are sequentially obtained by sliding from left to right in the width direction. Wherein the data Y11 has a value of X11W 11+ X12W 12+ X13W 13+ X14W 14; the data Y21 had values X21 × W11+ X22 × W12+ X23 × W13+ X24 × W14. As shown in fig. 7C, the data X11 is slid backward in the depth direction by one data and then slid further from left to right, and data Y31 of size 1 × 1 × 1 and data Y41 of size 1 × 1 × 1 are obtained in this order. Wherein the data Y31 has a value of X31W 11+ X32W 12+ X33W 13+ X34W 14; the data Y41 had values X41 × W11+ X42 × W12+ X43 × W13+ X44 × W14. As shown in fig. 7D, the data X31 is slid backward in the depth direction by one data and then slid further from left to right, and data Y51 of size 1 × 1 × 1 and data Y61 of size 1 × 1 × 1 are obtained in this order. Wherein the data Y51 has a value of X51W 11+ X52W 12+ X53W 13+ X54W 14; the data Y61 had values X61 × W11+ X62 × W12+ X63 × W13+ X64 × W14.

It can be understood that according to the data processing method provided by the application, it can be ensured that each data in the convolution output matrix is consistent with each data in the N output data of the full connection layer, and the arrangement sequence of the data in the convolution output matrix is also the same as the arrangement sequence of the corresponding data in the N output data of the full connection layer, so that not only can the number of times of loading the weight coefficient be reduced, but also the arrangement sequence of the output data is unchanged, and no influence is generated on the operation of the next layer network of the full connection layer (for example, the next full connection layer 2 of the full connection layer 1 in fig. 1), thereby greatly improving the data processing efficiency.

FIG. 12 is a block diagram illustrating an electronic device 10 according to one embodiment of the present application. FIG. 12 schematically illustrates an example electronic device 10, in accordance with various embodiments. In one embodiment, the electronic device 10 may include one or more processors 1004, system control logic 1008 coupled to at least one of the processors 1004, system memory 1012 coupled to the system control logic 1008, non-volatile memory (NVM)1016 coupled to the system control logic 1008, and a network interface 1020 coupled to the system control logic 1008.

In some embodiments, processor 1004 may include one or more single-core or multi-core processors. In some embodiments, the processor 1004 may include any combination of general-purpose processors and special-purpose processors (e.g., graphics processors, application processors, baseband processors, etc.). In embodiments where the electronic device 10 employs an eNB (enhanced Node B) or RAN (Radio Access Network) controller, the processor 1004 may be configured to perform various consistent embodiments.

In some embodiments, system control logic 1008 may include any suitable interface controllers to provide any suitable interface to at least one of processors 1004 and/or any suitable device or component in communication with system control logic 1008.

In some embodiments, system control logic 1008 may include one or more memory controllers to provide an interface to system memory 1012. System memory 1012 may be used to load and store data and/or instructions. Memory 1012 of system 1000 may include any suitable volatile memory, such as suitable Dynamic Random Access Memory (DRAM), in some embodiments.

The NVM 1016 may include one or more tangible, non-transitory computer-readable media for storing data and/or instructions. In some embodiments, the NVM 1016 may include any suitable non-volatile memory, such as flash memory, and/or any suitable non-volatile storage device, such as at least one of an HDD (Hard Disk Drive), CD (Compact Disc) Drive, DVD (Digital Versatile Disc) Drive.

The NVM 1016 may include a portion of the storage resource on the device on which the electronic device 10 is installed, or it may be accessible by, but not necessarily a part of, the device. The NVM 1016 may be accessed over a network, for example, via the network interface 1020.

In particular, the system memory 1012 and the NVM 1016 may include: a temporary copy and a permanent copy of instructions 1024. The instructions 1024 may include: instructions that, when executed by at least one of the processors 1004, cause the electronic device 10 to perform a method as shown in fig. 12. In some embodiments, the instructions 1024, hardware, firmware, and/or software components thereof may additionally/alternatively be disposed in the system control logic 1008, the network interface 1020, and/or the processor 1004.

The network interface 1020 may include a transceiver to provide a radio interface for the electronic device 10 to communicate with any other suitable devices (e.g., front end modules, antennas, etc.) over one or more networks. In some embodiments, the network interface 1020 may be integrated with other components of the electronic device 10. For example, the network interface 1020 may be integrated with at least one of the processors 1004, the system memory 1012, the NVM 1016, and a firmware device (not shown) having instructions that, when executed by at least one of the processors 1004, the electronic device 10 implements the method shown in fig. 12.

The network interface 1020 may further include any suitable hardware and/or firmware to provide a multiple-input multiple-output radio interface. For example, network interface 1020 may be a network adapter, a wireless network adapter, a telephone modem, and/or a wireless modem.

In one embodiment, at least one of the processors 1004 may be packaged together with logic for one or more controllers of system control logic 1008 to form a System In Package (SiP). In one embodiment, at least one of the processors 1004 may be integrated on the same die with logic for one or more controllers of system control logic 1008 to form a system on a chip (SoC).

The electronic device 10 may further include: input/output (I/O) devices 1032.

Fig. 13 shows a block diagram of an SoC (System on Chip) 1100 including an arithmetic accelerator 100, according to an embodiment of the present application. The SOC1100 is provided in the electronic device 10. In fig. 13, like parts have the same reference numerals. In fig. 13, the SoC1100 includes: an interconnect unit 1150 coupled to the application processor 1110; the system agent unit 1170; a bus controller unit 1180; an integrated memory controller unit 1140; an operation accelerator 100; an Static Random Access Memory (SRAM) unit 1130; when the first memory 104 and the second memory 105 are not disposed on the arithmetic accelerator 100, the SRAM unit 1130 may include the first memory 104 and the second memory 105, where the first memory 104 is configured to store N to-be-processed data vectors of a fully-connected layer, and the second memory 105 is configured to store a weight matrix of the fully-connected layer. A Direct Memory Access (DMA) unit 1160. In one embodiment, SoC1100 may also include a processor such as, for example, a network or communication processor, compression engine, GPU, a high-throughput MIC processor, or embedded processor, among others.

In some embodiments, the computation accelerator 100 may be a Neural Network Processing Unit (NPU) or other processor, or may be a hardware Unit that performs convolution operations.

It is to be appreciated that as used herein, the term module may refer to or include an Application Specific Integrated Circuit (ASIC), an electronic circuit, a processor (shared, dedicated, or group) and/or memory that executes one or more software or firmware programs, a combinational logic circuit, and/or other suitable hardware components that provide the described functionality, or may be part of such hardware components.

It is to be appreciated that in various embodiments of the present application, the processor may be a microprocessor, a digital signal processor, a microcontroller, or the like, and/or any combination thereof. According to another aspect, the processor may be a single-core processor, a multi-core processor, the like, and/or any combination thereof.

Embodiments of the mechanisms disclosed herein may be implemented in hardware, software, firmware, or a combination of these implementations. Embodiments of the application may be implemented as computer programs or program code executing on programmable systems comprising at least one processor, a storage system (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device.

Program code may be applied to input instructions to perform the functions described herein and generate output information. The output information may be applied to one or more output devices in a known manner. For purposes of this application, a processing system includes any system having a processor such as, for example, a Digital Signal Processor (DSP), a microcontroller, an Application Specific Integrated Circuit (ASIC), or a microprocessor.

The program code may be implemented in a high level procedural or object oriented programming language to communicate with a processing system. The program code can also be implemented in assembly or machine language, if desired. Indeed, the mechanisms described in this application are not limited in scope to any particular programming language. In any case, the language may be a compiled or interpreted language.

In some cases, the disclosed embodiments may be implemented in hardware, firmware, software, or any combination thereof. The disclosed embodiments may also be implemented as instructions carried by or stored on one or more transitory or non-transitory machine-readable (e.g., computer-readable) storage media, which may be read and executed by one or more processors. For example, the instructions may be distributed via a network or via other computer readable media. Thus, a machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer), including, but not limited to, floppy diskettes, optical disks, read-only memories (CD-ROMs), magneto-optical disks, read-only memories (ROMs), Random Access Memories (RAMs), erasable programmable read-only memories (EPROMs), electrically erasable programmable read-only memories (EEPROMs), magnetic or optical cards, flash memory, or tangible machine-readable memories for transmitting information using the Internet in the form of electrical, optical, acoustical or other propagated signals, e.g., carrier waves, infrared digital signals, etc.). Thus, a machine-readable medium includes any type of machine-readable medium suitable for storing or transmitting electronic instructions or information in a form readable by a machine (e.g., a computer).

In the drawings, some features of the structures or methods may be shown in a particular arrangement and/or order. However, it is to be understood that such specific arrangement and/or ordering may not be required. Rather, in some embodiments, the features may be arranged in a manner and/or order different from that shown in the illustrative figures. In addition, the inclusion of a structural or methodical feature in a particular figure is not meant to imply that such feature is required in all embodiments, and in some embodiments, may not be included or may be combined with other features.

It should be noted that, in the embodiments of the apparatuses in the present application, each unit/module is a logical unit/module, and physically, one logical unit/module may be one physical unit/module, or may be a part of one physical unit/module, and may also be implemented by a combination of multiple physical units/modules, where the physical implementation manner of the logical unit/module itself is not the most important, and the combination of the functions implemented by the logical unit/module is the key to solve the technical problem provided by the present application. Furthermore, in order to highlight the innovative part of the present application, the above-mentioned device embodiments of the present application do not introduce units/modules which are not so closely related to solve the technical problems presented in the present application, which does not indicate that no other units/modules exist in the above-mentioned device embodiments.

It is noted that, in the examples and descriptions of this patent, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, the use of the verb "comprise a" to define an element does not exclude the presence of another, same element in a process, method, article, or apparatus that comprises the element.

While the present application has been shown and described with reference to certain preferred embodiments thereof, it will be understood by those of ordinary skill in the art that various changes in form and details may be made therein without departing from the spirit and scope of the present application.

Claims

1. A data processing method for a neural network model full connection layer is used for an electronic device, and is characterized by comprising the following steps:

acquiring a plurality of to-be-processed data vectors and a weight matrix of a full connection layer;

converting a plurality of data vectors to be processed of the full connection layer into a matrix to be convolved, and converting the weight matrix into at least one convolution kernel;

and performing convolution operation on the matrix to be convolved and the convolution kernel to obtain a plurality of full-connection layer processing results of a plurality of data vectors to be processed of the full-connection layer.

2. The method according to claim 1, wherein the height value of the matrix to be convolved is the same as the width value of the data vector to be processed, and the product of the width value of the matrix to be convolved and the depth value of the matrix to be convolved has the same number as the data vector to be processed.

3. The method of claim 1, wherein the height of the convolution kernel is the same as the width of the weight matrix, and wherein the number of convolution kernels is the same as the height of the weight matrix.

4. The method of claim 2, wherein a column of data in the matrix to be convolved corresponds to one of the vectors of data to be processed.

5. The method of claim 3, wherein the convolution kernels are one-dimensional vectors, and wherein each convolution kernel corresponds to a row of data in the weight matrix.

6. The method according to claim 1, wherein during the convolution operation between the matrix to be convolved and the convolution kernel, the order of multiplying each column of data in the matrix to be convolved by the corresponding convolution kernel is the same as the order of inputting the plurality of data vectors to be processed into the fully-connected layer.

7. The method according to claim 1, wherein, when performing convolution operation on the matrix to be convolved and the convolution kernel, the set sliding step is 1, and the set padding is 0.

8. A readable medium of an electronic device, wherein the readable medium of the electronic device has stored thereon instructions, which when executed on the electronic device, cause the electronic device to execute the data processing method for the neural network model fully connected layer of any one of claims 1 to 8.

9. An electronic device, comprising:

a memory for storing instructions for execution by the one or more processors of the electronic device, and a plurality of processors for executing the instructions in the memory to perform the data processing method for the neural network model fully-connected layer of any one of claims 1 to 8.

10. A computer program product comprising a computer program/instructions which, when executed by a processor, implement the data processing method for a neural network model fully-connected layer of any one of claims 1 to 8.