CN114600126A - Convolution operation circuit and convolution operation method - Google Patents

Convolution operation circuit and convolution operation method Download PDF

Info

Publication number
CN114600126A
CN114600126A CN201980101499.6A CN201980101499A CN114600126A CN 114600126 A CN114600126 A CN 114600126A CN 201980101499 A CN201980101499 A CN 201980101499A CN 114600126 A CN114600126 A CN 114600126A
Authority
CN
China
Prior art keywords
matrix
sub
data
matrix data
circuit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201980101499.6A
Other languages
Chinese (zh)
Inventor
董镇江
李震桁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Publication of CN114600126A publication Critical patent/CN114600126A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/16Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Abstract

The application provides a convolution operation circuit and a convolution operation method in the field of artificial intelligence, which are used for realizing convolution operation of first matrix data and second matrix data and comprise a splitting circuit and a matrix multiplication accumulation circuit; the splitting circuit is used for splitting the first matrix data to obtain first sub-matrix data of the front N/2 dimension and second sub-matrix data of the rear N/2 dimension; splitting the second matrix data to obtain first N/2-dimensional third sub-matrix data and second N/2-dimensional fourth sub-matrix data; the first matrix data and the second matrix data are both N-dimensional matrix data; n is a positive even number; and the matrix multiplication accumulation circuit is used for carrying out multiplication accumulation operation according to the first sub-matrix data, the second sub-matrix data, the third sub-matrix data and the fourth sub-matrix data to obtain an operation result. The convolution of two high digits is split into multiplication of three low digits by the processor, so that the area and the power consumption of the processor can be reduced.

Description

Convolution operation circuit and convolution operation method Technical Field
The present application relates to the field of computer technologies, and in particular, to a convolution operation circuit and a convolution operation method.
Background
In computer technology, a Convolutional Neural Network (CNN) is a multi-layer Neural Network. Currently, in a convolutional neural network, a convolution operation performed by a processor is usually to convert a convolution of an input signal feature and a weight into a matrix multiplication operation between a signal matrix and a weight matrix, that is, AxB ([ MxK ] x [ KxN ]), where a represents the signal matrix (input matrix) and B represents the weight matrix. In general, the a matrix is an input matrix extracted from input data according to a convolution kernel stride when performing convolution, that is, an input matrix of input signal feature transformation.
Generally, the input matrix and the weight matrix are usually high-dimensional matrices, and when the convolution operation of the two high-dimensional matrices is implemented, more processor area is occupied, which undoubtedly increases the area and power consumption of the processor. Therefore, how to improve the computational efficiency of the processor to reduce the area and power consumption of the processor is an urgent technical problem to be solved.
Disclosure of Invention
The convolution operation circuit and the convolution operation method divide convolution operation of two high-order numbers into three low-order number multiplications, and realize multiplication operation of N dimensions by multiplication of two N/2 dimensions, and the operation complexity is N2Reduced to n1.585The area and power consumption of the processor can be reduced.
In a first aspect, a convolution operation circuit is provided, the convolution operation circuit includes a splitting circuit and a matrix multiply-accumulate circuit; the splitting circuit is used for symmetrically splitting the first matrix data to obtain first sub-matrix data of front N/2 dimensions and second sub-matrix data of rear N/2 dimensions; here, the first submatrix data of the first N/2 dimensions refers to a high-order part in the first matrix data, and the second submatrix data of the second N/2 dimensions refers to a low-order part in the first matrix data; and symmetrically splitting the second matrix data to obtain the first N/2-dimensional third sub-matrix data and the second N/2-dimensional fourth sub-matrix data. Similarly, the first N/2-dimensional third sub-matrix data refers to the upper part of the second matrix data, and the second N/2-dimensional fourth sub-matrix data refers to the lower part of the second matrix data. The first matrix data and the second matrix data are both N-dimensional matrix data; n is a positive even number; here, the matrix data being N-dimensional means that the scale of the first matrix data and the second matrix data is relatively large. The matrix multiplication and accumulation circuit is used for carrying out multiplication and accumulation operation according to the first sub-matrix data, the second sub-matrix data, the third sub-matrix data and the fourth sub-matrix data to obtain an operation result.
By implementing the embodiment of the application, when the convolution operation of two high-digit numbers is realized, the operation circuit symmetrically splits the two N-dimensional high-digit numbers respectively to obtain four corresponding N/2-dimensional low-digit numbers, and then, three times of multiplication operation and accumulation operation are carried out on the four N/2-dimensional low-digit numbers. Since the multiplication operation of N dimension is realized by the multiplication of two N/2 dimensions, the operation complexity is N2Reduced to n1.585The computing speed of the processor can be increased, and thus the area and power consumption of the processor can be reduced.
In a possible implementation manner, the matrix multiplication and accumulation circuit may include a matrix multiplication circuit and an accumulation circuit; the matrix multiplication circuit is used for executing multiplication operation according to the first sub-matrix data and the third sub-matrix data to obtain first intermediate data; performing multiplication operation according to the second sub-matrix data and the fourth sub-matrix data to obtain second intermediate data; and executing multiplication operation according to the first sub-matrix data, the second sub-matrix data, the third sub-matrix data and the fourth sub-matrix data to obtain third intermediate data. Here, in each of the three multiplications, the multiplication of N dimensions is realized by two multiplications of N/2 dimensions. And the accumulation circuit is used for accumulating the first intermediate data, the second intermediate data and the third intermediate data to obtain an operation result. By implementing the embodiment of the application, when the convolution operation of two high-order numbers is realized, two high-order numbers can be respectively used through the matrix multiplication circuit and the accumulation circuitN/2-dimensional multiplication realizes N-dimensional multiplication operation with the operation complexity of N2Reduced to n1.585The computing speed of the processor can be increased, and thus the area and power consumption of the processor can be reduced.
In a possible implementation manner, the matrix multiplication circuit includes a first matrix multiplication circuit, a second matrix multiplication circuit, and a third matrix multiplication circuit; the first matrix multiplication circuit is used for performing multiplication operation on the first sub-matrix data and the third sub-matrix data to obtain first sub-intermediate data, and shifting the first sub-intermediate data to the left by N bits to obtain first intermediate data; the second matrix multiplication circuit is used for executing multiplication operation on the second sub-matrix data and the fourth sub-matrix data to obtain second intermediate data; the third matrix multiplication circuit is used for performing accumulation operation on the first sub-matrix data and the second sub-matrix data to obtain first summation matrix data, performing accumulation operation on the third sub-matrix data and the fourth sub-matrix data to obtain second summation matrix data, and performing multiplication operation on the first summation matrix data and the second summation matrix data to obtain fourth intermediate data; and accumulating the first sub-intermediate data, the second intermediate data and the fourth intermediate data, and shifting the left by N/2 bits to obtain third intermediate data. Here, the dimension of the first intermediate data, the second intermediate data, and the third intermediate data may be N, or N-1. In the three multiplication operations, the multiplication operation of N dimension is realized by two multiplication operations of N/2 dimension. By implementing the embodiment of the application, when convolution operation of two high-order numbers is realized, N-dimensional multiplication operation can be realized by two N/2-dimensional multiplications through three matrix multiplication circuits respectively, and the operation complexity is N2Reduced to n1.585The computing speed of the processor can be increased, and thus the area and power consumption of the processor can be reduced.
In a possible implementation manner, the splitting circuit is further configured to split the first large matrix data and the second large matrix data into m first matrix blocks and m second matrix blocks, respectively; wherein, the ith matrix block in the m first matrix blocks is used as the first matrix data, and the ith matrix in the m second matrix blocksThe array block is used as second matrix data; and i is sequentially valued from 1 to m to obtain m groups of first matrix data and second matrix data. Here, the first matrix block refers to a part of data in the first large matrix data, and then, similarly, the second matrix block is a part of data in the second large matrix data. And the matrix multiplication accumulation circuit is used for executing multiplication accumulation operation according to the first sub-matrix data, the second sub-matrix data, the third sub-matrix data and the fourth sub-matrix data aiming at each group of the first matrix data and the second matrix data to obtain m operation intermediate results. That is to say, for the data blocks obtained by splitting the first large matrix data and the second large matrix data, the operation intermediate results corresponding to each group of two data blocks can be obtained, and then the m operation intermediate results are accumulated to obtain the operation result. By implementing the embodiment of the application, when convolution operation of two large matrix data is realized, the large matrix data can be partitioned, then, an operation intermediate result corresponding to each group of first matrix data and second matrix data is obtained through calculation of the matrix multiplication accumulation circuit, and then, a plurality of operation intermediate results are accumulated to obtain the operation result. For any multiplication operation, the multiplication operation of N dimension is realized by two multiplication operations of N/2 dimension, and the operation complexity is N2Reduced to n1.585The computing speed of the processor can be increased, and thus the area and power consumption of the processor can be reduced.
In a possible implementation manner, the splitting circuit is further configured to split the first large matrix data and the second large matrix data into m first matrix blocks and m second matrix blocks, respectively. Taking the ith matrix block in the m first matrix blocks as a first matrix vector and taking the ith matrix block in the m second matrix blocks as a second matrix vector; and i is sequentially valued from 1 to m to obtain m groups of first matrix vectors and second matrix vectors. Here, the first matrix block refers to a part of data in the first large matrix data, and then, similarly, the second matrix block is a part of data in the second large matrix data. Matrix multiply-accumulate circuit forAnd executing multiplication accumulation operation according to the first sub-matrix vector, the second sub-matrix vector, the third sub-matrix vector and the fourth sub-matrix vector to obtain m operation intermediate results in each group of the first matrix vector and the second matrix vector. That is to say, for the data blocks obtained by splitting the first large matrix data and the second large matrix data, the operation intermediate results corresponding to each group of two data blocks can be obtained, and then the m operation intermediate results are accumulated to obtain the operation result. By implementing the embodiment of the application, when convolution operation of two large-scale matrix data is realized, the large-scale matrix data can be partitioned, then vectors corresponding to each group of first matrix data and second matrix data are extracted, then an operation intermediate result corresponding to each group of first matrix vectors and second matrix vectors is obtained through calculation of the matrix multiplication accumulation circuit, and then a plurality of operation intermediate results are accumulated to obtain an operation result. For any multiplication operation, the multiplication operation of N dimension is realized by two multiplication operations of N/2 dimension, and the operation complexity is N2Reduced to n1.585The computing speed of the processor can be increased, and thus the area and power consumption of the processor can be reduced.
In one possible implementation, the first matrix data is a symbol number; the first sub-matrix data and the second sub-matrix data each include a sign bit; that is to say, the split first sub-matrix data and the split second sub-matrix data are composed of two parts, one part is high-order data or low-order data; the other part is a sign bit. The second matrix data is the number of symbols; the third sub-matrix data and the fourth sub-matrix data each include a sign bit; similarly, the third sub-matrix data and the fourth sub-matrix data obtained by splitting are composed of two parts, one part is high-order data or low-order data; the other part is a sign bit. The arithmetic circuit also comprises an OR logic judgment circuit connected with the splitting circuit; the OR logic judgment circuit is used for performing 1 addition operation on the first sub-matrix data under the condition that the non-symbolic bit of the second sub-matrix data is not all 0. Here, the first sub-matrix data is subjected to the operation of adding 1 to obtain the corresponding binary systemComplementary form of (1). And under the condition that the non-symbol bits of the fourth sub-matrix data are not all 0, adding 1 to the third sub-matrix data. Similarly, the third sub-matrix data is subjected to the operation of adding 1, so that the corresponding binary complement form can be obtained. By implementing the embodiment of the application, under the condition that two high-order digits participating in convolution operation are signed digits, the operation circuit can convert the signed digits into a corresponding binary complement form through an OR logic judgment circuit in the operation circuit, and then, three times of multiplication and accumulation operation are carried out on the low-order digits of four N/2 dimensions obtained by splitting. The sign number is converted by the OR logic judgment circuit, so that the error condition in convolution operation can be avoided, and in addition, for any multiplication operation, because two N/2-dimensional multiplications are used for realizing N-dimensional multiplication operation, the operation complexity is N2Reduced to n1.585The computing speed of the processor can be increased, and thus the area and power consumption of the processor can be reduced.
In a second aspect, an embodiment of the present application provides a convolution operation method, where the convolution operation method is applied to a convolution operation circuit, where the convolution operation circuit includes a splitting circuit and a matrix multiply-accumulate circuit, and the method includes the following steps: splitting the first matrix data through a splitting circuit in the convolution operation circuit to obtain first sub-matrix data of the front N/2 dimension and second sub-matrix data of the rear N/2 dimension; here, the first submatrix data of the first N/2 dimensions refers to a high-order part in the first matrix data, and the second submatrix data of the second N/2 dimensions refers to a low-order part in the first matrix data; and splitting the second matrix data to obtain the first N/2-dimensional third sub-matrix data and the second N/2-dimensional fourth sub-matrix data. Similarly, the first N/2-dimensional third sub-matrix data refers to the upper part of the second matrix data, and the second N/2-dimensional fourth sub-matrix data refers to the lower part of the second matrix data. The first matrix data and the second matrix data are both N-dimensional matrix data; n is a positive even number; here, the matrix data being N-dimensional means that the scale of the first matrix data and the second matrix data is relatively large. And performing multiplication and accumulation operation by a matrix multiplication and accumulation circuit in the operation circuit according to the first sub-matrix data, the second sub-matrix data, the third sub-matrix data and the fourth sub-matrix data to obtain an operation result.
By implementing the embodiment of the application, when the convolution operation of two high-order digits is realized, the two high-order digits of N dimensions are symmetrically split respectively to obtain four corresponding low-order digits of N/2 dimensions, and then, the four low-order digits of N/2 dimensions are subjected to three times of multiplication operation and accumulation operation. Since the multiplication operation of N dimension is realized by the multiplication of two N/2 dimensions, the operation complexity is N2Reduced to n1.585The computing speed of the processor can be increased, and thus the area and power consumption of the processor can be reduced.
In one possible implementation, the matrix multiply-accumulate circuit includes a matrix multiply circuit and an accumulate circuit; performing multiply-accumulate operation according to the first sub-matrix data, the second sub-matrix data, the third sub-matrix data and the fourth sub-matrix data by a matrix multiply-accumulate circuit in the operation circuit to obtain an operation result, wherein the operation result comprises: performing multiplication operation according to the first sub-matrix data and the third sub-matrix data through a matrix multiplication circuit to obtain first intermediate data; performing multiplication operation according to the second sub-matrix data and the fourth sub-matrix data to obtain second intermediate data; and executing multiplication operation according to the first sub-matrix data, the second sub-matrix data, the third sub-matrix data and the fourth sub-matrix data to obtain third intermediate data. Here, in each of the three multiplications, the multiplication of N dimensions is realized by two multiplications of N/2 dimensions. And accumulating the first intermediate data, the second intermediate data and the third intermediate data by an accumulation circuit to obtain an operation result.
In one possible implementation, the matrix multiplication circuit includes a first matrix multiplication circuit, a second matrix multiplication circuit, and a third matrix multiplication circuit; performing multiplication operation according to the first sub-matrix data and the third sub-matrix data through a matrix multiplication circuit to obtain first intermediate data; performing multiplication operation according to the second sub-matrix data and the fourth sub-matrix data to obtain second intermediate data; performing multiplication operation according to the first sub-matrix data, the second sub-matrix data, the third sub-matrix data and the fourth sub-matrix data to obtain third intermediate data, including: performing multiplication operation on the first sub-matrix data and the third sub-matrix data through a first matrix multiplication circuit to obtain first sub-intermediate data, and shifting the first sub-intermediate data to the left by N bits to obtain first intermediate data; performing multiplication operation on the second sub-matrix data and the fourth sub-matrix data through a second matrix multiplication circuit to obtain second intermediate data; performing accumulation operation on the first sub-matrix data and the second sub-matrix data through a third matrix multiplication circuit to obtain first summation matrix data, performing accumulation operation on the third sub-matrix data and the fourth sub-matrix data to obtain second summation matrix data, and performing multiplication operation on the first summation matrix data and the second summation matrix data to obtain fourth intermediate data; and after accumulating the first sub-intermediate data, the second intermediate data and the fourth intermediate data, shifting the data to the left by N/2 bits to obtain third intermediate data. Here, the dimension of the first intermediate data, the second intermediate data, and the third intermediate data may be N, or N-1. In each of the three multiplication operations, the multiplication operation of the N dimension is realized by two multiplications of the N/2 dimension.
In one possible implementation, the method further includes: partitioning the first large matrix data and the second large matrix data through a splitting circuit to obtain m first matrix blocks and m second matrix blocks respectively; wherein, the ith matrix block in the m first matrix blocks is used as first matrix data, and the ith matrix block in the m second matrix blocks is used as second matrix data; and i is sequentially valued from 1 to m to obtain m groups of first matrix data and second matrix data. Here, the first matrix block refers to a part of data in the first large matrix data, and then, similarly, the second matrix block is a part of data in the second large matrix data. Performing multiply-accumulate operation according to the first sub-matrix data, the second sub-matrix data, the third sub-matrix data and the fourth sub-matrix data by a matrix multiply-accumulate circuit in the operation circuit to obtain an operation result, wherein the operation result comprises: and executing the multiplication accumulation operation according to the first sub-matrix data, the second sub-matrix data, the third sub-matrix data and the fourth sub-matrix data by a matrix multiplication accumulation circuit aiming at each group of the first matrix data and the second matrix data to obtain m operation intermediate results. That is to say, for the data blocks obtained by splitting the first large matrix data and the second large matrix data, the operation intermediate results corresponding to each group of two data blocks can be obtained, and then the m operation intermediate results are accumulated to obtain the operation result.
In one possible implementation, the method further includes: partitioning the first large matrix data and the second large matrix data through a splitting circuit to obtain m first matrix blocks and m second matrix blocks respectively; taking the ith matrix block in the m first matrix blocks as a first matrix vector and taking the ith matrix block in the m second matrix blocks as a second matrix vector; and i is sequentially valued from 1 to m to obtain m groups of first matrix vectors and second matrix vectors. Here, the first matrix block refers to a part of data in the first large matrix data, and then, similarly, the second matrix block is a part of data in the second large matrix data. Performing multiply-accumulate operation according to the first sub-matrix data, the second sub-matrix data, the third sub-matrix data and the fourth sub-matrix data by a matrix multiply-accumulate circuit in the operation circuit to obtain an operation result, wherein the operation result comprises: and executing multiplication accumulation operation according to the first sub-matrix vector, the second sub-matrix vector, the third sub-matrix vector and the fourth sub-matrix vector by using a matrix multiplication accumulation circuit aiming at each group of the first matrix vector and the second matrix vector to obtain m operation intermediate results. That is to say, for the data blocks obtained by splitting the first large matrix data and the second large matrix data, the operation intermediate results corresponding to each group of two data blocks can be obtained, and then the m operation intermediate results are accumulated to obtain the operation result.
In one possible implementation, the first matrix data is a symbol number; the first sub-matrix data and the second sub-matrix data each include a sign bit; that is to say, the split first sub-matrix data and the split second sub-matrix data are composed of two parts, one part is high-order data or low-order data; the other part is a sign bit. The second matrix data is the number of symbols; the third sub-matrix data and the fourth sub-matrix data each include a sign bit; similarly, the third sub-matrix data and the fourth sub-matrix data obtained by splitting are composed of two parts, one part is high-order data or low-order data; the other part is a sign bit. The method further comprises the following steps: performing an addition operation of 1 on the first sub-matrix data by an OR logic judgment circuit under the condition that the non-symbolic bit of the second sub-matrix data is not all 0; and performing 1 addition operation on the third sub-matrix data when the condition that the non-symbol bits of the fourth sub-matrix data are not all 0 is satisfied. Here, the first sub-matrix data is subjected to an add-1 operation to obtain a corresponding binary complement.
In a third aspect, an embodiment of the present application provides a chip, where the chip includes the convolution operation circuit provided in the first aspect and at least one vector calculation circuit coupled to the convolution operation circuit; the vector calculation circuit is used for calculating other layer network structures in the convolutional neural network according to the operation result to obtain the identification result.
In a fourth aspect, an embodiment of the present application provides a board, where the board includes the chip provided in the third aspect, and at least one memory device and a control device coupled to the chip; the storage device is used for storing the calculation data of the convolutional neural network; and the control device is used for communicating with the chip to realize the operation of the convolutional neural network.
In a fifth aspect, an embodiment of the present application provides an electronic device, where the electronic device includes the board card provided in the fourth aspect.
In a sixth aspect, the present application provides a computer-readable storage medium, in which a computer program is stored, the computer program including program instructions, which, when executed by a processor, cause the processor to perform the method of the second aspect.
Drawings
Fig. 1 is a schematic structural diagram of a convolutional neural network according to an embodiment of the present disclosure;
fig. 2 is a schematic structural diagram of a convolutional neural network according to an embodiment of the present disclosure;
fig. 3 is a schematic diagram of converting a three-dimensional convolution kernel into a two-dimensional convolution kernel by using a GEMM according to an embodiment of the present application;
fig. 4a is a specific implementation scenario of a convolutional neural network according to an embodiment of the present disclosure;
fig. 4b is a specific implementation scenario of another convolutional neural network provided in the embodiment of the present application;
FIG. 5a is a diagram illustrating a hardware structure of an artificial intelligence processor according to an embodiment of the present disclosure;
FIG. 5b is a diagram illustrating a hardware structure of an artificial intelligence processor according to an embodiment of the present disclosure;
fig. 5c is a schematic diagram of a hardware structure of an arithmetic circuit according to an embodiment of the present disclosure;
FIG. 5d is a diagram illustrating a hardware configuration of another artificial intelligence processor according to an embodiment of the present disclosure;
FIG. 6 is a schematic diagram of a vector calculation unit in an artificial intelligence processor according to the present application;
fig. 7a is a schematic flowchart of a convolution operation method according to an embodiment of the present application;
fig. 7b is a schematic diagram of a symmetric splitting manner provided in the embodiment of the present application;
FIG. 7c is a diagram illustrating a convolution operation according to an embodiment of the present application;
fig. 7d is a block diagram according to an embodiment of the present disclosure;
fig. 7e is a schematic diagram of splitting signed matrix data according to an embodiment of the present application;
fig. 8a is a schematic structural diagram of a board card provided in the embodiment of the present application;
fig. 8b is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
The embodiments of the present application will be described below with reference to the drawings.
The terms "first," "second," "third," and "fourth," etc. in the description and claims of this application and in the accompanying drawings are used for distinguishing between different elements and not for describing a particular sequential order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.
Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment can be included in at least one embodiment of the application. The appearances of the phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It is explicitly and implicitly understood by one skilled in the art that the embodiments described herein can be combined with other embodiments.
As used in this specification, the terms "component," "module," "system," and the like are intended to refer to a computer-related entity, either hardware, firmware, a combination of hardware and software, or software in execution. For example, a component may be, but is not limited to being, a process running on a processor, an object, an executable, a thread of execution, a program, and/or a computer. By way of illustration, both an application running on a computing device and the computing device can be a component. One or more components can reside within a process and/or thread of execution and a component can be localized on one computer and/or distributed between 2 or more computers. In addition, these components can execute from various computer readable media having various data structures stored thereon. The components may communicate by way of local and/or remote processes such as in accordance with a signal having one or more data packets (e.g., data from two components interacting with another component in a local system, distributed system, and/or across a network such as the internet with other systems by way of the signal).
In order to facilitate better understanding of the technical solutions described in the present application, technical terms related to the embodiments of the present application are explained below:
(1) a convolutional neural network.
A Convolutional Neural Network (CNN) is a deep neural Network with a Convolutional structure, and is a deep learning (deep learning) architecture, where the deep learning architecture refers to learning at multiple levels in different abstraction levels through a machine learning algorithm. As a deep learning architecture, CNN is a feed-forward artificial neural network in which individual neurons respond to overlapping regions in an image input thereto.
As shown in fig. 1, Convolutional Neural Network (CNN)100 may include an input layer 110, a convolutional/pooling layer 120, where the pooling layer is optional, and a neural network layer 130. This is described in detail below:
convolutional layer/pooling layer 120:
and (3) rolling layers:
the convolutional layer/pooling layer 120 shown in FIG. 1 may include, for example, 121-126 layers, in one implementation, 121 layers are convolutional layers, 122 layers are pooling layers, 123 layers are convolutional layers, 124 layers are pooling layers, 125 layers are convolutional layers, and 126 layers are pooling layers; in another implementation, 121, 122 are convolutional layers, 123 are pooling layers, 124, 125 are convolutional layers, and 126 are pooling layers. I.e., the output of a convolutional layer may be used as input to a subsequent pooling layer, or may be used as input to another convolutional layer to continue the convolution operation.
Taking convolutional layer 121 as an example, convolutional layer 121 may include a plurality of convolution operators, also called kernels, whose role in image processing is to act as a filter for extracting specific information from an input image matrix, and a convolution operator may be essentially a weight matrix, which is usually predefined, and during the convolution operation on an image, the weight matrix is usually processed on the input image pixel by pixel (or two pixels by two pixels … …, which depends on the value of step size stride) in the horizontal direction, so as to complete the task of extracting a specific feature from the image. The size of the weight matrix is related to the size of the image, and it should be noted that the depth dimension (depth dimension) of the weight matrix is the same as the depth dimension of the input image, and the weight matrix extends to the entire depth of the input image during the convolution operation. Thus, convolving with a single weight matrix will produce a single depth dimension of the convolved output, but in most cases not a single weight matrix is used, but a plurality of weight matrices of the same dimension are applied. The outputs of each weight matrix are stacked to form the depth dimension of the convolved image. Different weight matrixes can be used for extracting different features in the image, for example, one weight matrix is used for extracting image edge information, another weight matrix is used for extracting specific colors of the image, another weight matrix is used for blurring unwanted noise points in the image … …, the dimensions of the multiple weight matrixes are the same, the dimensions of feature maps extracted by the multiple weight matrixes with the same dimensions are also the same, and the extracted multiple feature maps with the same dimensions are combined to form the output of convolution operation.
The weight values in these weight matrices need to be obtained through a large amount of training in practical application, and each weight matrix formed by the trained weight values can extract information from the input image, thereby helping the convolutional neural network 100 to make correct prediction.
When convolutional neural network 100 has multiple convolutional layers, the initial convolutional layer (e.g., 121) tends to extract more general features, which may also be referred to as low-level features; as the depth of the convolutional neural network 100 increases, the more convolutional layers (e.g., 126) that go further back extract more complex features, such as features with high levels of semantics, the more highly semantic features are more suitable for the problem to be solved.
A pooling layer:
since it is often desirable to reduce the number of training parameters, it is often desirable to periodically introduce pooling layers after the convolutional layer, i.e., the layers 121-126 as illustrated by 120 in fig. 1, may be one convolutional layer followed by one pooling layer, or may be multiple convolutional layers followed by one or more pooling layers. During image processing, the only purpose of the pooling layer is to reduce the spatial size of the image. The pooling layer may comprise an average pooling operator and/or a maximum pooling operator for sampling the input image to smaller size images. The average pooling operator may calculate the pixel values in the image over a particular range to produce an average value. The max pooling operator may take the pixel with the largest value in a particular range as a result of the max pooling. In addition, just as the size of the weighting matrix used in the convolutional layer should be related to the image size, the operators in the pooling layer should also be related to the image size. The size of the image output after the processing by the pooling layer may be smaller than the size of the image input to the pooling layer, and each pixel point in the image output by the pooling layer represents an average value or a maximum value of a corresponding sub-region of the image input to the pooling layer.
The neural network layer 130:
after processing by convolutional layer/pooling layer 120, convolutional neural network 100 is not sufficient to output the required output information. Because, as previously described, the convolutional layer/pooling layer 120 only extracts features and reduces the parameters brought by the input image. However, to generate the final output information (class information or other relevant information as needed), the convolutional neural network 100 needs to generate one or a set of outputs of the number of classes as needed using the neural network layer 130. Accordingly, a plurality of hidden layers (131, 132 to 13n shown in fig. 1) and an output layer 140 may be included in the neural network layer 130, and parameters included in the hidden layers may be obtained by performing pre-training according to related training data of a specific task type, for example, the task type may include image recognition, image classification, image super-resolution reconstruction, and the like.
After the hidden layers in the neural network layer 130, i.e. the last layer of the whole convolutional neural network 100 is the output layer 140, the output layer 140 has a loss function similar to the class cross entropy, and is specifically used for calculating the prediction error, once the forward propagation (i.e. the propagation from 110 to 140 in fig. 3 is the forward propagation) of the whole convolutional neural network 100 is completed, the backward propagation (i.e. the propagation from 140 to 110 in fig. 3 is the backward propagation) starts to update the weight values and the bias of the aforementioned layers, so as to reduce the loss of the convolutional neural network 100 and the error between the result output by the convolutional neural network 100 through the output layer and the ideal result.
It should be noted that the convolutional neural network 100 shown in fig. 1 is only an example of a convolutional neural network, and in a specific application, the convolutional neural network may also exist in the form of other network models, for example, as shown in fig. 2, a plurality of convolutional layers/pooling layers are parallel, and the features extracted respectively are all input to the neural network layer 130 for processing.
As mentioned above, in a convolutional neural network, there are usually a plurality of convolution kernels, and the plurality of convolution kernels are often three-dimensional and include data of three dimensions, the x and y directions are the length and width of the data, and the z direction can be regarded as the depth of the data. In practical applications, a three-dimensional convolution kernel can be converted into a two-dimensional convolution kernel by Matrix-Matrix Multiplication (GEMM). Specifically, fig. 3 is a schematic diagram of reducing dimensions of a three-dimensional convolution kernel by using a GEMM according to an embodiment of the present application.
An application scenario in which the convolutional neural network 100 may be applicable is exemplarily described below.
A first application scenario:
in the embodiment of the present application, the convolutional neural network 100 can be applied to various electronic devices. In one specific implementation scenario, as shown in fig. 4a, smartphones 302 and 304 have built-in processors associated with convolutional neural network 100. The mobile smartphone client 301 initiates a voice call to the mobile smartphone client 305, the voice signal is sent out via the smartphone 302 and forwarded to the smartphone 304 via the base station 303, the input signal 306 is severely attenuated and contains much noise due to the sudden rainstorm and the strong lightning thunder when the voice call is initiated, and the smartphone 304 is equipped with the convolutional neural network 100, which may be implemented in a chip in the form of a dedicated circuit or program instructions running in a Central Processing Unit (CPU) or other processor. The input signal 306 is processed in the convolutional neural network in the smart phone 304, and the processing includes noise removal, effective signal enhancement, and the like, to obtain an output signal 307, which completely retains the voice information transmitted by the calling party, and avoids interference of a harsh natural environment on the signal.
A second application scenario:
the embodiment of the present application provides another specific implementation scenario of the convolutional neural network 100, as shown in fig. 4b, a car 403 runs at a high speed on a road, and a passerby 401 uses a digital camera 402 to take a picture of the license plate number of the car 403, but because the car 403 has a high vehicle speed v, a motion blur phenomenon occurs in an input signal 404 of the digital camera, the input signal is a two-dimensional digital image signal, the convolutional neural network 100 is equipped in the digital camera 402, and the neural network may be implemented in a chip in the form of a dedicated circuit or may be a software module running in an image signal processor. After the input signal 404 is processed in the convolutional neural network in the digital camera 402, the processing includes car motion model estimation, motion blur removal, and the like, to obtain an output signal 405, and the definition of the license plate number information included in the output signal is improved, so that accurate identification can be obtained.
As described above, convolutional neural networks widely used in the fields of image recognition, audio recognition, and the like often need to perform a large number of matrix multiplication operations, which require a very high memory bandwidth and a large amount of operations. In order to fully utilize the processing capacity of hardware, in the embodiment of the present application, the convolution operation is optimized based on the Karatsuba algorithm to improve the calculation speed.
In the embodiment of the present application, the Karatsuba algorithm is a fast multiplication algorithm. It uses n to multiply two n-bit numbers2Is reduced to
Figure PCTCN2019114504-APPB-000001
Therefore, it is faster to compute than the traditional algorithm.
In the embodiment of the present application, the theoretical basis for realizing convolution operation of two high-order bits based on the Karatsuba algorithm is as follows: convolution operation can be understood as multiplication by two and accumulation of the products one by one.
In the embodiment of the present application, the implementation process of performing convolution operation on two high-order digits may be described as follows: the convolution of two high-bit numbers is represented as three low-bit numbers multiplication based on the Karatsuba algorithm. Wherein the first low-order multiplication is represented by: performing multiplication operation on the first N/2-dimensional data obtained by splitting the two high-order bits respectively; the second low-order multiplication is reflected in: performing multiplication operation on the post-N/2-bit dimensional data obtained by splitting the two high-bit numbers respectively; the third low-order multiplication is reflected in: the two high-order numbers each correspond to a summation number that performs a multiplication operation.
In the embodiment of the present application, the high-order number refers to a type of data whose bit width is greater than a first preset threshold. For example, the bit width of the first matrix data is N (N is 100), and when N is greater than a first preset threshold (e.g., the first preset threshold is 70), the first matrix data may be considered as a high bit number. In contrast, a low bit number refers to a type of data having a bit width smaller than a first predetermined threshold. For example, the bit width of the first sub-matrix data is N/2(N is 100), and when N/2 is smaller than a first preset threshold (e.g., the first preset threshold is 70), the first sub-matrix data may be considered as a low bit number.
In some implementations, the convolution operation based on the Karatsuba algorithm may be as shown in equation (1):
Figure PCTCN2019114504-APPB-000002
here, "+" indicates a convolution operation, and "×" indicates a multiplication operation.
As can be known from formula (1), the operation result of the convolution operation based on the Karatsuba algorithm is composed of three parts, wherein the operation result of the first part can be described as: after the first N/2-dimensional data obtained by splitting the two high-digit numbers respectively execute multiplication operation, executing left shift N-bit operation; the result of the second partial operation can be described as: performing multiplication operation on the post-N/2-dimensional data obtained by splitting the two high-order bits respectively; the result of the third partial operation can be described as: and after the first N/2-dimensional data obtained by splitting the two high digits respectively and the second N/2-dimensional data obtained by splitting the two high digits respectively are subjected to multiplication, summing and accumulating operations are performed among the three data.
In some implementations, the convolution operation based on the Karatsuba algorithm may be as shown in equation (2):
∑a i*b i=∑{(a i1×b i1)<<N+(a i2×b i2)+[(a i1+a i2)×(b i1+b i2)-(a i1×b i1)-(a i2×b i2)]<<N/2}
=∑(a i1×b i1)<<N+∑(a i2×b i2)+∑[(a i1+a i2)×(b i1+b i2)-(a i1×b i1)-(a i2×b i2)]<<N/2
=∑ 16-1(a i1×b i1)<<N+∑ 16-1(a i2×b i2)+∑ 16-1[(a i1+a i2)×(b i1+b i2)-(a i1×b i1)-(a i2×b i2)]<<N/2
+∑ 16-2(a i1×b i1)<<N+∑ 16-2(a i2×b i2)+∑ 16-2[(a i1+a i2)×(b i1+b i2)-(a i1×b i1)-(a i2×b i2)]<<N/2
……+∑ 16-m(a i1×b i1)<<N+∑ 16-m(a i2×b i2)+∑ 16-m[(a i1+a i2)×(b i1+b i2)-(a i1×b i1)-(a i2×b i2)]<<N/2 (2)
here, "+" indicates a convolution operation, and "×" indicates a multiplication operation.
As can be known from the formula (2), two high-order numbers (high-order number 1 and high-order number 2) are respectively blocked to obtain m sub high-order numbers 1 and m sub high-order numbers 2. The first sub-high order number in the m sub-high order numbers 1 and the first sub-high order number in the m sub-high order numbers 2 are convolved, and the operation result of the convolution operation based on the Karatsuba algorithm is composed of three parts, wherein the operation result of the first part can be described as: after the first N/2-dimensional data obtained by splitting two high-order digits (for example, a first sub high-order digit and a second sub high-order digit) respectively is subjected to multiplication operation, the left shift N-order digit operation is performed; the result of the second partial operation can be described as: performing multiplication operation on the post-N/2-dimensional data obtained by splitting each of two high-order digits (for example, a first sub-high-order digit and a second sub-high-order digit); the result of the third partial operation can be described as: the summation number corresponding to the first sub high-order number and the summation number corresponding to the second sub high-order number execute multiplication operation, after the first N/2-dimensional data obtained by splitting each of the two high-order numbers (the first sub high-order number and the second sub high-order number) executes multiplication operation and the last N/2-dimensional data obtained by splitting each of the two high-order numbers (the first sub high-order number and the second sub high-order number) executes multiplication operation, the summation and accumulation operation are executed among the first N/2-dimensional data, the first sub high-order number and the second sub high-order number, and an operation intermediate result of a group of first block high-order numbers is obtained;
it can be understood that the convolution of the second sub-high bit number in the m sub-high bit number 1 and the second sub-high bit number in the m sub-high bit number 2 can obtain a set of operation intermediate results of the second sub-high bit number. In this case, the operation intermediate results corresponding to the m groups of sub high-order digits are accumulated to obtain the operation result.
In some implementations, based on the theoretical research formula shown in formula (2), the operation result may be obtained by extracting a vector corresponding to each sub high-order digit and then performing vector calculation. For example, two high-order numbers (high-order number 1 and high-order number 2) are each partitioned into m sub high-order numbers 1 and m sub high-order numbers 2. The first sub-high order number in the m sub-high order numbers 1 and the first sub-high order number in the m sub-high order numbers 2 are convolved, and the operation result of the convolution operation based on the Karatsuba algorithm is composed of three parts, wherein the operation result of the first part can be described as: performing multiplication operation on front N/2-dimensional data obtained by splitting vectors (such as a first sub high-order vector and a second sub high-order vector) corresponding to two high-order numbers respectively, and then performing left-shift N-order operation; the result of the second partial operation can be described as: performing multiplication operation on the post-N/2-dimensional data obtained by splitting vectors (such as a first sub high-digit vector and a second sub high-digit vector) corresponding to two high-digits respectively; the result of the third partial operation can be described as: performing multiplication operation on the summation corresponding to the first sub high-order vector and the summation corresponding to the second sub high-order vector, performing multiplication operation on the front N/2-dimensional data obtained by splitting the vectors (the first sub high-order vector and the second sub high-order vector) corresponding to the two high-order vectors respectively, and performing multiplication operation on the rear N/2-dimensional data obtained by splitting the two high-order vectors (the first sub high-order vector and the second sub high-order vector) respectively, and performing summation and accumulation operation among the three to obtain an operation intermediate result of a group of first sub high-order vectors;
it can be understood that the convolution of the second sub-high-order vector in the m sub-high-order 1 and the second sub-high-order vector in the m sub-high-order 2 can obtain the intermediate result of the operation of the second sub-high-order. In this case, the operation results may be obtained by accumulating the operation intermediate results corresponding to the m sub high-order vectors.
Based on the mathematical computation theory shown in formula (1) or formula (2), as shown in fig. 5a, the hardware structure diagram of the chip provided in the embodiment of the present application may include an artificial intelligence processor (NPU) 50, which is used to implement the operation function of the convolutional Neural Network 100. The neural network processor in the embodiment of the application can be applied to various devices which can execute matrix multiplication operations, such as mobile phones, tablet computers, servers, wearable devices and the like.
In the embodiment of the present application, the artificial intelligence processor 50 may be any processor suitable for large-scale exclusive or operation Processing, such as NPU, google Tensor Processor (TPU), or Graphics Processing Unit (GPU). Taking NPU as an example: the NPU may be mounted as a coprocessor to a main CPU (host CPU), which is assigned tasks. The core part of the NPU is an arithmetic circuit 503, and first matrix data and second matrix data in the memory are extracted through the arithmetic circuit 503, wherein the first matrix data and the second matrix data are both N-dimensional matrix data; and symmetrically splitting the two N-dimensional matrix data, and splitting the two N-dimensional matrix data into corresponding 4N/2-dimensional matrix data for low-order multiplication and accumulation operation to obtain an operation result.
In some implementations, the arithmetic circuit 503 includes a plurality of processing circuits (PEs) therein. In some implementations, the operational circuitry 503 may be a two-dimensional systolic array. The arithmetic circuit 503 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuitry 503 is a general-purpose matrix processor.
For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. In the embodiment of the present application, the representation of the input buffer 501 may be an input memory 5011 and a weight memory 5012. The arithmetic circuit 503 fetches the weight data of the matrix B from the weight memory 5012 and buffers it on each PE in the arithmetic circuit 503. The arithmetic circuit 503 acquires input data of the matrix a from the input memory 5011, performs matrix arithmetic on the input data of the matrix a and weight data of the matrix B, and stores partial results or final results of the obtained matrix in the output buffer 505.
In this embodiment, the input buffer 501 and the output buffer 505 may be a Random Access Memory (RAM) or a power-down volatile Memory device, such as a Static Random Access Memory (SRAM), a Dynamic Random Access Memory (DRAM), a Synchronous Dynamic Random Access Memory (Synchronous DRAM, SDRAM), a double Data Rate SDRAM (DDR SDRAM), and the like, which is not limited in this embodiment.
In the embodiment of the present application, the weight data is directly transferred to the weight Memory 5012 through a Memory Access Controller (DMAC) 501.
The BIU is a Bus Interface Unit, i.e., Bus Interface 507, for interaction between the AXI Bus and the DMAC and the Instruction Fetch memory 509Instruction Fetch Buffer.
A Bus Interface 507(Bus Interface Unit, BIU for short) is used for the instruction fetch memory 506 to fetch instructions from the external memory, and is also used for the memory Unit access controller 501 to fetch the original data of the input matrix a or the weight matrix B from the external memory.
The DMAC is primarily used to carry weight data into the weight memory 5012 or input data into the input memory 5011.
The vector calculation circuit 504 has a plurality of arithmetic processing circuits, and further processes the outputs of the arithmetic processing circuits, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like, as necessary. The method is mainly used for non-convolution/FC layer network calculation in the neural network, such as Pooling (Pooling), Batch Normalization (Batch Normalization), Local Response Normalization (Local Response Normalization) and the like.
In some implementations, the vector calculation circuit 504 stores the vector of processed outputs to the output buffer 505. For example, the vector calculation circuit 504 may apply a non-linear function to the output of the arithmetic circuit 503, such as a vector of accumulated values, to generate the activation value. In some implementations, the vector calculation circuit 504 generates normalized values, combined values, or both. In some implementations, the vector of processed outputs can be used as activation inputs to the arithmetic circuitry 503, for example for use in subsequent layers in a neural network.
The operations of the layers in the convolutional neural networks shown in fig. 1 and 2 can be performed by the vector calculation circuit 504.
Fig. 5b is a schematic structural diagram of an artificial intelligence processor 50 according to an embodiment of the present disclosure. The operation circuit 503 may include a splitting circuit 5031 and a matrix multiply-accumulate circuit 5032.
In this embodiment of the application, the splitting circuit 5031 is configured to split the first matrix data to obtain first N/2-dimensional first sub-matrix data and second N/2-dimensional second sub-matrix data; splitting the second matrix data to obtain first N/2-dimensional third sub-matrix data and second N/2-dimensional fourth sub-matrix data; the first matrix data and the second matrix data are both N-dimensional matrix data; and N is a positive even number.
In the embodiment of the present application, the first matrix data may be input matrix data, for example, the input matrix data may be specifically image matrix data. The second matrix data may be weight matrix data.
In the embodiment of the present application, the first matrix data and the second matrix data may be unsigned numbers or signed numbers, and the present application is not limited specifically. Taking the first matrix data as image matrix data as an example, if the image matrix data is matrix data corresponding to an original image which is not subjected to any processing, the image matrix data may be an unsigned number; if the image matrix data is matrix data corresponding to the residual image, the image matrix data may be signed or unsigned, and should be determined in combination with specific practice.
In the embodiment of the application, the first matrix data and the second matrix data are symmetrically split, and the N-dimensional multiplication is realized by using two N/2-dimensional multiplications, so that the processing speed of the operation circuit can be improved. In general, when the bit width N of a multiplication operation is large, the optimized N/2-dimensional symmetric split multiplication operation is faster than the N-dimensional multiplication operation in speed. In terms of area, the area of the N-dimensional multiplication operation is not 2 times, but 3 times or even 4 times, of the simple N/2-dimensional multiplication, and the larger N is, the higher the multiple is.
In the embodiment of the present application, the first matrix data and the second matrix data participating in the convolution operation may be stored in the input buffer 501, and specifically, the first matrix data may be stored in the input memory 5011, and the second matrix data may be stored in the weight memory 5012. The splitting circuit 5031 may obtain the first matrix data from the input memory 5011, obtain the second matrix data from the weight memory, and then split the first matrix data and the second matrix data.
A matrix multiply-accumulate circuit 5032 configured to perform a multiply-accumulate operation according to the first sub-matrix data, the second sub-matrix data, the third sub-matrix data, and the fourth sub-matrix data to obtain an operation result. In the embodiment of the application, when the operation circuit realizes convolution operation of two high-digit numbers, the two high-digit numbers of N dimensions are symmetrically split respectively to obtain four corresponding low-digit numbers of N/2 dimensions, and then three times of multiplication operation and accumulation operation are carried out on the four low-digit numbers of N/2 dimensions. Since the multiplication operation of N dimension is realized by the multiplication of two N/2 dimensions, the operation complexity is N2Reduced to n1.585The computing speed of the processor can be increased, and thus the area and power consumption of the processor can be reduced. .
In an embodiment of the application, the matrix multiply-accumulate circuit 5032 may include a matrix multiply circuit and an accumulate circuit; the matrix multiplication circuit is used for executing multiplication operation according to the first sub-matrix data and the third sub-matrix data to obtain first intermediate data; performing multiplication operation according to the second sub-matrix data and the fourth sub-matrix data to obtain second intermediate data; performing multiplication operation according to the first sub-matrix data, the second sub-matrix data, the third sub-matrix data and the fourth sub-matrix data to obtain third intermediate data; and the accumulation circuit is used for accumulating the first intermediate data, the second intermediate data and the third intermediate data to obtain an operation result. In the application, when convolution operation of two high-order numbers is realized, multiplication operation of N dimensions can be realized by multiplication of two N/2 dimensions through a matrix multiplication circuit and an accumulation circuit respectively, and the operation complexity is N2Is reduced to n1.585The computing speed of the processor can be increased, and thus the area and power consumption of the processor can be reduced.
In an embodiment of the application, the matrix multiply-accumulate circuit 5032 can include a plurality of matrix multiply circuits 5032a (e.g., 3 and integer multiples of 3) and one or more Karatsuba accumulate circuits 5032 b.
In some possible implementations, the matrix multiply-accumulate circuit 5032 can include three matrix multiply circuits 5032a and one kartsuba accumulate circuit 5032 b; among them, the three matrix multiplication circuits 5032a may be a first matrix multiplication circuit, a second matrix multiplication circuit and a third matrix multiplication circuit, respectively; a kartsuba accumulation circuit 5032b may perform a kartsuba accumulation operation on the operation results generated by the respective three matrix multiplication circuits 5032 a.
In this embodiment of the application, the first matrix multiplication circuit 5032a is configured to perform a multiplication operation on the first sub-matrix data and the third sub-matrix data to obtain first sub-intermediate data, and shift the first sub-intermediate data left by N bits to obtain first intermediate data;
a second matrix multiplication circuit 5032a configured to perform a multiplication operation on the second sub-matrix data and the fourth sub-matrix data to obtain second intermediate data;
a third matrix multiplication circuit 5032a, configured to perform an accumulation operation on the first sub-matrix data and the second sub-matrix data to obtain first summation matrix data, perform an accumulation operation on the third sub-matrix data and the fourth sub-matrix data to obtain second summation matrix data, and perform a multiplication operation on the first summation matrix data and the second summation matrix data to obtain fourth intermediate data; and after accumulating the first sub-intermediate data, the second intermediate data and the fourth intermediate data, shifting the left by N/2 bits to obtain third intermediate data.
In a specific implementation, a plurality of matrix multiplication circuits 5032a included in the matrix multiply-accumulate circuit 5032 can perform matrix multiplication independently. Referring to fig. 5c, the matrix multiply-accumulate circuit 5032 of fig. 5c is illustrated as comprising 3 matrix multiply circuits 5032 a. The matrix multiplication circuit 5032a includes M operation groups composed of operation blocks, each operation group includes K operation blocks, each operation block includes N sub-operation units, each sub-operation unit has two inputs respectively used for receiving data sent by different memories (such as an input memory 5011 and a weight memory 5012) and multiplying the two inputs; a kartsuba accumulation circuit 5032b for performing calculation on the calculation results generated by the 3 matrix multiplication circuits 5032 a; and adding the Karatsuba to obtain a final calculation result.
In some practical applications, taking the first matrix data as the image matrix data and the second matrix data as the weight matrix data as an example, the image matrix data and the weight matrix data may include negative numbers, as shown in fig. 5d, which is a schematic structural diagram of another artificial intelligence processor 50 provided in the embodiment of the present application. The operation circuit 503 may include one OR more splitting circuits 5031, one OR more OR logic judgment circuits 5033, and a plurality of matrix multiply-accumulate circuits 5032. The splitting circuit 5031 is connected to an OR logic determination circuit 5033, and the OR logic determination circuit 5033 is connected to the matrix multiply accumulating circuit 5032. Specifically, the splitting circuit 5031 is configured to split the first matrix data to obtain first sub-matrix data and second sub-matrix data; splitting the second matrix data to obtain third sub-matrix data and fourth sub-matrix data; an OR logic determination unit 5033 is configured to process the signed number to obtain the complement of the signed number, and then the matrix multiply accumulate circuit 5032 obtains the complement of the signed number to perform a matrix multiply accumulate operation, so as to obtain an operation result.
In the embodiment of the application, the first matrix data is the number of symbols; the split first sub-matrix data and the split second sub-matrix data both comprise sign bits; the second matrix data is the number of symbols; and the third sub-matrix data and the fourth sub-matrix data obtained by splitting both comprise sign bits. It is understood that the matrix data obtained by splitting is composed of two parts, one part is the matrix data itself, and the other part is the sign bit.
Then, in this case, the OR logic determination circuit 5033 serves to: under the condition that the non-symbolic bit of the second sub-matrix data is not all 0, adding 1 to the first sub-matrix data; and under the condition that the non-symbol bits of the fourth sub-matrix data are not all 0, adding 1 to the third sub-matrix data.
It is understood that the OR logic determination circuit 5033 is further configured to: under the condition that the non-symbol bits of the second sub-matrix data are all 0, not adding 1 to the first sub-matrix data; and under the condition that the non-symbol bits of the fourth sub-matrix data are all 0, not adding 1 to the third sub-matrix data.
In the embodiment of the application, taking the first sub-matrix data as a signed number as an example, performing an addition operation on the first sub-matrix data to obtain a form of a two's complement code corresponding to the first sub-matrix data, so that the matrix multiply-accumulate circuit 5032 performs a matrix multiply-accumulate operation on the first sub-matrix data.
In the embodiment of the present application, fig. 6 shows the structure of the vector calculation unit 504, and the vector calculation unit 504 is used to generate the normalized value, the combined value, or both. The vector of processed outputs can be used as activation inputs to the arithmetic circuitry 503, for example for use in subsequent layers in a neural network, which mainly includes activation circuitry, normalization circuitry and pooling circuitry.
Activation circuitry 5041 to apply a non-linear function to each accumulated value to generate an activation value, such as a non-linear hyperbolic function tanh (x);
a normalization circuit 5042 that generates a normalization value from the activation value;
the pooling circuit 5043 applies an aggregation function to the normalized values to generate pooled values, e.g., in some implementations, the aggregation function is a function that returns a maximum, minimum, or average value in the set of normalized values.
In this embodiment, taking the convolutional neural network shown in fig. 1 as an example, the operation circuit 503 is configured to perform a convolution operation in the convolutional layer according to the first matrix data and the second matrix data, extract image feature data of the image/video to be processed, and then the vector calculation circuit 504 is configured to perform an operation in the neural network layer according to the image feature data to obtain the identification result. For example, the recognition result may be image recognition or voice recognition, and the embodiment of the present application is not particularly limited.
Based on the hardware structure diagram shown in fig. 5a or fig. 5d, the following describes how to implement convolution operation in the embodiment of the present application in combination with the flowchart of the convolution operation method shown in fig. 7a according to the embodiment of the present application.
In the embodiment of the present application, the convolution operation method is applied to a convolution operation circuit, where the operation circuit includes a splitting circuit and a matrix multiply-accumulate circuit, and the method may include, but is not limited to, the following steps:
step S700, splitting the first matrix data through a splitting circuit in a convolution operation circuit to obtain first sub-matrix data of front N/2 dimensions and second sub-matrix data of rear N/2 dimensions; splitting the second matrix data to obtain first N/2-dimensional third sub-matrix data and second N/2-dimensional fourth sub-matrix data; wherein the first matrix data and the second matrix data are both N-dimensional matrix data; and N is a positive even number.
In some implementations, as previously described, the first matrix data can be stored in the input memory 5011 and the second matrix data can be stored in the weight memory 5012. Then, in this case, the splitting circuit may obtain the first matrix data from the input memory 5011 and the second matrix data from the weight memory 5012.
In the embodiment of the present application, if a is the first matrix data (e.g., the image matrix data) is an M × K matrix, the ith row and jth column elements in the first matrix data a may be denoted as aij,i=(1、2、3、……M),j=(1、2、3、……K)。
In the embodiment of the present application, if b is the second matrix data (e.g., the weight matrix data) is an N × K matrix, the ith row and jth column elements in the second matrix data b can be denoted as bij,i=(1、2、3、……N),j=(1、2、3、……K)。
In the embodiment of the present application, the image matrix data and the weight matrix data may be unsigned numbers or signed numbers, and the present application is not limited specifically. Taking the image matrix data as an example, if the image matrix data is matrix data corresponding to an original image which is not subjected to any processing, the image matrix data can be unsigned number; if the image matrix data is matrix data corresponding to the residual image, the image matrix data may be signed or unsigned, and should be determined in combination with specific practice. In the embodiment of the present application, the first matrix data may be split in a symmetric splitting manner to obtain first N/2-dimensional first sub-matrix data and second N/2-dimensional second sub-matrix data, that is, the first sub-matrix data is high-order image matrix data in the image matrix data, and the second sub-matrix data is low-order image matrix data in the image matrix data.
Similarly, in the embodiment of the present application, the second matrix data may be split in a symmetric split manner to obtain first N/2-dimensional third sub-matrix data and second N/2-dimensional fourth sub-matrix data, that is, the third sub-matrix data is high-order weight matrix data in the weight matrix data, and the fourth sub-matrix data is low-order weight matrix data in the weight matrix data. Specifically, as shown in fig. 7b, the size of the image matrix data is 12 × 12, the size of the weight matrix data is 12 × 12, and the splitting is performed in a symmetric splitting manner, so that the sizes of the first sub-matrix data, the second sub-matrix data, the third sub-matrix data, and the fourth sub-matrix data are all 12 × 6.
Step S702, a matrix multiply-accumulate circuit in the operation circuit performs multiply-accumulate operation according to the first sub-matrix data, the second sub-matrix data, the third sub-matrix data, and the fourth sub-matrix data to obtain an operation result.
In some implementations, the matrix multiply-accumulate circuit may include a matrix multiply circuit and an accumulate circuit, and in this case, the step of performing the multiply-accumulate operation by the matrix multiply-accumulate circuit in the operation circuit according to the first sub-matrix data, the second sub-matrix data, the third sub-matrix data and the fourth sub-matrix data may include: performing multiplication operation according to the first sub-matrix data and the third sub-matrix data through a matrix multiplication circuit to obtain first intermediate data; performing multiplication operation according to the second sub-matrix data and the fourth sub-matrix data to obtain second intermediate data; performing multiplication operation according to the first sub-matrix data, the second sub-matrix data, the third sub-matrix data and the fourth sub-matrix data to obtain third intermediate data; and accumulating the first intermediate data, the second intermediate data and the third intermediate data through an accumulation circuit to obtain an operation result.
In some implementations, the matrix multiplication circuit includes a first matrix multiplication circuit, a second matrix multiplication circuit, and a third matrix multiplication circuit; in this case, the step of performing the multiply-accumulate operation by the matrix multiply-accumulate circuit according to the first sub-matrix data, the second sub-matrix data, the third sub-matrix data, and the fourth sub-matrix data may be as shown in fig. 7c, and specifically may include: performing multiplication operation on the first sub-matrix data and the third sub-matrix data through a first matrix multiplication circuit to obtain first sub-intermediate data, and shifting the first sub-intermediate data to the left by N bits to obtain first intermediate data; performing multiplication operation on the second sub-matrix data and the fourth sub-matrix data through a second matrix multiplication circuit to obtain second intermediate data; performing accumulation operation on the first sub-matrix data and the third sub-matrix data through a third matrix multiplication circuit to obtain first summation matrix data, performing accumulation operation on the third sub-matrix data and the fourth sub-matrix data to obtain second summation matrix data, and performing multiplication operation on the first summation matrix data and the second summation matrix data to obtain fourth intermediate data; accumulating the first sub-intermediate data, the second intermediate data and the fourth intermediate data, and shifting left by N/2 bits to obtain third intermediate data; and accumulating the first intermediate data, the second intermediate data and the third intermediate data to obtain an operation result.
In the embodiment of the application, the dimension of the first sub-intermediate data, the second intermediate data and the fourth intermediate data may be N-dimensional, and may also be N-1-dimensional, so that the multiplication operation of N-dimensional may be realized by multiplying two matrix data of N/2-dimensional.
It should be noted that, in this embodiment of the application, taking the first matrix data and the second matrix data as an example, the first matrix data may be split into first sub-matrix data and second sub-matrix data, and the second matrix data may be split into third sub-matrix data and fourth sub-matrix data. For example, the first matrix data and the second matrix data are both 12 × 12 matrix data, and in this case, the minimum calculation unit may be a multiplication operation of two 12 × 6 matrices. For another example, the first matrix data and the second matrix data are both 24 × 24 matrix data, and in this case, the minimum calculation unit may be a multiplication of two 24 × 12 matrices.
In some implementation manners, in consideration of the fact that in practical application, the image matrix data and the weight matrix data are both large matrix data, in this case, the first large matrix data and the second large matrix data may be partitioned to obtain m first matrix blocks and m second matrix blocks; wherein, the ith matrix block in the m first matrix blocks is used as first matrix data, and the ith matrix block in the m second matrix blocks is used as second matrix data; and i is sequentially valued from 1 to m to obtain m groups of first matrix data and second matrix data. That is, the convolution operation on the plurality of matrix blocks may be based on the minimum calculation unit, and the specific implementation may include: and performing operation for each group of the first matrix data and the second matrix data sequentially through a matrix multiplication accumulation circuit to obtain m operation intermediate results, and accumulating the m operation intermediate results to obtain an operation result. For example, as shown in fig. 7d, the size of the large image matrix data is 24 × 24 before the block is not divided, and the size of the large weight matrix data is 24 × 24, and the large image matrix data is divided into blocks, for example, m is 2, so that two sets of the first matrix data and the second matrix data can be obtained. For example, the first group includes image matrix data 1 and weight matrix data 1; the second group includes image matrix data 2 and weight matrix data 2. According to the above description, the operation intermediate result corresponding to the 1 st matrix block in the image matrix data and the 1 st matrix block in the weight matrix data can be obtained, and at this time, the operation intermediate result is stored in the output buffer; and then accumulating the operation intermediate results corresponding to the first group of first matrix data and the second matrix data stored in the output buffer and the operation intermediate results of the 2 nd group of first matrix data and the second matrix data, thereby obtaining the operation result.
In some implementation manners, in consideration of the fact that in practical application, the image matrix data and the weight matrix data are both large matrix data, in this case, the first large matrix data and the second large matrix data may be partitioned to obtain m first matrix blocks and m second matrix blocks; wherein, the ith matrix block in the m first matrix blocks is used as first matrix data, and the ith matrix block in the m second matrix blocks is used as second matrix data; and i is sequentially valued from 1 to m to obtain m groups of first matrix data and second matrix data. That is, the convolution operation on the plurality of matrix blocks may be based on the minimum calculation unit described above. In addition, the operation may be performed by obtaining a vector of each matrix data, and the specific implementation may include: and performing one operation on each group of the first matrix vector and the second matrix vector in sequence through the matrix multiplication accumulation circuit element to obtain m operation intermediate results, and then accumulating the m operation intermediate results to obtain an operation result. For example, m is 2, that is, two sets of first matrix data and second matrix data are obtained. And then, respectively obtaining a vector corresponding to the first matrix data and a vector corresponding to the second matrix data, so as to obtain a first matrix vector and a second matrix vector. According to the above description, 2 operation intermediate results can be obtained by performing one operation for each set of the first matrix vector and the second matrix vector, and then the 2 operation intermediate results are accumulated to obtain an operation result.
In some implementations, the operation results may be stored in an output buffer.
In the embodiment of the present application, after the convolution operation is performed by the operation circuit, the output buffer temporarily stores the operation result for the following operation (e.g., pooling operation) of the neural network layer, and the like. Specifically, the feature extraction operation for the image may be completed by convolution operation, and then, the extracted features may be input to the neural network layer for processing.
By implementing the embodiment of the application, when the convolution operation of two high-digit numbers is realized, the operation circuit symmetrically splits the two N-dimensional high-digit numbers respectively to obtain four corresponding N/2-dimensional low-digit numbers, and then, three times of multiplication operation and accumulation operation are carried out on the four N/2-dimensional low-digit numbers. Since the multiplication operation of N dimension is realized by the multiplication of two N/2 dimensions, the operation complexity is N2Reduced to n1.585The computing speed of the processor can be increased, and thus the area and power consumption of the processor can be reduced.
In some practical applications, the image matrix data and the weight matrix data may be signed numbers or unsigned numbers. In the case of a signed number, the signed number may be processed to obtain the corresponding two's complement for ease of calculation.
In this embodiment of the application, as shown in fig. 7e, taking the second matrix data as an example, and the second matrix data is a signed number, when the second matrix data is split, the third sub-matrix data and the fourth sub-matrix data obtained by splitting both include a sign bit. Then, in this case, the third sub-matrix data may be considered to be composed of two parts, one of which is a sign bit and the other of which is a non-sign bit.
In this embodiment of the application, the determining, by the OR logic determining circuit, the two's complement corresponding to the third sub-matrix data and the fourth sub-matrix data may include: and the OR logic judgment circuit adds 1 to the third sub-matrix data under the condition that the non-symbolic bit of the fourth sub-matrix data is not completely 0, so that the third sub-matrix data is expressed in a binary complement form, and the calculation result can be input into the matrix multiplication and accumulation circuit to realize the multiplication and accumulation operation of the two matrixes. Then, correspondingly, the OR logic determination circuit does not perform the 1 addition operation on the third sub-matrix if all the non-sign bits of the fourth sub-matrix data are 0.
By implementing the embodiment of the application, under the condition that two high-order digits participating in convolution operation are signed digits, the operation circuit can convert the signed digits into a corresponding binary complement form through an OR logic judgment circuit in the operation circuit, and then, three times of multiplication and accumulation operation are carried out on the low-order digits of four N/2 dimensions obtained by splitting. The sign number is converted by the OR logic judgment circuit, so that the error condition in convolution operation can be avoided, and in addition, the multiplication operation of N dimension is realized by using the multiplication of two N/2 dimensions, and the operation complexity is increased by N2Reduced to n1.585The computing speed of the processor can be increased, and thus the area and power consumption of the processor can be reduced.
The embodiment of the application further provides the board card, and the board card comprises the packaging structure of the chip. Referring to fig. 8a, fig. 8a provides a board that may include other kits in addition to the chip 889, including but not limited to: at least one memory device 890, interface device 891, and control device 892;
the memory device 890 is connected to the chip in the chip package structure through a bus, and is configured to store data.
In this embodiment, the at least one storage component 890 may store various intermediate operation result data, configuration data, and the like generated by a chip operation algorithm (e.g., a convolutional neural network algorithm) or a process.
In particular implementations, the memory device may include multiple sets of memory 893. Each group of the memories is connected with the chip through a bus. It is understood that each group of the Memory may be a Random Access Memory (RAM) or a power-down volatile Memory device, such as a Static Random Access Memory (SRAM), a Dynamic Random Access Memory (DRAM), or a Synchronous Dynamic Random Access Memory (Synchronous DRAM, SDRAM), a double Data Rate SDRAM (DDR SDRAM), etc.
Taking DDR as an example, DDR can double the speed of SDRAM without increasing clock frequency. DDR allows data to be read out on the rising and falling edges of the clock pulse. DDR is twice as fast as standard SDRAM. In one embodiment, the storage means may comprise 4 sets of said memory. Each set of the memory may include a plurality of DDR4 particles (chips). In one embodiment, the chip may internally include 4 72-bit DDR4 controllers, and 64 bits of the 72-bit DDR4 controller are used for data transmission, and 8 bits are used for ECC check. It can be understood that when DDR4-3200 particles are adopted in each group of the memories, the theoretical bandwidth of data transmission can reach 25600 MB/s.
In one embodiment, each set of the memories comprises a plurality of double rate synchronous dynamic random access memories arranged in parallel. DDR can transfer data twice in one clock cycle. And a controller for controlling DDR is arranged in the chip and is used for controlling data transmission and data storage of each memory.
The interface device is electrically connected with a chip in the chip packaging structure. The interface device is used for realizing data transmission between the chip and an external device (such as a server or a computer). For example, in one embodiment, the interface device may be a standard PCIE interface. For example, the data to be processed is transmitted to the chip by the server through the standard PCIE interface, so as to implement data transfer. Preferably, when PCIE 3.0X 16 interface transmission is adopted, the theoretical bandwidth can reach 16000 MB/s. In another embodiment, the interface device may also be another interface, and the present application does not limit the concrete expression of the other interface, and the interface unit may implement the switching function. In addition, the operation result of the chip is still transmitted back to an external device (e.g., a server) by the interface device.
Wherein the control device is electrically connected with the chip. The control device is used for communicating with a chip, calling the chip to realize the operation of the convolutional neural network, and also can be used for monitoring the state of the chip. In some implementations, the chip and the control device may be electrically connected through an SPI interface. The control device may include a single chip Microcomputer (MCU). For example, the chip may include a plurality of processing chips, a plurality of processing cores, or a plurality of processing circuits, and may carry a plurality of loads. Therefore, the chip can be in different working states such as multi-load and light load.
The embodiment of the application further provides electronic equipment which comprises the board card.
The embodiment of the application also provides an electronic device, which can comprise a data processing device, a robot, a computer, a scanner, a tablet computer, an intelligent terminal, a mobile phone, a driving recorder, a navigator, a sensor, a camera, a cloud server, a camera, a video camera, a projector, a watch, an earphone, a mobile storage device and a wearable device.
As shown in fig. 8b, the electronic device 800 may include: a processor 801, a memory 802, a communication bus 803 and a communication interface 804, by which the processor 801 connects the memory 802 and the communication interface 803.
The Processor 801 may be a Central Processing Unit (CPU), and the Processor 801 may also be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor 801 may be any conventional processor or the like.
The processor 801 may also be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the data processing method of the present application may be implemented by integrated logic circuits of hardware or instructions in the form of software in the processor 801. The processor 801 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, or discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 801, and the processor 801 reads information in the memory 802, and completes convolution operation in the convolutional neural network in combination with hardware thereof to extract image features.
The Memory 802 may be a Read-Only Memory (ROM), a Random Access Memory (RAM), or other Memory. In the embodiment of the present application, the memory 802 is used to store data and various software programs, such as a program of the convolution operation method described in the embodiment of the present application.
The communication interface 804 enables communication between the electronic device 800 and other devices or communication networks using transceiver means, such as, but not limited to, transceivers. For example, the raw data set, the first set of data sets, etc. may be retrieved through the communication interface 903 to enable information interaction with a training device, a client device, a user device, or a terminal device.
Optionally, the electronic device may further include an artificial intelligence processor 805, where the artificial intelligence processor 805 may be any processor suitable for large-scale exclusive or operation processing, such as a neural Network Processor (NPU), a Tensor Processing Unit (TPU), or a Graphics Processing Unit (GPU). The artificial intelligence processor 805 may be mounted as a coprocessor to a main CPU (host CPU) for which tasks are assigned. The artificial intelligence processor 805 may implement one or more of the operations involved in the data processing methods described above. For example, taking an NPU as an example, the core portion of the NPU is an arithmetic circuit, and the controller controls the arithmetic circuit to extract matrix data in the memory 802 and perform a multiply-add operation.
The processor 801 is configured to call the data and the program codes in the memory, and perform:
splitting the first matrix data to obtain first sub-matrix data of front N/2 dimensions and second sub-matrix data of rear N/2 dimensions; splitting the second matrix data to obtain first N/2-dimensional third sub-matrix data and second N/2-dimensional fourth sub-matrix data; wherein the first matrix data and the second matrix data are both N-dimensional matrix data; n is a positive even number;
and performing multiplication accumulation operation according to the first sub-matrix data, the second sub-matrix data, the third sub-matrix data and the fourth sub-matrix data to obtain an operation result.
The processor 801 performs multiply-accumulate operation according to the first sub-matrix data, the second sub-matrix data, the third sub-matrix data, and the fourth sub-matrix data to obtain an operation result, and the method includes:
performing multiplication operation according to the first sub-matrix data and the third sub-matrix data to obtain first intermediate data; performing multiplication operation according to the second sub-matrix data and the fourth sub-matrix data to obtain second intermediate data; performing multiplication operation according to the first sub-matrix data, the second sub-matrix data, the third sub-matrix data and the fourth sub-matrix data to obtain third intermediate data;
and accumulating the first intermediate data, the second intermediate data and the third intermediate data to obtain an operation result.
The processor 801 executes a multiplication operation according to the first sub-matrix data and the third sub-matrix data to obtain first intermediate data; performing multiplication operation according to the second sub-matrix data and the fourth sub-matrix data to obtain second intermediate data; performing a multiplication operation according to the first sub-matrix data, the second sub-matrix data, the third sub-matrix data, and the fourth sub-matrix data to obtain third intermediate data, including:
performing multiplication operation on the first sub-matrix data and the third sub-matrix data to obtain first sub-intermediate data, and shifting the first sub-intermediate data to the left by N bits to obtain first intermediate data;
performing multiplication operation on the second sub-matrix data and the fourth sub-matrix data to obtain second intermediate data;
performing accumulation operation on the first sub-matrix data and the second sub-matrix data to obtain first summation matrix data, performing accumulation operation on the third sub-matrix data and the fourth sub-matrix data to obtain second summation matrix data, and performing multiplication operation on the first summation matrix data and the second summation matrix data to obtain fourth intermediate data; and after accumulating the first sub-intermediate data, the second intermediate data and the fourth intermediate data, shifting the first sub-intermediate data, the second sub-intermediate data and the fourth intermediate data to the left by N/2 bits to obtain third intermediate data.
Wherein, the processor 801 is further configured to:
partitioning the first large matrix data and the second large matrix data to obtain m first matrix blocks and m second matrix blocks respectively; wherein an ith matrix block of the m first matrix blocks is used as the first matrix data, and an ith matrix block of the m second matrix blocks is used as the second matrix data; wherein, i takes values from 1 to m in sequence to obtain m groups of first matrix data and second matrix data;
the processor 801 performs a multiply-accumulate operation according to the first sub-matrix data, the second sub-matrix data, the third sub-matrix data, and the fourth sub-matrix data to obtain an operation result, including:
executing multiplication accumulation operation according to the first sub-matrix data, the second sub-matrix data, the third sub-matrix data and the fourth sub-matrix data to obtain m operation intermediate results aiming at each group of the first matrix data and the second matrix data; and accumulating the m operation intermediate results to obtain an operation result.
Wherein, the processor 801 is further configured to:
partitioning the first large matrix data and the second large matrix data to obtain m first matrix blocks and m second matrix blocks respectively; wherein, the ith matrix block in the m first matrix blocks is used as a first matrix vector, and the ith matrix block in the m second matrix blocks is used as a second matrix vector; the value of i is sequentially taken from 1 to m to obtain m groups of first matrix vectors and second matrix vectors;
the processor 801 performs a multiply-accumulate operation according to the first sub-matrix data, the second sub-matrix data, the third sub-matrix data, and the fourth sub-matrix data to obtain an operation result, including:
for each group of the first matrix vector and the second matrix vector, performing multiplication accumulation operation according to the first sub-matrix vector, the second sub-matrix vector, the third sub-matrix vector and the fourth sub-matrix vector to obtain m operation intermediate results; and accumulating the m operation intermediate results to obtain an operation result.
Wherein the first matrix data is a symbol number; the first sub-matrix data and the second sub-matrix data each include a sign bit; the second matrix data is the number of symbols; the third sub-matrix data and the fourth sub-matrix data each include a sign bit; the processor 801 may further include:
under the condition that non-symbolic bit of the second sub-matrix data is not all 0, adding 1 to the first sub-matrix data; and under the condition that the non-symbol bits of the fourth sub-matrix data are not all 0, performing 1 addition operation on the third sub-matrix data.
It should be understood that, implementation of each device may also correspondingly refer to corresponding description in the foregoing method embodiments, and details are not described in this embodiment of the present application.
Embodiments of the present application further provide a computer-readable storage medium, where the computer-readable storage medium is a computer program for implementing convolution of two high-bit numbers based on the Karatsuba algorithm, and the computer program enables an electronic device to perform part or all of the steps of any one of the convolution operation methods described in the above method embodiments.
Embodiments of the present application also provide a computer program product comprising a non-transitory computer readable storage medium storing a computer program operable to cause an electronic device to perform some or all of the steps of any one of the convolution operations methods as described in the above method embodiments.
It is to be understood that one of ordinary skill in the art would recognize that the elements and algorithm steps of the various examples described in connection with the embodiments disclosed in the various embodiments disclosed herein can be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
Those of skill would appreciate that the functions described in connection with the various illustrative logical blocks, modules, and algorithm steps disclosed in the various embodiments disclosed herein may be implemented as hardware, software, firmware, or any combination thereof. If implemented in software, the functions described in the various illustrative logical blocks, modules, and steps may be stored on or transmitted over as one or more instructions or code on a computer-readable medium and executed by a hardware-based processing unit. The computer-readable medium may include a computer-readable storage medium, which corresponds to a tangible medium, such as a data storage medium, or any communication medium including a medium that facilitates transfer of a computer program from one place to another (e.g., according to a communication protocol). In this manner, a computer-readable medium may generally correspond to (1) a non-transitory tangible computer-readable storage medium, or (2) a communication medium, such as a signal or carrier wave. A data storage medium may be any available medium that can be accessed by one or more computers or one or more processors to retrieve instructions, code and/or data structures for implementing the techniques described herein. The computer program product may include a computer-readable medium.
It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.
In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one type of logical functional division, and other divisions may be realized in practice, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (16)

  1. A convolution operation circuit is used for realizing convolution operation of first matrix data and second matrix data and is characterized by comprising a splitting circuit and a matrix multiplication accumulation circuit; wherein the content of the first and second substances,
    the splitting circuit is used for splitting the first matrix data to obtain first sub-matrix data of front N/2 dimensions and second sub-matrix data of rear N/2 dimensions; splitting the second matrix data to obtain first N/2-dimensional third sub-matrix data and second N/2-dimensional fourth sub-matrix data; wherein the first matrix data and the second matrix data are both N-dimensional matrix data; n is a positive even number;
    and the matrix multiplication accumulation circuit is used for carrying out multiplication accumulation operation according to the first sub-matrix data, the second sub-matrix data, the third sub-matrix data and the fourth sub-matrix data to obtain an operation result.
  2. The arithmetic circuit of claim 1 wherein the matrix multiply-accumulate circuit comprises a matrix multiply circuit and an accumulate circuit; wherein the content of the first and second substances,
    the matrix multiplication circuit is used for executing multiplication operation according to the first sub-matrix data and the third sub-matrix data to obtain first intermediate data; performing multiplication operation according to the second sub-matrix data and the fourth sub-matrix data to obtain second intermediate data; performing multiplication operation according to the first sub-matrix data, the second sub-matrix data, the third sub-matrix data and the fourth sub-matrix data to obtain third intermediate data;
    the accumulation circuit is configured to accumulate the first intermediate data, the second intermediate data, and the third intermediate data to obtain an operation result.
  3. The arithmetic circuit of claim 2 wherein the matrix multiplication circuit comprises a first matrix multiplication circuit, a second matrix multiplication circuit, and a third matrix multiplication circuit; wherein the content of the first and second substances,
    the first matrix multiplication circuit is configured to perform multiplication operation on the first sub-matrix data and the third sub-matrix data to obtain first sub-intermediate data, and shift the first sub-intermediate data by N bits to the left to obtain the first intermediate data;
    the second matrix multiplication circuit is configured to perform multiplication operation on the second sub-matrix data and the fourth sub-matrix data to obtain second intermediate data;
    the third matrix multiplication circuit is configured to perform an accumulation operation on the first sub-matrix data and the second sub-matrix data to obtain first summation matrix data, perform an accumulation operation on the third sub-matrix data and the fourth sub-matrix data to obtain second summation matrix data, and perform a multiplication operation on the first summation matrix data and the second summation matrix data to obtain fourth intermediate data; and after accumulating the first sub-intermediate data, the second intermediate data and the fourth intermediate data, shifting the first sub-intermediate data, the second sub-intermediate data and the fourth intermediate data to the left by N/2 bits to obtain third intermediate data.
  4. The operational circuit of any of claims 1-3,
    the splitting circuit is further configured to split the first large matrix data and the second large matrix data into blocks to obtain m first matrix blocks and m second matrix blocks, respectively; wherein an ith matrix block of the m first matrix blocks is used as the first matrix data, and an ith matrix block of the m second matrix blocks is used as the second matrix data; wherein, i takes values from 1 to m in sequence to obtain m groups of first matrix data and second matrix data;
    the matrix multiply-accumulate circuit is configured to perform multiply-accumulate operation on the first sub-matrix data, the second sub-matrix data, the third sub-matrix data, and the fourth sub-matrix data for each group of the first matrix data and the second matrix data to obtain m operation intermediate results; and accumulating the m operation intermediate results to obtain an operation result.
  5. The operational circuit of any of claims 1-3,
    the splitting circuit is further configured to split the first large matrix data and the second large matrix data into blocks to obtain m first matrix blocks and m second matrix blocks, respectively; wherein an ith matrix block of the m first matrix blocks is used as a first matrix vector, and an ith matrix block of the m second matrix blocks is used as a second matrix vector; the value of i is sequentially taken from 1 to m to obtain m groups of first matrix vectors and second matrix vectors;
    the matrix multiply-accumulate circuit is configured to perform multiply-accumulate operation according to the first sub-matrix vector, the second sub-matrix vector, the third sub-matrix vector, and the fourth sub-matrix vector for each group of the first matrix vector and the second matrix vector to obtain m operation intermediate results; and accumulating the m operation intermediate results to obtain an operation result.
  6. The arithmetic circuit of any one of claims 1-5, wherein the first matrix data is a symbol number; the first sub-matrix data and the second sub-matrix data each include a sign bit; the second matrix data is the number of symbols; the third sub-matrix data and the fourth sub-matrix data each include a sign bit; the arithmetic circuit also comprises an OR logic judgment circuit connected with the splitting circuit; wherein the content of the first and second substances,
    the OR logic judgment circuit is used for performing 1 adding operation on the first sub-matrix data under the condition that the non-symbolic bit incompletion of the second sub-matrix data is 0; and under the condition that the non-symbol bit of the fourth sub-matrix data is not all 0, adding 1 to the third sub-matrix data.
  7. A convolution operation method for performing a convolution operation of first matrix data and second matrix data, comprising:
    splitting the first matrix data through a splitting circuit in a convolution operation circuit to obtain first sub-matrix data of the front N/2 dimension and second sub-matrix data of the rear N/2 dimension; splitting the second matrix data to obtain first N/2-dimensional third sub-matrix data and second N/2-dimensional fourth sub-matrix data; wherein the first matrix data and the second matrix data are both N-dimensional matrix data; n is a positive even number;
    and performing multiplication and accumulation operation by a matrix multiplication and accumulation circuit in the operation circuit according to the first sub-matrix data, the second sub-matrix data, the third sub-matrix data and the fourth sub-matrix data to obtain an operation result.
  8. The method of claim 7, wherein the matrix multiply accumulate circuit comprises a matrix multiply circuit and an accumulate circuit; the obtaining of an operation result by a matrix multiply-accumulate circuit in the operation circuit according to the first sub-matrix data, the second sub-matrix data, the third sub-matrix data, and the fourth sub-matrix data includes:
    executing multiplication operation according to the first sub-matrix data and the third sub-matrix data through the matrix multiplication circuit to obtain first intermediate data; performing multiplication operation according to the second sub-matrix data and the fourth sub-matrix data to obtain second intermediate data; performing multiplication operation according to the first sub-matrix data, the second sub-matrix data, the third sub-matrix data and the fourth sub-matrix data to obtain third intermediate data;
    and accumulating the first intermediate data, the second intermediate data and the third intermediate data through the accumulation circuit to obtain an operation result.
  9. The method of claim 8, wherein the matrix multiplication circuit comprises a first matrix multiplication circuit, a second matrix multiplication circuit, and a third matrix multiplication circuit; performing multiplication operation according to the first sub-matrix data and the third sub-matrix data by the matrix multiplication circuit to obtain first intermediate data; performing multiplication operation according to the second sub-matrix data and the fourth sub-matrix data to obtain second intermediate data; performing a multiplication operation according to the first sub-matrix data, the second sub-matrix data, the third sub-matrix data, and the fourth sub-matrix data to obtain third intermediate data, including:
    performing multiplication operation on the first sub-matrix data and the third sub-matrix data through the first matrix multiplication circuit to obtain first sub-intermediate data, and performing left shift on the first sub-intermediate data by N bits to obtain first intermediate data;
    performing, by the second matrix multiplication circuit, a multiplication operation on the second sub-matrix data and the fourth sub-matrix data to obtain second intermediate data;
    performing, by the third matrix multiplication circuit, an accumulation operation on the first sub-matrix data and the second sub-matrix data to obtain first summation matrix data, performing an accumulation operation on the third sub-matrix data and the fourth sub-matrix data to obtain second summation matrix data, and performing a multiplication operation on the first summation matrix data and the second summation matrix data to obtain fourth intermediate data; and after accumulating the first sub-intermediate data, the second intermediate data and the fourth intermediate data, shifting the first sub-intermediate data, the second sub-intermediate data and the fourth intermediate data to the left by N/2 bits to obtain third intermediate data.
  10. The method of any one of claims 7-9, further comprising:
    partitioning the first large matrix data and the second large matrix data through the splitting circuit to obtain m first matrix blocks and m second matrix blocks respectively; wherein an ith matrix block of the m first matrix blocks is used as the first matrix data, and an ith matrix block of the m second matrix blocks is used as the second matrix data; wherein, i takes values from 1 to m in sequence to obtain m groups of first matrix data and second matrix data;
    the obtaining of an operation result by a matrix multiply-accumulate circuit in the operation circuit according to the first sub-matrix data, the second sub-matrix data, the third sub-matrix data, and the fourth sub-matrix data includes:
    executing, by the matrix multiply-accumulate circuit, for each set of the first matrix data and the second matrix data, the multiply-accumulate operation according to the first sub-matrix data, the second sub-matrix data, the third sub-matrix data, and the fourth sub-matrix data to obtain m operation intermediate results; and accumulating the m operation intermediate results to obtain an operation result.
  11. The method of any one of claims 7-9, further comprising:
    partitioning the first large matrix data and the second large matrix data through the splitting circuit to obtain m first matrix blocks and m second matrix blocks respectively; wherein an ith matrix block of the m first matrix blocks is used as a first matrix vector, and an ith matrix block of the m second matrix blocks is used as a second matrix vector; the value of i is sequentially taken from 1 to m to obtain m groups of first matrix vectors and second matrix vectors;
    the obtaining of an operation result by a matrix multiply-accumulate circuit in the operation circuit according to the first sub-matrix data, the second sub-matrix data, the third sub-matrix data, and the fourth sub-matrix data includes:
    performing, by the matrix multiply-accumulate circuit, multiply-accumulate operation according to the first sub-matrix vector, the second sub-matrix vector, the third sub-matrix vector, and the fourth sub-matrix vector for each group of the first matrix vector and the second matrix vector to obtain m operation intermediate results; and accumulating the m operation intermediate results to obtain an operation result.
  12. The method of any one of claims 7-11, wherein the first matrix data is a symbol number; the first sub-matrix data and the second sub-matrix data each include a sign bit; the second matrix data is a symbol number; the third sub-matrix data and the fourth sub-matrix data each include a sign bit; the method further comprises the following steps:
    performing 1 adding operation on the first sub-matrix data by an OR logic judgment circuit under the condition that the non-symbolic bit incompletion of the second sub-matrix data is 0; and under the condition that the non-symbol bit of the fourth sub-matrix data is not all 0, adding 1 to the third sub-matrix data.
  13. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program comprising program instructions which, when executed by a processor, cause the processor to carry out the steps of the method according to any one of claims 7-12.
  14. A chip comprising the convolution operation circuit of any one of claims 1-6 and at least one vector calculation circuit coupled to the operation circuit;
    and the at least one vector calculation circuit is used for calculating other layer network structures in the convolutional neural network according to the operation result to obtain an identification result.
  15. A board comprising the chip of claim 14 and at least one of a memory device and a control device coupled to the chip;
    the at least one storage device is used for storing the calculation data of the convolutional neural network;
    and the control device is used for communicating with the chip to realize the operation of the convolutional neural network.
  16. An electronic device, characterized in that it comprises a card according to claim 15.
CN201980101499.6A 2019-10-30 2019-10-30 Convolution operation circuit and convolution operation method Pending CN114600126A (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/114504 WO2021081854A1 (en) 2019-10-30 2019-10-30 Convolution operation circuit and convolution operation method

Publications (1)

Publication Number Publication Date
CN114600126A true CN114600126A (en) 2022-06-07

Family

ID=75715705

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201980101499.6A Pending CN114600126A (en) 2019-10-30 2019-10-30 Convolution operation circuit and convolution operation method

Country Status (2)

Country Link
CN (1) CN114600126A (en)
WO (1) WO2021081854A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113434814B (en) * 2021-06-26 2023-08-25 上海寒武纪信息科技有限公司 Matrix multiplication operation method based on neural network and related device
CN113792867A (en) * 2021-09-10 2021-12-14 中科寒武纪科技股份有限公司 Arithmetic circuit, chip and board card

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107862378B (en) * 2017-12-06 2020-04-24 芯原微电子(上海)股份有限公司 Multi-core-based convolutional neural network acceleration method and system, storage medium and terminal
CN111859273A (en) * 2017-12-29 2020-10-30 华为技术有限公司 Matrix multiplier
US20190318227A1 (en) * 2018-04-13 2019-10-17 Fabula Al Limited Recommendation system and method for estimating the elements of a multi-dimensional tensor on geometric domains from partial observations
CN109543816B (en) * 2018-09-20 2022-12-06 中国科学院计算技术研究所 Convolutional neural network calculation method and system based on weight kneading
CN109522052B (en) * 2018-11-27 2020-05-08 中科寒武纪科技股份有限公司 Computing device and board card

Also Published As

Publication number Publication date
WO2021081854A1 (en) 2021-05-06

Similar Documents

Publication Publication Date Title
CN110050267B (en) System and method for data management
US11727276B2 (en) Processing method and accelerating device
JP6946572B2 (en) Accelerated quantized multiply-accumulate operation
CN108133270B (en) Convolutional neural network acceleration method and device
US10929746B2 (en) Low-power hardware acceleration method and system for convolution neural network computation
JP7304148B2 (en) Method and apparatus for processing convolution operation in neural network
CN110929865B (en) Network quantification method, service processing method and related product
CN110610237A (en) Quantitative training method and device of model and storage medium
WO2020074989A1 (en) Data representation for dynamic precision in neural network cores
US11604975B2 (en) Ternary mode of planar engine for neural processor
US11775807B2 (en) Artificial neural network and method of controlling fixed point in the same
CN111340077B (en) Attention mechanism-based disparity map acquisition method and device
KR20200049366A (en) Neural network processor and convolution operation method thereof
US20200242459A1 (en) Instruction set for hybrid cpu and analog in-memory artificial intelligence processor
US20200218777A1 (en) Signal Processing Method and Apparatus
CN114600126A (en) Convolution operation circuit and convolution operation method
US20210295140A1 (en) Neural network processing
US10747845B2 (en) System, method and apparatus for computationally efficient data manipulation
WO2022151779A1 (en) Convolution operation implementation method and device, and data processing method and device
CN112784951B (en) Winograd convolution operation method and related products
EP3686814A1 (en) Hybrid cpu and analog in-memory artificial intelligence processor
US11853868B2 (en) Multi dimensional convolution in neural network processor
WO2023109748A1 (en) Neural network adjustment method and corresponding apparatus
US20230169316A1 (en) Indexing Operations In Neural Network Processor
Wang et al. Acceleration and implementation of convolutional neural network based on FPGA

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination