WO2022110386A1

WO2022110386A1 - Data processing method and artificial intelligence processor

Info

Publication number: WO2022110386A1
Application number: PCT/CN2020/137453
Authority: WO
Inventors: 裴京; 施路平; 徐明坤; 王冠睿; 马骋
Original assignee: 清华大学
Priority date: 2020-11-30
Filing date: 2020-12-18
Publication date: 2022-06-02
Also published as: CN112395092B; CN112395092A

Abstract

A data processing method and an artificial intelligence processor. The method comprises: reading first pixel data from a storage unit according to a preset pixel reading bit width; during a T-th operation of a Ky-th row of k convolution kernels, reading first weight data from the storage unit according to a preset weight reading bit width, wherein the first weight data comprises weight data at an M-th channel, the Ky-th row, and a convolution kernel position T of the K convolution kernels; selecting, from the first pixel data and according to the stride Sx of the convolution kernels, "a" pieces of pixel data corresponding to the convolution kernel position T as second pixel data; and when T>1, for a q-th column of MACs in a MAC array, multiplying the second pixel data by q-th weight data in the first weight data, and adding same to the result of the (t-1)-th operation to obtain "a" first convolution operation results of the q-th column of MACs in the T-th operation. The data processing method can effectively improve the efficiency of convolution operation.

Description

Data processing method and artificial intelligence processor

technical field

The present disclosure relates to the field of computer technology, and in particular, to a data processing method and an artificial intelligence processor.

Background technique

Neuromorphic chips are important platforms for realizing biologically interpretable brain-like algorithms such as spiking neural networks based on brain-like computing. Among them, the convolution operation is one of the important logical operations for the realization of artificial neural network by neuromorphic chips based on many-core architecture.

How to implement efficient convolution operations based on neuromorphic chips is the key to improving the computing efficiency of neuromorphic chips.

SUMMARY OF THE INVENTION

In view of this, the present disclosure proposes a data processing method and an artificial intelligence processor to efficiently implement convolution operations.

According to an aspect of the present disclosure, a data processing method is provided, which is applied to a processing core of an artificial intelligence processor, the artificial intelligence processor includes a plurality of processing cores, each processing core includes a storage unit and an arithmetic unit, the The storage unit is used to store the pixel data of the image and the weight data of the N convolution kernels; the operation unit includes a multiplier-accumulator MAC array for performing operations according to the pixel data and the weight data, wherein the The size of the image is W ₀ ×H ₀ ×C ₀ , the size of the convolution kernel is K×K×C ₀ , the stride in the row direction is Sx, and W ₀ , H ₀ , C ₀ , K, and Sx are positive Integer, the method includes: reading first pixel data from the storage unit according to a preset pixel read bit width, where the first pixel data includes the mth channel of the image, the continuous line of the Pyth line M pixel data, 1≤m≤C ₀ , 1≤Py≤H ₀ , 1<M≤W ₀ ; in the T-th operation of the Ky-th row of the k convolution kernels, read according to the preset weight Take the bit width, read the first weight data from the storage unit, and the first weight data includes the mth channel of the k convolution kernels, the Kyth row, and the weight data at the position T of the convolution kernel, 1< k≤N, 1≤T≤K, 1≤Ky≤K; according to the step size Sx of the convolution kernel, select a pixel data corresponding to the convolution kernel position T from the first pixel data as the first pixel data Two pixel data, 1<a<M; when T>1, for the qth column MAC in the MAC array, the second pixel data is compared with the qth weight data in the first weight data Multiply and add with the result of the T-1th operation to obtain a first convolution operation result of the Tth operation of the MAC in the qth column, 1≤q≤k.

In a possible implementation manner, the method further includes: when T=1, for the qth column MAC, comparing the second pixel data with the qth weight data in the first weight data Multiply and add with the convolution operation result of the Kth operation in the Ky-1th row to obtain a first convolution operation results of the qth column MAC in the first operation, 1≤q≤k.

In a possible implementation manner, the method further includes: for the qth column MAC, after completing the operations of the K rows of the k convolution kernels, obtain the ath channel of the mth channel. Two convolution operation results; after the convolution operation results of the C ₀ channels are obtained, the convolution operation results of the C ₀ channels of each convolution kernel are added to obtain a target volume output by the qth column MAC. Product operation result.

In a possible implementation manner, the method further includes: storing the weight data of the N convolution kernels according to the weight storage bit width, wherein the weight storage bit width is consistent with the weight read bit width ; Described storing the weight data of the N convolution kernels according to the weight storage bit width, including: for each convolution kernel in the N convolution kernels, sequentially according to the row direction, column direction of the convolution kernel direction and the order of channel C ₀ , the weight data of the convolution kernel is vertically arranged into a first weight vector; the first weight vectors of the N convolution kernels are horizontally aligned and merged into a first weight matrix; according to the weight The bit width is stored, and the weight data in the first weight matrix is horizontally stored.

In a possible implementation manner, the storing the weight data in the first weight matrix horizontally according to the weight storage bit width includes: when N is greater than the column number Q of the MAC array, storing the weight data according to each Q The first weight matrix is split vertically by column to obtain F second weight matrices, where,

In the case that the width of the second weight matrix is less than or equal to the weight storage bit width, store the weight data in the f-th second weight matrix in the order of row direction and column direction, 1≤f≤F ; Arrange the f-1th second weight matrix before the fth second weight matrix; wherein, the width of the second weight matrix is equal to Q multiplied by the first storage unit of the weight data, the weight data The first storage unit of is determined according to the data type of the weight data.

In a possible implementation manner, the method further includes: in the case that the width of the second weight matrix is greater than the weight storage bit width, for the f-th second weight matrix, splitting vertically according to each weight storage bit width Divide the f-th second weight matrix to obtain F ₀ third weight matrices, where,

Store the weight data in the f _0th third weight matrix in the order of the row direction and the column direction in turn, 1≤f ₀ ≤F ₀ ; arrange the f ₀ -1th third weight matrix in the f _0th Before the triple weight matrix.

In a possible implementation manner, the method further includes: storing pixel data of the image according to a pixel storage bit width, wherein the pixel storage bit width is consistent with the pixel read bit width; Pixel storage bit width, storing the pixel data of the image, including: dividing the pixel data of the mth channel and the Pyth row of the image into B first storage vectors according to each consecutive b pixel data, B is equal to the result of dividing W ₀ by b and rounded up, 1≤b≤W ₀ ; for each first storage vector, the first storage vector is divided into E second storage vectors according to each b bytes, so The b byte is less than or equal to the pixel storage bit width; according to the pixel storage bit width, the E second storage vectors are sequentially stored, and the address space less than the weight storage bit width is filled with 0; the mth storage vector is sequentially stored Channel, pixel data of line Py.

In a possible implementation manner, reading the first weight data from the storage unit according to the preset weight reading bit width includes: when T=1, determining the mth channel of the k convolution kernels , the row L of the weight data at row Ky, the convolution kernel position T in the target weight matrix; read the bit width according to the weight, read the Lth row of the target weight matrix from the storage unit The weight data is taken as the first weight data read from the storage unit; when 1<T≤K, read the bit width according to the preset weight, read the target weight matrix from the storage unit The weight data of the L+T-1th row is taken as the first weight data read from the storage unit; wherein, the target weight matrix includes the second weight matrix or the third weight matrix.

In a possible implementation manner, according to the step size Sx of the convolution kernel, selecting a pixel data corresponding to the position T of the convolution kernel from the first pixel data as the second pixel data, including: When T=1, select a pixel data at intervals of the step size Sx from the first pixel data as the second pixel data, and the second pixel data includes the mth channel and the Pyth line of the image. Pixel data at X[0], X[Sx], X[2Sx], X[3Sx], ..., X[(a-1)Sx]; when 1<T≤K, according to the convolution Expansion rate Ex of the kernel, X[(T-1)Ex], X[Sx+(T-1)Ex], X[Sx+(T-1)Ex], X[ 2Sx+(T-1)Ex], X[3Sx+(T-1)Ex], ..., pixel data at X[(a-1)Sx+(T-1)Ex] are used as the second pixel data.

In a possible implementation manner, after obtaining a target convolution operation result output by the MAC in the qth column, the method further includes: according to the mth channel of the image, the Xth [aSx in the Pyth row ] the first storage vector corresponding to the pixel data at, determine the first storage vector corresponding to the pixel data at the Xth [aSx] in the Pyth row corresponds to the first storage start address in the storage unit; according to The preset pixel read bit width and the first storage start address, the third pixel data is read from the storage unit, and the third pixel data includes reading from the first storage start address. The acquired consecutive M pieces of pixel data, so that the operation unit can continue the operation.

In a possible implementation manner, the method further includes: after completing the convolution operation between the k convolution kernels and the K rows of pixel data, determine the difference between The second storage start address of the first pixel data of the first row interval Sy-1 row of the K rows of pixel data; according to the preset pixel read bit width and the second storage start address, from the The storage unit reads fourth pixel data, where the fourth pixel data includes consecutive M pieces of pixel data read from the second storage start address, so that the operation unit continues to operate.

In a possible implementation manner, the multiplier-accumulator MAC array includes an array based on a crossbar matrix structure; the operation unit further includes at least one buffer module, the buffer module is configured to The read bit width reads pixel data from the storage unit, and reads the weight data from the storage unit according to a preset weight read bit width.

According to another aspect of the present disclosure, an artificial intelligence processor is provided, the artificial intelligence processor includes a plurality of processing cores, each processing core includes a storage unit and an operation unit, the storage unit is used for storing pixels of an image data and the weight data of the N convolution kernels; the operation unit includes a multiplier-accumulator MAC array for performing operations according to the pixel data and the weight data, wherein the processing core passes any one of the The data processing method performs the convolution operation.

In the embodiment of the present disclosure, by reading the first pixel data and the first weight data, according to the step size Sx of the convolution kernel, a pixel data corresponding to the position T of the convolution kernel is selected from the first pixel data as the second Pixel data, for the qth column MAC in the MAC array, multiply the second pixel data by the qth weight data in the first weight data, and add the result of the T-1th operation to obtain the qth column The a first convolution operation result of the T-th operation of MAC can realize multi-point parallel convolution operation between multiple pixel data and weight data corresponding to multiple convolution kernels in each operation, so as to improve the volume It can improve the operation efficiency of artificial intelligence processors.

Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments with reference to the accompanying drawings.

Description of drawings

The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate exemplary embodiments, features, and aspects of the disclosure, and together with the description, serve to explain the principles of the disclosure.

1 shows a flowchart of a data processing method according to an embodiment of the present disclosure;

FIG. 2 shows a schematic diagram of storage of pixel data according to an embodiment of the present disclosure;

3 shows a schematic diagram of N convolution kernels according to an embodiment of the present disclosure;

4a shows a schematic diagram of a first weight vector according to an embodiment of the present disclosure;

Fig. 4b shows a schematic diagram of a first weight matrix according to an embodiment of the present disclosure;

FIG. 5 shows a schematic diagram of storage of weight data according to an embodiment of the present disclosure;

6 shows a schematic diagram of splitting a second weight matrix according to an embodiment of the present disclosure;

FIG. 7 shows a schematic structural diagram of a MAC array according to an embodiment of the present disclosure;

Figure 8a shows a block diagram of an artificial intelligence processor according to an embodiment of the present disclosure;

Figure 8b shows a block diagram of a processing core according to an embodiment of the present disclosure;

9a shows a schematic diagram of selecting pixel data according to an embodiment of the present disclosure;

FIG. 9b shows yet another schematic diagram of selecting pixel data according to an embodiment of the present disclosure.

Detailed ways

Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. The same reference numbers in the figures denote elements that have the same or similar functions. While various aspects of the embodiments are shown in the drawings, the drawings are not necessarily drawn to scale unless otherwise indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration." Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

In addition, in order to better illustrate the present disclosure, numerous specific details are given in the following detailed description. It will be understood by those skilled in the art that the present disclosure may be practiced without certain specific details. In some instances, methods, means, components and circuits well known to those skilled in the art have not been described in detail so as not to obscure the subject matter of the present disclosure.

In the embodiment of the present disclosure, the artificial intelligence processor may be a neuromorphic chip based on a many-core architecture. Various artificial intelligence algorithms can be implemented based on the artificial intelligence processor. The artificial intelligence processor can include multiple processing cores, and each processing core can include a storage unit and an arithmetic unit. Wherein, the storage unit can be used to store the data to be operated, and the operation unit can be used to perform logical operation and arithmetic operation. The present disclosure does not limit the specific type of the artificial intelligence processor.

It can be known that in the field of artificial intelligence, especially in the field of image processing, the convolution operation occupies a large part of the total calculation amount, and as the depth and/or breadth of the convolutional neural network increases, the operation of the convolution operation Efficiency may have a greater impact on the operating efficiency of artificial intelligence processors, so improving the efficiency of convolution operations can improve the operating efficiency of artificial intelligence processors to a certain extent.

At present, when the neuromorphic chip based on the many-core structure realizes the convolution operation, it generally expands the multiple input channels of the input image into a one-dimensional vector form, and performs multiplier accumulation calculation on the pixel data and the corresponding weight data one by one. Due to the structural limitation of the multiplier-accumulator MAC in the current neuromorphic chip, only the product of a single pixel data and the weight data corresponding to multiple convolution kernels can be performed in each operation, and the convolution operation result is output after the accumulation.

Based on this, the operation unit in the embodiment of the present disclosure may include a multiplier-accumulator MAC array, and the MAC array may include an array based on a crossbar matrix structure. In one possible implementation, the MAC array may include A row×Q column MACs. The specific values of A and Q can be set according to actual requirements. Considering that the number N of convolution kernels is usually a power of 2, the MAC array can be, for example, a 4×32 MAC array. The embodiment of the present disclosure does not limit the structure of the MAC array in the operation unit. Based on the MAC array in the embodiment of the present disclosure, parallel convolution operations between multiple pixel data and weight data corresponding to multiple convolution kernels can be implemented, thereby improving the efficiency of convolution operations.

In a possible implementation manner, in order to implement the convolution operation between the pixel data and the weight data, the storage unit in each processing core may be used to store the pixel data of the image and the weight data of the N convolution kernels. The operation unit may include a multiplier-accumulator MAC array for performing operations according to pixel data and weight data, wherein the size of the image may be width W ₀ × height H ₀ × number of channels C ₀ , and the size of the convolution kernel may be width K×height K×channel number C ₀ , the step size in the row direction can be Sx, and W ₀ , H ₀ , C ₀ , K, and Sx are positive integers. It can be understood that, the pixel data and the weight data in the embodiments of the present disclosure may be data to be subjected to a convolution operation. The embodiments of the present disclosure do not limit the size and quantity of pixel data and weight data.

FIG. 1 shows a flowchart of a data processing method according to an embodiment of the present disclosure. As shown in FIG. 1 , the data processing method includes:

Step 11: Read the first pixel data from the storage unit according to the preset pixel read bit width, the first pixel data includes the mth channel of the image and the Pyth row of continuous M pixel data, 1≤m≤ C ₀ , 1≤Py≤H ₀ , 1<M≤W ₀ ;

Step 12: During the T-th operation of the Ky-th row of the k convolution kernels, read the bit width according to the preset weight, read the first weight data from the storage unit, and the first weight data includes k convolution kernels The weight data at the mth channel, the Kyth line, and the convolution kernel position T, 1<k≤N, 1≤T≤K, 1≤Ky≤K;

Step 13: According to the step size Sx of the convolution kernel, select a pixel data corresponding to the position T of the convolution kernel from the first pixel data as the second pixel data, 1<a<M;

Step 14: When T>1, for the qth column MAC in the MAC array, multiply the second pixel data by the qth weight data in the first weight data, and multiply it with the result of the T-1th operation. Add to obtain a first convolution operation result of the T-th operation of the MAC in the q-th column, 1≤q≤k.

In a possible implementation manner, the data processing method in the embodiment of the present disclosure may be applied to an artificial intelligence processor.

In a possible implementation manner, before performing step 11, the parameters required for performing the convolution operation may also be obtained by obtaining the primitive parameters, and the primitive parameters may include data required for performing the convolution operation, for example , the primitive parameters can include: image size W ₀ ×H ₀ ×C ₀ , convolution kernel size K×K×C ₀ , number of convolution kernels N, row-wise stride Sx, column-wise stride Sy, dilation For parameters such as rate Ex, padding parameter, and bias parameter Bias, the specific form of the primitive parameter is not limited in this embodiment of the present disclosure.

In the embodiment of the present disclosure, by reading the first pixel data and the first weight data, according to the step size Sx of the convolution kernel, a pixel data corresponding to the position T of the convolution kernel is selected from the first pixel data as the first pixel data. Two pixel data, for the qth column MAC in the MAC array, multiply the second pixel data by the qth weight data in the first weight data, and add the result of the T-1th operation to obtain the qth A first convolution operation result of the T-th operation of the column MAC can realize multi-point parallel convolution operation between multiple pixel data and weight data corresponding to multiple convolution kernels in each operation, thereby improving the Convolution operation efficiency to improve the operation efficiency of artificial intelligence processors.

In a possible implementation manner, before performing step 11, the data processing method may further include: storing pixel data of the image according to the pixel storage bit width, wherein the pixel storage bit width is consistent with the pixel read bit width, So that in step 11, the first pixel data is read from the storage unit according to the preset pixel read bit width.

In a possible implementation manner, according to the pixel storage bit width, storing the pixel data of the image may include: for the pixel data of the mth channel and the Pyth row of the image, according to each consecutive b pixel data is divided into B first storage vectors, B is equal to the result of dividing W ₀ by b and rounded up, 1≤b≤W ₀ ; for each first storage vector, split the first storage vector into E according to each b bytes A second storage vector, the b byte is less than or equal to the pixel storage bit width; according to the pixel storage bit width, E second storage vectors are sequentially stored, and the address space less than the weight storage bit width is filled with 0; Py row of pixel data.

For example, it is assumed that the size of the image X is W ₀ ×H ₀ ×C ₀ =138×127×3, that is, the image X has 3 channels, and each channel contains 138×127 pixel data. Assuming that b is 16, the pixel storage bit width is 32 bytes. For the pixel data of the first channel and the first row of the image, it is divided into B first storage vectors according to each consecutive 16 pixel data, and B=138/16 is rounded up to 9; it is assumed that each pixel data occupies 1 byte, then for each first storage vector, the first storage vector can be divided into 9 second storage vectors according to every 16 bytes, that is, E=9; according to the pixel storage bit width of 32B, 9 are stored in sequence The second storage vector, since the ninth second storage vector contains 10 pixel data, that is, the ninth second storage vector has a width of 10B, which is less than 32B, then the address of the second storage vector that is not enough for the weight storage bit width is in the storage unit. The space is filled with 0, which is equivalent to storing the pixel data of the first channel and the first line of the image.

In a possible implementation manner, after the pixel data of the first channel and the first row are stored in sequence, the pixel data of the first channel and the second row are stored until the completion of all rows of the first channel. After the pixel data is stored, the pixel data of all rows of the second channel are stored until the storage of the pixel data of all channels is completed.

It can be understood that the number E of the second storage vectors obtained by splitting is related to the second storage unit of the pixel data, and the second storage unit of the pixel data is determined according to the data type of the pixel data. For example, if the second storage unit of pixel data is 2 bytes, for each first storage vector, the first storage vector may be divided into 8 second storage vectors every 16 bytes.

In a possible implementation manner, the data type may include multi-precision data types such as three-valued (-1, 0, 1), int8, uit8, etc. The embodiment of the present disclosure does not limit the data type of pixel data.

In a possible implementation manner, the specific value of b can be set according to actual needs. In some cases, the pixel data can be a multiple of 16, then b can be set to a multiple of 16, for example, 16 or 32, etc. This embodiment of the present disclosure is not limited. Among them, the b byte is less than or equal to the pixel storage bit width, so as to align and store the pixel data in the storage unit.

In a possible implementation manner, the pixel storage bit width may be the storage width of pixel data in the storage unit set according to actual requirements. In order to facilitate the storage of pixel data, the pixel storage bit width may be a multiple of 16, for example, It may be 16B, 32B, or 64B, etc., which is not limited in this embodiment of the present disclosure. In order to facilitate reading pixel data from the storage unit, the pixel storage bit width may be consistent with the pixel read bit width.

FIG. 2 shows a schematic diagram of storing pixel data according to an embodiment of the present disclosure. Among them, Px represents the Px-th column of the image X, Py represents the Py-th row of the image X, and RGB represents the red, green, and blue channels of the image. As shown in Figure 2, the first row of 16B storage space stores the 0th to 15th pixel data of the first row of the R channel of the image X, namely X[0][0:15], and so on, X[0] [Px-1; 0] means that the pixel data of the first row is stored, the pixel data of this row is filled with 0 in the address space less than 16B in the storage space, and the pixel data of the second row is stored after the pixel data of the first row is stored.

In the embodiment of the present disclosure, by dividing the pixel data into the first storage vector and the second storage vector, the storage efficiency of the pixel data can be improved, and it is convenient to read the pixel data corresponding to the weight data from the storage unit.

In a possible implementation manner, before step 11 is performed, the data processing method may further include: storing weight data of N convolution kernels according to the weight storage bit width. The weight storage bit width is consistent with the weight read bit width, so that in step 12, the first weight data is read from the storage unit according to the preset weight read bit width.

In a possible implementation manner, storing the weight data of N convolution kernels according to the weight storage bit width may include: for each convolution kernel in the N convolution kernels, sequentially according to the row of the convolution kernel direction, column direction and the order of channel C ₀ , the weight data of the convolution kernel is vertically arranged as the first weight vector; the first weight vectors of the N convolution kernels are horizontally aligned and merged into the first weight matrix; according to the weight storage Bit width, horizontally storing the weight data in the first weight matrix.

FIG. 3 shows a schematic diagram of N convolution kernels according to an embodiment of the present disclosure. Fig. 4a shows a schematic diagram of a first weight vector according to an embodiment of the present disclosure. FIG. 4b shows a schematic diagram of a first weight matrix according to an embodiment of the present disclosure. For example, as shown in FIG. 3, for N convolution kernels of K×K×C ₀ =3×3×3, the convolution kernel 1 can be longitudinally arranged in the order of row direction, column direction and channel C ₀ . Arranged as the first weight vector as shown in Figure 4a, the first weight vectors corresponding to other convolution kernels are analogized in turn, and will not be repeated; the first weight vectors corresponding to the N convolution kernels are horizontally aligned and merged into Figure 4b The first weight matrix shown, and then according to the weight storage bit width, the first weight matrix is stored, so as to realize the storage of the weight data.

In the embodiment of the present disclosure, by processing the weight data of the N convolution kernels into the first weight matrix, and then storing the first weight matrix according to the weight storage bit width, the sequential storage of the weight data can be realized, and the storage efficiency of the weight data can be improved. .

In a possible implementation manner, horizontally storing the weight data in the first weight matrix according to the weight storage bit width, may include: when N is greater than the number of columns Q of the MAC array, splitting vertically according to each Q column Divide the first weight matrix to obtain F second weight matrices, where,

Indicates round-up, that is, F is equal to the round-up value of (N/Q); in the case that the width of the second weight matrix is less than or equal to the weight storage bit width, the order of the row direction and the column direction is followed. The weight data in the f second weight matrices, 1≤f≤F, the address space that is not enough for the weight storage bit width is filled with 0; the f-1th second weight matrix is arranged before the fth second weight matrix; wherein , the width of the second weight matrix is equal to Q multiplied by the first storage unit of the weight data, and the first storage unit of the weight data is determined according to the data type of the weight data.

In a possible implementation manner, the data type may include multi-precision data types such as three-value (-1, 0, 1), int8, uit8, etc. The embodiment of the present disclosure does not limit the data type of the weight data.

For example, assuming that the weight storage bit width is 32 bytes B, there are 64 convolution kernels, the number of columns of the MAC array is 32, and the storage unit of the weight data is 2 bits, then F=64÷32=2, that is, the first weight The matrix is divided into two second weight matrices; the width of the second weight matrix is 2bit×32=64bit<weight storage bit width 32B, and the two second weight matrices can be stored in the order of row direction and column direction in turn. Weight data, arrange the first second weight matrix before the second second weight matrix. It can be understood that the weight data in the first second weight matrix is stored in the row direction and the column direction first, and then the weight data in the second second weight matrix is stored in the row direction and the column direction, wherein the second weight matrix The weight data of each row in the storage unit is not enough for the address space of the weight storage bit width to be filled with 0. After filling with 0, it is equivalent to storing a row of weight data in the second weight matrix. After storing the weight data of the current row, store the next A row of weight data.

In a possible implementation, when the width of the second weight matrix is greater than the weight storage bit width, for the f second weight matrix, the f second weight matrix is vertically split according to each weight storage bit width , get F ₀ third weight matrix, where,

That is, F ₀ is equal to the rounded-up value of (the width of the second weight matrix/weight storage bit width); the weight data in the f _0th third weight matrix is stored in the order of the row direction and the column direction, 1≤ f ₀ ≤F ₀ ; arrange the f ₀ -1 th third weight matrix before the f ₀ th third weight matrix.

For example, assuming that the weight storage bit width is 32B, there are 64 convolution kernels, the number of columns of the MAC array is 32, and the storage unit of the weight data is 2B, then F=64÷32=2, that is, the first weight matrix split are two second weight matrices; the width of the second weight matrix is 2B×32=64B>the weight storage bit width is 32B. For each second weight matrix, the second weight matrix can be divided vertically according to every 32B to obtain two third The weight matrix is equivalent to dividing the first weight matrix into 4 third weight matrices vertically according to each 32B, and stores the weight data in the third weight matrix in the order of row direction and column direction.

In some cases, the number N of convolution kernels may also be less than or equal to the number of columns Q of the MAC array, then in a possible implementation manner, the weight data in the first weight matrix is horizontally stored according to the weight storage bit width , may also include: when N is less than or equal to the number of columns Q of the MAC array, and the width of the first weight matrix is greater than the weight storage bit width, splitting the first weight matrix vertically according to each weight storage bit width to obtain F ₁ The fourth weight matrix; according to the order of row direction and column direction, the weight data in the f _1th fourth weight matrix is stored, 1≤f ₁ ≤F ₁ ; the f ₁ -1th fourth weight matrix is arranged in the Before the _f1th fourth weight matrix; wherein, the width of the first weight matrix is equal to N times the first storage unit of the weight data.

For example, assuming that the weight storage bit width is 32B, 16 convolution kernels, the number of columns of the MAC array is 32B, and the storage unit of the weight data is 4B, the number of convolution kernels is less than the number of columns of the MAC array, and the first weight The width of the matrix is 16×4B=64B>the weight storage bit width is 32B. The first weight matrix can be divided vertically every 32B to obtain two fourth weight matrices, and then two fourth weight matrices can be stored in the order of row and column directions. Weight matrix, arrange the first fourth weight matrix before the second fourth weight matrix.

In a possible implementation, there may also be a situation where N is less than or equal to the number of columns Q of the MAC array, and the width of the first weight matrix is less than or equal to the weight storage bit width. The weight data in the first weight matrix is stored in the order of , and each row of weight data is filled with 0 in the address space where the weight storage bit width is insufficient in the storage unit.

FIG. 5 shows a schematic diagram of storing weight data according to an embodiment of the present disclosure. Among them, Kx represents the Kx column of the convolution kernel, Ky represents the Ky-th row of the convolution kernel, RGB represents the three channels of the convolution kernel corresponding to the red, green and blue channels of the image, F0 represents the first target weight matrix, F1 Represents the second target weight matrix, and so on. As shown in Figure 5, "R channel_F0" means that the first target weight matrix under the first channel of the convolution kernel is stored, and the first 32B stores the first target weight matrix. row, the second 32B stores the second row of the first target weight matrix, and so on, where [0,0] represents the channel, the first row, the first weight data, [Ky-1,Kx -1] represents the weight data under the channel, the Ky-th row, the Kx-th weight data, and so on, arrange the first target weight matrix F0 before the second target weight matrix F1, and fill the address space with insufficient weight storage bit width with 0 .

In a possible implementation manner, the weight storage bit width can be set according to the actual demand to set the storage width of the weight data in the storage unit. In some cases, the number of convolution kernels in the convolution layer is usually a multiple of 16 , for example, 32, 64, 128, 256, etc., then the weight storage bit width can be set to be a multiple of 16, for example, 32 bytes, 64 bytes, etc., which is not limited by this embodiment of the present disclosure.

In a possible implementation manner, the weight storage bit width and the weight read bit width may be consistent, so that the cache module reads the first weight data from the storage unit in step 12 according to the preset weight read bit width.

In the embodiment of the present disclosure, the weight data is stored according to the column number Q of the MAC array and the weight storage bit width, which can improve the storage efficiency of the weight data, so that the weight data sequentially read from the storage unit in each operation is the same as the pixel value. corresponding to the data to further improve the efficiency of the convolution operation.

In a possible implementation manner, the arithmetic unit of each processing core of the artificial intelligence processor may further include at least one cache module, and the cache module may be configured to read from the storage unit according to a preset pixel read bit width Take the pixel data, and read the weight data from the storage unit according to the preset weight read bit width, then in step 11, read the first pixel data from the storage unit according to the preset pixel read bit width, which may be through At least one cache module reads the first pixel data from the storage unit, and in step 12, reads the first weight data from the storage unit according to the preset weight read bit width, which may be from the storage unit through at least one cache module Read the first weight data.

In a possible implementation manner, the cache module may use a register, a dual-port random access memory, a non-volatile memory, or other memory that can implement shift fetching, which is not limited to this embodiment of the present disclosure.

In a possible implementation manner, the size and quantity of the cache module may be set according to actual requirements. In this embodiment of the present disclosure, the cache module may be larger than the pixel read bit width and the weight read bit width. For example, if the pixel read If the bit width is 32B, the register of 48B can be selected to ensure the continuous loading of data during the operation, thereby ensuring the continuity of the operation.

In a possible implementation manner, after the size of the cache module is determined, one or more cache modules can be used according to actual requirements. For example, if a 48B register is to be used, the 48B register can use three 16B registers It is also possible to use a 48B register, in which multiple cache modules can be selected to realize the multiplexing of the cache modules and improve the utilization rate of resources.

In a possible implementation manner, if the width of the data loaded by the cache module is smaller than the size of the cache module, the cache module can load the data with this width, and other storage spaces in the cache module are filled with 0, for example, a 48B register. If the pixel data or weight data is less than 16B, load the 16B data, and add 0 to other storage spaces in the cache module. If the loaded pixel data or weight data is less than 32B, load the 32B data, and add 0 to other storage spaces in the cache module.

In a possible implementation manner, the reading of the first pixel data in step 11 may be continuous. In other words, when the data in the cache module cannot meet the requirements of the current operation, the cache module may read from the storage unit Continuous pixel data to ensure the continuity of the operation. For example, if a 48B register is used to read data, whenever the 16B data is shifted in the register, the register will load the next 16B data from the storage unit, thus ensuring that Continuity of operations.

In a possible implementation manner, in step 12, reading the first weight data from the storage unit according to the preset weight reading bit width may include:

When T=1, determine the mth channel, the Kyth row, and the row L of the weight data at the convolution kernel position T of the k convolution kernels in the target weight matrix; The weight data of the Lth row of the target weight matrix is read in the unit as the first weight data read from the storage unit;

When 1<T≤K, read the bit width according to the preset weight, and read the weight data of the L+T-1th row of the target weight matrix from the storage unit as the first weight data read from the storage unit ;

Wherein, the target weight matrix may include the second weight matrix or the third weight matrix. In a possible implementation manner, the target weight matrix may further include a first weight matrix or a fourth weight matrix.

In a possible implementation manner, the convolution kernel position T may refer to the T th weight data of the Ky th row of the m th channel of the convolution kernel.

In a possible implementation manner, reading the second weight data from the storage unit may be after determining the starting address of the weight data to be read, that is, the storage address corresponding to the weight data in the Lth row of the target weight matrix , read the weight data of the L+T-1th row according to the sequential addressing method, that is, the address is accumulated by 1.

Through the embodiments of the present disclosure, during the T-th operation of the Ky-th row of the k convolution kernels, it is possible to sequentially read the m-th channel, the Ky-th row, and the convolution kernel position T of the k convolution kernels. weight data. FIG. 6 shows a schematic diagram of splitting a second weight matrix according to an embodiment of the present disclosure. For example, as shown in Figure 6, when T=1 operation, the weight data of the first row of the second weight matrix such as a1 and e1 can be read, which is equivalent to reading the first row of 32 convolution kernels. channel, the first row, the first weight data, when T=2, the weight data of the second row of the second weight matrix such as a2 and e2 can be read, which is equivalent to reading the first k convolution kernels. Channel, first row, second weight data, and so on.

In the embodiment of the present disclosure, after determining the row L of the weight data at the m channels, the Ky th row, and the convolution kernel position T in the target weight matrix, the first weight data is read row by row, and the reading can be realized. Weight data corresponding to pixel data.

In a possible implementation, in step 13, according to the step size Sx of the convolution kernel, select a pixel data corresponding to the position T of the convolution kernel from the first pixel data as the second pixel data, including:

When T=1, select a pixel data with an interval step Sx from the first pixel data as the second pixel data, and the second pixel data includes the mth channel of the image, X[0] of the Pyth row, X[0] Sx], X[2Sx], X[3Sx], ..., pixel data at X[(a-1)Sx];

When 1<T≤K, according to the expansion rate Ex of the convolution kernel, select the mth channel of the image and X[(T-1)Ex] of the Pyth row from the first pixel data, X[Sx+(T- 1)Ex], X[2Sx+(T-1)Ex], X[3Sx+(T-1)Ex], ..., the pixel data at X[(a-1)Sx+(T-1)Ex] as second pixel data.

It can be understood that when implementing the convolution operation between the convolution kernel and the image, the convolution kernel usually performs the convolution operation according to the moving step size in the row direction and the column direction. One pixel data, when T=1, select a second pixel data X[0], X[Sx], X[2Sx], X[3Sx], ... , X[(a-1)Sx], which is equivalent to selecting a second pixel data corresponding to the mth channel, the Kyth row, and the first weight data of multiple convolution kernels.

Among them, when 1<T≤K, according to the expansion rate Ex of the convolution kernel, select the mth channel of the image and X[(T-1)Ex] of the Pyth row from the first pixel data, X[Sx+( T-1)Ex], X[2Sx+(T-1)Ex], X[3Sx+(T-1)Ex], ..., pixel at X[(a-1)Sx+(T-1)Ex] The data is used as the second pixel data, which is equivalent to selecting a second pixel data corresponding to the mth channel, the Kyth line, and the Tth weight data of multiple convolution kernels. For example, when T=2, the second pixel The data can include X[Ex], X[Sx+Ex], X[2Sx+Ex], X[3Sx+Ex], ..., X[(a-1)Sx+Ex], when T=3, the first The two-pixel data may include X[2Ex], X[Sx+2Ex], X[2Sx+2Ex], X[3Sx+2Ex], ..., X[(a-1)Sx+2Ex], and so on.

In a possible implementation, the expansion rate Ex of the convolution kernel can be set according to the actual volume and operation requirements. When the expansion rate Ex=1, it means that the ordinary convolution operation is performed. When the expansion rate Ex> When it is 1, it means that the dilated convolution operation is performed.

In a possible implementation manner, the value of a may be less than or equal to the row number A of the MAC array, for example, for a 4×32 MAC array, a may be an integer in [1, 4].

In the embodiment of the present disclosure, a piece of second pixel data is selected from the first pixel data according to the step size Sx and the expansion rate Ex, so that the selected multiple pixel data corresponds to the weight data of multiple convolution kernels, Thus, the convolution operation of pixel data and weight data can be accurately realized, and the dilated convolution operation can also be supported.

In a possible implementation manner, the first weight data read in step 12 and the second weight data selected in step 13 can be input into the MAC array through the cache module, and then the second pixel data and the corresponding weight can be realized in step 14 The multiplier accumulation of the data, that is, the convolution operation of the weight data and the pixel data is realized.

FIG. 7 shows a schematic structural diagram of a MAC array according to an embodiment of the present disclosure. In order to facilitate the understanding of the process of obtaining the first convolution operation result in step 14, a schematic diagram of a MAC array shown in FIG. 7 is used as an example for illustration. As shown in FIG. 7, each circle contains 4 MACs, then a Can be 4, there are 5 columns of MAC, then k can be 5.

Then when T=2, the selected second pixel data X[Ex], X[Sx+Ex], X[2Sx+Ex], X[3Sx+Ex] are input into the MAC array from the row direction of the MAC array ; Input the first weight data of the mth channel, the Kyth row, and the convolution kernel position 2 of the 5 convolution kernels into the MAC array from the column direction. For the qth column MAC in the MAC array, you can The products of the 4 pixel data and the qth first weight data are obtained respectively. By analogy, when T=1, the second pixel data (X[0], X[Sx], X[2Sx], X[3Sx]) can be obtained, and the first weight data (5 convolution kernels The product of the mth channel, the Kyth row, the weight data at kernel position 1).

Then for each column of MAC, when T=2, T=2 operations to obtain 4 products and T=1 operation to obtain 4 products to accumulate respectively to obtain the T=2th operation result, and so on, in the Tth During the first operation, the product obtained by the T-th operation and the T-1 operation result are cyclically accumulated to obtain a first convolution operation result of the T-th operation of the MAC in the q-th column.

In the embodiment of the present disclosure, a multi-point parallel convolution operation between multiple pixel data and weight data corresponding to multiple convolution kernels can be implemented in each operation, so that the convolution operation can be efficiently implemented and artificial intelligence can be improved. The operating efficiency of the processor.

In the embodiment of the present disclosure, when the T=Kth operation is completed, for the qth column MAC, it is equivalent to completing the mth channel of the convolution kernel corresponding to the column, the Kyth row weight data, and the ath column of the corresponding pixel data. A convolution operation result. It can be known that, when implementing the convolution operation between the convolution kernel and the image, it is usually necessary to calculate the product of each row of weight data and pixel data of each channel.

In a possible implementation manner, the data processing method according to an embodiment of the present disclosure may further include: Step 15: when T=1, for the qth column MAC, compare the second pixel data with the first weight data in the first weight data. The q weight data are multiplied and added with the convolution operation result of the Kth operation in the Ky-1th row to obtain a first convolution operation result of the 1st operation of the MAC in the qth column, 1≤q≤ k.

The convolution operation result of the Kth operation in the Ky-1th row can be obtained by using the processing methods disclosed in steps 11 to 14 in the above embodiments of the present disclosure, and details are not described herein again.

In the embodiment of the present disclosure, the cyclic accumulation of the convolution operation results of each row of weight data and the corresponding pixel data can be implemented, so as to obtain the convolution operation result of the mth channel.

In practical applications, when implementing the convolution operation between the convolution kernel and the image, for each convolution kernel, it is usually necessary to accumulate the convolution operation results of the weight data of each channel and the pixel data of the corresponding channel to obtain the final convolution Operation result.

In a possible implementation manner, the data processing method may further include: Step 16, for the qth column MAC, after completing the operation of the K rows of the k convolution kernels, obtain a second volume of the mth channel Product operation results; Step 17, after the convolution operation results of C ₀ channels are obtained, add the convolution operation results of C ₀ channels of each convolution kernel to obtain a target output by the qth column MAC The result of the convolution operation.

Wherein, the convolution operation result of each channel is actually obtained by accumulating the convolution operation results of each row of convolution kernels under the channel, and the K rows of operations of k convolution kernels are completed in step 16. The processing methods disclosed in steps 11 to 15 are obtained, and details are not repeated here.

In the embodiment of the present disclosure, for the qth column MAC, by adding the convolution operation results of the C ₀ channels of each convolution kernel, a target convolution operation result output by the qth column MAC can be obtained, which is equivalent to The values of the adjacent a points in the same row of the k output graphs are obtained.

In practical applications, after obtaining the result of a target convolution operation output by the qth column MAC output, that is, after obtaining the values of the adjacent a points in the same row of the k output images, the input image may not be fully completed in the row direction. For the convolution of pixel data and weight data, in a possible implementation manner, after obtaining a target convolution operation result output by the qth column MAC output, the data processing method may further include:

According to the mth channel of the image and the first storage vector corresponding to the pixel data at the X[aSx] position in the Pyth row, determine that the first storage vector corresponding to the pixel data at the X[aSx] position in the Pyth row is in a first storage start address corresponding to the storage unit;

According to the preset pixel read bit width and the first storage start address, the third pixel data is read from the storage unit, and the third pixel data includes consecutive M pieces of pixel data read from the first storage start address, so that the operation unit can continue to operate.

In a possible implementation, after obtaining the size of the input image, the padding parameter, the size of the convolution kernel (width K×height K), the step size of the convolution kernel (including the step size Sx in the row direction, the column The size of the output graph can be obtained after parameters such as the step size Sy) in the direction of Whether to complete the convolution operation of all pixel data and weight data in the input image in the row direction when calculating the result of the product operation.

Among them, P _out is the width or height of the output image, P _in is the width or height of the input image, and S represents the step size in the row direction or the step size in the column direction.

For example, if the width of the output image is calculated to be 16, and the current qth column MAC outputs 4 target convolution operation results, it means that the convolution of all pixel data and weight data in the row direction of the input image has not been completed.

Wherein, according to the mth channel of the image and the first storage vector corresponding to the pixel data at the Xth [aSx] in the Pyth row, determine the first storage vector corresponding to the pixel data at the Xth [aSx] in the Pyth row The vector corresponds to the first storage start address in the storage unit, because for the first pixel data of the Pyth row, the mth channel of the image and the X[0], X[Sx], X of the Pyth row were previously selected. [2Sx], X[3Sx], ..., the second pixel data at X[(a-1)Sx], so according to the step size Sx of the convolution kernel, it is necessary to select the first pixel from X[aSx]. Two pixel data for convolution operation.

In the embodiment of the present disclosure, considering that the corresponding storage address of the pixel data at X[aSx] is not easy to determine, and if the situation of reading data from the storage address corresponding to the pixel data at X[aSx] is complicated, Therefore, in this embodiment of the present disclosure, the first storage vector corresponding to the pixel data at X[aSx] is determined to correspond to the first storage start address in the storage unit, and then the bit width and the first storage start address are read according to the preset pixel. Storing the starting address and reading the third pixel data from the storage unit can easily and quickly determine the starting address of the cache module fetching from the storage unit. Since the first storage vector is stored in alignment in the storage unit, and at the same time It can facilitate the fetching of the cache module.

In a possible implementation manner, the first storage vector corresponding to the pixel data at X[aSx] in the mth channel of the image and the Pyth row can be determined according to aSx and nb-1, n∈[1, B], to determine the first storage vector corresponding to the pixel data at X[aSx].

For example, assuming that b is equal to 16, that is, the first storage vector is obtained by dividing every 16 pixel data. There are 3 first storage vectors including [0,15], [16,31], [16,31], For the pixel data of [32,47], if aSx=12, since 12 is less than 15, the corresponding first storage vector is [0,15], and it is necessary to read data from the storage unit starting from the 0th pixel data. aSx=18, 18 is greater than 15 and less than 31, data needs to be read from the storage unit starting from the 16th pixel data, and so on.

In a possible implementation manner, after the third pixel data is read from the storage unit according to the preset pixel read bit width and the first storage start address, the step size Sx of the convolution kernel may be Select a pixel data corresponding to the convolution kernel position T from the three pixel data as the second pixel data, including:

When T=1, select a pixel data with an interval step Sx from the third pixel data as the second pixel data, and the second pixel data includes the mth channel of the image, X[aSx] in the Pyth row, X[aSx], X [(a+1)Sx], X[(a+1)Sx]], X[(a+3)Sx]... Pixel data at X[(2a-1)Sx];

When 1<T≤K, according to the expansion rate Ex of the convolution kernel, select the mth channel of the image and X[aSx+(T-1)Ex] of the Pyth row from the third pixel data, X[(a+ 1)Sx]+(T-1)Ex], X[(a+1)Sx]+(T-1)Ex], X[(a+3)Sx+(T-1)Ex],..., The pixel data at X[(2a-1)Sx+(T-1)Ex] is taken as the second pixel data.

It should be noted that, according to the first storage vector corresponding to the pixel data at X[aSx] in the mth channel of the image and the Pyth row, the pixel data at X[aSx] in the Pyth row is determined to correspond to The first storage vector corresponding to the first storage start address in the storage unit is an implementation provided by the embodiment of the present disclosure, but those skilled in the art can understand that the present disclosure should not be limited thereto. Under the inspiration of the embodiments of the present disclosure, those skilled in the art can also determine the first storage vector corresponding to the pixel data at the mth channel of the image and the Xth [2aSx] in the Pyth row of the image, to determine the first storage vector in the Pyth row. The first storage vector corresponding to the pixel data at X[2aSx] corresponds to the first storage start address in the storage unit, and so on. For the sake of brevity, the embodiments of the present disclosure are not exhaustive.

In a possible implementation manner, the third pixel data is equivalent to the first pixel data in step 11. After the third pixel data is read from the storage unit, the steps 11 to 16 in the above embodiment of the present disclosure may be used. The data processing method described above can obtain another target convolution operation result output by each column of MAC, so that the convolution of all pixel data and weight data in the row direction of the image can be completed, that is, all values in the same row of the output image can be obtained.

In the embodiment of the present disclosure, by starting to read pixel data corresponding to the first storage start address in the storage unit from the first storage vector corresponding to the pixel data at the Xth[aSx], Xth[2aSx], etc., It is relatively simple and effective to implement line-feeding to read pixel data, so that the operation unit can continue to operate, and finally all the values of the same line of the multiple output graphs can be obtained.

In practical applications, after obtaining the data of a certain row of the output graph, it is necessary to perform a moving cycle operation in the column direction of the image according to the step size Sy in the column direction of the convolution kernel, and calculate the data of the next row of the output graph. In a possible implementation manner, the data processing method according to the embodiment of the present disclosure may further include: after completing the convolution operation of k convolution kernels and K rows of pixel data, according to the step size Sy of the convolution kernel in the column direction, Determine the second storage start address of the first pixel data at the interval Sy-1 row with the first row of K rows of pixel data; read from the storage unit according to the preset pixel read bit width and the second storage start address Fourth pixel data, the fourth pixel data includes consecutive M pieces of pixel data read from the second storage start address, so that the operation unit can continue to operate.

In a possible implementation manner, after reading the fourth pixel data, the fourth pixel data is equivalent to the first pixel data, then the data processing method disclosed in steps 11 to 17 in the above-mentioned embodiment of the present disclosure can be obtained to obtain The result of a target convolution operation output by the qth column MAC to complete the convolution operation between the convolution kernel and the image.

In the embodiment of the present disclosure, by reading the fourth pixel data according to the moving step size Sy in the column direction, it is convenient to calculate the data of each row of the output graph, and finally obtain the output graph corresponding to each convolution kernel.

It should be noted that the output map in the embodiment of the present disclosure may refer to a feature map obtained by convolution operation, and the input image and image may refer to the original image, or may refer to the image after the convolution operation has been performed. feature diagram, which is not limited to this embodiment of the present disclosure.

Fig. 8a shows a block diagram of an artificial intelligence processor according to an embodiment of the present disclosure, and Fig. 8b shows a block diagram of a processing core according to an embodiment of the present disclosure. As shown in FIG. 8 a , the artificial intelligence processor 100 includes a plurality of processing cores 101 , and as shown in FIG. 8 b , each processing core 101 includes a storage unit 102 and an operation unit 103 .

In a possible implementation manner, the storage unit 102 is used to store the pixel data of the image and the weight data of the N convolution kernels; the operation unit 103 includes a multiplier-accumulator MAC array 104, which is used for performing the processing according to the pixel data and the weight data. operation.

In a possible implementation manner, the operation unit may further include at least one cache module 105, and the cache module is configured to read pixel data from the storage unit 102 according to a preset pixel read bit width, and read pixel data according to a preset weight. The weight data is read from the storage unit 102 by taking the bit width.

In a possible implementation manner, the buffering module 105 may send the gated data into the MAC array for convolution operation, and output the convolution operation result to the address space specified by the address generating module 106 in the storage unit.

In a possible implementation manner, the operation unit may further include an address generation module 106 for generating an address pointer when the cache module reads data, so that the cache module 105 can implement sequential addressing and/or jump addressing according to the address pointer .

In one possible implementation, the MAC array 104 includes an array based on a crossbar switching crossbar matrix structure. The MAC array 104 can be expanded into two dimensions of rows and columns, and can support multi-point parallel convolution operations.

In a possible implementation manner, the processing core 101 may perform a convolution operation by using the data processing method described in any one of the foregoing embodiments of the present disclosure.

In a possible implementation manner, the storage unit 102 can be used to store data according to a specific storage logic of pixel data and weight data, wherein the storage logic of pixel data includes: each channel image is stored in sequence, and the pixels of each channel are stored in sequence. The data is expanded into a vector along its image width direction, and b consecutively stored as a storage vector. The vector is divided into multiple pieces according to the b-byte alignment, and stored one by one. The storage sequence of different pixel data in the storage unit is first along the image width direction, and then stored along the image height direction. The entire image storage is aligned according to the pixel storage bit width, and the lack of zero is filled to facilitate the calculation of register fetching. Among them, the pixel storage bit width is greater than or equal to the set b bytes.

In a possible implementation manner, the weight data and the pixel data may specify a storage address in the storage unit 102 . According to the storage address of the weight data, when reading the weight data, the cache module 105 can read the weight data from the starting address of the storage address in a manner of adding one to the address.

In a possible implementation manner, when the cache module 105 reads pixel data, an address jump will be generated, that is, to read pixel data by line jump, a configured address jump value can be set in the primitive parameter. The address generation module 106 generates the target address according to the address jump value, and counts it through the loop clock counter that comes with the artificial intelligence processor. After the count meets the jump condition, the loop clock counter generates a jump signal, and instructs the cache module through the jump signal. 105 realizes the jump of the address pointer according to the target address generated by the address generating module 106 .

In the embodiments of the present disclosure, by using the artificial intelligence processor of the embodiments of the present disclosure, efficient convolution operations can be implemented, and the operation efficiency of the artificial intelligence processor can be improved.

In a possible implementation, the convolution kernel is four-dimensional data, with a total of K×K×C ₀ ×N weight data, and each output graph N (the number of convolution kernels, the number of output channels) is used as the vector length to expand, A weight matrix with a height of Kx×Ky×C ₀ and a width of N is formed. The weight data in the weight matrix are arranged in the order of the first direction, the column direction, and the channel C ₀ direction in the height direction.

In a possible implementation, when storing the weight matrix, when N is greater than 32 and the weight data is greater than or equal to 1B, the weight matrix is divided into W_grp groups according to 32B alignment, and the data of each group is arranged in the storage unit in Below the data for the previous set. In one case, when the weight data is 2 bits, the weight matrix is aligned and split into W_grp groups every 32 columns (that is, every 32×2bit=8B).

In a possible implementation, each channel of the input image is expanded into a vector along the width direction, 16 pixel data are continuously stored as a first storage vector, and the first storage vector is divided into multiple second storage vectors according to 16B alignment. , store the second storage vector one by one. The storage sequence of the input images in the storage unit is in the row direction first, and then in the column direction. The entire input image is aligned according to 32B in the storage unit, and the address space less than 32B is filled with zeros to facilitate register fetching and calculation.

In a possible implementation, a 48B shift register or three 16B registers can be used to read data from the storage unit. When reading data from the storage unit, 48B data can be loaded in 3 clocks, for example, the adjacent 48B (0th to 47th) of the first row of the first channel (usually R channel) of the input image The pixel is loaded into the 48B register according to the high and low 16 bits in three clocks. If the width of the input image is less than 16B, only 16B data will be loaded, and zero will be filled when the width is insufficient; if the width of the input image is less than 32B, only 32B data will be loaded, and zero will be added when the width is insufficient. The reading operation can be controlled by the cyclic clock counter.

In a possible implementation manner, when selecting data from the register and outputting it to the MAC array for operation, whenever the register will remove one 16B data, the register will load the next 16B data to maintain the continuity of the operation.

In a possible implementation, based on a 4×32 2D MAC array, it is possible to multiply up to 4 pixel data at the same time with the weight data of up to 32 convolution kernels at the same position. For example, in one operation, the X[0], X[Sx], X[2Sx], X[3Sx] 4 pixel data in the register can be gated, and the first channel, The first row and the first weight data are multiplied together. Then shift the gate X[Ex], X[Ex+Sx], X[Ex+2Sx], X[Ex+3Sx] according to the row direction of the convolution kernel and the first one of the 32 convolution kernels The channel, the first row, and the second weight data are subjected to convolution operation, and Ex represents the expansion rate, until the convolution operation of the K pixel data and the corresponding weight data is completed.

In a possible implementation, by expanding the 2D MAC array into row and column directions, and expanding into Q groups in the column direction, to provide calculations in the direction of Q output channels; expanding into A groups in the row direction, to provide calculations in the row direction of A pixel data. After each A pixel data is multiplied by the corresponding weight data, it is accumulated in the column direction of the MAC array based on the Crossbar structure, and the convolution result of consecutive A points is generated in the form of pipeline, realizing the support of multiple pixel data and weight data. Parallel operation.

Taking a 4x32 2D MAC array, the K of the convolution kernel is 11, and the image of the RGB channel as an example, an implementation of the data processing method in the embodiment of the present disclosure is described, including the following steps:

Step 1, get primitive parameters.

Step 2: Load the adjacent 48B (0-47) pixels of the first row of the image R channel into the 48B register in 3 clocks according to the high and low 16 bits. Select 4 pixel data X[0], X[Sx], X[2Sx], X[3Sx] from the register and send them to the 2D MAC array, and multiply them with the weight data at the same position of the 32 convolution kernels. 32 convolution operation results are obtained in parallel.

Step 3, then shift according to the row direction, gate X[Ex], X[Ex+Sx], X[Ex+2Sx], X[Ex+3Sx] and multiply the corresponding weight data until K pixels are completed The convolution operation of the data and the corresponding weight data.

Step 4, read the 48B pixel data of the next row in a new line, select 4 pixel data at a time along the row direction and perform a convolution operation with the corresponding weight data at the same time.

Step 5: Repeat the above steps 1 to 4 until the K×K convolution operations under the R channel are completed, and then calculate the convolution operations of other channels such as the G channel and the B channel respectively, and then the output graph can be obtained at the same time. The adjacent four points of the first 32 channels: P ₀ [0,0], P ₀ [0,1], P ₀ [0,2], P ₀ [0,3]. At this time, P ₀ [0,0], P ₀ [0,1], P ₀ [0,2], and P ₀ [0,3] of the first 32 channels need to be written back to the storage unit. Repeat the above steps 1 to 5 until four adjacent points of the output map of all channels are obtained: P ₀ [0,0], P ₀ [0,1], P ₀ [0,2], P ₀ [0,3].

Step 6: Determine whether the starting position of the pixel data read by the register for the second time on the window exceeds the position of the 15th pixel. If it exceeds 15, read 48B from the address of the 16th to 63rd pixels in the first row. , otherwise, the pixel data is still read from the address of the 0th to 48th pixel. Still use the 48B register to read the data, and then select the pixel data at X[4Sx], X[5Sx], X[6Sx], X[7Sx] from the register to perform convolution calculation until K×K convolutions are completed. Operation, to get 32 output graphs with four adjacent points in the same row: P ₀ [0,4], P ₀ [0,5], P ₀ [0,6], P ₀ [0,7].

Step 7: After obtaining the data of the first row of the output image, start to calculate the data of the next row of the output image. At this time, read the corresponding pixel data from the 0+Sy row of the input image, and perform the operations from steps 1 to 6 above.

In a possible implementation, based on a 4×32 MAC array based on a crossbar structure, a maximum of 4 pixel data can be selected from the register at a time, and at the same time, a weight data at the same position of a maximum of 32 convolution kernels can be selected. Convolution operation.

In a possible implementation manner, when the pixel data is gated from the register, a shift read operation is performed according to the size of the convolution kernel, the stride Sx of the convolution kernel, and the expansion rate Ex. FIG. 9a shows a schematic diagram of selecting pixel data according to an embodiment of the present disclosure. As shown in Figure 9a, assuming a convolution kernel of K×K=11×11, the expansion rate Ex=1, each pixel data of the image X occupies 1B, and the 48B read in the register is the first 48B of the first line of the image. Pixel data, the stride Sx of the convolution kernel is 3, and the 3 registers "Reg[0], Reg[1], Reg[3]" read the first "0-" of the first row X[0] of the image X 47” pixel data.

In the first operation, select X[0], X[Sx], X[2Sx], X[3Sx] in the register, that is, the 1st, 4th, 7th, and 10th pixel data "0" , 3, 6, 9", select X[1], X[Sx+1], X[2Sx+1], X[3Sx+1] for the second time, that is, the pixel data "1, 4, 7, A" , select X[2], X[Sx+2], X[2Sx+2], X[3Sx+2] for the third time, that is, the pixel data "2, 5, 8, B", to the 11th operation, Select X[10], X[Sx+10], X[2Sx+10], X[3Sx+10], namely pixel data "A, D, 16, 19".

Since the size of the convolution kernel is 11×11, after 11 times of pixel data is selected from the register and sent to the MAC array, it is equivalent to calculating the convolution of the weight data of the first row of the convolution kernel and the corresponding pixel data.

At this time, the register jumps to the start address corresponding to the pixel data of the second row of image X, loads the first 48B data of the pixel data of the second row, and the gating logic for selecting the pixel data is consistent with the above. After the convolution operation is performed on the weight data of the second row of the convolution kernel and the corresponding pixel data, the first 48B data of the pixel data of the third row is loaded, and the logic of each data loading and data gating is consistent with the above.

For the pixel data read by the register from the storage unit, when the K times of pixel data are selected by the shift, the address pointer of the register jumps to the storage address corresponding to the pixel data of the second row of the image for the second fetch, until the read K times of data is calculated, which is equivalent to the convolution of the first layer of weight data between the image R channel and the convolution kernel, and then jumps to the starting address of the first line of pixel data in the image G channel, according to the same reading as the R channel. The gating logic calculates the convolution of the pixel data of the G channel and the weight data of the second layer of the convolution kernel, and so on, and then calculates the B channel. After the convolution of the three channels of RGB is completed, 4 numbers of the same row of the 32 output images can be obtained in parallel.

Next, when calculating the last four numbers of the same row of the output image, it is necessary to return to the storage address corresponding to the pixel data of the first row of the image, and load the pixel data of the first row, that is, perform the second windowing. In the second windowing, determine the pixel data to be selected (X[4Sx]), that is, whether the starting position of the second windowing has passed 15 (0 is the start, and every 16 pixel data is used as a storage vector ), if it exceeds, read 48B data from the storage address corresponding to the 16th pixel in the first row, if not, read 48B data from the storage address corresponding to the first pixel. Fig. 9b shows yet another schematic diagram of selecting pixel data according to an embodiment of the present disclosure. As shown in Fig. 9b, when reading pixel data by windowing for the second time, since X[4Sx]=12 is less than 15, 3 registers Reg [0], Reg[1], Reg[3] still read the pixel data of the first "0-47" of the first line X[0] of the image X, and then select the pixel data, select X[ 4Sx], X[5Sx], X[6Sx], pixel data at X[7Sx], namely "C, F, 18, 21", the second time select X[4Sx+1], X[5Sx+1] , X[6Sx+1], X[7Sx+1], namely "D, 16, 19, 22", until the 11th time select X[4Sx+10], X[5Sx+10], X[6Sx+10 ], the pixel data at X[7Sx+10].

When the window is applied for the third time, judge whether X[8Sx] exceeds 15. If not, load data from the storage address corresponding to the first pixel. If it exceeds 15, judge whether it exceeds 31. If it exceeds 31, start from the 32nd pixel. The storage address corresponding to the pixel reads 48B data, and if it does not exceed 31, 48B data is read from the address corresponding to the 16th pixel. After reading the pixel data in the window for the third time, select the pixel data according to the step size Sx, the size of the convolution kernel K, and the expansion rate Ex, which will not be repeated.

After obtaining the value of the current row of the output image, read the pixel data of the row 0+Sy according to the step size Sy in the column direction to calculate the value of the next row of the output image, until the values of all the rows of the output image are obtained.

In the embodiment of the present disclosure, it can be seen from the above-mentioned convolution operation flow that to obtain an output graph with a size of Ox×Oy×N, a total of six layers of loops are required, which are the width Kx of the convolution kernel and the height of the convolution kernel. Ky, channel C ₀ , output channel N, output map width Ox, output map height Oy direction cycle.

In the embodiment of the present disclosure, the storage logic of the weight data of the convolution kernel in the storage unit is consistent with the calculation process. Therefore, in the cyclic calculation process, it is sufficient to add 1 to the starting address of the weight. The storage sequence of the output graph and the output data sequence of the MAC array also follow certain rules, so the storage sequence of the output graph can also be determined directly according to the hardware solidification logic. When reading the pixel data of the input image, since the sliding window fetching operation is involved, the address jump value of each layer loop can be set as a configurable primitive parameter.

In the embodiment of the present disclosure, the 2D MAC array based on the structure of Crossbar can support multi-point parallel operation of data by expanding the two dimensions of row and column.

In the embodiment of the present disclosure, based on the storage logic of the input image and the convolution kernel, the storage efficiency can be improved. Each channel is stored in sequence, and each channel is expanded into a vector along its image width direction, and 16 consecutively stored as a storage vector, The vector is split into multiple pieces according to 16B alignment and stored one by one. The storage sequence of different pixel data in the storage unit is first along the image width direction, and then stored along the image height direction. The entire image storage is aligned according to 32B, and the shortage is filled with zeros to facilitate the calculation of register fetching.

In the embodiment of the present disclosure, by using multiple shift registers, dynamic data reading and gating can be realized, and the operation of pixel data and corresponding weight data can be accurately and efficiently realized.

In the embodiment of the present disclosure, based on the 2D MAC array, the input image is stored in the order of advance direction, column direction, and channel direction, and the target row of the input image is extracted by designing multiple shift registers to realize dynamic data reading and gating logic. In the data register, by multiplying the pixel data of the corresponding line of the convolution kernel, the multi-point parallel operation logic of the data can be realized, and the multi-point convolution operation result can be output in parallel by continuous operation in the form of line pipeline.

In the embodiments of the present disclosure, a novel convolution operation logic and data storage mode of a neuromorphic chip based on a many-core architecture is implemented, and the convolution operation and data storage efficiency between images and convolution kernels are improved.

Various embodiments of the present disclosure have been described above, and the foregoing descriptions are exemplary, not exhaustive, and not limiting of the disclosed embodiments. Numerous modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the various embodiments, the practical application or improvement over the technology in the marketplace, or to enable others of ordinary skill in the art to understand the various embodiments disclosed herein.

Claims

A data processing method, characterized in that it is applied to a processing core of an artificial intelligence processor, wherein the artificial intelligence processor includes a plurality of processing cores, and each processing core includes a storage unit and an arithmetic unit,

The storage unit is used to store the pixel data of the image and the weight data of the N convolution kernels; the operation unit includes a multiplier-accumulator MAC array for performing operations according to the pixel data and the weight data, wherein, The size of the image is W 0 ×H 0 ×C 0 , the size of the convolution kernel is K×K×C 0 , the step size in the row direction is Sx, W 0 , H 0 , C 0 , K, Sx is a positive integer,

The method includes:

According to the preset pixel read bit width, read the first pixel data from the storage unit, the first pixel data includes the mth channel of the image and the continuous M pixel data of the Pyth row, 1 ≤m≤C 0 , 1≤Py≤H 0 , 1<M≤W 0 ;

During the T-th operation of the Ky-th row of the k convolution kernels, the bit width is read according to the preset weight, and the first weight data is read from the storage unit, where the first weight data includes k convolutions The mth channel of the kernel, the Kyth row, and the weight data at the convolution kernel position T, 1<k≤N, 1≤T≤K, 1≤Ky≤K;

According to the step size Sx of the convolution kernel, select a pixel data corresponding to the convolution kernel position T from the first pixel data as the second pixel data, 1<a<M;

When T>1, for the qth column MAC in the MAC array, multiply the second pixel data by the qth weight data in the first weight data, and perform the T-1th operation with The results are added to obtain a first convolution operation result of the T-th operation of the MAC in the q-th column, 1≤q≤k.
The method according to claim 1, wherein the method further comprises:

When T=1, for the qth column MAC, multiply the second pixel data by the qth weight data in the first weight data, and perform the Kth operation with the Ky-1th row The convolution operation results of the qth column are added to obtain a first convolution operation results of the first operation of the MAC in the qth column, 1≤q≤k.
The method according to claim 1, wherein the method further comprises:

For the qth column MAC, after completing the operations of the K rows of the k convolution kernels, a second convolution operation result of the mth channel is obtained;

After the convolution operation results of the C 0 channels are obtained, the convolution operation results of the C 0 channels of each convolution kernel are added to obtain a target convolution operation results output by the qth column MAC.
The method according to claim 1, wherein the method further comprises: storing the weight data of the N convolution kernels according to the weight storage bit width, wherein the weight storage bit width and the weight read Take the same bit width;

The storage of the weight data of the N convolution kernels according to the weight storage bit width includes:

For each convolution kernel in the N convolution kernels, according to the row direction, column direction and channel C 0 order of the convolution kernel, the weight data of the convolution kernel is longitudinally arranged into a first weight vector ;

horizontally aligning the first weight vectors of the N convolution kernels into a first weight matrix;

According to the weight storage bit width, the weight data in the first weight matrix is horizontally stored.
The method according to claim 4, wherein the storing the weight data in the first weight matrix horizontally according to the weight storage bit width comprises:

In the case where N is greater than the column number Q of the MAC array, the first weight matrix is vertically split according to each Q column to obtain F second weight matrices, wherein,

In the case where the width of the second weight matrix is less than or equal to the weight storage bit width, the weight data in the f-th second weight matrix is stored in the order of row direction and column direction, 1≤f≤F ;

Arrange the f-1th second weight matrix before the fth second weight matrix;

The width of the second weight matrix is equal to Q multiplied by the first storage unit of the weight data, and the first storage unit of the weight data is determined according to the data type of the weight data.
The method according to claim 5, wherein the method further comprises:

When the width of the second weight matrix is larger than the weight storage bit width, for the f second weight matrix, split the f second weight matrix vertically according to the weight storage bit width to obtain F 0 The third weight matrix, where,

Store the weight data in the f 0th third weight matrix in sequence in the row direction and the column direction, 1≤f 0 ≤F 0 ;

Arrange the f 0 -1 th third weight matrix before the f 0 th third weight matrix.
The method according to claim 1, wherein the method further comprises: storing pixel data of the image according to a pixel storage bit width, wherein the pixel storage bit width is consistent with the pixel read bit width ;

The storing of the pixel data of the image according to the pixel storage bit width includes:

For the pixel data of the mth channel and the Pyth row of the image, it is divided into B first storage vectors according to each consecutive b pixel data, B is equal to the result of dividing W 0 by b and rounded up, 1≤ b≤W 0 ;

For each first storage vector, split the first storage vector into E second storage vectors according to every b bytes, where the b bytes are less than or equal to the pixel storage bit width;

According to the pixel storage bit width, the E second storage vectors are sequentially stored, and the address space that is insufficient for the weight storage bit width is filled with 0;

The pixel data of the mth channel and the Pyth row are sequentially stored.
The method according to any one of claims 4 to 6, wherein reading the first weight data from the storage unit according to a preset weight reading bit width, comprising:

When T=1, determine the mth channel of the k convolution kernels, the Kyth row, and the row L of the weight data at the convolution kernel position T in the target weight matrix; read the bit width according to the weight, Read the weight data of the Lth row of the target weight matrix from the storage unit as the first weight data read from the storage unit;

When 1<T≤K, according to the preset weight read bit width, read the weight data of the L+T-1th row of the target weight matrix from the storage unit, as the read from the storage unit The first weight data taken;

Wherein, the target weight matrix includes the second weight matrix or the third weight matrix.
The method according to claim 3, wherein, according to the step size Sx of the convolution kernel, a pixel data corresponding to the position T of the convolution kernel is selected from the first pixel data as the second pixel data ,include:

When T=1, select a pixel data at intervals of the step size Sx from the first pixel data as the second pixel data, and the second pixel data includes the mth channel and the Pyth row of the image. The pixel data at X[0], X[Sx], X[2Sx], X[3Sx], ..., X[(a-1)Sx];

When 1<T≤K, according to the expansion rate Ex of the convolution kernel, select the mth channel of the image and X[(T-1)Ex] of the Pyth row from the first pixel data, X[Sx+(T-1)Ex], X[2Sx+(T-1)Ex], X[3Sx+(T-1)Ex], ..., X[(a-1)Sx+(T-1)Ex] ] as the second pixel data.
The method according to claim 9, wherein after obtaining a target convolution operation result output by the qth column MAC, the method further comprises:

According to the mth channel of the image and the first storage vector corresponding to the pixel data at X[aSx] in the Pyth row, determine the first storage vector corresponding to the pixel data at X[aSx] in the Pyth row A storage vector corresponds to the first storage start address in the storage unit;

According to the preset pixel read bit width and the first storage start address, read third pixel data from the storage unit, where the third pixel data includes starting from the first storage start address Read consecutive M pieces of pixel data so that the operation unit can continue to operate.
The method of claim 10, wherein the method further comprises:

After completing the convolution operation between the k convolution kernels and the K rows of pixel data, according to the step size Sy of the convolution kernel in the column direction, determine the interval Sy-1 row with the first row of the K rows of pixel data. The second storage start address of the first pixel data;

According to the preset pixel read bit width and the second storage starting address, read fourth pixel data from the storage unit, where the fourth pixel data includes starting from the second storage starting address Read consecutive M pieces of pixel data so that the operation unit can continue to operate.
The method according to any one of claims 1-11, wherein the multiplier-accumulator MAC array comprises an array based on a crossbar switching crossbar matrix structure;

The arithmetic unit further includes at least one cache module, which is configured to read pixel data from the storage unit according to a preset pixel read bit width, and read the bit width from the storage unit according to a preset weight. The weight data is read from the storage unit.
An artificial intelligence processor, characterized in that the artificial intelligence processor includes a plurality of processing cores, and each processing core includes a storage unit and an arithmetic unit, and the storage unit is used to store pixel data of an image and N convolutions. weight data of the core; the operation unit includes a multiplier-accumulator MAC array for performing operations according to the pixel data and the weight data,

Wherein, the processing core performs a convolution operation through the data processing method according to any one of claims 1-12.