WO2022110386A1 - Data processing method and artificial intelligence processor - Google Patents

Data processing method and artificial intelligence processor Download PDF

Info

Publication number
WO2022110386A1
WO2022110386A1 PCT/CN2020/137453 CN2020137453W WO2022110386A1 WO 2022110386 A1 WO2022110386 A1 WO 2022110386A1 CN 2020137453 W CN2020137453 W CN 2020137453W WO 2022110386 A1 WO2022110386 A1 WO 2022110386A1
Authority
WO
WIPO (PCT)
Prior art keywords
weight
data
pixel data
storage
row
Prior art date
Application number
PCT/CN2020/137453
Other languages
French (fr)
Chinese (zh)
Inventor
裴京
施路平
徐明坤
王冠睿
马骋
Original Assignee
清华大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 清华大学 filed Critical 清华大学
Publication of WO2022110386A1 publication Critical patent/WO2022110386A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present disclosure relates to the field of computer technology, and in particular, to a data processing method and an artificial intelligence processor.
  • Neuromorphic chips are important platforms for realizing biologically interpretable brain-like algorithms such as spiking neural networks based on brain-like computing.
  • the convolution operation is one of the important logical operations for the realization of artificial neural network by neuromorphic chips based on many-core architecture.
  • the present disclosure proposes a data processing method and an artificial intelligence processor to efficiently implement convolution operations.
  • a data processing method is provided, which is applied to a processing core of an artificial intelligence processor, the artificial intelligence processor includes a plurality of processing cores, each processing core includes a storage unit and an arithmetic unit, the The storage unit is used to store the pixel data of the image and the weight data of the N convolution kernels; the operation unit includes a multiplier-accumulator MAC array for performing operations according to the pixel data and the weight data, wherein the The size of the image is W 0 ⁇ H 0 ⁇ C 0 , the size of the convolution kernel is K ⁇ K ⁇ C 0 , the stride in the row direction is Sx, and W 0 , H 0 , C 0 , K, and Sx are positive Integer, the method includes: reading first pixel data from the storage unit according to a preset pixel read bit width, where the first pixel data includes the mth channel of the image, the continuous line of the Pyth line M pixel data, 1 ⁇ m ⁇
  • the method further includes: for the qth column MAC, after completing the operations of the K rows of the k convolution kernels, obtain the ath channel of the mth channel. Two convolution operation results; after the convolution operation results of the C 0 channels are obtained, the convolution operation results of the C 0 channels of each convolution kernel are added to obtain a target volume output by the qth column MAC. Product operation result.
  • the method further includes: storing the weight data of the N convolution kernels according to the weight storage bit width, wherein the weight storage bit width is consistent with the weight read bit width ; Described storing the weight data of the N convolution kernels according to the weight storage bit width, including: for each convolution kernel in the N convolution kernels, sequentially according to the row direction, column direction of the convolution kernel direction and the order of channel C 0 , the weight data of the convolution kernel is vertically arranged into a first weight vector; the first weight vectors of the N convolution kernels are horizontally aligned and merged into a first weight matrix; according to the weight The bit width is stored, and the weight data in the first weight matrix is horizontally stored.
  • the storing the weight data in the first weight matrix horizontally according to the weight storage bit width includes: when N is greater than the column number Q of the MAC array, storing the weight data according to each Q The first weight matrix is split vertically by column to obtain F second weight matrices, where, In the case that the width of the second weight matrix is less than or equal to the weight storage bit width, store the weight data in the f-th second weight matrix in the order of row direction and column direction, 1 ⁇ f ⁇ F ; Arrange the f-1th second weight matrix before the fth second weight matrix; wherein, the width of the second weight matrix is equal to Q multiplied by the first storage unit of the weight data, the weight data The first storage unit of is determined according to the data type of the weight data.
  • the method further includes: in the case that the width of the second weight matrix is greater than the weight storage bit width, for the f-th second weight matrix, splitting vertically according to each weight storage bit width Divide the f-th second weight matrix to obtain F 0 third weight matrices, where, Store the weight data in the f 0th third weight matrix in the order of the row direction and the column direction in turn, 1 ⁇ f 0 ⁇ F 0 ; arrange the f 0 -1th third weight matrix in the f 0th Before the triple weight matrix.
  • the method further includes: storing pixel data of the image according to a pixel storage bit width, wherein the pixel storage bit width is consistent with the pixel read bit width;
  • Pixel storage bit width storing the pixel data of the image, including: dividing the pixel data of the mth channel and the Pyth row of the image into B first storage vectors according to each consecutive b pixel data, B is equal to the result of dividing W 0 by b and rounded up, 1 ⁇ b ⁇ W 0 ; for each first storage vector, the first storage vector is divided into E second storage vectors according to each b bytes, so The b byte is less than or equal to the pixel storage bit width; according to the pixel storage bit width, the E second storage vectors are sequentially stored, and the address space less than the weight storage bit width is filled with 0; the mth storage vector is sequentially stored Channel, pixel data of line Py.
  • the method further includes: according to the mth channel of the image, the Xth [aSx in the Pyth row ] the first storage vector corresponding to the pixel data at, determine the first storage vector corresponding to the pixel data at the Xth [aSx] in the Pyth row corresponds to the first storage start address in the storage unit; according to The preset pixel read bit width and the first storage start address, the third pixel data is read from the storage unit, and the third pixel data includes reading from the first storage start address.
  • the acquired consecutive M pieces of pixel data so that the operation unit can continue the operation.
  • the method further includes: after completing the convolution operation between the k convolution kernels and the K rows of pixel data, determine the difference between The second storage start address of the first pixel data of the first row interval Sy-1 row of the K rows of pixel data; according to the preset pixel read bit width and the second storage start address, from the The storage unit reads fourth pixel data, where the fourth pixel data includes consecutive M pieces of pixel data read from the second storage start address, so that the operation unit continues to operate.
  • the multiplier-accumulator MAC array includes an array based on a crossbar matrix structure; the operation unit further includes at least one buffer module, the buffer module is configured to The read bit width reads pixel data from the storage unit, and reads the weight data from the storage unit according to a preset weight read bit width.
  • an artificial intelligence processor includes a plurality of processing cores, each processing core includes a storage unit and an operation unit, the storage unit is used for storing pixels of an image data and the weight data of the N convolution kernels; the operation unit includes a multiplier-accumulator MAC array for performing operations according to the pixel data and the weight data, wherein the processing core passes any one of the The data processing method performs the convolution operation.
  • a pixel data corresponding to the position T of the convolution kernel is selected from the first pixel data as the second Pixel data, for the qth column MAC in the MAC array, multiply the second pixel data by the qth weight data in the first weight data, and add the result of the T-1th operation to obtain the qth column
  • the a first convolution operation result of the T-th operation of MAC can realize multi-point parallel convolution operation between multiple pixel data and weight data corresponding to multiple convolution kernels in each operation, so as to improve the volume It can improve the operation efficiency of artificial intelligence processors.
  • FIG. 1 shows a flowchart of a data processing method according to an embodiment of the present disclosure
  • FIG. 2 shows a schematic diagram of storage of pixel data according to an embodiment of the present disclosure
  • FIG. 3 shows a schematic diagram of N convolution kernels according to an embodiment of the present disclosure
  • FIG. 4a shows a schematic diagram of a first weight vector according to an embodiment of the present disclosure
  • Fig. 4b shows a schematic diagram of a first weight matrix according to an embodiment of the present disclosure
  • FIG. 5 shows a schematic diagram of storage of weight data according to an embodiment of the present disclosure
  • FIG. 6 shows a schematic diagram of splitting a second weight matrix according to an embodiment of the present disclosure
  • FIG. 7 shows a schematic structural diagram of a MAC array according to an embodiment of the present disclosure
  • Figure 8a shows a block diagram of an artificial intelligence processor according to an embodiment of the present disclosure
  • Figure 8b shows a block diagram of a processing core according to an embodiment of the present disclosure
  • FIG. 9a shows a schematic diagram of selecting pixel data according to an embodiment of the present disclosure.
  • FIG. 9b shows yet another schematic diagram of selecting pixel data according to an embodiment of the present disclosure.
  • the artificial intelligence processor may be a neuromorphic chip based on a many-core architecture.
  • Various artificial intelligence algorithms can be implemented based on the artificial intelligence processor.
  • the artificial intelligence processor can include multiple processing cores, and each processing core can include a storage unit and an arithmetic unit.
  • the storage unit can be used to store the data to be operated, and the operation unit can be used to perform logical operation and arithmetic operation.
  • the present disclosure does not limit the specific type of the artificial intelligence processor.
  • the convolution operation occupies a large part of the total calculation amount, and as the depth and/or breadth of the convolutional neural network increases, the operation of the convolution operation Efficiency may have a greater impact on the operating efficiency of artificial intelligence processors, so improving the efficiency of convolution operations can improve the operating efficiency of artificial intelligence processors to a certain extent.
  • the neuromorphic chip based on the many-core structure realizes the convolution operation, it generally expands the multiple input channels of the input image into a one-dimensional vector form, and performs multiplier accumulation calculation on the pixel data and the corresponding weight data one by one. Due to the structural limitation of the multiplier-accumulator MAC in the current neuromorphic chip, only the product of a single pixel data and the weight data corresponding to multiple convolution kernels can be performed in each operation, and the convolution operation result is output after the accumulation.
  • the operation unit in the embodiment of the present disclosure may include a multiplier-accumulator MAC array, and the MAC array may include an array based on a crossbar matrix structure.
  • the MAC array may include A row ⁇ Q column MACs.
  • the specific values of A and Q can be set according to actual requirements. Considering that the number N of convolution kernels is usually a power of 2, the MAC array can be, for example, a 4 ⁇ 32 MAC array.
  • the embodiment of the present disclosure does not limit the structure of the MAC array in the operation unit. Based on the MAC array in the embodiment of the present disclosure, parallel convolution operations between multiple pixel data and weight data corresponding to multiple convolution kernels can be implemented, thereby improving the efficiency of convolution operations.
  • the storage unit in each processing core may be used to store the pixel data of the image and the weight data of the N convolution kernels.
  • the operation unit may include a multiplier-accumulator MAC array for performing operations according to pixel data and weight data, wherein the size of the image may be width W 0 ⁇ height H 0 ⁇ number of channels C 0 , and the size of the convolution kernel may be width K ⁇ height K ⁇ channel number C 0 , the step size in the row direction can be Sx, and W 0 , H 0 , C 0 , K, and Sx are positive integers.
  • the pixel data and the weight data in the embodiments of the present disclosure may be data to be subjected to a convolution operation.
  • the embodiments of the present disclosure do not limit the size and quantity of pixel data and weight data.
  • FIG. 1 shows a flowchart of a data processing method according to an embodiment of the present disclosure. As shown in FIG. 1 , the data processing method includes:
  • Step 11 Read the first pixel data from the storage unit according to the preset pixel read bit width, the first pixel data includes the mth channel of the image and the Pyth row of continuous M pixel data, 1 ⁇ m ⁇ C 0 , 1 ⁇ Py ⁇ H 0 , 1 ⁇ M ⁇ W 0 ;
  • Step 12 During the T-th operation of the Ky-th row of the k convolution kernels, read the bit width according to the preset weight, read the first weight data from the storage unit, and the first weight data includes k convolution kernels The weight data at the mth channel, the Kyth line, and the convolution kernel position T, 1 ⁇ k ⁇ N, 1 ⁇ T ⁇ K, 1 ⁇ Ky ⁇ K;
  • Step 13 According to the step size Sx of the convolution kernel, select a pixel data corresponding to the position T of the convolution kernel from the first pixel data as the second pixel data, 1 ⁇ a ⁇ M;
  • Step 14 When T>1, for the qth column MAC in the MAC array, multiply the second pixel data by the qth weight data in the first weight data, and multiply it with the result of the T-1th operation. Add to obtain a first convolution operation result of the T-th operation of the MAC in the q-th column, 1 ⁇ q ⁇ k.
  • the data processing method in the embodiment of the present disclosure may be applied to an artificial intelligence processor.
  • the parameters required for performing the convolution operation may also be obtained by obtaining the primitive parameters, and the primitive parameters may include data required for performing the convolution operation, for example , the primitive parameters can include: image size W 0 ⁇ H 0 ⁇ C 0 , convolution kernel size K ⁇ K ⁇ C 0 , number of convolution kernels N, row-wise stride Sx, column-wise stride Sy, dilation For parameters such as rate Ex, padding parameter, and bias parameter Bias, the specific form of the primitive parameter is not limited in this embodiment of the present disclosure.
  • a pixel data corresponding to the position T of the convolution kernel is selected from the first pixel data as the first pixel data.
  • a first convolution operation result of the T-th operation of the column MAC can realize multi-point parallel convolution operation between multiple pixel data and weight data corresponding to multiple convolution kernels in each operation, thereby improving the Convolution operation efficiency to improve the operation efficiency of artificial intelligence processors.
  • the data processing method may further include: storing pixel data of the image according to the pixel storage bit width, wherein the pixel storage bit width is consistent with the pixel read bit width, So that in step 11, the first pixel data is read from the storage unit according to the preset pixel read bit width.
  • storing the pixel data of the image may include: for the pixel data of the mth channel and the Pyth row of the image, according to each consecutive b pixel data is divided into B first storage vectors, B is equal to the result of dividing W 0 by b and rounded up, 1 ⁇ b ⁇ W 0 ; for each first storage vector, split the first storage vector into E according to each b bytes A second storage vector, the b byte is less than or equal to the pixel storage bit width; according to the pixel storage bit width, E second storage vectors are sequentially stored, and the address space less than the weight storage bit width is filled with 0; Py row of pixel data.
  • the pixel storage bit width is 32 bytes.
  • the second storage vector since the ninth second storage vector contains 10 pixel data, that is, the ninth second storage vector has a width of 10B, which is less than 32B, then the address of the second storage vector that is not enough for the weight storage bit width is in the storage unit.
  • the space is filled with 0, which is equivalent to storing the pixel data of the first channel and the first line of the image.
  • the pixel data of the first channel and the first row are stored in sequence
  • the pixel data of the first channel and the second row are stored until the completion of all rows of the first channel.
  • the pixel data of all rows of the second channel are stored until the storage of the pixel data of all channels is completed.
  • the number E of the second storage vectors obtained by splitting is related to the second storage unit of the pixel data, and the second storage unit of the pixel data is determined according to the data type of the pixel data. For example, if the second storage unit of pixel data is 2 bytes, for each first storage vector, the first storage vector may be divided into 8 second storage vectors every 16 bytes.
  • the data type may include multi-precision data types such as three-valued (-1, 0, 1), int8, uit8, etc.
  • the embodiment of the present disclosure does not limit the data type of pixel data.
  • the specific value of b can be set according to actual needs.
  • the pixel data can be a multiple of 16, then b can be set to a multiple of 16, for example, 16 or 32, etc.
  • This embodiment of the present disclosure is not limited.
  • the b byte is less than or equal to the pixel storage bit width, so as to align and store the pixel data in the storage unit.
  • the pixel storage bit width may be the storage width of pixel data in the storage unit set according to actual requirements.
  • the pixel storage bit width may be a multiple of 16, for example, It may be 16B, 32B, or 64B, etc., which is not limited in this embodiment of the present disclosure.
  • the pixel storage bit width may be consistent with the pixel read bit width.
  • FIG. 2 shows a schematic diagram of storing pixel data according to an embodiment of the present disclosure.
  • Px represents the Px-th column of the image X
  • Py represents the Py-th row of the image X
  • RGB represents the red, green, and blue channels of the image.
  • the first row of 16B storage space stores the 0th to 15th pixel data of the first row of the R channel of the image X, namely X[0][0:15], and so on, X[0] [Px-1; 0] means that the pixel data of the first row is stored, the pixel data of this row is filled with 0 in the address space less than 16B in the storage space, and the pixel data of the second row is stored after the pixel data of the first row is stored.
  • the storage efficiency of the pixel data can be improved, and it is convenient to read the pixel data corresponding to the weight data from the storage unit.
  • the data processing method may further include: storing weight data of N convolution kernels according to the weight storage bit width.
  • the weight storage bit width is consistent with the weight read bit width, so that in step 12, the first weight data is read from the storage unit according to the preset weight read bit width.
  • storing the weight data of N convolution kernels according to the weight storage bit width may include: for each convolution kernel in the N convolution kernels, sequentially according to the row of the convolution kernel direction, column direction and the order of channel C 0 , the weight data of the convolution kernel is vertically arranged as the first weight vector; the first weight vectors of the N convolution kernels are horizontally aligned and merged into the first weight matrix; according to the weight storage Bit width, horizontally storing the weight data in the first weight matrix.
  • FIG. 3 shows a schematic diagram of N convolution kernels according to an embodiment of the present disclosure.
  • Fig. 4a shows a schematic diagram of a first weight vector according to an embodiment of the present disclosure.
  • FIG. 4b shows a schematic diagram of a first weight matrix according to an embodiment of the present disclosure.
  • the convolution kernel 1 can be longitudinally arranged in the order of row direction, column direction and channel C 0 .
  • the first weight vectors corresponding to other convolution kernels are analogized in turn, and will not be repeated; the first weight vectors corresponding to the N convolution kernels are horizontally aligned and merged into Figure 4b
  • the first weight matrix shown, and then according to the weight storage bit width, the first weight matrix is stored, so as to realize the storage of the weight data.
  • the sequential storage of the weight data can be realized, and the storage efficiency of the weight data can be improved.
  • horizontally storing the weight data in the first weight matrix according to the weight storage bit width may include: when N is greater than the number of columns Q of the MAC array, splitting vertically according to each Q column Divide the first weight matrix to obtain F second weight matrices, where, Indicates round-up, that is, F is equal to the round-up value of (N/Q); in the case that the width of the second weight matrix is less than or equal to the weight storage bit width, the order of the row direction and the column direction is followed.
  • the weight data in the f second weight matrices, 1 ⁇ f ⁇ F, the address space that is not enough for the weight storage bit width is filled with 0; the f-1th second weight matrix is arranged before the fth second weight matrix; wherein , the width of the second weight matrix is equal to Q multiplied by the first storage unit of the weight data, and the first storage unit of the weight data is determined according to the data type of the weight data.
  • the data type may include multi-precision data types such as three-value (-1, 0, 1), int8, uit8, etc.
  • the embodiment of the present disclosure does not limit the data type of the weight data.
  • the weight storage bit width is 32 bytes B
  • there are 64 convolution kernels the number of columns of the MAC array is 32
  • the storage unit of the weight data is 2 bits
  • Weight data arrange the first second weight matrix before the second second weight matrix.
  • the weight data in the first second weight matrix is stored in the row direction and the column direction first, and then the weight data in the second second weight matrix is stored in the row direction and the column direction, wherein the second weight matrix
  • the weight data of each row in the storage unit is not enough for the address space of the weight storage bit width to be filled with 0. After filling with 0, it is equivalent to storing a row of weight data in the second weight matrix. After storing the weight data of the current row, store the next A row of weight data.
  • the f second weight matrix when the width of the second weight matrix is greater than the weight storage bit width, for the f second weight matrix, the f second weight matrix is vertically split according to each weight storage bit width , get F 0 third weight matrix, where, That is, F 0 is equal to the rounded-up value of (the width of the second weight matrix/weight storage bit width); the weight data in the f 0th third weight matrix is stored in the order of the row direction and the column direction, 1 ⁇ f 0 ⁇ F 0 ; arrange the f 0 -1 th third weight matrix before the f 0 th third weight matrix.
  • the weight storage bit width is 32B
  • there are 64 convolution kernels the number of columns of the MAC array is 32
  • the storage unit of the weight data is 2B
  • the second weight matrix can be divided vertically according to every 32B to obtain two third
  • the weight matrix is equivalent to dividing the first weight matrix into 4 third weight matrices vertically according to each 32B, and stores the weight data in the third weight matrix in the order of row direction and column direction.
  • the number N of convolution kernels may also be less than or equal to the number of columns Q of the MAC array, then in a possible implementation manner, the weight data in the first weight matrix is horizontally stored according to the weight storage bit width , may also include: when N is less than or equal to the number of columns Q of the MAC array, and the width of the first weight matrix is greater than the weight storage bit width, splitting the first weight matrix vertically according to each weight storage bit width to obtain F 1 The fourth weight matrix; according to the order of row direction and column direction, the weight data in the f 1th fourth weight matrix is stored, 1 ⁇ f 1 ⁇ F 1 ; the f 1 -1th fourth weight matrix is arranged in the Before the f1th fourth weight matrix; wherein, the width of the first weight matrix is equal to N times the first storage unit of the weight data.
  • the weight storage bit width is 32B, 16 convolution kernels, the number of columns of the MAC array is 32B, and the storage unit of the weight data is 4B, the number of convolution kernels is less than the number of columns of the MAC array, and the first weight
  • the first weight matrix can be divided vertically every 32B to obtain two fourth weight matrices, and then two fourth weight matrices can be stored in the order of row and column directions. Weight matrix, arrange the first fourth weight matrix before the second fourth weight matrix.
  • N is less than or equal to the number of columns Q of the MAC array
  • the width of the first weight matrix is less than or equal to the weight storage bit width.
  • the weight data in the first weight matrix is stored in the order of , and each row of weight data is filled with 0 in the address space where the weight storage bit width is insufficient in the storage unit.
  • FIG. 5 shows a schematic diagram of storing weight data according to an embodiment of the present disclosure.
  • Kx represents the Kx column of the convolution kernel
  • Ky represents the Ky-th row of the convolution kernel
  • RGB represents the three channels of the convolution kernel corresponding to the red, green and blue channels of the image
  • F0 represents the first target weight matrix
  • F1 Represents the second target weight matrix, and so on.
  • R channel_F0 means that the first target weight matrix under the first channel of the convolution kernel is stored, and the first 32B stores the first target weight matrix.
  • the second 32B stores the second row of the first target weight matrix, and so on, where [0,0] represents the channel, the first row, the first weight data, [Ky-1,Kx -1] represents the weight data under the channel, the Ky-th row, the Kx-th weight data, and so on, arrange the first target weight matrix F0 before the second target weight matrix F1, and fill the address space with insufficient weight storage bit width with 0 .
  • the weight storage bit width can be set according to the actual demand to set the storage width of the weight data in the storage unit.
  • the number of convolution kernels in the convolution layer is usually a multiple of 16 , for example, 32, 64, 128, 256, etc.
  • the weight storage bit width can be set to be a multiple of 16, for example, 32 bytes, 64 bytes, etc., which is not limited by this embodiment of the present disclosure.
  • the weight storage bit width and the weight read bit width may be consistent, so that the cache module reads the first weight data from the storage unit in step 12 according to the preset weight read bit width.
  • the weight data is stored according to the column number Q of the MAC array and the weight storage bit width, which can improve the storage efficiency of the weight data, so that the weight data sequentially read from the storage unit in each operation is the same as the pixel value. corresponding to the data to further improve the efficiency of the convolution operation.
  • the arithmetic unit of each processing core of the artificial intelligence processor may further include at least one cache module, and the cache module may be configured to read from the storage unit according to a preset pixel read bit width Take the pixel data, and read the weight data from the storage unit according to the preset weight read bit width, then in step 11, read the first pixel data from the storage unit according to the preset pixel read bit width, which may be through At least one cache module reads the first pixel data from the storage unit, and in step 12, reads the first weight data from the storage unit according to the preset weight read bit width, which may be from the storage unit through at least one cache module Read the first weight data.
  • the cache module may use a register, a dual-port random access memory, a non-volatile memory, or other memory that can implement shift fetching, which is not limited to this embodiment of the present disclosure.
  • the size and quantity of the cache module may be set according to actual requirements.
  • the cache module may be larger than the pixel read bit width and the weight read bit width. For example, if the pixel read If the bit width is 32B, the register of 48B can be selected to ensure the continuous loading of data during the operation, thereby ensuring the continuity of the operation.
  • one or more cache modules can be used according to actual requirements. For example, if a 48B register is to be used, the 48B register can use three 16B registers It is also possible to use a 48B register, in which multiple cache modules can be selected to realize the multiplexing of the cache modules and improve the utilization rate of resources.
  • the cache module can load the data with this width, and other storage spaces in the cache module are filled with 0, for example, a 48B register. If the pixel data or weight data is less than 16B, load the 16B data, and add 0 to other storage spaces in the cache module. If the loaded pixel data or weight data is less than 32B, load the 32B data, and add 0 to other storage spaces in the cache module.
  • the reading of the first pixel data in step 11 may be continuous.
  • the cache module may read from the storage unit Continuous pixel data to ensure the continuity of the operation. For example, if a 48B register is used to read data, whenever the 16B data is shifted in the register, the register will load the next 16B data from the storage unit, thus ensuring that Continuity of operations.
  • reading the first weight data from the storage unit according to the preset weight reading bit width may include:
  • the target weight matrix may include the second weight matrix or the third weight matrix.
  • the target weight matrix may further include a first weight matrix or a fourth weight matrix.
  • the convolution kernel position T may refer to the T th weight data of the Ky th row of the m th channel of the convolution kernel.
  • reading the second weight data from the storage unit may be after determining the starting address of the weight data to be read, that is, the storage address corresponding to the weight data in the Lth row of the target weight matrix , read the weight data of the L+T-1th row according to the sequential addressing method, that is, the address is accumulated by 1.
  • FIG. 6 shows a schematic diagram of splitting a second weight matrix according to an embodiment of the present disclosure.
  • the weight data of the first row of the second weight matrix such as a1 and e1 can be read, which is equivalent to reading the first row of 32 convolution kernels.
  • the weight data of the second row of the second weight matrix such as a2 and e2 can be read, which is equivalent to reading the first k convolution kernels.
  • the first weight data is read row by row, and the reading can be realized.
  • step 13 according to the step size Sx of the convolution kernel, select a pixel data corresponding to the position T of the convolution kernel from the first pixel data as the second pixel data, including:
  • the convolution kernel when implementing the convolution operation between the convolution kernel and the image, the convolution kernel usually performs the convolution operation according to the moving step size in the row direction and the column direction.
  • the expansion rate Ex of the convolution kernel can be set according to the actual volume and operation requirements.
  • the expansion rate Ex> When it is 1, it means that the dilated convolution operation is performed.
  • the value of a may be less than or equal to the row number A of the MAC array, for example, for a 4 ⁇ 32 MAC array, a may be an integer in [1, 4].
  • a piece of second pixel data is selected from the first pixel data according to the step size Sx and the expansion rate Ex, so that the selected multiple pixel data corresponds to the weight data of multiple convolution kernels,
  • the convolution operation of pixel data and weight data can be accurately realized, and the dilated convolution operation can also be supported.
  • the first weight data read in step 12 and the second weight data selected in step 13 can be input into the MAC array through the cache module, and then the second pixel data and the corresponding weight can be realized in step 14
  • the multiplier accumulation of the data that is, the convolution operation of the weight data and the pixel data is realized.
  • FIG. 7 shows a schematic structural diagram of a MAC array according to an embodiment of the present disclosure.
  • a schematic diagram of a MAC array shown in FIG. 7 is used as an example for illustration. As shown in FIG. 7, each circle contains 4 MACs, then a Can be 4, there are 5 columns of MAC, then k can be 5.
  • the selected second pixel data X[Ex], X[Sx+Ex], X[2Sx+Ex], X[3Sx+Ex] are input into the MAC array from the row direction of the MAC array ; Input the first weight data of the mth channel, the Kyth row, and the convolution kernel position 2 of the 5 convolution kernels into the MAC array from the column direction.
  • the qth column MAC in the MAC array you can The products of the 4 pixel data and the qth first weight data are obtained respectively.
  • the second pixel data (X[0], X[Sx], X[2Sx], X[3Sx]) can be obtained, and the first weight data (5 convolution kernels The product of the mth channel, the Kyth row, the weight data at kernel position 1).
  • a multi-point parallel convolution operation between multiple pixel data and weight data corresponding to multiple convolution kernels can be implemented in each operation, so that the convolution operation can be efficiently implemented and artificial intelligence can be improved.
  • the operating efficiency of the processor can be improved.
  • the q weight data are multiplied and added with the convolution operation result of the Kth operation in the Ky-1th row to obtain a first convolution operation result of the 1st operation of the MAC in the qth column, 1 ⁇ q ⁇ k.
  • the convolution operation result of the Kth operation in the Ky-1th row can be obtained by using the processing methods disclosed in steps 11 to 14 in the above embodiments of the present disclosure, and details are not described herein again.
  • the cyclic accumulation of the convolution operation results of each row of weight data and the corresponding pixel data can be implemented, so as to obtain the convolution operation result of the mth channel.
  • the data processing method may further include: Step 16, for the qth column MAC, after completing the operation of the K rows of the k convolution kernels, obtain a second volume of the mth channel Product operation results; Step 17, after the convolution operation results of C 0 channels are obtained, add the convolution operation results of C 0 channels of each convolution kernel to obtain a target output by the qth column MAC The result of the convolution operation.
  • the convolution operation result of each channel is actually obtained by accumulating the convolution operation results of each row of convolution kernels under the channel, and the K rows of operations of k convolution kernels are completed in step 16.
  • the processing methods disclosed in steps 11 to 15 are obtained, and details are not repeated here.
  • a target convolution operation result output by the qth column MAC can be obtained, which is equivalent to The values of the adjacent a points in the same row of the k output graphs are obtained.
  • the data processing method may further include:
  • the mth channel of the image and the first storage vector corresponding to the pixel data at the X[aSx] position in the Pyth row determine that the first storage vector corresponding to the pixel data at the X[aSx] position in the Pyth row is in a first storage start address corresponding to the storage unit;
  • the third pixel data is read from the storage unit, and the third pixel data includes consecutive M pieces of pixel data read from the first storage start address, so that the operation unit can continue to operate.
  • the size of the output graph can be obtained after parameters such as the step size Sy) in the direction of Whether to complete the convolution operation of all pixel data and weight data in the input image in the row direction when calculating the result of the product operation.
  • P out is the width or height of the output image
  • P in is the width or height of the input image
  • S represents the step size in the row direction or the step size in the column direction.
  • the width of the output image is calculated to be 16, and the current qth column MAC outputs 4 target convolution operation results, it means that the convolution of all pixel data and weight data in the row direction of the input image has not been completed.
  • the first storage vector corresponding to the pixel data at the Xth [aSx] in the Pyth row determines the first storage vector corresponding to the pixel data at the Xth [aSx] in the Pyth row
  • the vector corresponds to the first storage start address in the storage unit, because for the first pixel data of the Pyth row, the mth channel of the image and the X[0], X[Sx], X of the Pyth row were previously selected. [2Sx], X[3Sx], ..., the second pixel data at X[(a-1)Sx], so according to the step size Sx of the convolution kernel, it is necessary to select the first pixel from X[aSx]. Two pixel data for convolution operation.
  • the first storage vector corresponding to the pixel data at X[aSx] is determined to correspond to the first storage start address in the storage unit, and then the bit width and the first storage start address are read according to the preset pixel. Storing the starting address and reading the third pixel data from the storage unit can easily and quickly determine the starting address of the cache module fetching from the storage unit. Since the first storage vector is stored in alignment in the storage unit, and at the same time It can facilitate the fetching of the cache module.
  • the first storage vector corresponding to the pixel data at X[aSx] in the mth channel of the image and the Pyth row can be determined according to aSx and nb-1, n ⁇ [1, B], to determine the first storage vector corresponding to the pixel data at X[aSx].
  • the first storage vector is obtained by dividing every 16 pixel data.
  • For the pixel data of [32,47], if aSx 12, since 12 is less than 15, the corresponding first storage vector is [0,15], and it is necessary to read data from the storage unit starting from the 0th pixel data.
  • the step size Sx of the convolution kernel may be Select a pixel data corresponding to the convolution kernel position T from the three pixel data as the second pixel data, including:
  • the pixel data at X[aSx] in the Pyth row is determined to correspond to
  • the first storage vector corresponding to the first storage start address in the storage unit is an implementation provided by the embodiment of the present disclosure, but those skilled in the art can understand that the present disclosure should not be limited thereto. Under the inspiration of the embodiments of the present disclosure, those skilled in the art can also determine the first storage vector corresponding to the pixel data at the mth channel of the image and the Xth [2aSx] in the Pyth row of the image, to determine the first storage vector in the Pyth row.
  • the first storage vector corresponding to the pixel data at X[2aSx] corresponds to the first storage start address in the storage unit, and so on.
  • the embodiments of the present disclosure are not exhaustive.
  • the third pixel data is equivalent to the first pixel data in step 11.
  • the steps 11 to 16 in the above embodiment of the present disclosure may be used.
  • the data processing method described above can obtain another target convolution operation result output by each column of MAC, so that the convolution of all pixel data and weight data in the row direction of the image can be completed, that is, all values in the same row of the output image can be obtained.
  • the data processing method may further include: after completing the convolution operation of k convolution kernels and K rows of pixel data, according to the step size Sy of the convolution kernel in the column direction, Determine the second storage start address of the first pixel data at the interval Sy-1 row with the first row of K rows of pixel data; read from the storage unit according to the preset pixel read bit width and the second storage start address Fourth pixel data, the fourth pixel data includes consecutive M pieces of pixel data read from the second storage start address, so that the operation unit can continue to operate.
  • the fourth pixel data is equivalent to the first pixel data
  • the data processing method disclosed in steps 11 to 17 in the above-mentioned embodiment of the present disclosure can be obtained to obtain The result of a target convolution operation output by the qth column MAC to complete the convolution operation between the convolution kernel and the image.
  • the output map in the embodiment of the present disclosure may refer to a feature map obtained by convolution operation
  • the input image and image may refer to the original image, or may refer to the image after the convolution operation has been performed.
  • feature diagram which is not limited to this embodiment of the present disclosure.
  • Fig. 8a shows a block diagram of an artificial intelligence processor according to an embodiment of the present disclosure
  • Fig. 8b shows a block diagram of a processing core according to an embodiment of the present disclosure.
  • the artificial intelligence processor 100 includes a plurality of processing cores 101 , and as shown in FIG. 8 b , each processing core 101 includes a storage unit 102 and an operation unit 103 .
  • the storage unit 102 is used to store the pixel data of the image and the weight data of the N convolution kernels;
  • the operation unit 103 includes a multiplier-accumulator MAC array 104, which is used for performing the processing according to the pixel data and the weight data. operation.
  • the operation unit may further include at least one cache module 105, and the cache module is configured to read pixel data from the storage unit 102 according to a preset pixel read bit width, and read pixel data according to a preset weight.
  • the weight data is read from the storage unit 102 by taking the bit width.
  • the buffering module 105 may send the gated data into the MAC array for convolution operation, and output the convolution operation result to the address space specified by the address generating module 106 in the storage unit.
  • the operation unit may further include an address generation module 106 for generating an address pointer when the cache module reads data, so that the cache module 105 can implement sequential addressing and/or jump addressing according to the address pointer .
  • the MAC array 104 includes an array based on a crossbar switching crossbar matrix structure.
  • the MAC array 104 can be expanded into two dimensions of rows and columns, and can support multi-point parallel convolution operations.
  • the processing core 101 may perform a convolution operation by using the data processing method described in any one of the foregoing embodiments of the present disclosure.
  • the storage unit 102 can be used to store data according to a specific storage logic of pixel data and weight data, wherein the storage logic of pixel data includes: each channel image is stored in sequence, and the pixels of each channel are stored in sequence.
  • the data is expanded into a vector along its image width direction, and b consecutively stored as a storage vector.
  • the vector is divided into multiple pieces according to the b-byte alignment, and stored one by one.
  • the storage sequence of different pixel data in the storage unit is first along the image width direction, and then stored along the image height direction.
  • the entire image storage is aligned according to the pixel storage bit width, and the lack of zero is filled to facilitate the calculation of register fetching.
  • the pixel storage bit width is greater than or equal to the set b bytes.
  • the weight data and the pixel data may specify a storage address in the storage unit 102 .
  • the cache module 105 when reading the weight data, the cache module 105 can read the weight data from the starting address of the storage address in a manner of adding one to the address.
  • an address jump will be generated, that is, to read pixel data by line jump
  • a configured address jump value can be set in the primitive parameter.
  • the address generation module 106 generates the target address according to the address jump value, and counts it through the loop clock counter that comes with the artificial intelligence processor. After the count meets the jump condition, the loop clock counter generates a jump signal, and instructs the cache module through the jump signal. 105 realizes the jump of the address pointer according to the target address generated by the address generating module 106 .
  • the artificial intelligence processor of the embodiments of the present disclosure by using the artificial intelligence processor of the embodiments of the present disclosure, efficient convolution operations can be implemented, and the operation efficiency of the artificial intelligence processor can be improved.
  • the convolution kernel is four-dimensional data, with a total of K ⁇ K ⁇ C 0 ⁇ N weight data, and each output graph N (the number of convolution kernels, the number of output channels) is used as the vector length to expand, A weight matrix with a height of Kx ⁇ Ky ⁇ C 0 and a width of N is formed.
  • the weight data in the weight matrix are arranged in the order of the first direction, the column direction, and the channel C 0 direction in the height direction.
  • each channel of the input image is expanded into a vector along the width direction, 16 pixel data are continuously stored as a first storage vector, and the first storage vector is divided into multiple second storage vectors according to 16B alignment. , store the second storage vector one by one.
  • the storage sequence of the input images in the storage unit is in the row direction first, and then in the column direction.
  • the entire input image is aligned according to 32B in the storage unit, and the address space less than 32B is filled with zeros to facilitate register fetching and calculation.
  • a 48B shift register or three 16B registers can be used to read data from the storage unit.
  • 48B data can be loaded in 3 clocks, for example, the adjacent 48B (0th to 47th) of the first row of the first channel (usually R channel) of the input image
  • the pixel is loaded into the 48B register according to the high and low 16 bits in three clocks. If the width of the input image is less than 16B, only 16B data will be loaded, and zero will be filled when the width is insufficient; if the width of the input image is less than 32B, only 32B data will be loaded, and zero will be added when the width is insufficient.
  • the reading operation can be controlled by the cyclic clock counter.
  • the register when selecting data from the register and outputting it to the MAC array for operation, whenever the register will remove one 16B data, the register will load the next 16B data to maintain the continuity of the operation.
  • a 4 ⁇ 32 2D MAC array it is possible to multiply up to 4 pixel data at the same time with the weight data of up to 32 convolution kernels at the same position. For example, in one operation, the X[0], X[Sx], X[2Sx], X[3Sx] 4 pixel data in the register can be gated, and the first channel, The first row and the first weight data are multiplied together.
  • Step 1 get primitive parameters.
  • Step 2 Load the adjacent 48B (0-47) pixels of the first row of the image R channel into the 48B register in 3 clocks according to the high and low 16 bits. Select 4 pixel data X[0], X[Sx], X[2Sx], X[3Sx] from the register and send them to the 2D MAC array, and multiply them with the weight data at the same position of the 32 convolution kernels. 32 convolution operation results are obtained in parallel.
  • Step 3 then shift according to the row direction, gate X[Ex], X[Ex+Sx], X[Ex+2Sx], X[Ex+3Sx] and multiply the corresponding weight data until K pixels are completed The convolution operation of the data and the corresponding weight data.
  • Step 4 read the 48B pixel data of the next row in a new line, select 4 pixel data at a time along the row direction and perform a convolution operation with the corresponding weight data at the same time.
  • Step 5 Repeat the above steps 1 to 4 until the K ⁇ K convolution operations under the R channel are completed, and then calculate the convolution operations of other channels such as the G channel and the B channel respectively, and then the output graph can be obtained at the same time.
  • the adjacent four points of the first 32 channels P 0 [0,0], P 0 [0,1], P 0 [0,2], P 0 [0,3].
  • P 0 [0,0], P 0 [0,1], P 0 [0,2], and P 0 [0,3] of the first 32 channels need to be written back to the storage unit.
  • Repeat the above steps 1 to 5 until four adjacent points of the output map of all channels are obtained: P 0 [0,0], P 0 [0,1], P 0 [0,2], P 0 [0,3].
  • Step 6 Determine whether the starting position of the pixel data read by the register for the second time on the window exceeds the position of the 15th pixel. If it exceeds 15, read 48B from the address of the 16th to 63rd pixels in the first row. , otherwise, the pixel data is still read from the address of the 0th to 48th pixel. Still use the 48B register to read the data, and then select the pixel data at X[4Sx], X[5Sx], X[6Sx], X[7Sx] from the register to perform convolution calculation until K ⁇ K convolutions are completed. Operation, to get 32 output graphs with four adjacent points in the same row: P 0 [0,4], P 0 [0,5], P 0 [0,6], P 0 [0,7].
  • Step 7 After obtaining the data of the first row of the output image, start to calculate the data of the next row of the output image. At this time, read the corresponding pixel data from the 0+Sy row of the input image, and perform the operations from steps 1 to 6 above.
  • a maximum of 4 pixel data can be selected from the register at a time, and at the same time, a weight data at the same position of a maximum of 32 convolution kernels can be selected. Convolution operation.
  • FIG. 9a shows a schematic diagram of selecting pixel data according to an embodiment of the present disclosure.
  • Pixel data, the stride Sx of the convolution kernel is 3, and the 3 registers "Reg[0], Reg[1], Reg[3]" read the first "0-" of the first row X[0] of the image X 47” pixel data.
  • select X[0], X[Sx], X[2Sx], X[3Sx] in the register that is, the 1st, 4th, 7th, and 10th pixel data "0" , 3, 6, 9"
  • select X[1], X[Sx+1], X[2Sx+1], X[3Sx+1] for the second time that is, the pixel data "1, 4, 7, A”
  • select X[2], X[Sx+2], X[2Sx+2], X[3Sx+2] for the third time that is, the pixel data "2, 5, 8, B", to the 11th operation
  • the size of the convolution kernel is 11 ⁇ 11, after 11 times of pixel data is selected from the register and sent to the MAC array, it is equivalent to calculating the convolution of the weight data of the first row of the convolution kernel and the corresponding pixel data.
  • the register jumps to the start address corresponding to the pixel data of the second row of image X, loads the first 48B data of the pixel data of the second row, and the gating logic for selecting the pixel data is consistent with the above.
  • the convolution operation is performed on the weight data of the second row of the convolution kernel and the corresponding pixel data
  • the first 48B data of the pixel data of the third row is loaded, and the logic of each data loading and data gating is consistent with the above.
  • the address pointer of the register jumps to the storage address corresponding to the pixel data of the second row of the image for the second fetch, until the read K times of data is calculated, which is equivalent to the convolution of the first layer of weight data between the image R channel and the convolution kernel, and then jumps to the starting address of the first line of pixel data in the image G channel, according to the same reading as the R channel.
  • the gating logic calculates the convolution of the pixel data of the G channel and the weight data of the second layer of the convolution kernel, and so on, and then calculates the B channel. After the convolution of the three channels of RGB is completed, 4 numbers of the same row of the 32 output images can be obtained in parallel.
  • Fig. 9b shows yet another schematic diagram of selecting pixel data according to an embodiment of the present disclosure.
  • the storage logic of the weight data of the convolution kernel in the storage unit is consistent with the calculation process. Therefore, in the cyclic calculation process, it is sufficient to add 1 to the starting address of the weight.
  • the storage sequence of the output graph and the output data sequence of the MAC array also follow certain rules, so the storage sequence of the output graph can also be determined directly according to the hardware solidification logic.
  • the address jump value of each layer loop can be set as a configurable primitive parameter.
  • the 2D MAC array based on the structure of Crossbar can support multi-point parallel operation of data by expanding the two dimensions of row and column.
  • each channel is stored in sequence, and each channel is expanded into a vector along its image width direction, and 16 consecutively stored as a storage vector, The vector is split into multiple pieces according to 16B alignment and stored one by one.
  • the storage sequence of different pixel data in the storage unit is first along the image width direction, and then stored along the image height direction.
  • the entire image storage is aligned according to 32B, and the shortage is filled with zeros to facilitate the calculation of register fetching.
  • the input image is stored in the order of advance direction, column direction, and channel direction, and the target row of the input image is extracted by designing multiple shift registers to realize dynamic data reading and gating logic.
  • the data register by multiplying the pixel data of the corresponding line of the convolution kernel, the multi-point parallel operation logic of the data can be realized, and the multi-point convolution operation result can be output in parallel by continuous operation in the form of line pipeline.
  • a novel convolution operation logic and data storage mode of a neuromorphic chip based on a many-core architecture is implemented, and the convolution operation and data storage efficiency between images and convolution kernels are improved.

Abstract

A data processing method and an artificial intelligence processor. The method comprises: reading first pixel data from a storage unit according to a preset pixel reading bit width; during a T-th operation of a Ky-th row of k convolution kernels, reading first weight data from the storage unit according to a preset weight reading bit width, wherein the first weight data comprises weight data at an M-th channel, the Ky-th row, and a convolution kernel position T of the K convolution kernels; selecting, from the first pixel data and according to the stride Sx of the convolution kernels, "a" pieces of pixel data corresponding to the convolution kernel position T as second pixel data; and when T>1, for a q-th column of MACs in a MAC array, multiplying the second pixel data by q-th weight data in the first weight data, and adding same to the result of the (t-1)-th operation to obtain "a" first convolution operation results of the q-th column of MACs in the T-th operation. The data processing method can effectively improve the efficiency of convolution operation.

Description

数据处理方法及人工智能处理器Data processing method and artificial intelligence processor 技术领域technical field
本公开涉及计算机技术领域,尤其涉及一种数据处理方法及人工智能处理器。The present disclosure relates to the field of computer technology, and in particular, to a data processing method and an artificial intelligence processor.
背景技术Background technique
神经形态芯片是依托于类脑计算,实现脉冲神经网络等具有生物可解释性类脑算法的重要平台。其中,卷积运算是基于众核架构的神经形态芯片实现人工神经网络的重要逻辑运算之一。Neuromorphic chips are important platforms for realizing biologically interpretable brain-like algorithms such as spiking neural networks based on brain-like computing. Among them, the convolution operation is one of the important logical operations for the realization of artificial neural network by neuromorphic chips based on many-core architecture.
如何基于神经形态芯片实现高效地卷积运算,是提升神经形态芯片运算效率的关键。How to implement efficient convolution operations based on neuromorphic chips is the key to improving the computing efficiency of neuromorphic chips.
发明内容SUMMARY OF THE INVENTION
有鉴于此,本公开提出了一种数据处理方法及人工智能处理器,以高效地实现卷积运算。In view of this, the present disclosure proposes a data processing method and an artificial intelligence processor to efficiently implement convolution operations.
根据本公开的一方面,提供了一种数据处理方法,应用于人工智能处理器的处理核心,所述人工智能处理器包括多个处理核心,每个处理核心包括存储单元及运算单元,所述存储单元用于存储图像的像素数据及N个卷积核的权重数据;所述运算单元包括乘数累加器MAC阵列,用于根据所述像素数据及所述权重数据进行运算,其中,所述图像的尺寸为W 0×H 0×C 0,所述卷积核的尺寸为K×K×C 0,行方向的步长为Sx,W 0、H 0、C 0、K、Sx为正整数,所述方法包括:根据预设的像素读取位宽,从所述存储单元读取第一像素数据,所述第一像素数据包括所述图像的第m个通道、第Py行的连续的M个像素数据,1≤m≤C 0,1≤Py≤H 0,1<M≤W 0;在k个卷积核的第Ky行的第T次运算时,根据预设的权重读取位宽,从所述存储单元读取第一权重数据,所述第一权重数据包括k个卷积核的第m个通道、第Ky行、卷积核位置T处的权重数据,1<k≤N,1≤T≤K,1≤Ky≤K;根据所述卷积核的步长Sx,从所述第一像素数据中选取与卷积核位置T对应的a个像素数据作为第二像素数据,1<a<M;在T>1时,针对所述MAC阵列中的第q列MAC,将所述第二像素数据与所述第一权重数据中的第q个权重数据相乘,并与第T-1次运算的结果相加,得到所述第q列MAC第T次运算的a个第一卷积运算结果,1≤q≤k。 According to an aspect of the present disclosure, a data processing method is provided, which is applied to a processing core of an artificial intelligence processor, the artificial intelligence processor includes a plurality of processing cores, each processing core includes a storage unit and an arithmetic unit, the The storage unit is used to store the pixel data of the image and the weight data of the N convolution kernels; the operation unit includes a multiplier-accumulator MAC array for performing operations according to the pixel data and the weight data, wherein the The size of the image is W 0 ×H 0 ×C 0 , the size of the convolution kernel is K×K×C 0 , the stride in the row direction is Sx, and W 0 , H 0 , C 0 , K, and Sx are positive Integer, the method includes: reading first pixel data from the storage unit according to a preset pixel read bit width, where the first pixel data includes the mth channel of the image, the continuous line of the Pyth line M pixel data, 1≤m≤C 0 , 1≤Py≤H 0 , 1<M≤W 0 ; in the T-th operation of the Ky-th row of the k convolution kernels, read according to the preset weight Take the bit width, read the first weight data from the storage unit, and the first weight data includes the mth channel of the k convolution kernels, the Kyth row, and the weight data at the position T of the convolution kernel, 1< k≤N, 1≤T≤K, 1≤Ky≤K; according to the step size Sx of the convolution kernel, select a pixel data corresponding to the convolution kernel position T from the first pixel data as the first pixel data Two pixel data, 1<a<M; when T>1, for the qth column MAC in the MAC array, the second pixel data is compared with the qth weight data in the first weight data Multiply and add with the result of the T-1th operation to obtain a first convolution operation result of the Tth operation of the MAC in the qth column, 1≤q≤k.
在一种可能的实现方式中,所述方法还包括:在T=1时,针对所述第q列MAC,将所述第二像素数据与所述第一权重数据中的第q个权重数据相乘,并与第Ky-1行的第K次运算的卷积运算结果相加,得到所述第q列MAC第1次运算的a个第一卷积运算结果,1≤q≤k。In a possible implementation manner, the method further includes: when T=1, for the qth column MAC, comparing the second pixel data with the qth weight data in the first weight data Multiply and add with the convolution operation result of the Kth operation in the Ky-1th row to obtain a first convolution operation results of the qth column MAC in the first operation, 1≤q≤k.
在一种可能的实现方式中,所述方法还包括:针对所述第q列MAC,在完成所述k个卷积核的K行的运算后,得到所述第m个通道的a个第二卷积运算结果;在得到的C 0个通道的卷积运算结果后,将每个卷积核的C 0个通道的卷积运算结果相加,得到第q列MAC输出的a个目标卷积运算结果。 In a possible implementation manner, the method further includes: for the qth column MAC, after completing the operations of the K rows of the k convolution kernels, obtain the ath channel of the mth channel. Two convolution operation results; after the convolution operation results of the C 0 channels are obtained, the convolution operation results of the C 0 channels of each convolution kernel are added to obtain a target volume output by the qth column MAC. Product operation result.
在一种可能的实现方式中,所述方法还包括:根据权重存储位宽,存储所述N个卷积核的权重数据,其中,所述权重存储位宽与所述权重读取位宽一致;所述根据权重存储位宽,存储所述N个卷积核的权重数据,包括:针对所述N个卷积核中的每个卷积核,依次按照该卷积核的行方向、列方向和通道C 0的顺序,将该卷积核的权重数据纵向排列为第一权重向量;将所述N个卷积核的第一权重向量横向对齐合并为第一权重矩阵;根据所述权重存储位宽,横向存储所述第一权重矩阵中的权重数据。 In a possible implementation manner, the method further includes: storing the weight data of the N convolution kernels according to the weight storage bit width, wherein the weight storage bit width is consistent with the weight read bit width ; Described storing the weight data of the N convolution kernels according to the weight storage bit width, including: for each convolution kernel in the N convolution kernels, sequentially according to the row direction, column direction of the convolution kernel direction and the order of channel C 0 , the weight data of the convolution kernel is vertically arranged into a first weight vector; the first weight vectors of the N convolution kernels are horizontally aligned and merged into a first weight matrix; according to the weight The bit width is stored, and the weight data in the first weight matrix is horizontally stored.
在一种可能的实现方式中,所述根据权重存储位宽,横向存储所述第一权重矩阵中的权重数据,包括:在N大于所述MAC阵列的列数Q的情况下,按照每Q列纵向拆分所述第一权重矩阵,得到F个第二权重矩阵,其中,
Figure PCTCN2020137453-appb-000001
在所述第二权重矩阵的宽度小于或等于所述权重存储位宽的情况下, 依次按照行方向、列方向的顺序,存储第f个第二权重矩阵中的权重数据,1≤f≤F;将第f-1个第二权重矩阵排列在第f个第二权重矩阵之前;其中,所述第二权重矩阵的宽度等于Q乘以所述权重数据的第一存储单位,所述权重数据的第一存储单位跟据所述权重数据的数据类型确定。
In a possible implementation manner, the storing the weight data in the first weight matrix horizontally according to the weight storage bit width includes: when N is greater than the column number Q of the MAC array, storing the weight data according to each Q The first weight matrix is split vertically by column to obtain F second weight matrices, where,
Figure PCTCN2020137453-appb-000001
In the case that the width of the second weight matrix is less than or equal to the weight storage bit width, store the weight data in the f-th second weight matrix in the order of row direction and column direction, 1≤f≤F ; Arrange the f-1th second weight matrix before the fth second weight matrix; wherein, the width of the second weight matrix is equal to Q multiplied by the first storage unit of the weight data, the weight data The first storage unit of is determined according to the data type of the weight data.
在一种可能的实现方式中,所述方法还包括:在第二权重矩阵的宽度大于所述权重存储位宽的情况下,针对第f个第二权重矩阵,按照每权重存储位宽纵向拆分所述第f个第二权重矩阵,得到F 0个第三权重矩阵,其中,
Figure PCTCN2020137453-appb-000002
依次按照行方向、列方向的顺序,存储第f 0个第三权重矩阵中的权重数据,1≤f 0≤F 0;将第f 0-1个第三权重矩阵排列在第f 0个第三权重矩阵之前。
In a possible implementation manner, the method further includes: in the case that the width of the second weight matrix is greater than the weight storage bit width, for the f-th second weight matrix, splitting vertically according to each weight storage bit width Divide the f-th second weight matrix to obtain F 0 third weight matrices, where,
Figure PCTCN2020137453-appb-000002
Store the weight data in the f 0th third weight matrix in the order of the row direction and the column direction in turn, 1≤f 0 ≤F 0 ; arrange the f 0 -1th third weight matrix in the f 0th Before the triple weight matrix.
在一种可能的实现方式中,所述方法还包括:根据像素存储位宽,存储所述图像的像素数据,其中,所述像素存储位宽与所述像素读取位宽一致;所述根据像素存储位宽,存储所述图像的像素数据,包括:针对所述图像的第m个通道、第Py行的像素数据,按照每连续的b个像素数据拆分为B个第一存储向量,B等于W 0除以b的结果向上取整,1≤b≤W 0;针对每个第一存储向量,按照每b字节将该第一存储向量拆分为E个第二存储向量,所述b字节小于等于所述像素存储位宽;根据所述像素存储位宽,顺序存储所述E个第二存储向量,不足权重存储位宽的地址空间补0;顺序存储所述第m个通道、第Py行的像素数据。 In a possible implementation manner, the method further includes: storing pixel data of the image according to a pixel storage bit width, wherein the pixel storage bit width is consistent with the pixel read bit width; Pixel storage bit width, storing the pixel data of the image, including: dividing the pixel data of the mth channel and the Pyth row of the image into B first storage vectors according to each consecutive b pixel data, B is equal to the result of dividing W 0 by b and rounded up, 1≤b≤W 0 ; for each first storage vector, the first storage vector is divided into E second storage vectors according to each b bytes, so The b byte is less than or equal to the pixel storage bit width; according to the pixel storage bit width, the E second storage vectors are sequentially stored, and the address space less than the weight storage bit width is filled with 0; the mth storage vector is sequentially stored Channel, pixel data of line Py.
在一种可能的实现方式中,根据预设的权重读取位宽,从所述存储单元读取第一权重数据,包括:在T=1时,确定k个卷积核的第m个通道、第Ky行、卷积核位置T处的权重数据在目标权重矩阵中的所在行L;根据所述权重读取位宽,从所述存储单元中读取所述目标权重矩阵的第L行的权重数据,作为从所述存储单元读取的第一权重数据;在1<T≤K时,根据预设的权重读取位宽,从所述存储单元中读取所述目标权重矩阵的第L+T-1行的权重数据,作为从所述存储单元读取的第一权重数据;其中,所述目标权重矩阵包括所述第二权重矩阵或所述第三权重矩阵。In a possible implementation manner, reading the first weight data from the storage unit according to the preset weight reading bit width includes: when T=1, determining the mth channel of the k convolution kernels , the row L of the weight data at row Ky, the convolution kernel position T in the target weight matrix; read the bit width according to the weight, read the Lth row of the target weight matrix from the storage unit The weight data is taken as the first weight data read from the storage unit; when 1<T≤K, read the bit width according to the preset weight, read the target weight matrix from the storage unit The weight data of the L+T-1th row is taken as the first weight data read from the storage unit; wherein, the target weight matrix includes the second weight matrix or the third weight matrix.
在一种可能的实现方式中,根据所述卷积核的步长Sx,从所述第一像素数据中选取与卷积核位置T对应的a个像素数据作为第二像素数据,包括:在T=1时,从所述第一像素数据中选取间隔所述步长Sx的a个像素数据作为第二像素数据,所述第二像素数据包括所述图像的第m通道、第Py行的X[0],X[Sx],X[2Sx],X[3Sx],……,X[(a-1)Sx]处的像素数据;在1<T≤K时,根据所述卷积核的膨胀率Ex,从所述第一像素数据中选取所述图像的第m通道、第Py行的X[(T-1)Ex],X[Sx+(T-1)Ex],X[2Sx+(T-1)Ex],X[3Sx+(T-1)Ex],……,X[(a-1)Sx+(T-1)Ex]处的像素数据作为第二像素数据。In a possible implementation manner, according to the step size Sx of the convolution kernel, selecting a pixel data corresponding to the position T of the convolution kernel from the first pixel data as the second pixel data, including: When T=1, select a pixel data at intervals of the step size Sx from the first pixel data as the second pixel data, and the second pixel data includes the mth channel and the Pyth line of the image. Pixel data at X[0], X[Sx], X[2Sx], X[3Sx], ..., X[(a-1)Sx]; when 1<T≤K, according to the convolution Expansion rate Ex of the kernel, X[(T-1)Ex], X[Sx+(T-1)Ex], X[Sx+(T-1)Ex], X[ 2Sx+(T-1)Ex], X[3Sx+(T-1)Ex], ..., pixel data at X[(a-1)Sx+(T-1)Ex] are used as the second pixel data.
在一种可能的实现方式中,在得到第q列MAC输出的a个目标卷积运算结果后,所述方法还包括:根据所述图像的第m通道、第Py行中的第X[aSx]处的像素数据对应的第一存储向量,确定所述第Py行中的第X[aSx]处的像素数据对应的第一存储向量在所述存储单元中对应第一存储起始地址;根据所述预设的像素读取位宽和所述第一存储起始地址,从所述存储单元读取第三像素数据,所述第三像素数据包括从所述第一存储起始地址开始读取的连续的M个像素数据,以便所述运算单元继续运算。In a possible implementation manner, after obtaining a target convolution operation result output by the MAC in the qth column, the method further includes: according to the mth channel of the image, the Xth [aSx in the Pyth row ] the first storage vector corresponding to the pixel data at, determine the first storage vector corresponding to the pixel data at the Xth [aSx] in the Pyth row corresponds to the first storage start address in the storage unit; according to The preset pixel read bit width and the first storage start address, the third pixel data is read from the storage unit, and the third pixel data includes reading from the first storage start address. The acquired consecutive M pieces of pixel data, so that the operation unit can continue the operation.
在一种可能的实现方式中,所述方法还包括:在完成所述k个卷积核与K行像素数据的卷积运算后,根据卷积核在列方向的步长Sy,确定与所述K行像素数据的第1行间隔Sy-1行的首个像素数据的第二存储起始地址;根据所述预设的像素读取位宽和所述第二存储起始地址,从所述存储单元读取第四像素数据,所述第四像素数据包括从所述第二存储起始地址开始读取的连续的M个像素数据,以便所述运算单元继续运算。In a possible implementation manner, the method further includes: after completing the convolution operation between the k convolution kernels and the K rows of pixel data, determine the difference between The second storage start address of the first pixel data of the first row interval Sy-1 row of the K rows of pixel data; according to the preset pixel read bit width and the second storage start address, from the The storage unit reads fourth pixel data, where the fourth pixel data includes consecutive M pieces of pixel data read from the second storage start address, so that the operation unit continues to operate.
在一种可能的实现方式中,所述乘数累加器MAC阵列包括基于纵横式交换crossbar矩阵结构的阵列;所述运算单元还包括至少一个缓存模块,所述缓存模块用于根据预设的像素读取位宽从所述存储 单元中读取像素数据,以及根据预设的权重读取位宽从所述存储单元中读取权重数据。In a possible implementation manner, the multiplier-accumulator MAC array includes an array based on a crossbar matrix structure; the operation unit further includes at least one buffer module, the buffer module is configured to The read bit width reads pixel data from the storage unit, and reads the weight data from the storage unit according to a preset weight read bit width.
根据本公开的另一方面,提供了一种人工智能处理器,所述人工智能处理器包括多个处理核心,每个处理核心包括存储单元及运算单元,所述存储单元用于存储图像的像素数据及N个卷积核的权重数据;所述运算单元包括乘数累加器MAC阵列,用于根据所述像素数据及所述权重数据进行运算,其中,所述处理核心通过任一项所述的数据处理方法执行卷积运算。According to another aspect of the present disclosure, an artificial intelligence processor is provided, the artificial intelligence processor includes a plurality of processing cores, each processing core includes a storage unit and an operation unit, the storage unit is used for storing pixels of an image data and the weight data of the N convolution kernels; the operation unit includes a multiplier-accumulator MAC array for performing operations according to the pixel data and the weight data, wherein the processing core passes any one of the The data processing method performs the convolution operation.
本公开实施例中,通过读取第一像素数据和第一权重数据,根据卷积核的步长Sx,从第一像素数据中选取与卷积核位置T对应的a个像素数据作为第二像素数据,针对MAC阵列中的第q列MAC,将第二像素数据与第一权重数据中的第q个权重数据相乘,并与第T-1次运算的结果相加,得到第q列MAC第T次运算的a个第一卷积运算结果,能够在每次运算时实现多个像素数据与对应多个卷积核的权重数据之间的多点并行卷积运算,从而可以提高卷积运算效率,实现提升人工智能处理器的运行效率。In the embodiment of the present disclosure, by reading the first pixel data and the first weight data, according to the step size Sx of the convolution kernel, a pixel data corresponding to the position T of the convolution kernel is selected from the first pixel data as the second Pixel data, for the qth column MAC in the MAC array, multiply the second pixel data by the qth weight data in the first weight data, and add the result of the T-1th operation to obtain the qth column The a first convolution operation result of the T-th operation of MAC can realize multi-point parallel convolution operation between multiple pixel data and weight data corresponding to multiple convolution kernels in each operation, so as to improve the volume It can improve the operation efficiency of artificial intelligence processors.
根据下面参考附图对示例性实施例的详细说明,本公开的其它特征及方面将变得清楚。Other features and aspects of the present disclosure will become apparent from the following detailed description of exemplary embodiments with reference to the accompanying drawings.
附图说明Description of drawings
包含在说明书中并且构成说明书的一部分的附图与说明书一起示出了本公开的示例性实施例、特征和方面,并且用于解释本公开的原理。The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate exemplary embodiments, features, and aspects of the disclosure, and together with the description, serve to explain the principles of the disclosure.
图1示出根据本公开实施例的数据处理方法的流程图;1 shows a flowchart of a data processing method according to an embodiment of the present disclosure;
图2示出根据本公开实施例的一种像素数据的存储示意图;FIG. 2 shows a schematic diagram of storage of pixel data according to an embodiment of the present disclosure;
图3示出根据本公开实施例的一种N个卷积核的示意图;3 shows a schematic diagram of N convolution kernels according to an embodiment of the present disclosure;
图4a示出根据本公开实施例的一种第一权重向量的示意图;4a shows a schematic diagram of a first weight vector according to an embodiment of the present disclosure;
图4b示出根据本公开实施例的一种第一权重矩阵的示意图;Fig. 4b shows a schematic diagram of a first weight matrix according to an embodiment of the present disclosure;
图5示出根据本公开实施例的一种权重数据的存储示意图;FIG. 5 shows a schematic diagram of storage of weight data according to an embodiment of the present disclosure;
图6示出根据本公开实施例的一种第二权重矩阵的拆分示意图;6 shows a schematic diagram of splitting a second weight matrix according to an embodiment of the present disclosure;
图7示出根据本公开实施例的一种MAC阵列的结构示意图;FIG. 7 shows a schematic structural diagram of a MAC array according to an embodiment of the present disclosure;
图8a示出根据本公开实施例的一种人工智能处理器的框图;Figure 8a shows a block diagram of an artificial intelligence processor according to an embodiment of the present disclosure;
图8b示出根据本公开实施例的一种处理核心的框图;Figure 8b shows a block diagram of a processing core according to an embodiment of the present disclosure;
图9a示出根据本公开实施例的一种选取像素数据的示意图;9a shows a schematic diagram of selecting pixel data according to an embodiment of the present disclosure;
图9b示出根据本公开实施例的又一种选取像素数据的示意图。FIG. 9b shows yet another schematic diagram of selecting pixel data according to an embodiment of the present disclosure.
具体实施方式Detailed ways
以下将参考附图详细说明本公开的各种示例性实施例、特征和方面。附图中相同的附图标记表示功能相同或相似的元件。尽管在附图中示出了实施例的各种方面,但是除非特别指出,不必按比例绘制附图。Various exemplary embodiments, features and aspects of the present disclosure will be described in detail below with reference to the accompanying drawings. The same reference numbers in the figures denote elements that have the same or similar functions. While various aspects of the embodiments are shown in the drawings, the drawings are not necessarily drawn to scale unless otherwise indicated.
在这里专用的词“示例性”意为“用作例子、实施例或说明性”。这里作为“示例性”所说明的任何实施例不必解释为优于或好于其它实施例。The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration." Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.
另外,为了更好的说明本公开,在下文的具体实施方式中给出了众多的具体细节。本领域技术人员应当理解,没有某些具体细节,本公开同样可以实施。在一些实例中,对于本领域技术人员熟知的方法、手段、元件和电路未作详细描述,以便于凸显本公开的主旨。In addition, in order to better illustrate the present disclosure, numerous specific details are given in the following detailed description. It will be understood by those skilled in the art that the present disclosure may be practiced without certain specific details. In some instances, methods, means, components and circuits well known to those skilled in the art have not been described in detail so as not to obscure the subject matter of the present disclosure.
在本公开实施例中,人工智能处理器可以为一种基于众核架构的神经形态芯片。基于该人工智能处理器可以实现各种人工智能算法,该人工智能处理器可以包括多个处理核心,每个处理核心可以包 括存储单元及运算单元。其中,存储单元可以用于存储待运算的数据,运算单元可以用于执行逻辑运算和算术运算。本公开对人工智能处理器的具体类型不做限制。In the embodiment of the present disclosure, the artificial intelligence processor may be a neuromorphic chip based on a many-core architecture. Various artificial intelligence algorithms can be implemented based on the artificial intelligence processor. The artificial intelligence processor can include multiple processing cores, and each processing core can include a storage unit and an arithmetic unit. Wherein, the storage unit can be used to store the data to be operated, and the operation unit can be used to perform logical operation and arithmetic operation. The present disclosure does not limit the specific type of the artificial intelligence processor.
可以知晓的是,在人工智能领域,尤其是图像处理领域,卷积运算占据了总计算量的较大部分,并且随着卷积神经网络的深度和/或广度的增加,卷积运算的运算效率可能会对人工智能处理器的运行效率产生较大影响,所以提升卷积运算的效率,可以在一定程度上提升人工智能处理器的运行效率。It can be known that in the field of artificial intelligence, especially in the field of image processing, the convolution operation occupies a large part of the total calculation amount, and as the depth and/or breadth of the convolutional neural network increases, the operation of the convolution operation Efficiency may have a greater impact on the operating efficiency of artificial intelligence processors, so improving the efficiency of convolution operations can improve the operating efficiency of artificial intelligence processors to a certain extent.
目前基于众核结构的神经形态芯片在实现卷积运算时,一般是通过将输入图像的多个输入通道展开成一维向量形式,逐个像素数据与对应权重数据进行乘数累加计算。由于目前的神经形态芯片中乘数累加器MAC的结构限制,所以每次运算时只能执行单个像素数据与对应多个卷积核的权重数据的乘积,累加后输出的卷积运算结果。At present, when the neuromorphic chip based on the many-core structure realizes the convolution operation, it generally expands the multiple input channels of the input image into a one-dimensional vector form, and performs multiplier accumulation calculation on the pixel data and the corresponding weight data one by one. Due to the structural limitation of the multiplier-accumulator MAC in the current neuromorphic chip, only the product of a single pixel data and the weight data corresponding to multiple convolution kernels can be performed in each operation, and the convolution operation result is output after the accumulation.
基于此,本公开实施例中的运算单元中可以包括乘数累加器MAC阵列,该MAC阵列可以包括基于纵横式交换crossbar矩阵结构的阵列。在一种可能的实现方式中,该MAC阵列可以包括A行×Q列的MAC。其中A和Q的具体数值可以根据实际需求设定,考虑到卷积核数量N通常是2的幂次方,MAC阵列例如可以采用4×32的MAC阵列等。对于运算单元中MAC阵列的结构,本公开实施例不做限制。基于本公开实施例中的MAC阵列,可以实现多个像素数据与对应多个卷积核的权重数据之间的并行卷积运算,从而提高卷积运算效率。Based on this, the operation unit in the embodiment of the present disclosure may include a multiplier-accumulator MAC array, and the MAC array may include an array based on a crossbar matrix structure. In one possible implementation, the MAC array may include A row×Q column MACs. The specific values of A and Q can be set according to actual requirements. Considering that the number N of convolution kernels is usually a power of 2, the MAC array can be, for example, a 4×32 MAC array. The embodiment of the present disclosure does not limit the structure of the MAC array in the operation unit. Based on the MAC array in the embodiment of the present disclosure, parallel convolution operations between multiple pixel data and weight data corresponding to multiple convolution kernels can be implemented, thereby improving the efficiency of convolution operations.
在一种可能的实现方式中,为实现像素数据与权重数据的卷积运算,每个处理核心中的存储单元可以用于存储图像的像素数据及N个卷积核的权重数据。运算单元可以包括乘数累加器MAC阵列,用于根据像素数据及权重数据进行运算,其中,图像的尺寸可以为宽度W 0×高度H 0×通道数C 0,卷积核的尺寸可以为宽度K×高度K×通道数C 0,行方向的步长可以为Sx,W 0、H 0、C 0、K、Sx为正整数。可以理解的是,本公开实施例中的像素数据和权重数据可以是待进行卷积运算的数据。对于像素数据和权重数据的尺寸和数量,本公开实施例不做限制。 In a possible implementation manner, in order to implement the convolution operation between the pixel data and the weight data, the storage unit in each processing core may be used to store the pixel data of the image and the weight data of the N convolution kernels. The operation unit may include a multiplier-accumulator MAC array for performing operations according to pixel data and weight data, wherein the size of the image may be width W 0 × height H 0 × number of channels C 0 , and the size of the convolution kernel may be width K×height K×channel number C 0 , the step size in the row direction can be Sx, and W 0 , H 0 , C 0 , K, and Sx are positive integers. It can be understood that, the pixel data and the weight data in the embodiments of the present disclosure may be data to be subjected to a convolution operation. The embodiments of the present disclosure do not limit the size and quantity of pixel data and weight data.
图1示出根据本公开实施例的数据处理方法的流程图,如图1所示,该数据处理方法,包括:FIG. 1 shows a flowchart of a data processing method according to an embodiment of the present disclosure. As shown in FIG. 1 , the data processing method includes:
步骤11:根据预设的像素读取位宽,从存储单元读取第一像素数据,第一像素数据包括图像的第m个通道、第Py行的连续的M个像素数据,1≤m≤C 0,1≤Py≤H 0,1<M≤W 0Step 11: Read the first pixel data from the storage unit according to the preset pixel read bit width, the first pixel data includes the mth channel of the image and the Pyth row of continuous M pixel data, 1≤m≤ C 0 , 1≤Py≤H 0 , 1<M≤W 0 ;
步骤12:在k个卷积核的第Ky行的第T次运算时,根据预设的权重读取位宽,从存储单元读取第一权重数据,第一权重数据包括k个卷积核的第m个通道、第Ky行、卷积核位置T处的权重数据,1<k≤N,1≤T≤K,1≤Ky≤K;Step 12: During the T-th operation of the Ky-th row of the k convolution kernels, read the bit width according to the preset weight, read the first weight data from the storage unit, and the first weight data includes k convolution kernels The weight data at the mth channel, the Kyth line, and the convolution kernel position T, 1<k≤N, 1≤T≤K, 1≤Ky≤K;
步骤13:根据卷积核的步长Sx,从第一像素数据中选取与卷积核位置T对应的a个像素数据作为第二像素数据,1<a<M;Step 13: According to the step size Sx of the convolution kernel, select a pixel data corresponding to the position T of the convolution kernel from the first pixel data as the second pixel data, 1<a<M;
步骤14:在T>1时,针对MAC阵列中的第q列MAC,将第二像素数据与第一权重数据中的第q个权重数据相乘,并与第T-1次运算的结果相加,得到第q列MAC第T次运算的a个第一卷积运算结果,1≤q≤k。Step 14: When T>1, for the qth column MAC in the MAC array, multiply the second pixel data by the qth weight data in the first weight data, and multiply it with the result of the T-1th operation. Add to obtain a first convolution operation result of the T-th operation of the MAC in the q-th column, 1≤q≤k.
在一种可能的实现方式中,本公开实施例中的数据处理方法可以应用于人工智能处理器中。In a possible implementation manner, the data processing method in the embodiment of the present disclosure may be applied to an artificial intelligence processor.
在一种可能的实现方式中,在进行步骤11前,还可以通过获取原语参数,获取执行卷积运算所需要的参数,原语参数可以包括用于进行卷积运算所需要的数据,例如,原语参数可以包括:图像尺寸W 0×H 0×C 0、卷积核尺寸K×K×C 0、卷积核数量N、行方向的步长Sx、列方向的步长Sy、膨胀率Ex、填充参数padding、偏置参数Bias等参数,对于原语参数的具体形式,本公开实施例不做限制。 In a possible implementation manner, before performing step 11, the parameters required for performing the convolution operation may also be obtained by obtaining the primitive parameters, and the primitive parameters may include data required for performing the convolution operation, for example , the primitive parameters can include: image size W 0 ×H 0 ×C 0 , convolution kernel size K×K×C 0 , number of convolution kernels N, row-wise stride Sx, column-wise stride Sy, dilation For parameters such as rate Ex, padding parameter, and bias parameter Bias, the specific form of the primitive parameter is not limited in this embodiment of the present disclosure.
在本公开实施例中,通过读取第一像素数据和第一权重数据,根据卷积核的步长Sx,从第一像素数据中选取与卷积核位置T对应的a个像素数据作为第二像素数据,针对MAC阵列中的第q列MAC,将 第二像素数据与第一权重数据中的第q个权重数据相乘,并与第T-1次运算的结果相加,得到第q列MAC第T次运算的a个第一卷积运算结果,能够在每次运算时实现多个像素数据与对应多个卷积核的权重数据之间的多点并行卷积运算,从而可以提高卷积运算效率,实现提升人工智能处理器的运行效率。In the embodiment of the present disclosure, by reading the first pixel data and the first weight data, according to the step size Sx of the convolution kernel, a pixel data corresponding to the position T of the convolution kernel is selected from the first pixel data as the first pixel data. Two pixel data, for the qth column MAC in the MAC array, multiply the second pixel data by the qth weight data in the first weight data, and add the result of the T-1th operation to obtain the qth A first convolution operation result of the T-th operation of the column MAC can realize multi-point parallel convolution operation between multiple pixel data and weight data corresponding to multiple convolution kernels in each operation, thereby improving the Convolution operation efficiency to improve the operation efficiency of artificial intelligence processors.
在一种可能的实现方式中,在执行步骤11前,所述数据处理方法还可以包括:根据像素存储位宽,存储图像的像素数据,其中,像素存储位宽与像素读取位宽一致,以便于在步骤11中,根据预设的像素读取位宽,从存储单元读取第一像素数据。In a possible implementation manner, before performing step 11, the data processing method may further include: storing pixel data of the image according to the pixel storage bit width, wherein the pixel storage bit width is consistent with the pixel read bit width, So that in step 11, the first pixel data is read from the storage unit according to the preset pixel read bit width.
在一种可能的实现方式中,根据像素存储位宽,存储图像的像素数据,可以包括:针对图像的第m个通道、第Py行的像素数据,按照每连续的b个像素数据拆分为B个第一存储向量,B等于W 0除以b的结果向上取整,1≤b≤W 0;针对每个第一存储向量,按照每b字节将该第一存储向量拆分为E个第二存储向量,b字节小于等于像素存储位宽;根据像素存储位宽,顺序存储E个第二存储向量,不足权重存储位宽的地址空间补0;顺序存储第m个通道、第Py行的像素数据。 In a possible implementation manner, according to the pixel storage bit width, storing the pixel data of the image may include: for the pixel data of the mth channel and the Pyth row of the image, according to each consecutive b pixel data is divided into B first storage vectors, B is equal to the result of dividing W 0 by b and rounded up, 1≤b≤W 0 ; for each first storage vector, split the first storage vector into E according to each b bytes A second storage vector, the b byte is less than or equal to the pixel storage bit width; according to the pixel storage bit width, E second storage vectors are sequentially stored, and the address space less than the weight storage bit width is filled with 0; Py row of pixel data.
举例来说,假设图像X的尺寸W 0×H 0×C 0=138×127×3,即,图像X有3个通道,每个通道包含138×127个像素数据。假设b为16,像素存储位宽为32字节。针对图像的第一个通道、第一行的像素数据,按照每连续的16个像素数据拆分为B个第一存储向量,B=138/16向上取整为9;假设每个像素数据占1字节,则针对每个第一存储向量,按照每16字节可以将第一存储向量拆分为9个第二存储向量,即E=9;按照像素存储位宽32B,顺序存储9个第二存储向量,由于第9个第二存储向量包含10个像素数据,即第9个第二存储向量宽度为10B,不足32B,则不足权重存储位宽的第二存储向量在存储单元中地址空间补0,此时相当于存储完图像第一个通道、第一行的像素数据。 For example, it is assumed that the size of the image X is W 0 ×H 0 ×C 0 =138×127×3, that is, the image X has 3 channels, and each channel contains 138×127 pixel data. Assuming that b is 16, the pixel storage bit width is 32 bytes. For the pixel data of the first channel and the first row of the image, it is divided into B first storage vectors according to each consecutive 16 pixel data, and B=138/16 is rounded up to 9; it is assumed that each pixel data occupies 1 byte, then for each first storage vector, the first storage vector can be divided into 9 second storage vectors according to every 16 bytes, that is, E=9; according to the pixel storage bit width of 32B, 9 are stored in sequence The second storage vector, since the ninth second storage vector contains 10 pixel data, that is, the ninth second storage vector has a width of 10B, which is less than 32B, then the address of the second storage vector that is not enough for the weight storage bit width is in the storage unit. The space is filled with 0, which is equivalent to storing the pixel data of the first channel and the first line of the image.
在一种可能的实现方式中,在顺序存储完第一个通道、第一行的像素数据后,再存储第一个通道、第二行的像素数据,至完成第一个通道的所有行的像素数据存储后,再存储第二个通道的所有行的像素数据,至完成所有通道的像素数据的存储。In a possible implementation manner, after the pixel data of the first channel and the first row are stored in sequence, the pixel data of the first channel and the second row are stored until the completion of all rows of the first channel. After the pixel data is stored, the pixel data of all rows of the second channel are stored until the storage of the pixel data of all channels is completed.
可以理解的是,拆分得到的第二存储向量的数量E与像素数据的第二存储单位相关,像素数据的第二存储单位根据像素数据的数据类型确定。例如,若像素数据的第二存储单位为2字节,则针对每个第一存储向量,按照每16字节可以是将第一存储向量拆分为8个第二存储向量。It can be understood that the number E of the second storage vectors obtained by splitting is related to the second storage unit of the pixel data, and the second storage unit of the pixel data is determined according to the data type of the pixel data. For example, if the second storage unit of pixel data is 2 bytes, for each first storage vector, the first storage vector may be divided into 8 second storage vectors every 16 bytes.
在一种可能的实现方式中,数据类型可是包括三值(-1、0、1)、int8、uit8等多精度的数据类型,本公开实施例对于像素数据的数据类型不做限制。In a possible implementation manner, the data type may include multi-precision data types such as three-valued (-1, 0, 1), int8, uit8, etc. The embodiment of the present disclosure does not limit the data type of pixel data.
在一种可能的实现方式中,b的具体数值可以根据实际需求设定,在一些情况下,像素数据可以是16的倍数,则b可以设置为16的倍数,例如,16或32等,对此本公开实施例不做限制。其中,b字节小于等于像素存储位宽,以便于在存储单元中对齐存储像素数据。In a possible implementation manner, the specific value of b can be set according to actual needs. In some cases, the pixel data can be a multiple of 16, then b can be set to a multiple of 16, for example, 16 or 32, etc. This embodiment of the present disclosure is not limited. Among them, the b byte is less than or equal to the pixel storage bit width, so as to align and store the pixel data in the storage unit.
在一种可能的实现方式中,像素存储位宽可以是根据实际需求设定的像素数据在存储单元中的存储宽度,为了便于像素数据的存储,像素存储位宽可以是16的倍数,例如,可以是16B或32B或64B等,对此本公开实施例不做限制。为了便于从存储单元中读取像素数据,像素存储位宽可以与像素读取位宽一致。In a possible implementation manner, the pixel storage bit width may be the storage width of pixel data in the storage unit set according to actual requirements. In order to facilitate the storage of pixel data, the pixel storage bit width may be a multiple of 16, for example, It may be 16B, 32B, or 64B, etc., which is not limited in this embodiment of the present disclosure. In order to facilitate reading pixel data from the storage unit, the pixel storage bit width may be consistent with the pixel read bit width.
图2示出根据本公开实施例的一种像素数据的存储示意图。其中,Px代表图像X的第Px列,Py代表图像X的第Py行,RGB代表图像的红绿蓝三个通道。如图2所示,第一行16B存储空间,存储了图像X的R通道的第1行第0至15个像素数据,即X[0][0:15],依次类推,X[0][Px-1;0]代表存储完第1行的像素数据,该行像素数据在存储空间中不满16B的地址空间补0,在存储完第1行像素数据后存储第2行像素数据。FIG. 2 shows a schematic diagram of storing pixel data according to an embodiment of the present disclosure. Among them, Px represents the Px-th column of the image X, Py represents the Py-th row of the image X, and RGB represents the red, green, and blue channels of the image. As shown in Figure 2, the first row of 16B storage space stores the 0th to 15th pixel data of the first row of the R channel of the image X, namely X[0][0:15], and so on, X[0] [Px-1; 0] means that the pixel data of the first row is stored, the pixel data of this row is filled with 0 in the address space less than 16B in the storage space, and the pixel data of the second row is stored after the pixel data of the first row is stored.
在本公开实施例中,通过将像素数据拆分为第一存储向量、第二存储向量,可以提高像素数据的 存储效率,便于从存储单元中读取与权重数据对应的像素数据。In the embodiment of the present disclosure, by dividing the pixel data into the first storage vector and the second storage vector, the storage efficiency of the pixel data can be improved, and it is convenient to read the pixel data corresponding to the weight data from the storage unit.
在一种可能的实现方式中,在执行步骤11前,所述数据处理方法还可以包括:根据权重存储位宽,存储N个卷积核的权重数据。其中,权重存储位宽与权重读取位宽一致,以便于在步骤12中,根据预设的权重读取位宽,从存储单元读取第一权重数据。In a possible implementation manner, before step 11 is performed, the data processing method may further include: storing weight data of N convolution kernels according to the weight storage bit width. The weight storage bit width is consistent with the weight read bit width, so that in step 12, the first weight data is read from the storage unit according to the preset weight read bit width.
在一种可能的实现方式中,根据权重存储位宽,存储N个卷积核的权重数据,可以包括:针对N个卷积核中的每个卷积核,依次按照该卷积核的行方向、列方向和通道C 0的顺序,将该卷积核的权重数据纵向排列为第一权重向量;将N个卷积核的第一权重向量横向对齐合并为第一权重矩阵;根据权重存储位宽,横向存储第一权重矩阵中的权重数据。 In a possible implementation manner, storing the weight data of N convolution kernels according to the weight storage bit width may include: for each convolution kernel in the N convolution kernels, sequentially according to the row of the convolution kernel direction, column direction and the order of channel C 0 , the weight data of the convolution kernel is vertically arranged as the first weight vector; the first weight vectors of the N convolution kernels are horizontally aligned and merged into the first weight matrix; according to the weight storage Bit width, horizontally storing the weight data in the first weight matrix.
图3示出根据本公开实施例的一种N个卷积核的示意图。图4a示出根据本公开实施例的一种第一权重向量的示意图。图4b示出根据本公开实施例的一种第一权重矩阵的示意图。举例来说,如图3所示,针对N个K×K×C 0=3×3×3的卷积核,将卷积核1按照行方向、列方向和通道C 0的顺序,可以纵向排列为如图4a所示的第一权重向量,其他卷积核对应的第一权重向量依次类推,不再赘述;将该N个卷积核对应的第一权重向量横向对齐合并为如图4b所示的第一权重矩阵,再根据权重存储位宽,存储第一权重矩阵,从而实现权重数据的存储。 FIG. 3 shows a schematic diagram of N convolution kernels according to an embodiment of the present disclosure. Fig. 4a shows a schematic diagram of a first weight vector according to an embodiment of the present disclosure. FIG. 4b shows a schematic diagram of a first weight matrix according to an embodiment of the present disclosure. For example, as shown in FIG. 3, for N convolution kernels of K×K×C 0 =3×3×3, the convolution kernel 1 can be longitudinally arranged in the order of row direction, column direction and channel C 0 . Arranged as the first weight vector as shown in Figure 4a, the first weight vectors corresponding to other convolution kernels are analogized in turn, and will not be repeated; the first weight vectors corresponding to the N convolution kernels are horizontally aligned and merged into Figure 4b The first weight matrix shown, and then according to the weight storage bit width, the first weight matrix is stored, so as to realize the storage of the weight data.
在本公开实施例中,通过将N个卷积核的权重数据处理为第一权重矩阵,再根据权重存储位宽存储第一权重矩阵,可以实现权重数据的顺序存储,提高权重数据的存储效率。In the embodiment of the present disclosure, by processing the weight data of the N convolution kernels into the first weight matrix, and then storing the first weight matrix according to the weight storage bit width, the sequential storage of the weight data can be realized, and the storage efficiency of the weight data can be improved. .
在一种可能的实现方式中,根据权重存储位宽,横向存储第一权重矩阵中的权重数据,可以包括:在N大于所述MAC阵列的列数Q的情况下,按照每Q列纵向拆分第一权重矩阵,得到F个第二权重矩阵,其中,
Figure PCTCN2020137453-appb-000003
表示向上取整,也即F等于(N/Q)的向上取整值;在第二权重矩阵的宽度小于或等于权重存储位宽的情况下,依次按照行方向、列方向的顺序,存储第f个第二权重矩阵中的权重数据,1≤f≤F,不足权重存储位宽的地址空间补0;将第f-1个第二权重矩阵排列在第f个第二权重矩阵之前;其中,第二权重矩阵的宽度等于Q乘以权重数据的第一存储单位,权重数据的第一存储单位跟据权重数据的数据类型确定。
In a possible implementation manner, horizontally storing the weight data in the first weight matrix according to the weight storage bit width, may include: when N is greater than the number of columns Q of the MAC array, splitting vertically according to each Q column Divide the first weight matrix to obtain F second weight matrices, where,
Figure PCTCN2020137453-appb-000003
Indicates round-up, that is, F is equal to the round-up value of (N/Q); in the case that the width of the second weight matrix is less than or equal to the weight storage bit width, the order of the row direction and the column direction is followed. The weight data in the f second weight matrices, 1≤f≤F, the address space that is not enough for the weight storage bit width is filled with 0; the f-1th second weight matrix is arranged before the fth second weight matrix; wherein , the width of the second weight matrix is equal to Q multiplied by the first storage unit of the weight data, and the first storage unit of the weight data is determined according to the data type of the weight data.
在一种可能的实现方式中,数据类型可是包括三值(-1,0,1)、int8、uit8等多精度的数据类型,本公开实施例对于权重数据的数据类型不做限制。In a possible implementation manner, the data type may include multi-precision data types such as three-value (-1, 0, 1), int8, uit8, etc. The embodiment of the present disclosure does not limit the data type of the weight data.
举例来说,假设权重存储位宽为32字节B,64个卷积核,MAC阵列的列数为32,权重数据的存储单位为2bit,则F=64÷32=2,即第一权重矩阵拆分为2个第二权重矩阵;第二权重矩阵的宽度2bit×32=64bit<权重存储位宽32B,可以依次按照行方向、列方向的顺序,存储该2个第二权重矩阵中的权重数据,将第1个第二权重矩阵排列在第2个第二权重矩阵之前。可以理解为,先按照行方向、列方向存储第一个第二权重矩阵中的权重数据,再按照行方向、列方向存储第二个第二权重矩阵中的权重数据,其中,第二权重矩阵的每行权重数据在存储单元中不足权重存储位宽的地址空间补0,补0后相当于存储完第二权重矩阵中的一行权重数据,在存储完当前行的权重数据后,依次存储下一行的权重数据。For example, assuming that the weight storage bit width is 32 bytes B, there are 64 convolution kernels, the number of columns of the MAC array is 32, and the storage unit of the weight data is 2 bits, then F=64÷32=2, that is, the first weight The matrix is divided into two second weight matrices; the width of the second weight matrix is 2bit×32=64bit<weight storage bit width 32B, and the two second weight matrices can be stored in the order of row direction and column direction in turn. Weight data, arrange the first second weight matrix before the second second weight matrix. It can be understood that the weight data in the first second weight matrix is stored in the row direction and the column direction first, and then the weight data in the second second weight matrix is stored in the row direction and the column direction, wherein the second weight matrix The weight data of each row in the storage unit is not enough for the address space of the weight storage bit width to be filled with 0. After filling with 0, it is equivalent to storing a row of weight data in the second weight matrix. After storing the weight data of the current row, store the next A row of weight data.
在一种可能的实现方式中,在第二权重矩阵的宽度大于权重存储位宽的情况下,针对第f个第二权重矩阵,按照每权重存储位宽纵向拆分第f个第二权重矩阵,得到F 0个第三权重矩阵,其中,
Figure PCTCN2020137453-appb-000004
Figure PCTCN2020137453-appb-000005
也即F 0等于(第二权重矩阵的宽度/权重存储位宽)的向上取整值;依次按照行方向、列方向的顺序,存储第f 0个第三权重矩阵中的权重数据,1≤f 0≤F 0;将第f 0-1个第三权重矩阵排列在第f 0个第三权重矩阵之前。
In a possible implementation, when the width of the second weight matrix is greater than the weight storage bit width, for the f second weight matrix, the f second weight matrix is vertically split according to each weight storage bit width , get F 0 third weight matrix, where,
Figure PCTCN2020137453-appb-000004
Figure PCTCN2020137453-appb-000005
That is, F 0 is equal to the rounded-up value of (the width of the second weight matrix/weight storage bit width); the weight data in the f 0th third weight matrix is stored in the order of the row direction and the column direction, 1≤ f 0 ≤F 0 ; arrange the f 0 -1 th third weight matrix before the f 0 th third weight matrix.
举例来说,假设权重存储位宽为32B,64个卷积核,MAC阵列的列数为32,权重数据的存储单位 为2B,则F=64÷32=2,即第一权重矩阵拆分为2个第二权重矩阵;第二权重矩阵的宽度2B×32=64B>权重存储位宽32B,针对每个第二权重矩阵,可以按照每32B纵向拆分第二权重矩阵得到2个第三权重矩阵,相当于将第一权重矩阵按照每32B纵向拆分为4个第三权重矩阵,按照行方向、列方向的顺序,存储第三权重矩阵中的权重数据。For example, assuming that the weight storage bit width is 32B, there are 64 convolution kernels, the number of columns of the MAC array is 32, and the storage unit of the weight data is 2B, then F=64÷32=2, that is, the first weight matrix split are two second weight matrices; the width of the second weight matrix is 2B×32=64B>the weight storage bit width is 32B. For each second weight matrix, the second weight matrix can be divided vertically according to every 32B to obtain two third The weight matrix is equivalent to dividing the first weight matrix into 4 third weight matrices vertically according to each 32B, and stores the weight data in the third weight matrix in the order of row direction and column direction.
在一些情况下,卷积核数量N还可能存在小于等于MAC阵列的列数Q的情况,则在一种可能的实现方式中,根据权重存储位宽,横向存储第一权重矩阵中的权重数据,还可以包括:在N小于等于MAC阵列的列数Q,且第一权重矩阵的宽度大于权重存储位宽的情况下,按照每权重存储位宽纵向拆分第一权重矩阵,得到F 1个第四权重矩阵;依次按照行方向、列方向的顺序,存储第f 1个第四权重矩阵中的权重数据,1≤f 1≤F 1;将第f 1-1个第四权重矩阵排列在第f 1个第四权重矩阵之前;其中,第一权重矩阵的宽度等于N乘以权重数据的第一存储单位。 In some cases, the number N of convolution kernels may also be less than or equal to the number of columns Q of the MAC array, then in a possible implementation manner, the weight data in the first weight matrix is horizontally stored according to the weight storage bit width , may also include: when N is less than or equal to the number of columns Q of the MAC array, and the width of the first weight matrix is greater than the weight storage bit width, splitting the first weight matrix vertically according to each weight storage bit width to obtain F 1 The fourth weight matrix; according to the order of row direction and column direction, the weight data in the f 1th fourth weight matrix is stored, 1≤f 1 ≤F 1 ; the f 1 -1th fourth weight matrix is arranged in the Before the f1th fourth weight matrix; wherein, the width of the first weight matrix is equal to N times the first storage unit of the weight data.
举例来说,假设权重存储位宽为32B,16个卷积核,MAC阵列的列数为32B,权重数据的存储单位为4B,则卷积核数量小于MAC阵列的列数,而第一权重矩阵的宽度16×4B=64B>权重存储位宽32B,可以按照每32B纵向拆分第一权重矩阵得到2个第四权重矩阵,再依次按照行方向、列方向的顺序,存储2个第四权重矩阵,将第1个第四权重矩阵排列再第2个第四权重矩阵之前。For example, assuming that the weight storage bit width is 32B, 16 convolution kernels, the number of columns of the MAC array is 32B, and the storage unit of the weight data is 4B, the number of convolution kernels is less than the number of columns of the MAC array, and the first weight The width of the matrix is 16×4B=64B>the weight storage bit width is 32B. The first weight matrix can be divided vertically every 32B to obtain two fourth weight matrices, and then two fourth weight matrices can be stored in the order of row and column directions. Weight matrix, arrange the first fourth weight matrix before the second fourth weight matrix.
在一种可能的实现方式中,还可能存在N小于等于MAC阵列的列数Q,且第一权重矩阵的宽度小于等于权重存储位宽的情况,在该情况下,直接按照行方向、列方向的顺序,存储第一权重矩阵中的权重数据,每行权重数据在存储单元中不足权重存储位宽的地址空间补0。In a possible implementation, there may also be a situation where N is less than or equal to the number of columns Q of the MAC array, and the width of the first weight matrix is less than or equal to the weight storage bit width. The weight data in the first weight matrix is stored in the order of , and each row of weight data is filled with 0 in the address space where the weight storage bit width is insufficient in the storage unit.
图5示出根据本公开实施例的一种权重数据的存储示意图。其中,Kx代表卷积核的第Kx列,Ky代表卷积核的第Ky行,RGB代表对应于图像的红绿蓝通道的卷积核三个通道,F0代表第1个目标权重矩阵,F1代表第2个目标权重矩阵,依次类推。如图5所示,“R通道_F0”,代表存储的是卷积核第一个通道下的第1个目标权重矩阵,其中第1个32B存储的是第1个目标权重矩阵的第一行,第2个32B存储的是第1个目标权重矩阵的第二行,依次类推,其中[0,0]代表该通道下、第一行、第一个权重数据,[Ky-1,Kx-1]代表该通道下、第Ky行、第Kx个权重数据,依次类推,将第1个目标权重矩阵F0排列在第2个目标权重矩阵F1之前,不足权重存储位宽的地址空间补0。FIG. 5 shows a schematic diagram of storing weight data according to an embodiment of the present disclosure. Among them, Kx represents the Kx column of the convolution kernel, Ky represents the Ky-th row of the convolution kernel, RGB represents the three channels of the convolution kernel corresponding to the red, green and blue channels of the image, F0 represents the first target weight matrix, F1 Represents the second target weight matrix, and so on. As shown in Figure 5, "R channel_F0" means that the first target weight matrix under the first channel of the convolution kernel is stored, and the first 32B stores the first target weight matrix. row, the second 32B stores the second row of the first target weight matrix, and so on, where [0,0] represents the channel, the first row, the first weight data, [Ky-1,Kx -1] represents the weight data under the channel, the Ky-th row, the Kx-th weight data, and so on, arrange the first target weight matrix F0 before the second target weight matrix F1, and fill the address space with insufficient weight storage bit width with 0 .
在一种可能的实现方式中,权重存储位宽可以根据实际需求设定的权重数据在存储单元中的存储宽度,在一些情况下,卷积层中卷积核的个数通常是16的倍数,例如,32、64、128、256等,那么可以设置权重存储位宽是16的倍数,例如,32字节、64字节等,对此本公开实施例不做限制。In a possible implementation manner, the weight storage bit width can be set according to the actual demand to set the storage width of the weight data in the storage unit. In some cases, the number of convolution kernels in the convolution layer is usually a multiple of 16 , for example, 32, 64, 128, 256, etc., then the weight storage bit width can be set to be a multiple of 16, for example, 32 bytes, 64 bytes, etc., which is not limited by this embodiment of the present disclosure.
在一种可能的实现方式中,权重存储位宽和权重读取位宽可以一致,以便于缓存模块在步骤12中根据预设的权重读取位宽,从存储单元读取第一权重数据。In a possible implementation manner, the weight storage bit width and the weight read bit width may be consistent, so that the cache module reads the first weight data from the storage unit in step 12 according to the preset weight read bit width.
在本公开实施例中,根据MAC阵列的列数Q、权重存储位宽存储权重数据,能够提高权重数据的存储效率,从而使得在每次运算从存储单元中顺序读取的权重数据是与像素数据对应的,以进一步提高卷积运算效率。In the embodiment of the present disclosure, the weight data is stored according to the column number Q of the MAC array and the weight storage bit width, which can improve the storage efficiency of the weight data, so that the weight data sequentially read from the storage unit in each operation is the same as the pixel value. corresponding to the data to further improve the efficiency of the convolution operation.
在一种可能的实现方式中,人工智能处理器的每个处理核心的运算单元中还可以包括至少一个缓存模块,该缓存模块可以用于根据预设的像素读取位宽从存储单元中读取像素数据,以及根据预设的权重读取位宽从存储单元中读取权重数据,则步骤11中根据预设的像素读取位宽,从存储单元读取第一像素数据,可以是通过至少一个缓存模块从存储单元读取第一像素数据,以及在步骤12中,根据预设的权重读取位宽,从存储单元读取第一权重数据,可以是通过至少一个缓存模块从存储单元读取第一权重数据。In a possible implementation manner, the arithmetic unit of each processing core of the artificial intelligence processor may further include at least one cache module, and the cache module may be configured to read from the storage unit according to a preset pixel read bit width Take the pixel data, and read the weight data from the storage unit according to the preset weight read bit width, then in step 11, read the first pixel data from the storage unit according to the preset pixel read bit width, which may be through At least one cache module reads the first pixel data from the storage unit, and in step 12, reads the first weight data from the storage unit according to the preset weight read bit width, which may be from the storage unit through at least one cache module Read the first weight data.
在一种可能的实现方式中,缓存模块可以采用寄存器、双口随机存取存储器、非易失性存储器等 可以实现移位取数的存储器,对此本公开实施例不做限制。In a possible implementation manner, the cache module may use a register, a dual-port random access memory, a non-volatile memory, or other memory that can implement shift fetching, which is not limited to this embodiment of the present disclosure.
在一种可能的实现方式中,缓存模块的大小和数量可以根据实际需求设定,在本公开实施例中,缓存模块可以大于像素读取位宽和权重读取位宽,例如,若像素读取位宽为32B,则可以选用48B的寄存器,以在进行运算时保证数据的连续加载,从而保证运算的连贯性。In a possible implementation manner, the size and quantity of the cache module may be set according to actual requirements. In this embodiment of the present disclosure, the cache module may be larger than the pixel read bit width and the weight read bit width. For example, if the pixel read If the bit width is 32B, the register of 48B can be selected to ensure the continuous loading of data during the operation, thereby ensuring the continuity of the operation.
在一种可能的实现方式中,在确定缓存模块的大小后,可以根据实际需求采用1个或多个缓存模块,例如,若要采用48B的寄存器,该48B的寄存器可以采用3个16B的寄存器组成,还可以采用一个48B的寄存器,其中,选用多个缓存模块,可以实现缓存模块的复用,提高资源的利用率。In a possible implementation manner, after the size of the cache module is determined, one or more cache modules can be used according to actual requirements. For example, if a 48B register is to be used, the 48B register can use three 16B registers It is also possible to use a 48B register, in which multiple cache modules can be selected to realize the multiplexing of the cache modules and improve the utilization rate of resources.
在一种可能的实现方式中,若缓存模块加载的数据宽度小于缓存模块的大小,则缓存模块可以加载该宽度下的数据,缓存模块中其它存储空间补0,例如,48B的寄存器,若加载的像素数据或权重数据不足16B,则加载16B的数据,缓存模块内其它存储空间补0,若加载的像素数据或权重数据不足32B,则加载32B的数据,缓存模块内其它存储空间补0。In a possible implementation manner, if the width of the data loaded by the cache module is smaller than the size of the cache module, the cache module can load the data with this width, and other storage spaces in the cache module are filled with 0, for example, a 48B register. If the pixel data or weight data is less than 16B, load the 16B data, and add 0 to other storage spaces in the cache module. If the loaded pixel data or weight data is less than 32B, load the 32B data, and add 0 to other storage spaces in the cache module.
在一种可能的实现方式中,步骤11中读取第一像素数据可以是连续的,换句话说,当缓存模块内的数据不能满足当前运算的需求时,缓存模块可以从存储单元中读取连续的像素数据,以保证运算的连贯性,例如,若采用48B的寄存器读取数据,每当寄存器内移位取完16B的数据,寄存器会从存储单元中加载下一16B的数据,从而保证运算的连贯性。In a possible implementation manner, the reading of the first pixel data in step 11 may be continuous. In other words, when the data in the cache module cannot meet the requirements of the current operation, the cache module may read from the storage unit Continuous pixel data to ensure the continuity of the operation. For example, if a 48B register is used to read data, whenever the 16B data is shifted in the register, the register will load the next 16B data from the storage unit, thus ensuring that Continuity of operations.
在一种可能的实现方式中,在步骤12中,根据预设的权重读取位宽,从存储单元读取第一权重数据,可以包括:In a possible implementation manner, in step 12, reading the first weight data from the storage unit according to the preset weight reading bit width may include:
在T=1时,确定k个卷积核的第m个通道、第Ky行、卷积核位置T处的权重数据在目标权重矩阵中的所在行L;根据权重读取位宽,从存储单元中读取目标权重矩阵的第L行的权重数据,作为从存储单元读取的第一权重数据;When T=1, determine the mth channel, the Kyth row, and the row L of the weight data at the convolution kernel position T of the k convolution kernels in the target weight matrix; The weight data of the Lth row of the target weight matrix is read in the unit as the first weight data read from the storage unit;
在1<T≤K时,根据预设的权重读取位宽,从存储单元中读取目标权重矩阵的第L+T-1行的权重数据,作为从存储单元读取的第一权重数据;When 1<T≤K, read the bit width according to the preset weight, and read the weight data of the L+T-1th row of the target weight matrix from the storage unit as the first weight data read from the storage unit ;
其中,目标权重矩阵可以包括第二权重矩阵或第三权重矩阵。在一种可能的实现方式中,目标权重矩阵还可以包括第一权重矩阵或第四权重矩阵。Wherein, the target weight matrix may include the second weight matrix or the third weight matrix. In a possible implementation manner, the target weight matrix may further include a first weight matrix or a fourth weight matrix.
在一种可能的实现方式中,卷积核位置T可以是指卷积核的第m个通道的第Ky行的第T个权重数据。In a possible implementation manner, the convolution kernel position T may refer to the T th weight data of the Ky th row of the m th channel of the convolution kernel.
在一种可能的实现方式中,从存储单元中读取第二权重数据,可以是在确定所要读取的权重数据的起始地址后,即目标权重矩阵的第L行权重数据对应的存储地址,按照顺序寻址即地址累加1的方式,读取第L+T-1行的权重数据。In a possible implementation manner, reading the second weight data from the storage unit may be after determining the starting address of the weight data to be read, that is, the storage address corresponding to the weight data in the Lth row of the target weight matrix , read the weight data of the L+T-1th row according to the sequential addressing method, that is, the address is accumulated by 1.
通过本公开实施例,能够在k个卷积核的第Ky行的第T次运算时,实现顺序读取k个卷积核的第m个通道、第Ky行、卷积核位置T处的权重数据。图6示出根据本公开实施例的一种第二权重矩阵的拆分示意图。举例来说,如图6所示,在第T=1次运算时,可以读取a1、e1等第二权重矩阵的第一行权重数据,即可以相当于读取32个卷积核第一个通道、第一行、第一个权重数据,在T=2时可以读取a2、e2等第二权重矩阵的第二行权重数据,即可以相当于读取k个卷积核第一个通道、第一行、第二个权重数据,依次类推。Through the embodiments of the present disclosure, during the T-th operation of the Ky-th row of the k convolution kernels, it is possible to sequentially read the m-th channel, the Ky-th row, and the convolution kernel position T of the k convolution kernels. weight data. FIG. 6 shows a schematic diagram of splitting a second weight matrix according to an embodiment of the present disclosure. For example, as shown in Figure 6, when T=1 operation, the weight data of the first row of the second weight matrix such as a1 and e1 can be read, which is equivalent to reading the first row of 32 convolution kernels. channel, the first row, the first weight data, when T=2, the weight data of the second row of the second weight matrix such as a2 and e2 can be read, which is equivalent to reading the first k convolution kernels. Channel, first row, second weight data, and so on.
在本公开实施例中,在确定m个通道、第Ky行、卷积核位置T处的权重数据在目标权重矩阵中的所在行L后,逐行读取第一权重数据,能够实现读取与像素数据对应的权重数据。In the embodiment of the present disclosure, after determining the row L of the weight data at the m channels, the Ky th row, and the convolution kernel position T in the target weight matrix, the first weight data is read row by row, and the reading can be realized. Weight data corresponding to pixel data.
在一种可能的实现方式中,在步骤13中根据卷积核的步长Sx,从第一像素数据中选取与卷积核位置T对应的a个像素数据作为第二像素数据,包括:In a possible implementation, in step 13, according to the step size Sx of the convolution kernel, select a pixel data corresponding to the position T of the convolution kernel from the first pixel data as the second pixel data, including:
在T=1时,从第一像素数据中选取间隔步长Sx的a个像素数据作为第二像素数据,第二像素数据包 括图像的第m通道、第Py行的X[0],X[Sx],X[2Sx],X[3Sx],……,X[(a-1)Sx]处的像素数据;When T=1, select a pixel data with an interval step Sx from the first pixel data as the second pixel data, and the second pixel data includes the mth channel of the image, X[0] of the Pyth row, X[0] Sx], X[2Sx], X[3Sx], ..., pixel data at X[(a-1)Sx];
在1<T≤K时,根据卷积核的膨胀率Ex,从第一像素数据中选取图像的第m通道、第Py行的X[(T-1)Ex],X[Sx+(T-1)Ex],X[2Sx+(T-1)Ex],X[3Sx+(T-1)Ex],……,X[(a-1)Sx+(T-1)Ex]处的像素数据作为第二像素数据。When 1<T≤K, according to the expansion rate Ex of the convolution kernel, select the mth channel of the image and X[(T-1)Ex] of the Pyth row from the first pixel data, X[Sx+(T- 1)Ex], X[2Sx+(T-1)Ex], X[3Sx+(T-1)Ex], ..., the pixel data at X[(a-1)Sx+(T-1)Ex] as second pixel data.
可以理解的是,在实现卷积核与图像的卷积运算时,卷积核通常根据行方向和列方向上的移动步长进行卷积运算,其中,针对第m通道、第Py行的第一像素数据,在T=1时,从第一像素数据中选取间隔步长Sx的a个第二像素数据X[0],X[Sx],X[2Sx],X[3Sx],……,X[(a-1)Sx],相当于选取了与多个卷积核的第m通道、第Ky行、第一个权重数据对应的a个第二像素数据。It can be understood that when implementing the convolution operation between the convolution kernel and the image, the convolution kernel usually performs the convolution operation according to the moving step size in the row direction and the column direction. One pixel data, when T=1, select a second pixel data X[0], X[Sx], X[2Sx], X[3Sx], ... , X[(a-1)Sx], which is equivalent to selecting a second pixel data corresponding to the mth channel, the Kyth row, and the first weight data of multiple convolution kernels.
其中,在1<T≤K时,根据卷积核的膨胀率Ex,从第一像素数据中选取图像的第m通道、第Py行的X[(T-1)Ex],X[Sx+(T-1)Ex],X[2Sx+(T-1)Ex],X[3Sx+(T-1)Ex],……,X[(a-1)Sx+(T-1)Ex]处的像素数据作为第二像素数据,相当于选取了与多个卷积核的第m通道、第Ky行、第T个权重数据对应的a个第二像素数据,例如,T=2时,第二像素数据可以包括X[Ex],X[Sx+Ex],X[2Sx+Ex],X[3Sx+Ex],……,X[(a-1)Sx+Ex],T=3时,第二像素数据可以包括X[2Ex],X[Sx+2Ex],X[2Sx+2Ex],X[3Sx+2Ex],……,X[(a-1)Sx+2Ex],依次类推。Among them, when 1<T≤K, according to the expansion rate Ex of the convolution kernel, select the mth channel of the image and X[(T-1)Ex] of the Pyth row from the first pixel data, X[Sx+( T-1)Ex], X[2Sx+(T-1)Ex], X[3Sx+(T-1)Ex], ..., pixel at X[(a-1)Sx+(T-1)Ex] The data is used as the second pixel data, which is equivalent to selecting a second pixel data corresponding to the mth channel, the Kyth line, and the Tth weight data of multiple convolution kernels. For example, when T=2, the second pixel The data can include X[Ex], X[Sx+Ex], X[2Sx+Ex], X[3Sx+Ex], ..., X[(a-1)Sx+Ex], when T=3, the first The two-pixel data may include X[2Ex], X[Sx+2Ex], X[2Sx+2Ex], X[3Sx+2Ex], ..., X[(a-1)Sx+2Ex], and so on.
在一种可能的实现方式中,卷积核的膨胀率Ex,可以根据实际卷及运算的需求设定,当膨胀率Ex=1时,代表执行的是普通卷积运算,当膨胀率Ex>1时,代表执行的是膨胀卷积运算。In a possible implementation, the expansion rate Ex of the convolution kernel can be set according to the actual volume and operation requirements. When the expansion rate Ex=1, it means that the ordinary convolution operation is performed. When the expansion rate Ex> When it is 1, it means that the dilated convolution operation is performed.
在一种可能的实现方式中,a的值可以小于等于MAC阵列的行数A,例如,4×32的MAC阵列,a可以是取[1,4]中的整数。In a possible implementation manner, the value of a may be less than or equal to the row number A of the MAC array, for example, for a 4×32 MAC array, a may be an integer in [1, 4].
在本公开实施例中,根据步长Sx和膨胀率Ex从第一像素数据中选取a个第二像素数据,能够实现选取的多个像素数据是与多个卷积核的权重数据对应的,从而准确实现像素数据与权重数据的卷积运算,并且还可以支持膨胀卷积运算。In the embodiment of the present disclosure, a piece of second pixel data is selected from the first pixel data according to the step size Sx and the expansion rate Ex, so that the selected multiple pixel data corresponds to the weight data of multiple convolution kernels, Thus, the convolution operation of pixel data and weight data can be accurately realized, and the dilated convolution operation can also be supported.
在一种可能的实现方式中,通过缓存模块可以将步骤12读取的第一权重数据和步骤13选取的第二权重数据输入到MAC阵列中,进而通过步骤14实现第二像素数据与对应权重数据的乘数累加,也即,实现了权重数据与像素数据的卷积运算。In a possible implementation manner, the first weight data read in step 12 and the second weight data selected in step 13 can be input into the MAC array through the cache module, and then the second pixel data and the corresponding weight can be realized in step 14 The multiplier accumulation of the data, that is, the convolution operation of the weight data and the pixel data is realized.
图7示出根据本公开实施例的一种MAC阵列的结构示意图。为便于理解步骤14中得到第一卷积运算结果的过程,以如图7所示的一种MAC阵列示意图为例进行说明,如图7所示,其中每个圆圈包含4个MAC,则a可以为4,共有5列MAC,则k可以为5。FIG. 7 shows a schematic structural diagram of a MAC array according to an embodiment of the present disclosure. In order to facilitate the understanding of the process of obtaining the first convolution operation result in step 14, a schematic diagram of a MAC array shown in FIG. 7 is used as an example for illustration. As shown in FIG. 7, each circle contains 4 MACs, then a Can be 4, there are 5 columns of MAC, then k can be 5.
则在T=2时,将选取的第二像素数据X[Ex],X[Sx+Ex],X[2Sx+Ex],X[3Sx+Ex]从MAC阵列的行方向输入到MAC阵列中;将5个卷积核的第m个通道、第Ky行、卷积核位置2处的第一权重数据,分别从列方向输入到MAC阵列中,针对MAC阵列中的第q列MAC,可以分别得到4个像素数据与第q个第一权重数据的乘积。以此类推,在T=1时,可以得到第二像素数据(X[0],X[Sx],X[2Sx],X[3Sx]),与第一权重数据(5个卷积核的第m个通道、第Ky行、卷积核位置1处的权重数据)的乘积。Then when T=2, the selected second pixel data X[Ex], X[Sx+Ex], X[2Sx+Ex], X[3Sx+Ex] are input into the MAC array from the row direction of the MAC array ; Input the first weight data of the mth channel, the Kyth row, and the convolution kernel position 2 of the 5 convolution kernels into the MAC array from the column direction. For the qth column MAC in the MAC array, you can The products of the 4 pixel data and the qth first weight data are obtained respectively. By analogy, when T=1, the second pixel data (X[0], X[Sx], X[2Sx], X[3Sx]) can be obtained, and the first weight data (5 convolution kernels The product of the mth channel, the Kyth row, the weight data at kernel position 1).
那么针对每列MAC,在T=2时,将T=2次运算得到4个乘积与T=1次运算得到4个乘积分别累加得到第T=2次运算结果,以此类推,在第T次运算时,将第T次运算得到乘积与T-1次运算结果进行循环累加,得到第q列MAC第T次运算的a个第一卷积运算结果。Then for each column of MAC, when T=2, T=2 operations to obtain 4 products and T=1 operation to obtain 4 products to accumulate respectively to obtain the T=2th operation result, and so on, in the Tth During the first operation, the product obtained by the T-th operation and the T-1 operation result are cyclically accumulated to obtain a first convolution operation result of the T-th operation of the MAC in the q-th column.
在本公开实施例中,能够在每次运算时实现多个像素数据与对应多个卷积核的权重数据之间的多点并行卷积运算,从而可以高效的实现卷积运算,提升人工智能处理器的运行效率。In the embodiment of the present disclosure, a multi-point parallel convolution operation between multiple pixel data and weight data corresponding to multiple convolution kernels can be implemented in each operation, so that the convolution operation can be efficiently implemented and artificial intelligence can be improved. The operating efficiency of the processor.
在本公开实施例中,在完成第T=K次运算时,针对第q列MAC,相当于完成该列对应卷积核的第m通道、第Ky行权重数据与对应像素数据的a个第一卷积运算结果。可以知晓的是,在实现卷积核与 图像的卷积运算,通常要计算每个通道的每行权重数据与像素数据的乘积。In the embodiment of the present disclosure, when the T=Kth operation is completed, for the qth column MAC, it is equivalent to completing the mth channel of the convolution kernel corresponding to the column, the Kyth row weight data, and the ath column of the corresponding pixel data. A convolution operation result. It can be known that, when implementing the convolution operation between the convolution kernel and the image, it is usually necessary to calculate the product of each row of weight data and pixel data of each channel.
在一种可能的实现方式中,根据本公开实施例的数据处理方法还可以包括:步骤15,在T=1时,针对第q列MAC,将第二像素数据与第一权重数据中的第q个权重数据相乘,并与第Ky-1行的第K次运算的卷积运算结果相加,得到第q列MAC第1次运算的a个第一卷积运算结果,1≤q≤k。In a possible implementation manner, the data processing method according to an embodiment of the present disclosure may further include: Step 15: when T=1, for the qth column MAC, compare the second pixel data with the first weight data in the first weight data. The q weight data are multiplied and added with the convolution operation result of the Kth operation in the Ky-1th row to obtain a first convolution operation result of the 1st operation of the MAC in the qth column, 1≤q≤ k.
其中,第Ky-1行的第K次运算的卷积运算结果可以采用上述本公开实施例中步骤11至步骤14公开的处理方式得到,在此不再赘述。The convolution operation result of the Kth operation in the Ky-1th row can be obtained by using the processing methods disclosed in steps 11 to 14 in the above embodiments of the present disclosure, and details are not described herein again.
在本公开实施例中,能够实现每行权重数据与对应像素数据的卷积运算结果的循环累加,以得到第m个通道的卷积运算结果。In the embodiment of the present disclosure, the cyclic accumulation of the convolution operation results of each row of weight data and the corresponding pixel data can be implemented, so as to obtain the convolution operation result of the mth channel.
在实际应用中,在实现卷积核与图像的卷积运算时,对于各个卷积核,通常需要将各通道的权重数据与对应通道的像素数据的卷积运算结果累加,得到最终的卷积运算结果。In practical applications, when implementing the convolution operation between the convolution kernel and the image, for each convolution kernel, it is usually necessary to accumulate the convolution operation results of the weight data of each channel and the pixel data of the corresponding channel to obtain the final convolution Operation result.
在一种可能的实现方式中,数据处理方法还可以包括:步骤16,针对第q列MAC,在完成k个卷积核的K行的运算后,得到第m个通道的a个第二卷积运算结果;步骤17,在得到的C 0个通道的卷积运算结果后,将每个卷积核的C 0个通道的卷积运算结果相加,得到第q列MAC输出的a个目标卷积运算结果。 In a possible implementation manner, the data processing method may further include: Step 16, for the qth column MAC, after completing the operation of the K rows of the k convolution kernels, obtain a second volume of the mth channel Product operation results; Step 17, after the convolution operation results of C 0 channels are obtained, add the convolution operation results of C 0 channels of each convolution kernel to obtain a target output by the qth column MAC The result of the convolution operation.
其中,每个通道的卷积运算结果实际为将该通道下卷积核每行的卷积运算结果累加得到,则步骤16中完成k个卷积核的K行运算可以采用本公开实施例上述步骤11至步骤15公开的处理方式得到,在此不再赘述。Wherein, the convolution operation result of each channel is actually obtained by accumulating the convolution operation results of each row of convolution kernels under the channel, and the K rows of operations of k convolution kernels are completed in step 16. The processing methods disclosed in steps 11 to 15 are obtained, and details are not repeated here.
在本公开实施例中,针对第q列MAC,通过将每个卷积核的C 0个通道的卷积运算结果相加,能够得到第q列MAC输出的a个目标卷积运算结果,相当于得到了k个输出图的同一行的相邻a个点的值。 In the embodiment of the present disclosure, for the qth column MAC, by adding the convolution operation results of the C 0 channels of each convolution kernel, a target convolution operation result output by the qth column MAC can be obtained, which is equivalent to The values of the adjacent a points in the same row of the k output graphs are obtained.
在实际应用中,在得到第q列MAC输出的a个目标卷积运算结果后,即k个输出图的同一行的相邻a个点的值后,可能还未完成输入图像在行方向上全部像素数据与权重数据的卷积,则在一种可能的实现方式中,在得到第q列MAC输出的a个目标卷积运算结果后,数据处理方法还可以包括:In practical applications, after obtaining the result of a target convolution operation output by the qth column MAC output, that is, after obtaining the values of the adjacent a points in the same row of the k output images, the input image may not be fully completed in the row direction. For the convolution of pixel data and weight data, in a possible implementation manner, after obtaining a target convolution operation result output by the qth column MAC output, the data processing method may further include:
根据图像的第m通道、第Py行中的第X[aSx]处的像素数据对应的第一存储向量,确定第Py行中的第X[aSx]处的像素数据对应的第一存储向量在存储单元中对应第一存储起始地址;According to the mth channel of the image and the first storage vector corresponding to the pixel data at the X[aSx] position in the Pyth row, determine that the first storage vector corresponding to the pixel data at the X[aSx] position in the Pyth row is in a first storage start address corresponding to the storage unit;
根据预设的像素读取位宽和第一存储起始地址,从存储单元读取第三像素数据,第三像素数据包括从第一存储起始地址开始读取的连续的M个像素数据,以便运算单元继续运算。According to the preset pixel read bit width and the first storage start address, the third pixel data is read from the storage unit, and the third pixel data includes consecutive M pieces of pixel data read from the first storage start address, so that the operation unit can continue to operate.
在一种可能的实现方式中,在获取到输入图像的尺寸、填充参数padding,卷积核的尺寸(宽度K×高度K),卷积核的步长(包括行方向的步长Sx,列方向的步长Sy)等参数后,可以得到出输出图的尺寸,例如,可以通过公式1得到输出图的尺寸,根据输出图的尺寸可以判断出在得到第q列MAC输出的a个目标卷积运算结果时,是否完成输入图像在行方向上全部像素数据与权重数据的卷积运算。In a possible implementation, after obtaining the size of the input image, the padding parameter, the size of the convolution kernel (width K×height K), the step size of the convolution kernel (including the step size Sx in the row direction, the column The size of the output graph can be obtained after parameters such as the step size Sy) in the direction of Whether to complete the convolution operation of all pixel data and weight data in the input image in the row direction when calculating the result of the product operation.
Figure PCTCN2020137453-appb-000006
Figure PCTCN2020137453-appb-000006
其中,P out为输出图的宽度或高度,P in为输入图像的宽度或高度,S表示行方向的步长或列方向的步长。 Among them, P out is the width or height of the output image, P in is the width or height of the input image, and S represents the step size in the row direction or the step size in the column direction.
举例来说,若计算得到输出图的宽度为16,当前第q列MAC输出了4个目标卷积运算结果,则代表未完成输入图像行方向上全部像素数据与权重数据的卷积。For example, if the width of the output image is calculated to be 16, and the current qth column MAC outputs 4 target convolution operation results, it means that the convolution of all pixel data and weight data in the row direction of the input image has not been completed.
其中,根据图像的第m通道、第Py行中的第X[aSx]处的像素数据对应的第一存储向量,确定第Py行中的第X[aSx]处的像素数据对应的第一存储向量在存储单元中对应第一存储起始地址,是由于针对第Py行的第一像素数据,在先选取了图像的第m通道、第Py行的X[0],X[Sx],X[2Sx],X[3Sx],……,X[(a-1)Sx]处的第二像素数据,所以根据卷积核的步长Sx,接下来需要从X[aSx]处开始选取第二像素 数据以进行卷积运算。Wherein, according to the mth channel of the image and the first storage vector corresponding to the pixel data at the Xth [aSx] in the Pyth row, determine the first storage vector corresponding to the pixel data at the Xth [aSx] in the Pyth row The vector corresponds to the first storage start address in the storage unit, because for the first pixel data of the Pyth row, the mth channel of the image and the X[0], X[Sx], X of the Pyth row were previously selected. [2Sx], X[3Sx], ..., the second pixel data at X[(a-1)Sx], so according to the step size Sx of the convolution kernel, it is necessary to select the first pixel from X[aSx]. Two pixel data for convolution operation.
在本公开实施例中,考虑到X[aSx]处的像素数据的对应的存储地址不易确定,且若从X[aSx]处的像素数据所对应的存储地址开始读取数据的情况较为复杂,所以在本公开实施例中,通过确定第X[aSx]处的像素数据对应的第一存储向量在存储单元中对应第一存储起始地址,再根据预设的像素读取位宽和第一存储起始地址,从存储单元读取第三像素数据,可以较为便捷快速的确定出缓存模块从存储单元中取数的起始地址,由于第一存储向量在存储单元中是对齐存储的,同时可以便于缓存模块的取数。In the embodiment of the present disclosure, considering that the corresponding storage address of the pixel data at X[aSx] is not easy to determine, and if the situation of reading data from the storage address corresponding to the pixel data at X[aSx] is complicated, Therefore, in this embodiment of the present disclosure, the first storage vector corresponding to the pixel data at X[aSx] is determined to correspond to the first storage start address in the storage unit, and then the bit width and the first storage start address are read according to the preset pixel. Storing the starting address and reading the third pixel data from the storage unit can easily and quickly determine the starting address of the cache module fetching from the storage unit. Since the first storage vector is stored in alignment in the storage unit, and at the same time It can facilitate the fetching of the cache module.
在一种可能的实现方式中,确定图像的第m通道、第Py行中的第X[aSx]处的像素数据对应的第一存储向量,可以根据aSx与nb-1,n∈[1,B]的对比关系,确定第X[aSx]处的像素数据对应的第一存储向量。In a possible implementation manner, the first storage vector corresponding to the pixel data at X[aSx] in the mth channel of the image and the Pyth row can be determined according to aSx and nb-1, n∈[1, B], to determine the first storage vector corresponding to the pixel data at X[aSx].
举例来说,假设b等于16,即按照每16个像素数据拆分得到第一存储向量,有3个第一存储向量分别包含图像第Py行的[0,15]、[16,31]、[32,47]的像素数据,若aSx=12,由于12小于15,则对应的第一存储向量为[0,15],需要从第0个像素数据开始从存储单元中读取数据,若aSx=18,18大于15,小于31,需要从第16个像素数据开始从存储单元中读取数据,以此类推。For example, assuming that b is equal to 16, that is, the first storage vector is obtained by dividing every 16 pixel data. There are 3 first storage vectors including [0,15], [16,31], [16,31], For the pixel data of [32,47], if aSx=12, since 12 is less than 15, the corresponding first storage vector is [0,15], and it is necessary to read data from the storage unit starting from the 0th pixel data. aSx=18, 18 is greater than 15 and less than 31, data needs to be read from the storage unit starting from the 16th pixel data, and so on.
在一种可能的实现方式中,在根据预设的像素读取位宽和第一存储起始地址,从存储单元读取第三像素数据后,可以根据卷积核的步长Sx,从第三像素数据中选取与卷积核位置T对应的a个像素数据作为第二像素数据,包括:In a possible implementation manner, after the third pixel data is read from the storage unit according to the preset pixel read bit width and the first storage start address, the step size Sx of the convolution kernel may be Select a pixel data corresponding to the convolution kernel position T from the three pixel data as the second pixel data, including:
在T=1时,从第三像素数据中选取间隔步长Sx的a个像素数据作为第二像素数据,第二像素数据包括图像的第m个通道、第Py行的X[aSx],X[(a+1)Sx],X[(a+1)Sx]],X[(a+3)Sx]……X[(2a-1)Sx]处的像素数据;When T=1, select a pixel data with an interval step Sx from the third pixel data as the second pixel data, and the second pixel data includes the mth channel of the image, X[aSx] in the Pyth row, X[aSx], X [(a+1)Sx], X[(a+1)Sx]], X[(a+3)Sx]... Pixel data at X[(2a-1)Sx];
在1<T≤K时,根据卷积核的膨胀率Ex,从第三像素数据中选取图像的第m通道、第Py行的X[aSx+(T-1)Ex],X[(a+1)Sx]+(T-1)Ex],X[(a+1)Sx]+(T-1)Ex],X[(a+3)Sx+(T-1)Ex],……,X[(2a-1)Sx+(T-1)Ex]处的像素数据作为第二像素数据。When 1<T≤K, according to the expansion rate Ex of the convolution kernel, select the mth channel of the image and X[aSx+(T-1)Ex] of the Pyth row from the third pixel data, X[(a+ 1)Sx]+(T-1)Ex], X[(a+1)Sx]+(T-1)Ex], X[(a+3)Sx+(T-1)Ex],..., The pixel data at X[(2a-1)Sx+(T-1)Ex] is taken as the second pixel data.
需要说明的是,上述根据图像的第m通道、第Py行中的第X[aSx]处的像素数据对应的第一存储向量,确定第Py行中的第X[aSx]处的像素数据对应的第一存储向量在存储单元中对应第一存储起始地址,是本公开实施例提供的一种实施方式,但本领域技术人员能够理解,本公开应不限于此。在本公开实施例的启示下,本领域技术人员还可以根据图像的第m通道、第Py行中的第X[2aSx]处的像素数据对应的第一存储向量,确定第Py行中的第X[2aSx]处的像素数据对应的第一存储向量在存储单元中对应第一存储起始地址,依次类推。为行文简洁,本公开实施例不做穷举。It should be noted that, according to the first storage vector corresponding to the pixel data at X[aSx] in the mth channel of the image and the Pyth row, the pixel data at X[aSx] in the Pyth row is determined to correspond to The first storage vector corresponding to the first storage start address in the storage unit is an implementation provided by the embodiment of the present disclosure, but those skilled in the art can understand that the present disclosure should not be limited thereto. Under the inspiration of the embodiments of the present disclosure, those skilled in the art can also determine the first storage vector corresponding to the pixel data at the mth channel of the image and the Xth [2aSx] in the Pyth row of the image, to determine the first storage vector in the Pyth row. The first storage vector corresponding to the pixel data at X[2aSx] corresponds to the first storage start address in the storage unit, and so on. For the sake of brevity, the embodiments of the present disclosure are not exhaustive.
在一种可能的实现方式中,第三像素数据相当于步骤11中的第一像素数据,在从存储单元中读取第三像素数据后,可以采用上述本公开实施例步骤11至步骤16所述的数据处理方法,得到每列MAC输出的又a个目标卷积运算结果,从而能够完成图像在行方向上全部像素数据与权重数据的卷积,即得到输出图的同一行的全部值。In a possible implementation manner, the third pixel data is equivalent to the first pixel data in step 11. After the third pixel data is read from the storage unit, the steps 11 to 16 in the above embodiment of the present disclosure may be used. The data processing method described above can obtain another target convolution operation result output by each column of MAC, so that the convolution of all pixel data and weight data in the row direction of the image can be completed, that is, all values in the same row of the output image can be obtained.
在本公开实施例中,通过从第X[aSx]、第X[2aSx]等处的像素数据对应的第一存储向量在所述存储单元中对应第一存储起始地址开始读取像素数据,可以较为简单有效的实现换行读取像素数据,以便于运算单元继续运算,最终可以得到多个输出图的同一行的全部值。In the embodiment of the present disclosure, by starting to read pixel data corresponding to the first storage start address in the storage unit from the first storage vector corresponding to the pixel data at the Xth[aSx], Xth[2aSx], etc., It is relatively simple and effective to implement line-feeding to read pixel data, so that the operation unit can continue to operate, and finally all the values of the same line of the multiple output graphs can be obtained.
在实际应用中,得到输出图的某一行的数据后,需要根据卷积核的列方向上的步长Sy,在图像列方向进行移动循环运算,计算输出图下一行的数据,则在一种可能的实现方式中,根据本公开实施例的数据处理方法,还可以包括:在完成k个卷积核与K行像素数据的卷积运算后,根据卷积核在列方向的步长Sy,确定与K行像素数据的第1行间隔Sy-1行的首个像素数据的第二存储起始地址;根据预设的 像素读取位宽和第二存储起始地址,从存储单元读取第四像素数据,第四像素数据包括从第二存储起始地址开始读取的连续的M个像素数据,以便运算单元继续运算。In practical applications, after obtaining the data of a certain row of the output graph, it is necessary to perform a moving cycle operation in the column direction of the image according to the step size Sy in the column direction of the convolution kernel, and calculate the data of the next row of the output graph. In a possible implementation manner, the data processing method according to the embodiment of the present disclosure may further include: after completing the convolution operation of k convolution kernels and K rows of pixel data, according to the step size Sy of the convolution kernel in the column direction, Determine the second storage start address of the first pixel data at the interval Sy-1 row with the first row of K rows of pixel data; read from the storage unit according to the preset pixel read bit width and the second storage start address Fourth pixel data, the fourth pixel data includes consecutive M pieces of pixel data read from the second storage start address, so that the operation unit can continue to operate.
在一种可能的实现方式中,读取第四像素数据后,第四像素数据相当于第一像素数据,则可以按照上述本公开实施例中的步骤11至步骤17公开的数据处理方法,得到第q列MAC输出的a个目标卷积运算结果,以完成卷积核与图像的卷积运算。In a possible implementation manner, after reading the fourth pixel data, the fourth pixel data is equivalent to the first pixel data, then the data processing method disclosed in steps 11 to 17 in the above-mentioned embodiment of the present disclosure can be obtained to obtain The result of a target convolution operation output by the qth column MAC to complete the convolution operation between the convolution kernel and the image.
在本公开实施例中,通过根据列方向上的移动步长Sy读取第四像素数据,可以便于计算输出图的每一行的数据,最终得到各个卷积核对应的输出图。In the embodiment of the present disclosure, by reading the fourth pixel data according to the moving step size Sy in the column direction, it is convenient to calculate the data of each row of the output graph, and finally obtain the output graph corresponding to each convolution kernel.
需要说明的是,本公开实施例中的输出图可以是指经卷积运算得到的特征图(feature map),输入图像和图像可以是指原始图像,也可以指已进行卷积运算处理后的特征图,对此本公开实施例不做限制。It should be noted that the output map in the embodiment of the present disclosure may refer to a feature map obtained by convolution operation, and the input image and image may refer to the original image, or may refer to the image after the convolution operation has been performed. feature diagram, which is not limited to this embodiment of the present disclosure.
图8a示出根据本公开实施例的一种人工智能处理器的框图,图8b示出根据本公开实施例的一种处理核心的框图。如图8a所示,该人工智能处理器100包括多个处理核心101,如图8b所示,每个处理核心101包括存储单元102及运算单元103。Fig. 8a shows a block diagram of an artificial intelligence processor according to an embodiment of the present disclosure, and Fig. 8b shows a block diagram of a processing core according to an embodiment of the present disclosure. As shown in FIG. 8 a , the artificial intelligence processor 100 includes a plurality of processing cores 101 , and as shown in FIG. 8 b , each processing core 101 includes a storage unit 102 and an operation unit 103 .
在一种可能的实现方式中,存储单元102用于存储图像的像素数据及N个卷积核的权重数据;运算单元103包括乘数累加器MAC阵列104,用于根据像素数据及权重数据进行运算。In a possible implementation manner, the storage unit 102 is used to store the pixel data of the image and the weight data of the N convolution kernels; the operation unit 103 includes a multiplier-accumulator MAC array 104, which is used for performing the processing according to the pixel data and the weight data. operation.
在一种可能的实现方式中,运算单元还可以包括至少一个缓存模块105,缓存模块用于根据预设的像素读取位宽从存储单元102中读取像素数据,以及根据预设的权重读取位宽从存储单元102中读取权重数据。In a possible implementation manner, the operation unit may further include at least one cache module 105, and the cache module is configured to read pixel data from the storage unit 102 according to a preset pixel read bit width, and read pixel data according to a preset weight. The weight data is read from the storage unit 102 by taking the bit width.
在一种可能的实现方式中,缓存模块105可以将选通的数据送入MAC阵列进行卷积运算,将卷积运算结果输出到存储单元中由地址产生模块106指定的地址空间中。In a possible implementation manner, the buffering module 105 may send the gated data into the MAC array for convolution operation, and output the convolution operation result to the address space specified by the address generating module 106 in the storage unit.
在一种可能的实现方式中,运算单元还可以包括地址产生模块106用于产生缓存模块读取数据时的地址指针,以便于缓存模块105根据地址指针实现顺序寻址和/或跳变寻址。In a possible implementation manner, the operation unit may further include an address generation module 106 for generating an address pointer when the cache module reads data, so that the cache module 105 can implement sequential addressing and/or jump addressing according to the address pointer .
在一种可能的实现方式中,MAC阵列104包括基于纵横式交换crossbar矩阵结构的阵列。MAC阵列104可以展开为行列两个维度,能够实现支持多点并行卷积运算。In one possible implementation, the MAC array 104 includes an array based on a crossbar switching crossbar matrix structure. The MAC array 104 can be expanded into two dimensions of rows and columns, and can support multi-point parallel convolution operations.
在一种可能的实现方式中,处理核心101可以通过本公开实施例上述任一项所述的数据处理方法执行卷积运算。In a possible implementation manner, the processing core 101 may perform a convolution operation by using the data processing method described in any one of the foregoing embodiments of the present disclosure.
在一种可能的实现方式中,存储单元102,可以用于根据特定的像素数据与权重数据的存储逻辑存储数据,其中像素数据的存储逻辑包括:每个通道图像依次存储,每个通道的像素数据的沿其图像宽度方向展开成向量,连续存储b个作为一个存储向量,向量按b字节对齐拆分为多个,逐个存储。不同像素数据在存储单元中的存储顺序为先沿图像宽度方向,再按图像高度方向存储。整个图像存储按照像素存储位宽对齐,不足补零,以方便寄存器取数计算。其中,像素存储位宽大于等于设置的b字节。In a possible implementation manner, the storage unit 102 can be used to store data according to a specific storage logic of pixel data and weight data, wherein the storage logic of pixel data includes: each channel image is stored in sequence, and the pixels of each channel are stored in sequence. The data is expanded into a vector along its image width direction, and b consecutively stored as a storage vector. The vector is divided into multiple pieces according to the b-byte alignment, and stored one by one. The storage sequence of different pixel data in the storage unit is first along the image width direction, and then stored along the image height direction. The entire image storage is aligned according to the pixel storage bit width, and the lack of zero is filled to facilitate the calculation of register fetching. Among them, the pixel storage bit width is greater than or equal to the set b bytes.
在一种可能的实现方式中,权重数据和像素数据时可以指定在存储单元102中的存储地址。根据权重数据的存储地址,缓存模块105在读取权重数据时,可以从存储地址的起始地址开始以地址加一的方式读取权重数据。In a possible implementation manner, the weight data and the pixel data may specify a storage address in the storage unit 102 . According to the storage address of the weight data, when reading the weight data, the cache module 105 can read the weight data from the starting address of the storage address in a manner of adding one to the address.
在一种可能的实现方式中,缓存模块105在读取像素数据时,会产生地址跳变,即,跳行读取像素数据,可以在原语参数里设置配置的地址跳变值。地址产生模块106根据地址跳变值生成目标地址,通过人工智能处理器自带的循环时钟计数器进行计数,在计数满足跳变条件后,循环时钟计数器产生 跳变信号,通过跳变信号指示缓存模块105根据地址产生模块106产生的目标地址,实现地址指针的跳转。In a possible implementation manner, when the cache module 105 reads pixel data, an address jump will be generated, that is, to read pixel data by line jump, a configured address jump value can be set in the primitive parameter. The address generation module 106 generates the target address according to the address jump value, and counts it through the loop clock counter that comes with the artificial intelligence processor. After the count meets the jump condition, the loop clock counter generates a jump signal, and instructs the cache module through the jump signal. 105 realizes the jump of the address pointer according to the target address generated by the address generating module 106 .
在本公开实施例中,通过采用本公开实施例的人工智能处理器,可以实现高效的卷积运算,能够提升人工智能处理器的运行效率。In the embodiments of the present disclosure, by using the artificial intelligence processor of the embodiments of the present disclosure, efficient convolution operations can be implemented, and the operation efficiency of the artificial intelligence processor can be improved.
在一种可能的实现方式中,卷积核为四维数据,共有K×K×C 0×N个权重数据,以每个输出图N(卷积核数量,输出通道数量)为向量长度展开,构成一个高度为Kx×Ky×C 0,宽度为N的权重矩阵,该权重矩阵中的权重数据,在高度方向的排列顺序为先行方向、再列方向、再通道C 0方向进行排列。 In a possible implementation, the convolution kernel is four-dimensional data, with a total of K×K×C 0 ×N weight data, and each output graph N (the number of convolution kernels, the number of output channels) is used as the vector length to expand, A weight matrix with a height of Kx×Ky×C 0 and a width of N is formed. The weight data in the weight matrix are arranged in the order of the first direction, the column direction, and the channel C 0 direction in the height direction.
在一种可能的实现方式中,存储该权重矩阵时,当N大于32,权重数据大于等于1B时,该权重矩阵按32B对齐拆分为W_grp组,每一组的数据在存储单元中排列在前一组的数据下方。在一种情况下,当权重数据是2bit时,权重矩阵按每32列(也就是每32×2bit=8B)对齐拆分为W_grp组。In a possible implementation, when storing the weight matrix, when N is greater than 32 and the weight data is greater than or equal to 1B, the weight matrix is divided into W_grp groups according to 32B alignment, and the data of each group is arranged in the storage unit in Below the data for the previous set. In one case, when the weight data is 2 bits, the weight matrix is aligned and split into W_grp groups every 32 columns (that is, every 32×2bit=8B).
一种可能的实现方式中,输入图像的每一个通道沿宽度方向展开成向量,连续存储16个像素数据作为一个第一存储向量,第一存储向量按16B对齐拆分为多个第二存储向量,逐个存储第二存储向量。输入图像在存储单元中的存储顺序为先沿行方向,再按列方向存储。整个输入图像在存储单元中按照32B对齐,不足32B的地址空间补零,以方便寄存器取数计算。In a possible implementation, each channel of the input image is expanded into a vector along the width direction, 16 pixel data are continuously stored as a first storage vector, and the first storage vector is divided into multiple second storage vectors according to 16B alignment. , store the second storage vector one by one. The storage sequence of the input images in the storage unit is in the row direction first, and then in the column direction. The entire input image is aligned according to 32B in the storage unit, and the address space less than 32B is filled with zeros to facilitate register fetching and calculation.
在一种可能的实现方式中,可以采用48B的移位寄存器,或3个16B的寄存器,从存储单元中读取数据。在从存储单元中读取数据时,可以分3个时钟加载48B数据,例如,将输入图像的第1个通道(通常为R通道)的第一行的相邻48B(第0至47个)像素分3个时钟按高低16位加载进48B寄存器。若输入图像的宽度不足16B,则只load进16B数据,不足时补零;若输入图像的宽度不足32B,则只load进32B数据,不足时补零,读数操作可以通过循环时钟计数器控制。In a possible implementation, a 48B shift register or three 16B registers can be used to read data from the storage unit. When reading data from the storage unit, 48B data can be loaded in 3 clocks, for example, the adjacent 48B (0th to 47th) of the first row of the first channel (usually R channel) of the input image The pixel is loaded into the 48B register according to the high and low 16 bits in three clocks. If the width of the input image is less than 16B, only 16B data will be loaded, and zero will be filled when the width is insufficient; if the width of the input image is less than 32B, only 32B data will be loaded, and zero will be added when the width is insufficient. The reading operation can be controlled by the cyclic clock counter.
在一种可能的实现方式中,在从寄存器中选取数据输出MAC阵列中进行运算时,每当寄存器将移除1个16B的数据时,寄存器会装载下一个16B的数据以保持运算的连贯。In a possible implementation manner, when selecting data from the register and outputting it to the MAC array for operation, whenever the register will remove one 16B data, the register will load the next 16B data to maintain the continuity of the operation.
在一种可能的实现方式中,基于4×32的2D MAC阵列,可实现最多4个像素数据在同时和最多32个卷积核的同一位置处的权重数据做相乘计算。例如,在一次运算中,可以选通寄存器中的X[0],X[Sx],X[2Sx],X[3Sx]4个像素数据,同时与32个卷积核的第一个通道、第一行、第一个权重数据相乘。之后按照卷积核行方向移位选通X[Ex],X[Ex+Sx],X[Ex+2Sx],X[Ex+3Sx]处的像素数据与32个卷积核的第一个通道、第一行、第二个权重数据进行卷积运算,Ex代表膨胀率,直至完成K个像素数据和对应权重数据的卷积运算。In a possible implementation, based on a 4×32 2D MAC array, it is possible to multiply up to 4 pixel data at the same time with the weight data of up to 32 convolution kernels at the same position. For example, in one operation, the X[0], X[Sx], X[2Sx], X[3Sx] 4 pixel data in the register can be gated, and the first channel, The first row and the first weight data are multiplied together. Then shift the gate X[Ex], X[Ex+Sx], X[Ex+2Sx], X[Ex+3Sx] according to the row direction of the convolution kernel and the first one of the 32 convolution kernels The channel, the first row, and the second weight data are subjected to convolution operation, and Ex represents the expansion rate, until the convolution operation of the K pixel data and the corresponding weight data is completed.
在一种可能的实现方式中,通过将2D MAC阵列展开为行列两个方向,在列方向展开为Q个组,以提供在Q个输出通道方向的计算;在行方向展开为A个组,以提供A个像素数据的在行方向上的计算。每A个像素数据与对应权重数据相乘后,在基于Crossbar结构的MAC阵列的列方向做累加,以流水的形式生成连续A个点的卷积结果,实现支持多个像素数据与权重数据的并行运算。In a possible implementation, by expanding the 2D MAC array into row and column directions, and expanding into Q groups in the column direction, to provide calculations in the direction of Q output channels; expanding into A groups in the row direction, to provide calculations in the row direction of A pixel data. After each A pixel data is multiplied by the corresponding weight data, it is accumulated in the column direction of the MAC array based on the Crossbar structure, and the convolution result of consecutive A points is generated in the form of pipeline, realizing the support of multiple pixel data and weight data. Parallel operation.
以4x32的2D MAC阵列,卷积核的K为11,RGB通道的图像为例,说明本公开实施例中数据处理方法的一种实施方式,包括以下步骤:Taking a 4x32 2D MAC array, the K of the convolution kernel is 11, and the image of the RGB channel as an example, an implementation of the data processing method in the embodiment of the present disclosure is described, including the following steps:
步骤1,获取原语参数。 Step 1, get primitive parameters.
步骤2,将图像R通道的第一行的相邻48B(0-47)像素分3个时钟按高低16位加载进48B寄存器。从寄存器中选取4个像素数据X[0],X[Sx],X[2Sx],X[3Sx]送入2D MAC阵列,与32个卷积核的同一位置处的权重数据相乘,可以并行得到32个卷积运算结果。Step 2: Load the adjacent 48B (0-47) pixels of the first row of the image R channel into the 48B register in 3 clocks according to the high and low 16 bits. Select 4 pixel data X[0], X[Sx], X[2Sx], X[3Sx] from the register and send them to the 2D MAC array, and multiply them with the weight data at the same position of the 32 convolution kernels. 32 convolution operation results are obtained in parallel.
步骤3,之后按照行方向进行移位,选通X[Ex],X[Ex+Sx],X[Ex+2Sx],X[Ex+3Sx]与对应权重 数据相乘,直至完成K次像素数据和对应权重数据的卷积运算。 Step 3, then shift according to the row direction, gate X[Ex], X[Ex+Sx], X[Ex+2Sx], X[Ex+3Sx] and multiply the corresponding weight data until K pixels are completed The convolution operation of the data and the corresponding weight data.
步骤4,换行读取下一行的48B像素数据,沿着行方向每次选通4个像素数据同时与对应权重数据进行卷积操作。 Step 4, read the 48B pixel data of the next row in a new line, select 4 pixel data at a time along the row direction and perform a convolution operation with the corresponding weight data at the same time.
步骤5,重复上述步骤1至步骤4的操作,直至完成R通道下的K×K次卷积运算后,再分别计算G通道与B通道等其它通道的卷积运算,之后可同时得到输出图前32通道的相邻四个点:P 0[0,0],P 0[0,1],P 0[0,2],P 0[0,3]。此时需要将前32通道的P 0[0,0],P 0[0,1],P 0[0,2],P 0[0,3]写回存储单元中。重复上述步骤1至步骤5的操作,直到得到所有通道的输出图的相邻四个点:P 0[0,0],P 0[0,1],P 0[0,2],P 0[0,3]。 Step 5: Repeat the above steps 1 to 4 until the K×K convolution operations under the R channel are completed, and then calculate the convolution operations of other channels such as the G channel and the B channel respectively, and then the output graph can be obtained at the same time. The adjacent four points of the first 32 channels: P 0 [0,0], P 0 [0,1], P 0 [0,2], P 0 [0,3]. At this time, P 0 [0,0], P 0 [0,1], P 0 [0,2], and P 0 [0,3] of the first 32 channels need to be written back to the storage unit. Repeat the above steps 1 to 5 until four adjacent points of the output map of all channels are obtained: P 0 [0,0], P 0 [0,1], P 0 [0,2], P 0 [0,3].
步骤6,判断寄存器第二次上窗读取像素数据的起始位置是否超过了第15个像素的位置,若超过15,则从第一行的第16至第63个像素的地址读取48B的像素数据,否则,仍从第0至第48个像素的地址读取像素数据。仍使用48B寄存器读取数据,再从寄存器中选取第X[4Sx],X[5Sx],X[6Sx],X[7Sx]处的像素数据进行卷积计算,至完成K×K次卷积运算,得到32个输出图的同一行相邻四个点:P 0[0,4],P 0[0,5],P 0[0,6],,P 0[0,7]。 Step 6: Determine whether the starting position of the pixel data read by the register for the second time on the window exceeds the position of the 15th pixel. If it exceeds 15, read 48B from the address of the 16th to 63rd pixels in the first row. , otherwise, the pixel data is still read from the address of the 0th to 48th pixel. Still use the 48B register to read the data, and then select the pixel data at X[4Sx], X[5Sx], X[6Sx], X[7Sx] from the register to perform convolution calculation until K×K convolutions are completed. Operation, to get 32 output graphs with four adjacent points in the same row: P 0 [0,4], P 0 [0,5], P 0 [0,6], P 0 [0,7].
步骤7,得到输出图的第一行的数据后,开始计算输出图的下一行的数据,此时从输入图像第0+Sy行读取对应像素数据,进行上述步骤1至步骤6的操作。Step 7: After obtaining the data of the first row of the output image, start to calculate the data of the next row of the output image. At this time, read the corresponding pixel data from the 0+Sy row of the input image, and perform the operations from steps 1 to 6 above.
在一种可能的实现方式中,基于crossbar结构的4×32的MAC阵列,每次可以从寄存器中选取最多4个像素数据,同时和最多32个卷积核的同一位置处的一个权重数据做卷积运算。In a possible implementation, based on a 4×32 MAC array based on a crossbar structure, a maximum of 4 pixel data can be selected from the register at a time, and at the same time, a weight data at the same position of a maximum of 32 convolution kernels can be selected. Convolution operation.
在一种可能的实现方式中,在从寄存器中选通像素数据时,根据卷积核的尺寸、卷积核的步长Sx以及膨胀率Ex进行移位读取操作。图9a示出根据本公开实施例的一种选取像素数据的示意图。如图9a所示,假设K×K=11×11的卷积核,膨胀率Ex=1,图像X的每个像素数据占1B,寄存器中读取的48B为图像第一行的前48B的像素数据,卷积核的步长Sx为3,3个寄存器“Reg[0],Reg[1],Reg[3]”读取了图像X的第一行X[0]的前“0-47”的像素数据。In a possible implementation manner, when the pixel data is gated from the register, a shift read operation is performed according to the size of the convolution kernel, the stride Sx of the convolution kernel, and the expansion rate Ex. FIG. 9a shows a schematic diagram of selecting pixel data according to an embodiment of the present disclosure. As shown in Figure 9a, assuming a convolution kernel of K×K=11×11, the expansion rate Ex=1, each pixel data of the image X occupies 1B, and the 48B read in the register is the first 48B of the first line of the image. Pixel data, the stride Sx of the convolution kernel is 3, and the 3 registers "Reg[0], Reg[1], Reg[3]" read the first "0-" of the first row X[0] of the image X 47” pixel data.
第一次运算时,选取寄存器中的X[0],X[Sx],X[2Sx],X[3Sx],即第1个、第4个、第7个、第10个像素数据“0,3,6,9”,第二次选取X[1],X[Sx+1],X[2Sx+1],X[3Sx+1],即像素数据“1,4,7,A”,第三次选取X[2],X[Sx+2],X[2Sx+2],X[3Sx+2],即像素数据“2,5,8,B”,至第11次运算,选取X[10],X[Sx+10],X[2Sx+10],X[3Sx+10],即像素数据“A,D,16,19”。In the first operation, select X[0], X[Sx], X[2Sx], X[3Sx] in the register, that is, the 1st, 4th, 7th, and 10th pixel data "0" , 3, 6, 9", select X[1], X[Sx+1], X[2Sx+1], X[3Sx+1] for the second time, that is, the pixel data "1, 4, 7, A" , select X[2], X[Sx+2], X[2Sx+2], X[3Sx+2] for the third time, that is, the pixel data "2, 5, 8, B", to the 11th operation, Select X[10], X[Sx+10], X[2Sx+10], X[3Sx+10], namely pixel data "A, D, 16, 19".
由于卷积核的尺寸为11×11,从寄存器中选取11次像素数据送入MAC阵列后,相当于计算完卷积核第一行权重数据与对应像素数据的卷积。Since the size of the convolution kernel is 11×11, after 11 times of pixel data is selected from the register and sent to the MAC array, it is equivalent to calculating the convolution of the weight data of the first row of the convolution kernel and the corresponding pixel data.
这时寄存器再跳转至图像X第二行像素数据对应的起始地址,加载第二行像素数据的前48B数据,选取像素数据的选通逻辑与上述一致。在卷积核第二行权重数据与对应像素数据进行卷积运算后,再加载第三行像素数据的前48B数据,每次数据加载与数据选通逻辑与上述一致。At this time, the register jumps to the start address corresponding to the pixel data of the second row of image X, loads the first 48B data of the pixel data of the second row, and the gating logic for selecting the pixel data is consistent with the above. After the convolution operation is performed on the weight data of the second row of the convolution kernel and the corresponding pixel data, the first 48B data of the pixel data of the third row is loaded, and the logic of each data loading and data gating is consistent with the above.
针对寄存器从存储单元中读取的像素数据,当移位选取了K次像素数据,寄存器的地址指针跳转到图像的第二行像素数据对应的存储地址进行第二次取数,至读取了K次数据,相当于算完图像R通道和卷积核的第一层权重数据的卷积,再跳到图像G通道的第一行像素数据的起始地址,按照与R通道相同的读数选通逻辑,计算G通道的像素数据与卷积核的第二层权重数据的卷积,以此类推,再算B通道。在RGB三个通道的卷积算完后,可以并行得到32个输出图的同一行的4个数。For the pixel data read by the register from the storage unit, when the K times of pixel data are selected by the shift, the address pointer of the register jumps to the storage address corresponding to the pixel data of the second row of the image for the second fetch, until the read K times of data is calculated, which is equivalent to the convolution of the first layer of weight data between the image R channel and the convolution kernel, and then jumps to the starting address of the first line of pixel data in the image G channel, according to the same reading as the R channel. The gating logic calculates the convolution of the pixel data of the G channel and the weight data of the second layer of the convolution kernel, and so on, and then calculates the B channel. After the convolution of the three channels of RGB is completed, 4 numbers of the same row of the 32 output images can be obtained in parallel.
接下来在计算输出图的同一行的后4个数时,需要回到图像的第一行像素数据对应的存储地址,加载第一行像素数据,即,进行第二次上窗。在第二次上窗时,判断即将选取的像素数据(X[4Sx]),即第二次上窗的起始位置是否过了15(0为起始,每16个像素数据作为一个存储向量),若超过则从第一行的第16个像素对应的存储地址读取48B数据,若没超过,则从第1个像素对应的存储地址读取48B 数据。图9b示出根据本公开实施例的又一种选取像素数据的示意图,如图9b所示,第二次上窗读取像素数据时,由于X[4Sx]=12小于15,3个寄存器Reg[0],Reg[1],Reg[3]仍读取了图像X的第一行X[0]的前“0-47”的像素数据,再选取像素数据时,第一次选取X[4Sx],X[5Sx],X[6Sx],X[7Sx]处的像素数据,即“C,F,18,21”,第二次选取X[4Sx+1],X[5Sx+1],X[6Sx+1],X[7Sx+1],即“D,16,19,22”,至第11次选取X[4Sx+10],X[5Sx+10],X[6Sx+10],X[7Sx+10]处的像素数据。Next, when calculating the last four numbers of the same row of the output image, it is necessary to return to the storage address corresponding to the pixel data of the first row of the image, and load the pixel data of the first row, that is, perform the second windowing. In the second windowing, determine the pixel data to be selected (X[4Sx]), that is, whether the starting position of the second windowing has passed 15 (0 is the start, and every 16 pixel data is used as a storage vector ), if it exceeds, read 48B data from the storage address corresponding to the 16th pixel in the first row, if not, read 48B data from the storage address corresponding to the first pixel. Fig. 9b shows yet another schematic diagram of selecting pixel data according to an embodiment of the present disclosure. As shown in Fig. 9b, when reading pixel data by windowing for the second time, since X[4Sx]=12 is less than 15, 3 registers Reg [0], Reg[1], Reg[3] still read the pixel data of the first "0-47" of the first line X[0] of the image X, and then select the pixel data, select X[ 4Sx], X[5Sx], X[6Sx], pixel data at X[7Sx], namely "C, F, 18, 21", the second time select X[4Sx+1], X[5Sx+1] , X[6Sx+1], X[7Sx+1], namely "D, 16, 19, 22", until the 11th time select X[4Sx+10], X[5Sx+10], X[6Sx+10 ], the pixel data at X[7Sx+10].
第三次上窗时,判断X[8Sx]是否超过15,若没超过,则从第一个像素对应的存储地址加载数据,若超过15再判断是否超过31,若超过31,从第32个像素对应的存储地址读取48B数据,若没超过31从第16个像素对应的地址读取48B数据。第三次上窗读取像素数据后,根据步长Sx、卷积核尺寸K、膨胀率Ex选取像素数据,不再赘述。When the window is applied for the third time, judge whether X[8Sx] exceeds 15. If not, load data from the storage address corresponding to the first pixel. If it exceeds 15, judge whether it exceeds 31. If it exceeds 31, start from the 32nd pixel. The storage address corresponding to the pixel reads 48B data, and if it does not exceed 31, 48B data is read from the address corresponding to the 16th pixel. After reading the pixel data in the window for the third time, select the pixel data according to the step size Sx, the size of the convolution kernel K, and the expansion rate Ex, which will not be repeated.
在得到输出图当前行的数值后,根据列方向的步长Sy,读取0+Sy行的像素数据,以计算输出图的下一行数值,至得到输出图的全部行的数值。After obtaining the value of the current row of the output image, read the pixel data of the row 0+Sy according to the step size Sy in the column direction to calculate the value of the next row of the output image, until the values of all the rows of the output image are obtained.
在本公开实施例中,从上述卷积运算流程可以看出,要得到大小为Ox×Oy×N的输出图,共需要六层循环,为别为在卷积核宽度Kx,卷积核高度Ky,通道C 0,输出通道N,输出图宽度Ox,输出图高度Oy方向的循环。 In the embodiment of the present disclosure, it can be seen from the above-mentioned convolution operation flow that to obtain an output graph with a size of Ox×Oy×N, a total of six layers of loops are required, which are the width Kx of the convolution kernel and the height of the convolution kernel. Ky, channel C 0 , output channel N, output map width Ox, output map height Oy direction cycle.
在本公开实施例中,卷积核的权重数据在存储单元中的存储逻辑与计算流程相符,因此在循环计算过程中,从权重的起始地址开始地址累加1即可。输出图的存储顺序与MAC阵列的输出数据顺序之间也遵循一定规律,因此也可以直接根据硬件固化逻辑确定输出图的存储顺序。在读取输入图像的像素数据时,由于涉及到滑窗取数操作,每层循环的地址跳变值可以设为可配置的原语参数。In the embodiment of the present disclosure, the storage logic of the weight data of the convolution kernel in the storage unit is consistent with the calculation process. Therefore, in the cyclic calculation process, it is sufficient to add 1 to the starting address of the weight. The storage sequence of the output graph and the output data sequence of the MAC array also follow certain rules, so the storage sequence of the output graph can also be determined directly according to the hardware solidification logic. When reading the pixel data of the input image, since the sliding window fetching operation is involved, the address jump value of each layer loop can be set as a configurable primitive parameter.
本公开实施例中,基于Crossbar的结构的2D MAC阵列,通过展开行列两个维度,能够支持数据的多点并行运算。In the embodiment of the present disclosure, the 2D MAC array based on the structure of Crossbar can support multi-point parallel operation of data by expanding the two dimensions of row and column.
本公开实施例中,基于输入图像与卷积核的存储逻辑,可以提升存储效率,每个通道依次存储,每个通道的沿其图像宽度方向展开成向量,连续存储16个作为一个存储向量,向量按16B对齐拆分为多个,逐个存储。不同像素数据在存储单元中的存储顺序为先沿图像宽度方向,再按图像高度方向存储。整个图像存储按照32B对齐,不足补零,以方便寄存器取数计算。In the embodiment of the present disclosure, based on the storage logic of the input image and the convolution kernel, the storage efficiency can be improved. Each channel is stored in sequence, and each channel is expanded into a vector along its image width direction, and 16 consecutively stored as a storage vector, The vector is split into multiple pieces according to 16B alignment and stored one by one. The storage sequence of different pixel data in the storage unit is first along the image width direction, and then stored along the image height direction. The entire image storage is aligned according to 32B, and the shortage is filled with zeros to facilitate the calculation of register fetching.
本公开实施例中,采用多个移位寄存器,可以实现动态的数据读取和选通,准确并高效地实现像素数据与对应权重数据的运算。In the embodiment of the present disclosure, by using multiple shift registers, dynamic data reading and gating can be realized, and the operation of pixel data and corresponding weight data can be accurately and efficiently realized.
本公开实施例中,基于2D MAC阵列,输入图像按先行方向、列方向、通道方向顺序存储,通过设计多个移位寄存器以实现数据的动态读取和选通逻辑,提取输入图像的目标行至数据寄存器内,通过与卷积核对应行像素数据进行相乘,可以实现数据的多点并行运算逻辑,以行流水的形式连续运算并行输出多点卷积运算结果。In the embodiment of the present disclosure, based on the 2D MAC array, the input image is stored in the order of advance direction, column direction, and channel direction, and the target row of the input image is extracted by designing multiple shift registers to realize dynamic data reading and gating logic. In the data register, by multiplying the pixel data of the corresponding line of the convolution kernel, the multi-point parallel operation logic of the data can be realized, and the multi-point convolution operation result can be output in parallel by continuous operation in the form of line pipeline.
在本公开实施例中,实现了基于众核架构的神经形态芯片的新型卷积运算逻辑与数据存储模式,提高图像和卷积核之间的卷积运算和数据存储效率。In the embodiments of the present disclosure, a novel convolution operation logic and data storage mode of a neuromorphic chip based on a many-core architecture is implemented, and the convolution operation and data storage efficiency between images and convolution kernels are improved.
以上已经描述了本公开的各实施例,上述说明是示例性的,并非穷尽性的,并且也不限于所披露的各实施例。在不偏离所说明的各实施例的范围和精神的情况下,对于本技术领域的普通技术人员来说许多修改和变更都是显而易见的。本文中所用术语的选择,旨在最好地解释各实施例的原理、实际应用或对市场中的技术的改进,或者使本技术领域的其它普通技术人员能理解本文披露的各实施例。Various embodiments of the present disclosure have been described above, and the foregoing descriptions are exemplary, not exhaustive, and not limiting of the disclosed embodiments. Numerous modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein was chosen to best explain the principles of the various embodiments, the practical application or improvement over the technology in the marketplace, or to enable others of ordinary skill in the art to understand the various embodiments disclosed herein.

Claims (13)

  1. 一种数据处理方法,其特征在于,应用于人工智能处理器的处理核心,所述人工智能处理器包括多个处理核心,每个处理核心包括存储单元及运算单元,A data processing method, characterized in that it is applied to a processing core of an artificial intelligence processor, wherein the artificial intelligence processor includes a plurality of processing cores, and each processing core includes a storage unit and an arithmetic unit,
    所述存储单元用于存储图像的像素数据及N个卷积核的权重数据;所述运算单元包括乘数累加器MAC阵列,用于根据所述像素数据及所述权重数据进行运算,其中,所述图像的尺寸为W 0×H 0×C 0,所述卷积核的尺寸为K×K×C 0,行方向的步长为Sx,W 0、H 0、C 0、K、Sx为正整数, The storage unit is used to store the pixel data of the image and the weight data of the N convolution kernels; the operation unit includes a multiplier-accumulator MAC array for performing operations according to the pixel data and the weight data, wherein, The size of the image is W 0 ×H 0 ×C 0 , the size of the convolution kernel is K×K×C 0 , the step size in the row direction is Sx, W 0 , H 0 , C 0 , K, Sx is a positive integer,
    所述方法包括:The method includes:
    根据预设的像素读取位宽,从所述存储单元读取第一像素数据,所述第一像素数据包括所述图像的第m个通道、第Py行的连续的M个像素数据,1≤m≤C 0,1≤Py≤H 0,1<M≤W 0According to the preset pixel read bit width, read the first pixel data from the storage unit, the first pixel data includes the mth channel of the image and the continuous M pixel data of the Pyth row, 1 ≤m≤C 0 , 1≤Py≤H 0 , 1<M≤W 0 ;
    在k个卷积核的第Ky行的第T次运算时,根据预设的权重读取位宽,从所述存储单元读取第一权重数据,所述第一权重数据包括k个卷积核的第m个通道、第Ky行、卷积核位置T处的权重数据,1<k≤N,1≤T≤K,1≤Ky≤K;During the T-th operation of the Ky-th row of the k convolution kernels, the bit width is read according to the preset weight, and the first weight data is read from the storage unit, where the first weight data includes k convolutions The mth channel of the kernel, the Kyth row, and the weight data at the convolution kernel position T, 1<k≤N, 1≤T≤K, 1≤Ky≤K;
    根据所述卷积核的步长Sx,从所述第一像素数据中选取与卷积核位置T对应的a个像素数据作为第二像素数据,1<a<M;According to the step size Sx of the convolution kernel, select a pixel data corresponding to the convolution kernel position T from the first pixel data as the second pixel data, 1<a<M;
    在T>1时,针对所述MAC阵列中的第q列MAC,将所述第二像素数据与所述第一权重数据中的第q个权重数据相乘,并与第T-1次运算的结果相加,得到所述第q列MAC第T次运算的a个第一卷积运算结果,1≤q≤k。When T>1, for the qth column MAC in the MAC array, multiply the second pixel data by the qth weight data in the first weight data, and perform the T-1th operation with The results are added to obtain a first convolution operation result of the T-th operation of the MAC in the q-th column, 1≤q≤k.
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:The method according to claim 1, wherein the method further comprises:
    在T=1时,针对所述第q列MAC,将所述第二像素数据与所述第一权重数据中的第q个权重数据相乘,并与第Ky-1行的第K次运算的卷积运算结果相加,得到所述第q列MAC第1次运算的a个第一卷积运算结果,1≤q≤k。When T=1, for the qth column MAC, multiply the second pixel data by the qth weight data in the first weight data, and perform the Kth operation with the Ky-1th row The convolution operation results of the qth column are added to obtain a first convolution operation results of the first operation of the MAC in the qth column, 1≤q≤k.
  3. 根据权利要求1所述的方法,其特征在于,所述方法还包括:The method according to claim 1, wherein the method further comprises:
    针对所述第q列MAC,在完成所述k个卷积核的K行的运算后,得到所述第m个通道的a个第二卷积运算结果;For the qth column MAC, after completing the operations of the K rows of the k convolution kernels, a second convolution operation result of the mth channel is obtained;
    在得到的C 0个通道的卷积运算结果后,将每个卷积核的C 0个通道的卷积运算结果相加,得到第q列MAC输出的a个目标卷积运算结果。 After the convolution operation results of the C 0 channels are obtained, the convolution operation results of the C 0 channels of each convolution kernel are added to obtain a target convolution operation results output by the qth column MAC.
  4. 根据权利要求1所述的方法,其特征在于,所述方法还包括:根据权重存储位宽,存储所述N个卷积核的权重数据,其中,所述权重存储位宽与所述权重读取位宽一致;The method according to claim 1, wherein the method further comprises: storing the weight data of the N convolution kernels according to the weight storage bit width, wherein the weight storage bit width and the weight read Take the same bit width;
    所述根据权重存储位宽,存储所述N个卷积核的权重数据,包括:The storage of the weight data of the N convolution kernels according to the weight storage bit width includes:
    针对所述N个卷积核中的每个卷积核,依次按照该卷积核的行方向、列方向和通道C 0的顺序,将该卷积核的权重数据纵向排列为第一权重向量; For each convolution kernel in the N convolution kernels, according to the row direction, column direction and channel C 0 order of the convolution kernel, the weight data of the convolution kernel is longitudinally arranged into a first weight vector ;
    将所述N个卷积核的第一权重向量横向对齐合并为第一权重矩阵;horizontally aligning the first weight vectors of the N convolution kernels into a first weight matrix;
    根据所述权重存储位宽,横向存储所述第一权重矩阵中的权重数据。According to the weight storage bit width, the weight data in the first weight matrix is horizontally stored.
  5. 根据权利要求4所述的方法,其特征在于,所述根据权重存储位宽,横向存储所述第一权重矩阵中的权重数据,包括:The method according to claim 4, wherein the storing the weight data in the first weight matrix horizontally according to the weight storage bit width comprises:
    在N大于所述MAC阵列的列数Q的情况下,按照每Q列纵向拆分所述第一权重矩阵,得到F个第二 权重矩阵,其中,
    Figure PCTCN2020137453-appb-100001
    In the case where N is greater than the column number Q of the MAC array, the first weight matrix is vertically split according to each Q column to obtain F second weight matrices, wherein,
    Figure PCTCN2020137453-appb-100001
    在所述第二权重矩阵的宽度小于或等于所述权重存储位宽的情况下,依次按照行方向、列方向的顺序,存储第f个第二权重矩阵中的权重数据,1≤f≤F;In the case where the width of the second weight matrix is less than or equal to the weight storage bit width, the weight data in the f-th second weight matrix is stored in the order of row direction and column direction, 1≤f≤F ;
    将第f-1个第二权重矩阵排列在第f个第二权重矩阵之前;Arrange the f-1th second weight matrix before the fth second weight matrix;
    其中,所述第二权重矩阵的宽度等于Q乘以所述权重数据的第一存储单位,所述权重数据的第一存储单位跟据所述权重数据的数据类型确定。The width of the second weight matrix is equal to Q multiplied by the first storage unit of the weight data, and the first storage unit of the weight data is determined according to the data type of the weight data.
  6. 根据权利要求5所述的方法,其特征在于,所述方法还包括:The method according to claim 5, wherein the method further comprises:
    在第二权重矩阵的宽度大于所述权重存储位宽的情况下,针对第f个第二权重矩阵,按照每权重存储位宽纵向拆分所述第f个第二权重矩阵,得到F 0个第三权重矩阵,其中,
    Figure PCTCN2020137453-appb-100002
    Figure PCTCN2020137453-appb-100003
    When the width of the second weight matrix is larger than the weight storage bit width, for the f second weight matrix, split the f second weight matrix vertically according to the weight storage bit width to obtain F 0 The third weight matrix, where,
    Figure PCTCN2020137453-appb-100002
    Figure PCTCN2020137453-appb-100003
    依次按照行方向、列方向的顺序,存储第f 0个第三权重矩阵中的权重数据,1≤f 0≤F 0Store the weight data in the f 0th third weight matrix in sequence in the row direction and the column direction, 1≤f 0 ≤F 0 ;
    将第f 0-1个第三权重矩阵排列在第f 0个第三权重矩阵之前。 Arrange the f 0 -1 th third weight matrix before the f 0 th third weight matrix.
  7. 根据权利要求1所述的方法,其特征在于,所述方法还包括:根据像素存储位宽,存储所述图像的像素数据,其中,所述像素存储位宽与所述像素读取位宽一致;The method according to claim 1, wherein the method further comprises: storing pixel data of the image according to a pixel storage bit width, wherein the pixel storage bit width is consistent with the pixel read bit width ;
    所述根据像素存储位宽,存储所述图像的像素数据,包括:The storing of the pixel data of the image according to the pixel storage bit width includes:
    针对所述图像的第m个通道、第Py行的像素数据,按照每连续的b个像素数据拆分为B个第一存储向量,B等于W 0除以b的结果向上取整,1≤b≤W 0For the pixel data of the mth channel and the Pyth row of the image, it is divided into B first storage vectors according to each consecutive b pixel data, B is equal to the result of dividing W 0 by b and rounded up, 1≤ b≤W 0 ;
    针对每个第一存储向量,按照每b字节将该第一存储向量拆分为E个第二存储向量,所述b字节小于等于所述像素存储位宽;For each first storage vector, split the first storage vector into E second storage vectors according to every b bytes, where the b bytes are less than or equal to the pixel storage bit width;
    根据所述像素存储位宽,顺序存储所述E个第二存储向量,不足权重存储位宽的地址空间补0;According to the pixel storage bit width, the E second storage vectors are sequentially stored, and the address space that is insufficient for the weight storage bit width is filled with 0;
    顺序存储所述第m个通道、第Py行的像素数据。The pixel data of the mth channel and the Pyth row are sequentially stored.
  8. 根据权利要求4至6中任一项所述的方法,其特征在于,根据预设的权重读取位宽,从所述存储单元读取第一权重数据,包括:The method according to any one of claims 4 to 6, wherein reading the first weight data from the storage unit according to a preset weight reading bit width, comprising:
    在T=1时,确定k个卷积核的第m个通道、第Ky行、卷积核位置T处的权重数据在目标权重矩阵中的所在行L;根据所述权重读取位宽,从所述存储单元中读取所述目标权重矩阵的第L行的权重数据,作为从所述存储单元读取的第一权重数据;When T=1, determine the mth channel of the k convolution kernels, the Kyth row, and the row L of the weight data at the convolution kernel position T in the target weight matrix; read the bit width according to the weight, Read the weight data of the Lth row of the target weight matrix from the storage unit as the first weight data read from the storage unit;
    在1<T≤K时,根据预设的权重读取位宽,从所述存储单元中读取所述目标权重矩阵的第L+T-1行的权重数据,作为从所述存储单元读取的第一权重数据;When 1<T≤K, according to the preset weight read bit width, read the weight data of the L+T-1th row of the target weight matrix from the storage unit, as the read from the storage unit The first weight data taken;
    其中,所述目标权重矩阵包括所述第二权重矩阵或所述第三权重矩阵。Wherein, the target weight matrix includes the second weight matrix or the third weight matrix.
  9. 根据权利要求3所述的方法,其特征在于,根据所述卷积核的步长Sx,从所述第一像素数据中选取与卷积核位置T对应的a个像素数据作为第二像素数据,包括:The method according to claim 3, wherein, according to the step size Sx of the convolution kernel, a pixel data corresponding to the position T of the convolution kernel is selected from the first pixel data as the second pixel data ,include:
    在T=1时,从所述第一像素数据中选取间隔所述步长Sx的a个像素数据作为第二像素数据,所述第二像素数据包括所述图像的第m通道、第Py行的X[0],X[Sx],X[2Sx],X[3Sx],……,X[(a-1)Sx]处的像素数据;When T=1, select a pixel data at intervals of the step size Sx from the first pixel data as the second pixel data, and the second pixel data includes the mth channel and the Pyth row of the image. The pixel data at X[0], X[Sx], X[2Sx], X[3Sx], ..., X[(a-1)Sx];
    在1<T≤K时,根据所述卷积核的膨胀率Ex,从所述第一像素数据中选取所述图像的第m通道、 第Py行的X[(T-1)Ex],X[Sx+(T-1)Ex],X[2Sx+(T-1)Ex],X[3Sx+(T-1)Ex],……,X[(a-1)Sx+(T-1)Ex]处的像素数据作为第二像素数据。When 1<T≤K, according to the expansion rate Ex of the convolution kernel, select the mth channel of the image and X[(T-1)Ex] of the Pyth row from the first pixel data, X[Sx+(T-1)Ex], X[2Sx+(T-1)Ex], X[3Sx+(T-1)Ex], ..., X[(a-1)Sx+(T-1)Ex] ] as the second pixel data.
  10. 根据权利要求9所述的方法,其特征在于,在得到第q列MAC输出的a个目标卷积运算结果后,所述方法还包括:The method according to claim 9, wherein after obtaining a target convolution operation result output by the qth column MAC, the method further comprises:
    根据所述图像的第m通道、第Py行中的第X[aSx]处的像素数据对应的第一存储向量,确定所述第Py行中的第X[aSx]处的像素数据对应的第一存储向量在所述存储单元中对应第一存储起始地址;According to the mth channel of the image and the first storage vector corresponding to the pixel data at X[aSx] in the Pyth row, determine the first storage vector corresponding to the pixel data at X[aSx] in the Pyth row A storage vector corresponds to the first storage start address in the storage unit;
    根据所述预设的像素读取位宽和所述第一存储起始地址,从所述存储单元读取第三像素数据,所述第三像素数据包括从所述第一存储起始地址开始读取的连续的M个像素数据,以便所述运算单元继续运算。According to the preset pixel read bit width and the first storage start address, read third pixel data from the storage unit, where the third pixel data includes starting from the first storage start address Read consecutive M pieces of pixel data so that the operation unit can continue to operate.
  11. 根据权利要求10所述的方法,其特征在于,所述方法还包括:The method of claim 10, wherein the method further comprises:
    在完成所述k个卷积核与K行像素数据的卷积运算后,根据卷积核在列方向的步长Sy,确定与所述K行像素数据的第1行间隔Sy-1行的首个像素数据的第二存储起始地址;After completing the convolution operation between the k convolution kernels and the K rows of pixel data, according to the step size Sy of the convolution kernel in the column direction, determine the interval Sy-1 row with the first row of the K rows of pixel data. The second storage start address of the first pixel data;
    根据所述预设的像素读取位宽和所述第二存储起始地址,从所述存储单元读取第四像素数据,所述第四像素数据包括从所述第二存储起始地址开始读取的连续的M个像素数据,以便所述运算单元继续运算。According to the preset pixel read bit width and the second storage starting address, read fourth pixel data from the storage unit, where the fourth pixel data includes starting from the second storage starting address Read consecutive M pieces of pixel data so that the operation unit can continue to operate.
  12. 根据权利要求1-11中任意一项所述的方法,其特征在于,所述乘数累加器MAC阵列包括基于纵横式交换crossbar矩阵结构的阵列;The method according to any one of claims 1-11, wherein the multiplier-accumulator MAC array comprises an array based on a crossbar switching crossbar matrix structure;
    所述运算单元还包括至少一个缓存模块,所述缓存模块用于根据预设的像素读取位宽从所述存储单元中读取像素数据,以及根据预设的权重读取位宽从所述存储单元中读取权重数据。The arithmetic unit further includes at least one cache module, which is configured to read pixel data from the storage unit according to a preset pixel read bit width, and read the bit width from the storage unit according to a preset weight. The weight data is read from the storage unit.
  13. 一种人工智能处理器,其特征在于,所述人工智能处理器包括多个处理核心,每个处理核心包括存储单元及运算单元,所述存储单元用于存储图像的像素数据及N个卷积核的权重数据;所述运算单元包括乘数累加器MAC阵列,用于根据所述像素数据及所述权重数据进行运算,An artificial intelligence processor, characterized in that the artificial intelligence processor includes a plurality of processing cores, and each processing core includes a storage unit and an arithmetic unit, and the storage unit is used to store pixel data of an image and N convolutions. weight data of the core; the operation unit includes a multiplier-accumulator MAC array for performing operations according to the pixel data and the weight data,
    其中,所述处理核心通过权利要求1-12中任一项所述的数据处理方法执行卷积运算。Wherein, the processing core performs a convolution operation through the data processing method according to any one of claims 1-12.
PCT/CN2020/137453 2020-11-30 2020-12-18 Data processing method and artificial intelligence processor WO2022110386A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011381294.9A CN112395092B (en) 2020-11-30 2020-11-30 Data processing method and artificial intelligent processor
CN202011381294.9 2020-11-30

Publications (1)

Publication Number Publication Date
WO2022110386A1 true WO2022110386A1 (en) 2022-06-02

Family

ID=74604862

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/137453 WO2022110386A1 (en) 2020-11-30 2020-12-18 Data processing method and artificial intelligence processor

Country Status (2)

Country Link
CN (1) CN112395092B (en)
WO (1) WO2022110386A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114995782A (en) * 2022-08-03 2022-09-02 上海登临科技有限公司 Data processing method, device, equipment and readable storage medium
CN116152307A (en) * 2023-04-04 2023-05-23 西安电子科技大学 SAR image registration preprocessing device based on FPGA

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112862724B (en) * 2021-03-12 2022-09-09 上海壁仞智能科技有限公司 Method for computing, computing device and computer-readable storage medium
CN112927124A (en) * 2021-03-31 2021-06-08 成都商汤科技有限公司 Data processing method, device, equipment and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108108809A (en) * 2018-03-05 2018-06-01 山东领能电子科技有限公司 A kind of hardware structure and its method of work that acceleration is made inferences for convolutional Neural metanetwork
US20190164037A1 (en) * 2017-11-29 2019-05-30 Electronics And Telecommunications Research Institute Apparatus for processing convolutional neural network using systolic array and method thereof
CN111028126A (en) * 2019-11-18 2020-04-17 中国航空工业集团公司西安航空计算技术研究所 Method for realizing convolution filtering of GPU image processing
CN111897579A (en) * 2020-08-18 2020-11-06 腾讯科技(深圳)有限公司 Image data processing method, image data processing device, computer equipment and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190164037A1 (en) * 2017-11-29 2019-05-30 Electronics And Telecommunications Research Institute Apparatus for processing convolutional neural network using systolic array and method thereof
CN108108809A (en) * 2018-03-05 2018-06-01 山东领能电子科技有限公司 A kind of hardware structure and its method of work that acceleration is made inferences for convolutional Neural metanetwork
CN111028126A (en) * 2019-11-18 2020-04-17 中国航空工业集团公司西安航空计算技术研究所 Method for realizing convolution filtering of GPU image processing
CN111897579A (en) * 2020-08-18 2020-11-06 腾讯科技(深圳)有限公司 Image data processing method, image data processing device, computer equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114995782A (en) * 2022-08-03 2022-09-02 上海登临科技有限公司 Data processing method, device, equipment and readable storage medium
CN116152307A (en) * 2023-04-04 2023-05-23 西安电子科技大学 SAR image registration preprocessing device based on FPGA

Also Published As

Publication number Publication date
CN112395092B (en) 2023-06-02
CN112395092A (en) 2021-02-23

Similar Documents

Publication Publication Date Title
WO2022110386A1 (en) Data processing method and artificial intelligence processor
US10997496B2 (en) Sparse convolutional neural network accelerator
CA3070972C (en) Accelerated mathematical engine
US11119765B2 (en) Processor with processing cores each including arithmetic unit array
CN106445471A (en) Processor and method for executing matrix multiplication on processor
US11487845B2 (en) Convolutional operation device with dimensional conversion
US8441492B2 (en) Methods and apparatus for image processing at pixel rate
US20210019594A1 (en) Convolutional neural network accelerating device and method
CN110188869B (en) Method and system for integrated circuit accelerated calculation based on convolutional neural network algorithm
CN110674927A (en) Data recombination method for pulse array structure
US11915118B2 (en) Method and apparatus for processing computation of zero value in processing of layers in neural network
US10402196B2 (en) Multi-dimensional sliding window operation for a vector processor, including dividing a filter into a plurality of patterns for selecting data elements from a plurality of input registers and performing calculations in parallel using groups of the data elements and coefficients
CN111767994A (en) Neuron calculation module
CN114169514B (en) Convolution hardware acceleration method and convolution hardware acceleration circuit
CN114219699B (en) Matching cost processing method and circuit and cost aggregation processing method
EP4318275A1 (en) Matrix multiplier and method for controlling matrix multiplier
CN116888591A (en) Matrix multiplier, matrix calculation method and related equipment
US11194490B1 (en) Data formatter for convolution
US11429850B2 (en) Performing consecutive mac operations on a set of data using different kernels in a MAC circuit
CN112712457A (en) Data processing method and artificial intelligence processor
Solovyev et al. Real-Time Recognition of Handwritten Digits in FPGA Based on Neural Network with Fixed Point Calculations
JP2022074442A (en) Arithmetic device and arithmetic method
Liguori A MAC-less Neural Inference Processor Supporting Compressed, Variable Precision Weights
CN108805846B (en) Method and system for optimizing binary image processing
US20220269752A1 (en) Execution method for convolution computation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20963274

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20963274

Country of ref document: EP

Kind code of ref document: A1