WO2022110386A1 - Procédé de traitement de données et processeur d'intelligence artificielle - Google Patents

Procédé de traitement de données et processeur d'intelligence artificielle Download PDF

Info

Publication number
WO2022110386A1
WO2022110386A1 PCT/CN2020/137453 CN2020137453W WO2022110386A1 WO 2022110386 A1 WO2022110386 A1 WO 2022110386A1 CN 2020137453 W CN2020137453 W CN 2020137453W WO 2022110386 A1 WO2022110386 A1 WO 2022110386A1
Authority
WO
WIPO (PCT)
Prior art keywords
weight
data
pixel data
storage
row
Prior art date
Application number
PCT/CN2020/137453
Other languages
English (en)
Chinese (zh)
Inventor
裴京
施路平
徐明坤
王冠睿
马骋
Original Assignee
清华大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 清华大学 filed Critical 清华大学
Publication of WO2022110386A1 publication Critical patent/WO2022110386A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5005Allocation of resources, e.g. of the central processing unit [CPU] to service a request
    • G06F9/5027Allocation of resources, e.g. of the central processing unit [CPU] to service a request the resource being a machine, e.g. CPUs, Servers, Terminals
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present disclosure relates to the field of computer technology, and in particular, to a data processing method and an artificial intelligence processor.
  • Neuromorphic chips are important platforms for realizing biologically interpretable brain-like algorithms such as spiking neural networks based on brain-like computing.
  • the convolution operation is one of the important logical operations for the realization of artificial neural network by neuromorphic chips based on many-core architecture.
  • the present disclosure proposes a data processing method and an artificial intelligence processor to efficiently implement convolution operations.
  • a data processing method is provided, which is applied to a processing core of an artificial intelligence processor, the artificial intelligence processor includes a plurality of processing cores, each processing core includes a storage unit and an arithmetic unit, the The storage unit is used to store the pixel data of the image and the weight data of the N convolution kernels; the operation unit includes a multiplier-accumulator MAC array for performing operations according to the pixel data and the weight data, wherein the The size of the image is W 0 ⁇ H 0 ⁇ C 0 , the size of the convolution kernel is K ⁇ K ⁇ C 0 , the stride in the row direction is Sx, and W 0 , H 0 , C 0 , K, and Sx are positive Integer, the method includes: reading first pixel data from the storage unit according to a preset pixel read bit width, where the first pixel data includes the mth channel of the image, the continuous line of the Pyth line M pixel data, 1 ⁇ m ⁇
  • the method further includes: for the qth column MAC, after completing the operations of the K rows of the k convolution kernels, obtain the ath channel of the mth channel. Two convolution operation results; after the convolution operation results of the C 0 channels are obtained, the convolution operation results of the C 0 channels of each convolution kernel are added to obtain a target volume output by the qth column MAC. Product operation result.
  • the method further includes: storing the weight data of the N convolution kernels according to the weight storage bit width, wherein the weight storage bit width is consistent with the weight read bit width ; Described storing the weight data of the N convolution kernels according to the weight storage bit width, including: for each convolution kernel in the N convolution kernels, sequentially according to the row direction, column direction of the convolution kernel direction and the order of channel C 0 , the weight data of the convolution kernel is vertically arranged into a first weight vector; the first weight vectors of the N convolution kernels are horizontally aligned and merged into a first weight matrix; according to the weight The bit width is stored, and the weight data in the first weight matrix is horizontally stored.
  • the storing the weight data in the first weight matrix horizontally according to the weight storage bit width includes: when N is greater than the column number Q of the MAC array, storing the weight data according to each Q The first weight matrix is split vertically by column to obtain F second weight matrices, where, In the case that the width of the second weight matrix is less than or equal to the weight storage bit width, store the weight data in the f-th second weight matrix in the order of row direction and column direction, 1 ⁇ f ⁇ F ; Arrange the f-1th second weight matrix before the fth second weight matrix; wherein, the width of the second weight matrix is equal to Q multiplied by the first storage unit of the weight data, the weight data The first storage unit of is determined according to the data type of the weight data.
  • the method further includes: in the case that the width of the second weight matrix is greater than the weight storage bit width, for the f-th second weight matrix, splitting vertically according to each weight storage bit width Divide the f-th second weight matrix to obtain F 0 third weight matrices, where, Store the weight data in the f 0th third weight matrix in the order of the row direction and the column direction in turn, 1 ⁇ f 0 ⁇ F 0 ; arrange the f 0 -1th third weight matrix in the f 0th Before the triple weight matrix.
  • the method further includes: storing pixel data of the image according to a pixel storage bit width, wherein the pixel storage bit width is consistent with the pixel read bit width;
  • Pixel storage bit width storing the pixel data of the image, including: dividing the pixel data of the mth channel and the Pyth row of the image into B first storage vectors according to each consecutive b pixel data, B is equal to the result of dividing W 0 by b and rounded up, 1 ⁇ b ⁇ W 0 ; for each first storage vector, the first storage vector is divided into E second storage vectors according to each b bytes, so The b byte is less than or equal to the pixel storage bit width; according to the pixel storage bit width, the E second storage vectors are sequentially stored, and the address space less than the weight storage bit width is filled with 0; the mth storage vector is sequentially stored Channel, pixel data of line Py.
  • the method further includes: according to the mth channel of the image, the Xth [aSx in the Pyth row ] the first storage vector corresponding to the pixel data at, determine the first storage vector corresponding to the pixel data at the Xth [aSx] in the Pyth row corresponds to the first storage start address in the storage unit; according to The preset pixel read bit width and the first storage start address, the third pixel data is read from the storage unit, and the third pixel data includes reading from the first storage start address.
  • the acquired consecutive M pieces of pixel data so that the operation unit can continue the operation.
  • the method further includes: after completing the convolution operation between the k convolution kernels and the K rows of pixel data, determine the difference between The second storage start address of the first pixel data of the first row interval Sy-1 row of the K rows of pixel data; according to the preset pixel read bit width and the second storage start address, from the The storage unit reads fourth pixel data, where the fourth pixel data includes consecutive M pieces of pixel data read from the second storage start address, so that the operation unit continues to operate.
  • the multiplier-accumulator MAC array includes an array based on a crossbar matrix structure; the operation unit further includes at least one buffer module, the buffer module is configured to The read bit width reads pixel data from the storage unit, and reads the weight data from the storage unit according to a preset weight read bit width.
  • an artificial intelligence processor includes a plurality of processing cores, each processing core includes a storage unit and an operation unit, the storage unit is used for storing pixels of an image data and the weight data of the N convolution kernels; the operation unit includes a multiplier-accumulator MAC array for performing operations according to the pixel data and the weight data, wherein the processing core passes any one of the The data processing method performs the convolution operation.
  • a pixel data corresponding to the position T of the convolution kernel is selected from the first pixel data as the second Pixel data, for the qth column MAC in the MAC array, multiply the second pixel data by the qth weight data in the first weight data, and add the result of the T-1th operation to obtain the qth column
  • the a first convolution operation result of the T-th operation of MAC can realize multi-point parallel convolution operation between multiple pixel data and weight data corresponding to multiple convolution kernels in each operation, so as to improve the volume It can improve the operation efficiency of artificial intelligence processors.
  • FIG. 1 shows a flowchart of a data processing method according to an embodiment of the present disclosure
  • FIG. 2 shows a schematic diagram of storage of pixel data according to an embodiment of the present disclosure
  • FIG. 3 shows a schematic diagram of N convolution kernels according to an embodiment of the present disclosure
  • FIG. 4a shows a schematic diagram of a first weight vector according to an embodiment of the present disclosure
  • Fig. 4b shows a schematic diagram of a first weight matrix according to an embodiment of the present disclosure
  • FIG. 5 shows a schematic diagram of storage of weight data according to an embodiment of the present disclosure
  • FIG. 6 shows a schematic diagram of splitting a second weight matrix according to an embodiment of the present disclosure
  • FIG. 7 shows a schematic structural diagram of a MAC array according to an embodiment of the present disclosure
  • Figure 8a shows a block diagram of an artificial intelligence processor according to an embodiment of the present disclosure
  • Figure 8b shows a block diagram of a processing core according to an embodiment of the present disclosure
  • FIG. 9a shows a schematic diagram of selecting pixel data according to an embodiment of the present disclosure.
  • FIG. 9b shows yet another schematic diagram of selecting pixel data according to an embodiment of the present disclosure.
  • the artificial intelligence processor may be a neuromorphic chip based on a many-core architecture.
  • Various artificial intelligence algorithms can be implemented based on the artificial intelligence processor.
  • the artificial intelligence processor can include multiple processing cores, and each processing core can include a storage unit and an arithmetic unit.
  • the storage unit can be used to store the data to be operated, and the operation unit can be used to perform logical operation and arithmetic operation.
  • the present disclosure does not limit the specific type of the artificial intelligence processor.
  • the convolution operation occupies a large part of the total calculation amount, and as the depth and/or breadth of the convolutional neural network increases, the operation of the convolution operation Efficiency may have a greater impact on the operating efficiency of artificial intelligence processors, so improving the efficiency of convolution operations can improve the operating efficiency of artificial intelligence processors to a certain extent.
  • the neuromorphic chip based on the many-core structure realizes the convolution operation, it generally expands the multiple input channels of the input image into a one-dimensional vector form, and performs multiplier accumulation calculation on the pixel data and the corresponding weight data one by one. Due to the structural limitation of the multiplier-accumulator MAC in the current neuromorphic chip, only the product of a single pixel data and the weight data corresponding to multiple convolution kernels can be performed in each operation, and the convolution operation result is output after the accumulation.
  • the operation unit in the embodiment of the present disclosure may include a multiplier-accumulator MAC array, and the MAC array may include an array based on a crossbar matrix structure.
  • the MAC array may include A row ⁇ Q column MACs.
  • the specific values of A and Q can be set according to actual requirements. Considering that the number N of convolution kernels is usually a power of 2, the MAC array can be, for example, a 4 ⁇ 32 MAC array.
  • the embodiment of the present disclosure does not limit the structure of the MAC array in the operation unit. Based on the MAC array in the embodiment of the present disclosure, parallel convolution operations between multiple pixel data and weight data corresponding to multiple convolution kernels can be implemented, thereby improving the efficiency of convolution operations.
  • the storage unit in each processing core may be used to store the pixel data of the image and the weight data of the N convolution kernels.
  • the operation unit may include a multiplier-accumulator MAC array for performing operations according to pixel data and weight data, wherein the size of the image may be width W 0 ⁇ height H 0 ⁇ number of channels C 0 , and the size of the convolution kernel may be width K ⁇ height K ⁇ channel number C 0 , the step size in the row direction can be Sx, and W 0 , H 0 , C 0 , K, and Sx are positive integers.
  • the pixel data and the weight data in the embodiments of the present disclosure may be data to be subjected to a convolution operation.
  • the embodiments of the present disclosure do not limit the size and quantity of pixel data and weight data.
  • FIG. 1 shows a flowchart of a data processing method according to an embodiment of the present disclosure. As shown in FIG. 1 , the data processing method includes:
  • Step 11 Read the first pixel data from the storage unit according to the preset pixel read bit width, the first pixel data includes the mth channel of the image and the Pyth row of continuous M pixel data, 1 ⁇ m ⁇ C 0 , 1 ⁇ Py ⁇ H 0 , 1 ⁇ M ⁇ W 0 ;
  • Step 12 During the T-th operation of the Ky-th row of the k convolution kernels, read the bit width according to the preset weight, read the first weight data from the storage unit, and the first weight data includes k convolution kernels The weight data at the mth channel, the Kyth line, and the convolution kernel position T, 1 ⁇ k ⁇ N, 1 ⁇ T ⁇ K, 1 ⁇ Ky ⁇ K;
  • Step 13 According to the step size Sx of the convolution kernel, select a pixel data corresponding to the position T of the convolution kernel from the first pixel data as the second pixel data, 1 ⁇ a ⁇ M;
  • Step 14 When T>1, for the qth column MAC in the MAC array, multiply the second pixel data by the qth weight data in the first weight data, and multiply it with the result of the T-1th operation. Add to obtain a first convolution operation result of the T-th operation of the MAC in the q-th column, 1 ⁇ q ⁇ k.
  • the data processing method in the embodiment of the present disclosure may be applied to an artificial intelligence processor.
  • the parameters required for performing the convolution operation may also be obtained by obtaining the primitive parameters, and the primitive parameters may include data required for performing the convolution operation, for example , the primitive parameters can include: image size W 0 ⁇ H 0 ⁇ C 0 , convolution kernel size K ⁇ K ⁇ C 0 , number of convolution kernels N, row-wise stride Sx, column-wise stride Sy, dilation For parameters such as rate Ex, padding parameter, and bias parameter Bias, the specific form of the primitive parameter is not limited in this embodiment of the present disclosure.
  • a pixel data corresponding to the position T of the convolution kernel is selected from the first pixel data as the first pixel data.
  • a first convolution operation result of the T-th operation of the column MAC can realize multi-point parallel convolution operation between multiple pixel data and weight data corresponding to multiple convolution kernels in each operation, thereby improving the Convolution operation efficiency to improve the operation efficiency of artificial intelligence processors.
  • the data processing method may further include: storing pixel data of the image according to the pixel storage bit width, wherein the pixel storage bit width is consistent with the pixel read bit width, So that in step 11, the first pixel data is read from the storage unit according to the preset pixel read bit width.
  • storing the pixel data of the image may include: for the pixel data of the mth channel and the Pyth row of the image, according to each consecutive b pixel data is divided into B first storage vectors, B is equal to the result of dividing W 0 by b and rounded up, 1 ⁇ b ⁇ W 0 ; for each first storage vector, split the first storage vector into E according to each b bytes A second storage vector, the b byte is less than or equal to the pixel storage bit width; according to the pixel storage bit width, E second storage vectors are sequentially stored, and the address space less than the weight storage bit width is filled with 0; Py row of pixel data.
  • the pixel storage bit width is 32 bytes.
  • the second storage vector since the ninth second storage vector contains 10 pixel data, that is, the ninth second storage vector has a width of 10B, which is less than 32B, then the address of the second storage vector that is not enough for the weight storage bit width is in the storage unit.
  • the space is filled with 0, which is equivalent to storing the pixel data of the first channel and the first line of the image.
  • the pixel data of the first channel and the first row are stored in sequence
  • the pixel data of the first channel and the second row are stored until the completion of all rows of the first channel.
  • the pixel data of all rows of the second channel are stored until the storage of the pixel data of all channels is completed.
  • the number E of the second storage vectors obtained by splitting is related to the second storage unit of the pixel data, and the second storage unit of the pixel data is determined according to the data type of the pixel data. For example, if the second storage unit of pixel data is 2 bytes, for each first storage vector, the first storage vector may be divided into 8 second storage vectors every 16 bytes.
  • the data type may include multi-precision data types such as three-valued (-1, 0, 1), int8, uit8, etc.
  • the embodiment of the present disclosure does not limit the data type of pixel data.
  • the specific value of b can be set according to actual needs.
  • the pixel data can be a multiple of 16, then b can be set to a multiple of 16, for example, 16 or 32, etc.
  • This embodiment of the present disclosure is not limited.
  • the b byte is less than or equal to the pixel storage bit width, so as to align and store the pixel data in the storage unit.
  • the pixel storage bit width may be the storage width of pixel data in the storage unit set according to actual requirements.
  • the pixel storage bit width may be a multiple of 16, for example, It may be 16B, 32B, or 64B, etc., which is not limited in this embodiment of the present disclosure.
  • the pixel storage bit width may be consistent with the pixel read bit width.
  • FIG. 2 shows a schematic diagram of storing pixel data according to an embodiment of the present disclosure.
  • Px represents the Px-th column of the image X
  • Py represents the Py-th row of the image X
  • RGB represents the red, green, and blue channels of the image.
  • the first row of 16B storage space stores the 0th to 15th pixel data of the first row of the R channel of the image X, namely X[0][0:15], and so on, X[0] [Px-1; 0] means that the pixel data of the first row is stored, the pixel data of this row is filled with 0 in the address space less than 16B in the storage space, and the pixel data of the second row is stored after the pixel data of the first row is stored.
  • the storage efficiency of the pixel data can be improved, and it is convenient to read the pixel data corresponding to the weight data from the storage unit.
  • the data processing method may further include: storing weight data of N convolution kernels according to the weight storage bit width.
  • the weight storage bit width is consistent with the weight read bit width, so that in step 12, the first weight data is read from the storage unit according to the preset weight read bit width.
  • storing the weight data of N convolution kernels according to the weight storage bit width may include: for each convolution kernel in the N convolution kernels, sequentially according to the row of the convolution kernel direction, column direction and the order of channel C 0 , the weight data of the convolution kernel is vertically arranged as the first weight vector; the first weight vectors of the N convolution kernels are horizontally aligned and merged into the first weight matrix; according to the weight storage Bit width, horizontally storing the weight data in the first weight matrix.
  • FIG. 3 shows a schematic diagram of N convolution kernels according to an embodiment of the present disclosure.
  • Fig. 4a shows a schematic diagram of a first weight vector according to an embodiment of the present disclosure.
  • FIG. 4b shows a schematic diagram of a first weight matrix according to an embodiment of the present disclosure.
  • the convolution kernel 1 can be longitudinally arranged in the order of row direction, column direction and channel C 0 .
  • the first weight vectors corresponding to other convolution kernels are analogized in turn, and will not be repeated; the first weight vectors corresponding to the N convolution kernels are horizontally aligned and merged into Figure 4b
  • the first weight matrix shown, and then according to the weight storage bit width, the first weight matrix is stored, so as to realize the storage of the weight data.
  • the sequential storage of the weight data can be realized, and the storage efficiency of the weight data can be improved.
  • horizontally storing the weight data in the first weight matrix according to the weight storage bit width may include: when N is greater than the number of columns Q of the MAC array, splitting vertically according to each Q column Divide the first weight matrix to obtain F second weight matrices, where, Indicates round-up, that is, F is equal to the round-up value of (N/Q); in the case that the width of the second weight matrix is less than or equal to the weight storage bit width, the order of the row direction and the column direction is followed.
  • the weight data in the f second weight matrices, 1 ⁇ f ⁇ F, the address space that is not enough for the weight storage bit width is filled with 0; the f-1th second weight matrix is arranged before the fth second weight matrix; wherein , the width of the second weight matrix is equal to Q multiplied by the first storage unit of the weight data, and the first storage unit of the weight data is determined according to the data type of the weight data.
  • the data type may include multi-precision data types such as three-value (-1, 0, 1), int8, uit8, etc.
  • the embodiment of the present disclosure does not limit the data type of the weight data.
  • the weight storage bit width is 32 bytes B
  • there are 64 convolution kernels the number of columns of the MAC array is 32
  • the storage unit of the weight data is 2 bits
  • Weight data arrange the first second weight matrix before the second second weight matrix.
  • the weight data in the first second weight matrix is stored in the row direction and the column direction first, and then the weight data in the second second weight matrix is stored in the row direction and the column direction, wherein the second weight matrix
  • the weight data of each row in the storage unit is not enough for the address space of the weight storage bit width to be filled with 0. After filling with 0, it is equivalent to storing a row of weight data in the second weight matrix. After storing the weight data of the current row, store the next A row of weight data.
  • the f second weight matrix when the width of the second weight matrix is greater than the weight storage bit width, for the f second weight matrix, the f second weight matrix is vertically split according to each weight storage bit width , get F 0 third weight matrix, where, That is, F 0 is equal to the rounded-up value of (the width of the second weight matrix/weight storage bit width); the weight data in the f 0th third weight matrix is stored in the order of the row direction and the column direction, 1 ⁇ f 0 ⁇ F 0 ; arrange the f 0 -1 th third weight matrix before the f 0 th third weight matrix.
  • the weight storage bit width is 32B
  • there are 64 convolution kernels the number of columns of the MAC array is 32
  • the storage unit of the weight data is 2B
  • the second weight matrix can be divided vertically according to every 32B to obtain two third
  • the weight matrix is equivalent to dividing the first weight matrix into 4 third weight matrices vertically according to each 32B, and stores the weight data in the third weight matrix in the order of row direction and column direction.
  • the number N of convolution kernels may also be less than or equal to the number of columns Q of the MAC array, then in a possible implementation manner, the weight data in the first weight matrix is horizontally stored according to the weight storage bit width , may also include: when N is less than or equal to the number of columns Q of the MAC array, and the width of the first weight matrix is greater than the weight storage bit width, splitting the first weight matrix vertically according to each weight storage bit width to obtain F 1 The fourth weight matrix; according to the order of row direction and column direction, the weight data in the f 1th fourth weight matrix is stored, 1 ⁇ f 1 ⁇ F 1 ; the f 1 -1th fourth weight matrix is arranged in the Before the f1th fourth weight matrix; wherein, the width of the first weight matrix is equal to N times the first storage unit of the weight data.
  • the weight storage bit width is 32B, 16 convolution kernels, the number of columns of the MAC array is 32B, and the storage unit of the weight data is 4B, the number of convolution kernels is less than the number of columns of the MAC array, and the first weight
  • the first weight matrix can be divided vertically every 32B to obtain two fourth weight matrices, and then two fourth weight matrices can be stored in the order of row and column directions. Weight matrix, arrange the first fourth weight matrix before the second fourth weight matrix.
  • N is less than or equal to the number of columns Q of the MAC array
  • the width of the first weight matrix is less than or equal to the weight storage bit width.
  • the weight data in the first weight matrix is stored in the order of , and each row of weight data is filled with 0 in the address space where the weight storage bit width is insufficient in the storage unit.
  • FIG. 5 shows a schematic diagram of storing weight data according to an embodiment of the present disclosure.
  • Kx represents the Kx column of the convolution kernel
  • Ky represents the Ky-th row of the convolution kernel
  • RGB represents the three channels of the convolution kernel corresponding to the red, green and blue channels of the image
  • F0 represents the first target weight matrix
  • F1 Represents the second target weight matrix, and so on.
  • R channel_F0 means that the first target weight matrix under the first channel of the convolution kernel is stored, and the first 32B stores the first target weight matrix.
  • the second 32B stores the second row of the first target weight matrix, and so on, where [0,0] represents the channel, the first row, the first weight data, [Ky-1,Kx -1] represents the weight data under the channel, the Ky-th row, the Kx-th weight data, and so on, arrange the first target weight matrix F0 before the second target weight matrix F1, and fill the address space with insufficient weight storage bit width with 0 .
  • the weight storage bit width can be set according to the actual demand to set the storage width of the weight data in the storage unit.
  • the number of convolution kernels in the convolution layer is usually a multiple of 16 , for example, 32, 64, 128, 256, etc.
  • the weight storage bit width can be set to be a multiple of 16, for example, 32 bytes, 64 bytes, etc., which is not limited by this embodiment of the present disclosure.
  • the weight storage bit width and the weight read bit width may be consistent, so that the cache module reads the first weight data from the storage unit in step 12 according to the preset weight read bit width.
  • the weight data is stored according to the column number Q of the MAC array and the weight storage bit width, which can improve the storage efficiency of the weight data, so that the weight data sequentially read from the storage unit in each operation is the same as the pixel value. corresponding to the data to further improve the efficiency of the convolution operation.
  • the arithmetic unit of each processing core of the artificial intelligence processor may further include at least one cache module, and the cache module may be configured to read from the storage unit according to a preset pixel read bit width Take the pixel data, and read the weight data from the storage unit according to the preset weight read bit width, then in step 11, read the first pixel data from the storage unit according to the preset pixel read bit width, which may be through At least one cache module reads the first pixel data from the storage unit, and in step 12, reads the first weight data from the storage unit according to the preset weight read bit width, which may be from the storage unit through at least one cache module Read the first weight data.
  • the cache module may use a register, a dual-port random access memory, a non-volatile memory, or other memory that can implement shift fetching, which is not limited to this embodiment of the present disclosure.
  • the size and quantity of the cache module may be set according to actual requirements.
  • the cache module may be larger than the pixel read bit width and the weight read bit width. For example, if the pixel read If the bit width is 32B, the register of 48B can be selected to ensure the continuous loading of data during the operation, thereby ensuring the continuity of the operation.
  • one or more cache modules can be used according to actual requirements. For example, if a 48B register is to be used, the 48B register can use three 16B registers It is also possible to use a 48B register, in which multiple cache modules can be selected to realize the multiplexing of the cache modules and improve the utilization rate of resources.
  • the cache module can load the data with this width, and other storage spaces in the cache module are filled with 0, for example, a 48B register. If the pixel data or weight data is less than 16B, load the 16B data, and add 0 to other storage spaces in the cache module. If the loaded pixel data or weight data is less than 32B, load the 32B data, and add 0 to other storage spaces in the cache module.
  • the reading of the first pixel data in step 11 may be continuous.
  • the cache module may read from the storage unit Continuous pixel data to ensure the continuity of the operation. For example, if a 48B register is used to read data, whenever the 16B data is shifted in the register, the register will load the next 16B data from the storage unit, thus ensuring that Continuity of operations.
  • reading the first weight data from the storage unit according to the preset weight reading bit width may include:
  • the target weight matrix may include the second weight matrix or the third weight matrix.
  • the target weight matrix may further include a first weight matrix or a fourth weight matrix.
  • the convolution kernel position T may refer to the T th weight data of the Ky th row of the m th channel of the convolution kernel.
  • reading the second weight data from the storage unit may be after determining the starting address of the weight data to be read, that is, the storage address corresponding to the weight data in the Lth row of the target weight matrix , read the weight data of the L+T-1th row according to the sequential addressing method, that is, the address is accumulated by 1.
  • FIG. 6 shows a schematic diagram of splitting a second weight matrix according to an embodiment of the present disclosure.
  • the weight data of the first row of the second weight matrix such as a1 and e1 can be read, which is equivalent to reading the first row of 32 convolution kernels.
  • the weight data of the second row of the second weight matrix such as a2 and e2 can be read, which is equivalent to reading the first k convolution kernels.
  • the first weight data is read row by row, and the reading can be realized.
  • step 13 according to the step size Sx of the convolution kernel, select a pixel data corresponding to the position T of the convolution kernel from the first pixel data as the second pixel data, including:
  • the convolution kernel when implementing the convolution operation between the convolution kernel and the image, the convolution kernel usually performs the convolution operation according to the moving step size in the row direction and the column direction.
  • the expansion rate Ex of the convolution kernel can be set according to the actual volume and operation requirements.
  • the expansion rate Ex> When it is 1, it means that the dilated convolution operation is performed.
  • the value of a may be less than or equal to the row number A of the MAC array, for example, for a 4 ⁇ 32 MAC array, a may be an integer in [1, 4].
  • a piece of second pixel data is selected from the first pixel data according to the step size Sx and the expansion rate Ex, so that the selected multiple pixel data corresponds to the weight data of multiple convolution kernels,
  • the convolution operation of pixel data and weight data can be accurately realized, and the dilated convolution operation can also be supported.
  • the first weight data read in step 12 and the second weight data selected in step 13 can be input into the MAC array through the cache module, and then the second pixel data and the corresponding weight can be realized in step 14
  • the multiplier accumulation of the data that is, the convolution operation of the weight data and the pixel data is realized.
  • FIG. 7 shows a schematic structural diagram of a MAC array according to an embodiment of the present disclosure.
  • a schematic diagram of a MAC array shown in FIG. 7 is used as an example for illustration. As shown in FIG. 7, each circle contains 4 MACs, then a Can be 4, there are 5 columns of MAC, then k can be 5.
  • the selected second pixel data X[Ex], X[Sx+Ex], X[2Sx+Ex], X[3Sx+Ex] are input into the MAC array from the row direction of the MAC array ; Input the first weight data of the mth channel, the Kyth row, and the convolution kernel position 2 of the 5 convolution kernels into the MAC array from the column direction.
  • the qth column MAC in the MAC array you can The products of the 4 pixel data and the qth first weight data are obtained respectively.
  • the second pixel data (X[0], X[Sx], X[2Sx], X[3Sx]) can be obtained, and the first weight data (5 convolution kernels The product of the mth channel, the Kyth row, the weight data at kernel position 1).
  • a multi-point parallel convolution operation between multiple pixel data and weight data corresponding to multiple convolution kernels can be implemented in each operation, so that the convolution operation can be efficiently implemented and artificial intelligence can be improved.
  • the operating efficiency of the processor can be improved.
  • the q weight data are multiplied and added with the convolution operation result of the Kth operation in the Ky-1th row to obtain a first convolution operation result of the 1st operation of the MAC in the qth column, 1 ⁇ q ⁇ k.
  • the convolution operation result of the Kth operation in the Ky-1th row can be obtained by using the processing methods disclosed in steps 11 to 14 in the above embodiments of the present disclosure, and details are not described herein again.
  • the cyclic accumulation of the convolution operation results of each row of weight data and the corresponding pixel data can be implemented, so as to obtain the convolution operation result of the mth channel.
  • the data processing method may further include: Step 16, for the qth column MAC, after completing the operation of the K rows of the k convolution kernels, obtain a second volume of the mth channel Product operation results; Step 17, after the convolution operation results of C 0 channels are obtained, add the convolution operation results of C 0 channels of each convolution kernel to obtain a target output by the qth column MAC The result of the convolution operation.
  • the convolution operation result of each channel is actually obtained by accumulating the convolution operation results of each row of convolution kernels under the channel, and the K rows of operations of k convolution kernels are completed in step 16.
  • the processing methods disclosed in steps 11 to 15 are obtained, and details are not repeated here.
  • a target convolution operation result output by the qth column MAC can be obtained, which is equivalent to The values of the adjacent a points in the same row of the k output graphs are obtained.
  • the data processing method may further include:
  • the mth channel of the image and the first storage vector corresponding to the pixel data at the X[aSx] position in the Pyth row determine that the first storage vector corresponding to the pixel data at the X[aSx] position in the Pyth row is in a first storage start address corresponding to the storage unit;
  • the third pixel data is read from the storage unit, and the third pixel data includes consecutive M pieces of pixel data read from the first storage start address, so that the operation unit can continue to operate.
  • the size of the output graph can be obtained after parameters such as the step size Sy) in the direction of Whether to complete the convolution operation of all pixel data and weight data in the input image in the row direction when calculating the result of the product operation.
  • P out is the width or height of the output image
  • P in is the width or height of the input image
  • S represents the step size in the row direction or the step size in the column direction.
  • the width of the output image is calculated to be 16, and the current qth column MAC outputs 4 target convolution operation results, it means that the convolution of all pixel data and weight data in the row direction of the input image has not been completed.
  • the first storage vector corresponding to the pixel data at the Xth [aSx] in the Pyth row determines the first storage vector corresponding to the pixel data at the Xth [aSx] in the Pyth row
  • the vector corresponds to the first storage start address in the storage unit, because for the first pixel data of the Pyth row, the mth channel of the image and the X[0], X[Sx], X of the Pyth row were previously selected. [2Sx], X[3Sx], ..., the second pixel data at X[(a-1)Sx], so according to the step size Sx of the convolution kernel, it is necessary to select the first pixel from X[aSx]. Two pixel data for convolution operation.
  • the first storage vector corresponding to the pixel data at X[aSx] is determined to correspond to the first storage start address in the storage unit, and then the bit width and the first storage start address are read according to the preset pixel. Storing the starting address and reading the third pixel data from the storage unit can easily and quickly determine the starting address of the cache module fetching from the storage unit. Since the first storage vector is stored in alignment in the storage unit, and at the same time It can facilitate the fetching of the cache module.
  • the first storage vector corresponding to the pixel data at X[aSx] in the mth channel of the image and the Pyth row can be determined according to aSx and nb-1, n ⁇ [1, B], to determine the first storage vector corresponding to the pixel data at X[aSx].
  • the first storage vector is obtained by dividing every 16 pixel data.
  • For the pixel data of [32,47], if aSx 12, since 12 is less than 15, the corresponding first storage vector is [0,15], and it is necessary to read data from the storage unit starting from the 0th pixel data.
  • the step size Sx of the convolution kernel may be Select a pixel data corresponding to the convolution kernel position T from the three pixel data as the second pixel data, including:
  • the pixel data at X[aSx] in the Pyth row is determined to correspond to
  • the first storage vector corresponding to the first storage start address in the storage unit is an implementation provided by the embodiment of the present disclosure, but those skilled in the art can understand that the present disclosure should not be limited thereto. Under the inspiration of the embodiments of the present disclosure, those skilled in the art can also determine the first storage vector corresponding to the pixel data at the mth channel of the image and the Xth [2aSx] in the Pyth row of the image, to determine the first storage vector in the Pyth row.
  • the first storage vector corresponding to the pixel data at X[2aSx] corresponds to the first storage start address in the storage unit, and so on.
  • the embodiments of the present disclosure are not exhaustive.
  • the third pixel data is equivalent to the first pixel data in step 11.
  • the steps 11 to 16 in the above embodiment of the present disclosure may be used.
  • the data processing method described above can obtain another target convolution operation result output by each column of MAC, so that the convolution of all pixel data and weight data in the row direction of the image can be completed, that is, all values in the same row of the output image can be obtained.
  • the data processing method may further include: after completing the convolution operation of k convolution kernels and K rows of pixel data, according to the step size Sy of the convolution kernel in the column direction, Determine the second storage start address of the first pixel data at the interval Sy-1 row with the first row of K rows of pixel data; read from the storage unit according to the preset pixel read bit width and the second storage start address Fourth pixel data, the fourth pixel data includes consecutive M pieces of pixel data read from the second storage start address, so that the operation unit can continue to operate.
  • the fourth pixel data is equivalent to the first pixel data
  • the data processing method disclosed in steps 11 to 17 in the above-mentioned embodiment of the present disclosure can be obtained to obtain The result of a target convolution operation output by the qth column MAC to complete the convolution operation between the convolution kernel and the image.
  • the output map in the embodiment of the present disclosure may refer to a feature map obtained by convolution operation
  • the input image and image may refer to the original image, or may refer to the image after the convolution operation has been performed.
  • feature diagram which is not limited to this embodiment of the present disclosure.
  • Fig. 8a shows a block diagram of an artificial intelligence processor according to an embodiment of the present disclosure
  • Fig. 8b shows a block diagram of a processing core according to an embodiment of the present disclosure.
  • the artificial intelligence processor 100 includes a plurality of processing cores 101 , and as shown in FIG. 8 b , each processing core 101 includes a storage unit 102 and an operation unit 103 .
  • the storage unit 102 is used to store the pixel data of the image and the weight data of the N convolution kernels;
  • the operation unit 103 includes a multiplier-accumulator MAC array 104, which is used for performing the processing according to the pixel data and the weight data. operation.
  • the operation unit may further include at least one cache module 105, and the cache module is configured to read pixel data from the storage unit 102 according to a preset pixel read bit width, and read pixel data according to a preset weight.
  • the weight data is read from the storage unit 102 by taking the bit width.
  • the buffering module 105 may send the gated data into the MAC array for convolution operation, and output the convolution operation result to the address space specified by the address generating module 106 in the storage unit.
  • the operation unit may further include an address generation module 106 for generating an address pointer when the cache module reads data, so that the cache module 105 can implement sequential addressing and/or jump addressing according to the address pointer .
  • the MAC array 104 includes an array based on a crossbar switching crossbar matrix structure.
  • the MAC array 104 can be expanded into two dimensions of rows and columns, and can support multi-point parallel convolution operations.
  • the processing core 101 may perform a convolution operation by using the data processing method described in any one of the foregoing embodiments of the present disclosure.
  • the storage unit 102 can be used to store data according to a specific storage logic of pixel data and weight data, wherein the storage logic of pixel data includes: each channel image is stored in sequence, and the pixels of each channel are stored in sequence.
  • the data is expanded into a vector along its image width direction, and b consecutively stored as a storage vector.
  • the vector is divided into multiple pieces according to the b-byte alignment, and stored one by one.
  • the storage sequence of different pixel data in the storage unit is first along the image width direction, and then stored along the image height direction.
  • the entire image storage is aligned according to the pixel storage bit width, and the lack of zero is filled to facilitate the calculation of register fetching.
  • the pixel storage bit width is greater than or equal to the set b bytes.
  • the weight data and the pixel data may specify a storage address in the storage unit 102 .
  • the cache module 105 when reading the weight data, the cache module 105 can read the weight data from the starting address of the storage address in a manner of adding one to the address.
  • an address jump will be generated, that is, to read pixel data by line jump
  • a configured address jump value can be set in the primitive parameter.
  • the address generation module 106 generates the target address according to the address jump value, and counts it through the loop clock counter that comes with the artificial intelligence processor. After the count meets the jump condition, the loop clock counter generates a jump signal, and instructs the cache module through the jump signal. 105 realizes the jump of the address pointer according to the target address generated by the address generating module 106 .
  • the artificial intelligence processor of the embodiments of the present disclosure by using the artificial intelligence processor of the embodiments of the present disclosure, efficient convolution operations can be implemented, and the operation efficiency of the artificial intelligence processor can be improved.
  • the convolution kernel is four-dimensional data, with a total of K ⁇ K ⁇ C 0 ⁇ N weight data, and each output graph N (the number of convolution kernels, the number of output channels) is used as the vector length to expand, A weight matrix with a height of Kx ⁇ Ky ⁇ C 0 and a width of N is formed.
  • the weight data in the weight matrix are arranged in the order of the first direction, the column direction, and the channel C 0 direction in the height direction.
  • each channel of the input image is expanded into a vector along the width direction, 16 pixel data are continuously stored as a first storage vector, and the first storage vector is divided into multiple second storage vectors according to 16B alignment. , store the second storage vector one by one.
  • the storage sequence of the input images in the storage unit is in the row direction first, and then in the column direction.
  • the entire input image is aligned according to 32B in the storage unit, and the address space less than 32B is filled with zeros to facilitate register fetching and calculation.
  • a 48B shift register or three 16B registers can be used to read data from the storage unit.
  • 48B data can be loaded in 3 clocks, for example, the adjacent 48B (0th to 47th) of the first row of the first channel (usually R channel) of the input image
  • the pixel is loaded into the 48B register according to the high and low 16 bits in three clocks. If the width of the input image is less than 16B, only 16B data will be loaded, and zero will be filled when the width is insufficient; if the width of the input image is less than 32B, only 32B data will be loaded, and zero will be added when the width is insufficient.
  • the reading operation can be controlled by the cyclic clock counter.
  • the register when selecting data from the register and outputting it to the MAC array for operation, whenever the register will remove one 16B data, the register will load the next 16B data to maintain the continuity of the operation.
  • a 4 ⁇ 32 2D MAC array it is possible to multiply up to 4 pixel data at the same time with the weight data of up to 32 convolution kernels at the same position. For example, in one operation, the X[0], X[Sx], X[2Sx], X[3Sx] 4 pixel data in the register can be gated, and the first channel, The first row and the first weight data are multiplied together.
  • Step 1 get primitive parameters.
  • Step 2 Load the adjacent 48B (0-47) pixels of the first row of the image R channel into the 48B register in 3 clocks according to the high and low 16 bits. Select 4 pixel data X[0], X[Sx], X[2Sx], X[3Sx] from the register and send them to the 2D MAC array, and multiply them with the weight data at the same position of the 32 convolution kernels. 32 convolution operation results are obtained in parallel.
  • Step 3 then shift according to the row direction, gate X[Ex], X[Ex+Sx], X[Ex+2Sx], X[Ex+3Sx] and multiply the corresponding weight data until K pixels are completed The convolution operation of the data and the corresponding weight data.
  • Step 4 read the 48B pixel data of the next row in a new line, select 4 pixel data at a time along the row direction and perform a convolution operation with the corresponding weight data at the same time.
  • Step 5 Repeat the above steps 1 to 4 until the K ⁇ K convolution operations under the R channel are completed, and then calculate the convolution operations of other channels such as the G channel and the B channel respectively, and then the output graph can be obtained at the same time.
  • the adjacent four points of the first 32 channels P 0 [0,0], P 0 [0,1], P 0 [0,2], P 0 [0,3].
  • P 0 [0,0], P 0 [0,1], P 0 [0,2], and P 0 [0,3] of the first 32 channels need to be written back to the storage unit.
  • Repeat the above steps 1 to 5 until four adjacent points of the output map of all channels are obtained: P 0 [0,0], P 0 [0,1], P 0 [0,2], P 0 [0,3].
  • Step 6 Determine whether the starting position of the pixel data read by the register for the second time on the window exceeds the position of the 15th pixel. If it exceeds 15, read 48B from the address of the 16th to 63rd pixels in the first row. , otherwise, the pixel data is still read from the address of the 0th to 48th pixel. Still use the 48B register to read the data, and then select the pixel data at X[4Sx], X[5Sx], X[6Sx], X[7Sx] from the register to perform convolution calculation until K ⁇ K convolutions are completed. Operation, to get 32 output graphs with four adjacent points in the same row: P 0 [0,4], P 0 [0,5], P 0 [0,6], P 0 [0,7].
  • Step 7 After obtaining the data of the first row of the output image, start to calculate the data of the next row of the output image. At this time, read the corresponding pixel data from the 0+Sy row of the input image, and perform the operations from steps 1 to 6 above.
  • a maximum of 4 pixel data can be selected from the register at a time, and at the same time, a weight data at the same position of a maximum of 32 convolution kernels can be selected. Convolution operation.
  • FIG. 9a shows a schematic diagram of selecting pixel data according to an embodiment of the present disclosure.
  • Pixel data, the stride Sx of the convolution kernel is 3, and the 3 registers "Reg[0], Reg[1], Reg[3]" read the first "0-" of the first row X[0] of the image X 47” pixel data.
  • select X[0], X[Sx], X[2Sx], X[3Sx] in the register that is, the 1st, 4th, 7th, and 10th pixel data "0" , 3, 6, 9"
  • select X[1], X[Sx+1], X[2Sx+1], X[3Sx+1] for the second time that is, the pixel data "1, 4, 7, A”
  • select X[2], X[Sx+2], X[2Sx+2], X[3Sx+2] for the third time that is, the pixel data "2, 5, 8, B", to the 11th operation
  • the size of the convolution kernel is 11 ⁇ 11, after 11 times of pixel data is selected from the register and sent to the MAC array, it is equivalent to calculating the convolution of the weight data of the first row of the convolution kernel and the corresponding pixel data.
  • the register jumps to the start address corresponding to the pixel data of the second row of image X, loads the first 48B data of the pixel data of the second row, and the gating logic for selecting the pixel data is consistent with the above.
  • the convolution operation is performed on the weight data of the second row of the convolution kernel and the corresponding pixel data
  • the first 48B data of the pixel data of the third row is loaded, and the logic of each data loading and data gating is consistent with the above.
  • the address pointer of the register jumps to the storage address corresponding to the pixel data of the second row of the image for the second fetch, until the read K times of data is calculated, which is equivalent to the convolution of the first layer of weight data between the image R channel and the convolution kernel, and then jumps to the starting address of the first line of pixel data in the image G channel, according to the same reading as the R channel.
  • the gating logic calculates the convolution of the pixel data of the G channel and the weight data of the second layer of the convolution kernel, and so on, and then calculates the B channel. After the convolution of the three channels of RGB is completed, 4 numbers of the same row of the 32 output images can be obtained in parallel.
  • Fig. 9b shows yet another schematic diagram of selecting pixel data according to an embodiment of the present disclosure.
  • the storage logic of the weight data of the convolution kernel in the storage unit is consistent with the calculation process. Therefore, in the cyclic calculation process, it is sufficient to add 1 to the starting address of the weight.
  • the storage sequence of the output graph and the output data sequence of the MAC array also follow certain rules, so the storage sequence of the output graph can also be determined directly according to the hardware solidification logic.
  • the address jump value of each layer loop can be set as a configurable primitive parameter.
  • the 2D MAC array based on the structure of Crossbar can support multi-point parallel operation of data by expanding the two dimensions of row and column.
  • each channel is stored in sequence, and each channel is expanded into a vector along its image width direction, and 16 consecutively stored as a storage vector, The vector is split into multiple pieces according to 16B alignment and stored one by one.
  • the storage sequence of different pixel data in the storage unit is first along the image width direction, and then stored along the image height direction.
  • the entire image storage is aligned according to 32B, and the shortage is filled with zeros to facilitate the calculation of register fetching.
  • the input image is stored in the order of advance direction, column direction, and channel direction, and the target row of the input image is extracted by designing multiple shift registers to realize dynamic data reading and gating logic.
  • the data register by multiplying the pixel data of the corresponding line of the convolution kernel, the multi-point parallel operation logic of the data can be realized, and the multi-point convolution operation result can be output in parallel by continuous operation in the form of line pipeline.
  • a novel convolution operation logic and data storage mode of a neuromorphic chip based on a many-core architecture is implemented, and the convolution operation and data storage efficiency between images and convolution kernels are improved.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Image Processing (AREA)

Abstract

Procédé de traitement de données et processeur d'intelligence artificielle. Le procédé consiste : à lire des premières données de pixel à partir d'une unité de stockage selon une largeur de bit de lecture de pixel prédéfinie ; pendant une T-ième opération d'une Ky-ème rangée de k noyaux de convolution, à lire des premières données de pondération à partir de l'unité de stockage selon une largeur de bit de lecture de pondération prédéfinie, les premières données de pondération comprenant des données de pondération au niveau d'un M-ième canal, la Ky-ème rangée et une position de noyau de convolution T des K noyaux de convolution ; à sélectionner, à partir des premières données de pixel et selon le pas Sx des noyaux de convolution, "a" éléments de données de pixel correspondant à la position de noyau de convolution T en tant que secondes données de pixel ; et lorsque T>1, pour une q-ième colonne de MAC dans un réseau MAC, à multiplier les secondes données de pixel par q-ième données de pondération dans les premières données de pondération, et à ajouter ces dernières au résultat de la (t-1)-ème opération pour obtenir "a" premiers résultats d'opération de convolution de la q-ième colonne de MAC dans la T-ième opération. Le procédé de traitement de données peut améliorer efficacement le rendement d'une opération de convolution.
PCT/CN2020/137453 2020-11-30 2020-12-18 Procédé de traitement de données et processeur d'intelligence artificielle WO2022110386A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011381294.9 2020-11-30
CN202011381294.9A CN112395092B (zh) 2020-11-30 2020-11-30 数据处理方法及人工智能处理器

Publications (1)

Publication Number Publication Date
WO2022110386A1 true WO2022110386A1 (fr) 2022-06-02

Family

ID=74604862

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/137453 WO2022110386A1 (fr) 2020-11-30 2020-12-18 Procédé de traitement de données et processeur d'intelligence artificielle

Country Status (2)

Country Link
CN (1) CN112395092B (fr)
WO (1) WO2022110386A1 (fr)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114995782A (zh) * 2022-08-03 2022-09-02 上海登临科技有限公司 数据处理方法、装置、设备和可读存储介质
CN116152307A (zh) * 2023-04-04 2023-05-23 西安电子科技大学 一种基于fpga的sar图像配准预处理装置

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112862724B (zh) * 2021-03-12 2022-09-09 上海壁仞智能科技有限公司 用于计算的方法、计算设备和计算机可读存储介质
CN112927124A (zh) * 2021-03-31 2021-06-08 成都商汤科技有限公司 一种数据处理方法、装置、设备及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108108809A (zh) * 2018-03-05 2018-06-01 山东领能电子科技有限公司 一种针对卷积神经元网络进行推理加速的硬件架构及其工作方法
US20190164037A1 (en) * 2017-11-29 2019-05-30 Electronics And Telecommunications Research Institute Apparatus for processing convolutional neural network using systolic array and method thereof
CN111028126A (zh) * 2019-11-18 2020-04-17 中国航空工业集团公司西安航空计算技术研究所 一种gpu图像处理卷积过滤的实现方法
CN111897579A (zh) * 2020-08-18 2020-11-06 腾讯科技(深圳)有限公司 图像数据处理方法、装置、计算机设备和存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190164037A1 (en) * 2017-11-29 2019-05-30 Electronics And Telecommunications Research Institute Apparatus for processing convolutional neural network using systolic array and method thereof
CN108108809A (zh) * 2018-03-05 2018-06-01 山东领能电子科技有限公司 一种针对卷积神经元网络进行推理加速的硬件架构及其工作方法
CN111028126A (zh) * 2019-11-18 2020-04-17 中国航空工业集团公司西安航空计算技术研究所 一种gpu图像处理卷积过滤的实现方法
CN111897579A (zh) * 2020-08-18 2020-11-06 腾讯科技(深圳)有限公司 图像数据处理方法、装置、计算机设备和存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114995782A (zh) * 2022-08-03 2022-09-02 上海登临科技有限公司 数据处理方法、装置、设备和可读存储介质
CN116152307A (zh) * 2023-04-04 2023-05-23 西安电子科技大学 一种基于fpga的sar图像配准预处理装置

Also Published As

Publication number Publication date
CN112395092B (zh) 2023-06-02
CN112395092A (zh) 2021-02-23

Similar Documents

Publication Publication Date Title
WO2022110386A1 (fr) Procédé de traitement de données et processeur d'intelligence artificielle
US10997496B2 (en) Sparse convolutional neural network accelerator
CN108171317B (zh) 一种基于soc的数据复用卷积神经网络加速器
US20190095776A1 (en) Efficient data distribution for parallel processing
CN108388537B (zh) 一种卷积神经网络加速装置和方法
US20200150958A1 (en) Processor and control method for processor
US11487845B2 (en) Convolutional operation device with dimensional conversion
US8441492B2 (en) Methods and apparatus for image processing at pixel rate
CN110188869B (zh) 一种基于卷积神经网络算法的集成电路加速计算的方法及系统
CN111767994B (zh) 一种神经元计算装置
CN110674927A (zh) 一种用于脉动阵列结构的数据重组方法
US11915118B2 (en) Method and apparatus for processing computation of zero value in processing of layers in neural network
EP3093757B1 (fr) Opération de fenêtre glissante multidimensionnelle pour un processeur vectoriel
CN111738433A (zh) 一种可重配置的卷积硬件加速器
Liu et al. WinoCNN: Kernel sharing Winograd systolic array for efficient convolutional neural network acceleration on FPGAs
EP4318275A1 (fr) Multiplicateur matriciel et procédé de commande de multiplicateur matriciel
CN112712457B (zh) 数据处理方法以及人工智能处理器
CN114169514B (zh) 一种卷积硬件加速方法及卷积硬件加速电路
CN116888591A (zh) 一种矩阵乘法器、矩阵计算方法及相关设备
US20230376733A1 (en) Convolutional neural network accelerator hardware
US11194490B1 (en) Data formatter for convolution
JP2021531572A (ja) Mac回路中の異なるカーネルを使用してデータのセットに対して連続するmac演算を実施すること
Solovyev et al. Real-Time Recognition of Handwritten Digits in FPGA Based on Neural Network with Fixed Point Calculations
Liguori A MAC-less Neural Inference Processor Supporting Compressed, Variable Precision Weights
US20220269752A1 (en) Execution method for convolution computation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20963274

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20963274

Country of ref document: EP

Kind code of ref document: A1