WO2023024668A1 - 卷积计算方法、系统、设备及存储介质 - Google Patents

卷积计算方法、系统、设备及存储介质 Download PDF

Info

Publication number
WO2023024668A1
WO2023024668A1 PCT/CN2022/099246 CN2022099246W WO2023024668A1 WO 2023024668 A1 WO2023024668 A1 WO 2023024668A1 CN 2022099246 W CN2022099246 W CN 2022099246W WO 2023024668 A1 WO2023024668 A1 WO 2023024668A1
Authority
WO
WIPO (PCT)
Prior art keywords
sub
block
processed
convolution
blocks
Prior art date
Application number
PCT/CN2022/099246
Other languages
English (en)
French (fr)
Inventor
王和国
黎立煌
蒋文
张丹
Original Assignee
深圳云天励飞技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳云天励飞技术股份有限公司 filed Critical 深圳云天励飞技术股份有限公司
Publication of WO2023024668A1 publication Critical patent/WO2023024668A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T1/00General purpose image data processing
    • G06T1/60Memory management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means

Definitions

  • the present application relates to the technical field of convolution computing, and in particular to a convolution computing method, system, device and storage medium.
  • the convolution operation is the most important operation in deep learning, and the convolutional network pushes deep learning to the forefront of almost all machine learning tasks. How to efficiently support the convolution operation plays a vital role in the operation of the neural network model, including the absolute running time of the model, model delay, throughput, power consumption, cost, and off-chip bandwidth requirements.
  • the convolution operation usually expands the multi-channel input image and the corresponding convolution kernel into a two-dimensional matrix, the calculation result is obtained by matrix multiplication.
  • the computing energy efficiency ratio it is necessary to reduce the demand for storage space and the demand for storage and handling bandwidth.
  • a convolution calculation method comprising:
  • each channel of the input image is divided in the same way to obtain sub-blocks to be processed; wherein, the sub-blocks to be processed
  • the maximum width is less than the sum of the width of the unit storage block and the width of the convolution kernel
  • each sub-block to be processed in the unit storage block wherein, for each row of data in the sub-block to be processed, it is divided into the same height value of the convolution kernel by shifting and intercepting a fixed length The number of rows, which are stored in the unit storage block row by row;
  • each sub-block to be processed read the data to be convolved and calculated from at least one of the unit storage blocks within the limit of the single calculation capacity, and perform matrix multiplication, so as to sequentially obtain the corresponding The portion of the subchunk to process.
  • a convolution computing system comprising:
  • the block module is used to divide each channel of the input image in the same way according to the size of the unit storage block in the cache unit of the input image and the size of the convolution kernel to obtain sub-blocks to be processed; wherein, the The maximum width of the sub-block to be processed is less than the sum of the width of the unit storage block and the width of the convolution kernel;
  • the sub-block is stored in the module, which is used to store each sub-block to be processed as a unit storage block; wherein, for each row of data of the sub-block to be processed, the fixed-length mode of shifting and intercepting is divided into the The number of rows with the same height value of the convolution kernel is stored in the unit storage block row by row;
  • a calculation module for each sub-block to be processed, read the data to be convoluted and calculated from at least one of the unit storage blocks within the limit of a single calculation capacity, and perform matrix multiplication to obtain the output image in sequence Each row corresponds to the portion of the subblock to be processed.
  • a convolution calculation device comprising a memory, a processor, and a convolution calculation program stored on the memory and operable on the processor, when the convolution calculation program is executed by the processor, the above The steps of the convolution calculation method described above.
  • a computer-readable storage medium where a convolution calculation program is stored on the computer-readable storage medium, and when the convolution calculation program is executed by a processor, the steps of the above-mentioned convolution calculation method are implemented.
  • the above convolution calculation method, system, device, and computer-readable storage medium divide the input image into blocks, and then each sub-block to be processed is intercepted with a fixed length according to the shift and stored in the unit storage block line by line, and the The unit of single calculation capacity is to gradually complete the calculation of blocks, and then complete the convolution calculation of the entire input image.
  • the method of storage after fixed-length shift and interception and the subsequent calculation method can perform a large number of complex operations on the data. Therefore, storage costs and transmission bandwidth costs are also saved.
  • FIG. 1 is a schematic structural diagram of a convolution computing device in a hardware operating environment involved in an embodiment of the present application
  • Figure 2a is a schematic diagram of convolution calculation
  • Figure 2b is a schematic diagram of multi-channel convolution calculation
  • Figure 2c is a schematic diagram of converting the convolution kernel into a two-dimensional matrix
  • Figure 2d is a schematic diagram of converting an input image into a two-dimensional matrix
  • Figure 2e is a schematic diagram of the result of the convolution operation obtained by two-dimensional matrix multiplication
  • FIG. 3a is a flow chart of a convolution calculation method according to an embodiment
  • Fig. 3b is a structural diagram of a convolution calculation device according to an embodiment
  • Figure 4a is a 7 ⁇ 17 image pixel distribution map
  • Fig. 4b is a pixel distribution diagram of the feature image obtained by the image of Fig. 4a through 3 * 3 convolution operation;
  • Fig. 4c is a schematic diagram of storing the input of 2 channels of the sub-block of 7 * 10 in Fig. 4a into the unit storage block;
  • Figure 4d is part of the data read from the data in Figure 4c for calculating the first line of the output image
  • Figure 4e is a schematic diagram of the process of calculating each pixel of the output image through a sliding window
  • Fig. 4f is a schematic diagram of converting the embodiment of the present application into a matrix multiplication operation to obtain the same result as Fig. 4e;
  • Figure 4g is part of the data read from the data in Figure 4c for calculating the second line of the output image
  • FIG. 5 is a block diagram of a convolution computing system according to an embodiment.
  • FIG. 1 is a schematic structural diagram of a convolution computing device 100 in a hardware operating environment involved in the solution of the embodiment of the present application.
  • the convolution computing device in the embodiment of the present application may be, for example, a server, a personal computer, a smart phone, a tablet computer, a portable computer, and the like. As long as it has a certain general data processing capability.
  • the convolution computing device 100 includes: a memory 104 , a processor 102 and a network interface 106 .
  • Processor 102 may be a central processing unit (Central Processing Unit) in some embodiments Processing Unit (CPU), controller, microcontroller, microprocessor or other data processing chips, used to run the program code stored in the memory 104 or process data, for example, execute a convolution calculation program and the like.
  • Central Processing Unit Central Processing Unit
  • CPU Central Processing Unit
  • controller microcontroller
  • microprocessor or other data processing chips, used to run the program code stored in the memory 104 or process data, for example, execute a convolution calculation program and the like.
  • the memory 104 includes at least one type of readable storage medium, which includes flash memory, hard disk, multimedia card, card-type memory (eg, SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, and the like.
  • the memory 104 may be an internal storage unit of the convolution computing device 100 in some embodiments, such as a hard disk of the convolution computing device 100 .
  • the memory 104 may also be an external storage device of the convolution computing device 100 in other embodiments, such as a plug-in hard disk equipped on the convolution computing device 100, a smart memory card (Smart Media Card, SMC), Secure Digital (Secure Digital, SD) card, flash memory card (Flash Card) and so on.
  • the memory 104 may also include an internal storage unit of the convolution computing device 100 .
  • the memory 104 can not only be used to store application software and various data installed in the convolution computing device 100, such as codes for face recognition model training, etc., but can also be used to temporarily store data that has been output or will be output.
  • the network interface 106 may optionally include standard wired interfaces and wireless interfaces (such as WI-FI interfaces), which are generally used to establish communication connections between the convolution computing device 100 and other electronic devices.
  • standard wired interfaces and wireless interfaces such as WI-FI interfaces
  • the network may be the Internet, a cloud network, a wireless fidelity (Wi-Fi) network, a personal network (PAN), a local area network (LAN) and/or a metropolitan area network (MAN).
  • Wi-Fi wireless fidelity
  • PAN personal network
  • LAN local area network
  • MAN metropolitan area network
  • Various devices in a network environment can be configured to connect to the communication network according to various wired and wireless communication protocols.
  • wired and wireless communication protocols may include, but are not limited to, at least one of the following: Transmission Control Protocol and Internet Protocol (TCP/IP), User Datagram Protocol (UDP), Hypertext Transfer Protocol (HTTP), file transfer Protocol (FTP), ZigBee, EDGE, IEEE 802.11, Optical Fidelity (Li-Fi), 802.16, IEEE 802.11s, IEEE 802.11g, multi-hop communication, wireless access point (AP), device-to-device communication, cellular communication protocol and/or Bluetooth (Blue Tooth) communication protocol or a combination thereof.
  • TCP/IP Transmission Control Protocol and Internet Protocol
  • UDP User Datagram Protocol
  • HTTP Hypertext Transfer Protocol
  • FTP file transfer Protocol
  • ZigBee ZigBee
  • EDGE EDGE
  • FIG. 1 only shows a convolution computing device 100 with components 102-106. Those skilled in the art can understand that the structure shown in FIG. 1 does not constitute a limitation on the convolution computing device 100, and may include Fewer or more components, or combinations of certain components, or different arrangements of components.
  • the convolution operation is to use the convolution kernel 202 to perform sliding window calculation on the input image 204 one by one, so as to extract the features in the input image to obtain the output image 206 .
  • the size of the input image 204 is 8 ⁇ 8, and the size of the convolution kernel 202 is 2 ⁇ 2.
  • the position corresponding to the convolution kernel 202 and the input image 204 is moved to the right by one grid, and then the calculation is performed in the same manner as above.
  • the area corresponding to the convolution kernel 202 is gradually moved to the right according to the step size of 1 and calculated to obtain all the values of the first row on the output image 206 . It can be understood that when the area corresponding to the convolution kernel 202 moves to 4 pixels in the upper right corner, the value of the first row and last column of the input image 204 is calculated. In the same way, all the values of the second row on the output image 206 can be calculated by moving the corresponding area of the convolution kernel 202 down by one row. It can be seen that when the input image 204 is 8 ⁇ 8 and the convolution kernel 202 is 2 ⁇ 2, the output image 206 is 7 ⁇ 7. When setting different convolution kernel 202 sizes or sliding steps, the size of the output image 206 will also change. In addition, the output image 206 can also be pooled for further compression.
  • the input image 204 it is usually multi-channel.
  • the original image generally has three channels of RGB, and the number of channels in the middle feature image can be set differently according to the situation.
  • the above convolution process is aimed at the convolution operation of one channel, and the convolution operation method of each channel is the same.
  • the convolution kernel can be the same or different.
  • the result of the operation of a multi-channel (the number of channels is C) input image 208 and a set of convolution kernels (that is, convolution kernels with the same number of channels as a filter 210) is the output of one channel image, multiple filters (the number is Cout) output a multi-channel output image 212, and the size of each channel image in the output image 212 is H' ⁇ W'.
  • the size of the output image 212 will be different according to the size of the convolution kernel, the sliding step size, whether to pool or not, and the pooling method.
  • each filter has C channels corresponding to the input image, and each channel is of size K ⁇ K. Convert it to a two-dimensional matrix of size Cout ⁇ (C ⁇ K ⁇ K), that is, the height is Cout and the width is C ⁇ K ⁇ K.
  • Cout filters form a Cout row.
  • the input image has the number of channels C, and each channel has the size H ⁇ W. Convert it to a two-dimensional matrix of size (H’ ⁇ W’) ⁇ (C ⁇ K ⁇ K), that is, the height is (H’ ⁇ W’) and the width is (C ⁇ K ⁇ K).
  • this application proposes a convolution calculation method, which can divide the calculation into multiple small calculations, and make full use of the data transmitted each time, saving bandwidth and improving calculation efficiency.
  • Fig. 3a is a flowchart of a convolution calculation method according to an embodiment. The method includes:
  • Step S302 Divide each channel of the input image in the same manner according to the size of the unit storage block in the buffer unit of the input image and the size of the convolution kernel to obtain sub-blocks to be processed.
  • the weight data is read from the data storage area 302 by the weight reading unit 304, and then sent to the weight buffer unit 306, which is read by the feature data reading unit 308
  • the feature data is then sent to the feature cache unit 310 .
  • the matrix calculation unit 312 reads the corresponding data from the weight cache unit 306 and the feature cache unit 310, then performs matrix multiplication, and outputs the operation result to the output cache unit 314, and stores it in the data storage area 302 through the output data read-write unit 316 .
  • the unit storage block may be a basic storage block in the memory, each storage block has 8 ⁇ 8 16-bit storage spaces, and each row of 8 16-bits is a bank. Each 16-bit storage space can store one pixel data of the feature image.
  • the unit storage block is also the basic storage unit in the weight cache unit 306 and the feature cache unit 310 .
  • the maximum width of the sub-block to be processed is smaller than the sum of the width of the unit storage block and the width of the convolution kernel. For example, when the size of the convolution kernel is 3 ⁇ 3, the width of the bank is 8 pixels, and the maximum width of the sub-block to be processed is 10 pixels.
  • Dividing each channel in the same way means that after one of the channels is divided in a certain way, other channels should be divided in the same way. For example, for a 11 ⁇ 16 input image with 2 channels, each channel is divided into two 3 ⁇ 8 and two 8 ⁇ 8 sub-blocks.
  • the input image is an initial input image or an intermediate feature image.
  • the initial input image is generally an RGB three-channel image
  • the middle feature image is the image after convolution processing. That is, the method of the present application is applicable to any convolution calculation process.
  • Step S304 Store each sub-block to be processed in unit storage block. Wherein, for each row of data of the sub-block to be processed, it is divided into the same number of rows as the height value of the convolution kernel by shifting and truncating to a fixed length, and stored in the unit storage block row by row.
  • each row of the sub-block to be processed may be divided into 3 rows by shifting and intercepting 8 pixels. Assuming that the first row of the sub-block to be processed is a0 ⁇ a9, the three intercepted rows are a0 ⁇ a7, a1 ⁇ a8, and a2 ⁇ a9 respectively.
  • each line of the sub-block to be processed can be divided into 2 lines by shifting and intercepting 8 pixels . Assuming that the first row of the sub-block to be processed is a0 ⁇ a8, the two intercepted rows are a0 ⁇ a7 and a1 ⁇ a8 respectively.
  • each line of the sub-block to be processed can be divided into 2 lines by shifting and intercepting 7 pixels . Assuming that the first row of the sub-block to be processed is a0 ⁇ a7, the two intercepted rows are a0 ⁇ a6 and a1 ⁇ a7 respectively. At this time, each row intercepted cannot completely fill one row of the cell memory block. This happens when the divided sub-block is the last remaining sub-block.
  • Step S306 For each sub-block to be processed, read the data to be convolved and calculated from at least one of the unit storage blocks within the limit of the single calculation capacity, and perform matrix multiplication to obtain in turn the data in each row of the output image Corresponds to the portion of the subblock to be processed.
  • a single computing capacity means reading 8 ⁇ 8 data in a unit storage block at one time, and the data to be calculated by convolution can be obtained through multiple readings. After the data is multiplied by the weight matrix, a part of a row of the output image can be obtained, and after the parts of multiple sub-blocks located in the same row are multiplied by the weight matrix, other parts of the row of the output image are obtained, thereby forming a row of the output image.
  • the single calculation capacity is the amount of data that can be calculated by a processor per clock cycle; the processor is a single processor or a multi-processor.
  • the input image is divided into blocks, and then each sub-block to be processed is intercepted with a fixed length according to the shift and stored in the unit storage block line by line, and the calculation of the blocks is gradually completed in units of a single calculation capacity, Then, the convolution calculation of the entire input image is completed, and the method of storage after fixed-length shifting and interception in the above-mentioned step S304 and the subsequent calculation method can save a lot of data multiplexing, so the storage cost and transmission bandwidth cost are also saved. .
  • each input eigenvalue channel is an input image of 7 ⁇ 17
  • the output image of 5 ⁇ 15 is obtained by convolution with a 3 ⁇ 3 convolution kernel, and the step size of the convolution is 1.
  • the input image is shown in Figure 4a
  • the output image is shown in Figure 4b.
  • each channel of the input image is divided into 7 ⁇ 10 sub-blocks and 7 ⁇ 7 sub-blocks.
  • the data in the sub-block is transferred to the feature buffer unit 310 through the feature data reading unit 308, and stored in the following format:
  • row 0/1/2 is a0,0 ⁇ a0,9 obtained through the sliding window.
  • the feature data reading unit 308 reads a0,0 ⁇ a0,9 at one time, and then outputs to the feature buffer unit 310 by shifting, then reads a1,0 ⁇ a1,9 of the next line, and then outputs to Lines 3 to 5 of the bank whose address is addr0 in the feature cache unit 310 .
  • the input feature value of the next channel is not read until all lines are read and stored in the feature buffer unit 310 .
  • step S304: storing each sub-block to be processed in unit storage block may include:
  • the data storage of one channel of the input image is completed, if the remaining space of the last unit storage block used by the data of one channel satisfies the preset condition, then the data of the next channel is stored from the last unit storage block Otherwise, create the remaining space satisfying the preset condition in the new unit storage block, and store the data of the next channel from the remaining space of the new unit storage block.
  • the preset condition is that the remaining space of the last unit storage block used by the data of one channel includes rows: (C ⁇ K ⁇ K)% 8, that is, the number of channels C and the convolution kernel The remainder of dividing the product of the height and width of the bank by the width of the bank 8.
  • the number of channels is 2
  • the convolution kernel is 3 ⁇ 3
  • the remainder is 2
  • the first line of the second channel stored in the feature buffer unit 310 is bank1
  • the address is the next one of the first channel address. If the bank1 of the last address stored in the feature cache unit 310 in the first channel and the subsequent banks are not occupied, the same address as the last address of the previous channel can be directly used.
  • the last bank stored in the last address of the first channel is bank4 (bank4 of addr2), so the second channel needs to create another address (addr3 next to the first channel) to store the first The data of the bank (bank1 of addr3). As shown in Figure 4c, the data of the second channel is stored between bank1 of addr3 to bank5 of addr5.
  • the matrix calculation unit 312 will spend 3 clock cycles to read sequentially: 1) addr0 bank0 ⁇ 7; 2) addr1 bank0, addr3 bank1 ⁇ bank7; 3) addr4 bank0 ⁇ 1, and then accumulate to get the final result of the 8 pixels in the first row of the output feature value channel.
  • the data read once in 3 clock cycles is shown in Figure 4d. Since the above-mentioned matrix multiplication operation takes the multiplication of two 8 ⁇ 8 matrices as the basic unit, the result of 8 channels and 8 pixels in each channel will be obtained at the same time.
  • each pixel of the output image is the sum of the product of the sliding window and the corresponding pixel area. For example:
  • c0,0 b1*a0,0+b2*a0,1+b3*a0,2+b4*a1,0+b5*a1,1+b6*a1,2+b7*a2,0+b8*a2, 1+b9*a2,2
  • c0,1 b1*a0,1+b2*a0,2+b3*a0,3+b4*a1,1+b5*a1,2+b6*a1,3+b7*a2,1+b8*a2, 2+b9*a2,2
  • c0,0 is the sum of the convolution operation results of the same pixel area convolution kernel of multiple channels.
  • the above operation can be converted into a matrix multiplication operation. That is, the process shown in Figure 4f.
  • the matrix calculation unit 312 will sequentially read: 1) addr0 bank3 ⁇ 7, addr1 bank0 ⁇ 2; 2) addr1 bank3, addr3 bank4 ⁇ bank7, addr4 0 ⁇ 2, 3) addr4 bank3 ⁇ 4.
  • the data read once in 3 clock cycles is shown in Figure 4g.
  • the step S302 according to the size of the unit storage block in the buffer unit of the input image and the size of the convolution kernel, each channel of the input image is divided in the same way to obtain the sub-channels to be processed
  • the block may be: divided into blocks with the maximum width until the width of the remaining sub-blocks is smaller than the maximum width or just finished dividing with the maximum width.
  • the bank width is 8, and the maximum width is 10. If the width of the input image is 33, it is divided into 3 sub-blocks with a width of 10 and 1 sub-block with a width of 3, and the width of the remaining sub-blocks is 3 and cannot be further divided. If the width of the input image is 40, it is divided into 4 sub-blocks with a width of 10, which is just finished at this time.
  • the height of the sub-block is not limited.
  • the step S302 according to the size of the unit storage block in the buffer unit of the input image and the size of the convolution kernel, each channel of the input image is divided in the same way to obtain the sub-channels to be processed
  • the block may be: block with a width greater than the width of the convolution kernel and not exceeding the maximum width, so that any two sub-blocks to be processed have the same width, or a difference of no more than 2 pixels in width.
  • the bank width is 8, the maximum width is 10, and the minimum width is 4. If the width of the input image is 33, it can be divided into 3 sub-blocks with a width of 8 and 1 sub-block with a width of 9. If the width of the input image is 35, it can be divided into 5 sub-blocks with a width of 7.
  • the height of the sub-block is not limited. This division method can make the size of the sub-blocks to be processed close to each other, which facilitates the design of the calculation matrix.
  • the convolution calculation system 500 includes:
  • the block module 502 is used to divide each channel of the input image in the same way according to the size of the unit storage block in the buffer unit of the input image and the size of the convolution kernel to obtain sub-blocks to be processed; wherein, The maximum width of the sub-block to be processed is smaller than the sum of the width of the unit storage block and the width of the convolution kernel.
  • the weight data is read from the data storage area 302 by the weight reading unit 304, and then sent to the weight buffer unit 306, which is read by the feature data reading unit 308
  • the feature data is then sent to the feature cache unit 310 .
  • the matrix calculation unit 312 reads the corresponding data from the weight cache unit 306 and the feature cache unit 310, then performs matrix multiplication, and outputs the operation result to the output cache unit 314, and stores it in the data storage area 302 through the output data read-write unit 316 .
  • the unit storage block may be a basic storage block in the memory, each storage block has 8 ⁇ 8 16-bit storage spaces, and each row of 8 16-bits is a bank. Each 16-bit storage space can store one pixel data of the feature image.
  • the unit storage block is also the basic storage unit in the weight cache unit 306 and the feature cache unit 310 .
  • the maximum width of the sub-block to be processed is smaller than the sum of the width of the unit storage block and the width of the convolution kernel. For example, when the size of the convolution kernel is 3 ⁇ 3, the width of the bank is 8 pixels, and the maximum width of the sub-block to be processed is 10 pixels.
  • Dividing each channel in the same way means that after one of the channels is divided in a certain way, other channels should be divided in the same way. For example, for a 11 ⁇ 16 input image with 2 channels, each channel is divided into two 3 ⁇ 8 and two 8 ⁇ 8 sub-blocks.
  • the input image is an initial input image or an intermediate feature image.
  • the initial input image is generally an RGB three-channel image
  • the middle feature image is the image after convolution processing. That is, the method of the present application is applicable to any convolution calculation process.
  • the sub-block storage module 504 is used to store each sub-block to be processed in units of unit storage blocks; wherein, for each row of data of the sub-block to be processed, it is divided into the same row as the fixed-length mode of shifting and truncating. The number of rows with the same height value of the convolution kernel is stored in the unit storage block row by row.
  • each row of the sub-block to be processed may be divided into 3 rows by shifting and intercepting 8 pixels. Assuming that the first row of the sub-block to be processed is a0 ⁇ a9, the three intercepted rows are a0 ⁇ a7, a1 ⁇ a8, and a2 ⁇ a9 respectively.
  • each line of the sub-block to be processed can be divided into 2 lines by shifting and intercepting 8 pixels . Assuming that the first row of the sub-block to be processed is a0 ⁇ a8, the two intercepted rows are a0 ⁇ a7 and a1 ⁇ a8 respectively.
  • each line of the sub-block to be processed can be divided into 2 lines by shifting and intercepting 7 pixels . Assuming that the first row of the sub-block to be processed is a0 ⁇ a7, the two intercepted rows are a0 ⁇ a6 and a1 ⁇ a7 respectively. At this time, each row intercepted cannot completely fill one row of the cell memory block. This happens when the divided sub-block is the last remaining sub-block.
  • the calculation module 506 is used to read the data to be convolved and calculated from at least one of the unit storage blocks within the limit of the single calculation capacity for each sub-block to be processed, and perform matrix multiplication to obtain the output image in sequence Each row of corresponds to the portion of the subblock to be processed.
  • a single computing capacity means reading 8 ⁇ 8 data in a unit storage block at one time, and the data to be calculated by convolution can be obtained through multiple readings. After the data is multiplied by the weight matrix, a part of a row of the output image can be obtained, and after the parts of multiple sub-blocks located in the same row are multiplied by the weight matrix, other parts of the row of the output image are obtained, thereby forming a row of the output image.
  • the single calculation capacity is the amount of data that can be calculated by a processor per clock cycle; the processor is a single processor or a multi-processor.
  • the input image is divided into blocks, and then each sub-block to be processed is intercepted with a fixed length according to the shift and stored in the unit storage block line by line, and the calculation of the blocks is gradually completed in units of a single calculation capacity, Then, the convolution calculation of the entire input image is completed, and the above-mentioned sub-blocks are stored in the module 504 according to a fixed-length shift and interception, and the storage method and the subsequent calculation method can also save storage costs because a large amount of data can be multiplexed. and transmission bandwidth costs.
  • each input eigenvalue channel is an input image of 7 ⁇ 17
  • the output image of 5 ⁇ 15 is obtained by convolution with a 3 ⁇ 3 convolution kernel, and the step size of the convolution is 1.
  • the input image is shown in Figure 4a
  • the output image is shown in Figure 4b.
  • each channel of the input image is divided into 7 ⁇ 10 sub-blocks and 7 ⁇ 7 sub-blocks.
  • the data in the sub-block is transferred to the feature buffer unit 310 through the feature data reading unit 308, and stored in the following format:
  • row 0/1/2 is a0,0 ⁇ a0,9 obtained through the sliding window.
  • the feature data reading unit 308 reads a0,0 ⁇ a0,9 at one time, and then outputs to the feature buffer unit 310 by shifting, then reads a1,0 ⁇ a1,9 of the next line, and then outputs to Lines 3 to 5 of the bank whose address is addr0 in the feature cache unit 310 .
  • the input feature value of the next channel is not read until all lines are read and stored in the feature buffer unit 310 .
  • step S304: storing each sub-block to be processed in unit storage block may include:
  • the data storage of one channel of the input image is completed, if the remaining space of the last unit storage block used by the data of one channel satisfies the preset condition, then the data of the next channel is stored from the last unit storage block Otherwise, create the remaining space satisfying the preset condition in the new unit storage block, and store the data of the next channel from the remaining space of the new unit storage block.
  • the preset condition is that the remaining space of the last unit storage block used by the data of one channel includes rows: (C ⁇ K ⁇ K)% 8, that is, the number of channels C and the convolution kernel The remainder of dividing the product of the height and width of the bank by the width of the bank 8.
  • the number of channels is 2
  • the convolution kernel is 3 ⁇ 3
  • the remainder is 2
  • the first line of the second channel stored in the feature buffer unit 310 is bank1
  • the address is the next one of the first channel address. If the bank1 of the last address stored in the feature cache unit 310 in the first channel and the subsequent banks are not occupied, the same address as the last address of the previous channel can be directly used.
  • the last bank stored in the last address of the first channel is bank4 (bank4 of addr2), so the second channel needs to create another address (addr3 next to the first channel) to store the first The data of the bank (bank1 of addr3). As shown in Figure 4c, the data of the second channel is stored between bank1 of addr3 to bank5 of addr5.
  • the matrix calculation unit 312 will spend 3 clock cycles to read sequentially: 1) addr0 bank0 ⁇ 7; 2) addr1 bank0, addr3 bank1 ⁇ bank7; 3) addr4 bank0 ⁇ 1, and then accumulate to get the final result of the 8 pixels in the first row of the output feature value channel.
  • the data read once in 3 clock cycles is shown in Figure 4d. Since the above-mentioned matrix multiplication operation takes the multiplication of two 8 ⁇ 8 matrices as the basic unit, the result of 8 channels and 8 pixels in each channel will be obtained at the same time.
  • each pixel of the output image is the sum of the product of the sliding window and the corresponding pixel area. For example:
  • c0,0 b1*a0,0+b2*a0,1+b3*a0,2+b4*a1,0+b5*a1,1+b6*a1,2+b7*a2,0+b8*a2, 1+b9*a2,2
  • c0,1 b1*a0,1+b2*a0,2+b3*a0,3+b4*a1,1+b5*a1,2+b6*a1,3+b7*a2,1+b8*a2, 2+b9*a2,2
  • c0,0 is the sum of the convolution operation results of the same pixel area convolution kernel of multiple channels.
  • the above operation can be converted into a matrix multiplication operation. That is, the process shown in Figure 4f.
  • the matrix calculation unit 312 will sequentially read: 1) addr0 bank3 ⁇ 7, addr1 bank0 ⁇ 2; 2) addr1 bank3, addr3 bank4 ⁇ bank7, addr4 0 ⁇ 2, 3) addr4 bank3 ⁇ 4.
  • the data read once in 3 clock cycles is shown in Figure 4g.
  • the block-dividing module 502 is specifically configured to: block with the maximum width until the width of the remaining sub-blocks is smaller than the maximum width or just finished dividing with the maximum width.
  • the block module 502 is specifically configured to: perform block with a width greater than the width of the convolution kernel and not exceeding the maximum width, so that any two sub-blocks to be processed have the same width, or a width difference of no more than 2 pixels.
  • the bank width is 8, the maximum width is 10, and the minimum width is 4. If the width of the input image is 33, it can be divided into 3 sub-blocks with a width of 8 and 1 sub-block with a width of 9. If the width of the input image is 35, it can be divided into 5 sub-blocks with a width of 7.
  • the height of the sub-block is not limited. This division method can make the size of the sub-blocks to be processed close to each other, which facilitates the design of the calculation matrix.
  • the sub-block storage module 504 is specifically used for:
  • Each of the above modules is a virtual device module corresponding to the method one by one, and the specific execution process has been described in the method embodiment, and will not be repeated here. It can be understood that the content described in the foregoing method embodiments can be appropriately introduced into the system embodiments to support them.
  • the embodiment of the present application also proposes a computer-readable storage medium, the above-mentioned convolution calculation program is stored on the computer-readable storage medium, and when the convolution calculation program is executed by a processor, the above-mentioned convolution Calculation method steps.
  • the embodiments of the present invention may be provided as methods, systems, or computer program products. Accordingly, the present invention can take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions
  • the device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Artificial Intelligence (AREA)
  • Neurology (AREA)
  • Complex Calculations (AREA)

Abstract

本申请涉及一种卷积计算方法、系统、设备及存储介质。所述方法包括:根据输入图像的缓存单元中的单元存储块的大小和卷积核的大小将输入图像的每个通道按照相同的方式进行划分,得到待处理的子块;将每个待处理的子块以单元存储块为单位存入;其中,对于待处理的子块的每一行数据,以移位截取固定长度的方式分成与所述卷积核的高度数值相同的行数,逐行存入所述单位存储块;对每个待处理的子块,以单次计算容量为限从至少一个所述单元存储块中读取待卷积计算的数据并进行矩阵乘法运算,依次得到输出图像的每一行中对应于待处理的子块的部分。上述卷积计算方法、系统、设备及存储介质实现了降低存储和数据传输带宽需求的卷积计算。

Description

卷积计算方法、系统、设备及存储介质 技术领域
本申请涉及卷积计算技术领域,特别是涉及一种卷积计算方法、系统、设备及存储介质。
本申请要求于2021年8月27日提交中国专利局,申请号为202110997622.6、发明名称为“卷积计算方法、系统、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
背景技术
卷积操作是现在深度学习中最重要的运算,卷积网络将深度学习推向了几乎所有机器学习任务的最前沿。如何高效的支持卷积操作对神经网络模型的运行起着至关重要的作用,包括模型的绝对运行时间、模型时延、吞吐、功耗、成本、片外带宽需求等方方面面。
由于卷积运算通常是将多通道的输入图像和对应的卷积核分别展开为二维矩阵,通过矩阵相乘来获得计算结果。但由于硬件的限制,为提高计算能效比,需降低对存储空间的需求以及存储搬运带宽的需求。
技术问题
基于此,有必要针对节省存储空间和降低带宽需求的问题,提供一种卷积计算方法、系统、设备及存储介质。
为了实现本申请的目的,本申请采用如下技术方案:
一种卷积计算方法,包括:
根据输入图像的缓存单元中的单元存储块的大小和卷积核的大小将输入图像的每个通道按照相同的方式进行划分,得到待处理的子块;其中,所述待处理的子块的最大宽度小于单元存储块的宽度与卷积核的宽度的和;
将每个待处理的子块以单元存储块为单位存入;其中,对于待处理的子块的每一行数据,以移位截取固定长度的方式分成与所述卷积核的高度数值相同的行数,逐行存入所述单位存储块;
对每个待处理的子块,以单次计算容量为限从至少一个所述单元存储块中读取待卷积计算的数据并进行矩阵乘法运算,以依次得到输出图像的每一行中对应于待处理的子块的部分。
一种卷积计算系统,包括:
分块模块,用于根据输入图像的缓存单元中的单元存储块的大小和卷积核的大小将输入图像的每个通道按照相同的方式进行划分,得到待处理的子块;其中,所述待处理的子块的最大宽度小于单元存储块的宽度与卷积核的宽度的和;
子块存入模块,用于将每个待处理的子块以单元存储块为单位存入;其中,对于待处理的子块的每一行数据,以移位截取固定长度的方式分成与所述卷积核的高度数值相同的行数,逐行存入所述单位存储块;
计算模块,用于对每个待处理的子块,以单次计算容量为限从至少一个所述单元存储块中读取待卷积计算的数据并进行矩阵乘法运算,以依次得到输出图像的每一行中对应于待处理的子块的部分。
一种卷积计算设备,包括存储器、处理器和存储在所述存储器上并可在所述处理器上运行的卷积计算程序,所述卷积计算程序被所述处理器执行时实现如上所述的卷积计算方法的步骤。
一种计算机可读存储介质,所述计算机可读存储介质上存储有卷积计算程序,所述卷积计算程序被处理器执行时实现如上所述的卷积计算方法的步骤。
上述卷积计算方法、系统、设备以及计算机可读存储介质,通过将输入图像进行分块,然后每个待处理的子块都按照移位截取固定长度并逐行存入单元存储块,并且以单次计算容量为单位逐步完成分块的计算,进而完成整个输入图像的卷积计算,而上述步骤中按固定长度移位截取后存储的方式和后续的计算方式由于可以对数据进行大量的复用,因此也节约了存储成本和传输带宽成本。
附图说明
图1为本申请实施例方案涉及的硬件运行环境的卷积计算设备结构示意图;
图2a为卷积计算原理图;
图2b为多通道卷积计算原理图;
图2c为将卷积核转换成二维矩阵的示意图;
图2d为将输入图像转换成二维矩阵的示意图;
图2e为通过二维矩阵乘法得到卷积运算结果的示意图;
图3a为一实施例的卷积计算方法流程图;
图3b为一实施例的卷积计算装置结构图;
图4a为7×17的图像像素分布图;
图4b为图4a的图像经过3×3的卷积运算得到的特征图像的像素分布图;
图4c为将图4a中的7×10的子块的2个通道的输入存入单元存储块的示意图;
图4d为从图4c的数据中读取的用于计算输出图像第一行的部分数据;
图4e为通过滑窗计算输出图像每个像素的过程示意图;
图4f为本申请实施例转换为矩阵乘法运算得到与图4e相同结果的示意图;
图4g为从图4c的数据中读取的用于计算输出图像第二行的部分数据;
图5为一实施例的卷积计算系统模块图。
本发明的实施方式
为了便于理解本申请,下面将参照相关附图对本申请进行更全面的描述。附图中给出了本申请的首选实施例。但是,本申请可以以许多不同的形式来实现,并不限于本文所描述的实施例。相反地,提供这些实施例的目的是使对本申请的公开内容更加透彻全面。
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中在本申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请。
图1是本申请实施例方案涉及的硬件运行环境的卷积计算设备100结构示意图。
本申请实施例的卷积计算设备,可以是例如服务器、个人计算机,智能手机、平板电脑、便携计算机等。只要其具备一定的通用数据处理能力即可。
如图1所示,所述卷积计算设备100包括:存储器104、处理器102及网络接口106。
处理器102在一些实施例中可以是一中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器或其他数据处理芯片,用于运行存储器104中存储的程序代码或处理数据,例如执行卷积计算程序等。
存储器104至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、磁性存储器、磁盘、光盘等。存储器104在一些实施例中可以是卷积计算设备100的内部存储单元,例如该卷积计算设备100的硬盘。存储器104在另一些实施例中也可以是卷积计算设备100的外部存储设备,例如该卷积计算设备100上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。
进一步地,存储器104还可以包括卷积计算设备100的内部存储单元。存储器104不仅可以用于存储安装于卷积计算设备100的应用软件及各类数据,例如人脸识别模型训练的代码等,还可以用于暂时地存储已经输出或者将要输出的数据。
网络接口106可选的可以包括标准的有线接口、无线接口(如WI-FI接口),通常用于在该卷积计算设备100与其他电子设备之间建立通信连接。
网络可以为互联网、云网络、无线保真(Wi-Fi)网络、个人网(PAN)、局域网(LAN)和/或城域网(MAN)。网络环境中的各种设备可以被配置为根据各种有线和无线通信协议连接到通信网络。这样的有线和无线通信协议的例子可以包括但不限于以下中的至少一个:传输控制协议和互联网协议(TCP/IP)、用户数据报协议(UDP)、超文本传输协议(HTTP)、文件传输协议(FTP)、ZigBee、EDGE、IEEE 802.11、光保真(Li-Fi)、802.16、IEEE 802.11s、IEEE 802.11g、多跳通信、无线接入点(AP)、设备对设备通信、蜂窝通信协议和/或蓝牙(Blue Tooth)通信协议或其组合。
图1仅示出了具有组件102-106的卷积计算设备100,本领域技术人员可以理解的是,图1示出的结构并不构成对卷积计算设备100的限定,可以包括比图示更少或者更多的部件,或者组合某些部件,或者不同的部件布置。
在机器学习领域,尤其是卷积神经网络领域,常涉及卷积运算。参考图2a,卷积运算是使用卷积核202对输入图像204进行滑窗逐个计算,以提取输入图像中的特征得到输出图像206。图2a中,输入图像204的尺寸为8×8,卷积核202的尺寸为2×2。为了计算得到输出图像206的第一个值,将卷积核202与输入图像204左上的4个像素做卷积运算,即把每个像素的值与卷积核202对应位置的值相乘然后把得到的4个乘积相加,2×0 + 5×1 + 7×1 + 4×0 = 12,得到输出图像206左上角的第一个像素的值。为了得到输出图像206第一行第二列的像素的值,将卷积核202与输入图像204对应的位置向右移动一格,然后按照上述相同的方式计算。在卷积核202对应区域按照步长为1逐步向右移动并计算,就可以得到输出图像206上第一行的所有值。可以理解,当卷积核202对应区域移动至右上角4个像素时,计算得到输入图像204第一行最后一列的值。按照同样的方式,卷积核202对应区域下移一行,即可计算得到输出图像206上第二行的所有值。可知当输入图像204为8×8而卷积核202为2×2时,输出图像206为7×7。当设置不同的卷积核202尺寸或滑动步长时,输出图像206的尺寸也会发生变化。此外输出图像206还可以进行池化处理,进一步压缩。
对于输入图像204而言,通常是多通道的,例如原始图像一般为RGB三通道,中间的特征图像则可以根据情况设置不同的通道数量。上述的卷积过程针对的是一个通道的卷积运算,每个通道的卷积运算方法相同。对于每个通道,卷积核可以相同也可以不同。
如图2b所示,多通道(通道数量为C)的输入图像208与一组卷积核(即与通道数量相同的卷积核,一起组成一个过滤器210)运算的结果为一个通道的输出图像,多个过滤器(数量为Cout)则输出多通道的输出图像212,输出图像212中每个通道图像的尺寸为H’×W’。输出图像212的尺寸根据卷积核的大小、滑动步长、是否池化以及池化方式等均会不同。
在计算机中进行卷积运算时,是使用矩阵乘法。该计算方法需要将输入图像和卷积核转换为二维矩阵,以进行矩阵乘法运算。如图2c所示,具有数量为Cout的过滤器,每个过滤器对应于输入图像具有C个通道,每个通道尺寸为K×K。将其转换为尺寸为Cout×(C×K×K)的二维矩阵,即高为Cout,宽为C×K×K。对于每个过滤器,一个通道按行展开为一维,得到K×K的一行,C个通道拼接为C×K×K的一行。Cout个过滤器形成Cout行。
如图2d所示,输入图像具有通道数量为C,每个通道的尺寸为H×W。将其转换为尺寸为(H’×W’)×(C×K×K)的二维矩阵,即高为(H’×W’),宽为(C×K×K)。
之后,如图2e所示,就可以将图2d的二维矩阵与图2c中的二维矩阵的转置相乘,得到卷积计算的结果。即:
((H’×W’)×(C×K×K))×(Cout×(C×K×K)) T=(H’×W’)×Cout
但是上述计算过程受限于内存大小或计算能力,有时候无法将两个矩阵的乘法一次性计算完成。为此,本申请提出一种卷积计算方法,其可以将计算分成多个小部分的计算,并且充分利用每次传输的数据,节省带宽和提高计算效率。
图3a为一实施例的卷积计算方法流程图。该方法包括:
步骤S302:根据输入图像的缓存单元中的单元存储块的大小和卷积核的大小将输入图像的每个通道按照相同的方式进行划分,得到待处理的子块。
如图3b所示,在进行基本的矩阵乘法运算时,从数据存储区302中分别由权重读取单元304读取权重数据,然后送入权重缓存单元306,由特征数据读取单元308读取特征数据然后送入特征缓存单元310。矩阵计算单元312从权重缓存单元306和特征缓存单元310中读取对应的数据,然后进行矩阵乘法运算,将运算结果输出至输出缓存单元314,通过输出数据读写单元316存入数据存储区302。
单元存储块可以是内存中的一个基本的存储块,每个存储块具有8×8个16位大小的存储空间,每一行的8个16位为一个bank。每个16位大小的存储空间可以存放特征图像的一个像素数据。所述单元存储块也是权重缓存单元306和特征缓存单元310中的基本存储单元。
所述待处理的子块的最大宽度小于单元存储块的宽度与卷积核的宽度的和。例如,当卷积核的尺寸为3×3时,bank的宽度为8个像素,则待处理的子块的最大宽度为10个像素。
每个通道按照相同的方式进行划分是指其中一个通道以某种方式划分后,其他通道要与其按照相同的方式进行划分。例如2通道的11×16的输入图像,每个通道都划分为2个3×8、2个8×8的子块。
在其中一个实施例中,所述输入图像为初始输入图像或中间的特征图像。初始输入图像一般为RGB三通道的图像,中间的特征图像则是经过卷积处理后的图像。也即,本申请的方法适用于任何一个卷积计算过程。
步骤S304:将每个待处理的子块以单元存储块为单位存入。其中,对于待处理的子块的每一行数据,以移位截取固定长度的方式分成与所述卷积核的高度数值相同的行数,逐行存入所述单位存储块。
例如,当卷积核的尺寸为3×3时,且待处理的子块的宽度为10个像素,可以将待处理的子块的每一行以移位截取8个像素的方式分成3行。假设待处理的子块的第一行为a0~a9,则截取的3行分别为a0~a7,a1~a8,a2~a9。
再例如,当卷积核的尺寸为2×2时,且待处理的子块的宽度为9个像素,可以将待处理的子块的每一行以移位截取8个像素的方式分成2行。假设待处理的子块的第一行为a0~a8,则截取的2行分别为a0~a7,a1~a8。
再例如,当卷积核的尺寸为2×2时,且待处理的子块的宽度为8个像素,可以将待处理的子块的每一行以移位截取7个像素的方式分成2行。假设待处理的子块的第一行为a0~a7,则截取的2行分别为a0~a6,a1~a7。此时,截取的每一行不能完全填充单元存储块的一行。这种情况发生于所划分的子块是最后剩余的子块。
当子块的数据存满一个单元存储块bank时,继续在新的单元存储块bank中存储。
步骤S306:对每个待处理的子块,以单次计算容量为限从至少一个所述单元存储块中读取待卷积计算的数据并进行矩阵乘法运算,依次得到输出图像的每一行中对应于待处理的子块的部分。
单次计算容量即一次性读取一个单元存储块中的8×8的数据,通过多次读取可以获得待卷积计算的数据。该数据与权重矩阵相乘后,可以得到输出图像的一行的一部分,多个子块位于相同行的部分与权重矩阵相乘后,获得输出图像的该行的其他部分,从而形成输出图像的一行。在其中一个实施例中,所述单次计算容量为处理器每个时钟周期能够计算的数据量;所述处理器为单处理器或多处理器。
上述实施例通过将输入图像进行分块,然后每个待处理的子块都按照移位截取固定长度并逐行存入单元存储块,并且以单次计算容量为单位逐步完成分块的计算,进而完成整个输入图像的卷积计算,而上述步骤S304中按固定长度移位截取后存储的方式和后续的计算方式由于可以对数据进行大量的复用,因此也节约了存储成本和传输带宽成本。
以下通过具体的实施例进行说明。
假设每个输入特征值通道的大小为7×17的输入图像,经过3×3的卷积核卷积得到5×15的输出图像,卷积的步长为1。输入图像如图4a所示,输出图像如图4b所示。
假设输入图像具有2个通道。输入图像的每个通道通过分块得到7×10的子块和7×7的子块。以处理7×10的子块(图4a中加粗的部分)为例,结合图3b,将子块中的数据通过特征数据读取单元308搬运到特征缓存单元310,并按照如下格式存储:如图4c所示,地址为addr0的bank中,第0/1/2行是a0,0~a0,9通过滑窗得到。特征数据读取单元308一次性读取a0,0~a0,9,然后通过移位输出到特征缓存单元310中,然后读取下一行的a1,0~a1,9,然后通过移位输出到特征缓存单元310中地址为addr0的bank的第3~5行。直至把所有的行均读取完成存储至特征缓存单元310中之后,再读取下一个通道的输入特征值。
为保证每次矩阵乘法(每个时钟周期的矩阵乘法)能够读取到bank中的8行数据,不存在bank冲突。在其中一个实施例中,所述步骤S304:将每个待处理的子块以单元存储块为单位存入,可以包括:
当输入图像的一个通道的数据存储完成后,若所述一个通道的数据所使用的最后一个单元存储块的剩余空间满足预设条件,则将下一个通道的数据从所述最后一个单元存储块的剩余空间开始存储;否则在新的单元存储块中创建满足所述预设条件的剩余空间,并将下一个通道的数据从所述新的单元存储块的剩余空间开始存储。
在本实施例中,所述预设条件为所述一个通道的数据所使用的最后一个单元存储块的剩余空间包括行:(C×K×K)% 8,即通道数C与卷积核的高宽的乘积除以bank的宽度8的余数。在本实施例中,通道数为2、卷积核为3×3,则余数为2,所以第二个通道存储至特征缓存单元310的首行为bank1,地址为第一个通道接下来的一个地址。如果第一个通道存储在特征缓存单元310中最后一个地址的bank1及后续的bank均没有被占用,可以直接使用和上一个通道最后一个地址相同的地址。本实施例中,第一个通道最后一个地址存储的最后一个bank为bank4(addr2的bank4),所以第二个通道需要另起一个地址(第一个通道接下来的地址addr3)存储第一个bank 的数据(addr3的bank1)。如图4c所示,第二个通道的数据存储在addr3的bank1至addr5的bank5之间。
特征缓存单元310存储完毕之后,矩阵计算单元312会花费3个时钟周期依次读取:1)addr0 bank0~7;2)addr1 bank0,addr3 bank1~bank7;3)addr4 bank0~1,然后累加即可得到输出特征值通道第一行的8个像素的最终结果。3个时钟周期一次读取的数据如图4d所示。由于上述矩阵乘法运算以两个8×8大小的矩阵相乘为基本单位,所以会同时得到8个通道、每个通道8个像素的结果。
请参考图4e,根据卷积核滑动窗口计算原理,输出图像的每个像素都是滑窗与对应像素区域的乘积的和。例如:
c0,0=b1*a0,0+b2*a0,1+b3*a0,2+b4*a1,0+b5*a1,1+b6*a1,2+b7*a2,0+b8*a2,1+b9*a2,2
c0,1=b1*a0,1+b2*a0,2+b3*a0,3+b4*a1,1+b5*a1,2+b6*a1,3+b7*a2,1+b8*a2,2+b9*a2,2
当有多个通道时,c0,0是多个通道的同一像素区域卷积核的卷积运算结果的加和。
而本实施例的方法中,则可以将上述运算转换成矩阵乘法运算。即图4f所示的过程。
同样的,为了计算输出图像的第二行的8个像素,矩阵计算单元312会依次读取:1)addr0 bank3~7, addr1 bank0~2;2)addr1 bank3,addr3 bank4~bank7,addr4 0~2,3)addr4 bank3~4。3个时钟周期一次读取的数据如图4g所示。
通过以上实施例可以看出:1)在滑窗的过程中,每次滑窗总共需要9个bank的数据,两次滑窗之间复用了6个bank的数据,极大的节省了特征缓存单元310存储的需求,特征缓存单元310节省6/9 = 66%;2)在从数据存储区302搬运至特征缓存单元310中时,一个数据搬运1次、复用9次,带宽节省至原始带宽的1/9。
在其中一个实施例中,所述步骤S302:根据输入图像的缓存单元中的单元存储块的大小和卷积核的大小将输入图像的每个通道按照相同的方式进行划分,得到待处理的子块,可以为:以所述最大宽度分块,直至剩余子块的宽度小于所述最大宽度或刚好以所述最大宽度划分完。
以卷积核为3×3为例,bank宽度为8,则最大宽度为10。若输入图像的宽度为33,则分成宽度为10的3个子块以及宽度为3的1个子块,剩余子块的宽度为3不能继续划分。若输入图像的宽度为40,则分成宽度为10的4个子块,此时刚好分完。子块的高度不限。
在对上述7×17的输入图像进行分块时,以最大宽度10进行分块,因而得到7×10的子块和7×7的子块。该划分方式可以最大程度利用存储空间。
在其中一个实施例中,所述步骤S302:根据输入图像的缓存单元中的单元存储块的大小和卷积核的大小将输入图像的每个通道按照相同的方式进行划分,得到待处理的子块,可以为:以大于卷积核的宽度并不超过所述最大宽度的宽度进行分块,使得任意两个待处理的子块的宽度相同,或者宽度相差不超过2个像素。
以卷积核为3×3为例,bank宽度为8,则最大宽度为10,最小宽度为4。若输入图像的宽度为33,可以分成宽度为8的3个子块和宽度为9的1个子块。若输入图像的宽度为35,可以分成宽度为7的5个子块。子块的高度不限。该划分方式可以使得处理的子块的大小接近,方便设计计算矩阵。
此外,基于相同的发明构思,提供一种卷积计算系统。如图5所示,所述卷积计算系统500包括:
分块模块502,用于根据输入图像的缓存单元中的单元存储块的大小和卷积核的大小将输入图像的每个通道按照相同的方式进行划分,得到待处理的子块;其中,所述待处理的子块的最大宽度小于单元存储块的宽度与卷积核的宽度的和。
如图3b所示,在进行基本的矩阵乘法运算时,从数据存储区302中分别由权重读取单元304读取权重数据,然后送入权重缓存单元306,由特征数据读取单元308读取特征数据然后送入特征缓存单元310。矩阵计算单元312从权重缓存单元306和特征缓存单元310中读取对应的数据,然后进行矩阵乘法运算,将运算结果输出至输出缓存单元314,通过输出数据读写单元316存入数据存储区302。
单元存储块可以是内存中的一个基本的存储块,每个存储块具有8×8个16位大小的存储空间,每一行的8个16位为一个bank。每个16位大小的存储空间可以存放特征图像的一个像素数据。所述单元存储块也是权重缓存单元306和特征缓存单元310中的基本存储单元。
所述待处理的子块的最大宽度小于单元存储块的宽度与卷积核的宽度的和。例如,当卷积核的尺寸为3×3时,bank的宽度为8个像素,则待处理的子块的最大宽度为10个像素。
每个通道按照相同的方式进行划分是指其中一个通道以某种方式划分后,其他通道要与其按照相同的方式进行划分。例如2通道的11×16的输入图像,每个通道都划分为2个3×8、2个8×8的子块。
在其中一个实施例中,所述输入图像为初始输入图像或中间的特征图像。初始输入图像一般为RGB三通道的图像,中间的特征图像则是经过卷积处理后的图像。也即,本申请的方法适用于任何一个卷积计算过程。
子块存入模块504,用于将每个待处理的子块以单元存储块为单位存入;其中,对于待处理的子块的每一行数据,以移位截取固定长度的方式分成与所述卷积核的高度数值相同的行数,逐行存入所述单位存储块。
例如,当卷积核的尺寸为3×3时,且待处理的子块的宽度为10个像素,可以将待处理的子块的每一行以移位截取8个像素的方式分成3行。假设待处理的子块的第一行为a0~a9,则截取的3行分别为a0~a7,a1~a8,a2~a9。
再例如,当卷积核的尺寸为2×2时,且待处理的子块的宽度为9个像素,可以将待处理的子块的每一行以移位截取8个像素的方式分成2行。假设待处理的子块的第一行为a0~a8,则截取的2行分别为a0~a7,a1~a8。
再例如,当卷积核的尺寸为2×2时,且待处理的子块的宽度为8个像素,可以将待处理的子块的每一行以移位截取7个像素的方式分成2行。假设待处理的子块的第一行为a0~a7,则截取的2行分别为a0~a6,a1~a7。此时,截取的每一行不能完全填充单元存储块的一行。这种情况发生于所划分的子块是最后剩余的子块。
当子块的数据存满一个单元存储块bank时,继续在新的单元存储块bank中存储。
计算模块506,用于对每个待处理的子块,以单次计算容量为限从至少一个所述单元存储块中读取待卷积计算的数据并进行矩阵乘法运算,以依次得到输出图像的每一行中对应于待处理的子块的部分。
单次计算容量即一次性读取一个单元存储块中的8×8的数据,通过多次读取可以获得待卷积计算的数据。该数据与权重矩阵相乘后,可以得到输出图像的一行的一部分,多个子块位于相同行的部分与权重矩阵相乘后,获得输出图像的该行的其他部分,从而形成输出图像的一行。在其中一个实施例中,所述单次计算容量为处理器每个时钟周期能够计算的数据量;所述处理器为单处理器或多处理器。
上述实施例通过将输入图像进行分块,然后每个待处理的子块都按照移位截取固定长度并逐行存入单元存储块,并且以单次计算容量为单位逐步完成分块的计算,进而完成整个输入图像的卷积计算,而上述子块存入模块504中按固定长度移位截取后存储的方式和后续的计算方式由于可以对数据进行大量的复用,因此也节约了存储成本和传输带宽成本。
以下通过具体的实施例进行说明。
假设每个输入特征值通道的大小为7×17的输入图像,经过3×3的卷积核卷积得到5×15的输出图像,卷积的步长为1。输入图像如图4a所示,输出图像如图4b所示。
假设输入图像具有2个通道。输入图像的每个通道通过分块得到7×10的子块和7×7的子块。以处理7×10的子块(图4a中加粗的部分)为例,结合图3b,将子块中的数据通过特征数据读取单元308搬运到特征缓存单元310,并按照如下格式存储:如图4c所示,地址为addr0的bank中,第0/1/2行是a0,0~a0,9通过滑窗得到。特征数据读取单元308一次性读取a0,0~a0,9,然后通过移位输出到特征缓存单元310中,然后读取下一行的a1,0~a1,9,然后通过移位输出到特征缓存单元310中地址为addr0的bank的第3~5行。直至把所有的行均读取完成存储至特征缓存单元310中之后,再读取下一个通道的输入特征值。
为保证每次矩阵乘法(每个时钟周期的矩阵乘法)能够读取到bank中的8行数据,不存在bank冲突。在其中一个实施例中,所述步骤S304:将每个待处理的子块以单元存储块为单位存入,可以包括:
当输入图像的一个通道的数据存储完成后,若所述一个通道的数据所使用的最后一个单元存储块的剩余空间满足预设条件,则将下一个通道的数据从所述最后一个单元存储块的剩余空间开始存储;否则在新的单元存储块中创建满足所述预设条件的剩余空间,并将下一个通道的数据从所述新的单元存储块的剩余空间开始存储。
在本实施例中,所述预设条件为所述一个通道的数据所使用的最后一个单元存储块的剩余空间包括行:(C×K×K)% 8,即通道数C与卷积核的高宽的乘积除以bank的宽度8的余数。在本实施例中,通道数为2、卷积核为3×3,则余数为2,所以第二个通道存储至特征缓存单元310的首行为bank1,地址为第一个通道接下来的一个地址。如果第一个通道存储在特征缓存单元310中最后一个地址的bank1及后续的bank均没有被占用,可以直接使用和上一个通道最后一个地址相同的地址。本实施例中,第一个通道最后一个地址存储的最后一个bank为bank4(addr2的bank4),所以第二个通道需要另起一个地址(第一个通道接下来的地址addr3)存储第一个bank 的数据(addr3的bank1)。如图4c所示,第二个通道的数据存储在addr3的bank1至addr5的bank5之间。
特征缓存单元310存储完毕之后,矩阵计算单元312会花费3个时钟周期依次读取:1)addr0 bank0~7;2)addr1 bank0,addr3 bank1~bank7;3)addr4 bank0~1,然后累加即可得到输出特征值通道第一行的8个像素的最终结果。3个时钟周期一次读取的数据如图4d所示。由于上述矩阵乘法运算以两个8×8大小的矩阵相乘为基本单位,所以会同时得到8个通道、每个通道8个像素的结果。
请参考图4e,根据卷积核滑动窗口计算原理,输出图像的每个像素都是滑窗与对应像素区域的乘积的和。例如:
c0,0=b1*a0,0+b2*a0,1+b3*a0,2+b4*a1,0+b5*a1,1+b6*a1,2+b7*a2,0+b8*a2,1+b9*a2,2
c0,1=b1*a0,1+b2*a0,2+b3*a0,3+b4*a1,1+b5*a1,2+b6*a1,3+b7*a2,1+b8*a2,2+b9*a2,2
当有多个通道时,c0,0是多个通道的同一像素区域卷积核的卷积运算结果的加和。
而本实施例的方法中,则可以将上述运算转换成矩阵乘法运算。即图4f所示的过程。
同样的,为了计算输出图像的第二行的8个像素,矩阵计算单元312会依次读取:1)addr0 bank3~7, addr1 bank0~2;2)addr1 bank3,addr3 bank4~bank7,addr4 0~2,3)addr4 bank3~4。3个时钟周期一次读取的数据如图4g所示。
通过以上实施例可以看出:1)在滑窗的过程中,每次滑窗总共需要9个bank的数据,两次滑窗之间复用了6个bank的数据,极大的节省了特征缓存单元310存储的需求,特征缓存单元310节省6/9 = 66%;2)在从数据存储区302搬运至特征缓存单元310中时,一个数据搬运1次、复用9次,带宽节省至原始带宽的1/9。
所述分块模块502具体用于:以所述最大宽度分块,直至剩余子块的宽度小于所述最大宽度或刚好以所述最大宽度划分完。
或所述分块模块502具体用于:以大于卷积核的宽度并不超过所述最大宽度的宽度进行分块,使得任意两个待处理的子块的宽度相同,或者宽度相差不超过2个像素。
以卷积核为3×3为例,bank宽度为8,则最大宽度为10,最小宽度为4。若输入图像的宽度为33,可以分成宽度为8的3个子块和宽度为9的1个子块。若输入图像的宽度为35,可以分成宽度为7的5个子块。子块的高度不限。该划分方式可以使得处理的子块的大小接近,方便设计计算矩阵。
所述子块存入模块504具体用于:
当输入图像的一个通道的数据存储完成后,若所述一个通道的数据所使用的最后一个单元存储块的剩余空间满足预设条件,则将下一个通道的数据从所述最后一个单元存储块的剩余空间开始存储;
否则在新的单元存储块中创建满足所述预设条件的剩余空间,并将下一个通道的数据从所述新的单元存储块的剩余空间开始存储。
上述各模块为与方法一一对应的虚拟装置模块,其具体执行的过程在方法实施例中已有描述,在此不赘述。可以理解,上述方法实施例中描述的内容可以适当引入系统实施例中对其进行支持。
此外,本申请实施例还提出一种计算机可读存储介质,所述计算机可读存储介质上存储有上述卷积计算程序,所述卷积计算程序被处理器执行时实现如上所述的卷积计算方法的步骤。
本申请计算机可读存储介质具体实施方式与上述卷积计算方法各实施例基本相同,在此不再赘述。
本领域内的技术人员应明白,本发明的实施例可提供为方法、系统、或计算机程序产品。因此,本发明可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本发明可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本发明是参照根据本发明实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
尽管已描述了本发明的优选实施例,但本领域内的技术人员一旦得知了基本创造性概念,则可对这些实施例作出另外的变更和修改。所以,所附权利要求意欲解释为包括优选实施例以及落入本发明范围的所有变更和修改。
显然,本领域的技术人员可以对本发明进行各种改动和变型而不脱离本发明的精神和范围。这样,倘若本发明的这些修改和变型属于本发明权利要求及其等同技术的范围之内,则本发明也意图包含这些改动和变型在内。

Claims (10)

  1. 一种卷积计算方法,包括:
    根据输入图像的缓存单元中的单元存储块的大小和卷积核的大小将输入图像的每个通道按照相同的方式进行划分,得到待处理的子块;其中,所述待处理的子块的最大宽度小于单元存储块的宽度与卷积核的宽度的和;
    将每个待处理的子块以单元存储块为单位存入缓存单元中;其中,对于待处理的子块的每一行数据,以移位截取固定长度的方式分成与所述卷积核的高度数值相同的行数,逐行存入所述单元存储块 ;
    对每个待处理的子块,每次从至少一个所述单元存储块中读取小于或等于单次计算容量的待卷积计算的数据并进行矩阵乘法运算,依次得到输出图像的每一行中对应于待处理的子块的部分。
  2. 根据权利要求1所述的卷积计算方法,其特征在于,所述输入图像为初始输入图像或中间的特征图像。
  3. 根据权利要求1所述的卷积计算方法,其特征在于,所述根据输入图像的缓存单元中的单元存储块的大小和卷积核的大小将输入图像的每个通道按照相同的方式进行划分,得到待处理的子块,包括:
    以所述最大宽度分块,直至剩余子块的宽度小于所述最大宽度或刚好以所述最大宽度划分完。
  4. 根据权利要求1所述的卷积计算方法,其特征在于,所述根据输入图像的缓存单元中的单元存储块的大小和卷积核的大小将输入图像的每个通道按照相同的方式进行划分,得到待处理的子块,包括:
    以大于卷积核的宽度并不超过所述最大宽度的宽度进行分块,使得任意两个待处理的子块的宽度相同,或者宽度相差不超过2个像素。
  5. 根据权利要求1所述的卷积计算方法,其特征在于,所述单次计算容量为处理器每个时钟周期能够计算的数据量;所述处理器为单处理器或多处理器。
  6. 根据权利要求1所述的卷积计算方法,其特征在于,所述将每个待处理的子块以单元存储块为单位存入,包括:
    当输入图像的一个通道的数据存储完成后,若所述一个通道的数据所使用的最后一个单元存储块的剩余空间满足预设条件,则将下一个通道的数据从所述最后一个单元存储块的剩余空间开始存储;
    否则在新的单元存储块中创建满足所述预设条件的剩余空间,并将下一个通道的数据从所述新的单元存储块的剩余空间开始存储。
  7. 根据权利要求1所述的卷积计算方法,其特征在于,所述单元存储块具有8×8个16位大小的存储空间,每个16位大小的存储空间用于存放输入图像的一个像素数据。
  8. 一种卷积计算系统,其特征在于,包括:
    分块模块,用于根据输入图像的缓存单元中的单元存储块的大小和卷积核的大小将输入图像的每个通道按照相同的方式进行划分,得到待处理的子块;其中,所述待处理的子块的最大宽度小于单元存储块的宽度与卷积核的宽度的和;
    子块存入模块,用于将每个待处理的子块以单元存储块为单位存入;其中,对于待处理的子块的每一行数据,以移位截取固定长度的方式分成与所述卷积核的高度数值相同的行数,逐行存入所述单位存储块;
    计算模块,用于对每个待处理的子块,以单次计算容量为限从至少一个所述单元存储块中读取待卷积计算的数据并进行矩阵乘法运算,以依次得到输出图像的每一行中对应于待处理的子块的部分。
  9. 一种卷积计算设备,其特征在于,包括存储器、处理器和存储在所述存储器上并可在所述处理器上运行的卷积计算程序,所述卷积计算程序被所述处理器执行时实现如权利要求1至7中任一项所述的卷积计算方法的步骤。
  10. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有卷积计算程序,所述卷积计算程序被处理器执行时实现如权利要求1至7中任一项所述的卷积计算方法的步骤。
PCT/CN2022/099246 2021-08-27 2022-06-16 卷积计算方法、系统、设备及存储介质 WO2023024668A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110997622.6 2021-08-27
CN202110997622.6A CN113870091A (zh) 2021-08-27 2021-08-27 卷积计算方法、系统、设备及存储介质

Publications (1)

Publication Number Publication Date
WO2023024668A1 true WO2023024668A1 (zh) 2023-03-02

Family

ID=78988650

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/099246 WO2023024668A1 (zh) 2021-08-27 2022-06-16 卷积计算方法、系统、设备及存储介质

Country Status (2)

Country Link
CN (1) CN113870091A (zh)
WO (1) WO2023024668A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113870091A (zh) * 2021-08-27 2021-12-31 深圳云天励飞技术股份有限公司 卷积计算方法、系统、设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108229671A (zh) * 2018-01-16 2018-06-29 华南理工大学 一种降低加速器外部数据存储带宽需求的系统和方法
CN109447893A (zh) * 2019-01-28 2019-03-08 深兰人工智能芯片研究院(江苏)有限公司 一种卷积神经网络fpga加速中图像前处理方法及装置
CN111199273A (zh) * 2019-12-31 2020-05-26 深圳云天励飞技术有限公司 卷积计算方法、装置、设备及存储介质
US20200265106A1 (en) * 2019-02-15 2020-08-20 Apple Inc. Two-dimensional multi-layer convolution for deep learning
CN111897579A (zh) * 2020-08-18 2020-11-06 腾讯科技(深圳)有限公司 图像数据处理方法、装置、计算机设备和存储介质
CN113870091A (zh) * 2021-08-27 2021-12-31 深圳云天励飞技术股份有限公司 卷积计算方法、系统、设备及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108229671A (zh) * 2018-01-16 2018-06-29 华南理工大学 一种降低加速器外部数据存储带宽需求的系统和方法
CN109447893A (zh) * 2019-01-28 2019-03-08 深兰人工智能芯片研究院(江苏)有限公司 一种卷积神经网络fpga加速中图像前处理方法及装置
US20200265106A1 (en) * 2019-02-15 2020-08-20 Apple Inc. Two-dimensional multi-layer convolution for deep learning
CN111199273A (zh) * 2019-12-31 2020-05-26 深圳云天励飞技术有限公司 卷积计算方法、装置、设备及存储介质
CN111897579A (zh) * 2020-08-18 2020-11-06 腾讯科技(深圳)有限公司 图像数据处理方法、装置、计算机设备和存储介质
CN113870091A (zh) * 2021-08-27 2021-12-31 深圳云天励飞技术股份有限公司 卷积计算方法、系统、设备及存储介质

Also Published As

Publication number Publication date
CN113870091A (zh) 2021-12-31

Similar Documents

Publication Publication Date Title
KR102368970B1 (ko) 지능형 고 대역폭 메모리 장치
US11989638B2 (en) Convolutional neural network accelerating device and method with input data conversion
US10846364B1 (en) Generalized dot product for computer vision applications
CN111199273B (zh) 卷积计算方法、装置、设备及存储介质
US9489342B2 (en) Systems, methods, and computer program products for performing mathematical operations
CN104952037B (zh) 图像文件缩放方法与系统
US20210096823A1 (en) Transpose operations using processing element array
WO2018139177A1 (ja) プロセッサ、情報処理装置及びプロセッサの動作方法
CN110147252A (zh) 一种卷积神经网络的并行计算方法及装置
WO2019216376A1 (ja) 演算処理装置
WO2023024668A1 (zh) 卷积计算方法、系统、设备及存储介质
WO2022151779A1 (zh) 卷积运算的实现方法、数据处理方法及装置
US20220253668A1 (en) Data processing method and device, storage medium and electronic device
CN106227506A (zh) 一种内存压缩系统中的多通道并行压缩解压系统及方法
WO2022205197A1 (zh) 一种矩阵乘法器、矩阵计算方法及相关设备
CN110018851A (zh) 数据处理方法、相关设备及计算机可读介质
WO2021174834A1 (zh) Yuv图像识别方法、系统和计算机设备
WO2021082723A1 (zh) 运算装置
JP2023547831A (ja) 画像データの処理方法及び装置並びに電子機器及びコンピュータプログラム
US20240303295A1 (en) Operation apparatus and related product
CN111382852B (zh) 数据处理装置、方法、芯片及电子设备
CN111382856B (zh) 数据处理装置、方法、芯片及电子设备
TW202215300A (zh) 卷積神經網路運算方法及裝置
US10909043B2 (en) Direct memory access (DMA) controller, device and method using a write control module for reorganization of storage addresses in a shared local address space
JP4397242B2 (ja) 画像処理装置及び画像処理方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22860015

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22860015

Country of ref document: EP

Kind code of ref document: A1