WO2023231999A1 - 卷积运算方法、卷积运算装置、电子设备及存储介质 - Google Patents

卷积运算方法、卷积运算装置、电子设备及存储介质 Download PDF

Info

Publication number
WO2023231999A1
WO2023231999A1 PCT/CN2023/096983 CN2023096983W WO2023231999A1 WO 2023231999 A1 WO2023231999 A1 WO 2023231999A1 CN 2023096983 W CN2023096983 W CN 2023096983W WO 2023231999 A1 WO2023231999 A1 WO 2023231999A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
row
convolution
address
convolution kernel
Prior art date
Application number
PCT/CN2023/096983
Other languages
English (en)
French (fr)
Inventor
王帅
李涛
袁航剑
施云峰
王剑
Original Assignee
北京有竹居网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京有竹居网络技术有限公司 filed Critical 北京有竹居网络技术有限公司
Publication of WO2023231999A1 publication Critical patent/WO2023231999A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • Embodiments of the present disclosure relate to a convolution operation method, a convolution operation device, electronic equipment and a storage medium.
  • CNN Convolutional Neural Network
  • At least one embodiment of the present disclosure provides a convolution operation method, including: determining an operation convolution kernel, wherein the operation convolution kernel is obtained based on an initial convolution kernel, and the initial convolution kernel is expressed as [R, S, C,K], the operation convolution kernel is expressed as [1,1,(C ⁇ R ⁇ S),K], R, S, C, K are all integers greater than 0; based on the operation convolution kernel
  • the number of channels adjust the arrangement of the input data, and obtain the target data, wherein the size and number of channels of the target data are different from the size and number of channels of the input data, and the number of channels of the target data is equal to the operation
  • the number of channels of the convolution kernel perform a convolution operation based on the target data and the operation convolution kernel to obtain a convolution operation result, wherein the convolution operation result between the target data and the operation convolution kernel is equal to The convolution operation result of the input data and the initial convolution kernel.
  • At least one embodiment of the present disclosure also provides a convolution operation device, including: a determining unit, configured to Set to determine the operation convolution kernel, wherein the operation convolution kernel is obtained based on the initial convolution kernel, the initial convolution kernel is expressed as [R, S, C, K], and the operation convolution kernel is expressed as [ 1,1,(C ⁇ R ⁇ S),K], R, S, C, and K are all integers greater than 0; the adjustment unit is configured to adjust the arrangement of the input data based on the number of channels of the operation convolution kernel.
  • the computing unit configured to perform a convolution operation based on the target data and the operation convolution kernel to obtain a convolution operation result, wherein the convolution operation result of the target data and the operation convolution kernel is equal to the input data The result of the convolution operation with the initial convolution kernel.
  • At least one embodiment of the present disclosure also provides an electronic device, including the convolution operation device provided by any embodiment of the present disclosure.
  • At least one embodiment of the present disclosure also provides an electronic device, including: a processor; a memory including at least one computer program module; wherein the at least one computer program module is stored in the memory and configured to be configured by the The processor executes, and the at least one computer program module is used to implement the convolution operation method provided by any embodiment of the present disclosure.
  • At least one embodiment of the present disclosure also provides a storage medium that stores non-transitory computer-readable instructions.
  • the non-transitory computer-readable instructions are executed by a computer, the convolution operation method provided by any embodiment of the disclosure is implemented. .
  • Figure 1 is a schematic diagram of the data flow of a convolution operation provided by some embodiments of the present disclosure
  • Figure 2 is a schematic flowchart of a convolution operation method provided by some embodiments of the present disclosure
  • Figure 3 is a schematic diagram of the principle of a convolution operation
  • FIG 4 is a schematic flow chart of step S20 in Figure 2;
  • FIG. 5 is a schematic flow chart of step S21 in Figure 4.
  • Figure 6 is a schematic diagram of the storage method of input data in the memory in the convolution operation method provided by some embodiments of the present disclosure
  • Figure 7 is a schematic diagram of the storage method of input data in the static memory in the convolution operation method provided by some embodiments of the present disclosure
  • Figure 8 is a schematic flow chart of step S22 in Figure 4.
  • FIG. 9 is a schematic flow chart of step S23 in Figure 4.
  • Figure 10 is a schematic flow chart of step S232 in Figure 9;
  • Figure 11 is one of the schematic diagrams of data arrangement transformation in the convolution operation method provided by some embodiments of the present disclosure.
  • Figure 12 is a second schematic diagram of data arrangement transformation in the convolution operation method provided by some embodiments of the present disclosure.
  • Figure 13 is a schematic block diagram of a convolution operation device provided by some embodiments of the present disclosure.
  • Figure 14 is a schematic block diagram of an electronic device provided by some embodiments of the present disclosure.
  • Figure 15 is a schematic block diagram of another electronic device provided by some embodiments of the present disclosure.
  • Figure 16 is a schematic block diagram of another electronic device provided by some embodiments of the present disclosure.
  • Figure 17 is a schematic diagram of a storage medium provided by some embodiments of the present disclosure.
  • the term “include” and its variations are open-ended, ie, “including but not limited to.”
  • the term “based on” means “based at least in part on.”
  • the term “one embodiment” means “at least one embodiment”; the term “another embodiment” means “at least one additional embodiment”; and the term “some embodiments” means “at least some embodiments”. Relevant definitions of other terms will be given in the description below.
  • the input data of a convolutional neural network is generally a 3-channel image.
  • the input image of the first layer convolution of the residual network ResNet50 is [1,224,224,3], that is, the input image has 3 channels, and each channel
  • the image size is 224 ⁇ 224
  • the convolution kernel shape used in the first layer convolution of the residual network ResNet50 is [7, 7, 3, 64].
  • Commonly used neural network accelerators are equipped with a matrix operation unit (Matrix).
  • the matrix operation unit is mainly responsible for accelerating matrix operations and convolution operations in neural networks. In order to speed up matrix operations, matrix operation units generally increase computing parallelism by increasing the calculation scale, such as 64 ⁇ 64, 128 ⁇ 128, etc.
  • the number of input data channels of the first layer convolution of the convolutional neural network is small (for example, 3 channels)
  • the calculation time will also be relatively long, and the acceleration effect is not obvious.
  • the data arrangement is strictly aligned with the number of channels (Channel Align Tensor Layout), the storage space of the data will be significantly increased and the data transmission time will be increased.
  • hardware accelerators are usually mounted on the PCIe (Peripheral Component Interconnect Express) node of the host as a slave device of the host.
  • PCIe Peripheral Component Interconnect Express
  • the hardware accelerator is the device side (Device).
  • the central processing unit Central Processing Unit, CPU
  • CPU Central Processing Unit
  • the input data is expressed as [1,224,224,3], which is 3-channel data and the size is 224 ⁇ 224;
  • the convolution kernel is expressed as [7,7,3,64], which is 64 groups, each group is 3 convolution kernels and the size of the convolution kernel is 7 ⁇ 7; the result is 64-channel data with a size of 112 ⁇ 112.
  • the matrix operation unit on the hardware accelerator is 64 ⁇ 64
  • the first layer convolution input data needs to be expanded from [1,224,224,3] to [1,224,224,64] on the host , redundant channel data is all filled with 0s.
  • the transmission time for data transmission from the host side to the hardware accelerator will also increase by 21.33 times.
  • the computing power utilization rate of the matrix operation unit is only 4.68%; in terms of convolution operation time, it takes 614656 cycles for the matrix operation unit to complete the first layer convolution operation.
  • the matrix operation unit of the hardware accelerator Since the number of input data channels for the first-layer convolution calculation of the convolutional neural network is small and the matrix operation unit of the hardware accelerator is large, the computing requirements do not match the hardware characteristics, which in turn makes the first-layer convolution calculation of the convolutional neural network problematic.
  • At least one embodiment of the present disclosure provides a convolution operation method, a convolution operation device, an electronic device, and a storage medium.
  • This convolution operation method can improve the utilization rate of the matrix operation unit, effectively utilize the computing power of the matrix operation unit, shorten the time of the convolution operation, improve the operation efficiency, and can save data transmission time.
  • At least one embodiment of the present disclosure provides a convolution operation method.
  • the convolution operation method includes: determining the operation convolution kernel, which is obtained based on the initial convolution kernel.
  • the initial convolution kernel is expressed as [R, S, C, K]
  • the operation convolution kernel is expressed as [1,1 , (C ⁇ R ⁇ S),K]
  • R, S, C, K are all integers greater than 0; based on the number of channels of the operation convolution kernel, adjust the arrangement of the input data to obtain the target data, the target data
  • the size and number of channels are different from the size and number of channels of the input data.
  • the number of channels of the target data is equal to the number of channels of the operation convolution kernel; a convolution operation is performed based on the target data and the operation convolution kernel to obtain the convolution operation result.
  • the convolution operation result of the target data and the operation convolution kernel is equal to the convolution operation result of the input data and the initial convolution kernel.
  • FIG. 2 is a schematic flowchart of a convolution operation method provided by some embodiments of the present disclosure. As shown in Figure 2, in some embodiments, the convolution operation method includes steps S10 to S30.
  • Step S10 Determine the operation convolution kernel, where the operation convolution kernel is obtained based on the initial convolution kernel, the initial convolution kernel is represented by [R, S, C, K], and the operation convolution kernel is represented by [1,1,( C ⁇ R ⁇ S),K], R, S, C, and K are all integers greater than 0;
  • Step S20 Based on the number of channels of the operation convolution kernel, adjust the arrangement of the input data to obtain target data.
  • the size and number of channels of the target data are different from the size and number of channels of the input data.
  • the number of channels of the target data is equal to the number of channels of the operation.
  • Step S30 Perform a convolution operation based on the target data and the operation convolution kernel to obtain the convolution operation result, where the convolution operation result between the target data and the operation convolution kernel is equal to the convolution operation result between the input data and the initial convolution kernel.
  • this convolution operation method can be used for the first layer convolution operation of a convolutional neural network.
  • the convolution operation method can be used not only for convolutional neural networks, but also for convolution operations of other types of networks. It can not only be used for first-layer convolution operations (first layer Convolution operation) can also be used for convolution operations of other layers, which can be determined according to actual needs, and the embodiments of the present disclosure do not limit this.
  • the initial convolution kernel is the convolution kernel required for the first layer convolution operation, and the initial convolution kernel is represented as [R, S, C, K].
  • Transform the parameters of the initial convolution kernel to obtain the operation convolution kernel [1, 1, (C ⁇ R ⁇ S), K].
  • the operational convolution kernel can be obtained based on the initial convolution kernel, and the operational convolution kernel is [1,1,147,64]. The following is a brief explanation of the transformation principle of the convolution kernel in conjunction with Figure 3.
  • Figure 3 is a schematic diagram of the principle of a convolution operation.
  • the input data size is [1,3,3,5]
  • the convolution kernel size is [2,2,5,4]
  • the output data size is [1,2,2,4].
  • M point the calculation method is shown in Figure 3. Since the size of the convolution kernel is 2 ⁇ 2 and the channel is 5, M point is the input data and 20 points of the convolution kernel each multiplied and then accumulated. the result of.
  • the convolution kernel can be transformed from R ⁇ S ⁇ C ⁇ K to 1 ⁇ 1 ⁇ (C ⁇ R ⁇ S) ⁇ K, and the input data can be adjusted accordingly, so the entire convolution The calculation result remains unchanged. Through such transformation operations, the number of channels can be increased.
  • the convolution kernels can be changed offline. This is because the convolution kernels used by the neural network model in the deployment phase are fixed and will not change as the input changes. Therefore, the convolution kernels can be preset Process it into the required arrangement.
  • the convolution kernel [R, S, C, K] to be used can be set to [1, 1, (C ⁇ R ⁇ S), K] during the deployment stage of the neural network model, Use it as the convolution kernel for subsequent use.
  • a high-level language such as Python
  • Python can be used to modify the code corresponding to the initial convolution kernel [R, S, C, K] during the model compilation stage, so that the operation convolution kernel [1, 1, (C ⁇ R ⁇ S),K].
  • the initial convolution kernel [R, S, C, K] to be used can also be adjusted before each convolution operation is performed to obtain the operation actually used in this operation.
  • step S20 based on the number of channels of the operation convolution kernel, the arrangement of the input data is adjusted to obtain the target data.
  • the size and number of channels of the target data are different from the size and number of channels of the input data, and the number of channels of the target data is equal to the number of channels of the operation convolution kernel.
  • the data obtained after adjusting the input data is [1,112,112,147].
  • the data obtained after adjusting the arrangement of the input data is called target data
  • the target data is the data that is finally convolved with the operation convolution kernel.
  • the size and number of channels of the target data are different from the size and number of channels of the input data due to the adjustment of the arrangement. It can be seen from the above calculation formula that the number of channels of the target data is equal to the number of channels of the operation convolution kernel (for example, both are 147 in the above example), which facilitates the convolution operation of the two.
  • the number of channels of the target data is greater than the number of channels of the input data
  • the number of channels of the operation convolution kernel is greater than the number of channels of the initial convolution kernel, thereby increasing the number of channels and fully utilizing the computing power of the matrix operation unit.
  • the number of channels of the input data and the number of channels of the initial convolution kernel are both 3
  • the number of channels of the target data and the number of channels of the operation convolution kernel are both 147, thereby increasing the number of channels.
  • the conversion of the input data layout needs to be completed online, that is, In the neural network inference stage, the arrangement of input data needs to be adjusted each time.
  • FIG. 4 is a schematic flow chart of step S20 in FIG. 2 .
  • step S20 may further include steps S21 to S23.
  • Step S21 Store the input data in the static memory in row units, where each row of the input data is stored in corresponding N storage rows in the static memory, where N is an integer greater than 0;
  • Step S22 Perform a filling operation on the input data stored in the static memory to obtain extended data
  • Step S23 Adjust the arrangement of the extended data to change the size and number of channels of the extended data to obtain target data.
  • the input data is first stored in a static memory, which is set in a hardware accelerator.
  • the static memory is, for example, a static random access memory (Static Random Access Memory, SRAM).
  • SRAM static random access memory
  • the input data is stored in row units in the static memory, that is, each row of the input data is stored in the corresponding N storage rows in the static memory, where N is an integer greater than 0.
  • N is an integer greater than 0.
  • the data flow shown in Figure 1 can be used to transfer input data to static memory.
  • step S21 may further include steps S211 to S212.
  • Step S211 Store the input data in the memory in a closely arranged manner, where the input data includes multiple channels.
  • the closely arranged manner means that multiple channels of the same data point are stored adjacently in the memory in sequence;
  • Step S212 Use direct storage access to transfer the input data in the memory to the static memory of the hardware accelerator, and store the first data point of each row of the input data in the first column of a different row of the static memory, so that the input Each row of data is stored in corresponding N storage rows in static memory.
  • the dense arrangement method is, for example, a channel-aligned data layout method (Channel Align Tensor Layout).
  • the input data includes multiple channels, and the tightly arranged method means that multiple channels of the same data point are stored adjacently in the memory in order.
  • a storage method that arranges data closely is adopted.
  • the values of the three channels are continuously stored in the memory space; then, the pixels in the first row and the second column are continued to be stored, and the values of the three channels are continuously stored in the memory space. memory space, and so on.
  • each channel of each pixel is expressed in FP16 data format, which occupies 2 bytes of address space.
  • the data format adopted and the address space occupied are examples, which do not constitute a limitation on the embodiments of the present disclosure.
  • step S212 the input data in the memory is transferred to the static memory of the hardware accelerator using Direct Memory Access (DMA).
  • DMA Direct Memory Access
  • the data transfer method can be as shown in Figure 1.
  • the input data of the first layer convolution is transferred from the host side to the DDR on the device side through PCIe.
  • the input data is stored in the DDR in the same way as it is stored in the host memory.
  • the input data is stored contiguously in the DDR, occupying, for example, 294KB.
  • the input data needs to be transferred to static memory, that is, the 294KB data needs to be moved from the DDR to the SRAM in the processing engine (PE, also known as the hardware accelerator).
  • PE also known as the hardware accelerator
  • the first data point of each row of the input data is stored in the first column of a different row of the static memory, so that each row of the input data is stored in corresponding N storage rows in the static memory.
  • the organization form of SRAM can be abstractly considered as a table with M rows and N columns, and each table stores one piece of data. Since the size of the input data is 224 ⁇ 224, the input data is logically divided into 224 rows, and the starting position of each row is in the first column of a certain row of SRAM. Since the number of SRAM columns is limited, it is difficult to store all one row of input data in one row of SRAM. Therefore, one row of input data will be scattered across multiple rows of SRAM, that is, on different SRAM addresses. For the input data of [1,224,224,3], considering the data filled by the subsequent filling operation, each row has 229 points and each point has 3 channels.
  • ceil (229*3/64) 11
  • ceil means rounding up. That is to say, one row of data points of the input data is stored in 11 storage rows in the SRAM.
  • N 11
  • the entire input data occupies 224 ⁇ 11 2464 rows of SRAM.
  • the left side indicates that the input data is continuously stored in the DDR, without H, W, or C.
  • the right side indicates that after being transferred to SRAM by DMA, the data is split in SRAM according to the rows of input data, and each row of data occupies a certain amount of SRAM space (for example, 11 storage rows).
  • the DMA transfer process is briefly explained as follows.
  • DMA After completing the data transfer of the first line, DMA needs to jump the read address from source_address to source_address+1344Byte, which is the DDR address at the beginning of the second line of actual input data, and then continuously transfer 11 ⁇ 128Byte to destination_address+ Within the SRAM space headed by 11.
  • the amount of 11 ⁇ 128Byte data sent each time is larger than the actual data amount of each row, that is, it also includes the data of the next row.
  • the first address sent each time is accurate, even if it is sent repeatedly Data will not have an impact on the data itself, and these redundant data will not affect subsequent processing.
  • a filling operation is performed on the input data stored in the static memory to obtain extended data.
  • extended data refers to the data obtained after performing the filling operation.
  • the input data needs to be processed in all directions up, down, left, and right Padding operation.
  • you need to fill 3 points on the left and top of the input data (3 columns on the left and 3 rows on the top), and 2 points on the right and bottom of the input data (2 columns on the right and 2 on the bottom). row)
  • the size of the expanded data obtained after the filling operation is [1,229,229,3].
  • FIG. 8 is a schematic flow chart of step S22 in FIG. 4 . As shown in FIG. 8 , step S22 may further include steps S221 to S223.
  • Step S221 In the static memory, before and after the storage location corresponding to the input data The storage row is filled with the first preset value to obtain the first intermediate data, wherein the first intermediate data includes the input data and the filled first preset value;
  • Step S222 Transmit the first intermediate data to the vector calculation unit, and use the shift instructions and filling instructions of the vector calculation unit to fill both ends of each row corresponding to the first intermediate data with the second preset value to obtain the second intermediate data,
  • the second intermediate data includes the first intermediate data and the filled second preset value
  • Step S223 Transfer the second intermediate data to the corresponding storage location in the static memory to obtain extended data, where the extended data has the same content as the second intermediate data.
  • step S221 the storage lines before and after the storage location corresponding to the input data in the static memory are filled with the first preset value, thereby obtaining the first intermediate data.
  • the first intermediate data includes the input data and the filled first default value.
  • This step performs, for example, filling operations on the upper and lower edges of the input data. For example, in some examples, near the target address of the SRAM, the SRAM space required for filling above needs to be reserved, that is, several rows of data must be inserted before the first row of actual input data.
  • the vector calculation unit (Vector) in the hardware accelerator is used to perform the upper and lower padding operations (padding).
  • the first preset value of filling is usually 0, so it is necessary to write all 0 values in several addresses before and after the input data storage space in the SRAM to obtain the first intermediate data.
  • the first intermediate data is data that has been filled with top and bottom edges, and the first intermediate data has not yet been filled with left and right edges.
  • step S222 the first intermediate data is transmitted to the vector calculation unit, and the shift instructions (for example, vshiftri instructions) and fill instructions (for example, SI2V instructions) of the vector calculation unit are used to perform two operations on each row corresponding to the first intermediate data.
  • the end is filled with the second preset value to obtain the second intermediate data.
  • the second intermediate data includes the first intermediate data and the filled second preset value. This step performs, for example, padding operations on the left and right sides of the first intermediate data.
  • the data on the 2464 address spaces in the SRAM are grouped into every 11 rows, sent to the vector computing unit in turn, and stored in the storage space vmem within the vector computing unit.
  • the vector calculation unit uses the vshiftri instruction to shift the data to the right as a whole to leave space for left padding, and then uses the SI2V instruction to write these positions to the corresponding second preset value (for example, usually set to 0).
  • the corresponding second preset value for example, usually set to 0.
  • the corresponding second preset value is written after the last column of the first row of the input data. If the amount of filled data is too large, additional vmem space needs to be added as needed.
  • a pipeline approach can be used to Group 11 rows of data perform left and right padding operations to improve processing efficiency.
  • step S223 the second intermediate data is transferred to the corresponding storage location in the static memory to obtain extended data.
  • the extended data has the same content as the second intermediate data. That is, the second intermediate data that has completed the filling operation in vmem is written back to the corresponding address space in the SRAM, and the data stored in the SRAM that has completed the filling operation is called extended data.
  • step S22 may be omitted.
  • up and down filling can be performed first, and then left and right filling, or left and right filling can be performed first, and then up and down filling is performed.
  • the specific filling order is not limited.
  • the instructions used when performing the filling operation are not limited to the vshiftri instruction and the SI2V instruction. Other applicable instructions can also be used, as long as the filling operation can be realized, and the embodiments of the present disclosure do not limit this.
  • step S23 the arrangement of the extended data is adjusted to change the size and number of channels of the extended data, thereby obtaining target data. That is to say, in order to match the operation convolution kernel and ensure that the operation result remains unchanged, it is necessary to adjust the arrangement of the expanded data and change its size and channel number.
  • the number of channels of the target data obtained after adjustment is equal to the number of channels of the convolution kernel.
  • the target data is expressed as [1, ht, wt, (C ⁇ R ⁇ S)], and ht and wt are both integers greater than 0. .
  • FIG. 9 is a schematic flow chart of step S23 in FIG. 4 . As shown in Figure 9, step S23 may further include steps S231 to S232.
  • Step S231 Read the data in the R*N storage rows in the static memory one after another and transfer it to the vector calculation unit;
  • Step S232 The vector calculation unit converts the data in R*N storage rows received each time into data in wt*ceil ((C ⁇ R ⁇ S)/L) storage rows to obtain target data.
  • step S231 the data in the R*N storage rows in the static memory is read each time and transmitted to the vector calculation unit, so that the vector calculation unit can process the data in the R*N storage rows received each time.
  • Data is converted.
  • the starting address of each read is moved by str*N storage lines according to the preset skip step str.
  • the preset skip step is the skip step of the sliding window in the row and column directions required to convolve the input data with the initial convolution kernel.
  • the sum of the number of times to read data from the static memory is equal to ht.
  • the sliding window required to convolve the input data [1,224,224,3] with the initial convolution kernel [7,7,3,64] has a jump of 2 in the row and column directions, so the above preset The skip step str is 2.
  • the vector calculation unit converts the data in R*N storage rows received each time into data in wt*ceil ((C ⁇ R ⁇ S)/L) storage rows.
  • the converted data That is the target data. That is, the arrangement of the data is adjusted, thereby changing the size and number of channels of the data.
  • L in this calculation formula represents the number of data points that can be stored in each storage line in the static memory, and ceil((C ⁇ R ⁇ S)/L) means taking upwards from (C ⁇ R ⁇ S)/L. all.
  • FIG. 10 is a schematic flow chart of step S232 in FIG. 9 .
  • the above step S232 further includes steps S2321 to S2323.
  • Step S2321 Divide the data in R*N storage rows into multiple groups of data according to preset skip steps
  • Step S2322 For each group of data, determine the initial position information parameters and target position information parameters of each row of data in the sliding window corresponding to the group of data;
  • Step S2323 The vector calculation unit stores each set of data in the converted arrangement into the corresponding location of the target memory according to the initial position information parameters and the target position information parameters to obtain the target data.
  • step S2321 the data in R*N storage rows are divided into multiple groups of data according to preset skip steps. Each group of data corresponds to a sliding window in the row direction, and the number of groups of multiple groups of data is equal to wt.
  • the data in 7 ⁇ 11 storage rows corresponds to the input data 224 ⁇
  • the divided 112 groups of data correspond to the 112 sliding windows made by skipping step 2 for a row of 224 data.
  • step S2322 for each set of data, the initial position information parameters and target position information parameters of each row of data in the sliding window corresponding to the set of data are determined.
  • the initial location information parameter is used to determine the source address where the row of data in the sliding window is located, and the target location information parameter is used to determine the target address for transporting these data.
  • each sliding window (Feature Window) to which the convolution kernel slides needs to complete the transformation from [7 ⁇ 7 ⁇ 3] to [1,1,147] as shown in Figure 11.
  • the sliding window to which each convolution kernel slides corresponds to the 7 rows and 7 columns of the original data (input data or input image)
  • the sliding windows swept by the convolution kernel during the sliding process from the upper left to the lower right overlap.
  • sliding the sliding window from left to right will repeatedly read data overlapping in the row direction.
  • the data in SRAM is divided into 112 groups of data, corresponding to 112 rows after conversion. Each set of data occupies 7 ⁇ 11 address space before conversion.
  • the vector computing unit After the vector computing unit reads a set of data, it is processed and outputs a data size of 112 ⁇ 3 address spaces, where 112 corresponds to the converted data width, and 3 corresponds to the space occupied by 147 channels (147 channels require Occupies 3 SRAM storage lines, that is, occupies 3 SRAM address spaces).
  • the vector computing unit After the vector computing unit obtains a set of data, the data on these 7 ⁇ 11 SRAM storage rows (entries) will be temporarily stored in vmem inside the vector computing unit, and its data arrangement remains unchanged. Then use the instruction operation of the vector computing unit to convert it into data of 112 ⁇ 3 vmem storage lines. Afterwards, the results are written back to SRAM.
  • the converted data width is 112 points in the row direction, and the channel of each point is 147, distributed on 3 vmem storage lines.
  • the sliding window corresponding to the original 7 ⁇ 7 ⁇ 3 needs to be converted into 1 ⁇ 1 ⁇ 147. To do this, you need to find the corresponding 7 rows of data in each sliding window, and then reorganize them into a new data arrangement.
  • the original rows 0, 1, and 2 form the new row 1, and the original rows 3 and 4 form the new row 2.
  • the original rows Rows 5 and 6 form a new row 3. These new three rows store a total of 147 data points covered in the original sliding window corresponding to 7 ⁇ 7 ⁇ 3.
  • the data arrangement is still converted in a similar way until the sliding window corresponding to the 7 rows of data is converted, and then the next set of 7 ⁇ 11 is read. Stores row data.
  • the initial position information parameters include a first starting boundary coordinate, a first ending boundary coordinate, a first starting address, a first ending address, a first starting sequence number, and a first ending sequence number.
  • the first starting boundary coordinate represents the relative coordinate of the corresponding starting boundary of the sliding window in the row direction of the extended data
  • the first end boundary coordinate represents the relative coordinate of the corresponding ending boundary of the sliding window in the row direction of the extended data.
  • the corresponding starting boundary of the sliding window and the corresponding ending boundary of the sliding window are located at different positions in the row direction of the extended data.
  • the data obtained after performing the filling operation is extended data, so these coordinates and parameters are defined for the extended data. For other cases where padding is not required, these coordinates and parameters can be defined directly with respect to the input data.
  • the starting boundary is, for example, the left boundary of the sliding window
  • the first starting boundary coordinate is the relative coordinate of the left boundary of the sliding window in the row direction of the 229 ⁇ 3 extended data
  • the ending boundary is, for example, the right side of the sliding window.
  • the first end boundary coordinate is the relative coordinate of the right boundary of the sliding window in the row direction of the 229 ⁇ 3 extended data.
  • src_row_start_index represents the first starting boundary coordinate
  • i represents the sequence number of the corresponding data point of the corresponding sliding window in the size wt of the target data (for example, it indicates which sliding window is the number of 112 sliding windows in a row, that is, in the output
  • str represents the sliding window
  • ch represents the number of channels of the input data (for example, 3).
  • src_row_end_index represents the first end boundary coordinate
  • kernel_w represents the width of the sliding window (for example, 7)
  • the size of the sliding window is equal to the size of the initial convolution kernel (for example, both are 7 ⁇ 7).
  • the first start address represents the address of the first starting boundary coordinate in the memory of the vector calculation unit (for example, vmem), and the first end address represents the address of the first end boundary coordinate in the memory of the vector calculation unit (for example, vmem).
  • the first starting sequence number represents the sequence number of the data point corresponding to the first starting boundary coordinate at the first starting address
  • the first ending sequence number represents the sequence number of the data point corresponding to the first ending boundary coordinate at the first ending address. Since vmem is stored in rows, a certain storage row in vmem can be located according to the first starting address or the first ending address.
  • the first starting sequence number or the first ending sequence number indicates the corresponding data in the storage row. data.
  • src_row_start_address represents the first starting address
  • vmem_lane represents the number of data points that can be stored in each storage row in the memory of the vector calculation unit
  • j represents the row number of the corresponding data in the sliding window (for example, a value from 1 to 7) .
  • src_row_end_address src_row_end_index/vmem_lane+j*N. src_row_end_address indicates the first end address.
  • src_row_end_lane src_row_end_index%vmem_lane. src_row_end_lane represents the first end sequence number.
  • the target location information parameters include a second starting boundary coordinate, a second ending boundary coordinate, a second starting address, a second ending address, a second starting sequence number, and a second ending sequence number.
  • the second starting boundary coordinate represents the relative coordinate of the corresponding sliding window's starting boundary in the data size of [1,1,(C ⁇ R ⁇ S)], and the second ending boundary coordinate represents the corresponding ending boundary of the sliding window.
  • the corresponding starting boundary of the sliding window corresponds to The end boundaries of the sliding window are located at different positions in the row direction of the augmented data.
  • the target data is expressed as [1,ht,wt,(C ⁇ R ⁇ S)].
  • the target data is [1,112,112,147], and the data size corresponding to each sliding window needs to be converted from [7,7,3] to [1,1,147].
  • the starting boundary is, for example, the left boundary of the sliding window
  • the second starting boundary coordinate is the relative coordinate of the left boundary of the sliding window in the data size of [1, 1, 147]
  • the ending boundary is, for example, the sliding window.
  • the coordinates of the right boundary and the second end boundary are the relative coordinates of the right boundary of the sliding window in the data size of [1,1,147].
  • dst_row_start_index j*kernel_w*ch.
  • dst_row_start_index represents the second starting boundary coordinate
  • j represents the row number of the corresponding data in the sliding window (for example, a value from 1 to 7)
  • kernel_w represents the width of the sliding window (for example, 7)
  • the size of the sliding window is equal to
  • the size of the initial convolution kernel for example, both are 7 ⁇ 7)
  • ch represents the number of channels of the input data (for example, 3).
  • dst_row_end_index dst_row_start_index+(kernel_w*ch-1).
  • dst_row_end_index represents the second end boundary coordinate.
  • the second start address represents the address of the second starting boundary coordinate in the memory of the vector calculation unit (for example, vmem), and the second end address represents the address of the second end boundary coordinate in the memory of the vector calculation unit (for example, vmem).
  • the second starting sequence number represents the sequence number of the data point corresponding to the second starting boundary coordinate at the second starting address
  • the second ending sequence number represents the sequence number of the data point corresponding to the second ending boundary coordinate at the second ending address. Since vmem is stored in rows, a certain storage row in vmem can be located according to the second start address or the second end address.
  • the second start sequence number or the second end sequence number indicates that the corresponding data is in the storage row. Which data.
  • dst_row_start_address represents the second start address
  • vmem_lane represents the number of data points that can be stored in each storage line in the memory of the vector computing unit.
  • dst_row_end_address dst_row_end_index/vmem_lane. dst_row_end_address represents the second end address.
  • dst_row_start_lane dst_row_start_index%vmem_lane. dst_row_start_lane represents the second starting sequence number.
  • dst_row_end_lane dst_row_end_index%vmem_lane. dst_row_end_lane represents the second end sequence number.
  • the source address and target address required for data transfer can be determined, and then the source data is moved to the destination address based on these parameters.
  • the vector calculation unit stores each set of data in the converted arrangement into the corresponding position of the target memory, and the target position information parameter indicates The target address is the address in the target memory, thereby obtaining the target data.
  • the target memory stores data in units of rows, and the data transferred to the target memory and stored on the target memory is the target data.
  • the target memory can be the static memory mentioned above (in this case, the data before conversion and the data after conversion are stored in different addresses of the static memory), or it can be other storage different from the static memory mentioned above. equipment, the embodiments of the present disclosure are not limited to this.
  • step S2323 may further include: according to the initial position information parameter and the target position information parameter, the vector calculation unit uses a circular shift instruction and arranges each set of data in a converted arrangement according to the preset enable signal in the predicate register. Splice and store to the corresponding location in the target memory to obtain the target data.
  • the vshiftri instruction in the instruction set architecture (Vector ISA) of the vector computing unit can be used to circularly shift the data of the source address to the right by several positions and then move it according to the predicate register (Vector Predicate Register, VPR, also known as VP
  • VPR Vector Predicate Register
  • the write enable signal in the register is written to the destination address.
  • the aforementioned preset enable signal is, for example, a write enable signal.
  • the VP register used needs to be determined based on the second start sequence number dst_row_start_lane and the second end sequence number dst_row_end_lane.
  • the vshiftri instruction and the VP register please refer to the conventional design and will not be described in detail here.
  • the vector calculation unit is used to complete the conversion from a two-dimensional tensor (2D Tensor) to a three-dimensional tensor (3D Tensor).
  • the input data [1,224,224,3] is converted into the target data [1,112,112,147], and the operation convolution kernel determined according to the initial convolution kernel [7,7,3,64] is [1,1,147,64 ], thereby increasing the number of channels from 3 to 147.
  • step S30 a convolution operation is performed based on the target data and the operation convolution kernel to obtain a convolution result.
  • the convolution operation result of the target data and the operation convolution kernel is equal to the convolution operation result of the input data and the initial convolution kernel.
  • step S30 may further include: using matrix operations The unit (Matrix) performs a convolution operation on the target data and the operation convolution kernel.
  • the operation convolution kernel obtained based on the initial convolution kernel [7, 7, 3, 64] is [1, 1, 147, 64], for the input data [1, 224, 224, 3] After adjusting the arrangement of The convolution operation result is consistent with the original convolution operation result.
  • the computing power of the matrix operation unit can be fully utilized, the utilization rate of the matrix operation unit is improved, the time of convolution operation is shortened, and the operation efficiency is improved.
  • the input data does not need to be re-arranged using the host CPU and there is no need to expand the number of channels on the host, the data space occupied will not significantly increase and the amount of data transferred from the host to the device will not increase. Therefore, the PCIe transmission time of Host2Device will not be increased, thereby saving data transmission time.
  • the convolution operation method provided by the embodiments of the present disclosure helps to achieve the purpose of hardware acceleration and can realize the acceleration of the convolution calculation of the first layer of the convolutional neural network (CNN). It has small storage space, short transmission time, high hardware module utilization, Features include short calculation time. For example, the time required to perform the first layer convolution of the residual network ResNet50 using the usual convolution operation method is 614656 cycle times (cycles), while the convolution operation method provided by the embodiment of the present disclosure is used to perform the first layer convolution of the residual network ResNet50. The theoretical time required for the first layer of convolution is 37632 cycles, which is reduced to 6.1% of the previous level. This greatly shortens the calculation time of the first layer of convolutional neural network (CNN).
  • the convolution operation method provided by the above-mentioned embodiments of the present disclosure may include more or less operations, and these operations may be performed sequentially or in parallel.
  • the flow of the convolution operation method described above includes multiple operations occurring in a specific order, it should be clearly understood that the order of the multiple operations is not limited.
  • the convolution operation method described above can be executed once or multiple times according to predetermined conditions.
  • the above description takes the first layer convolution of the residual network ResNet50 as an example, but this does not constitute a limitation on the embodiments of the present disclosure.
  • the convolution operation method provided by the embodiments of the present disclosure can be applied to any applicable convolution operation.
  • the size and number of channels of various types of data, and the size and number of channels of various types of convolution kernels can be determined according to actual needs. Not limited to the specific numerical values described above.
  • At least one embodiment of the present disclosure also provides a convolution operation device.
  • the convolution operation device can improve the utilization rate of the matrix operation unit, effectively utilize the computing power of the matrix operation unit, shorten the time of the convolution operation, improve the operation efficiency, and can save data transmission time.
  • Figure 13 is a schematic block diagram of a convolution operation device provided by some embodiments of the present disclosure. As shown in FIG. 13 , in some embodiments, the convolution operation device 100 includes a determination unit 110 , an adjustment unit 120 , and a calculation unit 130 .
  • the determining unit 110 is configured to determine the operation convolution kernel.
  • the operation convolution kernel is obtained based on the initial convolution kernel.
  • the initial convolution kernel is expressed as [R, S, C, K]
  • the operation convolution kernel is expressed as [1, 1, (C ⁇ R ⁇ S), K] , R, S, C, and K are all integers greater than 0.
  • the determination unit 110 may perform step S10 of the convolution operation method shown in FIG. 2 .
  • the adjustment unit 120 is configured to adjust the arrangement of the input data based on the number of channels of the operation convolution kernel to obtain target data. For example, the size and number of channels of the target data are different from the size and number of channels of the input data, and the number of channels of the target data is equal to the number of channels of the operation convolution kernel. For example, the adjustment unit 120 may perform step S20 of the convolution operation method shown in FIG. 2 .
  • the computing unit 130 is configured to perform a convolution operation based on the target data and the operation convolution kernel to obtain a convolution operation result.
  • the convolution operation result of the target data and the operation convolution kernel is equal to the convolution operation result of the input data and the initial convolution kernel.
  • the calculation unit 130 may perform step S30 of the convolution operation method shown in FIG. 2 .
  • the determining unit 110, the adjusting unit 120, and the calculating unit 130 may be hardware, software, firmware, or any feasible combination thereof.
  • the determination unit 110, the adjustment unit 120, and the calculation unit 130 may be dedicated or general-purpose circuits, chips, or devices, or may be a combination of a processor and a memory.
  • the embodiments of the present disclosure do not limit this.
  • each unit of the convolution operation device 100 corresponds to each step of the aforementioned convolution operation method.
  • the specific functions of the convolution operation device 100 please refer to the above convolution operation method. The relevant descriptions will not be repeated here.
  • the components and structures of the convolution operation device 100 shown in FIG. 13 are only exemplary and not restrictive.
  • the convolution operation device 100 may also include other components and structures as needed.
  • At least one embodiment of the present disclosure also provides an electronic device.
  • This electronic device can improve the utilization rate of the matrix operation unit, effectively utilize the computing power of the matrix operation unit, and shorten the time of convolution operation. Improve computing efficiency and save data transmission time.
  • FIG. 14 is a schematic block diagram of an electronic device provided by some embodiments of the present disclosure.
  • the electronic device 200 includes a convolution operation device 210 .
  • the convolution operation device 210 may be a convolution operation device provided in any embodiment of the present disclosure, such as the aforementioned convolution operation device 100 .
  • the electronic device 200 can be any device with computing functions, such as a server, a terminal device, a personal computer, etc., and the embodiments of the present disclosure are not limited thereto.
  • Figure 15 is a schematic block diagram of another electronic device provided by some embodiments of the present disclosure.
  • the electronic device 300 includes a processor 310 and a memory 320, which can be used to implement a client or a server.
  • the memory 320 is used to non-transitoryly store computer-executable instructions (eg, at least one (one or more) computer program modules).
  • the processor 310 is configured to run the computer-executable instructions. When the computer-executable instructions are run by the processor 310, they can perform one or more steps in the convolution operation method described above, thereby realizing the convolution operation method described above.
  • Product operation method Memory 320 and processor 310 may be interconnected by a bus system and/or other forms of connection mechanisms (not shown).
  • the processor 310 may be a central processing unit (CPU), a graphics processing unit (GPU), or other forms of processing units with data processing capabilities and/or program execution capabilities.
  • the central processing unit (CPU) may be of X86 or ARM architecture.
  • the processor 310 may be a general-purpose processor or a special-purpose processor that may control other components in the electronic device 300 to perform desired functions.
  • memory 320 may include any combination of at least one (eg, one or more) computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory.
  • Volatile memory may include, for example, random access memory (RAM) and/or cache memory (cache), etc.
  • Non-volatile memory may include, for example, read-only memory (ROM), hard disk, erasable programmable read-only memory (EPROM), portable compact disk read-only memory (CD-ROM), USB memory, flash memory, and the like.
  • At least one (eg, one or more) computer program modules may be stored on the computer-readable storage medium, and the processor 310 may run at least one (eg, one or more) computer program modules to implement various functions of the electronic device 300 .
  • Various application programs and various data, as well as various data used and/or generated by the application programs, etc. can also be stored in the computer-readable storage medium.
  • FIG. 16 is a schematic block diagram of another electronic device provided by some embodiments of the present disclosure.
  • the electronic device 400 is, for example, suitable for implementing the convolution operation method provided by the embodiment of the present disclosure.
  • the electronic device 400 may be a terminal device or the like, and may be used to implement a client or a server.
  • the electronic device 400 may include, but is not limited to, a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (Personal Digital Assistant), a PAD (Tablet Computer), a PMP (Portable Multimedia Player), a vehicle-mounted terminal (such as a vehicle-mounted navigation terminal), Mobile terminals such as wearable electronic devices and fixed terminals such as digital TVs, desktop computers, smart home devices, etc.
  • PDA Personal Digital Assistant
  • PAD Tablett Computer
  • PMP Portable Multimedia Player
  • vehicle-mounted terminal such as a vehicle-mounted navigation terminal
  • Mobile terminals such as wearable electronic devices and fixed terminals such as digital TVs, desktop computers, smart home devices,
  • the electronic device 400 may include a processing device (eg, central processing unit, graphics processor, etc.) 410 , which may be loaded into a random access device according to a program stored in a read-only memory (ROM) 420 or from a storage device 480 .
  • the program in the memory (RAM) 430 executes various appropriate actions and processes.
  • various programs and data required for the operation of the electronic device 400 are also stored.
  • the processing device 410, ROM 420 and RAM 430 are connected to each other through a bus 440.
  • An input/output (I/O) interface 450 is also connected to bus 440.
  • the following devices may be connected to the I/O interface 450: input devices 460 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speakers, vibration An output device 470 such as a computer; a storage device 480 including a magnetic tape, a hard disk, etc.; and a communication device 490.
  • the communication device 490 may allow the electronic device 400 to communicate wirelessly or wiredly with other electronic devices to exchange data.
  • FIG. 16 illustrates electronic device 400 having various means, it should be understood that implementation or provision of all illustrated means is not required, and electronic device 400 may alternatively implement or be provided with more or fewer means.
  • the above-mentioned convolution operation method may be implemented as a computer software program.
  • embodiments of the present disclosure include a computer program product, which includes a computer program carried on a non-transitory computer-readable medium.
  • the computer program includes program code for executing the above convolution operation method.
  • the computer program may be downloaded and installed from the network through the communication device 490, or installed from the storage device 480, or installed from the ROM 420.
  • the functions defined in the convolution operation method provided by the embodiments of the present disclosure can be implemented.
  • At least one embodiment of the present disclosure also provides a storage medium. Using this storage medium can improve the utilization rate of the matrix operation unit, effectively utilize the computing power of the matrix operation unit, shorten the time of convolution operation, improve the operation efficiency, and save data transmission time.
  • Figure 17 is a schematic diagram of a storage medium provided by some embodiments of the present disclosure.
  • the storage medium 500 may be a non-transitory computer-readable storage medium storing non-transitory computer-readable instructions 510 .
  • the non-transitory computer readable instructions 510 are executed by the processor, the convolution operation method described in the embodiment of the present disclosure may be implemented.
  • the non-transitory computer readable instructions 510 are executed by the processor, the method according to the above may be executed.
  • the storage medium 500 can be applied in the above-mentioned electronic device.
  • the storage medium 500 can include the memory 320 in the electronic device 300 .
  • the storage medium may include a memory card of a smartphone, a storage component of a tablet computer, a hard drive of a personal computer, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM), Portable compact disk read-only memory (CD-ROM), flash memory, or any combination of the above storage media can also be other suitable storage media.
  • RAM random access memory
  • ROM read only memory
  • EPROM erasable programmable read only memory
  • CD-ROM Portable compact disk read-only memory
  • flash memory or any combination of the above storage media can also be other suitable storage media.
  • the description of the storage medium 500 may refer to the description of the memory in the embodiment of the electronic device, and repeated descriptions will not be repeated.
  • the specific functions and technical effects of the storage medium 500 please refer to the above description of the convolution operation method, which will not be described again here.
  • the convolution operation method provided by the embodiment of the present disclosure can be used for the first-layer convolution operation of the convolutional neural network.
  • the computational convolution kernel performs convolution operations, which can improve the utilization of the matrix operation unit, effectively utilize the computing power of the matrix operation unit, shorten the time of the convolution operation, improve the operation efficiency, and save data transmission time.
  • a computer-readable medium may be a tangible medium that may contain or be stored for use by or in conjunction with an instruction execution system, apparatus, or device. program.
  • the computer-readable medium may be a computer-readable signal medium or a computer-readable storage medium or any combination of the two.
  • the computer-readable storage medium may be, for example, but is not limited to: an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, device or device, or any combination thereof.
  • Computer readable storage media may include, but are not limited to: an electrical connection having one or more wires, a portable computer disk, a hard drive, random access memory (RAM), read only memory (ROM), removable Programming read-only memory (EPROM or flash memory), fiber optics, portable compact disk read-only memory (CD-ROM), optical storage storage device, magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program for use by or in connection with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave, carrying computer-readable program code therein.
  • Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above.
  • a computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium that can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device .
  • Program code embodied on a computer-readable medium may be transmitted using any suitable medium, including but not limited to: wire, optical cable, RF (radio frequency), etc., or any suitable combination of the above.
  • the client and server can communicate using any currently known or future developed network protocol such as HTTP (HyperText Transfer Protocol), and can communicate with digital data in any form or medium.
  • Communications e.g., communications network
  • communications networks include local area networks (“LAN”), wide area networks (“WAN”), the Internet (e.g., the Internet), and end-to-end networks (e.g., ad hoc end-to-end networks), as well as any currently known or developed in the future network of.
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device; it may also exist independently without being assembled into the electronic device.
  • Computer program code for performing the operations of the present disclosure may be written in one or more programming languages, including but not limited to object-oriented programming languages such as Java, Smalltalk, C++, or a combination thereof. Includes conventional procedural programming languages, such as the "C" language or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer may be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or may be connected to an external computer, such as an Internet service provider through Internet connection).
  • LAN local area network
  • WAN wide area network
  • Internet service provider such as an Internet service provider through Internet connection
  • the flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operations of possible implementations of systems, methods, and computer program products according to various embodiments of the present disclosure.
  • the flowchart or Each block in the block diagram may represent a module, segment, or portion of code that contains one or more executable instructions for implementing the specified logical function.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown one after another may actually execute substantially in parallel, or they may sometimes execute in the reverse order, depending on the functionality involved.
  • each block of the block diagram and/or flowchart illustration, and combinations of blocks in the block diagram and/or flowchart illustration can be implemented by special purpose hardware-based systems that perform the specified functions or operations. , or can be implemented using a combination of specialized hardware and computer instructions.
  • the units involved in the embodiments of the present disclosure can be implemented in software or hardware. Among them, the name of a unit does not constitute a limitation on the unit itself under certain circumstances.
  • FPGAs Field Programmable Gate Arrays
  • ASICs Application Specific Integrated Circuits
  • ASSPs Application Specific Standard Products
  • SOCs Systems on Chips
  • CPLD Complex Programmable Logical device
  • a convolution operation method includes: determining an operation convolution kernel, wherein the operation convolution kernel is obtained based on an initial convolution kernel, and the initial convolution kernel is expressed as [R ,S,C,K], the operation convolution kernel is expressed as [1,1,(C ⁇ R ⁇ S),K], R, S, C, K are all integers greater than 0; based on the operation
  • the channel number of the convolution kernel adjusts the arrangement of the input data to obtain the target data, wherein the size and channel number of the target data are different from the size and channel number of the input data, and the channel number of the target data is equal to The number of channels of the operation convolution kernel; performing a convolution operation based on the target data and the operation convolution kernel to obtain a convolution operation result, wherein the convolution of the target data and the operation convolution kernel The operation result is equal to the convolution operation result of the input data and the initial convolution kernel.
  • the number of channels of the target data is greater than the number of channels of the input data
  • the number of channels of the operation convolution kernel is greater than the number of channels of the initial convolution kernel
  • adjusting the arrangement of the input data based on the number of channels of the operation convolution kernel to obtain target data includes: storing the input data in a static memory in units of rows , wherein each row of the input data is stored in corresponding N storage rows in the static memory, and N is an integer greater than 0; for the Perform a filling operation on the input data to obtain expanded data; adjust the arrangement of the expanded data to change the size and number of channels of the expanded data to obtain the target data.
  • storing the input data in the static memory in row units includes: storing the input data in a memory in a tightly arranged manner, wherein the input data Including multiple channels, the tightly arranged method means that multiple channels of the same data point are stored sequentially and adjacently in the memory; the input data in the memory is transmitted to the hardware by direct storage access.
  • the static memory of the accelerator and store the first data point of each row of the input data in the first column of a different row of the static memory, such that each row of the input data is stored in the static memory in the corresponding N storage rows.
  • performing the filling operation on the input data stored in the static memory to obtain the extended data includes: in the static memory, corresponding to the input The storage rows before and after the storage location of the data are filled with the first preset value to obtain the first intermediate data, wherein the first intermediate data includes the input data and the filled first preset value; the first The intermediate data is transmitted to the vector calculation unit, and the shift instructions and filling instructions of the vector calculation unit are used to fill both ends of each row corresponding to the first intermediate data with the second preset value to obtain the second intermediate data, where, The second intermediate data includes the first intermediate data and the filled second preset value; the second intermediate data is transferred to the corresponding storage location in the static memory to obtain the extended data, wherein The extended data has the same content as the second intermediate data.
  • the target data is expressed as [1, ht, wt, (C ⁇ R ⁇ S)], ht and wt are both integers greater than 0; adjust the arrangement of the extended data
  • the method of changing the size and channel number of the extended data to obtain the target data includes: sequentially reading the data in R*N storage rows in the static memory and transmitting it to the vector calculation unit, where , the starting address of each read moves str*N storage lines according to the preset skip step str.
  • the preset skip step is the sliding step required to perform the convolution operation between the input data and the initial convolution kernel.
  • the window jumps in the row direction and column direction, and the sum of the number of times data is read from the static memory is equal to ht; the vector calculation unit converts the data in the R*N storage rows received each time into wt* ceil((C ⁇ R ⁇ S)/L) storage rows of data to obtain the target data, where L represents the number of data points that can be stored in each storage row in the static memory, ceil((C ⁇ R ⁇ S)/L) means rounding up (C ⁇ R ⁇ S)/L, and the converted data is the target data.
  • the vector calculation unit converts the data in R*N storage rows received each time into wt*ceil ((C ⁇ R ⁇ S)/L) storage rows. data to obtain the target data, including: dividing the data in R*N storage rows into multiple groups of data according to the preset skip step, wherein each group of data corresponds to a sliding window in the row direction, and the multiple groups of data correspond to a sliding window in the row direction.
  • the number of groups of data is equal to wt; for each group of data, the initial position information parameters and target position information parameters of each row of data in the sliding window corresponding to the group of data are determined; the vector calculation unit is based on the initial position information parameters and the target location information parameters, and store each set of data in the converted arrangement to the corresponding location of the target memory to obtain the target data, wherein the target memory stores the data in row units and transmits it to the target memory.
  • the target memory and the data stored on the target memory are the target data.
  • the initial location information parameters include a first starting boundary coordinate, a first ending boundary coordinate, a first starting address, a first ending address, a first starting sequence number, and a first ending sequence number;
  • the first starting boundary coordinate represents the relative coordinate of the starting boundary of the corresponding sliding window in the row direction of the extended data
  • the first end boundary coordinate represents the ending boundary of the corresponding sliding window in the row direction of the extended data.
  • the first starting address represents the first starting boundary
  • the first end address represents the address of the first end boundary coordinate in the memory of the vector calculation unit
  • the first starting sequence number represents the first The first end sequence number represents the sequence number of the data point corresponding to the first end boundary coordinate at the first end address.
  • the target location information parameters include a second starting boundary coordinate, a second ending boundary coordinate, a second starting address, a second ending address, a second starting sequence number, a second End sequence number;
  • the second starting boundary coordinate represents the relative coordinate of the corresponding sliding window's starting boundary in the data size of [1, 1, (C ⁇ R ⁇ S)], and the second ending boundary coordinate represents The relative coordinates of the end boundary of the corresponding sliding window in the data size of [1, 1, (C ⁇ R ⁇ S)], the starting boundary of the corresponding sliding window and the end boundary of the corresponding sliding window are located at Different positions in the row direction of the extended data;
  • the second start address represents the address of the second starting boundary coordinate in the memory of the vector calculation unit, and the second end address represents the second The address of the ending boundary coordinate in the memory of the vector calculation unit;
  • the second starting serial number represents the serial number of the data point corresponding to the second starting boundary coordinate at the second starting address, and the first The second end sequence number represents the sequence number of the data point corresponding to
  • kernel_w represents the width of the sliding window
  • the size of the sliding window is equal to the size of the initial convolution kernel
  • ch represents the number of channels of the input data
  • the vector calculation unit stores each set of data in a converted arrangement according to the initial position information parameter and the target position information parameter to the corresponding location of the target memory. position to obtain the target data, including: according to the initial position information parameter and the target position information parameter, the vector calculation unit uses a circular shift instruction and converts each group according to the preset enable signal in the predicate register. The data is spliced in the converted arrangement and stored in the corresponding location of the target memory to obtain the target data.
  • performing a convolution operation based on the target data and the operation convolution kernel includes: using a matrix operation unit to perform a convolution operation on the target data and the operation convolution kernel. .
  • the convolution operation method is used for the first layer convolution operation of the convolutional neural network.
  • a convolution operation device includes: a determination unit configured to determine an operation convolution kernel, wherein the operation convolution kernel is obtained based on an initial convolution kernel, and the initial convolution kernel
  • the kernel is expressed as [R, S, C, K]
  • the operation convolution kernel is expressed as [1, 1, (C ⁇ R ⁇ S), K].
  • R, S, C, and K are all integers greater than 0.
  • the adjustment unit is configured to adjust the arrangement of the input data based on the number of channels of the operation convolution kernel to obtain target data, wherein the size and number of channels of the target data are different from the size and channel of the input data number, the number of channels of the target data is equal to the number of channels of the operation convolution kernel;
  • the calculation unit is configured to perform a convolution operation based on the target data and the operation convolution kernel to obtain a convolution operation result, where , the convolution operation result of the target data and the operation convolution kernel is equal to the convolution operation result of the input data and the initial convolution kernel.
  • the number of channels of the target data is greater than the number of channels of the input data
  • the number of channels of the operation convolution kernel is greater than the number of channels of the initial convolution kernel
  • the adjustment unit includes a first adjustment subunit, The second adjustment sub-unit and the third adjustment sub-unit.
  • the first adjustment subunit is configured to store the input data in a static memory in row units, where each row of the input data is stored in corresponding N storage rows in the static memory, and N is greater than 0. integer.
  • the second adjustment subunit is configured to perform a filling operation on the input data stored in the static memory to obtain extended data.
  • the third adjustment subunit is configured to adjust the arrangement of the extended data to change the size and channel number of the extended data to obtain the target data.
  • the first adjustment sub-unit includes a first storage unit and a second storage unit.
  • the first storage unit is configured to store the input data in the memory in a closely arranged manner, wherein the input data includes multiple channels, and the closely arranged manner means that multiple channels of the same data point are in the Stored adjacently in memory.
  • the second storage unit is configured to transfer the input data in the memory to the static memory of the hardware accelerator by using direct storage access, and store the first data point of each row of the input data in the
  • the first columns of different rows of the static memory are configured so that each row of the input data is stored in corresponding N storage rows in the static memory.
  • the second adjustment sub-unit includes a first filling unit, a second filling unit, and a third filling unit.
  • the first filling unit is configured to fill the first preset value in the storage rows before and after the storage location corresponding to the input data in the static memory to obtain first intermediate data, wherein the first intermediate data Includes the input data and the filled first preset value.
  • the second filling unit is configured to transmit the first intermediate data to the vector calculation unit, and use the shift instructions and filling instructions of the vector calculation unit to fill both ends of each row corresponding to the first intermediate data with the second predetermined data. Set the value to obtain second intermediate data, wherein the second intermediate data includes the first intermediate data and the filled second preset value.
  • the third filling unit is configured to transfer the second intermediate data to the corresponding storage location in the static memory to obtain the extended data, wherein the extended data has the same content as the second intermediate data.
  • the target data is expressed as [1, ht, wt, (C ⁇ R ⁇ S)], and both ht and wt are integers greater than 0.
  • the third adjustment sub-unit includes a first changing unit and a second changing unit.
  • the first change unit is configured to read the data in R*N storage rows in the static memory one by one and transmit it to the vector calculation unit, where the starting address of each read is moved according to the preset jump step str str*N storage rows, the preset skip step is the skip step of the sliding window in the row direction and column direction required to perform the convolution operation on the input data and the initial convolution kernel, from the The sum of the number of times data is read from static memory is equal to ht.
  • the second change unit is configured to use the vector calculation unit to convert the data in the R*N storage rows received each time into the data of wt*ceil ((C ⁇ R ⁇ S)/L) storage rows to obtain the target data, where L represents the number of data points that can be stored in each storage row in the static memory, and ceil((C ⁇ R ⁇ S)/L) represents rounding up (C ⁇ R ⁇ S)/L , the converted data is the target data.
  • the second changing unit includes a grouping unit, a parameter determination unit, and a vector calculation unit.
  • the grouping unit is configured to divide the data in R*N storage rows into multiple groups of data according to the preset skip step, wherein each group of data corresponds to a sliding window in the row direction, and the number of groups of the multiple groups of data equal to wt.
  • the parameter determination unit is configured to, for each group of data, determine the initial position information parameters and target position information parameters of each row of data in the sliding window corresponding to the group of data.
  • the vector calculation unit is configured to store each set of data in a converted arrangement into a corresponding position of the target memory according to the initial position information parameter and the target position information parameter to obtain the target data, wherein,
  • the target memory stores data in row units, and the data transferred to the target memory and stored on the target memory is the target data.
  • the initial location information parameters include a first starting boundary coordinate, a first ending boundary coordinate, a first starting address, a first ending address, a first starting sequence number, and a first ending sequence number;
  • the first starting boundary coordinate represents the relative coordinate of the starting boundary of the corresponding sliding window in the row direction of the extended data
  • the first end boundary coordinate represents the ending boundary of the corresponding sliding window in the row direction of the extended data.
  • the first starting address represents the first starting boundary
  • the first end address represents the address of the first end boundary coordinate in the memory of the vector calculation unit
  • the first starting sequence number represents the first The first end sequence number represents the sequence number of the data point corresponding to the first end boundary coordinate at the first end address.
  • the target location information parameters include a second starting boundary coordinate, a second ending boundary coordinate, a second starting address, a second ending address, a second starting sequence number, a second End sequence number;
  • the second starting boundary coordinate represents the relative coordinate of the corresponding sliding window's starting boundary in the data size of [1, 1, (C ⁇ R ⁇ S)], and the second ending boundary coordinate represents The relative coordinates of the end boundary of the corresponding sliding window in the data size of [1, 1, (C ⁇ R ⁇ S)], the starting boundary of the corresponding sliding window and the end boundary of the corresponding sliding window are located at Different positions in the row direction of the extended data;
  • the second start address represents the address of the second starting boundary coordinate in the memory of the vector calculation unit, and the second end address represents the second The address of the ending boundary coordinate in the memory of the vector calculation unit;
  • the second starting serial number represents the serial number of the data point corresponding to the second starting boundary coordinate at the second starting address, and the first The second end sequence number represents the sequence number of the data point corresponding to
  • kernel_w represents the width of the sliding window
  • the size of the sliding window is equal to the size of the initial convolution kernel
  • ch represents the number of channels of the input data
  • dst_row_start_index represents the second
  • the vector calculation unit is further configured to use a circular shift instruction according to the initial position information parameter and the target position information parameter and according to a preset enable signal in the predicate register
  • a circular shift instruction according to the initial position information parameter and the target position information parameter and according to a preset enable signal in the predicate register
  • Each set of data is spliced in a converted arrangement and stored in the corresponding location of the target memory to obtain the target data.
  • the calculation unit includes a calculation sub-unit configured to perform a convolution operation on the target data and the operation convolution kernel using a matrix operation unit.
  • the convolution operation device is used for the first layer convolution operation of the convolutional neural network.
  • an electronic device includes the convolution operation device provided by any embodiment of the present disclosure.
  • an electronic device includes: a processor; a memory including at least one computer program module; wherein the at least one computer program module is stored in the memory and configured by The processor executes, and the at least one computer program module is used to implement the convolution operation method provided by any embodiment of the present disclosure.
  • a storage medium stores non-transitory computer-readable instructions.
  • the non-transitory computer-readable instructions are executed by a computer, the volume provided by any embodiment of the present disclosure is implemented.
  • Product operation method is implemented.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Algebra (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Complex Calculations (AREA)

Abstract

一种卷积运算方法、卷积运算装置、电子设备及存储介质。该卷积运算方法包括:确定运算卷积核,运算卷积核基于初始卷积核得到,初始卷积核表示为[R,S,C,K],运算卷积核表示为[1,1,(C×R×S),K],R、S、C、K均为大于0的整数;基于运算卷积核的通道数,调整输入数据的排布方式,得到目标数据,目标数据的尺寸和通道数不同于输入数据的尺寸和通道数,目标数据的通道数等于运算卷积核的通道数;基于目标数据和运算卷积核进行卷积运算,以得到卷积运算结果。目标数据与运算卷积核的卷积运算结果等于输入数据与初始卷积核的卷积运算结果。该卷积运算方法可以提高矩阵运算单元的利用率,有效利用矩阵运算单元的算力,缩短卷积运算的时间。

Description

卷积运算方法、卷积运算装置、电子设备及存储介质
本申请要求于2022年5月31日递交的中国专利申请第202210610935.6号的优先权,在此全文引用上述中国专利申请公开的内容以作为本申请的一部分。
技术领域
本公开的实施例涉及一种卷积运算方法、卷积运算装置、电子设备及存储介质。
背景技术
随着技术的发展,人工智能(Artificial Intelligence,AI)技术在多个领域得到了广泛的应用。深度学习(Deep Learning)是人工智能技术的重要技术之一,基于人工神经网络的深度学习技术已经在诸如物体分类、文本处理、图像搜索以及人机对话等领域取得了巨大进展。卷积神经网络(Convolutional Neural Network,CNN)是一种应用广泛的深度学习技术,可以直接输入图像数据而无需复杂的预处理,在图像处理等方面具有较大优势。
发明内容
本公开至少一个实施例提供一种卷积运算方法,包括:确定运算卷积核,其中,所述运算卷积核基于初始卷积核得到,所述初始卷积核表示为[R,S,C,K],所述运算卷积核表示为[1,1,(C×R×S),K],R、S、C、K均为大于0的整数;基于所述运算卷积核的通道数,调整输入数据的排布方式,得到目标数据,其中,所述目标数据的尺寸和通道数不同于所述输入数据的尺寸和通道数,所述目标数据的通道数等于所述运算卷积核的通道数;基于所述目标数据和所述运算卷积核进行卷积运算,以得到卷积运算结果,其中,所述目标数据与所述运算卷积核的卷积运算结果等于所述输入数据与所述初始卷积核的卷积运算结果。
本公开至少一个实施例还提供一种卷积运算装置,包括:确定单元,配 置为确定运算卷积核,其中,所述运算卷积核基于初始卷积核得到,所述初始卷积核表示为[R,S,C,K],所述运算卷积核表示为[1,1,(C×R×S),K],R、S、C、K均为大于0的整数;调整单元,配置为基于所述运算卷积核的通道数,调整输入数据的排布方式,得到目标数据,其中,所述目标数据的尺寸和通道数不同于所述输入数据的尺寸和通道数,所述目标数据的通道数等于所述运算卷积核的通道数;计算单元,配置为基于所述目标数据和所述运算卷积核进行卷积运算,以得到卷积运算结果,其中,所述目标数据与所述运算卷积核的卷积运算结果等于所述输入数据与所述初始卷积核的卷积运算结果。
本公开至少一个实施例还提供一种电子设备,包括本公开任一实施例提供的卷积运算装置。
本公开至少一个实施例还提供一种电子设备,包括:处理器;存储器,包括至少一个计算机程序模块;其中,所述至少一个计算机程序模块被存储在所述存储器中并被配置为由所述处理器执行,所述至少一个计算机程序模块用于实现本公开任一实施例提供的卷积运算方法。
本公开至少一个实施例还提供一种存储介质,存储有非暂时性计算机可读指令,当所述非暂时性计算机可读指令由计算机执行时实现本公开任一实施例提供的卷积运算方法。
附图说明
为了更清楚地说明本公开实施例的技术方案,下面将对实施例的附图作简单地介绍,显而易见地,下面描述中的附图仅仅涉及本公开的一些实施例,而非对本公开的限制。
图1为本公开一些实施例提供的一种卷积运算的数据流示意图;
图2为本公开一些实施例提供的一种卷积运算方法的流程示意图;
图3为一种卷积运算的原理示意图;
图4为图2中步骤S20的示意性流程图;
图5为图4中步骤S21的示意性流程图;
图6为本公开一些实施例提供的卷积运算方法中输入数据在内存中的存储方式示意图;
图7为本公开一些实施例提供的卷积运算方法中输入数据在静态存储器中的存储方式示意图;
图8为图4中步骤S22的示意性流程图;
图9为图4中步骤S23的示意性流程图;
图10为图9中步骤S232的示意性流程图;
图11为本公开一些实施例提供的卷积运算方法中进行数据排布方式变换的示意图之一;
图12为本公开一些实施例提供的卷积运算方法中进行数据排布方式变换的示意图之二;
图13为本公开一些实施例提供的一种卷积运算装置的示意框图;
图14为本公开一些实施例提供的一种电子设备的示意框图;
图15为本公开一些实施例提供的另一种电子设备的示意框图;
图16为本公开一些实施例提供的另一种电子设备的示意框图;以及
图17为本公开一些实施例提供的一种存储介质的示意图。
具体实施方式
下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而且不应该被解释为限于这里阐述的实施例,相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。
应当理解,本公开的方法实施方式中记载的各个步骤可以按照不同的顺序执行,和/或并行执行。此外,方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。
本文使用的术语“包括”及其变形是开放性包括,即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”;术语“另一实施例”表示“至少一个另外的实施例”;术语“一些实施例”表示“至少一些实施例”。其他术语的相关定义将在下文描述中给出。
需要注意,本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分,并非用于限定这些装置、模块或单元所执行的功 能的顺序或者相互依存关系。
需要注意,本公开中提及“一个”、“多个”的修饰是示意性而非限制性的,本领域技术人员应当理解,除非在上下文另有明确指出,否则应该理解为“一个或多个”。
本公开实施方式中的多个装置之间所交互的消息或者信息的名称仅用于说明性的目的,而并不是用于对这些消息或信息的范围进行限制。
卷积神经网络的输入数据一般是一个3通道的图片,例如,残差网络ResNet50的首层卷积的输入图像为[1,224,224,3],也即,输入图像具有3个通道,每个通道的图像尺寸为224×224,残差网络ResNet50的首层卷积所使用的卷积核形状为[7,7,3,64]。常用的神经网络加速器上都设置有矩阵运算单元(Matrix),矩阵运算单元主要负责神经网络中的矩阵运算和卷积运算加速。为了加速矩阵运算,矩阵运算单元一般都会通过增大计算规模来提升计算并行度,比如运算规模为64×64、128×128等。但是,由于卷积神经网络的首层卷积的输入数据的通道数较少(例如为3通道),这使得硬件加速器上的矩阵运算单元的算力利用率较低,并且,首层卷积的计算时间也会相对较长,加速效果不明显。除此之外,如果严格按照通道数对齐的数据排布方式(Channel Align Tensor Layout),还会显著增加数据的存储空间,增加数据的传输时间。
如图1所示,硬件加速器通常作为主机(Host)的从设备挂载在主机的PCIe(Peripheral Component Interconnect Express)节点上,PCIe是一种高速串行计算机扩展总线标准,可以实现高速的数据传输。相对于主机侧,硬件加速器为设备侧(Device)。在进行卷积运算时,首层卷积输入的数据需要从主机经过PCIe发送给硬件加速器,这一过程称之为Host2Device。例如,中央处理单元(Central Processing Unit,CPU)从内存中读取数据,通过PCIe传输到设备侧的硬件加速器,并存入硬件加速器上的内存中(例如存入DDR中)。之后,硬件加速器可以利用这些数据进行卷积运算。
以残差网络ResNet50首层卷积为例,首层卷积需要实现输入数据与卷积核的运算,表示为:[1,224,224,3]×[7,7,3,64]=[1,112,112,64]。这里,输入数据表示为[1,224,224,3],也即为3通道的数据且尺寸为224×224;卷积核表示为[7,7,3,64],也即为64组、每组3个卷积核且卷积核的尺寸为 7×7;得到的结果为64通道的数据,尺寸为112×112。
假设硬件加速器上的矩阵运算单元为64×64,由于通道数对齐的数据排布方式的限制,首层卷积输入数据需要在主机上完成从[1,224,224,3]到[1,224,224,64]的扩充,冗余的通道数据全部填充0。对于存储空间而言,需要增长21.33倍;同样地,对于数据从主机侧传输至硬件加速器的传输时间也会增长21.33倍。在该情形下,矩阵运算单元的算力利用率只有4.68%;对于卷积运算时间而言,矩阵运算单元完成首层卷积的运算需要614656个循环时间(cycle)。
由于卷积神经网络的首层卷积计算输入数据通道数较少并且硬件加速器的矩阵运算单元规模较大,导致计算需求与硬件特性不匹配,进而使得卷积神经网络的首层卷积计算存在如下问题。首先,输入数据需要使用主机的CPU进行重新调整排布方式,占用存储空间增大,且耗费CPU时间(CPU Time)。其次,重新排布之后的输入数据量增大,Host2Device的PCIe传输时间增大。再次,硬件加速器的矩阵运算单元利用率低,不能发挥全部的算力,造成了硬件资源的浪费。第四,硬件加速器的矩阵运算单元执行首层卷积计算的时间长,不能达到硬件加速的目的。
本公开至少一个实施例提供一种卷积运算方法、卷积运算装置、电子设备及存储介质。该卷积运算方法可以提高矩阵运算单元的利用率,有效利用矩阵运算单元的算力,缩短卷积运算的时间,提高运算效率,并且可以节省数据传输时间。
下面,将参考附图详细地说明本公开的实施例。应当注意的是,不同的附图中相同的附图标记将用于指代已描述的相同的元件。
本公开至少一个实施例提供一种卷积运算方法。该卷积运算方法包括:确定运算卷积核,运算卷积核基于初始卷积核得到,初始卷积核表示为[R,S,C,K],运算卷积核表示为[1,1,(C×R×S),K],R、S、C、K均为大于0的整数;基于运算卷积核的通道数,调整输入数据的排布方式,得到目标数据,目标数据的尺寸和通道数不同于输入数据的尺寸和通道数,目标数据的通道数等于运算卷积核的通道数;基于目标数据和运算卷积核进行卷积运算,以得到卷积运算结果。目标数据与运算卷积核的卷积运算结果等于输入数据与初始卷积核的卷积运算结果。
图2为本公开一些实施例提供的一种卷积运算方法的流程示意图。如图2所示,在一些实施例中,该卷积运算方法包括步骤S10~S30。
步骤S10:确定运算卷积核,其中,运算卷积核基于初始卷积核得到,初始卷积核表示为[R,S,C,K],运算卷积核表示为[1,1,(C×R×S),K],R、S、C、K均为大于0的整数;
步骤S20:基于运算卷积核的通道数,调整输入数据的排布方式,得到目标数据,其中,目标数据的尺寸和通道数不同于输入数据的尺寸和通道数,目标数据的通道数等于运算卷积核的通道数;
步骤S30:基于目标数据和运算卷积核进行卷积运算,以得到卷积运算结果,其中,目标数据与运算卷积核的卷积运算结果等于输入数据与初始卷积核的卷积运算结果。
例如,该卷积运算方法可以用于卷积神经网络的首层卷积运算。当然,本公开的实施例不限于此,该卷积运算方法不仅可以用于卷积神经网络,还可以用于其他类型网络的卷积运算,不仅可以用于首层卷积运算(第一层卷积运算),还可以用于其他层的卷积运算,这可以根据实际需求而定,本公开的实施例对此不作限制。
例如,在步骤S10中,初始卷积核为进行首层卷积运算时所需要的卷积核,初始卷积核表示为[R,S,C,K]。以残差网络ResNet50首层卷积为例,首层卷积需要实现的运算表示为:[1,224,224,3]×[7,7,3,64]=[1,112,112,64],则初始卷积核[R,S,C,K]为[7,7,3,64],也即,在该示例中,R=7,S=7,C=3,K=64。对初始卷积核的参数进行变换,从而得到运算卷积核[1,1,(C×R×S),K]。在上述示例中,根据初始卷积核可以得到运算卷积核,运算卷积核为[1,1,147,64]。下面结合图3对卷积核的变换原理进行简要说明。
图3为一种卷积运算的原理示意图。如图3所示,输入数据大小为[1,3,3,5],卷积核大小为[2,2,5,4],则输出数据大小为[1,2,2,4]。例如,对于M点,其计算方式如图3所示,由于卷积核的大小为2×2,通道为5,所以M点是输入数据和卷积核各20个点对应相乘后再累加的结果。利用卷积计算的特性,可以将卷积核从R×S×C×K变换成1×1×(C×R×S)×K,并对输入数据也做相应的调整,因此整个卷积的计算结果保持不变。通过这样的变换操作,可以使通道数增加。对于卷积神经网络的首层网络而言,卷积核 由[7,7,3,64]调整为[1,1,147,64],通道数从3扩充到3×7×7=147,由此可以充分利用矩阵运算单元(Matrix)的算力。因此,在步骤S10中,可以对初始卷积核[R,S,C,K]进行变换而得到运算卷积核[1,1,(C×R×S),K],从而实现卷积核排布方式的改变。
例如,卷积核的排布可以采用离线方式来改变,这是因为神经网络模型在部署阶段使用的卷积核是固定的,不会随着输入的变化而变化,因此可以将卷积核预先处理成需要的排布方式。在本公开的实施例中,可以在神经网络模型的部署阶段将需要使用的卷积核[R,S,C,K]设置为[1,1,(C×R×S),K],将其作为后续使用的卷积核。例如,可以在模型编译阶段采用高级语言(例如Python)修改初始卷积核[R,S,C,K]对应的代码,从而可以调整得到运算卷积核[1,1,(C×R×S),K]。当然,本公开的实施例不限于此,也可以在每次进行卷积运算之前对需要使用的初始卷积核[R,S,C,K]进行调整,以得到本次运算实际使用的运算卷积核[1,1,(C×R×S),K]。
返回至图2,在步骤S20中,基于运算卷积核的通道数,调整输入数据的排布方式,得到目标数据。例如,目标数据的尺寸和通道数不同于输入数据的尺寸和通道数,目标数据的通道数等于运算卷积核的通道数。以残差网络ResNet50首层卷积为例,本应进行的计算为[1,224,224,3]×[7,7,3,64]=[1,112,112,64],由于卷积核被调整为[1,1,147,64],为了保证计算结果不发生变化,因此需要调整输入数据的排布方式,使运算变为[1,112,112,147]×[1,1,147,64]=[1,112,112,64]。由此可知,对输入数据进行调整之后得到的数据为[1,112,112,147]。例如,将对输入数据的排布方式进行调整后得到的数据称为目标数据,目标数据是最终与运算卷积核进行卷积运算的数据。例如,由于进行了排布方式的调整,目标数据的尺寸和通道数不同于输入数据的尺寸和通道数。由上述计算公式可知,目标数据的通道数等于运算卷积核的通道数(例如在上述示例中均为147),从而便于两者进行卷积运算。
例如,目标数据的通道数大于输入数据的通道数,运算卷积核的通道数大于初始卷积核的通道数,从而实现通道数的增加,进而充分利用矩阵运算单元的算力。例如,在上述示例中,输入数据的通道数和初始卷积核的通道数均为3,目标数据的通道数和运算卷积核的通道数均为147,由此实现通道数的增加。例如,对于输入数据的排布方式的转换,需要在线完成,也即是, 在神经网络推理阶段,每次输入的数据都需要调整排布方式。
图4为图2中步骤S20的示意性流程图。例如,在一些示例中,如图4所示,步骤S20可以进一步包括步骤S21~S23。
步骤S21:将输入数据以行为单位存储在静态存储器中,其中,输入数据的每一行存储在静态存储器中对应的N个存储行中,N为大于0的整数;
步骤S22:对静态存储器中存储的输入数据执行填充操作,得到扩充数据;
步骤S23:调整扩充数据的排布方式以改变扩充数据的尺寸和通道数,得到目标数据。
例如,在步骤S21中,首先将输入数据存储到静态存储器中,该静态存储器设置在硬件加速器中,该静态存储器例如为静态随机存储器(Static Random Access Memory,SRAM)。输入数据在静态存储器中以行为单位存储,也即输入数据的每一行存储在静态存储器中对应的N个存储行中,N为大于0的整数。例如,可以采用图1所示的数据流将输入数据传输至静态存储器中。
如图5所示,步骤S21可以进一步包括步骤S211~S212。
步骤S211:采用紧密排列的方式将输入数据存储在内存中,其中,输入数据包括多个通道,紧密排列的方式指同一个数据点的多个通道在内存中依序相邻存储;
步骤S212:采用直接存储访问的方式将内存中的输入数据传输至硬件加速器的静态存储器中,并且将输入数据的各行的第一个数据点存储在静态存储器的不同行的第一列,使得输入数据的每一行存储在静态存储器中对应的N个存储行中。
例如,在步骤S211中,紧密排列的方式例如为通道数对齐的数据排布方式(Channel Align Tensor Layout)。输入数据包括多个通道,紧密排列的方式指同一个数据点的多个通道在内存中依序相邻存储。如图6所示,在一些示例中,首层卷积输入数据为[1,224,224,3],数字1表示Batch Size=1,数字224之一表示Height=224,数字224的另一个表示Width=224,数字3表示Channel=3,也即,输入数据为一张大小为224×224的3通道图片。为了减少数据在主机上的存储空间,采用将数据紧密排列的存储方式。例如,对于 输入数据第一行第一列的像素点,其3个通道的数值依次连续存储在内存空间上;接着,继续存储第一行第二列的像素点,其3个通道的数值依次连续存储在内存空间上,以此类推。
例如,每个像素点(pixel)的每个通道的数值用FP16数据格式表示,其占据2个字节(Byte)地址空间。首层输入图片总共占据224×224×3×2Byte=301056Byte,即294KB。这里,所采用的数据格式以及所占用的地址空间均为示例性的,这并不是构成对本公开实施例的限制。
例如,在步骤S212中,采用直接存储访问(Direct Memory Access,DMA)的方式将内存中的输入数据传输至硬件加速器的静态存储器中。数据搬运方式可以采用图1所示的方式,首层卷积的输入数据经过PCIe从主机侧搬运到设备侧的DDR内,输入数据在DDR内的存储方式与其在主机内存中的存储方式一致,由此实现一维张量(1D Tensor)的Host2Device过程。例如,输入数据连续存储在DDR内,例如占据294KB。然后,需要将输入数据传输至静态存储器中,也即需要将这294KB数据从DDR搬运到处理引擎(PE,也称为硬件加速器)内的SRAM。
例如,进行存储时,将输入数据的各行的第一个数据点存储在静态存储器的不同行的第一列,使得输入数据的每一行存储在静态存储器中对应的N个存储行中。
如图7所示,在一些示例中,SRAM的组织形式可以抽象认为是一个M行N列的表格,每个表格内存储一个数据。由于输入数据的尺寸为224×224,因此在逻辑上将输入数据划分为224行,每一行的起始位置都在SRAM的某一行的第一列。由于SRAM列数有限,输入数据的一行难以全部存储在SRAM的一行中,因此,输入数据的一行会分散到SRAM的多行上,即不同的SRAM地址上。对于[1,224,224,3]的输入数据,考虑到之后的填充操作所填充的数据,每一行有229个点,每个点有3个通道。对于每一行存储空间为1024bit的SRAM,存储一行数据需要的SRAM的行数为:ceil(229*3/64)=11,ceil表示向上取整。也即是,输入数据的一行数据点对应存储在SRAM中的11个存储行中,该示例中N=11,整个输入数据占据224×11=2464行SRAM。
如图7所示,左边表示输入数据在DDR内连续存储,没有H、W、C的 概念;右边表示经过DMA搬运到SRAM之后,数据在SRAM内按照输入数据的行进行拆分,每行数据占据一定量的SRAM空间(例如11个存储行)。由此,实现了从DDR到PE内SRAM上的数据搬运,完成了从一维张量(1D Tensor)到二维张量(2D Tensor)的转换。
DMA搬运过程简要说明如下。
假定输入数据存储在以源地址(source_address)为首的一片连续的DDR空间内,需要先搬运第一行224×3×2Byte=1344Byte数据到以目标地址(destiny_address)为首的一片连续的SRAM空间内。由于SRAM一行为128Byte,所以这1344Byte需要存储到ceil(1344/128)=11行SRAM内,即DMA需要连续发送11×128Byte数据。完成第一行的数据搬运之后,DMA需要将读地址从source_address跳变到source_address+1344Byte的位置,即实际输入数据的第二行的开头的DDR地址,然后再连续搬运11×128Byte到以destiny_address+11为首的SRAM空间内。以此类推,在完成224次搬运后,所有的输入数据已经从DDR转移到处理引擎内的SRAM内,也即,完成了从一维张量到二维张量的转变。
需要说明的是,每次发送的11×128Byte数据的量大于实际每行的数据量,即也包含了下一行的数据,但是,由于每次发送的首地址是准确的,因此即使重复发送了数据,也不会对数据本身产生影响,这些冗余的数据不会影响后续处理。
返回至图4,例如,在步骤S22中,对静态存储器中存储的输入数据执行填充操作,得到扩充数据。这里,扩充数据是指执行填充操作后所得到的数据。例如,在一些示例中,假设实际需要完成的卷积计算为:[1,224,224,3]×[7,7,3,64]=[1,112,112,64],则需要对输入数据进行上下左右各个方向的填充操作(padding)。填充时,需要在输入数据的左边和上边分别填充3个点(左边填充3列,上边填充3行),需要在输入数据的右边和下边分别填充2个点(右边填充2列,下边填充2行),进行填充操作之后所得到的扩充数据的大小为[1,229,229,3]。
图8为图4中步骤S22的示意性流程图。如图8所示,步骤S22可以进一步包括步骤S221~S223。
步骤S221:在静态存储器中,在对应于输入数据的存储位置之前和之后 的存储行填充第一预设值,得到第一中间数据,其中,第一中间数据包括输入数据和填充的第一预设值;
步骤S222:将第一中间数据传输至矢量计算单元,利用矢量计算单元的移位指令和填充指令对第一中间数据对应的每一行的两端填充第二预设值,得到第二中间数据,其中,第二中间数据包括第一中间数据和填充的第二预设值;
步骤S223:将第二中间数据传输至静态存储器中对应的存储位置,得到扩充数据,其中,扩充数据与第二中间数据的内容相同。
例如,在步骤S221中,在静态存储器中对应于输入数据的存储位置之前和之后的存储行填充第一预设值,从而得到第一中间数据,第一中间数据包括输入数据和填充的第一预设值。该步骤例如执行针对输入数据的上边和下边的填充操作。例如,在一些示例中,在SRAM的目标地址附近,需要预留出上边填充需要的SRAM空间,即实际输入数据第一行之前要插入若干行数据。利用硬件加速器中的矢量计算单元(Vector)执行上边和下边的填充操作(padding)。例如,填充的第一预设值通常为0,因此需要在SRAM内输入数据的存储空间前后若干个地址上写入全0的值,从而得到第一中间数据。第一中间数据是执行了上边和下边填充的数据,第一中间数据还未进行左边和右边的填充。
例如,在步骤S222中,将第一中间数据传输至矢量计算单元,利用矢量计算单元的移位指令(例如vshiftri指令)和填充指令(例如SI2V指令)对第一中间数据对应的每一行的两端填充第二预设值,得到第二中间数据。第二中间数据包括第一中间数据和填充的第二预设值。该步骤例如执行针对第一中间数据的左边和右边的填充操作。
例如,在一些示例中,将SRAM内2464个地址空间上的数据按照每11行进行分组,依次发送给矢量计算单元,并存储在矢量计算单元内的存储空间vmem内。然后矢量计算单元利用vshiftri指令将数据进行整体右移,为左侧padding留出空间,然后利用SI2V指令将这些位置写入对应的第二预设值(例如通常设置为0)。对于右侧padding,在数据整体右移之后,在输入数据第一行最后一列之后写入对应的第二预设值。对于填充的数据量过大的情形,需要根据需求额外增加vmem空间。例如,可以采用流水线的方式对多 组11行数据执行左边和右边的填充操作,以提高处理效率。
例如,在步骤S223中,将第二中间数据传输至静态存储器中对应的存储位置,得到扩充数据。扩充数据与第二中间数据的内容相同。也即是,将vmem中完成填充操作的第二中间数据写回至SRAM中对应的地址空间内,将SRAM中存储的、已完成填充操作的数据称为扩充数据。
需要说明的是,在无需对输入数据执行填充操作的情形中,可以省略步骤S22。并且,在本公开的实施例中,对于需要执行填充操作的情形,可以先进行上下填充,再进行左右填充,也可以先进行左右填充,再进行上下填充,具体的填充顺序不受限制。执行填充操作时所使用的指令不限于vshiftri指令和SI2V指令,还可以使用其他适用的指令,只要能够实现填充操作即可,本公开的实施例对此不作限制。
返回至图4,例如,在步骤S23中,调整扩充数据的排布方式以改变扩充数据的尺寸和通道数,从而得到目标数据。也即是,为了与运算卷积核相匹配并且保证运算结果不变,需要调整扩充数据的排布方式,改变其尺寸和通道数。例如,调整后所得到的目标数据的通道数等于运算卷积核的通道数,目标数据表示为[1,ht,wt,(C×R×S)],ht、wt均为大于0的整数。
图9为图4中步骤S23的示意性流程图。如图9所示,步骤S23可以进一步包括步骤S231~S232。
步骤S231:逐次读取静态存储器中的R*N个存储行中的数据并传输至矢量计算单元;
步骤S232:矢量计算单元将每次接收的R*N个存储行中的数据转换为wt*ceil((C×R×S)/L)个存储行的数据,得到目标数据。
例如,在步骤S231中,每次读取静态存储器中的R*N个存储行中的数据并传输至矢量计算单元,以便于矢量计算单元将每次收到的R*N个存储行中的数据进行转换。例如,每次读取的起始地址按照预设跳步str移动str*N个存储行。预设跳步是将输入数据与初始卷积核进行卷积运算所需的滑动窗在行方向和列方向上的跳步,从静态存储器中读取数据的次数总和等于ht。
例如,在一些示例中,仍然以残差网络ResNet50的首层卷积为例,初始卷积核[R,S,C,K]=[7,7,3,64],因此R=7。输入数据为[1,224,224,3],输入数据的一行存储在SRAM的N个存储行中,若SRAM一行空间为128Byte, 则N=11。因此,每次读取静态存储器中的R*N=77个存储行中的数据并传输至矢量计算单元,这77个存储行中存储的数据对应于224×224的输入数据中的一行。例如,将输入数据[1,224,224,3]与初始卷积核[7,7,3,64]进行卷积运算所需的滑动窗在行方向和列方向上的跳步为2,因此上述预设跳步str为2。每次读取的起始地址按照预设跳步str移动str*N(也即2×11=22)个存储行,从而使读取的数据与原本进行卷积运算时滑动窗所包含的数据一致。根据算式[1,224,224,3]×[7,7,3,64]=[1,112,112,147]×[1,1,147,64]=[1,112,112,64]可知,卷积核变换为[1,1,147,64],需要变换得到的目标数据[1,ht,wt,(C×R×S)]为[1,112,112,147],因此ht=112,wt=112。例如,从静态存储器中读取数据的次数总和等于ht(例如为112),每次读取的数据转换后对应于目标数据的一行。
例如,在步骤S232中,矢量计算单元将每次接收的R*N个存储行中的数据转换为wt*ceil((C×R×S)/L)个存储行的数据,转换后的数据即为目标数据。也即是,将数据的排布方式进行调整,从而改变数据的尺寸和通道数。例如,该计算式中的L表示静态存储器中每个存储行能够存储的数据点的数量,ceil((C×R×S)/L)表示对(C×R×S)/L进行向上取整。例如,在一些示例中,矢量计算单元每次接收7×11个存储行的数据;若SRAM一行空间为128Byte,则每个存储行能够存储的数据点的数量L为64;对于初始卷积核[7,7,3,64],R=7,S=7,C=3;对于目标数据[1,ht,wt,(C×R×S)]=[1,112,112,147]的情形,wt=112。因此,wt*ceil((C×R×S)/L)=112×3,也即是,矢量计算单元将每次接收的7×11个存储行中的数据转换为112×3个存储行的数据。
图10为图9中步骤S232的示意性流程图。例如,在一些示例中,上述步骤S232进一步包括步骤S2321~S2323。
步骤S2321:将R*N个存储行中的数据按照预设跳步划分为多组数据;
步骤S2322:对于每组数据,确定该组数据所对应的滑动窗的每一行数据的初始位置信息参数和目标位置信息参数;
步骤S2323:矢量计算单元根据初始位置信息参数和目标位置信息参数,将每组数据以转换后的排布方式存储到目标存储器的对应位置,以得到目标数据。
例如,在步骤S2321中,将R*N个存储行中的数据按照预设跳步划分为多组数据,每组数据对应于一个行方向上的滑动窗,多组数据的组的数量等于wt。例如,在一些示例中,将7×11个存储行中的数据按照预设跳步str=2划分为112组数据,wt=112。7×11个存储行中的数据对应于输入数据224×224中的一行数据,划分的112组数据对应于一行224的数据以跳步2所做的112次滑动窗。
例如,在步骤S2322中,对于每组数据,确定该组数据所对应的滑动窗的每一行数据的初始位置信息参数和目标位置信息参数。初始位置信息参数用于确定滑动窗中的该行数据所在的源地址,目标位置信息参数用于确定搬运这些数据的目标地址。
下面举例说明上述步骤S2321-S2322的操作方式。
例如,在对输入数据[1,224,224,3]执行上下左右的填充操作(padding)之后,整个输入数据大小变为[1,229,229,3],在SRAM内占据229×11个地址空间。接下来需要将输入数据的形状转换为[1,112,112,147]。本质上,对于卷积运算,卷积核所滑到的每一个滑动窗(Feature Window)都需要完成图11所示的[7×7×3]到[1,1,147]的转变。
由于每个卷积核所滑到的滑动窗对应原数据(输入数据或输入图像)的7行7列,卷积核从左上到右下的滑动过程中扫过的滑动窗有重叠。对于列方向上的重叠,为了避免重复从SRAM内读取数据,因此每次从SRAM内读取7×11=77个地址空间的数据交给矢量计算单元处理。对于行方向的重叠,滑动窗从左向右滑动会重复读取行方向重叠的数据。对于整体而言,SRAM内的数据被分为112组数据,对应于转换后的112行。每组数据在转换前占据7×11个地址空间。矢量计算单元读取到一组数据之后,经过处理,输出112×3个地址空间的数据大小,其中,112对应转换后的数据宽度,3对应于147个通道所占用的空间(147个通道需要占据3个SRAM存储行,也即占据3个SRAM地址空间)。
矢量计算单元在拿到一组数据之后,这7×11个SRAM存储行(entry)上的数据会暂时存储在矢量计算单元内部的vmem上,其数据排布保持不变。然后利用矢量计算单元的指令操作,将其转换成112×3个vmem存储行的数据。之后,再将结果写回到SRAM。
转换后的数据宽度是112个行方向上的点,每个点的通道是147,分布在3个vmem存储行上。对于每个行方向上的点,需要将原来的7×7×3对应的滑动窗转换成1×1×147。为此,需要找到每一个滑动窗里对应的7行数据,然后将其重新组成新的数据排布方式。如图12所示,对于第一个滑动窗,滑动窗的数据宽度是7×3=21通道,共有7行数据(对应存储在7×11个存储行中),需要确定出这7行数据的存储地址,然后转换其排布方式。参见图12的右侧,这7行数据重新排布为3行,原来的第0、1、2行组成新的第1行,原来的第3、4行组成新的第2行,原来的第5、6行组成新的第3行,这新的3行共存储了原来7×7×3对应的滑动窗里覆盖的147个数据点。以此类推,对于行方向上的下一个滑动窗,仍然以类似的方式转换数据的排布方式,直至这7行数据所对应的一行滑动窗都转换完成,接着读取下一组7×11个存储行的数据。
为了确定滑动窗里每一行数据的初始位置和目标位置,需要为滑动窗里每一行数据定义初始位置信息参数和目标位置信息参数。
初始位置信息参数包括第一起始边界坐标、第一结束边界坐标、第一起始地址、第一结束地址、第一起始序号、第一结束序号。
第一起始边界坐标表示对应的滑动窗的起始边界在扩充数据的行方向上的相对坐标,第一结束边界坐标表示对应的滑动窗的结束边界在扩充数据的行方向上的相对坐标。对应的滑动窗的起始边界与对应的滑动窗的结束边界位于扩充数据的行方向上的不同位置。执行填充操作之后得到的数据为扩充数据,因此这些坐标和参数都是针对扩充数据而定义的。对于无需执行填充操作的其他情形,这些坐标和参数可以直接针对输入数据而定定义。如图12所示,起始边界例如为滑动窗的左边界,第一起始边界坐标为滑动窗的左边界在229×3的扩充数据的行方向上的相对坐标;结束边界例如为滑动窗的右边界,第一结束边界坐标为滑动窗的右边界在229×3的扩充数据的行方向上的相对坐标。
第一起始边界坐标的计算公式为:src_row_start_index=i*str*ch。src_row_start_index表示第一起始边界坐标,i表示对应的滑动窗在目标数据的尺寸wt中对应的数据点的序号(例如表示该滑动窗是一行共112个滑动窗中的第几个,也即在输出数据的宽度wt=112中的第几个),str表示滑动窗 在行方向上的跳步(例如为2),ch表示输入数据的通道数(例如为3)。
第一结束边界坐标的计算公式为:src_row_end_index=src_row_start_index+(kernel_w*ch-1)。src_row_end_index表示第一结束边界坐标,kernel_w表示滑动窗的宽度(例如为7),滑动窗的尺寸等于初始卷积核的尺寸(例如均为7×7)。
第一起始地址表示第一起始边界坐标在矢量计算单元的存储器(例如vmem)中的地址,第一结束地址表示第一结束边界坐标在矢量计算单元的存储器(例如vmem)中的地址。第一起始序号表示第一起始边界坐标在第一起始地址上对应的数据点的序号,第一结束序号表示第一结束边界坐标在第一结束地址上对应的数据点的序号。由于vmem是按行存储的,根据第一起始地址或第一结束地址可以定位到vmem中的某个存储行,第一起始序号或第一结束序号表示对应的数据是该存储行中的第几个数据。
第一起始地址的计算公式为:src_row_start_address=src_row_start_index/vmem_lane+j*N。src_row_start_address表示第一起始地址,vmem_lane表示矢量计算单元的存储器中每个存储行能够存储的数据点的数量,j表示对应的数据在滑动窗中的行序号(例如为1~7中的一个数值)。
第一结束地址的计算公式为:src_row_end_address=src_row_end_index/vmem_lane+j*N。src_row_end_address表示第一结束地址。
第一起始序号的计算公式为:src_row_start_lane=src_row_start_index%vmem_lane。src_row_start_lane表示第一起始序号。例如,%表示求模运算。
第一结束序号的计算公式为:src_row_end_lane=src_row_end_index%vmem_lane。src_row_end_lane表示第一结束序号。
确定上述各个参数之后,则可以确定转换一个7×7×3所需要的源数据在vmem里的位置。为了将这些源数据转移到vmem的目的地址上,还需要确定对应的目的地址及相关参数,也即,还需要确定目标位置信息参数。
目标位置信息参数包括第二起始边界坐标、第二结束边界坐标、第二起始地址、第二结束地址、第二起始序号、第二结束序号。
第二起始边界坐标表示对应的滑动窗的起始边界在[1,1,(C×R×S)]的数据尺寸里的相对坐标,第二结束边界坐标表示对应的滑动窗的结束边界在[1,1,(C×R×S)]的数据尺寸里的相对坐标。对应的滑动窗的起始边界与对应 的滑动窗的结束边界位于扩充数据的行方向上的不同位置。例如,目标数据表示为[1,ht,wt,(C×R×S)]。在一些示例中,目标数据为[1,112,112,147],需要将每个滑动窗对应的数据尺寸从[7,7,3]转换为[1,1,147]。如图12所示,起始边界例如为滑动窗的左边界,第二起始边界坐标为滑动窗的左边界在[1,1,147]的数据尺寸里的相对坐标;结束边界例如为滑动窗的右边界,第二结束边界坐标为滑动窗的右边界在[1,1,147]的数据尺寸里的相对坐标。
第二起始边界坐标的计算公式为:dst_row_start_index=j*kernel_w*ch。dst_row_start_index表示第二起始边界坐标,j表示对应的数据在滑动窗中的行序号(例如为1~7中的一个数值),kernel_w表示滑动窗的宽度(例如为7),滑动窗的尺寸等于初始卷积核的尺寸(例如均为7×7),ch表示输入数据的通道数(例如为3)。
第二结束边界坐标的计算公式为:dst_row_end_index=dst_row_start_index+(kernel_w*ch-1)。dst_row_end_index表示第二结束边界坐标。
第二起始地址表示第二起始边界坐标在矢量计算单元的存储器(例如vmem)中的地址,第二结束地址表示第二结束边界坐标在矢量计算单元的存储器(例如vmem)中的地址。第二起始序号表示第二起始边界坐标在第二起始地址上对应的数据点的序号,第二结束序号表示第二结束边界坐标在第二结束地址上对应的数据点的序号。由于vmem是按行存储的,根据第二起始地址或第二结束地址可以定位到vmem中的某个存储行,第二起始序号或第二结束序号表示对应的数据是该存储行中的第几个数据。
第二起始地址的计算公式为:dst_row_start_address=dst_row_start_index/vmem_lane。dst_row_start_address表示第二起始地址,vmem_lane表示矢量计算单元的存储器中每个存储行能够存储的数据点的数量。
第二结束地址的计算公式为:dst_row_end_address=dst_row_end_index/vmem_lane。dst_row_end_address表示第二结束地址。
第二起始序号的计算公式为:dst_row_start_lane=dst_row_start_index%vmem_lane。dst_row_start_lane表示第二起始序号。
第二结束序号的计算公式为:dst_row_end_lane=dst_row_end_index%vmem_lane。dst_row_end_lane表示第二结束序号。
在确定了初始位置信息参数和目标位置信息参数之后,可以确定数据转移所需要的源地址和目标地址,接下来根据这些参数将源数据搬运到目的地址上。
例如,在步骤S2323中,在确定了初始位置信息参数和目标位置信息参数之后,矢量计算单元将每组数据以转换后的排布方式存储到目标存储器的对应位置,目标位置信息参数所指示的目标地址为目标存储器中的地址,从而得到目标数据。例如,目标存储器以行为单位进行存储,传输至目标存储器并存储在目标存储器上的数据即为目标数据。例如,目标存储器可以是上文所述的静态存储器(此时转换前的数据和转换后的数据存储在静态存储器的不同地址中),也可以是不同于上文所述的静态存储器的其他存储设备,本公开的实施例对此不作限制。
例如,步骤S2323可以进一步包括:根据初始位置信息参数和目标位置信息参数,矢量计算单元采用循环移位指令并根据谓词寄存器中的预设使能信号将每组数据以转换后的排布方式进行拼接并存储到目标存储器的对应位置,以得到目标数据。例如,在一些示例中,可以利用矢量计算单元的指令集架构(Vector ISA)中的vshiftri指令将源地址的数据循环右移若干个位置之后按照谓词寄存器(Vector Predicate Register,VPR,也称为VP寄存器)中的写使能(write enable)信号写入目的地址,前述预设使能信号例如为写使能信号。在将一个7×7×3的滑动窗对应的数据转换为1×1×147的数据的拼接过程中,需要根据第二起始序号dst_row_start_lane和第二结束序号dst_row_end_lane来确定使用的VP寄存器。关于vshiftri指令以及VP寄存器的使用方式可以参考常规设计,此处不再详述。
通过上述方式,利用矢量计算单元完成了从二维张量(2D Tensor)到三维张量(3D Tensor)的转换。
经过各个步骤的处理之后,输入数据[1,224,224,3]被转换为目标数据[1,112,112,147],根据初始卷积核[7,7,3,64]确定的运算卷积核为[1,1,147,64],从而使得通道数从3增加为147。
返回至图2,在步骤S30中,基于目标数据和运算卷积核进行卷积运算,以得到卷积结果。目标数据与运算卷积核的卷积运算结果等于输入数据与初始卷积核的卷积运算结果。例如,步骤S30可以进一步包括:利用矩阵运算 单元(Matrix)对目标数据和运算卷积核进行卷积运算。
例如,在一些示例中,以残差网络ResNet50的首层卷积为例,需要实现的输入数据与初始卷积核的运算为:[1,224,224,3]×[7,7,3,64]=[1,112,112,64],由于只有3个通道,没有充分利用矩阵运算单元的算力。通过采用本公开实施例提供的卷积运算方法,基于初始卷积核[7,7,3,64]得到的运算卷积核为[1,1,147,64],对输入数据[1,224,224,3]的排布方式进行调整后得到的目标数据为[1,112,112,147],因此实际进行的目标数据与运算卷积核的运算为:[1,112,112,147]×[1,1,147,64]=[1,112,112,64],该卷积运算结果与原本需要得到的卷积运算结果是一致的。由于通道数增大为147,可以充分利用矩阵运算单元的算力,提高矩阵运算单元的利用率,缩短卷积运算的时间,提高运算效率。并且,由于输入数据无需使用主机CPU重新调整排布方式,无需在主机上进行通道数的扩充,因此占用的数据空间不会显著增大,不会增大从主机传输到设备上的数据量,因此不会增大Host2Device的PCIe传输时间,从而可以节省数据传输时间。
本公开实施例提供的卷积运算方法有助于实现硬件加速的目的,可以实现卷积神经网络(CNN)首层卷积计算加速,具有存储空间小、传输时间短、硬件模块利用率高、计算时间短等特点。例如,采用通常的卷积运算方法执行残差网络ResNet50的首层卷积所需要的时间为614656个循环时间(cycle),而采用本公开实施例提供的卷积运算方法执行残差网络ResNet50的首层卷积所需要的理论时间为37632个循环时间(cycle),降低为之前的6.1%,这极大地缩短了卷积神经网络(CNN)首层卷积计算时间。
需要说明的是,在本公开的实施例中,本公开上述各个实施例提供的卷积运算方法可以包括更多或更少的操作,这些操作可以顺序执行或并行执行。虽然上文描述的卷积运算方法的流程包括特定顺序出现的多个操作,但是应该清楚地了解,多个操作的顺序并不受限制。上文描述的卷积运算方法可以执行一次,也可以按照预定条件执行多次。
需要说明的是,上文以残差网络ResNet50的首层卷积为例进行说明,但这并不构成对本公开实施例的限制。可以将本公开实施例提供的卷积运算方法应用到任意适用的卷积运算中,各类数据的尺寸和通道数、各类卷积核的尺寸和通道数都可以根据实际需求而定,而不限于上文所描述的具体数值。
本公开至少一个实施例还提供一种卷积运算装置。该卷积运算装置可以提高矩阵运算单元的利用率,有效利用矩阵运算单元的算力,缩短卷积运算的时间,提高运算效率,并且可以节省数据传输时间。
图13为本公开一些实施例提供的一种卷积运算装置的示意框图。如图13所示,在一些实施例中,卷积运算装置100包括确定单元110、调整单元120、计算单元130。
确定单元110配置为确定运算卷积核。例如,运算卷积核基于初始卷积核得到,初始卷积核表示为[R,S,C,K],运算卷积核表示为[1,1,(C×R×S),K],R、S、C、K均为大于0的整数。例如,确定单元110可以执行图2所示的卷积运算方法的步骤S10。
调整单元120配置为基于运算卷积核的通道数,调整输入数据的排布方式,得到目标数据。例如,目标数据的尺寸和通道数不同于输入数据的尺寸和通道数,目标数据的通道数等于运算卷积核的通道数。例如,调整单元120可以执行图2所示的卷积运算方法的步骤S20。
计算单元130配置为基于目标数据和运算卷积核进行卷积运算,以得到卷积运算结果。例如,目标数据与运算卷积核的卷积运算结果等于输入数据与初始卷积核的卷积运算结果。例如,计算单元130可以执行图2所示的卷积运算方法的步骤S30。
例如,确定单元110、调整单元120、计算单元130可以为硬件、软件、固件以及它们的任意可行的组合。例如,确定单元110、调整单元120、计算单元130可以为专用或通用的电路、芯片或装置等,也可以为处理器和存储器的结合。关于确定单元110、调整单元120、计算单元130的具体实现形式,本公开的实施例对此不作限制。
需要说明的是,本公开的实施例中,卷积运算装置100的各个单元与前述的卷积运算方法的各个步骤对应,关于该卷积运算装置100的具体功能可以参考上文中卷积运算方法的相关描述,此处不再赘述。图13所示的卷积运算装置100的组件和结构只是示例性的,而非限制性的,根据需要,该卷积运算装置100还可以包括其他组件和结构。
本公开至少一个实施例还提供一种电子设备。该电子设备可以提高矩阵运算单元的利用率,有效利用矩阵运算单元的算力,缩短卷积运算的时间, 提高运算效率,并且可以节省数据传输时间。
图14为本公开一些实施例提供的一种电子设备的示意框图。如图14所示,电子设备200包括卷积运算装置210,卷积运算装置210可以为本公开任一实施例提供的卷积运算装置,例如为前述的卷积运算装置100。该电子设备200可以为任意的具有计算功能的设备,例如为服务器、终端设备、个人计算机等,本公开的实施例对此不作限制。
图15为本公开一些实施例提供的另一种电子设备的示意框图。如图15所示,该电子设备300包括处理器310和存储器320,可以用于实现客户端或服务器。存储器320用于非瞬时性地存储有计算机可执行指令(例如至少一个(一个或多个)计算机程序模块)。处理器310用于运行该计算机可执行指令,该计算机可执行指令被处理器310运行时可以执行上文所述的卷积运算方法中的一个或多个步骤,进而实现上文所述的卷积运算方法。存储器320和处理器310可以通过总线系统和/或其它形式的连接机构(未示出)互连。
例如,处理器310可以是中央处理单元(CPU)、图形处理单元(GPU)或者具有数据处理能力和/或程序执行能力的其它形式的处理单元。例如,中央处理单元(CPU)可以为X86或ARM架构等。处理器310可以为通用处理器或专用处理器,可以控制电子设备300中的其它组件以执行期望的功能。
例如,存储器320可以包括至少一个(例如一个或多个)计算机程序产品的任意组合,计算机程序产品可以包括各种形式的计算机可读存储介质,例如易失性存储器和/或非易失性存储器。易失性存储器例如可以包括随机存取存储器(RAM)和/或高速缓冲存储器(cache)等。非易失性存储器例如可以包括只读存储器(ROM)、硬盘、可擦除可编程只读存储器(EPROM)、便携式紧致盘只读存储器(CD-ROM)、USB存储器、闪存等。在计算机可读存储介质上可以存储至少一个(例如一个或多个)计算机程序模块,处理器310可以运行至少一个(例如一个或多个)计算机程序模块,以实现电子设备300的各种功能。在计算机可读存储介质中还可以存储各种应用程序和各种数据以及应用程序使用和/或产生的各种数据等。
需要说明的是,本公开的实施例中,电子设备300的具体功能和技术效果可以参考上文中关于卷积运算方法的描述,此处不再赘述。
图16为本公开一些实施例提供的另一种电子设备的示意框图。该电子 设备400例如适于用来实施本公开实施例提供的卷积运算方法。电子设备400可以是终端设备等,可以用于实现客户端或服务器。电子设备400可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)、PMP(便携式多媒体播放器)、车载终端(例如车载导航终端)、可穿戴电子设备等等的移动终端以及诸如数字TV、台式计算机、智能家居设备等等的固定终端。需要注意的是,图16示出的电子设备400仅仅是一个示例,其不会对本公开实施例的功能和使用范围带来任何限制。
如图16所示,电子设备400可以包括处理装置(例如中央处理器、图形处理器等)410,其可以根据存储在只读存储器(ROM)420中的程序或者从存储装置480加载到随机访问存储器(RAM)430中的程序而执行各种适当的动作和处理。在RAM 430中,还存储有电子设备400操作所需的各种程序和数据。处理装置410、ROM 420以及RAM 430通过总线440彼此相连。输入/输出(I/O)接口450也连接至总线440。
通常,以下装置可以连接至I/O接口450:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置460;包括例如液晶显示器(LCD)、扬声器、振动器等的输出装置470;包括例如磁带、硬盘等的存储装置480;以及通信装置490。通信装置490可以允许电子设备400与其他电子设备进行无线或有线通信以交换数据。虽然图16示出了具有各种装置的电子设备400,但应理解的是,并不要求实施或具备所有示出的装置,电子设备400可以替代地实施或具备更多或更少的装置。
例如,根据本公开的实施例,上述卷积运算方法可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包括用于执行上述卷积运算方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置490从网络上被下载和安装,或者从存储装置480安装,或者从ROM 420安装。在该计算机程序被处理装置410执行时,可以实现本公开实施例提供的卷积运算方法中限定的功能。
本公开至少一个实施例还提供一种存储介质。利用该存储介质,可以提高矩阵运算单元的利用率,有效利用矩阵运算单元的算力,缩短卷积运算的时间,提高运算效率,并且可以节省数据传输时间。
图17为本公开一些实施例提供的一种存储介质的示意图。例如,如图17所示,存储介质500可以为非暂时性计算机可读存储介质,存储有非暂时性计算机可读指令510。当非暂时性计算机可读指令510由处理器执行时可以实现本公开实施例所述的卷积运算方法,例如,当非暂时性计算机可读指令510由处理器执行时,可以执行根据上文所述的卷积运算方法中的一个或多个步骤。
例如,该存储介质500可以应用于上述电子设备中,例如,该存储介质500可以包括电子设备300中的存储器320。
例如,存储介质可以包括智能电话的存储卡、平板电脑的存储部件、个人计算机的硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM)、便携式紧致盘只读存储器(CD-ROM)、闪存、或者上述存储介质的任意组合,也可以为其他适用的存储介质。
例如,关于存储介质500的说明可以参考电子设备的实施例中对于存储器的描述,重复之处不再赘述。存储介质500的具体功能和技术效果可以参考上文中关于卷积运算方法的描述,此处不再赘述。
在上文中,结合图1至图17描述了本公开实施例提供的卷积运算方法、卷积运算装置、电子设备及存储介质。本公开实施例提供的卷积运算方法可以用于卷积神经网络的首层卷积运算,通过调整数据的排布方式增大通道数,使具有较多通道的目标数据与具有较多通道的运算卷积核进行卷积运算,从而可以提高矩阵运算单元的利用率,有效利用矩阵运算单元的算力,缩短卷积运算的时间,提高运算效率,并且可以节省数据传输时间。
需要说明的是,在本公开的上下文中,计算机可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是,但不限于:电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存 储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(射频)等等,或者上述的任意合适的组合。
在一些实施方式中,客户端、服务器可以利用诸如HTTP(HyperText Transfer Protocol,超文本传输协议)之类的任何当前已知或未来研发的网络协议进行通信,并且可以与任意形式或介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(“LAN”),广域网(“WAN”),网际网(例如,互联网)以及端对端网络(例如,ad hoc端对端网络),以及任何当前已知或未来研发的网络。
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括但不限于面向对象的程序设计语言,诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言,诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络(,包括局域网(LAN)或广域网(WAN))连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或 框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。
描述于本公开实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,单元的名称在某种情况下并不构成对该单元本身的限定。
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、片上系统(SOC)、复杂可编程逻辑设备(CPLD)等等。
根据本公开的一个或多个实施例,一种卷积运算方法包括:确定运算卷积核,其中,所述运算卷积核基于初始卷积核得到,所述初始卷积核表示为[R,S,C,K],所述运算卷积核表示为[1,1,(C×R×S),K],R、S、C、K均为大于0的整数;基于所述运算卷积核的通道数,调整输入数据的排布方式,得到目标数据,其中,所述目标数据的尺寸和通道数不同于所述输入数据的尺寸和通道数,所述目标数据的通道数等于所述运算卷积核的通道数;基于所述目标数据和所述运算卷积核进行卷积运算,以得到卷积运算结果,其中,所述目标数据与所述运算卷积核的卷积运算结果等于所述输入数据与所述初始卷积核的卷积运算结果。
根据本公开的一个或多个实施例,所述目标数据的通道数大于所述输入数据的通道数,所述运算卷积核的通道数大于所述初始卷积核的通道数。
根据本公开的一个或多个实施例,基于所述运算卷积核的通道数,调整所述输入数据的排布方式,得到目标数据,包括:将所述输入数据以行为单位存储在静态存储器中,其中,所述输入数据的每一行存储在所述静态存储器中对应的N个存储行中,N为大于0的整数;对所述静态存储器中存储的 所述输入数据执行填充操作,得到扩充数据;调整所述扩充数据的排布方式以改变所述扩充数据的尺寸和通道数,得到所述目标数据。
根据本公开的一个或多个实施例,将所述输入数据以行为单位存储在所述静态存储器中,包括:采用紧密排列的方式将所述输入数据存储在内存中,其中,所述输入数据包括多个通道,所述紧密排列的方式指同一个数据点的多个通道在所述内存中依序相邻存储;采用直接存储访问的方式将所述内存中的所述输入数据传输至硬件加速器的所述静态存储器中,并且将所述输入数据的各行的第一个数据点存储在所述静态存储器的不同行的第一列,使得所述输入数据的每一行存储在所述静态存储器中对应的N个存储行中。
根据本公开的一个或多个实施例,对所述静态存储器中存储的所述输入数据执行所述填充操作,得到所述扩充数据,包括:在所述静态存储器中,在对应于所述输入数据的存储位置之前和之后的存储行填充第一预设值,得到第一中间数据,其中,所述第一中间数据包括所述输入数据和填充的第一预设值;将所述第一中间数据传输至矢量计算单元,利用所述矢量计算单元的移位指令和填充指令对所述第一中间数据对应的每一行的两端填充第二预设值,得到第二中间数据,其中,所述第二中间数据包括所述第一中间数据和填充的第二预设值;将所述第二中间数据传输至所述静态存储器中对应的存储位置,得到所述扩充数据,其中,所述扩充数据与所述第二中间数据的内容相同。
根据本公开的一个或多个实施例,所述目标数据表示为[1,ht,wt,(C×R×S)],ht、wt均为大于0的整数;调整所述扩充数据的排布方式以改变所述扩充数据的尺寸和通道数,得到所述目标数据,包括:逐次读取所述静态存储器中的R*N个存储行中的数据并传输至所述矢量计算单元,其中,每次读取的起始地址按照预设跳步str移动str*N个存储行,所述预设跳步是将所述输入数据与所述初始卷积核进行卷积运算所需的滑动窗在行方向和列方向上的跳步,从所述静态存储器中读取数据的次数总和等于ht;所述矢量计算单元将每次接收的R*N个存储行中的数据转换为wt*ceil((C×R×S)/L)个存储行的数据,得到所述目标数据,其中,L表示所述静态存储器中每个存储行能够存储的数据点的数量,ceil((C×R×S)/L)表示对(C×R×S)/L进行向上取整,转换后的数据为所述目标数据。
根据本公开的一个或多个实施例,所述矢量计算单元将每次接收的R*N个存储行中的数据转换为wt*ceil((C×R×S)/L)个存储行的数据,得到所述目标数据,包括:将R*N个存储行中的数据按照所述预设跳步划分为多组数据,其中,每组数据对应于一个行方向上的滑动窗,所述多组数据的组的数量等于wt;对于每组数据,确定该组数据所对应的滑动窗的每一行数据的初始位置信息参数和目标位置信息参数;所述矢量计算单元根据所述初始位置信息参数和所述目标位置信息参数,将每组数据以转换后的排布方式存储到目标存储器的对应位置,以得到所述目标数据,其中,所述目标存储器以行为单位进行存储,传输至所述目标存储器并存储在所述目标存储器上的数据为所述目标数据。
根据本公开的一个或多个实施例,所述初始位置信息参数包括第一起始边界坐标、第一结束边界坐标、第一起始地址、第一结束地址、第一起始序号、第一结束序号;所述第一起始边界坐标表示对应的滑动窗的起始边界在所述扩充数据的行方向上的相对坐标,所述第一结束边界坐标表示对应的滑动窗的结束边界在所述扩充数据的行方向上的相对坐标,所述对应的滑动窗的起始边界与所述对应的滑动窗的结束边界位于所述扩充数据的行方向上的不同位置;所述第一起始地址表示所述第一起始边界坐标在所述矢量计算单元的存储器中的地址,所述第一结束地址表示所述第一结束边界坐标在所述矢量计算单元的存储器中的地址;所述第一起始序号表示所述第一起始边界坐标在所述第一起始地址上对应的数据点的序号,所述第一结束序号表示所述第一结束边界坐标在所述第一结束地址上对应的数据点的序号。
根据本公开的一个或多个实施例,所述第一起始边界坐标的计算公式为:src_row_start_index=i*str*ch,其中,src_row_start_index表示所述第一起始边界坐标,i表示对应的滑动窗在所述目标数据的尺寸wt中对应的数据点的序号,str表示所述滑动窗在所述行方向上的跳步,ch表示所述输入数据的通道数;所述第一结束边界坐标的计算公式为:src_row_end_index=src_row_start_index+(kernel_w*ch-1),其中,src_row_end_index表示所述第一结束边界坐标,kernel_w表示所述滑动窗的宽度,所述滑动窗的尺寸等于所述初始卷积核的尺寸;所述第一起始地址的计算公式为:src_row_start_address=src_row_start_index/vmem_lane+j*N,其中, src_row_start_address表示所述第一起始地址,vmem_lane表示所述矢量计算单元的存储器中每个存储行能够存储的数据点的数量,j表示对应的数据在所述滑动窗中的行序号;所述第一结束地址的计算公式为:src_row_end_address=src_row_end_index/vmem_lane+j*N,其中,src_row_end_address表示所述第一结束地址;所述第一起始序号的计算公式为:src_row_start_lane=src_row_start_index%vmem_lane,其中,src_row_start_lane表示所述第一起始序号,%表示求模运算;所述第一结束序号的计算公式为:src_row_end_lane=src_row_end_index%vmem_lane,其中,src_row_end_lane表示所述第一结束序号。
根据本公开的一个或多个实施例,所述目标位置信息参数包括第二起始边界坐标、第二结束边界坐标、第二起始地址、第二结束地址、第二起始序号、第二结束序号;所述第二起始边界坐标表示对应的滑动窗的起始边界在[1,1,(C×R×S)]的数据尺寸里的相对坐标,所述第二结束边界坐标表示对应的滑动窗的结束边界在[1,1,(C×R×S)]的数据尺寸里的相对坐标,所述对应的滑动窗的起始边界与所述对应的滑动窗的结束边界位于所述扩充数据的行方向上的不同位置;所述第二起始地址表示所述第二起始边界坐标在所述矢量计算单元的存储器中的地址,所述第二结束地址表示所述第二结束边界坐标在所述矢量计算单元的存储器中的地址;所述第二起始序号表示所述第二起始边界坐标在所述第二起始地址上对应的数据点的序号,所述第二结束序号表示所述第二结束边界坐标在所述第二结束地址上对应的数据点的序号。
根据本公开的一个或多个实施例,所述第二起始边界坐标的计算公式为:dst_row_start_index=j*kernel_w*ch,其中,dst_row_start_index表示所述第二起始边界坐标,j表示对应的数据在所述滑动窗中的行序号,kernel_w表示所述滑动窗的宽度,所述滑动窗的尺寸等于所述初始卷积核的尺寸,ch表示所述输入数据的通道数;所述第二结束边界坐标的计算公式为:dst_row_end_index=dst_row_start_index+(kernel_w*ch-1),其中,dst_row_end_index表示所述第二结束边界坐标;所述第二起始地址的计算公式为:dst_row_start_address=dst_row_start_index/vmem_lane,其中,dst_row_start_address表示所述第二起始地址,vmem_lane表示所述矢量计算单元的存储器中每个存储行能够存储的数据点的数量;所述第二结束地址的 计算公式为:dst_row_end_address=dst_row_end_index/vmem_lane,其中,dst_row_end_address表示所述第二结束地址;所述第二起始序号的计算公式为:dst_row_start_lane=dst_row_start_index%vmem_lane,其中,dst_row_start_lane表示所述第二起始序号,%表示求模运算;所述第二结束序号的计算公式为:dst_row_end_lane=dst_row_end_index%vmem_lane,其中,dst_row_end_lane表示所述第二结束序号。
根据本公开的一个或多个实施例,所述矢量计算单元根据所述初始位置信息参数和所述目标位置信息参数,将每组数据以转换后的排布方式存储到所述目标存储器的对应位置,以得到所述目标数据,包括:根据所述初始位置信息参数和所述目标位置信息参数,所述矢量计算单元采用循环移位指令并根据谓词寄存器中的预设使能信号将每组数据以转换后的排布方式进行拼接并存储到所述目标存储器的对应位置,以得到所述目标数据。
根据本公开的一个或多个实施例,基于所述目标数据和所述运算卷积核进行卷积运算,包括:利用矩阵运算单元对所述目标数据和所述运算卷积核进行卷积运算。
根据本公开的一个或多个实施例,所述卷积运算方法用于卷积神经网络的首层卷积运算。
根据本公开的一个或多个实施例,一种卷积运算装置包括:确定单元,配置为确定运算卷积核,其中,所述运算卷积核基于初始卷积核得到,所述初始卷积核表示为[R,S,C,K],所述运算卷积核表示为[1,1,(C×R×S),K],R、S、C、K均为大于0的整数;调整单元,配置为基于所述运算卷积核的通道数,调整输入数据的排布方式,得到目标数据,其中,所述目标数据的尺寸和通道数不同于所述输入数据的尺寸和通道数,所述目标数据的通道数等于所述运算卷积核的通道数;计算单元,配置为基于所述目标数据和所述运算卷积核进行卷积运算,以得到卷积运算结果,其中,所述目标数据与所述运算卷积核的卷积运算结果等于所述输入数据与所述初始卷积核的卷积运算结果。
根据本公开的一个或多个实施例,所述目标数据的通道数大于所述输入数据的通道数,所述运算卷积核的通道数大于所述初始卷积核的通道数。
根据本公开的一个或多个实施例,所述调整单元包括第一调整子单元、 第二调整子单元、第三调整子单元。第一调整子单元配置为将所述输入数据以行为单位存储在静态存储器中,其中,所述输入数据的每一行存储在所述静态存储器中对应的N个存储行中,N为大于0的整数。第二调整子单元配置为对所述静态存储器中存储的所述输入数据执行填充操作,得到扩充数据。第三调整子单元配置为调整所述扩充数据的排布方式以改变所述扩充数据的尺寸和通道数,得到所述目标数据。
根据本公开的一个或多个实施例,所述第一调整子单元包括第一存储单元和第二存储单元。第一存储单元配置为采用紧密排列的方式将所述输入数据存储在内存中,其中,所述输入数据包括多个通道,所述紧密排列的方式指同一个数据点的多个通道在所述内存中依序相邻存储。第二存储单元配置为采用直接存储访问的方式将所述内存中的所述输入数据传输至硬件加速器的所述静态存储器中,并且将所述输入数据的各行的第一个数据点存储在所述静态存储器的不同行的第一列,使得所述输入数据的每一行存储在所述静态存储器中对应的N个存储行中。
根据本公开的一个或多个实施例,第二调整子单元包括第一填充单元、第二填充单元、第三填充单元。第一填充单元配置为在所述静态存储器中,在对应于所述输入数据的存储位置之前和之后的存储行填充第一预设值,得到第一中间数据,其中,所述第一中间数据包括所述输入数据和填充的第一预设值。第二填充单元配置为将所述第一中间数据传输至矢量计算单元,利用所述矢量计算单元的移位指令和填充指令对所述第一中间数据对应的每一行的两端填充第二预设值,得到第二中间数据,其中,所述第二中间数据包括所述第一中间数据和填充的第二预设值。第三填充单元配置为将所述第二中间数据传输至所述静态存储器中对应的存储位置,得到所述扩充数据,其中,所述扩充数据与所述第二中间数据的内容相同。
根据本公开的一个或多个实施例,所述目标数据表示为[1,ht,wt,(C×R×S)],ht、wt均为大于0的整数。第三调整子单元包括第一改变单元和第二改变单元。第一改变单元配置为逐次读取所述静态存储器中的R*N个存储行中的数据并传输至所述矢量计算单元,其中,每次读取的起始地址按照预设跳步str移动str*N个存储行,所述预设跳步是将所述输入数据与所述初始卷积核进行卷积运算所需的滑动窗在行方向和列方向上的跳步,从所述 静态存储器中读取数据的次数总和等于ht。第二改变单元配置为利用矢量计算单元将每次接收的R*N个存储行中的数据转换为wt*ceil((C×R×S)/L)个存储行的数据,得到所述目标数据,其中,L表示所述静态存储器中每个存储行能够存储的数据点的数量,ceil((C×R×S)/L)表示对(C×R×S)/L进行向上取整,转换后的数据为所述目标数据。
根据本公开的一个或多个实施例,第二改变单元包括分组单元、参数确定单元和矢量计算单元。分组单元配置为将R*N个存储行中的数据按照所述预设跳步划分为多组数据,其中,每组数据对应于一个行方向上的滑动窗,所述多组数据的组的数量等于wt。参数确定单元配置为,对于每组数据,确定该组数据所对应的滑动窗的每一行数据的初始位置信息参数和目标位置信息参数。矢量计算单元配置为根据所述初始位置信息参数和所述目标位置信息参数,将每组数据以转换后的排布方式存储到目标存储器的对应位置,以得到所述目标数据,其中,所述目标存储器以行为单位进行存储,传输至所述目标存储器并存储在所述目标存储器上的数据为所述目标数据。
根据本公开的一个或多个实施例,所述初始位置信息参数包括第一起始边界坐标、第一结束边界坐标、第一起始地址、第一结束地址、第一起始序号、第一结束序号;所述第一起始边界坐标表示对应的滑动窗的起始边界在所述扩充数据的行方向上的相对坐标,所述第一结束边界坐标表示对应的滑动窗的结束边界在所述扩充数据的行方向上的相对坐标,所述对应的滑动窗的起始边界与所述对应的滑动窗的结束边界位于所述扩充数据的行方向上的不同位置;所述第一起始地址表示所述第一起始边界坐标在所述矢量计算单元的存储器中的地址,所述第一结束地址表示所述第一结束边界坐标在所述矢量计算单元的存储器中的地址;所述第一起始序号表示所述第一起始边界坐标在所述第一起始地址上对应的数据点的序号,所述第一结束序号表示所述第一结束边界坐标在所述第一结束地址上对应的数据点的序号。
根据本公开的一个或多个实施例,所述第一起始边界坐标的计算公式为:src_row_start_index=i*str*ch,其中,src_row_start_index表示所述第一起始边界坐标,i表示对应的滑动窗在所述目标数据的尺寸wt中对应的数据点的序号,str表示所述滑动窗在所述行方向上的跳步,ch表示所述输入数据的通道数;所述第一结束边界坐标的计算公式为:src_row_end_index= src_row_start_index+(kernel_w*ch-1),其中,src_row_end_index表示所述第一结束边界坐标,kernel_w表示所述滑动窗的宽度,所述滑动窗的尺寸等于所述初始卷积核的尺寸;所述第一起始地址的计算公式为:src_row_start_address=src_row_start_index/vmem_lane+j*N,其中,src_row_start_address表示所述第一起始地址,vmem_lane表示所述矢量计算单元的存储器中每个存储行能够存储的数据点的数量,j表示对应的数据在所述滑动窗中的行序号;所述第一结束地址的计算公式为:src_row_end_address=src_row_end_index/vmem_lane+j*N,其中,src_row_end_address表示所述第一结束地址;所述第一起始序号的计算公式为:src_row_start_lane=src_row_start_index%vmem_lane,其中,src_row_start_lane表示所述第一起始序号,%表示求模运算;所述第一结束序号的计算公式为:src_row_end_lane=src_row_end_index%vmem_lane,其中,src_row_end_lane表示所述第一结束序号。
根据本公开的一个或多个实施例,所述目标位置信息参数包括第二起始边界坐标、第二结束边界坐标、第二起始地址、第二结束地址、第二起始序号、第二结束序号;所述第二起始边界坐标表示对应的滑动窗的起始边界在[1,1,(C×R×S)]的数据尺寸里的相对坐标,所述第二结束边界坐标表示对应的滑动窗的结束边界在[1,1,(C×R×S)]的数据尺寸里的相对坐标,所述对应的滑动窗的起始边界与所述对应的滑动窗的结束边界位于所述扩充数据的行方向上的不同位置;所述第二起始地址表示所述第二起始边界坐标在所述矢量计算单元的存储器中的地址,所述第二结束地址表示所述第二结束边界坐标在所述矢量计算单元的存储器中的地址;所述第二起始序号表示所述第二起始边界坐标在所述第二起始地址上对应的数据点的序号,所述第二结束序号表示所述第二结束边界坐标在所述第二结束地址上对应的数据点的序号。
根据本公开的一个或多个实施例,所述第二起始边界坐标的计算公式为:dst_row_start_index=j*kernel_w*ch,其中,dst_row_start_index表示所述第二起始边界坐标,j表示对应的数据在所述滑动窗中的行序号,kernel_w表示所述滑动窗的宽度,所述滑动窗的尺寸等于所述初始卷积核的尺寸,ch表示所述输入数据的通道数;所述第二结束边界坐标的计算公式为:dst_row_end_index=dst_row_start_index+(kernel_w*ch-1),其中, dst_row_end_index表示所述第二结束边界坐标;所述第二起始地址的计算公式为:dst_row_start_address=dst_row_start_index/vmem_lane,其中,dst_row_start_address表示所述第二起始地址,vmem_lane表示所述矢量计算单元的存储器中每个存储行能够存储的数据点的数量;所述第二结束地址的计算公式为:dst_row_end_address=dst_row_end_index/vmem_lane,其中,dst_row_end_address表示所述第二结束地址;所述第二起始序号的计算公式为:dst_row_start_lane=dst_row_start_index%vmem_lane,其中,dst_row_start_lane表示所述第二起始序号,%表示求模运算;所述第二结束序号的计算公式为:dst_row_end_lane=dst_row_end_index%vmem_lane,其中,dst_row_end_lane表示所述第二结束序号。
根据本公开的一个或多个实施例,所述矢量计算单元还配置为根据所述初始位置信息参数和所述目标位置信息参数,采用循环移位指令并根据谓词寄存器中的预设使能信号将每组数据以转换后的排布方式进行拼接并存储到所述目标存储器的对应位置,以得到所述目标数据。
根据本公开的一个或多个实施例,所述计算单元包括计算子单元,所述计算子单元配置为利用矩阵运算单元对所述目标数据和所述运算卷积核进行卷积运算。
根据本公开的一个或多个实施例,所述卷积运算装置用于卷积神经网络的首层卷积运算。
根据本公开的一个或多个实施例,一种电子设备包括本公开任一实施例提供的卷积运算装置。
根据本公开的一个或多个实施例,一种电子设备包括:处理器;存储器,包括至少一个计算机程序模块;其中,所述至少一个计算机程序模块被存储在所述存储器中并被配置为由所述处理器执行,所述至少一个计算机程序模块用于实现本公开任一实施例提供的卷积运算方法。
根据本公开的一个或多个实施例,一种存储介质,存储有非暂时性计算机可读指令,当所述非暂时性计算机可读指令由计算机执行时实现本公开任一实施例提供的卷积运算方法。
以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本公开中所涉及的公开范围,并不限于上述技术特征 的特定组合而成的技术方案,同时也应涵盖在不脱离上述公开构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。
此外,虽然采用特定次序描绘了各操作,但是这不应当理解为要求这些操作以所示出的特定次序或以顺序次序执行来执行。在一定环境下,多任务和并行处理可能是有利的。同样地,虽然在上面论述中包含了若干具体实现细节,但是这些不应当被解释为对本公开的范围的限制。在单独的实施例的上下文中描述的某些特征还可以组合地实现在单个实施例中。相反地,在单个实施例的上下文中描述的各种特征也可以单独地或以任何合适的子组合的方式实现在多个实施例中。
尽管已经采用特定于结构特征和/或方法逻辑动作的语言描述了本主题,但是应当理解所附权利要求书中所限定的主题未必局限于上面描述的特定特征或动作。相反,上面所描述的特定特征和动作仅仅是实现权利要求书的示例形式。
对于本公开,还有以下几点需要说明:
(1)本公开实施例附图只涉及到本公开实施例涉及到的结构,其他结构可参考通常设计。
(2)在不冲突的情况下,本公开的实施例及实施例中的特征可以相互组合以得到新的实施例。
以上所述,仅为本公开的具体实施方式,但本公开的保护范围并不局限于此,本公开的保护范围应以所述权利要求的保护范围为准。

Claims (18)

  1. 一种卷积运算方法,包括:
    确定运算卷积核,其中,所述运算卷积核基于初始卷积核得到,所述初始卷积核表示为[R,S,C,K],所述运算卷积核表示为[1,1,(C×R×S),K],R、S、C、K均为大于0的整数;
    基于所述运算卷积核的通道数,调整输入数据的排布方式,得到目标数据,其中,所述目标数据的尺寸和通道数不同于所述输入数据的尺寸和通道数,所述目标数据的通道数等于所述运算卷积核的通道数;
    基于所述目标数据和所述运算卷积核进行卷积运算,以得到卷积运算结果,其中,所述目标数据与所述运算卷积核的卷积运算结果等于所述输入数据与所述初始卷积核的卷积运算结果。
  2. 根据权利要求1所述的卷积运算方法,其中,所述目标数据的通道数大于所述输入数据的通道数,所述运算卷积核的通道数大于所述初始卷积核的通道数。
  3. 根据权利要求1或2所述的卷积运算方法,其中,基于所述运算卷积核的通道数,调整所述输入数据的排布方式,得到目标数据,包括:
    将所述输入数据以行为单位存储在静态存储器中,其中,所述输入数据的每一行存储在所述静态存储器中对应的N个存储行中,N为大于0的整数;
    对所述静态存储器中存储的所述输入数据执行填充操作,得到扩充数据;
    调整所述扩充数据的排布方式以改变所述扩充数据的尺寸和通道数,得到所述目标数据。
  4. 根据权利要求3所述的卷积运算方法,其中,将所述输入数据以行为单位存储在所述静态存储器中,包括:
    采用紧密排列的方式将所述输入数据存储在内存中,其中,所述输入数据包括多个通道,所述紧密排列的方式指同一个数据点的多个通道在所述内存中依序相邻存储;
    采用直接存储访问的方式将所述内存中的所述输入数据传输至硬件加速器的所述静态存储器中,并且将所述输入数据的各行的第一个数据点存储 在所述静态存储器的不同行的第一列,使得所述输入数据的每一行存储在所述静态存储器中对应的N个存储行中。
  5. 根据权利要求3所述的卷积运算方法,其中,对所述静态存储器中存储的所述输入数据执行所述填充操作,得到所述扩充数据,包括:
    在所述静态存储器中,在对应于所述输入数据的存储位置之前和之后的存储行填充第一预设值,得到第一中间数据,其中,所述第一中间数据包括所述输入数据和填充的第一预设值;
    将所述第一中间数据传输至矢量计算单元,利用所述矢量计算单元的移位指令和填充指令对所述第一中间数据对应的每一行的两端填充第二预设值,得到第二中间数据,其中,所述第二中间数据包括所述第一中间数据和填充的第二预设值;
    将所述第二中间数据传输至所述静态存储器中对应的存储位置,得到所述扩充数据,其中,所述扩充数据与所述第二中间数据的内容相同。
  6. 根据权利要求5所述的卷积运算方法,其中,所述目标数据表示为[1,ht,wt,(C×R×S)],ht、wt均为大于0的整数;
    调整所述扩充数据的排布方式以改变所述扩充数据的尺寸和通道数,得到所述目标数据,包括:
    逐次读取所述静态存储器中的R*N个存储行中的数据并传输至所述矢量计算单元,其中,每次读取的起始地址按照预设跳步str移动str*N个存储行,所述预设跳步是将所述输入数据与所述初始卷积核进行卷积运算所需的滑动窗在行方向和列方向上的跳步,从所述静态存储器中读取数据的次数总和等于ht;
    所述矢量计算单元将每次接收的R*N个存储行中的数据转换为wt*ceil((C×R×S)/L)个存储行的数据,得到所述目标数据,其中,L表示所述静态存储器中每个存储行能够存储的数据点的数量,ceil((C×R×S)/L)表示对(C×R×S)/L进行向上取整,转换后的数据为所述目标数据。
  7. 根据权利要求6所述的卷积运算方法,其中,所述矢量计算单元将每次接收的R*N个存储行中的数据转换为wt*ceil((C×R×S)/L)个存储行的数据,得到所述目标数据,包括:
    将R*N个存储行中的数据按照所述预设跳步划分为多组数据,其中,每 组数据对应于一个行方向上的滑动窗,所述多组数据的组的数量等于wt;
    对于每组数据,确定该组数据所对应的滑动窗的每一行数据的初始位置信息参数和目标位置信息参数;
    所述矢量计算单元根据所述初始位置信息参数和所述目标位置信息参数,将每组数据以转换后的排布方式存储到目标存储器的对应位置,以得到所述目标数据,其中,所述目标存储器以行为单位进行存储,传输至所述目标存储器并存储在所述目标存储器上的数据为所述目标数据。
  8. 根据权利要求7所述的卷积运算方法,其中,所述初始位置信息参数包括第一起始边界坐标、第一结束边界坐标、第一起始地址、第一结束地址、第一起始序号、第一结束序号;
    所述第一起始边界坐标表示对应的滑动窗的起始边界在所述扩充数据的行方向上的相对坐标,所述第一结束边界坐标表示对应的滑动窗的结束边界在所述扩充数据的行方向上的相对坐标,所述对应的滑动窗的起始边界与所述对应的滑动窗的结束边界位于所述扩充数据的行方向上的不同位置;
    所述第一起始地址表示所述第一起始边界坐标在所述矢量计算单元的存储器中的地址,所述第一结束地址表示所述第一结束边界坐标在所述矢量计算单元的存储器中的地址;
    所述第一起始序号表示所述第一起始边界坐标在所述第一起始地址上对应的数据点的序号,所述第一结束序号表示所述第一结束边界坐标在所述第一结束地址上对应的数据点的序号。
  9. 根据权利要求8所述的卷积运算方法,其中,
    所述第一起始边界坐标的计算公式为:src_row_start_index=i*str*ch,其中,src_row_start_index表示所述第一起始边界坐标,i表示对应的滑动窗在所述目标数据的尺寸wt中对应的数据点的序号,str表示所述滑动窗在所述行方向上的跳步,ch表示所述输入数据的通道数;
    所述第一结束边界坐标的计算公式为:src_row_end_index=src_row_start_index+(kernel_w*ch-1),其中,src_row_end_index表示所述第一结束边界坐标,kernel_w表示所述滑动窗的宽度,所述滑动窗的尺寸等于所述初始卷积核的尺寸;
    所述第一起始地址的计算公式为:src_row_start_address= src_row_start_index/vmem_lane+j*N,其中,src_row_start_address表示所述第一起始地址,vmem_lane表示所述矢量计算单元的存储器中每个存储行能够存储的数据点的数量,j表示对应的数据在所述滑动窗中的行序号;
    所述第一结束地址的计算公式为:src_row_end_address=src_row_end_index/vmem_lane+j*N,其中,src_row_end_address表示所述第一结束地址;
    所述第一起始序号的计算公式为:src_row_start_lane=src_row_start_index%vmem_lane,其中,src_row_start_lane表示所述第一起始序号,%表示求模运算;
    所述第一结束序号的计算公式为:src_row_end_lane=src_row_end_index%vmem_lane,其中,src_row_end_lane表示所述第一结束序号。
  10. 根据权利要求7所述的卷积运算方法,其中,所述目标位置信息参数包括第二起始边界坐标、第二结束边界坐标、第二起始地址、第二结束地址、第二起始序号、第二结束序号;
    所述第二起始边界坐标表示对应的滑动窗的起始边界在[1,1,(C×R×S)]的数据尺寸里的相对坐标,所述第二结束边界坐标表示对应的滑动窗的结束边界在[1,1,(C×R×S)]的数据尺寸里的相对坐标,所述对应的滑动窗的起始边界与所述对应的滑动窗的结束边界位于所述扩充数据的行方向上的不同位置;
    所述第二起始地址表示所述第二起始边界坐标在所述矢量计算单元的存储器中的地址,所述第二结束地址表示所述第二结束边界坐标在所述矢量计算单元的存储器中的地址;
    所述第二起始序号表示所述第二起始边界坐标在所述第二起始地址上对应的数据点的序号,所述第二结束序号表示所述第二结束边界坐标在所述第二结束地址上对应的数据点的序号。
  11. 根据权利要求10所述的卷积运算方法,其中,
    所述第二起始边界坐标的计算公式为:dst_row_start_index=j*kernel_w*ch,其中,dst_row_start_index表示所述第二起始边界坐标,j表示对应的数据在所述滑动窗中的行序号,kernel_w表示所述滑动窗的宽度,所述滑动窗的尺寸等于所述初始卷积核的尺寸,ch表示所述输入数据的通道数;
    所述第二结束边界坐标的计算公式为:dst_row_end_index=dst_row_start_index+(kernel_w*ch-1),其中,dst_row_end_index表示所述第二结束边界坐标;
    所述第二起始地址的计算公式为:dst_row_start_address=dst_row_start_index/vmem_lane,其中,dst_row_start_address表示所述第二起始地址,vmem_lane表示所述矢量计算单元的存储器中每个存储行能够存储的数据点的数量;
    所述第二结束地址的计算公式为:dst_row_end_address=dst_row_end_index/vmem_lane,其中,dst_row_end_address表示所述第二结束地址;
    所述第二起始序号的计算公式为:dst_row_start_lane=dst_row_start_index%vmem_lane,其中,dst_row_start_lane表示所述第二起始序号,%表示求模运算;
    所述第二结束序号的计算公式为:dst_row_end_lane=dst_row_end_index%vmem_lane,其中,dst_row_end_lane表示所述第二结束序号。
  12. 根据权利要求7所述的卷积运算方法,其中,所述矢量计算单元根据所述初始位置信息参数和所述目标位置信息参数,将每组数据以转换后的排布方式存储到所述目标存储器的对应位置,以得到所述目标数据,包括:
    根据所述初始位置信息参数和所述目标位置信息参数,所述矢量计算单元采用循环移位指令并根据谓词寄存器中的预设使能信号将每组数据以转换后的排布方式进行拼接并存储到所述目标存储器的对应位置,以得到所述目标数据。
  13. 根据权利要求1-12任一项所述的卷积运算方法,其中,基于所述目标数据和所述运算卷积核进行卷积运算,包括:
    利用矩阵运算单元对所述目标数据和所述运算卷积核进行卷积运算。
  14. 根据权利要求1-13任一项所述的卷积运算方法,其中,所述卷积运算方法用于卷积神经网络的首层卷积运算。
  15. 一种卷积运算装置,包括:
    确定单元,配置为确定运算卷积核,其中,所述运算卷积核基于初始卷积核得到,所述初始卷积核表示为[R,S,C,K],所述运算卷积核表示为[1,1, (C×R×S),K],R、S、C、K均为大于0的整数;
    调整单元,配置为基于所述运算卷积核的通道数,调整输入数据的排布方式,得到目标数据,其中,所述目标数据的尺寸和通道数不同于所述输入数据的尺寸和通道数,所述目标数据的通道数等于所述运算卷积核的通道数;
    计算单元,配置为基于所述目标数据和所述运算卷积核进行卷积运算,以得到卷积运算结果,其中,所述目标数据与所述运算卷积核的卷积运算结果等于所述输入数据与所述初始卷积核的卷积运算结果。
  16. 一种电子设备,包括权利要求15所述的卷积运算装置。
  17. 一种电子设备,包括:
    处理器;
    存储器,包括至少一个计算机程序模块;
    其中,所述至少一个计算机程序模块被存储在所述存储器中并被配置为由所述处理器执行,所述至少一个计算机程序模块用于实现权利要求1-14任一项所述的卷积运算方法。
  18. 一种存储介质,存储有非暂时性计算机可读指令,当所述非暂时性计算机可读指令由计算机执行时实现权利要求1-14任一项所述的卷积运算方法。
PCT/CN2023/096983 2022-05-31 2023-05-30 卷积运算方法、卷积运算装置、电子设备及存储介质 WO2023231999A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210610935.6A CN117216459A (zh) 2022-05-31 2022-05-31 卷积运算方法、卷积运算装置、电子设备及存储介质
CN202210610935.6 2022-05-31

Publications (1)

Publication Number Publication Date
WO2023231999A1 true WO2023231999A1 (zh) 2023-12-07

Family

ID=89026933

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/096983 WO2023231999A1 (zh) 2022-05-31 2023-05-30 卷积运算方法、卷积运算装置、电子设备及存储介质

Country Status (2)

Country Link
CN (1) CN117216459A (zh)
WO (1) WO2023231999A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117828479A (zh) * 2024-02-29 2024-04-05 浙江鹏信信息科技股份有限公司 诈骗网站识别检测方法、系统及计算机可读存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108960411A (zh) * 2018-06-27 2018-12-07 郑州云海信息技术有限公司 一种卷积神经网络调整及相关装置
CN111427838A (zh) * 2020-03-30 2020-07-17 电子科技大学 基于zynq动态更新卷积神经网络的分类系统及方法
US20210019593A1 (en) * 2019-07-19 2021-01-21 Qualcomm Incorporated Efficient inferencing with piecewise pointwise convolution
CN112598110A (zh) * 2020-12-04 2021-04-02 北京迈格威科技有限公司 神经网络构建方法、装置、设备及介质
CN113627587A (zh) * 2021-06-08 2021-11-09 南京广捷智能科技有限公司 一种多通道式卷积神经网络加速方法及装置
CN113688069A (zh) * 2021-09-10 2021-11-23 北京百度网讯科技有限公司 数据处理方法、装置、电子设备及介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108960411A (zh) * 2018-06-27 2018-12-07 郑州云海信息技术有限公司 一种卷积神经网络调整及相关装置
US20210019593A1 (en) * 2019-07-19 2021-01-21 Qualcomm Incorporated Efficient inferencing with piecewise pointwise convolution
CN111427838A (zh) * 2020-03-30 2020-07-17 电子科技大学 基于zynq动态更新卷积神经网络的分类系统及方法
CN112598110A (zh) * 2020-12-04 2021-04-02 北京迈格威科技有限公司 神经网络构建方法、装置、设备及介质
CN113627587A (zh) * 2021-06-08 2021-11-09 南京广捷智能科技有限公司 一种多通道式卷积神经网络加速方法及装置
CN113688069A (zh) * 2021-09-10 2021-11-23 北京百度网讯科技有限公司 数据处理方法、装置、电子设备及介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117828479A (zh) * 2024-02-29 2024-04-05 浙江鹏信信息科技股份有限公司 诈骗网站识别检测方法、系统及计算机可读存储介质
CN117828479B (zh) * 2024-02-29 2024-06-11 浙江鹏信信息科技股份有限公司 诈骗网站识别检测方法、系统及计算机可读存储介质

Also Published As

Publication number Publication date
CN117216459A (zh) 2023-12-12

Similar Documents

Publication Publication Date Title
US11726950B2 (en) Compute near memory convolution accelerator
CN107832843B (zh) 一种信息处理方法及相关产品
WO2020073211A1 (zh) 运算加速器、处理方法及相关设备
US20150371359A1 (en) Processing method and apparatus for single-channel convolution layer, and processing method and apparatus for multi-channel convolution layer
CN110678841A (zh) 张量处理器指令集架构
WO2023231999A1 (zh) 卷积运算方法、卷积运算装置、电子设备及存储介质
CN109219805B (zh) 一种多核系统内存访问方法、相关装置、系统及存储介质
US11803508B2 (en) Systems and methods for implementing a machine perception and dense algorithm integrated circuit and enabling a flowing propagation of data within the integrated circuit
CN111915673B (zh) 图像处理方法、装置、终端设备及存储介质
US20210334234A1 (en) Distributed graphics processor unit architecture
KR102155479B1 (ko) 반도체 장치
EP4071619A1 (en) Address generation method, related device and storage medium
WO2024061088A1 (zh) 显示方法、装置、电子设备以及存储介质
CN106101712A (zh) 一种视频流数据的处理方法及装置
CN112463160A (zh) 编译方法、装置、电子设备和存储介质
CN116468078A (zh) 面向人工智能芯片的智能引擎处理方法和装置
CN116166185A (zh) 缓存方法、图像传输方法、电子设备及存储介质
CN117940934A (zh) 数据处理装置及方法
CN116415100A (zh) 业务处理方法、装置、处理器及计算设备
CN112905514A (zh) 一种基于dmd的光刻设备数据传输系统及其方法
WO2024067207A1 (zh) 调度方法、调度装置、电子设备及存储介质
CN112712167A (zh) 支持多种卷积神经网络加速的存储器访问方法及系统
CN112639747A (zh) 处理器的寻址方法、处理器、可移动平台和电子设备
CN117891751B (zh) 内存数据访存方法及装置、电子设备与存储介质
US20240070223A1 (en) Increased computation efficiency with multi-stage 8-bit floating point matrix multiplication with format conversion

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23815178

Country of ref document: EP

Kind code of ref document: A1