WO2019109795A1 - 卷积运算处理方法及相关产品 - Google Patents

卷积运算处理方法及相关产品 Download PDF

Info

Publication number
WO2019109795A1
WO2019109795A1 PCT/CN2018/116086 CN2018116086W WO2019109795A1 WO 2019109795 A1 WO2019109795 A1 WO 2019109795A1 CN 2018116086 W CN2018116086 W CN 2018116086W WO 2019109795 A1 WO2019109795 A1 WO 2019109795A1
Authority
WO
WIPO (PCT)
Prior art keywords
convolution
segmentation
input data
data
kernel
Prior art date
Application number
PCT/CN2018/116086
Other languages
English (en)
French (fr)
Inventor
章恒
张阳明
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Publication of WO2019109795A1 publication Critical patent/WO2019109795A1/zh
Priority to US16/678,004 priority Critical patent/US11449576B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/06Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
    • G06N3/063Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/10Complex mathematical operations
    • G06F17/15Correlation function computation including computation of convolution operations
    • G06F17/153Multidimensional correlation or convolution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Definitions

  • the present application relates to the field of computer technology, and in particular, to a convolution operation processing method and related products.
  • CNN Convolutional Nero Network
  • CNN the Convolutional Nero Network
  • the integrated chip is generally used for CNN calculation.
  • the core of CNN calculation is convolution operation.
  • the integrated chip expands the convolution kernel into a convolution kernel matrix, expands the convolved input data into a convolution input matrix, and performs a matrix product operation on one row of the convolution kernel matrix and one column of the convolution input matrix.
  • the embodiment of the present application discloses a convolution operation processing method and related products, which can reduce the number of circuits.
  • the embodiment of the present application provides an integrated chip, where the integrated chip includes: a controller, a convolution processor, an input buffer, and an output buffer;
  • the controller loads the segmentation convolution kernel and the segmentation convolution input data into the input buffer;
  • the segmentation convolution kernel is data obtained by convolution kernel segmentation;
  • the segmentation convolution input data Data obtained by convolving input data segments;
  • the convolution processor performs a segmentation convolution operation on the segmentation convolution kernel and the segmentation convolution input data to obtain a segmentation convolution result, and stores the segmentation convolution result in the output buffer .
  • the embodiment of the present application provides a convolution operation processing method, including:
  • the segmentation convolution kernel is data obtained by convolution kernel segmentation
  • the segmentation convolution input data is the convolution input Data obtained by data segmentation
  • An embodiment of the present application provides a convolution operation processing apparatus, including a memory and a processor, where the memory is used to store program instructions, and the program instructions are adapted to be loaded by the processor;
  • the processor is configured to load the program instruction and execute the convolution operation processing method according to the second aspect of the embodiment of the present application.
  • the embodiment of the present application provides a storage medium, where the storage medium stores a plurality of program instructions, and the program instructions are adapted to be loaded by a processor and execute the convolution operation processing method of the embodiment of the present application.
  • An embodiment of the present application provides a computer program product comprising a non-transitory computer readable storage medium storing a computer program operable to cause a computer to perform a convolution operation as in the embodiment of the present application.
  • the piecewise convolution operation is performed on the segmented convolution kernel and the segmented convolution input data to obtain the segmentation convolution result, which is segmented.
  • the convolution kernel is the data obtained by the convolution kernel segmentation
  • the segment convolution input data is the data obtained by convolution input data segmentation
  • the segmentation convolution kernel and the segment convolution data are both smaller, and fewer circuits are used. Segmented convolution operations can be implemented, reducing the number of circuits required for convolution operations.
  • FIG. 1A is a schematic diagram of a hardware architecture of an embodiment of the present application.
  • FIG. 1B is a schematic diagram of another hardware architecture of an embodiment of the present application.
  • 1C is a schematic diagram of a convolutional neural network algorithm model according to an embodiment of the present application.
  • 2A is a schematic structural diagram of an integrated chip according to an embodiment of the present application.
  • 2B is a schematic structural diagram of another integrated chip according to an embodiment of the present application.
  • FIG. 3 is a schematic diagram of a calculation process of a conventional convolution operation according to an embodiment of the present application.
  • 4A is a schematic diagram of segmentation processing of a convolution kernel and convolution input data according to an embodiment of the present application
  • 4B is a schematic flowchart of a piecewise convolution operation according to an embodiment of the present application.
  • 4C is a schematic flowchart of a matrix product operation corresponding to a piecewise convolution operation according to an embodiment of the present application
  • FIG. 5 is a schematic diagram showing the operation of a logic circuit for performing a matrix product operation on a row of a piecewise convolution input matrix and a piecewise convolution kernel matrix according to an embodiment of the present application;
  • FIG. 6 is a schematic flow chart of a convolution operation processing method according to an embodiment of the present application.
  • FIG. 7 is a schematic flowchart diagram of another convolution operation processing method according to an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a convolution operation processing apparatus according to an embodiment of the present application.
  • FIG. 9 is a schematic structural diagram of another convolution operation processing apparatus according to an embodiment of the present application.
  • the embodiment of the present application provides a convolution operation processing method and apparatus, which segment the convolution kernel and the convolution input data, and can reduce the bandwidth between the memory and the input buffer when performing convolution operations.
  • the convolution operation processing method of the embodiment of the present application can be executed by various computing platforms.
  • the computing platform can be a computing platform that is simply operated by a central processor unit (CPU), or a heterogeneous computing platform that includes a processor and an integrated chip.
  • the convolution operation can be performed by the CPU.
  • the controller of the computing platform can pass the matrix product operation required by the convolutional neural network algorithm to an integrated chip (for example, FPGA/ASIC), or can perform other operations of the convolutional neural network algorithm. For example, activation functions, pooling, and normalization calculations are also passed to the integrated chip for execution.
  • the convolution operation processing method of each embodiment may include:
  • the segmentation convolution kernel is data obtained by convolution kernel segmentation
  • the segmentation convolution input data is convolution input data segmentation Data obtained by the segment
  • the amount of convolution kernel data and input data used for each convolution operation can be reduced, thereby reducing the size of the input buffer of the integrated chip and reducing the convolutional neural network algorithm for the device. Bandwidth requirements and input cache requirements.
  • the computing platform can determine the amount of data of the segmentation convolution kernel based on the size of the input buffer. For example, a computing platform with limited memory resources can determine the size of the input buffer used for the convolution operation according to the usage of the memory, and then determine the size of the segmentation convolution kernel according to the size of the allocated input buffer, and accordingly The convolution kernel is segmented.
  • the above method may further include:
  • the convolution kernel comprising a plurality of convolution elements arranged in an N-dimensional space, and N is a positive integer;
  • each segment convolution kernel includes adjacent ones in the N-dimensional space
  • a plurality of convolution elements each of which has a smaller amount of data than the size of the input buffer.
  • the size of the segmentation convolution kernel can be adjusted according to the actual cache usage, so that the convolution calculation can adapt to the actual cache usage of the computing platform.
  • the size of the segmentation convolution kernel is also related to the number of multiplication resources (e.g., multiplication units, multiplication circuits, etc.) available for the convolution operation in the computing platform.
  • the convolution kernel can be partitioned as follows:
  • Segmenting the convolution kernel in the N-dimensional space to obtain a plurality of segmentation convolution kernels wherein the number of convolution elements in each segment convolution kernel is the first number and the The smaller of the second number.
  • the convolution calculation can be simultaneously adapted to the computing resources and the caching resources in the computing platform, so that the technical solution can be applied to the computing resources and the cache.
  • the resource is occupied by a more sensitive computing platform.
  • segmenting the convolution kernel can be done as follows:
  • the amount of data transferred during the convolution operation can be reduced, thereby reducing the read and write bandwidth.
  • the input buffer can be used to cache the required segmented convolution kernel and segmented convolution input data.
  • a separate buffer space may be allocated for each convolution kernel convolution operation, or may be parallel segments.
  • the convolution operation of the segment convolution kernel allocates the shared cache space.
  • the convolution input data can be segmented according to the information of the segmentation convolution kernel to obtain segmentation convolution input data.
  • the method of each embodiment may further include:
  • convolution input data comprising a plurality of data elements arranged in an N-dimensional space, N being a positive integer;
  • Segmenting the convolution input data to obtain a plurality of piecewise convolution input data wherein the number of data elements in each piece of convolution input data and the arrangement in the N-dimensional space and each piece of convolution kernel The number and arrangement of the convolutional elements are the same, the plurality of piecewise convolutional input data comprising a set of segmented convolution input data respectively corresponding to each segmentation convolution kernel in the segmentation convolution kernel ;
  • the convolution processor After the convolution processor performs a segmentation convolution operation on the first segment convolution input data and the first segment convolution kernel, convolving the first component segment into the input data a second segment convolution input data is loaded into the input buffer to replace the first segment convolution input data, causing the convolution processor to convolute input data and the first to the second segment
  • the segmentation convolution kernel performs a piecewise convolution operation.
  • a plurality of piecewise convolution operations can be performed in parallel in the computing platform, which can:
  • the segmentation convolution input data is respectively loaded into the input buffer corresponding to each segment convolution operation, and the segmentation convolution is actually performed.
  • the input data is loaded into the shared input cache.
  • only the next segment convolution input data may be loaded into the input buffer with a different portion of the currently cached segment convolution input data.
  • a portion of the plurality of segmented convolution input data that is different from the first segmented convolution input data may be loaded into the input buffer to form an input buffer in the input buffer. The second segment convolution input data.
  • the convolution operation of the plurality of segment convolution kernels corresponding to the same component segment convolution input data may be performed in parallel by a plurality of convolution circuits, or may be performed in batches using one or a plurality of convolution circuits.
  • the result of the segmentation convolution can be directly superimposed in the output buffer, thereby further saving the space of the output buffer.
  • the same component segment convolution input data refers to data elements of the same row or rows in the input data matrix, such as data corresponding to one row or several rows of pixels in the image; and multiple points corresponding to the same component segment convolution input data
  • the segment convolution kernel may be a plurality of segmentation convolution kernels corresponding to the same column or columns in the convolution kernel matrix.
  • the method of each embodiment may further include:
  • the convolution processor After the convolution processor completes the segmentation convolution operation of the first segmentation convolution kernel and the first component segment convolution input data, the second segment in the segmentation convolution kernel a segment convolution kernel and a third segment convolution input data in the second component segment convolution input data corresponding to the second segment convolution kernel are loaded into the input buffer to cause the convolution processor Performing a segmentation convolution operation on the second segmentation convolution kernel and the third segment convolution input data;
  • the convolution processor is configured to: superimpose a segmentation convolution result of the second segmentation convolution kernel and the third segment convolution input data into a second segmentation convolution stored in the output buffer In the result, the second segmentation convolution result is a segmentation convolution result corresponding to a data element of the same row in the convolution input data.
  • only the new segmentation convolution kernel may be loaded into the input buffer with a different portion of the segmentation convolution kernel currently stored in the input buffer.
  • a portion of the second segmentation convolution kernel different from the first segmentation convolution kernel may be loaded into the input buffer to form the second segmentation convolution kernel.
  • the computing platform when the computing platform includes on-chip memory (ie, an on-chip memory such as a CPU, an FPGA, etc.), the on-chip memory can be used to temporarily store the convolution input data required for the complex convolution operation, thereby reducing processing.
  • the method may include extracting, from the original convolution input data stored in the off-chip memory, data for performing a plurality of convolution operations as the convolution input data, and loading the convolution input data.
  • the computing platform is embedded in the on-chip memory.
  • the method may include extracting, from the original convolution input data, second data for performing a plurality of convolution operations, and convolving the second data with a current storage in the on-chip memory. Portions of different input data are loaded into the on-chip memory to form the second data, and the second data in the on-chip memory is used as the convolution input data for a convolution operation.
  • the next segment convolution input data may be stored in a different portion of the segment convolution input data stored in the current on-chip memory before the convolution operation of the current segment convolution input data is completed.
  • the method may include extracting, from the original convolution input data, a plurality of convolution operations before the convolution operation of the convolution input data currently stored in the on-chip memory is completed. a second data, the portion of the second data being different from the currently stored convolution input data in the on-chip memory is loaded into the on-chip memory to form the second data;
  • the second data in the on-chip memory is used as the convolution input data for a convolution operation.
  • the controller of the computing platform can pass the matrix product operations required by the convolutional neural network algorithm to an integrated chip (eg, an FPGA/ASIC) for execution.
  • the controller of the computing platform can then segment the convolution kernel and the convolution input data and load it into the integrated chip at the appropriate time, that is, perform the methods of the above embodiments to implement the convolution operation.
  • FIG. 1A is a schematic diagram of a hardware architecture of an embodiment of the present application.
  • the hardware architecture includes a server 11 and an integrated chip 12.
  • the server 11 includes a central processing unit (CPU) 111 and an external memory 112.
  • the server 11 and the integrated chip 12 pass through a bus and an interface.
  • a standard (Peripheral Component Interconnect Express, PCIE) interface 13 is connected.
  • the integrated chip 12 may include a control unit 122, a processing element (PE) 123, an input buffer 124, an on-chip memory 125, and an output buffer. 126.
  • PE processing element
  • the controller 122 is bidirectionally connected to the convolution processor 123, the input buffer 124 and the output buffer 126 are bidirectionally connected, the on-chip memory 125 is connected to the input buffer 124, and the on-chip memory 125 can input data to the input buffer 124, and the input buffer 124 is Convolution processor 123 is coupled, input buffer 124 can input data to convolution processor 123, convolution processor 123 is coupled to output buffer 126, convolution processor 123 can output data to output buffer 126, output buffer 126 and on-chip
  • the memory 125 is coupled and the output buffer 126 can output data to the on-chip memory 125.
  • the integrated chip 12 may be a Field Programmble Gate Array (FPGA) or an Application Specific Integrated Circuit (ASIC).
  • the integrated chip 12 acts as a coprocessor for the server 11 for fetching data from the external memory 112 of the server 11, placing the fetched data into the on-chip memory 125 and notifying the controller 122 that data has been obtained from the external storage 112.
  • the controller 122 is configured to retrieve the input buffer 124 from the on-chip memory 125 for the data to be calculated, and the convolution processor 123 calculates the data to be calculated.
  • the convolution processor 123 is configured to place the calculated calculation result into the output buffer 126.
  • the output buffer 126 is for outputting the calculation result to the on-chip memory 125, and the server 11 can read the calculation result from the on-chip memory 125 of the integrated chip 12.
  • FIG. 1B is a schematic diagram of another hardware architecture provided by an embodiment of the present application.
  • FIG. 1B is further improved on the basis of FIG. 1A, and the difference from FIG. 1A is that a direct memory access (DMA) 121 is added to the integrated chip 12.
  • the direct memory access 121 is bidirectionally connected to the controller 122 and the on-chip memory 125, and the integrated chip 12 can acquire data from the external memory 112 of the server 11 through the direct memory access 121.
  • Direct memory access 121 is used to place the acquired data into on-chip memory 125 and notify controller 122 that data has been obtained from external storage 112.
  • the server 11 can also read the calculation result from the on-chip memory 125 of the integrated chip 12 through the direct memory access 121.
  • the direct memory access 121 does not need to interrupt the controller 122 and the convolution processor 123 when reading data from the server 11, and the integrated memory 12 and the server 11 can be improved by using the direct memory access 121.
  • FIG. 1C is a schematic diagram of a convolutional neural network algorithm model provided by an embodiment of the present application.
  • the convolutional neural network algorithm model can process image data.
  • the input image data is subjected to operations such as convolution operation, pooling, normalization, and finally, after the full connection layer and the softmax operation processing, the image processing result is judged as an image of "dog".
  • the convolution operation needs to perform multi-layer operations, and the convolution operation accounts for the largest calculation amount in the calculation process of the entire convolutional neural network algorithm.
  • the convolutional neural network algorithm in the embodiment of the present application can be used to perform various image recognition, such as image classification, image filtering, and the like.
  • the service scenario applicable to the embodiment of the present application may be a target business scenario for erotic image detection and filtering.
  • the convolution operation in the embodiment of the present application can be implemented by a deep learning platform, which may include a Convolutional Architecture for Fast Feature Embedding (Caffe) and a second generation artificial intelligence learning system (for example, Tensor Flow). )Wait.
  • the deep learning platform can call the Matrix Linear Algebra Subprograms (BLAS) for matrix product operations.
  • BLAS Matrix Linear Algebra Subprograms
  • the convolution processor 123 in the integrated chip 12 can be used to process convolution operations, and the convolution processor 123 can have multiple, and can simultaneously process multiple convolution operations in parallel. .
  • convolution operations refer to convolution operations on convolution kernels and convolutional input data.
  • the number of convolution kernels and the size of the convolution kernel are related to the convolutional neural network algorithm.
  • the convolution kernel of each layer convolution operation is different, and the convolution input data of each layer convolution operation is also different.
  • the convolution input data is generally directly expanded into a convolution input matrix, and the convolution kernel is directly expanded into a convolution kernel matrix.
  • Convolution operations are transformed into matrix product operations (multiplication and addition) that are facilitated by logic circuits.
  • one row of the convolution kernel matrix and one column of the convolution input matrix are read from the on-chip memory 125 each time to the input buffer 124, and the convolution processor 123 pairs the row and volume of the convolution kernel matrix.
  • a column of the product input matrix performs a matrix product operation. Since the data amount of the convolution input matrix is usually large, the amount of data and the amount of calculation required for the matrix product operation are very large, so that the number of logic circuits of the integrated chip 12 required for the matrix product operation is very large, and the logic circuit can At least one of a multiplier, an adder, and a multiplier is included.
  • FIG. 2A is a schematic structural diagram of an integrated chip according to an embodiment of the present application.
  • integrated chip 12 includes controller 122, convolution processor 123, input buffer 124, and output buffer 126.
  • the integrated chip 12 can also be based on the hardware architecture shown in FIG. 1B.
  • the integrated chip 12 can also include a direct memory access 121, which can be bidirectionally coupled to the controller 122.
  • the direct memory access 121 does not need to interrupt the controller 122 and the convolution processor 123 when reading data from an external device of the integrated chip, and the direct memory access 121 can improve the data transmission efficiency between the integrated chip and the external device.
  • the number of convolutional layers of the convolution operation and the size and number of convolution kernels for each convolution operation are related to the convolutional neural network algorithm.
  • the number of convolutional layers that need to be convoluted is not necessarily the same, and the size of the convolution kernel of each convolution operation is not necessarily the same, and the convolution of each layer convolution operation
  • the number of cores is not necessarily different. It is not necessarily the same whether each layer needs to be post-processed after the convolution operation.
  • the post-processing operation includes at least one of an activation function calculation, a pooling calculation, and a normalization calculation, and whether the post-processing calculation is performed is determined according to a convolutional neural network algorithm.
  • the convolutional neural network algorithm determines, the number of layers of the convolution operation and the size and number of convolution kernels for each layer of convolution operations are determined.
  • the convolution operation is performed layer by layer. First, the first layer convolution operation is performed, and after the first layer convolution operation result is obtained, if the neural network algorithm requires post-processing calculation, then Processing calculation, the result of post-processing calculation of the first layer convolution operation result is used as convolution input data of the second layer convolution operation, and if the neural network algorithm does not require post-processing calculation, the first layer convolution operation is performed. The result is convolution input data as a second layer convolution operation.
  • a second layer convolution operation is performed, and so on, until the end of the last layer convolution operation, the multi-layer convolution operation can be completed.
  • N convolution kernels can be convoluted separately with convolution input data.
  • the convolution kernel may be learned from the training data through the deep learning platform; the convolution input data of the first layer convolution operation is initial data for convolution operation, for example, the initial data may be to be processed. Image data.
  • the result of each layer convolution operation is likely to perform the rearrangement operation as the convolution input data of the next layer. In the case where the matrix is used as the convolution operation input, the rearrangement can usually be performed. Implemented by transposition.
  • the convolution kernel and the convolution input data are three-dimensional, and the convolution kernel and the convolution input data can be understood as three-dimensional data blocks.
  • the convolution result of a convolution kernel and convolution input data after convolution is two-dimensional. Since the number of convolution kernels is often multiple, multiple convolution kernels and convolution input data are convoluted. The result is three-dimensional.
  • the convolution operation of the convolution kernel and the convolution input data can be understood as: the small data block corresponding to the convolution kernel slides in the large data block corresponding to the convolution input data, and each time it is slid, the small data block and the large data block are The part of the data that coincides with the small data block is multiplied, and the result obtained by multiplying each coincident data is added to obtain a convolution result.
  • FIG. 3 is a schematic diagram of a calculation process of a convolution operation according to an embodiment of the present application.
  • the convolution kernel is a 2 ⁇ 2 ⁇ 2 data block, and the length, height, and width of the convolution kernel are both 2.
  • the convolution kernel comprises a total of 8 data (a11, a12, a21, a22, a31, a32, a41, a42), and the convolution kernel is composed of 8 small cubes, each of which represents one data in the convolution kernel.
  • the convolution input data is a 4 ⁇ 4 ⁇ 2 data block, and the convolution input data has a length and a height of 4 and a width of 2.
  • a total of 32 data (b11, b12, b13, b14, b21, b22, b23, b24, b31, b32, b33, b34, b41, b42, b43, b44, b51, b52, b53, b54, b61, b62, B63, b64, b71, b72, b73, b74, b81, b82, b83, b84).
  • the convolution input data consists of 32 small cubes, each representing one of the data in the convolution input data.
  • Controller 122 loads the convolution kernel and convolution input data into input buffer 124 prior to the convolution operation.
  • the controller 122 may load the convolution kernel and the convolution input data into the input buffer 124 by: the controller 122 may send a read instruction to the input buffer 124, and the input buffer 124 reads from the memory in response to the read instruction.
  • the convolution kernel and convolution input data are fetched and the convolution kernel and convolution input data are loaded into the input buffer 12.
  • the convolution processor 123 performs a convolution operation on the convolution kernel and the convolution input data. Specifically, the first step: the convolution processor 123 puts the convolution kernel into the convolution input. In the data, the data block of the convolution kernel is overlapped with a 2 ⁇ 2 ⁇ 2 data block in the convolution input data, and the data represented by the coincident small cube is multiplied to obtain a multiplication result, and all the data are obtained. The multiplication results are added together to obtain the convolution operation result of the first step. As can be seen from Fig.
  • a11, a12, a21, a22, a31, a32, a41, a42 coincide with b11, b12, b21, b22, b51, b52, b61, b62, respectively, the first step of the volume
  • the product operation is: a11 ⁇ b11 + a12 ⁇ b12 + a21 ⁇ b21 + a22 ⁇ b22 + a31 ⁇ b51 + a32 ⁇ b52 + a41 ⁇ b61 + a42 ⁇ b62
  • the convolution operation in the first step can be understood as a line matrix composed of a11, a12, a21, a22, a31, a32, a41, a42 and b11, b12, b21, b22, b51, b52, b61, b62.
  • a matrix of columns performs matrix product operations.
  • the second step sliding the convolution kernel by one data unit along the length direction of the convolution input data, so that the data block of the convolution kernel coincides with another 2 ⁇ 2 ⁇ 2 data block in the convolution input data, and Similarly, the first step is to multiply the data represented by the coincident small cubes to obtain the multiplication result, and add all the multiplication results to obtain the convolution operation result of the second step. As can be seen from FIG.
  • a11, a12, a21, a22, a31, a32, a41, a42 coincide with b12, b13, b22, b23, b52, b53, b62, b63, respectively, and the volume of the second step
  • the result of the product operation is c12.
  • the third step sliding the convolution kernel along the length direction of the convolution input data by one data unit, multiplying and adding the data represented by the coincident small cubes, and obtaining the convolution operation result c13 of the third step.
  • the fourth step sliding the convolution kernel along the height direction of the convolution input data by one data unit, multiplying and adding the data represented by the coincident small cubes, and obtaining the convolution operation result c23 of the fourth step.
  • Step 5 The convolution kernel is slid along the length direction of the convolution input data by one data unit, and the data represented by the coincident small cubes is multiplied and added to obtain the convolution operation result c22 of the fifth step.
  • the sixth step sliding the convolution kernel along the length direction of the convolution input data by one data unit, multiplying and adding the data represented by the coincident small cubes, and obtaining the convolution operation result c21 of the sixth step.
  • Step 7 The convolution kernel is slid along a height direction of the convolution input data by a data unit, and the data represented by the coincident small cubes is multiplied and added to obtain a convolution operation result c31 of the seventh step.
  • Step 8 The convolution kernel is slid along the length direction of the convolution input data by one data unit, and the data represented by the coincident small cubes is multiplied and added to obtain the convolution operation result c32 of the eighth step.
  • Step 9 The convolution kernel is slid by one data unit along the length direction of the convolution input data, and the data represented by the coincident small cubes is multiplied and added to obtain the convolution operation result c33 of the ninth step.
  • the final convolution result consists of nine data sets c11, c12, c13, c21, c22, c23, c31, c32, c33, and these nine data can be mapped to a 3x3 data plane: If there are N number of convolution kernels in FIG. 3, the convolution operation result of the N convolution kernels and the convolution input data is a 3 ⁇ 3 ⁇ N data block (length and height are 3, and the width is N).
  • the convolution processor 123 performs a convolution operation on the convolution kernel and the convolution input data. Specifically, the convolution processor 123 converts the convolution kernel matrix into a convolution of one row and eight columns. The kernel matrix converts the convolved input data into a convolution input matrix of 8 rows and 9 columns. Wherein, the number of columns of convolution input data is determined by the size of the convolution kernel and the convolution input data. If the size of the convolution kernel is K ⁇ K ⁇ D, the size of the convolution input data is M ⁇ M ⁇ D. Then, the number of columns of the convolution input data is (M - K + 1) ⁇ (M - K + 1).
  • the convolution operation result of the N convolution kernels and the convolution input data is a 3 ⁇ 3 ⁇ N data block (length and height are 3, and the width is N).
  • the length, height, and width of the convolution kernel and convolution input data in FIG. 3 are all described by taking a small numerical value as an example.
  • the length, height, and width of the convolution kernel and the convolution input data are often very large.
  • the integrated chip 12 needs to be used.
  • the number of logic circuits for convolution operations is set very large, which can result in wasted circuitry.
  • each convolution kernel needs to be convoluted with the convolution input data, which causes the convolution kernel or the convolution input data to be repeatedly loaded from the memory into the input buffer.
  • the bandwidth requirement between the input buffer and the input buffer is high.
  • the integrated chip 12 shown in FIG. 2A can perform segmentation processing on the convolution kernel to obtain a segmentation convolution kernel; segmentation processing the convolution input data to obtain segmentation convolution input data.
  • the convolution processor 123 After the segmentation convolution kernel and the segmentation convolution input data are loaded into the input buffer 124, the convolution processor 123 performs a piecewise convolution operation on the segment convolution input data and the segmentation convolution kernel to obtain a segmentation convolution. As a result, the segmented convolution result is stored to the output buffer 126.
  • the segmented convolution kernel is data obtained by convolution kernel segmentation; the segmentation convolution input data is data obtained by convolution input data segmentation.
  • the above process is a layer convolution operation, and the next layer convolution operation will use the above segmentation convolution result as the convolution input data of the next layer convolution operation.
  • the specific convolution process refer to the previous description of the convolution operation, and details are not described herein again.
  • the piecewise convolution kernel is data obtained by segmentation of the convolution kernel, specifically: the segmentation convolution kernel includes N convolution kernels according to The same segmentation method results in the same piece of data in the spatial region.
  • the segmented convolution input data is data obtained by convolution input data segmentation, and specifically: the segmentation convolution input data is a convolution input data obtained by convolution according to the segmentation manner of the segmentation convolution kernel.
  • 4A is taken as an example to describe the segmentation mode of the convolution kernel and the segmentation mode of the convolution input data.
  • 4A is a schematic diagram of segmentation processing of a convolution kernel and convolution input data according to an embodiment of the present application, and FIG. 4A is a scenario in which convolution operations are performed on N convolution kernels and a convolution input data.
  • the size of each convolution kernel is: K ⁇ K ⁇ D, K is the length and height of the convolution kernel, D is the width of the convolution kernel; the size of the convolution input data is M ⁇ M ⁇ D, M is The length and height of the convolution input data, and D is the width of the convolution input data.
  • the convolution kernel and convolution input data can be understood as data blocks in three dimensions, each of the above convolution kernels comprising K x K x D data, and the convolution input data comprises M x M x D data.
  • the convolution kernel is split, the K ⁇ K ⁇ D data block is cut in the width direction and divided into Y K ⁇ K ⁇ (D/Y) sub-blocks, similarly, to the convolution.
  • the M ⁇ M ⁇ D data block is also cut in the width direction and divided into Y M ⁇ M ⁇ (D/Y) sub-blocks.
  • N convolution kernels are used as the first segment convolution kernel in the sub-block of width 0 to D/Y interval, and N convolution kernels are used.
  • the sub-blocks with the width in the D/Y ⁇ 2D/Y interval are used as the second segmentation convolution kernel, and so on, and the N convolution kernels are in the interval of (Y-1)D/Y ⁇ D.
  • the sub-block is used as the Y-th segment convolution kernel.
  • the data is in the sub-block of width D/Y ⁇ 2D/Y as the second piece of convolution input data of the convolution input data, and so on, and the convolution input data is at the width (Y-1)D.
  • the sub-block of the /Y to D section is used as the Y-th segment convolution input data of the convolution input data.
  • the first segmentation convolution kernel includes a piece of data in a range of 0 to D/Y in a width interval obtained by segmentation in units of width/D/Y in N convolution kernels;
  • the segmentation convolution kernel includes a piece of data in a range of D/Y to 2D/Y in which the width interval obtained by segmentation in units of widths in units of D/Y is divided into N convolution kernels;
  • the convolution kernel includes a piece of data in a range of (Y-1)D/Y to D in which the width interval obtained by the segmentation method in which the width is divided by D/Y in the N convolution kernels.
  • the first segment convolution input data is a piece of data in which the convolution input data is in a range of 0 to D/Y in a width interval obtained by segmentation in units of widths in units of D/Y;
  • the segmented convolution input data is a piece of data in which the convolution input data is in a range of D/Y to 2D/Y in a segmentation manner in which the width is divided in units of D/Y;
  • the product input data is a piece of data in which the convolution input data is in the (Y-1)D/Y to D interval in the width section obtained by the segmentation method in which the width is divided in units of D/Y.
  • the first segment convolution kernel corresponds to the first segment convolution input data
  • the second segment convolution kernel corresponds to the second segment convolution input data
  • the Y segment convolution kernel and the Y segment The convolution input data corresponds.
  • the controller 122 loads the first segmentation convolution kernel and the corresponding first segment convolution input data into the input buffer 124 prior to performing the segmentation convolution operation.
  • the convolution processor 123 performs a convolution operation on the first segmentation convolution kernel and the corresponding first segment convolution input data, respectively, to obtain a segmentation convolution result of the first segment.
  • the segmented convolution result of the first segment is stored in output buffer 126.
  • the controller 122 loads the second segmentation convolution kernel and the corresponding second segment convolution input data into the input buffer 124, and the convolution processor 123 divides the second segmentation convolution kernel and the second segmentation volume.
  • the product input data is subjected to a convolution operation to obtain a segmentation convolution result of the second segment, and the segmentation convolution result of the second segment is stored in the output buffer 126.
  • the controller 122 loads the Y-th segment convolution kernel and the corresponding Y-segment convolution input data into the input buffer 124, and the convolution processor 123 divides the Y-th segment convolution kernel and the Y-th segment.
  • the convolution input data is subjected to a convolution operation to obtain a segmentation convolution result of the Yth segment, and the segmentation convolution result of the Yth segment is stored in the output buffer 126.
  • FIG. 4B is a process of convolution operation of N convolution kernels of size K ⁇ K ⁇ D and convolution input data of size M ⁇ M ⁇ D.
  • each convolution kernel is split into Y segmentation convolution kernels of size K ⁇ K ⁇ (D/Y), and the convolution input data is split into Y sizes.
  • the convolution operation can be split into a Y-segment convolution operation, and the Y-segment convolution operation includes: a first-segment convolution operation, a second-segment convolution operation, ... a Y-segment convolution operation .
  • the second segment convolution operation in FIG. 4B performs a convolution operation on the first segment convolution kernel and the first segment convolution input data
  • the second segment convolution operation in FIG. 4B is the second segment.
  • the segment convolution kernel and the second segment convolution input data are convoluted, and so on, the Y-th segment convolution operation in FIG. 4B is the Y-th segment convolution kernel and the Y-th segment convolution input data.
  • the detailed process of the convolution operation of the segmentation convolution kernel of K ⁇ K ⁇ (D/Y) and the segmentation convolution input data of M ⁇ M ⁇ (D/Y) is 2 ⁇ 2 ⁇ in FIG. 3 .
  • the convolution operation of the convolution kernel of 2 is similar to that of the 4 ⁇ 4 ⁇ 2 convolution input data, and will not be described here.
  • the first segmentation convolution kernel may be expanded into a matrix of U rows and N columns, and the first segment convolution input data may be expanded into a matrix of P rows and U columns;
  • the second segment convolution kernel may also be Expanded into a matrix of U rows and N columns, the second segment convolution input data can also be expanded into a matrix of P rows and U columns; and so on, the Y segment convolution kernel can also be expanded into a matrix of U rows and N columns.
  • the Y-segment convolution input data can also be expanded into a matrix of P rows and U columns.
  • the result of the matrix product operation of the matrix of Y P rows and U columns and the matrix of Y U rows and N columns is accumulated to obtain a convolution result of P rows and N columns.
  • U K ⁇ K ⁇ (D / Y);
  • P (M - K + 1) ⁇ (M - K + 1).
  • the integrated chip 12 may further include an on-chip memory 125 that receives convolutional input data and a segmentation convolution kernel, or the on-chip memory 125 receives convolutional input data and Convolution kernel.
  • On-chip memory 125 can be coupled to input buffer 124 and can be coupled to output buffer 126.
  • the convolution kernel may be segmented by the server 11 in FIG. 1A or FIG. 1B, or may be segmented by the integrated chip 12.
  • the convolution input data can be segmented by the integrated chip 12.
  • the central processor 111 of the server 11 segments the N convolution kernels of each layer corresponding to the convolutional neural network algorithm to obtain a plurality of segmented volumes.
  • the core processor 111 accumulates and stores in the external storage unit 112, and the central processing unit 111 of the server 11 inputs the plurality of segmentation convolution kernels in the external memory 112 and the convolution input data subjected to the convolution operation to the on-chip memory 125 of the integrated chip 12.
  • the on-chip memory 125 can receive the convolution input data and the plurality of segmentation convolution kernels transmitted by the server 11.
  • the controller 122 splits the convolution input data in the on-chip memory 125 into a plurality of sub-convolution input data, and the controller 122 loads a segment convolution kernel and a segment convolution input data into the input buffer 124, or Controller 122 loads the two segmented convolution kernels and the two segmented convolutional input data into input buffer 124.
  • the convolution kernel is segmented by the server 11 in FIG. 1A, the integrated chip does not need to segment the convolution kernel, and the segmentation convolution operation can be performed quickly, thereby improving the processing efficiency of the convolution operation.
  • the server 11 inputs the convolution kernel and convolution input data in the external memory 112 to the on-chip memory 125 of the integrated chip 12, and the on-chip memory 125 receives The convolution input data and convolution kernel sent by the server 11.
  • the controller 122 splits the convolution input data in the on-chip memory 125 into a plurality of sub-convolution input data, and the controller 122 splits the convolution kernel in the on-chip memory 125 into a plurality of segmentation convolution kernels.
  • Controller 122 loads a segmented convolution kernel and a segmented convolution input data into input buffer 124, or controller 122 loads two segmented convolution kernels and two segmented convolution input data into Input buffer 124.
  • the server 11 does not need to segment the convolution kernel, and the burden on the server 11 can be alleviated.
  • the operation of the integrated chip 12 will be described in conjunction with FIGS. 4A, 4B, and 4C.
  • the integrated chip 12 acquires a segmentation convolution kernel and segmentation convolution input data, wherein the segmentation convolution kernel may be split by the integrated chip 12 or may be split by the server 11.
  • the controller 122 reads a piecewise convolution input data and a corresponding N piecewise convolution kernels from the on-chip memory 125 into the input buffer 124, and the convolution processor 123 expands the N piecewise convolution kernels into U.
  • a piecewise convolution kernel matrix of row N a piecewise convolution input data is expanded into a piecewise convolution input matrix of P rows and U columns, and a piecewise convolution input matrix of P rows and U columns and U rows and N columns are The piecewise convolution kernel matrix performs a matrix product operation to obtain a segmentation convolution result.
  • N is the number of convolution kernels
  • U K ⁇ K ⁇ (D / Y)
  • the convolution processor 123 completes the piecewise convolution operation of the piecewise convolution input data and the corresponding N piecewise convolution kernels
  • the resulting segmented convolution result is stored in the output buffer 126, and the controller 122
  • the other piecewise convolution input data and the corresponding additional N pieces of convolutional kernels are read into the input buffer 124 in the on-chip memory 125 for the next piecewise convolution operation.
  • all the segmentation convolution results in the output buffer 126 can be accumulated to obtain a convolution operation result, that is, the completion is completed.
  • a layer of convolution operation The segmentation convolution operation process shown in FIG. 4A to FIG.
  • 4C performs segmentation convolution operation on the segmentation convolution input data and the segmentation convolution kernel by splitting the convolution kernel and the convolution input data.
  • it is not necessary to repeatedly read the segmentation convolution kernel and the segmentation convolution input data from the on-chip memory to the input buffer, which can reduce the on-chip memory and the input buffer during the convolution operation. Bandwidth.
  • the number of rows of the segmented convolution kernel matrix corresponding to the segmentation convolution kernel and the number of columns of the segment convolution input matrix corresponding to the segmentation convolution input data correspond to the number of logic circuits of the convolution processor 123. .
  • the logic circuit may include a multiplier, an adder, or other device having logic operation capability.
  • the number of rows of the segmented convolution kernel matrix corresponding to the segmentation convolution kernel and the number of columns of the segmentation convolution input matrix corresponding to the segmentation convolution input data correspond to the number of logic circuits of the convolution processor, including:
  • the number of rows of the segmentation convolution kernel matrix corresponding to the segmentation convolution kernel is less than or equal to the number of multipliers of the convolution processor 123, and the number of columns of the segment convolution input matrix corresponding to the segment convolution input data is less than or equal to The number of multipliers of the convolution processor 123.
  • each piecewise convolution kernel is K ⁇ K ⁇ (D/Y)
  • the logic circuit may include a multiplier, the number of rows of the matrix corresponding to the segmentation convolution kernel and the number of columns of the matrix corresponding to the segment convolution input data and the number of logic circuits of the convolution processor include:
  • the number of rows of the matrix corresponding to the segmentation convolution kernel is less than or equal to the number of multipliers of the convolution processor; the number of columns of the matrix corresponding to the segment convolution input data is less than or equal to the number of multipliers of the convolution processor .
  • each piecewise convolution kernel is K ⁇ K ⁇ (D/Y)
  • the value of Y can be determined according to the number of available logic circuits of the convolution processor 123 and the size of the convolution kernel.
  • the logic circuit includes a multiplier and an adder
  • the number of available multipliers of the convolution processor 123 is X
  • the size of the convolution kernel is K ⁇ K ⁇ D
  • Y needs to satisfy: Y > (K ⁇ K ⁇ D) / X.
  • the logic circuit includes a multiplier
  • the number of available multipliers of the convolution processor 123 is X
  • the size of the convolution kernel is K ⁇ K ⁇ D
  • Y needs to satisfy: Y > (K ⁇ K ⁇ D) / X .
  • the convolution kernel size is K ⁇ K ⁇ D, K is length and height, and D is width.
  • the convolution kernel is split into 15 piecewise convolution kernels, which are: 10 ⁇ 10 ⁇ 7, 10 ⁇ 10 ⁇ 7, 10 ⁇ 10 ⁇ 7, 10 ⁇ 10 ⁇ 7, 10 ⁇ 10 ⁇ 7, 10 ⁇ 10 ⁇ 7, 10 ⁇ 10 ⁇ 7, 10 ⁇ 10 ⁇ 7, 10 ⁇ 10 ⁇ 7, 10 ⁇ 10 ⁇ 7, 10 ⁇ 10 ⁇ 7, 10 ⁇ 10 ⁇ 7, 10 ⁇ 10 ⁇ 7, 10 ⁇ 10 ⁇ 7, 10 ⁇ 10 ⁇ 7, 10 ⁇ 10 ⁇ 7, 10 ⁇ 10 ⁇ 7, 10 ⁇ 10 ⁇ 7, 10 ⁇ 10 ⁇ 7, 10 ⁇ 10 ⁇ 7, 10 ⁇ 10 ⁇ 10 ⁇ 2.
  • the input buffer 124 includes a first storage space and a second storage space, the first storage space and the second storage space are respectively used for storing data; and the controller 122 loading the segmentation convolution kernel into the input buffer 124 includes:
  • the controller 122 moves the data in the first storage space to the second storage space, and then loads the segmentation convolution kernel into the first storage space.
  • the controller 122 moves the data in the first storage space to the second storage space, and the original data in the second storage space can be overwritten.
  • the controller 122 loads the segmentation convolution kernel into the first storage space to overwrite the original data in the first storage space.
  • the above data may be a segmentation convolution kernel.
  • the segmentation convolution kernel is stored in the input buffer 124 by using a "ping-pong storage" manner, and it is always ensured that two segmentation convolutions are stored in the input buffer 124.
  • the convolution processor 123 can perform the next segmentation convolution operation without waiting for the loading of the segmentation convolution kernel of the next segmentation convolution operation after completing a segmentation convolution operation, which can improve the segmentation convolution operation.
  • the processing efficiency of the convolution operation can improve the segmentation convolution operation.
  • the controller 122 loads the segmentation convolution kernel into the input buffer 124, including:
  • the controller 122 loads the segmentation convolution kernel into the second storage space
  • the controller 122 is also operative to load another segmentation convolution kernel other than the segmentation convolution kernel in the first storage space.
  • the input buffer 124 when the first storage space and the second storage space are empty, it indicates that the data is first loaded in the input buffer 124.
  • the input buffer 124 may be loaded for one time. Enter two segmentation convolution kernels.
  • the convolution processor 123 performs a segmentation convolution operation on the segmentation convolution kernel and the segmentation convolution input data, including:
  • the convolution processor 123 converts the piecewise convolution kernel into a piecewise convolution kernel matrix, and converts the piecewise convolution input data into a piecewise convolution input matrix; the convolution processor 123 takes the piecewise convolution input matrix as a multiplication The number is calculated by multiplying the piecewise convolution kernel matrix as a multiplicand.
  • the piecewise convolution input matrix is used as a multiplier, and the piecewise convolution kernel matrix is taken as a multiplicand.
  • the matrix corresponding to the multiplier is generally matrix-multiplied by the matrix corresponding to the multiplicand.
  • the size of the convolution kernel is: 2 ⁇ 2 ⁇ 100
  • the size of the convolution input data is 100 ⁇ 100 ⁇ 100
  • each convolution kernel can be split into 50 2 ⁇ 2 ⁇ 2 size segmentation convolution kernel
  • convolution input data can be split into 50 pieces of 100 ⁇ 100 ⁇ 2 piecewise convolution input data.
  • convolution processor 123 converts the segmented convolution kernel into a segmented convolution kernel matrix, and converts the segmented convolution input data into a segmented convolution input matrix; a convolution processor 123 uses the piecewise convolution input matrix as the multiplicand, and multiplies the piecewise convolution kernel matrix as a multiplier.
  • the piecewise convolution kernel matrix is 100 rows and 8 columns, and the segment convolution input matrix is 8 rows and 9801 columns. .
  • each matrix product operation is 1 row.
  • the matrix of 8 columns is multiplied by the matrix of 8 rows and 9801 columns, and each matrix product operation needs to occupy a large storage space of the input buffer 124.
  • the convolution processor 123 uses the piecewise convolution input matrix as a multiplier, and the piecewise convolution kernel matrix as a multiplicand to perform a product operation, and a piecewise convolution kernel matrix.
  • the segmented convolution input matrix is 9801 rows and 8 columns.
  • matrix matrix product is performed on matrix of 8801 rows and 8 columns and matrix of 8 rows and 100 columns, it is necessary to perform matrix product operation on matrix matrix of 8801 row 8 and matrix of 8 rows and 100 columns, and each matrix product operation is 1 row 8
  • the matrix of columns is multiplied by a matrix of 8 rows and 100 columns, each time the matrix product operation occupies a small storage space of the input buffer 124.
  • the piecewise convolution input matrix is often much larger than the piecewise convolution kernel matrix
  • the piecewise convolution input matrix is used as the multiplier
  • the piecewise convolution kernel matrix is used as the multiplicand, which can reduce the input occupied by each matrix product operation.
  • the logic circuit can complete the piecewise convolution operation. By implementing the embodiments of the present application, the convolution operation can be completed with limited available logic circuits, saving logic circuits for convolution operations.
  • FIG. 5 is a schematic diagram of the operation of a logic circuit for performing a matrix product operation on a row of a piecewise convolution input matrix and a piecewise convolution kernel matrix provided in the embodiment of the present application.
  • the piecewise convolution input matrix is a P row U column
  • the following is a first example of a segmented convolution input matrix to illustrate the implementation of a specific logic circuit.
  • the logic circuit for the convolution operation in the convolution processor 123 of the integrated chip 12 may be a multiplier and an adder, or may be a multiplier.
  • the multiplier can multiply at least two data and output a multiplication result.
  • the adder can add at least two data and output the addition result.
  • a multiply-accumulator also referred to as a multiply accumulator
  • the result obtained by the multiplication operation can be added to another operand, and multiplication and addition can be performed in one clock cycle. To reduce the execution delay of the entire multiply-add operation.
  • FIG. 5 illustrates a multiplier as an example.
  • only five multipliers (such as multiplier adder 0, multiplier adder 1, multiplier adder 2, multiplier adder 3, and multiplier adder 4) can be realized to realize all the segmented volumes.
  • Product operation For example, if the clock frequency of the convolution processor 123 of the integrated chip 12 is 1 GHz, the clock period is 1 ns, assuming that the multiply-accumulator can perform a multiply-accumulate operation in one clock cycle.
  • the first behavior of the piecewise convolution input matrix is X 00 X 01 X 02 X 03 X 04
  • the piecewise convolution kernel matrix is Then, the matrix product operation of the first row of the segment convolution input matrix and the segmentation convolution kernel matrix is implemented as follows in the logic circuit of the convolution processor.
  • the multiplier 0 performs a multiplication of X00 ⁇ W00 to obtain an operation result (X00 ⁇ W00).
  • the multiplier 0 transfers the operation result (X00 ⁇ W00) obtained in the previous cycle to the multiplier 1 , and then the multiplier 0 performs the multiplication of X00 ⁇ W01 to obtain an operation.
  • Result (X00 ⁇ W01) the multiplier 1 performs multiplication of X01 ⁇ W10 and adds the obtained operation result (X01 ⁇ W10) and the operation result (X00 ⁇ W00) transmitted by the multiplier adder 0 to obtain an operation result. (X00 ⁇ W00+X01 ⁇ W10).
  • the multiplier 0 transmits the operation result (X00 ⁇ W01) obtained in the previous cycle to the adder 1, and then the multiplier 0 performs the multiplication of X00 ⁇ W02 to obtain the operation result.
  • the multiplier 1 transmits the operation result (X00 ⁇ W00+X01 ⁇ W10) obtained in the previous cycle to the multiplier adder 2, and then the multiplier 1 performs multiplication of X01 ⁇ W11 and obtains
  • the operation result (X01 ⁇ W11) and the operation result (X00 ⁇ W01) transmitted by the multiplier adder 0 are added to obtain an operation result (X00 ⁇ W01+X01 ⁇ W11);
  • the multiplier 2 performs multiplication of X02 ⁇ W20 and
  • the obtained calculation result (X02 ⁇ W20) and the calculation result (X00 ⁇ W00+X01 ⁇ W10) transmitted from the multiplier 1 are added to obtain an operation result (X00 ⁇ W00+X01 ⁇ W10+X02 ⁇ W20).
  • the multiplier 0 transmits the operation result (X00 ⁇ W02) obtained in the previous cycle to the adder 1, and then the multiplier operation of X00 ⁇ W03 is performed by the adder 0 to obtain the operation result.
  • the multiplier 1 transmits the operation result (X00 ⁇ W01+X01 ⁇ W11) obtained in the previous cycle to the multiplier adder 2, and then the multiplier 1 performs multiplication of X01 ⁇ W12 and obtains
  • the operation result (X01 ⁇ W12) and the operation result (X00 ⁇ W02) transmitted by the multiplier adder 0 are added to obtain an operation result (X00 ⁇ W02+X01 ⁇ W12); the multiplier 2 obtains the operation result of the previous cycle.
  • the multiplier 0 transfers the operation result (X00 ⁇ W03) obtained in the previous cycle to the adder 1, and then multiplies the multiplier by X0 ⁇ W04 by the adder 0 to obtain the operation result. (X00 ⁇ W04); the multiplier 1 transmits the operation result (X00 ⁇ W02+X01 ⁇ W12) obtained in the previous cycle to the multiplier adder 2, and then the multiplier 1 performs multiplication of X01 ⁇ W13 and obtains The operation result (X01 ⁇ W13) and the operation result (X00 ⁇ W03) transmitted by the multiplier adder 0 are added to obtain an operation result (X00 ⁇ W03+X01 ⁇ W13); the multiplier 2 obtains the operation result of the previous cycle.
  • the calculation result (X00 ⁇ W02+X01 ⁇ W12) is added to obtain the operation result (X00 ⁇ W02+X01 ⁇ W12+X02 ⁇ W22); the multiplier 3 calculates the operation result obtained in the previous cycle (X00 ⁇ W00+X01) ⁇ W10+X02 ⁇ W20+X03 ⁇ W30) is transmitted to the multiplier adder 4, and then the multiplier 3 performs multiplication of X03 ⁇ W31 and the obtained operation result (X03 ⁇ W31) and the operation result transmitted by the multiplier adder 2 (X00 ⁇ W01+X01 ⁇ W11+X02 ⁇ W21) The addition result is obtained after the addition (X00 ⁇ W01+X01 ⁇ W11+X02 ⁇ W21+X03 ⁇ W31).
  • the multiplier 4 performs multiplication of X04 ⁇ W40 and adds the obtained operation result (X04 ⁇ W40) and the operation result transmitted by the multiplier 3 (X00 ⁇ W00+X01 ⁇ W10+X02 ⁇ W20+X03 ⁇ W30).
  • the operation result is obtained after the calculation (X00 ⁇ W00+X01 ⁇ W10+X02 ⁇ W20+X03 ⁇ W30+X04 ⁇ W40).
  • the multiplier 0 transfers the operation result (X00 ⁇ W04) obtained in the previous cycle to the adder 1, and then multiplies the multiplier by X0 ⁇ W05 by the adder 0 to obtain the operation result.
  • (X00 ⁇ W05) the multiplier 1 transmits the operation result (X00 ⁇ W03+X01 ⁇ W13) obtained in the previous cycle to the multiplier 2, and then the multiplier 1 performs multiplication of X01 ⁇ W14 and obtains
  • the operation result (X01 ⁇ W14) and the operation result (X00 ⁇ W04) transmitted by the multiplier adder 0 are added to obtain an operation result (X00 ⁇ W04+X01 ⁇ W14);
  • the multiplier 2 obtains the operation result obtained in the previous cycle ( X00 ⁇ W02+X01 ⁇ W12+X02 ⁇ W22) is transmitted to the multiplier adder 3, and then the multiplier 2 performs multiplication of X02 ⁇ W23 and the obtained operation result (X02 ⁇ W23) and the operation of the multiplier 1 are transmitted.
  • the result (X00 ⁇ W03+X01 ⁇ W13) is added to obtain the operation result (X00 ⁇ W03+X01 ⁇ W13+X02 ⁇ W23); the multiplier 3 calculates the operation result obtained in the previous cycle (X00 ⁇ W01+X01 ⁇ W11+X02 ⁇ W21+X03 ⁇ W31) is transmitted to the multiplier adder 4, and then the multiplier 3 performs multiplication of X03 ⁇ W32 and the obtained operation result (X03 ⁇ W32) and the operation result transmitted by the multiplier 2 ( X00 ⁇ W02+X0 1 ⁇ W12+X02 ⁇ W22) The addition result is obtained after the addition (X00 ⁇ W02+X01 ⁇ W12+X02 ⁇ W22+X03 ⁇ W32).
  • the multiplier 4 outputs the operation result (X00 ⁇ W00+X01 ⁇ W10+X02 ⁇ W20+X03 ⁇ W30+X04 ⁇ W40) obtained in the previous cycle to the output buffer for storage, and then the multiplier 4 executes X04 ⁇ W41.
  • the multiplication operation adds the obtained operation result (X04 ⁇ W41) and the operation result (X00 ⁇ W01+X01 ⁇ W11+X02 ⁇ W21+X03 ⁇ W31) transmitted by the multiplier 3 to obtain the operation result (X00). ⁇ W01+X01 ⁇ W11+X02 ⁇ W21+X03 ⁇ W31+X04 ⁇ W41).
  • multiplier adder 0 completes the convolution product operation of the first row of the piecewise convolution input matrix and the piecewise convolution kernel matrix, and can start the segmentation convolution input matrix.
  • the second row and the multiplication of the piecewise convolution kernel matrix If the second behavior of the piecewise convolution input matrix is X 10 X 11 X 12 X 13 X 14 , then at 101 clock cycles (T100), the multiplier 0 performs a multiplication of X10 ⁇ W00 to obtain an operation result (X10). ⁇ W00).
  • the multiplier adder 0, the multiplier adder 1, the multiplier adder 2, the multiplier adder 3, and the multiplier adder 4 are pipelined until the segmented volume After all the rows of the product input matrix have completed the matrix product operation, the operations of the segmented convolution input matrix and the segmentation convolution kernel matrix are completed, and the operations of the next piecewise convolution input matrix and the piecewise convolution kernel matrix are performed.
  • the multiplier 4 Starting from the sixth clock cycle (T5), the multiplier 4 outputs the operation result obtained in the previous cycle to the output buffer for storage. After the end of the 104th cycle (T103), the multiplier 4 completes the last multiplication operation of the convolution product of the first row of the segment convolution input matrix and the convolution kernel matrix, at the 105th The last output of the cycle output is stored in the output buffer for storage, thereby completing the operation of the first row of the segmented convolution input matrix and the segmented convolution kernel matrix.
  • the multiplication of the second row of the piecewise convolution input matrix and the piecewise convolution kernel matrix begins at the 101st clock cycle (T100), and the second row and the subdivision of the segmentation convolution input matrix
  • T100 101st clock cycle
  • the multiplication operation mode of the segment convolution kernel matrix is similar to the multiplication operation of the first row of the segmentation convolution input matrix and the segmentation convolution kernel matrix, and details are not described herein again.
  • the operation of other rows of the segmented convolution input matrix and the segmentation convolution kernel matrix can also be completed by five multipliers, thereby completing the segmentation convolution input matrix and the segmentation convolution kernel matrix. Operation. Further, the operations of the other segmented convolution input matrix and the segmentation convolution kernel matrix can be multiplied and added by the same logic circuit.
  • the number of rows of the segmented convolution kernel matrix is the minimum number of multipliers required.
  • the split size of the convolution kernel can be determined based on the available logic of the convolution processor. For example, if the data volume of the convolution kernel is Q, and the convolution sum is evenly split into Y piecewise convolution kernels, the data amount of each piecewise convolution kernel is Q/Y. If the number of available logic circuits (for example, multipliers) of the convolution processor is large, Y can be set smaller, so that the number of rows (Q/Y) of the segmentation convolution kernel matrix is larger to satisfy the comparison. Fast computing processing needs.
  • multiple convolution processors can be used to calculate different segmentation convolution kernel matrices at the same time, further increasing the speed of convolution operations. If there are fewer logic circuits available in the integrated chip, Y can be set larger, so that the number of rows (Q/Y) of the segmented convolution kernel matrix is smaller to save logic circuits, which can occupy less logic circuits. In the case of a multi-layer convolution operation.
  • convolution processor 123 stores the segmented convolution result into output buffer 126, including:
  • the convolution processor 123 accumulates the segmentation convolution result and the data stored in the output buffer 126 and writes it to the output buffer 126.
  • the convolution processor 123 performs a convolution operation on the first segmentation convolution kernel and the first segment convolution input data, respectively, to obtain a segmentation convolution result of the first segment, and segment the first segment.
  • the convolution result is stored in the output buffer 126;
  • the convolution processor 123 performs a convolution operation on the second segment convolution kernel and the second segment convolution input data, respectively, to obtain a segmentation convolution result of the second segment, which will be
  • the segmented convolution result of the two segments is summed with the segmented convolution result of the first segment stored in the output buffer 126 and written to the input buffer 126.
  • the convolution processor 123 performs a convolution operation on the Y-th segment convolution kernel and the Y-segment convolution input data, respectively, to obtain a segmentation convolution result of the Y segment, and segment the Y segment.
  • the convolution result is accumulated with the accumulated result of the segmentation convolution results of the first to (Y-1)th segments stored in the output buffer 126, and then written to the input buffer 126.
  • the convolution processor 123 performs the segmentation convolution operation, and accumulates the obtained segmentation convolution result and the previously stored data in the output buffer 126.
  • the segmentation convolution result accumulation is performed immediately after the product operation, and the segmentation convolution result accumulation is not required after all the segmentation convolution operations are completed, which can improve the processing efficiency of the entire convolution operation.
  • the convolution processor 123 performs post-processing calculation on the data stored in the output buffer 126 to obtain a post-processing convolution result;
  • the matrix corresponding to the product result is subjected to a transposition operation to obtain a transposed matrix of the matrix corresponding to the post-processing convolution result;
  • the convolution processor 123 transposes the matrix corresponding to the data stored in the output buffer 126 to obtain the data corresponding to the data stored in the output buffer. a transposed matrix of matrices;
  • the controller 122 stores the convolution result corresponding to the above-described transposed matrix as the convolution input data in the on-chip memory 125.
  • the post-processing calculation includes at least one of an activation function calculation, a pooling calculation, and a normalization calculation, and whether the post-processing calculation is performed is determined according to a convolutional neural network algorithm. Since the piecewise convolution operation in the embodiment of the present application uses the piecewise convolution input data as a multiplier, the piecewise convolution kernel as a multiplicand, and the conventional convolution kernel as a multiplier, convolution input The data is different as the multiplicand, and the rows and columns of the data obtained by the segmentation convolution operation are reversed, and the matrix obtained by the segmentation convolution operation needs to be transposed.
  • the convolution result obtained in the prior art is In the N rows and P columns
  • the convolution result obtained in the embodiment of the present application is P rows and N columns
  • the convolution result obtained in the embodiment of the present application is subjected to a matrix transposition operation to obtain a convolution result of N rows and P columns.
  • the integrated chip 12 in FIG. 2A can segment the convolution kernel and the convolution input data, and can split the convolution kernel and the convolution input data, and does not need to repeatedly read when performing a piecewise convolution operation. Segmented convolution kernel and segmented convolution input data to the input buffer can reduce the bandwidth between memory and input buffer 124 when performing convolution operations.
  • the input buffer 124 with a small storage space can be used, reducing the requirement of the convolutional neural network algorithm for the input buffer 124, and, due to the segmentation convolution Both the core and the segmented convolution data are small, and the convolution operation can still be performed if the convolution processor 12 has a limited number of logic circuits.
  • the bandwidth of the device can refer to the ability of data transmission between the two devices.
  • the bandwidth between the memory and the input buffer 124 can be understood as the speed at which the input buffer 124 reads data from the memory and the speed at which the memory reads data from the input buffer 124. If the bandwidth is larger, the read speed is faster.
  • the unit of bandwidth can be Gb/s.
  • the convolution operation in the embodiment of the present application can be applied to the field of image processing, for example, application scenarios such as image recognition, image classification, and image filtering can be applied.
  • the input image data is subjected to operations such as convolution operation, pooling, normalization, and finally, after the full connection layer and the regression operation, the image processing result is output.
  • the convolution operation can have multiple layers, and each layer of convolution operation is a convolution operation of this layer of convolution input data with this layer of convolution kernel. The result of the convolution operation of each layer can be used as the convolution input data of the next layer convolution operation.
  • the convolution input data of the first layer convolution operation is input image data, and the first layer convolution operation performs convolution operation on the input image data and the convolution kernel of the first layer.
  • the input image data may be data of all pixels in one image (for example, gray value, RGB value, etc.), such as 1000 ⁇ 600 ⁇ 3 (3 is RGB value) composed of 1000 ⁇ 600 pixels.
  • the convolution kernel and the convolution input data of each layer can be split to obtain a plurality of segment convolution kernels and corresponding multiple segment convolution inputs.
  • FIG. 6 is a schematic flowchart of a convolution operation processing method according to an embodiment of the present application. As shown in FIG. 6, the convolution operation processing method includes the following steps.
  • the segmentation convolution kernel is data obtained by convolution kernel segmentation
  • the segmentation convolution input data is the convolution input data. Segmented data.
  • the execution body of step 601 and step 602 may be the integrated chip shown in FIG. 1A or FIG. 1B.
  • the integrated chip is used for convolution operations, and the integrated chip includes a controller for convolution operations, a convolution processor, an input buffer, an output buffer, an on-chip memory, and the like.
  • the execution body of step 601 and step 602 may also be a device including a central processing unit, an input/output device, a memory, and a communication bus, wherein the central processing unit, the input/output device, and the memory are connected through a communication bus.
  • the central processing unit is used for convolution operations, the central processing unit includes logic circuits for performing convolution operations, such as multipliers, adders, multipliers, etc.; the memory is used to store the segmentation convolution kernel and the segmentation convolution Input data.
  • the execution body of the convolution operation processing method in FIG. 6 will be described by taking an integrated chip shown in FIG. 1A or FIG. 1B as an example.
  • step 601 before performing step 601, the following steps may also be performed:
  • the convolution input data and the convolution kernel are received, the convolution input data is segmented to obtain segmented convolution data, and the convolution kernel is segmented to obtain a segmentation convolution kernel.
  • the convolution kernel may be segmented by the server 11 in FIG. 1A or FIG. 1B, or may be segmented by the integrated chip 12.
  • the convolution input data can be segmented by the integrated chip 12.
  • the integrated chip does not need to segment the convolution kernel, and the segmentation convolution operation can be performed quickly, thereby improving the processing efficiency of the convolution operation.
  • the server 11 does not need to segment the convolution kernel, and the burden on the server 11 can be alleviated.
  • the number of rows of the matrix corresponding to the segmentation convolution kernel and the number of columns of the matrix corresponding to the segmentation convolution input data correspond to the number of logic circuits performing convolution operations.
  • the number of rows of the matrix corresponding to the segmentation convolution kernel and the number of columns of the matrix corresponding to the segmentation convolution input data are corresponding to the number of logic circuits performing convolution operations, including:
  • the number of rows of the matrix corresponding to the segmentation convolution kernel is less than or equal to the number of multipliers performing the convolution operation; the number of columns of the matrix corresponding to the segmentation convolution input data is less than or equal to the multiplier for convolution operation quantity.
  • the convolution operation is performed by the multiplier and the multiplication operation can be added to the other operand, and the multiplication operation and the addition operation can be performed in one clock cycle to reduce the execution delay of the entire multiplication and addition operation.
  • the performing the segmentation convolution operation on the segmentation convolution kernel and the segmentation convolution input data includes:
  • the piecewise convolution input matrix is used as a multiplier, and the piecewise convolution kernel matrix is multiplied as a multiplicand.
  • the piecewise convolution input matrix is used as a multiplier, and the piecewise convolution kernel matrix is taken as The multiplier can reduce the storage space of the input buffer 124 occupied by each matrix product operation. Since both the convolution kernel and the convolution input data are segmented, the number of rows of the segmented convolution kernel matrix is significantly smaller than the number of columns of the unsplit convolution kernel expanded matrix, and the convolution processor only A segmentation convolution operation can be accomplished by using an available logic circuit equal to or greater than the number of rows of the segmented convolution kernel matrix. By implementing the embodiments of the present application, the convolution operation can be completed with limited available logic circuits, saving logic circuits for convolution operations.
  • storing the segmentation convolution result described above to the output buffer 126 includes:
  • the segmentation convolution result is accumulated with the data stored in the output buffer, and then written into the output buffer 126.
  • the segmentation convolution result is accumulated immediately after each segmentation convolution operation, and the segmentation convolution result accumulation is not required after all the segmentation convolution operations are completed, thereby improving the processing efficiency of the entire convolution operation.
  • the input buffer includes a first storage space and a second storage space, wherein the first storage space and the second storage space are respectively used to store data; in step 601, the segmentation convolution kernel is loaded into the input.
  • the cache includes:
  • the data in the first storage space is moved to the second storage space, and then the segmentation convolution kernel is loaded into the first storage space. .
  • the data in the first storage space is moved to the second storage space, and the original data in the second storage space can be overwritten.
  • Loading the segmentation convolution kernel into the first storage space can overwrite the original data in the first storage space.
  • the above data may be a segmentation convolution kernel, and the segmentation convolution kernel is stored in the input buffer 124 by using "ping-pong storage". It is always ensured that the input buffer 124 stores two segmentation convolution kernels, and the convolution processor 123 After completing a piecewise convolution operation, the next piecewise convolution operation can be performed quickly without waiting for the loading of the segmentation convolution kernel of the next piecewise convolution operation, which can improve the processing efficiency of the convolution operation. .
  • loading the segmentation convolution kernel into the input buffer in step 601 includes:
  • Another segmentation convolution kernel other than the segmentation convolution kernel is loaded in the first storage space.
  • the embodiment of the present application is the first storage process of "ping-pong storage".
  • the input buffer 124 can be loaded into the two segmentation convolution kernels at one time.
  • FIG. 7 is a schematic flowchart diagram of another convolution operation processing method provided by an embodiment of the present application.
  • FIG. 7 is further optimized on the basis of FIG. 6.
  • the convolution operation processing method includes:
  • segmentation convolution kernel is data obtained by convolution kernel segmentation
  • segmentation convolution input data is the convolution input data. Segmented data.
  • the post-processing calculation includes at least one of an activation function calculation, a pooling calculation, and a normalization calculation, and whether the post-processing calculation is performed is determined according to a convolutional neural network algorithm.
  • pooling calculations are used to decrement the data in the convolution result, removing some redundant data. For example, for a 24 ⁇ 24 original image data, convolution with a 5 ⁇ 5 convolution kernel, the convolution result is 20 ⁇ 20, and after 2 ⁇ 2 pooling, the final result becomes 10 ⁇ 10.
  • the piecewise convolution operation in the embodiment of the present application uses the piecewise convolution input data as a multiplier, the piecewise convolution kernel as a multiplicand, and the conventional convolution kernel as a multiplier, convolution input
  • the data is different as the multiplicand, and the rows and columns of the data obtained by the segmentation convolution operation are reversed, and the matrix obtained by the segmentation convolution operation needs to be transposed.
  • the convolution result obtained in the prior art is In the N rows and P columns
  • the convolution result obtained in the embodiment of the present application is P rows and N columns
  • the convolution result obtained in the embodiment of the present application is subjected to a matrix transposition operation to obtain a convolution result of N rows and P columns.
  • Step 701 to step 702 can refer to step 601 to step 602 shown in FIG. 6 , and details are not described herein again.
  • FIG. 8 is a schematic structural diagram of a convolution operation processing apparatus according to an embodiment of the present application.
  • the convolution operation processing device includes a loading unit 801, a segmentation convolution unit 802, and a storage unit 803, wherein:
  • the loading unit 801 is configured to load the segmentation convolution kernel and the segmentation convolution input data into the input buffer; the segmentation convolution kernel is data obtained by convolution kernel segmentation, and the segmentation convolution input data is Convolution input data segmentation data.
  • the segmentation convolution unit 802 is configured to perform a segmentation convolution operation on the segmentation convolution kernel and the segmentation convolution input data to obtain a segmentation convolution result.
  • the storage unit 803 is configured to store the segmentation convolution result to the output buffer.
  • the convolution operation processing device shown in FIG. 8 is implemented, and when the segmentation convolution operation is performed once, it is not necessary to repeatedly read the segment convolution kernel and the segment convolution input data from the on-chip memory to the input buffer, which can be performed.
  • the bandwidth between the on-chip memory and the input buffer is reduced during convolution operations.
  • FIG. 9 is a schematic structural diagram of another convolution operation processing apparatus according to an embodiment of the present application.
  • the convolution operation processing apparatus may include a memory 901, a processor 902, and an input/output device 903.
  • the memory 901, the processor 902, and the input/output device 903 may be connected through a communication bus 904.
  • the memory 901 is for storing program instructions, and the program instructions are adapted to be loaded by the processor 902; the input and output device 903 can be used to receive convolutional input data and for outputting convolution processing results.
  • the processor 902 is configured to load program instructions and perform some or all of the method steps in Figures 6-7 above.
  • the convolution operation processing device shown in FIG. 9 is implemented, and when the segmentation convolution operation is performed once, it is not necessary to repeatedly read the segmentation convolution kernel and the segment convolution input data from the on-chip memory to the input buffer, which can be performed.
  • the bandwidth between the on-chip memory and the input buffer is reduced during convolution operations.
  • the embodiment of the present application further provides a computer storage medium, wherein the computer storage medium stores a plurality of program instructions, the program instructions being adapted to be loaded by a processor and executing any one of the convolution operations as described in the foregoing method embodiments. Part or all of the steps of the processing method.
  • the units in the convolution operation processing apparatus of the embodiment of the present application can be combined, divided, and deleted according to actual needs.
  • ROM Read-Only Memory
  • RAM Random Access Memory
  • PROM Programmable Read-Only Memory
  • EPROM Erasable Programmable Read Only Memory
  • OTPROM One-Time Programmable Read-Only Memory
  • EEPROM Electronically-Erasable Programmable Read-Only Memory
  • CD-ROM Compact Disc Read-Only Memory

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Pure & Applied Mathematics (AREA)
  • Mathematical Optimization (AREA)
  • Mathematical Analysis (AREA)
  • Computational Mathematics (AREA)
  • Neurology (AREA)
  • Algebra (AREA)
  • Databases & Information Systems (AREA)
  • Complex Calculations (AREA)

Abstract

一种卷积运算处理方法及相关产品,集成芯片包括控制器、卷积处理器、输入缓存、输出缓存。控制器将分段卷积核和分段卷积输入数据载入到输入缓存;分段卷积核为卷积核分段得到的数据;分段卷积输入数据为卷积输入数据分段得到的数据;卷积处理器对分段卷积核和分段卷积输入数据进行分段卷积运算得到分段卷积结果,将分段卷积结果存储到输出缓存。

Description

卷积运算处理方法及相关产品
本申请要求于2017年12月6日提交中国专利局、申请号为201711283173.9、申请名称为“卷积运算处理方法及相关产品”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及计算机技术领域,尤其涉及卷积运算处理方法及相关产品。
背景
在人工智能领域,目前主流的人工智能算法是深度学习。在深度学习里面,卷积神经网络(Convolution Nero Network,CNN)因为其在图像领域的突出效果,成为目前最为广泛应用的神经网络之一。CNN和其他深度学习一样,需要的计算量大。为了提高CNN的计算效率,一般使用集成芯片进行CNN的计算,CNN计算的核心是卷积运算。在进行卷积运算时,集成芯片将卷积核展开为卷积核矩阵,将卷积输入数据展开为卷积输入矩阵,将卷积核矩阵的一行与卷积输入矩阵的一列进行矩阵乘积运算。
技术内容
本申请实施例公开卷积运算处理方法及相关产品,可以减少电路数量的需求。
本申请实施例提供一种集成芯片,所述集成芯片包括:控制器、卷积处理器、输入缓存、输出缓存;
所述控制器将分段卷积核和分段卷积输入数据载入到所述输入缓存;所述分段卷积核为卷积核分段得到的数据;所述分段卷积输入数据为卷积输入数据分段得到的数据;
所述卷积处理器对所述分段卷积核和所述分段卷积输入数据进行分段卷积运算得到分段卷积结果,将所述分段卷积结果存储到所述输出缓存。
本申请实施例提供一种卷积运算处理方法,包括:
将分段卷积核和分段卷积输入数据载入到输入缓存;所述分段卷积核为卷积核分段得到的数据,所述分段卷积输入数据为所述卷积输入数据分段得到的数据;
对所述分段卷积核和所述分段卷积输入数据进行分段卷积运算得到分段卷积结果,将所述分段卷积结果存储到输出缓存。
本申请实施例提供一种卷积运算处理装置,包括存储器和处理器,所述存储器用于存储程序指令,所述程序指令适于由所述处理器加载;
所述处理器,用于加载所述程序指令并执行本申请实施例第二方面 所述的卷积运算处理方法。
本申请实施例提供一种存储介质,所述存储介质中存储有多条程序指令,所述程序指令适于由处理器加载并执行本申请实施例的卷积运算处理方法。
本申请实施例提供一种计算机程序产品,所述计算机程序产品包括存储了计算机程序的非瞬时性计算机可读存储介质,所述计算机程序可操作来使计算机执行如本申请实施例的卷积运算处理方法。
本申请实施例中,集成芯片进行卷积运算时,采用分段卷积运算,对分段卷积核和分段卷积输入数据进行分段卷积运算得到分段卷积结果,由于分段卷积核为卷积核分段得到的数据,分段卷积输入数据为卷积输入数据分段得到的数据,分段卷积核与分段卷积数据均变小,采用较少的电路即可实现分段卷积运算,可以减少卷积运算所需的电路数量。
附图简要说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍。
图1A是本申请实施例的一种硬件架构示意图;
图1B是本申请实施例的另一种硬件架构示意图;
图1C是本申请实施例的一种卷积神经网络算法模型的示意图;
图2A是本申请实施例的一种集成芯片的结构示意图;
图2B是本申请实施例的另一种集成芯片的结构示意图;
图3是本申请实施例的一种常规的卷积运算的计算过程示意图;
图4A是本申请实施例的一种卷积核和卷积输入数据的分段处理示意图;
图4B是本申请实施例的一种分段卷积运算的流程示意图;
图4C是本申请实施例的一种分段卷积运算对应的矩阵乘积运算的流程示意图;
图5是本申请实施例的分段卷积输入矩阵的一行和分段卷积核矩阵进行矩阵乘积运算的逻辑电路工作示意图;
图6是本申请实施例的一种卷积运算处理方法的流程示意图;
图7是本申请实施例的另一种卷积运算处理方法的流程示意图;
图8是本申请实施例的一种卷积运算处理装置的结构示意图;
图9是本申请实施例的另一种卷积运算处理装置的结构示意图。
实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人 员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请实施例提供了一种卷积运算处理方法及装置,将卷积核和卷积输入数据进行分段,可以在进行卷积运算时降低存储器与输入缓存之间的带宽。
本申请实施例的卷积运算处理方法可以由各种计算平台执行。计算平台可以是由单纯由中央处理器(central processor unit,CPU)进行运算的计算平台,或者包括处理器和集成芯片的异构计算平台。在单纯由CPU进行运算的计算平台中,卷积运算可以由CPU执行。在异构计算平台中,计算平台的控制器可以把卷积神经网络算法所需的矩阵乘积运算交给集成芯片(例如,FPGA/ASIC)执行,也可以把卷积神经网络算法的其它操作,例如激活函数、池化、归一化计算也交给集成芯片执行。各实施例的卷积运算处理方法可以包括:
将分段卷积核和分段卷积输入数据载入到输入缓存;所述分段卷积核为卷积核分段得到的数据,所述分段卷积输入数据为卷积输入数据分段得到的数据;
对所述分段卷积核和所述分段卷积输入数据进行分段卷积运算得到分段卷积结果,将所述分段卷积结果存储到输出缓存。
通过将卷积核进行分段,可以减小每次卷积运算使用的卷积核数据和输入数据的数据量,从而减小集成芯片的输入缓存的大小,降低卷积神经网络算法对器件的带宽的需求和输入缓存需求。
各实施例中,计算平台可以根据输入缓存的空间大小来确定分段卷积核的数据量。例如,内存资源有限的计算平台可以根据内存的使用情况来确定用于进行卷积运算的输入缓存的大小,再根据分配的输入缓存的大小来确定分段卷积核的大小,并据此对卷积核进行分段。此时,上述方法还可以包括:
获取卷积核,所述卷积核包括N维空间中排列的复数个卷积元素,N为正整数;
根据所述输入缓存的大小在所述N维空间中对所述卷积核进行分割,得到复数个分段卷积核,其中,每个分段卷积核包括所述N维空间中相邻的复数个卷积元素,每个分段卷积核的数据量小于所述输入缓存的大小。
通过根据缓存的大小来确定分段卷积核的大小,可以做到根据实际的缓存使用情况来调整分段卷积核的大小,从而使得卷积计算可以适应计算平台实际的缓存使用情况。
分段卷积核的大小也与计算平台中该卷积运算可用的乘法运算资源 (例如乘法运算单元、乘法运算电路,等)的数量有关。一些实施例中,可以按照下列方法对卷积核进行分割:
获取预设的用于所述分段卷积运算的乘法运算单元的数量,根据所述乘法运算单元的数量确定每个分段卷积核中卷积元素的第一数量;
获取所述输入缓存的大小,根据所述输入缓存的大小确定每个分段卷积核中卷积元素的第二数量;
在所述N维空间中对所述卷积核进行分割,得到复数个分段卷积核,其中,每个分段卷积核中的卷积元素的数量为所述第一数量和所述第二数量中的较小值。
在对卷积核进行分段时,通过综合考虑运算资源和缓存资源,可以使得卷积计算同时适应计算平台中运算资源和缓存资源的情况,使得该的技术方案可以应用在对运算资源和缓存资源的占用较敏感的计算平台。
一些实施例中,对卷积核进行分段可以按照下面的方法进行:
确定所述卷积输入数据在所述N维空间的N个方向上数据元素数量最少的第一方向;沿所述第一方向对所述卷积核进行切割,得到所述复数个分段卷积核。
通过沿卷积输入数据中数据元素最少的维度的方向对卷积核进行切割,可以减小卷积运算过程中数据的传输量,从而降低读写带宽。
一些实施例中,输入缓存可以用于缓存所需的分段卷积核和分段卷积输入数据。当计算平台并行进行多个分段卷积核对同一分段卷积数据的卷积操作时,可以为每个分段卷积核的卷积运算分配独立的缓存空间,也可以为并行的各分段卷积核的卷积运算分配共享的缓存空间。
在得到分段卷积核后,可以根据分段卷积核的信息来对卷积输入数据进行分段,得到分段卷积输入数据。例如,各实施例的方法还可以包括:
获取卷积输入数据,所述卷积输入数据包括N维空间中排列的复数个数据元素,N为正整数;
对卷积输入数据进行分段得到复数个分段卷积输入数据,其中,每个分段卷积输入数据中数据元素的数量和在N维空间中的排列方式与每个分段卷积核中卷积元素的数量和排列方式相同,所述复数个分段卷积输入数据包括分别与所述分段卷积核中的每个分段卷积核对应的一组分段卷积输入数据;
将所述分段卷积核中的第一分段卷积核和所述第一分段卷积核对应 的第一组分段卷积输入数据中的第一分段卷积输入数据载入到所述输入缓存;
当所述卷积处理器对所述第一分段卷积输入数据和所述第一分段卷积核进行分段卷积运算后,将所述第一组分段卷积输入数据中的第二分段卷积输入数据载入所述输入缓存以替换所述第一分段卷积输入数据,使所述卷积处理器对所述第二分段卷积输入数据和所述第一分段卷积核进行分段卷积运算。
一些实施例中,计算平台中可以并行执行复数个分段卷积运算,可以:
将所述复数个分段卷积输入数据中的第一分段卷积输入数据分别载入每个卷积处理器对应的输入缓存,将所述第一分段卷积输入数据对应的复数个第一分段卷积核分别载入每个卷积处理器对应的输入缓存,以使每个卷积处理器将所述第一分段卷积输入数据与各自的第一分段卷积核进行卷积运算。
当并行的复数个分段卷积核的卷积运算共享输入缓存空间时,将分段卷积输入数据分别载入每个分段卷积运算对应的输入缓存,实际上是将分段卷积输入数据载入共享的输入缓存。
通过在并行执行复数个分段卷积运算时,并行进行多个分段卷积核与同一分段卷积输入数据的卷积运算,能减少数据的重复交互,实现并行高效的卷积运算。
一些实施例中,在当前分段卷积输入数据的卷积运算完成后,可以仅将下一分段卷积输入数据与当前缓存的分段卷积输入数据的不同的部分载入输入缓存,从而进一步减少数据的加载量,提高效率。例如,可以将所述复数个分段卷积输入数据中的第二分段卷积输入数据与所述第一分段卷积输入数据不相同的部分载入输入缓存以在输入缓存中形成所述第二分段卷积输入数据。
一些实施例中,可以将对应同一组分段卷积输入数据的多个分段卷积核的卷积运算由多个卷积电路并行执行,也可以采用一个或复数个卷积电路分批进行,其分段卷积的结果可以直接在输出缓存中叠加,从而进一步节省输出缓存的空间。这里,同一组分段卷积输入数据是指输入数据矩阵中同一行或若干行的数据元素,例如图像中对应一行或若干行像素的数据;对应同一组分段卷积输入数据的复数个分段卷积核可以是卷积核矩阵中同一列或若干列对应的复数个分段卷积核。例如,各实施例的方法还可以包括:
当所述卷积处理器完成所述第一分段卷积核和所述第一组分段卷积输入数据的分段卷积运算后,将所述分段卷积核中的第二分段卷积核及所述第二分段卷积核对应的第二组分段卷积输入数据中的第三分段卷积输入数据载入所述输入缓存,以使所述卷积处理器对所述第二分段卷积核和所述第三分段卷积输入数据进行分段卷积运算;
所述卷积处理器用于:将所述第二分段卷积核和所述第三分段卷积输入数据的分段卷积结果叠加到所述输出缓存中存储的第二分段卷积结果中,所述第二分段卷积结果为所述卷积输入数据中同一行的数据元素对应的分段卷积结果。
一些实施例中,在更换输入缓存中的分段卷积核时,也可以仅将新的分段卷积核与当前输入缓存存储的分段卷积核的不相同的部分载入输入缓存,从而较少数据传输量。即,上述方法中,可以将所述第二分段卷积核与所述第一分段卷积核不相同的部分载入所述输入缓存以形成所述第二分段卷积核。
一些实施例中,当计算平台包括片内存储器(即微处理器如CPU、FPGA等的片上存储器)时,可以利用片内存储器暂存复数次卷积运算所需的卷积输入数据,减少处理器与外界存储器的交互次数。此时,该方法可以包括:从片外存储器中存储的原始卷积输入数据中提取出用于进行复数次卷积操作的数据作为所述卷积输入数据,将所述卷积输入数据载入所述计算平台内嵌的片内存储器中。
一些实施例中,在当前分段卷积输入数据的卷积运算完成后,可以仅将下一分段卷积输入数据与当前片内存储器中存储的分段卷积输入数据的不同的部分载入片内存储器,从而进一步减少数据的交互量,提高效率。此时,该方法可以包括:从所述原始卷积输入数据中提取出用于进行复数次卷积操作的第二数据,将所述第二数据与所述片内存储器中当前存储的卷积输入数据不相同的部分载入所述片内存储器以形成所述第二数据,将所述片内存储器中的所述第二数据作为所述卷积输入数据用于卷积运算。
一些实施例中,可以在当前分段卷积输入数据的卷积运算完成之前,先将下一分段卷积输入数据与当前片内存储器中存储的分段卷积输入数据的不同的部分载入片内存储器,减少等待数据载入的时间,提高处理效率。此时,该方法可以包括:在所述片内存储器中当前存储的卷积输入数据的卷积运算执行完毕之前,从所述原始卷积输入数据中提取出 用于进行复数次卷积操作的第二数据,将所述第二数据与所述片内存储器中当前存储的卷积输入数据不相同的部分载入所述片内存储器以形成所述第二数据;
在所述片内存储器中当前存储的卷积输入数据的卷积运算执行完毕之后,将所述片内存储器中的所述第二数据作为所述卷积输入数据用于卷积运算。
当在异构计算平台中实现各实施例的方法时,计算平台的控制器可以把卷积神经网络算法所需的矩阵乘积运算交给集成芯片(例如,FPGA/ASIC)执行。计算平台的控制器则可以对卷积核和卷积输入数据进行分段,并在合适的时机将其载入集成芯片,即执行上述各实施例的方法,从而实现卷积运算。
一些实施例中,异构计算平台可以如图1A所示。图1A是本申请实施例的一种硬件架构示意图。如图1A所示,该硬件架构包括服务器11和集成芯片12,服务器11包括中央处理器(Central Processing Unit,CPU)111和外部存储器(External Memory)112,服务器11与集成芯片12通过总线和接口标准(Peripheral Component Interconnect Express,PCIE)接口13进行连接,集成芯片12可以包括控制器(Control Unit)122、卷积处理器(Processing Element,PE)123、输入缓存124、片内存储器125以及输出缓存126。其中,控制器122与卷积处理器123双向连接、输入缓存124和输出缓存126双向连接,片内存储器125与输入缓存124连接,片内存储器125可以向输入缓存124输入数据,输入缓存124与卷积处理器123连接,输入缓存124可以向卷积处理器123输入数据,卷积处理器123与输出缓存126连接,卷积处理器123可以向输出缓存126输出数据,输出缓存126与片内存储器125连接,输出缓存126可以向片内存储器125输出数据。
本申请实施例中,集成芯片12可以是现场可编程门阵列(Field Programmble Gate Array,FPGA),也可以是专用集成电路(Application Specific Integrated circuit,ASIC)。集成芯片12作为服务器11的协处理器,用于从服务器11的外部存储器112中获取数据,将获取的数据放入片内存储器125并通知控制器122已经从外部存储112中获得数据。控制器122用于将输入缓存124从片内存储器125获取需要计算的数据,卷积处理器123对需要计算的数据进行计算。卷积处理器123用于将计算得到的计算结果放入输出缓存126。输出缓存126用于将计算结果输出至片内存储器125,服务器11可以从集成芯片12的片内存储器125中读取计算结果。
请参阅图1B,图1B是本申请实施例提供的另一种硬件架构示意图。图1B是在图1A基础上进一步改进得到的,与图1A的区别在于,在集成芯片12中增加了直接内存存取(Direct Memory Access,DMA)121。其中,直接内存存取121与控制器122和片内存储器125双向连接,集成芯片12可以通过直接内存存取121从服务器11的外部存储器112中获取数据。直接内存存取121用于将获取的数据放入片内存储器125并通知控制器122已经从外部存储112中获得数据。服务器11还可以通过直接内存存取121从集成芯片12的片内存储器125中读取计算结果。
图1B中采用直接内存存取121后,直接内存存取121从服务器11中读取数据时无需中断控制器122和卷积处理器123,采用直接内存存取121可以提高集成芯片12与服务器11之间的数据传输效率。
请参阅图1C,图1C是本申请实施例提供的一种卷积神经网络算法模型的示意图。如图1C所示,卷积神经网络算法模型可以对图像数据进行处理。对于输入的图像数据,经过卷积运算、池化、归一化等操作,最后经过全连接层和回归(softmax)操作处理后,输出图像处理结果判断为“狗”的图像。在整个卷积神经网络算法的计算过程中,卷积运算需要进行多层运算,卷积运算在整个卷积神经网络算法的计算过程中占最大的计算量。本申请实施例中的卷积神经网络算法可以用于进行各种图像识别,例如图像分类、图像过滤等。比如,本申请实施例适用的业务场景可以为色情图片检测过滤的目标业务场景。本申请实施例中的卷积运算可以通过深度学习平台进行实现,深度学习平台可以包括卷积神经网络架构(Convolutional Architecture for Fast Feature Embedding,Caffe)、第二代人工智能学习系统(例如,Tensor Flow)等。深度学习平台可以调用基础线性代数子程序库(Basic Linear Algebra Subprograms,BLAS)进行矩阵乘积运算。
在图1A和图1B所示的硬件架构中,集成芯片12中的卷积处理器123可以用来处理卷积运算,卷积处理器123可以有多个,可以同时并行处理多个卷积运算。
在卷积神经网络中,卷积运算是指卷积核和卷积输入数据进行卷积运算。卷积核的数量以及卷积核的大小与卷积神经网络算法相关。一般而言,对于同一个卷积神经网络算法,每一层卷积运算的卷积核都不相同,每一层卷积运算的卷积输入数据也不相同。在逻辑电路实现中,为了方便逻辑电路对卷积输入数据和卷积核进行卷积运算,一般将卷积输入数据直接展开为卷积输入矩阵,将卷积核直接展开为卷积核矩阵,将卷积运算转化为便于逻辑电路实现的矩阵乘积运算(乘法运算和加法运算)。在进行矩阵乘积运算时,每次从片内存储器125中读入卷积核矩阵的一行和卷积输入矩阵的一列到输入缓存124,卷积处理器123对上 述卷积核矩阵的一行和卷积输入矩阵的一列进行矩阵乘积运算。由于卷积输入矩阵的数据量通常较大,导致矩阵乘积运算所需要的数据量和计算量非常大,从而使得进行矩阵乘积运算所需要的集成芯片12的逻辑电路的数量非常巨大,逻辑电路可以包括乘法器、加法器、乘加器中的至少一种。同时,卷积核矩阵和卷积输入矩阵进行矩阵乘积运算时,由于卷积核矩阵的每一行都需要和卷积输入矩阵的每一列进行矩阵乘积运算,会造成卷积输入矩阵的数据的重复读取,对片内存储器125与输入缓存124之间的带宽要求较高。
基于上述图1A所示的硬件架构以及图1C所示的卷积神经网络算法模型,提供一种集成芯片。请参阅图2A,图2A是本申请实施例提供的一种集成芯片的结构示意图。如图2A所示,集成芯片12包括控制器122、卷积处理器123、输入缓存124以及输出缓存126。
一些实施例中,集成芯片12还可以基于图1B所示的硬件架构,集成芯片12还可以包括直接内存存取121,直接内存存取121可以和控制器122双向连接。直接内存存取121从集成芯片的外部设备读取数据时无需中断控制器122和卷积处理器123,采用直接内存存取121可以提高集成芯片与外部设备之间的数据传输效率。
在卷积运算的过程中,卷积运算的卷积层的数量、每一层卷积运算的卷积核的大小和数量与卷积神经网络算法相关。对于不同的卷积神经网络算法,需要做卷积运算的卷积层的数量不一定相同,每一层卷积运算的卷积核的大小也不一定相同,每一层卷积运算的卷积核的数量也不一定不同,每一层卷积运算后是否需要做后处理运算也不一定相同。其中,后处理运算包括激活函数计算、池化计算、归一化计算中的至少一种,是否进行后处理计算是根据卷积神经网络算法来决定的。在卷积神经网络算法确定的情况下,卷积运算的层数以及每层卷积运算的卷积核的大小和数量都是确定的。在多层卷积运算过程中,卷积运算是逐层进行的,首先进行第一层卷积运算,得到第一层卷积运算结果后,如果神经网络算法要求进行后处理计算,则进行后处理计算,将第一层卷积运算结果进行后处理计算得到的结果作为第二层卷积运算的卷积输入数据,如果神经网络算法不要求进行后处理计算,则将第一层卷积运算结果作为第二层卷积运算的卷积输入数据。然后进行第二层卷积运算,以此类推,直到最后一层卷积运算结束,即可完成多层卷积运算。在一层卷积运算过程中,如果卷积核的数量有N个,则可以将N个卷积核分别和卷积输入数据进行卷积运算。需要说明的是,卷积核可以是通过深度学习平台从训练数据中学习得到的;第一层卷积运算的卷积输入数据是进行卷积运算的初始数据,例如,初始数据可以是待处理的图像数据。在以上多层卷积运算中,每一层卷积运算的结果有可能执行重排操作后 作为下一层的卷积输入数据,在以矩阵作为卷积运算输入的情况下,重排通常可以采用转置运算实现。
本申请实施例中,卷积核和卷积输入数据是三维的,卷积核和卷积输入数据可以理解为三维的数据块。一个卷积核和卷积输入数据进行卷积运算后的卷积结果是二维的,由于卷积核的个数往往有多个,多个卷积核和卷积输入数据进行卷积运算的结果是三维的。
卷积核和卷积输入数据进行卷积运算可以理解为:卷积核对应的小数据块在卷积输入数据对应的大数据块中滑动,每滑动一次,将小数据块和大数据块中与小数据块重合的那部分数据进行乘法运算,将每个重合的数据相乘所得的结果相加,得到一个卷积结果。
为了便于理解卷积运算,下面以图3为例阐述卷积运算的具体过程。图3是本申请实施例提供的一种卷积运算的计算过程示意图。如图3所示,卷积核为2×2×2的数据块,卷积核的长度、高度、宽度均为2。卷积核总共包括8个数据(a11、a12、a21、a22、a31、a32、a41、a42),卷积核由8个小立方体组成,每个小立方体代表卷积核中的一个数据。卷积输入数据为4×4×2的数据块,卷积输入数据的长度、高度均为4,宽度为2。总共包括32个数据(b11、b12、b13、b14、b21、b22、b23、b24、b31、b32、b33、b34、b41、b42、b43、b44、b51、b52、b53、b54、b61、b62、b63、b64、b71、b72、b73、b74、b81、b82、b83、b84)。卷积输入数据由32个小立方体组成,每个小立方体代表卷积输入数据中的一个数据。在进行卷积运算之前,控制器122将卷积核和卷积输入数据载入到输入缓存124。其中,控制器122将卷积核和卷积输入数据载入到输入缓存124的方式可以为:控制器122可以向输入缓存124发送读取指令,输入缓存124响应该读取指令从存储器中读取卷积核和卷积输入数据,并将卷积核和卷积输入数据载入到输入缓存12。
从卷积运算的角度来看,卷积处理器123对卷积核和卷积输入数据进行卷积运算,具体可以为:第一步:卷积处理器123将卷积核放入卷积输入数据中,以使卷积核的数据块与卷积输入数据中的一个2×2×2的数据块重合,将重合的小立方体所代表的数据进行乘法运算,得到乘法结果,并将所有的乘法结果相加,得到第一步的卷积运算结果。从图3可知,在第一步中,a11、a12、a21、a22、a31、a32、a41、a42分别与b11、b12、b21、b22、b51、b52、b61、b62重合,第一步的卷积运算为:a11×b11+a12×b12+a21×b21+a22×b22+a31×b51+a32×b52+a41×b61+a42×b62,第一步的卷积运算结果为c11,c11=a11×b11+a12×b12+a21×b21+a22×b22+a31×b51+a32×b52+a41×b61+a42×b62。可见,第一步中的卷积运算可以理解为a11、a12、a21、a22、a31、a32、a41、a42组成的一行矩阵和b11、b12、b21、b22、b51、b52、b61、 b62组成的一列矩阵进行矩阵乘积运算。第二步:将卷积核顺着卷积输入数据的长度方向滑动一个数据单位,以使卷积核的数据块与卷积输入数据中的另一个2×2×2的数据块重合,与第一步类似,同样将重合的小立方体所代表的数据进行乘法运算,得到乘法结果,并将所有的乘法结果相加,得到第二步的卷积运算结果。从图3可知,在第二步中,a11、a12、a21、a22、a31、a32、a41、a42分别与b12、b13、b22、b23、b52、b53、b62、b63重合,第二步的卷积运算结果为c12。第三步:将卷积核顺着卷积输入数据的长度方向滑动一个数据单位,将重合的小立方体所代表的数据进行乘法运算和加法运算,得到第三步的卷积运算结果c13。第四步:将卷积核顺着卷积输入数据的高度方向滑动一个数据单位,将重合的小立方体所代表的数据进行乘法运算和加法运算,得到第四步的卷积运算结果c23。第五步:将卷积核顺着卷积输入数据的长度方向滑动一个数据单位,将重合的小立方体所代表的数据进行乘法运算和加法运算,得到第五步的卷积运算结果c22。第六步:将卷积核顺着卷积输入数据的长度方向滑动一个数据单位,将重合的小立方体所代表的数据进行乘法运算和加法运算,得到第六步的卷积运算结果c21。第七步:将卷积核顺着卷积输入数据的高度方向滑动一个数据单位,将重合的小立方体所代表的数据进行乘法运算和加法运算,得到第七步的卷积运算结果c31。第八步:将卷积核顺着卷积输入数据的长度方向滑动一个数据单位,将重合的小立方体所代表的数据进行乘法运算和加法运算,得到第八步的卷积运算结果c32。第九步:将卷积核顺着卷积输入数据的长度方向滑动一个数据单位,将重合的小立方体所代表的数据进行乘法运算和加法运算,得到第九步的卷积运算结果c33。至此,卷积核和卷积输入数据的卷积运算完成,得到最终的卷积运算结果。最终的卷积结果为c11、c12、c13、c21、c22、c23、c31、c32、c33这九个数据组成,并且这九个数据可以映射为3×3的数据平面:
Figure PCTCN2018116086-appb-000001
如果图3中的卷积核的个数有N个,则N个卷积核和卷积输入数据的卷积运算结果为3×3×N的数据块(长度、高度均为3,宽度为N)。
从矩阵运算的角度来看,卷积处理器123对卷积核和卷积输入数据进行卷积运算,具体可以为:卷积处理器123将卷积核矩阵转换为1行8列的卷积核矩阵,将卷积输入数据转换为8行9列的卷积输入矩阵。其中,卷积输入数据的列数是由卷积核和卷积输入数据的大小来确定的,如果卷积核的大小为K×K×D,卷积输入数据的大小为M×M×D,则卷积输入数据的列数为(M-K+1)×(M-K+1)。将1行8列的卷积核矩阵和8行9列的卷积输入矩阵进行矩阵运算,即可得到1行9列的 矩阵,1行9列的矩阵可以映射为3×3的数据平面。如果图3中的卷积核的个数有N个,则N个卷积核和卷积输入数据的卷积运算结果为3×3×N的数据块(长度、高度均为3,宽度为N)。
为了便于理解,图3中的卷积核和卷积输入数据的长度、高度和宽度均以较小的数值为例进行说明。然而,实际的卷积运算过程中,卷积核和卷积输入数据的长度、高度和宽度往往非常大,为了保证集成芯片12能够适用于所有的卷积运算,需要将集成芯片12中用于卷积运算的逻辑电路的数量设置得非常大,这会造成电路的浪费。同时,由于卷积核的数量有多个,每个卷积核都需要与卷积输入数据进行卷积运算,会造成卷积核或卷积输入数据的重复从存储器载入输入缓存,对存储器和输入缓存之间的带宽要求较高。
图2A所示的集成芯片12可以对卷积核进行分段处理,得到分段卷积核;对卷积输入数据进行分段处理,得到分段卷积输入数据。当分段卷积核和分段卷积输入数据载入到输入缓存124后,卷积处理器123对分段卷积输入数据和分段卷积核进行分段卷积运算得到分段卷积结果,将分段卷积结果存储到输出缓存126。其中,分段卷积核为卷积核分段得到的数据;分段卷积输入数据为卷积输入数据分段得到的数据。以上流程为一层卷积运算,下一层卷积运算将会使用上述分段卷积结果作为下一层卷积运算的卷积输入数据。具体卷积过程参见前文关于卷积运算的说明,在此不再赘述。
其中,如果一层卷积运算的卷积核的数量为N个,分段卷积核为卷积核分段得到的数据,具体可以为:分段卷积核包括N个卷积核中按照相同的分段方式得到的空间区域相同的一段数据。
其中,分段卷积输入数据为卷积输入数据分段得到的数据,具体可以为:分段卷积输入数据为卷积输入数据按照上述分段卷积核的分段方式得到的在卷积运算过程中能够与分段卷积核数据在空间区域重合的一段数据。
下面以图4A为例描述卷积核的分段方式与卷积输入数据的分段方式。图4A是本申请实施例提供的一种卷积核和卷积输入数据的分段处理示意图,图4A为N个卷积核与一个卷积输入数据进行卷积运算的场景。其中,每个卷积核的大小为:K×K×D,K为卷积核的长度和高度,D为卷积核的宽度;卷积输入数据的大小为M×M×D,M为卷积输入数据的长度和高度,D为卷积输入数据的宽度。卷积核和卷积输入数据可以理解为三维空间的数据块,上述每个卷积核包括K×K×D个数据,上述卷积输入数据包括M×M×D个数据。在对卷积核进行拆分时,将K×K×D的数据块在宽度方向进行切割,切分为Y个K×K×(D/Y)的子数据块,类似的,对卷积输入数据进行拆分时,同样将M×M×D 的数据块在宽度方向进行切割,切分为Y个M×M×(D/Y)的子数据块。例如,在对N个卷积核进行分段处理时,将N个卷积核在宽度处于0~D/Y区间的子数据块作为第一个分段卷积核,将N个卷积核在宽度处于D/Y~2D/Y区间的子数据块作为第二个分段卷积核,以此类推,将N个卷积核在宽度处于(Y-1)D/Y~D区间的子数据块作为第Y个分段卷积核。在对卷积输入数据进行分段处理时,将卷积输入数据在宽度处于0~D/Y区间的子数据块作为卷积输入数据的第一个分段卷积输入数据,将卷积输入数据在宽度处于D/Y~2D/Y区间的子数据块作为卷积输入数据的第二个分段卷积输入数据,以此类推,将卷积输入数据在宽度处于(Y-1)D/Y~D区间的子数据块作为卷积输入数据的第Y个分段卷积输入数据。
可见,第一分段卷积核包括N个卷积核中按照在宽度上以D/Y为单位进行分段的分段方式得到的宽度区间处于0~D/Y区间的一段数据;第二分段卷积核包括N个卷积核中按照在宽度上以D/Y为单位进行分段的分段方式得到的宽度区间处于D/Y~2D/Y区间的一段数据;第Y分段卷积核包括N个卷积核中按照在宽度上以D/Y为单位进行分段的分段方式得到的宽度区间处于(Y-1)D/Y~D区间的一段数据。类似的,第一分段卷积输入数据为卷积输入数据按照在宽度上以D/Y为单位进行分段的分段方式得到的宽度区间处于0~D/Y区间的一段数据;第二分段卷积输入数据为卷积输入数据按照在宽度上以D/Y为单位进行分段的分段方式得到的宽度区间处于D/Y~2D/Y区间的一段数据;第Y分段卷积输入数据为卷积输入数据按照在宽度上以D/Y为单位进行分段的分段方式得到的宽度区间处于(Y-1)D/Y~D区间的一段数据。其中,第一分段卷积核与第一分段卷积输入数据对应,第二分段卷积核与第二分段卷积输入数据对应,第Y分段卷积核与第Y分段卷积输入数据对应。
在进行分段卷积运算之前,控制器122将第一分段卷积核和对应的第一分段卷积输入数据载入输入缓存124。在进行分段卷积运算时,卷积处理器123将第一分段卷积核和对应的第一分段卷积输入数据分别进行卷积运算,得到第一段的分段卷积结果,将第一段的分段卷积结果存储在输出缓存126。类似的,控制器122将第二分段卷积核和对应的第二分段卷积输入数据载入输入缓存124,卷积处理器123将第二分段卷积核和第二分段卷积输入数据分别进行卷积运算,得到第二段的分段卷积结果,将第二段的分段卷积结果存储在输出缓存126。以此类推,控制器122将第Y分段卷积核和对应的第Y分段卷积输入数据载入输入缓存124,卷积处理器123将第Y分段卷积核和第Y分段卷积输入数据分别进行卷积运算,得到第Y段的分段卷积结果,将第Y段的分段卷积结果存储在输出缓存126。
下面结合图4A和4b阐述分段卷积运算的过程。如图4B所示,图4B为N个大小为K×K×D的卷积核和大小为M×M×D的卷积输入数据进行卷积运算的过程。首先,按照上述图4A的方式,将每个卷积核拆分为Y个大小为K×K×(D/Y)的分段卷积核,将卷积输入数据拆分为大小为Y个M×M×(D/Y)的分段卷积输入数据,并且第一分段卷积核与第一分段卷积输入数据对应,第二分段卷积核与第二分段卷积输入数据对应,第Y分段卷积核与第Y分段卷积输入数据对应。图4B中,卷积运算可以拆分为Y段卷积运算,Y段卷积运算包括:第一分段卷积运算、第二分段卷积运算、...第Y分段卷积运算。其中,图4B中的第一分段卷积运算为第一分段卷积核与第一分段卷积输入数据进行卷积运算,图4B中的第二分段卷积运算为第二分段卷积核与第二分段卷积输入数据进行卷积运算,以此类推,图4B中的第Y分段卷积运算为第Y分段卷积核与第Y分段卷积输入数据进行卷积运算。当Y段卷积运算完成后,可以将Y段卷积运算的结果进行累加,得到最终的卷积结果。其中,K×K×(D/Y)的分段卷积核和M×M×(D/Y)的分段卷积输入数据进行卷积运算的详细过程与图3中的2×2×2的卷积核和4×4×2卷积输入数据的卷积运算类似,此处不再赘述。
下面结合图4C阐述图4B中分段卷积核与分段卷积输入数据进行卷积运算对应的矩阵乘积运算过程。如图4C所示,第一分段卷积核可以展开为U行N列的矩阵,第一分段卷积输入数据可以展开为P行U列的矩阵;第二分段卷积核同样可以展开为U行N列的矩阵,第二分段卷积输入数据同样可以展开为P行U列的矩阵;以此类推,第Y分段卷积核同样可以展开为U行N列的矩阵,第Y分段卷积输入数据同样可以展开为P行U列的矩阵。将Y个P行U列的矩阵与Y个U行N列的矩阵进行分别矩阵乘积运算的结果进行累加,即可得到P行N列的卷积结果。其中,U=K×K×(D/Y);P=(M-K+1)×(M-K+1)。
一些实施例中,如图2B所示,集成芯片12还可以包括片内存储器125,片内存储器125接收卷积输入数据和分段卷积核,或者,片内存储器125接收卷积输入数据和卷积核。片内存储器125可以与输入缓存124连接,可以与输出缓存126连接。
本申请实施例中,卷积核可以由图1A或图1B中的服务器11进行分段处理,也可以由集成芯片12进行分段处理。卷积输入数据可以由集成芯片12进行分段处理。
如果卷积核由图1A中的服务器11进行分段处理,服务器11的中央处理器111将卷积神经网络算法对应的每层的N个卷积核进行分段处理,得到多个分段卷积核并存储在外部存储单元112,服务器11的中央处理器111将外部存储器112中的多个分段卷积核与进行卷积运算的卷 积输入数据输入到集成芯片12的片内存储器125,片内存储器125可以接收到服务器11发送的卷积输入数据和多个分段卷积核。控制器122将片内存储器125中的卷积输入数据拆分为多个子卷积输入数据,控制器122将一个分段卷积核和一个分段卷积输入数据载入到输入缓存124,或者,控制器122将两个分段卷积核和两个分段卷积输入数据载入到输入缓存124。当卷积核由图1A中的服务器11进行分段处理时,集成芯片无需再对卷积核进行分段,可以快速进行分段卷积运算,提高卷积运算的处理效率。
如果卷积核由图1A中的集成芯片12进行分段处理,服务器11将外部存储器112中的卷积核和卷积输入数据输入到集成芯片12的片内存储器125,片内存储器125接收到服务器11发送的卷积输入数据和卷积核。控制器122将片内存储器125中的卷积输入数据拆分为多个子卷积输入数据,控制器122将片内存储器125中的卷积核拆分为多个分段卷积核。控制器122将一个分段卷积核和一个分段卷积输入数据载入到输入缓存124,或者,控制器122将两个分段卷积核和两个分段卷积输入数据载入到输入缓存124。当卷积核由图1A中的集成芯片12进行分段处理时,服务器11无需对卷积核进行分段,可以减轻服务器11的负担。
结合图4A、图4B和图4C阐述集成芯片12的工作过程。集成芯片12获取分段卷积核和分段卷积输入数据,其中,分段卷积核可以由集成芯片12进行拆分,也可以由服务器11进行拆分。控制器122从片内存储器125中读取一个分段卷积输入数据和对应的N个分段卷积核放入输入缓存124,卷积处理器123将N个分段卷积核展开为U行N列的分段卷积核矩阵,将分段卷积输入数据展开为P行U列的分段卷积输入矩阵,将P行U列的分段卷积输入矩阵和U行N列的分段卷积核矩阵进行矩阵乘积运算,得到分段卷积结果。其中,N为卷积核的个数,U=K×K×(D/Y),P=(M-K+1)×(M-K+1),如果K=10,D=20,M=100,Y=20,则U=100,P=8281。当卷积处理器123完成一个分段卷积输入数据和对应的N个分段卷积核的分段卷积运算之后,将得到的分段卷积结果存储到输出缓存126,控制器122从片内存储器125中读取另一个分段卷积输入数据和对应的另外N个分段卷积核放入输入缓存124,进行下一次分段卷积运算。当所有的分段卷积输入数据和分段卷积核都完成分段卷积运算之后,可以将输出缓存126中的所有分段卷积结果进行累加,得到一层卷积运算结果,即完成一层卷积运算。图4A~图4C所示的分段卷积运算过程,通过对卷积核和卷积输入数据进行拆分,将分段卷积输入数据和分段卷积核进行分段卷积运算,在进行一次分段卷积运算时,无需从片内存储器中重复读取分段卷积核和分段卷积输入数据到 输入缓存,可以在进行卷积运算时降低片内存储器与输入缓存之间的带宽。
一些实施例中,分段卷积核对应的分段卷积核矩阵的行数和分段卷积输入数据对应的分段卷积输入矩阵的列数与卷积处理器123的逻辑电路数量对应。
本申请实施例中,逻辑电路可以包括乘法器、加法器或其他有逻辑运算能力的器件。分段卷积核对应的分段卷积核矩阵的行数和分段卷积输入数据对应的分段卷积输入矩阵的列数与卷积处理器的逻辑电路数量对应,包括:
分段卷积核对应的分段卷积核矩阵的行数小于或等于卷积处理器123的乘法器的数量,分段卷积输入数据对应的分段卷积输入矩阵的列数小于或等于卷积处理器123的乘法器的数量。
举例来说,请参见图4C所示的实施例,如果每个分段卷积核的大小为K×K×(D/Y),N个分段卷积核对应的分段卷积核矩阵为U行N列,分段卷积输入数据对应的分段卷积输入矩阵为P行U列,则U=K×K×(D/Y),则卷积处理器123的乘法器的数量大于U。
一些实施例中,逻辑电路可以包括乘加器,分段卷积核对应的矩阵的行数和分段卷积输入数据对应的矩阵的列数与卷积处理器的逻辑电路数量对应包括:
分段卷积核对应的矩阵的行数小于或等于卷积处理器的乘加器的数量;分段卷积输入数据对应的矩阵的列数小于或等于卷积处理器的乘加器的数量。
举例来说,请参见图4C所示的实施例,如果每个分段卷积核的大小为K×K×(D/Y),N个分段卷积核对应的分段卷积核矩阵为U行N列,分段卷积输入数据对应的分段卷积输入矩阵为P行U列,则U=K×K×(D/Y),则卷积处理器123的乘法器的数量大于U。
上述实施例中,Y的取值可以根据卷积处理器123的可用逻辑电路数量以及卷积核的大小来确定。
如果逻辑电路包括乘法器和加法器,卷积处理器123的可用乘法器数量为X,卷积核的大小为K×K×D,则Y需要满足:Y>(K×K×D)/X。
如果逻辑电路包括乘加器,卷积处理器123的可用乘加器数量为X,卷积核的大小为K×K×D,则Y需要满足:Y>(K×K×D)/X。
在矩阵乘积运算中,为了满足一行U列的矩阵和U行一列的矩阵进行矩阵乘积运算,需要至少U个乘法器和至少1个加法器,或者,需要至少U个乘加器。大小为K×K×D的卷积核拆分为Y段后,每个分段卷积的大小为K×K×(D/Y),如果把每个分段卷积核展开为U行1列, 则X>U=K×K×(D/Y)。
上述实施例中,如果出现卷积核的大小无法被X(X为卷积处理器123的可用乘加器数量)整除的情况,则可以将无法被整除的余数作为另一个分段卷积核。例如,卷积核大小为K×K×D,K为长度和高度、D为宽度,如果K=10,D=100,X=7,如果卷积核在宽度方向上进行拆分,则可以将卷积核拆分为15个分段卷积核,这15个分段卷积核分别为:10×10×7、10×10×7、10×10×7、10×10×7、10×10×7、10×10×7、10×10×7、10×10×7、10×10×7、10×10×7、10×10×7、10×10×7、10×10×7、10×10×7、10×10×2。
一些实施例中,输入缓存124包含第一存储空间和第二存储空间,第一存储空间和第二存储空间分别用于存储数据;控制器122将分段卷积核载入输入缓存124包括:
在第二存储空间内的数据参与分段卷积运算结束后,控制器122将第一存储空间内的数据移动到第二存储空间,然后将分段卷积核载入第一存储空间。
其中,在第二存储空间内的数据参与分段卷积运算结束后,控制器122将第一存储空间内的数据移动到第二存储空间,可以将第二存储空间内原有的数据进行覆盖。控制器122将分段卷积核载入第一存储空间可以将第一存储空间内原有的数据进行覆盖。上述数据可以是分段卷积核,采用本申请实施例,在输入缓存124中采用“乒乓存储”的方式存储分段卷积核,可以始终保证输入缓存124中存储有两个分段卷积核,卷积处理器123在完成一次分段卷积运算之后,无需等待下一次分段卷积运算的分段卷积核的载入,即可快速进行下一次分段卷积运算,可以提高卷积运算的处理效率。
一些实施例中,若第一存储空间和第二存储空间为空,控制器122将分段卷积核载入输入缓存124包括:
控制器122将分段卷积核载入第二存储空间;
控制器122还用于在第一存储空间载入分段卷积核之外的另一分段卷积核。
本申请实施例中,当第一存储空间和第二存储空间为空时,表明输入缓存124中首次载入数据,当输入缓存124中首次载入数据时,可以将输入缓存124中一次性载入两个分段卷积核。
一些实施例中,卷积处理器123对分段卷积核和分段卷积输入数据进行分段卷积运算包括:
卷积处理器123将分段卷积核转换为分段卷积核矩阵,将分段卷积输入数据转换为分段卷积输入矩阵;卷积处理器123将分段卷积输入矩阵作为乘数,将分段卷积核矩阵作为被乘数进行乘积运算。
本申请实施例中,将分段卷积输入矩阵作为乘数,将分段卷积核矩阵作为被乘数。在进行矩阵乘积运算时,一般将乘数对应的矩阵逐行和被乘数对应的矩阵进行矩阵乘积运算,具体请参见图5的矩阵乘积运算过程。如果卷积核的大小为:2×2×100,卷积核的个数N=100,卷积输入数据的大小为100×100×100,则每个卷积核可以拆分为50个2×2×2大小的分段卷积核,卷积输入数据可以拆分为50个100×100×2的分段卷积输入数据。如果采用常规的分段卷积运算,卷积处理器123将分段卷积核转换为分段卷积核矩阵,将分段卷积输入数据转换为分段卷积输入矩阵;卷积处理器123将分段卷积输入矩阵作为被乘数,将分段卷积核矩阵作为乘数进行乘积运算,分段卷积核矩阵为100行8列,分段卷积输入矩阵为8行9801列。100行8列的矩阵和8行9801列的矩阵进行矩阵乘积运算时,需要将100行8列的矩阵逐行和8行9801列的矩阵进行矩阵乘积运算,每次矩阵乘积运算均为1行8列的矩阵和8行9801列的矩阵相乘,每次矩阵乘积运算需要占用输入缓存124较大的存储空间。
采用本申请实施例中的分段卷积运算,卷积处理器123将分段卷积输入矩阵作为乘数,将分段卷积核矩阵作为被乘数进行乘积运算,分段卷积核矩阵为8行100列,分段卷积输入矩阵为9801行8列。9801行8列的矩阵和8行100列的矩阵进行矩阵乘积运算时,需要将9801行8的矩阵逐行和8行100列的矩阵进行矩阵乘积运算,每次矩阵乘积运算均为1行8列的矩阵和8行100列的矩阵相乘,每次矩阵乘积运算占用输入缓存124较小的存储空间。由于分段卷积输入矩阵往往远大于分段卷积核矩阵,将分段卷积输入矩阵作为乘数,将分段卷积核矩阵作为被乘数,可以减少每次矩阵乘积运算占用的输入缓存124的存储空间。
由于卷积核和卷积输入数据都进行分段处理,分段卷积核矩阵的行数明显变小,卷积处理器123只需要有等于或大于分段卷积核矩阵的行数的可用逻辑电路,即可完成分段卷积运算。实施本申请实施例,可以在有限的可用逻辑电路的情况下完成卷积运算,节省用于卷积运算的逻辑电路。
结合图5阐述分段卷积输入矩阵和分段卷积核矩阵的矩阵乘积运算的具体过程。
举例来说,如图5所示,图5是本申请实施例中提供的分段卷积输入矩阵的一行和分段卷积核矩阵进行矩阵乘积运算的逻辑电路工作示意图。如果分段卷积输入矩阵为P行U列,分段卷积核矩阵为U行N列,假设P=2000,U=5,N=100。下面以分段卷积输入矩阵的第一行为例,阐述具体的逻辑电路实现过程。
集成芯片12的卷积处理器123中用于卷积运算的逻辑电路可以是乘 法器和加法器,也可以是乘加器。乘法器可以将至少两个数据进行乘法运算,并输出乘法计算结果。加法器可以将至少两个数据进行加法运算,并输出加法计算结果。本申请实施例采用乘加器(也可称为乘法累加器)进行卷积运算,可以将乘法操作得到的结果与另一个操作数进行相加,可以在一个时钟周期内进行乘法运算和加法运算,以降低整个乘加操作的执行延时。
图5以乘加器为例进行说明。图5中仅需要5个乘加器(如图5所示的乘加器0、乘加器1、乘加器2、乘加器3、乘加器4)即可实现所有的分段卷积运算。举例来说,如果集成芯片12的卷积处理器123的时钟频率为1GHz,则时钟周期为1ns,假设乘加器在一个时钟周期内可以执行一次乘加运算。
如图5所示,分段卷积输入矩阵的第一行为X 00 X 01 X 02 X 03 X 04,分段卷积核矩阵为
Figure PCTCN2018116086-appb-000002
则分段卷积输入矩阵的第一行和分段卷积核矩阵的矩阵乘积运算在卷积处理器的逻辑电路实现具体如下。
在第一个时钟周期(T0)内,乘加器0执行X00×W00的乘法运算,得到运算结果(X00×W00)。
在第二个时钟周期(T1)内,乘加器0将上个周期得到的运算结果(X00×W00)传送给乘加器1,然后乘加器0执行X00×W01的乘法运算,得到运算结果(X00×W01);乘加器1执行X01×W10的乘法运算并将得到的运算结果(X01×W10)和乘加器0传送的运算结果(X00×W00)进行加法运算后得到运算结果(X00×W00+X01×W10)。
在第三个时钟周期(T2)内,乘加器0将上个周期得到的运算结果(X00×W01)传送给加法器1,然后乘加器0执行X00×W02的乘法运算,得到运算结果(X00×W02);乘加器1将上个周期得到的运算结果(X00×W00+X01×W10)传送给乘加器2,然后乘加器1执行X01×W11的乘法运算并将得到的运算结果(X01×W11)和乘加器0传送的运算结果(X00×W01)进行加法运算后得到运算结果(X00×W01+X01×W11);乘加器2执行X02×W20的乘法运算并将得到的运算结果(X02×W20)和乘加器1传送的运算结果(X00×W00+X01×W10)进行加法运算后得到运算结果(X00×W00+X01×W10+X02×W20)。
在第四个时钟周期(T3)内,乘加器0将上个周期得到的运算结果(X00×W02)传送给加法器1,然后乘加器0执行X00×W03的乘法 运算,得到运算结果(X00×W03);乘加器1将上个周期得到的运算结果(X00×W01+X01×W11)传送给乘加器2,然后乘加器1执行X01×W12的乘法运算并将得到的运算结果(X01×W12)和乘加器0传送的运算结果(X00×W02)进行加法运算后得到运算结果(X00×W02+X01×W12);乘加器2将上个周期得到的运算结果(X00×W00+X01×W10+X02×W20)传送给乘加器3,然后乘加器2执行X02×W21的乘法运算并将得到的运算结果(X02×W21)和乘加器1传送的运算结果(X00×W01+X01×W11)进行加法运算后得到运算结果(X00×W01+X01×W11+X02×W21);乘加器3执行X03×W30的乘法运算并将得到的运算结果(X03×W30)和乘加器2传送的运算结果(X00×W00+X01×W10+X02×W20)进行加法运算后得到运算结果(X00×W00+X01×W10+X02×W20+X03×W30)。
在第五个时钟周期(T4)内,乘加器0将上个周期得到的运算结果(X00×W03)传送给加法器1,然后乘加器0执行X00×W04的乘法运算,得到运算结果(X00×W04);乘加器1将上个周期得到的运算结果(X00×W02+X01×W12)传送给乘加器2,然后乘加器1执行X01×W13的乘法运算并将得到的运算结果(X01×W13)和乘加器0传送的运算结果(X00×W03)进行加法运算后得到运算结果(X00×W03+X01×W13);乘加器2将上个周期得到的运算结果(X00×W01+X01×W11+X02×W21)传送给乘加器3,然后乘加器2执行X02×W22的乘法运算并将得到的运算结果(X02×W22)和乘加器1传送的运算结果(X00×W02+X01×W12)进行加法运算后得到运算结果(X00×W02+X01×W12+X02×W22);乘加器3将上个周期得到的运算结果(X00×W00+X01×W10+X02×W20+X03×W30)传送给乘加器4,然后乘加器3执行X03×W31的乘法运算并将得到的运算结果(X03×W31)和乘加器2传送的运算结果(X00×W01+X01×W11+X02×W21)进行加法运算后得到运算结果(X00×W01+X01×W11+X02×W21+X03×W31)。乘加器4执行X04×W40的乘法运算并将得到的运算结果(X04×W40)和乘加器3传送的运算结果(X00×W00+X01×W10+X02×W20+X03×W30)进行加法运算后得到运算结果(X00×W00+X01×W10+X02×W20+X03×W30+X04×W40)。
在第六个时钟周期(T5)内,乘加器0将上个周期得到的运算结果(X00×W04)传送给加法器1,然后乘加器0执行X00×W05的乘法运算,得到运算结果(X00×W05);乘加器1将上个周期得到的运算结果(X00×W03+X01×W13)传送给乘加器2,然后乘加器1执行X01×W14的乘法运算并将得到的运算结果(X01×W14)和乘加器0传送的运算结果(X00×W04)进行加法运算后得到运算结果(X00×W04+ X01×W14);乘加器2将上个周期得到的运算结果(X00×W02+X01×W12+X02×W22)传送给乘加器3,然后乘加器2执行X02×W23的乘法运算并将得到的运算结果(X02×W23)和乘加器1传送的运算结果(X00×W03+X01×W13)进行加法运算后得到运算结果(X00×W03+X01×W13+X02×W23);乘加器3将上个周期得到的运算结果(X00×W01+X01×W11+X02×W21+X03×W31)传送给乘加器4,然后乘加器3执行X03×W32的乘法运算并将得到的运算结果(X03×W32)和乘加器2传送的运算结果(X00×W02+X01×W12+X02×W22)进行加法运算后得到运算结果(X00×W02+X01×W12+X02×W22+X03×W32)。乘加器4将上个周期得到的运算结果(X00×W00+X01×W10+X02×W20+X03×W30+X04×W40)输出到输出缓存中进行存储,然后乘加器4执行X04×W41的乘法运算并将得到的运算结果(X04×W41),和乘加器3传送的运算结果(X00×W01+X01×W11+X02×W21+X03×W31)进行加法运算后得到运算结果(X00×W01+X01×W11+X02×W21+X03×W31+X04×W41)。
从第五个时钟周期(T4)开始,五个乘加器(乘加器0、乘加器1、乘加器2、乘加器3、乘加器4)都开始进行乘法运算。到第101个时钟周期(T100)时,乘加器0完成了分段卷积输入矩阵的第一行和分段卷积核矩阵的卷积乘积运算,可以开始进行分段卷积输入矩阵的第二行和分段卷积核矩阵的乘法运算。如果分段卷积输入矩阵的第二行为X 10 X 11 X 12 X 13 X 14,则在101个时钟周期(T100)时,乘加器0执行X10×W00的乘法运算,得到运算结果(X10×W00)。可见,本申请实施例中的乘加器在进行矩阵乘积运算时,乘加器0、乘加器1、乘加器2、乘加器3、乘加器4为流水作业,直到分段卷积输入矩阵的所有行都完成矩阵乘积运算后,完成分段卷积输入矩阵和分段卷积核矩阵的运算,进行下一个分段卷积输入矩阵和分段卷积核矩阵的运算。
从第六个时钟周期(T5)开始,乘加器4每个周期都将上个周期得到的运算结果输出到输出缓存中进行存储。到第104个周期(T103)结束后,乘加器4做完了分段卷积输入矩阵的第一行和分段卷积核矩阵的卷积乘积运算的最后一次乘加运算,在第105个周期输出最后一个运算结果到输出缓存中进行存储,从而完成了分段卷积输入矩阵的第一行和分段卷积核矩阵的运算。需要说明的是,分段卷积输入矩阵的第二行和分段卷积核矩阵的乘法运算在第101个时钟周期(T100)就开始了,分段卷积输入矩阵的第二行和分段卷积核矩阵的乘法运算方式与上述分段卷积输入矩阵的第一行和分段卷积核矩阵的乘法运算方式类似,此处不再赘述。
以此类推,分段卷积输入矩阵的其他行和分段卷积核矩阵的运算过 程也可以通过五个乘加器来完成,从而完成分段卷积输入矩阵和分段卷积核矩阵的运算。更进一步的,其他的分段卷积输入矩阵和分段卷积核矩阵的运算都可以采用相同的逻辑电路进行乘加运算。
从上述例子可以看出,分段卷积核矩阵的行数即为最少需要的乘加器的个数。可以根据卷积处理器的可用逻辑电路来确定卷积核的拆分大小。比如,卷积核的数据量为Q,卷积和被均匀拆分为Y个分段卷积核,则每个分段卷积核的数据量为Q/Y。如果卷积处理器的可用逻辑电路(例如,乘加器)的数量较多时,可以将Y设置的较小,使得分段卷积核矩阵的行数(Q/Y)较大,以满足较快的运算处理需求。当卷积处理器的数量有多个时,还可以同时用多个卷积处理器分别对不同的分段卷积核矩阵进行运算,进一步提高卷积运算的速度。如果集成芯片中的可用逻辑电路较少时,可以将Y设置较大,使得分段卷积核矩阵的行数(Q/Y)较小,以节省逻辑电路,可以在占用较少的逻辑电路的情况下,实现多层卷积运算。
一些实施例中,卷积处理器123将分段卷积结果存储到输出缓存126包括:
卷积处理器123将分段卷积结果与输出缓存126内存储的数据累加后写入输出缓存126。
举例来说,卷积处理器123将第一分段卷积核和第一分段卷积输入数据分别进行卷积运算,得到第一段的分段卷积结果,将第一段的分段卷积结果存储在输出缓存126;卷积处理器123将第二分段卷积核和第二分段卷积输入数据分别进行卷积运算,得到第二段的分段卷积结果,将第二段的分段卷积结果与输出缓存126内存储的第一段的分段卷积结果累加后写入输入缓存126。以此类推,卷积处理器123将第Y分段卷积核和第Y分段卷积输入数据分别进行卷积运算,得到第Y段的分段卷积结果,将第Y段的分段卷积结果与输出缓存126内存储的第一段至第(Y-1)段的分段卷积结果的累加结果进行累加后写入输入缓存126。本申请实施例中,卷积处理器123完成一次分段卷积运算后,将得到的分段卷积结果与输出缓存126内之前存储的数据进行累加,本申请实施例在每一次分段卷积运算之后立即进行分段卷积结果累加,无需在所有分段卷积运算结束之后再进行分段卷积结果累加,可以提高整个卷积运算的处理效率。
一些实施例中,在卷积核的所有数据均参与分段卷积运算后,卷积处理器123对输出缓存126内存储的数据进行后处理计算,得到后处理卷积结果;对后处理卷积结果对应的矩阵进行转置运算,得到所述后处理卷积结果对应的矩阵的转置矩阵;
或者,在卷积核的所有数据均参与分段卷积运算后,卷积处理器123 对输出缓存126内存储的数据对应的矩阵进行转置运算,得到所述输出缓存内存储的数据对应的矩阵的转置矩阵;
控制器122将上述转置矩阵对应的卷积结果作为卷积输入数据存入片内存储器125。
本申请实施例中,后处理计算包括激活函数计算、池化计算、归一化计算中的至少一种,是否进行后处理计算是根据卷积神经网络算法来决定的。由于本申请实施例中的分段卷积运算是将分段卷积输入数据作为乘数,将分段卷积核作为被乘数,与常规的将卷积核作为乘数,将卷积输入数据作为被乘数不同,分段卷积运算得到的数据的行和列是颠倒的,需要将分段卷积运算得到的矩阵进行转置运算,比如,现有技术中得到的卷积结果为N行P列,本申请实施例得到的卷积结果为P行N列,则将本申请实施例得到的卷积结果进行矩阵转置运算,即可得到N行P列的卷积结果。
图2A中的集成芯片12可以将卷积核和卷积输入数据进行分段处理,可以对卷积核和卷积输入数据进行拆分,在进行一次分段卷积运算时,无需重复读取分段卷积核和分段卷积输入数据到输入缓存,可以在进行卷积运算时降低存储器与输入缓存124之间的带宽。同时,由于分段卷积核和分段卷积输入数据均较小,可以采用存储空间较小的输入缓存124,降低卷积神经网络算法对输入缓存124的需求,并且,由于分段卷积核与分段卷积数据均较小,在卷积处理器12的逻辑电路有限的情况下仍然可以进行卷积运算。需要说明的是,器件的带宽可以指的是两个器件之间的数据传输的能力。例如,存储器与输入缓存124之间的带宽,可以理解为输入缓存124从存储器从读取数据的速度以及存储器从输入缓存124读取数据的速度,如果带宽越大,则读取速度越快。带宽的单位可以为Gb/s。
本申请实施例中的卷积运算可以应用在图像处理领域,例如,可以应用与图像识别、图像分类、图像过滤等应用场景。如图1C所示,对于输入的图像数据,经过卷积运算、池化、归一化等操作,最后经过全连接层和回归操作处理后,输出图像处理结果。卷积运算可以有多层,每一层卷积运算都是这一层的卷积输入数据与这一层的卷积核进行卷积运算。每一层的卷积运算结果可以作为下一层卷积运算的卷积输入数据。其中,第一层卷积运算的卷积输入数据为输入的图像数据,第一层卷积运算为输入的图像数据与第一层的卷积核进行卷积运算。输入的图像数据可以是一张图像中的所有像素点的数据(例如,灰度值,RGB值等),如1000×600个像素点组成的1000×600×3(3为RGB值)的数据。本申请实施例在每一层的卷积运算中,可以对每一层的卷积核和卷积输入数据进行拆分,得到多个分段卷积核与对应的多个分段卷积输入 数据,在进行一次分段卷积运算时,无需重复读取分段卷积核和分段卷积输入数据到输入缓存,可以在进行卷积运算时降低片内存储器125与输入缓存124之间的带宽。同时,由于分段卷积核和分段卷积输入数据均较小,可以采用存储空间较小的输入缓存124,降低卷积神经网络算法对输入缓存124的需求,并且,由于分段卷积核与分段卷积数据均较小,在卷积处理器12的逻辑电路有限的情况下仍然可以进行卷积运算。
请参阅图6,图6是本申请实施例提供的一种卷积运算处理方法的流程示意图,如图6所示,该卷积运算处理方法包括如下步骤。
601、将分段卷积核和分段卷积输入数据载入到输入缓存;上述分段卷积核为卷积核分段得到的数据,上述分段卷积输入数据为上述卷积输入数据分段得到的数据。
602、对上述分段卷积核和上述分段卷积输入数据进行分段卷积运算得到分段卷积结果,将上述分段卷积结果存储到输出缓存。
本申请实施例中,步骤601和步骤602的执行主体可以为图1A或图1B所示的集成芯片。集成芯片用于进行卷积运算,集成芯片包括用于进行卷积运算的控制器、卷积处理器、输入缓存、输出缓存、片内存储器等。步骤601和步骤602的执行主体也可以为包括中央处理器、输入输出设备、存储器以及通信总线的装置,其中,中央处理器、输入输出设备和存储器通过通信总线进行连接。中央处理器用于进行卷积运算,中央处理器包含用于进行卷积运算的逻辑电路,例如,乘法器、加法器、乘加器等;存储器用于存储分段卷积核和分段卷积输入数据。为了便于说明,图6中的卷积运算处理方法的执行主体以图1A或图1B所示的集成芯片为例进行说明。
图6所示的方法的具体实施可以参见上述图2A至图5中关于集成芯片的实施例,此处不再赘述。
实施图6所示的方法,通过对卷积核进行拆分,在进行一次分段卷积运算时,无需从片内存储器中重复读取分段卷积核和分段卷积输入数据到输入缓存,可以在进行卷积运算时降低片内存储器与输入缓存之间的带宽。同时,由于分段卷积核和分段卷积输入数据均较小,可以采用存储空间较小的输入缓存,降低卷积神经网络算法对器件的带宽需求与输入缓存的需求。
一些实施例中,在执行步骤601之前,还可以执行如下步骤:
接收卷积输入数据和分段卷积核,将卷积输入数据分段得到分段卷积数据;
或者,接收卷积输入数据和卷积核,将卷积输入数据分段得到分段卷积数据,将卷积核分段得到分段卷积核。
本申请实施例中,卷积核可以由图1A或图1B中的服务器11进行 分段处理,也可以由集成芯片12进行分段处理。卷积输入数据可以由集成芯片12进行分段处理。当卷积核由图1A中的服务器11进行分段处理时,集成芯片无需再对卷积核进行分段,可以快速进行分段卷积运算,提高卷积运算的处理效率。当卷积核由图1A中的集成芯片12进行分段处理时,服务器11无需对卷积核进行分段,可以减轻服务器11的负担。
一些实施例中,上述分段卷积核对应的矩阵的行数和上述分段卷积输入数据对应的矩阵的列数与进行卷积运算的逻辑电路数量对应。
一些实施例中,上述分段卷积核对应的矩阵的行数和上述分段卷积输入数据对应的矩阵的列数与进行卷积运算的逻辑电路数量对应包括:
上述分段卷积核对应的矩阵的行数小于或等于进行卷积运算的乘加器的数量;上述分段卷积输入数据对应的矩阵的列数小于或等于进行卷积运算的乘加器的数量。
采用乘加器进行卷积运算,可以将乘法操作得到的结果与另一个操作数进行相加,可以在一个时钟周期内进行乘法运算和加法运算,以降低整个乘加操作的执行延时。
一些实施例中,上述对上述分段卷积核和上述分段卷积输入数据进行分段卷积运算包括:
将上述分段卷积核转换为分段卷积核矩阵,将上述分段卷积输入数据转换为分段卷积输入矩阵;
将上述分段卷积输入矩阵作为乘数,将上述分段卷积核矩阵作为被乘数进行乘积运算。
本申请实施例中,在进行矩阵乘积运算时,由于分段卷积输入矩阵往往远大于分段卷积核矩阵,将分段卷积输入矩阵作为乘数,将分段卷积核矩阵作为被乘数,可以减少每次矩阵乘积运算占用的输入缓存124的存储空间。由于卷积核和卷积输入数据都进行分段处理,分段卷积核矩阵的行数与未拆分的卷积核展开的矩阵的列数相比,明显变小,卷积处理器只需要有等于或大于分段卷积核矩阵的行数的可用逻辑电路,即可完成分段卷积运算。实施本申请实施例,可以在有限的可用逻辑电路的情况下完成卷积运算,节省用于卷积运算的逻辑电路。
一些实施例中,上述将上述分段卷积结果存储到上述输出缓存126包括:
将上述分段卷积结果与上述输出缓存内存储的数据累加后写入上述输出缓存126。
本申请实施例在每一次分段卷积运算之后立即进行分段卷积结果累加,无需在所有分段卷积运算结束之后再进行分段卷积结果累加,可以提高整个卷积运算的处理效率。
一些实施例中,上述输入缓存包含第一存储空间和第二存储空间,上述第一存储空间和上述第二存储空间分别用于存储数据;步骤601中将上述分段卷积核载入到输入缓存包括:
在上述第二存储空间内的数据参与分段卷积运算结束后,将上述第一存储空间内的数据移动到上述第二存储空间,然后将上述分段卷积核载入上述第一存储空间。
其中,在第二存储空间内的数据参与分段卷积运算结束后,将第一存储空间内的数据移动到第二存储空间,可以将第二存储空间内原有的数据进行覆盖。将分段卷积核载入第一存储空间可以将第一存储空间内原有的数据进行覆盖。上述数据可以是分段卷积核,在输入缓存124中采用“乒乓存储”的方式存储分段卷积核,可以始终保证输入缓存124中存储有两个分段卷积核,卷积处理器123在完成一次分段卷积运算之后,无需等待下一次分段卷积运算的分段卷积核的载入,即可快速进行下一次分段卷积运算,可以提高卷积运算的处理效率。
一些实施例中,若上述第一存储空间和上述第二存储空间为空,步骤601中将上述分段卷积核载入到输入缓存包括:
将上述分段卷积核载入上述第二存储空间;
在上述第一存储空间载入上述分段卷积核之外的另一分段卷积核。
本申请实施例为“乒乓存储”的首次存储过程,当输入缓存124中首次载入数据时,可以将输入缓存124中一次性载入两个分段卷积核。
请参阅图7,图7是本申请实施例提供的另一种卷积运算处理方法的流程示意图。图7是在图6的基础上进一步优化得到的,如图7所示,该卷积运算处理方法包括:
701、将分段卷积核和分段卷积输入数据载入到输入缓存;上述分段卷积核为卷积核分段得到的数据,上述分段卷积输入数据为上述卷积输入数据分段得到的数据。
702、对上述分段卷积核和上述分段卷积输入数据进行分段卷积运算得到分段卷积结果,将上述分段卷积结果存储到输出缓存。
703、在上述卷积核的所有数据均参与分段卷积运算后,对上述输出缓存内存储的数据进行后处理计算,得到后处理卷积结果,对上述后处理卷积结果对应的矩阵进行转置运算,得到上述后处理卷积结果对应的矩阵的转置矩阵;或者,在上述卷积核的所有数据均参与分段卷积运算后,对上述输出缓存内存储的数据对应的矩阵进行转置运算,得到上述输出缓存内存储的数据对应的矩阵的转置矩阵。
704、将上述转置矩阵对应的卷积结果作为卷积输入数据存入片内存储器。
其中,后处理计算包括激活函数计算、池化计算、归一化计算中的 至少一种,是否进行后处理计算是根据卷积神经网络算法来决定的。在图像领域中,由于邻域数据的相关性,池化计算用于对卷积结果中的数据进行删减,去除一些冗余的数据。例如,对于一幅24×24的原始图像数据,用5×5卷积核卷积,得到卷积结果为20×20,经过2×2池化,最终的结果变成10×10。
由于本申请实施例中的分段卷积运算是将分段卷积输入数据作为乘数,将分段卷积核作为被乘数,与常规的将卷积核作为乘数,将卷积输入数据作为被乘数不同,分段卷积运算得到的数据的行和列是颠倒的,需要将分段卷积运算得到的矩阵进行转置处理,比如,现有技术中得到的卷积结果为N行P列,本申请实施例得到的卷积结果为P行N列,则将本申请实施例得到的卷积结果进行矩阵转置运算,即可得到N行P列的卷积结果。
步骤701至步骤702可以参见图6所示的步骤601至步骤602,此处不再赘述。
图7所示的方法的具体实施可以参见上述图2A至图5中关于集成芯片的实施例,此处不再赘述。
请参阅图8,图8是本申请实施例提供的一种卷积运算处理装置的结构示意图。其中,该卷积运算处理装置包括载入单元801、分段卷积单元802和存储单元803,其中:
载入单元801,用于将分段卷积核和分段卷积输入数据载入到输入缓存;上述分段卷积核为卷积核分段得到的数据,上述分段卷积输入数据为卷积输入数据分段得到的数据。
分段卷积单元802,用于对上述分段卷积核和上述分段卷积输入数据进行分段卷积运算得到分段卷积结果。
存储单元803,用于将上述分段卷积结果存储到输出缓存。
图8所示的卷积运算处理装置的实施可以参见图6至图7所示的方法实施例,重复之处不再赘述。
实施图8所示的卷积运算处理装置,在进行一次分段卷积运算时,无需从片内存储器中重复读取分段卷积核和分段卷积输入数据到输入缓存,可以在进行卷积运算时降低片内存储器与输入缓存之间的带宽。
请参阅图9,图9是本申请实施例提供的另一种卷积运算处理装置的结构示意图。如图9所示,该卷积运算处理装置可以包括存储器901、处理器902以及输入输出装置903,存储器901、处理器902以及输入输出装置903可以通过通信总线904连接。存储器901用于存储程序指令,程序指令适于由处理器902加载;输入输出装置903可以用于接收卷积输入数据,以及用于输出卷积处理结果。
处理器902,用于加载程序指令并执行上述图6-图7中的部分或全 部方法步骤。
实施图9所示的卷积运算处理装置,在进行一次分段卷积运算时,无需从片内存储器中重复读取分段卷积核和分段卷积输入数据到输入缓存,可以在进行卷积运算时降低片内存储器与输入缓存之间的带宽。
本申请实施例还提供一种计算机存储介质,其中,该计算机存储介质存储有多条程序指令,该程序指令适于由处理器加载并执行如上述方法实施例中记载的任何一种卷积运算处理方法的部分或全部步骤。
本申请实施例的方法中的步骤可以根据实际需要进行顺序调整、合并和删减。
本申请实施例的卷积运算处理装置中的单元可以根据实际需要进行合并、划分和删减。
本领域普通技术人员可以理解上述实施例的各种方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序可以存储于一计算机可读存储介质中,存储介质包括只读存储器(Read-Only Memory,ROM)、随机存储器(Random Access Memory,RAM)、可编程只读存储器(Programmable Read-only Memory,PROM)、可擦除可编程只读存储器(Erasable Programmable Read Only Memory,EPROM)、一次可编程只读存储器(One-time Programmable Read-Only Memory,OTPROM)、电子抹除式可复写只读存储器(Electrically-Erasable Programmable Read-Only Memory,EEPROM)、只读光盘(Compact Disc Read-Only Memory,CD-ROM)或其他光盘存储器、磁盘存储器、磁带存储器、或者能够用于携带或存储数据的计算机可读的任何其他介质。
综上所述,权利要求的范围不应局限于以上描述的例子中的实施方式,而应当将说明书作为一个整体并给予最宽泛的解释。

Claims (39)

  1. 一种集成芯片,所述集成芯片包括:控制器、卷积处理器、输入缓存、输出缓存;
    所述控制器将分段卷积核和分段卷积输入数据载入到所述输入缓存;所述分段卷积核为卷积核分段得到的数据;所述分段卷积输入数据为卷积输入数据分段得到的数据;
    所述卷积处理器对所述分段卷积核和所述分段卷积输入数据进行分段卷积运算得到分段卷积结果,将所述分段卷积结果存储到所述输出缓存。
  2. 根据权利要求1所述集成芯片,其中,所述控制器用于:
    获取卷积核,所述卷积核包括N维空间中排列的复数个卷积元素,N为正整数;
    获取所述输入缓存的大小;
    根据所述输入缓存的大小在所述N维空间中对所述卷积核进行分割,得到复数个分段卷积核,其中,每个分段卷积核包括所述N维空间中相邻的复数个卷积元素,每个分段卷积核的数据量小于所述输入缓存的大小。
  3. 根据权利要求2所述集成芯片,其中,所述控制器用于:
    获取所述卷积处理器中逻辑电路的数量,根据所述逻辑电路的数量确定每个分段卷积核中卷积元素的第一数量;
    获取所述输入缓存的大小,根据所述输入缓存的大小确定每个分段卷积核中卷积元素的第二数量;
    在所述N维空间中对所述卷积核进行分割,得到复数个分段卷积核,其中,每个分段卷积核中的卷积元素的数量为所述第一数量和所述第二数量中的较小值。
  4. 根据权利要求2所述集成芯片,其中,所述控制器用于:
    确定所述卷积输入数据在所述N维空间的N个方向上数据元素数量最少的第一方向;
    沿所述第一方向对所述卷积核进行切割,得到所述复数个分段卷积核。
  5. 根据权利要求1-4中任一权利要求所述的集成芯片,其中,所述控制器用于:
    获取卷积输入数据,所述卷积输入数据包括N维空间中排列的复数个数据元素,N为正整数;
    对卷积输入数据进行分段得到复数个分段卷积输入数据,其中,每个分段卷积输入数据中数据元素的数量和在N维空间中的排列方式与每个分段卷积核中卷积元素的数量和排列方式相同,所述复数个分段卷 积输入数据包括分别与所述分段卷积核中的每个分段卷积核对应的一组分段卷积输入数据;
    将所述分段卷积核中的第一分段卷积核和所述第一分段卷积核对应的第一组分段卷积输入数据中的第一分段卷积输入数据载入到所述输入缓存;
    当所述卷积处理器对所述第一分段卷积输入数据和所述第一分段卷积核进行分段卷积运算后,将所述第一组分段卷积输入数据中的第二分段卷积输入数据载入所述输入缓存以替换所述第一分段卷积输入数据,使所述卷积处理器对所述第二分段卷积输入数据和所述第一分段卷积核进行分段卷积运算。
  6. 根据权利要求5中所述的集成芯片,其中,所述集成芯片包括复数个卷积处理器;
    所述控制器用于,将所述复数个分段卷积输入数据中的第一分段卷积输入数据分别载入每个卷积处理器对应的输入缓存,将所述第一分段卷积输入数据对应的复数个第一分段卷积核分别载入每个卷积处理器对应的输入缓存,以使每个卷积处理器将所述第一分段卷积输入数据与各自的第一分段卷积核进行卷积运算。
  7. 根据权利要求5中所述的集成芯片,其中,所述控制器用于:
    将所述第二分段卷积输入数据与所述第一分段卷积输入数据不相同的部分载入所述输入缓存以在所述输入缓存中形成所述第二分段卷积输入数据。
  8. 根据权利要求5中所述的集成芯片,其中,所述控制器用于:
    当所述卷积处理器完成所述第一分段卷积核和所述第一组分段卷积输入数据的分段卷积运算后,将所述分段卷积核中的第二分段卷积核及所述第二分段卷积核对应的第二组分段卷积输入数据中的第三分段卷积输入数据载入所述输入缓存,以使所述卷积处理器对所述第二分段卷积核和所述第三分段卷积输入数据进行分段卷积运算;
    所述卷积处理器用于:将所述第二分段卷积核和所述第三分段卷积输入数据的分段卷积结果叠加到所述输出缓存中存储的第二分段卷积结果中,所述第二分段卷积结果为所述卷积输入数据中同一行的数据元素对应的分段卷积结果。
  9. 根据权利要求8中所述的集成芯片,其中,所述控制器用于:
    将所述第二分段卷积核与所述第一分段卷积核不相同的部分载入所述输入缓存以形成所述第二分段卷积核。
  10. 根据权利要求1中所述的集成芯片,进一步包括:片内存储器;
    所述控制器用于,从片外存储器中存储的原始卷积输入数据中提取出用于进行复数次卷积操作的数据作为所述卷积输入数据,将所述卷积 输入数据载入所述片内存储器。
  11. 根据权利要求10中所述的集成芯片,其中,所述控制器用于:
    从所述原始卷积输入数据中提取出用于进行复数次卷积操作的第二数据,将所述第二数据与当前片内存储器中的卷积输入数据不相同的部分载入所述片内存储器以在所述片内存储器内形成所述第二数据,将所述片内存储器中的所述第二数据作为所述卷积输入数据用于卷积运算。
  12. 根据权利要求11中所述的集成芯片,其中,所述控制器用于:
    在所述片内存储器中当前存储的卷积输入数据的卷积运算执行完毕之前,从所述原始卷积输入数据中提取出用于进行复数次卷积操作的第二数据,将所述第二数据与所述片内存储器中当前存储的卷积输入数据不相同的部分载入所述片内存储器以形成所述第二数据;
    在所述片内存储器中当前存储的卷积输入数据的卷积运算执行完毕之后,将所述片内存储器中的所述第二数据作为所述卷积输入数据用于卷积运算。
  13. 根据权利要求1所述集成芯片,其中,
    所述分段卷积核对应的矩阵的行数和所述分段卷积输入数据对应的矩阵的列数与所述卷积处理器的逻辑电路数量对应。
  14. 根据权利要求13所述集成芯片,其中,
    所述分段卷积核对应的矩阵的行数小于或等于所述卷积处理器的乘加器的数量;所述分段卷积输入数据对应的矩阵的列数小于或等于所述卷积处理器的乘加器的数量。
  15. 根据权利要求1、13~14中任一项所述集成芯片,其中,所述卷积处理器用于:
    所述卷积处理器将所述分段卷积核转换为分段卷积核矩阵,将所述分段卷积输入数据转换为分段卷积输入矩阵;
    所述卷积处理器将所述分段卷积输入矩阵作为乘数,将所述分段卷积核矩阵作为被乘数进行乘积运算。
  16. 根据权利要求15所述集成芯片,其中,所述卷积处理器用于:
    所述卷积处理器将所述分段卷积结果与所述输出缓存内存储的数据累加后写入所述输出缓存。
  17. 根据权利要求16所述集成芯片,其中,所述卷积处理器用于:
    在所述卷积核的所有数据均参与分段卷积运算后,对所述输出缓存内存储的数据进行后处理计算,得到后处理卷积结果;对所述后处理卷积结果对应的矩阵进行转置运算,得到所述后处理卷积结果对应的矩阵的转置矩阵;
    或者,在所述卷积核的所有数据均参与分段卷积运算后,对所述输出缓存内存储的数据对应的矩阵进行转置运算,得到所述输出缓存内存 储的数据对应的矩阵的转置矩阵;
    所述控制器用于,将所述转置矩阵对应的卷积结果作为卷积输入数据存入片内存储器。
  18. 根据权利要求1、13~14中任一项所述集成芯片,其中,所述输入缓存包含第一存储空间和第二存储空间,所述第一存储空间和所述第二存储空间分别用于存储数据;所述控制器将所述分段卷积核载入所述输入缓存包括:
    在所述第二存储空间内的数据参与分段卷积运算结束后,所述控制器将所述第一存储空间内的数据移动到所述第二存储空间,然后将所述分段卷积核载入所述第一存储空间。
  19. 根据权利要求1所述集成芯片,进一步包括:片内存储器;
    所述片内存储器接收所述卷积输入数据和所述分段卷积核,或者,所述片内存储器接收所述卷积输入数据和所述卷积核。
  20. 一种卷积运算处理方法,应用于计算平台,包括:
    将分段卷积核和分段卷积输入数据载入到输入缓存;所述分段卷积核为卷积核分段得到的数据,所述分段卷积输入数据为卷积输入数据分段得到的数据;
    对所述分段卷积核和所述分段卷积输入数据进行分段卷积运算得到分段卷积结果,将所述分段卷积结果存储到输出缓存。
  21. 根据权利要求20所述的方法,进一步包括:
    获取卷积核,所述卷积核包括N维空间中排列的复数个卷积元素,N为正整数;
    根据所述输入缓存的大小在所述N维空间中对所述卷积核进行分割,得到复数个分段卷积核,其中,每个分段卷积核包括所述N维空间中相邻的复数个卷积元素,每个分段卷积核的数据量小于所述输入缓存的大小。
  22. 根据权利要求21所述的方法,其中,根据所述输入缓存的大小在所述N维空间中对所述卷积核进行分割,得到复数个分段卷积核包括:
    获取预设的用于所述分段卷积运算的乘法运算单元的数量,根据所述乘法运算单元的数量确定每个分段卷积核中卷积元素的第一数量;
    获取所述输入缓存的大小,根据所述输入缓存的大小确定每个分段卷积核中卷积元素的第二数量;
    在所述N维空间中对所述卷积核进行分割,得到复数个分段卷积核,其中,每个分段卷积核中的卷积元素的数量为所述第一数量和所述第二数量中的较小值。
  23. 根据权利要求21所述的方法,其中,根据所述输入缓存的大小 在所述N维空间中对所述卷积核进行分割,得到复数个分段卷积核包括:
    确定所述卷积输入数据在所述N维空间的N个方向上数据元素数量最少的第一方向;
    沿所述第一方向对所述卷积核进行分割,得到所述复数个分段卷积核。
  24. 根据权利要求20-23中任一权利要求所述的方法,进一步包括:
    获取卷积输入数据,所述卷积输入数据包括N维空间中排列的复数个数据元素,N为正整数;
    对卷积输入数据进行分段得到复数个分段卷积输入数据,其中,每个分段卷积输入数据中数据元素的数量和在N维空间中的排列方式与每个分段卷积核中卷积元素的数量和排列方式相同,所述复数个分段卷积输入数据包括分别与所述分段卷积核中的每个分段卷积核对应的一组分段卷积输入数据;
    将所述分段卷积核中的第一分段卷积核和所述第一分段卷积核对应的第一组分段卷积输入数据中的第一分段卷积输入数据载入到所述输入缓存;
    当所述卷积处理器对所述第一分段卷积输入数据和所述第一分段卷积核进行分段卷积运算后,将所述第一组分段卷积输入数据中的第二分段卷积输入数据载入所述输入缓存以替换所述第一分段卷积输入数据,使所述卷积处理器对所述第二分段卷积输入数据和所述第一分段卷积核进行分段卷积运算。
  25. 根据权利要求24中所述的方法,其中,当并行执行复数个分段卷积运算时,
    将所述复数个分段卷积输入数据中的第一分段卷积输入数据分别载入每个卷积处理器对应的输入缓存,将所述第一分段卷积输入数据对应的复数个第一分段卷积核分别载入每个卷积处理器对应的输入缓存,以使每个卷积处理器将所述第一分段卷积输入数据与各自的第一分段卷积核进行卷积运算。
  26. 根据权利要求24中所述的方法,其中,将所述第一组分段卷积输入数据中的第二分段卷积输入数据载入所述输入缓存包括:
    将所述第二分段卷积输入数据与所述第一分段卷积输入数据不相同的部分载入所述输入缓存以在所述输入缓存中形成所述第二分段卷积输入数据。
  27. 根据权利要求24所述的方法,进一步包括:
    当所述卷积处理器完成所述第一分段卷积核和所述第一组分段卷积输入数据的分段卷积运算后,将所述分段卷积核中的第二分段卷积核及 所述第二分段卷积核对应的第二组分段卷积输入数据中的第三分段卷积输入数据载入所述输入缓存,以使所述卷积处理器对所述第二分段卷积核和所述第三分段卷积输入数据进行分段卷积运算;
    所述卷积处理器用于:将所述第二分段卷积核和所述第三分段卷积输入数据的分段卷积结果叠加到所述输出缓存中存储的第二分段卷积结果中,所述第二分段卷积结果为所述卷积输入数据中同一行的数据元素对应的分段卷积结果。
  28. 根据权利要求27所述的方法,将所述分段卷积核中的第二分段卷积核载入所述输入缓存包括:
    将所述第二分段卷积核与所述第一分段卷积核不相同的部分载入所述输入缓存以形成所述第二分段卷积核。
  29. 根据权利要求20所述的方法,进一步包括:
    从片外存储器中存储的原始卷积输入数据中提取出用于进行复数次卷积操作的数据作为所述卷积输入数据,将所述卷积输入数据载入所述计算平台内嵌的片内存储器中。
  30. 根据权利要求29所述的方法,进一步包括:
    从所述原始卷积输入数据中提取出用于进行复数次卷积操作的第二数据,将所述第二数据与所述片内存储器中当前存储的卷积输入数据不相同的部分载入所述片内存储器以形成所述第二数据,将所述片内存储器中的所述第二数据作为所述卷积输入数据用于卷积运算。
  31. 根据权利要求30所述的方法,将所述第二数据与所述片内存储器中当前存储的卷积输入数据不相同的部分载入所述片内存储器以形成所述第二数据,将所述片内存储器中的所述第二数据作为所述卷积输入数据用于卷积运算包括:
    在所述片内存储器中当前存储的卷积输入数据的卷积运算执行完毕之前,将所述不相同的部分载入所述片内存储器以在所述片内存储器内形成所述第二数据;
    在所述片内存储器中当前存储的卷积输入数据的卷积运算执行完毕之后,将所述片内存储器中的所述第二数据作为所述卷积输入数据用于卷积运算。
  32. 根据权利要求20所述方法,其中,
    所述分段卷积核对应的矩阵的行数和所述分段卷积输入数据对应的矩阵的列数与进行卷积运算的逻辑电路数量对应。
  33. 根据权利要求32所述方法,其中,
    所述分段卷积核对应的矩阵的行数小于或等于进行卷积运算的乘加器的数量;所述分段卷积输入数据对应的矩阵的列数小于或等于进行卷积运算的乘加器的数量。
  34. 根据权利要求20、32~33任一项所述方法,其中,所述对所述分段卷积核和所述分段卷积输入数据进行分段卷积运算包括:
    将所述分段卷积核转换为分段卷积核矩阵,将所述分段卷积输入数据转换为分段卷积输入矩阵;
    将所述分段卷积输入矩阵作为乘数,将所述分段卷积核矩阵作为被乘数进行乘积运算。
  35. 根据权利要求34所述方法,其中,所述将所述分段卷积结果存储到所述输出缓存包括:
    将所述分段卷积结果与所述输出缓存内存储的数据累加后写入所述输出缓存。
  36. 根据权利要求35所述方法,进一步包括:
    在所述卷积核的所有数据均参与分段卷积运算后,对所述输出缓存内存储的数据进行后处理计算,得到后处理卷积结果,对所述后处理卷积结果对应的矩阵进行转置运算,得到所述后处理卷积结果对应的矩阵的转置矩阵;
    或者,在所述卷积核的所有数据均参与分段卷积运算后,对所述输出缓存内存储的数据对应的矩阵进行转置运算,得到所述输出缓存内存储的数据对应的矩阵的转置矩阵;
    将所述转置矩阵对应的卷积结果作为卷积输入数据存入片内存储器。
  37. 根据权利要求20所述方法,进一步包括:
    接收所述卷积输入数据和所述分段卷积核,将所述卷积输入数据分段得到所述分段卷积数据;
    或者,接收所述卷积输入数据和所述卷积核,将所述卷积输入数据分段得到所述分段卷积数据,将所述卷积核分段得到所述分段卷积核。
  38. 一种卷积运算处理装置,包括存储器和处理器,所述存储器用于存储计算机可读指令,所述计算机可读指令可以使所述处理器执行如权利要求20~37任一项所述的卷积运算处理方法。
  39. 一种存储介质,所述存储介质中存储有计算机可读指令,所述计算机可读指令可以使处理器执行如权利要求20~37中任一项所述的卷积运算处理方法。
PCT/CN2018/116086 2017-12-06 2018-11-19 卷积运算处理方法及相关产品 WO2019109795A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US16/678,004 US11449576B2 (en) 2017-12-06 2019-11-08 Convolution operation processing method and related product

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201711283173.9A CN108304923B (zh) 2017-12-06 2017-12-06 卷积运算处理方法及相关产品
CN201711283173.9 2017-12-06

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/678,004 Continuation US11449576B2 (en) 2017-12-06 2019-11-08 Convolution operation processing method and related product

Publications (1)

Publication Number Publication Date
WO2019109795A1 true WO2019109795A1 (zh) 2019-06-13

Family

ID=62869725

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2018/116086 WO2019109795A1 (zh) 2017-12-06 2018-11-19 卷积运算处理方法及相关产品

Country Status (3)

Country Link
US (1) US11449576B2 (zh)
CN (1) CN108304923B (zh)
WO (1) WO2019109795A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111882028A (zh) * 2020-06-08 2020-11-03 北京大学深圳研究生院 用于卷积神经网络的卷积运算装置
EP3771999A1 (en) * 2019-07-30 2021-02-03 Beijing Baidu Netcom Science And Technology Co. Ltd. Method and apparatus for extracting image data in parallel from multiple convolution windows, device, and computer-readable storage medium
TWI740725B (zh) * 2020-11-20 2021-09-21 英業達股份有限公司 資料傳遞及合併的方法
EP4012546A4 (en) * 2019-09-03 2022-10-12 Inspur Electronic Information Industry Co., Ltd DATA Caching METHOD AND APPARATUS

Families Citing this family (34)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108304923B (zh) * 2017-12-06 2022-01-18 腾讯科技(深圳)有限公司 卷积运算处理方法及相关产品
WO2020029181A1 (zh) * 2018-08-09 2020-02-13 深圳鲲云信息科技有限公司 三维卷积神经网络计算装置及相关产品
CN110825311B (zh) * 2018-08-10 2023-04-18 昆仑芯(北京)科技有限公司 用于存储数据的方法和装置
CN109146065B (zh) * 2018-09-30 2021-06-08 中国人民解放军战略支援部队信息工程大学 二维数据的卷积运算方法及装置
CN109508782B (zh) * 2018-10-09 2022-05-24 瑞芯微电子股份有限公司 基于神经网络深度学习的加速电路和方法
CN109542837B (zh) * 2018-11-30 2023-03-24 上海寒武纪信息科技有限公司 运算方法、装置及相关产品
CN109558564B (zh) * 2018-11-30 2022-03-11 上海寒武纪信息科技有限公司 运算方法、装置及相关产品
WO2020125806A1 (en) * 2018-12-17 2020-06-25 Shanghai United Imaging Intelligence Co., Ltd. Systems and methods for image segmentation
CN111488216B (zh) * 2019-01-28 2024-04-30 北京灵汐科技有限公司 一种数据处理的方法、装置及电子设备
CN109948787B (zh) * 2019-02-26 2021-01-08 山东师范大学 用于神经网络卷积层的运算装置、芯片及方法
CN109886395B (zh) * 2019-03-06 2020-11-24 上海熠知电子科技有限公司 一种面向多核图像处理卷积神经网络的数据读取方法
CN110032538B (zh) * 2019-03-06 2020-10-02 上海熠知电子科技有限公司 一种数据读取系统和方法
CN110147347B (zh) * 2019-03-18 2023-01-06 腾讯科技(深圳)有限公司 用于矩阵处理的芯片、矩阵处理方法、装置及存储介质
CN110097174B (zh) * 2019-04-22 2021-04-20 西安交通大学 基于fpga和行输出优先的卷积神经网络实现方法、系统及装置
US11797345B2 (en) * 2019-04-30 2023-10-24 Prakash C R J Naidu Hardware accelerator for efficient convolution processing
CN110188865B (zh) * 2019-05-21 2022-04-26 深圳市商汤科技有限公司 信息处理方法及装置、电子设备和存储介质
CN110413539B (zh) * 2019-06-19 2021-09-14 深圳云天励飞技术有限公司 一种数据处理方法及装置
TWI719512B (zh) * 2019-06-24 2021-02-21 瑞昱半導體股份有限公司 使用像素通道置亂的卷積神經網路的演算方法與系統
CN111178508B (zh) * 2019-12-27 2024-04-05 珠海亿智电子科技有限公司 用于执行卷积神经网络中全连接层的运算装置及方法
CN111291864B (zh) * 2020-01-20 2023-11-03 Oppo广东移动通信有限公司 运算处理模组、神经网络处理器、电子设备及数据处理方法
CN113222125A (zh) * 2020-01-21 2021-08-06 北京希姆计算科技有限公司 卷积运算方法及芯片
US11216375B2 (en) * 2020-02-26 2022-01-04 Hangzhou Zhicun Intelligent Technology Co., Ltd. Data caching
TWI766568B (zh) * 2020-04-17 2022-06-01 神盾股份有限公司 用於執行卷積神經網路運算的處理裝置與其操作方法
CN111597029B (zh) * 2020-05-20 2024-03-22 上海商汤智能科技有限公司 数据处理方法及装置、电子设备和存储介质
CN111767246B (zh) * 2020-06-09 2024-03-05 上海寒武纪信息科技有限公司 数据处理方法、相关设备及计算机可读介质
CN111767243A (zh) * 2020-06-09 2020-10-13 上海寒武纪信息科技有限公司 数据处理方法、相关设备及计算机可读介质
CN111832717B (zh) * 2020-06-24 2021-09-28 上海西井信息科技有限公司 芯片及用于卷积计算的处理装置
CN112261308A (zh) * 2020-10-10 2021-01-22 深圳市海雀科技有限公司 一种具有片内模式识别的感光集成电路
CN112330524B (zh) * 2020-10-26 2024-06-18 沈阳上博智像科技有限公司 一种在图像跟踪系统中快速实现卷积的装置和方法
CN113344183B (zh) * 2021-06-03 2022-09-30 沐曦集成电路(上海)有限公司 一种在计算系统中实现卷积运算的方法及计算系统
CN113449852B (zh) * 2021-08-05 2023-02-03 安谋科技(中国)有限公司 卷积神经网络的计算方法、片上系统和电子设备
CN114185014B (zh) * 2021-12-20 2022-11-08 北方工业大学 一种应用于雷达信号处理的并行卷积方法及装置
CN114579925A (zh) * 2022-03-04 2022-06-03 奥比中光科技集团股份有限公司 一种卷积运算方法及装置和卷积核拆分方法及单元
CN115687228B (zh) * 2023-01-03 2023-05-02 中国科学院国家空间科学中心 一种基于PCIe总线的星载固态存储系统及方法

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106844294A (zh) * 2016-12-29 2017-06-13 华为机器有限公司 卷积运算芯片和通信设备
CN106951395A (zh) * 2017-02-13 2017-07-14 上海客鹭信息技术有限公司 面向压缩卷积神经网络的并行卷积运算方法及装置
CN108304923A (zh) * 2017-12-06 2018-07-20 腾讯科技(深圳)有限公司 卷积运算处理方法及相关产品

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100409258C (zh) * 2005-12-21 2008-08-06 北京航空航天大学 一种实时快速实现高斯模板卷积的装置
US8458635B2 (en) * 2009-12-04 2013-06-04 Synopsys, Inc. Convolution computation for many-core processor architectures
CN104915322B (zh) * 2015-06-09 2018-05-01 中国人民解放军国防科学技术大学 一种卷积神经网络硬件加速方法
US10725934B2 (en) * 2015-10-08 2020-07-28 Shanghai Zhaoxin Semiconductor Co., Ltd. Processor with selective data storage (of accelerator) operable as either victim cache data storage or accelerator memory and having victim cache tags in lower level cache wherein evicted cache line is stored in said data storage when said data storage is in a first mode and said cache line is stored in system memory rather then said data store when said data storage is in a second mode
CN105631094A (zh) * 2015-12-18 2016-06-01 天津工业大学 基于分段线性循环卷积的一维左手材料Crank-Nicolson完全匹配层实现算法
CN105681628B (zh) * 2016-01-05 2018-12-07 西安交通大学 一种卷积网络运算单元及可重构卷积神经网络处理器和实现图像去噪处理的方法
US10621486B2 (en) * 2016-08-12 2020-04-14 Beijing Deephi Intelligent Technology Co., Ltd. Method for optimizing an artificial neural network (ANN)
CN107239824A (zh) * 2016-12-05 2017-10-10 北京深鉴智能科技有限公司 用于实现稀疏卷积神经网络加速器的装置和方法
CN107145939B (zh) * 2017-06-21 2020-11-24 北京图森智途科技有限公司 一种低计算能力处理设备的计算机视觉处理方法及装置
US11586907B2 (en) * 2018-02-27 2023-02-21 Stmicroelectronics S.R.L. Arithmetic unit for deep learning acceleration
CN110928576A (zh) * 2018-09-20 2020-03-27 中兴通讯股份有限公司 一种卷积神经网络的卷积处理方法、装置及存储介质

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106844294A (zh) * 2016-12-29 2017-06-13 华为机器有限公司 卷积运算芯片和通信设备
CN106951395A (zh) * 2017-02-13 2017-07-14 上海客鹭信息技术有限公司 面向压缩卷积神经网络的并行卷积运算方法及装置
CN108304923A (zh) * 2017-12-06 2018-07-20 腾讯科技(深圳)有限公司 卷积运算处理方法及相关产品

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
CHEN YUE: "Design and implementation of a fast algorithm of partition convultion", MODERN ELECTRONIC TECHNIQUE, no. 17, 31 December 2008 (2008-12-31) *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3771999A1 (en) * 2019-07-30 2021-02-03 Beijing Baidu Netcom Science And Technology Co. Ltd. Method and apparatus for extracting image data in parallel from multiple convolution windows, device, and computer-readable storage medium
KR20210014561A (ko) * 2019-07-30 2021-02-09 베이징 바이두 넷컴 사이언스 앤 테크놀로지 코., 엘티디. 다수 컨벌루션 윈도우 중의 이미지 데이터를 추출하는 방법, 장치, 기기 및 컴퓨터 판독 가능한 저장매체
US11481994B2 (en) 2019-07-30 2022-10-25 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for extracting image data in parallel from multiple convolution windows, device, and computer-readable storage medium
KR102470027B1 (ko) * 2019-07-30 2022-11-24 베이징 바이두 넷컴 사이언스 앤 테크놀로지 코., 엘티디. 다수 컨벌루션 윈도우 중의 이미지 데이터를 추출하는 방법, 장치, 기기 및 컴퓨터 판독 가능한 저장매체
EP3771999B1 (en) * 2019-07-30 2023-01-18 Kunlunxin Technology (Beijing) Company Limited Method and apparatus for extracting image data in parallel from multiple convolution windows, device, and computer-readable storage medium
EP4012546A4 (en) * 2019-09-03 2022-10-12 Inspur Electronic Information Industry Co., Ltd DATA Caching METHOD AND APPARATUS
US11803475B2 (en) 2019-09-03 2023-10-31 Inspur Electronic Information Industry Co., Ltd. Method and apparatus for data caching
CN111882028A (zh) * 2020-06-08 2020-11-03 北京大学深圳研究生院 用于卷积神经网络的卷积运算装置
CN111882028B (zh) * 2020-06-08 2022-04-19 北京大学深圳研究生院 用于卷积神经网络的卷积运算装置
TWI740725B (zh) * 2020-11-20 2021-09-21 英業達股份有限公司 資料傳遞及合併的方法

Also Published As

Publication number Publication date
US11449576B2 (en) 2022-09-20
CN108304923A (zh) 2018-07-20
US20200074288A1 (en) 2020-03-05
CN108304923B (zh) 2022-01-18

Similar Documents

Publication Publication Date Title
WO2019109795A1 (zh) 卷积运算处理方法及相关产品
US11157592B2 (en) Hardware implementation of convolutional layer of deep neural network
CN112214726B (zh) 运算加速器
US11907830B2 (en) Neural network architecture using control logic determining convolution operation sequence
CN110582785B (zh) 配置用于执行层描述符列表的具有功率效率的深度神经网络模块
WO2020073211A1 (zh) 运算加速器、处理方法及相关设备
JP6977239B2 (ja) 行列乗算器
CN108427990B (zh) 神经网络计算系统和方法
US9697176B2 (en) Efficient sparse matrix-vector multiplication on parallel processors
JP6905573B2 (ja) 計算装置と計算方法
WO2018107476A1 (zh) 访存设备、计算设备和应用于卷积神经网络运算的设备
WO2019215907A1 (ja) 演算処理装置
WO2022151779A1 (zh) 卷积运算的实现方法、数据处理方法及装置
WO2020014893A1 (zh) 反卷积实现方法及相关产品
KR20220154764A (ko) 추론 엔진 회로 아키텍처
KR20210033757A (ko) 메모리 장치 및 그 동작 방법
US20210295140A1 (en) Neural network processing
CN114008589A (zh) 用于在顺序处理器上进行多次执行的动态代码加载
CN116888591A (zh) 一种矩阵乘法器、矩阵计算方法及相关设备
WO2021081854A1 (zh) 一种卷积运算电路和卷积运算方法
US11853868B2 (en) Multi dimensional convolution in neural network processor
CN111931937B (zh) 图像处理模型的梯度更新方法、装置及系统
WO2021120036A1 (zh) 数据处理装置和数据处理方法
US20220222509A1 (en) Processing non-power-of-two work unit in neural processor circuit
CN112214727B (zh) 运算加速器

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 18885216

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 18885216

Country of ref document: EP

Kind code of ref document: A1