CN115293978A

CN115293978A - Convolution operation circuit and method, image processing apparatus

Info

Publication number: CN115293978A
Application number: CN202210815191.1A
Authority: CN
Inventors: 郑军; 韩军; 段旭阳
Original assignee: Shanghai Weijing Technology Co ltd
Current assignee: Shanghai Weijing Technology Co ltd
Priority date: 2022-07-11
Filing date: 2022-07-11
Publication date: 2022-11-04

Abstract

The application discloses a convolution operation circuit, a convolution operation method and image processing equipment, wherein the convolution operation circuit comprises a plurality of operation modules for performing different convolution operation layers; each operation module comprises an intermediate memory for caching intermediate data and an operation unit for performing corresponding convolution operation, and the operation unit and the corresponding intermediate memory are arranged in the same chip; the operation unit is used for receiving input data, performing corresponding convolution operation on the input data, caching intermediate data in the convolution operation process to the intermediate storage, reading the current intermediate data from the intermediate storage for the corresponding convolution operation, updating the intermediate data cached in the intermediate storage according to the latest intermediate data, and outputting the convolution operation result of the corresponding operation module. The convolution operation circuit and the convolution operation method can improve the operation speed of the convolution operation circuit and reduce the corresponding execution power consumption.

Description

Convolution operation circuit and method, image processing apparatus

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a convolution operation circuit and method, and an image processing apparatus.

Background

The convolutional neural network algorithm for image denoising requires output features with the same resolution as the input feature map, so that the intermediate result data volume is extremely large. In general, in a neural network computing hardware design scheme, intermediate data generated after a certain layer of neural network operation needs to be transferred to the outside of a chip, and then the intermediate data is reloaded into the chip before the next neural network layer operation. Due to high delay of internal and external transmission of the chip and large intermediate data volume of the image denoising algorithm, the corresponding algorithm is easy to operate at a low speed.

Disclosure of Invention

In view of the above, the present application provides a convolution operation circuit and method, and an image processing apparatus, so as to improve the operation speed of the convolution operation process.

The convolution operation circuit comprises a plurality of operation modules for performing different convolution operation layers;

each operation module comprises an intermediate memory for caching intermediate data and an operation unit for performing corresponding convolution operation, and the operation unit and the corresponding intermediate memory are arranged in the same chip;

the operation unit is used for receiving input data, performing corresponding convolution operation on the input data, caching intermediate data in the convolution operation process to the intermediate storage, reading current intermediate data from the intermediate storage for corresponding convolution operation, updating the intermediate data cached by the intermediate storage according to the latest intermediate data, and outputting a convolution operation result of a corresponding operation module.

Optionally, the plurality of operational modules includes an input layer convolution module, at least one DW convolution module, at least one PW convolution module, and an output layer convolution module.

Optionally, the input layer convolution module comprises a TM ₁ A first multiplier, TM ₁ A first adder and a first intermediate memory respectively corresponding to the first multipliers, and a first nonlinear activation unit, TM ₁ Representing the number of convolution kernels of the input layer convolution module; the TM ₁ The first multipliers are used for respectively carrying out multiplication operation on the data of the corresponding channels in the image characteristic diagram to obtain first products; the first adder is used for reading current first intermediate data from the corresponding first intermediate memory and adding the current first intermediate data to the corresponding first intermediate memoryAdding the previous first intermediate data and the corresponding first product, if the addition result is intermediate data, updating the corresponding first intermediate data according to the addition result, and storing the updated first intermediate data into the corresponding first intermediate memory; the first nonlinear activation unit is used for activating an addition result of the final convolution result of the representation current layer to obtain the first operation result.

Optionally, the input layer convolution module further includes R1 × S1 first input memories, R ₁ ×S ₁ Representing a window size of a single convolution calculation in the input layer convolution module; the R1 × S1 first input memories correspond to the R1 × S1 convolution windows and are used for storing first input activation data in the convolution windows required in subsequent operation so as to facilitate the TM ₁ A first multiplier reading said first input activation data; the first input activation data is derived from the image feature map.

Optionally, the DW convolution module includes a TM ₂ A second multiplier, TM ₂ A second adder and a second intermediate memory respectively corresponding to the second multipliers, and a second nonlinear activation unit, TM ₂ Representing the number of convolution kernels of the DW convolution module; the TM ₂ The second multipliers are used for respectively carrying out multiplication operation on the data of the corresponding channels in the first operation result to obtain second products; the second adder is configured to read current second intermediate data from the corresponding second intermediate memory, add the current second intermediate data to the second product, update the second intermediate data according to an addition result if the addition result is intermediate data, and store the updated second intermediate data in the second intermediate memory; and the second nonlinear activation unit is used for activating an addition result representing a final convolution result of the current layer to obtain the second operation result.

Optionally, the PW convolution module comprises a TM ₃ ×TN ₃ A third multiplier, a third adder, a third intermediate memory, and a third non-linear activation unit, TM ₃ Represents the number of convolution kernels, TN, of the PW convolution module ₃ Represents the aboveThe PW convolution module inputs the fragment size on the channel scale; the TM ₃ ×TN ₃ The third multipliers are used for respectively carrying out multiplication operation on the data of the corresponding channels in the second operation result to obtain third products; the third adder is used for respectively adding third multiplication products output by TM3 groups of third multipliers, reading current third intermediate data from the third intermediate memory, accumulating the current third intermediate data and an addition result corresponding to the third multiplication products, updating the third intermediate data according to the accumulation result, and storing the updated third intermediate data into the third intermediate memory; and the third nonlinear activation unit is used for activating an accumulation result representing a final convolution result of the current layer to obtain the third operation result.

Optionally, the PW convolution module further includes TN ₃ A second input memory; the TN ₃ A second input memory for storing second input activation data in convolution window for use in subsequent operation, so as to facilitate TM ₃ ×TN ₃ A third multiplier reading the second input activation data; the second input activation data is derived from the second operation result.

Optionally, the output layer convolution module comprises a TN ₄ A fourth multiplier, a fourth adder, a fourth intermediate memory, and a fourth nonlinear activation unit, TN ₄ Representing the size of the fragment on the scale of an input channel in the output layer convolution module; the TN ₄ The fourth multipliers are used for performing multiplication operation on the data of the corresponding channels in the third operation result respectively to obtain fourth products; the fourth adder is configured to add fourth products output by TN4 fourth multipliers, read current fourth intermediate data from the fourth intermediate memory, add the current fourth intermediate data and an addition result corresponding to the fourth product, update the fourth intermediate data according to the addition result, and store the updated fourth intermediate data into the fourth intermediate memory; the fourth nonlinear activation unit is used for activating and representing the addition result of the final convolution result of the current layer to obtain the fourth operationAnd calculating a result.

Optionally, the first nonlinear activation unit, the second nonlinear activation unit, the third nonlinear activation unit, and the fourth nonlinear activation unit respectively include: an adder group and a comparator group; the adder group is used for carrying out accumulation bias processing on an addition result representing a final convolution result of a current layer; the comparator group is used for carrying out nonlinear activation processing on the structure subjected to accumulation bias processing to obtain a corresponding operation result.

Optionally, the intermediate memory comprises a static random access memory.

The present application further provides a convolution operation method applied to any one of the convolution operation circuits, including:

receiving input data, performing corresponding convolution operation on the input data, and caching intermediate data in the convolution operation process to an intermediate memory;

and reading the current intermediate data from the intermediate memory for corresponding convolution operation, updating the intermediate data cached by the intermediate memory according to the latest intermediate data, and outputting the convolution operation result of the corresponding operation module.

The present application also provides an image processing apparatus including any one of the convolution operation circuits described above.

In the convolution operation circuit and method and the image processing device, the plurality of operation modules respectively comprise the intermediate storage used for caching the intermediate data so as to store the corresponding intermediate data for the follow-up operation convolution process to take, so that the corresponding operation modules do not need to load the intermediate data from the outside of the chip for many times during operation, the energy loss and the transmission time of the transmission of the data from the inside of the chip to the outside of the chip are saved, the operation speed in the convolution operation process can be improved, and the corresponding execution power consumption is reduced.

Further, each of the operation modules is applicable to a convolutional neural network hardware accelerator implemented based on a layer fusion (Fused layer) architecture, and can efficiently execute convolutional neural network operation under the condition of an application layer fusion architecture, and an acceleration operator can adopt two bottom operators in deep separable convolution: both DW convolution and PW convolution can be accelerated in the accelerator; the method can meet the data reuse requirement under a layer fusion framework, can well accelerate the comprehensive application condition of various operators, can complete the operation task of the convolutional neural network at high speed, is designed for a flow line in the operation process, and has high hardware utilization rate.

Furthermore, the calculation of each operation module is expanded to carry out hardware mapping, different operation modules are respectively responsible for the acceleration of different operators, and in the operation process of the convolutional neural network, each operation module is uniformly and directly in a full-load state, so that the hardware utilization rate can be further improved.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings required to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the description below are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of a convolution operation circuit according to an embodiment of the present application;

FIG. 2 is a schematic diagram of an operation module according to an embodiment of the present application;

FIGS. 3a and 3b are schematic diagrams of the structure of an input layer convolution module according to an embodiment of the present application;

FIG. 4 is a diagram illustrating the DW convolution module according to an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a PW convolution module according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of an output layer convolution module according to an embodiment of the present application.

Detailed Description

The inventor researches and discovers that a Fused layer (Fused layer) architecture is adopted for hardware accelerated by an image denoising algorithm based on a convolutional neural network, namely, intermediate data is stored in a chip by utilizing the data reuse opportunity of an overlapping part existing between convolution windows, so that the operation speed can be increased to a certain extent, and the power consumption can be reduced. The lightweight convolutional neural network algorithm can equivalently convert the conventional Convolution (Convolution) into a depth separable Convolution (DW Convolution). There are convolution kernels with four dimensions, and typically all four dimensions are greater than 1. First, the convolution kernel of the DW convolution in the first step is a single channel, and the other three dimensions are the same as the conventional convolution. The second step includes performing PW convolution (position convolution) on the feature map after DW convolution, wherein a PW convolution kernel is 1 × 1 in size. Therefore, compared with the traditional convolution, the deep separable convolution can greatly reduce the parameter quantity of the traditional convolution, thereby being more convenient for hardware mapping and having less operation quantity. However, the current hardware accelerator is rarely dedicated to accelerating both DW convolution and PW convolution.

The inventor further researches and discovers that in hardware equipment for accelerating the image denoising algorithm based on the convolutional neural network, execution substrates such as a Central Processing Unit (CPU), a Graphics Processing Unit (GPU) and the like cannot meet the operation requirements of high speed and low power consumption of the image denoising algorithm based on the Convolutional Neural Network (CNN) due to the structure of the execution substrates, the image denoising network based on the CNN has large calculation amount, and the intermediate data needs to have extremely large storage amount, so that the implemented hardware accelerator needs to more fully utilize the data reuse opportunity. The general design idea of the CNN accelerator is to put the convolutional layers into the operation array one by one for calculation, then output the intermediate result out of the chip, and set the calculation granularity as the whole convolutional layer. In the image denoising task based on the CNN, in order to fully reuse data, all intermediate data are stored on a chip, the calculation granularity needs to be reduced, and a pipeline calculation architecture is designed. The deep separable convolution has its own features in the way of computation compared to the conventional convolution. However, a general CNN accelerator rarely adopts a special hardware acceleration scheme for the deep separable convolution, so that the operation performance is lost, and the deep separable convolution cannot fully utilize the computing resources of hardware designed for the conventional convolution.

Based on the above research, in the convolution operation circuit provided by the application, the input layer convolution module, the DW convolution module, the PW convolution module, the output layer convolution module and other operation modules respectively include an intermediate memory for caching intermediate data to store corresponding intermediate data for use in a subsequent convolution operation process, so that the corresponding operation modules do not need to load the intermediate data from outside of the chip for multiple times during operation, the energy loss and transmission time of data transmission between the inside and the outside of the chip are saved, the operation speed in the convolution operation process can be improved, and the corresponding execution power consumption is reduced. Each operation module is suitable for a convolutional neural network hardware accelerator realized based on a layer fusion (Fused layer) architecture, can efficiently execute convolutional neural network operation under the condition of an application layer fusion architecture, and can accelerate both bottom operators in deep separable convolution and can be accelerated in the accelerator; the method can meet the data reuse requirement under a layer fusion framework, can well accelerate the comprehensive application condition of various operators, can complete the operation task of the convolutional neural network at high speed, is designed for a production line in the operation process, and has high hardware utilization rate.

The technical solutions in the embodiments of the present application are clearly and completely described below with reference to the accompanying drawings, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application. The following embodiments and their technical features may be combined with each other without conflict.

A first aspect of the present application provides a convolution operation circuit, including a plurality of operation modules for performing different convolution operation layers; each operation module comprises an intermediate memory for caching intermediate data and an operation unit for performing corresponding convolution operation, and the operation unit and the corresponding intermediate memory are arranged in the same chip; the operation unit is used for receiving input data, performing corresponding convolution operation on the input data, caching intermediate data in the convolution operation process to the intermediate storage, reading the current intermediate data from the intermediate storage for the corresponding convolution operation, updating the intermediate data cached in the intermediate storage according to the latest intermediate data, and outputting the convolution operation result of the corresponding operation module.

Specifically, the arithmetic unit may include a multiplier and/or an adder or the like for performing convolution operation. Optionally, the multiplier may perform multiplication operations on data of corresponding channels in the corresponding input data to obtain a current product, the adder may read current intermediate data from the intermediate memory, add the current intermediate data to the current product, if the addition result is intermediate data, update the intermediate data according to the addition result, store the updated intermediate data in the intermediate memory for use in a subsequent operation process, and if the addition result is a convolution result corresponding to the operation module, determine a convolution operation result of the operation module according to the addition result, so as to input the convolution operation result to the operation module of a next convolution operation layer or serve as a final operation result.

The convolution operation circuit can cache the intermediate data in the convolution operation process to the intermediate memory, reads the current intermediate data from the intermediate memory for the corresponding convolution operation, updates the intermediate data cached in the intermediate memory according to the latest intermediate data, does not need to Load (Load) the intermediate data from the outside of the chip for many times during the operation, saves the energy loss and the transmission time of the transmission of the data from the inside of the chip to the outside of the chip, can improve the operation speed in the convolution operation process, and reduces the corresponding execution power consumption.

In one embodiment, the convolution operation circuit may include an input convolution layer, at least one DW convolution layer, at least one PW convolution layer, and an output convolution layer, wherein input data of the input convolution layer includes an image feature map representing an image to be processed, and input data of each other convolution operation layer includes an input feature map representing output data of the previous convolution operation layer. Accordingly, the plurality of operational blocks of the convolution operational circuit includes an input layer convolution block, at least one DW convolution block, at least one PW convolution block, and an output layer convolution block. Alternatively, the input-output relationship among the plurality of operation modules may be as shown in fig. 1, and have the following features: input layer convolution module → DW convolution module → PW convolution module → … … → DW convolution module → PW convolution module → output layer convolution module. The operation module provided by this embodiment is a neural network accelerator calculation module oriented to the image denoising field, and is suitable for a convolutional neural network hardware accelerator implemented based on a layer fusion (Fused layer) architecture, and can efficiently execute convolutional neural network operation under the application layer fusion architecture, where the acceleration operator includes Convolution (Convolution) and deep separable Convolution (Depthwise separable Convolution). Wherein the depth can separate two underlying operators in the convolution: both DW convolution and PW convolution can be accelerated. The operation module is suitable for the data reuse requirement under the layer fusion framework, can well accelerate the comprehensive application condition of various operators, can complete the operation task of the convolutional neural network at high speed, is designed for a flow line in the operation process, and has high hardware utilization rate. In the specific operation process, the calculation of each convolution operation layer is expanded to carry out hardware mapping, different operation modules can respectively accelerate different operators, and each operation module can be always in a full load state, so that the extremely high hardware utilization rate is achieved.

Specifically, the definition of the relevant parameters of each operation module can be referred to fig. 2. The input layer convolution module is an operation module used for calculating convolution, the input layer convolution module can take pixel point information of an image to be input into a convolution neural network for denoising as an input characteristic diagram, convolution operation is carried out to obtain characteristics extracted by the input layer of the neural network, and a corresponding output characteristic diagram is used for subsequent convolution operation processing of the network. The design of the input layer convolution module can be oriented to the convolution kernel with the size of R ₁ ×S ₁ And the slice (Tile) size on the output channel scale is TM ₁ The window size of a single convolution calculation is R ₁ ×S ₁ Module per R ₁ ×S ₁ The operation performed in one cycle is a single convolution to obtain TM ₁ Part of the output channel and the required calculation, i.e. TM ₁ Each convolution kernel corresponds to a feature map R ₁ ×S ₁ Multiplication of pixel points, the total multiplication unit number being TM ₁ X1. The DW convolution module is used for calculating DW volumeAnd in the product operation module, the convolution operation of the middle layer of the image denoising convolution neural network is equivalently converted into the lightweight depth separable convolution operation, so that the convolution operation parameters are reduced, and hardware mapping is facilitated. The module takes the output of the previous layer of module as an input feature map, carries out DW convolution operation and transmits the output feature to a subsequent PW convolution module. The design of the module is oriented to the convolution kernel with the size of R ₂ ×S ₂ And the slice size on the output channel scale is TM ₂ The window size of a single DW convolution calculation is R ₂ ×S ₂ Module per R ₂ ×S ₂ Operation performed in one cycle is a single DW convolution to obtain TM ₂ Part of the output channel and the required calculation, i.e. TM ₂ Each DW convolution kernel corresponds to a feature map R ₂ ×S ₂ Multiplication of pixel points, the total number of multiplication units being TM ₂ X1. The PW convolution module is an operation module for calculating PW convolution, the module takes the output of the DW convolution module in the same layer as an input characteristic diagram to carry out PW convolution operation, and the output characteristic is transmitted to the operation module in the next layer. The design of the operation module is oriented to the convolution kernel with the size of 1 multiplied by 1 and the fragment size on the scale of the input channel with the size of TN ₃ The slice size on the output channel scale is TM ₃ The window size of single PW convolution calculation is 1 multiplied by 1, and the calculation performed by each module period is single PW convolution input TN ₃ One channel gets TM ₃ Part of the output channel and the required calculation, i.e. TM ₃ One PW convolution kernel and corresponding TN in the characteristic diagram ₃ Multiplication of 1 × 1 pixel points in each channel, and the total multiplication unit number is TM ₃ ×TN ₃ X1. The output layer convolution module is an operation module used for calculating convolution, and the module takes the output of the upper layer module as an input feature map to carry out convolution operation so as to output the pixel point features of the final denoised image obtained by the image denoising convolution neural network processing. The design of the module is oriented to the convolution kernel with the size of R ₄ ×S ₄ And the size of the segment on the input channel scale is TN ₄ The window size of a single convolution calculation is R ₄ ×S ₄ Each module isR ₄ ×S ₄ Operation performed for one period is single convolution input TN ₄ Partial and required calculation for obtaining output characteristics of each channel, i.e. TN corresponding to convolution kernel and characteristic diagram ₄ In one channel R ₄ ×S ₄ Multiplication of each pixel point, the total multiplication unit number is TN ₄ ×1×1。

In one example, the input layer convolution module includes a first intermediate memory configured to cache first intermediate data, and is configured to receive an image feature map, perform multiplication operations on data of corresponding channels in the image feature map to obtain first products, read current first intermediate data from the first intermediate memory, add the current first intermediate data to the first products, update the first intermediate data according to an addition result if the addition result is intermediate data, store the updated first intermediate data in the first intermediate memory, and determine a first operation result of the input layer convolution module according to the addition result if the addition result represents a final convolution result of a current layer.

In an example, the DW convolution module includes a second intermediate memory configured to cache second intermediate data, and is configured to receive the first operation result, perform multiplication operations on data of corresponding channels in the first operation result, respectively, to obtain corresponding second products, read current second intermediate data from the second intermediate memory, add the current second intermediate data to each obtained second product, and perform DW convolution operation, if the addition result is intermediate data, update the second intermediate data according to the addition result, store the updated second intermediate data into the second intermediate memory, and if the addition result represents a final convolution result of the current layer, determine a second operation result of the convolution module according to the addition result DW.

In an example, the PW convolution module includes a third intermediate memory configured to buffer third intermediate data, and is configured to receive the second operation result, perform multiplication operations on data of corresponding channels in the second operation result to obtain third multiplication products, read current third intermediate data from the third intermediate memory, add the current third intermediate data and the third multiplication products to perform PW convolution operation, update the third intermediate data according to an addition result if the addition result is intermediate data, store the updated third intermediate data in the third intermediate memory, and determine a second operation result of the PW convolution module according to the addition result if the addition result represents a final convolution result of a current layer.

In an example, the output layer convolution module includes a fourth intermediate memory configured to buffer fourth intermediate data, and is configured to receive a third operation result, perform multiplication operations on data of corresponding channels in the third operation result to obtain a fourth product, read current fourth intermediate data from the fourth intermediate memory, add the current fourth intermediate data to the fourth product, update the fourth intermediate data according to an addition result if the addition result is intermediate data, store the updated fourth intermediate data in the fourth intermediate memory, and determine a fourth operation result of the output layer convolution module according to the addition result if the addition result represents a final convolution result of a current layer.

The convolution and the depth separable convolution are mapped into the corresponding operation module, and the special calculation architecture of the pipeline operator supports a fusion layer architecture with high data reuse degree, so that the extremely high hardware utilization rate and operation speed can be realized.

In one embodiment, as shown with reference to FIG. 3a, the input layer convolution module includes a TM ₁ A first multiplier 111, TM ₁ A first adder 112 and a first intermediate memory 113 respectively corresponding to the first multipliers 111, and a first nonlinear activation unit 114, tm ₁ Representing the number of convolution kernels of the input layer convolution module; wherein TM ₁ A first multiplier 111, TM ₁ The first intermediate memory 113, and the first nonlinear activation unit 114 may form a first operation unit of the input layer convolution module.

The TM ₁ A first multiplier 111 for multiplying the data of the corresponding channel in the image feature mapMultiplication is respectively carried out to obtain first products so as to realize parallel operation among the first multipliers 111 and improve the operation efficiency; the first adder 112 is configured to read current first intermediate data from the corresponding first intermediate memory 113, add the current first intermediate data to the corresponding first product, update the corresponding first intermediate data according to the addition result if the addition result is intermediate data, and store the updated first intermediate data in the corresponding first intermediate memory 113; the first nonlinear activation unit 114 is configured to activate an addition result representing a final convolution result of the current layer to obtain the first operation result, where the first operation result may be output through a corresponding output channel. Alternatively, TM ₁ The first multiplier 111 may obtain the corresponding weight from the external control device or the storage device, respectively, to perform multiplication operation on the data of the corresponding channel in the image feature map by using the obtained weight, so as to ensure the accuracy of the obtained first product. Alternatively, the first nonlinear activation unit 114 may activate the corresponding addition result using a ReLU function (linear rectification function).

In one example, as shown with reference to FIG. 3b, the input layer convolution module also includes R1 × S1 first input memories 115, R ₁ ×S ₁ Representing a window size of a single convolution calculation in the input layer convolution module; the R1 × S1 first input memories 115 correspond to the R1 × S1 convolution windows, and are used for storing first input activation data in the convolution windows required in subsequent operations, so that the TM can conveniently use ₁ A first multiplier 111 reads the first input activation data to enable TM ₁ The first multipliers 111 can multiplex the first input activation data stored in the first input storage 115, so that the utilization rate of the first input storage 115 is improved; the first input activation data is derived from the image feature map.

In the specific operation process, the input layer convolution module carries out one output channel slicing per period (the input layer convolution module comprises TM) ₁ Single output channel) of the same ₁ ×S ₁ The convolution calculation of the corresponding partial sum of 1 data in the convolution window, and the corresponding structure can refer to the figure3b are shown. The operation of the input layer convolution module shown in fig. 3b can be divided into four stages as follows: first, a single channel R in the image profile because of the time-reuse of input activations ₁ ×S ₁ All R in the convolution window ₁ ×S ₁ The individual data may be loaded from off-chip to on-chip and stored in the first input memory 115 on-chip to avoid repeated loading during subsequent operations. Second stage, single channel R in the input image feature map due to spatial reuse of input activation ₁ ×S ₁ The 1 data in the convolution window is sent to the TM over the broadcast network ₁ A first multiplier 111, and TM ₁ Each R in the weight ₁ ×S ₁ The convolution windows each provide a weight datum, i.e. a total of TM ₁ The individual weight data is sent to the corresponding first multiplier 111, TM, via the unicast network ₁ The first multipliers 111 operate simultaneously to obtain the output partial sums. Third level, upper level, middle TM ₁ The obtained partial sum and TM output by the multiplication unit ₁ The output part is added to the first intermediate data previously stored in the first intermediate memory 113, and if the value obtained in the memory after the addition is the final convolution calculation result after partial and full accumulation (i.e. R in all convolution kernels) ₁ ×S ₁ Corresponding data in the convolution window is completely operated), the value of the first intermediate memory 113 is transferred to the first nonlinear activation unit 114 located at the next stage, and a signal can also be sent to inform the first nonlinear activation unit 114 that the currently transmitted value is the convolution operation result, and the first intermediate memory 113 is emptied, otherwise, the accumulated value is still a partial sum of a non-final result, and the first multiplier 111 stored in the first intermediate memory 113 waiting for the next period is output, and will not be transferred to the next stage, and will not send a signal to the next stage. And the fourth stage is used for activating the output data in the first intermediate memory 113 at the upper stage, and the result after the fourth stage operation is directly used as the output of the input layer convolution module. Optionally, the activation process may include processes such as accumulating bias and nonlinear activation function processes.

In one embodiment, the referenceAs shown in FIG. 4, the DW convolution module includes a TM ₂ A second multiplier 121, TM ₂ A second adder 122 and a second intermediate memory 123 respectively corresponding to the second multipliers 121, and a second nonlinear activation unit 124, tm ₂ Representing the number of convolution kernels of the DW convolution module; wherein TM ₂ A second multiplier 121, TM ₂ The second adders 122 and the second intermediate memory 123, and the second nonlinear activation unit 124 may form a second operation unit of the DW convolution module.

The TM ₂ The second multipliers 121 are configured to perform multiplication operations on data of corresponding channels in the first operation result to obtain second products, so as to implement parallel operations among the second multipliers 121 and improve operation efficiency; the second adder 122 is configured to read current second intermediate data from the corresponding second intermediate memory 123, add the current second intermediate data to the second product, update the second intermediate data according to an addition result if the addition result is intermediate data, and store the updated second intermediate data in the second intermediate memory 123; the second nonlinear activation unit 124 is configured to activate an addition result representing a final convolution result of the current layer, so as to obtain the second operation result. Alternatively, TM ₂ The second multipliers 121 may respectively obtain corresponding weights from an external control device or a storage device, so as to perform multiplication operations on the input data of corresponding channels in the input data by using the obtained weights, thereby ensuring the accuracy of the obtained second product. Alternatively, the second nonlinear activation unit 124 may activate the corresponding addition result using the ReLU function. Alternatively, the TM mentioned above ₂ The second multipliers 121 may be respectively connected to a second operation result output by one output channel of the input layer convolution module, and the connected second operation result is used as an input.

In the specific operation process, the DW convolution module performs one output channel slicing every period (the DW convolution module comprises TM) ₂ Single output channel) of the same ₂ ×S ₂ The convolution calculation of the partial sum corresponding to 1 data in the convolution window, the corresponding structure can be referred to as shown in fig. 4.The operation process of the DW convolution module shown in FIG. 4 can be divided into three stages as follows: first level, corresponding input feature map TM ₂ Each R in each channel ₂ ×S ₂ The convolution windows each provide one input activation data, i.e. total TM ₂ The input activation data is sent to the TM over the unicast network ₂ A second multiplier 121, and TM ₂ Each R in the weight corresponding to each second multiplier 121 ₂ ×S ₂ The convolution windows each provide a weight datum, i.e. a total of TM ₂ The individual weight data is sent to the second multiplier 121 via the unicast network ₂ The second multipliers 121 operate simultaneously to obtain the output partial sums. Second level, superior middle TM ₂ The second multiplier 121 outputs the resulting partial sum and ₂ the output part is added to the previously stored value (second intermediate data) in the second intermediate memory 123, and if the value obtained in the second intermediate memory 123 after the addition is the final convolution calculation result after partial and full accumulation (i.e. R in all convolution kernels) ₂ ×S ₂ The corresponding data in the convolution window has all been completed), the value in the second intermediate memory 123 is transferred to the second nonlinear activation unit 124 arranged at the next stage, and a signal is sent to inform that the currently transmitted value at the next stage is the convolution operation result, and the second intermediate memory 123 is emptied, otherwise, the accumulated value is still a partial sum of a non-final result, and the second multiplier 121 stored in the second intermediate memory 123 waiting for the next period is output and will not be transferred to the next stage, and will not send a signal to the next stage. And in the third stage, processing is performed on the output data in the second intermediate memory 123 in the previous stage, and the result after the third-stage operation is directly used as the output of the DW convolution module. Alternatively, the second nonlinear activation unit 124 may activate an addition result representing a final convolution result of the current layer by using a process of accumulation biasing, nonlinear activation function processing, or the like.

In one embodiment, the PW convolution module comprises a TM ₃ ×TN ₃ A third multiplier 131, a third adder 132, a third intermediate memory 133, and a third non-linear activation unit 134 ₃ Represents the convolution kernel of the PW convolution moduleNumber, TN ₃ And the size of the slice on the scale of the input channel in the PW convolution module is represented. Specifically, as shown in FIG. 5, TM ₃ ×TN ₃ The third multiplier 131 can be divided into TMs ₃ Sets of third multipliers, each set of third multipliers comprising TN respectively ₃ A multiplier, TM ₃ The sets of third multipliers each have a corresponding third adder 132. In particular, TM ₃ ×TN ₃ The third multiplier 131, the third adder 132, the third intermediate memory 133, and the third nonlinear activation unit 134 may form a third operation unit of the PW convolution module.

The TM ₃ ×TN ₃ The third multipliers 131 are configured to perform multiplication operations on data of corresponding channels in the second operation result to obtain third multiplication products, so as to implement parallel operations among the third multipliers 131 and improve operation efficiency; the third adder 132 is used for separately adding to TM ₃ Adding the third products output by the group of third multipliers, reading current third intermediate data from the third intermediate memory 132, accumulating the current third intermediate data and the addition results corresponding to the third products, updating the third intermediate data according to the accumulation results, and storing the updated third intermediate data in the third intermediate memory 132; the third nonlinear activation unit 134 is configured to activate an accumulation result representing a final convolution result of the current layer to obtain the third operation result. Alternatively, TM ₃ ×TN ₃ The third multipliers 131 may respectively obtain corresponding weights from an external control device or a storage device, so as to perform multiplication operations by using data of corresponding channels in the obtained weight input data, thereby ensuring the accuracy of the obtained third product. Alternatively, the third nonlinear activation unit 134 may activate the corresponding addition result using a ReLU function. Alternatively, the third adder 132 may include a first stage adder and a second stage adder, the first stage adder being used for the pair for respectively pairing the TMs ₃ The third products of the outputs of the third multipliers are added, and the second-stage adder is connected between the first-stage adder and the third intermediate memory 133 for reading the current third intermediate number from the third intermediate memory 132Accordingly, the current third intermediate data is accumulated with the addition result corresponding to the third product.

In one example, referring to FIG. 5, in FIG. 5, input represents Input and Output represents Output, and the PW convolution module further includes TN ₃ A second input memory 135; the TN ₃ A second input memory 135 for storing second input activation data in a convolution window for use in subsequent operations, such as TM ₃ ×TN ₃ A third multiplier 131 reads the second input activation data, causing TM to be ₃ ×TN ₃ The third multiplier 131 can multiplex the second input activation data stored in the second input memory 135, thereby increasing the utilization rate of the second input memory 135; the second input activation data is derived from the second operation result.

In the specific operation process, the PW convolution module performs one input channel slicing per period (the PW convolution module comprises TN) ₃ An input channel) and an output channel slice (including TM) ₃ Individual output channels) of the data corresponding to partial sums within a single 1 x 1 convolution window, the corresponding structure can be seen with reference to fig. 5. The working process of the PW convolution module shown in fig. 5 may include the following five stages: in the first stage, because of the time reusability of input activation, the module stores the output of the upper DW convolution module into the second input memory 135, and the stored data is the total TN within each 1 × 1 convolution window in a single input channel slice in the module input profile ₃ And (4) storing the data to avoid repeated loading in the subsequent operation process. Second stage, TN in Module input Profile due to spatial reuse of input activation ₃ Each input channel provides one input activation data per 1 × 1 convolution window, i.e., total TN ₃ The incoming activation data is sent to the multiplier unit via a Multicast network, and the TM ₃ In the weights of the blocks, TN in each weight ₃ One weight data is provided for each 1 x 1 convolution window under each channel, i.e. TM in total ₃ ×TN ₃ The individual weight data is sent to the third multiplier 131 over the unicast network ₃ ×TN ₃ A third multiplier 131 is the same asAnd performing operation to obtain an output partial sum. Third order, TM ₃ ×TN ₃ The output result of the third multiplier 131 per TN ₃ One in one group, passing through TM ₃ An addition tree, the output of which is TM ₃ The corresponding partial sums of the output channels. TM obtained in the fourth, upper stage ₃ Partial sum and TM ₃ The output portions are added to the previously stored values in the third intermediate memory 133, and if the added value obtained in the third intermediate memory 133 is the final convolution calculation result after partial and complete accumulation (that is, the corresponding data in the 1 × 1 convolution windows under all the input channels have all been completed), the value in the third intermediate memory 133 is transferred to the third nonlinear activation unit 134 of the next stage, and a signal is sent to inform the next stage that the currently transmitted value is the convolution calculation result, and the third intermediate memory 133 is emptied, otherwise, the accumulated value is still the partial sum of the non-final result, and the addition tree stored in the third intermediate memory 133 for waiting for the next period is output, and is not transferred to the next stage, and a signal is not sent to the next stage. And in the fifth stage, the output data in the third intermediate memory 133 in the previous stage is activated, and the result after the fifth stage operation is directly used as the output of the PW convolution module. Optionally, the activation process includes a process of accumulating an offset, a nonlinear activation function process, and the like.

In one embodiment, as shown with reference to FIG. 6, the output layer convolution module includes TN ₄ A fourth multiplier 141, a fourth adder 142, a fourth intermediate memory 143, and a fourth nonlinear activation unit 144 ₄ Representing the size of a fragment on the scale of an input channel in the output layer convolution module; in particular, TN ₄ The fourth multiplier 141, the fourth adder 142, and the fourth nonlinear activation unit 144 may form a fourth operation unit of the output layer convolution module.

The TN ₄ The fourth multipliers 141 are used for performing multiplication on the data of the corresponding channels in the third operation result to obtain fourth products, so that parallel operation among the fourth multipliers 141 is realized, and the operation efficiency is improved; the fourth adder 142 is used for respectively adding TN to TN ₄ The fourth products output by the fourth multipliers 141 are added, the current fourth intermediate data is read from the fourth intermediate storage 143, the current fourth intermediate data and the addition result corresponding to the fourth product are accumulated, the fourth intermediate data is updated according to the accumulation result, and the updated fourth intermediate data is stored in the fourth intermediate storage 143; the fourth nonlinear activation unit 144 is configured to activate an addition result representing a final convolution result of the current layer, so as to obtain the fourth operation result. Alternatively, TN ₄ The fourth multipliers 141 may respectively obtain corresponding weights from an external control device or a storage device, so as to perform multiplication operations by using data of corresponding channels in the obtained weight input data, thereby ensuring the accuracy of the obtained fourth product. Alternatively, the fourth nonlinear activation unit 144 may activate the corresponding addition result using a ReLU function. Alternatively, the fourth adder 142 can include a third-stage adder and a fourth-stage adder, the third-stage adder being used for pairing TM's respectively ₃ The fourth-stage adder is connected between the third-stage adder and the fourth intermediate memory 143, and is configured to read the current fourth intermediate data from the fourth intermediate memory 143, and accumulate the addition result corresponding to the current fourth intermediate data and the fourth product.

In the specific operation process, the output layer convolution module performs one input channel slicing per period (the output layer convolution module comprises TN) ₄ One input channel) of a single R ₄ ×S ₄ The convolution calculation of the partial sum corresponding to 1 data in the convolution window, the corresponding structure can be referred to as shown in fig. 6. The operation of the output layer convolution module shown in fig. 6 may include the following four stages: first stage, corresponding to TN in the input characteristic diagram ₄ Each R under one channel ₄ ×S ₄ The convolution windows each provide one input activation data, i.e. total TN ₄ The input activation data is sent to the multiplier unit via a unicast network, the weight TN ₄ Each R under one channel ₄ ×S ₄ The convolution windows each provide a weight data, i.e. total TN ₄ The weight data pass sheetThe broadcast network is sent to a fourth multiplier 141, tn ₄ The fourth multipliers 141 operate simultaneously to obtain the output partial sums. Second stage, TN ₄ The output result of the fourth multiplier 141 is output through the addition tree to obtain the accumulated partial sum. In the third stage, the partial sum obtained in the upper stage is added to the output part and the previously stored values in the fourth intermediate memory 143, and if the added value obtained in the fourth intermediate memory 143 is the final convolution result (i.e., R for all input channels) after partial sum is completely accumulated ₄ ×S ₄ Corresponding data in the convolution window is completely operated), the value in the fourth intermediate memory 143 is transferred to the fourth non-linear activation unit 144 of the next stage, and a signal is sent to inform the fourth non-linear activation unit 144 of the next stage that the currently transmitted value is the convolution operation result, the fourth intermediate memory 143 is emptied, otherwise, the accumulated value is still a partial sum of a non-final result, the addition tree stored in the fourth intermediate memory 143 for the next period is output, and is not transferred to the next stage, and a signal is not sent to the next stage. And a fourth stage, processing the output data in the fourth intermediate memory 143 of the previous stage, and directly outputting the result after the fourth stage operation as a module. Optionally, the activation process includes a process of accumulating an offset, a nonlinear activation function process, and the like.

In one example, the first nonlinear activation unit 114, the second nonlinear activation unit 124, the third nonlinear activation unit 134, and the fourth nonlinear activation unit 144 respectively include: an adder group and a comparator group; the adder group is used for performing accumulation offset processing on an addition result representing a final convolution result of a current layer; the comparator group is used for carrying out nonlinear activation processing on the structure subjected to accumulation bias processing to obtain a corresponding operation result. The nonlinear activation unit provided by the example can stably perform nonlinear activation processing on the addition result, and the reliability of the convolution operation circuit is improved.

In one embodiment, the intermediate memories (e.g., the first to fourth intermediate memories, etc.) include Static Random Access Memories (SRAMs) to improve the efficiency of the intermediate memories in storing, updating, or erasing the corresponding intermediate data, thereby improving the efficiency of the corresponding convolution operations.

In the convolution operation circuit, the operation modules respectively comprise the intermediate memories used for caching the intermediate data so as to store the corresponding intermediate data for the subsequent operation process to use, so that the corresponding operation modules do not need to load the intermediate data from the outside of the chip for multiple times during operation, the energy loss and the transmission time of data transmission from the inside and the outside of the chip are saved, the operation speed in the convolution operation process can be improved, and the corresponding execution power consumption is reduced. Each operation module is suitable for a convolutional neural network hardware accelerator realized based on a layer fusion (Fused layer) architecture, can efficiently execute convolutional neural network operation under the condition of an application layer fusion architecture, and can adopt two bottom operators in deep separable convolution as an acceleration operator: both DW convolution and PW convolution can be accelerated in the accelerator; the method can meet the data reuse requirement under a layer fusion framework, can well accelerate the comprehensive application condition of various operators, can complete the operation task of the convolutional neural network at high speed, is designed for a flow line in the operation process, and has high hardware utilization rate; the calculation of each operation module is expanded to carry out hardware mapping, different operation modules are respectively responsible for acceleration of different operators, and in the operation process of the convolutional neural network, each operation module is uniformly and directly in a full-load state, so that the hardware utilization rate can be further improved.

A second aspect of the present application provides a convolution operation method, which is applied to the convolution operation circuit provided in any of the above embodiments, and includes:

The convolution operation method is applied to the convolution operation circuit provided in any of the above embodiments, has all the technical effects of the convolution operation circuit provided in any of the above embodiments, and is not described herein again.

A third aspect of the present application provides an image processing apparatus, including the convolution operation circuit described in any of the above embodiments, where the image processing apparatus can perform processing such as image denoising by using the convolution operation circuit, so as to improve a denoising effect, thereby improving a corresponding image processing effect.

Although the application has been shown and described with respect to one or more implementations, equivalent alterations and modifications will occur to others skilled in the art based upon a reading and understanding of this specification and the annexed drawings. The present application includes all such modifications and alterations, and is limited only by the scope of the appended claims. In particular regard to the various functions performed by the above described components, the terms used to describe such components are intended to correspond, unless otherwise indicated, to any component which performs the specified function of the described component (e.g., that is functionally equivalent), even though not structurally equivalent to the disclosed structure which performs the function in the herein illustrated exemplary implementations of the specification.

That is, the above description is only an embodiment of the present application, and not intended to limit the scope of the present application, and all equivalent structures or equivalent flow transformations made by using the contents of the specification and the drawings of the present application, such as the combination of technical features between various embodiments, or the direct or indirect application to other related technical fields, are all included in the scope of the present application.

In addition, structural elements having the same or similar characteristics may be identified by the same or different reference numerals. Furthermore, the terms "first", "second", "third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, features defined as "first", "second", "third" may explicitly or implicitly include one or more features. In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.

In this application, the word "exemplary" is used to mean "serving as an example, instance, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments. The previous description is provided to enable any person skilled in the art to make and use the present application. In the foregoing description, various details have been set forth for the purpose of explanation. It will be apparent to one of ordinary skill in the art that the present application may be practiced without these specific details. In other instances, well-known structures and processes are not shown in detail to avoid obscuring the description of the present application with unnecessary detail. Thus, the present application is not intended to be limited to the embodiments shown, but is to be accorded the widest scope consistent with the principles and features disclosed herein.

Claims

1. A convolution operation circuit is characterized by comprising a plurality of operation modules for performing different convolution operation layers;

the operation unit is used for receiving input data, performing corresponding convolution operation on the input data, caching intermediate data in the convolution operation process to the intermediate storage, reading the current intermediate data from the intermediate storage for the corresponding convolution operation, updating the intermediate data cached in the intermediate storage according to the latest intermediate data, and outputting the convolution operation result of the corresponding operation module.

2. The convolution operation circuit of claim 1 wherein the plurality of operation modules includes an input layer convolution module, at least one DW convolution module, at least one PW convolution module, and an output layer convolution module.

3. The convolution operation circuit according to claim 2, wherein the convolution operation circuit includes a first convolution filter and a second convolution filterCharacterized in that said input layer convolution module comprises a TM ₁ A first multiplier, TM ₁ A first adder and a first intermediate memory respectively corresponding to the first multipliers, and a first nonlinear activation unit, TM ₁ Representing the number of convolution kernels of the input layer convolution module;

the TM ₁ The first multipliers are used for respectively carrying out multiplication operation on data of corresponding channels in the image feature map to obtain first products;

the first adder is used for reading current first intermediate data from the corresponding first intermediate memory, adding the current first intermediate data and the corresponding first product, if the addition result is intermediate data, updating the corresponding first intermediate data according to the addition result, and storing the updated first intermediate data into the corresponding first intermediate memory;

the first nonlinear activation unit is used for activating an addition result of the final convolution result of the representation current layer to obtain the first operation result.

4. The convolution operation circuit of claim 3 wherein the input layer convolution module further includes R1 x S1 first input memories, R ₁ ×S ₁ Representing a window size of a single convolution calculation in the input layer convolution module;

the R1 × S1 first input memories correspond to the R1 × S1 convolution windows and are used for storing first input activation data in the convolution windows required in subsequent operation, so that the TM can conveniently use ₁ A first multiplier reading said first input activation data; the first input activation data is derived from the image feature map.

5. The convolution operation circuit of claim 2, wherein the DW convolution module comprises a TM ₂ A second multiplier, TM ₂ A second adder and a second intermediate memory respectively corresponding to the second multipliers, and a second nonlinear activation unit, TM ₂ Representing the number of convolution kernels of the DW convolution module;

the TM ₂ The second multipliers are used for respectively carrying out multiplication operation on the data of the corresponding channels in the first operation result to obtain second products;

the second adder is configured to read current second intermediate data from the corresponding second intermediate memory, add the current second intermediate data to the second product, update the second intermediate data according to an addition result if the addition result is intermediate data, and store the updated second intermediate data in the second intermediate memory;

and the second nonlinear activation unit is used for activating an addition result representing a final convolution result of the current layer to obtain the second operation result.

6. The convolution operation circuit of claim 2, wherein the PW convolution module comprises a TM ₃ ×TN ₃ A third multiplier, a third adder, a third intermediate memory, and a third non-linear activation unit, TM ₃ Represents the number of convolution kernels, TN, of the PW convolution module ₃ Representing the size of a fragment on the scale of an input channel in the PW convolution module;

the TM ₃ ×TN ₃ The third multipliers are used for respectively carrying out multiplication operation on the data of the corresponding channels in the second operation result to obtain third products;

the third adder is used for respectively adding third products output by TM3 groups of third multipliers, reading current third intermediate data from the third intermediate storage, adding the current third intermediate data and an addition result corresponding to the third products, updating the third intermediate data according to the addition result, and storing the updated third intermediate data into the third intermediate storage;

and the third nonlinear activation unit is used for activating an accumulation result representing a final convolution result of the current layer to obtain the third operation result.

7. The convolution operation circuit of claim 6, whichCharacterized in that the PW convolution module further comprises TN ₃ A second input memory;

the TN ₃ A second input memory for storing second input activation data in convolution window for use in subsequent operation, so as to facilitate TM ₃ ×TN ₃ A third multiplier reading the second input activation data; the second input activation data is derived from the second operation result.

8. The convolution operation circuit of claim 2 wherein the output layer convolution module comprises a TN (twisted nematic) module ₄ A fourth multiplier, a fourth adder, a fourth intermediate memory, and a fourth nonlinear activation unit, TN ₄ Representing the size of a fragment on the scale of an input channel in the output layer convolution module;

the TN ₄ The fourth multipliers are used for performing multiplication operation on the data of the corresponding channels in the third operation result respectively to obtain fourth products;

the fourth adder is configured to add fourth products output by the TN4 fourth multipliers, read current fourth intermediate data from the fourth intermediate memory, add the current fourth intermediate data and an addition result corresponding to the fourth product, update the fourth intermediate data according to the addition result, and store the updated fourth intermediate data in the fourth intermediate memory;

the fourth nonlinear activation unit is configured to activate an addition result representing a final convolution result of the current layer to obtain the fourth operation result.

9. The convolution operation circuit according to any one of claims 3 to 8, wherein the first nonlinear activation unit, the second nonlinear activation unit, the third nonlinear activation unit, and the fourth nonlinear activation unit respectively include: an adder group and a comparator group;

the adder group is used for carrying out accumulation bias processing on an addition result representing a final convolution result of a current layer;

the comparator group is used for carrying out nonlinear activation processing on the structure subjected to accumulation bias processing to obtain a corresponding operation result.

10. The convolution operation circuit of any one of claims 1 to 9, wherein the intermediate memory comprises a static random access memory.

11. A convolution operation method applied to the convolution operation circuit according to any one of claims 1 to 10, comprising:

and reading the current intermediate data from the intermediate memory for the corresponding convolution operation, updating the intermediate data cached by the intermediate memory according to the latest intermediate data, and outputting the convolution operation result of the corresponding operation module.

12. An image processing apparatus characterized by comprising the convolution operation circuit according to any one of claims 1 to 10.