WO2023137710A1

WO2023137710A1 - Neural network training method, image processing method and device, system, and medium

Info

Publication number: WO2023137710A1
Application number: PCT/CN2022/073246
Authority: WO
Inventors: 聂谷洪; 肖立睿; 陈铂
Original assignee: 深圳市大疆创新科技有限公司
Priority date: 2022-01-21
Filing date: 2022-01-21
Publication date: 2023-07-27

Abstract

A neural network training method, an image processing method and device, a system, and a storage medium. The neural network training method comprises: obtaining an image sample carrying a label; and training a preset neural network according to the image sample carrying the label, wherein the neural network comprises at least one residual block, a convolutional layer in the residual block performs binary convolution operation, and the quantization bit number of an input value of the residual block and the quantization bit number of an output value of the residual block are both 1 bit. The operation bandwidth and the storage capacity of the neural network are reduced.

Description

Neural network training method, image processing method, device, system and medium

technical field

The present application relates to the technical field of image processing, in particular, to a neural network training method, image processing method, device, system and storage medium.

Background technique

With the development of technology, neural network technology is applied to all aspects of life, such as using neural network technology for image recognition (such as face recognition or expression recognition, etc.) tasks, or image classification tasks.

However, the operation of a neural network is a computationally and memory intensive process. In order to reduce the amount of network storage and improve operating efficiency, one of the improvement directions is quantization acceleration, that is, by quantizing the floating-point values in the neural network, the redundant precision of the data is cut out, so that floating-point calculations are converted into bit operations (or small integer calculations), which can not only reduce network storage, but also greatly accelerate.

Among them, the 8-bit quantized neural network (that is, the floating-point value in the neural network is quantized to 8 bits) is currently a more commonly used quantization model. However, for small devices with limited operating resources and storage resources (such as wearable devices, mobile terminals, or small drones, etc.), the 8-bit quantized neural network still needs to occupy most of the resources during the inference process, thereby affecting small devices to perform other tasks.

Contents of the invention

In view of this, one of the objectives of the present application is to provide a neural network training method, image processing method, device, system and storage medium.

In the first aspect, the embodiment of the present application provides a training method of a neural network, the neural network is used to process computer vision tasks, the method includes:

Obtain image samples with labels;

Training a preset neural network according to the labeled image samples;

Wherein, the neural network includes at least one residual block; the convolution layer in the residual block performs a binary convolution operation, and the number of quantized bits of the input value and the output value of the residual block is 1 bit.

In a second aspect, the embodiment of the present application provides an image processing method, including:

Get the image to be processed;

The image to be processed is input into a pre-trained neural network to obtain image processing results; wherein the neural network includes at least one residual block; the convolution layer in the residual block performs binary convolution operation, and the quantized bit numbers of the input value and the output value of the residual block are 1 bit.

In a third aspect, the embodiment of the present application provides a neural network training device, the neural network is used to process computer vision tasks, and the device includes:

memory for storing executable instructions;

one or more processors;

Wherein, when the one or more processors execute the executable instructions, they are individually or collectively configured to execute the method described in the first aspect.

In a fourth aspect, the embodiment of the present application provides an image processing device, including:

memory for storing executable instructions;

one or more processors;

Wherein, when the one or more processors execute the executable instructions, they are individually or collectively configured to execute the method described in the second aspect.

In the fifth aspect, the embodiment of the present application provides an image processing system, including the image processing device and the mobile platform described in the fourth aspect;

The movable platform is provided with a photographing device, and the movable platform is used to send the image captured by the photographing device to the image processing device.

In a sixth aspect, an embodiment of the present application provides a computer-readable storage medium, the computer-readable storage medium stores executable instructions, and when the executable instructions are executed by a processor, the method described in the first aspect or the second aspect is implemented.

The embodiments of the present application provide a neural network training method, image processing method, device, system, and storage medium. A neural network is obtained by training image samples with labels and the neural network is used to process computer vision tasks; wherein, the neural network includes at least one residual block; the convolutional layer in the residual block performs binary convolution operation, which reduces the computational complexity, and the quantization bits of the input value and output value of the residual block are 1 bit, which helps to reduce the computing bandwidth and the storage capacity of the neural network, and has universal applicability. The amount is reduced, the delay is lower, and it can meet the real-time requirements in some scenarios.

Description of drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the accompanying drawings that need to be used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can also be obtained based on these drawings without paying creative labor.

Figure 1, Figure 2 and Figure 3 are schematic diagrams of three different application scenarios of neural networks for processing different computer vision tasks provided by the embodiment of the present application;

Fig. 4 is a schematic flow chart of a neural network training method provided by an embodiment of the present application;

FIG. 5 is a schematic structural diagram of a residual block provided by an embodiment of the present application;

FIG. 6A is a schematic structural diagram of a residual block including an addition layer provided by an embodiment of the present application;

FIG. 6B is a schematic structural diagram of a residual block including a fusion layer provided by an embodiment of the present application;

Fig. 7 is a schematic diagram of the structure of the auxiliary operator introduced in the training process of the residual block provided by the embodiment of the present application;

FIG. 8A is a schematic structural diagram of introducing auxiliary operators and auxiliary parameters into the training process of the convolutional layer provided by the embodiment of the present application;

Fig. 8B is a schematic diagram of the structure of the convolutional layer provided by the embodiment of the present application, after the training is completed, the auxiliary operators and auxiliary parameters are eliminated and/or fused;

FIG. 9 is a schematic structural diagram of a 1-bit systolic array provided by an embodiment of the present application;

FIG. 10 is a schematic flowchart of an image processing method provided in an embodiment of the present application;

FIG. 11 is a schematic structural diagram of a neural network training device provided by an embodiment of the present application.

Detailed ways

The following will clearly and completely describe the technical solutions in the embodiments of the application with reference to the drawings in the embodiments of the application. Apparently, the described embodiments are only some of the embodiments of the application, not all of them. Based on the embodiments in this application, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the scope of protection of this application.

Considering that the 8-bit quantized neural network in the related technology still needs to occupy most of the resources of the small device during the inference process, thus affecting the performance of other tasks on the small device. The embodiment of the present application trains to obtain a neural network, which is used to process computer vision tasks. The neural network includes at least one residual block, the convolution layer in the residual block performs binary convolution operation, and the quantized bit numbers of the input value and the output value of the residual block are both 1 bit. In this embodiment, the number of bits of data in the neural network is further quantified. The convolutional layer performs binary convolution operations. The input value and output value of the residual block are both 1-bit values, which helps to reduce the computing bandwidth and the storage capacity of the neural network, and has universal applicability.

The neural network trained in the embodiment of the present application includes a residual block, which can be used to replace the residual block of the non-binarization operation in the related art, so it can be transferred to most computer vision tasks without adjustment or with a small amount of adjustment, and has strong portability. That is to say, the neural network trained in the embodiment of the present application can be used to perform one or more of the following computer vision tasks: image classification task, image localization task, target detection task, target tracking task, semantic segmentation task, instance segmentation task or super-resolution reconstruction task, etc.

Then the neural network can be applied to different scenarios or different devices based on the computer vision tasks it handles.

In an exemplary embodiment, the neural network trained in the present application can be installed on a mobile platform, which includes but not limited to drones, unmanned vehicles, mobile robots, unmanned ships or cloud platforms. The movable platform includes a photographing device, and after the movable platform acquires the image taken by the photographing device, the image can be input into the neural network, so that the neural network can perform computer vision tasks based on the input image; the computer vision tasks include but are not limited to image positioning tasks, target detection tasks, or target tracking tasks.

In one example, taking the movable platform as an unmanned aerial vehicle as an example, the unmanned aerial vehicle is equipped with a photographing device, and the photographing device at least includes a photosensitive element, such as a complementary metal oxide semiconductor (Complementary Metal Oxide Semiconductor, CMOS) sensor or a charge-coupled device (Charge-coupled Device, CCD) sensor; the images captured by the photographing device include but are not limited to color images, grayscale images, infrared images, or depth images.

Please refer to FIG. 1 , taking the neural network used to perform target tracking tasks as an example. In the following shooting scene of the UAV 100 , the shooting device 200 is used to shoot the following target object 300 .

Please refer to FIG. 2 , taking the neural network as an example for performing a target detection task. During the flight mission of the UAV 100, the current flight environment is photographed by a photographing device. After the UAV 100 obtains the image taken by the photographing device, the neural network can perform target detection based on the image to obtain obstacle information, so that the UAV replans the flight path 400 based on the obstacle information, realizes obstacle-avoiding flight, and ensures the flight safety of the UAV.

In another exemplary embodiment, the neural network trained in this application can be installed in the terminal device. For example, after the terminal device acquires the image to be processed, it can input the image to be processed into the neural network, so that the neural network can perform computer vision tasks based on the input image; the computer vision tasks include but are not limited to semantic segmentation tasks, instance segmentation tasks, or super-resolution reconstruction tasks. The terminal devices include, but not limited to, but not limited to smart phone/mobile phone, tablet computer, personal digital assistant (PDA), knee computer, desktop computer, media content player, video game station/system, virtual reality system, augmented reality system, wearable device (for example, watches, glasses, gloves, headdresses (such as hats, helmets, virtual reality headphones, wearing headphones, wearing headphones, wearing headphones, wearing headphones, wearing headphones, wearing headphones, wearing headphones, wearing headphones, wearing headphones, wearing headphones, wearing headphones, wearing headphones, wearing headphones, wearing headphones, wearing headphones, wearing headphones, wearing headphones, wearing headphones, wearing headphones, wearing headphones, wearing headphones, wearing headphones. Augmented reality headphones, headwear -type devices (HMD), headbands), pendants, armbands, legs, shoes, vests), remote control, or any other type of device.

In an example, referring to FIG. 3 , the terminal device is connected in communication with a mobile platform (such as a UAV), and the UAV 100 can transmit the captured image to the terminal device 500. Taking the neural network as an example for performing a super-resolution reconstruction task, the terminal device 500 can input the received image into the neural network for resolution reconstruction processing, so as to obtain an image with better image quality (such as an image with a higher resolution).

In some embodiments, a neural network training method provided in the embodiments of the present application may be applied to a training device. Exemplarily, the training device may be an electronic device with data processing capabilities, such as a computer, server, cloud server or terminal, etc.; it may also be a computer chip or an integrated circuit with data processing capabilities, such as a central processing unit (Central Processing Unit, CPU), a digital signal processor (Digital Signal Processor, DSP), an application specific integrated circuit (Application Specific Integrated Circuit, ASIC) or a ready-made programmable gate array (Field-Programmable Gate Array, FPGA), etc.

Next, the training method provided by the embodiment of the present application will be described: as shown in FIG. 4 , which is a schematic flowchart of a neural network training method provided by the embodiment of the present application. The neural network is used to process computer vision tasks, and the method can be performed by training a device, and the method includes:

In step S101, an image sample carrying a label is obtained.

In step S102, a preset neural network is trained according to the image sample carrying the label; wherein, the neural network includes at least one residual block; the convolutional layer in the residual block performs binary convolution operation, and the quantized bit numbers of the input value and the output value of the residual block are both 1 bit.

In this embodiment, the convolutional layer in the residual block performs a binary convolution operation. The input value and output value of the residual block are both 1-bit values, which helps to reduce the computing bandwidth and the storage capacity of the neural network, and has universal applicability. For example, it is suitable for small devices with limited operating resources and storage resources (such as wearable devices, mobile terminals or small drones, etc.), and the amount of data involved in the calculation is reduced, and the delay is lower, which can meet the real-time requirements in some scenarios.

In some embodiments, referring to FIG. 5 , the number of quantized bits of the input value and the output value of the residual block is 1 bit (bit); the residual block includes a fusion layer 20, at least one convolutional layer 10 located in the main branch, and at least one convolutional layer 10 located in the jumper branch; wherein, the number of convolutional layers 10 in the main branch is more than the number of convolutional layers 10 in the jumper branch. The convolution layer 10 performs a binary convolution operation, and the quantization bits of the input value and the weight value of the convolution layer 10 are both 1 bit, which helps to reduce the operation bandwidth and improve the operation efficiency. The fusion layer 20 is used to fuse the output value of the convolutional layer 10 in the main branch and the output value of the convolutional layer 10 in the jumper branch.

Wherein, if the next layer of the convolutional layer 10 is not the fusion layer 20, the quantized bit number of the output value of the convolutional layer 10 is 1 bit, so that the input value of the convolutional layer 10 is 1 bit, without PReLU, Sign or other nonlinear function processing, which saves frequent precision conversion and reduces the overall data movement amount. It is also beneficial to improve computing efficiency and reduce computing resources.

If the next layer of the convolution layer 10 is the fusion layer 20, the quantization bit number of the output value of the convolution layer 10 is greater than 1 bit, such as the quantization bit number of the output value of the convolution layer 10 is any of the following: 2 bits, 4 bits, 8 bits or 16 bits, and the quantization bit number of the output value of the convolution layer 10 can be set according to the actual application scene. In addition, the fusion layer 20 can convert the fusion result obtained by fusing the output value of the convolution layer 10 in the main branch and the output value of the convolution layer 10 in the jumper branch into 1 bit, that is, the quantization bit number of the output value of the fusion layer 20 is 1 bit. In this embodiment, when performing data fusion, the fusion layer 20 uses data whose quantization bit number is greater than 1 bit for fusion, which is beneficial to improve fusion accuracy.

Exemplarily, please refer to FIG. 6A and FIG. 6B , the fusion layer 20 includes an addition layer 21 or a splicing layer 22 . In FIG. 6A, the addition layer 21 is used to add the output values of at least two convolutional layers 10 connected to the addition layer 21, for example, the output value of the convolutional layer 10 of the main branch and the output value of the convolutional layer 10 of the jumper branch can be added bit by bit. In FIG. 6B, the splicing layer 22 is used to splice the output values of at least two convolutional layers 10 connected to the splicing layer 22, for example, the output values corresponding to the convolutional layer 10 of the main branch and the convolutional layer 10 of the jumper branch can be spliced in cascade. The residual block provided by this embodiment has a simple structure. The input value of the convolutional layer 10 in the residual block and the number of quantized bits of the weight value are both 1 bit, that is, the convolutional layer 10 performs a binary convolution operation, and only fusion processing of more than 1 bit is performed in the fusion layer 20. This simplicity helps to reduce data movement and bandwidth, and has lower delay.

For example, please refer to FIG. 6A, if the fusion layer 20 is an addition layer 21, the output value of the residual block is output by the addition layer 21, and the quantization bit number of the output value of the residual block is 1 bit. Please refer to FIG. 6B, the residual block also includes at least one convolutional layer 10 after the splicing layer 22, if the fusion layer 20 is a splicing layer 22, the output value of the residual block is output by the next convolutional layer 10 of the splicing layer 22, and the quantization bit number of the output value of the residual block is 1 bit.

The residual block provided by this embodiment can be used to replace the residual block of the non-binarization operation in the related art, and it can be transferred to most computer vision tasks without adjustment or with a small amount of adjustment, and has strong portability. The neural network obtained by training in this embodiment and including the residual block with the structure shown in FIG. 6A or FIG. 6B can be used to perform different computer vision tasks, such as image classification tasks, image positioning tasks, target detection tasks, target tracking tasks, semantic segmentation tasks, instance segmentation tasks or super-resolution reconstruction tasks, etc.

Based on the needs of actual application scenarios, during the training process of the neural network, corresponding labels can be determined based on expected computer vision tasks, and the labels corresponding to different computer vision tasks are different. In an example, such as in a semantic segmentation task, the label is the category to which each pixel in the image sample belongs. In another example, such as in an object detection task, the label is position information of an object in an image sample. Then, during the training process, the preset neural network processes the image samples to obtain predicted values, and then the test device adjusts the parameters of the neural network based on the difference between the predicted values and the labels of the image samples to obtain a trained neural network.

In some embodiments, in order to expand the network capacity and improve the network accuracy, the embodiment of the present application realizes that during the training process of the neural network, auxiliary parameters and auxiliary operators are introduced to assist the training of the convolutional layer, thereby helping to improve network performance.

For the weight of the convolutional layer, during the training process, the weight of the convolutional layer corresponds to a first auxiliary parameter and a second auxiliary parameter; wherein, the first auxiliary parameter is used to control the degree of quantization of the floating-point weight of the convolutional layer into 1 bit; the second auxiliary parameter is used to indicate the degree of scaling of the quantized weight.

In the neural network training process, a first auxiliary operator and a second auxiliary operator are introduced, and the floating-point weight of the convolution layer is quantized by using the first auxiliary operator and the second auxiliary operator; wherein, the first auxiliary operator is used to quantize the floating-point weight of the convolution layer into 1 bit according to the first auxiliary parameter in the forward transfer process; the second auxiliary operator is used to determine the sign of the quantized weight. Exemplarily, the first auxiliary operator includes a Tanh function; the second auxiliary operator includes a sign function.

For the output value of the neural network, the output value of the convolutional layer corresponds to a third auxiliary parameter, a fourth auxiliary parameter, and at least two fifth auxiliary parameters; the third auxiliary parameter is used to control the quantization degree of quantizing the output value of the convolutional layer into 1 bit; the fourth auxiliary parameter is used to indicate the scaling degree of the quantized output value; the at least two fifth auxiliary parameters are different offsets of the output value.

During the training process of the neural network, please refer to Fig. 7, the training device uses the third auxiliary operator, the fourth auxiliary operator and the second auxiliary operator to quantize the floating-point output value of the convolution layer 10; wherein, the third auxiliary operator is used to perform nonlinear processing on the floating-point output value of the convolution layer 10 according to the third auxiliary parameter and one of the fifth auxiliary parameters; Used to determine the sign of the quantized output value. Exemplarily, the third auxiliary operator includes an activation function, the fourth auxiliary operator includes a hard-tanh function, and the second auxiliary operator includes a sign function.

In an exemplary embodiment, it is assumed that the weight of the convolutional layer is W, the input value is X, and the output value is A. Assume that there is a weight W∈R ^N×C×K×K of the convolutional layer, the input value is X∈R ^B×H×W×C , and the output value A=Conv(W,X), that is, the output value is the result of the convolution operation between the weight of the convolutional layer and the input value; where N, C, H, W, B, and K are the output channel, input channel, height, width, batch size, and kernel size, respectively.

In one example, it is assumed that the weight of the convolutional layer is W, the floating-point weight is W _f , the quantized weight is W _b , the first auxiliary parameter is α, and the second auxiliary parameter is λ. During training, the floating-point weights W _f are used for training, and Tanh function is used to approximate binarization during the forward pass, see Figure 8A, then:

W _b =Sign(Tanh(α·W _f )) (1);

W _f ≈λ·W _b (2).

Exemplarily, α can adopt different ranges (such as ^∈RN , or RN ^×C , or RN ^×C×K×K ) to adjust the required degree of overparameterization, and the size of each α coefficient controls the sharpness of the approximation to binarization. λ ∈ R ^N is a scaling factor for each output channel to compensate for the magnitude difference between W _f and W _b in {−1,1}. α and λ can be adjusted during training. In this embodiment, auxiliary parameters and auxiliary operators are introduced into the weight of the convolutional layer during the training process, which expands the network capacity and helps to improve network performance.

In one example, it is assumed that the output value of the convolutional layer is A, the floating-point output value is A _f , the quantized output value is A _b , the third auxiliary parameter is τ, the fourth auxiliary parameter is κ, and at least two fifth auxiliary parameters include b ₀ and b ₁ . During the training process, since the approximate weights and convolution operations are both floating-point numbers, the output is real-valued and must be binarized before being input to the next layer. The binarized activation A _b is used to approximate the real-valued output A _f through the following series of transformations. See Figure 8A, then:

A _b =Sign(Htanh(PReLU(τA _f +b ₀ )+b ₁ )) (3);

A _f ≈ κ·A _b (4).

Among them, Htanh is the hard-tanh function, using the hard-tanh function can clamp the input at [-1,1] during the forward pass, and use the sinusoid in the backward pass. The PReLU function and the Sign function use Straight-Through-Estimator (STE) to calculate the gradient. The transformation process described above can reshape the input distribution and help regulate the training process of the neural network. In this embodiment, auxiliary parameters and auxiliary operators are introduced into the output value of the convolutional layer during the training process, which expands the network capacity and helps to improve network performance.

And in order to speed up the convergence speed, please refer to FIG. 8A , use batch norm processing (BatchNorms) during the training process to process (such as whitening processing) the floating-point output value A _f obtained by converting the scaling factor to solve the problem of data distribution changes.

In some embodiments, after the training of the neural network is completed, the auxiliary parameters and auxiliary operators can be absorbed into the regular parameters of the convolutional layer, so that the operation efficiency will not be reduced while ensuring the accuracy of the neural network. Exemplarily, please refer to FIG. 8B. During the training process, the embodiment shown in FIG. 8B introduces auxiliary parameters and auxiliary operators to expand the network capacity. After the training of the neural network is completed, the auxiliary parameters and auxiliary operators can be absorbed into the regular parameters of the convolution layer, so that the convolution layer shown in FIG. 8B can be obtained. The convolution layer performs binary convolution operations. The input value of the convolution layer and the quantization bit number of the weight value are both 1 bit.

For the weight of the convolutional layer, after the training of the neural network is completed, the training device may eliminate and/or fuse auxiliary parameters and auxiliary operators corresponding to the weight value of the convolutional layer; wherein, the quantized weight of the convolutional layer can be simplified as follows: determined according to the sign of the first auxiliary parameter and the sign of the floating-point weight.

In an example, the auxiliary parameters and auxiliary operators corresponding to the weights in the training process can be absorbed into a simple form during the inference process. Essentially, binarization only cares about the relative value of two numbers, regardless of their magnitude. Utilizing this property, equation (1), ie, W _b =Sign(Tanh(α·W _f )), can be simplified to W _b =Sign(α)·Sign(W _f ). That is to say, only the sign of the first auxiliary parameter and the sign of the floating-point weight need to be known, and then the quantized weight (ie, the binarized weight) can be determined.

For the output value of the convolutional layer, after the training of the neural network is completed, the training device may eliminate and/or fuse auxiliary parameters and auxiliary operators corresponding to the output value of the convolutional layer. For example, please refer to the three auxiliary operators shown in the dashed box in FIG. 7 and the auxiliary parameters not shown in FIG. 7 , which can be fused or eliminated after the neural network training is completed to obtain the structure shown in FIG. 5 . Wherein, the quantized output value of the convolutional layer can be simplified as: determined according to the sign of the third auxiliary parameter and the sign of the preset difference; the preset difference is the difference between the floating-point output value and the preset parameter; the preset parameter is determined according to the third auxiliary parameter and the at least two fifth auxiliary parameters.

In an example, the auxiliary parameters and auxiliary operators corresponding to the output values during the training process can be absorbed into a simple form during the reasoning process. Essentially, binarization only cares about the relative value of two numbers, regardless of their magnitude. Using this property, for each output channel n, Equation (3), ie, A _b =Sign(Htanh(PReLU(τA _f +b ₀ )+b ₁ )), can be simplified to A _b (n)=Sign(τ(n))Sign(A _f (n)−θ(n)). That is to say, only need to know the sign of the third auxiliary parameter, the sign of the difference between the floating-point output value and the preset parameter (ie θ(n)), the quantized output value of the convolutional layer (ie the binarized output value) can be determined.

where θ(n)∈R is a threshold depending on b ₀ (n), b ₁ (n) and τ(n) of channel n. The reason for this simplification is that given b ₀ (n), b ₁ (n) and τ(n), all transformations in equation (3) are monotonic, so only the zero point θ(n) needs to be solved to determine the symbol A _b (n). After the training of the neural network is completed, the operation process of the neural network only involves bitwise convolution and threshold (θ(n)), which simplifies the reasoning form on the basis of ensuring the network accuracy remains unchanged.

Finally, the convolution output can be approximated as: Conv(W _f ,A _f )≈(κ·λ)Conv(W _b ,A _b ). Wherein, the second auxiliary parameter (λ) corresponding to the weight of the convolutional layer and the fourth auxiliary parameter (κ) corresponding to the output value of the convolutional layer are the scaling factors of each channel, (κ·λ) ^∈RN , which can be brought into the next layer and can be absorbed during the processing of the activation function of the next convolutional layer. Taking the activation function as the PReLU function as an example, the second auxiliary parameter (λ) and the fourth auxiliary parameter (κ) can be absorbed into the PReLU operation of the next layer, so they do not need to be calculated during inference.

Exemplarily, the batch normalization process is also suitable for parameter fusion. In one example, the auxiliary parameters and auxiliary operators corresponding to the output values of the convolutional layers may be fused during the batch normalization process. For example, let γ, β∈R ^N be the scale and deviation of batch normalization processing (BatchNorm), then: PReLU(τ·BN(x)+ _{b 0} ₎ +b ₁ ＝PReLU(τ·(γ·x+β)+b 0 )+b ₁ ＝PReLU((τ·γ)·x+(τ·β+b ₀ ))+b ₁ . Among them, x is the output value A _f (n) of the convolutional layer, and the simplification principle of the PReLU function is the same as the simplification principle of (3) above.

In some embodiments, in order to expand the network capacity and improve the network accuracy, the number of channels of the input value and/or output value of the residual block can be further expanded during the training process, and since the expanded channels are binarized, the memory occupied by the 8-bit quantized neural network is less.

In some embodiments, in order to improve the accuracy of the neural network, the neural network can also include at least one network block, the quantization bit number of the input value or output value of the network block is greater than 1 bit, such as the input value or output value of the network block includes but not limited to 2 bits, 4 bits, 8 bits or 16 bits, etc.; the input value and the quantization bit number of the weight of the convolution layer in the network block are also greater than 1 bit, such as the convolution layer.

It can be understood that the network block can be located at any position of the neural network, and this embodiment does not impose any limitation on this, for example, the network block is arranged at the beginning and end of the neural network, and at least one residual block is arranged in the middle. Alternatively, the network block may also be set at an intermediate position in the neural network, which may be specifically set according to an actual application scenario.

In a possible implementation, in order to expand the network capacity and improve the network accuracy, during the training process, the number of channels of the input value and/or output value of the residual block can be set to be greater than the number of channels of the input value and/or output value of the network block. In an example, for example, the number of channels of the input value and/or output value of the residual block is 2 times or 4 times the number of channels of the input value and/or output value of the network block.

In some embodiments, the convolution operation of the convolution layer in the residual block is performed through a specified systolic array; the systolic array includes a plurality of processing units supporting 1-bit operation. Considering that the quantization bit numbers of the input value and the output value of the convolution layer are both 1 bit, the input bandwidth and the output bandwidth of the systolic array can be set to be the same. Exemplarily, the systolic array may be a square array to ensure the same bandwidth for data loading and writing.

In order to fully utilize the bandwidth, the systolic array includes multiple input lines for 1-bit data input.

Exemplarily, the weights of the convolutional layers are stored in NHWC format. Of course, the weights of the convolutional layers can also be stored in NCHW format. The utilization rate of MAC using data multiplexing and different data arrangements (NCHW and NHWC) is different (the utilization rate of NCHW arrangement depends on whether the W (width) dimension can divide the MAC size, and the NHWC arrangement utilization depends on whether the C (channel) dimension can divide the MAC size).

In an example, please refer to FIG. 9 , which shows a schematic diagram of a 1-bit systolic array, which is used to perform bit-by-bit convolution operations according to the weights and input values of the convolutional layer. The 1bit systolic array is obtained by changing the 16bit*16bit systolic array. In order to create a 1bit systolic array, 1bit PE (the smallest processing unit of the systolic array) is used to replace the 8bit PE. The 1bit systolic array includes multiple 1bit PEs. In order to fully utilize the input bandwidth, each 8bit input line is replaced with 8 1bit input lines, basically creating 8 times as many lines in the 1bit design. Since the array is square, the number of columns is also 8 times larger. Compared with 8bit, this means that the PE has been increased by 64 times, theoretically accelerated by 64 times, and the data processing efficiency has been significantly improved.

Various technical features in the above embodiments can be combined arbitrarily, as long as there is no conflict or contradiction between the combinations of features, so any combination of various technical features in the above embodiments also falls within the scope of the disclosure of this specification.

Correspondingly, referring to FIG. 10, the embodiment of the present application also provides an image processing method, including:

In step S201, an image to be processed is acquired.

In step S202, the image to be processed is input into a pre-trained neural network to obtain an image processing result; wherein, the neural network includes at least one residual block; the convolution layer in the residual block performs a binary convolution operation, and the number of quantized bits of the input value and output value of the residual block is 1 bit.

Wherein, the neural network is trained based on the training method shown in the embodiment shown in FIG. 4 . For relevant parts, refer to the description of the embodiment shown in FIG. 4 , which will not be repeated here.

In some embodiments, the neural network is used to process any one of the following computer vision tasks: image classification task, image localization task, object detection task, object tracking task, semantic segmentation task, instance segmentation task or super-resolution reconstruction task.

Correspondingly, referring to FIG. 11 , the embodiment of the present application also provides a neural network training device, the neural network is used for processing computer vision tasks, and the device includes:

memory 102 for storing executable instructions;

one or more processors 101;

Wherein, when the one or more processors 101 execute the executable instructions, they are individually or collectively configured to execute the above method.

The processor 101 executes the executable instructions included in the memory 102. The processor 101 can be a central processing unit (Central Processing Unit, CPU), and can also be other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) Or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. A general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.

The memory 102 stores executable instructions of the training method of the neural network, and the memory 102 may include at least one type of storage medium, the storage medium includes a flash memory, a hard disk, a multimedia card, a card-type memory (for example, SD or DX memory, etc.), random access memory (RAM), static random access memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disk, optical disk, etc. Also, the device may cooperate with a network storage device that performs a storage function of the memory through a network connection. The memory 102 may be an internal storage unit of the device, such as a hard disk or internal memory of the device. The memory 102 can also be an external storage device of the device, such as a plug-in hard disk equipped on the device, a smart memory card (Smart Media Card, SMC), a secure digital (Secure Digital, SD) card, a flash memory card (Flash Card) and the like. Further, the storage 102 may also include both an internal storage unit of the apparatus and an external storage device. The memory 102 is used to store computer programs and other data required by the device. The memory 102 can also be used to temporarily store data that has been output or will be output.

In some embodiments, when the one or more processors 101 execute the executable instructions, they are individually or jointly configured to:

Obtain image samples with labels;

Training a preset neural network according to the labeled image samples;

Exemplarily, the residual block includes a fusion layer, at least one convolution layer located in the main branch, and at least one convolution layer located in the jump branch; the input value of the convolution layer and the quantization bit number of the weight value are both 1 bit; the fusion layer is used to fuse the output value of the convolution layer in the main branch and the output value of the convolution layer in the jump branch; wherein, the number of convolution layers in the main branch is more than the number of convolution layers in the jump branch.

Exemplarily, if the next layer of the convolution layer is the fusion layer, the quantization bit number of the output value of the convolution layer is greater than 1 bit; if the next layer of the convolution layer is not the fusion layer, the quantization bit number of the output value of the convolution layer is 1 bit.

Exemplarily, the number of quantization bits of the output value of the fusion layer is 1 bit.

Exemplarily, if the next layer of the convolutional layer is the fusion layer, the number of quantized bits of the output value of the convolutional layer is any of the following: 2 bits, 4 bits, 8 bits or 16 bits.

Exemplarily, the fusion layer includes an addition layer or a stitching layer; the addition layer is used to add the output values of at least two convolutional layers connected to the addition layer; the stitching layer is used to stitch the output values of at least two convolutional layers connected to the stitching layer.

Exemplarily, if the fusion layer is an addition layer, the output value of the residual block is output by the addition layer; if the fusion layer is a splicing layer, the output value of the residual block is output by the next convolutional layer of the splicing layer.

Exemplarily, the tags corresponding to different computer vision tasks are different.

Exemplarily, the computer vision task includes any one or more of the following: image classification task, image localization task, object detection task, object tracking task, semantic segmentation task, instance segmentation task or super-resolution reconstruction task.

Exemplarily, during the training process, auxiliary parameters and auxiliary operators are introduced to assist the training of the convolutional layer; after the training of the neural network is completed, the auxiliary parameters and auxiliary operators are absorbed into the parameters of the convolutional layer.

Exemplarily, during the training process, the weight of the convolutional layer corresponds to a first auxiliary parameter and a second auxiliary parameter; wherein, the first auxiliary parameter is used to control the degree of quantization of the floating-point weight of the convolutional layer into 1 bit; the second auxiliary parameter is used to indicate the scaling degree of the quantized weight.

Exemplarily, the processor is further configured to: during the training process, use a first auxiliary operator and a second auxiliary operator to quantize the floating-point weight of the convolutional layer; wherein the first auxiliary operator is used to quantize the floating-point weight of the convolutional layer into 1 bit according to the first auxiliary parameter in the forward pass process; and the second auxiliary operator is used to determine the sign of the quantized weight. Exemplarily, the first auxiliary operator includes a Tanh function; the second auxiliary operator includes a sign function.

Exemplarily, the processor is further configured to: after the neural network training is completed, eliminate and/or fuse auxiliary parameters and auxiliary operators corresponding to the weight values of the convolutional layer; wherein, the quantized weight of the convolutional layer can be simplified as: determined according to the sign of the first auxiliary parameter and the sign of the floating-point weight.

Exemplarily, the output value of the convolutional layer corresponds to a third auxiliary parameter, a fourth auxiliary parameter, and at least two fifth auxiliary parameters; the third auxiliary parameter is used to control the quantization degree of quantizing the output value of the convolutional layer into 1 bit; the fourth auxiliary parameter is used to indicate the scaling degree of the quantized output value; the at least two fifth auxiliary parameters are different offsets of the output value.

Exemplarily, the processor is further configured to: during the training process, use a third auxiliary operator, a fourth auxiliary operator, and a second auxiliary operator to quantize the floating-point output value of the convolutional layer; wherein, the third auxiliary operator is used to perform nonlinear processing on the floating-point output value of the convolutional layer according to the third auxiliary parameter and one of the fifth auxiliary parameters; The sign of the output value.

Exemplarily, the third auxiliary operator includes an activation function, the fourth auxiliary operator includes a hard-tanh function, and the second auxiliary operator includes a sign function.

Exemplarily, the processor is further configured to: after the neural network training is completed, eliminate and/or fuse auxiliary parameters and auxiliary operators corresponding to the output values of the convolutional layer; wherein, the quantized output value of the convolutional layer can be simplified as follows: determined according to the sign of the third auxiliary parameter and the sign of a preset difference; the preset difference is the difference between the floating-point output value and a preset parameter; the preset parameter is determined according to the third auxiliary parameter and the at least two fifth auxiliary parameters.

Exemplarily, the second auxiliary parameter corresponding to the weight of the convolutional layer and the fourth auxiliary parameter corresponding to the output value of the convolutional layer can be absorbed during the processing of the activation function of the next convolutional layer.

Exemplarily, the neural network further includes at least one network block, and the number of quantized bits of the input value or output value of the network block is greater than 1 bit.

Exemplarily, the number of channels of the input value and/or output value of the residual block is greater than the number of channels of the input value and/or output value of the network block.

Exemplarily, the convolution operation of the convolution layer in the residual block is performed through a specified systolic array; the systolic array includes a plurality of processing units supporting 1-bit operation.

Exemplarily, the input bandwidth and the output bandwidth of the systolic array are the same.

Exemplarily, the pulsation array is a square array.

Exemplarily, the systolic array includes a plurality of input lines for 1-bit data input.

Exemplarily, the weights of the convolutional layers are stored in NHWC format.

For the implementation process of the functions and effects of each unit in the above device, please refer to the implementation process of the corresponding steps in the above method for details, and will not be repeated here.

Correspondingly, the embodiment of the present application also provides an image processing device, including:

memory for storing executable instructions;

one or more processors;

Wherein, when the one or more processors execute the executable instructions, they are individually or jointly configured to:

Get the image to be processed;

Exemplarily, the neural network is used to process any one of the following computer vision tasks: image classification task, image localization task, object detection task, object tracking task, semantic segmentation task, instance segmentation task or super-resolution reconstruction task.

Correspondingly, an embodiment of the present application also provides an image processing system, including the above-mentioned image processing device and a movable platform;

Exemplarily, the mobile platform includes, but is not limited to, a drone, an unmanned vehicle, a mobile robot, an unmanned ship, or a cloud platform.

In an example, the image processing device may be integrated in a mobile platform, as shown in FIG. 1 or FIG. 2 . In another example, the image processing apparatus may be installed in a terminal device, and the terminal device is communicatively connected to the movable platform. The terminal device may be, for example, a remote controller of the movable platform, as shown in FIG. 3 .

In an exemplary embodiment, there is also provided a non-transitory computer-readable storage medium including instructions, such as a memory including instructions, which are executable by a processor of an apparatus to perform the above method. For example, the non-transitory computer readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, and the like.

A non-transitory computer-readable storage medium, enabling the terminal to execute the above method when instructions in the storage medium are executed by a processor of the terminal.

Various implementations described herein can be implemented using a computer readable medium such as computer software, hardware, or any combination thereof. For hardware implementation, the embodiments described herein may be implemented using at least one of an Application Specific Integrated Circuit (ASIC), Digital Signal Processor (DSP), Digital Signal Processing Device (DSPD), Programmable Logic Device (PLD), Field Programmable Gate Array (FPGA), processor, controller, microcontroller, microprocessor, electronic unit designed to perform the functions described herein. For software implementation, an embodiment such as a procedure or a function may be implemented with a separate software module that allows at least one function or operation to be performed. The software codes can be implemented by a software application (or program) written in any suitable programming language, stored in memory and executed by a controller.

It should be noted that in this document, relational terms such as first and second are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any such actual relationship or order between these entities or operations. The term "comprising", "comprising" or any other variation thereof is intended to cover a non-exclusive inclusion such that a process, method, article or apparatus comprising a set of elements includes not only those elements but also other elements not expressly listed or which are inherent to such process, method, article or apparatus. Without further limitations, an element defined by the phrase "comprising a ..." does not exclude the presence of additional identical elements in the process, method, article or apparatus comprising said element.

The method and device provided by the embodiment of the present application have been introduced in detail above, and specific examples have been used in this paper to illustrate the principle and implementation of the application. The description of the above embodiment is only used to help understand the method and core idea of the application; meanwhile, for those of ordinary skill in the art, according to the idea of the application, there will be changes in the specific implementation and application range. In summary, the content of this specification should not be understood as limiting the application.

Claims

A training method of a neural network, characterized in that the neural network is used to process computer vision tasks, the method comprising:

Obtain image samples with labels;

Training a preset neural network according to the labeled image samples;

Wherein, the neural network includes at least one residual block; the convolution layer in the residual block performs a binary convolution operation, and the number of quantized bits of the input value and the output value of the residual block is 1 bit.
The method according to claim 1, wherein the residual block includes a fusion layer, at least one convolutional layer located in the main branch, and at least one convolutional layer located in the jumper branch; the input value of the convolutional layer and the number of quantized bits of the weight value are both 1 bit;

The fusion layer is used to fuse the output value of the convolutional layer in the main branch and the output value of the convolutional layer in the jumper branch;

Wherein, the number of convolutional layers in the main branch is greater than the number of convolutional layers in the jumper branch.
The method according to claim 2, characterized in that,

If the next layer of the convolutional layer is the fusion layer, the number of quantized bits of the output value of the convolutional layer is greater than 1 bit;

If the next layer of the convolutional layer is not the fusion layer, the number of quantized bits of the output value of the convolutional layer is 1 bit.
The method according to claim 2 or 3, characterized in that the number of quantized bits of the output value of the fusion layer is 1 bit.
The method according to claim 3, wherein if the next layer of the convolutional layer is the fusion layer, the number of quantized bits of the output value of the convolutional layer is any of the following: 2 bits, 4 bits, 8 bits or 16 bits.
The method according to claim 2, wherein the fusion layer comprises an addition layer or a stitching layer;

The addition layer is used to add the output values of at least two convolutional layers connected to the addition layer;

The splicing layer is used to concatenate output values of at least two convolutional layers connected to the splicing layer.
The method according to claim 6, wherein if the fusion layer is an addition layer, the output value of the residual block is output by the addition layer;

If the fusion layer is a splicing layer, the output value of the residual block is output by the next convolutional layer of the splicing layer.
The method according to claim 1, wherein the labels corresponding to different computer vision tasks are different.
The method according to claim 1 or 8, wherein the computer vision task comprises any one or more of the following: image classification task, image localization task, target detection task, target tracking task, semantic segmentation task, instance segmentation task or super-resolution reconstruction task.
The method according to claim 2, characterized in that, in the training process, introducing auxiliary parameters and auxiliary operators to assist the training of the convolutional layer;

After the training of the neural network is completed, the auxiliary parameters and auxiliary operators are absorbed into the parameters of the convolution layer.
The method according to claim 10, wherein, in the training process, the weight of the convolutional layer corresponds to a first auxiliary parameter and a second auxiliary parameter;

Wherein, the first auxiliary parameter is used to control the degree of quantization of the floating-point weight of the convolutional layer into 1 bit;

The second auxiliary parameter is used to represent the scaling degree of the quantized weight.
The method according to claim 11, characterized in that the method further comprises:

During the training process, using the first auxiliary operator and the second auxiliary operator to quantize the floating-point weights of the convolutional layer;

Wherein, the first auxiliary operator is used to quantize the floating-point weight of the convolutional layer into 1 bit according to the first auxiliary parameter in the forward pass process;

The second auxiliary operator is used to determine the sign of the quantized weight.
The method according to claim 12, wherein the first auxiliary operator comprises a Tanh function; the second auxiliary operator comprises a sign function.
The method according to claim 12, further comprising:

After the neural network training is completed, eliminating and/or fusing auxiliary parameters and auxiliary operators corresponding to the weight values of the convolutional layer;

Wherein, the quantized weight of the convolutional layer can be simplified as: determined according to the sign of the first auxiliary parameter and the sign of the floating-point weight.
The method according to claim 8, wherein the output value of the convolutional layer corresponds to a third auxiliary parameter, a fourth auxiliary parameter, and at least two fifth auxiliary parameters;

The third auxiliary parameter is used to control the quantization degree of quantizing the output value of the convolutional layer into 1 bit;

The fourth auxiliary parameter is used to indicate the scaling degree of the quantized output value;

The at least two fifth auxiliary parameters are different offsets of the output value.
The method according to claim 15, further comprising:

During the training process, the floating-point output value of the convolution layer is quantized by using the third auxiliary operator, the fourth auxiliary operator and the second auxiliary operator;

Wherein, the third auxiliary operator is used to perform nonlinear processing on the floating-point output value of the convolution layer according to the third auxiliary parameter and one of the fifth auxiliary parameters;

The fourth auxiliary operator is used to quantize the result output by the third auxiliary operator into 1 bit according to another fifth auxiliary parameter during forward transmission;

The second auxiliary function is used to determine the sign of the quantized output value.
The method according to claim 16, wherein the third auxiliary operator includes an activation function, the fourth auxiliary operator includes a hard-tanh function, and the second auxiliary operator includes a sign function.
The method according to claim 16, further comprising:

After the neural network training is completed, eliminating and/or fusing auxiliary parameters and auxiliary operators corresponding to the output values of the convolutional layer;

Wherein, the quantized output value of the convolutional layer can be simplified as: determined according to the sign of the third auxiliary parameter and the sign of the preset difference; the preset difference is the difference between the floating-point output value and the preset parameter; the preset parameter is determined according to the third auxiliary parameter and the at least two fifth auxiliary parameters.
The method according to claim 10, wherein the second auxiliary parameter corresponding to the weight of the convolutional layer and the fourth auxiliary parameter corresponding to the output value of the convolutional layer can be absorbed during the activation function processing of the next convolutional layer.
The method according to claim 1, wherein the neural network further comprises at least one network block, and the number of quantized bits of the input value or output value of the network block is greater than 1 bit.
The method according to claim 20, characterized in that the number of channels of the input value and/or output value of the residual block is greater than the number of channels of the input value and/or output value of the network block.
The method according to claim 1 or 2, wherein the convolution operation of the convolution layer in the residual block is performed through a specified systolic array;

The systolic array includes a plurality of processing units supporting 1-bit operation.
The method according to claim 22, wherein the input bandwidth and the output bandwidth of the systolic array are the same.
The method according to claim 22 or 23, wherein the pulsation array is a square array.
The method according to claim 22, wherein the systolic array comprises a plurality of input lines for 1-bit data input.
The method according to claim 22, wherein the weights of the convolutional layers are stored in NHWC format.
An image processing method, characterized in that, comprising:

Get the image to be processed;

The image to be processed is input into a pre-trained neural network to obtain image processing results; wherein the neural network includes at least one residual block; the convolution layer in the residual block performs binary convolution operation, and the quantized bit numbers of the input value and the output value of the residual block are 1 bit.
The method according to claim 27, wherein the neural network is used to process any of the following computer vision tasks: image classification tasks, image positioning tasks, target detection tasks, target tracking tasks, semantic segmentation tasks, instance segmentation tasks or super-resolution reconstruction tasks.
A training device for a neural network, characterized in that the neural network is used for processing computer vision tasks, and the device comprises:

memory for storing executable instructions;

one or more processors;

Wherein, when the one or more processors execute the executable instructions, they are individually or jointly configured to perform the method described in any one of claims 1-26.
An image processing device, characterized in that it comprises:

memory for storing executable instructions;

one or more processors;

Wherein, when the one or more processors execute the executable instructions, they are individually or collectively configured to perform the method described in claim 27 or 28 .
An image processing system, characterized by comprising the image processing device and a movable platform according to claim 30;

The movable platform is provided with a photographing device, and the movable platform is used to send the image captured by the photographing device to the image processing device.
A computer-readable storage medium, wherein the computer-readable storage medium stores executable instructions, and when the executable instructions are executed by a processor, the method according to any one of claims 1 to 28 is implemented.