CN117315366A

CN117315366A - Image processing method based on double-branch parallel neural network

Info

Publication number: CN117315366A
Application number: CN202311334063.6A
Authority: CN
Inventors: 黄彦森; 宋爽; 岑翼刚; 武宁波
Original assignee: Guizhou Bida Cloud Testing Information Technology Co ltd
Current assignee: Guizhou Bida Cloud Testing Information Technology Co ltd
Priority date: 2023-10-13
Filing date: 2023-10-13
Publication date: 2023-12-29

Abstract

The invention relates to the technical field of computer vision image classification, in particular to an image processing method based on a double-branch parallel neural network. The method comprises the following steps: s1: the input image is sent to a first basic convolution module and a feature embedding module to respectively conduct feature extraction and space vector mapping to obtain feature X ₀ And space vector Z ₀ The method comprises the steps of carrying out a first treatment on the surface of the S2: x is to be ₀ And Z ₀ Input into a bidirectional bridging module Block, and output X _block0 And Z _block0 The method comprises the steps of carrying out a first treatment on the surface of the The step S2 comprises the following steps: s2-1: x is to be ₀ And Z ₀ Feeding into a first fusion moduleOutput Z _hid The method comprises the steps of carrying out a first treatment on the surface of the S2-2: will Z _hid Input to a transducer network module, learn and extract low-frequency global features in image features, and output Z _block0 The method comprises the steps of carrying out a first treatment on the surface of the S2-3: x is to be ₀ And Z _block0 Inputting into a Flexible CNN, acquiring local features in image features, and outputting X _hid The method comprises the steps of carrying out a first treatment on the surface of the S2-4: x is to be _hid And Z _block0 Input into the second fusion module, output X _block0 The method comprises the steps of carrying out a first treatment on the surface of the S3: taking the output of the last bidirectional bridging module Block as the input of the next bidirectional bridging module Block, repeating the step S2 for a plurality of times, and finally outputting X _block And Z _block 。

Description

Image processing method based on double-branch parallel neural network

Technical Field

The invention relates to the technical field of computer vision image classification, in particular to an image processing method based on a double-branch parallel neural network.

Background

Neural networks, as a powerful machine learning model, achieve excellent results in a variety of visual tasks such as image classification, object detection, and semantic separation by learning complex features and functions from a large amount of data. In order to enable the network to learn sufficiently rich data features and meet the requirements of various complex scenes, the depth and width of the neural network are continuously increased, and the network calculation amount and the parameter amount are increased. The lightweight neural network aims to build a lightweight network model with less parameter and lower calculation resource consumption under the condition of meeting the use performance by means of better module combination, so that the problem that the deep neural network consumes serious storage and calculation resources is essentially solved, and the popularization and the landing of deep learning are promoted. Most of the existing lightweight neural networks only optimize convolution modules, common convolution is disassembled into a 1×1 convolution combination of depth convolution for changing the size of a feature map and the number of channels for expanding the feature map, so that FLPs (floating point operation amount) of the neural network are reduced, most of the work is only optimized for the convolution modules, basic convolution operation is carried out on pixel points in a rectangular range around a convolution kernel each time, feature extraction in the network is always limited to a local range no matter how the depth and the width of the network change, long-distance connection relation of global images cannot be established, on the other hand, the capacity of the neural network is limited, and network benefits brought by increasing data samples are gradually reduced.

In order to solve the defect that convolutional neural experience is limited in visual field and global information is difficult to capture, for example, a BotNet (Bottleneck Transformers for Visual Recognition) network combines a transducer and a convolutional neural network, a Self-Attention mechanism in the transducer is integrated into a backbone network of a traditional convolutional neural network, a new module named as Bot BottleNeck is proposed, resNet-50 is used as a basic network, 3×3 convolution in the last three BottleNeck in ResNet-50 is replaced by a Multi-Head Self-Attention module (Multi-Self-Attention) similar to the transducer, which is abbreviated as MHSA, and the rest modules remain unchanged. The network takes the traditional image as input, extracts the characteristics through the original backbone network, obtains a 2048-channel characteristic diagram, inputs the characteristic diagram into the Bot BottleNeck, firstly carries out channel reduction by 1X 1 convolution, and then sends the characteristic diagram into the multi-head self-attention layer. Unlike the multi-head self-attention layer in the conventional transducer, for the Position coding part, two vectors are used as the space attention of two dimensions of the transverse and longitudinal directions in the BotNet, after the two vectors are added, the Content-Position is obtained by multiplying the query matrix, and then the Content-Position is multiplied by the Content code obtained through the query matrix and the key value matrix to obtain the similarity characteristic of the space nameplate, so that when the multi-head self-attention module focuses on a more proper area, the characteristic diagram is sent to a diffusion channel in the convolution of 1×1 again.

However, the existing solutions for combining convolutional neural networks and transformers either use convolution at the beginning or embed convolution into each transducer module, and the whole network still adopts a serial design paradigm. According to the technical scheme, the two network structures are connected in series, so that the mutual dependence among the modules is strong, the connection relation and the insertion position of different structures must be fully considered, and network modification is difficult. In addition, the existing scheme does not consider the gain ratio between the computational complexity of the model and the performance improvement caused by the model, the built network model FLPs are up to several G, and the deployment cost is high.

Disclosure of Invention

The invention aims to provide an image processing method based on a double-branch parallel neural network, which can solve the problem of high network deployment cost when a convolutional neural network is combined with a transducer.

The basic scheme provided by the invention is as follows: an image processing method based on a double-branch parallel neural network comprises the following steps:

s1: the input image is sent to a first basic convolution module and a feature embedding module to respectively conduct feature extraction and space vector mapping to obtain feature X ₀ And space vector Z ₀ ；

S2: x is to be ₀ And Z ₀ Inputting the data into a bidirectional bridging module Block, wherein the bidirectional bridging module Block comprises a parallel convolutional neural network module Flexible CNN, a parallel Transformer network module, a first fusion module and a second fusion module;

the step S2 comprises the following steps:

s2-1: x is to be ₀ And Z ₀ Sending the local feature and the global feature to a first fusion module, wherein the first fusion module is used for completing fusion of the local feature and the global feature and outputting Z _hid ；

S2-2: will Z _hid Input to a transducer network module, learn and extract low-frequency global features in image features, and output Z _block0 ；

S2-3: x is to be ₀ And Z _block0 Inputting into a Flexible CNN, acquiring local features in image features, and outputting X _hid ；

S2-4: x is to be _hid And Z _block0 Inputting a second fusion module, wherein the second fusion module is used for fusing the local feature and the global feature again and outputting X _block0 ；

S3: x outputting previous bidirectional bridging module Block _block0 And Z _block0 As the input of the next bidirectional bridging module Block, repeating the step S2 for several times, and finally outputting X _block And Z _block 。

The principle and the advantages of the invention are as follows: the invention focuses on adjusting the structure of the network to be a double-branch parallel structure, and combining the relatively efficient computing operation of the convolutional neural network with the global feature extraction capability of the transducer model. The network consists of a lightweight convolution branch and a modularized transducer branch, and each branch only needs to pay attention to the feature extraction and the module construction of the branch, so that the dependence between two different networks is reduced. Meanwhile, a bidirectional bridging module is constructed by referring to a self-attention mechanism, the bridging module receives the characteristics from the convolution branches and the Transformer branches respectively, and the characteristics are fused and output in the module to realize information interaction of different branches, so that a double-branch parallel network structure is finally obtained, and the problem of high network deployment cost is solved.

And constructing the convolutional neural network and the transducer model neural network as parallel results, and respectively extracting input characteristics by parallel convolutional neural network branches and transducers so that the two neural networks are mutually decoupled. Meanwhile, a bidirectional bridging module is constructed by referring to a self-attention mechanism, and features extracted by two branches are subjected to bidirectional fusion interaction, so that the advantage of the convolutional neural network in terms of feature capture and global information modeling of the transform network can be fully utilized by the network.

In the prior art, the network structure is designed into a serial shape, and a transducer module is introduced at the head, middle or tail of the convolutional neural network, which means that when each module is constructed, the relevance between the preceding module and the following module must be considered, so that the flexibility of the network is insufficient. The limitation of the scheme combining the convolutional neural network and the transducer network is reexamined, the network structure is redesigned, and the modules in the network are easier to modify in a parallel mode.

Further, the method also comprises the following steps:

s4: x is to be _block Inputting into a second basic convolution module, performing channel amplification, and outputting X _tail ；

S5: x is to be _tail Input into an adaptive pooling module, X _tail The Input size of (a) is recorded as Input, the expected Output size is Output, and the self-adaptive pooling module calculates the pooling core size and the step of the pooling operation according to the Input and OutputSize, dimension reduction is carried out on the characteristics, and X is output _pool 。

In order to ensure the consistency of output after pooling, a self-adaptive pooling operation is introduced to replace conventional pooling, the size of a pooling core and the step size can be automatically adjusted in the pooling process, the limitation of a network on the size of an input image is reduced, the input image with different sizes is adapted, and the flexibility of the network is improved.

Further, the step S5 includes the steps of:

s501: judging whether the Input can divide the Output or not, executing S502 when the Input can divide the Output, and executing S503 when the Input cannot divide the Output;

s502: calculating a pooling kernel size K and a step size S through the following formula;

s503: the pooling kernel size K is calculated by the following formula _i Step size S _i ；

i denotes the calculation from 0 for the ith pooling operation in one feature map, ceil denotes the rounding up, floor denotes the rounding down.

When Input cannot divide the Ouput evenly, the step size and the pooling core size are dynamically changed each time the pooling operation, and the pooling core size and step size are calculated by rounding up and down.

Further, the method also comprises the following steps:

s6: x is to be _pool And Z _block Splicing in channel dimension, outputting X _cat ；

S7: x is to be _cat Inputting into two layers of fully connected neural networks to obtain final output X _out 。

Further, the step S3 includes the steps of:

s301: and before the output of the last bidirectional bridging module Block is output to the bidirectional bridging module Block again, parameters in the convolutional neural network module are adjusted, wherein the parameters comprise channels and step sizes.

And by adjusting parameters, the number of the characteristic map channels finally output in each step is gradually increased, and the characteristic map size is gradually reduced.

Drawings

FIG. 1 is a schematic diagram of a structure of a dual-branch network according to an embodiment of an image processing method based on a dual-branch parallel neural network of the present invention;

fig. 2 is a visualization result of a layer-by-layer class activation map of a dual-branch network in an embodiment of an image processing method based on a dual-branch parallel neural network.

Detailed Description

The following is a further detailed description of the embodiments:

an example is substantially as shown in figures 1 and 2:

an image processing method based on a double-branch parallel neural network comprises the following steps:

s1: the input image is sent to a first basic convolution module and a feature embedding module to respectively conduct feature extraction and space vector mapping to obtain feature X ₀ And space vector Z ₀ 。

In this embodiment, S1 further includes the following steps:

s101: the input image is sent into a Base Conv module, the Base Conv module consists of three pure convolution operations, the first convolution operation is named as Stem Conv, the second convolution operation and the third convolution operation are named as Neck Conv, and basic feature extraction is mainly carried out on the input image to obtain an output X of the step ₀ 。

S2: x is to be ₀ And Z ₀ Input into a bidirectional bridging module Block, and output X _block0 And Z _block0 The bidirectional bridging module Block comprises a parallel convolutional neural network module Flexible CNN, a parallel convolutional neural network module and a parallel convolutional neural network module;

the step S2 comprises the following steps:

s2-1: x is to be ₀ And Z ₀ Sending the local feature and the global feature to a first fusion module, wherein the first fusion module is used for completing fusion of the local feature and the global feature and outputting Z _hid . In this embodiment, the first fusion module is the CNN to Former module in fig. 2, which is a plurality of matrix multiplication and addition operations. Referring to the standard self-attention module in the transducer, the implementation is as follows, and matrix multiplication operations are as follows:

(1) Will Z ₀ Multiplying the query matrix obtained by training;

(2) Combining the result obtained in (1) with X ₀ Multiplying;

(3) Feeding the result obtained in step (2) into a Softmax function;

(4) Combining the result obtained in step (3) with X ₀ Multiplying;

(5) Multiplying the result obtained in the step (4) with a transformation matrix obtained by training;

(6) Mixing the result obtained in step (5) with Z ₀ Adding to obtain the final output Z _hid

S2-3: x is to be ₀ And Z _block0 Inputting into a Flexible CNN, acquiring local features in image features, and outputting X _hid . Specifically, local small domain context information of the feature is obtained through convolution and induction bias action to obtain an output X of the step _hid 。

S3: taking the output of the last bidirectional bridging module Block as the input of the next bidirectional bridging module Block, repeating the step S2 for a plurality of times, and finally outputting X _block And Z _block . The second fusion module is a Former to CNN module in FIG. 2, and realizes the re-interaction fusion of the global features and the local features.

Specifically, X ₀ And Z ₀ Input into a Block-0 module and output X _block0 And Z _block0 After which X is _blook0 And Z _block0 Inputting into a Block-1 module, repeating the operation of step S2, outputting X _block1 And Z _block1 . And (3) circulating in this way, before the output of the previous Block is used as the input of the downward moving Block, adjusting the parameters of the Block module, wherein the parameters comprise a channel and a step length, and the corresponding configuration file config. Py is read when the network is constructed, and different networks are constructed according to the parameters of each layer set in the file. This approach is a common processing means in convolutional neural networks. Each channel can be regarded as different characteristics, as the characteristics are learned by the assumption of the network to be more abstract, the number of channels is increased, so that the network can characterize and cover more key characteristics, and the classification accuracy of the network is improved. The purpose of feature map reduction is to reduce the number of parameters in the network, avoid overfitting, and improve the reasoning speed of the network.

So that the number of the channels of the feature map finally output in each step is gradually increased and the size of the feature map is gradually reduced. In this embodiment, the process is repeated eight times from Block-0 to Block-7, and Block-7 is finally used as the final output X _block7 And Z _block7 . The visualization result of the layer-by-layer class activation map from Block-0 to Block-7 is shown in FIG. 2.

S5: x is to be _tail Input into an adaptive pooling module, X _tail The Input size of the (2) is recorded as Input, the expected Output size is Output, the self-adaptive pooling module calculates the pooling core size and the step size of pooling operation according to the Input and Output, reduces the dimension of the feature and outputs X _pool . The size of the output feature map is directly given when the step code is realized, then the step size and the pooling core size of each pooling process are adjusted according to the integer division relation between the input feature map and the output feature map, and the calculation modes of the two parameter sizes during each pooling are given in the steps S501 and S502.

s503: when the Input cannot divide the Output completely, the step size and the pooling core size of each pooling operation are dynamically changed, and the pooling core size K is calculated by the following formula _i Step size S _i ；

Assuming that the adaptive pooling is maximally pooled, the Input size input=14, and the desired Output size output=4, the adaptive pooling operation divides the above-mentioned pooling step into four intervals.

Input: [0,1,2,3,4,5,6,7,8,9,10,11,12,13]

And (3) outputting: [3,6,10,13]

Interval 1: pool core size = 4[0,1,2,3] → [3]

Step size 3

Interval 2: pool core size = 4[3,4,5,6] → [6]

Step size 4

Interval 3: pool core size = 4[7,8,9,10] → [10]

Step size 3

Interval 4: pool core size = 4[10,11,12,13] → [13].

The foregoing is merely exemplary of the present invention, and the specific structures and features well known in the art are not described in any way herein, so that those skilled in the art will be able to ascertain all prior art in the field, and will not be able to ascertain any prior art to which this invention pertains, without the general knowledge of the skilled person in the field, before the application date or the priority date, to practice the present invention, with the ability of these skilled persons to perfect and practice this invention, with the help of the teachings of this application, with some typical known structures or methods not being the obstacle to the practice of this application by those skilled in the art. It should be noted that modifications and improvements can be made by those skilled in the art without departing from the structure of the present invention, and these should also be considered as the scope of the present invention, which does not affect the effect of the implementation of the present invention and the utility of the patent. The protection scope of the present application shall be subject to the content of the claims, and the description of the specific embodiments and the like in the specification can be used for explaining the content of the claims.

Claims

1. The image processing method based on the double-branch parallel neural network is characterized by comprising the following steps of: the method comprises the following steps:

the step S2 comprises the following steps:

S3: taking the output of the last bidirectional bridging module Block as the input of the next bidirectional bridging module Block, repeating the step S2 for a plurality of times, and finally outputting X _block And Z _block 。

2. The image processing method based on the dual-branch parallel neural network according to claim 1, wherein: the method also comprises the following steps:

S5: x is to be _tail Input into an adaptive pooling module, X _tail The Input size of the (2) is recorded as Input, the expected Output size is Output, the self-adaptive pooling module calculates the pooling core size and the step size of pooling operation according to the Input and Output, reduces the dimension of the feature and outputs X _pool 。

3. The image processing method based on the dual-branch parallel neural network according to claim 2, wherein:

the step S5 comprises the following steps:

4. The image processing method based on the dual-branch parallel neural network according to claim 3, wherein: the method also comprises the following steps:

5. The image processing method based on the dual-branch parallel neural network according to claim 1, wherein: the step S3 comprises the following steps: