CN115690549A

CN115690549A - Target detection method for realizing multi-dimensional feature fusion based on parallel interaction architecture model

Info

Publication number: CN115690549A
Application number: CN202211420718.7A
Authority: CN
Inventors: 杜松林; 谢昊
Original assignee: Shenzhen Institute Of Southeast University; Southeast University
Current assignee: Shenzhen Institute Of Southeast University; Southeast University
Priority date: 2022-11-12
Filing date: 2022-11-12
Publication date: 2023-02-03

Abstract

The invention discloses a target detection method for realizing multi-dimensional feature fusion based on a parallel interactive architecture model, which comprises the following steps of: the method solves the problems of low convergence speed, long training time consumption and the like in the traditional target detection task, and improves the detection precision and the detection speed of the target detection task.

Description

Target detection method for realizing multi-dimensional feature fusion based on parallel interaction architecture model

Technical Field

The invention belongs to the field of target detection in computer vision, and particularly provides a target detection method for realizing feature fusion based on a parallel interaction architecture model.

Background

Object detection is a long-standing and non-negligible basic task in the field of computer vision, whose main purpose is to predict the location and class of instances in an image. As a basis for many visual tasks including example segmentation and target tracking, target detection has a very important research significance in the field of image vision. In recent years, with the rise of popularity of practical fields such as automatic driving and industrial defect detection, the industrial community has attracted more and more attention to target detection. The core challenge of object detection is how to make the detection network fully learn the spatial information and semantic information of the image from the input features, and how to accurately locate and classify instances from these information. The target detector needs strong feature fusion capability and enough spatial sensitivity, and most of the traditional deep learning detection models are based on a Convolutional Neural Network (CNN). The CNN fully fuses local features in the image by using convolution operation, and the sensitive local space sensing capability enables the CNN to become one of networks most suitable for a target detection task; however, CNN has a certain limitation, and its feature fusion capability in global space is deficient. Traditional target detection models based on CNN are generally divided into anchor-based and anchor-free according to how to locate objects; the former uses anchors to predict potential objects, while the latter typically detects objects based on a central point. The Anchor-based model can be divided into one-stage and two-stage according to the detection steps; the former classical model is YOLO series, SSD, retinaNet, etc., and the latter is represented by R-CNN series. In the two-stage method, firstly, potential target areas are searched, then, the second step is carried out on the areas to calculate classification scores, namely, the areas are positioned firstly and then classified; and the single-stage method directly generates the detection frame in one step to predict the category and the position of the object. The CNN-based model has two key issues, how to assign anchors and grountruth labels and how to make the model effectively learn key semantic information from features. Models designed to solve both problems also have significant drawbacks, such as the need to perform artificial design knowledge under certain a priori conditions, which is a difficult task to design a priori such as suitable anchor points and thresholds for different detection methods. On the other hand, the global feature interaction capability of CNN is weak due to the size limitation of the convolution kernel.

In recent years, the advent of visual Transformer (ViT), DEtectionTRansformer (DETR), and its variants, has raised the hot tide of applying Transformer to target detection. These new object detection paradigms discard the traditional CNN and replace it with a well-designed multi-layer encoding and decoding architecture; the encoder is used to fuse features, and the decoder uses object query to decouple rich semantics in features. Compared with CNN, viT emphasizes semantic association between global spaces in space, and integrates global space characteristics through a global self-attention mechanism. The DETR treats object detection as a task of ensemble prediction. A certain number of object queries are matched to the groudtruth during the training process. This process eliminates the label assignment of the traditional model; and in the inference process, the network directly predicts the object according to the object query. In addition, in the aspect of positioning an object, position sensing sensitivity of the model is enhanced by using positional embedding in the DETR, however, the DETR detector has the problems of low network convergence speed, high calculation force dependence and the like.

Disclosure of Invention

The invention provides a method for realizing feature fusion based on a parallel interactive architecture model in order to solve the problems and combine with the latest ideas in other fields of deep learning, and aims to provide advanced feature fusion capability for the model. Firstly, in the aspect of a feature extraction mode, 3D feature space window sampling different from the traditional CNN is introduced, and local and global spatial features are fully extracted; subsequently, the invention provides a multi-dimensional feature fusion network CFFN, which can enable the model to deeply fuse image features in space and channel dimensions, thereby enabling the model to better learn semantic information, further realizing better detection effect and achieving higher detection precision.

In order to achieve the purpose, the invention provides the following technical scheme: a target detection method for realizing multi-dimensional feature fusion based on a parallel interactive architecture model comprises the following steps:

step 1: preparing a COCO2017 data set required by model training; configuring a COCO2017 data set in a server, and putting the COCO2017 data set into a training folder according to a required format;

step 2: building a model under an mmdetection framework, and configuring a PyTorch deep learning environment required by training;

and 3, step 3: setting training hyper-parameters, and inputting a data set into an end-to-end target detection model of a parallel interaction architecture for training;

and 4, step 4: the model sends the input image into a ResNet50 for feature extraction, outputs a multi-scale feature map, and simultaneously constructs a 3D feature sampling space by the multi-scale feature map;

and 5: a set of predictor vector objects is generated that contains the content vector and the position vector. For each object query, the object query generates a sampling offset through a feedforward neural network, and generates a model initial sampling point by taking a position vector of the object query as an initial coordinate and combining the sampling offset;

step 6: forming a local sampling window by using the initial sampling point and eight adjacent points of the initial sampling point in a sampling space, interpolating points in the window to obtain window characteristics, and then paving the window;

and 7: the obtained feature matrix is sent to a feature fusion network CFFN, wherein the CFFN is composed of a one-way parallel interactive structure (PSUI) and an inter-group self-attention layer, and the layer realizes the full fusion of features in space and channel dimensions;

and 8: the fully fused features are sent into an adaptive Mixing decoding layer for feature decoupling;

and step 9: the final output of the decoding layer sequentially updates the content vector and the position vector of the object query through two feed-forward neural networks FFN, and the content vector and the position vector predict the category and the position of the target to be detected through two FFNs;

step 10: after the model training is finished, the precision of the model can be verified, and a detection frame can be generated by using a trained model file according to an input test picture to detect the type and the position of an object to be detected in the test picture. In step 4, the model performs feature pre-extraction on the input image by using a classical CNN backbone network ResNet50 to obtain feature maps with four different scales. If the dimension of the input image is

Then the output multi-scale feature map is respectively

When a 3D feature space is constructed, the channel number of each scale feature map is normalized to a value D _feat . In step 5, the prediction vector object is composed of a content vector and a position vector, wherein the content vector is the coordinates (x, y, z, r) of the object query.

Step 6, as shown in fig. 2, in the characteristic sampling space, eight adjacent points from the initial sampling points in the step 5 and the initial sampling points form a local sampling window, and then the sampling points in the window are interpolated to obtain a sampling characteristic matrix x epsilon R ^G*W*P*C (taking an object query as an example). Wherein G represents sampling grouping, W and P respectively represent the number of sampling windows and the number of points in the sampling windows, and C represents the number of characteristic channels. The formula is as follows:

in the above formula, S is the size of the local window, i is the sampling point in the local window, coordinate refers to the coordinates of the sampling point, and Interpolation is Interpolation operation.

In step 7, the CFFN includes a unidirectional interactive Parallel Structure (PSUI) and self attention between groups, where the PSUI is composed of two left and right branches and a unidirectional interactive network connecting the left and right branches, and details of the PSUI are shown in fig. 3:

(1) The left branch performs self-attention among windows to realize local feature fusion, and V in self-attention operation among windows _w The result of the semantic weight of the channel obtained after dot product of the feature matrix and factor, and Q _w ,K _w The characteristic matrix is obtained through a feedforward neural network; local self-attention operation is carried out on different points in the window, so that local features between different adjacent points in the same window can be fully fused. The method comprises the following specific operations:

Q _w ,K _w ＝FFN ₁ (x),FFN ₂ (x),

V _w ＝x⊙factor,

in the above formula, Q _w ,K _w ,V _w Three matrices in the self-attention operation, respectively, factor is the interaction factor containing channel weight generated by the single-term interaction network, d _k Is a scaling factor.

(2) The right branch firstly carries out dimension conversion on the characteristic matrix of the local window, and the converted matrix belongs to R for x ∈ ^G ^*C*W*P . Then two dimensions form a characteristic diagram; the lateral direction represents the inside of one window and the longitudinal direction represents the different windows. The right branch firstly carries out depthwise convolution on the right branch by utilizing a convolution kernel with the size of 9X5, and features among windows are fused, so that global feature interaction is realized; and realizing semantic fusion on channel dimensionality through poitwise convolution. The final output of the right branch will be converted to the original dimension to concat with the left branch.

(3) The direction of the single-term interactive connection is from right to left, depthwise convolution output of the right branch is operated by an interactive network to obtain a factor containing channel semantic weight, and the factor is input to the left branch to participate in self-attention operation in the left branch.

(4) The final results of the left branch and the right branch only retain the characteristics of the initial sampling point in the dimension P, and then the dimension of the matrix is converted into x ∈ R ^G*W*C After concat, the dimension of FFN is maintained as x ∈ R through a layer ^G*W*C 。

In step 7, the CFFN comprises a unidirectional interactive parallel architecture (PSUI) and inter-group self-attention, wherein the details of the inter-group self-attention are: in order to reduce the network consumption computation amount and accelerate the network detection speed and the training speed, the model is used for calculating the d value _feat And dividing the matrix into four groups, and performing interpolation sampling on each group to reduce the size of the matrix in a certain dimension. To compensate for the different interchannel missing interactions resulting from this operation, the model designed components to self-attention fuse the different interchannel features. The formula is as follows:

Q _g ,K _g ＝FFN ₃ (x),FFN ₄ (x),

in the above formula Q _g ,K _g Query and Key matrices in self-attention operations, respectively, d _k Is a scaling factor. In step 9, the output of the decoder is converted to the same dimension as the object query content vector through an FFN, and the content vector is updated; the detection head predicts the category and the position of the candidate frame by different FFNs respectively by using the content vector and the position vector.

In the training process of the invention, the model predicts candidate frames of the same size from an initial picture of the training input by a set of fixed number (N) of object queries, wherein N is usually much larger than the actual number of the objects of interest in the image, so that an additional special class label is used

To indicate that no object is detected.

In the whole training process of the invention, the parallel interactive architecture target detection model adopts a one-to-one label assignment mode, each prediction box needs to be matched with a Bounding box, and the model realizes the best binary matching between a real object and a prediction object by using Hungary algorithm, namely, the optimal matching mode is found

Let total matching cost L _{matc h} And (3) minimum:

the above formula sigma is a matching rule between the group route and the prediction box, thetaN represents a possible matching mode, y is a group route set,

is a set of N prediction boxes, if the number of boxes in y is less than N, the method uses

And (4) filling. L is a radical of an alcohol _match The matching cost between the true value and a prediction with index σ (i), which includes the classification penalty L _cls And predicting frame loss L _box ，L _box And includes IoU loss L _iou And l ₁ Loss L ₁ 。

Each element y in the grountruth set y _i Are all composed of c _i And b _i In which c is _i Is the category to which the object in the frame belongs, b _i Is a position vector defining the center coordinates and dimensions of the real box. For prediction of the index σ (i), the invention defines that it belongs to the class c _i Has a probability of

The prediction box is

Then L can be substituted _{matc h} Expressed as:

the loss function in the training process of the invention is the Hungarian loss of all pairs in the matching:

wherein

Is the optimal match.

After 12 epochs are completed in step 10, the trained model is saved as a pt file, and the trained model file can be used to verify model accuracy and pictures.

The specific method for detecting the picture in the step 10 is as follows:

the trained model file can be loaded by using a network to detect the object in the image; and running a detect code, setting a detection model as a pt file after training, and setting an input picture directory as a folder where the picture to be detected is located. And starting detection, inputting data to be detected into a trained model for image recognition and positioning, and outputting a plurality of prediction frames containing the positions and the classes of the potential objects in the picture by the model.

The present invention has the following advantages over the prior art. In the aspect of the method: firstly, the method does not use the traditional CNN to further extract the characteristics, but obtains the spatial characteristics in a mode of constructing a 3D sampling space and then carrying out window sampling in the space, improves the richness of characteristic extraction in the spatial dimension and strengthens the positioning capability of a model; secondly, in order to improve the feature fusion quality, a CFFN network structure is designed to fuse the extracted features in the dimensions of space and channels, the CFFN network comprises convolution and self-attention operation, different methods are used for feature fusion of different dimensions, the design greatly enriches semantic information and makes contribution to improving the model precision; on the application level, the invention obtains the AP precision of 43.0 under the training period of 12 epochs in the example, which is superior to a plurality of detection methods, and meanwhile, the invention abandons the prior knowledge of the traditional CNN detection network, improves the training speed, further solves the problems of low convergence speed, long training time consumption and the like in the target detection task, and improves the precision and the detection speed of the target detection task.

Drawings

FIG. 1 is a diagram of the network architecture of the present invention;

FIG. 2 is a schematic view of a sampling local window of the present invention;

fig. 3 is a schematic diagram of the PSUI structure of the present invention.

Detailed Description

The following detailed description of the embodiments of the invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention. All other embodiments, which can be obtained by a person skilled in the art without making any creative effort based on the embodiments in the present invention, belong to the protection scope of the present invention.

Example 1: aiming at the traditional two-stage target detection method at present, the invention provides a 2D image target detection method for realizing multi-dimensional feature fusion based on a parallel interactive architecture model.

In the embodiment, a COCO2017 data set is used as experimental data, and by means of a data amplification technology, object positioning and classification are realized by using an end-to-end parallel interactive target detection model based on a backbone network ResNet50, a 3D feature space window sampling mode, a feature fusion network CFFN and a detection head comprising two feed-forward networks.

step 1.1, a public data set COCO2017 is obtained in a COCO official website, and pictures and labels of a training set and a verification set which are divided by an official part are downloaded.

Step 1.2, the COCO2017 data set is composed of a training set image, an annotation file, a verification set image and an annotation file, wherein the training set image, the verification set image and the annotation are respectively placed in a train2017 folder, a val2017 folder and an indications folder.

And 2, step: building a model, and configuring a PyTorch deep learning environment required by training;

step 2.1, creating a virtual environment for the project in anaconda, and installing key packages such as the pytorch 1.11.0 and the like required by the training environment in the virtual environment. The video card used by the training server is NVIDIA RTX 3090GPU, the operating system is Ubuntu20.04, the CUDA version is 11.3, and the compiling language is Python 3.8.

And 2.2, installing and configuring an mmdetection framework and a required mmcv compiling package, and installing other dependent packages required by the training script.

And step 3: setting training hyper-parameters, and inputting a data set into an end-to-end target detection model of a parallel interaction architecture for training;

step 3.1, the hyper-parameters of the training part are as follows: the feature extraction backbone is ResNet50, the initial learning rate is 0.000125, the batch size is 4, and the epoch number is 12.

And 4, step 4: the model sends the input image into a ResNet50 for feature extraction, outputs a multi-scale feature map, and then constructs a 3D feature sampling space by the multi-scale feature map;

and 5: a set of prediction vector object queries is generated that contains a content vector and a position vector. For each object query, the object query generates a sampling offset through a feedforward neural network, and generates a model initial sampling point by taking a position vector of the object query as an initial coordinate and combining the sampling offset;

and step 8: the fully fused features are sent into an Adaptive Mixing decoding layer for feature decoupling;

and step 9: the final output of the decoding layer sequentially updates the content vector and the position vector of the object query through two feed-forward neural networks FFN, and the content vector and the position vector predict the category and the position of the object to be detected through two FFNs;

step 10: after the model training is finished, the precision of the model can be verified, and a detection frame can be generated by using a trained model file according to an input test picture to detect the type and the position of an object to be detected in the test picture.

After the 10.1 and 12 epochs are completed, the trained model is stored as a pt file, the model can be subjected to precision verification according to the training weight by using val.py, and the val2017 data set mentioned in the step 1 is input into the model, so that the precision of the trained model can be evaluated. Common precision index AP and AP ₅₀ 、AP ₇₅ 、AP _S 、AP _M And AP _L 6 pieces of the Chinese herbal medicines are used. In the final verification result of the embodiment, the AP is 43.0, which is superior to most methods under the same experimental conditions.

Step 10.2, the trained model file can be loaded by using a detection network to detect the object in the image; and running a detect code, setting the detection model as a pt file after training, and setting the input picture directory as a folder where the picture to be detected is located. After the operation is set, the detection can be started, the data to be detected is input into the trained model for image recognition and positioning, and the model outputs a plurality of prediction frames containing the positions and the classes of the potential objects in the picture.

Claims

1. A target detection method for realizing multi-dimensional feature fusion based on a parallel interactive architecture model is characterized by comprising the following steps:

and 2, step: building a model under an mmdetection framework, and configuring a PyTorch deep learning environment required by training;

and 5: generating a group of prediction vector objects containing content vectors and position vectors, generating sampling offsets of the object queries through a feedforward neural network for each object query, taking the position vectors as initial coordinates, and generating model initial sampling points by combining the sampling offsets;

step 6: forming a local sampling window by using the initial sampling points and eight adjacent points of the initial sampling points in a sampling space, interpolating points in the window to obtain window characteristics, and then paving the window;

2. The method for detecting the target based on the parallel interactive architecture model to realize the multi-dimensional feature fusion of claim 1, wherein the step 4: the target detection model uses a classical CNN backbone network ResNet50 to perform feature pre-extraction on an input image to obtain four feature maps with different scales, and if the dimension of the input image is

Then the output multi-scale feature map is respectively

In addition, when a 3D feature space is constructed, the channel numbers of all the scale feature maps are normalized to a uniform value D _feat D above _i The number of feature channels, H, of the ith layer feature map ₀ 、W ₀ Is the input image height and width.

3. The method for detecting the target based on the parallel interactive architecture model to realize the multi-dimensional feature fusion of claim 1, wherein the step 5: the prediction vector object query is composed of a content vector and a location vector, wherein the content vector represents the initial coordinates (x, y, z, r) of the object query, where r is the aspect ratio.

4. The method for detecting the target based on the parallel interactive architecture model to realize the multi-dimensional feature fusion of claim 1, wherein the step 6: eight adjacent points from the initial sampling points in the step 5 and the initial sampling points form a local sampling window in a characteristic sampling space, and then the sampling points in the window are interpolated to obtain a sampling characteristic matrix x epsilon R ^G*W*P*C Wherein G represents sampling grouping, W and P represent the number of sampling windows and the number of points in the sampling windows respectively, and C represents the number of characteristic channels.

5. The method for detecting the target based on the parallel interactive architecture model to realize the multi-dimensional feature fusion of claim 1, wherein the step 7: the CFFN includes a one-way Interactive parallel architecture (PSUI) and inter-group self-attention, with the respective details as follows:

(1) The CFFN comprises a unidirectional interactive Parallel Structure (PSUI) and self attention among groups, wherein the PSUI consists of a left branch and a right-left unidirectional interactive network connecting the left branch and the right branch;

(2) The branch on the left side of the CFFN realizes local feature fusion by utilizing inter-window self-attention operation, and V in the inter-window self-attention operation _w The result from the semantic weight of the channel obtained after dot product of the window feature matrix and the factor, Q _w ,K _w The feature matrix is obtained through different feedforward neural networks; local self-attention operation is carried out on different points in the window, so that local features between different adjacent points in the same window can be fully fused;

(3) The right branch firstly carries out dimension conversion on the characteristic matrix of the local window, and the converted matrix belongs to R for x ∈ ^G*C*W*P Then the two dimensions form a characteristic diagram; the horizontal represents the inside of one window, the vertical represents different windows, and the right branch utilizes convolution kernel with the size of 9X5Firstly, depthwise convolution is carried out on the window frame, and the characteristics among windows are fused, so that global characteristic interaction is realized; semantic fusion on channel dimensionality is achieved through pointwise convolution, and the final output of the right branch is converted into the original dimensionality so as to be convenient for concat with the left branch;

(4) The direction of the single-term interactive connection is from right to left, depthwise convolution output of the right branch is operated by an interactive network to obtain a factor containing channel semantic weight, and the factor is input to the left branch to participate in self-attention operation in the left branch;

(5) The final results of the left branch and the right branch only retain the characteristics of the initial sampling point in the dimension P, and then the dimension of the matrix is converted into x ∈ R ^G*W*C After concat, the dimension of FFN is maintained as x ∈ R through a layer ^G*W*C ；

(6) In order to reduce the network consumption computation amount and accelerate the network detection speed and the training speed, the model is used for calculating the d value _feat The method is divided into four groups, each group is subjected to interpolation sampling to reduce the size of a matrix in a certain dimension, and in order to make up for missing interaction of different inter-group channels caused by the operation, a model designs components to integrate characteristics of different inter-group channels by self attention.

6. The method for detecting the target based on the parallel interactive architecture model to realize the multi-dimensional feature fusion of claim 1, wherein the step 9:

(1) The output of the decoder is converted into the same dimension of the content vector of the object query through one FFN, and the updating of the content vector is completed; then, the position vector is updated by changing one FFN into the same dimension with the position vector;

(2) The detection head predicts the category and position of the candidate frame by different FFNs respectively by using the content vector and the position vector.