CN114283326A

CN114283326A - Underwater target re-identification method combining local perception and high-order feature reconstruction

Info

Publication number: CN114283326A
Application number: CN202111582065.8A
Authority: CN
Inventors: 付先平; 蒋广琪; 姚铭泽; 彭锦佳; 王辉兵
Original assignee: Dalian Maritime University
Current assignee: Dalian Maritime University
Priority date: 2021-12-22
Filing date: 2021-12-22
Publication date: 2022-04-05

Abstract

The invention discloses an underwater target re-identification method combining local perception and high-order feature reconstruction, which comprises the steps of carrying out underwater target feature processing on an acquired image by utilizing a cross-phase network of a target detection algorithm model to obtain an underwater target feature map; the method comprises the steps of carrying out mapping processing and scale pooling on an underwater target characteristic graph to obtain a characteristic graph with a characteristic matrix, inputting a path aggregation network and predicting to generate a target object, sampling the target object, extracting image characteristics through a residual error network to obtain a three-dimensional tensor and vertical bars, carrying out average pooling on the vertical bars to generate a predicted image, carrying out tensor reconstruction on the predicted image to identify a context information segment, obtaining context characteristics by utilizing synthesis and element-level multiplication and weighted average, and carrying out average pooling and minimum cross entropy loss processing on the context characteristics to obtain a prediction result. And (3) tensor reconstruction is carried out to extract a characteristic graph of higher-order characteristic information, and the characteristic graph is used for predicting to obtain a re-recognition prediction result with higher robustness and higher precision.

Description

Underwater target re-identification method combining local perception and high-order feature reconstruction

Technical Field

The invention relates to the field of underwater target identification, in particular to an underwater target re-identification method combining local perception and high-order feature reconstruction.

Background

The existing underwater target recognition method mainly extracts underwater target features for recognition and detection tasks based on a related target recognition algorithm. However, due to the navigation angle of the underwater robot, the changes of the navigation speed and the navigation track can cause the photographed target object to be influenced by deformation, angle, contrast and the like. This brings many challenges to the task of re-identification of underwater targets where the vision-guided underwater robot identifies the same target object from different angles. The current target re-identification task is mainly based on identification of unified targets under different cameras on land, and an existing underwater target identification algorithm cannot sense the current position and cannot repeatedly identify the same target object due to changes of the shape and the air route of an underwater robot, so that the size change of an acquired picture is influenced due to changes of the track in the navigation process of the underwater robot, and the accuracy is influenced due to final identification.

Disclosure of Invention

The invention provides an underwater target re-identification method combining local perception and high-order feature reconstruction, and aims to solve the technical problems that the existing underwater target identification method is not accurate in result and the like.

In order to achieve the purpose, the technical scheme of the invention is as follows:

an underwater target re-identification method combining local perception and high-order feature reconstruction comprises the following steps:

step 1, collecting an image of a passing area of an underwater robot;

step 2, performing underwater target feature processing on the acquired image by using a cross-stage network of a YOLOv4 target detection algorithm model to obtain an underwater target feature map;

step 3, mapping the underwater target characteristic graph by using a Mish activation function to obtain a target characteristic result graph;

step 4, performing space pyramid pooling operation on the target characteristic result graph through a target detection algorithm, and performing character string connection on the pooled result, so as to separate the characteristic graph with the characteristic matrix;

step 5, inputting the characteristic diagram with the characteristic matrix into a path aggregation network for characteristic fusion to obtain a fusion characteristic diagram;

step 6, inputting the fusion characteristic graph into a prediction network, setting an anchor frame, predicting the anchor frame by using a clustering algorithm to generate a prediction frame, wherein the prediction frame comprises a network output image and is used for detecting target objects with different scales;

step 7, after sampling the network output images in batches, extracting the characteristics of the network output images through a residual error network to obtain a three-dimensional tensor T,

step 8, performing average pooling on the three-dimensional tensor T to divide the three-dimensional tensor T into p vertical strips, namely, partitioning, identifying and processing, and acquiring activation vectors of the p vertical strips by using a matrix channel axis of the three-dimensional tensor T and defining the activation vectors as column vectors;

step 9, inputting the column vectors into a classifier to perform prediction processing to obtain a predicted image;

step 10, carrying out tensor reconstruction and re-identification processing on the predicted image by using a residual network, extracting matrix characteristics of the predicted image after tensor reconstruction and re-identification processing, and carrying out characteristic generation from three angles of a channel, a width and a height by using the matrix characteristics to obtain a context information segment;

step 11, repeating step 10, and synthesizing the obtained context information segments to obtain a context attention drawing representing the three-dimensional context characteristics; activating and aggregating the context attention by using element-level products and weighted averages to obtain fine-grained context features on space and channel dimensions;

step 12, performing global average pooling on the fine-grained context feature map through a global average pooling layer to obtain a target feature map; and connecting the target characteristic diagram with the column vectors, optimizing the connection result by using the minimized cross entropy loss to obtain a joint optimization characteristic diagram, and predicting the joint optimization characteristic diagram to obtain a prediction result.

Further, in the step 5, a feature pyramid structure in the path aggregation network is used to connect the top-down path and the bottom-up path in a horizontal manner, so as to divide the feature map with the feature matrix, a self-adaptive pooling operation is used to perform a pooling extraction operation on each feature map with the feature matrix to obtain a target feature, and a full-connection fusion operation is performed on the extracted target feature to obtain a fusion feature map.

Further, in step 6, the anchor frame is predicted by using the global intersection ratio as a prediction frame regression function to generate a prediction frame, where the specific formula of the global intersection ratio is as follows:

where ρ is²(b,b^gt) The Euclidean distance between the center points of the prediction frame and the real frame is represented, c represents the diagonal length of the minimum external moment between the center point of the prediction frame and the center point of the real frame, IOU represents the intersection ratio between the prediction frame and the real frame, alpha represents an influence factor corresponding to a parameter v, v represents a parameter for measuring the consistency of the length-width ratio, w represents the width of the corresponding frame, h represents the height of the corresponding frame, gt represents the real frame, and CIOU represents the global intersection ratio.

Further, step 8 includes modifying the residual network to extract the depth features of the network output image to obtain a three-dimensional tensor T, and specifically modifying the residual network includes eliminating a global average pooling layer, a full connection layer and an output layer of the convolutional neural network.

Further, in step 11, activating and aggregating the context attention by using element-level product and weighted average, so as to obtain fine-grained context features in space and channel dimensions, specifically:

step 11.1, repeating the step 10 to obtain context information segments in different directions;

step 11.2, carrying out tensor reconstruction processing on different context information fragments to obtain the sub attention image A in different directions_i；

Step 11.3, annotating the sub-points in different directions with graph A_iSynthesizing to obtain a context intention;

and 11.4, activating and aggregating the context attention by utilizing element-level multiplication and weighted average, thereby obtaining fine-grained context characteristics on space and channel dimensions.

Further, the specific method in step 11.4 is as follows:

S＝{s₁,s₂,...,s_CHW}

A＝{a₁,a₂,...,a_CHW}

wherein S represents the feature matrix of the input, A represents the context attention, Y represents the fine-grained context feature, i represents the ith feature, CHW represents the total number of features, S represents the input feature, and A represents the context attention.

Has the advantages that: the invention adopts a target detection network to obtain the marking information of an underwater target object for constructing an underwater target characteristic diagram. A local perception branch network is constructed to carry out blocking and re-identification processing on an underwater target, the feature extraction performance is improved, meanwhile, a high-order reconstruction branch network is constructed to reconstruct the high-order discriminant features of the image features in a tensor reconstruction mode, and a feature map of higher-order feature information is further extracted. By using the cross-stage network to extract the characteristic diagram with more detailed characteristics, the high-order reconstruction branch network extracts the characteristic diagram with more high-order characteristic information, and the two characteristic diagrams are connected for prediction, so that a re-identification prediction result with stronger robustness and higher precision is obtained.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a schematic diagram of a bus exception handling apparatus according to the present invention;

FIG. 2 is a diagram of a model architecture of the object detection algorithm of the present invention;

FIG. 3 is a diagram of a cross-phase network architecture in accordance with the present invention;

fig. 4 is a diagram illustrating a dual-branch re-identification network according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment provides an underwater target re-identification method combining local perception and high-order feature reconstruction, as shown in fig. 1 to 4, including the following steps:

step 1, collecting an image of an underwater robot path area;

step 2, performing underwater target feature processing on the acquired image by using a cross-stage network of a YOLOv4 target detection algorithm model, splitting the stacked residual blocks into two parts, and combining the two parts by using a cross-stage hierarchical structure to obtain an underwater target feature graph F with stronger robustness_g；

The cross-phase network comprises a plurality of standard 3 x 3 convolutions for feature extraction, which are used for preventing the training network from overfitting and more accurately obtaining the Mish activation function of the result, and the formula is as follows:

Mish＝X×tan h(ln(1+e^x))

specifically, feature blocks are extracted for marine product objects in an image of a visible area of the underwater robot.

First the cross-phase network (CSPDarknet53) consists of multiple standard 3 × 3 convolutions, the Mish function, and the CSPNet. The purpose of the multiple standard 3 x 3 convolutions is to better extract the features of the target, after the features are extracted, the CSPNet is used to divide the feature mapping of the basic layer into two parts, and then the two parts are combined through a cross-stage hierarchical structure to extract the target feature graph F with strong robustness_g. The global features are then used for input, training and final optimization of the next feature pyramid network.

Step 3, using Mish activation function to perform characteristic diagram F on underwater target_gMapping processing is carried out to obtain a more accurate and stable target characteristic result graph;

step 4, carrying out space pyramid pooling operation on the target characteristic result graph through a target detection algorithm, carrying out character string connection on the pooled result, and further separating a characteristic graph F with a characteristic matrix_f；

Using four pooling checks with different sizes to perform maximum pooling on the feature maps respectively, wherein the pooling results are F_m1，F_m2，F_m3，F_m4. And then connecting the four pooling results to improve the discriminability and the comprehensiveness of the cross-stage network characteristics and obviously separate more important context characteristics, wherein the connection part is shown in the following formula:

F_f＝concat(F_m1,F_m2,F_m3,F_m4)

step 5, the characteristic diagram F with the characteristic matrix_fPerforming feature fusion on an input path aggregation network, specifically, segmenting, extracting and fusing a feature map with a feature matrix, enhancing the weight of more features of feature map channel information, reducing the weight of unimportant features, and obtaining a fused feature map;

step 6, inputting the fusion characteristic graph into a prediction network, setting an anchor frame, predicting the anchor frame by using a clustering algorithm to generate a prediction frame, wherein the prediction frame comprises a network output image and is used for detecting target objects with different scales; specifically, the selection of anchor points in the clustering algorithm is set, anchor frames with different sizes are correspondingly generated for prediction by setting three anchor points with different sizes, and finally, the result of the predicted network is the network output image with three different sizes;

step 7, after batch sampling is carried out on the network output image, extracting the characteristics of the network output image through a residual error network to obtain a three-dimensional tensor T;

step 9, performing average pooling on each vertical bar in the p vertical bars to generate a single part-level column vector with more detailed characteristics, and inputting the column vector into a classifier to perform prediction processing to obtain a predicted image; in order to dynamically classify all column vectors e in the three-dimensional tensor T, the classifier is composed of an FC layer and a Softmax function, and Softmax activation is used as a partial classifier as follows:

wherein P (P)_iI e) is e belongs to P_iThe prediction probability of a part, p, is the number of previously defined partitions. W is a trainable weight matrix;

step 10, carrying out tensor reconstruction and re-identification processing on the predicted image by using a convolutional neural network, extracting matrix characteristics of the predicted image after tensor reconstruction and re-identification processing, and carrying out characteristic generation from three angles of a channel, a width and a height by using the matrix characteristics to obtain context information segments in the channel direction, the width direction and the height direction;

step 11, repeating step 10, obtaining context information segments in different directions, and performing synthesis processing on the context information segments in different directions to obtain context attention representing three-dimensional context characteristics; and repeating continuously, reconstructing other context segments, and activating and aggregating the context attention by utilizing element-level products and weighted average, thereby obtaining fine-grained context characteristics on space and channel dimensions.

Step 12, performing global average pooling on the fine-grained context feature map through a global average pooling layer to obtain a higher-order target feature map; and connecting the target characteristic diagram with the column vectors, optimizing the connection result by using the minimized cross entropy loss to obtain a joint optimization characteristic diagram, and predicting the joint optimization characteristic diagram to obtain a prediction result.

In a specific embodiment, in step 5, a feature pyramid structure in the path aggregation network is used to connect the top-down path and the bottom-up path in the horizontal direction to segment the feature map with the feature matrix, a pooling extraction operation is performed on each feature map with the feature matrix by using an adaptive pooling operation to obtain a target feature, and a full-connection fusion operation is performed on the extracted target feature to obtain a fusion feature map. In order to generate a target mask better, a self-adaptive pooling is used for carrying out a pooling extraction operation on each representative feature map in the path to obtain target features, and a full-connection fusion operation is carried out on the extracted target features to obtain a fusion feature map.

In a specific embodiment, in step 6, the anchor frame is predicted by using a global intersection ratio as a prediction frame regression function to generate a prediction frame, where the global intersection ratio is specifically represented by the following formula:

In a specific embodiment, the step 8 further includes modifying the residual network to extract depth features of the network output image to obtain a three-dimensional tensor T, and specifically modifying the residual network includes eliminating a global average pooling layer, a full connection layer, and an output layer of the convolutional neural network.

In a specific embodiment, in step 11, activating and aggregating the context attention by using element-level multiplication and weighted average, so as to obtain fine-grained context features in spatial and channel dimensions, specifically:

In a specific embodiment, the specific method in step 11.4 is:

S＝{s₁,s₂,...,s_CHW}

A＝{a₁,a₂,...,a_CHW}

Specifically, the dual-branch re-identification network proposed by the present invention, as shown in fig. 4, includes a local sensing branch network and a high-order reconstruction branch network;

(1) the local perception branch network is used for performing average pooling on the three-dimensional tensor T in the step 8 to divide the three-dimensional tensor T into p vertical strips, namely, block re-identification processing, and obtaining the activation vectors of the p vertical strips by using the matrix channel axis of the three-dimensional tensor T and defining the activation vectors as column vectors.

Specifically, dividing the three-dimensional tensor T into five sub-feature tensors with equal size from the horizontal direction can be described as: t (i) ([ T (i,1), T (i,2),.. T (i, p) ] s.t.i) ("1, 2.. N"), where N is the number of samples in the data set and p represents the number of blocks in the first branch. Then, global average pooling is adopted for the sub-regions of each partition, so that a column vector corresponding to each sub-feature tensor is obtained, and the column vector can effectively acquire the detail features of the image. Then, each block-segmentation-based sub-vector obtained by passing the convolutional layer for each column vector is represented as H (i) ═ H (i,1), H (i,2),.. H (i, p) ] s.t.i ═ 1,2.. N. The method can model the feature distribution of each local area to obtain the feature extraction performance of the local perception feature improvement network. In the training process, M is the class number in the training set. Finally, the identity of the prediction input is obtained by inputting the column vector into a classifier for prediction processing. The classifier is composed of an FC layer and a Softmax function, and Softmax activation is shown as a partial classifier as follows:

where p is the number of previously defined partitions, M is the number of classes in the training set, L_softmaxFor local perceptual branch loss, H (i, p) is the ith sample obtained by convolving the pth column vector, and H (j, p) is the jth sample obtained by convolving the pth column vector. And obtaining the prediction probability of the local refined features of the underwater target by predicting each block of feature sub-vector so as to increase the local perception performance of the network.

(2) The high-order reconstruction branch network performs tensor reconstruction re-identification processing on the predicted image by using a residual network in step 10, extracts the matrix characteristics of the predicted image after the tensor reconstruction re-identification processing, and performs characteristic generation from three angles of a channel, a width and a height by using the matrix characteristics to obtain context information segments in three directions.

(3) The tensor reconstruction part in the high-order reconstruction branch network mainly performs feature generation on the tensor T obtained by the cross-phase network from three angles of a channel, a width and a height to obtain context information fragments in three directions. And processing the context features generated in the three directions to obtain a sub-attention drawing representing a part of the three-dimensional context features. And continuously and iteratively reconstructing other context segments, and activating and aggregating the subgraphs through element-level multiplication and weighted average, thereby obtaining fine-grained context characteristics on space and channel dimensions.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. An underwater target re-identification method combining local perception and high-order feature reconstruction is characterized by comprising the following steps:

step 1, collecting an image of a passing area of an underwater robot;

2. The underwater target re-identification method combining local perception and high-order feature reconstruction as claimed in claim 1, wherein: in the step 5, a characteristic pyramid structure in the path aggregation network is used to transversely connect the top-down path and the bottom-up path to divide the characteristic diagram with the characteristic matrix, a self-adaptive pooling operation is used to perform a pooling extraction operation on each characteristic diagram with the characteristic matrix to obtain target characteristics, and a full-connection fusion operation is performed on the extracted target characteristics to obtain a fusion characteristic diagram.

3. The underwater target re-identification method combining local perception and high-order feature reconstruction as claimed in claim 2, wherein: and 6, predicting the anchor frame by using the global intersection ratio as a prediction frame regression function to generate a prediction frame, wherein the specific formula of the global intersection ratio is as follows:

4. The underwater target re-identification method combining local perception and high-order feature reconstruction as claimed in claim 3, wherein: step 8, modifying the residual error network to extract the depth features of the network output image to obtain a three-dimensional tensor T, wherein the specific modification of the residual error network comprises eliminating a global average pooling layer, a full connection layer and an output layer of the convolutional neural network.

5. The underwater target re-identification method combining local perception and high-order feature reconstruction as claimed in claim 4, wherein: in step 11, activating and aggregating the context attention by using element-level product and weighted average, so as to obtain fine-grained context features in space and channel dimensions, specifically:

6. The method for re-identifying the underwater target by combining the local perception and the high-order feature reconstruction as claimed in claim 5, wherein the step 11.4 is specifically as follows:

S＝{s₁,s₂,...,s_CHW}

A＝{a₁,a₂,...,a_CHW}