CN116935332A

CN116935332A - Fishing boat target detection and tracking method based on dynamic video

Info

Publication number: CN116935332A
Application number: CN202310328937.0A
Authority: CN
Inventors: 刘永桂; 黎远梅
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2023-03-30
Filing date: 2023-03-30
Publication date: 2023-10-24

Abstract

The invention discloses a method for detecting and tracking a fishing boat target based on a dynamic video, and relates to the field of detection and tracking of the boat target. S1, collecting and manufacturing a fishing boat picture data set and a fishing boat video data set. S2, training a YoloV5 target detection model by using a fishing boat picture data set; and S3, inputting the fishing boat picture into a trained YoloV5 target detection model to obtain region coordinates of a plurality of fishing boat targets, and performing picture segmentation by using a PIL library according to the region coordinate information. S4, training a target tracking model based on the improved MobileNet V3 network and the RPN subnetwork by utilizing the fishing boat video data set subjected to framing. S5, testing a target tracking model: inputting a first frame picture of any video frame sequence of the fishing boat video data set into a trained YoloV5 target detection model to obtain regional position information and segmentation pictures of each fishing boat target, and inputting the regional position information and segmentation pictures of the fishing boat target and the video frame sequence into a trained target tracking model to obtain a fishing boat target tracking result.

Description

Fishing boat target detection and tracking method based on dynamic video

Technical Field

The invention relates to the field of ship target detection and tracking, in particular to a fishing ship target detection and tracking method based on dynamic video.

Background

With the continuous development of computer vision in recent years, a target detection and tracking technology based on deep learning is gradually applied to behavior supervision of marine traffic ships, however, as ship targets in the marine environment are easily influenced by problems such as ship gestures, light transformation, complex backgrounds and the like, the robustness of the results of ship target detection and tracking is lower, and the effect is poor.

In the field of target detection, the YoloV5 algorithm is excellent in the field of general target detection, and can accurately identify an image target; in the field of target tracking, the most mainstream tracking algorithm is a SiamRPN algorithm, and the proposal of the algorithm greatly improves the accuracy of target tracking, and simultaneously achieves faster calculation speed, but the network model parameter and the calculated amount are too large, so that the method is not suitable for target tracking of an off-line real-time scene; meanwhile, siamRPN has poor generalization capability in the aspect of multi-scale data fusion, and target characteristic information is easy to lose.

Disclosure of Invention

The invention aims to provide a method for detecting and tracking a fishing boat target based on a dynamic video, which utilizes a currently mainstream YoloV5 target detection model to accurately identify the fishing boat target, uses an improved MobileNet V3 model with a double-branch structure to extract the image characteristics of the fishing boat in combination with a Transformer cross attention mechanism (Cross Attention Transformer, CAT), and finally tracks the fishing boat target based on an RPN (remote procedure network) of the Transformer attention mechanism.

In order to achieve the above purpose, the invention provides a method for detecting and tracking a fishing boat target based on dynamic video, which comprises the following steps:

s1, constructing a data set: collecting marine fishing boat pictures, and manufacturing a fishing boat picture data set for training a YoloV5 target detection model; collecting historical monitoring videos of monitoring cameras of a fishing boat wharf, and manufacturing a dynamic video data set for training and improving a target tracking model of a MobileNet V3 network and an RPN subnetwork;

s2, training a YoloV5 target detection model: carrying out data enhancement processing on the fishing boat picture data set, and carrying out 8 on the data set sample after data enhancement: dividing the model into a training set and a testing set, and training a YoloV5 target detection model by using the training set;

s3, realizing target detection and picture segmentation of the fishing boat: inputting the test set into a trained YoloV5 target detection model to obtain region coordinates of a plurality of fishing boat targets, and dividing the pictures by using a PIL library according to the region coordinate information to obtain a plurality of pictures only containing single fishing boat targets.

S4, training a target tracking model: carrying out framing treatment on the video of the dynamic video data set to obtain a video frame sequence data set, wherein the frame sequence data set is as follows: and 2, dividing the target tracking model into a training set and a testing set, and training a target tracking model based on an improved MobileNet V3 network and an RPN sub-network by utilizing the training set.

S5, realizing target tracking of the dynamic video fishing boat: inputting a first frame picture of a video frame sequence of a test set into a trained YoloV5 target detection model to obtain regional position information and segmentation pictures of each fishing boat target, and inputting the regional position information and segmentation pictures of the fishing boat target and the video frame sequence into a trained fishing boat target tracking model to obtain a fishing boat target tracking result.

Preferably, in the step S1, the fishing boat picture may be obtained from a network public data set; the dynamic video of the fishing boat can be obtained from a monitoring camera erected on a wharf of the fishing boat.

Preferably, in step S2, data enhancement processing is performed on the fishing boat picture data set, and the data enhancement processing includes picture rotation, scaling and background filling through data enhancement of abundant data samples, and the enhanced data set sample is divided into a training set and a test set, wherein the training set is used for training the YoloV5 target detection model, and the test set is used for verifying the effect of the YoloV5 target detection model.

Preferably, in the step S3, the fishing boat picture is input into the trained YoloV5 target detection model, the identification frames of a plurality of fishing boat targets are output, and the picture cutting is directly performed on the original fishing boat picture by using the regional position information of the identification frames, so that a plurality of pictures including one fishing boat target can be obtained.

Preferably, in the step S4, the target tracking model is divided into two parts, i.e. the modified MobileNetV3 network and the RPN subnetwork, as shown in fig. 2. The improved MobileNet V3 network is used for extracting image features, and the RPN subnetwork is used for classifying and regressing the extracted image features to obtain a final fishing boat target tracking result.

The improved mobilenet V3 network adopts a double-branch structure, the double-branch structure is divided into a template branch and a search branch, the two branches use the same improved mobilenet V3 model and share training parameters, and template region image features and search region image features are respectively extracted.

The structure of the MobileNet V3 model is shown in figure 3, a cross attention mechanism (Cross Attention Transformer, CAT) based on a transducer is introduced to improve the MobileNet V3, so that the calculation cost is reduced, and meanwhile, the better feature extraction performance is maintained.

The overall structure of CAT is shown in fig. 4, and the feature extraction is divided into the following four stages, and the cross attention block (Cross Attention Block, CAB) is shown in fig. 5:

(1) The first stage is to split the input image into patches with height H ₁ =h/P, width W ₁ =w/P, increasing the number of channels to C ₁ The characteristic shape of the output at this time is F ₁ ＝H ₁ ×W ₁ ×C ₁ ；

(2) Entering the second stage, executing the patch projection layer to perform space-to-depth operation, changing the pixel block with 2×2×C shape from 2×2×C shape to 1×1×4C shape, and thenThe linear projection layer is used for projection to 1 multiplied by 2C, the length and width of the characteristic graph can be reduced by one time, the dimension is doubled, and the output characteristic shape is F ₁ ＝H ₁ /2×W ₁ /2×C ₂ ；

(3) The third stage and the fourth stage execute the patch projection layer to perform the operation from space to depth;

(4) After four stages of treatment, F can be obtained ₁ 、F ₂ 、F ₃ 、F ₄ Feature maps of four different scales and dimensions.

The RPN subnetwork is shown in fig. 2 and includes a classification branch and a regression branch. The classification branch is used for distinguishing the target from the background, and the regression branch is used for outputting a more accurate target tracking position. In the classification branch, the template branch outputs 2k channels of feature map for targets and backgrounds of k anchor points (where k represents anchors, i.e., the preselected number of boxes per location). In the regression branch, the template branch output feature map has 4k channels, corresponding to 4 position regression parameters of k anchor points.

And carrying out multi-scale coding fusion on the image characteristics on the basis of the attention mechanism of the transducer in the RPN classification branch and the regression branch respectively to obtain a final target tracking output response diagram. The network structure of the attention mechanism is shown in FIG. 6, the input of the network is X, and the mapping F of the convolution layer is adopted _tr Which is converted into a feature map U of a given size. The network then performs a Squeeze operation F on U _sq I.e. spatial characteristics U of each channel of U _c Encoding into a global feature z _c ：

Where H, W are the height and width of the original image features, respectively. The operation of Squeeze obtains the global description of each channel and then carries out the specification operation F _ex The relation among all channels is learned, and finally the self-adaptive weight of each channel is obtained:

s _c ＝F _ex (z _c ,W)＝σ(W ₂ g(W ₁ z _c )) (2)

wherein W is ₁ Is a linear transformation matrix, andτ is a dimension-reducing super-parameter, σ is a Sigmoid activation function, g is a ReLU activation function, s _c Adaptive weights for the various channels.

The final fishing vessel target tracking output result response graph U' is the weight s of each channel obtained by learning _c F with U _scale Channel-by-channel linear weighting:

u′ _c ＝F _scale (u _c ,s _c )＝∑s _c u _c (3)

in summary, the specific step of training the target tracking model in step S4 is as follows:

s4-1, inputting a first frame of picture of a video frame sequence of a training set into a trained YoloV5 target detection model, and outputting regional position information of all fishing boat targets in the picture and fishing boat target pictures obtained by cutting according to the regional position information;

s4-2, inputting the regional position information of a single fishing boat target and corresponding fishing boat target pictures into a template branch of an improved MobileNet V3 network of a target tracking model according to a label sequence, inputting a test set video frame sequence into a search branch of the improved MobileNet V3 network according to a time sequence, and obtaining template regional characteristics and search regional characteristics after network convolution calculation;

s4-3, respectively inputting the template region features and the search region features into a classification branch and a regression branch of the RPN sub-network, and outputting a final response diagram of the target tracking result of the fishing boat after multi-scale coding fusion is carried out on the characteristics of the template region and the characteristics of the search region by a attention mechanism based on a transducer.

S4-4, the steps S4-1 to S4-3 are repeated, the training set video frame sequence is input into the target tracking model for training, and the network training parameters are self-learned and adjusted by each module until the response diagram of the output result of the target tracking model of the fishing boat achieves the expected effect.

Preferably, in step S5, a first frame of the video frame sequence of the test set is input to a trained YoloV5 target detection model to obtain region position information and a segmentation picture of each fishing boat target, and the region position information and the segmentation picture of the fishing boat target are input to a trained fishing boat target tracking model together with the video frame sequence to obtain a fishing boat target tracking result.

The fishing boat target detection and tracking method based on the dynamic video has the advantages and positive effects that:

according to the method for detecting and tracking the target of the fishing boat based on the dynamic video, the detection part utilizes the existing most mainstream YoloV5 target detection model to identify the target of the fishing boat, so that the identification effect is excellent; the tracking part extracts image features by utilizing an improved lightweight MobileNet V3 network, and simultaneously uses an RPN sub-network to perform classification and regression operation after multi-scale coding fusion on the extracted features, so that the total frame greatly reduces the number of the participated calculation parameters, improves the target tracking calculation speed, and simultaneously improves the accuracy of target tracking, thereby enabling the model to be more suitable for fishing boat target detection and tracking in a real-time scene.

The technical scheme of the invention is further described in detail through the drawings and the embodiments.

Drawings

FIG. 1 is a schematic flow chart of an embodiment of a method for detecting and tracking targets of a fishing vessel based on dynamic video;

FIG. 2 is a schematic diagram of a target tracking model of an embodiment of a method for detecting and tracking a target of a fishing vessel based on dynamic video;

FIG. 3 is a schematic diagram of a MobileNet V3 model structure of an embodiment of a method for detecting and tracking a fishing vessel target based on dynamic video;

FIG. 4 is an overall block diagram of CAT of an embodiment of a method for detecting and tracking targets of a fishing vessel based on dynamic video according to the present invention;

FIG. 5 is a schematic diagram of a cross attention block CAB in a CAT structure of an embodiment of a method for detecting and tracking targets of a fishing boat based on dynamic video according to the present invention;

FIG. 6 is a schematic diagram of a network structure of an attention mechanism of an embodiment of a method for detecting and tracking targets of a fishing vessel based on dynamic video.

Detailed Description

The technical scheme of the invention is further described below through the attached drawings and the embodiments.

Examples

As shown in fig. 1, a method for detecting and tracking a target of a fishing vessel based on dynamic video comprises the following steps:

s1, constructing a data set: collecting marine fishing boat pictures, and manufacturing a fishing boat picture data set for training a YoloV5 target detection model; collecting historical monitoring videos of monitoring cameras of a fishing boat wharf, and manufacturing a dynamic video data set for training and improving a target tracking model of a MobileNet V3 network and an RPN subnetwork; the fishing boat picture can be obtained from the network public data set; the dynamic video of the fishing boat can be obtained from a monitoring camera erected on a wharf of the fishing boat.

S2, training a YoloV5 target detection model: carrying out data enhancement processing on the fishing boat picture data set, enriching data samples through data enhancement, wherein the data enhancement processing comprises picture rotation, equal ratio scaling and background filling, and the data set samples after data enhancement are processed according to the following ratio of 8:2 is divided into a training set and a testing set, the training set is utilized to train the YoloV5 target detection model, and the testing set is used for verifying the effect of the YoloV5 target detection model.

The target tracking model is divided into a modified MobileNetV3 network and an RPN subnetwork, as shown in fig. 2. The improved MobileNet V3 network is used for extracting image features, and the RPN subnetwork is used for classifying and regressing the extracted image features to obtain a final fishing boat target tracking result.

(2) Entering the second stage, executing the patch projection layer to perform the operation from space to depth, changing the pixel block with 2×2×C shape from 2×2×C shape to 1×1×4C shape, then projecting to 1×1×2C by the linear projection layer, and reducing the length and width of the feature map by one time, doubling the dimension, and outputting the feature shape of F ₁ ＝H ₁ /2×W ₁ /2×C ₂ ；

And carrying out multi-scale coding fusion on the image characteristics on the basis of the attention mechanism of the transducer in the RPN classification branch and the regression branch respectively to obtain a final target tracking output response diagram. The network structure of the attention mechanism is shown in FIG. 6, the input of the network is X, and the mapping F of the convolution layer is adopted _tr Which is converted into a feature map U of a given size. The network then performs a squeze operation F on U _sq I.e. spatial characteristics U of each channel of U _c Encoding into a global feature z _c ：

s _c ＝F _ex (z _c ,W)＝σ(W ₂ g(W ₁ z _c )) (2)

The final target tracking output result response diagram U' is the weight s of each channel obtained by learning _c F with U _scale Channel-by-channel linear weighting:

u′ _c ＝F _scale (u _c ,s _c )＝∑s _c u _c (3)

S5, realizing target tracking of the dynamic video fishing boat: inputting a first frame picture of a video frame sequence of a test set into a trained YoloV5 target detection model to obtain regional position information and segmentation pictures of each fishing boat target, and inputting the regional position information and segmentation pictures of the fishing boat target and the video frame sequence into a trained target tracking model to obtain a fishing boat target tracking result.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention and not for limiting it, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that: the technical scheme of the invention can be modified or replaced by the same, and the modified technical scheme cannot deviate from the spirit and scope of the technical scheme of the invention.

Claims

1. The method for detecting and tracking the target of the fishing boat based on the dynamic video is characterized by comprising the following steps of:

2. The method for detecting and tracking the target of the fishing boat based on the dynamic video according to claim 1, wherein the method comprises the following steps: in the step S1, a fishing boat picture can be obtained from a network public data set; the dynamic video of the fishing boat can be obtained from a monitoring camera erected on a wharf of the fishing boat.

3. The method for detecting and tracking the target of the fishing boat based on the dynamic video according to claim 1, wherein the method comprises the following steps: in the step S2, the data enhancement processing is performed on the fishing boat picture data set, the data enhancement processing includes picture rotation, scaling and background filling, the enhanced data set sample is divided into a training set and a testing set, the training set is used for training the YoloV5 target detection model, and the testing set is used for verifying the effect of the YoloV5 target detection model.

4. The method for detecting and tracking the target of the fishing boat based on the dynamic video according to claim 1, wherein the method comprises the following steps: in the step S3, the fishing boat picture is input into the trained YoloV5 target detection model, the identification frames of a plurality of fishing boat targets are output, and the picture cutting is directly performed on the original fishing boat picture by utilizing the regional position information of the identification frames, so that a plurality of pictures containing only one fishing boat target can be obtained.

5. The method for detecting and tracking the target of the fishing boat based on the dynamic video according to claim 1, wherein the method comprises the following steps: in the step S4, the target tracking model is divided into two parts, i.e. the modified MobileNetV3 network and the RPN subnetwork, as shown in fig. 2. The improved MobileNet V3 network is used for extracting image features, and the RPN subnetwork is used for classifying and regressing the extracted image features to obtain a final fishing boat target tracking result.

And carrying out multi-scale coding fusion on the image characteristics on the basis of the attention mechanism of the transducer in the RPN classification branch and the regression branch respectively to obtain a final target tracking output response diagram. The network structure of the attention mechanism is shown in FIG. 6, the input of the network is X, and the mapping F of the convolution layer is adopted _tr Which is converted into a feature map U of a given size. The network then performs a Squeeze operation F on U _sq I.e. spatial characteristics U of each channel of U _c Encode into a wholeLocal feature z _c ：

s _c ＝F _ex (z _c ,W)＝σ(W ₂ g(W ₁ z _c )) (2)

u′ _c ＝F _scale (u _c ,s _c )＝∑s _c u _c (3)

6. The method for detecting and tracking the target of the fishing boat based on the dynamic video according to claim 1, wherein the method comprises the following steps: in the step S5, a first frame of picture of a video frame sequence of the test set is input into a trained YoloV5 target detection model to obtain regional position information and segmentation pictures of each fishing boat target, and the regional position information and segmentation pictures of the fishing boat target and the video frame sequence are input into a trained fishing boat target tracking model together to obtain a fishing boat target tracking result.