CN116935332A - Fishing boat target detection and tracking method based on dynamic video - Google Patents
Fishing boat target detection and tracking method based on dynamic video Download PDFInfo
- Publication number
- CN116935332A CN116935332A CN202310328937.0A CN202310328937A CN116935332A CN 116935332 A CN116935332 A CN 116935332A CN 202310328937 A CN202310328937 A CN 202310328937A CN 116935332 A CN116935332 A CN 116935332A
- Authority
- CN
- China
- Prior art keywords
- fishing boat
- target
- training
- model
- tracking
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 47
- 238000000034 method Methods 0.000 title claims abstract description 28
- 238000012549 training Methods 0.000 claims abstract description 55
- 238000012360 testing method Methods 0.000 claims abstract description 23
- 230000011218 segmentation Effects 0.000 claims abstract description 16
- 238000004519 manufacturing process Methods 0.000 claims abstract description 7
- 238000009432 framing Methods 0.000 claims abstract description 4
- 238000010586 diagram Methods 0.000 claims description 15
- 230000007246 mechanism Effects 0.000 claims description 15
- 230000004044 response Effects 0.000 claims description 12
- 238000004364 calculation method Methods 0.000 claims description 9
- 238000012544 monitoring process Methods 0.000 claims description 9
- 230000004927 fusion Effects 0.000 claims description 8
- 238000012545 processing Methods 0.000 claims description 8
- 230000004913 activation Effects 0.000 claims description 6
- 238000000605 extraction Methods 0.000 claims description 6
- 230000006870 function Effects 0.000 claims description 6
- 230000000694 effects Effects 0.000 claims description 5
- 230000009466 transformation Effects 0.000 claims description 4
- 230000003044 adaptive effect Effects 0.000 claims description 3
- 230000003631 expected effect Effects 0.000 claims description 3
- 238000013507 mapping Methods 0.000 claims description 3
- 239000011159 matrix material Substances 0.000 claims description 3
- 230000006399 behavior Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000000284 extract Substances 0.000 description 1
- 230000008092 positive effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/50—Context or environment of the image
- G06V20/52—Surveillance or monitoring of activities, e.g. for recognising suspicious objects
- G06V20/54—Surveillance or monitoring of activities, e.g. for recognising suspicious objects of traffic, e.g. cars on the road, trains or boats
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/764—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/774—Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/806—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/77—Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
- G06V10/80—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
- G06V10/809—Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/70—Arrangements for image or video recognition or understanding using pattern recognition or machine learning
- G06V10/82—Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/41—Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/49—Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V2201/00—Indexing scheme relating to image or video recognition or understanding
- G06V2201/07—Target detection
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Medical Informatics (AREA)
- Computational Linguistics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- Molecular Biology (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a method for detecting and tracking a fishing boat target based on a dynamic video, and relates to the field of detection and tracking of the boat target. S1, collecting and manufacturing a fishing boat picture data set and a fishing boat video data set. S2, training a YoloV5 target detection model by using a fishing boat picture data set; and S3, inputting the fishing boat picture into a trained YoloV5 target detection model to obtain region coordinates of a plurality of fishing boat targets, and performing picture segmentation by using a PIL library according to the region coordinate information. S4, training a target tracking model based on the improved MobileNet V3 network and the RPN subnetwork by utilizing the fishing boat video data set subjected to framing. S5, testing a target tracking model: inputting a first frame picture of any video frame sequence of the fishing boat video data set into a trained YoloV5 target detection model to obtain regional position information and segmentation pictures of each fishing boat target, and inputting the regional position information and segmentation pictures of the fishing boat target and the video frame sequence into a trained target tracking model to obtain a fishing boat target tracking result.
Description
Technical Field
The invention relates to the field of ship target detection and tracking, in particular to a fishing ship target detection and tracking method based on dynamic video.
Background
With the continuous development of computer vision in recent years, a target detection and tracking technology based on deep learning is gradually applied to behavior supervision of marine traffic ships, however, as ship targets in the marine environment are easily influenced by problems such as ship gestures, light transformation, complex backgrounds and the like, the robustness of the results of ship target detection and tracking is lower, and the effect is poor.
In the field of target detection, the YoloV5 algorithm is excellent in the field of general target detection, and can accurately identify an image target; in the field of target tracking, the most mainstream tracking algorithm is a SiamRPN algorithm, and the proposal of the algorithm greatly improves the accuracy of target tracking, and simultaneously achieves faster calculation speed, but the network model parameter and the calculated amount are too large, so that the method is not suitable for target tracking of an off-line real-time scene; meanwhile, siamRPN has poor generalization capability in the aspect of multi-scale data fusion, and target characteristic information is easy to lose.
Disclosure of Invention
The invention aims to provide a method for detecting and tracking a fishing boat target based on a dynamic video, which utilizes a currently mainstream YoloV5 target detection model to accurately identify the fishing boat target, uses an improved MobileNet V3 model with a double-branch structure to extract the image characteristics of the fishing boat in combination with a Transformer cross attention mechanism (Cross Attention Transformer, CAT), and finally tracks the fishing boat target based on an RPN (remote procedure network) of the Transformer attention mechanism.
In order to achieve the above purpose, the invention provides a method for detecting and tracking a fishing boat target based on dynamic video, which comprises the following steps:
s1, constructing a data set: collecting marine fishing boat pictures, and manufacturing a fishing boat picture data set for training a YoloV5 target detection model; collecting historical monitoring videos of monitoring cameras of a fishing boat wharf, and manufacturing a dynamic video data set for training and improving a target tracking model of a MobileNet V3 network and an RPN subnetwork;
s2, training a YoloV5 target detection model: carrying out data enhancement processing on the fishing boat picture data set, and carrying out 8 on the data set sample after data enhancement: dividing the model into a training set and a testing set, and training a YoloV5 target detection model by using the training set;
s3, realizing target detection and picture segmentation of the fishing boat: inputting the test set into a trained YoloV5 target detection model to obtain region coordinates of a plurality of fishing boat targets, and dividing the pictures by using a PIL library according to the region coordinate information to obtain a plurality of pictures only containing single fishing boat targets.
S4, training a target tracking model: carrying out framing treatment on the video of the dynamic video data set to obtain a video frame sequence data set, wherein the frame sequence data set is as follows: and 2, dividing the target tracking model into a training set and a testing set, and training a target tracking model based on an improved MobileNet V3 network and an RPN sub-network by utilizing the training set.
S5, realizing target tracking of the dynamic video fishing boat: inputting a first frame picture of a video frame sequence of a test set into a trained YoloV5 target detection model to obtain regional position information and segmentation pictures of each fishing boat target, and inputting the regional position information and segmentation pictures of the fishing boat target and the video frame sequence into a trained fishing boat target tracking model to obtain a fishing boat target tracking result.
Preferably, in the step S1, the fishing boat picture may be obtained from a network public data set; the dynamic video of the fishing boat can be obtained from a monitoring camera erected on a wharf of the fishing boat.
Preferably, in step S2, data enhancement processing is performed on the fishing boat picture data set, and the data enhancement processing includes picture rotation, scaling and background filling through data enhancement of abundant data samples, and the enhanced data set sample is divided into a training set and a test set, wherein the training set is used for training the YoloV5 target detection model, and the test set is used for verifying the effect of the YoloV5 target detection model.
Preferably, in the step S3, the fishing boat picture is input into the trained YoloV5 target detection model, the identification frames of a plurality of fishing boat targets are output, and the picture cutting is directly performed on the original fishing boat picture by using the regional position information of the identification frames, so that a plurality of pictures including one fishing boat target can be obtained.
Preferably, in the step S4, the target tracking model is divided into two parts, i.e. the modified MobileNetV3 network and the RPN subnetwork, as shown in fig. 2. The improved MobileNet V3 network is used for extracting image features, and the RPN subnetwork is used for classifying and regressing the extracted image features to obtain a final fishing boat target tracking result.
The improved mobilenet V3 network adopts a double-branch structure, the double-branch structure is divided into a template branch and a search branch, the two branches use the same improved mobilenet V3 model and share training parameters, and template region image features and search region image features are respectively extracted.
The structure of the MobileNet V3 model is shown in figure 3, a cross attention mechanism (Cross Attention Transformer, CAT) based on a transducer is introduced to improve the MobileNet V3, so that the calculation cost is reduced, and meanwhile, the better feature extraction performance is maintained.
The overall structure of CAT is shown in fig. 4, and the feature extraction is divided into the following four stages, and the cross attention block (Cross Attention Block, CAB) is shown in fig. 5:
(1) The first stage is to split the input image into patches with height H 1 =h/P, width W 1 =w/P, increasing the number of channels to C 1 The characteristic shape of the output at this time is F 1 =H 1 ×W 1 ×C 1 ;
(2) Entering the second stage, executing the patch projection layer to perform space-to-depth operation, changing the pixel block with 2×2×C shape from 2×2×C shape to 1×1×4C shape, and thenThe linear projection layer is used for projection to 1 multiplied by 2C, the length and width of the characteristic graph can be reduced by one time, the dimension is doubled, and the output characteristic shape is F 1 =H 1 /2×W 1 /2×C 2 ;
(3) The third stage and the fourth stage execute the patch projection layer to perform the operation from space to depth;
(4) After four stages of treatment, F can be obtained 1 、F 2 、F 3 、F 4 Feature maps of four different scales and dimensions.
The RPN subnetwork is shown in fig. 2 and includes a classification branch and a regression branch. The classification branch is used for distinguishing the target from the background, and the regression branch is used for outputting a more accurate target tracking position. In the classification branch, the template branch outputs 2k channels of feature map for targets and backgrounds of k anchor points (where k represents anchors, i.e., the preselected number of boxes per location). In the regression branch, the template branch output feature map has 4k channels, corresponding to 4 position regression parameters of k anchor points.
And carrying out multi-scale coding fusion on the image characteristics on the basis of the attention mechanism of the transducer in the RPN classification branch and the regression branch respectively to obtain a final target tracking output response diagram. The network structure of the attention mechanism is shown in FIG. 6, the input of the network is X, and the mapping F of the convolution layer is adopted tr Which is converted into a feature map U of a given size. The network then performs a Squeeze operation F on U sq I.e. spatial characteristics U of each channel of U c Encoding into a global feature z c :
Where H, W are the height and width of the original image features, respectively. The operation of Squeeze obtains the global description of each channel and then carries out the specification operation F ex The relation among all channels is learned, and finally the self-adaptive weight of each channel is obtained:
s c =F ex (z c ,W)=σ(W 2 g(W 1 z c )) (2)
wherein W is 1 Is a linear transformation matrix, andτ is a dimension-reducing super-parameter, σ is a Sigmoid activation function, g is a ReLU activation function, s c Adaptive weights for the various channels.
The final fishing vessel target tracking output result response graph U' is the weight s of each channel obtained by learning c F with U scale Channel-by-channel linear weighting:
u′ c =F scale (u c ,s c )=∑s c u c (3)
in summary, the specific step of training the target tracking model in step S4 is as follows:
s4-1, inputting a first frame of picture of a video frame sequence of a training set into a trained YoloV5 target detection model, and outputting regional position information of all fishing boat targets in the picture and fishing boat target pictures obtained by cutting according to the regional position information;
s4-2, inputting the regional position information of a single fishing boat target and corresponding fishing boat target pictures into a template branch of an improved MobileNet V3 network of a target tracking model according to a label sequence, inputting a test set video frame sequence into a search branch of the improved MobileNet V3 network according to a time sequence, and obtaining template regional characteristics and search regional characteristics after network convolution calculation;
s4-3, respectively inputting the template region features and the search region features into a classification branch and a regression branch of the RPN sub-network, and outputting a final response diagram of the target tracking result of the fishing boat after multi-scale coding fusion is carried out on the characteristics of the template region and the characteristics of the search region by a attention mechanism based on a transducer.
S4-4, the steps S4-1 to S4-3 are repeated, the training set video frame sequence is input into the target tracking model for training, and the network training parameters are self-learned and adjusted by each module until the response diagram of the output result of the target tracking model of the fishing boat achieves the expected effect.
Preferably, in step S5, a first frame of the video frame sequence of the test set is input to a trained YoloV5 target detection model to obtain region position information and a segmentation picture of each fishing boat target, and the region position information and the segmentation picture of the fishing boat target are input to a trained fishing boat target tracking model together with the video frame sequence to obtain a fishing boat target tracking result.
The fishing boat target detection and tracking method based on the dynamic video has the advantages and positive effects that:
according to the method for detecting and tracking the target of the fishing boat based on the dynamic video, the detection part utilizes the existing most mainstream YoloV5 target detection model to identify the target of the fishing boat, so that the identification effect is excellent; the tracking part extracts image features by utilizing an improved lightweight MobileNet V3 network, and simultaneously uses an RPN sub-network to perform classification and regression operation after multi-scale coding fusion on the extracted features, so that the total frame greatly reduces the number of the participated calculation parameters, improves the target tracking calculation speed, and simultaneously improves the accuracy of target tracking, thereby enabling the model to be more suitable for fishing boat target detection and tracking in a real-time scene.
The technical scheme of the invention is further described in detail through the drawings and the embodiments.
Drawings
FIG. 1 is a schematic flow chart of an embodiment of a method for detecting and tracking targets of a fishing vessel based on dynamic video;
FIG. 2 is a schematic diagram of a target tracking model of an embodiment of a method for detecting and tracking a target of a fishing vessel based on dynamic video;
FIG. 3 is a schematic diagram of a MobileNet V3 model structure of an embodiment of a method for detecting and tracking a fishing vessel target based on dynamic video;
FIG. 4 is an overall block diagram of CAT of an embodiment of a method for detecting and tracking targets of a fishing vessel based on dynamic video according to the present invention;
FIG. 5 is a schematic diagram of a cross attention block CAB in a CAT structure of an embodiment of a method for detecting and tracking targets of a fishing boat based on dynamic video according to the present invention;
FIG. 6 is a schematic diagram of a network structure of an attention mechanism of an embodiment of a method for detecting and tracking targets of a fishing vessel based on dynamic video.
Detailed Description
The technical scheme of the invention is further described below through the attached drawings and the embodiments.
Examples
As shown in fig. 1, a method for detecting and tracking a target of a fishing vessel based on dynamic video comprises the following steps:
s1, constructing a data set: collecting marine fishing boat pictures, and manufacturing a fishing boat picture data set for training a YoloV5 target detection model; collecting historical monitoring videos of monitoring cameras of a fishing boat wharf, and manufacturing a dynamic video data set for training and improving a target tracking model of a MobileNet V3 network and an RPN subnetwork; the fishing boat picture can be obtained from the network public data set; the dynamic video of the fishing boat can be obtained from a monitoring camera erected on a wharf of the fishing boat.
S2, training a YoloV5 target detection model: carrying out data enhancement processing on the fishing boat picture data set, enriching data samples through data enhancement, wherein the data enhancement processing comprises picture rotation, equal ratio scaling and background filling, and the data set samples after data enhancement are processed according to the following ratio of 8:2 is divided into a training set and a testing set, the training set is utilized to train the YoloV5 target detection model, and the testing set is used for verifying the effect of the YoloV5 target detection model.
S3, realizing target detection and picture segmentation of the fishing boat: inputting the test set into a trained YoloV5 target detection model to obtain region coordinates of a plurality of fishing boat targets, and dividing the pictures by using a PIL library according to the region coordinate information to obtain a plurality of pictures only containing single fishing boat targets.
S4, training a target tracking model: carrying out framing treatment on the video of the dynamic video data set to obtain a video frame sequence data set, wherein the frame sequence data set is as follows: and 2, dividing the target tracking model into a training set and a testing set, and training a target tracking model based on an improved MobileNet V3 network and an RPN sub-network by utilizing the training set.
The target tracking model is divided into a modified MobileNetV3 network and an RPN subnetwork, as shown in fig. 2. The improved MobileNet V3 network is used for extracting image features, and the RPN subnetwork is used for classifying and regressing the extracted image features to obtain a final fishing boat target tracking result.
The improved mobilenet V3 network adopts a double-branch structure, the double-branch structure is divided into a template branch and a search branch, the two branches use the same improved mobilenet V3 model and share training parameters, and template region image features and search region image features are respectively extracted.
The structure of the MobileNet V3 model is shown in figure 3, a cross attention mechanism (Cross Attention Transformer, CAT) based on a transducer is introduced to improve the MobileNet V3, so that the calculation cost is reduced, and meanwhile, the better feature extraction performance is maintained.
The overall structure of CAT is shown in fig. 4, and the feature extraction is divided into the following four stages, and the cross attention block (Cross Attention Block, CAB) is shown in fig. 5:
(1) The first stage is to split the input image into patches with height H 1 =h/P, width W 1 =w/P, increasing the number of channels to C 1 The characteristic shape of the output at this time is F 1 =H 1 ×W 1 ×C 1 ;
(2) Entering the second stage, executing the patch projection layer to perform the operation from space to depth, changing the pixel block with 2×2×C shape from 2×2×C shape to 1×1×4C shape, then projecting to 1×1×2C by the linear projection layer, and reducing the length and width of the feature map by one time, doubling the dimension, and outputting the feature shape of F 1 =H 1 /2×W 1 /2×C 2 ;
(3) The third stage and the fourth stage execute the patch projection layer to perform the operation from space to depth;
(4) After four stages of treatment, F can be obtained 1 、F 2 、F 3 、F 4 Feature maps of four different scales and dimensions.
The RPN subnetwork is shown in fig. 2 and includes a classification branch and a regression branch. The classification branch is used for distinguishing the target from the background, and the regression branch is used for outputting a more accurate target tracking position. In the classification branch, the template branch outputs 2k channels of feature map for targets and backgrounds of k anchor points (where k represents anchors, i.e., the preselected number of boxes per location). In the regression branch, the template branch output feature map has 4k channels, corresponding to 4 position regression parameters of k anchor points.
And carrying out multi-scale coding fusion on the image characteristics on the basis of the attention mechanism of the transducer in the RPN classification branch and the regression branch respectively to obtain a final target tracking output response diagram. The network structure of the attention mechanism is shown in FIG. 6, the input of the network is X, and the mapping F of the convolution layer is adopted tr Which is converted into a feature map U of a given size. The network then performs a squeze operation F on U sq I.e. spatial characteristics U of each channel of U c Encoding into a global feature z c :
Where H, W are the height and width of the original image features, respectively. The operation of Squeeze obtains the global description of each channel and then carries out the specification operation F ex The relation among all channels is learned, and finally the self-adaptive weight of each channel is obtained:
s c =F ex (z c ,W)=σ(W 2 g(W 1 z c )) (2)
wherein W is 1 Is a linear transformation matrix, andτ is a dimension-reducing super-parameter, σ is a Sigmoid activation function, g is a ReLU activation function, s c Adaptive weights for the various channels.
The final target tracking output result response diagram U' is the weight s of each channel obtained by learning c F with U scale Channel-by-channel linear weighting:
u′ c =F scale (u c ,s c )=∑s c u c (3)
in summary, the specific step of training the target tracking model in step S4 is as follows:
s4-1, inputting a first frame of picture of a video frame sequence of a training set into a trained YoloV5 target detection model, and outputting regional position information of all fishing boat targets in the picture and fishing boat target pictures obtained by cutting according to the regional position information;
s4-2, inputting the regional position information of a single fishing boat target and corresponding fishing boat target pictures into a template branch of an improved MobileNet V3 network of a target tracking model according to a label sequence, inputting a test set video frame sequence into a search branch of the improved MobileNet V3 network according to a time sequence, and obtaining template regional characteristics and search regional characteristics after network convolution calculation;
s4-3, respectively inputting the template region features and the search region features into a classification branch and a regression branch of the RPN sub-network, and outputting a final response diagram of the target tracking result of the fishing boat after multi-scale coding fusion is carried out on the characteristics of the template region and the characteristics of the search region by a attention mechanism based on a transducer.
S4-4, the steps S4-1 to S4-3 are repeated, the training set video frame sequence is input into the target tracking model for training, and the network training parameters are self-learned and adjusted by each module until the response diagram of the output result of the target tracking model of the fishing boat achieves the expected effect.
S5, realizing target tracking of the dynamic video fishing boat: inputting a first frame picture of a video frame sequence of a test set into a trained YoloV5 target detection model to obtain regional position information and segmentation pictures of each fishing boat target, and inputting the regional position information and segmentation pictures of the fishing boat target and the video frame sequence into a trained target tracking model to obtain a fishing boat target tracking result.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention and not for limiting it, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that: the technical scheme of the invention can be modified or replaced by the same, and the modified technical scheme cannot deviate from the spirit and scope of the technical scheme of the invention.
Claims (6)
1. The method for detecting and tracking the target of the fishing boat based on the dynamic video is characterized by comprising the following steps of:
s1, constructing a data set: collecting marine fishing boat pictures, and manufacturing a fishing boat picture data set for training a YoloV5 target detection model; collecting historical monitoring videos of monitoring cameras of a fishing boat wharf, and manufacturing a dynamic video data set for training and improving a target tracking model of a MobileNet V3 network and an RPN subnetwork;
s2, training a YoloV5 target detection model: carrying out data enhancement processing on the fishing boat picture data set, and carrying out 8 on the data set sample after data enhancement: dividing the model into a training set and a testing set, and training a YoloV5 target detection model by using the training set;
s3, realizing target detection and picture segmentation of the fishing boat: inputting the test set into a trained YoloV5 target detection model to obtain region coordinates of a plurality of fishing boat targets, and dividing the pictures by using a PIL library according to the region coordinate information to obtain a plurality of pictures only containing single fishing boat targets.
S4, training a target tracking model: carrying out framing treatment on the video of the dynamic video data set to obtain a video frame sequence data set, wherein the frame sequence data set is as follows: and 2, dividing the target tracking model into a training set and a testing set, and training a target tracking model based on an improved MobileNet V3 network and an RPN sub-network by utilizing the training set.
S5, realizing target tracking of the dynamic video fishing boat: inputting a first frame picture of a video frame sequence of a test set into a trained YoloV5 target detection model to obtain regional position information and segmentation pictures of each fishing boat target, and inputting the regional position information and segmentation pictures of the fishing boat target and the video frame sequence into a trained fishing boat target tracking model to obtain a fishing boat target tracking result.
2. The method for detecting and tracking the target of the fishing boat based on the dynamic video according to claim 1, wherein the method comprises the following steps: in the step S1, a fishing boat picture can be obtained from a network public data set; the dynamic video of the fishing boat can be obtained from a monitoring camera erected on a wharf of the fishing boat.
3. The method for detecting and tracking the target of the fishing boat based on the dynamic video according to claim 1, wherein the method comprises the following steps: in the step S2, the data enhancement processing is performed on the fishing boat picture data set, the data enhancement processing includes picture rotation, scaling and background filling, the enhanced data set sample is divided into a training set and a testing set, the training set is used for training the YoloV5 target detection model, and the testing set is used for verifying the effect of the YoloV5 target detection model.
4. The method for detecting and tracking the target of the fishing boat based on the dynamic video according to claim 1, wherein the method comprises the following steps: in the step S3, the fishing boat picture is input into the trained YoloV5 target detection model, the identification frames of a plurality of fishing boat targets are output, and the picture cutting is directly performed on the original fishing boat picture by utilizing the regional position information of the identification frames, so that a plurality of pictures containing only one fishing boat target can be obtained.
5. The method for detecting and tracking the target of the fishing boat based on the dynamic video according to claim 1, wherein the method comprises the following steps: in the step S4, the target tracking model is divided into two parts, i.e. the modified MobileNetV3 network and the RPN subnetwork, as shown in fig. 2. The improved MobileNet V3 network is used for extracting image features, and the RPN subnetwork is used for classifying and regressing the extracted image features to obtain a final fishing boat target tracking result.
The improved mobilenet V3 network adopts a double-branch structure, the double-branch structure is divided into a template branch and a search branch, the two branches use the same improved mobilenet V3 model and share training parameters, and template region image features and search region image features are respectively extracted.
The structure of the MobileNet V3 model is shown in figure 3, a cross attention mechanism (Cross Attention Transformer, CAT) based on a transducer is introduced to improve the MobileNet V3, so that the calculation cost is reduced, and meanwhile, the better feature extraction performance is maintained.
The overall structure of CAT is shown in fig. 4, and the feature extraction is divided into the following four stages, and the cross attention block (Cross Attention Block, CAB) is shown in fig. 5:
(1) The first stage is to split the input image into patches with height H 1 =h/P, width W 1 =w/P, increasing the number of channels to C 1 The characteristic shape of the output at this time is F 1 =H 1 ×W 1 ×C 1 ;
(2) Entering the second stage, executing the patch projection layer to perform the operation from space to depth, changing the pixel block with 2×2×C shape from 2×2×C shape to 1×1×4C shape, then projecting to 1×1×2C by the linear projection layer, and reducing the length and width of the feature map by one time, doubling the dimension, and outputting the feature shape of F 1 =H 1 /2×W 1 /2×C 2 ;
(3) The third stage and the fourth stage execute the patch projection layer to perform the operation from space to depth;
(4) After four stages of treatment, F can be obtained 1 、F 2 、F 3 、F 4 Feature maps of four different scales and dimensions.
The RPN subnetwork is shown in fig. 2 and includes a classification branch and a regression branch. The classification branch is used for distinguishing the target from the background, and the regression branch is used for outputting a more accurate target tracking position. In the classification branch, the template branch outputs 2k channels of feature map for targets and backgrounds of k anchor points (where k represents anchors, i.e., the preselected number of boxes per location). In the regression branch, the template branch output feature map has 4k channels, corresponding to 4 position regression parameters of k anchor points.
And carrying out multi-scale coding fusion on the image characteristics on the basis of the attention mechanism of the transducer in the RPN classification branch and the regression branch respectively to obtain a final target tracking output response diagram. The network structure of the attention mechanism is shown in FIG. 6, the input of the network is X, and the mapping F of the convolution layer is adopted tr Which is converted into a feature map U of a given size. The network then performs a Squeeze operation F on U sq I.e. spatial characteristics U of each channel of U c Encode into a wholeLocal feature z c :
Where H, W are the height and width of the original image features, respectively. The operation of Squeeze obtains the global description of each channel and then carries out the specification operation F ex The relation among all channels is learned, and finally the self-adaptive weight of each channel is obtained:
s c =F ex (z c ,W)=σ(W 2 g(W 1 z c )) (2)
wherein W is 1 Is a linear transformation matrix, andτ is a dimension-reducing super-parameter, σ is a Sigmoid activation function, g is a ReLU activation function, s c Adaptive weights for the various channels.
The final fishing vessel target tracking output result response graph U' is the weight s of each channel obtained by learning c F with U scale Channel-by-channel linear weighting:
u′ c =F scale (u c ,s c )=∑s c u c (3)
in summary, the specific step of training the target tracking model in step S4 is as follows:
s4-1, inputting a first frame of picture of a video frame sequence of a training set into a trained YoloV5 target detection model, and outputting regional position information of all fishing boat targets in the picture and fishing boat target pictures obtained by cutting according to the regional position information;
s4-2, inputting the regional position information of a single fishing boat target and corresponding fishing boat target pictures into a template branch of an improved MobileNet V3 network of a target tracking model according to a label sequence, inputting a test set video frame sequence into a search branch of the improved MobileNet V3 network according to a time sequence, and obtaining template regional characteristics and search regional characteristics after network convolution calculation;
s4-3, respectively inputting the template region features and the search region features into a classification branch and a regression branch of the RPN sub-network, and outputting a final response diagram of the target tracking result of the fishing boat after multi-scale coding fusion is carried out on the characteristics of the template region and the characteristics of the search region by a attention mechanism based on a transducer.
S4-4, the steps S4-1 to S4-3 are repeated, the training set video frame sequence is input into the target tracking model for training, and the network training parameters are self-learned and adjusted by each module until the response diagram of the output result of the target tracking model of the fishing boat achieves the expected effect.
6. The method for detecting and tracking the target of the fishing boat based on the dynamic video according to claim 1, wherein the method comprises the following steps: in the step S5, a first frame of picture of a video frame sequence of the test set is input into a trained YoloV5 target detection model to obtain regional position information and segmentation pictures of each fishing boat target, and the regional position information and segmentation pictures of the fishing boat target and the video frame sequence are input into a trained fishing boat target tracking model together to obtain a fishing boat target tracking result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310328937.0A CN116935332A (en) | 2023-03-30 | 2023-03-30 | Fishing boat target detection and tracking method based on dynamic video |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310328937.0A CN116935332A (en) | 2023-03-30 | 2023-03-30 | Fishing boat target detection and tracking method based on dynamic video |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116935332A true CN116935332A (en) | 2023-10-24 |
Family
ID=88386805
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310328937.0A Pending CN116935332A (en) | 2023-03-30 | 2023-03-30 | Fishing boat target detection and tracking method based on dynamic video |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116935332A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117437595A (en) * | 2023-11-27 | 2024-01-23 | 哈尔滨航天恒星数据系统科技有限公司 | Fishing boat boundary crossing early warning method based on deep learning |
CN117557785A (en) * | 2024-01-11 | 2024-02-13 | 宁波海上鲜信息技术股份有限公司 | Image processing-based long-distance fishing boat plate recognition method |
-
2023
- 2023-03-30 CN CN202310328937.0A patent/CN116935332A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117437595A (en) * | 2023-11-27 | 2024-01-23 | 哈尔滨航天恒星数据系统科技有限公司 | Fishing boat boundary crossing early warning method based on deep learning |
CN117557785A (en) * | 2024-01-11 | 2024-02-13 | 宁波海上鲜信息技术股份有限公司 | Image processing-based long-distance fishing boat plate recognition method |
CN117557785B (en) * | 2024-01-11 | 2024-04-02 | 宁波海上鲜信息技术股份有限公司 | Image processing-based long-distance fishing boat plate recognition method |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113688723B (en) | Infrared image pedestrian target detection method based on improved YOLOv5 | |
CN108805070A (en) | A kind of deep learning pedestrian detection method based on built-in terminal | |
CN116935332A (en) | Fishing boat target detection and tracking method based on dynamic video | |
CN111753677B (en) | Multi-angle remote sensing ship image target detection method based on characteristic pyramid structure | |
CN114612769B (en) | Integrated sensing infrared imaging ship detection method integrated with local structure information | |
CN112183203A (en) | Real-time traffic sign detection method based on multi-scale pixel feature fusion | |
CN114943876A (en) | Cloud and cloud shadow detection method and device for multi-level semantic fusion and storage medium | |
CN113255589B (en) | Target detection method and system based on multi-convolution fusion network | |
CN113888547A (en) | Non-supervision domain self-adaptive remote sensing road semantic segmentation method based on GAN network | |
CN112818969A (en) | Knowledge distillation-based face pose estimation method and system | |
CN113255837A (en) | Improved CenterNet network-based target detection method in industrial environment | |
CN111310609B (en) | Video target detection method based on time sequence information and local feature similarity | |
CN111723660A (en) | Detection method for long ground target detection network | |
CN111126205A (en) | Optical remote sensing image airplane target detection method based on rotary positioning network | |
CN113280820B (en) | Orchard visual navigation path extraction method and system based on neural network | |
CN113177503A (en) | Arbitrary orientation target twelve parameter detection method based on YOLOV5 | |
CN114821326A (en) | Method for detecting and identifying dense weak and small targets in wide remote sensing image | |
Sun et al. | IRDCLNet: Instance segmentation of ship images based on interference reduction and dynamic contour learning in foggy scenes | |
CN115661932A (en) | Fishing behavior detection method | |
CN116596966A (en) | Segmentation and tracking method based on attention and feature fusion | |
CN109284752A (en) | A kind of rapid detection method of vehicle | |
CN107766858A (en) | A kind of method that ship detecting is carried out using diameter radar image | |
CN112132880A (en) | Real-time dense depth estimation method based on sparse measurement and monocular RGB (red, green and blue) image | |
Aldabbagh et al. | Classification of chili plant growth using deep learning | |
CN115953312A (en) | Joint defogging detection method and device based on single image and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |