CN116935332A - Fishing boat target detection and tracking method based on dynamic video - Google Patents

Fishing boat target detection and tracking method based on dynamic video Download PDF

Info

Publication number
CN116935332A
CN116935332A CN202310328937.0A CN202310328937A CN116935332A CN 116935332 A CN116935332 A CN 116935332A CN 202310328937 A CN202310328937 A CN 202310328937A CN 116935332 A CN116935332 A CN 116935332A
Authority
CN
China
Prior art keywords
fishing boat
target
training
model
tracking
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310328937.0A
Other languages
Chinese (zh)
Inventor
刘永桂
黎远梅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202310328937.0A priority Critical patent/CN116935332A/en
Publication of CN116935332A publication Critical patent/CN116935332A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/50Context or environment of the image
    • G06V20/52Surveillance or monitoring of activities, e.g. for recognising suspicious objects
    • G06V20/54Surveillance or monitoring of activities, e.g. for recognising suspicious objects of traffic, e.g. cars on the road, trains or boats
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0464Convolutional networks [CNN, ConvNet]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/764Arrangements for image or video recognition or understanding using pattern recognition or machine learning using classification, e.g. of video objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/774Generating sets of training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/809Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of classification results, e.g. where the classifiers operate on the same input data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/49Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V2201/00Indexing scheme relating to image or video recognition or understanding
    • G06V2201/07Target detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Medical Informatics (AREA)
  • Computational Linguistics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a method for detecting and tracking a fishing boat target based on a dynamic video, and relates to the field of detection and tracking of the boat target. S1, collecting and manufacturing a fishing boat picture data set and a fishing boat video data set. S2, training a YoloV5 target detection model by using a fishing boat picture data set; and S3, inputting the fishing boat picture into a trained YoloV5 target detection model to obtain region coordinates of a plurality of fishing boat targets, and performing picture segmentation by using a PIL library according to the region coordinate information. S4, training a target tracking model based on the improved MobileNet V3 network and the RPN subnetwork by utilizing the fishing boat video data set subjected to framing. S5, testing a target tracking model: inputting a first frame picture of any video frame sequence of the fishing boat video data set into a trained YoloV5 target detection model to obtain regional position information and segmentation pictures of each fishing boat target, and inputting the regional position information and segmentation pictures of the fishing boat target and the video frame sequence into a trained target tracking model to obtain a fishing boat target tracking result.

Description

Fishing boat target detection and tracking method based on dynamic video
Technical Field
The invention relates to the field of ship target detection and tracking, in particular to a fishing ship target detection and tracking method based on dynamic video.
Background
With the continuous development of computer vision in recent years, a target detection and tracking technology based on deep learning is gradually applied to behavior supervision of marine traffic ships, however, as ship targets in the marine environment are easily influenced by problems such as ship gestures, light transformation, complex backgrounds and the like, the robustness of the results of ship target detection and tracking is lower, and the effect is poor.
In the field of target detection, the YoloV5 algorithm is excellent in the field of general target detection, and can accurately identify an image target; in the field of target tracking, the most mainstream tracking algorithm is a SiamRPN algorithm, and the proposal of the algorithm greatly improves the accuracy of target tracking, and simultaneously achieves faster calculation speed, but the network model parameter and the calculated amount are too large, so that the method is not suitable for target tracking of an off-line real-time scene; meanwhile, siamRPN has poor generalization capability in the aspect of multi-scale data fusion, and target characteristic information is easy to lose.
Disclosure of Invention
The invention aims to provide a method for detecting and tracking a fishing boat target based on a dynamic video, which utilizes a currently mainstream YoloV5 target detection model to accurately identify the fishing boat target, uses an improved MobileNet V3 model with a double-branch structure to extract the image characteristics of the fishing boat in combination with a Transformer cross attention mechanism (Cross Attention Transformer, CAT), and finally tracks the fishing boat target based on an RPN (remote procedure network) of the Transformer attention mechanism.
In order to achieve the above purpose, the invention provides a method for detecting and tracking a fishing boat target based on dynamic video, which comprises the following steps:
s1, constructing a data set: collecting marine fishing boat pictures, and manufacturing a fishing boat picture data set for training a YoloV5 target detection model; collecting historical monitoring videos of monitoring cameras of a fishing boat wharf, and manufacturing a dynamic video data set for training and improving a target tracking model of a MobileNet V3 network and an RPN subnetwork;
s2, training a YoloV5 target detection model: carrying out data enhancement processing on the fishing boat picture data set, and carrying out 8 on the data set sample after data enhancement: dividing the model into a training set and a testing set, and training a YoloV5 target detection model by using the training set;
s3, realizing target detection and picture segmentation of the fishing boat: inputting the test set into a trained YoloV5 target detection model to obtain region coordinates of a plurality of fishing boat targets, and dividing the pictures by using a PIL library according to the region coordinate information to obtain a plurality of pictures only containing single fishing boat targets.
S4, training a target tracking model: carrying out framing treatment on the video of the dynamic video data set to obtain a video frame sequence data set, wherein the frame sequence data set is as follows: and 2, dividing the target tracking model into a training set and a testing set, and training a target tracking model based on an improved MobileNet V3 network and an RPN sub-network by utilizing the training set.
S5, realizing target tracking of the dynamic video fishing boat: inputting a first frame picture of a video frame sequence of a test set into a trained YoloV5 target detection model to obtain regional position information and segmentation pictures of each fishing boat target, and inputting the regional position information and segmentation pictures of the fishing boat target and the video frame sequence into a trained fishing boat target tracking model to obtain a fishing boat target tracking result.
Preferably, in the step S1, the fishing boat picture may be obtained from a network public data set; the dynamic video of the fishing boat can be obtained from a monitoring camera erected on a wharf of the fishing boat.
Preferably, in step S2, data enhancement processing is performed on the fishing boat picture data set, and the data enhancement processing includes picture rotation, scaling and background filling through data enhancement of abundant data samples, and the enhanced data set sample is divided into a training set and a test set, wherein the training set is used for training the YoloV5 target detection model, and the test set is used for verifying the effect of the YoloV5 target detection model.
Preferably, in the step S3, the fishing boat picture is input into the trained YoloV5 target detection model, the identification frames of a plurality of fishing boat targets are output, and the picture cutting is directly performed on the original fishing boat picture by using the regional position information of the identification frames, so that a plurality of pictures including one fishing boat target can be obtained.
Preferably, in the step S4, the target tracking model is divided into two parts, i.e. the modified MobileNetV3 network and the RPN subnetwork, as shown in fig. 2. The improved MobileNet V3 network is used for extracting image features, and the RPN subnetwork is used for classifying and regressing the extracted image features to obtain a final fishing boat target tracking result.
The improved mobilenet V3 network adopts a double-branch structure, the double-branch structure is divided into a template branch and a search branch, the two branches use the same improved mobilenet V3 model and share training parameters, and template region image features and search region image features are respectively extracted.
The structure of the MobileNet V3 model is shown in figure 3, a cross attention mechanism (Cross Attention Transformer, CAT) based on a transducer is introduced to improve the MobileNet V3, so that the calculation cost is reduced, and meanwhile, the better feature extraction performance is maintained.
The overall structure of CAT is shown in fig. 4, and the feature extraction is divided into the following four stages, and the cross attention block (Cross Attention Block, CAB) is shown in fig. 5:
(1) The first stage is to split the input image into patches with height H 1 =h/P, width W 1 =w/P, increasing the number of channels to C 1 The characteristic shape of the output at this time is F 1 =H 1 ×W 1 ×C 1
(2) Entering the second stage, executing the patch projection layer to perform space-to-depth operation, changing the pixel block with 2×2×C shape from 2×2×C shape to 1×1×4C shape, and thenThe linear projection layer is used for projection to 1 multiplied by 2C, the length and width of the characteristic graph can be reduced by one time, the dimension is doubled, and the output characteristic shape is F 1 =H 1 /2×W 1 /2×C 2
(3) The third stage and the fourth stage execute the patch projection layer to perform the operation from space to depth;
(4) After four stages of treatment, F can be obtained 1 、F 2 、F 3 、F 4 Feature maps of four different scales and dimensions.
The RPN subnetwork is shown in fig. 2 and includes a classification branch and a regression branch. The classification branch is used for distinguishing the target from the background, and the regression branch is used for outputting a more accurate target tracking position. In the classification branch, the template branch outputs 2k channels of feature map for targets and backgrounds of k anchor points (where k represents anchors, i.e., the preselected number of boxes per location). In the regression branch, the template branch output feature map has 4k channels, corresponding to 4 position regression parameters of k anchor points.
And carrying out multi-scale coding fusion on the image characteristics on the basis of the attention mechanism of the transducer in the RPN classification branch and the regression branch respectively to obtain a final target tracking output response diagram. The network structure of the attention mechanism is shown in FIG. 6, the input of the network is X, and the mapping F of the convolution layer is adopted tr Which is converted into a feature map U of a given size. The network then performs a Squeeze operation F on U sq I.e. spatial characteristics U of each channel of U c Encoding into a global feature z c
Where H, W are the height and width of the original image features, respectively. The operation of Squeeze obtains the global description of each channel and then carries out the specification operation F ex The relation among all channels is learned, and finally the self-adaptive weight of each channel is obtained:
s c =F ex (z c ,W)=σ(W 2 g(W 1 z c )) (2)
wherein W is 1 Is a linear transformation matrix, andτ is a dimension-reducing super-parameter, σ is a Sigmoid activation function, g is a ReLU activation function, s c Adaptive weights for the various channels.
The final fishing vessel target tracking output result response graph U' is the weight s of each channel obtained by learning c F with U scale Channel-by-channel linear weighting:
u′ c =F scale (u c ,s c )=∑s c u c (3)
in summary, the specific step of training the target tracking model in step S4 is as follows:
s4-1, inputting a first frame of picture of a video frame sequence of a training set into a trained YoloV5 target detection model, and outputting regional position information of all fishing boat targets in the picture and fishing boat target pictures obtained by cutting according to the regional position information;
s4-2, inputting the regional position information of a single fishing boat target and corresponding fishing boat target pictures into a template branch of an improved MobileNet V3 network of a target tracking model according to a label sequence, inputting a test set video frame sequence into a search branch of the improved MobileNet V3 network according to a time sequence, and obtaining template regional characteristics and search regional characteristics after network convolution calculation;
s4-3, respectively inputting the template region features and the search region features into a classification branch and a regression branch of the RPN sub-network, and outputting a final response diagram of the target tracking result of the fishing boat after multi-scale coding fusion is carried out on the characteristics of the template region and the characteristics of the search region by a attention mechanism based on a transducer.
S4-4, the steps S4-1 to S4-3 are repeated, the training set video frame sequence is input into the target tracking model for training, and the network training parameters are self-learned and adjusted by each module until the response diagram of the output result of the target tracking model of the fishing boat achieves the expected effect.
Preferably, in step S5, a first frame of the video frame sequence of the test set is input to a trained YoloV5 target detection model to obtain region position information and a segmentation picture of each fishing boat target, and the region position information and the segmentation picture of the fishing boat target are input to a trained fishing boat target tracking model together with the video frame sequence to obtain a fishing boat target tracking result.
The fishing boat target detection and tracking method based on the dynamic video has the advantages and positive effects that:
according to the method for detecting and tracking the target of the fishing boat based on the dynamic video, the detection part utilizes the existing most mainstream YoloV5 target detection model to identify the target of the fishing boat, so that the identification effect is excellent; the tracking part extracts image features by utilizing an improved lightweight MobileNet V3 network, and simultaneously uses an RPN sub-network to perform classification and regression operation after multi-scale coding fusion on the extracted features, so that the total frame greatly reduces the number of the participated calculation parameters, improves the target tracking calculation speed, and simultaneously improves the accuracy of target tracking, thereby enabling the model to be more suitable for fishing boat target detection and tracking in a real-time scene.
The technical scheme of the invention is further described in detail through the drawings and the embodiments.
Drawings
FIG. 1 is a schematic flow chart of an embodiment of a method for detecting and tracking targets of a fishing vessel based on dynamic video;
FIG. 2 is a schematic diagram of a target tracking model of an embodiment of a method for detecting and tracking a target of a fishing vessel based on dynamic video;
FIG. 3 is a schematic diagram of a MobileNet V3 model structure of an embodiment of a method for detecting and tracking a fishing vessel target based on dynamic video;
FIG. 4 is an overall block diagram of CAT of an embodiment of a method for detecting and tracking targets of a fishing vessel based on dynamic video according to the present invention;
FIG. 5 is a schematic diagram of a cross attention block CAB in a CAT structure of an embodiment of a method for detecting and tracking targets of a fishing boat based on dynamic video according to the present invention;
FIG. 6 is a schematic diagram of a network structure of an attention mechanism of an embodiment of a method for detecting and tracking targets of a fishing vessel based on dynamic video.
Detailed Description
The technical scheme of the invention is further described below through the attached drawings and the embodiments.
Examples
As shown in fig. 1, a method for detecting and tracking a target of a fishing vessel based on dynamic video comprises the following steps:
s1, constructing a data set: collecting marine fishing boat pictures, and manufacturing a fishing boat picture data set for training a YoloV5 target detection model; collecting historical monitoring videos of monitoring cameras of a fishing boat wharf, and manufacturing a dynamic video data set for training and improving a target tracking model of a MobileNet V3 network and an RPN subnetwork; the fishing boat picture can be obtained from the network public data set; the dynamic video of the fishing boat can be obtained from a monitoring camera erected on a wharf of the fishing boat.
S2, training a YoloV5 target detection model: carrying out data enhancement processing on the fishing boat picture data set, enriching data samples through data enhancement, wherein the data enhancement processing comprises picture rotation, equal ratio scaling and background filling, and the data set samples after data enhancement are processed according to the following ratio of 8:2 is divided into a training set and a testing set, the training set is utilized to train the YoloV5 target detection model, and the testing set is used for verifying the effect of the YoloV5 target detection model.
S3, realizing target detection and picture segmentation of the fishing boat: inputting the test set into a trained YoloV5 target detection model to obtain region coordinates of a plurality of fishing boat targets, and dividing the pictures by using a PIL library according to the region coordinate information to obtain a plurality of pictures only containing single fishing boat targets.
S4, training a target tracking model: carrying out framing treatment on the video of the dynamic video data set to obtain a video frame sequence data set, wherein the frame sequence data set is as follows: and 2, dividing the target tracking model into a training set and a testing set, and training a target tracking model based on an improved MobileNet V3 network and an RPN sub-network by utilizing the training set.
The target tracking model is divided into a modified MobileNetV3 network and an RPN subnetwork, as shown in fig. 2. The improved MobileNet V3 network is used for extracting image features, and the RPN subnetwork is used for classifying and regressing the extracted image features to obtain a final fishing boat target tracking result.
The improved mobilenet V3 network adopts a double-branch structure, the double-branch structure is divided into a template branch and a search branch, the two branches use the same improved mobilenet V3 model and share training parameters, and template region image features and search region image features are respectively extracted.
The structure of the MobileNet V3 model is shown in figure 3, a cross attention mechanism (Cross Attention Transformer, CAT) based on a transducer is introduced to improve the MobileNet V3, so that the calculation cost is reduced, and meanwhile, the better feature extraction performance is maintained.
The overall structure of CAT is shown in fig. 4, and the feature extraction is divided into the following four stages, and the cross attention block (Cross Attention Block, CAB) is shown in fig. 5:
(1) The first stage is to split the input image into patches with height H 1 =h/P, width W 1 =w/P, increasing the number of channels to C 1 The characteristic shape of the output at this time is F 1 =H 1 ×W 1 ×C 1
(2) Entering the second stage, executing the patch projection layer to perform the operation from space to depth, changing the pixel block with 2×2×C shape from 2×2×C shape to 1×1×4C shape, then projecting to 1×1×2C by the linear projection layer, and reducing the length and width of the feature map by one time, doubling the dimension, and outputting the feature shape of F 1 =H 1 /2×W 1 /2×C 2
(3) The third stage and the fourth stage execute the patch projection layer to perform the operation from space to depth;
(4) After four stages of treatment, F can be obtained 1 、F 2 、F 3 、F 4 Feature maps of four different scales and dimensions.
The RPN subnetwork is shown in fig. 2 and includes a classification branch and a regression branch. The classification branch is used for distinguishing the target from the background, and the regression branch is used for outputting a more accurate target tracking position. In the classification branch, the template branch outputs 2k channels of feature map for targets and backgrounds of k anchor points (where k represents anchors, i.e., the preselected number of boxes per location). In the regression branch, the template branch output feature map has 4k channels, corresponding to 4 position regression parameters of k anchor points.
And carrying out multi-scale coding fusion on the image characteristics on the basis of the attention mechanism of the transducer in the RPN classification branch and the regression branch respectively to obtain a final target tracking output response diagram. The network structure of the attention mechanism is shown in FIG. 6, the input of the network is X, and the mapping F of the convolution layer is adopted tr Which is converted into a feature map U of a given size. The network then performs a squeze operation F on U sq I.e. spatial characteristics U of each channel of U c Encoding into a global feature z c
Where H, W are the height and width of the original image features, respectively. The operation of Squeeze obtains the global description of each channel and then carries out the specification operation F ex The relation among all channels is learned, and finally the self-adaptive weight of each channel is obtained:
s c =F ex (z c ,W)=σ(W 2 g(W 1 z c )) (2)
wherein W is 1 Is a linear transformation matrix, andτ is a dimension-reducing super-parameter, σ is a Sigmoid activation function, g is a ReLU activation function, s c Adaptive weights for the various channels.
The final target tracking output result response diagram U' is the weight s of each channel obtained by learning c F with U scale Channel-by-channel linear weighting:
u′ c =F scale (u c ,s c )=∑s c u c (3)
in summary, the specific step of training the target tracking model in step S4 is as follows:
s4-1, inputting a first frame of picture of a video frame sequence of a training set into a trained YoloV5 target detection model, and outputting regional position information of all fishing boat targets in the picture and fishing boat target pictures obtained by cutting according to the regional position information;
s4-2, inputting the regional position information of a single fishing boat target and corresponding fishing boat target pictures into a template branch of an improved MobileNet V3 network of a target tracking model according to a label sequence, inputting a test set video frame sequence into a search branch of the improved MobileNet V3 network according to a time sequence, and obtaining template regional characteristics and search regional characteristics after network convolution calculation;
s4-3, respectively inputting the template region features and the search region features into a classification branch and a regression branch of the RPN sub-network, and outputting a final response diagram of the target tracking result of the fishing boat after multi-scale coding fusion is carried out on the characteristics of the template region and the characteristics of the search region by a attention mechanism based on a transducer.
S4-4, the steps S4-1 to S4-3 are repeated, the training set video frame sequence is input into the target tracking model for training, and the network training parameters are self-learned and adjusted by each module until the response diagram of the output result of the target tracking model of the fishing boat achieves the expected effect.
S5, realizing target tracking of the dynamic video fishing boat: inputting a first frame picture of a video frame sequence of a test set into a trained YoloV5 target detection model to obtain regional position information and segmentation pictures of each fishing boat target, and inputting the regional position information and segmentation pictures of the fishing boat target and the video frame sequence into a trained target tracking model to obtain a fishing boat target tracking result.
Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention and not for limiting it, and although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that: the technical scheme of the invention can be modified or replaced by the same, and the modified technical scheme cannot deviate from the spirit and scope of the technical scheme of the invention.

Claims (6)

1. The method for detecting and tracking the target of the fishing boat based on the dynamic video is characterized by comprising the following steps of:
s1, constructing a data set: collecting marine fishing boat pictures, and manufacturing a fishing boat picture data set for training a YoloV5 target detection model; collecting historical monitoring videos of monitoring cameras of a fishing boat wharf, and manufacturing a dynamic video data set for training and improving a target tracking model of a MobileNet V3 network and an RPN subnetwork;
s2, training a YoloV5 target detection model: carrying out data enhancement processing on the fishing boat picture data set, and carrying out 8 on the data set sample after data enhancement: dividing the model into a training set and a testing set, and training a YoloV5 target detection model by using the training set;
s3, realizing target detection and picture segmentation of the fishing boat: inputting the test set into a trained YoloV5 target detection model to obtain region coordinates of a plurality of fishing boat targets, and dividing the pictures by using a PIL library according to the region coordinate information to obtain a plurality of pictures only containing single fishing boat targets.
S4, training a target tracking model: carrying out framing treatment on the video of the dynamic video data set to obtain a video frame sequence data set, wherein the frame sequence data set is as follows: and 2, dividing the target tracking model into a training set and a testing set, and training a target tracking model based on an improved MobileNet V3 network and an RPN sub-network by utilizing the training set.
S5, realizing target tracking of the dynamic video fishing boat: inputting a first frame picture of a video frame sequence of a test set into a trained YoloV5 target detection model to obtain regional position information and segmentation pictures of each fishing boat target, and inputting the regional position information and segmentation pictures of the fishing boat target and the video frame sequence into a trained fishing boat target tracking model to obtain a fishing boat target tracking result.
2. The method for detecting and tracking the target of the fishing boat based on the dynamic video according to claim 1, wherein the method comprises the following steps: in the step S1, a fishing boat picture can be obtained from a network public data set; the dynamic video of the fishing boat can be obtained from a monitoring camera erected on a wharf of the fishing boat.
3. The method for detecting and tracking the target of the fishing boat based on the dynamic video according to claim 1, wherein the method comprises the following steps: in the step S2, the data enhancement processing is performed on the fishing boat picture data set, the data enhancement processing includes picture rotation, scaling and background filling, the enhanced data set sample is divided into a training set and a testing set, the training set is used for training the YoloV5 target detection model, and the testing set is used for verifying the effect of the YoloV5 target detection model.
4. The method for detecting and tracking the target of the fishing boat based on the dynamic video according to claim 1, wherein the method comprises the following steps: in the step S3, the fishing boat picture is input into the trained YoloV5 target detection model, the identification frames of a plurality of fishing boat targets are output, and the picture cutting is directly performed on the original fishing boat picture by utilizing the regional position information of the identification frames, so that a plurality of pictures containing only one fishing boat target can be obtained.
5. The method for detecting and tracking the target of the fishing boat based on the dynamic video according to claim 1, wherein the method comprises the following steps: in the step S4, the target tracking model is divided into two parts, i.e. the modified MobileNetV3 network and the RPN subnetwork, as shown in fig. 2. The improved MobileNet V3 network is used for extracting image features, and the RPN subnetwork is used for classifying and regressing the extracted image features to obtain a final fishing boat target tracking result.
The improved mobilenet V3 network adopts a double-branch structure, the double-branch structure is divided into a template branch and a search branch, the two branches use the same improved mobilenet V3 model and share training parameters, and template region image features and search region image features are respectively extracted.
The structure of the MobileNet V3 model is shown in figure 3, a cross attention mechanism (Cross Attention Transformer, CAT) based on a transducer is introduced to improve the MobileNet V3, so that the calculation cost is reduced, and meanwhile, the better feature extraction performance is maintained.
The overall structure of CAT is shown in fig. 4, and the feature extraction is divided into the following four stages, and the cross attention block (Cross Attention Block, CAB) is shown in fig. 5:
(1) The first stage is to split the input image into patches with height H 1 =h/P, width W 1 =w/P, increasing the number of channels to C 1 The characteristic shape of the output at this time is F 1 =H 1 ×W 1 ×C 1
(2) Entering the second stage, executing the patch projection layer to perform the operation from space to depth, changing the pixel block with 2×2×C shape from 2×2×C shape to 1×1×4C shape, then projecting to 1×1×2C by the linear projection layer, and reducing the length and width of the feature map by one time, doubling the dimension, and outputting the feature shape of F 1 =H 1 /2×W 1 /2×C 2
(3) The third stage and the fourth stage execute the patch projection layer to perform the operation from space to depth;
(4) After four stages of treatment, F can be obtained 1 、F 2 、F 3 、F 4 Feature maps of four different scales and dimensions.
The RPN subnetwork is shown in fig. 2 and includes a classification branch and a regression branch. The classification branch is used for distinguishing the target from the background, and the regression branch is used for outputting a more accurate target tracking position. In the classification branch, the template branch outputs 2k channels of feature map for targets and backgrounds of k anchor points (where k represents anchors, i.e., the preselected number of boxes per location). In the regression branch, the template branch output feature map has 4k channels, corresponding to 4 position regression parameters of k anchor points.
And carrying out multi-scale coding fusion on the image characteristics on the basis of the attention mechanism of the transducer in the RPN classification branch and the regression branch respectively to obtain a final target tracking output response diagram. The network structure of the attention mechanism is shown in FIG. 6, the input of the network is X, and the mapping F of the convolution layer is adopted tr Which is converted into a feature map U of a given size. The network then performs a Squeeze operation F on U sq I.e. spatial characteristics U of each channel of U c Encode into a wholeLocal feature z c
Where H, W are the height and width of the original image features, respectively. The operation of Squeeze obtains the global description of each channel and then carries out the specification operation F ex The relation among all channels is learned, and finally the self-adaptive weight of each channel is obtained:
s c =F ex (z c ,W)=σ(W 2 g(W 1 z c )) (2)
wherein W is 1 Is a linear transformation matrix, andτ is a dimension-reducing super-parameter, σ is a Sigmoid activation function, g is a ReLU activation function, s c Adaptive weights for the various channels.
The final fishing vessel target tracking output result response graph U' is the weight s of each channel obtained by learning c F with U scale Channel-by-channel linear weighting:
u′ c =F scale (u c ,s c )=∑s c u c (3)
in summary, the specific step of training the target tracking model in step S4 is as follows:
s4-1, inputting a first frame of picture of a video frame sequence of a training set into a trained YoloV5 target detection model, and outputting regional position information of all fishing boat targets in the picture and fishing boat target pictures obtained by cutting according to the regional position information;
s4-2, inputting the regional position information of a single fishing boat target and corresponding fishing boat target pictures into a template branch of an improved MobileNet V3 network of a target tracking model according to a label sequence, inputting a test set video frame sequence into a search branch of the improved MobileNet V3 network according to a time sequence, and obtaining template regional characteristics and search regional characteristics after network convolution calculation;
s4-3, respectively inputting the template region features and the search region features into a classification branch and a regression branch of the RPN sub-network, and outputting a final response diagram of the target tracking result of the fishing boat after multi-scale coding fusion is carried out on the characteristics of the template region and the characteristics of the search region by a attention mechanism based on a transducer.
S4-4, the steps S4-1 to S4-3 are repeated, the training set video frame sequence is input into the target tracking model for training, and the network training parameters are self-learned and adjusted by each module until the response diagram of the output result of the target tracking model of the fishing boat achieves the expected effect.
6. The method for detecting and tracking the target of the fishing boat based on the dynamic video according to claim 1, wherein the method comprises the following steps: in the step S5, a first frame of picture of a video frame sequence of the test set is input into a trained YoloV5 target detection model to obtain regional position information and segmentation pictures of each fishing boat target, and the regional position information and segmentation pictures of the fishing boat target and the video frame sequence are input into a trained fishing boat target tracking model together to obtain a fishing boat target tracking result.
CN202310328937.0A 2023-03-30 2023-03-30 Fishing boat target detection and tracking method based on dynamic video Pending CN116935332A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310328937.0A CN116935332A (en) 2023-03-30 2023-03-30 Fishing boat target detection and tracking method based on dynamic video

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310328937.0A CN116935332A (en) 2023-03-30 2023-03-30 Fishing boat target detection and tracking method based on dynamic video

Publications (1)

Publication Number Publication Date
CN116935332A true CN116935332A (en) 2023-10-24

Family

ID=88386805

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310328937.0A Pending CN116935332A (en) 2023-03-30 2023-03-30 Fishing boat target detection and tracking method based on dynamic video

Country Status (1)

Country Link
CN (1) CN116935332A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117437595A (en) * 2023-11-27 2024-01-23 哈尔滨航天恒星数据系统科技有限公司 Fishing boat boundary crossing early warning method based on deep learning
CN117557785A (en) * 2024-01-11 2024-02-13 宁波海上鲜信息技术股份有限公司 Image processing-based long-distance fishing boat plate recognition method

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117437595A (en) * 2023-11-27 2024-01-23 哈尔滨航天恒星数据系统科技有限公司 Fishing boat boundary crossing early warning method based on deep learning
CN117557785A (en) * 2024-01-11 2024-02-13 宁波海上鲜信息技术股份有限公司 Image processing-based long-distance fishing boat plate recognition method
CN117557785B (en) * 2024-01-11 2024-04-02 宁波海上鲜信息技术股份有限公司 Image processing-based long-distance fishing boat plate recognition method

Similar Documents

Publication Publication Date Title
CN113688723B (en) Infrared image pedestrian target detection method based on improved YOLOv5
CN108805070A (en) A kind of deep learning pedestrian detection method based on built-in terminal
CN116935332A (en) Fishing boat target detection and tracking method based on dynamic video
CN111753677B (en) Multi-angle remote sensing ship image target detection method based on characteristic pyramid structure
CN114612769B (en) Integrated sensing infrared imaging ship detection method integrated with local structure information
CN112183203A (en) Real-time traffic sign detection method based on multi-scale pixel feature fusion
CN114943876A (en) Cloud and cloud shadow detection method and device for multi-level semantic fusion and storage medium
CN113255589B (en) Target detection method and system based on multi-convolution fusion network
CN113888547A (en) Non-supervision domain self-adaptive remote sensing road semantic segmentation method based on GAN network
CN112818969A (en) Knowledge distillation-based face pose estimation method and system
CN113255837A (en) Improved CenterNet network-based target detection method in industrial environment
CN111310609B (en) Video target detection method based on time sequence information and local feature similarity
CN111723660A (en) Detection method for long ground target detection network
CN111126205A (en) Optical remote sensing image airplane target detection method based on rotary positioning network
CN113280820B (en) Orchard visual navigation path extraction method and system based on neural network
CN113177503A (en) Arbitrary orientation target twelve parameter detection method based on YOLOV5
CN114821326A (en) Method for detecting and identifying dense weak and small targets in wide remote sensing image
Sun et al. IRDCLNet: Instance segmentation of ship images based on interference reduction and dynamic contour learning in foggy scenes
CN115661932A (en) Fishing behavior detection method
CN116596966A (en) Segmentation and tracking method based on attention and feature fusion
CN109284752A (en) A kind of rapid detection method of vehicle
CN107766858A (en) A kind of method that ship detecting is carried out using diameter radar image
CN112132880A (en) Real-time dense depth estimation method based on sparse measurement and monocular RGB (red, green and blue) image
Aldabbagh et al. Classification of chili plant growth using deep learning
CN115953312A (en) Joint defogging detection method and device based on single image and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination