CN114863348A - Video target segmentation method based on self-supervision - Google Patents

Video target segmentation method based on self-supervision Download PDF

Info

Publication number
CN114863348A
CN114863348A CN202210658263.6A CN202210658263A CN114863348A CN 114863348 A CN114863348 A CN 114863348A CN 202210658263 A CN202210658263 A CN 202210658263A CN 114863348 A CN114863348 A CN 114863348A
Authority
CN
China
Prior art keywords
target
edge
network model
frame
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210658263.6A
Other languages
Chinese (zh)
Inventor
李阳阳
封星宇
赵逸群
刘睿娇
陈彦桥
焦李成
尚荣华
马文萍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202210658263.6A priority Critical patent/CN114863348A/en
Publication of CN114863348A publication Critical patent/CN114863348A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/49Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Computing Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a video target segmentation method based on self-supervision, which mainly solves the problems of low segmentation precision and large influence on target shielding and tracking drift in the prior art. The scheme comprises the following steps: 1) acquiring a video sequence from a video target segmentation data set, preprocessing the video sequence, and dividing the video sequence to obtain a training, verifying and testing sample set; 2) constructing and training an image reconstruction neural network model, and extracting target features by adopting a self-supervision learning method based on a multi-pixel scale image reconstruction task; 3) constructing and training a side output edge detection network model; 4) constructing and training an edge correction network model based on self-supervision; 5) combining the trained three models to obtain a video target segmentation model; 6) and inputting the test set into a video target segmentation model to obtain a target segmentation result. The method can effectively improve the generalization and the accuracy of video target segmentation, and can be used in the fields of automatic driving, intelligent monitoring, unmanned aerial vehicle intelligent tracking and the like.

Description

Video target segmentation method based on self-supervision
Technical Field
The invention belongs to the technical field of computer vision, and further relates to a video target segmentation technology, in particular to a video target segmentation method based on self-supervision, which can be used in the fields of automatic driving, intelligent monitoring, unmanned aerial vehicle intelligent tracking and the like.
Background
Computer vision aims at simulating the process of establishing visual perception by human, is a key link in the development of artificial intelligence technology, and the computer vision algorithm aims to simulate the human visual behavior to the maximum extent with higher precision and provide perception information for downstream tasks. In a human perception system, picture transformation is a continuous process, and under the large background of the current visual technology, a picture storage mode closest to human perception is a video mode, so that a computer visual algorithm for processing a video task has the capability of simulating human visual behaviors.
The video object segmentation task is an important subject in the video processing task, and aims to segment an object of interest in a series of video sequences from a background. In recent years, due to the excellent performance of deep learning technology in computer vision tasks (such as image recognition, target tracking, motion recognition, etc.), a video target segmentation algorithm based on deep learning has become a mainstream method for solving the video target segmentation task. The performance of the video target segmentation algorithm based on deep learning depends on the scale of a neural network used by the video target segmentation algorithm, the performance of the neural network depends on a large amount of training data, and the larger the scale of a training data set is, the better the generalization and robustness of the neural network obtained by training are. In the supervised learning mode, the production process of the video target segmentation training data set is expensive and time-consuming, and not only each pixel in an image needs to be labeled in space, but also each frame in a video sequence needs to be labeled in time. The performance of the video target segmentation model is closely related to the structure, and errors in the video target segmentation process can be effectively reduced through reasonable optimization of the reasoning process of the video target segmentation model.
The research goal of the self-supervision learning is to train a deep learning model under the condition of not using any manual label, so that the deep learning model can extract effective visual representation information from a large number of unlabelled pictures or video data sets, and the extracted information is subjected to fine adjustment and is used by downstream tasks. The video target segmentation based on the self-supervision learning is designed aiming at the specific task of semi-supervision video target segmentation, the video target segmentation model is trained by the self-supervision learning method, the trained model can be directly used for the video target segmentation task, and any manual labeling data set intervention is not needed in the whole training process.
The research of the self-supervision video target segmentation method can be basically divided into two lines, and firstly, a better preposed task training model is designed, so that the model has better representation and extraction capacity; and secondly, aiming at the problem of segmenting the semi-supervised video target, introducing more mechanisms to reduce the influence of target shielding and tracking drift. An article entitled "Tracking images by Tracking images" was published by colorimeter et al in 2018 in European Conference on Computer Vision, and an auto-surveillance video Tracking and coloring model is proposed, which utilizes the natural temporal coherence of colors to learn and color gray level videos, so that the effect of the auto-surveillance video Tracking technology is further improved, but the auto-surveillance video Tracking technology is poor in robustness to target occlusion and Tracking drift because the auto-surveillance video Tracking model is based on previous frame propagation. Corrflow et al published an article entitled "Self-assisted video representation learning for correcting location flow" in 2019 on British Machine Vision Conference, and introduced a limited attention mechanism to improve the resolution of model input and improve the segmentation accuracy without increasing the burden of computing equipment, however, the method does not consider the feature extraction generalization of targets of different scales, and performs poorly in the case of too large difference in target scales.
Disclosure of Invention
The invention aims to provide a video target segmentation method based on self-supervision aiming at the defects of the prior art, and the method is used for solving the technical problems of low segmentation precision and large influence on target occlusion and tracking drift in the prior art.
The idea for realizing the invention is as follows: firstly, extracting target features by adopting a self-supervision learning method based on a multi-pixel scale image reconstruction task, so that a video target segmentation model can give consideration to the features of a large target and the features of a small target to obtain better generalization performance, and then, aiming at the problem of error accumulation generated when the video target segmentation model is used for segmenting the target, the method provides that the semantic edge of the image is used for correcting a target segmentation mask; and finally, designing an edge fusion network based on self-supervision to obtain a more accurate target segmentation mask.
In order to achieve the purpose, the technical scheme adopted by the invention comprises the following steps:
(1) acquiring a training sample set, a verification sample set and a test sample set:
obtaining video sequences from a video target segmentation data set, preprocessing the video sequences to obtain a frame sequence set V, and dividing the frame sequences in the set to obtain a training sample set V train Verifying the sample set V val And test sample set V test
(2) Constructing and training an image reconstruction neural network model R:
(2a) constructing an image reconstruction neural network model R formed by a feature extraction network, wherein the feature extraction network adopts a residual error network comprising a plurality of convolution layers, a plurality of pooling layers, a plurality of residual error unit modules and a single full-connection layer which are connected in sequence;
(2b) defining a loss function of the image reconstruction neural network model R:
L mix =αL cls +(1-α)L reg
wherein the content of the first and second substances,
Figure BDA0003689310160000031
cross entropy loss function representing quantized image reconstruction task for a set of training samples
Figure BDA0003689310160000032
E cluster centroid points [ mu ] are selected 12 ,...,μ E And E is less than or equal to 50; calculating the class of the sample according to the distance between the training sample and the clustering centroid point, and setting the number of target classes contained in the frame sequence set V as C; correcting the position of the clustering mass center point to ensure that the same target label is the same between frames, and different target labels are different, wherein,
Figure BDA0003689310160000033
representing a given frame picture I t To which the ith pixel of (1) belongs,
Figure BDA0003689310160000034
denotes the prediction result using the K-means algorithm, L reg A regression loss function representing the RGB image reconstruction task,
Figure BDA0003689310160000035
wherein the content of the first and second substances,
Figure BDA0003689310160000036
wherein
Figure BDA0003689310160000037
In the case of a real target frame pixel,
Figure BDA0003689310160000038
alpha represents a weight coefficient for reconstructing the target frame pixel, and alpha is more than or equal to 0.1 and less than or equal to 0.9;
(2c) setting a feature handleTaking the network parameters and the maximum iteration number N, reconstructing a loss function of the neural network model R according to the image, and utilizing a training sample set V train Carrying out iterative training on the image reconstruction neural network model R by the target frame picture to obtain a trained image reconstruction neural network model R;
(3) constructing and training a side output edge detection network model Q:
(3a) constructing an edge detection network model Q comprising a side output edge detection layer SODL and a side output edge fusion layer SOFL which are connected in sequence, wherein the side output edge detection layer SODL comprises an deconvolution layer and a convolution layer with a convolution kernel size of 1 multiplied by 1 and an output channel number of 1, and the side output edge fusion layer SOFL is a convolution layer with a convolution kernel size of 1 multiplied by 1 and a channel number of 1;
(3b) defining a loss function of the side output edge detection network model Q:
L edge =L side +L fuse
wherein L is side Representing the side output edge detection loss function,
Figure BDA0003689310160000039
wherein, beta i A weight coefficient representing the ith side output edge detection network,
Figure BDA0003689310160000041
a loss function showing the prediction result of the ith side output edge detection network:
Figure BDA0003689310160000042
wherein the content of the first and second substances,
Figure BDA0003689310160000043
wherein e represents the input image target edge truth value, | e - | represents the number of pixels in the true value of the edge of the image target, | e + I represents the number of non-edge pixels in the image target edge truth, omega i Representing a parameter of the convolutional layer, L fuse To representEdge fusion loss function:
Figure BDA0003689310160000044
(3c) setting a maximum iteration number I, performing iterative training on the side output edge detection network model Q by utilizing a feature diagram set output by each structural layer of the feature extraction network in the image reconstruction neural network model R according to a loss function of the side output edge detection network model Q to obtain a trained side output edge detection network model Q;
(4) constructing and training an edge correction network model Z:
(4a) connected cavity space convolution pooling pyramid model F in sequence γ And a softmax activation function output layer, wherein the void space convolution pooling pyramid model F γ The method comprises the steps of obtaining an edge correction network model Z, wherein the edge correction network model Z is composed of a plurality of convolution layers and pooling layers which are connected in sequence;
(4b) defining a loss function of the edge correction network model Z:
Figure BDA0003689310160000045
wherein the content of the first and second substances,
Figure BDA0003689310160000046
the coarse segmentation result of the target frame output by the edge detection layer,
Figure BDA0003689310160000047
convolution pooling pyramid model F for void space γ The result of the prediction of (a) is,
Figure BDA0003689310160000048
wherein the content of the first and second substances,
Figure BDA0003689310160000049
representing the edge of the image obtained by the Canny algorithm, and M representing the mask
Figure BDA00036893101600000410
The number of classes of the middle pixels,
Figure BDA00036893101600000411
representing masks
Figure BDA00036893101600000412
The total number of medium pixels;
(4c) setting a maximum iteration number H, performing iterative training on the edge correction network model Z according to a loss function of the edge correction network model Z and by using output results of the image reconstruction network model R and the edge detection network model Q to obtain a trained edge correction network model Z;
(5) combining the trained image reconstruction neural network R, the side output edge detection network Q and the edge correction network model Z to obtain a video target segmentation model based on an image target edge correction segmentation result;
(6) obtaining a self-supervision video target segmentation result:
test set
Figure BDA0003689310160000051
The frame images in the video target segmentation model are used as input of the video target segmentation model to carry out forward propagation to obtain all test frame image prediction segmentation labels, and a final segmentation result image is obtained according to the test frame image prediction segmentation labels.
Compared with the prior art, the invention has the following advantages:
firstly, as the reconstruction task of the multi-pixel scale image is adopted as the prepositive task of the self-supervision learning, the features extracted by the trained model have better generalization aiming at large targets and small targets in the video segmentation task, thereby having better performance in the integral video target segmentation task.
Secondly, the invention uses the edge repair of the target in the video picture to repair the target mask, integrates the feature images extracted by each layer of the feature extraction network in the video target segmentation model through a side output edge detection network, predicts the candidate target edge in the target frame, and uses an edge fusion model based on self-supervision to fuse the segmentation result output by the video target segmentation model and the target edge output by the side output edge detection network, thereby correcting the segmentation mask according to the target edge and obtaining more accurate segmentation result.
Drawings
FIG. 1 is a flow chart of an implementation of the present invention.
Detailed Description
The invention is described in further detail below with reference to the figures and specific examples.
The first embodiment is as follows: referring to fig. 1, the video object segmentation method based on self-supervision provided by the invention specifically includes the following steps:
step 1: acquiring a training sample set, a verification sample set and a test sample set:
obtaining video sequences from a video target segmentation data set, preprocessing the video sequences to obtain a frame sequence set V, and dividing the frame sequences in the set to obtain a training sample set V train Verifying the sample set V val And test sample set V test (ii) a The method is realized as follows:
(1a) s multi-class video sequences are obtained from a video target segmentation data set, and a frame sequence set is obtained after preprocessing
Figure BDA0003689310160000052
S is more than or equal to 3000; wherein
Figure BDA0003689310160000053
Representing the kth frame sequence consisting of pre-processed image frames,
Figure BDA0003689310160000054
represents the nth image frame in the kth frame sequence, M is more than or equal to 30;
(1b) randomly extracting more than half of frame sequences from the frame sequence set V to form a training sample set
Figure BDA0003689310160000061
Wherein S/2 < N <S, aiming at each frame sequence in training sample set
Figure BDA0003689310160000062
Each target frame picture to be segmented
Figure BDA0003689310160000063
Zooming into image blocks of size p multiplied by h, and converting the picture format from RGB to Lab; extracting a half frame sequence from the rest frame sequence to form a verification sample set
Figure BDA0003689310160000064
Wherein J is less than or equal to S/4; the other half constitutes the test sample set
Figure BDA0003689310160000065
T is less than or equal to S/4, and the picture format is converted from RGB to Lab.
Step 2: constructing and training an image reconstruction neural network model R:
(2a) constructing an image reconstruction neural network model R formed by a feature extraction network, wherein the feature extraction network adopts a residual error network comprising a plurality of convolution layers, a plurality of pooling layers, a plurality of residual error unit modules and a single full-connection layer which are connected in sequence;
(2b) defining a loss function of the image reconstruction neural network model R:
L mix =αL cls +(1-α)L reg
wherein the content of the first and second substances,
Figure BDA0003689310160000066
cross entropy loss function representing quantized image reconstruction task for a set of training samples
Figure BDA0003689310160000067
E cluster centroid points [ mu ] are selected 12 ,...,μ E And E is less than or equal to 50; calculating the class of the sample according to the distance between the training sample and the clustering centroid point, and setting the number of target classes contained in the frame sequence set V as C; correcting the position of the clustering mass center point to enable the same kind of target marks between framesThe labels are the same, and different target labels are different, wherein,
Figure BDA0003689310160000068
representing a given frame picture I t To which the ith pixel of (1) belongs,
Figure BDA0003689310160000069
denotes the prediction result using the K-means algorithm, L reg A regression loss function representing the RGB image reconstruction task,
Figure BDA00036893101600000610
wherein the content of the first and second substances,
Figure BDA00036893101600000611
wherein
Figure BDA00036893101600000612
In the case of a real target frame pixel,
Figure BDA00036893101600000613
alpha represents a weight coefficient for reconstructing the target frame pixel, and alpha is more than or equal to 0.1 and less than or equal to 0.9;
(2c) setting characteristic extraction network parameters and maximum iteration times N, reconstructing a loss function of a neural network model R according to the image, and utilizing a training sample set V train Carrying out iterative training on the image reconstruction neural network model R by the target frame picture to obtain a trained image reconstruction neural network model R; the method is realized as follows:
(2c1) setting a network hyperparameter of the feature extraction network as theta, setting the maximum iteration number as N to be more than or equal to 150000, and indicating the current iteration number by N; initializing the iteration number by changing n to 1;
(2c2) will train the sample set V train The target frame picture in (1) is used as the input of an image reconstruction neural network model R for forward propagation:
for each target frame I to be divided t Selecting q frames in front of the reference frame as the reference frame { I' 0 ,I 1 ’,...,I’ q Q is more than or equal to 2 and less than or equal to 5, and a target frame I t With it phaseThe reference frame set is used as the input of a feature extraction network phi (;. theta), and the feature extraction network is used for I t And extracting the features of each reference frame image to obtain the features f of the target frame image t =Φ(I t (ii) a Theta), reference frame image feature f' 0 =Φ(I′ 0 ;θ),...,f′ q =Φ(I′ q (ii) a θ), training the target frame in the sample set { I } t L0 is more than or equal to t and less than or equal to N is used as the input of the K mean value algorithm to obtain the reconstruction loss value L of the quantized image cls Reconstructing the target frame I t With the real target frame I t Obtaining an RGB image reconstruction loss value L as input of an RGB image reconstruction task reg
(2c3) Using a loss function L mix By cross entropy loss L cls And regression loss L reg Calculating loss values of image reconstruction neural networks
Figure BDA0003689310160000071
Calculating the gradient g (theta) of the network parameter by adopting back propagation, and then updating the network parameter theta by a gradient descent method;
(2c4) judging whether N is true or not, and if so, obtaining a trained image reconstruction neural network R; otherwise, let n be n +1, and return to execute step (2c 2).
And step 3: constructing and training a side output edge detection network model Q:
(3a) constructing an edge detection network model Q comprising a side output edge detection layer SODL and a side output edge fusion layer SOFL which are connected in sequence, wherein the side output edge detection layer SODL comprises an deconvolution layer and a convolution layer with a convolution kernel size of 1 multiplied by 1 and an output channel number of 1, and the side output edge fusion layer SOFL is a convolution layer with a convolution kernel size of 1 multiplied by 1 and a channel number of 1;
(3b) defining a loss function of the side output edge detection network model Q:
L edge =L side +L fuse
wherein L is side Representing the side output edge detection loss function,
Figure BDA0003689310160000081
wherein, beta i A weight coefficient representing the ith side output edge detection network,
Figure BDA0003689310160000082
a loss function representing the prediction result of the ith side output edge detection network:
Figure BDA0003689310160000083
wherein the content of the first and second substances,
Figure BDA0003689310160000084
wherein e represents the input image target edge truth value, | e - | represents the number of pixels in the true value of the edge of the image target, | e + I represents the number of non-edge pixels in the image target edge truth, omega i Representing a parameter of the convolutional layer, L fuse Represent the edge fusion loss function:
Figure BDA0003689310160000085
(3c) setting a maximum iteration number I, performing iterative training on the side output edge detection network model Q by using a feature diagram set output by each structural layer of the feature extraction network in the image reconstruction neural network model R according to a loss function of the side output edge detection network model Q to obtain a trained side output edge detection network model Q, and realizing the following steps:
(3c1) setting the maximum iteration frequency I to be more than or equal to 150000, and setting the current iteration frequency to be I; initializing the iteration number by changing i to 1;
(3c2) feature graph set output by each structural layer of feature extraction network in image reconstruction network model
Figure BDA0003689310160000086
Forward propagation as input to the side-output edge detection network:
(3c3) side outputThe edge detection layer acquires the rough edge of the target from the feature map set so as to obtain the rough edge corresponding to each feature map
Figure BDA0003689310160000087
(3c4) Taking the rough edge set output by the side output edge detection layer SODL as the input of the side output edge fusion layer SOFL, performing weighted fusion on the rough edge to obtain the final predicted edge
Figure BDA0003689310160000088
Wherein the content of the first and second substances,
Figure BDA0003689310160000089
features, ω, formed by merging of thick edges fuse Parameters representing a side output edge blending layer;
(3c5) using a loss function L edge Loss L by side output edge detection side Sum-side output edge blending loss L fuse Calculating loss values for edge-detect networks
Figure BDA0003689310160000091
Calculating the gradient g (omega) of the network parameter by adopting back propagation, and then updating the network parameter omega by a gradient descent method;
(3c6) judging whether I is true, if so, obtaining a trained side output edge detection network model Q; otherwise, let i equal to i +1, and return to execute step (3c 2).
And 4, step 4: constructing and training an edge correction network model Z:
(4a) connected cavity space convolution pooling pyramid model F in sequence γ And a softmax activation function output layer, wherein the void space convolution pooling pyramid model F γ The method comprises the steps of obtaining an edge correction network model Z, wherein the edge correction network model Z is composed of a plurality of convolution layers and pooling layers which are connected in sequence;
(4b) defining a loss function of the edge correction network model Z:
Figure BDA0003689310160000092
wherein the content of the first and second substances,
Figure BDA0003689310160000093
the coarse segmentation result of the target frame output by the edge detection layer,
Figure BDA0003689310160000094
convolution pooling pyramid model F for void space γ The result of the prediction of (a) is,
Figure BDA0003689310160000095
wherein the content of the first and second substances,
Figure BDA0003689310160000096
representing the edge of the image obtained by the Canny algorithm, and M representing the mask
Figure BDA0003689310160000097
The number of classes of the middle pixels,
Figure BDA0003689310160000098
representing masks
Figure BDA0003689310160000099
The total number of medium pixels;
(4c) setting a maximum iteration number H, performing iterative training on the edge correction network model Z according to a loss function of the edge correction network model Z and the output results of the image reconstruction network model R and the edge detection network model Q to obtain a trained edge correction network model Z, and realizing the following steps:
(4c1) setting the maximum iteration number as H not less than 150000 and the current iteration number as H; initializing the iteration times by changing h to 1;
(4c2) taking the target frame rough segmentation result output by the image reconstruction network model R and the edge detection result output by the edge detection network model Q as the input of the edge correction network model Z for forward propagation:
(4c2.1) the edge correction network first coarsely divides the result into the target frame
Figure BDA00036893101600000910
And edge detection results
Figure BDA00036893101600000911
Merging the channel dimensions to obtain a characteristic diagram with the size of H multiplied by W multiplied by (K +1)
Figure BDA00036893101600000912
(4c2.2) mapping the features
Figure BDA00036893101600000913
Convolution pooling pyramid model F as void space γ Obtaining a prediction result of the expanded receptive field;
(4c2.3) taking the expanded receptive field prediction result as the input of the output layer of the softmax activation function, and determining the segmentation labels of the pixels according to the probability that each pixel in the feature map is located in each category, so as to obtain a more accurate target segmentation mask after the target segmentation mask of the target frame is subjected to edge fusion correction
Figure BDA0003689310160000101
Wherein O is t Representing a target frame I t The predictive segmentation tag of (1);
(4c3) using a loss function L corr Calculating the loss value of the edge correction network
Figure BDA0003689310160000103
Calculating the gradient g (c) of the network parameter by adopting back propagation, and then updating the network parameter c by a gradient descent method;
(4c4) judging whether H is true or not, and if so, obtaining a trained edge correction network model Z; otherwise, let h be h +1, and return to performing step (4c 2).
And 5: combining the trained image reconstruction neural network R, the side output edge detection network Q and the edge correction network model Z to obtain a video target segmentation model based on an image target edge correction segmentation result, and particularly combining the video target segmentation model according to the following modes: and obtaining a target edge prediction graph in an input side output edge detection network Q of the intermediate characteristic graph extracted by the image reconstruction neural network R, and taking the target segmentation mask prediction graph output by the image reconstruction neural network R and the target edge prediction graph output by the side output edge detection network Q as the input of an edge correction network model Z to obtain a trained video target segmentation model based on the image target edge correction segmentation result.
And 6: obtaining a self-supervision video target segmentation result:
test set
Figure BDA0003689310160000102
The frame images in the video target segmentation model are used as input of the video target segmentation model to carry out forward propagation to obtain all test frame image prediction segmentation labels, and a final segmentation result image is obtained according to the test frame image prediction segmentation labels.
Example two: the overall steps of this embodiment are the same as those of the first embodiment, and specific values are given for setting some of the parameters, so as to further describe the implementation process of the present invention:
step 1) obtaining a training sample set, a verification sample set and a test sample set:
step 1a) acquiring S multi-class video sequences from a video object segmentation dataset, and preprocessing the video sequences to obtain a frame sequence set V, where in this embodiment, the multi-class video sequences are acquired from a youtube vos dataset, S is 4453, and M is 50;
step 1b) sets the target Class number C of the frame sequence V to 94, and the Class set Class to { C ═ C num 1 ≦ num ≦ C, multiple classes of objects may appear in each frame sequence, where C is num Representing a num class target;
step 1c) randomly extracting more than half of frame sequences from the frame sequence set V to form a training sample set
Figure BDA0003689310160000111
Wherein S/2 < N < S, for each frame sequence in the training set
Figure BDA0003689310160000112
Each target frame picture to be segmented
Figure BDA0003689310160000113
Scaling into image blocks with the size of p multiplied by h, converting the RGB format images into Lab images, and extracting a half frame sequence from the rest frame sequences to form a verification sample set
Figure BDA0003689310160000114
Wherein J is less than or equal to S/4, and the other half constitutes a test sample set
Figure BDA0003689310160000115
T is less than or equal to S/4, and similarly, the original image is converted into a Lab format from an RGB format;
setting crop box size to x y for a sequence of frames in a training set
Figure BDA0003689310160000116
Frame picture divided for each band
Figure BDA0003689310160000117
Cutting to obtain cut frame picture
Figure BDA0003689310160000118
For the cut frame picture
Figure BDA0003689310160000119
Normalization processing is carried out, and the frame pictures after normalization form a training frame sequence after preprocessing
Figure BDA00036893101600001110
Wherein the content of the first and second substances,
Figure BDA00036893101600001111
an mth frame sequence in the training sample set;
in this embodiment, x is 256, y is 256, p is 256, and h is 3;
step 2), constructing an image reconstruction neural network model R:
step 2a) constructing a structure of an image reconstruction neural network model R:
constructing an image reconstruction neural network model formed by a feature extraction network, wherein the feature extraction network adopts a residual error network comprising a plurality of convolution layers, a plurality of pooling layers, a plurality of residual error unit modules and a single full-connection layer which are connected in sequence;
the feature extraction network comprises 17 convolutional layers and 1 fully-connected layer, the 18-layer structure is divided into 5 blocks which are conv _1, conv _2, conv _3, conv _4 and conv _5 respectively, conv _1 is one convolutional layer, the convolutional core size of the convolutional layer is 7 multiplied by 7, the number of channels is 64, conv _2 comprises two convolutional layers, the convolution kernel size is 3 x 3, the number of channels is 64, the convolution kernel moving distance is 1, conv _3 comprises two convolution layers, the convolution kernel size is 3 x 3, the number of channels is 128, wherein the first convolution layer convolution kernel shift distance is set to be 2, the second convolution layer convolution kernel shift distance is 1, conv _4 comprises two convolution layers, the convolution kernel size is 3 multiplied by 3, the number of channels is 256, the convolution kernel moving distance is 1, conv _5 comprises two convolution layers, the convolution kernel size is 3 multiplied by 3, the number of channels is 512, and the convolution kernel moving distance is 1;
step 2b) defining a loss function of the image reconstruction neural network model:
L mix =αL cls +(1-α)L reg
wherein L is cls Represents the cross-entropy loss function of the quantized image reconstruction task,
Figure BDA0003689310160000121
for training sample sets
Figure BDA0003689310160000122
E cluster centroid points [ mu ] are selected 12 ,...,μ E And E is less than or equal to 50, calculating the class of the sample according to the distance between the training sample and the clustering centroid point, correcting the position of the clustering centroid point to ensure that the same type target labels between frames are the same, and different target labels are different, wherein,
Figure BDA0003689310160000123
representing a given frame picture I t To which the ith pixel of (1) belongs,
Figure BDA0003689310160000128
denotes the prediction result using the K-means algorithm, L reg A regression loss function representing the RGB image reconstruction task,
Figure BDA0003689310160000124
wherein the content of the first and second substances,
Figure BDA0003689310160000125
wherein
Figure BDA0003689310160000126
In the case of a real target frame pixel,
Figure BDA0003689310160000127
alpha represents a weight coefficient for reconstructing the target frame pixel, and alpha is more than or equal to 0.1 and less than or equal to 0.9;
in this embodiment, K is 16, α is 0.6;
step 3) iterative training is carried out on the image reconstruction neural network model:
step 3a), setting a network hyper-parameter of the feature extraction network as theta, setting the number of initialization iterations as N, setting the maximum number of iterations as N, wherein N is more than or equal to 150000, and setting N as 1;
in this embodiment, N is 300000, and N is 300000 for more sufficient model training;
step 3b) training sample set V train The target frame picture in (1) is used as the input of an image reconstruction neural network model R for forward propagation:
step 3b1) for each target frame I to be segmented t Selecting q frames in front of the reference frame as the reference frame { I' 0 ,I 1 ’,...,I q ' }, where 2 < q > is less than or equal to 5, target frame I t The reference frame set corresponding to the reference frame set is used as the input of a feature extraction network phi (;. theta), and a feature extraction network pair I t And extracting features from each reference frame imageObtaining the target frame image characteristic f t =Φ(I t (ii) a Theta), reference frame image feature f' 0 =Φ(I′ 0 ;θ),...,f′ q =Φ(I′ q (ii) a θ), training the target frame in the sample set { I } t L0 is more than or equal to t and less than or equal to N is used as the input of the K mean value algorithm to obtain the reconstruction loss value L of the quantized image cls Reconstructing the target frame I t With the real target frame I t Obtaining an RGB image reconstruction loss value L as input of an RGB image reconstruction task reg
Step 3c) Using a loss function L mix By cross entropy loss L cls And regression loss L reg Calculating loss values of image reconstruction neural networks
Figure BDA0003689310160000131
Calculating the gradient g (theta) of the network parameter by adopting back propagation, and then updating the network parameter theta by a gradient descent method, wherein the updating formula is as follows:
Figure BDA0003689310160000132
wherein θ' represents θ n The updated result, gamma represents the learning rate, 1e-6 is more than or equal to gamma is less than or equal to 1e-3,
Figure BDA0003689310160000133
the loss function value of the image reconstruction neural network after the nth iteration is shown,
Figure BDA0003689310160000134
representing the partial derivative calculation.
In this embodiment, the initial learning rate γ is 0.001, when the iteration is performed for the 15 th ten thousand, the learning rate γ is 0.0005, when the iteration is performed for the 20 th ten thousand, the learning rate γ is 0.00025, and when the iteration is performed for the 25 th ten thousand, the learning rate γ is 0.000125, and the optimizer function uses an Adam optimizer, and the purpose of attenuating the learning rate when the network is iterated for a certain number of times is to prevent the loss function from falling into the local minimum;
step 3d) judging whether N is true or not, if so, obtaining a trained image reconstruction neural network R, otherwise, making N be N +1, and executing the step (3 b);
step 4), constructing a side output edge detection network model Q:
step 4a) constructing a structure of a side output edge detection network model Q:
constructing an edge detection network model Q comprising a side output edge detection layer SODL and a side output edge fusion layer SOFL which are connected in sequence, wherein the side output edge detection layer SODL comprises an deconvolution layer and a convolution layer with a convolution kernel size of 1 multiplied by 1 and an output channel number of 1, and the side output edge fusion layer SOFL is a convolution layer with a convolution kernel size of 1 multiplied by 1 and a channel number of 1;
step 4b) defining a loss function of the side output edge detection network model:
L edge =L side +L fuse
wherein L is side Representing the side output edge detection loss function,
Figure BDA0003689310160000135
wherein, beta i A weight coefficient representing the ith side output edge detection network,
Figure BDA0003689310160000136
a loss function representing the prediction result of the ith side output edge detection network:
Figure BDA0003689310160000141
wherein the content of the first and second substances,
Figure BDA0003689310160000142
wherein e represents the input image target edge truth value, | e - | represents the number of pixels in the true value of the edge of the image target, | e + I represents the number of non-edge pixels in the image target edge truth, omega i Representing a parameter of the convolutional layer, L fuse Represent the edge fusion loss function:
Figure BDA0003689310160000143
step 5) performing iterative training on the side output edge detection network model Q:
step 5a), initializing the iteration frequency to be I, setting the maximum iteration frequency to be I, wherein I is more than or equal to 150000, and setting I to 1;
in this embodiment, in order to make the model training more sufficient, design I is 300000;
step 5b) feature graph set output by each structural layer of the feature extraction network in the image reconstruction network model
Figure BDA0003689310160000144
Forward propagation as input to the side-output edge detection network:
step 5b1) side output edge detection layer obtains the rough edge of the target from the feature map set, thereby obtaining the rough edge corresponding to each feature map
Figure BDA0003689310160000145
Step 5b2) of outputting the rough edge set output by the side output edge detection layer SODL as the input of the side output edge fusion layer SOFL, and performing weighted fusion on the rough edge to obtain the final predicted edge
Figure BDA0003689310160000146
Wherein the content of the first and second substances,
Figure BDA0003689310160000147
features, ω, formed by merging of thick edges fuse Parameters representing a side output edge blending layer;
step 5c) Using a loss function L edge Loss L by side output edge detection side Sum-side output edge blending loss L fuse Calculating loss values for edge-detect networks
Figure BDA0003689310160000148
Computing network employing back propagationAnd (3) updating the network parameter omega by a gradient descent method according to the parameter gradient g (omega), wherein the updating formula is as follows:
Figure BDA0003689310160000149
wherein ω' represents ω i The updated result, beta represents the learning rate, 1e-6 is more than or equal to beta is less than or equal to 1e-3,
Figure BDA00036893101600001410
the loss function value of the output edge detection neural network at the i-th iteration is shown,
Figure BDA00036893101600001411
representing the partial derivative calculation.
In this embodiment, the initial learning rate β is 0.001, when the network iterates to the 15 th ten thousand, the learning rate β is 0.0005, when the network iterates to the 20 th ten thousand, the learning rate β is 0.00025, and when the network iterates to the 25 th ten thousand, the learning rate β is 0.000125, and the optimizer function uses an Adam optimizer, and the purpose of attenuating the learning rate when the network iterates to a certain number of times is to prevent the loss function from falling into a local minimum;
step 5d) judging whether I is true, if so, obtaining a trained side output edge detection network model Q, otherwise, making I +1, and executing the step (5 b);
step 6), constructing an edge correction network model Z:
step 6a) constructing a structure of an edge correction network model Z:
constructing a void space convolution pooling pyramid model F comprising sequential connections γ And an edge correction network model Z of the output layer of the softmax activation function, wherein the void space convolution pooling pyramid model F γ Composed of a plurality of convolution layers and pooling layers connected in sequence;
void space convolution pooling pyramid model F γ Comprises a convolution layer, a pooling pyramid and a pooling block, wherein the convolution layer convolution kernel has a size of 1 × 1, the pooling pyramid comprises three convolution layers connected in parallel, and the convolution layers are connected in parallelThe kernel size is 3 multiplied by 3, the pooling block comprises a 1 multiplied by 1 pooling layer, a convolution layer and an up-sampling operation layer, the convolution layer convolution kernel size is 1 multiplied by 1, concat operation is carried out on the convolution layer, the pooling pyramid and the feature diagram output by the pooling block module, and then the feature diagram is processed by the 1 multiplied by 1 pooling layer to obtain a cavity space convolution pooling pyramid model F γ An output of (d);
step 6b) defining a loss function of the edge correction network model Z:
Figure BDA0003689310160000151
wherein the content of the first and second substances,
Figure BDA0003689310160000152
the coarse segmentation result of the target frame output by the edge detection layer,
Figure BDA0003689310160000153
convolution pooling pyramid model F for void space γ The result of the prediction of (a) is,
Figure BDA0003689310160000154
wherein the content of the first and second substances,
Figure BDA0003689310160000155
representing the edge of the image obtained by the Canny algorithm, and M representing the mask
Figure BDA0003689310160000156
The number of classes of the middle pixels,
Figure BDA0003689310160000157
representing masks
Figure BDA0003689310160000158
The total number of medium pixels;
step 7) performing iterative training on the edge correction network model Z:
step 7a), initializing the iteration number to be H, wherein the maximum iteration number is H, H is more than or equal to 150000, and H is 1;
in this embodiment, H300000 is designed to make model training more sufficient;
step 7b) taking the target frame rough segmentation result output by the image reconstruction network model and the edge detection result output by the edge detection network model as the input of the edge correction network model Z for forward propagation:
step 7b1) the edge correction network firstly coarsely divides the target frame into results
Figure BDA0003689310160000161
And edge detection results
Figure BDA0003689310160000162
Merging the channel dimensions to obtain a characteristic diagram with the size of H multiplied by W multiplied by (K +1)
Figure BDA0003689310160000163
Step 7b2) feature map
Figure BDA0003689310160000164
Convolution pooling pyramid model F as void space γ Obtaining a prediction result of the expanded receptive field;
step 7b3) using the expanded receptive field prediction result as the input of the output layer of the softmax activation function, and determining the segmentation label of the pixel according to the probability of each category of each pixel position in the feature map, thereby obtaining a more accurate target segmentation mask after the target segmentation mask of the target frame is subjected to edge fusion correction
Figure BDA0003689310160000165
Wherein, O t Representing a target frame I t The predictive segmentation tag of (1);
step 7c) Using a loss function L corr Calculating the loss value of the edge correction network
Figure BDA0003689310160000166
Calculating network parameter gradients g (c) by back propagation and then passing through a ladderThe degree reduction method updates the network parameter c, and the updating formula is as follows:
Figure BDA0003689310160000167
wherein c' represents c h The updated result, alpha represents the learning rate, 1e-6 is more than or equal to alpha is less than or equal to 1e-3,
Figure BDA0003689310160000168
the loss function value of the edge modified neural network after the h-th iteration is shown,
Figure BDA0003689310160000169
representing the partial derivative calculation.
In this embodiment, the initial learning rate α is 0.001, when the network iterates to the 15 th ten thousand, the learning rate α is 0.0005, when the network iterates to the 20 th ten thousand, the learning rate α is 0.00025, and when the network iterates to the 25 th ten thousand, the learning rate α is 0.000125, and the optimizer function uses an Adam optimizer, and the purpose of attenuating the learning rate when the network iterates to a certain number of times is to prevent the loss function from falling into a local minimum;
step 7d) judging whether H is true or not, if so, obtaining a trained edge correction network model Z, otherwise, making H be H +1, and executing the step (7 b);
step 8) obtaining a self-supervision video target segmentation result:
test set
Figure BDA00036893101600001610
The frame image in the method is used as the input of a trained video target segmentation model based on the image target edge correction segmentation result for forward propagation, the video target segmentation model based on the image target edge correction segmentation result is composed of an image reconstruction neural network R, a side output edge detection network Q and an edge fusion network Z, all test frame image segmentation labels are obtained, and a segmentation result image is determined according to the test frame image segmentation labels.
The technical effects of the present invention are further explained by simulation experiments as follows:
1. simulation conditions and contents:
4453 video sequences were acquired from the YouTube-VOS dataset for use in simulation experiments;
the simulation experiment is carried out on a server with a CPU model of Intel (R) core (TM) i77800x CPU @3.5GHz 64GB and a GPU model of NVIDIA GeForce RTX 2080 ti. The operating system is a UBUNTU 16.04 system, the deep learning framework is PyTorch, and the programming language is Python 3.6;
the invention is compared and simulated with the existing video target segmentation method. In order to quantitatively compare video target segmentation results, two video target segmentation result evaluation indexes, namely region similarity J and contour similarity F, are adopted in the experiment, the higher the two evaluation indexes are, the better the change detection result is, and the simulation result is shown in table 1.
TABLE 1
Figure BDA0003689310160000171
2. And (3) simulation result analysis:
as can be seen from Table 1, compared with the existing video segmentation method, the J index and the F index are obviously improved, and the video target segmentation technology based on self-supervision constructed by the invention can effectively solve the problems of target shielding, tracking drift and the like, thereby improving the video target segmentation precision, and has important practical significance and practical value.
The simulation analysis proves the correctness and the effectiveness of the method provided by the invention.
The invention has not been described in detail in part of the common general knowledge of those skilled in the art.
While the invention has been particularly shown and described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention.

Claims (6)

1. A video object segmentation method based on self-supervision is characterized by comprising the following steps:
(1) acquiring a training sample set, a verification sample set and a test sample set:
obtaining video sequences from a video target segmentation data set, preprocessing the video sequences to obtain a frame sequence set V, and dividing the frame sequences in the set to obtain a training sample set V train Verifying the sample set V val And test sample set V test
(2) Constructing and training an image reconstruction neural network model R:
(2a) constructing an image reconstruction neural network model R formed by a feature extraction network, wherein the feature extraction network adopts a residual error network comprising a plurality of convolution layers, a plurality of pooling layers, a plurality of residual error unit modules and a single full-connection layer which are connected in sequence;
(2b) defining a loss function of the image reconstruction neural network model R:
L mix =αL cls +(1-α)L reg
wherein the content of the first and second substances,
Figure FDA0003689310150000011
cross entropy loss function representing quantized image reconstruction task for a set of training samples
Figure FDA0003689310150000012
E cluster centroid points [ mu ] are selected 12 ,...,μ E And E is less than or equal to 50; calculating the class of the sample according to the distance between the training sample and the clustering centroid point, and setting the number of target classes contained in the frame sequence set V as C; correcting the position of the clustering mass center point to ensure that the same target label is the same between frames, and different target labels are different, wherein,
Figure FDA0003689310150000013
representing a given frame picture I t The ith pixel ofThe category to which the device belongs is,
Figure FDA0003689310150000014
denotes the prediction result using the K-means algorithm, L reg A regression loss function representing the RGB image reconstruction task,
Figure FDA0003689310150000015
wherein the content of the first and second substances,
Figure FDA0003689310150000016
wherein
Figure FDA0003689310150000017
In the case of a real target frame pixel,
Figure FDA0003689310150000018
alpha represents a weight coefficient for reconstructing the target frame pixel, and alpha is more than or equal to 0.1 and less than or equal to 0.9;
(2c) setting characteristic extraction network parameters and maximum iteration times N, reconstructing a loss function of a neural network model R according to the image, and utilizing a training sample set V train Carrying out iterative training on the image reconstruction neural network model R by the target frame picture to obtain a trained image reconstruction neural network model R;
(3) constructing and training a side output edge detection network model Q:
(3a) constructing an edge detection network model Q comprising a side output edge detection layer SODL and a side output edge fusion layer SOFL which are connected in sequence, wherein the side output edge detection layer SODL comprises an deconvolution layer and a convolution layer with a convolution kernel size of 1 multiplied by 1 and an output channel number of 1, and the side output edge fusion layer SOFL is a convolution layer with a convolution kernel size of 1 multiplied by 1 and a channel number of 1;
(3b) defining a loss function of the side output edge detection network model Q:
L edge =L side +L fuse
wherein L is side Representing the side output edge detection loss function,
Figure FDA0003689310150000021
wherein, beta i A weight coefficient representing the ith side output edge detection network,
Figure FDA0003689310150000022
a loss function representing the prediction result of the ith side output edge detection network:
Figure FDA0003689310150000023
wherein the content of the first and second substances,
Figure FDA0003689310150000024
wherein e represents the input image target edge truth value, | e - | represents the number of pixels in the true value of the edge of the image target, | e + I represents the number of non-edge pixels in the image target edge truth, omega i Representing a parameter of the convolutional layer, L fuse Represent the edge fusion loss function:
Figure FDA0003689310150000025
(3c) setting a maximum iteration number I, performing iterative training on the side output edge detection network model Q by utilizing a feature diagram set output by each structural layer of the feature extraction network in the image reconstruction neural network model R according to a loss function of the side output edge detection network model Q to obtain a trained side output edge detection network model Q;
(4) constructing and training an edge correction network model Z:
(4a) connected cavity space convolution pooling pyramid model F in sequence γ And a softmax activation function output layer, wherein the void space convolution pooling pyramid model F γ The method comprises the steps of obtaining an edge correction network model Z, wherein the edge correction network model Z is composed of a plurality of convolution layers and pooling layers which are connected in sequence;
(4b) defining a loss function of the edge correction network model Z:
Figure FDA0003689310150000031
wherein the content of the first and second substances,
Figure FDA0003689310150000032
the coarse segmentation result of the target frame output by the edge detection layer,
Figure FDA0003689310150000033
convolution pooling pyramid model F for void space γ The result of the prediction of (a) is,
Figure FDA0003689310150000034
wherein the content of the first and second substances,
Figure FDA0003689310150000035
representing the edges of the image obtained by the Canny algorithm, and M representing the mask
Figure FDA0003689310150000036
The number of classes of the middle pixels,
Figure FDA0003689310150000037
representing masks
Figure FDA0003689310150000038
The total number of medium pixels;
(4c) setting a maximum iteration number H, performing iterative training on the edge correction network model Z according to a loss function of the edge correction network model Z and by using output results of the image reconstruction network model R and the edge detection network model Q to obtain a trained edge correction network model Z;
(5) combining the trained image reconstruction neural network R, the side output edge detection network Q and the edge correction network model Z to obtain a video target segmentation model based on an image target edge correction segmentation result;
(6) obtaining a self-supervision video target segmentation result:
test set
Figure FDA0003689310150000039
The frame images in the video target segmentation model are used as input of the video target segmentation model to carry out forward propagation to obtain all test frame image prediction segmentation labels, and a final segmentation result image is obtained according to the test frame image prediction segmentation labels.
2. The method of claim 1, wherein: training sample set V in step (1) train Verifying the sample set V val And test sample set V test The method comprises the following steps:
(1a) s multi-class video sequences are obtained from a video target segmentation data set, and a frame sequence set is obtained after preprocessing
Figure FDA00036893101500000310
S is more than or equal to 3000; wherein
Figure FDA00036893101500000311
Representing the kth frame sequence consisting of pre-processed image frames,
Figure FDA00036893101500000312
represents the nth image frame in the kth frame sequence, M is more than or equal to 30;
(1b) randomly extracting more than half of frame sequences from the frame sequence set V to form a training sample set
Figure FDA0003689310150000041
Wherein S/2 < N < S, for each frame sequence in the training sample set
Figure FDA0003689310150000042
Each target frame picture to be segmented
Figure FDA0003689310150000043
Zooming into image blocks of size p multiplied by h, and converting the picture format from RGB to Lab; extracting a half frame sequence from the rest frame sequence to form a verification sample set
Figure FDA0003689310150000044
Wherein J is less than or equal to S/4; the other half constitutes the test sample set
Figure FDA0003689310150000045
T is less than or equal to S/4, and the picture format is converted from RGB to Lab.
3. The method of claim 1, wherein: in the step (2c), iterative training is performed on the image reconstruction neural network model R, and the following is realized:
(2c1) setting a network hyperparameter of the feature extraction network as theta, setting the maximum iteration number as N to be more than or equal to 150000, and indicating the current iteration number by N; initializing the iteration number by changing n to 1;
(2c2) will train the sample set V train The target frame picture in (1) is used as the input of an image reconstruction neural network model R for forward propagation:
for each target frame I to be divided t Selecting q frames in front of the reference frame as the reference frame { I' 0 ,I’ 1 ,...,I’ q Q is more than or equal to 2 and less than or equal to 5, and a target frame I t The reference frame set corresponding to the reference frame set is used as the input of a feature extraction network phi (;. theta), and a feature extraction network pair I t And extracting the features of each reference frame image to obtain the features f of the target frame image t =Φ(I t (ii) a Theta), reference frame image feature f' 0 =Φ(I′ 0 ;θ),...,f′ q =Φ(I′ q (ii) a θ), training the target frame in the sample set { I } t L0 is more than or equal to t and less than or equal to N is used as the input of the K mean value algorithm to obtain the reconstruction loss value L of the quantized image cls Reconstructing the target frame I t With the real target frame I t Obtaining an RGB image reconstruction loss value L as input of an RGB image reconstruction task reg
(2c3) Using a loss function L mix By cross entropy loss L cls And regression loss L reg Calculating loss values of image reconstruction neural networks
Figure FDA0003689310150000046
Calculating the gradient g (theta) of the network parameter by adopting back propagation, and then updating the network parameter theta by a gradient descent method;
(2c4) judging whether N is true or not, and if so, obtaining a trained image reconstruction neural network R; otherwise, let n be n +1, and return to execute step (2c 2).
4. The method of claim 1, wherein: in the step (3c), iterative training is performed on the side output edge detection network model Q, and the following is realized:
(3c1) setting the maximum iteration frequency I to be more than or equal to 150000, and setting the current iteration frequency to be I; initializing the iteration number by changing i to 1;
(3c2) feature graph set output by each structural layer of feature extraction network in image reconstruction network model
Figure FDA0003689310150000051
Forward propagation as input to the side-output edge detection network:
(3c3) the side output edge detection layer obtains the rough edge of the target from the feature map set so as to obtain the rough edge corresponding to each feature map
Figure FDA0003689310150000052
(3c4) Taking the rough edge set output by the side output edge detection layer SODL as the input of the side output edge fusion layer SOFL, performing weighted fusion on the rough edge to obtain the final predicted edge
Figure FDA0003689310150000053
Wherein the content of the first and second substances,
Figure FDA0003689310150000054
features, ω, formed by merging of thick edges fuse Parameters representing a side output edge blending layer;
(3c5) using a loss function L edge Loss L by side output edge detection side Sum-side output edge blending loss L fuse Calculating loss values for edge-detect networks
Figure FDA0003689310150000055
Calculating the gradient g (omega) of the network parameter by adopting back propagation, and then updating the network parameter omega by a gradient descent method;
(3c6) judging whether I is true, if so, obtaining a trained side output edge detection network model Q; otherwise, let i equal to i +1, and return to execute step (3c 2).
5. The method of claim 1, wherein: performing iterative training on the edge correction network model Z in the step (4c), and realizing the following steps:
(4c1) setting the maximum iteration number as H not less than 150000 and the current iteration number as H; initializing the iteration times by changing h to 1;
(4c2) taking the target frame rough segmentation result output by the image reconstruction network model R and the edge detection result output by the edge detection network model Q as the input of the edge correction network model Z for forward propagation:
(4c2.1) the edge correction network first coarsely divides the result into the target frame
Figure FDA0003689310150000061
And edge detection results
Figure FDA0003689310150000062
Merging the channel dimensions to obtain a characteristic diagram with the size of H multiplied by W multiplied by (K +1)
Figure FDA0003689310150000063
(4c2.2) mapping the features
Figure FDA0003689310150000064
Convolution pooling pyramid model F as void space γ Obtaining a prediction result of the expanded receptive field;
(4c2.3) taking the expanded receptive field prediction result as the input of the output layer of the softmax activation function, and determining the segmentation labels of the pixels according to the probability that each pixel in the feature map is located in each category, so as to obtain a more accurate target segmentation mask after the target segmentation mask of the target frame is subjected to edge fusion correction
Figure FDA0003689310150000065
Wherein O is t Representing a target frame I t The predictive segmentation tag of (1);
(4c3) using a loss function L corr Calculating the loss value of the edge correction network
Figure FDA0003689310150000066
Calculating the gradient g (c) of the network parameter by adopting back propagation, and then updating the network parameter c by a gradient descent method;
(4c4) judging whether H is true or not, and if so, obtaining a trained edge correction network model Z; otherwise, let h be h +1, and return to performing step (4c 2).
6. The method of claim 1, wherein: and (5) obtaining a video target segmentation model based on the image target edge correction segmentation result by specifically obtaining a target edge prediction graph from an input side output edge detection network Q of an intermediate characteristic graph extracted by an image reconstruction neural network R, and taking the target segmentation mask prediction graph output by the image reconstruction neural network R and the target edge prediction graph output by the side output edge detection network Q as the input of an edge correction network model Z.
CN202210658263.6A 2022-06-10 2022-06-10 Video target segmentation method based on self-supervision Pending CN114863348A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210658263.6A CN114863348A (en) 2022-06-10 2022-06-10 Video target segmentation method based on self-supervision

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210658263.6A CN114863348A (en) 2022-06-10 2022-06-10 Video target segmentation method based on self-supervision

Publications (1)

Publication Number Publication Date
CN114863348A true CN114863348A (en) 2022-08-05

Family

ID=82624940

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210658263.6A Pending CN114863348A (en) 2022-06-10 2022-06-10 Video target segmentation method based on self-supervision

Country Status (1)

Country Link
CN (1) CN114863348A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116129353A (en) * 2023-02-07 2023-05-16 佛山市顺德区福禄康电器科技有限公司 Method and system for intelligent monitoring based on image recognition
CN116563218A (en) * 2023-03-31 2023-08-08 北京长木谷医疗科技股份有限公司 Spine image segmentation method and device based on deep learning and electronic equipment
CN116630697A (en) * 2023-05-17 2023-08-22 安徽大学 Image classification method based on biased selection pooling
CN117788492A (en) * 2024-02-28 2024-03-29 苏州元脑智能科技有限公司 Video object segmentation method, system, electronic device and storage medium

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116129353A (en) * 2023-02-07 2023-05-16 佛山市顺德区福禄康电器科技有限公司 Method and system for intelligent monitoring based on image recognition
CN116129353B (en) * 2023-02-07 2024-05-07 广州融赋数智技术服务有限公司 Method and system for intelligent monitoring based on image recognition
CN116563218A (en) * 2023-03-31 2023-08-08 北京长木谷医疗科技股份有限公司 Spine image segmentation method and device based on deep learning and electronic equipment
CN116630697A (en) * 2023-05-17 2023-08-22 安徽大学 Image classification method based on biased selection pooling
CN116630697B (en) * 2023-05-17 2024-04-05 安徽大学 Image classification method based on biased selection pooling
CN117788492A (en) * 2024-02-28 2024-03-29 苏州元脑智能科技有限公司 Video object segmentation method, system, electronic device and storage medium
CN117788492B (en) * 2024-02-28 2024-04-26 苏州元脑智能科技有限公司 Video object segmentation method, system, electronic device and storage medium

Similar Documents

Publication Publication Date Title
CN110135267B (en) Large-scene SAR image fine target detection method
CN110910391B (en) Video object segmentation method for dual-module neural network structure
CN114863348A (en) Video target segmentation method based on self-supervision
CN109086811B (en) Multi-label image classification method and device and electronic equipment
CN107491734B (en) Semi-supervised polarimetric SAR image classification method based on multi-core fusion and space Wishart LapSVM
CN112668579A (en) Weak supervision semantic segmentation method based on self-adaptive affinity and class distribution
CN114332578A (en) Image anomaly detection model training method, image anomaly detection method and device
CN111428625A (en) Traffic scene target detection method and system based on deep learning
CN113159048A (en) Weak supervision semantic segmentation method based on deep learning
CN113780292A (en) Semantic segmentation network model uncertainty quantification method based on evidence reasoning
CN112613350A (en) High-resolution optical remote sensing image airplane target detection method based on deep neural network
CN114332473A (en) Object detection method, object detection device, computer equipment, storage medium and program product
CN114973019A (en) Deep learning-based geospatial information change detection classification method and system
CN114511785A (en) Remote sensing image cloud detection method and system based on bottleneck attention module
CN114998360A (en) Fat cell progenitor cell segmentation method based on SUnet algorithm
CN113344069B (en) Image classification method for unsupervised visual representation learning based on multi-dimensional relation alignment
CN114580501A (en) Bone marrow cell classification method, system, computer device and storage medium
CN114299291A (en) Interpretable artificial intelligent medical image semantic segmentation method
Bagwari et al. A comprehensive review on segmentation techniques for satellite images
CN116883432A (en) Method and device for segmenting focus image, electronic equipment and readable storage medium
CN111611919A (en) Road scene layout analysis method based on structured learning
CN117152427A (en) Remote sensing image semantic segmentation method and system based on diffusion model and knowledge distillation
CN116580243A (en) Cross-domain remote sensing scene classification method for mask image modeling guide domain adaptation
CN109726690B (en) Multi-region description method for learner behavior image based on DenseCap network
CN113313185A (en) Hyperspectral image classification method based on self-adaptive spatial spectral feature extraction

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination