CN114863348A

CN114863348A - Video target segmentation method based on self-supervision

Info

Publication number: CN114863348A
Application number: CN202210658263.6A
Authority: CN
Inventors: 李阳阳; 封星宇; 赵逸群; 刘睿娇; 陈彦桥; 焦李成; 尚荣华; 马文萍
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2022-06-10
Filing date: 2022-06-10
Publication date: 2022-08-05

Abstract

The invention discloses a video target segmentation method based on self-supervision, which mainly solves the problems of low segmentation precision and large influence on target shielding and tracking drift in the prior art. The scheme comprises the following steps: 1) acquiring a video sequence from a video target segmentation data set, preprocessing the video sequence, and dividing the video sequence to obtain a training, verifying and testing sample set; 2) constructing and training an image reconstruction neural network model, and extracting target features by adopting a self-supervision learning method based on a multi-pixel scale image reconstruction task; 3) constructing and training a side output edge detection network model; 4) constructing and training an edge correction network model based on self-supervision; 5) combining the trained three models to obtain a video target segmentation model; 6) and inputting the test set into a video target segmentation model to obtain a target segmentation result. The method can effectively improve the generalization and the accuracy of video target segmentation, and can be used in the fields of automatic driving, intelligent monitoring, unmanned aerial vehicle intelligent tracking and the like.

Description

Video target segmentation method based on self-supervision

Technical Field

The invention belongs to the technical field of computer vision, and further relates to a video target segmentation technology, in particular to a video target segmentation method based on self-supervision, which can be used in the fields of automatic driving, intelligent monitoring, unmanned aerial vehicle intelligent tracking and the like.

Background

Computer vision aims at simulating the process of establishing visual perception by human, is a key link in the development of artificial intelligence technology, and the computer vision algorithm aims to simulate the human visual behavior to the maximum extent with higher precision and provide perception information for downstream tasks. In a human perception system, picture transformation is a continuous process, and under the large background of the current visual technology, a picture storage mode closest to human perception is a video mode, so that a computer visual algorithm for processing a video task has the capability of simulating human visual behaviors.

The video object segmentation task is an important subject in the video processing task, and aims to segment an object of interest in a series of video sequences from a background. In recent years, due to the excellent performance of deep learning technology in computer vision tasks (such as image recognition, target tracking, motion recognition, etc.), a video target segmentation algorithm based on deep learning has become a mainstream method for solving the video target segmentation task. The performance of the video target segmentation algorithm based on deep learning depends on the scale of a neural network used by the video target segmentation algorithm, the performance of the neural network depends on a large amount of training data, and the larger the scale of a training data set is, the better the generalization and robustness of the neural network obtained by training are. In the supervised learning mode, the production process of the video target segmentation training data set is expensive and time-consuming, and not only each pixel in an image needs to be labeled in space, but also each frame in a video sequence needs to be labeled in time. The performance of the video target segmentation model is closely related to the structure, and errors in the video target segmentation process can be effectively reduced through reasonable optimization of the reasoning process of the video target segmentation model.

The research goal of the self-supervision learning is to train a deep learning model under the condition of not using any manual label, so that the deep learning model can extract effective visual representation information from a large number of unlabelled pictures or video data sets, and the extracted information is subjected to fine adjustment and is used by downstream tasks. The video target segmentation based on the self-supervision learning is designed aiming at the specific task of semi-supervision video target segmentation, the video target segmentation model is trained by the self-supervision learning method, the trained model can be directly used for the video target segmentation task, and any manual labeling data set intervention is not needed in the whole training process.

The research of the self-supervision video target segmentation method can be basically divided into two lines, and firstly, a better preposed task training model is designed, so that the model has better representation and extraction capacity; and secondly, aiming at the problem of segmenting the semi-supervised video target, introducing more mechanisms to reduce the influence of target shielding and tracking drift. An article entitled "Tracking images by Tracking images" was published by colorimeter et al in 2018 in European Conference on Computer Vision, and an auto-surveillance video Tracking and coloring model is proposed, which utilizes the natural temporal coherence of colors to learn and color gray level videos, so that the effect of the auto-surveillance video Tracking technology is further improved, but the auto-surveillance video Tracking technology is poor in robustness to target occlusion and Tracking drift because the auto-surveillance video Tracking model is based on previous frame propagation. Corrflow et al published an article entitled "Self-assisted video representation learning for correcting location flow" in 2019 on British Machine Vision Conference, and introduced a limited attention mechanism to improve the resolution of model input and improve the segmentation accuracy without increasing the burden of computing equipment, however, the method does not consider the feature extraction generalization of targets of different scales, and performs poorly in the case of too large difference in target scales.

Disclosure of Invention

The invention aims to provide a video target segmentation method based on self-supervision aiming at the defects of the prior art, and the method is used for solving the technical problems of low segmentation precision and large influence on target occlusion and tracking drift in the prior art.

The idea for realizing the invention is as follows: firstly, extracting target features by adopting a self-supervision learning method based on a multi-pixel scale image reconstruction task, so that a video target segmentation model can give consideration to the features of a large target and the features of a small target to obtain better generalization performance, and then, aiming at the problem of error accumulation generated when the video target segmentation model is used for segmenting the target, the method provides that the semantic edge of the image is used for correcting a target segmentation mask; and finally, designing an edge fusion network based on self-supervision to obtain a more accurate target segmentation mask.

In order to achieve the purpose, the technical scheme adopted by the invention comprises the following steps:

(1) acquiring a training sample set, a verification sample set and a test sample set:

obtaining video sequences from a video target segmentation data set, preprocessing the video sequences to obtain a frame sequence set V, and dividing the frame sequences in the set to obtain a training sample set V _train Verifying the sample set V _val And test sample set V _test ；

(2) Constructing and training an image reconstruction neural network model R:

(2a) constructing an image reconstruction neural network model R formed by a feature extraction network, wherein the feature extraction network adopts a residual error network comprising a plurality of convolution layers, a plurality of pooling layers, a plurality of residual error unit modules and a single full-connection layer which are connected in sequence;

(2b) defining a loss function of the image reconstruction neural network model R:

L _mix ＝αL _cls +(1-α)L _reg

wherein the content of the first and second substances,

cross entropy loss function representing quantized image reconstruction task for a set of training samples

E cluster centroid points [ mu ] are selected ₁ ,μ ₂ ,...,μ _E And E is less than or equal to 50; calculating the class of the sample according to the distance between the training sample and the clustering centroid point, and setting the number of target classes contained in the frame sequence set V as C; correcting the position of the clustering mass center point to ensure that the same target label is the same between frames, and different target labels are different, wherein,

representing a given frame picture I _t To which the ith pixel of (1) belongs,

denotes the prediction result using the K-means algorithm, L _reg A regression loss function representing the RGB image reconstruction task,

wherein the content of the first and second substances,

wherein

In the case of a real target frame pixel,

alpha represents a weight coefficient for reconstructing the target frame pixel, and alpha is more than or equal to 0.1 and less than or equal to 0.9;

(2c) setting a feature handleTaking the network parameters and the maximum iteration number N, reconstructing a loss function of the neural network model R according to the image, and utilizing a training sample set V _train Carrying out iterative training on the image reconstruction neural network model R by the target frame picture to obtain a trained image reconstruction neural network model R;

(3) constructing and training a side output edge detection network model Q:

(3a) constructing an edge detection network model Q comprising a side output edge detection layer SODL and a side output edge fusion layer SOFL which are connected in sequence, wherein the side output edge detection layer SODL comprises an deconvolution layer and a convolution layer with a convolution kernel size of 1 multiplied by 1 and an output channel number of 1, and the side output edge fusion layer SOFL is a convolution layer with a convolution kernel size of 1 multiplied by 1 and a channel number of 1;

(3b) defining a loss function of the side output edge detection network model Q:

L _edge ＝L _side +L _fuse

wherein L is _side Representing the side output edge detection loss function,

wherein, beta _i A weight coefficient representing the ith side output edge detection network,

a loss function showing the prediction result of the ith side output edge detection network:

wherein the content of the first and second substances,

wherein e represents the input image target edge truth value, | e ^- | represents the number of pixels in the true value of the edge of the image target, | e ⁺ I represents the number of non-edge pixels in the image target edge truth, omega _i Representing a parameter of the convolutional layer, L _fuse To representEdge fusion loss function:

(3c) setting a maximum iteration number I, performing iterative training on the side output edge detection network model Q by utilizing a feature diagram set output by each structural layer of the feature extraction network in the image reconstruction neural network model R according to a loss function of the side output edge detection network model Q to obtain a trained side output edge detection network model Q;

(4) constructing and training an edge correction network model Z:

(4a) connected cavity space convolution pooling pyramid model F in sequence _γ And a softmax activation function output layer, wherein the void space convolution pooling pyramid model F _γ The method comprises the steps of obtaining an edge correction network model Z, wherein the edge correction network model Z is composed of a plurality of convolution layers and pooling layers which are connected in sequence;

(4b) defining a loss function of the edge correction network model Z:

wherein the content of the first and second substances,

the coarse segmentation result of the target frame output by the edge detection layer,

convolution pooling pyramid model F for void space _γ The result of the prediction of (a) is,

wherein the content of the first and second substances,

representing the edge of the image obtained by the Canny algorithm, and M representing the mask

The number of classes of the middle pixels,

representing masks

The total number of medium pixels;

(4c) setting a maximum iteration number H, performing iterative training on the edge correction network model Z according to a loss function of the edge correction network model Z and by using output results of the image reconstruction network model R and the edge detection network model Q to obtain a trained edge correction network model Z;

(5) combining the trained image reconstruction neural network R, the side output edge detection network Q and the edge correction network model Z to obtain a video target segmentation model based on an image target edge correction segmentation result;

(6) obtaining a self-supervision video target segmentation result:

test set

The frame images in the video target segmentation model are used as input of the video target segmentation model to carry out forward propagation to obtain all test frame image prediction segmentation labels, and a final segmentation result image is obtained according to the test frame image prediction segmentation labels.

Compared with the prior art, the invention has the following advantages:

firstly, as the reconstruction task of the multi-pixel scale image is adopted as the prepositive task of the self-supervision learning, the features extracted by the trained model have better generalization aiming at large targets and small targets in the video segmentation task, thereby having better performance in the integral video target segmentation task.

Secondly, the invention uses the edge repair of the target in the video picture to repair the target mask, integrates the feature images extracted by each layer of the feature extraction network in the video target segmentation model through a side output edge detection network, predicts the candidate target edge in the target frame, and uses an edge fusion model based on self-supervision to fuse the segmentation result output by the video target segmentation model and the target edge output by the side output edge detection network, thereby correcting the segmentation mask according to the target edge and obtaining more accurate segmentation result.

Drawings

FIG. 1 is a flow chart of an implementation of the present invention.

Detailed Description

The invention is described in further detail below with reference to the figures and specific examples.

The first embodiment is as follows: referring to fig. 1, the video object segmentation method based on self-supervision provided by the invention specifically includes the following steps:

step 1: acquiring a training sample set, a verification sample set and a test sample set:

obtaining video sequences from a video target segmentation data set, preprocessing the video sequences to obtain a frame sequence set V, and dividing the frame sequences in the set to obtain a training sample set V _train Verifying the sample set V _val And test sample set V _test (ii) a The method is realized as follows:

(1a) s multi-class video sequences are obtained from a video target segmentation data set, and a frame sequence set is obtained after preprocessing

S is more than or equal to 3000; wherein

Representing the kth frame sequence consisting of pre-processed image frames,

represents the nth image frame in the kth frame sequence, M is more than or equal to 30;

(1b) randomly extracting more than half of frame sequences from the frame sequence set V to form a training sample set

Wherein S/2 < N <S, aiming at each frame sequence in training sample set

Each target frame picture to be segmented

Zooming into image blocks of size p multiplied by h, and converting the picture format from RGB to Lab; extracting a half frame sequence from the rest frame sequence to form a verification sample set

Wherein J is less than or equal to S/4; the other half constitutes the test sample set

T is less than or equal to S/4, and the picture format is converted from RGB to Lab.

Step 2: constructing and training an image reconstruction neural network model R:

L _mix ＝αL _cls +(1-α)L _reg

wherein the content of the first and second substances,

E cluster centroid points [ mu ] are selected ₁ ,μ ₂ ,...,μ _E And E is less than or equal to 50; calculating the class of the sample according to the distance between the training sample and the clustering centroid point, and setting the number of target classes contained in the frame sequence set V as C; correcting the position of the clustering mass center point to enable the same kind of target marks between framesThe labels are the same, and different target labels are different, wherein,

representing a given frame picture I _t To which the ith pixel of (1) belongs,

wherein the content of the first and second substances,

wherein

In the case of a real target frame pixel,

(2c) setting characteristic extraction network parameters and maximum iteration times N, reconstructing a loss function of a neural network model R according to the image, and utilizing a training sample set V _train Carrying out iterative training on the image reconstruction neural network model R by the target frame picture to obtain a trained image reconstruction neural network model R; the method is realized as follows:

(2c1) setting a network hyperparameter of the feature extraction network as theta, setting the maximum iteration number as N to be more than or equal to 150000, and indicating the current iteration number by N; initializing the iteration number by changing n to 1;

(2c2) will train the sample set V _train The target frame picture in (1) is used as the input of an image reconstruction neural network model R for forward propagation:

for each target frame I to be divided _t Selecting q frames in front of the reference frame as the reference frame { I' ₀ ，I ₁ ’，...,I’ _q Q is more than or equal to 2 and less than or equal to 5, and a target frame I _t With it phaseThe reference frame set is used as the input of a feature extraction network phi (;. theta), and the feature extraction network is used for I _t And extracting the features of each reference frame image to obtain the features f of the target frame image _t ＝Φ(I _t (ii) a Theta), reference frame image feature f' ₀ ＝Φ(I′ ₀ ；θ),...,f′ _q ＝Φ(I′ _q (ii) a θ), training the target frame in the sample set { I } _t L0 is more than or equal to t and less than or equal to N is used as the input of the K mean value algorithm to obtain the reconstruction loss value L of the quantized image _cls Reconstructing the target frame I _t With the real target frame I _t Obtaining an RGB image reconstruction loss value L as input of an RGB image reconstruction task _reg ；

(2c3) Using a loss function L _mix By cross entropy loss L _cls And regression loss L _reg Calculating loss values of image reconstruction neural networks

Calculating the gradient g (theta) of the network parameter by adopting back propagation, and then updating the network parameter theta by a gradient descent method;

(2c4) judging whether N is true or not, and if so, obtaining a trained image reconstruction neural network R; otherwise, let n be n +1, and return to execute step (2c 2).

And step 3: constructing and training a side output edge detection network model Q:

L _edge ＝L _side +L _fuse

wherein L is _side Representing the side output edge detection loss function,

a loss function representing the prediction result of the ith side output edge detection network:

wherein the content of the first and second substances,

wherein e represents the input image target edge truth value, | e ^- | represents the number of pixels in the true value of the edge of the image target, | e ⁺ I represents the number of non-edge pixels in the image target edge truth, omega _i Representing a parameter of the convolutional layer, L _fuse Represent the edge fusion loss function:

(3c) setting a maximum iteration number I, performing iterative training on the side output edge detection network model Q by using a feature diagram set output by each structural layer of the feature extraction network in the image reconstruction neural network model R according to a loss function of the side output edge detection network model Q to obtain a trained side output edge detection network model Q, and realizing the following steps:

(3c1) setting the maximum iteration frequency I to be more than or equal to 150000, and setting the current iteration frequency to be I; initializing the iteration number by changing i to 1;

(3c2) feature graph set output by each structural layer of feature extraction network in image reconstruction network model

Forward propagation as input to the side-output edge detection network:

(3c3) side outputThe edge detection layer acquires the rough edge of the target from the feature map set so as to obtain the rough edge corresponding to each feature map

(3c4) Taking the rough edge set output by the side output edge detection layer SODL as the input of the side output edge fusion layer SOFL, performing weighted fusion on the rough edge to obtain the final predicted edge

Wherein the content of the first and second substances,

features, ω, formed by merging of thick edges _fuse Parameters representing a side output edge blending layer;

(3c5) using a loss function L _edge Loss L by side output edge detection _side Sum-side output edge blending loss L _fuse Calculating loss values for edge-detect networks

Calculating the gradient g (omega) of the network parameter by adopting back propagation, and then updating the network parameter omega by a gradient descent method;

(3c6) judging whether I is true, if so, obtaining a trained side output edge detection network model Q; otherwise, let i equal to i +1, and return to execute step (3c 2).

And 4, step 4: constructing and training an edge correction network model Z:

(4b) defining a loss function of the edge correction network model Z:

wherein the content of the first and second substances,

wherein the content of the first and second substances,

The number of classes of the middle pixels,

representing masks

The total number of medium pixels;

(4c) setting a maximum iteration number H, performing iterative training on the edge correction network model Z according to a loss function of the edge correction network model Z and the output results of the image reconstruction network model R and the edge detection network model Q to obtain a trained edge correction network model Z, and realizing the following steps:

(4c1) setting the maximum iteration number as H not less than 150000 and the current iteration number as H; initializing the iteration times by changing h to 1;

(4c2) taking the target frame rough segmentation result output by the image reconstruction network model R and the edge detection result output by the edge detection network model Q as the input of the edge correction network model Z for forward propagation:

(4c2.1) the edge correction network first coarsely divides the result into the target frame

And edge detection results

Merging the channel dimensions to obtain a characteristic diagram with the size of H multiplied by W multiplied by (K +1)

(4c2.2) mapping the features

Convolution pooling pyramid model F as void space _γ Obtaining a prediction result of the expanded receptive field;

(4c2.3) taking the expanded receptive field prediction result as the input of the output layer of the softmax activation function, and determining the segmentation labels of the pixels according to the probability that each pixel in the feature map is located in each category, so as to obtain a more accurate target segmentation mask after the target segmentation mask of the target frame is subjected to edge fusion correction

Wherein O is _t Representing a target frame I _t The predictive segmentation tag of (1);

(4c3) using a loss function L _corr Calculating the loss value of the edge correction network

Calculating the gradient g (c) of the network parameter by adopting back propagation, and then updating the network parameter c by a gradient descent method;

(4c4) judging whether H is true or not, and if so, obtaining a trained edge correction network model Z; otherwise, let h be h +1, and return to performing step (4c 2).

And 5: combining the trained image reconstruction neural network R, the side output edge detection network Q and the edge correction network model Z to obtain a video target segmentation model based on an image target edge correction segmentation result, and particularly combining the video target segmentation model according to the following modes: and obtaining a target edge prediction graph in an input side output edge detection network Q of the intermediate characteristic graph extracted by the image reconstruction neural network R, and taking the target segmentation mask prediction graph output by the image reconstruction neural network R and the target edge prediction graph output by the side output edge detection network Q as the input of an edge correction network model Z to obtain a trained video target segmentation model based on the image target edge correction segmentation result.

And 6: obtaining a self-supervision video target segmentation result:

test set

Example two: the overall steps of this embodiment are the same as those of the first embodiment, and specific values are given for setting some of the parameters, so as to further describe the implementation process of the present invention:

step 1) obtaining a training sample set, a verification sample set and a test sample set:

step 1a) acquiring S multi-class video sequences from a video object segmentation dataset, and preprocessing the video sequences to obtain a frame sequence set V, where in this embodiment, the multi-class video sequences are acquired from a youtube vos dataset, S is 4453, and M is 50;

step 1b) sets the target Class number C of the frame sequence V to 94, and the Class set Class to { C ═ C _num 1 ≦ num ≦ C, multiple classes of objects may appear in each frame sequence, where C is _num Representing a num class target;

step 1c) randomly extracting more than half of frame sequences from the frame sequence set V to form a training sample set

Wherein S/2 < N < S, for each frame sequence in the training set

Each target frame picture to be segmented

Scaling into image blocks with the size of p multiplied by h, converting the RGB format images into Lab images, and extracting a half frame sequence from the rest frame sequences to form a verification sample set

Wherein J is less than or equal to S/4, and the other half constitutes a test sample set

T is less than or equal to S/4, and similarly, the original image is converted into a Lab format from an RGB format;

setting crop box size to x y for a sequence of frames in a training set

Frame picture divided for each band

Cutting to obtain cut frame picture

For the cut frame picture

Normalization processing is carried out, and the frame pictures after normalization form a training frame sequence after preprocessing

Wherein the content of the first and second substances,

an mth frame sequence in the training sample set;

in this embodiment, x is 256, y is 256, p is 256, and h is 3;

step 2), constructing an image reconstruction neural network model R:

step 2a) constructing a structure of an image reconstruction neural network model R:

constructing an image reconstruction neural network model formed by a feature extraction network, wherein the feature extraction network adopts a residual error network comprising a plurality of convolution layers, a plurality of pooling layers, a plurality of residual error unit modules and a single full-connection layer which are connected in sequence;

the feature extraction network comprises 17 convolutional layers and 1 fully-connected layer, the 18-layer structure is divided into 5 blocks which are conv _1, conv _2, conv _3, conv _4 and conv _5 respectively, conv _1 is one convolutional layer, the convolutional core size of the convolutional layer is 7 multiplied by 7, the number of channels is 64, conv _2 comprises two convolutional layers, the convolution kernel size is 3 x 3, the number of channels is 64, the convolution kernel moving distance is 1, conv _3 comprises two convolution layers, the convolution kernel size is 3 x 3, the number of channels is 128, wherein the first convolution layer convolution kernel shift distance is set to be 2, the second convolution layer convolution kernel shift distance is 1, conv _4 comprises two convolution layers, the convolution kernel size is 3 multiplied by 3, the number of channels is 256, the convolution kernel moving distance is 1, conv _5 comprises two convolution layers, the convolution kernel size is 3 multiplied by 3, the number of channels is 512, and the convolution kernel moving distance is 1;

step 2b) defining a loss function of the image reconstruction neural network model:

L _mix ＝αL _cls +(1-α)L _reg

wherein L is _cls Represents the cross-entropy loss function of the quantized image reconstruction task,

for training sample sets

E cluster centroid points [ mu ] are selected ₁ ,μ ₂ ,...,μ _E And E is less than or equal to 50, calculating the class of the sample according to the distance between the training sample and the clustering centroid point, correcting the position of the clustering centroid point to ensure that the same type target labels between frames are the same, and different target labels are different, wherein,

representing a given frame picture I _t To which the ith pixel of (1) belongs,

wherein the content of the first and second substances,

wherein

In the case of a real target frame pixel,

in this embodiment, K is 16, α is 0.6;

step 3) iterative training is carried out on the image reconstruction neural network model:

step 3a), setting a network hyper-parameter of the feature extraction network as theta, setting the number of initialization iterations as N, setting the maximum number of iterations as N, wherein N is more than or equal to 150000, and setting N as 1;

in this embodiment, N is 300000, and N is 300000 for more sufficient model training;

step 3b) training sample set V _train The target frame picture in (1) is used as the input of an image reconstruction neural network model R for forward propagation:

step 3b1) for each target frame I to be segmented _t Selecting q frames in front of the reference frame as the reference frame { I' ₀ ，I ₁ ’，...,I _q ' }, where 2 < q > is less than or equal to 5, target frame I _t The reference frame set corresponding to the reference frame set is used as the input of a feature extraction network phi (;. theta), and a feature extraction network pair I _t And extracting features from each reference frame imageObtaining the target frame image characteristic f _t ＝Φ(I _t (ii) a Theta), reference frame image feature f' ₀ ＝Φ(I′ ₀ ；θ),...,f′ _q ＝Φ(I′ _q (ii) a θ), training the target frame in the sample set { I } _t L0 is more than or equal to t and less than or equal to N is used as the input of the K mean value algorithm to obtain the reconstruction loss value L of the quantized image _cls Reconstructing the target frame I _t With the real target frame I _t Obtaining an RGB image reconstruction loss value L as input of an RGB image reconstruction task _reg ；

Step 3c) Using a loss function L _mix By cross entropy loss L _cls And regression loss L _reg Calculating loss values of image reconstruction neural networks

Calculating the gradient g (theta) of the network parameter by adopting back propagation, and then updating the network parameter theta by a gradient descent method, wherein the updating formula is as follows:

wherein θ' represents θ ⁿ The updated result, gamma represents the learning rate, 1e-6 is more than or equal to gamma is less than or equal to 1e-3,

the loss function value of the image reconstruction neural network after the nth iteration is shown,

representing the partial derivative calculation.

In this embodiment, the initial learning rate γ is 0.001, when the iteration is performed for the 15 th ten thousand, the learning rate γ is 0.0005, when the iteration is performed for the 20 th ten thousand, the learning rate γ is 0.00025, and when the iteration is performed for the 25 th ten thousand, the learning rate γ is 0.000125, and the optimizer function uses an Adam optimizer, and the purpose of attenuating the learning rate when the network is iterated for a certain number of times is to prevent the loss function from falling into the local minimum;

step 3d) judging whether N is true or not, if so, obtaining a trained image reconstruction neural network R, otherwise, making N be N +1, and executing the step (3 b);

step 4), constructing a side output edge detection network model Q:

step 4a) constructing a structure of a side output edge detection network model Q:

constructing an edge detection network model Q comprising a side output edge detection layer SODL and a side output edge fusion layer SOFL which are connected in sequence, wherein the side output edge detection layer SODL comprises an deconvolution layer and a convolution layer with a convolution kernel size of 1 multiplied by 1 and an output channel number of 1, and the side output edge fusion layer SOFL is a convolution layer with a convolution kernel size of 1 multiplied by 1 and a channel number of 1;

step 4b) defining a loss function of the side output edge detection network model:

L _edge ＝L _side +L _fuse

wherein L is _side Representing the side output edge detection loss function,

wherein the content of the first and second substances,

step 5) performing iterative training on the side output edge detection network model Q:

step 5a), initializing the iteration frequency to be I, setting the maximum iteration frequency to be I, wherein I is more than or equal to 150000, and setting I to 1;

in this embodiment, in order to make the model training more sufficient, design I is 300000;

step 5b) feature graph set output by each structural layer of the feature extraction network in the image reconstruction network model

Forward propagation as input to the side-output edge detection network:

step 5b1) side output edge detection layer obtains the rough edge of the target from the feature map set, thereby obtaining the rough edge corresponding to each feature map

Step 5b2) of outputting the rough edge set output by the side output edge detection layer SODL as the input of the side output edge fusion layer SOFL, and performing weighted fusion on the rough edge to obtain the final predicted edge

Wherein the content of the first and second substances,

step 5c) Using a loss function L _edge Loss L by side output edge detection _side Sum-side output edge blending loss L _fuse Calculating loss values for edge-detect networks

Computing network employing back propagationAnd (3) updating the network parameter omega by a gradient descent method according to the parameter gradient g (omega), wherein the updating formula is as follows:

wherein ω' represents ω ⁱ The updated result, beta represents the learning rate, 1e-6 is more than or equal to beta is less than or equal to 1e-3,

the loss function value of the output edge detection neural network at the i-th iteration is shown,

representing the partial derivative calculation.

In this embodiment, the initial learning rate β is 0.001, when the network iterates to the 15 th ten thousand, the learning rate β is 0.0005, when the network iterates to the 20 th ten thousand, the learning rate β is 0.00025, and when the network iterates to the 25 th ten thousand, the learning rate β is 0.000125, and the optimizer function uses an Adam optimizer, and the purpose of attenuating the learning rate when the network iterates to a certain number of times is to prevent the loss function from falling into a local minimum;

step 5d) judging whether I is true, if so, obtaining a trained side output edge detection network model Q, otherwise, making I +1, and executing the step (5 b);

step 6), constructing an edge correction network model Z:

step 6a) constructing a structure of an edge correction network model Z:

constructing a void space convolution pooling pyramid model F comprising sequential connections _γ And an edge correction network model Z of the output layer of the softmax activation function, wherein the void space convolution pooling pyramid model F _γ Composed of a plurality of convolution layers and pooling layers connected in sequence;

void space convolution pooling pyramid model F _γ Comprises a convolution layer, a pooling pyramid and a pooling block, wherein the convolution layer convolution kernel has a size of 1 × 1, the pooling pyramid comprises three convolution layers connected in parallel, and the convolution layers are connected in parallelThe kernel size is 3 multiplied by 3, the pooling block comprises a 1 multiplied by 1 pooling layer, a convolution layer and an up-sampling operation layer, the convolution layer convolution kernel size is 1 multiplied by 1, concat operation is carried out on the convolution layer, the pooling pyramid and the feature diagram output by the pooling block module, and then the feature diagram is processed by the 1 multiplied by 1 pooling layer to obtain a cavity space convolution pooling pyramid model F _γ An output of (d);

step 6b) defining a loss function of the edge correction network model Z:

wherein the content of the first and second substances,

wherein the content of the first and second substances,

The number of classes of the middle pixels,

representing masks

The total number of medium pixels;

step 7) performing iterative training on the edge correction network model Z:

step 7a), initializing the iteration number to be H, wherein the maximum iteration number is H, H is more than or equal to 150000, and H is 1;

in this embodiment, H300000 is designed to make model training more sufficient;

step 7b) taking the target frame rough segmentation result output by the image reconstruction network model and the edge detection result output by the edge detection network model as the input of the edge correction network model Z for forward propagation:

step 7b1) the edge correction network firstly coarsely divides the target frame into results

And edge detection results

Step 7b2) feature map

step 7b3) using the expanded receptive field prediction result as the input of the output layer of the softmax activation function, and determining the segmentation label of the pixel according to the probability of each category of each pixel position in the feature map, thereby obtaining a more accurate target segmentation mask after the target segmentation mask of the target frame is subjected to edge fusion correction

Wherein, O _t Representing a target frame I _t The predictive segmentation tag of (1);

step 7c) Using a loss function L _corr Calculating the loss value of the edge correction network

Calculating network parameter gradients g (c) by back propagation and then passing through a ladderThe degree reduction method updates the network parameter c, and the updating formula is as follows:

wherein c' represents c ^h The updated result, alpha represents the learning rate, 1e-6 is more than or equal to alpha is less than or equal to 1e-3,

the loss function value of the edge modified neural network after the h-th iteration is shown,

representing the partial derivative calculation.

In this embodiment, the initial learning rate α is 0.001, when the network iterates to the 15 th ten thousand, the learning rate α is 0.0005, when the network iterates to the 20 th ten thousand, the learning rate α is 0.00025, and when the network iterates to the 25 th ten thousand, the learning rate α is 0.000125, and the optimizer function uses an Adam optimizer, and the purpose of attenuating the learning rate when the network iterates to a certain number of times is to prevent the loss function from falling into a local minimum;

step 7d) judging whether H is true or not, if so, obtaining a trained edge correction network model Z, otherwise, making H be H +1, and executing the step (7 b);

step 8) obtaining a self-supervision video target segmentation result:

test set

The frame image in the method is used as the input of a trained video target segmentation model based on the image target edge correction segmentation result for forward propagation, the video target segmentation model based on the image target edge correction segmentation result is composed of an image reconstruction neural network R, a side output edge detection network Q and an edge fusion network Z, all test frame image segmentation labels are obtained, and a segmentation result image is determined according to the test frame image segmentation labels.

The technical effects of the present invention are further explained by simulation experiments as follows:

1. simulation conditions and contents:

4453 video sequences were acquired from the YouTube-VOS dataset for use in simulation experiments;

the simulation experiment is carried out on a server with a CPU model of Intel (R) core (TM) i77800x CPU @3.5GHz 64GB and a GPU model of NVIDIA GeForce RTX 2080 ti. The operating system is a UBUNTU 16.04 system, the deep learning framework is PyTorch, and the programming language is Python 3.6;

the invention is compared and simulated with the existing video target segmentation method. In order to quantitatively compare video target segmentation results, two video target segmentation result evaluation indexes, namely region similarity J and contour similarity F, are adopted in the experiment, the higher the two evaluation indexes are, the better the change detection result is, and the simulation result is shown in table 1.

TABLE 1

2. And (3) simulation result analysis:

as can be seen from Table 1, compared with the existing video segmentation method, the J index and the F index are obviously improved, and the video target segmentation technology based on self-supervision constructed by the invention can effectively solve the problems of target shielding, tracking drift and the like, thereby improving the video target segmentation precision, and has important practical significance and practical value.

The simulation analysis proves the correctness and the effectiveness of the method provided by the invention.

The invention has not been described in detail in part of the common general knowledge of those skilled in the art.

While the invention has been particularly shown and described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention.

Claims

1. A video object segmentation method based on self-supervision is characterized by comprising the following steps:

(2) Constructing and training an image reconstruction neural network model R:

L _mix ＝αL _cls +(1-α)L _reg

wherein the content of the first and second substances,

representing a given frame picture I _t The ith pixel ofThe category to which the device belongs is,

wherein the content of the first and second substances,

wherein

In the case of a real target frame pixel,

(2c) setting characteristic extraction network parameters and maximum iteration times N, reconstructing a loss function of a neural network model R according to the image, and utilizing a training sample set V _train Carrying out iterative training on the image reconstruction neural network model R by the target frame picture to obtain a trained image reconstruction neural network model R;

(3) constructing and training a side output edge detection network model Q:

L _edge ＝L _side +L _fuse

wherein L is _side Representing the side output edge detection loss function,

wherein the content of the first and second substances,

(4) constructing and training an edge correction network model Z:

(4b) defining a loss function of the edge correction network model Z:

wherein the content of the first and second substances,

wherein the content of the first and second substances,

representing the edges of the image obtained by the Canny algorithm, and M representing the mask

The number of classes of the middle pixels,

representing masks

The total number of medium pixels;

(6) obtaining a self-supervision video target segmentation result:

test set

2. The method of claim 1, wherein: training sample set V in step (1) _train Verifying the sample set V _val And test sample set V _test The method comprises the following steps:

S is more than or equal to 3000; wherein

Representing the kth frame sequence consisting of pre-processed image frames,

Wherein S/2 < N < S, for each frame sequence in the training sample set

Each target frame picture to be segmented

3. The method of claim 1, wherein: in the step (2c), iterative training is performed on the image reconstruction neural network model R, and the following is realized:

for each target frame I to be divided _t Selecting q frames in front of the reference frame as the reference frame { I' ₀ ，I’ ₁ ，...,I’ _q Q is more than or equal to 2 and less than or equal to 5, and a target frame I _t The reference frame set corresponding to the reference frame set is used as the input of a feature extraction network phi (;. theta), and a feature extraction network pair I _t And extracting the features of each reference frame image to obtain the features f of the target frame image _t ＝Φ(I _t (ii) a Theta), reference frame image feature f' ₀ ＝Φ(I′ ₀ ；θ),...,f′ _q ＝Φ(I′ _q (ii) a θ), training the target frame in the sample set { I } _t L0 is more than or equal to t and less than or equal to N is used as the input of the K mean value algorithm to obtain the reconstruction loss value L of the quantized image _cls Reconstructing the target frame I _t With the real target frame I _t Obtaining an RGB image reconstruction loss value L as input of an RGB image reconstruction task _reg ；

4. The method of claim 1, wherein: in the step (3c), iterative training is performed on the side output edge detection network model Q, and the following is realized:

Forward propagation as input to the side-output edge detection network:

(3c3) the side output edge detection layer obtains the rough edge of the target from the feature map set so as to obtain the rough edge corresponding to each feature map

Wherein the content of the first and second substances,

5. The method of claim 1, wherein: performing iterative training on the edge correction network model Z in the step (4c), and realizing the following steps:

And edge detection results

(4c2.2) mapping the features

6. The method of claim 1, wherein: and (5) obtaining a video target segmentation model based on the image target edge correction segmentation result by specifically obtaining a target edge prediction graph from an input side output edge detection network Q of an intermediate characteristic graph extracted by an image reconstruction neural network R, and taking the target segmentation mask prediction graph output by the image reconstruction neural network R and the target edge prediction graph output by the side output edge detection network Q as the input of an edge correction network model Z.