CN114863348A - Video target segmentation method based on self-supervision - Google Patents
Video target segmentation method based on self-supervision Download PDFInfo
- Publication number
- CN114863348A CN114863348A CN202210658263.6A CN202210658263A CN114863348A CN 114863348 A CN114863348 A CN 114863348A CN 202210658263 A CN202210658263 A CN 202210658263A CN 114863348 A CN114863348 A CN 114863348A
- Authority
- CN
- China
- Prior art keywords
- target
- edge
- network model
- frame
- image
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/49—Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/40—Extraction of image or video features
- G06V10/44—Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Multimedia (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Biomedical Technology (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a video target segmentation method based on self-supervision, which mainly solves the problems of low segmentation precision and large influence on target shielding and tracking drift in the prior art. The scheme comprises the following steps: 1) acquiring a video sequence from a video target segmentation data set, preprocessing the video sequence, and dividing the video sequence to obtain a training, verifying and testing sample set; 2) constructing and training an image reconstruction neural network model, and extracting target features by adopting a self-supervision learning method based on a multi-pixel scale image reconstruction task; 3) constructing and training a side output edge detection network model; 4) constructing and training an edge correction network model based on self-supervision; 5) combining the trained three models to obtain a video target segmentation model; 6) and inputting the test set into a video target segmentation model to obtain a target segmentation result. The method can effectively improve the generalization and the accuracy of video target segmentation, and can be used in the fields of automatic driving, intelligent monitoring, unmanned aerial vehicle intelligent tracking and the like.
Description
Technical Field
The invention belongs to the technical field of computer vision, and further relates to a video target segmentation technology, in particular to a video target segmentation method based on self-supervision, which can be used in the fields of automatic driving, intelligent monitoring, unmanned aerial vehicle intelligent tracking and the like.
Background
Computer vision aims at simulating the process of establishing visual perception by human, is a key link in the development of artificial intelligence technology, and the computer vision algorithm aims to simulate the human visual behavior to the maximum extent with higher precision and provide perception information for downstream tasks. In a human perception system, picture transformation is a continuous process, and under the large background of the current visual technology, a picture storage mode closest to human perception is a video mode, so that a computer visual algorithm for processing a video task has the capability of simulating human visual behaviors.
The video object segmentation task is an important subject in the video processing task, and aims to segment an object of interest in a series of video sequences from a background. In recent years, due to the excellent performance of deep learning technology in computer vision tasks (such as image recognition, target tracking, motion recognition, etc.), a video target segmentation algorithm based on deep learning has become a mainstream method for solving the video target segmentation task. The performance of the video target segmentation algorithm based on deep learning depends on the scale of a neural network used by the video target segmentation algorithm, the performance of the neural network depends on a large amount of training data, and the larger the scale of a training data set is, the better the generalization and robustness of the neural network obtained by training are. In the supervised learning mode, the production process of the video target segmentation training data set is expensive and time-consuming, and not only each pixel in an image needs to be labeled in space, but also each frame in a video sequence needs to be labeled in time. The performance of the video target segmentation model is closely related to the structure, and errors in the video target segmentation process can be effectively reduced through reasonable optimization of the reasoning process of the video target segmentation model.
The research goal of the self-supervision learning is to train a deep learning model under the condition of not using any manual label, so that the deep learning model can extract effective visual representation information from a large number of unlabelled pictures or video data sets, and the extracted information is subjected to fine adjustment and is used by downstream tasks. The video target segmentation based on the self-supervision learning is designed aiming at the specific task of semi-supervision video target segmentation, the video target segmentation model is trained by the self-supervision learning method, the trained model can be directly used for the video target segmentation task, and any manual labeling data set intervention is not needed in the whole training process.
The research of the self-supervision video target segmentation method can be basically divided into two lines, and firstly, a better preposed task training model is designed, so that the model has better representation and extraction capacity; and secondly, aiming at the problem of segmenting the semi-supervised video target, introducing more mechanisms to reduce the influence of target shielding and tracking drift. An article entitled "Tracking images by Tracking images" was published by colorimeter et al in 2018 in European Conference on Computer Vision, and an auto-surveillance video Tracking and coloring model is proposed, which utilizes the natural temporal coherence of colors to learn and color gray level videos, so that the effect of the auto-surveillance video Tracking technology is further improved, but the auto-surveillance video Tracking technology is poor in robustness to target occlusion and Tracking drift because the auto-surveillance video Tracking model is based on previous frame propagation. Corrflow et al published an article entitled "Self-assisted video representation learning for correcting location flow" in 2019 on British Machine Vision Conference, and introduced a limited attention mechanism to improve the resolution of model input and improve the segmentation accuracy without increasing the burden of computing equipment, however, the method does not consider the feature extraction generalization of targets of different scales, and performs poorly in the case of too large difference in target scales.
Disclosure of Invention
The invention aims to provide a video target segmentation method based on self-supervision aiming at the defects of the prior art, and the method is used for solving the technical problems of low segmentation precision and large influence on target occlusion and tracking drift in the prior art.
The idea for realizing the invention is as follows: firstly, extracting target features by adopting a self-supervision learning method based on a multi-pixel scale image reconstruction task, so that a video target segmentation model can give consideration to the features of a large target and the features of a small target to obtain better generalization performance, and then, aiming at the problem of error accumulation generated when the video target segmentation model is used for segmenting the target, the method provides that the semantic edge of the image is used for correcting a target segmentation mask; and finally, designing an edge fusion network based on self-supervision to obtain a more accurate target segmentation mask.
In order to achieve the purpose, the technical scheme adopted by the invention comprises the following steps:
(1) acquiring a training sample set, a verification sample set and a test sample set:
obtaining video sequences from a video target segmentation data set, preprocessing the video sequences to obtain a frame sequence set V, and dividing the frame sequences in the set to obtain a training sample set V train Verifying the sample set V val And test sample set V test ;
(2) Constructing and training an image reconstruction neural network model R:
(2a) constructing an image reconstruction neural network model R formed by a feature extraction network, wherein the feature extraction network adopts a residual error network comprising a plurality of convolution layers, a plurality of pooling layers, a plurality of residual error unit modules and a single full-connection layer which are connected in sequence;
(2b) defining a loss function of the image reconstruction neural network model R:
L mix =αL cls +(1-α)L reg
wherein the content of the first and second substances,cross entropy loss function representing quantized image reconstruction task for a set of training samplesE cluster centroid points [ mu ] are selected 1 ,μ 2 ,...,μ E And E is less than or equal to 50; calculating the class of the sample according to the distance between the training sample and the clustering centroid point, and setting the number of target classes contained in the frame sequence set V as C; correcting the position of the clustering mass center point to ensure that the same target label is the same between frames, and different target labels are different, wherein,representing a given frame picture I t To which the ith pixel of (1) belongs,denotes the prediction result using the K-means algorithm, L reg A regression loss function representing the RGB image reconstruction task,wherein the content of the first and second substances,whereinIn the case of a real target frame pixel,alpha represents a weight coefficient for reconstructing the target frame pixel, and alpha is more than or equal to 0.1 and less than or equal to 0.9;
(2c) setting a feature handleTaking the network parameters and the maximum iteration number N, reconstructing a loss function of the neural network model R according to the image, and utilizing a training sample set V train Carrying out iterative training on the image reconstruction neural network model R by the target frame picture to obtain a trained image reconstruction neural network model R;
(3) constructing and training a side output edge detection network model Q:
(3a) constructing an edge detection network model Q comprising a side output edge detection layer SODL and a side output edge fusion layer SOFL which are connected in sequence, wherein the side output edge detection layer SODL comprises an deconvolution layer and a convolution layer with a convolution kernel size of 1 multiplied by 1 and an output channel number of 1, and the side output edge fusion layer SOFL is a convolution layer with a convolution kernel size of 1 multiplied by 1 and a channel number of 1;
(3b) defining a loss function of the side output edge detection network model Q:
L edge =L side +L fuse
wherein L is side Representing the side output edge detection loss function,wherein, beta i A weight coefficient representing the ith side output edge detection network,a loss function showing the prediction result of the ith side output edge detection network:
wherein the content of the first and second substances,wherein e represents the input image target edge truth value, | e - | represents the number of pixels in the true value of the edge of the image target, | e + I represents the number of non-edge pixels in the image target edge truth, omega i Representing a parameter of the convolutional layer, L fuse To representEdge fusion loss function:
(3c) setting a maximum iteration number I, performing iterative training on the side output edge detection network model Q by utilizing a feature diagram set output by each structural layer of the feature extraction network in the image reconstruction neural network model R according to a loss function of the side output edge detection network model Q to obtain a trained side output edge detection network model Q;
(4) constructing and training an edge correction network model Z:
(4a) connected cavity space convolution pooling pyramid model F in sequence γ And a softmax activation function output layer, wherein the void space convolution pooling pyramid model F γ The method comprises the steps of obtaining an edge correction network model Z, wherein the edge correction network model Z is composed of a plurality of convolution layers and pooling layers which are connected in sequence;
(4b) defining a loss function of the edge correction network model Z:
wherein the content of the first and second substances,the coarse segmentation result of the target frame output by the edge detection layer,convolution pooling pyramid model F for void space γ The result of the prediction of (a) is,wherein the content of the first and second substances,representing the edge of the image obtained by the Canny algorithm, and M representing the maskThe number of classes of the middle pixels,representing masksThe total number of medium pixels;
(4c) setting a maximum iteration number H, performing iterative training on the edge correction network model Z according to a loss function of the edge correction network model Z and by using output results of the image reconstruction network model R and the edge detection network model Q to obtain a trained edge correction network model Z;
(5) combining the trained image reconstruction neural network R, the side output edge detection network Q and the edge correction network model Z to obtain a video target segmentation model based on an image target edge correction segmentation result;
(6) obtaining a self-supervision video target segmentation result:
test setThe frame images in the video target segmentation model are used as input of the video target segmentation model to carry out forward propagation to obtain all test frame image prediction segmentation labels, and a final segmentation result image is obtained according to the test frame image prediction segmentation labels.
Compared with the prior art, the invention has the following advantages:
firstly, as the reconstruction task of the multi-pixel scale image is adopted as the prepositive task of the self-supervision learning, the features extracted by the trained model have better generalization aiming at large targets and small targets in the video segmentation task, thereby having better performance in the integral video target segmentation task.
Secondly, the invention uses the edge repair of the target in the video picture to repair the target mask, integrates the feature images extracted by each layer of the feature extraction network in the video target segmentation model through a side output edge detection network, predicts the candidate target edge in the target frame, and uses an edge fusion model based on self-supervision to fuse the segmentation result output by the video target segmentation model and the target edge output by the side output edge detection network, thereby correcting the segmentation mask according to the target edge and obtaining more accurate segmentation result.
Drawings
FIG. 1 is a flow chart of an implementation of the present invention.
Detailed Description
The invention is described in further detail below with reference to the figures and specific examples.
The first embodiment is as follows: referring to fig. 1, the video object segmentation method based on self-supervision provided by the invention specifically includes the following steps:
step 1: acquiring a training sample set, a verification sample set and a test sample set:
obtaining video sequences from a video target segmentation data set, preprocessing the video sequences to obtain a frame sequence set V, and dividing the frame sequences in the set to obtain a training sample set V train Verifying the sample set V val And test sample set V test (ii) a The method is realized as follows:
(1a) s multi-class video sequences are obtained from a video target segmentation data set, and a frame sequence set is obtained after preprocessingS is more than or equal to 3000; whereinRepresenting the kth frame sequence consisting of pre-processed image frames,represents the nth image frame in the kth frame sequence, M is more than or equal to 30;
(1b) randomly extracting more than half of frame sequences from the frame sequence set V to form a training sample setWherein S/2 < N <S, aiming at each frame sequence in training sample setEach target frame picture to be segmentedZooming into image blocks of size p multiplied by h, and converting the picture format from RGB to Lab; extracting a half frame sequence from the rest frame sequence to form a verification sample setWherein J is less than or equal to S/4; the other half constitutes the test sample setT is less than or equal to S/4, and the picture format is converted from RGB to Lab.
Step 2: constructing and training an image reconstruction neural network model R:
(2a) constructing an image reconstruction neural network model R formed by a feature extraction network, wherein the feature extraction network adopts a residual error network comprising a plurality of convolution layers, a plurality of pooling layers, a plurality of residual error unit modules and a single full-connection layer which are connected in sequence;
(2b) defining a loss function of the image reconstruction neural network model R:
L mix =αL cls +(1-α)L reg
wherein the content of the first and second substances,cross entropy loss function representing quantized image reconstruction task for a set of training samplesE cluster centroid points [ mu ] are selected 1 ,μ 2 ,...,μ E And E is less than or equal to 50; calculating the class of the sample according to the distance between the training sample and the clustering centroid point, and setting the number of target classes contained in the frame sequence set V as C; correcting the position of the clustering mass center point to enable the same kind of target marks between framesThe labels are the same, and different target labels are different, wherein,representing a given frame picture I t To which the ith pixel of (1) belongs,denotes the prediction result using the K-means algorithm, L reg A regression loss function representing the RGB image reconstruction task,wherein the content of the first and second substances,whereinIn the case of a real target frame pixel,alpha represents a weight coefficient for reconstructing the target frame pixel, and alpha is more than or equal to 0.1 and less than or equal to 0.9;
(2c) setting characteristic extraction network parameters and maximum iteration times N, reconstructing a loss function of a neural network model R according to the image, and utilizing a training sample set V train Carrying out iterative training on the image reconstruction neural network model R by the target frame picture to obtain a trained image reconstruction neural network model R; the method is realized as follows:
(2c1) setting a network hyperparameter of the feature extraction network as theta, setting the maximum iteration number as N to be more than or equal to 150000, and indicating the current iteration number by N; initializing the iteration number by changing n to 1;
(2c2) will train the sample set V train The target frame picture in (1) is used as the input of an image reconstruction neural network model R for forward propagation:
for each target frame I to be divided t Selecting q frames in front of the reference frame as the reference frame { I' 0 ,I 1 ’,...,I’ q Q is more than or equal to 2 and less than or equal to 5, and a target frame I t With it phaseThe reference frame set is used as the input of a feature extraction network phi (;. theta), and the feature extraction network is used for I t And extracting the features of each reference frame image to obtain the features f of the target frame image t =Φ(I t (ii) a Theta), reference frame image feature f' 0 =Φ(I′ 0 ;θ),...,f′ q =Φ(I′ q (ii) a θ), training the target frame in the sample set { I } t L0 is more than or equal to t and less than or equal to N is used as the input of the K mean value algorithm to obtain the reconstruction loss value L of the quantized image cls Reconstructing the target frame I t With the real target frame I t Obtaining an RGB image reconstruction loss value L as input of an RGB image reconstruction task reg ;
(2c3) Using a loss function L mix By cross entropy loss L cls And regression loss L reg Calculating loss values of image reconstruction neural networksCalculating the gradient g (theta) of the network parameter by adopting back propagation, and then updating the network parameter theta by a gradient descent method;
(2c4) judging whether N is true or not, and if so, obtaining a trained image reconstruction neural network R; otherwise, let n be n +1, and return to execute step (2c 2).
And step 3: constructing and training a side output edge detection network model Q:
(3a) constructing an edge detection network model Q comprising a side output edge detection layer SODL and a side output edge fusion layer SOFL which are connected in sequence, wherein the side output edge detection layer SODL comprises an deconvolution layer and a convolution layer with a convolution kernel size of 1 multiplied by 1 and an output channel number of 1, and the side output edge fusion layer SOFL is a convolution layer with a convolution kernel size of 1 multiplied by 1 and a channel number of 1;
(3b) defining a loss function of the side output edge detection network model Q:
L edge =L side +L fuse
wherein L is side Representing the side output edge detection loss function,wherein, beta i A weight coefficient representing the ith side output edge detection network,a loss function representing the prediction result of the ith side output edge detection network:
wherein the content of the first and second substances,wherein e represents the input image target edge truth value, | e - | represents the number of pixels in the true value of the edge of the image target, | e + I represents the number of non-edge pixels in the image target edge truth, omega i Representing a parameter of the convolutional layer, L fuse Represent the edge fusion loss function:
(3c) setting a maximum iteration number I, performing iterative training on the side output edge detection network model Q by using a feature diagram set output by each structural layer of the feature extraction network in the image reconstruction neural network model R according to a loss function of the side output edge detection network model Q to obtain a trained side output edge detection network model Q, and realizing the following steps:
(3c1) setting the maximum iteration frequency I to be more than or equal to 150000, and setting the current iteration frequency to be I; initializing the iteration number by changing i to 1;
(3c2) feature graph set output by each structural layer of feature extraction network in image reconstruction network modelForward propagation as input to the side-output edge detection network:
(3c3) side outputThe edge detection layer acquires the rough edge of the target from the feature map set so as to obtain the rough edge corresponding to each feature map
(3c4) Taking the rough edge set output by the side output edge detection layer SODL as the input of the side output edge fusion layer SOFL, performing weighted fusion on the rough edge to obtain the final predicted edgeWherein the content of the first and second substances,features, ω, formed by merging of thick edges fuse Parameters representing a side output edge blending layer;
(3c5) using a loss function L edge Loss L by side output edge detection side Sum-side output edge blending loss L fuse Calculating loss values for edge-detect networksCalculating the gradient g (omega) of the network parameter by adopting back propagation, and then updating the network parameter omega by a gradient descent method;
(3c6) judging whether I is true, if so, obtaining a trained side output edge detection network model Q; otherwise, let i equal to i +1, and return to execute step (3c 2).
And 4, step 4: constructing and training an edge correction network model Z:
(4a) connected cavity space convolution pooling pyramid model F in sequence γ And a softmax activation function output layer, wherein the void space convolution pooling pyramid model F γ The method comprises the steps of obtaining an edge correction network model Z, wherein the edge correction network model Z is composed of a plurality of convolution layers and pooling layers which are connected in sequence;
(4b) defining a loss function of the edge correction network model Z:
wherein the content of the first and second substances,the coarse segmentation result of the target frame output by the edge detection layer,convolution pooling pyramid model F for void space γ The result of the prediction of (a) is,wherein the content of the first and second substances,representing the edge of the image obtained by the Canny algorithm, and M representing the maskThe number of classes of the middle pixels,representing masksThe total number of medium pixels;
(4c) setting a maximum iteration number H, performing iterative training on the edge correction network model Z according to a loss function of the edge correction network model Z and the output results of the image reconstruction network model R and the edge detection network model Q to obtain a trained edge correction network model Z, and realizing the following steps:
(4c1) setting the maximum iteration number as H not less than 150000 and the current iteration number as H; initializing the iteration times by changing h to 1;
(4c2) taking the target frame rough segmentation result output by the image reconstruction network model R and the edge detection result output by the edge detection network model Q as the input of the edge correction network model Z for forward propagation:
(4c2.1) the edge correction network first coarsely divides the result into the target frameAnd edge detection resultsMerging the channel dimensions to obtain a characteristic diagram with the size of H multiplied by W multiplied by (K +1)
(4c2.2) mapping the featuresConvolution pooling pyramid model F as void space γ Obtaining a prediction result of the expanded receptive field;
(4c2.3) taking the expanded receptive field prediction result as the input of the output layer of the softmax activation function, and determining the segmentation labels of the pixels according to the probability that each pixel in the feature map is located in each category, so as to obtain a more accurate target segmentation mask after the target segmentation mask of the target frame is subjected to edge fusion correctionWherein O is t Representing a target frame I t The predictive segmentation tag of (1);
(4c3) using a loss function L corr Calculating the loss value of the edge correction networkCalculating the gradient g (c) of the network parameter by adopting back propagation, and then updating the network parameter c by a gradient descent method;
(4c4) judging whether H is true or not, and if so, obtaining a trained edge correction network model Z; otherwise, let h be h +1, and return to performing step (4c 2).
And 5: combining the trained image reconstruction neural network R, the side output edge detection network Q and the edge correction network model Z to obtain a video target segmentation model based on an image target edge correction segmentation result, and particularly combining the video target segmentation model according to the following modes: and obtaining a target edge prediction graph in an input side output edge detection network Q of the intermediate characteristic graph extracted by the image reconstruction neural network R, and taking the target segmentation mask prediction graph output by the image reconstruction neural network R and the target edge prediction graph output by the side output edge detection network Q as the input of an edge correction network model Z to obtain a trained video target segmentation model based on the image target edge correction segmentation result.
And 6: obtaining a self-supervision video target segmentation result:
test setThe frame images in the video target segmentation model are used as input of the video target segmentation model to carry out forward propagation to obtain all test frame image prediction segmentation labels, and a final segmentation result image is obtained according to the test frame image prediction segmentation labels.
Example two: the overall steps of this embodiment are the same as those of the first embodiment, and specific values are given for setting some of the parameters, so as to further describe the implementation process of the present invention:
step 1) obtaining a training sample set, a verification sample set and a test sample set:
step 1a) acquiring S multi-class video sequences from a video object segmentation dataset, and preprocessing the video sequences to obtain a frame sequence set V, where in this embodiment, the multi-class video sequences are acquired from a youtube vos dataset, S is 4453, and M is 50;
step 1b) sets the target Class number C of the frame sequence V to 94, and the Class set Class to { C ═ C num 1 ≦ num ≦ C, multiple classes of objects may appear in each frame sequence, where C is num Representing a num class target;
step 1c) randomly extracting more than half of frame sequences from the frame sequence set V to form a training sample setWherein S/2 < N < S, for each frame sequence in the training setEach target frame picture to be segmentedScaling into image blocks with the size of p multiplied by h, converting the RGB format images into Lab images, and extracting a half frame sequence from the rest frame sequences to form a verification sample setWherein J is less than or equal to S/4, and the other half constitutes a test sample setT is less than or equal to S/4, and similarly, the original image is converted into a Lab format from an RGB format;
setting crop box size to x y for a sequence of frames in a training setFrame picture divided for each bandCutting to obtain cut frame pictureFor the cut frame pictureNormalization processing is carried out, and the frame pictures after normalization form a training frame sequence after preprocessingWherein the content of the first and second substances,an mth frame sequence in the training sample set;
in this embodiment, x is 256, y is 256, p is 256, and h is 3;
step 2), constructing an image reconstruction neural network model R:
step 2a) constructing a structure of an image reconstruction neural network model R:
constructing an image reconstruction neural network model formed by a feature extraction network, wherein the feature extraction network adopts a residual error network comprising a plurality of convolution layers, a plurality of pooling layers, a plurality of residual error unit modules and a single full-connection layer which are connected in sequence;
the feature extraction network comprises 17 convolutional layers and 1 fully-connected layer, the 18-layer structure is divided into 5 blocks which are conv _1, conv _2, conv _3, conv _4 and conv _5 respectively, conv _1 is one convolutional layer, the convolutional core size of the convolutional layer is 7 multiplied by 7, the number of channels is 64, conv _2 comprises two convolutional layers, the convolution kernel size is 3 x 3, the number of channels is 64, the convolution kernel moving distance is 1, conv _3 comprises two convolution layers, the convolution kernel size is 3 x 3, the number of channels is 128, wherein the first convolution layer convolution kernel shift distance is set to be 2, the second convolution layer convolution kernel shift distance is 1, conv _4 comprises two convolution layers, the convolution kernel size is 3 multiplied by 3, the number of channels is 256, the convolution kernel moving distance is 1, conv _5 comprises two convolution layers, the convolution kernel size is 3 multiplied by 3, the number of channels is 512, and the convolution kernel moving distance is 1;
step 2b) defining a loss function of the image reconstruction neural network model:
L mix =αL cls +(1-α)L reg
wherein L is cls Represents the cross-entropy loss function of the quantized image reconstruction task,for training sample setsE cluster centroid points [ mu ] are selected 1 ,μ 2 ,...,μ E And E is less than or equal to 50, calculating the class of the sample according to the distance between the training sample and the clustering centroid point, correcting the position of the clustering centroid point to ensure that the same type target labels between frames are the same, and different target labels are different, wherein,representing a given frame picture I t To which the ith pixel of (1) belongs,denotes the prediction result using the K-means algorithm, L reg A regression loss function representing the RGB image reconstruction task,wherein the content of the first and second substances,whereinIn the case of a real target frame pixel,alpha represents a weight coefficient for reconstructing the target frame pixel, and alpha is more than or equal to 0.1 and less than or equal to 0.9;
in this embodiment, K is 16, α is 0.6;
step 3) iterative training is carried out on the image reconstruction neural network model:
step 3a), setting a network hyper-parameter of the feature extraction network as theta, setting the number of initialization iterations as N, setting the maximum number of iterations as N, wherein N is more than or equal to 150000, and setting N as 1;
in this embodiment, N is 300000, and N is 300000 for more sufficient model training;
step 3b) training sample set V train The target frame picture in (1) is used as the input of an image reconstruction neural network model R for forward propagation:
step 3b1) for each target frame I to be segmented t Selecting q frames in front of the reference frame as the reference frame { I' 0 ,I 1 ’,...,I q ' }, where 2 < q > is less than or equal to 5, target frame I t The reference frame set corresponding to the reference frame set is used as the input of a feature extraction network phi (;. theta), and a feature extraction network pair I t And extracting features from each reference frame imageObtaining the target frame image characteristic f t =Φ(I t (ii) a Theta), reference frame image feature f' 0 =Φ(I′ 0 ;θ),...,f′ q =Φ(I′ q (ii) a θ), training the target frame in the sample set { I } t L0 is more than or equal to t and less than or equal to N is used as the input of the K mean value algorithm to obtain the reconstruction loss value L of the quantized image cls Reconstructing the target frame I t With the real target frame I t Obtaining an RGB image reconstruction loss value L as input of an RGB image reconstruction task reg ;
Step 3c) Using a loss function L mix By cross entropy loss L cls And regression loss L reg Calculating loss values of image reconstruction neural networksCalculating the gradient g (theta) of the network parameter by adopting back propagation, and then updating the network parameter theta by a gradient descent method, wherein the updating formula is as follows:
wherein θ' represents θ n The updated result, gamma represents the learning rate, 1e-6 is more than or equal to gamma is less than or equal to 1e-3,the loss function value of the image reconstruction neural network after the nth iteration is shown,representing the partial derivative calculation.
In this embodiment, the initial learning rate γ is 0.001, when the iteration is performed for the 15 th ten thousand, the learning rate γ is 0.0005, when the iteration is performed for the 20 th ten thousand, the learning rate γ is 0.00025, and when the iteration is performed for the 25 th ten thousand, the learning rate γ is 0.000125, and the optimizer function uses an Adam optimizer, and the purpose of attenuating the learning rate when the network is iterated for a certain number of times is to prevent the loss function from falling into the local minimum;
step 3d) judging whether N is true or not, if so, obtaining a trained image reconstruction neural network R, otherwise, making N be N +1, and executing the step (3 b);
step 4), constructing a side output edge detection network model Q:
step 4a) constructing a structure of a side output edge detection network model Q:
constructing an edge detection network model Q comprising a side output edge detection layer SODL and a side output edge fusion layer SOFL which are connected in sequence, wherein the side output edge detection layer SODL comprises an deconvolution layer and a convolution layer with a convolution kernel size of 1 multiplied by 1 and an output channel number of 1, and the side output edge fusion layer SOFL is a convolution layer with a convolution kernel size of 1 multiplied by 1 and a channel number of 1;
step 4b) defining a loss function of the side output edge detection network model:
L edge =L side +L fuse
wherein L is side Representing the side output edge detection loss function,wherein, beta i A weight coefficient representing the ith side output edge detection network,a loss function representing the prediction result of the ith side output edge detection network:
wherein the content of the first and second substances,wherein e represents the input image target edge truth value, | e - | represents the number of pixels in the true value of the edge of the image target, | e + I represents the number of non-edge pixels in the image target edge truth, omega i Representing a parameter of the convolutional layer, L fuse Represent the edge fusion loss function:
step 5) performing iterative training on the side output edge detection network model Q:
step 5a), initializing the iteration frequency to be I, setting the maximum iteration frequency to be I, wherein I is more than or equal to 150000, and setting I to 1;
in this embodiment, in order to make the model training more sufficient, design I is 300000;
step 5b) feature graph set output by each structural layer of the feature extraction network in the image reconstruction network modelForward propagation as input to the side-output edge detection network:
step 5b1) side output edge detection layer obtains the rough edge of the target from the feature map set, thereby obtaining the rough edge corresponding to each feature map
Step 5b2) of outputting the rough edge set output by the side output edge detection layer SODL as the input of the side output edge fusion layer SOFL, and performing weighted fusion on the rough edge to obtain the final predicted edgeWherein the content of the first and second substances,features, ω, formed by merging of thick edges fuse Parameters representing a side output edge blending layer;
step 5c) Using a loss function L edge Loss L by side output edge detection side Sum-side output edge blending loss L fuse Calculating loss values for edge-detect networksComputing network employing back propagationAnd (3) updating the network parameter omega by a gradient descent method according to the parameter gradient g (omega), wherein the updating formula is as follows:
wherein ω' represents ω i The updated result, beta represents the learning rate, 1e-6 is more than or equal to beta is less than or equal to 1e-3,the loss function value of the output edge detection neural network at the i-th iteration is shown,representing the partial derivative calculation.
In this embodiment, the initial learning rate β is 0.001, when the network iterates to the 15 th ten thousand, the learning rate β is 0.0005, when the network iterates to the 20 th ten thousand, the learning rate β is 0.00025, and when the network iterates to the 25 th ten thousand, the learning rate β is 0.000125, and the optimizer function uses an Adam optimizer, and the purpose of attenuating the learning rate when the network iterates to a certain number of times is to prevent the loss function from falling into a local minimum;
step 5d) judging whether I is true, if so, obtaining a trained side output edge detection network model Q, otherwise, making I +1, and executing the step (5 b);
step 6), constructing an edge correction network model Z:
step 6a) constructing a structure of an edge correction network model Z:
constructing a void space convolution pooling pyramid model F comprising sequential connections γ And an edge correction network model Z of the output layer of the softmax activation function, wherein the void space convolution pooling pyramid model F γ Composed of a plurality of convolution layers and pooling layers connected in sequence;
void space convolution pooling pyramid model F γ Comprises a convolution layer, a pooling pyramid and a pooling block, wherein the convolution layer convolution kernel has a size of 1 × 1, the pooling pyramid comprises three convolution layers connected in parallel, and the convolution layers are connected in parallelThe kernel size is 3 multiplied by 3, the pooling block comprises a 1 multiplied by 1 pooling layer, a convolution layer and an up-sampling operation layer, the convolution layer convolution kernel size is 1 multiplied by 1, concat operation is carried out on the convolution layer, the pooling pyramid and the feature diagram output by the pooling block module, and then the feature diagram is processed by the 1 multiplied by 1 pooling layer to obtain a cavity space convolution pooling pyramid model F γ An output of (d);
step 6b) defining a loss function of the edge correction network model Z:
wherein the content of the first and second substances,the coarse segmentation result of the target frame output by the edge detection layer,convolution pooling pyramid model F for void space γ The result of the prediction of (a) is,wherein the content of the first and second substances,representing the edge of the image obtained by the Canny algorithm, and M representing the maskThe number of classes of the middle pixels,representing masksThe total number of medium pixels;
step 7) performing iterative training on the edge correction network model Z:
step 7a), initializing the iteration number to be H, wherein the maximum iteration number is H, H is more than or equal to 150000, and H is 1;
in this embodiment, H300000 is designed to make model training more sufficient;
step 7b) taking the target frame rough segmentation result output by the image reconstruction network model and the edge detection result output by the edge detection network model as the input of the edge correction network model Z for forward propagation:
step 7b1) the edge correction network firstly coarsely divides the target frame into resultsAnd edge detection resultsMerging the channel dimensions to obtain a characteristic diagram with the size of H multiplied by W multiplied by (K +1)
Step 7b2) feature mapConvolution pooling pyramid model F as void space γ Obtaining a prediction result of the expanded receptive field;
step 7b3) using the expanded receptive field prediction result as the input of the output layer of the softmax activation function, and determining the segmentation label of the pixel according to the probability of each category of each pixel position in the feature map, thereby obtaining a more accurate target segmentation mask after the target segmentation mask of the target frame is subjected to edge fusion correctionWherein, O t Representing a target frame I t The predictive segmentation tag of (1);
step 7c) Using a loss function L corr Calculating the loss value of the edge correction networkCalculating network parameter gradients g (c) by back propagation and then passing through a ladderThe degree reduction method updates the network parameter c, and the updating formula is as follows:
wherein c' represents c h The updated result, alpha represents the learning rate, 1e-6 is more than or equal to alpha is less than or equal to 1e-3,the loss function value of the edge modified neural network after the h-th iteration is shown,representing the partial derivative calculation.
In this embodiment, the initial learning rate α is 0.001, when the network iterates to the 15 th ten thousand, the learning rate α is 0.0005, when the network iterates to the 20 th ten thousand, the learning rate α is 0.00025, and when the network iterates to the 25 th ten thousand, the learning rate α is 0.000125, and the optimizer function uses an Adam optimizer, and the purpose of attenuating the learning rate when the network iterates to a certain number of times is to prevent the loss function from falling into a local minimum;
step 7d) judging whether H is true or not, if so, obtaining a trained edge correction network model Z, otherwise, making H be H +1, and executing the step (7 b);
step 8) obtaining a self-supervision video target segmentation result:
test setThe frame image in the method is used as the input of a trained video target segmentation model based on the image target edge correction segmentation result for forward propagation, the video target segmentation model based on the image target edge correction segmentation result is composed of an image reconstruction neural network R, a side output edge detection network Q and an edge fusion network Z, all test frame image segmentation labels are obtained, and a segmentation result image is determined according to the test frame image segmentation labels.
The technical effects of the present invention are further explained by simulation experiments as follows:
1. simulation conditions and contents:
4453 video sequences were acquired from the YouTube-VOS dataset for use in simulation experiments;
the simulation experiment is carried out on a server with a CPU model of Intel (R) core (TM) i77800x CPU @3.5GHz 64GB and a GPU model of NVIDIA GeForce RTX 2080 ti. The operating system is a UBUNTU 16.04 system, the deep learning framework is PyTorch, and the programming language is Python 3.6;
the invention is compared and simulated with the existing video target segmentation method. In order to quantitatively compare video target segmentation results, two video target segmentation result evaluation indexes, namely region similarity J and contour similarity F, are adopted in the experiment, the higher the two evaluation indexes are, the better the change detection result is, and the simulation result is shown in table 1.
TABLE 1
2. And (3) simulation result analysis:
as can be seen from Table 1, compared with the existing video segmentation method, the J index and the F index are obviously improved, and the video target segmentation technology based on self-supervision constructed by the invention can effectively solve the problems of target shielding, tracking drift and the like, thereby improving the video target segmentation precision, and has important practical significance and practical value.
The simulation analysis proves the correctness and the effectiveness of the method provided by the invention.
The invention has not been described in detail in part of the common general knowledge of those skilled in the art.
While the invention has been particularly shown and described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and detail may be made therein without departing from the spirit and scope of the invention.
Claims (6)
1. A video object segmentation method based on self-supervision is characterized by comprising the following steps:
(1) acquiring a training sample set, a verification sample set and a test sample set:
obtaining video sequences from a video target segmentation data set, preprocessing the video sequences to obtain a frame sequence set V, and dividing the frame sequences in the set to obtain a training sample set V train Verifying the sample set V val And test sample set V test ;
(2) Constructing and training an image reconstruction neural network model R:
(2a) constructing an image reconstruction neural network model R formed by a feature extraction network, wherein the feature extraction network adopts a residual error network comprising a plurality of convolution layers, a plurality of pooling layers, a plurality of residual error unit modules and a single full-connection layer which are connected in sequence;
(2b) defining a loss function of the image reconstruction neural network model R:
L mix =αL cls +(1-α)L reg
wherein the content of the first and second substances,cross entropy loss function representing quantized image reconstruction task for a set of training samplesE cluster centroid points [ mu ] are selected 1 ,μ 2 ,...,μ E And E is less than or equal to 50; calculating the class of the sample according to the distance between the training sample and the clustering centroid point, and setting the number of target classes contained in the frame sequence set V as C; correcting the position of the clustering mass center point to ensure that the same target label is the same between frames, and different target labels are different, wherein,representing a given frame picture I t The ith pixel ofThe category to which the device belongs is,denotes the prediction result using the K-means algorithm, L reg A regression loss function representing the RGB image reconstruction task,wherein the content of the first and second substances,whereinIn the case of a real target frame pixel,alpha represents a weight coefficient for reconstructing the target frame pixel, and alpha is more than or equal to 0.1 and less than or equal to 0.9;
(2c) setting characteristic extraction network parameters and maximum iteration times N, reconstructing a loss function of a neural network model R according to the image, and utilizing a training sample set V train Carrying out iterative training on the image reconstruction neural network model R by the target frame picture to obtain a trained image reconstruction neural network model R;
(3) constructing and training a side output edge detection network model Q:
(3a) constructing an edge detection network model Q comprising a side output edge detection layer SODL and a side output edge fusion layer SOFL which are connected in sequence, wherein the side output edge detection layer SODL comprises an deconvolution layer and a convolution layer with a convolution kernel size of 1 multiplied by 1 and an output channel number of 1, and the side output edge fusion layer SOFL is a convolution layer with a convolution kernel size of 1 multiplied by 1 and a channel number of 1;
(3b) defining a loss function of the side output edge detection network model Q:
L edge =L side +L fuse
wherein L is side Representing the side output edge detection loss function,wherein, beta i A weight coefficient representing the ith side output edge detection network,a loss function representing the prediction result of the ith side output edge detection network:
wherein the content of the first and second substances,wherein e represents the input image target edge truth value, | e - | represents the number of pixels in the true value of the edge of the image target, | e + I represents the number of non-edge pixels in the image target edge truth, omega i Representing a parameter of the convolutional layer, L fuse Represent the edge fusion loss function:
(3c) setting a maximum iteration number I, performing iterative training on the side output edge detection network model Q by utilizing a feature diagram set output by each structural layer of the feature extraction network in the image reconstruction neural network model R according to a loss function of the side output edge detection network model Q to obtain a trained side output edge detection network model Q;
(4) constructing and training an edge correction network model Z:
(4a) connected cavity space convolution pooling pyramid model F in sequence γ And a softmax activation function output layer, wherein the void space convolution pooling pyramid model F γ The method comprises the steps of obtaining an edge correction network model Z, wherein the edge correction network model Z is composed of a plurality of convolution layers and pooling layers which are connected in sequence;
(4b) defining a loss function of the edge correction network model Z:
wherein the content of the first and second substances,the coarse segmentation result of the target frame output by the edge detection layer,convolution pooling pyramid model F for void space γ The result of the prediction of (a) is,wherein the content of the first and second substances,representing the edges of the image obtained by the Canny algorithm, and M representing the maskThe number of classes of the middle pixels,representing masksThe total number of medium pixels;
(4c) setting a maximum iteration number H, performing iterative training on the edge correction network model Z according to a loss function of the edge correction network model Z and by using output results of the image reconstruction network model R and the edge detection network model Q to obtain a trained edge correction network model Z;
(5) combining the trained image reconstruction neural network R, the side output edge detection network Q and the edge correction network model Z to obtain a video target segmentation model based on an image target edge correction segmentation result;
(6) obtaining a self-supervision video target segmentation result:
test setThe frame images in the video target segmentation model are used as input of the video target segmentation model to carry out forward propagation to obtain all test frame image prediction segmentation labels, and a final segmentation result image is obtained according to the test frame image prediction segmentation labels.
2. The method of claim 1, wherein: training sample set V in step (1) train Verifying the sample set V val And test sample set V test The method comprises the following steps:
(1a) s multi-class video sequences are obtained from a video target segmentation data set, and a frame sequence set is obtained after preprocessingS is more than or equal to 3000; whereinRepresenting the kth frame sequence consisting of pre-processed image frames,represents the nth image frame in the kth frame sequence, M is more than or equal to 30;
(1b) randomly extracting more than half of frame sequences from the frame sequence set V to form a training sample setWherein S/2 < N < S, for each frame sequence in the training sample setEach target frame picture to be segmentedZooming into image blocks of size p multiplied by h, and converting the picture format from RGB to Lab; extracting a half frame sequence from the rest frame sequence to form a verification sample setWherein J is less than or equal to S/4; the other half constitutes the test sample setT is less than or equal to S/4, and the picture format is converted from RGB to Lab.
3. The method of claim 1, wherein: in the step (2c), iterative training is performed on the image reconstruction neural network model R, and the following is realized:
(2c1) setting a network hyperparameter of the feature extraction network as theta, setting the maximum iteration number as N to be more than or equal to 150000, and indicating the current iteration number by N; initializing the iteration number by changing n to 1;
(2c2) will train the sample set V train The target frame picture in (1) is used as the input of an image reconstruction neural network model R for forward propagation:
for each target frame I to be divided t Selecting q frames in front of the reference frame as the reference frame { I' 0 ,I’ 1 ,...,I’ q Q is more than or equal to 2 and less than or equal to 5, and a target frame I t The reference frame set corresponding to the reference frame set is used as the input of a feature extraction network phi (;. theta), and a feature extraction network pair I t And extracting the features of each reference frame image to obtain the features f of the target frame image t =Φ(I t (ii) a Theta), reference frame image feature f' 0 =Φ(I′ 0 ;θ),...,f′ q =Φ(I′ q (ii) a θ), training the target frame in the sample set { I } t L0 is more than or equal to t and less than or equal to N is used as the input of the K mean value algorithm to obtain the reconstruction loss value L of the quantized image cls Reconstructing the target frame I t With the real target frame I t Obtaining an RGB image reconstruction loss value L as input of an RGB image reconstruction task reg ;
(2c3) Using a loss function L mix By cross entropy loss L cls And regression loss L reg Calculating loss values of image reconstruction neural networksCalculating the gradient g (theta) of the network parameter by adopting back propagation, and then updating the network parameter theta by a gradient descent method;
(2c4) judging whether N is true or not, and if so, obtaining a trained image reconstruction neural network R; otherwise, let n be n +1, and return to execute step (2c 2).
4. The method of claim 1, wherein: in the step (3c), iterative training is performed on the side output edge detection network model Q, and the following is realized:
(3c1) setting the maximum iteration frequency I to be more than or equal to 150000, and setting the current iteration frequency to be I; initializing the iteration number by changing i to 1;
(3c2) feature graph set output by each structural layer of feature extraction network in image reconstruction network modelForward propagation as input to the side-output edge detection network:
(3c3) the side output edge detection layer obtains the rough edge of the target from the feature map set so as to obtain the rough edge corresponding to each feature map
(3c4) Taking the rough edge set output by the side output edge detection layer SODL as the input of the side output edge fusion layer SOFL, performing weighted fusion on the rough edge to obtain the final predicted edgeWherein the content of the first and second substances,features, ω, formed by merging of thick edges fuse Parameters representing a side output edge blending layer;
(3c5) using a loss function L edge Loss L by side output edge detection side Sum-side output edge blending loss L fuse Calculating loss values for edge-detect networksCalculating the gradient g (omega) of the network parameter by adopting back propagation, and then updating the network parameter omega by a gradient descent method;
(3c6) judging whether I is true, if so, obtaining a trained side output edge detection network model Q; otherwise, let i equal to i +1, and return to execute step (3c 2).
5. The method of claim 1, wherein: performing iterative training on the edge correction network model Z in the step (4c), and realizing the following steps:
(4c1) setting the maximum iteration number as H not less than 150000 and the current iteration number as H; initializing the iteration times by changing h to 1;
(4c2) taking the target frame rough segmentation result output by the image reconstruction network model R and the edge detection result output by the edge detection network model Q as the input of the edge correction network model Z for forward propagation:
(4c2.1) the edge correction network first coarsely divides the result into the target frameAnd edge detection resultsMerging the channel dimensions to obtain a characteristic diagram with the size of H multiplied by W multiplied by (K +1)
(4c2.2) mapping the featuresConvolution pooling pyramid model F as void space γ Obtaining a prediction result of the expanded receptive field;
(4c2.3) taking the expanded receptive field prediction result as the input of the output layer of the softmax activation function, and determining the segmentation labels of the pixels according to the probability that each pixel in the feature map is located in each category, so as to obtain a more accurate target segmentation mask after the target segmentation mask of the target frame is subjected to edge fusion correctionWherein O is t Representing a target frame I t The predictive segmentation tag of (1);
(4c3) using a loss function L corr Calculating the loss value of the edge correction networkCalculating the gradient g (c) of the network parameter by adopting back propagation, and then updating the network parameter c by a gradient descent method;
(4c4) judging whether H is true or not, and if so, obtaining a trained edge correction network model Z; otherwise, let h be h +1, and return to performing step (4c 2).
6. The method of claim 1, wherein: and (5) obtaining a video target segmentation model based on the image target edge correction segmentation result by specifically obtaining a target edge prediction graph from an input side output edge detection network Q of an intermediate characteristic graph extracted by an image reconstruction neural network R, and taking the target segmentation mask prediction graph output by the image reconstruction neural network R and the target edge prediction graph output by the side output edge detection network Q as the input of an edge correction network model Z.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210658263.6A CN114863348A (en) | 2022-06-10 | 2022-06-10 | Video target segmentation method based on self-supervision |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210658263.6A CN114863348A (en) | 2022-06-10 | 2022-06-10 | Video target segmentation method based on self-supervision |
Publications (1)
Publication Number | Publication Date |
---|---|
CN114863348A true CN114863348A (en) | 2022-08-05 |
Family
ID=82624940
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210658263.6A Pending CN114863348A (en) | 2022-06-10 | 2022-06-10 | Video target segmentation method based on self-supervision |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114863348A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116129353A (en) * | 2023-02-07 | 2023-05-16 | 佛山市顺德区福禄康电器科技有限公司 | Method and system for intelligent monitoring based on image recognition |
CN116563218A (en) * | 2023-03-31 | 2023-08-08 | 北京长木谷医疗科技股份有限公司 | Spine image segmentation method and device based on deep learning and electronic equipment |
CN116630697A (en) * | 2023-05-17 | 2023-08-22 | 安徽大学 | Image classification method based on biased selection pooling |
CN117788492A (en) * | 2024-02-28 | 2024-03-29 | 苏州元脑智能科技有限公司 | Video object segmentation method, system, electronic device and storage medium |
-
2022
- 2022-06-10 CN CN202210658263.6A patent/CN114863348A/en active Pending
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116129353A (en) * | 2023-02-07 | 2023-05-16 | 佛山市顺德区福禄康电器科技有限公司 | Method and system for intelligent monitoring based on image recognition |
CN116129353B (en) * | 2023-02-07 | 2024-05-07 | 广州融赋数智技术服务有限公司 | Method and system for intelligent monitoring based on image recognition |
CN116563218A (en) * | 2023-03-31 | 2023-08-08 | 北京长木谷医疗科技股份有限公司 | Spine image segmentation method and device based on deep learning and electronic equipment |
CN116630697A (en) * | 2023-05-17 | 2023-08-22 | 安徽大学 | Image classification method based on biased selection pooling |
CN116630697B (en) * | 2023-05-17 | 2024-04-05 | 安徽大学 | Image classification method based on biased selection pooling |
CN117788492A (en) * | 2024-02-28 | 2024-03-29 | 苏州元脑智能科技有限公司 | Video object segmentation method, system, electronic device and storage medium |
CN117788492B (en) * | 2024-02-28 | 2024-04-26 | 苏州元脑智能科技有限公司 | Video object segmentation method, system, electronic device and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110135267B (en) | Large-scene SAR image fine target detection method | |
CN110910391B (en) | Video object segmentation method for dual-module neural network structure | |
CN114863348A (en) | Video target segmentation method based on self-supervision | |
CN109086811B (en) | Multi-label image classification method and device and electronic equipment | |
CN107491734B (en) | Semi-supervised polarimetric SAR image classification method based on multi-core fusion and space Wishart LapSVM | |
CN112668579A (en) | Weak supervision semantic segmentation method based on self-adaptive affinity and class distribution | |
CN114332578A (en) | Image anomaly detection model training method, image anomaly detection method and device | |
CN111428625A (en) | Traffic scene target detection method and system based on deep learning | |
CN113159048A (en) | Weak supervision semantic segmentation method based on deep learning | |
CN113780292A (en) | Semantic segmentation network model uncertainty quantification method based on evidence reasoning | |
CN112613350A (en) | High-resolution optical remote sensing image airplane target detection method based on deep neural network | |
CN114332473A (en) | Object detection method, object detection device, computer equipment, storage medium and program product | |
CN114973019A (en) | Deep learning-based geospatial information change detection classification method and system | |
CN114511785A (en) | Remote sensing image cloud detection method and system based on bottleneck attention module | |
CN114998360A (en) | Fat cell progenitor cell segmentation method based on SUnet algorithm | |
CN113344069B (en) | Image classification method for unsupervised visual representation learning based on multi-dimensional relation alignment | |
CN114580501A (en) | Bone marrow cell classification method, system, computer device and storage medium | |
CN114299291A (en) | Interpretable artificial intelligent medical image semantic segmentation method | |
Bagwari et al. | A comprehensive review on segmentation techniques for satellite images | |
CN116883432A (en) | Method and device for segmenting focus image, electronic equipment and readable storage medium | |
CN111611919A (en) | Road scene layout analysis method based on structured learning | |
CN117152427A (en) | Remote sensing image semantic segmentation method and system based on diffusion model and knowledge distillation | |
CN116580243A (en) | Cross-domain remote sensing scene classification method for mask image modeling guide domain adaptation | |
CN109726690B (en) | Multi-region description method for learner behavior image based on DenseCap network | |
CN113313185A (en) | Hyperspectral image classification method based on self-adaptive spatial spectral feature extraction |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |