CN114429607B - Transformer-based semi-supervised video object segmentation method - Google Patents

Transformer-based semi-supervised video object segmentation method Download PDF

Info

Publication number
CN114429607B
CN114429607B CN202210098849.1A CN202210098849A CN114429607B CN 114429607 B CN114429607 B CN 114429607B CN 202210098849 A CN202210098849 A CN 202210098849A CN 114429607 B CN114429607 B CN 114429607B
Authority
CN
China
Prior art keywords
segmentation
module
video
layer
attention module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210098849.1A
Other languages
Chinese (zh)
Other versions
CN114429607A (en
Inventor
阳春华
周玮
赵于前
张帆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN202210098849.1A priority Critical patent/CN114429607B/en
Publication of CN114429607A publication Critical patent/CN114429607A/en
Application granted granted Critical
Publication of CN114429607B publication Critical patent/CN114429607B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2155Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the incorporation of unlabelled data, e.g. multiple instance learning [MIL], semi-supervised techniques using expectation-maximisation [EM] or naïve labelling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses a method for dividing a semi-supervised video target based on a transform, which comprises the following implementation scheme: 1) Acquiring a data set and a segmentation label; 2) Expanding and processing data; 3) Constructing a segmentation model; 4) Constructing a loss function; 5) Training a segmentation model; 6) Video object segmentation. According to the invention, space-time information is compressed by designing a space-time integration module, a multi-scale layer is introduced to generate a cross-scale input feature, and a double-branch cross attention module is constructed to consider multiple features of target information. The method can effectively improve the segmentation precision of the small-scale target and the similar target while reducing the calculation cost.

Description

Transformer-based semi-supervised video object segmentation method
Technical Field
The invention relates to the technical field of image processing, in particular to a semi-supervised video target segmentation method based on a transform.
Background
Video object segmentation is an important premise for video understanding, with many potential applications such as video retrieval, video editing, autopilot, etc. The purpose of semi-supervised video object segmentation is to segment a given video first frame segmentation object (i.e., a segmentation tag) from other frames of the entire video sequence.
Because of the powerful performance of the Transformer architecture in computer vision tasks such as image classification, object detection, semantic segmentation, object tracking, etc., many studies are currently being applied to video object segmentation. The transform architecture has excellent Long-term dependence modeling capability, and can effectively mine the space-time information in a given video, so that the segmentation accuracy is improved. However, most of the transform-based methods directly input the features of all frames in the storage pool into the multi-head attention module, and as the number of segmented frames increases, the cost of calculation is high, and the classical transform architecture lacks an inherent inductive bias, so that the segmentation accuracy of small-scale targets and similar targets is poor.
Disclosure of Invention
The invention aims to overcome the defects and shortcomings of the prior art and provides a semi-supervised video target segmentation method based on a Transformer.
In order to achieve the above purpose, the present invention provides the following technical solutions:
a semi-supervised video target segmentation method based on a transducer comprises the following steps:
(1) Acquiring a data set and dividing labels:
acquiring a video target segmentation dataset, a static image dataset and segmentation labels corresponding to the two datasets, and forming an image pair by each image in the dataset and the corresponding segmentation label;
(2) The data expansion and processing method specifically comprises the following steps:
(2-a) after normalizing each image pair consisting of the static image data set and the corresponding segmentation labels obtained in the step (1), repeating the following procedures to obtain a composite video training sample corresponding to each image, wherein a composite video training set is formed by a set of composite video training samples:
I. the short side of the image pair is reduced to w pixels, the long side is reduced according to the same proportion as the short side, the obtained image pair is randomly cut into the size of h multiplied by w pixels, wherein w is the width of the cut image, h is the height of the cut image, w and h are both positive integers, and the value range is [10,3000];
sequentially applying random scaling, random horizontal overturning, random color dithering and random gray level conversion to the cut image pair to obtain an enhanced image pair corresponding to the image pair;
III, repeating the process II for three times to obtain three enhanced image pairs corresponding to the image pair, wherein the three enhanced image pairs form a synthesized video training sample;
(2-b) carrying out normalization processing on each video and the corresponding segmentation label in the video target segmentation data set obtained in the step (1), repeating the following processes to obtain a real video training sample corresponding to each video, wherein a set of the real video training samples forms a real video training set:
I. randomly extracting three image pairs from the video and the corresponding segmentation labels;
reducing the short sides of the three image pairs to w pixels, reducing the long sides of the three image pairs according to equal proportion, and randomly cutting the obtained three image pairs into the size of h multiplied by w pixels, wherein the meanings and values of w and h are the same as those of the step (2-a);
thirdly, sequentially applying random clipping, color dithering and random gray level conversion to the three image pairs to obtain three enhanced image pairs, wherein the three enhanced image pairs form a real video training sample;
(3) The method for constructing the segmentation model specifically comprises the following steps:
(3-a) construction of a query encoder
Using convolutional neural network as query encoder, and sequentially passing the frame to be segmented through the first four layers of the encoder, wherein the output of the second layer is f C2 The output of the third layer is f C3 The output of the fourth layer is f C
(3-b) building a storage pool
The 1 st frame, the (tau+1) th frame, the (2tau+1) th frame, the … … th frame, the (Ntau+1) th frame and the corresponding segmentation labels in the video sequence are put into a storage pool, wherein tau is a positive integer, and the value range is [1,200 ]],τ C For the relative position of the frames to be segmented, the symbol +.>Representation directionB, lower rounding;
(3-c) construction of a storage encoder
Using convolutional neural network as storage encoder, all images in the storage pool and corresponding segmentation labels are obtained after passing through the encoder M
(3-d) construction of a transducer Module
The module consists of a transducer encoder and a transducer decoder; the transducer encoder comprises a space-time integration module, a convolution layer, a multi-scale layer and a self-attention module; the transducer decoder comprises two convolution layers, a multi-scale layer, a self-attention module and a double-branch cross-attention module, wherein the double-branch cross-attention module consists of a query branch and a storage branch, the structures of the two branches are identical, the two branches consist of a multi-head attention module, two residual error and layer normalization modules and a fully-connected feedforward network, and the fully-connected feedforward network consists of two linear layers and a ReLU activation layer; the structure of the self-attention module in the transducer encoder and the structure of the self-attention module in the transducer decoder are identical, and the self-attention module consists of a multi-head attention module, a residual error and layer normalization module; the multi-head attention module has the same structure;
(3-e) construction of a partition decoder
The segmentation decoder consists of a residual error module, two up-sampling modules and a prediction convolution module; the residual error module consists of three convolution layers and two Relu activation layers, the first up-sampling module consists of four convolution layers, two Relu activation layers and a bilinear interpolation, the second up-sampling module consists of three convolution layers, two Relu activation layers and a bilinear interpolation, and the prediction convolution module consists of one convolution layer and a bilinear interpolation;
(3-f) F obtained in the step (3-a) C And f obtained in the step (3-c) M Inputting the space-time integration module of the transducer encoder constructed in the step (3-d) to obtain f M The specific calculation process is as follows:
f M '=f M ·softmax(c(ConvKey(f M ),ConvKey(f C )))
wherein ConvKey (·) is a bond projection layer, and consists of a convolution layer, c (·) represents a negative squared Euclidean distance, and softmax (·) represents an activation function;
(3-g) f obtained in the step (3-f) M ' convolution layer, multi-scale layer and self-attention module sequentially input into a transducer encoder, respectively obtaining M 1 、M 2 And M 3 The method comprises the steps of carrying out a first treatment on the surface of the F obtained in the step (3-a) C Sequentially inputting the first convolution layer, the second convolution layer, the multi-scale layer and the self-attention module into the converter decoder constructed in the step (3-d) to obtain C respectively 0 、C 1 、C 2 And C 3 The method comprises the steps of carrying out a first treatment on the surface of the Will M 3 And C 3 Query branches in a double-branch cross-attention module input into a transducer decoder constructed in step (3-d) to obtain C 4 The method comprises the steps of carrying out a first treatment on the surface of the Will M 3 And C 3 The storage branches in the dual-branch cross-attention module input to the Transformer decoder constructed in step (3-d) to obtain M 4
(3-h) C obtained in the step (3-g) 0 、C 4 、M 4 And f obtained in the step (3-f) M ' after splicing, inputting the residual modules into a residual module of the split decoder constructed in the step (3-e) to obtain f O1 The method comprises the steps of carrying out a first treatment on the surface of the Will f O1 And f obtained in the step (3-a) C3 Input to the first up-sampling module of the split decoder constructed in step (3-e) to obtain f O2 The method comprises the steps of carrying out a first treatment on the surface of the Will f O2 And f obtained in the step (3-a) C2 Inputting to a second up-sampling module of the split decoder constructed in the step (3-e) to obtain f O3 The method comprises the steps of carrying out a first treatment on the surface of the Will f O3 Inputting the prediction convolution module of the segmentation decoder constructed in the step (3-e) to obtain a prediction segmentation label f O4 Completing the construction of a segmentation model;
(4) Constructing a loss function:
the loss function L uses cross entropy loss, defined as follows:
where Y is the true split label of the tag,is a predictive cut tag, H Y And W is Y The height and width of the real split label are respectively Y ij Is the pixel value of the ith row and jth column pixel in Y,/and Y>Is->Pixel values of the ith row and jth column pixels, i=1, 2, …, H Y ,j=1,2,…,W Y Log (-) represents the natural logarithm;
(5) Training a segmentation model:
training the segmentation model constructed in the step (3) by utilizing the synthesized video training set obtained in the step (2-a), obtaining a loss value according to the loss function constructed in the step (4), and updating parameters in the segmentation model by using a random gradient descent method to obtain a pre-training model; training the pre-training model by using the real video training set obtained in the step (2-b), obtaining a loss value according to the loss function constructed in the step (4), and updating parameters in the segmentation model by using a random gradient descent method to obtain a trained segmentation model;
(6) Video object segmentation:
and (3) acquiring a video to be segmented and a segmentation label corresponding to a first frame of the video to be segmented, sequentially inputting the video to be segmented into the trained segmentation model obtained in the step (5) from a second frame of the video to be segmented, and outputting the segmentation label.
The multi-scale layer in the step (3-d) consists of t convolution layers with different convolution kernel sizes, and the output calculation process is as follows:
MSL=Concat(S 1 ,…,S i ,…,S t )
S i =Conv i (X i ;r i )
wherein MSL represents the output of the multi-scale layer, concat (-) represents the splicing operation, and t represents the volume in the multi-scale layerThe number of the laminated layers, t is a positive integer, and the value range is [1,50 ]],Conv i (. Cndot.) represents the ith convolution layer, r, of the multi-scale layers i The convolution kernel size, r, representing the ith convolution layer i Is a positive integer with a value range of [1,100 ]],X i Representing input of ith convolution layer of multi-scale layer, S i Representing the output of the ith convolutional layer of the multi-scale layer, X for the multi-scale layer in the transform encoder i =M 1 For multi-scale layers in a transducer decoder, X i =C 1 ,i=1,2,…,t。
The multi-head attention module in the step (3-d) has the output calculation process that:
MultiHead(q,k,v)=Concat(A 1 ,…,A i ,…,A s )U o
wherein MultiHead (q, k, v) represents the output of the multi-head attention module, softmax (·) represents the activation function, concat (·) represents the splicing operation; q, k and v are inputs to the multi-head attention module, q=k=v=m for the multi-head attention module inside the self-attention module of the transducer encoder 2 For a multi-head attention module inside the self-attention module of a transducer decoder, q=k=v=c 2 For multi-head attention module inside the double-branch cross attention module storage branch, q=m 3 ,k=v=C 3 For a multi-headed attention module inside the query branch, q=c 3 ,k=v=M 3 The method comprises the steps of carrying out a first treatment on the surface of the s is the number of attention heads in the multi-head attention module, s is a positive integer, and the value range is [1,16 ]],A i Representing the output of the ith attention header, i=1, 2, …, s,and->Representing the ith attention headerq, k and v parameter matrices, U o Is a parameter matrix for adjusting final output, T represents a transpose operator, d is a super parameter, d is a positive integer, and the value range is [1,1000 ]]。
In step (3-b), τ is preferably 5.
In step (3-d), the number of convolution layers t in the multi-scale layer is preferably 2, the number of attention heads s in the multi-head attention module is preferably 8, and the super-parameter d is preferably 32.
Compared with the prior art, the invention has the following advantages:
(1) The space-time integration module provided by the invention can integrate and compress space-time information and reduce the calculation cost.
(2) The multi-scale layer introduced by the invention can provide trans-scale input features for the Transformer, thereby assisting the network to learn the feature representation with unchanged scale and improving the segmentation precision of the network to the small-scale target.
(3) The two branches in the double-branch cross attention module constructed by the invention can give consideration to different characteristics of target information, so that the segmentation precision of target details can be improved.
Drawings
FIG. 1 is a flow chart of a method for segmenting a semi-supervised video object based on a transducer according to an embodiment of the present invention;
FIG. 2 is a block diagram of a segmentation model according to an embodiment of the present invention;
FIG. 3 is a diagram of a transducer module configuration in accordance with an embodiment of the present invention;
FIG. 4 is a graph showing the comparison of the segmentation results of a small-scale object with the segmentation results of other methods according to the embodiment of the present invention;
FIG. 5 is a graph showing the comparison of the segmentation results of similar objects with the segmentation results of other methods according to the embodiment of the present invention.
Detailed Description
The following describes specific embodiments of the present invention;
example 1
Fig. 1 is a flowchart of a semi-supervised video object segmentation method based on a transducer according to an embodiment of the present invention, which specifically includes the following steps:
step 1, acquiring a data set and a segmentation label
And acquiring a video target segmentation dataset, a static image dataset and segmentation labels corresponding to the two datasets, and forming an image pair by each image in the dataset and the corresponding segmentation label.
Step 2, data expansion and processing
(2-a) after normalizing each image pair consisting of the static image data set and the corresponding segmentation labels obtained in the step (1), repeating the following procedures to obtain a composite video training sample corresponding to each image, wherein a composite video training set is formed by a set of composite video training samples:
I. the short side of the image pair is reduced to w pixels, the long side is reduced according to the same proportion as the short side, the obtained image pair is randomly cut into the size of h multiplied by w pixels, wherein w is the width of the cut image, h is the height of the cut image, w and h are both positive integers, the value range is [10,3000], and w is 384 and h is 384 in the embodiment;
sequentially applying random scaling, random horizontal overturning, random color dithering and random gray level conversion to the cut image pair to obtain an enhanced image pair corresponding to the image pair;
III, repeating the process II for three times to obtain three enhanced image pairs corresponding to the image pair, wherein the three enhanced image pairs form a synthesized video training sample;
(2-b) carrying out normalization processing on each video and the corresponding segmentation label in the video target segmentation data set obtained in the step (1), repeating the following processes to obtain a real video training sample corresponding to each video, wherein a set of the real video training samples forms a real video training set:
I. randomly extracting three image pairs from the video and the corresponding segmentation labels;
reducing the short sides of the three image pairs to w pixels, reducing the long sides of the three image pairs according to equal proportion, and randomly cutting the obtained three image pairs into the size of h multiplied by w pixels, wherein the meanings and values of w and h are the same as those of the step (2-a);
and III, sequentially applying random clipping, color dithering and random gray level conversion to the three image pairs to obtain three enhanced image pairs, wherein the three enhanced image pairs form a real video training sample.
Step 3, constructing a segmentation model
Fig. 2 is a block diagram of a segmentation model according to an embodiment of the present invention, and specifically includes the following steps:
(3-a) construction of a query encoder
Using convolutional neural network ResNet50 as query encoder, the frame to be segmented passes through the first four layers of the encoder in turn, where the output of the second layer is f C2 The output of the third layer is f C3 The output of the fourth layer is f C
(3-b) building a storage pool
The 1 st frame, the (tau+1) th frame, the (2tau+1) th frame, the … … th frame, the (Ntau+1) th frame and the corresponding segmentation labels in the video sequence are put into a storage pool, wherein tau is a positive integer, and the value range is [1,200 ]],τ C For the relative position of the frames to be segmented, the symbol +.>Representing a downward rounding, the present embodiment takes τ as 5;
(3-c) construction of a storage encoder
Using a convolutional neural network ResNet18 as a storage encoder, and obtaining f after all images in the storage pool and corresponding segmentation labels pass through the encoder M
(3-d) construction of a transducer Module
Fig. 3 is a block diagram of a transducer module according to an embodiment of the present invention. The module consists of a transducer encoder and a transducer decoder; the transducer encoder comprises a space-time integration module, a convolution layer with 256 convolution kernels and a size of 1, a multi-scale layer and a self-attention module; the converter decoder comprises a convolution layer with the convolution kernel number of 512 and the convolution size of 3, a convolution layer with the convolution kernel number of 256 and the convolution size of 1, a multi-scale layer, a self-attention module and a double-branch cross-attention module, wherein the double-branch cross-attention module consists of an inquiry branch and a storage branch, the structures of the two branches are identical, the two branches consist of a multi-head attention module, two residual error and layer normalization modules and a fully-connected feedforward network, and the fully-connected feedforward network consists of two linear layers with the input dimension of 256 and the hidden layer size of 2048 and a ReLU activation layer; the structure of the self-attention module in the transducer encoder and the structure of the self-attention module in the transducer decoder are identical, and the self-attention module consists of a multi-head attention module, a residual error and layer normalization module; the multi-head attention module has the same structure;
the multi-scale layer in the step (3-d) consists of t convolution layers with different convolution kernel sizes, and the output calculation process is as follows:
MSL=Concat(S 1 ,…,S i ,…,S t )
S i =Conv i (X i ;r i )
wherein MSL represents the output of the multi-scale layer, concat (-) represents the splicing operation, t represents the number of convolution layers in the multi-scale layer, t is a positive integer, and the value range is [1,50 ]]In this embodiment, t is2, conv i (. Cndot.) represents the ith convolution layer, r, of the multi-scale layers i The convolution kernel size, r, representing the ith convolution layer i Is a positive integer with a value range of [1,100 ]]The embodiment takes r 1 Is2, r 2 Is 4, X i Representing input of ith convolution layer of multi-scale layer, S i Representing the output of the ith convolutional layer of the multi-scale layer, X for the multi-scale layer in the transform encoder i =M 1 For multi-scale layers in a transducer decoder, X i =C 1 ,i=1,2,…,t。
The multi-head attention module in the step (3-d) has the output calculation process that:
MultiHead(q,k,v)=Concat(A 1 ,…,A i ,…,A s )U o
wherein MultiHead (q, k, v) represents the output of the multi-head attention module, softmax (·) represents the activation function, concat (·) represents the splicing operation; q, k and v are inputs to the multi-head attention module, q=k=v=m for the multi-head attention module inside the self-attention module of the transducer encoder 2 For a multi-head attention module inside the self-attention module of a transducer decoder, q=k=v=c 2 For multi-head attention module inside the double-branch cross attention module storage branch, q=m 3 ,k=v=C 3 For a multi-headed attention module inside the query branch, q=c 3 ,k=v=M 3 The method comprises the steps of carrying out a first treatment on the surface of the s is the number of attention heads in the multi-head attention module, s is a positive integer, and the value range is [1,16 ]]In this embodiment, s is 8, A i Representing the output of the ith attention header, i=1, 2, …, s,and->Q, k and v parameter matrices representing the ith attention header, U o Is a parameter matrix for adjusting final output, T represents a transpose operator, d is a super parameter, d is a positive integer, and the value range is [1,1000 ]]In this embodiment, d is 32;
(3-e) construction of a partition decoder
The segmentation decoder consists of a residual error module, two up-sampling modules and a prediction convolution module; the residual error module consists of three convolution layers with the convolution kernel number of 512 and the convolution size of 3 and two Relu activation layers, the first upsampling module consists of one convolution layer with the convolution kernel number of 512 and the convolution size of 3, three convolution layers with the convolution kernel number of 256 and the convolution size of 3, two Relu activation layers and a bilinear interpolation with the expansion factor of 2, and the second upsampling module consists of three convolution layers with the convolution kernel number of 256 and the convolution size of 3, two Relu activation layers and a bilinear interpolation with the expansion factor of 2, and the prediction convolution module consists of one convolution layer with the convolution kernel number of 256 and a bilinear interpolation with the expansion factor of 2;
(3-f) F obtained in the step (3-a) C And f obtained in the step (3-c) M Inputting the space-time integration module of the transducer encoder constructed in the step (3-d) to obtain f M The specific calculation process is as follows:
f M '=f M ·softmax(c(ConvKey(f M ),ConvKey(f C )))
wherein ConvKey (&) is a key projection layer, and consists of a convolution layer with the number of 64 convolution kernels and the size of 3 convolution layers, c (&) represents a negative square Euclidean distance, and softmax (&) represents an activation function;
(3-g) f obtained in the step (3-f) M ' convolution layer, multi-scale layer and self-attention module sequentially input into a transducer encoder, respectively obtaining M 1 、M 2 And M 3 The method comprises the steps of carrying out a first treatment on the surface of the F obtained in the step (3-a) C Sequentially inputting the first convolution layer, the second convolution layer, the multi-scale layer and the self-attention module into the converter decoder constructed in the step (3-d) to obtain C respectively 0 、C 1 、C 2 And C 3 The method comprises the steps of carrying out a first treatment on the surface of the Will M 3 And C 3 Query branches in a double-branch cross-attention module input into a transducer decoder constructed in step (3-d) to obtain C 4 The method comprises the steps of carrying out a first treatment on the surface of the Will M 3 And C 3 The storage branches in the dual-branch cross-attention module input to the Transformer decoder constructed in step (3-d) to obtain M 4
(3-h) C obtained in the step (3-g) 0 、C 4 、M 4 And f obtained in the step (3-f) M ' after splicing, inputting the residual modules into a residual module of the split decoder constructed in the step (3-e) to obtain f O1 The method comprises the steps of carrying out a first treatment on the surface of the Will f O1 And f obtained in the step (3-a) C3 Input to the first up-sampling module of the split decoder constructed in step (3-e) to obtain f O2 The method comprises the steps of carrying out a first treatment on the surface of the Will f O2 And f obtained in the step (3-a) C2 Input to step (3-e)The second up-sampling module of the constructed split decoder obtains f O3 The method comprises the steps of carrying out a first treatment on the surface of the Will f O3 Inputting the prediction convolution module of the segmentation decoder constructed in the step (3-e) to obtain a prediction segmentation label f O4 And (5) completing the construction of the segmentation model.
(4) Constructing a loss function:
the loss function L uses cross entropy loss, defined as follows:
where Y is the true split label of the tag,is a predictive cut tag, H Y And W is Y The height and width of the real split label are respectively Y ij Is the pixel value of the ith row and jth column pixel in Y,/and Y>Is->Pixel values of the ith row and jth column pixels, i=1, 2, …, H Y ,j=1,2,…,W Y Log (-) represents the natural logarithm.
(5) Training a segmentation model:
training the segmentation model constructed in the step (3) by utilizing the synthesized video training set obtained in the step (2-a), obtaining a loss value according to the loss function constructed in the step (4), and updating parameters in the segmentation model by using a random gradient descent method to obtain a pre-training model; training the pre-training model by using the real video training set obtained in the step (2-b), obtaining a loss value according to the loss function constructed in the step (4), and updating parameters in the segmentation model by using a random gradient descent method to obtain a trained segmentation model.
(6) Video object segmentation:
and (3) acquiring a video to be segmented and a segmentation label corresponding to a first frame of the video to be segmented, sequentially inputting the video to be segmented into the trained segmentation model obtained in the step (5) from a second frame of the video to be segmented, and outputting the segmentation label.
Example 2
Video object segmentation experiments were performed on the public data sets youtube vos2018, youtube vos2019, DAVIS2016, and DAVIS2017 using the method of example 1. The experimental operation system is Linux ubuntu version 18.04; based on the PyTorrch1.8.1 framework implementations of CUDA10.1 and cuDNN7.6.5, two NVIDIA 2080Ti 11G GPUs were used for training and testing.
The present embodiment uses the region similarity J, the contour accuracy F, and their average value J & F to evaluate the performance of the present invention. The area similarity J is calculated by the average intersection ratio of the estimated labels and the corresponding real labels, and the calculation formula is as follows:
wherein M is a predictive segmentation label, G is a real segmentation label, and the symbols ∈and ∈each represent the intersection and union of two sets.
The contour precision F represents the average boundary similarity between the boundary of the estimated label and the real label boundary, and the calculation formula is as follows:
wherein P is c Is the accuracy between l (M) and l (G), R c Is the recall rate between l (M) and l (G), P c And R is c Obtained using bipartite graph matching (bipartite graph matching) calculation; l (M) represents the set of closed contours within the scope of the predictive segmentation label M, and l (G) represents the set of closed contours within the scope of the true segmentation label G.
Tables 1,2, 3 and 4 show the results of the method of the present invention compared to the results of other methods for J, F and J & F of the test sample sets of youtube vos2018, youtube vos2019, DAVIS2016 and DAVIS2017, respectively, and it can be seen that the method of the present invention achieves the highest J & F scores on all four data sets.
FIG. 4 is a graph showing the comparison of the segmentation results of a small scale object with the segmentation results of other methods according to embodiments of the present invention. The first, second, third and fourth lines in the figure are TransVOS, CFBI +, STCN and the segmentation result of the method of the present invention on optional frames of video d79729a354 in youtube vos2018, the number in the upper left corner of each image is the frame number of the image in the video sequence, the image in the dashed frame is an enlarged view of the small image in the corresponding solid frame, and the first column is the first frame of the video and its real segmentation label. From the first row and the second column, it can be seen that the TransVOS method is interfered by the background, and the sundries are erroneously identified as remote persons; from the second row, second column and third column, it can be seen that the cfbi+ method fails to properly segment the person's legs and arms; from the third row, second, third, fourth, and fifth columns, it can be seen that the STCN method confuses persons at distance with vehicles in their background. From the fourth line, it can be seen that the method of the present invention can accurately segment small-scale objects, and the segmentation result is superior to that of other methods.
FIG. 5 is a graph showing the comparison of the segmentation results of similar objects with the segmentation results of other methods using an embodiment of the present invention. The first, second, third and fourth lines in the figure are TransVOS, CFBI +, STCN and the segmentation result of the method of the invention on optional frames of video 5d2020eff8 in YoutubeVOS2018, the number at the upper left corner of each image is the frame number of the image in the video sequence, the target shown by the solid line frame is the error segmentation result, and the first column is the first frame of the video and the real segmentation label thereof. It can be seen that STCN, cfbi+ and TransVOS will confuse fish as a segmentation target with fish of similar appearance in the background, whereas the method of the present invention can successfully distinguish the segmentation target from the analogues in the background.
The above embodiments are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention, so variations according to the principles of the present invention should be covered.
TABLE 1
(note: table J S And F S Values of J and F representing categories that have occurred in both the Youtube2018 training set and the test set, J U And F U Values of J and F representing categories that only occur in the Youtube2018 test set but not in the Youtube2018 training set
TABLE 2
(note: table J S And F S Values of J and F representing categories that have occurred in both the Youtube2019 training set and the test set, J U And F U Values of J and F representing categories that only appeared in the Youtube2019 test set but not in the Youtube2019 training set) Table 3
TABLE 4 Table 4
/>

Claims (3)

1. A method for segmenting a semi-supervised video target based on a transducer is characterized by comprising the following steps:
(1) Acquiring a data set and dividing labels:
acquiring a video target segmentation dataset, a static image dataset and segmentation labels corresponding to the two datasets, and forming an image pair by each image in the dataset and the corresponding segmentation label;
(2) The data expansion and processing method specifically comprises the following steps:
(2-a) after normalizing each image pair consisting of the static image data set and the corresponding segmentation labels obtained in the step (1), repeating the following procedures to obtain a composite video training sample corresponding to each image, wherein a composite video training set is formed by a set of composite video training samples:
I. the short side of the image pair is reduced to w pixels, the long side is reduced according to the same proportion as the short side, the obtained image pair is randomly cut into the size of h multiplied by w pixels, wherein w is the width of the cut image, h is the height of the cut image, w and h are both positive integers, and the value range is [10,3000];
sequentially applying random scaling, random horizontal overturning, random color dithering and random gray level conversion to the cut image pair to obtain an enhanced image pair corresponding to the image pair;
III, repeating the process II for three times to obtain three enhanced image pairs corresponding to the image pair, wherein the three enhanced image pairs form a synthesized video training sample;
(2-b) carrying out normalization processing on each video and the corresponding segmentation label in the video target segmentation data set obtained in the step (1), repeating the following processes to obtain a real video training sample corresponding to each video, wherein a set of the real video training samples forms a real video training set:
I. randomly extracting three image pairs from the video and the corresponding segmentation labels;
reducing the short sides of the three image pairs to w pixels, reducing the long sides of the three image pairs according to equal proportion, and randomly cutting the obtained three image pairs into the size of h multiplied by w pixels, wherein the meanings and values of w and h are the same as those of the step (2-a);
thirdly, sequentially applying random clipping, color dithering and random gray level conversion to the three image pairs to obtain three enhanced image pairs, wherein the three enhanced image pairs form a real video training sample;
(3) The method for constructing the segmentation model specifically comprises the following steps:
(3-a) construction of a query encoder
Using convolutional neural network as query encoder, and sequentially passing the frame to be segmented through the first four layers of the encoder, wherein the output of the second layer is f C2 The output of the third layer is f C3 The output of the fourth layer is f C
(3-b) building a storage pool
The 1 st frame, the (tau+1) th frame, the (2tau+1) th frame, the … … th frame, the (Ntau+1) th frame and the corresponding segmentation labels in the video sequence are put into a storage pool, wherein tau is a positive integer, and the value range is [1,200 ]],τ C For the relative position of the frames to be segmented, the symbol +.>Representing a downward rounding;
(3-c) construction of a storage encoder
Using convolutional neural network as storage encoder, all images in the storage pool and corresponding segmentation labels are obtained after passing through the encoder M
(3-d) construction of a transducer Module
The module consists of a transducer encoder and a transducer decoder; the transducer encoder comprises a space-time integration module, a convolution layer, a multi-scale layer and a self-attention module; the transducer decoder comprises two convolution layers, a multi-scale layer, a self-attention module and a double-branch cross-attention module, wherein the double-branch cross-attention module consists of a query branch and a storage branch, the structures of the two branches are identical, the two branches consist of a multi-head attention module, two residual error and layer normalization modules and a fully-connected feedforward network, and the fully-connected feedforward network consists of two linear layers and a ReLU activation layer; the structure of the self-attention module in the transducer encoder and the structure of the self-attention module in the transducer decoder are identical, and the self-attention module consists of a multi-head attention module, a residual error and layer normalization module; the multi-head attention module has the same structure;
(3-e) construction of a partition decoder
The segmentation decoder consists of a residual error module, two groups of up-sampling modules and a prediction convolution module; the residual error module consists of three convolution layers and two Relu activation layers, the first up-sampling module consists of four convolution layers, two Relu activation layers and a bilinear interpolation, the second up-sampling module consists of three convolution layers, two Relu activation layers and a bilinear interpolation, and the prediction convolution module consists of one convolution layer and a bilinear interpolation;
(3-f) F obtained in the step (3-a) C And f obtained in the step (3-c) M Inputting the space-time integration module of the transducer encoder constructed in the step (3-d) to obtain f M The specific calculation process is as follows:
f M '=f M ·softmax(c(ConvKey(f M ),ConvKey(f C )))
wherein ConvKey (·) is a bond projection layer, and consists of a convolution layer, c (·) represents a negative squared Euclidean distance, and softmax (·) represents an activation function;
(3-g) f obtained in the step (3-f) M ' convolution layer, multi-scale layer and self-attention module sequentially input into a transducer encoder, respectively obtaining M 1 、M 2 And M 3 The method comprises the steps of carrying out a first treatment on the surface of the F obtained in the step (3-a) C Sequentially inputting the first convolution layer, the second convolution layer, the multi-scale layer and the self-attention module into the converter decoder constructed in the step (3-d) to obtain C respectively 0 、C 1 、C 2 And C 3 The method comprises the steps of carrying out a first treatment on the surface of the Will M 3 And C 3 Query branches in a double-branch cross-attention module input into a transducer decoder constructed in step (3-d) to obtain C 4 The method comprises the steps of carrying out a first treatment on the surface of the Will M 3 And C 3 The storage branches in the dual-branch cross-attention module input to the Transformer decoder constructed in step (3-d) to obtain M 4
(3-h) C obtained in the step (3-g) 0 、C 4 、M 4 And f obtained in the step (3-f) M ' after splicing, inputting the residual modules into a residual module of the split decoder constructed in the step (3-e) to obtain f O1 The method comprises the steps of carrying out a first treatment on the surface of the Will f O1 And f obtained in the step (3-a) C3 Input to the first up-sampling module of the split decoder constructed in step (3-e) to obtain f O2 The method comprises the steps of carrying out a first treatment on the surface of the Will f O2 And f obtained in the step (3-a) C2 Input to step (3)-e) a second upsampling module of the constructed segment decoder, yielding f O3 The method comprises the steps of carrying out a first treatment on the surface of the Will f O3 Inputting the prediction convolution module of the segmentation decoder constructed in the step (3-e) to obtain a prediction segmentation label f O4 Completing the construction of a segmentation model;
(4) Constructing a loss function:
the loss function L uses cross entropy loss, defined as follows:
where Y is the true split label of the tag,is a predictive cut tag, H Y And W is Y The height and width of the real split label are respectively Y ij Is the pixel value of the ith row and jth column pixel in Y,/and Y> Pixel values of the ith row and jth column pixels, i=1, 2, …, H Y ,j=1,2,…,W Y Log (-) represents the natural logarithm;
(5) Training a segmentation model:
training the segmentation model constructed in the step (3) by utilizing the synthesized video training set obtained in the step (2-a), obtaining a loss value according to the loss function constructed in the step (4), and updating parameters in the segmentation model by using a random gradient descent method to obtain a pre-training model; training the pre-training model by using the real video training set obtained in the step (2-b), obtaining a loss value according to the loss function constructed in the step (4), and updating parameters in the segmentation model by using a random gradient descent method to obtain a trained segmentation model;
(6) Video object segmentation:
and (3) acquiring a video to be segmented and a segmentation label corresponding to a first frame of the video to be segmented, sequentially inputting the video to be segmented into the trained segmentation model obtained in the step (5) from a second frame of the video to be segmented, and outputting the segmentation label.
2. The method for segmenting a semi-supervised video object based on a Transformer according to claim 1, wherein the multi-scale layer in the step (3-d) is composed of t convolution layers with different convolution kernel sizes, and the output calculation process is as follows:
MSL=Concat(S 1 ,…,S i ,…,S t )
S i =Conv i (X i ;r i )
wherein MSL represents the output of the multi-scale layer, concat (-) represents the splicing operation, t represents the number of convolution layers in the multi-scale layer, t is a positive integer, and the value range is [1,50 ]],Conv i (. Cndot.) represents the ith convolution layer, r, of the multi-scale layers i The convolution kernel size, r, representing the ith convolution layer i Is a positive integer with a value range of [1,100 ]],X i Representing input of ith convolution layer of multi-scale layer, S i Representing the output of the ith convolutional layer of the multi-scale layer, X for the multi-scale layer in the transform encoder i =M 1 For multi-scale layers in a transducer decoder, X i =C 1 ,i=1,2,…,t。
3. The method for segmenting a semi-supervised video object based on a transducer as claimed in claim 1, wherein the multi-head attention module in the step (3-d) outputs the following calculation procedures:
MultiHead(q,k,v)=Concat(A 1 ,…,A i ,…,A s )U o
wherein MultiHead (q, k, v) represents the output of the multi-head attention module, softmax (·) represents the activation function, concat (·) represents the splicing operation; q, k and v are inputs to the multi-head attention module, q=k=v=m for the multi-head attention module inside the self-attention module of the transducer encoder 2 For a multi-head attention module inside the self-attention module of a transducer decoder, q=k=v=c 2 For multi-head attention module inside the double-branch cross attention module storage branch, q=m 3 ,k=v=C 3 For a multi-headed attention module inside the query branch, q=c 3 ,k=v=M 3 The method comprises the steps of carrying out a first treatment on the surface of the s is the number of attention heads in the multi-head attention module, s is a positive integer, and the value range is [1,16 ]],A i Representing the output of the ith attention header, i=1, 2, …, s,and->Q, k and v parameter matrices representing the ith attention header, U o Is a parameter matrix for adjusting final output, T represents a transpose operator, d is a super parameter, d is a positive integer, and the value range is [1,1000 ]]。
CN202210098849.1A 2022-01-24 2022-01-24 Transformer-based semi-supervised video object segmentation method Active CN114429607B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210098849.1A CN114429607B (en) 2022-01-24 2022-01-24 Transformer-based semi-supervised video object segmentation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210098849.1A CN114429607B (en) 2022-01-24 2022-01-24 Transformer-based semi-supervised video object segmentation method

Publications (2)

Publication Number Publication Date
CN114429607A CN114429607A (en) 2022-05-03
CN114429607B true CN114429607B (en) 2024-03-29

Family

ID=81313102

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210098849.1A Active CN114429607B (en) 2022-01-24 2022-01-24 Transformer-based semi-supervised video object segmentation method

Country Status (1)

Country Link
CN (1) CN114429607B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116012388B (en) * 2023-03-28 2023-06-13 中南大学 Three-dimensional medical image segmentation method and imaging method for acute ischemic cerebral apoplexy

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109685066A (en) * 2018-12-24 2019-04-26 中国矿业大学(北京) A kind of mine object detection and recognition method based on depth convolutional neural networks
CN110378288A (en) * 2019-07-19 2019-10-25 合肥工业大学 A kind of multistage spatiotemporal motion object detection method based on deep learning
WO2019218136A1 (en) * 2018-05-15 2019-11-21 深圳大学 Image segmentation method, computer device, and storage medium
CN111523410A (en) * 2020-04-09 2020-08-11 哈尔滨工业大学 Video saliency target detection method based on attention mechanism
CN111860162A (en) * 2020-06-17 2020-10-30 上海交通大学 Video crowd counting system and method
CN112036300A (en) * 2020-08-31 2020-12-04 合肥工业大学 Moving target detection method based on multi-scale space-time propagation layer
CN113570610A (en) * 2021-07-26 2021-10-29 北京百度网讯科技有限公司 Method and device for performing target segmentation on video by adopting semantic segmentation model

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20210061072A (en) * 2019-11-19 2021-05-27 삼성전자주식회사 Video segmentation method and apparatus

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019218136A1 (en) * 2018-05-15 2019-11-21 深圳大学 Image segmentation method, computer device, and storage medium
CN109685066A (en) * 2018-12-24 2019-04-26 中国矿业大学(北京) A kind of mine object detection and recognition method based on depth convolutional neural networks
CN110378288A (en) * 2019-07-19 2019-10-25 合肥工业大学 A kind of multistage spatiotemporal motion object detection method based on deep learning
CN111523410A (en) * 2020-04-09 2020-08-11 哈尔滨工业大学 Video saliency target detection method based on attention mechanism
CN111860162A (en) * 2020-06-17 2020-10-30 上海交通大学 Video crowd counting system and method
CN112036300A (en) * 2020-08-31 2020-12-04 合肥工业大学 Moving target detection method based on multi-scale space-time propagation layer
CN113570610A (en) * 2021-07-26 2021-10-29 北京百度网讯科技有限公司 Method and device for performing target segmentation on video by adopting semantic segmentation model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Zongxin Yang,etc.Collaborative Video Object Segmentation by Multi-Scale Foreground-Background Integration.《IEEE Transactions on Pattern Analysis and Machine》.2021,4701 - 4712. *
基于时空关联性的视频动作识别与检测方法研究;李栋;《中国博士学位论文全文数据库 (信息科技辑)》;20210915;I138-48 *

Also Published As

Publication number Publication date
CN114429607A (en) 2022-05-03

Similar Documents

Publication Publication Date Title
CN111489358B (en) Three-dimensional point cloud semantic segmentation method based on deep learning
CN109871798B (en) Remote sensing image building extraction method based on convolutional neural network
CN111898439A (en) Deep learning-based traffic scene joint target detection and semantic segmentation method
CN110599502B (en) Skin lesion segmentation method based on deep learning
CN113870335A (en) Monocular depth estimation method based on multi-scale feature fusion
CN114359130A (en) Road crack detection method based on unmanned aerial vehicle image
CN106874879A (en) Handwritten Digit Recognition method based on multiple features fusion and deep learning network extraction
CN114429607B (en) Transformer-based semi-supervised video object segmentation method
CN115953408B (en) YOLOv 7-based lightning arrester surface defect detection method
CN115035508A (en) Topic-guided remote sensing image subtitle generation method based on Transformer
CN114399533B (en) Single-target tracking method based on multi-level attention mechanism
CN114283120B (en) Domain-adaptive-based end-to-end multisource heterogeneous remote sensing image change detection method
CN111695455A (en) Low-resolution face recognition method based on coupling discrimination manifold alignment
CN113066089B (en) Real-time image semantic segmentation method based on attention guide mechanism
CN109670506A (en) Scene Segmentation and system based on Kronecker convolution
CN109543724B (en) Multilayer identification convolution sparse coding learning method
CN115017366B (en) Unsupervised video hash retrieval method based on multi-granularity contextualization and multi-structure preservation
CN116597183A (en) Multi-mode image feature matching method based on space and channel bi-dimensional attention
CN116363361A (en) Automatic driving method based on real-time semantic segmentation network
CN114821631A (en) Pedestrian feature extraction method based on attention mechanism and multi-scale feature fusion
CN114495163A (en) Pedestrian re-identification generation learning method based on category activation mapping
CN113436198A (en) Remote sensing image semantic segmentation method for collaborative image super-resolution reconstruction
CN115909045B (en) Two-stage landslide map feature intelligent recognition method based on contrast learning
CN114972746B (en) Medical image segmentation method based on multi-resolution overlapping attention mechanism
CN113393521B (en) High-precision flame positioning method and system based on dual semantic attention mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant