CN114429607A - Transformer-based semi-supervised video object segmentation method - Google Patents
Transformer-based semi-supervised video object segmentation method Download PDFInfo
- Publication number
- CN114429607A CN114429607A CN202210098849.1A CN202210098849A CN114429607A CN 114429607 A CN114429607 A CN 114429607A CN 202210098849 A CN202210098849 A CN 202210098849A CN 114429607 A CN114429607 A CN 114429607A
- Authority
- CN
- China
- Prior art keywords
- segmentation
- module
- layer
- video
- attention module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000011218 segmentation Effects 0.000 title claims abstract description 127
- 238000000034 method Methods 0.000 title claims abstract description 46
- 238000012549 training Methods 0.000 claims abstract description 50
- 238000004364 calculation method Methods 0.000 claims abstract description 13
- 230000010354 integration Effects 0.000 claims abstract description 8
- 238000012545 processing Methods 0.000 claims abstract description 8
- 238000010276 construction Methods 0.000 claims description 18
- 238000005070 sampling Methods 0.000 claims description 15
- 238000010606 normalization Methods 0.000 claims description 11
- 230000004913 activation Effects 0.000 claims description 9
- 238000006243 chemical reaction Methods 0.000 claims description 6
- 238000013527 convolutional neural network Methods 0.000 claims description 6
- 238000011478 gradient descent method Methods 0.000 claims description 6
- 239000011159 matrix material Substances 0.000 claims description 6
- 230000003068 static effect Effects 0.000 claims description 6
- 238000003672 processing method Methods 0.000 claims description 2
- 230000006870 function Effects 0.000 description 12
- 238000012360 testing method Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 241000251468 Actinopterygii Species 0.000 description 2
- 239000004576 sand Substances 0.000 description 2
- 230000007547 defect Effects 0.000 description 1
- 238000001514 detection method Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 230000001939 inductive effect Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
- G06F18/2155—Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the incorporation of unlabelled data, e.g. multiple instance learning [MIL], semi-supervised techniques using expectation-maximisation [EM] or naïve labelling
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/047—Probabilistic or stochastic networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Abstract
The invention discloses a transform-based semi-supervised video object segmentation method, which comprises the following implementation scheme: 1) acquiring a data set and a segmentation label; 2) data expansion and processing; 3) constructing a segmentation model; 4) constructing a loss function; 5) training a segmentation model; 6) and (5) segmenting the video object. The invention compresses space-time information by designing a space-time integration module, introduces multi-scale layers to generate cross-scale input characteristics, and constructs a double-branch cross attention module to take into account a plurality of characteristics of target information. The method can effectively improve the segmentation precision of the small-scale target and the similar target while reducing the calculation cost.
Description
Technical Field
The invention relates to the technical field of image processing, in particular to a transform-based semi-supervised video target segmentation method.
Background
Video object segmentation is an important prerequisite for video understanding, with many potential applications such as video retrieval, video editing, autopilot, etc. The purpose of semi-supervised video object segmentation is to segment an object (i.e. a segmentation tag) from other frames of the entire video sequence given the object in the first frame of the video.
Due to the strong performance of the Transformer architecture on computer vision tasks such as image classification, object detection, semantic segmentation, object tracking, etc., many current researches apply the Transformer architecture to video object segmentation. The Transformer architecture has excellent Long-term dependency (Long-range dependency) modeling capability, and can effectively mine spatio-temporal information in a given video, so that the segmentation precision is improved. However, most Transformer-based methods input the features of all frames in the storage pool directly into the multi-head attention module, which is computationally expensive with the increase of segmented frames, and the classical Transformer architecture lacks intrinsic inductive bias, and has poor segmentation accuracy for small-scale targets and similar targets.
Disclosure of Invention
The invention aims to overcome the defects and shortcomings of the prior art and provides a transform-based semi-supervised video target segmentation method.
In order to achieve the purpose, the invention provides the following technical scheme:
a semi-supervised video object segmentation method based on a Transformer comprises the following steps:
(1) acquiring a data set and a segmentation label:
acquiring a video target segmentation data set, a static image data set and segmentation labels corresponding to the two data sets, and forming an image pair by each image in the data sets and the corresponding segmentation labels;
(2) the data expansion and processing method specifically comprises the following steps:
(2-a) after normalization processing is carried out on each image pair consisting of the static image data set obtained in the step (1) and the corresponding segmentation labels, repeating the following processes to obtain a synthesized video training sample corresponding to each image pair, wherein the synthesized video training set is formed by a set of synthesized video training samples:
I. reducing the short edge of the image pair to w pixels, reducing the long edge according to the proportion equal to that of the short edge, and randomly cutting the obtained image pair into h multiplied by w pixel size, wherein w is the width of the cut image, h is the height of the cut image, w and h are positive integers, and the value range is [10,3000 ];
II, sequentially applying random zooming, random horizontal turning, random color dithering and random gray level conversion to the cut image pair to obtain an enhanced image pair corresponding to the image pair;
repeating the process II three times to obtain three enhanced image pairs corresponding to the image pair, wherein the three enhanced image pairs form a synthetic video training sample;
(2-b) after normalization processing is carried out on each video and the corresponding segmentation labels in the video target segmentation data set obtained in the step (1), repeating the following processes to obtain a real video training sample corresponding to each video, wherein the real video training set is formed by a set of real video training samples:
I. randomly extracting three image pairs from the video and the corresponding segmentation labels;
reducing the short sides of the three image pairs to w pixels, reducing the long sides of the three image pairs according to equal proportion, randomly cutting the obtained three image pairs into h multiplied by w pixel sizes, wherein the meanings and values of w and h are the same as those in the step (2-a);
sequentially applying random cutting, color dithering and random gray level conversion to the three image pairs to obtain three enhanced image pairs, wherein the three enhanced image pairs form a real video training sample;
(3) constructing a segmentation model, which specifically comprises the following steps:
(3-a) construction of query coders
Using a convolutional neural network as a query encoder, sequentially passing a frame to be segmented through the first four layers of the encoder, wherein the output of the second layer is fC2The output of the third layer is fC3The output of the fourth layer is fC;
(3-b) construction of storage pool
Putting a1 st frame, a tau +1 st frame, a 2 tau +1 st frame, … …, an Nth tau +1 st frame and corresponding segmentation labels in a video sequence into a storage pool, wherein tau is a positive integer and has a value range of [1,200 ]],τCFor relative position of frame to be divided, symbolRepresents rounding down;
(3-c) construction of storage encoder
Using a convolutional neural network as a storage encoder, and obtaining f after all images in a storage pool and corresponding segmentation labels pass through the encoderM;
(3-d) construction of Transformer Module
The module consists of a Transformer encoder and a Transformer decoder; the Transformer encoder comprises a space-time integration module, a convolutional layer, a multi-scale layer and a self-attention module; the Transformer decoder comprises two convolutional layers, a multi-scale layer, a self-attention module and a double-branch cross attention module, wherein the double-branch cross attention module consists of an inquiry branch and a storage branch, the two branches have the same structure and consist of a multi-head attention module, two residual error and layer normalization modules and a fully-connected feedforward network, and the fully-connected feedforward network consists of two linear layers and a ReLU activation layer; the structures of self-attention modules in the Transformer encoder and the Transformer decoder are completely the same, and the self-attention modules consist of a multi-head attention module and a residual error and layer normalization module; the multi-head attention module has the same structure;
(3-e) construction of a segmented decoder
The segmentation decoder consists of a residual error module, two up-sampling modules and a prediction convolution module; the residual error module consists of three convolution layers and two Relu active layers, the first up-sampling module consists of four convolution layers, two Relu active layers and one bilinear interpolation, the second up-sampling module consists of three convolution layers, two Relu active layers and one bilinear interpolation, and the predicted convolution module consists of one convolution layer and one bilinear interpolation;
(3-f) subjecting f obtained in the step (3-a) toCWith f obtained in step (3-c)MInputting the space-time integration module of the Transformer encoder constructed in the step (3-d) to obtain fM' the specific calculation process is as follows:
fM'=fM·softmax(c(ConvKey(fM),ConvKey(fC)))
wherein ConvKey (-) is a key projection layer, composed of a convolution layer, c (-) represents a negative squared Euclidean distance, softmax (-) represents an activation function;
(3-g) subjecting f obtained in the step (3-f)M' convolutional layer, multiscale layer and self-attention module input sequentially into Transformer encoder, resulting in M respectively1、M2And M3(ii) a Subjecting f obtained in step (3-a) toCSequentially inputting the first convolutional layer, the second convolutional layer, the multi-scale layer and the self-attention module in the Transformer decoder constructed in the step (3-d) to respectively obtain C0、C1、C2And C3(ii) a Will M3And C3Inputting to the query branch in the two-branch cross attention module in the transform decoder constructed in step (3-d) to obtain C4(ii) a Will M3And C3Inputting the memory branch in the double-branch cross attention module in the Transformer decoder constructed in the step (3-d) to obtain M4;
(3-h) subjecting C obtained in the step (3-g) to0、C4、M4And f obtained in step (3-f)M' splicing and inputting the residual error into the residual error module of the segmentation decoder constructed in the step (3-e) to obtain fO1(ii) a Will f isO1With f obtained in step (3-a)C3Inputting the data into the first up-sampling module of the segmented decoder constructed in the step (3-e) to obtain fO2(ii) a Will f isO2With f obtained in step (3-a)C2Input to the second of the segmented decoders constructed in step (3-e)Sampling module to obtain fO3(ii) a Will f isO3Inputting the prediction convolution module of the segmentation decoder constructed in the step (3-e) to obtain a prediction segmentation label fO4Completing the construction of a segmentation model;
(4) constructing a loss function:
the loss function L uses cross-entropy loss, defined as follows:
where Y is the true segmentation label,is a predictive segmentation tag, HYAnd WYHeight and width, Y, of the label being divided for realityijIs the pixel value of the ith row and jth column pixel in Y,is thatThe pixel value of the ith row and the jth column of the display panel, i is 1,2, …, HY,j=1,2,…,WYLog (·) denotes solving a natural logarithm;
(5) training a segmentation model:
training the segmentation model constructed in the step (3) by using the synthesized video training set obtained in the step (2-a), obtaining a loss value according to the loss function constructed in the step (4), and updating parameters in the segmentation model by using a random gradient descent method to obtain a pre-training model; training the pre-training model by using the real video training set obtained in the step (2-b), obtaining a loss value according to the loss function constructed in the step (4), and updating parameters in the segmentation model by using a random gradient descent method to obtain a trained segmentation model;
(6) video object segmentation:
and (5) acquiring a video to be segmented and a segmentation label corresponding to the first frame of the video to be segmented, inputting the video to be segmented into the trained segmentation model obtained in the step (5) from the second frame of the video to be segmented in sequence, and outputting the segmentation label.
The multi-scale layer in the step (3-d) is composed of t convolution layers with different convolution kernel sizes, and the output calculation process is as follows:
MSL=Concat(S1,…,Si,…,St)
Si=Convi(Xi;ri)
wherein MSL represents the output of the multi-scale layer, Concat (. cndot.) represents the splicing operation, t represents the number of the convolution layers in the multi-scale layer, and t is a positive integer with a value range of [1,50 ]],ConviDenotes the ith convolutional layer in the multi-scale layer, riRepresents the convolution kernel size, r, of the ith convolution layeriIs a positive integer with a value range of [1,100 ]],XiInput representing the ith convolutional layer of the multi-scale layer, SiRepresents the output of the ith convolutional layer of the multi-scale layer, X for the multi-scale layer in the transform encoderi=M1For a multiscale layer in a transform decoder, Xi=C1,i=1,2,…,t。
The multi-head attention module in the step (3-d) outputs a calculation process that:
MultiHead(q,k,v)=Concat(A1,…,Ai,…,As)Uo
wherein MultiHead (q, k, v) represents the output of the multi-head attention module, softmax (·) represents the activation function, and Concat (·) represents the stitching operation; q, k and v are the input of the multi-head attention module, and q-k-v-M for the multi-head attention module inside the self-attention module of the transform encoder2For a multi-head attention module inside a self-attention module of a transform decoder, q-k-v-C2For a two-branch cross attention module to store a multi-head attention module inside a branch, q is equal to M3,k=v=C3For query inside branchesThe multi-head attention module of (q ═ C)3,k=v=M3(ii) a s is the number of the attention heads in the multi-head attention module, s is a positive integer and has a value range of [1,16 ]],AiThe output of the ith attention head is shown, i ═ 1,2, …, s,andq, k and v parameter matrix, U, representing the ith head of attentionoIs to adjust the final output parameter matrix, T represents the transpose operator, d is a hyperparameter, d is a positive integer, and the value range is [1,1000]。
In step (3-b), preferably τ is 5.
In step (3-d), the number t of convolutional layers in the multi-scale layer is preferably 2, the number s of attention heads in the multi-head attention module is preferably 8, and the hyper-parameter d is preferably 32.
Compared with the prior art, the invention has the following advantages:
(1) the space-time integration module provided by the invention can integrate and compress space-time information and reduce the calculation cost.
(2) The multi-scale layer introduced by the invention can provide cross-scale input features for the Transformer, thereby assisting the network to learn the feature representation with invariable scale and improving the segmentation precision of the network on small-scale targets.
(3) The two branches in the double-branch cross attention module constructed by the invention can give consideration to different characteristics of target information, thereby improving the segmentation precision of target details.
Drawings
FIG. 1 is a flowchart of a transform-based semi-supervised video object segmentation method according to an embodiment of the present invention;
FIG. 2 is a diagram of a segmentation model architecture in accordance with an embodiment of the present invention;
FIG. 3 is a diagram of a Transformer module architecture according to an embodiment of the present invention;
FIG. 4 is a graph comparing the segmentation result of the small-scale target with the segmentation result of other methods according to the embodiment of the present invention;
FIG. 5 is a graph comparing the segmentation result of the similar target with that of other methods according to the embodiment of the present invention.
Detailed Description
The following describes specific embodiments of the present invention;
example 1
Fig. 1 is a flowchart of a transform-based semi-supervised video object segmentation method according to an embodiment of the present invention, which includes the following specific steps:
step 1, acquiring a data set and a segmentation label
The method comprises the steps of obtaining a video target segmentation data set, a static image data set and segmentation labels corresponding to the two data sets, and enabling each image in the data sets and the corresponding segmentation labels to form an image pair.
Step 2, data expansion and processing
(2-a) after normalization processing is carried out on each image pair consisting of the static image data set obtained in the step (1) and the corresponding segmentation labels, repeating the following processes to obtain a synthesized video training sample corresponding to each image pair, wherein the synthesized video training set is formed by a set of synthesized video training samples:
I. reducing the short edge of the image pair to w pixels, and reducing the long edge in proportion to the short edge, and randomly cutting the obtained image pair into h × w pixel size, where w is the width of the cut image, h is the height of the cut image, w and h are both positive integers, and the value range is [10,3000], in this embodiment, w is 384, and h is 384;
II, sequentially applying random zooming, random horizontal turning, random color dithering and random gray level conversion to the cut image pair to obtain an enhanced image pair corresponding to the image pair;
repeating the process II three times to obtain three enhanced image pairs corresponding to the image pair, wherein the three enhanced image pairs form a synthetic video training sample;
(2-b) after normalization processing is carried out on each video and the corresponding segmentation labels in the video target segmentation data set obtained in the step (1), repeating the following processes to obtain a real video training sample corresponding to each video, wherein the real video training set is formed by a set of real video training samples:
I. randomly extracting three image pairs from the video and the corresponding segmentation labels;
reducing the short sides of the three image pairs to w pixels, reducing the long sides of the three image pairs according to equal proportion, randomly cutting the obtained three image pairs into h multiplied by w pixel sizes, wherein the meanings and values of w and h are the same as those in the step (2-a);
and III, sequentially applying random cutting, color dithering and random gray level conversion to the three image pairs to obtain three enhanced image pairs, wherein the three enhanced image pairs form a real video training sample.
Step 3, constructing a segmentation model
Fig. 2 is a diagram illustrating a segmentation model structure according to an embodiment of the present invention, which includes the following steps:
(3-a) construction of query coders
Using the convolutional neural network ResNet50 as a query encoder, the frame to be segmented sequentially passes through the first four layers of the encoder, where the output of the second layer is fC2The output of the third layer is fC3The output of the fourth layer is fC;
(3-b) construction of storage pool
Putting a1 st frame, a tau +1 st frame, a 2 tau +1 st frame, … …, an Nth tau +1 st frame and corresponding segmentation labels in a video sequence into a storage pool, wherein tau is a positive integer and has a value range of [1,200 ]],τCFor relative position of frame to be divided, symbolMeaning rounding down, this example takes τ to be 5;
(3-c) construction of storage encoder
Using convolutional neural network ResNet18 as a storage encoder, all images in the storage pool and their corresponding segmentation labels are passed through the encoder to obtain fM;
(3-d) construction of Transformer Module
FIG. 3 is a diagram showing a structure of a Transformer module according to an embodiment of the present invention. The module consists of a Transformer encoder and a Transformer decoder; the Transformer encoder comprises a space-time integration module, convolution layers with convolution kernel number of 256 and convolution kernel size of 1, a multi-scale layer and a self-attention module; the Transformer decoder comprises convolution layers with convolution kernel number of 512 and convolution kernel size of 3, convolution layers with convolution kernel number of 256 and convolution kernel size of 1, a multi-scale layer, a self-attention module and a double-branch cross attention module, wherein the double-branch cross attention module consists of an inquiry branch and a storage branch, the two branches have the same structure and consist of a multi-head attention module, two residual error and layer normalization modules and a fully-connected feedforward network, and the fully-connected feedforward network consists of two linear layers with input dimensionalities of 256 hidden layers and size of 2048 and a ReLU activation layer; the structures of self-attention modules in the Transformer encoder and the Transformer decoder are completely the same, and the self-attention modules consist of a multi-head attention module and a residual error and layer normalization module; the multi-head attention module has the same structure;
the multi-scale layer in the step (3-d) is composed of t convolution layers with different convolution kernel sizes, and the output calculation process is as follows:
MSL=Concat(S1,…,Si,…,St)
Si=Convi(Xi;ri)
wherein MSL represents the output of the multi-scale layer, Concat (. cndot.) represents the splicing operation, t represents the number of the convolution layers in the multi-scale layer, and t is a positive integer with a value range of [1,50 ]]In this embodiment, t is2, ConviDenotes the ith convolutional layer in the multi-scale layer, riRepresents the convolution kernel size, r, of the ith convolution layeriIs a positive integer with a value range of [1,100 ]]In this embodiment, r is taken1Is2, r2Is 4, XiRepresents the input of the ith convolutional layer of the multi-scale layer,Sirepresents the output of the ith convolutional layer of the multi-scale layer, X for the multi-scale layer in the transform encoderi=M1For a multiscale layer in a transform decoder, Xi=C1,i=1,2,…,t。
The multi-head attention module in the step (3-d) outputs a calculation process that:
MultiHead(q,k,v)=Concat(A1,…,Ai,…,As)Uo
wherein MultiHead (q, k, v) represents the output of the multi-head attention module, softmax (·) represents the activation function, and Concat (·) represents the stitching operation; q, k and v are the input of the multi-head attention module, and q-k-v-M for the multi-head attention module inside the self-attention module of the transform encoder2For a multi-head attention module inside a self-attention module of a transform decoder, q-k-v-C2For a two-branch cross attention module to store a multi-head attention module inside a branch, q is equal to M3,k=v=C3For a multi-head attention module inside a query branch, q ═ C3,k=v=M3(ii) a s is the number of the attention heads in the multi-head attention module, s is a positive integer and has a value range of [1,16 ]]In this example, s is taken to be 8, AiThe output of the ith attention head is shown, i ═ 1,2, …, s,andq, k and v parameter matrix, U, representing the ith head of attentionoIs to adjust the final output parameter matrix, T represents the transpose operator, d is a hyperparameter, d is a positive integer, and the value range is [1,1000]In this embodiment, d is 32;
(3-e) construction of a segmented decoder
The segmentation decoder consists of a residual error module, two up-sampling modules and a prediction convolution module; the residual error module consists of convolution layers with convolution kernel numbers of 512 and convolution kernel sizes of 3 and two Relu active layers, the first up-sampling module consists of convolution layers with convolution kernel numbers of 512 and convolution kernel sizes of 3, convolution layers with convolution kernel numbers of 256 and convolution kernel sizes of 3, two Relu active layers and bilinear interpolation with an expansion factor of 2, the second up-sampling module consists of convolution layers with convolution kernel numbers of 256 and convolution kernel sizes of 3, two Relu active layers and bilinear interpolation with an expansion factor of 2, and the predicted convolution module consists of convolution layers with convolution kernel numbers of 256 and convolution kernel sizes of 1 and bilinear interpolation with an expansion factor of 2;
(3-f) subjecting f obtained in the step (3-a) toCWith f obtained in step (3-c)MInputting the space-time integration module of the Transformer encoder constructed in the step (3-d) to obtain fM' the specific calculation process is as follows:
fM'=fM·softmax(c(ConvKey(fM),ConvKey(fC)))
wherein ConvKey (-) is a key projection layer, and is composed of convolution layers with convolution kernel number of 64 and convolution kernel size of 3, c (-) represents a negative square Euclidean distance, and softmax (-) represents an activation function;
(3-g) subjecting f obtained in the step (3-f)M' convolutional layer, multiscale layer and self-attention module input sequentially into Transformer encoder, resulting in M respectively1、M2And M3(ii) a F obtained in the step (3-a)CSequentially inputting the first convolutional layer, the second convolutional layer, the multi-scale layer and the self-attention module in the Transformer decoder constructed in the step (3-d) to respectively obtain C0、C1、C2And C3(ii) a Will M3And C3Inputting the query branch in the double-branch cross attention module in the Transformer decoder constructed in the step (3-d) to obtain C4(ii) a Will M3And C3The double-branch cross attention module input into the Transformer decoder constructed in the step (3-d)Get M4;
(3-h) subjecting C obtained in the step (3-g) to0、C4、M4And f obtained in step (3-f)M' splicing and inputting the residual error into the residual error module of the segmentation decoder constructed in the step (3-e) to obtain fO1(ii) a Will f isO1With f obtained in step (3-a)C3Inputting the data into the first up-sampling module of the segmented decoder constructed in the step (3-e) to obtain fO2(ii) a Will f isO2With f obtained in step (3-a)C2Inputting the data to a second up-sampling module of the segmented decoder constructed in the step (3-e) to obtain fO3(ii) a Will f isO3Inputting the predicted segmentation label f into the prediction convolution module of the segmentation decoder constructed in the step (3-e)O4And finishing the construction of the segmentation model.
(4) Constructing a loss function:
the loss function L uses cross-entropy loss, defined as follows:
where Y is the true segmentation label,is a predictive segmentation tag, HYAnd WYHeight and width, Y, of the label being divided for realityijIs the pixel value of the ith row and jth column pixel in Y,is thatThe pixel value of the ith row and the jth column of the display panel, i is 1,2, …, HY,j=1,2,…,WYAnd log (·) denotes taking the natural logarithm.
(5) Training a segmentation model:
training the segmentation model constructed in the step (3) by using the synthesized video training set obtained in the step (2-a), obtaining a loss value according to the loss function constructed in the step (4), and updating parameters in the segmentation model by using a random gradient descent method to obtain a pre-training model; and (3) training the pre-training model by using the real video training set obtained in the step (2-b), obtaining a loss value according to the loss function constructed in the step (4), and updating parameters in the segmentation model by using a random gradient descent method to obtain the trained segmentation model.
(6) Video object segmentation:
and (5) acquiring a video to be segmented and a segmentation label corresponding to the first frame of the video to be segmented, inputting the video to be segmented into the trained segmentation model obtained in the step (5) from the second frame of the video to be segmented in sequence, and outputting the segmentation label.
Example 2
Video object segmentation experiments were performed on the public data sets youtube vos2018, youtube vos2019, DAVIS2016 and DAVIS2017 using the method of example 1. The experimental operating system is Linux ubuntu 18.04 version; a pytorch1.8.1 framework implementation based on CUDA10.1 and cudnn7.6.5 was trained and tested using two NVIDIA 2080Ti 11G GPUs.
The present embodiment uses the region similarity J, the contour accuracy F, and their average values J & F to evaluate the performance of the present invention. The region similarity J is calculated by the average intersection ratio of the estimated label and the corresponding real label, and the calculation formula is as follows:
where M is the predicted segmentation label, G is the true segmentation label, and the symbols ≡ and ≡ denote the intersection and union of the two sets, respectively.
The contour accuracy F represents the average boundary similarity between the boundary of the estimated label and the boundary of its real label, and its calculation formula is:
wherein P iscIs the accuracy between l (M) and l (G), RcIs l (M) and l (G)Recall ratio between, PcAnd RcObtained by using bipartite graph matching (bipartite graph matching) calculation; l (M) represents a set of closed contours in the range of the predicted segmentation label M, and l (G) represents a set of closed contours in the range of the true segmentation label G.
Table 1, table 2, table 3 and table 4 show the J, F and J & F results of the test sample sets of youtube vos2018, youtube vos2019, DAVIS2016 and DAVIS2017, respectively, compared to other methods, and it can be found that the highest J & F scores were obtained on all four data sets by the method of the present invention.
Fig. 4 is a graph showing the segmentation result of a small-scale object according to the embodiment of the present invention compared with the segmentation results of other methods. The first, second, third and fourth lines in the figure are respectively the result of segmentation of the video d79729a354 optional frame in youtube vos2018 by the method of the present invention, the number of the upper left corner of each image is the frame number of the image in the video sequence, the image in the dashed frame is the enlarged view of the small image in the corresponding solid frame, and the first column is the first frame of the video and its real segmentation label. It can be seen from the first row and the second column that the TransVOS method is disturbed by the background, and the sundries are wrongly identified as distant persons; it can be seen from the second row, the second column and the third column that the CFBI + method fails to segment the legs and arms of the person correctly; it can be seen from the third row, second, third, fourth, and fifth columns that the STCN method confuses people at a distance with a vehicle in its background. From the fourth line, it can be seen that the method of the present invention can correctly segment small-scale objects, and the segmentation result is superior to that of other methods.
Fig. 5 is a graph showing the results of segmentation of similar objects using an embodiment of the present invention in comparison with the results of segmentation using other methods. In the figure, the first, second, third and fourth lines are respectively a transVOS, a CFBI +, an STCN and a segmentation result of the method of the invention on an optional frame of a video 5d2020eff8 in a YoutubeVOS2018, the number at the upper left corner of each image is a frame number of the image in a video sequence, a target shown by a solid line box is an error segmentation result, and the first column is a first frame of the video and a real segmentation label thereof. It can be seen that STCN, CFBI + and TransVOS all confuse fish as segmentation target with fish with similar appearance in the background, and the method of the present invention can successfully distinguish segmentation target from similar in the background.
The above-mentioned embodiments are merely preferred embodiments of the present invention, and the scope of the present invention is not limited thereby, and all changes made based on the principle of the present invention should be covered within the scope of the present invention.
TABLE 1
(Note: J in the table)SAnd FSValues for J and F, J, representing classes that appear in both the Youttube 2018 training set and the test setUAnd FUValues of J and F representing classes that appear only in the Youtube2018 test set but not in the Youtube2018 training set)
TABLE 2
(Note: J in the table)SAnd FSValues for J and F, J, representing classes that appeared in both the Youttube 2019 training set and the test setUAnd FUValues for J and F representing categories that appear only in Youtube2019 test set but not in Youtube2019 training set) table 3
TABLE 4
Claims (3)
1. A semi-supervised video object segmentation method based on a Transformer is characterized by comprising the following steps:
(1) acquiring a data set and a segmentation label:
acquiring a video target segmentation data set, a static image data set and segmentation labels corresponding to the two data sets, and forming an image pair by each image in the data sets and the corresponding segmentation labels;
(2) the data expansion and processing method specifically comprises the following steps:
(2-a) after normalizing each image pair consisting of the static image data set obtained in the step (1) and the corresponding segmentation label, repeating the following processes to obtain a synthesized video training sample corresponding to each image pair, wherein the synthesized video training set is formed by a set of synthesized video training samples:
I. reducing the short edge of the image pair to w pixels, reducing the long edge according to the proportion equal to that of the short edge, and randomly cutting the obtained image pair into h multiplied by w pixel size, wherein w is the width of the cut image, h is the height of the cut image, w and h are positive integers, and the value range is [10,3000 ];
II, sequentially applying random zooming, random horizontal turning, random color dithering and random gray level conversion to the cut image pair to obtain an enhanced image pair corresponding to the image pair;
repeating the process II three times to obtain three enhanced image pairs corresponding to the image pair, wherein the three enhanced image pairs form a synthetic video training sample;
(2-b) after normalization processing is carried out on each video and the corresponding segmentation labels in the video target segmentation data set obtained in the step (1), repeating the following processes to obtain a real video training sample corresponding to each video, wherein the real video training set is formed by a set of real video training samples:
I. randomly extracting three image pairs from the video and the corresponding segmentation labels;
reducing the short sides of the three image pairs to w pixels, reducing the long sides of the three image pairs according to equal proportion, randomly cutting the obtained three image pairs into h multiplied by w pixel sizes, wherein the meanings and values of w and h are the same as those in the step (2-a);
sequentially applying random cutting, color dithering and random gray level conversion to the three image pairs to obtain three enhanced image pairs, wherein the three enhanced image pairs form a real video training sample;
(3) constructing a segmentation model, which specifically comprises the following steps:
(3-a) construction of query coders
Using a convolutional neural network as a query encoder, sequentially passing a frame to be segmented through the first four layers of the encoder, wherein the output of the second layer is fC2The output of the third layer is fC3The output of the fourth layer is fC;
(3-b) construction of storage pool
Putting a1 st frame, a tau +1 st frame, a 2 tau +1 st frame, … …, an Nth tau +1 st frame and corresponding segmentation labels in a video sequence into a storage pool, wherein tau is a positive integer and has a value range of [1,200 ]],τCFor relative position of frame to be divided, symbolRepresents rounding down;
(3-c) construction of storage encoder
Using a convolutional neural network as a storage encoder, and obtaining f after all images in a storage pool and corresponding segmentation labels pass through the encoderM;
(3-d) construction of Transformer Module
The module consists of a Transformer encoder and a Transformer decoder; the Transformer encoder comprises a space-time integration module, a convolutional layer, a multi-scale layer and a self-attention module; the Transformer decoder comprises two convolutional layers, a multi-scale layer, a self-attention module and a double-branch cross attention module, wherein the double-branch cross attention module consists of an inquiry branch and a storage branch, the two branches have the same structure and consist of a multi-head attention module, two residual error and layer normalization modules and a fully-connected feedforward network, and the fully-connected feedforward network consists of two linear layers and a ReLU activation layer; the structures of self-attention modules in the Transformer encoder and the Transformer decoder are completely the same, and the self-attention modules consist of a multi-head attention module and a residual error and layer normalization module; the multi-head attention module has the same structure;
(3-e) construction of a segmented decoder
The segmentation decoder consists of a residual error module, two groups of up-sampling modules and a prediction convolution module; the residual error module consists of three convolution layers and two Relu active layers, the first up-sampling module consists of four convolution layers, two Relu active layers and one bilinear interpolation, the second up-sampling module consists of three convolution layers, two Relu active layers and one bilinear interpolation, and the predicted convolution module consists of one convolution layer and one bilinear interpolation;
(3-f) subjecting f obtained in the step (3-a) toCWith f obtained in step (3-c)MInputting the space-time integration module of the Transformer encoder constructed in the step (3-d) to obtain fM' the specific calculation process is as follows:
fM'=fM·softmax(c(ConvKey(fM),ConvKey(fC)))
wherein ConvKey (-) is a key projection layer, composed of a convolution layer, c (-) represents a negative squared Euclidean distance, softmax (-) represents an activation function;
(3-g) subjecting f obtained in the step (3-f)M' convolutional layer, multiscale layer and self-attention module input sequentially into Transformer encoder, resulting in M respectively1、M2And M3(ii) a Subjecting f obtained in step (3-a) toCSequentially inputting the first convolutional layer, the second convolutional layer, the multi-scale layer and the self-attention module in the Transformer decoder constructed in the step (3-d) to respectively obtain C0、C1、C2And C3(ii) a Will M3And C3Inputting the query branch in the double-branch cross attention module in the Transformer decoder constructed in the step (3-d) to obtain C4(ii) a Will M3And C3Storage in a two-branch cross attention Module input to the Transformer decoder constructed in step (3-d)Store branch to obtain M4;
(3-h) subjecting C obtained in the step (3-g) to0、C4、M4And f obtained in step (3-f)M' splicing and inputting the residual error into the residual error module of the segmentation decoder constructed in the step (3-e) to obtain fO1(ii) a Will f isO1With f obtained in step (3-a)C3Inputting the data into the first up-sampling module of the segmented decoder constructed in the step (3-e) to obtain fO2(ii) a Will f isO2With f obtained in step (3-a)C2Inputting the data to a second up-sampling module of the segmented decoder constructed in the step (3-e) to obtain fO3(ii) a Will f isO3Inputting the predicted segmentation label f into the prediction convolution module of the segmentation decoder constructed in the step (3-e)O4Completing the construction of a segmentation model;
(4) constructing a loss function:
the loss function L uses cross-entropy loss, defined as follows:
where Y is the true segmentation label,is a predictive segmentation tag, HYAnd WYHeight and width, Y, of the label being divided for realityijIs the pixel value of the ith row and jth column pixel in Y, the pixel value of the ith row and the jth column of the display panel, i is 1,2, …, HY,j=1,2,…,WYLog (·) denotes solving a natural logarithm;
(5) training a segmentation model:
training the segmentation model constructed in the step (3) by using the synthesized video training set obtained in the step (2-a), obtaining a loss value according to the loss function constructed in the step (4), and updating parameters in the segmentation model by using a random gradient descent method to obtain a pre-training model; training the pre-training model by using the real video training set obtained in the step (2-b), obtaining a loss value according to the loss function constructed in the step (4), and updating parameters in the segmentation model by using a random gradient descent method to obtain a trained segmentation model;
(6) video object segmentation:
and (5) acquiring a video to be segmented and a segmentation label corresponding to the first frame of the video to be segmented, inputting the video to be segmented into the trained segmentation model obtained in the step (5) from the second frame of the video to be segmented in sequence, and outputting the segmentation label.
2. The transform-based semi-supervised video object segmentation method of claim 1, wherein the multi-scale layer in step (3-d) is composed of t convolutional layers with different convolutional kernel sizes, and the output of the multi-scale layer is calculated by:
MSL=Concat(S1,…,Si,…,St)
Si=Convi(Xi;ri)
wherein MSL represents the output of the multi-scale layer, Concat (. cndot.) represents the splicing operation, t represents the number of the convolution layers in the multi-scale layer, and t is a positive integer with a value range of [1,50 ]],ConviDenotes the ith convolutional layer in the multi-scale layer, riRepresents the convolution kernel size, r, of the ith convolution layeriIs a positive integer with a value range of [1,100 ]],XiInput representing the ith convolutional layer of the multi-scale layer, SiRepresents the output of the ith convolutional layer of the multi-scale layer, X for the multi-scale layer in the transform encoderi=M1For multi-scale layers in a transform decoder, Xi=C1,i=1,2,…,t。
3. The transform-based semi-supervised video object segmentation method of claim 1, wherein the multi-head attention module in step (3-d) outputs a calculation process of:
MultiHead(q,k,v)=Concat(A1,…,Ai,…,As)Uo
wherein MultiHead (q, k, v) represents the output of the multi-head attention module, softmax (·) represents the activation function, and Concat (·) represents the stitching operation; q, k and v are the input of the multi-head attention module, and q-k-v-M for the multi-head attention module inside the self-attention module of the transform encoder2For a multi-head attention module inside a self-attention module of a transform decoder, q-k-v-C2For a two-branch cross attention module to store a multi-head attention module inside a branch, q is equal to M3,k=v=C3For a multi-head attention module inside a query branch, q ═ C3,k=v=M3(ii) a s is the number of the attention heads in the multi-head attention module, s is a positive integer and has a value range of [1,16 ]],AiThe output of the ith attention head is shown, i ═ 1,2, …, s,andq, k and v parameter matrix, U, representing the ith head of attentionoIs to adjust the final output parameter matrix, T represents the transpose operator, d is a hyperparameter, d is a positive integer, and the value range is [1,1000]。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210098849.1A CN114429607B (en) | 2022-01-24 | 2022-01-24 | Transformer-based semi-supervised video object segmentation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210098849.1A CN114429607B (en) | 2022-01-24 | 2022-01-24 | Transformer-based semi-supervised video object segmentation method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114429607A true CN114429607A (en) | 2022-05-03 |
CN114429607B CN114429607B (en) | 2024-03-29 |
Family
ID=81313102
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210098849.1A Active CN114429607B (en) | 2022-01-24 | 2022-01-24 | Transformer-based semi-supervised video object segmentation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114429607B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116012388A (en) * | 2023-03-28 | 2023-04-25 | 中南大学 | Three-dimensional medical image segmentation method and imaging method for acute ischemic cerebral apoplexy |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109685066A (en) * | 2018-12-24 | 2019-04-26 | 中国矿业大学(北京) | A kind of mine object detection and recognition method based on depth convolutional neural networks |
CN110378288A (en) * | 2019-07-19 | 2019-10-25 | 合肥工业大学 | A kind of multistage spatiotemporal motion object detection method based on deep learning |
WO2019218136A1 (en) * | 2018-05-15 | 2019-11-21 | 深圳大学 | Image segmentation method, computer device, and storage medium |
CN111523410A (en) * | 2020-04-09 | 2020-08-11 | 哈尔滨工业大学 | Video saliency target detection method based on attention mechanism |
CN111860162A (en) * | 2020-06-17 | 2020-10-30 | 上海交通大学 | Video crowd counting system and method |
CN112036300A (en) * | 2020-08-31 | 2020-12-04 | 合肥工业大学 | Moving target detection method based on multi-scale space-time propagation layer |
US20210150727A1 (en) * | 2019-11-19 | 2021-05-20 | Samsung Electronics Co., Ltd. | Method and apparatus with video segmentation |
CN113570610A (en) * | 2021-07-26 | 2021-10-29 | 北京百度网讯科技有限公司 | Method and device for performing target segmentation on video by adopting semantic segmentation model |
-
2022
- 2022-01-24 CN CN202210098849.1A patent/CN114429607B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019218136A1 (en) * | 2018-05-15 | 2019-11-21 | 深圳大学 | Image segmentation method, computer device, and storage medium |
CN109685066A (en) * | 2018-12-24 | 2019-04-26 | 中国矿业大学(北京) | A kind of mine object detection and recognition method based on depth convolutional neural networks |
CN110378288A (en) * | 2019-07-19 | 2019-10-25 | 合肥工业大学 | A kind of multistage spatiotemporal motion object detection method based on deep learning |
US20210150727A1 (en) * | 2019-11-19 | 2021-05-20 | Samsung Electronics Co., Ltd. | Method and apparatus with video segmentation |
CN111523410A (en) * | 2020-04-09 | 2020-08-11 | 哈尔滨工业大学 | Video saliency target detection method based on attention mechanism |
CN111860162A (en) * | 2020-06-17 | 2020-10-30 | 上海交通大学 | Video crowd counting system and method |
CN112036300A (en) * | 2020-08-31 | 2020-12-04 | 合肥工业大学 | Moving target detection method based on multi-scale space-time propagation layer |
CN113570610A (en) * | 2021-07-26 | 2021-10-29 | 北京百度网讯科技有限公司 | Method and device for performing target segmentation on video by adopting semantic segmentation model |
Non-Patent Citations (2)
Title |
---|
ZONGXIN YANG, ETC: "Collaborative Video Object Segmentation by Multi-Scale Foreground-Background Integration", 《IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE》, 18 May 2021 (2021-05-18), pages 4701, XP011916332, DOI: 10.1109/TPAMI.2021.3081597 * |
李栋: "基于时空关联性的视频动作识别与检测方法研究", 《中国博士学位论文全文数据库 (信息科技辑)》, 15 September 2021 (2021-09-15), pages 138 - 48 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116012388A (en) * | 2023-03-28 | 2023-04-25 | 中南大学 | Three-dimensional medical image segmentation method and imaging method for acute ischemic cerebral apoplexy |
Also Published As
Publication number | Publication date |
---|---|
CN114429607B (en) | 2024-03-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111489358B (en) | Three-dimensional point cloud semantic segmentation method based on deep learning | |
Johnson et al. | Image generation from scene graphs | |
CN107273800B (en) | Attention mechanism-based motion recognition method for convolutional recurrent neural network | |
CN111783705B (en) | Character recognition method and system based on attention mechanism | |
CN113011427A (en) | Remote sensing image semantic segmentation method based on self-supervision contrast learning | |
CN104517103A (en) | Traffic sign classification method based on deep neural network | |
CN110599502B (en) | Skin lesion segmentation method based on deep learning | |
Hou et al. | Object detection in high-resolution panchromatic images using deep models and spatial template matching | |
CN110349229A (en) | A kind of Image Description Methods and device | |
Nguyen et al. | Hybrid deep learning-Gaussian process network for pedestrian lane detection in unstructured scenes | |
CN115035508A (en) | Topic-guided remote sensing image subtitle generation method based on Transformer | |
CN111696136A (en) | Target tracking method based on coding and decoding structure | |
CN115690152A (en) | Target tracking method based on attention mechanism | |
CN113177503A (en) | Arbitrary orientation target twelve parameter detection method based on YOLOV5 | |
CN115131313A (en) | Hyperspectral image change detection method and device based on Transformer | |
CN112905828A (en) | Image retriever, database and retrieval method combined with significant features | |
CN114429607A (en) | Transformer-based semi-supervised video object segmentation method | |
CN115512096A (en) | CNN and Transformer-based low-resolution image classification method and system | |
CN110503090A (en) | Character machining network training method, character detection method and character machining device based on limited attention model | |
Zhao et al. | Recognition and Classification of Concrete Cracks under Strong Interference Based on Convolutional Neural Network. | |
CN114821631A (en) | Pedestrian feature extraction method based on attention mechanism and multi-scale feature fusion | |
Li et al. | Bagging R-CNN: Ensemble for Object Detection in Complex Traffic Scenes | |
Shi et al. | A welding defect detection method based on multiscale feature enhancement and aggregation | |
Park et al. | A 2-D HMM method for offline handwritten character recognition | |
CN117876931A (en) | Global feature enhanced semi-supervised video target segmentation method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |