CN114429607A - Transformer-based semi-supervised video object segmentation method - Google Patents

Transformer-based semi-supervised video object segmentation method Download PDF

Info

Publication number
CN114429607A
CN114429607A CN202210098849.1A CN202210098849A CN114429607A CN 114429607 A CN114429607 A CN 114429607A CN 202210098849 A CN202210098849 A CN 202210098849A CN 114429607 A CN114429607 A CN 114429607A
Authority
CN
China
Prior art keywords
segmentation
module
layer
video
attention module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210098849.1A
Other languages
Chinese (zh)
Other versions
CN114429607B (en
Inventor
阳春华
周玮
赵于前
张帆
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN202210098849.1A priority Critical patent/CN114429607B/en
Publication of CN114429607A publication Critical patent/CN114429607A/en
Application granted granted Critical
Publication of CN114429607B publication Critical patent/CN114429607B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • G06F18/2155Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the incorporation of unlabelled data, e.g. multiple instance learning [MIL], semi-supervised techniques using expectation-maximisation [EM] or naïve labelling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses a transform-based semi-supervised video object segmentation method, which comprises the following implementation scheme: 1) acquiring a data set and a segmentation label; 2) data expansion and processing; 3) constructing a segmentation model; 4) constructing a loss function; 5) training a segmentation model; 6) and (5) segmenting the video object. The invention compresses space-time information by designing a space-time integration module, introduces multi-scale layers to generate cross-scale input characteristics, and constructs a double-branch cross attention module to take into account a plurality of characteristics of target information. The method can effectively improve the segmentation precision of the small-scale target and the similar target while reducing the calculation cost.

Description

Transformer-based semi-supervised video object segmentation method
Technical Field
The invention relates to the technical field of image processing, in particular to a transform-based semi-supervised video target segmentation method.
Background
Video object segmentation is an important prerequisite for video understanding, with many potential applications such as video retrieval, video editing, autopilot, etc. The purpose of semi-supervised video object segmentation is to segment an object (i.e. a segmentation tag) from other frames of the entire video sequence given the object in the first frame of the video.
Due to the strong performance of the Transformer architecture on computer vision tasks such as image classification, object detection, semantic segmentation, object tracking, etc., many current researches apply the Transformer architecture to video object segmentation. The Transformer architecture has excellent Long-term dependency (Long-range dependency) modeling capability, and can effectively mine spatio-temporal information in a given video, so that the segmentation precision is improved. However, most Transformer-based methods input the features of all frames in the storage pool directly into the multi-head attention module, which is computationally expensive with the increase of segmented frames, and the classical Transformer architecture lacks intrinsic inductive bias, and has poor segmentation accuracy for small-scale targets and similar targets.
Disclosure of Invention
The invention aims to overcome the defects and shortcomings of the prior art and provides a transform-based semi-supervised video target segmentation method.
In order to achieve the purpose, the invention provides the following technical scheme:
a semi-supervised video object segmentation method based on a Transformer comprises the following steps:
(1) acquiring a data set and a segmentation label:
acquiring a video target segmentation data set, a static image data set and segmentation labels corresponding to the two data sets, and forming an image pair by each image in the data sets and the corresponding segmentation labels;
(2) the data expansion and processing method specifically comprises the following steps:
(2-a) after normalization processing is carried out on each image pair consisting of the static image data set obtained in the step (1) and the corresponding segmentation labels, repeating the following processes to obtain a synthesized video training sample corresponding to each image pair, wherein the synthesized video training set is formed by a set of synthesized video training samples:
I. reducing the short edge of the image pair to w pixels, reducing the long edge according to the proportion equal to that of the short edge, and randomly cutting the obtained image pair into h multiplied by w pixel size, wherein w is the width of the cut image, h is the height of the cut image, w and h are positive integers, and the value range is [10,3000 ];
II, sequentially applying random zooming, random horizontal turning, random color dithering and random gray level conversion to the cut image pair to obtain an enhanced image pair corresponding to the image pair;
repeating the process II three times to obtain three enhanced image pairs corresponding to the image pair, wherein the three enhanced image pairs form a synthetic video training sample;
(2-b) after normalization processing is carried out on each video and the corresponding segmentation labels in the video target segmentation data set obtained in the step (1), repeating the following processes to obtain a real video training sample corresponding to each video, wherein the real video training set is formed by a set of real video training samples:
I. randomly extracting three image pairs from the video and the corresponding segmentation labels;
reducing the short sides of the three image pairs to w pixels, reducing the long sides of the three image pairs according to equal proportion, randomly cutting the obtained three image pairs into h multiplied by w pixel sizes, wherein the meanings and values of w and h are the same as those in the step (2-a);
sequentially applying random cutting, color dithering and random gray level conversion to the three image pairs to obtain three enhanced image pairs, wherein the three enhanced image pairs form a real video training sample;
(3) constructing a segmentation model, which specifically comprises the following steps:
(3-a) construction of query coders
Using a convolutional neural network as a query encoder, sequentially passing a frame to be segmented through the first four layers of the encoder, wherein the output of the second layer is fC2The output of the third layer is fC3The output of the fourth layer is fC
(3-b) construction of storage pool
Putting a1 st frame, a tau +1 st frame, a 2 tau +1 st frame, … …, an Nth tau +1 st frame and corresponding segmentation labels in a video sequence into a storage pool, wherein tau is a positive integer and has a value range of [1,200 ]],
Figure BDA0003485955000000021
τCFor relative position of frame to be divided, symbol
Figure BDA0003485955000000022
Represents rounding down;
(3-c) construction of storage encoder
Using a convolutional neural network as a storage encoder, and obtaining f after all images in a storage pool and corresponding segmentation labels pass through the encoderM
(3-d) construction of Transformer Module
The module consists of a Transformer encoder and a Transformer decoder; the Transformer encoder comprises a space-time integration module, a convolutional layer, a multi-scale layer and a self-attention module; the Transformer decoder comprises two convolutional layers, a multi-scale layer, a self-attention module and a double-branch cross attention module, wherein the double-branch cross attention module consists of an inquiry branch and a storage branch, the two branches have the same structure and consist of a multi-head attention module, two residual error and layer normalization modules and a fully-connected feedforward network, and the fully-connected feedforward network consists of two linear layers and a ReLU activation layer; the structures of self-attention modules in the Transformer encoder and the Transformer decoder are completely the same, and the self-attention modules consist of a multi-head attention module and a residual error and layer normalization module; the multi-head attention module has the same structure;
(3-e) construction of a segmented decoder
The segmentation decoder consists of a residual error module, two up-sampling modules and a prediction convolution module; the residual error module consists of three convolution layers and two Relu active layers, the first up-sampling module consists of four convolution layers, two Relu active layers and one bilinear interpolation, the second up-sampling module consists of three convolution layers, two Relu active layers and one bilinear interpolation, and the predicted convolution module consists of one convolution layer and one bilinear interpolation;
(3-f) subjecting f obtained in the step (3-a) toCWith f obtained in step (3-c)MInputting the space-time integration module of the Transformer encoder constructed in the step (3-d) to obtain fM' the specific calculation process is as follows:
fM'=fM·softmax(c(ConvKey(fM),ConvKey(fC)))
wherein ConvKey (-) is a key projection layer, composed of a convolution layer, c (-) represents a negative squared Euclidean distance, softmax (-) represents an activation function;
(3-g) subjecting f obtained in the step (3-f)M' convolutional layer, multiscale layer and self-attention module input sequentially into Transformer encoder, resulting in M respectively1、M2And M3(ii) a Subjecting f obtained in step (3-a) toCSequentially inputting the first convolutional layer, the second convolutional layer, the multi-scale layer and the self-attention module in the Transformer decoder constructed in the step (3-d) to respectively obtain C0、C1、C2And C3(ii) a Will M3And C3Inputting to the query branch in the two-branch cross attention module in the transform decoder constructed in step (3-d) to obtain C4(ii) a Will M3And C3Inputting the memory branch in the double-branch cross attention module in the Transformer decoder constructed in the step (3-d) to obtain M4
(3-h) subjecting C obtained in the step (3-g) to0、C4、M4And f obtained in step (3-f)M' splicing and inputting the residual error into the residual error module of the segmentation decoder constructed in the step (3-e) to obtain fO1(ii) a Will f isO1With f obtained in step (3-a)C3Inputting the data into the first up-sampling module of the segmented decoder constructed in the step (3-e) to obtain fO2(ii) a Will f isO2With f obtained in step (3-a)C2Input to the second of the segmented decoders constructed in step (3-e)Sampling module to obtain fO3(ii) a Will f isO3Inputting the prediction convolution module of the segmentation decoder constructed in the step (3-e) to obtain a prediction segmentation label fO4Completing the construction of a segmentation model;
(4) constructing a loss function:
the loss function L uses cross-entropy loss, defined as follows:
Figure BDA0003485955000000041
where Y is the true segmentation label,
Figure BDA0003485955000000042
is a predictive segmentation tag, HYAnd WYHeight and width, Y, of the label being divided for realityijIs the pixel value of the ith row and jth column pixel in Y,
Figure BDA0003485955000000043
is that
Figure BDA0003485955000000044
The pixel value of the ith row and the jth column of the display panel, i is 1,2, …, HY,j=1,2,…,WYLog (·) denotes solving a natural logarithm;
(5) training a segmentation model:
training the segmentation model constructed in the step (3) by using the synthesized video training set obtained in the step (2-a), obtaining a loss value according to the loss function constructed in the step (4), and updating parameters in the segmentation model by using a random gradient descent method to obtain a pre-training model; training the pre-training model by using the real video training set obtained in the step (2-b), obtaining a loss value according to the loss function constructed in the step (4), and updating parameters in the segmentation model by using a random gradient descent method to obtain a trained segmentation model;
(6) video object segmentation:
and (5) acquiring a video to be segmented and a segmentation label corresponding to the first frame of the video to be segmented, inputting the video to be segmented into the trained segmentation model obtained in the step (5) from the second frame of the video to be segmented in sequence, and outputting the segmentation label.
The multi-scale layer in the step (3-d) is composed of t convolution layers with different convolution kernel sizes, and the output calculation process is as follows:
MSL=Concat(S1,…,Si,…,St)
Si=Convi(Xi;ri)
wherein MSL represents the output of the multi-scale layer, Concat (. cndot.) represents the splicing operation, t represents the number of the convolution layers in the multi-scale layer, and t is a positive integer with a value range of [1,50 ]],ConviDenotes the ith convolutional layer in the multi-scale layer, riRepresents the convolution kernel size, r, of the ith convolution layeriIs a positive integer with a value range of [1,100 ]],XiInput representing the ith convolutional layer of the multi-scale layer, SiRepresents the output of the ith convolutional layer of the multi-scale layer, X for the multi-scale layer in the transform encoderi=M1For a multiscale layer in a transform decoder, Xi=C1,i=1,2,…,t。
The multi-head attention module in the step (3-d) outputs a calculation process that:
MultiHead(q,k,v)=Concat(A1,…,Ai,…,As)Uo
Figure BDA0003485955000000051
wherein MultiHead (q, k, v) represents the output of the multi-head attention module, softmax (·) represents the activation function, and Concat (·) represents the stitching operation; q, k and v are the input of the multi-head attention module, and q-k-v-M for the multi-head attention module inside the self-attention module of the transform encoder2For a multi-head attention module inside a self-attention module of a transform decoder, q-k-v-C2For a two-branch cross attention module to store a multi-head attention module inside a branch, q is equal to M3,k=v=C3For query inside branchesThe multi-head attention module of (q ═ C)3,k=v=M3(ii) a s is the number of the attention heads in the multi-head attention module, s is a positive integer and has a value range of [1,16 ]],AiThe output of the ith attention head is shown, i ═ 1,2, …, s,
Figure BDA0003485955000000052
and
Figure BDA0003485955000000053
q, k and v parameter matrix, U, representing the ith head of attentionoIs to adjust the final output parameter matrix, T represents the transpose operator, d is a hyperparameter, d is a positive integer, and the value range is [1,1000]。
In step (3-b), preferably τ is 5.
In step (3-d), the number t of convolutional layers in the multi-scale layer is preferably 2, the number s of attention heads in the multi-head attention module is preferably 8, and the hyper-parameter d is preferably 32.
Compared with the prior art, the invention has the following advantages:
(1) the space-time integration module provided by the invention can integrate and compress space-time information and reduce the calculation cost.
(2) The multi-scale layer introduced by the invention can provide cross-scale input features for the Transformer, thereby assisting the network to learn the feature representation with invariable scale and improving the segmentation precision of the network on small-scale targets.
(3) The two branches in the double-branch cross attention module constructed by the invention can give consideration to different characteristics of target information, thereby improving the segmentation precision of target details.
Drawings
FIG. 1 is a flowchart of a transform-based semi-supervised video object segmentation method according to an embodiment of the present invention;
FIG. 2 is a diagram of a segmentation model architecture in accordance with an embodiment of the present invention;
FIG. 3 is a diagram of a Transformer module architecture according to an embodiment of the present invention;
FIG. 4 is a graph comparing the segmentation result of the small-scale target with the segmentation result of other methods according to the embodiment of the present invention;
FIG. 5 is a graph comparing the segmentation result of the similar target with that of other methods according to the embodiment of the present invention.
Detailed Description
The following describes specific embodiments of the present invention;
example 1
Fig. 1 is a flowchart of a transform-based semi-supervised video object segmentation method according to an embodiment of the present invention, which includes the following specific steps:
step 1, acquiring a data set and a segmentation label
The method comprises the steps of obtaining a video target segmentation data set, a static image data set and segmentation labels corresponding to the two data sets, and enabling each image in the data sets and the corresponding segmentation labels to form an image pair.
Step 2, data expansion and processing
(2-a) after normalization processing is carried out on each image pair consisting of the static image data set obtained in the step (1) and the corresponding segmentation labels, repeating the following processes to obtain a synthesized video training sample corresponding to each image pair, wherein the synthesized video training set is formed by a set of synthesized video training samples:
I. reducing the short edge of the image pair to w pixels, and reducing the long edge in proportion to the short edge, and randomly cutting the obtained image pair into h × w pixel size, where w is the width of the cut image, h is the height of the cut image, w and h are both positive integers, and the value range is [10,3000], in this embodiment, w is 384, and h is 384;
II, sequentially applying random zooming, random horizontal turning, random color dithering and random gray level conversion to the cut image pair to obtain an enhanced image pair corresponding to the image pair;
repeating the process II three times to obtain three enhanced image pairs corresponding to the image pair, wherein the three enhanced image pairs form a synthetic video training sample;
(2-b) after normalization processing is carried out on each video and the corresponding segmentation labels in the video target segmentation data set obtained in the step (1), repeating the following processes to obtain a real video training sample corresponding to each video, wherein the real video training set is formed by a set of real video training samples:
I. randomly extracting three image pairs from the video and the corresponding segmentation labels;
reducing the short sides of the three image pairs to w pixels, reducing the long sides of the three image pairs according to equal proportion, randomly cutting the obtained three image pairs into h multiplied by w pixel sizes, wherein the meanings and values of w and h are the same as those in the step (2-a);
and III, sequentially applying random cutting, color dithering and random gray level conversion to the three image pairs to obtain three enhanced image pairs, wherein the three enhanced image pairs form a real video training sample.
Step 3, constructing a segmentation model
Fig. 2 is a diagram illustrating a segmentation model structure according to an embodiment of the present invention, which includes the following steps:
(3-a) construction of query coders
Using the convolutional neural network ResNet50 as a query encoder, the frame to be segmented sequentially passes through the first four layers of the encoder, where the output of the second layer is fC2The output of the third layer is fC3The output of the fourth layer is fC
(3-b) construction of storage pool
Putting a1 st frame, a tau +1 st frame, a 2 tau +1 st frame, … …, an Nth tau +1 st frame and corresponding segmentation labels in a video sequence into a storage pool, wherein tau is a positive integer and has a value range of [1,200 ]],
Figure BDA0003485955000000071
τCFor relative position of frame to be divided, symbol
Figure BDA0003485955000000072
Meaning rounding down, this example takes τ to be 5;
(3-c) construction of storage encoder
Using convolutional neural network ResNet18 as a storage encoder, all images in the storage pool and their corresponding segmentation labels are passed through the encoder to obtain fM
(3-d) construction of Transformer Module
FIG. 3 is a diagram showing a structure of a Transformer module according to an embodiment of the present invention. The module consists of a Transformer encoder and a Transformer decoder; the Transformer encoder comprises a space-time integration module, convolution layers with convolution kernel number of 256 and convolution kernel size of 1, a multi-scale layer and a self-attention module; the Transformer decoder comprises convolution layers with convolution kernel number of 512 and convolution kernel size of 3, convolution layers with convolution kernel number of 256 and convolution kernel size of 1, a multi-scale layer, a self-attention module and a double-branch cross attention module, wherein the double-branch cross attention module consists of an inquiry branch and a storage branch, the two branches have the same structure and consist of a multi-head attention module, two residual error and layer normalization modules and a fully-connected feedforward network, and the fully-connected feedforward network consists of two linear layers with input dimensionalities of 256 hidden layers and size of 2048 and a ReLU activation layer; the structures of self-attention modules in the Transformer encoder and the Transformer decoder are completely the same, and the self-attention modules consist of a multi-head attention module and a residual error and layer normalization module; the multi-head attention module has the same structure;
the multi-scale layer in the step (3-d) is composed of t convolution layers with different convolution kernel sizes, and the output calculation process is as follows:
MSL=Concat(S1,…,Si,…,St)
Si=Convi(Xi;ri)
wherein MSL represents the output of the multi-scale layer, Concat (. cndot.) represents the splicing operation, t represents the number of the convolution layers in the multi-scale layer, and t is a positive integer with a value range of [1,50 ]]In this embodiment, t is2, ConviDenotes the ith convolutional layer in the multi-scale layer, riRepresents the convolution kernel size, r, of the ith convolution layeriIs a positive integer with a value range of [1,100 ]]In this embodiment, r is taken1Is2, r2Is 4, XiRepresents the input of the ith convolutional layer of the multi-scale layer,Sirepresents the output of the ith convolutional layer of the multi-scale layer, X for the multi-scale layer in the transform encoderi=M1For a multiscale layer in a transform decoder, Xi=C1,i=1,2,…,t。
The multi-head attention module in the step (3-d) outputs a calculation process that:
MultiHead(q,k,v)=Concat(A1,…,Ai,…,As)Uo
Figure BDA0003485955000000091
wherein MultiHead (q, k, v) represents the output of the multi-head attention module, softmax (·) represents the activation function, and Concat (·) represents the stitching operation; q, k and v are the input of the multi-head attention module, and q-k-v-M for the multi-head attention module inside the self-attention module of the transform encoder2For a multi-head attention module inside a self-attention module of a transform decoder, q-k-v-C2For a two-branch cross attention module to store a multi-head attention module inside a branch, q is equal to M3,k=v=C3For a multi-head attention module inside a query branch, q ═ C3,k=v=M3(ii) a s is the number of the attention heads in the multi-head attention module, s is a positive integer and has a value range of [1,16 ]]In this example, s is taken to be 8, AiThe output of the ith attention head is shown, i ═ 1,2, …, s,
Figure BDA0003485955000000092
and
Figure BDA0003485955000000093
q, k and v parameter matrix, U, representing the ith head of attentionoIs to adjust the final output parameter matrix, T represents the transpose operator, d is a hyperparameter, d is a positive integer, and the value range is [1,1000]In this embodiment, d is 32;
(3-e) construction of a segmented decoder
The segmentation decoder consists of a residual error module, two up-sampling modules and a prediction convolution module; the residual error module consists of convolution layers with convolution kernel numbers of 512 and convolution kernel sizes of 3 and two Relu active layers, the first up-sampling module consists of convolution layers with convolution kernel numbers of 512 and convolution kernel sizes of 3, convolution layers with convolution kernel numbers of 256 and convolution kernel sizes of 3, two Relu active layers and bilinear interpolation with an expansion factor of 2, the second up-sampling module consists of convolution layers with convolution kernel numbers of 256 and convolution kernel sizes of 3, two Relu active layers and bilinear interpolation with an expansion factor of 2, and the predicted convolution module consists of convolution layers with convolution kernel numbers of 256 and convolution kernel sizes of 1 and bilinear interpolation with an expansion factor of 2;
(3-f) subjecting f obtained in the step (3-a) toCWith f obtained in step (3-c)MInputting the space-time integration module of the Transformer encoder constructed in the step (3-d) to obtain fM' the specific calculation process is as follows:
fM'=fM·softmax(c(ConvKey(fM),ConvKey(fC)))
wherein ConvKey (-) is a key projection layer, and is composed of convolution layers with convolution kernel number of 64 and convolution kernel size of 3, c (-) represents a negative square Euclidean distance, and softmax (-) represents an activation function;
(3-g) subjecting f obtained in the step (3-f)M' convolutional layer, multiscale layer and self-attention module input sequentially into Transformer encoder, resulting in M respectively1、M2And M3(ii) a F obtained in the step (3-a)CSequentially inputting the first convolutional layer, the second convolutional layer, the multi-scale layer and the self-attention module in the Transformer decoder constructed in the step (3-d) to respectively obtain C0、C1、C2And C3(ii) a Will M3And C3Inputting the query branch in the double-branch cross attention module in the Transformer decoder constructed in the step (3-d) to obtain C4(ii) a Will M3And C3The double-branch cross attention module input into the Transformer decoder constructed in the step (3-d)Get M4
(3-h) subjecting C obtained in the step (3-g) to0、C4、M4And f obtained in step (3-f)M' splicing and inputting the residual error into the residual error module of the segmentation decoder constructed in the step (3-e) to obtain fO1(ii) a Will f isO1With f obtained in step (3-a)C3Inputting the data into the first up-sampling module of the segmented decoder constructed in the step (3-e) to obtain fO2(ii) a Will f isO2With f obtained in step (3-a)C2Inputting the data to a second up-sampling module of the segmented decoder constructed in the step (3-e) to obtain fO3(ii) a Will f isO3Inputting the predicted segmentation label f into the prediction convolution module of the segmentation decoder constructed in the step (3-e)O4And finishing the construction of the segmentation model.
(4) Constructing a loss function:
the loss function L uses cross-entropy loss, defined as follows:
Figure BDA0003485955000000101
where Y is the true segmentation label,
Figure BDA0003485955000000102
is a predictive segmentation tag, HYAnd WYHeight and width, Y, of the label being divided for realityijIs the pixel value of the ith row and jth column pixel in Y,
Figure BDA0003485955000000103
is that
Figure BDA0003485955000000104
The pixel value of the ith row and the jth column of the display panel, i is 1,2, …, HY,j=1,2,…,WYAnd log (·) denotes taking the natural logarithm.
(5) Training a segmentation model:
training the segmentation model constructed in the step (3) by using the synthesized video training set obtained in the step (2-a), obtaining a loss value according to the loss function constructed in the step (4), and updating parameters in the segmentation model by using a random gradient descent method to obtain a pre-training model; and (3) training the pre-training model by using the real video training set obtained in the step (2-b), obtaining a loss value according to the loss function constructed in the step (4), and updating parameters in the segmentation model by using a random gradient descent method to obtain the trained segmentation model.
(6) Video object segmentation:
and (5) acquiring a video to be segmented and a segmentation label corresponding to the first frame of the video to be segmented, inputting the video to be segmented into the trained segmentation model obtained in the step (5) from the second frame of the video to be segmented in sequence, and outputting the segmentation label.
Example 2
Video object segmentation experiments were performed on the public data sets youtube vos2018, youtube vos2019, DAVIS2016 and DAVIS2017 using the method of example 1. The experimental operating system is Linux ubuntu 18.04 version; a pytorch1.8.1 framework implementation based on CUDA10.1 and cudnn7.6.5 was trained and tested using two NVIDIA 2080Ti 11G GPUs.
The present embodiment uses the region similarity J, the contour accuracy F, and their average values J & F to evaluate the performance of the present invention. The region similarity J is calculated by the average intersection ratio of the estimated label and the corresponding real label, and the calculation formula is as follows:
Figure BDA0003485955000000111
where M is the predicted segmentation label, G is the true segmentation label, and the symbols ≡ and ≡ denote the intersection and union of the two sets, respectively.
The contour accuracy F represents the average boundary similarity between the boundary of the estimated label and the boundary of its real label, and its calculation formula is:
Figure BDA0003485955000000112
wherein P iscIs the accuracy between l (M) and l (G), RcIs l (M) and l (G)Recall ratio between, PcAnd RcObtained by using bipartite graph matching (bipartite graph matching) calculation; l (M) represents a set of closed contours in the range of the predicted segmentation label M, and l (G) represents a set of closed contours in the range of the true segmentation label G.
Table 1, table 2, table 3 and table 4 show the J, F and J & F results of the test sample sets of youtube vos2018, youtube vos2019, DAVIS2016 and DAVIS2017, respectively, compared to other methods, and it can be found that the highest J & F scores were obtained on all four data sets by the method of the present invention.
Fig. 4 is a graph showing the segmentation result of a small-scale object according to the embodiment of the present invention compared with the segmentation results of other methods. The first, second, third and fourth lines in the figure are respectively the result of segmentation of the video d79729a354 optional frame in youtube vos2018 by the method of the present invention, the number of the upper left corner of each image is the frame number of the image in the video sequence, the image in the dashed frame is the enlarged view of the small image in the corresponding solid frame, and the first column is the first frame of the video and its real segmentation label. It can be seen from the first row and the second column that the TransVOS method is disturbed by the background, and the sundries are wrongly identified as distant persons; it can be seen from the second row, the second column and the third column that the CFBI + method fails to segment the legs and arms of the person correctly; it can be seen from the third row, second, third, fourth, and fifth columns that the STCN method confuses people at a distance with a vehicle in its background. From the fourth line, it can be seen that the method of the present invention can correctly segment small-scale objects, and the segmentation result is superior to that of other methods.
Fig. 5 is a graph showing the results of segmentation of similar objects using an embodiment of the present invention in comparison with the results of segmentation using other methods. In the figure, the first, second, third and fourth lines are respectively a transVOS, a CFBI +, an STCN and a segmentation result of the method of the invention on an optional frame of a video 5d2020eff8 in a YoutubeVOS2018, the number at the upper left corner of each image is a frame number of the image in a video sequence, a target shown by a solid line box is an error segmentation result, and the first column is a first frame of the video and a real segmentation label thereof. It can be seen that STCN, CFBI + and TransVOS all confuse fish as segmentation target with fish with similar appearance in the background, and the method of the present invention can successfully distinguish segmentation target from similar in the background.
The above-mentioned embodiments are merely preferred embodiments of the present invention, and the scope of the present invention is not limited thereby, and all changes made based on the principle of the present invention should be covered within the scope of the present invention.
TABLE 1
Figure BDA0003485955000000131
(Note: J in the table)SAnd FSValues for J and F, J, representing classes that appear in both the Youttube 2018 training set and the test setUAnd FUValues of J and F representing classes that appear only in the Youtube2018 test set but not in the Youtube2018 training set)
TABLE 2
Figure BDA0003485955000000132
(Note: J in the table)SAnd FSValues for J and F, J, representing classes that appeared in both the Youttube 2019 training set and the test setUAnd FUValues for J and F representing categories that appear only in Youtube2019 test set but not in Youtube2019 training set) table 3
Figure BDA0003485955000000141
TABLE 4
Figure BDA0003485955000000142

Claims (3)

1. A semi-supervised video object segmentation method based on a Transformer is characterized by comprising the following steps:
(1) acquiring a data set and a segmentation label:
acquiring a video target segmentation data set, a static image data set and segmentation labels corresponding to the two data sets, and forming an image pair by each image in the data sets and the corresponding segmentation labels;
(2) the data expansion and processing method specifically comprises the following steps:
(2-a) after normalizing each image pair consisting of the static image data set obtained in the step (1) and the corresponding segmentation label, repeating the following processes to obtain a synthesized video training sample corresponding to each image pair, wherein the synthesized video training set is formed by a set of synthesized video training samples:
I. reducing the short edge of the image pair to w pixels, reducing the long edge according to the proportion equal to that of the short edge, and randomly cutting the obtained image pair into h multiplied by w pixel size, wherein w is the width of the cut image, h is the height of the cut image, w and h are positive integers, and the value range is [10,3000 ];
II, sequentially applying random zooming, random horizontal turning, random color dithering and random gray level conversion to the cut image pair to obtain an enhanced image pair corresponding to the image pair;
repeating the process II three times to obtain three enhanced image pairs corresponding to the image pair, wherein the three enhanced image pairs form a synthetic video training sample;
(2-b) after normalization processing is carried out on each video and the corresponding segmentation labels in the video target segmentation data set obtained in the step (1), repeating the following processes to obtain a real video training sample corresponding to each video, wherein the real video training set is formed by a set of real video training samples:
I. randomly extracting three image pairs from the video and the corresponding segmentation labels;
reducing the short sides of the three image pairs to w pixels, reducing the long sides of the three image pairs according to equal proportion, randomly cutting the obtained three image pairs into h multiplied by w pixel sizes, wherein the meanings and values of w and h are the same as those in the step (2-a);
sequentially applying random cutting, color dithering and random gray level conversion to the three image pairs to obtain three enhanced image pairs, wherein the three enhanced image pairs form a real video training sample;
(3) constructing a segmentation model, which specifically comprises the following steps:
(3-a) construction of query coders
Using a convolutional neural network as a query encoder, sequentially passing a frame to be segmented through the first four layers of the encoder, wherein the output of the second layer is fC2The output of the third layer is fC3The output of the fourth layer is fC
(3-b) construction of storage pool
Putting a1 st frame, a tau +1 st frame, a 2 tau +1 st frame, … …, an Nth tau +1 st frame and corresponding segmentation labels in a video sequence into a storage pool, wherein tau is a positive integer and has a value range of [1,200 ]],
Figure FDA0003485954990000021
τCFor relative position of frame to be divided, symbol
Figure FDA0003485954990000022
Represents rounding down;
(3-c) construction of storage encoder
Using a convolutional neural network as a storage encoder, and obtaining f after all images in a storage pool and corresponding segmentation labels pass through the encoderM
(3-d) construction of Transformer Module
The module consists of a Transformer encoder and a Transformer decoder; the Transformer encoder comprises a space-time integration module, a convolutional layer, a multi-scale layer and a self-attention module; the Transformer decoder comprises two convolutional layers, a multi-scale layer, a self-attention module and a double-branch cross attention module, wherein the double-branch cross attention module consists of an inquiry branch and a storage branch, the two branches have the same structure and consist of a multi-head attention module, two residual error and layer normalization modules and a fully-connected feedforward network, and the fully-connected feedforward network consists of two linear layers and a ReLU activation layer; the structures of self-attention modules in the Transformer encoder and the Transformer decoder are completely the same, and the self-attention modules consist of a multi-head attention module and a residual error and layer normalization module; the multi-head attention module has the same structure;
(3-e) construction of a segmented decoder
The segmentation decoder consists of a residual error module, two groups of up-sampling modules and a prediction convolution module; the residual error module consists of three convolution layers and two Relu active layers, the first up-sampling module consists of four convolution layers, two Relu active layers and one bilinear interpolation, the second up-sampling module consists of three convolution layers, two Relu active layers and one bilinear interpolation, and the predicted convolution module consists of one convolution layer and one bilinear interpolation;
(3-f) subjecting f obtained in the step (3-a) toCWith f obtained in step (3-c)MInputting the space-time integration module of the Transformer encoder constructed in the step (3-d) to obtain fM' the specific calculation process is as follows:
fM'=fM·softmax(c(ConvKey(fM),ConvKey(fC)))
wherein ConvKey (-) is a key projection layer, composed of a convolution layer, c (-) represents a negative squared Euclidean distance, softmax (-) represents an activation function;
(3-g) subjecting f obtained in the step (3-f)M' convolutional layer, multiscale layer and self-attention module input sequentially into Transformer encoder, resulting in M respectively1、M2And M3(ii) a Subjecting f obtained in step (3-a) toCSequentially inputting the first convolutional layer, the second convolutional layer, the multi-scale layer and the self-attention module in the Transformer decoder constructed in the step (3-d) to respectively obtain C0、C1、C2And C3(ii) a Will M3And C3Inputting the query branch in the double-branch cross attention module in the Transformer decoder constructed in the step (3-d) to obtain C4(ii) a Will M3And C3Storage in a two-branch cross attention Module input to the Transformer decoder constructed in step (3-d)Store branch to obtain M4
(3-h) subjecting C obtained in the step (3-g) to0、C4、M4And f obtained in step (3-f)M' splicing and inputting the residual error into the residual error module of the segmentation decoder constructed in the step (3-e) to obtain fO1(ii) a Will f isO1With f obtained in step (3-a)C3Inputting the data into the first up-sampling module of the segmented decoder constructed in the step (3-e) to obtain fO2(ii) a Will f isO2With f obtained in step (3-a)C2Inputting the data to a second up-sampling module of the segmented decoder constructed in the step (3-e) to obtain fO3(ii) a Will f isO3Inputting the predicted segmentation label f into the prediction convolution module of the segmentation decoder constructed in the step (3-e)O4Completing the construction of a segmentation model;
(4) constructing a loss function:
the loss function L uses cross-entropy loss, defined as follows:
Figure FDA0003485954990000031
where Y is the true segmentation label,
Figure FDA0003485954990000032
is a predictive segmentation tag, HYAnd WYHeight and width, Y, of the label being divided for realityijIs the pixel value of the ith row and jth column pixel in Y,
Figure FDA0003485954990000033
Figure FDA0003485954990000034
Figure FDA0003485954990000035
the pixel value of the ith row and the jth column of the display panel, i is 1,2, …, HY,j=1,2,…,WYLog (·) denotes solving a natural logarithm;
(5) training a segmentation model:
training the segmentation model constructed in the step (3) by using the synthesized video training set obtained in the step (2-a), obtaining a loss value according to the loss function constructed in the step (4), and updating parameters in the segmentation model by using a random gradient descent method to obtain a pre-training model; training the pre-training model by using the real video training set obtained in the step (2-b), obtaining a loss value according to the loss function constructed in the step (4), and updating parameters in the segmentation model by using a random gradient descent method to obtain a trained segmentation model;
(6) video object segmentation:
and (5) acquiring a video to be segmented and a segmentation label corresponding to the first frame of the video to be segmented, inputting the video to be segmented into the trained segmentation model obtained in the step (5) from the second frame of the video to be segmented in sequence, and outputting the segmentation label.
2. The transform-based semi-supervised video object segmentation method of claim 1, wherein the multi-scale layer in step (3-d) is composed of t convolutional layers with different convolutional kernel sizes, and the output of the multi-scale layer is calculated by:
MSL=Concat(S1,…,Si,…,St)
Si=Convi(Xi;ri)
wherein MSL represents the output of the multi-scale layer, Concat (. cndot.) represents the splicing operation, t represents the number of the convolution layers in the multi-scale layer, and t is a positive integer with a value range of [1,50 ]],ConviDenotes the ith convolutional layer in the multi-scale layer, riRepresents the convolution kernel size, r, of the ith convolution layeriIs a positive integer with a value range of [1,100 ]],XiInput representing the ith convolutional layer of the multi-scale layer, SiRepresents the output of the ith convolutional layer of the multi-scale layer, X for the multi-scale layer in the transform encoderi=M1For multi-scale layers in a transform decoder, Xi=C1,i=1,2,…,t。
3. The transform-based semi-supervised video object segmentation method of claim 1, wherein the multi-head attention module in step (3-d) outputs a calculation process of:
MultiHead(q,k,v)=Concat(A1,…,Ai,…,As)Uo
Figure FDA0003485954990000041
wherein MultiHead (q, k, v) represents the output of the multi-head attention module, softmax (·) represents the activation function, and Concat (·) represents the stitching operation; q, k and v are the input of the multi-head attention module, and q-k-v-M for the multi-head attention module inside the self-attention module of the transform encoder2For a multi-head attention module inside a self-attention module of a transform decoder, q-k-v-C2For a two-branch cross attention module to store a multi-head attention module inside a branch, q is equal to M3,k=v=C3For a multi-head attention module inside a query branch, q ═ C3,k=v=M3(ii) a s is the number of the attention heads in the multi-head attention module, s is a positive integer and has a value range of [1,16 ]],AiThe output of the ith attention head is shown, i ═ 1,2, …, s,
Figure FDA0003485954990000051
and
Figure FDA0003485954990000052
q, k and v parameter matrix, U, representing the ith head of attentionoIs to adjust the final output parameter matrix, T represents the transpose operator, d is a hyperparameter, d is a positive integer, and the value range is [1,1000]。
CN202210098849.1A 2022-01-24 2022-01-24 Transformer-based semi-supervised video object segmentation method Active CN114429607B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210098849.1A CN114429607B (en) 2022-01-24 2022-01-24 Transformer-based semi-supervised video object segmentation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210098849.1A CN114429607B (en) 2022-01-24 2022-01-24 Transformer-based semi-supervised video object segmentation method

Publications (2)

Publication Number Publication Date
CN114429607A true CN114429607A (en) 2022-05-03
CN114429607B CN114429607B (en) 2024-03-29

Family

ID=81313102

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210098849.1A Active CN114429607B (en) 2022-01-24 2022-01-24 Transformer-based semi-supervised video object segmentation method

Country Status (1)

Country Link
CN (1) CN114429607B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116012388A (en) * 2023-03-28 2023-04-25 中南大学 Three-dimensional medical image segmentation method and imaging method for acute ischemic cerebral apoplexy

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109685066A (en) * 2018-12-24 2019-04-26 中国矿业大学(北京) A kind of mine object detection and recognition method based on depth convolutional neural networks
CN110378288A (en) * 2019-07-19 2019-10-25 合肥工业大学 A kind of multistage spatiotemporal motion object detection method based on deep learning
WO2019218136A1 (en) * 2018-05-15 2019-11-21 深圳大学 Image segmentation method, computer device, and storage medium
CN111523410A (en) * 2020-04-09 2020-08-11 哈尔滨工业大学 Video saliency target detection method based on attention mechanism
CN111860162A (en) * 2020-06-17 2020-10-30 上海交通大学 Video crowd counting system and method
CN112036300A (en) * 2020-08-31 2020-12-04 合肥工业大学 Moving target detection method based on multi-scale space-time propagation layer
US20210150727A1 (en) * 2019-11-19 2021-05-20 Samsung Electronics Co., Ltd. Method and apparatus with video segmentation
CN113570610A (en) * 2021-07-26 2021-10-29 北京百度网讯科技有限公司 Method and device for performing target segmentation on video by adopting semantic segmentation model

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019218136A1 (en) * 2018-05-15 2019-11-21 深圳大学 Image segmentation method, computer device, and storage medium
CN109685066A (en) * 2018-12-24 2019-04-26 中国矿业大学(北京) A kind of mine object detection and recognition method based on depth convolutional neural networks
CN110378288A (en) * 2019-07-19 2019-10-25 合肥工业大学 A kind of multistage spatiotemporal motion object detection method based on deep learning
US20210150727A1 (en) * 2019-11-19 2021-05-20 Samsung Electronics Co., Ltd. Method and apparatus with video segmentation
CN111523410A (en) * 2020-04-09 2020-08-11 哈尔滨工业大学 Video saliency target detection method based on attention mechanism
CN111860162A (en) * 2020-06-17 2020-10-30 上海交通大学 Video crowd counting system and method
CN112036300A (en) * 2020-08-31 2020-12-04 合肥工业大学 Moving target detection method based on multi-scale space-time propagation layer
CN113570610A (en) * 2021-07-26 2021-10-29 北京百度网讯科技有限公司 Method and device for performing target segmentation on video by adopting semantic segmentation model

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
ZONGXIN YANG, ETC: "Collaborative Video Object Segmentation by Multi-Scale Foreground-Background Integration", 《IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE》, 18 May 2021 (2021-05-18), pages 4701, XP011916332, DOI: 10.1109/TPAMI.2021.3081597 *
李栋: "基于时空关联性的视频动作识别与检测方法研究", 《中国博士学位论文全文数据库 (信息科技辑)》, 15 September 2021 (2021-09-15), pages 138 - 48 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116012388A (en) * 2023-03-28 2023-04-25 中南大学 Three-dimensional medical image segmentation method and imaging method for acute ischemic cerebral apoplexy

Also Published As

Publication number Publication date
CN114429607B (en) 2024-03-29

Similar Documents

Publication Publication Date Title
CN111489358B (en) Three-dimensional point cloud semantic segmentation method based on deep learning
Johnson et al. Image generation from scene graphs
CN107273800B (en) Attention mechanism-based motion recognition method for convolutional recurrent neural network
CN111783705B (en) Character recognition method and system based on attention mechanism
CN113011427A (en) Remote sensing image semantic segmentation method based on self-supervision contrast learning
CN104517103A (en) Traffic sign classification method based on deep neural network
CN110599502B (en) Skin lesion segmentation method based on deep learning
Hou et al. Object detection in high-resolution panchromatic images using deep models and spatial template matching
CN110349229A (en) A kind of Image Description Methods and device
Nguyen et al. Hybrid deep learning-Gaussian process network for pedestrian lane detection in unstructured scenes
CN115035508A (en) Topic-guided remote sensing image subtitle generation method based on Transformer
CN111696136A (en) Target tracking method based on coding and decoding structure
CN115690152A (en) Target tracking method based on attention mechanism
CN113177503A (en) Arbitrary orientation target twelve parameter detection method based on YOLOV5
CN115131313A (en) Hyperspectral image change detection method and device based on Transformer
CN112905828A (en) Image retriever, database and retrieval method combined with significant features
CN114429607A (en) Transformer-based semi-supervised video object segmentation method
CN115512096A (en) CNN and Transformer-based low-resolution image classification method and system
CN110503090A (en) Character machining network training method, character detection method and character machining device based on limited attention model
Zhao et al. Recognition and Classification of Concrete Cracks under Strong Interference Based on Convolutional Neural Network.
CN114821631A (en) Pedestrian feature extraction method based on attention mechanism and multi-scale feature fusion
Li et al. Bagging R-CNN: Ensemble for Object Detection in Complex Traffic Scenes
Shi et al. A welding defect detection method based on multiscale feature enhancement and aggregation
Park et al. A 2-D HMM method for offline handwritten character recognition
CN117876931A (en) Global feature enhanced semi-supervised video target segmentation method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant