CN114429607A

CN114429607A - Transformer-based semi-supervised video object segmentation method

Info

Publication number: CN114429607A
Application number: CN202210098849.1A
Authority: CN
Inventors: 阳春华; 周玮; 赵于前; 张帆
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2022-01-24
Filing date: 2022-01-24
Publication date: 2022-05-03
Anticipated expiration: 2042-01-24
Also published as: CN114429607B

Abstract

The invention discloses a transform-based semi-supervised video object segmentation method, which comprises the following implementation scheme: 1) acquiring a data set and a segmentation label; 2) data expansion and processing; 3) constructing a segmentation model; 4) constructing a loss function; 5) training a segmentation model; 6) and (5) segmenting the video object. The invention compresses space-time information by designing a space-time integration module, introduces multi-scale layers to generate cross-scale input characteristics, and constructs a double-branch cross attention module to take into account a plurality of characteristics of target information. The method can effectively improve the segmentation precision of the small-scale target and the similar target while reducing the calculation cost.

Description

Transformer-based semi-supervised video object segmentation method

Technical Field

The invention relates to the technical field of image processing, in particular to a transform-based semi-supervised video target segmentation method.

Background

Video object segmentation is an important prerequisite for video understanding, with many potential applications such as video retrieval, video editing, autopilot, etc. The purpose of semi-supervised video object segmentation is to segment an object (i.e. a segmentation tag) from other frames of the entire video sequence given the object in the first frame of the video.

Due to the strong performance of the Transformer architecture on computer vision tasks such as image classification, object detection, semantic segmentation, object tracking, etc., many current researches apply the Transformer architecture to video object segmentation. The Transformer architecture has excellent Long-term dependency (Long-range dependency) modeling capability, and can effectively mine spatio-temporal information in a given video, so that the segmentation precision is improved. However, most Transformer-based methods input the features of all frames in the storage pool directly into the multi-head attention module, which is computationally expensive with the increase of segmented frames, and the classical Transformer architecture lacks intrinsic inductive bias, and has poor segmentation accuracy for small-scale targets and similar targets.

Disclosure of Invention

The invention aims to overcome the defects and shortcomings of the prior art and provides a transform-based semi-supervised video target segmentation method.

In order to achieve the purpose, the invention provides the following technical scheme:

a semi-supervised video object segmentation method based on a Transformer comprises the following steps:

(1) acquiring a data set and a segmentation label:

acquiring a video target segmentation data set, a static image data set and segmentation labels corresponding to the two data sets, and forming an image pair by each image in the data sets and the corresponding segmentation labels;

(2) the data expansion and processing method specifically comprises the following steps:

(2-a) after normalization processing is carried out on each image pair consisting of the static image data set obtained in the step (1) and the corresponding segmentation labels, repeating the following processes to obtain a synthesized video training sample corresponding to each image pair, wherein the synthesized video training set is formed by a set of synthesized video training samples:

I. reducing the short edge of the image pair to w pixels, reducing the long edge according to the proportion equal to that of the short edge, and randomly cutting the obtained image pair into h multiplied by w pixel size, wherein w is the width of the cut image, h is the height of the cut image, w and h are positive integers, and the value range is [10,3000 ];

II, sequentially applying random zooming, random horizontal turning, random color dithering and random gray level conversion to the cut image pair to obtain an enhanced image pair corresponding to the image pair;

repeating the process II three times to obtain three enhanced image pairs corresponding to the image pair, wherein the three enhanced image pairs form a synthetic video training sample;

(2-b) after normalization processing is carried out on each video and the corresponding segmentation labels in the video target segmentation data set obtained in the step (1), repeating the following processes to obtain a real video training sample corresponding to each video, wherein the real video training set is formed by a set of real video training samples:

I. randomly extracting three image pairs from the video and the corresponding segmentation labels;

reducing the short sides of the three image pairs to w pixels, reducing the long sides of the three image pairs according to equal proportion, randomly cutting the obtained three image pairs into h multiplied by w pixel sizes, wherein the meanings and values of w and h are the same as those in the step (2-a);

sequentially applying random cutting, color dithering and random gray level conversion to the three image pairs to obtain three enhanced image pairs, wherein the three enhanced image pairs form a real video training sample;

(3) constructing a segmentation model, which specifically comprises the following steps:

(3-a) construction of query coders

Using a convolutional neural network as a query encoder, sequentially passing a frame to be segmented through the first four layers of the encoder, wherein the output of the second layer is f_C2The output of the third layer is f_C3The output of the fourth layer is f_C；

(3-b) construction of storage pool

Putting a1 st frame, a tau +1 st frame, a 2 tau +1 st frame, … …, an Nth tau +1 st frame and corresponding segmentation labels in a video sequence into a storage pool, wherein tau is a positive integer and has a value range of [1,200 ]]，

τ_CFor relative position of frame to be divided, symbol

Represents rounding down;

(3-c) construction of storage encoder

Using a convolutional neural network as a storage encoder, and obtaining f after all images in a storage pool and corresponding segmentation labels pass through the encoder_M；

(3-d) construction of Transformer Module

The module consists of a Transformer encoder and a Transformer decoder; the Transformer encoder comprises a space-time integration module, a convolutional layer, a multi-scale layer and a self-attention module; the Transformer decoder comprises two convolutional layers, a multi-scale layer, a self-attention module and a double-branch cross attention module, wherein the double-branch cross attention module consists of an inquiry branch and a storage branch, the two branches have the same structure and consist of a multi-head attention module, two residual error and layer normalization modules and a fully-connected feedforward network, and the fully-connected feedforward network consists of two linear layers and a ReLU activation layer; the structures of self-attention modules in the Transformer encoder and the Transformer decoder are completely the same, and the self-attention modules consist of a multi-head attention module and a residual error and layer normalization module; the multi-head attention module has the same structure;

(3-e) construction of a segmented decoder

The segmentation decoder consists of a residual error module, two up-sampling modules and a prediction convolution module; the residual error module consists of three convolution layers and two Relu active layers, the first up-sampling module consists of four convolution layers, two Relu active layers and one bilinear interpolation, the second up-sampling module consists of three convolution layers, two Relu active layers and one bilinear interpolation, and the predicted convolution module consists of one convolution layer and one bilinear interpolation;

(3-f) subjecting f obtained in the step (3-a) to_CWith f obtained in step (3-c)_MInputting the space-time integration module of the Transformer encoder constructed in the step (3-d) to obtain f_M' the specific calculation process is as follows:

f_M'＝f_M·softmax(c(ConvKey(f_M),ConvKey(f_C)))

wherein ConvKey (-) is a key projection layer, composed of a convolution layer, c (-) represents a negative squared Euclidean distance, softmax (-) represents an activation function;

(3-g) subjecting f obtained in the step (3-f)_M' convolutional layer, multiscale layer and self-attention module input sequentially into Transformer encoder, resulting in M respectively₁、M₂And M₃(ii) a Subjecting f obtained in step (3-a) to_CSequentially inputting the first convolutional layer, the second convolutional layer, the multi-scale layer and the self-attention module in the Transformer decoder constructed in the step (3-d) to respectively obtain C₀、C₁、C₂And C₃(ii) a Will M₃And C₃Inputting to the query branch in the two-branch cross attention module in the transform decoder constructed in step (3-d) to obtain C₄(ii) a Will M₃And C₃Inputting the memory branch in the double-branch cross attention module in the Transformer decoder constructed in the step (3-d) to obtain M₄；

(3-h) subjecting C obtained in the step (3-g) to₀、C₄、M₄And f obtained in step (3-f)_M' splicing and inputting the residual error into the residual error module of the segmentation decoder constructed in the step (3-e) to obtain f_O1(ii) a Will f is_O1With f obtained in step (3-a)_C3Inputting the data into the first up-sampling module of the segmented decoder constructed in the step (3-e) to obtain f_O2(ii) a Will f is_O2With f obtained in step (3-a)_C2Input to the second of the segmented decoders constructed in step (3-e)Sampling module to obtain f_O3(ii) a Will f is_O3Inputting the prediction convolution module of the segmentation decoder constructed in the step (3-e) to obtain a prediction segmentation label f_O4Completing the construction of a segmentation model;

(4) constructing a loss function:

the loss function L uses cross-entropy loss, defined as follows:

where Y is the true segmentation label,

is a predictive segmentation tag, H_YAnd W_YHeight and width, Y, of the label being divided for reality_ijIs the pixel value of the ith row and jth column pixel in Y,

is that

The pixel value of the ith row and the jth column of the display panel, i is 1,2, …, H_Y，j＝1,2,…,W_YLog (·) denotes solving a natural logarithm;

(5) training a segmentation model:

training the segmentation model constructed in the step (3) by using the synthesized video training set obtained in the step (2-a), obtaining a loss value according to the loss function constructed in the step (4), and updating parameters in the segmentation model by using a random gradient descent method to obtain a pre-training model; training the pre-training model by using the real video training set obtained in the step (2-b), obtaining a loss value according to the loss function constructed in the step (4), and updating parameters in the segmentation model by using a random gradient descent method to obtain a trained segmentation model;

(6) video object segmentation:

and (5) acquiring a video to be segmented and a segmentation label corresponding to the first frame of the video to be segmented, inputting the video to be segmented into the trained segmentation model obtained in the step (5) from the second frame of the video to be segmented in sequence, and outputting the segmentation label.

The multi-scale layer in the step (3-d) is composed of t convolution layers with different convolution kernel sizes, and the output calculation process is as follows:

MSL＝Concat(S₁,…,S_i,…,S_t)

S_i＝Conv_i(X_i；r_i)

wherein MSL represents the output of the multi-scale layer, Concat (. cndot.) represents the splicing operation, t represents the number of the convolution layers in the multi-scale layer, and t is a positive integer with a value range of [1,50 ]]，Conv_iDenotes the ith convolutional layer in the multi-scale layer, r_iRepresents the convolution kernel size, r, of the ith convolution layer_iIs a positive integer with a value range of [1,100 ]]，X_iInput representing the ith convolutional layer of the multi-scale layer, S_iRepresents the output of the ith convolutional layer of the multi-scale layer, X for the multi-scale layer in the transform encoder_i＝M₁For a multiscale layer in a transform decoder, X_i＝C₁，i＝1,2,…,t。

The multi-head attention module in the step (3-d) outputs a calculation process that:

MultiHead(q,k,v)＝Concat(A₁,…,A_i,…,A_s)U^o

wherein MultiHead (q, k, v) represents the output of the multi-head attention module, softmax (·) represents the activation function, and Concat (·) represents the stitching operation; q, k and v are the input of the multi-head attention module, and q-k-v-M for the multi-head attention module inside the self-attention module of the transform encoder₂For a multi-head attention module inside a self-attention module of a transform decoder, q-k-v-C₂For a two-branch cross attention module to store a multi-head attention module inside a branch, q is equal to M₃，k＝v＝C₃For query inside branchesThe multi-head attention module of (q ═ C)₃，k＝v＝M₃(ii) a s is the number of the attention heads in the multi-head attention module, s is a positive integer and has a value range of [1,16 ]]，A_iThe output of the ith attention head is shown, i ═ 1,2, …, s,

and

q, k and v parameter matrix, U, representing the ith head of attention^oIs to adjust the final output parameter matrix, T represents the transpose operator, d is a hyperparameter, d is a positive integer, and the value range is [1,1000]。

In step (3-b), preferably τ is 5.

In step (3-d), the number t of convolutional layers in the multi-scale layer is preferably 2, the number s of attention heads in the multi-head attention module is preferably 8, and the hyper-parameter d is preferably 32.

Compared with the prior art, the invention has the following advantages:

(1) the space-time integration module provided by the invention can integrate and compress space-time information and reduce the calculation cost.

(2) The multi-scale layer introduced by the invention can provide cross-scale input features for the Transformer, thereby assisting the network to learn the feature representation with invariable scale and improving the segmentation precision of the network on small-scale targets.

(3) The two branches in the double-branch cross attention module constructed by the invention can give consideration to different characteristics of target information, thereby improving the segmentation precision of target details.

Drawings

FIG. 1 is a flowchart of a transform-based semi-supervised video object segmentation method according to an embodiment of the present invention;

FIG. 2 is a diagram of a segmentation model architecture in accordance with an embodiment of the present invention;

FIG. 3 is a diagram of a Transformer module architecture according to an embodiment of the present invention;

FIG. 4 is a graph comparing the segmentation result of the small-scale target with the segmentation result of other methods according to the embodiment of the present invention;

FIG. 5 is a graph comparing the segmentation result of the similar target with that of other methods according to the embodiment of the present invention.

Detailed Description

The following describes specific embodiments of the present invention;

example 1

Fig. 1 is a flowchart of a transform-based semi-supervised video object segmentation method according to an embodiment of the present invention, which includes the following specific steps:

step 1, acquiring a data set and a segmentation label

The method comprises the steps of obtaining a video target segmentation data set, a static image data set and segmentation labels corresponding to the two data sets, and enabling each image in the data sets and the corresponding segmentation labels to form an image pair.

Step 2, data expansion and processing

I. reducing the short edge of the image pair to w pixels, and reducing the long edge in proportion to the short edge, and randomly cutting the obtained image pair into h × w pixel size, where w is the width of the cut image, h is the height of the cut image, w and h are both positive integers, and the value range is [10,3000], in this embodiment, w is 384, and h is 384;

and III, sequentially applying random cutting, color dithering and random gray level conversion to the three image pairs to obtain three enhanced image pairs, wherein the three enhanced image pairs form a real video training sample.

Step 3, constructing a segmentation model

Fig. 2 is a diagram illustrating a segmentation model structure according to an embodiment of the present invention, which includes the following steps:

(3-a) construction of query coders

Using the convolutional neural network ResNet50 as a query encoder, the frame to be segmented sequentially passes through the first four layers of the encoder, where the output of the second layer is f_C2The output of the third layer is f_C3The output of the fourth layer is f_C；

(3-b) construction of storage pool

τ_CFor relative position of frame to be divided, symbol

Meaning rounding down, this example takes τ to be 5;

(3-c) construction of storage encoder

Using convolutional neural network ResNet18 as a storage encoder, all images in the storage pool and their corresponding segmentation labels are passed through the encoder to obtain f_M；

(3-d) construction of Transformer Module

FIG. 3 is a diagram showing a structure of a Transformer module according to an embodiment of the present invention. The module consists of a Transformer encoder and a Transformer decoder; the Transformer encoder comprises a space-time integration module, convolution layers with convolution kernel number of 256 and convolution kernel size of 1, a multi-scale layer and a self-attention module; the Transformer decoder comprises convolution layers with convolution kernel number of 512 and convolution kernel size of 3, convolution layers with convolution kernel number of 256 and convolution kernel size of 1, a multi-scale layer, a self-attention module and a double-branch cross attention module, wherein the double-branch cross attention module consists of an inquiry branch and a storage branch, the two branches have the same structure and consist of a multi-head attention module, two residual error and layer normalization modules and a fully-connected feedforward network, and the fully-connected feedforward network consists of two linear layers with input dimensionalities of 256 hidden layers and size of 2048 and a ReLU activation layer; the structures of self-attention modules in the Transformer encoder and the Transformer decoder are completely the same, and the self-attention modules consist of a multi-head attention module and a residual error and layer normalization module; the multi-head attention module has the same structure;

MSL＝Concat(S₁,…,S_i,…,S_t)

S_i＝Conv_i(X_i；r_i)

wherein MSL represents the output of the multi-scale layer, Concat (. cndot.) represents the splicing operation, t represents the number of the convolution layers in the multi-scale layer, and t is a positive integer with a value range of [1,50 ]]In this embodiment, t is2, Conv_iDenotes the ith convolutional layer in the multi-scale layer, r_iRepresents the convolution kernel size, r, of the ith convolution layer_iIs a positive integer with a value range of [1,100 ]]In this embodiment, r is taken₁Is2, r₂Is 4, X_iRepresents the input of the ith convolutional layer of the multi-scale layer,S_irepresents the output of the ith convolutional layer of the multi-scale layer, X for the multi-scale layer in the transform encoder_i＝M₁For a multiscale layer in a transform decoder, X_i＝C₁，i＝1,2,…,t。

MultiHead(q,k,v)＝Concat(A₁,…,A_i,…,A_s)U^o

wherein MultiHead (q, k, v) represents the output of the multi-head attention module, softmax (·) represents the activation function, and Concat (·) represents the stitching operation; q, k and v are the input of the multi-head attention module, and q-k-v-M for the multi-head attention module inside the self-attention module of the transform encoder₂For a multi-head attention module inside a self-attention module of a transform decoder, q-k-v-C₂For a two-branch cross attention module to store a multi-head attention module inside a branch, q is equal to M₃，k＝v＝C₃For a multi-head attention module inside a query branch, q ═ C₃，k＝v＝M₃(ii) a s is the number of the attention heads in the multi-head attention module, s is a positive integer and has a value range of [1,16 ]]In this example, s is taken to be 8, A_iThe output of the ith attention head is shown, i ═ 1,2, …, s,

and

q, k and v parameter matrix, U, representing the ith head of attention^oIs to adjust the final output parameter matrix, T represents the transpose operator, d is a hyperparameter, d is a positive integer, and the value range is [1,1000]In this embodiment, d is 32;

(3-e) construction of a segmented decoder

The segmentation decoder consists of a residual error module, two up-sampling modules and a prediction convolution module; the residual error module consists of convolution layers with convolution kernel numbers of 512 and convolution kernel sizes of 3 and two Relu active layers, the first up-sampling module consists of convolution layers with convolution kernel numbers of 512 and convolution kernel sizes of 3, convolution layers with convolution kernel numbers of 256 and convolution kernel sizes of 3, two Relu active layers and bilinear interpolation with an expansion factor of 2, the second up-sampling module consists of convolution layers with convolution kernel numbers of 256 and convolution kernel sizes of 3, two Relu active layers and bilinear interpolation with an expansion factor of 2, and the predicted convolution module consists of convolution layers with convolution kernel numbers of 256 and convolution kernel sizes of 1 and bilinear interpolation with an expansion factor of 2;

f_M'＝f_M·softmax(c(ConvKey(f_M),ConvKey(f_C)))

wherein ConvKey (-) is a key projection layer, and is composed of convolution layers with convolution kernel number of 64 and convolution kernel size of 3, c (-) represents a negative square Euclidean distance, and softmax (-) represents an activation function;

(3-g) subjecting f obtained in the step (3-f)_M' convolutional layer, multiscale layer and self-attention module input sequentially into Transformer encoder, resulting in M respectively₁、M₂And M₃(ii) a F obtained in the step (3-a)_CSequentially inputting the first convolutional layer, the second convolutional layer, the multi-scale layer and the self-attention module in the Transformer decoder constructed in the step (3-d) to respectively obtain C₀、C₁、C₂And C₃(ii) a Will M₃And C₃Inputting the query branch in the double-branch cross attention module in the Transformer decoder constructed in the step (3-d) to obtain C₄(ii) a Will M₃And C₃The double-branch cross attention module input into the Transformer decoder constructed in the step (3-d)Get M₄；

(3-h) subjecting C obtained in the step (3-g) to₀、C₄、M₄And f obtained in step (3-f)_M' splicing and inputting the residual error into the residual error module of the segmentation decoder constructed in the step (3-e) to obtain f_O1(ii) a Will f is_O1With f obtained in step (3-a)_C3Inputting the data into the first up-sampling module of the segmented decoder constructed in the step (3-e) to obtain f_O2(ii) a Will f is_O2With f obtained in step (3-a)_C2Inputting the data to a second up-sampling module of the segmented decoder constructed in the step (3-e) to obtain f_O3(ii) a Will f is_O3Inputting the predicted segmentation label f into the prediction convolution module of the segmentation decoder constructed in the step (3-e)_O4And finishing the construction of the segmentation model.

(4) Constructing a loss function:

the loss function L uses cross-entropy loss, defined as follows:

where Y is the true segmentation label,

is that

The pixel value of the ith row and the jth column of the display panel, i is 1,2, …, H_Y，j＝1,2,…,W_YAnd log (·) denotes taking the natural logarithm.

(5) Training a segmentation model:

training the segmentation model constructed in the step (3) by using the synthesized video training set obtained in the step (2-a), obtaining a loss value according to the loss function constructed in the step (4), and updating parameters in the segmentation model by using a random gradient descent method to obtain a pre-training model; and (3) training the pre-training model by using the real video training set obtained in the step (2-b), obtaining a loss value according to the loss function constructed in the step (4), and updating parameters in the segmentation model by using a random gradient descent method to obtain the trained segmentation model.

(6) Video object segmentation:

Example 2

Video object segmentation experiments were performed on the public data sets youtube vos2018, youtube vos2019, DAVIS2016 and DAVIS2017 using the method of example 1. The experimental operating system is Linux ubuntu 18.04 version; a pytorch1.8.1 framework implementation based on CUDA10.1 and cudnn7.6.5 was trained and tested using two NVIDIA 2080Ti 11G GPUs.

The present embodiment uses the region similarity J, the contour accuracy F, and their average values J & F to evaluate the performance of the present invention. The region similarity J is calculated by the average intersection ratio of the estimated label and the corresponding real label, and the calculation formula is as follows:

where M is the predicted segmentation label, G is the true segmentation label, and the symbols ≡ and ≡ denote the intersection and union of the two sets, respectively.

The contour accuracy F represents the average boundary similarity between the boundary of the estimated label and the boundary of its real label, and its calculation formula is:

wherein P is_cIs the accuracy between l (M) and l (G), R_cIs l (M) and l (G)Recall ratio between, P_cAnd R_cObtained by using bipartite graph matching (bipartite graph matching) calculation; l (M) represents a set of closed contours in the range of the predicted segmentation label M, and l (G) represents a set of closed contours in the range of the true segmentation label G.

Table 1, table 2, table 3 and table 4 show the J, F and J & F results of the test sample sets of youtube vos2018, youtube vos2019, DAVIS2016 and DAVIS2017, respectively, compared to other methods, and it can be found that the highest J & F scores were obtained on all four data sets by the method of the present invention.

Fig. 4 is a graph showing the segmentation result of a small-scale object according to the embodiment of the present invention compared with the segmentation results of other methods. The first, second, third and fourth lines in the figure are respectively the result of segmentation of the video d79729a354 optional frame in youtube vos2018 by the method of the present invention, the number of the upper left corner of each image is the frame number of the image in the video sequence, the image in the dashed frame is the enlarged view of the small image in the corresponding solid frame, and the first column is the first frame of the video and its real segmentation label. It can be seen from the first row and the second column that the TransVOS method is disturbed by the background, and the sundries are wrongly identified as distant persons; it can be seen from the second row, the second column and the third column that the CFBI + method fails to segment the legs and arms of the person correctly; it can be seen from the third row, second, third, fourth, and fifth columns that the STCN method confuses people at a distance with a vehicle in its background. From the fourth line, it can be seen that the method of the present invention can correctly segment small-scale objects, and the segmentation result is superior to that of other methods.

Fig. 5 is a graph showing the results of segmentation of similar objects using an embodiment of the present invention in comparison with the results of segmentation using other methods. In the figure, the first, second, third and fourth lines are respectively a transVOS, a CFBI +, an STCN and a segmentation result of the method of the invention on an optional frame of a video 5d2020eff8 in a YoutubeVOS2018, the number at the upper left corner of each image is a frame number of the image in a video sequence, a target shown by a solid line box is an error segmentation result, and the first column is a first frame of the video and a real segmentation label thereof. It can be seen that STCN, CFBI + and TransVOS all confuse fish as segmentation target with fish with similar appearance in the background, and the method of the present invention can successfully distinguish segmentation target from similar in the background.

The above-mentioned embodiments are merely preferred embodiments of the present invention, and the scope of the present invention is not limited thereby, and all changes made based on the principle of the present invention should be covered within the scope of the present invention.

TABLE 1

(Note: J in the table)_SAnd F_SValues for J and F, J, representing classes that appear in both the Youttube 2018 training set and the test set_UAnd F_UValues of J and F representing classes that appear only in the Youtube2018 test set but not in the Youtube2018 training set)

TABLE 2

(Note: J in the table)_SAnd F_SValues for J and F, J, representing classes that appeared in both the Youttube 2019 training set and the test set_UAnd F_UValues for J and F representing categories that appear only in Youtube2019 test set but not in Youtube2019 training set) table 3

TABLE 4

Claims

1. A semi-supervised video object segmentation method based on a Transformer is characterized by comprising the following steps:

(1) acquiring a data set and a segmentation label:

(2-a) after normalizing each image pair consisting of the static image data set obtained in the step (1) and the corresponding segmentation label, repeating the following processes to obtain a synthesized video training sample corresponding to each image pair, wherein the synthesized video training set is formed by a set of synthesized video training samples:

(3-a) construction of query coders

(3-b) construction of storage pool

τ_CFor relative position of frame to be divided, symbol

Represents rounding down;

(3-c) construction of storage encoder

(3-d) construction of Transformer Module

(3-e) construction of a segmented decoder

The segmentation decoder consists of a residual error module, two groups of up-sampling modules and a prediction convolution module; the residual error module consists of three convolution layers and two Relu active layers, the first up-sampling module consists of four convolution layers, two Relu active layers and one bilinear interpolation, the second up-sampling module consists of three convolution layers, two Relu active layers and one bilinear interpolation, and the predicted convolution module consists of one convolution layer and one bilinear interpolation;

f_M'＝f_M·softmax(c(ConvKey(f_M),ConvKey(f_C)))

(3-g) subjecting f obtained in the step (3-f)_M' convolutional layer, multiscale layer and self-attention module input sequentially into Transformer encoder, resulting in M respectively₁、M₂And M₃(ii) a Subjecting f obtained in step (3-a) to_CSequentially inputting the first convolutional layer, the second convolutional layer, the multi-scale layer and the self-attention module in the Transformer decoder constructed in the step (3-d) to respectively obtain C₀、C₁、C₂And C₃(ii) a Will M₃And C₃Inputting the query branch in the double-branch cross attention module in the Transformer decoder constructed in the step (3-d) to obtain C₄(ii) a Will M₃And C₃Storage in a two-branch cross attention Module input to the Transformer decoder constructed in step (3-d)Store branch to obtain M₄；

(3-h) subjecting C obtained in the step (3-g) to₀、C₄、M₄And f obtained in step (3-f)_M' splicing and inputting the residual error into the residual error module of the segmentation decoder constructed in the step (3-e) to obtain f_O1(ii) a Will f is_O1With f obtained in step (3-a)_C3Inputting the data into the first up-sampling module of the segmented decoder constructed in the step (3-e) to obtain f_O2(ii) a Will f is_O2With f obtained in step (3-a)_C2Inputting the data to a second up-sampling module of the segmented decoder constructed in the step (3-e) to obtain f_O3(ii) a Will f is_O3Inputting the predicted segmentation label f into the prediction convolution module of the segmentation decoder constructed in the step (3-e)_O4Completing the construction of a segmentation model;

(4) constructing a loss function:

the loss function L uses cross-entropy loss, defined as follows:

where Y is the true segmentation label,

(5) training a segmentation model:

(6) video object segmentation:

2. The transform-based semi-supervised video object segmentation method of claim 1, wherein the multi-scale layer in step (3-d) is composed of t convolutional layers with different convolutional kernel sizes, and the output of the multi-scale layer is calculated by:

MSL＝Concat(S₁,…,S_i,…,S_t)

S_i＝Conv_i(X_i；r_i)

wherein MSL represents the output of the multi-scale layer, Concat (. cndot.) represents the splicing operation, t represents the number of the convolution layers in the multi-scale layer, and t is a positive integer with a value range of [1,50 ]]，Conv_iDenotes the ith convolutional layer in the multi-scale layer, r_iRepresents the convolution kernel size, r, of the ith convolution layer_iIs a positive integer with a value range of [1,100 ]]，X_iInput representing the ith convolutional layer of the multi-scale layer, S_iRepresents the output of the ith convolutional layer of the multi-scale layer, X for the multi-scale layer in the transform encoder_i＝M₁For multi-scale layers in a transform decoder, X_i＝C₁，i＝1,2,…,t。

3. The transform-based semi-supervised video object segmentation method of claim 1, wherein the multi-head attention module in step (3-d) outputs a calculation process of:

MultiHead(q,k,v)＝Concat(A₁,…,A_i,…,A_s)U^o

wherein MultiHead (q, k, v) represents the output of the multi-head attention module, softmax (·) represents the activation function, and Concat (·) represents the stitching operation; q, k and v are the input of the multi-head attention module, and q-k-v-M for the multi-head attention module inside the self-attention module of the transform encoder₂For a multi-head attention module inside a self-attention module of a transform decoder, q-k-v-C₂For a two-branch cross attention module to store a multi-head attention module inside a branch, q is equal to M₃，k＝v＝C₃For a multi-head attention module inside a query branch, q ═ C₃，k＝v＝M₃(ii) a s is the number of the attention heads in the multi-head attention module, s is a positive integer and has a value range of [1,16 ]]，A_iThe output of the ith attention head is shown, i ═ 1,2, …, s,

and