CN114429607B

CN114429607B - Transformer-based semi-supervised video object segmentation method

Info

Publication number: CN114429607B
Application number: CN202210098849.1A
Authority: CN
Inventors: 阳春华; 周玮; 赵于前; 张帆
Original assignee: Central South University
Current assignee: Central South University
Priority date: 2022-01-24
Filing date: 2022-01-24
Publication date: 2024-03-29
Anticipated expiration: 2042-01-24
Also published as: CN114429607A

Abstract

The invention discloses a method for dividing a semi-supervised video target based on a transform, which comprises the following implementation scheme: 1) Acquiring a data set and a segmentation label; 2) Expanding and processing data; 3) Constructing a segmentation model; 4) Constructing a loss function; 5) Training a segmentation model; 6) Video object segmentation. According to the invention, space-time information is compressed by designing a space-time integration module, a multi-scale layer is introduced to generate a cross-scale input feature, and a double-branch cross attention module is constructed to consider multiple features of target information. The method can effectively improve the segmentation precision of the small-scale target and the similar target while reducing the calculation cost.

Description

Transformer-based semi-supervised video object segmentation method

Technical Field

The invention relates to the technical field of image processing, in particular to a semi-supervised video target segmentation method based on a transform.

Background

Video object segmentation is an important premise for video understanding, with many potential applications such as video retrieval, video editing, autopilot, etc. The purpose of semi-supervised video object segmentation is to segment a given video first frame segmentation object (i.e., a segmentation tag) from other frames of the entire video sequence.

Because of the powerful performance of the Transformer architecture in computer vision tasks such as image classification, object detection, semantic segmentation, object tracking, etc., many studies are currently being applied to video object segmentation. The transform architecture has excellent Long-term dependence modeling capability, and can effectively mine the space-time information in a given video, so that the segmentation accuracy is improved. However, most of the transform-based methods directly input the features of all frames in the storage pool into the multi-head attention module, and as the number of segmented frames increases, the cost of calculation is high, and the classical transform architecture lacks an inherent inductive bias, so that the segmentation accuracy of small-scale targets and similar targets is poor.

Disclosure of Invention

The invention aims to overcome the defects and shortcomings of the prior art and provides a semi-supervised video target segmentation method based on a Transformer.

In order to achieve the above purpose, the present invention provides the following technical solutions:

a semi-supervised video target segmentation method based on a transducer comprises the following steps:

(1) Acquiring a data set and dividing labels:

acquiring a video target segmentation dataset, a static image dataset and segmentation labels corresponding to the two datasets, and forming an image pair by each image in the dataset and the corresponding segmentation label;

(2) The data expansion and processing method specifically comprises the following steps:

(2-a) after normalizing each image pair consisting of the static image data set and the corresponding segmentation labels obtained in the step (1), repeating the following procedures to obtain a composite video training sample corresponding to each image, wherein a composite video training set is formed by a set of composite video training samples:

I. the short side of the image pair is reduced to w pixels, the long side is reduced according to the same proportion as the short side, the obtained image pair is randomly cut into the size of h multiplied by w pixels, wherein w is the width of the cut image, h is the height of the cut image, w and h are both positive integers, and the value range is [10,3000];

sequentially applying random scaling, random horizontal overturning, random color dithering and random gray level conversion to the cut image pair to obtain an enhanced image pair corresponding to the image pair;

III, repeating the process II for three times to obtain three enhanced image pairs corresponding to the image pair, wherein the three enhanced image pairs form a synthesized video training sample;

(2-b) carrying out normalization processing on each video and the corresponding segmentation label in the video target segmentation data set obtained in the step (1), repeating the following processes to obtain a real video training sample corresponding to each video, wherein a set of the real video training samples forms a real video training set:

I. randomly extracting three image pairs from the video and the corresponding segmentation labels;

reducing the short sides of the three image pairs to w pixels, reducing the long sides of the three image pairs according to equal proportion, and randomly cutting the obtained three image pairs into the size of h multiplied by w pixels, wherein the meanings and values of w and h are the same as those of the step (2-a);

thirdly, sequentially applying random clipping, color dithering and random gray level conversion to the three image pairs to obtain three enhanced image pairs, wherein the three enhanced image pairs form a real video training sample;

(3) The method for constructing the segmentation model specifically comprises the following steps:

(3-a) construction of a query encoder

Using convolutional neural network as query encoder, and sequentially passing the frame to be segmented through the first four layers of the encoder, wherein the output of the second layer is f _C2 The output of the third layer is f _C3 The output of the fourth layer is f _C ；

(3-b) building a storage pool

The 1 st frame, the (tau+1) th frame, the (2tau+1) th frame, the … … th frame, the (Ntau+1) th frame and the corresponding segmentation labels in the video sequence are put into a storage pool, wherein tau is a positive integer, and the value range is [1,200 ]]，τ _C For the relative position of the frames to be segmented, the symbol +.>Representation directionB, lower rounding;

(3-c) construction of a storage encoder

Using convolutional neural network as storage encoder, all images in the storage pool and corresponding segmentation labels are obtained after passing through the encoder _M ；

(3-d) construction of a transducer Module

The module consists of a transducer encoder and a transducer decoder; the transducer encoder comprises a space-time integration module, a convolution layer, a multi-scale layer and a self-attention module; the transducer decoder comprises two convolution layers, a multi-scale layer, a self-attention module and a double-branch cross-attention module, wherein the double-branch cross-attention module consists of a query branch and a storage branch, the structures of the two branches are identical, the two branches consist of a multi-head attention module, two residual error and layer normalization modules and a fully-connected feedforward network, and the fully-connected feedforward network consists of two linear layers and a ReLU activation layer; the structure of the self-attention module in the transducer encoder and the structure of the self-attention module in the transducer decoder are identical, and the self-attention module consists of a multi-head attention module, a residual error and layer normalization module; the multi-head attention module has the same structure;

(3-e) construction of a partition decoder

The segmentation decoder consists of a residual error module, two up-sampling modules and a prediction convolution module; the residual error module consists of three convolution layers and two Relu activation layers, the first up-sampling module consists of four convolution layers, two Relu activation layers and a bilinear interpolation, the second up-sampling module consists of three convolution layers, two Relu activation layers and a bilinear interpolation, and the prediction convolution module consists of one convolution layer and a bilinear interpolation;

(3-f) F obtained in the step (3-a) _C And f obtained in the step (3-c) _M Inputting the space-time integration module of the transducer encoder constructed in the step (3-d) to obtain f _M The specific calculation process is as follows:

f _M '＝f _M ·softmax(c(ConvKey(f _M ),ConvKey(f _C )))

wherein ConvKey (·) is a bond projection layer, and consists of a convolution layer, c (·) represents a negative squared Euclidean distance, and softmax (·) represents an activation function;

(3-g) f obtained in the step (3-f) _M ' convolution layer, multi-scale layer and self-attention module sequentially input into a transducer encoder, respectively obtaining M ₁ 、M ₂ And M ₃ The method comprises the steps of carrying out a first treatment on the surface of the F obtained in the step (3-a) _C Sequentially inputting the first convolution layer, the second convolution layer, the multi-scale layer and the self-attention module into the converter decoder constructed in the step (3-d) to obtain C respectively ₀ 、C ₁ 、C ₂ And C ₃ The method comprises the steps of carrying out a first treatment on the surface of the Will M ₃ And C ₃ Query branches in a double-branch cross-attention module input into a transducer decoder constructed in step (3-d) to obtain C ₄ The method comprises the steps of carrying out a first treatment on the surface of the Will M ₃ And C ₃ The storage branches in the dual-branch cross-attention module input to the Transformer decoder constructed in step (3-d) to obtain M ₄ ；

(3-h) C obtained in the step (3-g) ₀ 、C ₄ 、M ₄ And f obtained in the step (3-f) _M ' after splicing, inputting the residual modules into a residual module of the split decoder constructed in the step (3-e) to obtain f _O1 The method comprises the steps of carrying out a first treatment on the surface of the Will f _O1 And f obtained in the step (3-a) _C3 Input to the first up-sampling module of the split decoder constructed in step (3-e) to obtain f _O2 The method comprises the steps of carrying out a first treatment on the surface of the Will f _O2 And f obtained in the step (3-a) _C2 Inputting to a second up-sampling module of the split decoder constructed in the step (3-e) to obtain f _O3 The method comprises the steps of carrying out a first treatment on the surface of the Will f _O3 Inputting the prediction convolution module of the segmentation decoder constructed in the step (3-e) to obtain a prediction segmentation label f _O4 Completing the construction of a segmentation model;

(4) Constructing a loss function:

the loss function L uses cross entropy loss, defined as follows:

where Y is the true split label of the tag,is a predictive cut tag, H _Y And W is _Y The height and width of the real split label are respectively Y _ij Is the pixel value of the ith row and jth column pixel in Y,/and Y>Is->Pixel values of the ith row and jth column pixels, i=1, 2, …, H _Y ，j＝1,2,…,W _Y Log (-) represents the natural logarithm;

(5) Training a segmentation model:

training the segmentation model constructed in the step (3) by utilizing the synthesized video training set obtained in the step (2-a), obtaining a loss value according to the loss function constructed in the step (4), and updating parameters in the segmentation model by using a random gradient descent method to obtain a pre-training model; training the pre-training model by using the real video training set obtained in the step (2-b), obtaining a loss value according to the loss function constructed in the step (4), and updating parameters in the segmentation model by using a random gradient descent method to obtain a trained segmentation model;

(6) Video object segmentation:

and (3) acquiring a video to be segmented and a segmentation label corresponding to a first frame of the video to be segmented, sequentially inputting the video to be segmented into the trained segmentation model obtained in the step (5) from a second frame of the video to be segmented, and outputting the segmentation label.

The multi-scale layer in the step (3-d) consists of t convolution layers with different convolution kernel sizes, and the output calculation process is as follows:

MSL＝Concat(S ₁ ,…,S _i ,…,S _t )

S _i ＝Conv _i (X _i ；r _i )

wherein MSL represents the output of the multi-scale layer, concat (-) represents the splicing operation, and t represents the volume in the multi-scale layerThe number of the laminated layers, t is a positive integer, and the value range is [1,50 ]]，Conv _i (. Cndot.) represents the ith convolution layer, r, of the multi-scale layers _i The convolution kernel size, r, representing the ith convolution layer _i Is a positive integer with a value range of [1,100 ]]，X _i Representing input of ith convolution layer of multi-scale layer, S _i Representing the output of the ith convolutional layer of the multi-scale layer, X for the multi-scale layer in the transform encoder _i ＝M ₁ For multi-scale layers in a transducer decoder, X _i ＝C ₁ ，i＝1,2,…,t。

The multi-head attention module in the step (3-d) has the output calculation process that:

MultiHead(q,k,v)＝Concat(A ₁ ,…,A _i ,…,A _s )U ^o

wherein MultiHead (q, k, v) represents the output of the multi-head attention module, softmax (·) represents the activation function, concat (·) represents the splicing operation; q, k and v are inputs to the multi-head attention module, q=k=v=m for the multi-head attention module inside the self-attention module of the transducer encoder ₂ For a multi-head attention module inside the self-attention module of a transducer decoder, q=k=v=c ₂ For multi-head attention module inside the double-branch cross attention module storage branch, q=m ₃ ，k＝v＝C ₃ For a multi-headed attention module inside the query branch, q=c ₃ ，k＝v＝M ₃ The method comprises the steps of carrying out a first treatment on the surface of the s is the number of attention heads in the multi-head attention module, s is a positive integer, and the value range is [1,16 ]]，A _i Representing the output of the ith attention header, i=1, 2, …, s,and->Representing the ith attention headerq, k and v parameter matrices, U ^o Is a parameter matrix for adjusting final output, T represents a transpose operator, d is a super parameter, d is a positive integer, and the value range is [1,1000 ]]。

In step (3-b), τ is preferably 5.

In step (3-d), the number of convolution layers t in the multi-scale layer is preferably 2, the number of attention heads s in the multi-head attention module is preferably 8, and the super-parameter d is preferably 32.

Compared with the prior art, the invention has the following advantages:

(1) The space-time integration module provided by the invention can integrate and compress space-time information and reduce the calculation cost.

(2) The multi-scale layer introduced by the invention can provide trans-scale input features for the Transformer, thereby assisting the network to learn the feature representation with unchanged scale and improving the segmentation precision of the network to the small-scale target.

(3) The two branches in the double-branch cross attention module constructed by the invention can give consideration to different characteristics of target information, so that the segmentation precision of target details can be improved.

Drawings

FIG. 1 is a flow chart of a method for segmenting a semi-supervised video object based on a transducer according to an embodiment of the present invention;

FIG. 2 is a block diagram of a segmentation model according to an embodiment of the present invention;

FIG. 3 is a diagram of a transducer module configuration in accordance with an embodiment of the present invention;

FIG. 4 is a graph showing the comparison of the segmentation results of a small-scale object with the segmentation results of other methods according to the embodiment of the present invention;

FIG. 5 is a graph showing the comparison of the segmentation results of similar objects with the segmentation results of other methods according to the embodiment of the present invention.

Detailed Description

The following describes specific embodiments of the present invention;

example 1

Fig. 1 is a flowchart of a semi-supervised video object segmentation method based on a transducer according to an embodiment of the present invention, which specifically includes the following steps:

step 1, acquiring a data set and a segmentation label

And acquiring a video target segmentation dataset, a static image dataset and segmentation labels corresponding to the two datasets, and forming an image pair by each image in the dataset and the corresponding segmentation label.

Step 2, data expansion and processing

I. the short side of the image pair is reduced to w pixels, the long side is reduced according to the same proportion as the short side, the obtained image pair is randomly cut into the size of h multiplied by w pixels, wherein w is the width of the cut image, h is the height of the cut image, w and h are both positive integers, the value range is [10,3000], and w is 384 and h is 384 in the embodiment;

and III, sequentially applying random clipping, color dithering and random gray level conversion to the three image pairs to obtain three enhanced image pairs, wherein the three enhanced image pairs form a real video training sample.

Step 3, constructing a segmentation model

Fig. 2 is a block diagram of a segmentation model according to an embodiment of the present invention, and specifically includes the following steps:

(3-a) construction of a query encoder

Using convolutional neural network ResNet50 as query encoder, the frame to be segmented passes through the first four layers of the encoder in turn, where the output of the second layer is f _C2 The output of the third layer is f _C3 The output of the fourth layer is f _C ；

(3-b) building a storage pool

The 1 st frame, the (tau+1) th frame, the (2tau+1) th frame, the … … th frame, the (Ntau+1) th frame and the corresponding segmentation labels in the video sequence are put into a storage pool, wherein tau is a positive integer, and the value range is [1,200 ]]，τ _C For the relative position of the frames to be segmented, the symbol +.>Representing a downward rounding, the present embodiment takes τ as 5;

(3-c) construction of a storage encoder

Using a convolutional neural network ResNet18 as a storage encoder, and obtaining f after all images in the storage pool and corresponding segmentation labels pass through the encoder _M ；

(3-d) construction of a transducer Module

Fig. 3 is a block diagram of a transducer module according to an embodiment of the present invention. The module consists of a transducer encoder and a transducer decoder; the transducer encoder comprises a space-time integration module, a convolution layer with 256 convolution kernels and a size of 1, a multi-scale layer and a self-attention module; the converter decoder comprises a convolution layer with the convolution kernel number of 512 and the convolution size of 3, a convolution layer with the convolution kernel number of 256 and the convolution size of 1, a multi-scale layer, a self-attention module and a double-branch cross-attention module, wherein the double-branch cross-attention module consists of an inquiry branch and a storage branch, the structures of the two branches are identical, the two branches consist of a multi-head attention module, two residual error and layer normalization modules and a fully-connected feedforward network, and the fully-connected feedforward network consists of two linear layers with the input dimension of 256 and the hidden layer size of 2048 and a ReLU activation layer; the structure of the self-attention module in the transducer encoder and the structure of the self-attention module in the transducer decoder are identical, and the self-attention module consists of a multi-head attention module, a residual error and layer normalization module; the multi-head attention module has the same structure;

MSL＝Concat(S ₁ ,…,S _i ,…,S _t )

S _i ＝Conv _i (X _i ；r _i )

wherein MSL represents the output of the multi-scale layer, concat (-) represents the splicing operation, t represents the number of convolution layers in the multi-scale layer, t is a positive integer, and the value range is [1,50 ]]In this embodiment, t is2, conv _i (. Cndot.) represents the ith convolution layer, r, of the multi-scale layers _i The convolution kernel size, r, representing the ith convolution layer _i Is a positive integer with a value range of [1,100 ]]The embodiment takes r ₁ Is2, r ₂ Is 4, X _i Representing input of ith convolution layer of multi-scale layer, S _i Representing the output of the ith convolutional layer of the multi-scale layer, X for the multi-scale layer in the transform encoder _i ＝M ₁ For multi-scale layers in a transducer decoder, X _i ＝C ₁ ，i＝1,2,…,t。

MultiHead(q,k,v)＝Concat(A ₁ ,…,A _i ,…,A _s )U ^o

wherein MultiHead (q, k, v) represents the output of the multi-head attention module, softmax (·) represents the activation function, concat (·) represents the splicing operation; q, k and v are inputs to the multi-head attention module, q=k=v=m for the multi-head attention module inside the self-attention module of the transducer encoder ₂ For a multi-head attention module inside the self-attention module of a transducer decoder, q=k=v=c ₂ For multi-head attention module inside the double-branch cross attention module storage branch, q=m ₃ ，k＝v＝C ₃ For a multi-headed attention module inside the query branch, q=c ₃ ，k＝v＝M ₃ The method comprises the steps of carrying out a first treatment on the surface of the s is the number of attention heads in the multi-head attention module, s is a positive integer, and the value range is [1,16 ]]In this embodiment, s is 8, A _i Representing the output of the ith attention header, i=1, 2, …, s,and->Q, k and v parameter matrices representing the ith attention header, U ^o Is a parameter matrix for adjusting final output, T represents a transpose operator, d is a super parameter, d is a positive integer, and the value range is [1,1000 ]]In this embodiment, d is 32;

(3-e) construction of a partition decoder

The segmentation decoder consists of a residual error module, two up-sampling modules and a prediction convolution module; the residual error module consists of three convolution layers with the convolution kernel number of 512 and the convolution size of 3 and two Relu activation layers, the first upsampling module consists of one convolution layer with the convolution kernel number of 512 and the convolution size of 3, three convolution layers with the convolution kernel number of 256 and the convolution size of 3, two Relu activation layers and a bilinear interpolation with the expansion factor of 2, and the second upsampling module consists of three convolution layers with the convolution kernel number of 256 and the convolution size of 3, two Relu activation layers and a bilinear interpolation with the expansion factor of 2, and the prediction convolution module consists of one convolution layer with the convolution kernel number of 256 and a bilinear interpolation with the expansion factor of 2;

f _M '＝f _M ·softmax(c(ConvKey(f _M ),ConvKey(f _C )))

wherein ConvKey (&) is a key projection layer, and consists of a convolution layer with the number of 64 convolution kernels and the size of 3 convolution layers, c (&) represents a negative square Euclidean distance, and softmax (&) represents an activation function;

(3-h) C obtained in the step (3-g) ₀ 、C ₄ 、M ₄ And f obtained in the step (3-f) _M ' after splicing, inputting the residual modules into a residual module of the split decoder constructed in the step (3-e) to obtain f _O1 The method comprises the steps of carrying out a first treatment on the surface of the Will f _O1 And f obtained in the step (3-a) _C3 Input to the first up-sampling module of the split decoder constructed in step (3-e) to obtain f _O2 The method comprises the steps of carrying out a first treatment on the surface of the Will f _O2 And f obtained in the step (3-a) _C2 Input to step (3-e)The second up-sampling module of the constructed split decoder obtains f _O3 The method comprises the steps of carrying out a first treatment on the surface of the Will f _O3 Inputting the prediction convolution module of the segmentation decoder constructed in the step (3-e) to obtain a prediction segmentation label f _O4 And (5) completing the construction of the segmentation model.

(4) Constructing a loss function:

the loss function L uses cross entropy loss, defined as follows:

where Y is the true split label of the tag,is a predictive cut tag, H _Y And W is _Y The height and width of the real split label are respectively Y _ij Is the pixel value of the ith row and jth column pixel in Y,/and Y>Is->Pixel values of the ith row and jth column pixels, i=1, 2, …, H _Y ，j＝1,2,…,W _Y Log (-) represents the natural logarithm.

(5) Training a segmentation model:

training the segmentation model constructed in the step (3) by utilizing the synthesized video training set obtained in the step (2-a), obtaining a loss value according to the loss function constructed in the step (4), and updating parameters in the segmentation model by using a random gradient descent method to obtain a pre-training model; training the pre-training model by using the real video training set obtained in the step (2-b), obtaining a loss value according to the loss function constructed in the step (4), and updating parameters in the segmentation model by using a random gradient descent method to obtain a trained segmentation model.

(6) Video object segmentation:

Example 2

Video object segmentation experiments were performed on the public data sets youtube vos2018, youtube vos2019, DAVIS2016, and DAVIS2017 using the method of example 1. The experimental operation system is Linux ubuntu version 18.04; based on the PyTorrch1.8.1 framework implementations of CUDA10.1 and cuDNN7.6.5, two NVIDIA 2080Ti 11G GPUs were used for training and testing.

The present embodiment uses the region similarity J, the contour accuracy F, and their average value J & F to evaluate the performance of the present invention. The area similarity J is calculated by the average intersection ratio of the estimated labels and the corresponding real labels, and the calculation formula is as follows:

wherein M is a predictive segmentation label, G is a real segmentation label, and the symbols ∈and ∈each represent the intersection and union of two sets.

The contour precision F represents the average boundary similarity between the boundary of the estimated label and the real label boundary, and the calculation formula is as follows:

wherein P is _c Is the accuracy between l (M) and l (G), R _c Is the recall rate between l (M) and l (G), P _c And R is _c Obtained using bipartite graph matching (bipartite graph matching) calculation; l (M) represents the set of closed contours within the scope of the predictive segmentation label M, and l (G) represents the set of closed contours within the scope of the true segmentation label G.

Tables 1,2, 3 and 4 show the results of the method of the present invention compared to the results of other methods for J, F and J & F of the test sample sets of youtube vos2018, youtube vos2019, DAVIS2016 and DAVIS2017, respectively, and it can be seen that the method of the present invention achieves the highest J & F scores on all four data sets.

FIG. 4 is a graph showing the comparison of the segmentation results of a small scale object with the segmentation results of other methods according to embodiments of the present invention. The first, second, third and fourth lines in the figure are TransVOS, CFBI +, STCN and the segmentation result of the method of the present invention on optional frames of video d79729a354 in youtube vos2018, the number in the upper left corner of each image is the frame number of the image in the video sequence, the image in the dashed frame is an enlarged view of the small image in the corresponding solid frame, and the first column is the first frame of the video and its real segmentation label. From the first row and the second column, it can be seen that the TransVOS method is interfered by the background, and the sundries are erroneously identified as remote persons; from the second row, second column and third column, it can be seen that the cfbi+ method fails to properly segment the person's legs and arms; from the third row, second, third, fourth, and fifth columns, it can be seen that the STCN method confuses persons at distance with vehicles in their background. From the fourth line, it can be seen that the method of the present invention can accurately segment small-scale objects, and the segmentation result is superior to that of other methods.

FIG. 5 is a graph showing the comparison of the segmentation results of similar objects with the segmentation results of other methods using an embodiment of the present invention. The first, second, third and fourth lines in the figure are TransVOS, CFBI +, STCN and the segmentation result of the method of the invention on optional frames of video 5d2020eff8 in YoutubeVOS2018, the number at the upper left corner of each image is the frame number of the image in the video sequence, the target shown by the solid line frame is the error segmentation result, and the first column is the first frame of the video and the real segmentation label thereof. It can be seen that STCN, cfbi+ and TransVOS will confuse fish as a segmentation target with fish of similar appearance in the background, whereas the method of the present invention can successfully distinguish the segmentation target from the analogues in the background.

The above embodiments are only preferred embodiments of the present invention, and are not intended to limit the scope of the present invention, so variations according to the principles of the present invention should be covered.

TABLE 1

(note: table J _S And F _S Values of J and F representing categories that have occurred in both the Youtube2018 training set and the test set, J _U And F _U Values of J and F representing categories that only occur in the Youtube2018 test set but not in the Youtube2018 training set

TABLE 2

(note: table J _S And F _S Values of J and F representing categories that have occurred in both the Youtube2019 training set and the test set, J _U And F _U Values of J and F representing categories that only appeared in the Youtube2019 test set but not in the Youtube2019 training set) Table 3

TABLE 4 Table 4

/>

Claims

1. A method for segmenting a semi-supervised video target based on a transducer is characterized by comprising the following steps:

(1) Acquiring a data set and dividing labels:

(3-a) construction of a query encoder

(3-b) building a storage pool

The 1 st frame, the (tau+1) th frame, the (2tau+1) th frame, the … … th frame, the (Ntau+1) th frame and the corresponding segmentation labels in the video sequence are put into a storage pool, wherein tau is a positive integer, and the value range is [1,200 ]]，τ _C For the relative position of the frames to be segmented, the symbol +.>Representing a downward rounding;

(3-c) construction of a storage encoder

(3-d) construction of a transducer Module

(3-e) construction of a partition decoder

The segmentation decoder consists of a residual error module, two groups of up-sampling modules and a prediction convolution module; the residual error module consists of three convolution layers and two Relu activation layers, the first up-sampling module consists of four convolution layers, two Relu activation layers and a bilinear interpolation, the second up-sampling module consists of three convolution layers, two Relu activation layers and a bilinear interpolation, and the prediction convolution module consists of one convolution layer and a bilinear interpolation;

f _M '＝f _M ·softmax(c(ConvKey(f _M ),ConvKey(f _C )))

(3-h) C obtained in the step (3-g) ₀ 、C ₄ 、M ₄ And f obtained in the step (3-f) _M ' after splicing, inputting the residual modules into a residual module of the split decoder constructed in the step (3-e) to obtain f _O1 The method comprises the steps of carrying out a first treatment on the surface of the Will f _O1 And f obtained in the step (3-a) _C3 Input to the first up-sampling module of the split decoder constructed in step (3-e) to obtain f _O2 The method comprises the steps of carrying out a first treatment on the surface of the Will f _O2 And f obtained in the step (3-a) _C2 Input to step (3)-e) a second upsampling module of the constructed segment decoder, yielding f _O3 The method comprises the steps of carrying out a first treatment on the surface of the Will f _O3 Inputting the prediction convolution module of the segmentation decoder constructed in the step (3-e) to obtain a prediction segmentation label f _O4 Completing the construction of a segmentation model;

(4) Constructing a loss function:

the loss function L uses cross entropy loss, defined as follows:

where Y is the true split label of the tag,is a predictive cut tag, H _Y And W is _Y The height and width of the real split label are respectively Y _ij Is the pixel value of the ith row and jth column pixel in Y,/and Y> Pixel values of the ith row and jth column pixels, i=1, 2, …, H _Y ，j＝1,2,…,W _Y Log (-) represents the natural logarithm;

(5) Training a segmentation model:

(6) Video object segmentation:

2. The method for segmenting a semi-supervised video object based on a Transformer according to claim 1, wherein the multi-scale layer in the step (3-d) is composed of t convolution layers with different convolution kernel sizes, and the output calculation process is as follows:

MSL＝Concat(S ₁ ,…,S _i ,…,S _t )

S _i ＝Conv _i (X _i ；r _i )

wherein MSL represents the output of the multi-scale layer, concat (-) represents the splicing operation, t represents the number of convolution layers in the multi-scale layer, t is a positive integer, and the value range is [1,50 ]]，Conv _i (. Cndot.) represents the ith convolution layer, r, of the multi-scale layers _i The convolution kernel size, r, representing the ith convolution layer _i Is a positive integer with a value range of [1,100 ]]，X _i Representing input of ith convolution layer of multi-scale layer, S _i Representing the output of the ith convolutional layer of the multi-scale layer, X for the multi-scale layer in the transform encoder _i ＝M ₁ For multi-scale layers in a transducer decoder, X _i ＝C ₁ ，i＝1,2,…,t。

3. The method for segmenting a semi-supervised video object based on a transducer as claimed in claim 1, wherein the multi-head attention module in the step (3-d) outputs the following calculation procedures:

MultiHead(q,k,v)＝Concat(A ₁ ,…,A _i ,…,A _s )U ^o

wherein MultiHead (q, k, v) represents the output of the multi-head attention module, softmax (·) represents the activation function, concat (·) represents the splicing operation; q, k and v are inputs to the multi-head attention module, q=k=v=m for the multi-head attention module inside the self-attention module of the transducer encoder ₂ For a multi-head attention module inside the self-attention module of a transducer decoder, q=k=v=c ₂ For multi-head attention module inside the double-branch cross attention module storage branch, q=m ₃ ，k＝v＝C ₃ For a multi-headed attention module inside the query branch, q=c ₃ ，k＝v＝M ₃ The method comprises the steps of carrying out a first treatment on the surface of the s is the number of attention heads in the multi-head attention module, s is a positive integer, and the value range is [1,16 ]]，A _i Representing the output of the ith attention header, i=1, 2, …, s,and->Q, k and v parameter matrices representing the ith attention header, U ^o Is a parameter matrix for adjusting final output, T represents a transpose operator, d is a super parameter, d is a positive integer, and the value range is [1,1000 ]]。