CN116402833A

CN116402833A - Knowledge distillation-based semi-supervised video target segmentation method

Info

Publication number: CN116402833A
Application number: CN202310677219.4A
Authority: CN
Inventors: 余锋; 李会引; 姜明华; 周鑫磊; 刘莉; 周昌龙; 宋坤芳
Original assignee: Wuhan Textile University
Current assignee: Wuhan Textile University
Priority date: 2023-06-08
Filing date: 2023-06-08
Publication date: 2023-07-07
Anticipated expiration: 2043-06-08
Also published as: CN116402833B

Abstract

The invention discloses a knowledge distillation-based semi-supervised video target segmentation method, which comprises the following steps of: firstly, designing a semi-supervised video target segmentation network, then designing a knowledge distillation-based network training method, determining a strategy of loss function design and adjustment parameters, and finally training to obtain a knowledge distillation-based semi-supervised video target segmentation model which can rapidly segment moving targets in an input video. The invention can obviously improve the video target segmentation performance and realize light weight by improving the deep learning algorithm, provides a semi-supervised video target segmentation scheme, greatly saves manpower, and can quickly segment targets in video by only marking a small amount of information.

Description

Knowledge distillation-based semi-supervised video target segmentation method

Technical Field

The invention relates to the field of semi-supervised video object segmentation, and more particularly to a semi-supervised video object segmentation method based on knowledge distillation.

Background

Video object segmentation is the localization and segmentation of the object of interest to the user. Video object segmentation techniques are generally classified into three categories, unsupervised video object segmentation, interactive video object segmentation, and semi-supervised video object segmentation, depending on the degree of human intervention in the segmentation process. The method comprises the steps of performing non-supervision video object segmentation on a given video sequence without specifying a specific segmentation object, automatically segmenting out a remarkable object in the video according to the appearance saliency and the motion salience of a foreground object in the video, performing manual intervention on the object in the segmentation process by a user, correcting a segmentation result, such as recalibrating error pixels in the segmentation process, and the like, wherein the semi-supervision video object segmentation is to provide a segmentation mask of the specific object of interest in a first frame, and automatically segmenting out the given object in a subsequent frame of the video.

Semi-supervised video object segmentation is mainstream, a user gives a segmentation mask of an object of interest in a first frame, and given object objects are automatically segmented in subsequent frames. In recent years, due to the development of deep learning, a full convolution neural network has made a great progress in the field of image segmentation. Under the driving of deep learning, the semi-supervised video target segmentation method mainly depends on three strategies, namely an online learning strategy, a mask propagation-based strategy and a feature matching-based strategy.

The Chinese patent with publication number of CN113344932A discloses a semi-supervised single-target video segmentation method, which realizes accurate segmentation of video images through an improved Unet network, but is easy to lose information by only extracting the characteristics of two adjacent pictures in the case of complex scenes.

Disclosure of Invention

Aiming at the defects or improvement demands of the prior art, the invention provides a knowledge distillation-based semi-supervised video target segmentation method, which aims to realize accurate segmentation of a video target under the condition that only first frame label information is adopted.

To achieve the above object, according to one aspect of the present invention, there is provided a knowledge distillation-based semi-supervised video object segmentation method, comprising the steps of:

step 1, constructing two semi-supervised video target segmentation network architectures with different sizes;

step 2, training a large semi-supervised video target segmentation network to obtain a network model with higher accuracy;

step 3, guiding a small video target segmentation network to train through a large semi-supervised video target segmentation network by using a knowledge distillation method;

and step 4, obtaining a knowledge-based distillation semi-supervised video target segmentation model with prediction accuracy similar to that of the large-scale semi-supervised video target segmentation network.

Furthermore, in the step 1, the two semi-supervised video target division network frameworks with different sizes have three branches; the 1 st branch is used for extracting the spatial characteristics of the 1 st frame and the previous video frame, the 2 nd branch is used for generating an optical flow information graph of the current frame and the previous video frame, generating a mask of the current frame through motion projection according to a prediction mask of the previous frame and the generated optical flow information graph, wherein during training, other frames are used except for a real mask of the first frame which is used for manual annotation, the generated mask replaces the real mask, and the 3 rd branch is used for extracting the spatial characteristics of the current frame picture;

the structure of the 1 st branch is the same as that of the 3 rd branch, and the 1 st branch consists of K layers of characteristic coding layers, wherein each characteristic coding layer comprises a plurality of characteristic coding convolution modules; the 2 nd branch adopts the existing optical flow generating network FlowNet, wherein the two semi-supervised video target division networks with different sizes are embodied in the 1 st branch and the 3 rd branch, and the number of the feature coding convolution modules in the feature coding layer is different;

the feature diagrams output by the 1 st branch and the 3 rd branch are 1/16 of the input video frame, and the specific operations of the three branches are as follows: the output of the 1 st branch is spliced with the output of the 4 th optical flow characteristic coding layer of the 2 nd branch, then the output of the 1 st branch and the output of the 3 rd branch are respectively sent into a characteristic coding convolution module, the obtained two output characteristic diagrams are added and sent into a decoder, and finally a prediction result is output.

Further, the specific processing procedure of the feature encoding convolution module is as follows;

when the feature images are sent to a feature code convolution module, firstly, carrying out 1X 1 convolution and a Tanh activation layer to obtain a first feature image, then carrying out 5X 5 DW convolution, processing the feature images generated by the 5X 5 DW convolution by three branches, wherein the first branch carries out 7X 1 DW convolution and 1X 7 DW convolution, the second branch carries out 11X 1 DW convolution and 1X 11 DW convolution, the third branch carries out 21X 1 DW convolution and 1X 21 DW convolution, and carrying out feature image addition operation on the feature images obtained by the three branches and the feature images generated by the 5X 5 DW convolution to output an intermediate process feature image;

and carrying out 1X 1 convolution on the intermediate process feature map to adjust the channel number to 1, carrying out feature map multiplication operation on the intermediate process feature map and the obtained first feature map to obtain a weighted feature map, and finally carrying out 1X 1 convolution, a Tanh activation layer, 1X 1 convolution and a Tanh activation layer on the weighted feature map in sequence to obtain an output feature map of the feature coding convolution module.

Further, the decoder is composed of two up-sampling convolution layers and a result output layer, wherein the up-sampling convolution layers are sequentially composed of a 1×1 convolution, a Tanh activation layer, a 3×3 convolution and a quadruple up-sampling layer, and the result output layer is sequentially composed of a 3×3 convolution and a 1×1 convolution.

Further, the value of K is 4, the characteristic coding convolution modules of the 4-layer characteristic coding layers in the large-scale semi-supervised video target division network are 3,5,27 and 3 respectively, and the characteristic coding convolution modules of the 4-layer characteristic coding layers in the small-scale semi-supervised video target division network are 2,4 and 2 respectively; the step length of the first 1×1 convolution in the first feature coding convolution module in each feature coding layer is 2, so as to reduce the height and width of the feature map and increase the channel dimension of the feature map.

Further, training a large-scale semi-supervised video target segmentation network in the step 2, and losing consistency of used time content

The following are provided:

wherein P (t) represents the video frame prediction mask at the current time t, G (t) represents the real mask of the video frame at the current time t, lambda _t1 Is the loss weight at time t, lambda _t2 Is the loss weight of the moment t-1, and epsilon is the content consistency weight of the moment t-1 and the moment t;

wherein L is _c The target consistency loss is represented as follows:

where N represents the number of pixels in the video frame mask, C represents the number of categories, p _ij A prediction probability, g, representing the j-th class of the i-th pixel _ij True label, gamma, representing the j-th class of the i-th pixel _j The controllable parameters are used for adjusting the importance of the easy classification category and the difficult classification category.

Further, the loss function L used in the knowledge distillation in step 3 _s The following are provided:

l in the formula _c Is the target consistency loss described above, f _L (t) a prediction feature map representing a large model at time t, f _S (t) represents a prediction feature map of a small model at a time t, g (t) represents a true mask at the time t, and alpha is a super parameter for controlling L _s The weight lost by the two parts in the (a).

Further, the specific strategy of knowledge distillation employed is:

s3-1, training to obtain a large model by combining the large semi-supervised video target segmentation network constructed in the step 1 with the time content consistency loss function in the step 2;

s3-2, training the large model obtained through training and the small semi-supervised video target segmentation network constructed in the step 1 at the same time, wherein a loss function adopted by the small semi-supervised video target segmentation network is also loss of consistency of time content;

s3-3, when two networks are trained simultaneously, firstly calculating errors of the time content consistency loss function in the step 2 respectively, and then substituting prediction results of the large model and the small model into the knowledge distillation loss function calculation errors in the step 3 simultaneously;

s3-4, adjusting parameters of the large model according to the time content consistency loss calculated by the large model, adjusting parameters of the small model according to the time content consistency loss calculated by the small model, and finally adjusting parameters of the large model and the small model according to the knowledge distillation loss.

In general, the above technical solutions conceived by the present invention, compared with the prior art, enable the following beneficial effects to be obtained:

(1) The network merges multi-scale convolution to obtain characteristic representation of each frame under different scales and inter-frame optical flow information, and the learning capacity of the network is enhanced.

(2) Through a knowledge distillation method, a small model is guided to train through a large model to obtain a small model with model segmentation accuracy similar to that of the large model, and the model is light.

(3) The method improves a plurality of loss functions to pay attention to interframe prediction loss and difficult-to-classify type prediction loss, and further improves stability and accuracy of a model.

Drawings

Fig. 1 is a flowchart of a technical scheme of a method for dividing a semi-supervised video target based on knowledge distillation according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of a feature coding convolution module of a semi-supervised video object segmentation method based on knowledge distillation according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of a network framework of a method for dividing a semi-supervised video object based on knowledge distillation according to an embodiment of the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

Referring to fig. 1, fig. 1 is a flowchart of a technical scheme of a knowledge distillation-based semi-supervised video object segmentation method, which specifically includes the following steps:

(1) Constructing two semi-supervised video target segmentation network architectures with different sizes;

specifically, the two semi-supervised video object segmentation network frameworks with different sizes have three branches, please refer to fig. 3, and fig. 3 is a network framework schematic diagram for implementing the semi-supervised video object segmentation method based on knowledge distillation provided by the embodiment. The invention uses the optical flow generating network to be the FlowNet, then uses the predicted mask of the FlowNet and the previous frame to carry on the motion projection to generate the mask of the current frame, besides the actual mask of the first frame which is used by the manual marking, the mask generated here is used to replace the actual mask, the 3 rd branch is used to extract the spatial feature of the current frame picture.

In particular, the three branches are interrelated. The 1 st branch and the 3 rd branch are composed of 4 layers of feature coding layers in the same structure, and the 2 nd branch is an existing optical flow generation network FlowNet. Wherein the two different sizes of semi-supervised video object segmentation networks are embodied in different numbers of feature encoding convolution modules in the 4-layer feature encoding layer in the 1 st branch and the 3 rd branch.

The decoder is composed of two up-sampling convolution layers and a result output layer, wherein the up-sampling convolution layers sequentially comprise a 1×1 convolution, a Tanh activation layer, a 3×3 convolution and a quadruple up-sampling layer, and the result output layer sequentially comprises a 3×3 convolution and a 1×1 convolution.

Specifically, referring to fig. 2, fig. 2 is a schematic diagram of a feature coding convolution module of a knowledge distillation-based semi-supervised video object segmentation method according to an embodiment of the present invention.

When the feature map is sent to the feature encoding convolution module, the 1×1 convolution and the Tanh activation layer are firstly performed to obtain a first feature map, then the 5×5 DW convolution (depth convolution) is performed, the feature map generated by the 5×5 DW convolution is processed by three branches, the first branch performs 7×1 DW convolution and 1×7 DW convolution, the second branch performs 11×1 DW convolution and 1×11 DW convolution, the third branch performs 21×1 DW convolution and 1×21 DW convolution, and the feature map obtained by the three branches and the feature map generated by the 5×5 DW convolution are subjected to feature map addition operation to output an intermediate process feature map.

Specifically, the feature coding convolution module in the feature coding layer is arranged in the network framework. The 1 st branch and the 3 rd branch in the large model and the small model network architecture are provided with 4 layers of characteristic coding layers, but the characteristic coding convolution modules of the 4 layers of characteristic coding layers in the large model are 3,5,27 and 3 respectively, and the characteristic coding convolution modules of the 4 layers of characteristic coding layers in the small model are 2,4 and 2 respectively. The step length of the first 1×1 convolution in the first feature coding convolution module in each feature coding layer is 2, so as to reduce the height and width of the feature map and increase the channel dimension of the feature map.

(2) Training a large semi-supervised video target segmentation network to obtain a network model with higher accuracy;

specifically, the loss function used in training the large model and the small model of the constructed semi-supervised video object segmentation network is the same, and the consistency of the used time content is lost

The following are provided:

in the formula, P (t) represents a video frame prediction feature map at time t, G (t) represents a video frame true mask at time t, G (t) is used for generating G (t) substitution except that the first frame is a manually calibrated true label, and other frames are all generated by using the optical flow map and the previous frame mask, and it should be noted that the previous frame mask used here is not generated by a network but is generated by using the optical flow map, wherein the true mask is single-channel, and the pixel value is 0 to the classification number minus one, if 16 foreground objects are added to the background class on the DAVIS 2017 data set, the pixel value range of G (t) is 0 to 16. For example, if there is a foreground object in the video frame, the real label of all pixels of the vehicle is 3, and in substituting the above-mentioned loss function, we usually convert G (t) into a form of a single thermal vector, for example, 3 can be represented as 00010000000000000 by a single thermal vector, the single thermal vector has 17 bits to represent 17 categories, subscripts in the vector are from 0 to 16, and the corresponding lower label of the position of the single thermal vector with 1 is the corresponding imageThe type of the pixel, wherein G (t) is H×W×1, and P (t) is H×W×N _classes It is the value in the one-to-one correspondence of each pixel in the subsequent G (t) to the one-hot vector pair P (t). Lambda (lambda) _t1 Is the loss weight at time t, typically set to 0.6, lambda _t2 Is the loss weight at time t-1, typically set to 0.25, epsilon is the content consistency weight at time t-1 and time t, typically set to 0.15. This formula is used to calculate the error between the predicted result of the current frame and the true value, the error between the predicted result of the previous frame and the true value, and the error between the predicted result of the previous frame and the current frame, where L _c The target consistency loss is represented as follows:

substituted into formula L _c P in (b) _ij Represents the value of a single characteristic point in P (t), and substitutes the value into a formula L _c G of (3) _ij Represents the single feature point value in G (t), N represents the number of pixels in the video frame mask, C represents the number of categories, p _ij A prediction probability, g, representing the j-th class of the i-th pixel _ij True label, gamma, representing the j-th class of the i-th pixel _j The controllable parameters are used for adjusting the importance of the easy classification category and the difficult classification category.

(3) Guiding a small video target segmentation network to train through a large semi-supervised video target segmentation network by using a knowledge distillation method;

specifically, a loss function L for knowledge distillation _s The following are provided:

l in the formula _c Is the target consistency loss described above, f _L (t) represents a prediction feature map of a large model (namely obtained after training of a large semi-supervised video target segmentation network) at the moment t, f _S (t) represents a prediction feature map of a small model (namely obtained by training a small video target segmentation network) at the moment t, and g (t) represents a real mask at the moment tAlpha is a super parameter, typically set to 0.4, for controlling L _s The weight lost by the two parts in the (a).

Specifically, we adopt knowledge distillation specific strategies as follows:

s3-1, training to obtain a large model by combining the large semi-supervised video target segmentation network constructed in the step (1) with the time content consistency loss function in the step (2);

s3-2, training the large model obtained by training and implementing the small semi-supervised video target segmentation network constructed in the step (1) at the same time;

s3-3, when two networks are trained simultaneously, firstly calculating errors of the time content consistency loss function in the implementation step (2), and then substituting the prediction results of the large model and the small model into the knowledge distillation loss function calculation errors in the implementation step (3) simultaneously.

(4) And obtaining a knowledge-based distillation semi-supervised video target segmentation model with similar prediction accuracy to the large model.

In particular, the final stored model is a small model that achieves stability and accuracy comparable to a large model through knowledge distillation algorithms, but with a much smaller model volume than the large model.

The invention provides a knowledge distillation-based semi-supervised video target segmentation method, which is characterized in that a semi-supervised video segmentation model is constructed through the implementation example, and high-accuracy video segmentation can be realized through operations such as multi-frame video frame feature extraction, optical flow generation and the like. The experimental effect of the invention is verified on a DAVIS 2017 data set, the average cross-over ratio of the method is improved by 1.8% under the condition that the knowledge distillation-based ratio is realized and the knowledge distillation is not used, and the small model integral average cross-over ratio constructed by the method reaches 82.2% which is an advanced level of the current semi-supervised video target segmentation.

Various modifications and alterations to this application may be made by those skilled in the art without departing from the spirit and scope of this application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.

Claims

1. The knowledge distillation-based semi-supervised video target segmentation method is characterized by comprising the following steps of:

and step 4, obtaining a knowledge-based distillation semi-supervised video target segmentation model with prediction accuracy similar to that of the large-scale semi-supervised video target segmentation network, and realizing video target segmentation by using the knowledge-based distillation semi-supervised video target segmentation model.

2. The knowledge distillation based semi-supervised video object segmentation method as set forth in claim 1, wherein: in the step 1, two semi-supervised video target segmentation network frameworks with different sizes are provided with three branches; the 1 st branch is used for extracting the spatial characteristics of the 1 st frame and the previous video frame, the 2 nd branch is used for generating an optical flow information graph of the current frame and the previous video frame, generating a mask of the current frame through motion projection according to a prediction mask of the previous frame and the generated optical flow information graph, wherein during training, other frames are used except for a real mask of the first frame which is used for manual annotation, the generated mask replaces the real mask, and the 3 rd branch is used for extracting the spatial characteristics of the current frame picture;

3. The knowledge distillation based semi-supervised video object segmentation method as set forth in claim 2, wherein: the specific processing procedure of the characteristic coding convolution module is as follows;

when the feature images are sent to a feature code convolution module, firstly, carrying out 1X 1 convolution and a Tanh activation layer to obtain a first feature image, then carrying out 5X 5 DW convolution, processing the feature images generated by the 5X 5 DW convolution by three branches, wherein the first branch carries out 7X 1 DW convolution and 1X 7 DW convolution, the second branch carries out 11X 1 DW convolution and 1X 11 DW convolution, the third branch carries out 21X 1 DW convolution and 1X 21 DW convolution, and carrying out feature image addition operation on the feature images obtained by the three branches and the feature images generated by the 5X 5 DW convolution to output an intermediate process feature image; DW convolution is depth convolution;

4. The knowledge distillation based semi-supervised video object segmentation method as set forth in claim 2, wherein: the decoder is composed of two up-sampling convolution layers and a result output layer, wherein the up-sampling convolution layers are sequentially composed of a 1×1 convolution, a Tanh activation layer, a 3×3 convolution and a quadruple up-sampling layer, and the result output layer is sequentially composed of a 3×3 convolution and a 1×1 convolution.

5. The knowledge distillation based semi-supervised video object segmentation method as set forth in claim 2, wherein: the value of K is 4, the characteristic coding convolution modules of the 4-layer characteristic coding layers in the large-scale semi-supervised video target division network are 3,5,27 and 3 respectively, and the characteristic coding convolution modules of the 4-layer characteristic coding layers in the small-scale semi-supervised video target division network are 2,4 and 2 respectively; the step length of the first 1×1 convolution in the first feature coding convolution module in each feature coding layer is 2, so as to reduce the height and width of the feature map and increase the channel dimension of the feature map.

6. The knowledge distillation based semi-supervised video object segmentation method as set forth in claim 1, wherein: training a large semi-supervised video target segmentation network in step 2, and losing consistency of used time content

The following are provided:

;

wherein L is _c The target consistency loss is represented by the following calculation formula:

;

where N represents an image in a video frame maskThe number of elements, C represents the number of classes, p _ij A prediction probability, g, representing the j-th class of the i-th pixel _ij True label, gamma, representing the j-th class of the i-th pixel _j The controllable parameters are used for adjusting the importance of the easy classification category and the difficult classification category.

7. The knowledge distillation based semi-supervised video object segmentation method as set forth in claim 5, wherein: loss function L for knowledge distillation in step 3 _s The following are provided:

;

l in the formula _c Is the target consistency loss described above, f _L (t) a prediction feature map representing a large model at time t, f _S (t) represents a prediction feature map of a small model at the time t, g (t) represents a true mask at the time t, and alpha is a super parameter for controlling L _s The weight lost by the two parts in the (a).

8. The knowledge distillation based semi-supervised video object segmentation method as set forth in claim 7, wherein: the specific strategy of knowledge distillation adopted is as follows: