CN116402833A - Knowledge distillation-based semi-supervised video target segmentation method - Google Patents

Knowledge distillation-based semi-supervised video target segmentation method Download PDF

Info

Publication number
CN116402833A
CN116402833A CN202310677219.4A CN202310677219A CN116402833A CN 116402833 A CN116402833 A CN 116402833A CN 202310677219 A CN202310677219 A CN 202310677219A CN 116402833 A CN116402833 A CN 116402833A
Authority
CN
China
Prior art keywords
convolution
semi
branch
supervised
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310677219.4A
Other languages
Chinese (zh)
Other versions
CN116402833B (en
Inventor
余锋
李会引
姜明华
周鑫磊
刘莉
周昌龙
宋坤芳
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuhan Textile University
Original Assignee
Wuhan Textile University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuhan Textile University filed Critical Wuhan Textile University
Priority to CN202310677219.4A priority Critical patent/CN116402833B/en
Publication of CN116402833A publication Critical patent/CN116402833A/en
Application granted granted Critical
Publication of CN116402833B publication Critical patent/CN116402833B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/0895Weakly supervised learning, e.g. semi-supervised or self-supervised learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/096Transfer learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention discloses a knowledge distillation-based semi-supervised video target segmentation method, which comprises the following steps of: firstly, designing a semi-supervised video target segmentation network, then designing a knowledge distillation-based network training method, determining a strategy of loss function design and adjustment parameters, and finally training to obtain a knowledge distillation-based semi-supervised video target segmentation model which can rapidly segment moving targets in an input video. The invention can obviously improve the video target segmentation performance and realize light weight by improving the deep learning algorithm, provides a semi-supervised video target segmentation scheme, greatly saves manpower, and can quickly segment targets in video by only marking a small amount of information.

Description

Knowledge distillation-based semi-supervised video target segmentation method
Technical Field
The invention relates to the field of semi-supervised video object segmentation, and more particularly to a semi-supervised video object segmentation method based on knowledge distillation.
Background
Video object segmentation is the localization and segmentation of the object of interest to the user. Video object segmentation techniques are generally classified into three categories, unsupervised video object segmentation, interactive video object segmentation, and semi-supervised video object segmentation, depending on the degree of human intervention in the segmentation process. The method comprises the steps of performing non-supervision video object segmentation on a given video sequence without specifying a specific segmentation object, automatically segmenting out a remarkable object in the video according to the appearance saliency and the motion salience of a foreground object in the video, performing manual intervention on the object in the segmentation process by a user, correcting a segmentation result, such as recalibrating error pixels in the segmentation process, and the like, wherein the semi-supervision video object segmentation is to provide a segmentation mask of the specific object of interest in a first frame, and automatically segmenting out the given object in a subsequent frame of the video.
Semi-supervised video object segmentation is mainstream, a user gives a segmentation mask of an object of interest in a first frame, and given object objects are automatically segmented in subsequent frames. In recent years, due to the development of deep learning, a full convolution neural network has made a great progress in the field of image segmentation. Under the driving of deep learning, the semi-supervised video target segmentation method mainly depends on three strategies, namely an online learning strategy, a mask propagation-based strategy and a feature matching-based strategy.
The Chinese patent with publication number of CN113344932A discloses a semi-supervised single-target video segmentation method, which realizes accurate segmentation of video images through an improved Unet network, but is easy to lose information by only extracting the characteristics of two adjacent pictures in the case of complex scenes.
Disclosure of Invention
Aiming at the defects or improvement demands of the prior art, the invention provides a knowledge distillation-based semi-supervised video target segmentation method, which aims to realize accurate segmentation of a video target under the condition that only first frame label information is adopted.
To achieve the above object, according to one aspect of the present invention, there is provided a knowledge distillation-based semi-supervised video object segmentation method, comprising the steps of:
step 1, constructing two semi-supervised video target segmentation network architectures with different sizes;
step 2, training a large semi-supervised video target segmentation network to obtain a network model with higher accuracy;
step 3, guiding a small video target segmentation network to train through a large semi-supervised video target segmentation network by using a knowledge distillation method;
and step 4, obtaining a knowledge-based distillation semi-supervised video target segmentation model with prediction accuracy similar to that of the large-scale semi-supervised video target segmentation network.
Furthermore, in the step 1, the two semi-supervised video target division network frameworks with different sizes have three branches; the 1 st branch is used for extracting the spatial characteristics of the 1 st frame and the previous video frame, the 2 nd branch is used for generating an optical flow information graph of the current frame and the previous video frame, generating a mask of the current frame through motion projection according to a prediction mask of the previous frame and the generated optical flow information graph, wherein during training, other frames are used except for a real mask of the first frame which is used for manual annotation, the generated mask replaces the real mask, and the 3 rd branch is used for extracting the spatial characteristics of the current frame picture;
the structure of the 1 st branch is the same as that of the 3 rd branch, and the 1 st branch consists of K layers of characteristic coding layers, wherein each characteristic coding layer comprises a plurality of characteristic coding convolution modules; the 2 nd branch adopts the existing optical flow generating network FlowNet, wherein the two semi-supervised video target division networks with different sizes are embodied in the 1 st branch and the 3 rd branch, and the number of the feature coding convolution modules in the feature coding layer is different;
the feature diagrams output by the 1 st branch and the 3 rd branch are 1/16 of the input video frame, and the specific operations of the three branches are as follows: the output of the 1 st branch is spliced with the output of the 4 th optical flow characteristic coding layer of the 2 nd branch, then the output of the 1 st branch and the output of the 3 rd branch are respectively sent into a characteristic coding convolution module, the obtained two output characteristic diagrams are added and sent into a decoder, and finally a prediction result is output.
Further, the specific processing procedure of the feature encoding convolution module is as follows;
when the feature images are sent to a feature code convolution module, firstly, carrying out 1X 1 convolution and a Tanh activation layer to obtain a first feature image, then carrying out 5X 5 DW convolution, processing the feature images generated by the 5X 5 DW convolution by three branches, wherein the first branch carries out 7X 1 DW convolution and 1X 7 DW convolution, the second branch carries out 11X 1 DW convolution and 1X 11 DW convolution, the third branch carries out 21X 1 DW convolution and 1X 21 DW convolution, and carrying out feature image addition operation on the feature images obtained by the three branches and the feature images generated by the 5X 5 DW convolution to output an intermediate process feature image;
and carrying out 1X 1 convolution on the intermediate process feature map to adjust the channel number to 1, carrying out feature map multiplication operation on the intermediate process feature map and the obtained first feature map to obtain a weighted feature map, and finally carrying out 1X 1 convolution, a Tanh activation layer, 1X 1 convolution and a Tanh activation layer on the weighted feature map in sequence to obtain an output feature map of the feature coding convolution module.
Further, the decoder is composed of two up-sampling convolution layers and a result output layer, wherein the up-sampling convolution layers are sequentially composed of a 1×1 convolution, a Tanh activation layer, a 3×3 convolution and a quadruple up-sampling layer, and the result output layer is sequentially composed of a 3×3 convolution and a 1×1 convolution.
Further, the value of K is 4, the characteristic coding convolution modules of the 4-layer characteristic coding layers in the large-scale semi-supervised video target division network are 3,5,27 and 3 respectively, and the characteristic coding convolution modules of the 4-layer characteristic coding layers in the small-scale semi-supervised video target division network are 2,4 and 2 respectively; the step length of the first 1×1 convolution in the first feature coding convolution module in each feature coding layer is 2, so as to reduce the height and width of the feature map and increase the channel dimension of the feature map.
Further, training a large-scale semi-supervised video target segmentation network in the step 2, and losing consistency of used time content
Figure SMS_1
The following are provided:
Figure SMS_2
wherein P (t) represents the video frame prediction mask at the current time t, G (t) represents the real mask of the video frame at the current time t, lambda t1 Is the loss weight at time t, lambda t2 Is the loss weight of the moment t-1, and epsilon is the content consistency weight of the moment t-1 and the moment t;
wherein L is c The target consistency loss is represented as follows:
Figure SMS_3
where N represents the number of pixels in the video frame mask, C represents the number of categories, p ij A prediction probability, g, representing the j-th class of the i-th pixel ij True label, gamma, representing the j-th class of the i-th pixel j The controllable parameters are used for adjusting the importance of the easy classification category and the difficult classification category.
Further, the loss function L used in the knowledge distillation in step 3 s The following are provided:
Figure SMS_4
l in the formula c Is the target consistency loss described above, f L (t) a prediction feature map representing a large model at time t, f S (t) represents a prediction feature map of a small model at a time t, g (t) represents a true mask at the time t, and alpha is a super parameter for controlling L s The weight lost by the two parts in the (a).
Further, the specific strategy of knowledge distillation employed is:
s3-1, training to obtain a large model by combining the large semi-supervised video target segmentation network constructed in the step 1 with the time content consistency loss function in the step 2;
s3-2, training the large model obtained through training and the small semi-supervised video target segmentation network constructed in the step 1 at the same time, wherein a loss function adopted by the small semi-supervised video target segmentation network is also loss of consistency of time content;
s3-3, when two networks are trained simultaneously, firstly calculating errors of the time content consistency loss function in the step 2 respectively, and then substituting prediction results of the large model and the small model into the knowledge distillation loss function calculation errors in the step 3 simultaneously;
s3-4, adjusting parameters of the large model according to the time content consistency loss calculated by the large model, adjusting parameters of the small model according to the time content consistency loss calculated by the small model, and finally adjusting parameters of the large model and the small model according to the knowledge distillation loss.
In general, the above technical solutions conceived by the present invention, compared with the prior art, enable the following beneficial effects to be obtained:
(1) The network merges multi-scale convolution to obtain characteristic representation of each frame under different scales and inter-frame optical flow information, and the learning capacity of the network is enhanced.
(2) Through a knowledge distillation method, a small model is guided to train through a large model to obtain a small model with model segmentation accuracy similar to that of the large model, and the model is light.
(3) The method improves a plurality of loss functions to pay attention to interframe prediction loss and difficult-to-classify type prediction loss, and further improves stability and accuracy of a model.
Drawings
Fig. 1 is a flowchart of a technical scheme of a method for dividing a semi-supervised video target based on knowledge distillation according to an embodiment of the present invention.
Fig. 2 is a schematic diagram of a feature coding convolution module of a semi-supervised video object segmentation method based on knowledge distillation according to an embodiment of the present invention.
Fig. 3 is a schematic diagram of a network framework of a method for dividing a semi-supervised video object based on knowledge distillation according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.
Referring to fig. 1, fig. 1 is a flowchart of a technical scheme of a knowledge distillation-based semi-supervised video object segmentation method, which specifically includes the following steps:
(1) Constructing two semi-supervised video target segmentation network architectures with different sizes;
specifically, the two semi-supervised video object segmentation network frameworks with different sizes have three branches, please refer to fig. 3, and fig. 3 is a network framework schematic diagram for implementing the semi-supervised video object segmentation method based on knowledge distillation provided by the embodiment. The invention uses the optical flow generating network to be the FlowNet, then uses the predicted mask of the FlowNet and the previous frame to carry on the motion projection to generate the mask of the current frame, besides the actual mask of the first frame which is used by the manual marking, the mask generated here is used to replace the actual mask, the 3 rd branch is used to extract the spatial feature of the current frame picture.
In particular, the three branches are interrelated. The 1 st branch and the 3 rd branch are composed of 4 layers of feature coding layers in the same structure, and the 2 nd branch is an existing optical flow generation network FlowNet. Wherein the two different sizes of semi-supervised video object segmentation networks are embodied in different numbers of feature encoding convolution modules in the 4-layer feature encoding layer in the 1 st branch and the 3 rd branch.
The feature diagrams output by the 1 st branch and the 3 rd branch are 1/16 of the input video frame, and the specific operations of the three branches are as follows: the output of the 1 st branch is spliced with the output of the 4 th optical flow characteristic coding layer of the 2 nd branch, then the output of the 1 st branch and the output of the 3 rd branch are respectively sent into a characteristic coding convolution module, the obtained two output characteristic diagrams are added and sent into a decoder, and finally a prediction result is output.
The decoder is composed of two up-sampling convolution layers and a result output layer, wherein the up-sampling convolution layers sequentially comprise a 1×1 convolution, a Tanh activation layer, a 3×3 convolution and a quadruple up-sampling layer, and the result output layer sequentially comprises a 3×3 convolution and a 1×1 convolution.
Specifically, referring to fig. 2, fig. 2 is a schematic diagram of a feature coding convolution module of a knowledge distillation-based semi-supervised video object segmentation method according to an embodiment of the present invention.
When the feature map is sent to the feature encoding convolution module, the 1×1 convolution and the Tanh activation layer are firstly performed to obtain a first feature map, then the 5×5 DW convolution (depth convolution) is performed, the feature map generated by the 5×5 DW convolution is processed by three branches, the first branch performs 7×1 DW convolution and 1×7 DW convolution, the second branch performs 11×1 DW convolution and 1×11 DW convolution, the third branch performs 21×1 DW convolution and 1×21 DW convolution, and the feature map obtained by the three branches and the feature map generated by the 5×5 DW convolution are subjected to feature map addition operation to output an intermediate process feature map.
And carrying out 1X 1 convolution on the intermediate process feature map to adjust the channel number to 1, carrying out feature map multiplication operation on the intermediate process feature map and the obtained first feature map to obtain a weighted feature map, and finally carrying out 1X 1 convolution, a Tanh activation layer, 1X 1 convolution and a Tanh activation layer on the weighted feature map in sequence to obtain an output feature map of the feature coding convolution module.
Specifically, the feature coding convolution module in the feature coding layer is arranged in the network framework. The 1 st branch and the 3 rd branch in the large model and the small model network architecture are provided with 4 layers of characteristic coding layers, but the characteristic coding convolution modules of the 4 layers of characteristic coding layers in the large model are 3,5,27 and 3 respectively, and the characteristic coding convolution modules of the 4 layers of characteristic coding layers in the small model are 2,4 and 2 respectively. The step length of the first 1×1 convolution in the first feature coding convolution module in each feature coding layer is 2, so as to reduce the height and width of the feature map and increase the channel dimension of the feature map.
(2) Training a large semi-supervised video target segmentation network to obtain a network model with higher accuracy;
specifically, the loss function used in training the large model and the small model of the constructed semi-supervised video object segmentation network is the same, and the consistency of the used time content is lost
Figure SMS_5
The following are provided:
Figure SMS_6
in the formula, P (t) represents a video frame prediction feature map at time t, G (t) represents a video frame true mask at time t, G (t) is used for generating G (t) substitution except that the first frame is a manually calibrated true label, and other frames are all generated by using the optical flow map and the previous frame mask, and it should be noted that the previous frame mask used here is not generated by a network but is generated by using the optical flow map, wherein the true mask is single-channel, and the pixel value is 0 to the classification number minus one, if 16 foreground objects are added to the background class on the DAVIS 2017 data set, the pixel value range of G (t) is 0 to 16. For example, if there is a foreground object in the video frame, the real label of all pixels of the vehicle is 3, and in substituting the above-mentioned loss function, we usually convert G (t) into a form of a single thermal vector, for example, 3 can be represented as 00010000000000000 by a single thermal vector, the single thermal vector has 17 bits to represent 17 categories, subscripts in the vector are from 0 to 16, and the corresponding lower label of the position of the single thermal vector with 1 is the corresponding imageThe type of the pixel, wherein G (t) is H×W×1, and P (t) is H×W×N classes It is the value in the one-to-one correspondence of each pixel in the subsequent G (t) to the one-hot vector pair P (t). Lambda (lambda) t1 Is the loss weight at time t, typically set to 0.6, lambda t2 Is the loss weight at time t-1, typically set to 0.25, epsilon is the content consistency weight at time t-1 and time t, typically set to 0.15. This formula is used to calculate the error between the predicted result of the current frame and the true value, the error between the predicted result of the previous frame and the true value, and the error between the predicted result of the previous frame and the current frame, where L c The target consistency loss is represented as follows:
Figure SMS_7
substituted into formula L c P in (b) ij Represents the value of a single characteristic point in P (t), and substitutes the value into a formula L c G of (3) ij Represents the single feature point value in G (t), N represents the number of pixels in the video frame mask, C represents the number of categories, p ij A prediction probability, g, representing the j-th class of the i-th pixel ij True label, gamma, representing the j-th class of the i-th pixel j The controllable parameters are used for adjusting the importance of the easy classification category and the difficult classification category.
(3) Guiding a small video target segmentation network to train through a large semi-supervised video target segmentation network by using a knowledge distillation method;
specifically, a loss function L for knowledge distillation s The following are provided:
Figure SMS_8
l in the formula c Is the target consistency loss described above, f L (t) represents a prediction feature map of a large model (namely obtained after training of a large semi-supervised video target segmentation network) at the moment t, f S (t) represents a prediction feature map of a small model (namely obtained by training a small video target segmentation network) at the moment t, and g (t) represents a real mask at the moment tAlpha is a super parameter, typically set to 0.4, for controlling L s The weight lost by the two parts in the (a).
Specifically, we adopt knowledge distillation specific strategies as follows:
s3-1, training to obtain a large model by combining the large semi-supervised video target segmentation network constructed in the step (1) with the time content consistency loss function in the step (2);
s3-2, training the large model obtained by training and implementing the small semi-supervised video target segmentation network constructed in the step (1) at the same time;
s3-3, when two networks are trained simultaneously, firstly calculating errors of the time content consistency loss function in the implementation step (2), and then substituting the prediction results of the large model and the small model into the knowledge distillation loss function calculation errors in the implementation step (3) simultaneously.
S3-4, adjusting parameters of the large model according to the time content consistency loss calculated by the large model, adjusting parameters of the small model according to the time content consistency loss calculated by the small model, and finally adjusting parameters of the large model and the small model according to the knowledge distillation loss.
(4) And obtaining a knowledge-based distillation semi-supervised video target segmentation model with similar prediction accuracy to the large model.
In particular, the final stored model is a small model that achieves stability and accuracy comparable to a large model through knowledge distillation algorithms, but with a much smaller model volume than the large model.
The invention provides a knowledge distillation-based semi-supervised video target segmentation method, which is characterized in that a semi-supervised video segmentation model is constructed through the implementation example, and high-accuracy video segmentation can be realized through operations such as multi-frame video frame feature extraction, optical flow generation and the like. The experimental effect of the invention is verified on a DAVIS 2017 data set, the average cross-over ratio of the method is improved by 1.8% under the condition that the knowledge distillation-based ratio is realized and the knowledge distillation is not used, and the small model integral average cross-over ratio constructed by the method reaches 82.2% which is an advanced level of the current semi-supervised video target segmentation.
Various modifications and alterations to this application may be made by those skilled in the art without departing from the spirit and scope of this application. Thus, if such modifications and variations of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to cover such modifications and variations.

Claims (8)

1. The knowledge distillation-based semi-supervised video target segmentation method is characterized by comprising the following steps of:
step 1, constructing two semi-supervised video target segmentation network architectures with different sizes;
step 2, training a large semi-supervised video target segmentation network to obtain a network model with higher accuracy;
step 3, guiding a small video target segmentation network to train through a large semi-supervised video target segmentation network by using a knowledge distillation method;
and step 4, obtaining a knowledge-based distillation semi-supervised video target segmentation model with prediction accuracy similar to that of the large-scale semi-supervised video target segmentation network, and realizing video target segmentation by using the knowledge-based distillation semi-supervised video target segmentation model.
2. The knowledge distillation based semi-supervised video object segmentation method as set forth in claim 1, wherein: in the step 1, two semi-supervised video target segmentation network frameworks with different sizes are provided with three branches; the 1 st branch is used for extracting the spatial characteristics of the 1 st frame and the previous video frame, the 2 nd branch is used for generating an optical flow information graph of the current frame and the previous video frame, generating a mask of the current frame through motion projection according to a prediction mask of the previous frame and the generated optical flow information graph, wherein during training, other frames are used except for a real mask of the first frame which is used for manual annotation, the generated mask replaces the real mask, and the 3 rd branch is used for extracting the spatial characteristics of the current frame picture;
the structure of the 1 st branch is the same as that of the 3 rd branch, and the 1 st branch consists of K layers of characteristic coding layers, wherein each characteristic coding layer comprises a plurality of characteristic coding convolution modules; the 2 nd branch adopts the existing optical flow generating network FlowNet, wherein the two semi-supervised video target division networks with different sizes are embodied in the 1 st branch and the 3 rd branch, and the number of the feature coding convolution modules in the feature coding layer is different;
the feature diagrams output by the 1 st branch and the 3 rd branch are 1/16 of the input video frame, and the specific operations of the three branches are as follows: the output of the 1 st branch is spliced with the output of the 4 th optical flow characteristic coding layer of the 2 nd branch, then the output of the 1 st branch and the output of the 3 rd branch are respectively sent into a characteristic coding convolution module, the obtained two output characteristic diagrams are added and sent into a decoder, and finally a prediction result is output.
3. The knowledge distillation based semi-supervised video object segmentation method as set forth in claim 2, wherein: the specific processing procedure of the characteristic coding convolution module is as follows;
when the feature images are sent to a feature code convolution module, firstly, carrying out 1X 1 convolution and a Tanh activation layer to obtain a first feature image, then carrying out 5X 5 DW convolution, processing the feature images generated by the 5X 5 DW convolution by three branches, wherein the first branch carries out 7X 1 DW convolution and 1X 7 DW convolution, the second branch carries out 11X 1 DW convolution and 1X 11 DW convolution, the third branch carries out 21X 1 DW convolution and 1X 21 DW convolution, and carrying out feature image addition operation on the feature images obtained by the three branches and the feature images generated by the 5X 5 DW convolution to output an intermediate process feature image; DW convolution is depth convolution;
and carrying out 1X 1 convolution on the intermediate process feature map to adjust the channel number to 1, carrying out feature map multiplication operation on the intermediate process feature map and the obtained first feature map to obtain a weighted feature map, and finally carrying out 1X 1 convolution, a Tanh activation layer, 1X 1 convolution and a Tanh activation layer on the weighted feature map in sequence to obtain an output feature map of the feature coding convolution module.
4. The knowledge distillation based semi-supervised video object segmentation method as set forth in claim 2, wherein: the decoder is composed of two up-sampling convolution layers and a result output layer, wherein the up-sampling convolution layers are sequentially composed of a 1×1 convolution, a Tanh activation layer, a 3×3 convolution and a quadruple up-sampling layer, and the result output layer is sequentially composed of a 3×3 convolution and a 1×1 convolution.
5. The knowledge distillation based semi-supervised video object segmentation method as set forth in claim 2, wherein: the value of K is 4, the characteristic coding convolution modules of the 4-layer characteristic coding layers in the large-scale semi-supervised video target division network are 3,5,27 and 3 respectively, and the characteristic coding convolution modules of the 4-layer characteristic coding layers in the small-scale semi-supervised video target division network are 2,4 and 2 respectively; the step length of the first 1×1 convolution in the first feature coding convolution module in each feature coding layer is 2, so as to reduce the height and width of the feature map and increase the channel dimension of the feature map.
6. The knowledge distillation based semi-supervised video object segmentation method as set forth in claim 1, wherein: training a large semi-supervised video target segmentation network in step 2, and losing consistency of used time content
Figure QLYQS_1
The following are provided:
Figure QLYQS_2
;
wherein P (t) represents the video frame prediction mask at the current time t, G (t) represents the real mask of the video frame at the current time t, lambda t1 Is the loss weight at time t, lambda t2 Is the loss weight of the moment t-1, and epsilon is the content consistency weight of the moment t-1 and the moment t;
wherein L is c The target consistency loss is represented by the following calculation formula:
Figure QLYQS_3
;
where N represents an image in a video frame maskThe number of elements, C represents the number of classes, p ij A prediction probability, g, representing the j-th class of the i-th pixel ij True label, gamma, representing the j-th class of the i-th pixel j The controllable parameters are used for adjusting the importance of the easy classification category and the difficult classification category.
7. The knowledge distillation based semi-supervised video object segmentation method as set forth in claim 5, wherein: loss function L for knowledge distillation in step 3 s The following are provided:
Figure QLYQS_4
;
l in the formula c Is the target consistency loss described above, f L (t) a prediction feature map representing a large model at time t, f S (t) represents a prediction feature map of a small model at the time t, g (t) represents a true mask at the time t, and alpha is a super parameter for controlling L s The weight lost by the two parts in the (a).
8. The knowledge distillation based semi-supervised video object segmentation method as set forth in claim 7, wherein: the specific strategy of knowledge distillation adopted is as follows:
s3-1, training to obtain a large model by combining the large semi-supervised video target segmentation network constructed in the step 1 with the time content consistency loss function in the step 2;
s3-2, training the large model obtained through training and the small semi-supervised video target segmentation network constructed in the step 1 at the same time, wherein a loss function adopted by the small semi-supervised video target segmentation network is also loss of consistency of time content;
s3-3, when two networks are trained simultaneously, firstly calculating errors of the time content consistency loss function in the step 2 respectively, and then substituting prediction results of the large model and the small model into the knowledge distillation loss function calculation errors in the step 3 simultaneously;
s3-4, adjusting parameters of the large model according to the time content consistency loss calculated by the large model, adjusting parameters of the small model according to the time content consistency loss calculated by the small model, and finally adjusting parameters of the large model and the small model according to the knowledge distillation loss.
CN202310677219.4A 2023-06-08 2023-06-08 Knowledge distillation-based semi-supervised video target segmentation method Active CN116402833B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310677219.4A CN116402833B (en) 2023-06-08 2023-06-08 Knowledge distillation-based semi-supervised video target segmentation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310677219.4A CN116402833B (en) 2023-06-08 2023-06-08 Knowledge distillation-based semi-supervised video target segmentation method

Publications (2)

Publication Number Publication Date
CN116402833A true CN116402833A (en) 2023-07-07
CN116402833B CN116402833B (en) 2023-08-22

Family

ID=87020270

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310677219.4A Active CN116402833B (en) 2023-06-08 2023-06-08 Knowledge distillation-based semi-supervised video target segmentation method

Country Status (1)

Country Link
CN (1) CN116402833B (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109858462A (en) * 2019-02-21 2019-06-07 武汉纺织大学 A kind of Fabric Recognition Method and system based on convolutional neural networks
CN109886225A (en) * 2019-02-27 2019-06-14 浙江理工大学 A kind of image gesture motion on-line checking and recognition methods based on deep learning
US20200160065A1 (en) * 2018-08-10 2020-05-21 Naver Corporation Method for training a convolutional recurrent neural network and for semantic segmentation of inputted video using the trained convolutional recurrent neural network
CN111968123A (en) * 2020-08-28 2020-11-20 北京交通大学 Semi-supervised video target segmentation method
CN112949529A (en) * 2021-03-12 2021-06-11 杭州电子科技大学 Loss function-based video image segmentation stability improving method
CN113191995A (en) * 2021-04-30 2021-07-30 东北大学 Video image automatic exposure correction method based on deep learning
CN113283438A (en) * 2021-03-25 2021-08-20 北京工业大学 Weak surveillance video target segmentation method based on multi-source significance and space-time sample adaptation
CN113344932A (en) * 2021-06-01 2021-09-03 电子科技大学 Semi-supervised single-target video segmentation method
CN115311307A (en) * 2022-07-21 2022-11-08 复旦大学 Semi-supervised video polyp segmentation system based on time sequence consistency and context independence
WO2023071531A1 (en) * 2021-10-25 2023-05-04 之江实验室 Liver ct automatic segmentation method based on deep shape learning
CN116129310A (en) * 2023-01-06 2023-05-16 北京交通大学 Video target segmentation system, method, electronic equipment and medium
CN116188486A (en) * 2022-12-29 2023-05-30 复旦大学 Video segmentation method and system for laparoscopic liver operation

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200160065A1 (en) * 2018-08-10 2020-05-21 Naver Corporation Method for training a convolutional recurrent neural network and for semantic segmentation of inputted video using the trained convolutional recurrent neural network
CN109858462A (en) * 2019-02-21 2019-06-07 武汉纺织大学 A kind of Fabric Recognition Method and system based on convolutional neural networks
CN109886225A (en) * 2019-02-27 2019-06-14 浙江理工大学 A kind of image gesture motion on-line checking and recognition methods based on deep learning
CN111968123A (en) * 2020-08-28 2020-11-20 北京交通大学 Semi-supervised video target segmentation method
CN112949529A (en) * 2021-03-12 2021-06-11 杭州电子科技大学 Loss function-based video image segmentation stability improving method
CN113283438A (en) * 2021-03-25 2021-08-20 北京工业大学 Weak surveillance video target segmentation method based on multi-source significance and space-time sample adaptation
CN113191995A (en) * 2021-04-30 2021-07-30 东北大学 Video image automatic exposure correction method based on deep learning
CN113344932A (en) * 2021-06-01 2021-09-03 电子科技大学 Semi-supervised single-target video segmentation method
WO2023071531A1 (en) * 2021-10-25 2023-05-04 之江实验室 Liver ct automatic segmentation method based on deep shape learning
CN115311307A (en) * 2022-07-21 2022-11-08 复旦大学 Semi-supervised video polyp segmentation system based on time sequence consistency and context independence
CN116188486A (en) * 2022-12-29 2023-05-30 复旦大学 Video segmentation method and system for laparoscopic liver operation
CN116129310A (en) * 2023-01-06 2023-05-16 北京交通大学 Video target segmentation system, method, electronic equipment and medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
GENSHENG PEI ET AL.: "Hierarchical Feature Alignment Network for Unsupervised Video Object Segmentation", COMPUTER VISION ECCV 2022, pages 231 - 233 *
钱明阳: "视频分割的稳定性研究", 《中国优秀硕士学位论文全文数据库信息科技辑》, pages 4 *

Also Published As

Publication number Publication date
CN116402833B (en) 2023-08-22

Similar Documents

Publication Publication Date Title
CN109711413B (en) Image semantic segmentation method based on deep learning
Wang et al. Esrgan: Enhanced super-resolution generative adversarial networks
CN109064507B (en) Multi-motion-stream deep convolution network model method for video prediction
CN108596024B (en) Portrait generation method based on face structure information
CN115049936B (en) High-resolution remote sensing image-oriented boundary enhanced semantic segmentation method
CN105787948B (en) A kind of Fast image segmentation method based on shape changeable resolution ratio
CN110287777B (en) Golden monkey body segmentation algorithm in natural scene
CN111539290B (en) Video motion recognition method and device, electronic equipment and storage medium
CN112287941B (en) License plate recognition method based on automatic character region perception
CN111310609B (en) Video target detection method based on time sequence information and local feature similarity
CN111968123A (en) Semi-supervised video target segmentation method
CN108053420A (en) A kind of dividing method based on the unrelated attribute dynamic scene of limited spatial and temporal resolution class
CN114549574A (en) Interactive video matting system based on mask propagation network
CN113807340B (en) Attention mechanism-based irregular natural scene text recognition method
CN115620010A (en) Semantic segmentation method for RGB-T bimodal feature fusion
CN112070114A (en) Scene character recognition method and system based on Gaussian constraint attention mechanism network
CN113392711A (en) Smoke semantic segmentation method and system based on high-level semantics and noise suppression
CN111428727A (en) Natural scene text recognition method based on sequence transformation correction and attention mechanism
CN114943888B (en) Sea surface small target detection method based on multi-scale information fusion
CN116645592A (en) Crack detection method based on image processing and storage medium
CN113378812A (en) Digital dial plate identification method based on Mask R-CNN and CRNN
CN112417752A (en) Cloud layer track prediction method and system based on convolution LSTM neural network
CN114973071A (en) Unsupervised video target segmentation method and system based on long-term and short-term time sequence characteristics
CN114708615A (en) Human body detection method based on image enhancement in low-illumination environment, electronic equipment and storage medium
CN117522903A (en) SF-Unet model-based high-resolution cultivated land remote sensing image segmentation method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant