CN114638839B - Small sample video target segmentation method based on dynamic prototype learning - Google Patents
Small sample video target segmentation method based on dynamic prototype learning Download PDFInfo
- Publication number
- CN114638839B CN114638839B CN202210536170.6A CN202210536170A CN114638839B CN 114638839 B CN114638839 B CN 114638839B CN 202210536170 A CN202210536170 A CN 202210536170A CN 114638839 B CN114638839 B CN 114638839B
- Authority
- CN
- China
- Prior art keywords
- video frame
- prototype
- matrix
- support
- features
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 230000011218 segmentation Effects 0.000 title claims abstract description 107
- 238000000034 method Methods 0.000 title claims abstract description 67
- 238000012545 processing Methods 0.000 claims abstract description 31
- 230000005540 biological transmission Effects 0.000 claims abstract description 11
- 239000011159 matrix material Substances 0.000 claims description 111
- 239000013598 vector Substances 0.000 claims description 40
- 230000006870 function Effects 0.000 claims description 31
- 238000005065 mining Methods 0.000 claims description 18
- 238000012549 training Methods 0.000 claims description 13
- 238000000605 extraction Methods 0.000 claims description 12
- 238000013528 artificial neural network Methods 0.000 claims description 8
- 238000004422 calculation algorithm Methods 0.000 claims description 8
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000005457 optimization Methods 0.000 claims description 4
- 238000011176 pooling Methods 0.000 claims description 4
- 238000009499 grossing Methods 0.000 claims description 3
- 230000000873 masking effect Effects 0.000 claims description 3
- 238000004590 computer program Methods 0.000 abstract description 3
- 239000000284 extract Substances 0.000 abstract description 3
- 230000008569 process Effects 0.000 description 12
- 238000010586 diagram Methods 0.000 description 9
- 230000015654 memory Effects 0.000 description 9
- 238000004891 communication Methods 0.000 description 3
- 230000003190 augmentative effect Effects 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000007704 transition Effects 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 238000009412 basement excavation Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a small sample video target segmentation method based on dynamic prototype learning, which comprises the following steps: acquiring a video target to be segmented; and processing the video target to be segmented by using a small sample video target segmentation model based on dynamic prototype learning to obtain a video target segmentation result. According to the small sample video target segmentation method, the optimal transmission method is used for self-adaptive learning of the dynamic prototype, noise attention is effectively reduced, and meanwhile, a guiding mode is adopted for matching the multi-level feature map, so that the calculated amount is greatly reduced; the method can fully extract the target information in a small number of support set samples, and remarkably improves the segmentation performance on the video of the challenge set. The invention also discloses electronic equipment, a storage medium and a computer program product for executing the small sample video object segmentation method based on the dynamic prototype learning.
Description
Technical Field
The invention relates to the field of computer vision, in particular to a training method of a small sample video target segmentation model and a video target segmentation method.
Background
Video target segmentation is a technology for predicting a foreground target mask in each frame of a video, and has wide application in the aspects of augmented reality, automatic driving, video editing and the like.
The prior art segmentation of video objects is typically based on semi-supervised and unsupervised. The semi-supervision method needs to give target information of a first frame of each video, then carries out dense association on targets in subsequent frames of the video, and the process seriously depends on a large amount of densely segmented and labeled data, so that time and labor are consumed; the unsupervised method has low performance due to the lack of the labeling data, and cannot meet the requirements of practical application. In addition, the above two methods cannot be well generalized to new target classes, and the segmentation capability on classes not seen in the training phase is sharply reduced, which limits the extensibility and practicability of video target recognition.
Disclosure of Invention
In view of the above, it is a primary object of the present invention to provide a small sample video object segmentation method based on dynamic prototype learning, an electronic device, a storage medium and a computer program product, which are intended to at least partially solve at least one of the above-mentioned technical problems.
According to a first aspect of the present invention, there is provided a small sample video object segmentation method based on dynamic prototype learning, including:
acquiring a video target to be segmented;
processing a video target to be segmented by using a small sample video target segmentation model based on dynamic prototype learning to obtain a video target segmentation result, wherein the small sample video target segmentation model based on dynamic prototype learning is obtained by training according to the following method:
processing the video frame images of the challenge set and the video frame images of the support set by utilizing a part of neural network layers of a feature extraction module of a small sample video object segmentation model to obtain low-level features of the challenge video frame and low-level features of the support video frame;
processing the video frame images of the challenge set by using all neural network layers of a feature extraction module of the small sample video object segmentation model to obtain the features of the challenge video frame;
carrying out mask operation on the low-level features of the support video frame to obtain foreground features of the support video frame;
processing foreground characteristics of the support video frame and characteristics of the challenge video frame by utilizing a mining module of a small sample video object segmentation model to obtain a corresponding relation matrix;
processing the low-level features of the support video frames, the low-level features of the inquiry video frames and the corresponding relation matrix by using a guide module of the small sample video target segmentation model to obtain a low-level corresponding relation matrix;
processing the corresponding relation matrix and the low-level corresponding relation matrix by using a segmentation module of the small sample video target segmentation model to obtain a video target segmentation result, and optimizing the small sample video target segmentation model by using a loss function of the small sample video target segmentation model;
and (4) performing feature extraction operation, masking operation, mining operation, guiding operation, segmentation operation and optimization operation in an iterative manner until the value of the loss function meets a preset condition to obtain a trained small sample video target segmentation model.
According to the embodiment of the present invention, the processing of the foreground feature of the support video frame and the feature of the challenge video frame by the mining module of the small sample video object segmentation model to obtain the corresponding relationship matrix includes:
processing foreground characteristics of the support video frame by utilizing a prototype generator of the mining module to obtain dynamic prototype characteristics;
calculating the dynamic prototype characteristics and the foreground characteristics of the support video frame to obtain a support corresponding relation matrix;
calculating the dynamic prototype characteristics and the characteristics of the inquiry video frame to obtain an inquiry corresponding relation matrix;
and operating the support corresponding relation matrix and the inquiry corresponding relation matrix to obtain a corresponding relation matrix.
According to an embodiment of the present invention, the processing of the foreground feature of the support video frame by the prototype generator of the mining module to obtain the dynamic prototype feature includes:
carrying out global average pooling on foreground features of the support video frame to obtain video target prototype features;
calculating foreground characteristics of the support video frame and prototype characteristics of the video target by using a prototype generator to obtain an attention matrix;
processing the attention matrix by using an optimal transmission algorithm to obtain an optimal distribution matrix;
and calculating the foreground characteristics of the support video frame and the optimal distribution matrix, and calculating the calculation result and the video target prototype characteristics to obtain the dynamic prototype characteristics.
According to an embodiment of the present invention, the above-mentioned attention matrix is determined by equation (1):
wherein,is the firstThe foreground feature vector of each support video frame,index representing foreground feature vector of support video frame, for length ofThe foreground feature vector of the supporting video frame,is in a range of values
,Is the firstThe characteristics of each prototype are characterized in that,index representing prototype features forThe characteristics of each prototype are characterized in that,is in the value range of, Is a matrix of support focus forces,is a support attention force matrixTo (1) aGo to the firstColumn value for indicating the secondIndividual prototype features andsimilarity of foreground feature vectors of the individual support video frames;
wherein the dynamic prototype feature is determined by equation (2):
wherein,is a sequence of foreground feature vectors that support video frames,is the firstIs characterized by a prototypeThe obtained dynamic prototype characteristics are updated according to the dynamic prototype characteristics,representing the optimized support set attention force matrix,is the first of the optimized support attention force matrixA row vector.
According to an embodiment of the present invention, the processing, by the guidance module of the small sample video object segmentation model, the low-level feature of the support video frame, the low-level feature of the inquiry video frame, and the correspondence matrix to obtain the low-level correspondence matrix includes:
selecting a preset row number and a preset column number of the corresponding relation matrix to obtain an intermediate corresponding relation matrix;
calculating the low-level feature of the support video frame and the intermediate corresponding relation matrix to obtain a reconstructed feature matrix;
and performing operation on the reconstructed feature matrix and the low-level features of the inquiry video frame to obtain a low-level corresponding relation matrix.
According to an embodiment of the present invention, the above guidance module is determined by formula (3) and formula (4):
wherein,is a temperature factor, for controlling the degree of smoothing of the output probability distribution,the length of the modulus of the vector is represented,is the firstAn individual challenge video frame feature vector is generated,index representing the feature vector of the challenge video frame, for height and width respectivelyAndthe image of the video frame of the challenge,is in the value range of,Is an allocation matrix of dynamic prototype features and challenge video frame features,first of the allocation matrix representing dynamic prototype features and challenging video frame featuresGo to the firstThe value of the column is such that,an assignment matrix representing the optimized dynamic prototype features and the foreground features of the support video frame,a correspondence matrix representing the features of the challenge video frame and the foreground features of the support video frame,softmaxrepresenting a normalized exponential function.
According to an embodiment of the present invention, the loss function of the small sample video object segmentation model includes a cross-over ratio loss function and a cross-entropy loss function;
wherein the cross entropy loss function is determined by equation (5):
wherein,andrespectively representing the height and width of the incoming challenge video frame image or support video frame image,which represents the product of the height and the width,is the result of the real segmentation,representing the second in the real segmentation resultGo to the firstThe value of the column is such that,is the result of the segmentation predicted by the model,representation of the segmentation results of model predictionsGo to the firstThe value of the column;
wherein the cross-over ratio loss function is determined by equation (6):
According to a second aspect of the present invention, there is provided an electronic apparatus comprising:
one or more processors;
a storage device for storing one or more programs,
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the above dynamic prototype learning-based small sample video object segmentation method.
According to a third aspect of the present invention, there is provided a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform the above-mentioned small-sample video object segmentation method based on dynamic prototype learning.
The small sample video target segmentation method based on dynamic prototype learning provided by the invention has the advantages that the optimal transmission method is used for adaptively learning the dynamic prototype, the noise attention is effectively reduced, meanwhile, the multi-level feature maps are matched in a guiding mode, the calculated amount is greatly reduced, meanwhile, the video segmentation method provided by the invention fully extracts the target information in a small number of support set samples, and the segmentation performance on the video of an inquiry set is obviously improved.
Drawings
FIG. 1 is a flow chart of a small sample video object segmentation method based on dynamic prototype learning according to an embodiment of the present invention;
FIG. 2 is a flowchart of a training method of a small sample video object segmentation model based on dynamic prototype learning according to an embodiment of the present invention;
FIG. 3 is a flow chart of obtaining a correspondence matrix according to an embodiment of the present invention;
FIG. 4 is a flow diagram for obtaining dynamic prototype features according to an embodiment of the present invention;
FIG. 5 is a flow diagram of obtaining a low-level correspondence matrix according to an embodiment of the invention;
FIG. 6 is a small sample video object segmentation model framework diagram based on dynamic prototype learning according to an embodiment of the present invention;
fig. 7 schematically illustrates a block diagram of an electronic device adapted to implement a small-sample video object segmentation method of base dynamic prototype learning, in accordance with an embodiment of the present invention.
Detailed Description
In order that the objects, technical solutions and advantages of the present invention will become more apparent, the present invention will be further described in detail with reference to the accompanying drawings in conjunction with the following specific embodiments.
The invention provides a small sample video target segmentation method based on dynamic prototype learning, which aims to reduce the dependence on data, improve the expansibility and the practicability and achieve better video target segmentation performance by using a small amount of data with labels.
In the current method, the method of dense matching by using multi-level features achieves the leading performance. However, dense matching of pixel-by-pixel features introduces a large amount of correspondence noise, and further processing on multiple scales increases the computational effort. The method provided by the invention can learn the target prototype in a self-adaptive manner, realize robust multi-level dense matching in a mode of an intermediate bridge, and effectively alleviate the problems of noise and large calculation amount.
The video segmentation method provided by the invention can be applied to an application system related to video object segmentation; the target in the input video is segmented according to the information provided by a small number of support set images, and the method can be widely applied to scenes such as augmented reality, automatic driving, video editing and the like. In a specific embodiment, the method can be embedded into a mobile device in a software form, and provides a real-time segmentation result of a recorded video; or the method can be installed in a background server to provide the processing result of the video in a large batch.
Fig. 1 is a flowchart of a small sample video object segmentation method based on dynamic prototype learning according to an embodiment of the present invention.
As shown in FIG. 1, the method includes operations S110 to S120.
In operation S110, a video object to be segmented is acquired.
In operation S120, a video object to be segmented is processed by using a small-sample video object segmentation model based on dynamic prototype learning, and a video object segmentation result is obtained.
Fig. 2 is a flowchart of a training method of a small sample video object segmentation model based on dynamic prototype learning according to an embodiment of the present invention.
As shown in FIG. 2, the method includes operations S210 to S270.
In operation S210, the challenge set video frame image and the support set video frame image are processed by using a part of the neural network layer of the feature extraction module of the small sample video object segmentation model, so as to obtain a low-level feature of the challenge video frame and a low-level feature of the support video frame.
The low-level features are processed by a part of neural network layers of the feature extraction module, so that the resolution is high, more detail information is included, but the low-level features are lower in semantic and more in noise. High-level features (or features) as opposed to low-level features, which traverse more layers of the neural network than low-level features, have stronger semantic information, but have a lower resolution and a poorer perception of detail.
In operation S220, all the neural network layers of the feature extraction module of the small sample video object segmentation model are used to process the video frame images of the challenge set, so as to obtain the features of the challenge video frame.
For input support set video frame images and inquiry set video frame images belonging to the same category, the feature extraction module is utilized to perform multi-level feature extraction based on a ResNet-50 network, and then a 1x1 convolutional layer is mapped to a common measurement space.
In operation S230, a mask operation is performed on the low-level features of the support video frame to obtain foreground features of the support video frame.
In operation S240, the mining module of the small sample video object segmentation model is used to process the foreground feature of the support video frame and the feature of the challenge video frame, so as to obtain a corresponding relationship matrix.
In operation S250, the low-level feature of the support video frame, the low-level feature of the challenge video frame, and the corresponding relationship matrix are processed by using the guidance module of the small-sample video object segmentation model to obtain a low-level corresponding relationship matrix.
In operation S260, the correspondence matrix and the low-level correspondence matrix are processed by using a segmentation module of the small-sample video object segmentation model to obtain a video object segmentation result, and the small-sample video object segmentation model is optimized by using a loss function of the small-sample video object segmentation model.
In operation S270, the feature extraction operation, the masking operation, the mining operation, the guiding operation, the segmentation operation, and the optimization operation are performed iteratively until the value of the loss function satisfies the preset condition, so as to obtain a trained small sample video target segmentation model.
According to the training method provided by the invention, a small sample video target segmentation model with reliability, generalization and high efficiency can be obtained by utilizing a dynamic prototype mining module and a multi-level dynamic guide module based on an optimal transmission algorithm; the trained small sample video target segmentation model is used for segmenting the video target, an optimal transmission method can be used for adaptively learning a dynamic prototype, noise attention is effectively reduced, meanwhile, a multi-level characteristic diagram is matched in a guiding mode, the calculated amount is greatly reduced, meanwhile, the video segmentation method provided by the invention fully extracts the target information in a small number of support set samples, and the segmentation performance on the challenge set video is obviously improved.
Fig. 3 is a flowchart of obtaining a correspondence matrix according to an embodiment of the present invention.
As shown in fig. 3, the mining module of the small sample video object segmentation model is used for processing foreground features of the support video frame and features of the inquiry video frame, and obtaining a corresponding relation matrix includes operations S310 to S340.
In operation S310, the foreground features of the support video frame are processed by using a prototype generator of the mining module to obtain dynamic prototype features.
In operation S320, the dynamic prototype feature and the foreground feature of the support video frame are calculated to obtain a support correspondence matrix.
In operation S330, the dynamic prototype feature and the feature of the challenge video frame are operated to obtain a challenge correspondence matrix.
In operation S340, the support correspondence matrix and the challenge correspondence matrix are operated to obtain a correspondence matrix.
In the process of acquiring the corresponding relation matrix, the feature points which support the incidence relation between the foreground features of the video frame and the features of the video frame to be inquired can be fully excavated by utilizing the excavation module of the dynamic prototype based on the optimal transmission algorithm, so that more firm data support is provided for the training of the subsequent model.
FIG. 4 is a flow diagram for obtaining dynamic prototype features according to an embodiment of the present invention.
As shown in fig. 4, processing the foreground feature of the support video frame by using the prototype generator of the mining module to obtain the dynamic prototype feature includes operations S410 to S440.
In operation S410, global average pooling is performed on the foreground features of the support video frame to obtain video target prototype features.
In operation S420, the prototype generator is used to perform operations on the foreground features of the support video frame and the prototype features of the video target, so as to obtain an attention matrix.
In operation S430, the attention matrix is processed using an optimal transfer algorithm to obtain an optimal allocation matrix.
In operation S440, the foreground feature of the supported video frame and the optimal allocation matrix are operated, and the operation result and the video target prototype feature are operated to obtain a dynamic prototype feature.
The processing process for acquiring the dynamic prototype features can effectively reduce the noise attention in the original video frame image, thereby improving the segmentation performance of the trained model.
According to an embodiment of the present invention, the above-mentioned attention matrix is determined by equation (1):
wherein,is the firstiThe foreground feature vectors of each of the support video frames,index representing foreground feature vector of support video frame, for length ofThe foreground feature vector of the supported video frame,is in the value range of
,Is the firstThe characteristics of each prototype are characterized in that,index representing prototype features forThe characteristics of each prototype are characterized in that,is in the value range of,Is a matrix of support concentration attention forces,is the first to support the attention-focusing force matrixGo to the firstColumn value for indicating the secondIndividual prototype characteristics andsimilarity of foreground feature vectors of the individual support video frames;
wherein the dynamic prototype feature is determined by equation (2):
wherein,is a sequence of foreground feature vectors supporting video frames,is the firstCharacter of individual prototypeThe obtained dynamic prototype characteristics are updated according to the dynamic prototype characteristics,the optimized support set attention matrix is shown, namely the support set attention matrix optimized by using the optimal transmission algorithm,is the first of the optimized support set attention force matrixRow vector, representingThe line prototype feature vector pairs support the similarity of the foreground feature vectors of the video frames.
FIG. 5 is a flow diagram of obtaining a low-level correspondence matrix according to an embodiment of the invention.
As shown in fig. 5, the obtaining of the low-level correspondence matrix by using the guidance module of the small sample video object segmentation model to process the low-level features of the support video frame, the low-level features of the inquiry video frame and the correspondence matrix includes operations S510 to S530.
In operation S510, a preset row number and a preset column number of the corresponding relationship matrix are selected to obtain an intermediate corresponding relationship matrix.
In operation S520, the low-level feature of the support video frame and the intermediate correspondence matrix are operated to obtain a reconstructed feature matrix.
In operation S530, the reconstructed feature matrix and the low-level features of the challenge video frame are operated to obtain a low-level corresponding relationship matrix.
According to an embodiment of the present invention, the above guidance module is determined by formula (3) and formula (4):
wherein,as a temperature factor, useIn controlling the degree of smoothing of the output probability distribution,the length of the modulus of the vector is represented,is the firstAn individual challenge video frame feature vector is generated,index representing the feature vector of the challenge video frame, for height and width respectivelyAndthe image of the video frame of the challenge,is in the value range of,Is an allocation matrix of dynamic prototype features and challenge video frame features,a first of the allocation matrices representing the dynamic prototype features and the features of the challenge video frameGo to the firstThe value of the column is such that,an assignment matrix representing the optimized dynamic prototype features and the foreground features of the support video frame,a correspondence matrix representing the features of the challenge video frame and the foreground features of the support video frame,softmaxrepresenting a normalized exponential function.
Allocation matrices for dynamic prototype features and supporting video frame featuresThen it is determined by:,allocation matrix representing dynamic prototype features and supporting video frame featuresTo (1) aGo to the firstThe value of the column is such that,is the firstiThe foreground feature vectors of each of the support video frames,is the firstIs characterized by a prototypeIs updated toTo dynamic prototype features.
According to an embodiment of the present invention, the loss function of the small sample video object segmentation model includes a cross-over ratio loss function and a cross-entropy loss function;
wherein the cross entropy loss function is determined by equation (5):
wherein,andrespectively representing the height and width of the incoming challenge video frame image or the support video frame image,represents the product of said height and said width,is the result of the true segmentation of the image,representing the first in the real segmentation resultGo to the firstThe value of the column is such that,is the result of the segmentation predicted by the model,representing the result of the segmentation predicted by the modelGo to the firstThe value of the column;
wherein the cross-over ratio loss function is determined by equation (6):
where a norm of the matrix is represented.
Since the segmentation task is similar to the pixel-by-pixel classification task, intensive cross-entropy losses are used as constraints, while, in order to improve the final segmentation resultAnd label maskThe coincidence degree index of the invention is additionally added with an intersection ratio loss, and finally the loss function of the invention is formed by combining the intersection ratio loss function and the cross entropy loss function according to a certain weight coefficient; the loss function of the small sample video object segmentation model of the invention is shown in formula (7):
By using the loss function as the constraint of the training method, the training effect of the small sample video target model can be improved, and a small sample video target segmentation model which has robustness and effectively reduces noise and is based on dynamic prototype learning is obtained.
Fig. 6 is a frame diagram of a small sample video object segmentation model based on dynamic prototype learning according to an embodiment of the present invention.
The training process of the model provided by the embodiment of the present invention is further described in detail with reference to fig. 6.
As shown in FIG. 6, the model training framework provided by the present invention comprises a dynamic prototype mining module based on an optimal transmission algorithm and a multi-level dynamic boot module. In a dynamic prototype mining module based on an optimal transmission algorithm, for input supporting set and inquiry set images belonging to the same category, multi-level features are extracted through a ResNet-50 network, and then a 1x1 convolutional layer is mapped to a common measurement space. Will support the featureFlattening and extracting the inclusion using the corresponding maskA sequence of foreground feature vectors supporting the video framesSending it to a prototype generator to obtainAn object prototype, as shown in equations (8) and (9):
GAP(Global Average Polean, GAP, global average pooling) is used to average a sequence of input supporting video frame foreground feature vectors,representing a target global feature vector, prototype generatorComposed of a full connection layer and an activation function, which can generate prototype features according to the target features input by the current support set,Is shown in whichkA generator for generating a prototype of the object,is composed ofCan be based on the attention moment matrixForeground pixel features are assigned to these prototypes as shown in equation (1):
in order to allocate a group of semantically consistent pixel features to the same prototype, an optimal allocation matrix is obtained based on an optimal transmission theory for adjusting the mapping relationship between the pixel features and the prototype, and this process mainly solves the optimization problem shown in formulas (10) and (11):
wherein,the vector is a vector of all 1 s,representing the transition matrix to be solved for,represents the optimal solution of the transition matrix to be solved,the weighting operation is performed on the attention moment array, and is a weighting matrix, Tr represents the trace operation of the matrix,which represents a constant coefficient of the constant,the entropy function of the information is represented,for transferring matricesThe space of possible solutions of (a) is,the dimension of expression isCan ultimately be based onThe updating results in a robust dynamic prototype, as shown in equation (12) and equation (2):
The above process can optimize the prototype vector through multiple iterations, and simultaneously purify the distribution matrix of the support set。
In the multi-level dynamic guiding module, for a video frame of a challenge set to be segmented, a pseudo label can be allocated to each pixel feature by using a self-adaptively generated dynamic prototype, and meanwhile, a huge calculation amount generated in a dense matching process is reduced by using a calculation mode of an intermediate bridge, as shown in formulas (3) and (4):
wherein,is the temperature factor. High-level features at low resolution may use pairsCorrespondence matrixAnd (4) reconstructing the characteristics of the support video frame, and inputting the reconstructed support video frame into a decoder to predict a segmentation result. For the low-level features with high resolution, a guiding method which can suppress noise by using a dynamic prototype and has less calculation amount is used for feature reconstruction. The concrete method is according toSelecting position indexes of similarity from the support set characteristics to obtain corresponding characteristic vectorsObtaining dense matching results for low-level features in an indirect guided mannerAs shown in equation (13):
wherein,is to challenge the low-level features of the video frameTo (1)jThe number of feature vectors is determined by the number of feature vectors,is selected to correspond toThe product of the two forms the low-level feature vector of the support video frameRepresenting low-level feature dense matching resultsTo (1)jA value.
The small sample video target segmentation model obtained through the training process can realize that a small amount of marked images are input as supports to segment targets of the same category in video frames.
Fig. 7 schematically illustrates a block diagram of an electronic device adapted to implement a small-sample video object segmentation method of base dynamic prototype learning, in accordance with an embodiment of the present invention.
As shown in fig. 7, an electronic device 700 according to an embodiment of the present invention includes a processor 701, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. The processor 701 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or associated chipset, and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), among others. The processor 701 may also include on-board memory for caching purposes. The processor 701 may comprise a single processing unit or a plurality of processing units for performing the different actions of the method flows according to embodiments of the present invention.
In the RAM 703, various programs and data necessary for the operation of the electronic apparatus 700 are stored. The processor 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. The processor 701 performs various operations of the method flow according to the embodiment of the present invention by executing programs in the ROM 702 and/or the RAM 703. It is noted that the programs may also be stored in one or more memories other than the ROM 702 and RAM 703. The processor 701 may also perform various operations of method flows according to embodiments of the present invention by executing programs stored in the one or more memories.
The present invention also provides a computer-readable storage medium, which may be contained in the apparatus/device/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The computer-readable storage medium carries one or more programs which, when executed, implement the method according to an embodiment of the present invention.
According to embodiments of the present invention, the computer readable storage medium may be a non-volatile computer readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to embodiments of the invention, a computer-readable storage medium may include the ROM 702 and/or the RAM 703 and/or one or more memories other than the ROM 702 and the RAM 703 described above.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention and are not intended to limit the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (8)
1. A small sample video object segmentation method based on dynamic prototype learning comprises the following steps:
acquiring a video target to be segmented;
processing the video target to be segmented by using a small sample video target segmentation model based on dynamic prototype learning to obtain a video target segmentation result, wherein the small sample video target segmentation model based on dynamic prototype learning is obtained by training according to the following method:
processing the video frame images of the challenge set and the video frame images of the support set by utilizing a part of neural network layers of a feature extraction module of the small sample video object segmentation model to obtain low-level features of the challenge video frame and low-level features of the support video frame;
processing the video frame images of the challenge set by using all the neural network layers of the feature extraction module of the small sample video object segmentation model to obtain the features of the challenge video frame;
carrying out mask operation on the low-level features of the support video frame to obtain foreground features of the support video frame;
processing the foreground characteristics of the support video frame and the characteristics of the challenge video frame by utilizing a mining module of the small sample video target segmentation model to obtain a corresponding relation matrix;
processing the low-level features of the support video frame, the low-level features of the challenge video frame and the corresponding relation matrix by utilizing a guide module of the small sample video target segmentation model to obtain a low-level corresponding relation matrix;
processing the corresponding relation matrix and the low-level corresponding relation matrix by utilizing a segmentation module of the small sample video target segmentation model to obtain a video target segmentation result, and optimizing the small sample video target segmentation model by utilizing a loss function of the small sample video target segmentation model;
iteratively performing feature extraction operation, masking operation, mining operation, guiding operation, segmentation operation and optimization operation until the value of the loss function meets a preset condition to obtain a trained small sample video target segmentation model;
wherein, the processing the foreground characteristics of the support video frame and the characteristics of the challenge video frame by the mining module of the small sample video object segmentation model to obtain a corresponding relation matrix comprises:
processing the foreground characteristics of the support video frame by utilizing a prototype generator of the mining module to obtain dynamic prototype characteristics;
calculating the dynamic prototype characteristics and the foreground characteristics of the support video frame to obtain a support corresponding relation matrix;
calculating the dynamic prototype characteristics and the characteristics of the challenge video frame to obtain a challenge corresponding relation matrix;
and operating the support corresponding relation matrix and the inquiry corresponding relation matrix to obtain a corresponding relation matrix.
2. The method of claim 1, wherein said processing the support video frame foreground features with a prototype generator of the mining module to obtain dynamic prototype features comprises:
performing global average pooling on the foreground features of the support video frames to obtain video target prototype features;
calculating the foreground characteristic of the support video frame and the prototype characteristic of the video target by using the prototype generator to obtain an attention matrix;
processing the attention matrix by using an optimal transmission algorithm to obtain an optimal distribution matrix;
and calculating the foreground characteristics of the support video frame and the optimal distribution matrix, and calculating the calculation result and the video target prototype characteristics to obtain dynamic prototype characteristics.
3. The method of claim 2, wherein the attention matrix is determined by equation (1):
wherein,is the firstThe foreground feature vectors of each of the support video frames,an index representing the foreground feature vector of the support video frame, for a length ofThe foreground feature vector of the support video frame,is in the value range of,Is the firstThe characteristics of each prototype are characterized in that,an index representing the prototype feature forAn instituteThe prototype feature is described as being a feature of the prototype,is in the value range of, Is a matrix of support focus forces,is the support attention force matrixTo (1)Go to the firstColumn value for indicating the secondThe prototype features andsimilarity of foreground feature vectors of the support video frames;
wherein the dynamic prototype feature is determined by equation (2):
wherein,is the support video frame foreground featureThe sequence of the feature vectors is then,is the firstCharacter of individual prototypeThe obtained dynamic prototype characteristics are updated according to the dynamic prototype characteristics,representing the optimized support set attention force matrix,is the first of the optimized support attention force matrixA row vector.
4. The method of claim 1, wherein the processing the support video frame low-level features, the challenge video frame low-level features, and the correspondence matrix with a bootstrap module of the small sample video object segmentation model to obtain a low-level correspondence matrix comprises:
selecting a preset row number and a preset column number of the corresponding relation matrix to obtain an intermediate corresponding relation matrix;
calculating the low-level characteristics of the support video frame and the intermediate corresponding relation matrix to obtain a reconstructed characteristic matrix;
and operating the reconstructed feature matrix and the low-level features of the inquiry video frame to obtain the low-level corresponding relation matrix.
5. The method of claim 1, wherein the guidance module is determined by formula (3) and formula (4):
wherein,is a temperature factor, for controlling the degree of smoothing of the output probability distribution,the length of the modulus of the vector is represented,is the firstAn individual challenge video frame feature vector is generated,an index representing the feature vector of the challenge video frame, for height and width respectivelyAndof the challenge video frame image of (a),is in the value range of,Is an allocation matrix of dynamic prototype features and challenge video frame features,a first of the allocation matrices representing the dynamic prototype features and the features of the challenge video frameGo to the firstThe value of the column is such that,an assignment matrix representing the optimized dynamic prototype features and the support video frame foreground features,and the softmax represents a normalized exponential function.
6. The method of claim 1, wherein the loss function of the small sample video object segmentation model comprises a cross-over ratio loss function and a cross-entropy loss function;
wherein the cross entropy loss function is determined by equation (5):
wherein,andrespectively representing the height and width of the incoming challenge video frame image or the support video frame image,representing the product of said height and said width,is the result of the real segmentation,representing the first in the real segmentation resultGo to the firstThe value of the column is such that,is the result of the segmentation predicted by the model,representing the result of the segmentation predicted by the modelGo to the firstThe value of the column;
wherein the cross-over ratio loss function is determined by equation (6):
7. An electronic device, comprising:
one or more processors;
a storage device for storing one or more programs,
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-6.
8. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method of any one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210536170.6A CN114638839B (en) | 2022-05-18 | 2022-05-18 | Small sample video target segmentation method based on dynamic prototype learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210536170.6A CN114638839B (en) | 2022-05-18 | 2022-05-18 | Small sample video target segmentation method based on dynamic prototype learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114638839A CN114638839A (en) | 2022-06-17 |
CN114638839B true CN114638839B (en) | 2022-09-30 |
Family
ID=81953301
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210536170.6A Active CN114638839B (en) | 2022-05-18 | 2022-05-18 | Small sample video target segmentation method based on dynamic prototype learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114638839B (en) |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110942463A (en) * | 2019-10-30 | 2020-03-31 | 杭州电子科技大学 | Video target segmentation method based on generation countermeasure network |
CN111210446A (en) * | 2020-01-08 | 2020-05-29 | 中国科学技术大学 | Video target segmentation method, device and equipment |
CN113177549A (en) * | 2021-05-11 | 2021-07-27 | 中国科学技术大学 | Few-sample target detection method and system based on dynamic prototype feature fusion |
CN113240039A (en) * | 2021-05-31 | 2021-08-10 | 西安电子科技大学 | Small sample target detection method and system based on spatial position characteristic reweighting |
CN113706487A (en) * | 2021-08-17 | 2021-11-26 | 西安电子科技大学 | Multi-organ segmentation method based on self-supervision characteristic small sample learning |
CN113763385A (en) * | 2021-05-28 | 2021-12-07 | 华南理工大学 | Video object segmentation method, device, equipment and medium |
CN113920127A (en) * | 2021-10-27 | 2022-01-11 | 华南理工大学 | Single sample image segmentation method and system with independent training data set |
EP3961502A1 (en) * | 2020-08-31 | 2022-03-02 | Sap Se | Weakly supervised one-shot image segmentation |
CN114266977A (en) * | 2021-12-27 | 2022-04-01 | 青岛澎湃海洋探索技术有限公司 | Multi-AUV underwater target identification method based on super-resolution selectable network |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11556666B2 (en) * | 2018-10-16 | 2023-01-17 | Immuta, Inc. | Data access policy management |
CN111583284B (en) * | 2020-04-22 | 2021-06-22 | 中国科学院大学 | Small sample image semantic segmentation method based on hybrid model |
CN114240965A (en) * | 2021-12-13 | 2022-03-25 | 江南大学 | Small sample learning tumor segmentation method driven by graph attention model |
-
2022
- 2022-05-18 CN CN202210536170.6A patent/CN114638839B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110942463A (en) * | 2019-10-30 | 2020-03-31 | 杭州电子科技大学 | Video target segmentation method based on generation countermeasure network |
CN111210446A (en) * | 2020-01-08 | 2020-05-29 | 中国科学技术大学 | Video target segmentation method, device and equipment |
EP3961502A1 (en) * | 2020-08-31 | 2022-03-02 | Sap Se | Weakly supervised one-shot image segmentation |
CN113177549A (en) * | 2021-05-11 | 2021-07-27 | 中国科学技术大学 | Few-sample target detection method and system based on dynamic prototype feature fusion |
CN113763385A (en) * | 2021-05-28 | 2021-12-07 | 华南理工大学 | Video object segmentation method, device, equipment and medium |
CN113240039A (en) * | 2021-05-31 | 2021-08-10 | 西安电子科技大学 | Small sample target detection method and system based on spatial position characteristic reweighting |
CN113706487A (en) * | 2021-08-17 | 2021-11-26 | 西安电子科技大学 | Multi-organ segmentation method based on self-supervision characteristic small sample learning |
CN113920127A (en) * | 2021-10-27 | 2022-01-11 | 华南理工大学 | Single sample image segmentation method and system with independent training data set |
CN114266977A (en) * | 2021-12-27 | 2022-04-01 | 青岛澎湃海洋探索技术有限公司 | Multi-AUV underwater target identification method based on super-resolution selectable network |
Non-Patent Citations (4)
Title |
---|
Dynamic Prototype Convolution Network for Few-Shot Semantic Segmentation;Jie Liu 等;《ICCV 2021 open access》;20220303;全文 * |
Motion-Modulated Temporal Fragment Alignment Network for Few-Shot Action Recognition;Jiamin Wu 等;《ICCV 2021 open access》;20220303;全文 * |
Uncertainty-Aware Semi-Supervised Few Shot Segmentation;Soopil Kim 等;《https://arxiv.org/abs/2110.08954》;20211018;全文 * |
金字塔原型对齐的轻量级小样本语义分割网络;贾熹滨 等;《北京工业大学学报》;20210528;第47卷(第5期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN114638839A (en) | 2022-06-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112668579A (en) | Weak supervision semantic segmentation method based on self-adaptive affinity and class distribution | |
US11030750B2 (en) | Multi-level convolutional LSTM model for the segmentation of MR images | |
CN112287144B (en) | Picture retrieval method, equipment and storage medium | |
CN112016512A (en) | Remote sensing image small target detection method based on feedback type multi-scale training | |
CN115147598A (en) | Target detection segmentation method and device, intelligent terminal and storage medium | |
CN116645592B (en) | Crack detection method based on image processing and storage medium | |
CN113128478A (en) | Model training method, pedestrian analysis method, device, equipment and storage medium | |
CN113780326A (en) | Image processing method and device, storage medium and electronic equipment | |
CN116982089A (en) | Method and system for image semantic enhancement | |
CN112668608A (en) | Image identification method and device, electronic equipment and storage medium | |
CN114170558A (en) | Method, system, device, medium and article for video processing | |
CN117437423A (en) | Weak supervision medical image segmentation method and device based on SAM collaborative learning and cross-layer feature aggregation enhancement | |
CN113610016A (en) | Training method, system, equipment and storage medium of video frame feature extraction model | |
CN111476226B (en) | Text positioning method and device and model training method | |
CN112907750A (en) | Indoor scene layout estimation method and system based on convolutional neural network | |
CN115222859A (en) | Image animation | |
CN114638839B (en) | Small sample video target segmentation method based on dynamic prototype learning | |
CN112861940A (en) | Binocular disparity estimation method, model training method and related equipment | |
CN115841596B (en) | Multi-label image classification method and training method and device for model thereof | |
CN116980541A (en) | Video editing method, device, electronic equipment and storage medium | |
CN117011515A (en) | Interactive image segmentation model based on attention mechanism and segmentation method thereof | |
CN114741697B (en) | Malicious code classification method and device, electronic equipment and medium | |
CN113807354B (en) | Image semantic segmentation method, device, equipment and storage medium | |
CN115049546A (en) | Sample data processing method and device, electronic equipment and storage medium | |
CN115082778A (en) | Multi-branch learning-based homestead identification method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |