CN114638839A - Small sample video target segmentation method based on dynamic prototype learning - Google Patents
Small sample video target segmentation method based on dynamic prototype learning Download PDFInfo
- Publication number
- CN114638839A CN114638839A CN202210536170.6A CN202210536170A CN114638839A CN 114638839 A CN114638839 A CN 114638839A CN 202210536170 A CN202210536170 A CN 202210536170A CN 114638839 A CN114638839 A CN 114638839A
- Authority
- CN
- China
- Prior art keywords
- video frame
- matrix
- prototype
- features
- support
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000011218 segmentation Effects 0.000 title claims abstract description 107
- 238000000034 method Methods 0.000 title claims abstract description 69
- 238000012545 processing Methods 0.000 claims abstract description 31
- 230000005540 biological transmission Effects 0.000 claims abstract description 11
- 239000011159 matrix material Substances 0.000 claims description 112
- 239000013598 vector Substances 0.000 claims description 41
- 230000006870 function Effects 0.000 claims description 31
- 238000005065 mining Methods 0.000 claims description 19
- 238000012549 training Methods 0.000 claims description 13
- 238000000605 extraction Methods 0.000 claims description 12
- 238000013528 artificial neural network Methods 0.000 claims description 8
- 238000004422 calculation algorithm Methods 0.000 claims description 8
- 238000004364 calculation method Methods 0.000 claims description 7
- 238000005457 optimization Methods 0.000 claims description 4
- 238000011176 pooling Methods 0.000 claims description 4
- 238000009499 grossing Methods 0.000 claims description 3
- 230000000873 masking effect Effects 0.000 claims description 3
- 238000004590 computer program Methods 0.000 abstract description 3
- 239000000284 extract Substances 0.000 abstract description 3
- 230000008569 process Effects 0.000 description 13
- 230000015654 memory Effects 0.000 description 9
- 238000010586 diagram Methods 0.000 description 7
- 238000004891 communication Methods 0.000 description 3
- 230000003190 augmentative effect Effects 0.000 description 2
- 238000005259 measurement Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 230000007704 transition Effects 0.000 description 2
- 230000004913 activation Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 239000004973 liquid crystal related substance Substances 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000008447 perception Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/10—Segmentation; Edge detection
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Image Analysis (AREA)
Abstract
The invention discloses a small sample video target segmentation method based on dynamic prototype learning, which comprises the following steps: acquiring a video target to be segmented; and processing the video target to be segmented by using a small sample video target segmentation model based on dynamic prototype learning to obtain a video target segmentation result. According to the small sample video target segmentation method, the optimal transmission method is used for self-adaptive learning of the dynamic prototype, noise attention is effectively reduced, and meanwhile, a guiding mode is adopted for matching the multi-level feature map, so that the calculated amount is greatly reduced; the method can fully extract the target information in a small number of support set samples, and remarkably improves the segmentation performance on the video of the challenge set. The invention also discloses electronic equipment, a storage medium and a computer program product for executing the small sample video object segmentation method based on the dynamic prototype learning.
Description
Technical Field
The invention relates to the field of computer vision, in particular to a training method of a small sample video target segmentation model and a video target segmentation method.
Background
Video target segmentation is a technology for predicting a foreground target mask in each frame of a video, and has wide application in the aspects of augmented reality, automatic driving, video editing and the like.
The prior art segmentation of video objects is typically based on semi-supervised and unsupervised. The semi-supervision method needs to give target information of a first frame of each video, then carries out dense association of targets in subsequent frames of the video, and the process seriously depends on a large amount of densely-segmented and labeled data, so that time and labor are consumed; the unsupervised method has low performance due to the lack of the labeling data, and cannot meet the requirements of practical application. In addition, the two methods cannot be well generalized to new target classes, and the segmentation capability on classes not seen in the training phase is sharply reduced, which limits the expansibility and practicability of video target recognition.
Disclosure of Invention
In view of the above, it is a primary object of the present invention to provide a small sample video object segmentation method based on dynamic prototype learning, an electronic device, a storage medium and a computer program product, which are intended to at least partially solve at least one of the above-mentioned technical problems.
According to a first aspect of the present invention, there is provided a small sample video object segmentation method based on dynamic prototype learning, including:
acquiring a video target to be segmented;
processing a video target to be segmented by using a small sample video target segmentation model based on dynamic prototype learning to obtain a video target segmentation result, wherein the small sample video target segmentation model based on dynamic prototype learning is obtained by training according to the following method:
processing the video frame images of the challenge set and the video frame images of the support set by utilizing a part of neural network layers of a feature extraction module of a small sample video object segmentation model to obtain low-level features of the challenge video frame and low-level features of the support video frame;
processing the video frame images of the challenge set by using all neural network layers of a feature extraction module of the small sample video object segmentation model to obtain the features of the challenge video frame;
carrying out mask operation on the low-level features of the support video frame to obtain foreground features of the support video frame;
processing foreground characteristics of the support video frame and characteristics of the challenge video frame by utilizing a mining module of a small sample video object segmentation model to obtain a corresponding relation matrix;
processing the low-level features of the support video frames, the low-level features of the inquiry video frames and the corresponding relation matrix by using a guide module of the small sample video target segmentation model to obtain a low-level corresponding relation matrix;
processing the corresponding relation matrix and the low-level corresponding relation matrix by using a segmentation module of the small sample video target segmentation model to obtain a video target segmentation result, and optimizing the small sample video target segmentation model by using a loss function of the small sample video target segmentation model;
and (4) iterating to perform feature extraction operation, masking operation, mining operation, guiding operation, segmentation operation and optimization operation until the value of the loss function meets a preset condition, so as to obtain a trained small sample video target segmentation model.
According to the embodiment of the present invention, the processing of the foreground feature of the support video frame and the feature of the challenge video frame by the mining module of the small sample video object segmentation model to obtain the corresponding relationship matrix includes:
processing foreground characteristics of the support video frame by utilizing a prototype generator of the mining module to obtain dynamic prototype characteristics;
calculating the dynamic prototype characteristics and the foreground characteristics of the support video frame to obtain a support corresponding relation matrix;
calculating the dynamic prototype characteristics and the characteristics of the inquiry video frame to obtain an inquiry corresponding relation matrix;
and calculating the support corresponding relation matrix and the inquiry corresponding relation matrix to obtain a corresponding relation matrix.
According to an embodiment of the present invention, the processing of the foreground feature of the support video frame by the prototype generator of the mining module to obtain the dynamic prototype feature includes:
carrying out global average pooling on foreground features of the support video frame to obtain video target prototype features;
calculating foreground characteristics of the support video frame and prototype characteristics of the video target by using a prototype generator to obtain an attention matrix;
processing the attention matrix by using an optimal transmission algorithm to obtain an optimal distribution matrix;
and calculating the foreground characteristics of the support video frame and the optimal distribution matrix, and calculating the calculation result and the video target prototype characteristics to obtain the dynamic prototype characteristics.
According to an embodiment of the present invention, the above-mentioned attention matrix is determined by equation (1):
wherein,is the firstThe foreground feature vectors of each of the support video frames,index representing foreground feature vector of support video frame, for length ofThe foreground feature vector of the supported video frame,is in the value range of
,Is the firstThe characteristics of each prototype are characterized in that,index representing prototype features, forThe characteristics of each prototype are characterized in that,is in the value range of, Is a matrix of support focus forces,is a support attention force matrixTo (1) aGo to the firstColumn value for indicating the secondIndividual prototype characteristics andsimilarity of foreground feature vectors of the individual support video frames;
wherein the dynamic prototype feature is determined by equation (2):
wherein,is a sequence of foreground feature vectors supporting video frames,is the firstCharacter of individual prototypeThe obtained dynamic prototype characteristics are updated according to the dynamic prototype characteristics,representing the optimized support set attention force matrix,is the first of the optimized support attention force matrixA row vector.
According to an embodiment of the present invention, the processing, by the guidance module of the small sample video object segmentation model, the low-level feature of the support video frame, the low-level feature of the inquiry video frame, and the correspondence matrix to obtain the low-level correspondence matrix includes:
selecting a preset row number and a preset column number of the corresponding relation matrix to obtain an intermediate corresponding relation matrix;
calculating the low-level feature of the support video frame and the intermediate corresponding relation matrix to obtain a reconstructed feature matrix;
and performing operation on the reconstructed feature matrix and the low-level features of the inquiry video frame to obtain a low-level corresponding relation matrix.
According to an embodiment of the present invention, the above guidance module is determined by formula (3) and formula (4):
wherein,is a temperature factor, for controlling the degree of smoothing of the output probability distribution,the length of the modulus of the vector is represented,is the firstAn individual challenge video frame feature vector is generated,index representing the feature vector of the challenge video frame, for height and width respectivelyAndthe image of the video frame of the challenge,is in the value range of,Is an allocation matrix of dynamic prototype features and challenge video frame features,first of the allocation matrix representing dynamic prototype features and challenging video frame featuresGo to the firstThe value of the column is such that,an assignment matrix representing the optimized dynamic prototype features and the foreground features of the support video frame,a correspondence matrix representing the features of the challenge video frame and the foreground features of the support video frame,softmaxrepresenting a normalized exponential function.
According to an embodiment of the present invention, the loss function of the small sample video object segmentation model includes a cross-over ratio loss function and a cross-entropy loss function;
wherein the cross entropy loss function is determined by equation (5):
wherein,andrepresenting the height and width of the incoming challenge video frame image or support video frame image respectively,which represents the product of the height and the width,is the result of the real segmentation,representing the second in the real segmentation resultGo to the firstThe value of the column is such that,is the result of the segmentation predicted by the model,representing the segmentation result of the model predictionGo to the firstThe value of the column;
wherein the cross-over ratio loss function is determined by equation (6):
According to a second aspect of the present invention, there is provided an electronic apparatus comprising:
one or more processors;
a storage device for storing one or more programs,
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the above dynamic prototype learning-based small sample video object segmentation method.
According to a third aspect of the present invention, there is provided a computer-readable storage medium having stored thereon executable instructions that, when executed by a processor, cause the processor to perform the above-mentioned small-sample video object segmentation method based on dynamic prototype learning.
The small sample video target segmentation method based on dynamic prototype learning provided by the invention has the advantages that the optimal transmission method is used for adaptively learning the dynamic prototype, the noise attention is effectively reduced, meanwhile, the multi-level feature maps are matched in a guiding mode, the calculated amount is greatly reduced, meanwhile, the video segmentation method provided by the invention fully extracts the target information in a small number of support set samples, and the segmentation performance on the video of an inquiry set is obviously improved.
Drawings
FIG. 1 is a flow chart of a small sample video object segmentation method based on dynamic prototype learning according to an embodiment of the present invention;
FIG. 2 is a flowchart of a training method of a small sample video object segmentation model based on dynamic prototype learning according to an embodiment of the present invention;
FIG. 3 is a flow chart of obtaining a correspondence matrix according to an embodiment of the present invention;
FIG. 4 is a flow diagram for obtaining dynamic prototype features according to an embodiment of the present invention;
FIG. 5 is a flow diagram of obtaining a low-level correspondence matrix according to an embodiment of the invention;
FIG. 6 is a small sample video object segmentation model framework diagram based on dynamic prototype learning according to an embodiment of the present invention;
fig. 7 schematically illustrates a block diagram of an electronic device adapted to implement a small-sample video object segmentation method of base dynamic prototype learning, in accordance with an embodiment of the present invention.
Detailed Description
In order that the objects, technical solutions and advantages of the present invention will become more apparent, the present invention will be further described in detail with reference to the accompanying drawings in conjunction with the following specific embodiments.
The invention provides a small sample video target segmentation method based on dynamic prototype learning, which aims to reduce the dependence on data, improve the expansibility and the practicability and achieve better video target segmentation performance by using a small amount of data with labels.
In the current method, the method of dense matching by using multi-level features achieves the leading performance. However, dense matching of pixel-by-pixel features introduces a large amount of correspondence noise, and further processing on multiple scales increases the computational load. The method provided by the invention can learn the target prototype in a self-adaptive manner, realize robust multi-level dense matching in a mode of an intermediate bridge, and effectively alleviate the problems of noise and large calculation amount.
The video segmentation method provided by the invention can be applied to an application system related to video object segmentation; the target in the input video is segmented according to the information provided by a small number of support set images, and the method can be widely applied to scenes such as augmented reality, automatic driving, video editing and the like. In a specific embodiment, the method can be embedded into a mobile device in a software form, and provides a real-time segmentation result of a recorded video; or the method can be installed in a background server to provide a processing result of a large batch of videos.
Fig. 1 is a flowchart of a small sample video object segmentation method based on dynamic prototype learning according to an embodiment of the present invention.
As shown in FIG. 1, the method includes operations S110 to S120.
In operation S110, a video object to be segmented is acquired.
In operation S120, a video object to be segmented is processed using a small sample video object segmentation model based on dynamic prototype learning, and a video object segmentation result is obtained.
Fig. 2 is a flowchart of a training method of a small sample video object segmentation model based on dynamic prototype learning according to an embodiment of the present invention.
As shown in FIG. 2, the method includes operations S210 to S270.
In operation S210, the partial neural network layer of the feature extraction module of the small sample video object segmentation model is used to process the video frame images of the challenge set and the video frame images of the support set, so as to obtain low-level features of the challenge video frame and low-level features of the support video frame.
The low-level features are processed by a part of neural network layers of the feature extraction module, so that the resolution is high, more detail information is included, but the low-level features are lower in semantic and more in noise. High-level features (or features) as opposed to low-level features, which traverse more layers of the neural network than low-level features, have stronger semantic information, but have a lower resolution and a poorer perception of detail.
In operation S220, all the neural network layers of the feature extraction module of the small sample video object segmentation model are used to process the video frame images of the challenge set, so as to obtain the features of the challenge video frame.
For input support set video frame images and inquiry set video frame images belonging to the same category, the feature extraction module is utilized to perform multi-level feature extraction based on a ResNet-50 network, and then a 1x1 convolutional layer is mapped to a common measurement space.
In operation S230, a mask operation is performed on the low-level features of the support video frame to obtain foreground features of the support video frame.
In operation S240, the mining module of the small sample video object segmentation model is used to process the foreground feature of the support video frame and the feature of the challenge video frame, so as to obtain a corresponding relationship matrix.
In operation S250, the low-level feature of the support video frame, the low-level feature of the challenge video frame, and the corresponding relationship matrix are processed by using the guidance module of the small-sample video object segmentation model to obtain a low-level corresponding relationship matrix.
In operation S260, the correspondence matrix and the low-level correspondence matrix are processed by using a segmentation module of the small-sample video object segmentation model to obtain a video object segmentation result, and the small-sample video object segmentation model is optimized by using a loss function of the small-sample video object segmentation model.
In operation S270, the feature extraction operation, the masking operation, the mining operation, the guiding operation, the segmentation operation, and the optimization operation are performed iteratively until the value of the loss function satisfies a preset condition, so as to obtain a trained small sample video object segmentation model.
According to the training method provided by the invention, a small sample video target segmentation model with reliability, generalization and high efficiency can be obtained by utilizing a dynamic prototype mining module and a multi-level dynamic guiding module based on an optimal transmission algorithm; the trained small-sample video target segmentation model is used for segmenting a video target, an optimal transmission method can be used for adaptively learning a dynamic prototype, noise attention is effectively reduced, meanwhile, a multi-level feature map is matched in a guiding mode, the calculated amount is greatly reduced, meanwhile, the video segmentation method provided by the invention fully extracts target information in a small number of support set samples, and the segmentation performance on an inquiry set video is remarkably improved.
Fig. 3 is a flowchart of obtaining a correspondence matrix according to an embodiment of the present invention.
As shown in fig. 3, the mining module of the small sample video object segmentation model is used for processing foreground features of the support video frame and features of the challenge video frame, and obtaining a corresponding relation matrix includes operations S310 to S340.
In operation S310, the foreground features of the support video frame are processed by a prototype generator of the mining module to obtain dynamic prototype features.
In operation S320, the dynamic prototype feature and the foreground feature of the support video frame are calculated to obtain a support correspondence matrix.
In operation S330, the dynamic prototype feature and the challenge video frame feature are operated to obtain a challenge correspondence matrix.
In operation S340, the support correspondence matrix and the challenge correspondence matrix are operated to obtain a correspondence matrix.
In the process of acquiring the corresponding relation matrix, the mining module of the dynamic prototype based on the optimal transmission algorithm is utilized, so that the characteristic points which support the incidence relation between the foreground characteristic of the video frame and the characteristic of the video frame to be inquired can be fully mined, and more firm data support is provided for the training of a subsequent model.
FIG. 4 is a flow diagram for obtaining dynamic prototype features according to an embodiment of the present invention.
As shown in fig. 4, the processing of the foreground feature of the support video frame by using the prototype generator of the mining module to obtain the dynamic prototype feature includes operations S410 to S440.
In operation S410, global average pooling is performed on the foreground features of the support video frame to obtain video target prototype features.
In operation S420, the prototype generator is used to perform operations on the foreground features of the support video frame and the prototype features of the video target, so as to obtain an attention matrix.
In operation S430, the attention matrix is processed using an optimal transfer algorithm to obtain an optimal allocation matrix.
In operation S440, the foreground features of the supported video frame and the optimal allocation matrix are calculated, and the calculation result and the video target prototype features are calculated to obtain dynamic prototype features.
The processing process for acquiring the dynamic prototype features can effectively reduce the noise attention in the original video frame image, thereby improving the segmentation performance of the trained model.
According to an embodiment of the present invention, the above-mentioned attention matrix is determined by equation (1):
wherein,is the firstiThe foreground feature vector of each support video frame,index representing foreground feature vector of support video frame, for length ofThe foreground feature vector of the supported video frame,is in the value range of
,Is the firstThe characteristics of each prototype are characterized in that,index representing prototype features forThe characteristics of each prototype are shown in the figure,is in the value range of,Is a supportWith attention to the force matrix, the user can be aware of,is the first to support the focused force matrixGo to the firstColumn value for indicating the secondIndividual prototype features andsimilarity of foreground feature vectors of the support video frames;
wherein the dynamic prototype feature is determined by equation (2):
wherein,is a sequence of foreground feature vectors that support video frames,is the firstIs characterized by a prototypeThe obtained dynamic prototype characteristics are updated according to the dynamic prototype characteristics,the optimized support set attention matrix is shown, namely the support set attention matrix optimized by using the optimal transmission algorithm,is the first of the optimized support attention force matrixRow vector, representingThe line prototype feature vector pairs support the similarity of the foreground feature vectors of the video frames.
Fig. 5 is a flow chart of obtaining a low-level correspondence matrix according to an embodiment of the invention.
As shown in fig. 5, the method for obtaining the low-level correspondence matrix includes operations S510 to S530, in which the guiding module of the small sample video object segmentation model is used to process the low-level features of the support video frame, the low-level features of the inquiry video frame, and the correspondence matrix.
In operation S510, a preset row number and a preset column number of the corresponding relationship matrix are selected to obtain an intermediate corresponding relationship matrix.
In operation S520, the low-level feature of the support video frame and the intermediate corresponding relationship matrix are operated to obtain a reconstructed feature matrix.
In operation S530, the reconstructed feature matrix and the low-level features of the challenge video frame are operated to obtain a low-level corresponding relationship matrix.
According to an embodiment of the present invention, the above guidance module is determined by formula (3) and formula (4):
wherein,is a temperature factor, for controlling the degree of smoothing of the output probability distribution,the length of the modulus of the vector is represented,is the firstAn individual challenge video frame feature vector is generated,index representing the feature vector of the challenge video frame, for height and width respectivelyAndthe challenge video frame image of (2) is,is in the value range of,Is an allocation matrix of dynamic prototype features and challenge video frame features,a first of the allocation matrices representing the dynamic prototype features and the features of the challenge video frameGo to the firstThe value of the column is such that,an assignment matrix representing the optimized dynamic prototype features and the foreground features of the support video frame,a correspondence matrix representing the features of the challenge video frame and the foreground features of the support video frame,softmaxrepresenting a normalized exponential function.
Allocation matrix for dynamic prototype features and supporting video frame featuresThen it is determined by:,allocation matrix representing dynamic prototype features and supporting video frame featuresTo (1) aGo to the firstThe value of the column is such that,is the firstiThe foreground feature vector of each support video frame,is the firstCharacter of individual prototypeAnd updating the obtained dynamic prototype features.
According to an embodiment of the present invention, the loss function of the small sample video object segmentation model includes a cross-over ratio loss function and a cross-entropy loss function;
wherein the cross entropy loss function is determined by equation (5):
wherein,andrepresenting the height and width of the incoming challenge video frame image or the support video frame image respectively,representing the product of said height and said width,is the result of the real segmentation,representing the first of the real segmentation resultsGo to the firstThe value of the column is such that,is the result of the segmentation predicted by the model,representing said mouldType prediction of the segmentation resultGo to the firstThe value of the column;
wherein the cross-over ratio loss function is determined by equation (6):
where a norm of the matrix is represented.
Since the segmentation task is similar to the pixel-by-pixel classification task, intensive cross-entropy losses are used as constraints, while, in order to improve the final segmentation resultAnd label maskThe coincidence degree index of the invention is additionally added with an intersection ratio loss, and finally the loss function of the invention is formed by combining the intersection ratio loss function and the cross entropy loss function according to a certain weight coefficient; the loss function of the small sample video object segmentation model of the invention is shown in formula (7):
By using the loss function as the constraint of the training method, the training effect of the small sample video target model can be improved, and the small sample video target segmentation model which has robustness and effectively reduces noise and is based on dynamic prototype learning is obtained.
Fig. 6 is a small sample video object segmentation model framework diagram based on dynamic prototype learning according to an embodiment of the present invention.
The training process of the model provided by the embodiment of the present invention is further described in detail with reference to fig. 6.
As shown in FIG. 6, the model training framework provided by the present invention comprises a dynamic prototype mining module based on an optimal transmission algorithm and a multi-level dynamic guiding module. In a dynamic prototype mining module based on an optimal transmission algorithm, for input supporting set and inquiry set images belonging to the same category, multi-level features are extracted through a ResNet-50 network, and then a 1x1 convolutional layer is mapped to a common measurement space. Will support the featureFlattening and extracting the mask includingA sequence of foreground feature vectors supporting the video framesSending it to a prototype generator to obtainAn object prototype, as shown in equations (8) and (9):
GAP (Global Average Pooling) is used to Average a sequence of input supporting video frame foreground feature vectors,representing target global feature vectors, prototype generatorComposed of a full connection layer and an activation function, which can generate prototype features according to the target features input by the current support set,Is shown in whichkThe number of prototype generators is determined by the number of prototype generators,is composed ofCan be based on the attention moment matrixForeground pixel features are assigned to these prototypes as shown in equation (1):
in order to allocate a group of semantically consistent pixel features to the same prototype, an optimal allocation matrix is obtained based on an optimal transmission theory for adjusting the mapping relationship between the pixel features and the prototype, and this process mainly solves the optimization problem shown in formulas (10) and (11):
wherein,the vector is a vector of all 1 s,representing the transition matrix to be solved for,represents the optimal solution of the transition matrix to be solved,the weighting operation is carried out on the attention moment array, and is a weighting matrix, Tr represents the trace operation of the matrix,which is a coefficient of a constant number of times,the entropy function of the information is represented,for transferring matricesThe space of possible solutions of (a) is,the representation dimension isCan ultimately be based onThe updating results in a robust dynamic prototype, as shown in equation (12) and equation (2):
The above process can optimize the prototype vector through multiple iterations, and simultaneously purify the distribution matrix of the support set。
In the multi-level dynamic boot module, for a video frame of a challenge set to be segmented, a pseudo label can be allocated to each pixel feature by using a self-adaptively generated dynamic prototype, and meanwhile, a huge calculation amount generated in a dense matching process is reduced by using a calculation mode of an intermediate bridge, as shown in formula (3) and formula (4):
wherein,is the temperature factor. Correspondence matrices may be used at high levels of features at low resolutionAnd (4) reconstructing the characteristics of the support video frame, and inputting the reconstructed support video frame into a decoder to predict a segmentation result. For the low-level features with high resolution, a guiding method which can suppress noise by using a dynamic prototype and has less calculation amount is used for feature reconstruction. Is specifically made according toSelecting position indexes of similarity from the support set characteristics to obtain corresponding characteristic vectorsObtaining dense matching results for low-level features in an indirect guided mannerAs shown in equation (13):
wherein,is to challenge the low-level features of the video frameTo (1)jThe number of feature vectors is determined by the number of feature vectors,is selected to correspond toThe product of the low-level feature vector of the support video frame and the low-level feature vector of the support video frame formsRepresenting low-level feature dense matching resultsTo (1) ajA value.
The small sample video target segmentation model obtained through the training process can realize that a small amount of marked images are input as supports to segment targets of the same category in video frames.
Fig. 7 schematically illustrates a block diagram of an electronic device adapted to implement a small-sample video object segmentation method of base dynamic prototype learning, in accordance with an embodiment of the present invention.
As shown in fig. 7, an electronic device 700 according to an embodiment of the present invention includes a processor 701, which can perform various appropriate actions and processes according to a program stored in a Read Only Memory (ROM) 702 or a program loaded from a storage section 708 into a Random Access Memory (RAM) 703. The processor 701 may include, for example, a general purpose microprocessor (e.g., a CPU), an instruction set processor and/or associated chipset, and/or a special purpose microprocessor (e.g., an Application Specific Integrated Circuit (ASIC)), among others. The processor 701 may also include on-board memory for caching purposes. The processor 701 may comprise a single processing unit or a plurality of processing units for performing the different actions of the method flows according to embodiments of the present invention.
In the RAM 703, various programs and data necessary for the operation of the electronic apparatus 700 are stored. The processor 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. The processor 701 performs various operations of the method flow according to the embodiment of the present invention by executing programs in the ROM 702 and/or the RAM 703. It is noted that the programs may also be stored in one or more memories other than the ROM 702 and RAM 703. The processor 701 may also perform various operations of method flows according to embodiments of the present invention by executing programs stored in the one or more memories.
The present invention also provides a computer-readable storage medium, which may be contained in the apparatus/device/system described in the above embodiments; or may exist separately and not be assembled into the device/apparatus/system. The computer-readable storage medium carries one or more programs which, when executed, implement the method according to an embodiment of the present invention.
According to embodiments of the present invention, the computer readable storage medium may be a non-volatile computer readable storage medium, which may include, for example but is not limited to: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present invention, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. For example, according to an embodiment of the present invention, a computer-readable storage medium may include the above-described ROM 702 and/or RAM 703 and/or one or more memories other than the ROM 702 and RAM 703.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are only exemplary embodiments of the present invention and are not intended to limit the present invention, and any modifications, equivalents, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (9)
1. A small sample video object segmentation method based on dynamic prototype learning comprises the following steps:
acquiring a video target to be segmented;
processing the video target to be segmented by using a small sample video target segmentation model based on dynamic prototype learning to obtain a video target segmentation result, wherein the small sample video target segmentation model based on dynamic prototype learning is obtained by training according to the following method:
processing the video frame images of the challenge set and the video frame images of the support set by utilizing a part of neural network layers of a feature extraction module of the small sample video target segmentation model to obtain low-level features of the challenge video frame and low-level features of the support video frame;
all neural network layers of a feature extraction module of the small sample video target segmentation model are used for processing video frame images of the challenge set to obtain features of the challenge video frame;
carrying out mask operation on the low-level features of the support video frame to obtain foreground features of the support video frame;
processing the foreground characteristics of the support video frame and the characteristics of the challenge video frame by utilizing a mining module of the small sample video target segmentation model to obtain a corresponding relation matrix;
processing the low-level features of the support video frame, the low-level features of the challenge video frame and the corresponding relation matrix by utilizing a guide module of the small sample video target segmentation model to obtain a low-level corresponding relation matrix;
processing the corresponding relation matrix and the low-level corresponding relation matrix by utilizing a segmentation module of the small sample video target segmentation model to obtain a video target segmentation result, and optimizing the small sample video target segmentation model by utilizing a loss function of the small sample video target segmentation model;
and iterating to perform feature extraction operation, masking operation, mining operation, guiding operation, segmentation operation and optimization operation until the value of the loss function meets a preset condition to obtain a trained small sample video target segmentation model.
2. The method of claim 1, wherein the processing the supporting video frame foreground features and the challenging video frame features with the mining module of the small sample video object segmentation model to obtain a correspondence matrix comprises:
processing the foreground characteristics of the support video frame by utilizing a prototype generator of the mining module to obtain dynamic prototype characteristics;
calculating the dynamic prototype characteristics and the foreground characteristics of the support video frame to obtain a support corresponding relation matrix;
calculating the dynamic prototype characteristics and the characteristics of the challenge video frame to obtain a challenge corresponding relation matrix;
and operating the support corresponding relation matrix and the inquiry corresponding relation matrix to obtain a corresponding relation matrix.
3. The method of claim 2, wherein said processing the support video frame foreground features with a prototype generator of the mining module to obtain dynamic prototype features comprises:
carrying out global average pooling on the foreground features of the support video frames to obtain video target prototype features;
calculating the foreground characteristics of the support video frame and the prototype characteristics of the video target by using the prototype generator to obtain an attention matrix;
processing the attention matrix by using an optimal transmission algorithm to obtain an optimal distribution matrix;
and calculating the foreground characteristics of the support video frame and the optimal distribution matrix, and calculating the calculation result and the video target prototype characteristics to obtain dynamic prototype characteristics.
4. The method of claim 3, wherein the attention matrix is determined by equation (1):
wherein,is the firstThe foreground feature vectors of each of the support video frames,an index representing the foreground feature vector of the support video frame, for a length ofThe foreground feature vector of the support video frame,is in the value range of,Is the firstThe characteristics of each prototype are characterized in that,an index representing the prototype feature forEach of said prototype features being selected from the group consisting of,is in the value range of, Is a matrix of support focus forces,is the support attention force matrixTo (1) aGo to the firstColumn value for indicating the secondThe prototype features andsimilarity of foreground feature vectors of the support video frames;
wherein the dynamic prototype feature is determined by equation (2):
wherein,is a sequence of foreground feature vectors of the support video frames,is the firstCharacter of individual prototypeThe obtained dynamic prototype characteristics are updated to obtain the dynamic prototype characteristics,representing the optimized support set attention force matrix,is the first of the optimized support set attention force matrixA row vector.
5. The method of claim 1, wherein the processing the support video frame low-level features, the challenge video frame low-level features, and the correspondence matrix with a bootstrap module of the small sample video object segmentation model to obtain a low-level correspondence matrix comprises:
selecting a preset row number and a preset column number of the corresponding relation matrix to obtain an intermediate corresponding relation matrix;
calculating the low-level characteristics of the support video frame and the intermediate corresponding relation matrix to obtain a reconstructed characteristic matrix;
and operating the reconstructed feature matrix and the low-level features of the inquiry video frame to obtain the low-level corresponding relation matrix.
6. The method of claim 1, wherein the guidance module is determined by formula (3) and formula (4):
wherein,is a temperature factor, for controlling the degree of smoothing of the output probability distribution,the length of the modulus of the vector is represented,is the firstAn individual challenge video frame feature vector is generated,an index representing the feature vector of the challenge video frame, for height and width respectivelyAndof the challenge video frame image of (a),is in the value range of,Is an allocation matrix of dynamic prototype features and challenge video frame features,a first of the allocation matrices representing the dynamic prototype features and the features of the challenge video frameGo to the firstThe value of the column is such that,an assignment matrix representing the optimized dynamic prototype features and the support video frame foreground features,a correspondence matrix representing the challenge video frame features and the support video frame foreground features,softmaxrepresenting a normalized exponential function.
7. The method of claim 1, wherein the loss function of the small sample video object segmentation model comprises a cross-over ratio loss function and a cross-entropy loss function;
wherein the cross entropy loss function is determined by equation (5):
wherein,andrepresenting the height and width of the incoming challenge video frame image or the support video frame image respectively,representing the product of said height and said width,is the result of the real segmentation,representing the first of the real segmentation resultsGo to the firstThe value of the column is such that,is the result of the segmentation predicted by the model,representing the result of the segmentation predicted by the modelGo to the firstThe value of the column;
wherein the cross-over ratio loss function is determined by equation (6):
8. An electronic device, comprising:
one or more processors;
a storage device for storing one or more programs,
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to perform the method of any of claims 1-7.
9. A computer readable storage medium having stored thereon executable instructions which, when executed by a processor, cause the processor to perform the method of any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210536170.6A CN114638839B (en) | 2022-05-18 | 2022-05-18 | Small sample video target segmentation method based on dynamic prototype learning |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210536170.6A CN114638839B (en) | 2022-05-18 | 2022-05-18 | Small sample video target segmentation method based on dynamic prototype learning |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114638839A true CN114638839A (en) | 2022-06-17 |
CN114638839B CN114638839B (en) | 2022-09-30 |
Family
ID=81953301
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210536170.6A Active CN114638839B (en) | 2022-05-18 | 2022-05-18 | Small sample video target segmentation method based on dynamic prototype learning |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114638839B (en) |
Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110942463A (en) * | 2019-10-30 | 2020-03-31 | 杭州电子科技大学 | Video target segmentation method based on generation countermeasure network |
US20200117826A1 (en) * | 2018-10-16 | 2020-04-16 | Immuta, Inc. | Data access policy management |
CN111210446A (en) * | 2020-01-08 | 2020-05-29 | 中国科学技术大学 | Video target segmentation method, device and equipment |
CN111583284A (en) * | 2020-04-22 | 2020-08-25 | 中国科学院大学 | Small sample image semantic segmentation method based on hybrid model |
CN113177549A (en) * | 2021-05-11 | 2021-07-27 | 中国科学技术大学 | Few-sample target detection method and system based on dynamic prototype feature fusion |
CN113240039A (en) * | 2021-05-31 | 2021-08-10 | 西安电子科技大学 | Small sample target detection method and system based on spatial position characteristic reweighting |
CN113706487A (en) * | 2021-08-17 | 2021-11-26 | 西安电子科技大学 | Multi-organ segmentation method based on self-supervision characteristic small sample learning |
CN113763385A (en) * | 2021-05-28 | 2021-12-07 | 华南理工大学 | Video object segmentation method, device, equipment and medium |
CN113920127A (en) * | 2021-10-27 | 2022-01-11 | 华南理工大学 | Single sample image segmentation method and system with independent training data set |
EP3961502A1 (en) * | 2020-08-31 | 2022-03-02 | Sap Se | Weakly supervised one-shot image segmentation |
CN114240965A (en) * | 2021-12-13 | 2022-03-25 | 江南大学 | Small sample learning tumor segmentation method driven by graph attention model |
CN114266977A (en) * | 2021-12-27 | 2022-04-01 | 青岛澎湃海洋探索技术有限公司 | Multi-AUV underwater target identification method based on super-resolution selectable network |
-
2022
- 2022-05-18 CN CN202210536170.6A patent/CN114638839B/en active Active
Patent Citations (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200117826A1 (en) * | 2018-10-16 | 2020-04-16 | Immuta, Inc. | Data access policy management |
CN110942463A (en) * | 2019-10-30 | 2020-03-31 | 杭州电子科技大学 | Video target segmentation method based on generation countermeasure network |
CN111210446A (en) * | 2020-01-08 | 2020-05-29 | 中国科学技术大学 | Video target segmentation method, device and equipment |
CN111583284A (en) * | 2020-04-22 | 2020-08-25 | 中国科学院大学 | Small sample image semantic segmentation method based on hybrid model |
EP3961502A1 (en) * | 2020-08-31 | 2022-03-02 | Sap Se | Weakly supervised one-shot image segmentation |
CN113177549A (en) * | 2021-05-11 | 2021-07-27 | 中国科学技术大学 | Few-sample target detection method and system based on dynamic prototype feature fusion |
CN113763385A (en) * | 2021-05-28 | 2021-12-07 | 华南理工大学 | Video object segmentation method, device, equipment and medium |
CN113240039A (en) * | 2021-05-31 | 2021-08-10 | 西安电子科技大学 | Small sample target detection method and system based on spatial position characteristic reweighting |
CN113706487A (en) * | 2021-08-17 | 2021-11-26 | 西安电子科技大学 | Multi-organ segmentation method based on self-supervision characteristic small sample learning |
CN113920127A (en) * | 2021-10-27 | 2022-01-11 | 华南理工大学 | Single sample image segmentation method and system with independent training data set |
CN114240965A (en) * | 2021-12-13 | 2022-03-25 | 江南大学 | Small sample learning tumor segmentation method driven by graph attention model |
CN114266977A (en) * | 2021-12-27 | 2022-04-01 | 青岛澎湃海洋探索技术有限公司 | Multi-AUV underwater target identification method based on super-resolution selectable network |
Non-Patent Citations (4)
Title |
---|
JIAMIN WU 等: "Motion-Modulated Temporal Fragment Alignment Network for Few-Shot Action Recognition", 《ICCV 2021 OPEN ACCESS》 * |
JIE LIU 等: "Dynamic Prototype Convolution Network for Few-Shot Semantic Segmentation", 《ICCV 2021 OPEN ACCESS》 * |
SOOPIL KIM 等: "Uncertainty-Aware Semi-Supervised Few Shot Segmentation", 《HTTPS://ARXIV.ORG/ABS/2110.08954》 * |
贾熹滨 等: "金字塔原型对齐的轻量级小样本语义分割网络", 《北京工业大学学报》 * |
Also Published As
Publication number | Publication date |
---|---|
CN114638839B (en) | 2022-09-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Li et al. | Box-supervised instance segmentation with level set evolution | |
US11030750B2 (en) | Multi-level convolutional LSTM model for the segmentation of MR images | |
CN111860504A (en) | Visual multi-target tracking method and device based on deep learning | |
CN112801103B (en) | Text direction recognition and text direction recognition model training method and device | |
CN114998595B (en) | Weak supervision semantic segmentation method, semantic segmentation method and readable storage medium | |
CN114596566A (en) | Text recognition method and related device | |
CN112990331A (en) | Image processing method, electronic device, and storage medium | |
CN112668608B (en) | Image recognition method and device, electronic equipment and storage medium | |
CN113128478A (en) | Model training method, pedestrian analysis method, device, equipment and storage medium | |
CN113780326A (en) | Image processing method and device, storage medium and electronic equipment | |
CN116982089A (en) | Method and system for image semantic enhancement | |
CN113762327A (en) | Machine learning method, machine learning system and non-transitory computer readable medium | |
CN110717405B (en) | Face feature point positioning method, device, medium and electronic equipment | |
CN114170558A (en) | Method, system, device, medium and article for video processing | |
CN113807354B (en) | Image semantic segmentation method, device, equipment and storage medium | |
CN116980541B (en) | Video editing method, device, electronic equipment and storage medium | |
CN117437423A (en) | Weak supervision medical image segmentation method and device based on SAM collaborative learning and cross-layer feature aggregation enhancement | |
CN115841596B (en) | Multi-label image classification method and training method and device for model thereof | |
CN112907750A (en) | Indoor scene layout estimation method and system based on convolutional neural network | |
CN112861940A (en) | Binocular disparity estimation method, model training method and related equipment | |
CN114638839B (en) | Small sample video target segmentation method based on dynamic prototype learning | |
CN115082778B (en) | Multi-branch learning-based homestead identification method and system | |
CN114842330B (en) | Multi-scale background perception pooling weak supervision building extraction method | |
CN115049546A (en) | Sample data processing method and device, electronic equipment and storage medium | |
CN113792653A (en) | Method, system, equipment and storage medium for cloud detection of remote sensing image |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |