CN115035437A - Video target segmentation method based on mask feature aggregation and target enhancement - Google Patents

Video target segmentation method based on mask feature aggregation and target enhancement Download PDF

Info

Publication number
CN115035437A
CN115035437A CN202210569043.6A CN202210569043A CN115035437A CN 115035437 A CN115035437 A CN 115035437A CN 202210569043 A CN202210569043 A CN 202210569043A CN 115035437 A CN115035437 A CN 115035437A
Authority
CN
China
Prior art keywords
feature
target
mask
query
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210569043.6A
Other languages
Chinese (zh)
Inventor
刘勇
梅剑标
王蒙蒙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202210569043.6A priority Critical patent/CN115035437A/en
Publication of CN115035437A publication Critical patent/CN115035437A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/49Segmenting video sequences, i.e. computational techniques such as parsing or cutting the sequence, low-level clustering or determining units such as shots or scenes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/761Proximity, similarity or dissimilarity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Abstract

The invention relates to the field of computer vision, and discloses a video target segmentation method based on mask feature aggregation and target enhancement, which comprises the following steps: s1, designing and obtaining an optimized multi-scale mask feature aggregation unit; s2, obtaining a target enhanced feature matching unit by using a target enhanced attention mechanism; s3, training a network model by using a server, optimizing network parameters by reducing a network loss function until the network converges, and obtaining a video target segmentation method based on multi-scale mask feature aggregation and target enhancement; and S4, segmenting a given target for the new video sequence by utilizing the multi-scale mask feature aggregation and target enhancement-based video target segmentation method. The method can fully utilize the edge contour information in the target mask to enhance the learning of the appearance representation of the target, so that the segmentation result has better contour accuracy, and the target can be accurately segmented in a complex environment.

Description

Video target segmentation method based on mask feature aggregation and target enhancement
Technical Field
The invention relates to the technical field of computer vision, in particular to a video target segmentation method based on mask feature aggregation and target enhancement.
Background
In recent years, Video Object Segmentation (VOS) has received much attention for its wide application in video manipulation and editing, among others, with semi-supervised video object segmentation tasks being of particular interest, which greatly simplifies video manipulation and editing applications, by segmenting objects in a video sequence given an initial object mask, which allows a user to provide an object mask only in the first frame, and then specific objects in the remaining frames will be automatically segmented without the user having to laboriously process the entire video.
The reference frame and the corresponding reference target mask are two vital reference information in video target segmentation, and are mainly used for memorizing historical target information and matching current target characteristics. The reference frame refers to the original RGB image of the historical frame that has undergone the segmentation process, and it contains the complete information of the target and the background environment. The reference target mask refers to a target mask corresponding to a reference frame (the first frame is a real value, and the rest frames are predicted values), and the reference target mask contains the edge and contour features of the target and clearly expresses the area and the boundary of the target in a background environment.
Although reference target masks help the algorithm to accurately segment the target, it remains an open question of how to properly utilize the reference target mask and effectively fuse it with the reference frame to better remember and match the target. Most of the previous methods only carry out simple auxiliary processing on the reference target mask, do not further mine the characteristics in the target mask, do not further explore how to effectively fuse the characteristics of the target mask and the characteristics of the reference frame image, and ignore the influence of the target mask on the characteristic matcher. For example, MaskTrack and RGMP simply concatenate the reference frame and the target mask in the channel dimension as input to the network. FEELVOS then directly uses the target mask to distinguish between foreground and background pixels. Until recently, the use of reference object masks in video object segmentation has not attracted attention. For example, SwiftNet generates target mask features through a convolution and sub-pixel module and fuses them with reference frame image features to enable efficient reference feature encoding. However, in addition to using target masks, these methods typically have other specific designs and different experimental settings to improve their performance, such as network structure, training and reasoning configuration, hyper-parameters, other special modules, etc. It is difficult to find the most efficient way and whether there is a better way to use the reference target mask. In addition, in the past, the research on the use of the target mask is mainly focused on a feature coding part, the application of the target mask in the feature matching process is omitted, and a feature matching unit is also a very critical link in video target segmentation.
Disclosure of Invention
Aiming at the problems, the invention provides a video target segmentation method based on mask feature aggregation and target enhancement, which can accurately and quickly segment targets in a plurality of difficult actual scenes.
In order to achieve the above object, the present invention provides a video object segmentation method based on mask feature aggregation and object enhancement, comprising the following steps:
s1, designing and obtaining an optimized multi-scale mask feature aggregation unit;
s2, obtaining a target enhanced feature matching unit by using a target enhanced attention mechanism;
s3, training a network model by using a server, optimizing network parameters by reducing a network loss function until the network converges, and obtaining a video target segmentation method based on multi-scale mask feature aggregation and target enhancement;
and S4, segmenting a given target for the new video sequence by using the multi-scale mask feature aggregation and target enhancement-based video target segmentation method.
Preferably, the step S1 specifically includes the following steps:
s11, designing a low-level fused mask feature aggregation unitI1, extracting inquiry image key value code pair (k) respectively by using the same inquiry coding unit and reference coding unit of the backbone network Q ,v Q ) And a reference key value code pair (k) R ,v R ) The index Q refers to a query image, the index R refers to a reference set, the query encoding unit is an image feature encoder with an input channel of 3, and the tail end of the image feature encoder with the input channel of 3 is attached with two convolutions which are connected in parallel and used for generating a query image key value encoding pair (k) Q ,v Q ) The reference coding unit is an image characteristic coder with an input channel of 4, and the tail end of the image characteristic coder with the input channel of 4 is attached with two convolutions which are connected in parallel and used for generating a reference key value coding pair (k) R ,v R ) The reference coding unit firstly splices the reference frame and the reference target mask on the channel dimension and then sends the spliced reference frame and the reference target mask together into an image feature coder with an input channel of 4, and the mathematical expression of the reference coding unit is as follows:
R={Concate(I i ,M i )} N
wherein, I i ,M i Respectively representing the reference frame RGB image and the reference target mask of the ith frame in the reference set R; n is the reference set size; concate represents the splicing operation along the channel dimension;
s12, designing three high-level fused mask feature aggregation units I2, I3 and I4, aggregating a target mask or a target mask feature and an image feature after a feature extraction stage for foreground discovery, wherein the I2 consists of a reference coding unit and a query coding unit, the query coding unit is an image feature encoder, and the tail end of the image feature encoder is attached with two convolutions which are connected in parallel and used for generating a query image key value encoding pair (k) by the aid of two convolutions Q ,v Q ) The reference coding unit comprises an image feature encoder shared with the query encoder feature and a mask feature aggregation module for generating a reference key value code pair (k) R ,v R ) The reference coding unit directly samples the original target mask and fuses the original target mask with the reference frame features output by the image feature coder by using a feature aggregation module; the reference coding unit of I3 uses an independent mask feature encoder to mask the targetExtracting the characteristics of the codes, and fusing the reference frame characteristics output by the shared image characteristic encoder by using a characteristic aggregation module; the I4 further shares a mask feature encoder and an image feature encoder on the basis of the I3;
s13, designing four multi-scale fused mask feature aggregation units I5, I6, I7 and I8, wherein the four feature extraction units respectively comprise a reference coding unit and a query coding unit, and outputting a query image key value coding pair (k) Q ,v Q ) And a reference key value code pair (k) R ,v R ) And target mask feature F M The I5 adopts the structure of SwiftNet, the image feature encoders in the reference encoding unit and the query encoding unit share the features, and the reference encoding unit fuses down-sampled target mask information into the reference frame features extracted by the image feature encoder after the 1 st and 4 th stages of the backbone network; the reference coding unit of I6 adopts a separate mask feature coder to extract a reference target mask feature F M Instead of simple down-sampling, the AFC module is used for fusing the reference frame features extracted by the image feature encoder in the first four stages of the backbone network, the image feature encoder is a main branch, and the image feature encoder in the query encoding unit and the reference encoding unit is not shared; the I7 has basically the same structure as the I6, takes a mask encoder as a main branch, and shares parameters of image feature encoders in a query coding unit and a reference coding unit; the I8 differs from the I6 in that only parameters of image feature encoders in a query coding unit and a reference coding unit are shared;
s14, comparing the effect of each type of feature extraction unit after the step S3 and the step S4 by using a default feature matching unit and a default decoding unit to obtain an optimal multi-scale mask feature aggregation unit I8.
Preferably, the feature aggregation module in step S12 is composed of two parallel convolution branches, wherein one branch is composed of a series of 1 × 7 convolution and 7 × 1 convolution; the other branch is formed by connecting a 7 multiplied by 1 convolution and a 1 multiplied by 7 convolution in series;
preferably, the step S2 specifically includes the following steps:
s21, Using reference mask feature F generated by mask encoder M Generating a target attention map w R (ii) a Then the target attention diagram w R And reference value coding features v R Multiplying to obtain target enhanced reference value coding characteristics
Figure BDA0003658392420000031
S22, according to the similarity between the query frame and the previous frame, drawing the corresponding target attention of the previous frame
Figure BDA0003658392420000032
Converting to the query frame to obtain a target attention diagram w corresponding to the query frame Q (ii) a Aim attention is tried to w Q And query value encoding features v Q Multiplying to obtain target enhanced query value coding features
Figure BDA0003658392420000033
S23, using the target enhanced query image key value code pair
Figure BDA0003658392420000041
Retrieving reference key value code pairs
Figure BDA0003658392420000042
And encoding the feature with the query value
Figure BDA0003658392420000046
And obtaining the final matching characteristics after splicing.
Preferably, the information retrieval process in step S23 specifically includes: firstly, similarity of the query key features and the reference key features is calculated, after the reference frame dimensions are normalized, the similarity is used as a weight to carry out weighted summation on the reference frame value features, and then the similarity is spliced with the query value coding features, namely:
Figure BDA0003658392420000043
wherein p and q represent pixels in the query key coding feature and the reference key coding feature, respectively, [ ] represents concatenation, σ represents a Softmax function, and y is the output of the feature matching unit.
Preferably, the step S3 specifically includes the following steps:
s31, executing a training video clip generating unit by using a server to generate a training video clip with the length of T, wherein T is more than or equal to 2;
s32, executing characteristic coding unit by the server to perform key value coding pair (k) of the query image Q ,v Q ) Reference key value code pair (k) R ,v R ) And reference target mask feature F M Extracting;
s33, executing the target enhanced feature matching unit in the step S2 by using a server, and encoding a pair (k) according to a query image key value Q ,v Q ) And reference target mask feature F M To retrieve a reference key-value code pair (k) R ,v R ) Obtaining the final matching characteristics according to the information in the step (1);
s34, a server is used for executing a decoding unit and outputting a final segmentation result of the query frame;
s35, performing network training by using a server, and performing training in an end-to-end mode; the mathematical expression of the segmentation loss function L is:
L(Y,M)=L ce (Y,M)+α·L IoU (Y,M)
wherein the content of the first and second substances,
Figure BDA0003658392420000044
represents cross entropy loss;
Figure BDA0003658392420000045
representing mask cross-over ratio loss; y represents a target mask true value; m represents a target mask prediction result; Ω represents the set of all pixels in the target mask; t represents the length of the training video clip; alpha is a hyper-parameter;
and S36, optimizing the objective function by using the server to obtain the local optimal network parameters.
Preferably, the step S31 specifically includes the following steps:
s311, randomly extracting T images from any video of a plurality of video data sets at intervals;
and S312, respectively carrying out different affine transformations on the T images for T times to form a training video clip, wherein the affine transformations comprise translation, scaling, overturning, rotation and shearing.
Preferably, the step S32 specifically includes: for a reference set, respectively performing feature extraction on an input reference frame image and a reference frame target mask prediction result by using a shared image feature encoder and a mask feature encoder; then, the characteristics of each stage of the mask characteristic encoder and the characteristics of the corresponding stage of the image characteristic encoder are added after passing through an extrusion excitation fusion module respectively; then injecting the added features into an image feature encoder; finally, the image feature encoder outputs a reference key value encoding pair (k) R ,v R ) The mask feature encoder outputs a mask feature F M (ii) a Reference key value code pair (k) R ,v R ) Is directly stored into the memory; for the query frame, directly using the image feature encoder coding features to obtain a query image coding feature pair (k) Q ,v Q )。
Preferably, the step S34 is specifically: using a plurality of residual blocks as a decoder, the matching features in the step S33 and the query image coding feature pair (k) in the step S32 introduced by the skip connection Q ,v Q ) As input, 2 times of upsampling is performed at each stage, and finally, a final segmentation result is output.
Preferably, the step S4 specifically includes the following steps:
s41, initializing a segmentation target, giving a mask of the target to be segmented in a first frame of a new video sequence, and initializing a reference set by using the first frame and the target mask thereof; the segmentation starts from a second frame of the video sequence;
s42, obtaining the inquiry image coding feature pair (k) by the current frame image and the reference set through the feature extraction unit Q ,v Q ) Reference key value coding pair (k) R ,v R ) The mask feature encoder outputs a mask feature F M
S43, executing a target enhanced matching unit to obtain matching characteristics;
s44, matching the matching feature in the step S43 and the query image coding feature pair (k) in the step S42 Q ,v Q ) Inputting the prediction result into a decoding unit to obtain a current frame target mask prediction result;
s45, putting the current frame and its target mask prediction result into the reference set every 5 frames.
Compared with the prior art, the invention has the beneficial effects that:
the invention provides a video target segmentation method based on mask feature aggregation and target enhancement, which gives out the optimal feature encoder configuration by designing and comparing eight different feature encoder designs: the multi-scale mask feature aggregation encoder fully utilizes edge contour information in a target mask through multi-scale mask feature aggregation to enhance the learning of target appearance representation, so that a segmentation result has better contour accuracy; by means of the target enhanced attention, given targets in a first frame are paid more attention, interference of targets with similar appearance characteristics and similar colors in a background is weakened, robustness of a method for challenges of rapid movement, deformation, shielding and the like of the targets is enhanced, the targets can be accurately segmented by the system in a complex environment, the targets can be accurately and rapidly segmented in a plurality of difficult actual scenes, J & F values reach 91.1% in a DAVIS2016 verification set, J & F values reach 85.5% in a DAVIS2017 verification set, overgrade scores reach 81.9% in a YouTube-VOS 2018 verification set, and a very good effect is achieved.
Drawings
FIG. 1 is a diagram of eight feature extraction units according to the present invention;
FIG. 2 is a diagram of a target enhanced feature matching unit according to the present invention;
FIG. 3 is an algorithm framework diagram of a video object segmentation method based on multi-scale mask feature aggregation and object enhancement according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Aiming at the problems and the defects in the prior art, the invention provides a video target segmentation method based on mask feature aggregation and target enhancement, which is mainly realized by four stages of multi-scale target mask feature aggregation encoder design, target enhancement type feature matching unit design, model training and model inference.
The invention provides a video target segmentation method based on mask feature aggregation and target enhancement, which comprises the following steps
S1, designing and obtaining an optimized multi-scale mask feature aggregation unit;
s2, obtaining a target enhanced feature matching unit by using a target enhanced attention mechanism;
s3, training a network model by using a server, optimizing network parameters by reducing a network loss function until the network converges, and obtaining a video target segmentation method based on multi-scale mask feature aggregation and target enhancement;
and S4, segmenting a given target for the new video sequence by utilizing the multi-scale mask feature aggregation and target enhancement-based video target segmentation method.
Each step is described in detail below.
And step S1, designing and obtaining an optimized multi-scale mask feature aggregation unit. Eight kinds of feature extraction units are designed in a summary manner, and as shown in fig. 1, eight kinds of feature extraction unit diagrams designed by the invention summarize a feature extraction unit configuration with the optimal effect, namely a multi-scale target mask feature aggregation unit.
The specific implementation process is as follows:
s11, designing a low-level fused mask feature aggregation unit I1, and respectively extracting query image key value coding pairs (k) by using a query coding unit and a reference coding unit which are the same in backbone network Q ,v Q ) And a reference key value code pair (k) R ,v R ) The superscript Q refers to the query image, and the superscript R refers to the reference set; the query encoding unit consists of an image feature encoder with an input channel of 3 (with two convolutions in parallel at the end for generating a query feature encoding pair (k) Q ,v Q ) And the reference encoding unit consists of an image feature encoder with an input channel 4 (with two convolutions in parallel at the ends) for generating a query feature encoding pair (k) R ,v R ) Composition (c); the reference coding unit firstly splices the reference frame and the reference target mask on the channel dimension and then sends the spliced reference frame and the reference target mask together into an image feature coder with an input channel of 4, and the mathematical expression of the reference coding unit is as follows:
R={Concate(I i ,M i )} N
wherein, I i ,M i Respectively representing the reference frame RGB image and the reference target mask of the ith frame in the reference set R; n is the reference set size; concate represents the splicing operation along the channel dimension;
s12, designing three mask feature aggregation units I2, I3 and I4 which are fused at a high level, and aggregating a target mask or target mask feature and image features after a feature extraction stage for foreground discovery; i2 is composed of a reference coding unit and a query coding unit, which is composed of an image feature encoder (with two convolutions in parallel at the end for generating a query feature encoding pair (k) Q ,v Q ) A reference coding unit consisting of an image feature encoder shared with the query encoder features and a mask feature aggregation module, outputting a reference feature coded pair (k) R ,v R ) (ii) a The reference coding unit directly samples the original target mask and fuses the original target mask with the reference frame features output by the image feature coder by using a feature aggregation module; unlike I2, used in the reference coding unit of I3An independent mask feature encoder extracts features of a target mask, and then a feature aggregation module is used for fusing reference frame features output by a shared image feature encoder; i4 further shares the mask feature encoder and the image feature encoder on the basis of I3; specifically, the feature aggregation module is composed of two parallel convolution branches, wherein one branch is composed of a 1 × 7 convolution and a 7 × 1 convolution in series; the other branch is formed by connecting a 7 multiplied by 1 convolution and a 1 multiplied by 7 convolution in series;
s13, designing four multi-scale fused mask feature aggregation units I5, I6, I7 and I8, wherein the four feature extraction units respectively comprise a reference coding unit and a query coding unit, and outputting a reference feature coding pair (k) R ,v R ) And query feature code pair (k) Q ,v Q ) And target mask feature F M (ii) a I5 basically adopts the structure of SwiftNet, the image feature coder in the reference coding unit and the inquiry coding unit carries out feature sharing, and the reference coding unit fuses the down-sampled target mask information into the reference frame feature extracted by the image feature coder after the 1 st and 4 th stages of the backbone network; the reference encoding unit of I6 adopts a separate mask feature encoder to extract the reference target mask feature F M Instead of simple downsampling, the AFC module is then used to fuse the reference frame features extracted by the image feature encoder in the first four stages of the backbone network (the image feature encoder is a main branch), and the image feature encoders in the query encoding unit and the reference encoding unit are not shared; i7 and I6 are basically the same in structure, a mask encoder is used as a main branch, and parameters of image feature encoders in a query encoding unit and a reference encoding unit are shared; in contrast to I6, I8 only shares parameters of the image feature encoders in the query coding unit and the reference coding unit.
S14, comparing the effects of various feature extraction units after the steps S3 and S4 by using a default feature matching unit and a default decoding unit to obtain an optimal multi-scale mask feature aggregation unit I8.
The invention designs eight different feature extraction units to find an effective method for using a target mask in the feature extraction unit, and in order to test the effectiveness of the feature extraction units, the invention provides a uniform reference, and the invention keeps the same system structure (a feature matching unit and a decoder), the same hyper-parameters and training/reasoning configuration except the feature extraction unit, and empirically summarizes two key findings from comparison: (i) it is necessary to use a separate encoder to extract the target mask features independently, which is more beneficial than using the original mask or simple down-sampling; (ii) multi-scale aggregation of target mask features and reference frame image features can improve performance, indicating that both low-level and high-level mask features are useful. The invention also finally picks a multi-scale target mask feature aggregation unit (fig. one, example I8) as our final feature extraction unit.
And step S2, obtaining a target enhanced feature matching unit by using the target enhanced attention mechanism. Fig. 2 shows a diagram of a target enhanced feature matching unit designed by the present invention, which includes the following specific steps:
s21, Using reference mask feature F generated by mask encoder M Generating a target attention map w R (ii) a Then the target attention diagram w R And reference value coding features v R Multiplying to obtain target enhanced reference value coding characteristics
Figure BDA0003658392420000081
The method comprises the following specific steps:
w R =Conv(F M )
Figure BDA0003658392420000082
wherein Conv represents a 1 × 1 convolution,
Figure BDA0003658392420000083
representing the hadamard product;
s22, according to the similarity between the query frame and the previous frame, drawing the corresponding target attention of the previous frame
Figure BDA0003658392420000084
Converting to the query frame to obtain a target attention diagram w corresponding to the query frame Q (ii) a Aim attention is tried to w Q And query value encoding features v Q Multiplying to obtain target enhanced query value coding features
Figure BDA0003658392420000085
The method specifically comprises the following steps:
Figure BDA0003658392420000086
Figure BDA0003658392420000087
wherein Conv represents a 1 × 1 convolution,
Figure BDA0003658392420000088
denotes the hadamard product, x denotes the matrix multiplication, σ denotes the Softmax function, ss subscript-1 denotes the last element of the reference set, i.e. the previous frame of the current frame;
s23, using the target enhanced query key value coding pair
Figure BDA0003658392420000091
Retrieving reference key value code pairs
Figure BDA0003658392420000092
And encoding the features with the query value
Figure BDA0003658392420000093
And (3) obtaining final matching characteristics after splicing, specifically:
Figure BDA0003658392420000094
wherein p and q represent pixels in the query key coding feature and the reference key coding feature respectively, [ ] represents concatenation, σ represents a Softmax function, and y is the output of the feature matching unit.
The invention explores the use method of the target mask on the feature matching unit. This was often ignored in previous methods, but the present invention found it helpful to eliminate background interference. A common feature matching unit in video object segmentation uses non-local attention. However, the attention between the query frame (current frame) and the reference set (reference frame and reference target mask) in such a feature matching unit involves a large number of unnecessary pairs of features (such as the relationship between backgrounds), and thus contains excessive background noise and interference. In response to this problem, the present invention improves the above problem simply and efficiently by explicitly introducing reference target mask information into the feature matching unit. Unlike the conventional feature matching unit, the present invention proposes a new target enhanced feature matching unit, which uses target enhanced attention to first generate a mask attention map using target mask features, and then uses the mask attention map to enhance the target area and suppress the background (see fig. 3).
And step S3, training the network model by using a server, and optimizing network parameters by reducing a network loss function until the network converges to obtain the video target segmentation method based on multi-scale mask feature aggregation and target enhancement. Fig. 3 is a diagram of an algorithm framework of a video object segmentation method based on multi-scale mask feature aggregation and object enhancement, which includes the following specific steps:
s31, executing a training video clip generating unit by using a server to generate a training video clip with the length of T, wherein T is more than or equal to 2; specifically, randomly extracting T images from any video of a plurality of video data sets at intervals, and performing affine transformation (translation, scaling, overturning, rotating and shearing) on the T images for T times to form a training video clip; or, arbitrarily extracting an image from the image data set, and performing different affine transformations for T times to form a training video fragment;
s32, executing the feature coding unit by the server to perform the key value coding pair of the query image and the reference key value coding pairAnd extracting the characteristic of the reference target mask, and inquiring the key value code pair of the image into (k) Q ,v Q ) The reference key value code pair is (k) R ,v R ) The reference target mask is characterized by F M The method comprises the following steps that A, a superscript Q indicates a query image, a superscript R indicates a reference set, and a superscript M indicates a reference target mask, specifically, for the reference set, feature extraction is respectively carried out on input reference frame images and reference frame target mask prediction results by using a shared image feature encoder and a mask feature encoder; then the characteristics of each stage of the mask characteristic encoder and the characteristics of the corresponding stage of the image characteristic encoder are added after passing through an extrusion excitation fusion module (AFC module); then injecting the added features into an image feature encoder; finally, the image characteristic encoder outputs a reference key value encoding pair (k) R ,v R ) The mask feature encoder outputs a mask feature F M (ii) a Key value code pair (k) R ,v R ) Is directly stored into the memory; for the query frame, directly using the coding features of the image feature coder to obtain a query image coding feature pair (k) R ,v R );
S33, executing the target enhanced feature matching unit in the step S2 by a server; encoding pairs (k) according to query image key values Q ,v Q ) And reference target mask feature F M To retrieve a reference key value code pair (k) R ,v R ) Obtaining the final matching characteristics according to the information in the step (1);
s34, a server is used for executing a decoding unit and outputting a final segmentation result of the query frame; specifically, a plurality of residual blocks are used as decoders, the matching features in the step S33 and the query coding features introduced through the skip connection in the step S32 are used as inputs, 2 times of upsampling is performed at each stage, and finally, a final segmentation result is output.
S35, performing network training by using a server, and performing end-to-end training; the mathematical expression of the segmentation loss function L is:
L(Y,M)=L ce (Y,M)+α·L IoU (Y,M)
wherein the content of the first and second substances,
Figure BDA0003658392420000101
represents the cross entropy loss;
Figure BDA0003658392420000102
representing mask cross-over ratio loss; y represents a target mask true value; m represents a target mask prediction result; Ω represents the set of all pixels in the target mask; t represents the length of the training video clip; alpha is a hyper-parameter;
s36, optimizing the objective function by using the server to obtain local optimal network parameters; specifically, the loss function L in step S35 is used as a target function, and an AdamW optimizer is used to iteratively update network parameters, so that the target loss function is reduced until the target loss function converges to a local optimum, and the training is ended to obtain the trained network weight of video target segmentation based on multi-scale mask feature aggregation and target enhanced attention.
And step S4, segmenting a given target for a new video sequence by using the multi-scale mask feature aggregation and target enhancement based video target segmentation method. The method comprises the following specific steps:
s41, initializing a segmentation target, giving a mask of the target to be segmented in a first frame of a new video sequence, and initializing a reference set by using the first frame and the target mask thereof; the segmentation starts from a second frame of the video sequence;
s42, obtaining the key value coding pair (k) of the query frame from the current frame (query frame) image and the reference set through the feature extraction unit Q ,v Q ) Reference key value code pair (k) R ,v R ) The mask feature encoder outputs a mask feature F M
S43, executing a target enhanced matching unit to obtain matching characteristics;
s44, inputting the matching characteristic in the step S43 and the inquiry coding characteristic in the step S42 into a decoding unit to obtain a current frame target mask prediction result;
s45, putting the current frame and its target mask prediction result into the reference set every 5 frames.
Although the invention herein has been described with reference to particular embodiments, it is to be understood that these embodiments are merely illustrative of the principles and applications of the present invention. It is therefore to be understood that numerous modifications may be made to the illustrative embodiments and that other arrangements may be devised without departing from the spirit and scope of the present invention as defined by the appended claims. It should be understood that various dependent claims and the features described herein may be combined in ways different from those described in the original claims. It is also to be understood that features described in connection with individual embodiments may be used in other described embodiments.

Claims (10)

1. The video target segmentation method based on mask feature aggregation and target enhancement is characterized by comprising the following steps of:
s1, designing and obtaining an optimized multi-scale mask feature aggregation unit;
s2, obtaining a target enhanced feature matching unit by using a target enhanced attention mechanism;
s3, training a network model by using a server, optimizing network parameters by reducing a network loss function until the network converges, and obtaining a video target segmentation method based on multi-scale mask feature aggregation and target enhancement;
and S4, segmenting a given target for the new video sequence by utilizing the multi-scale mask feature aggregation and target enhancement-based video target segmentation method.
2. The method for video object segmentation based on mask feature aggregation and object enhancement as claimed in claim 1, wherein the step S1 specifically includes the steps of:
s11, designing a low-level fused mask feature aggregation unit I1, and respectively extracting query image key value coding pairs (k) by using a query coding unit and a reference coding unit which are the same in backbone network Q ,v Q ) And a reference key value code pair (k) R ,v R ) The index Q indicates the query image, the index R indicates the reference set, and the query coding unit isAn input channel 3 image feature encoder, wherein the tail end of the input channel 3 image feature encoder is attached with two convolutions in parallel for generating a query image key value encoding pair (k) Q ,v Q ) The reference coding unit is an image characteristic coder with an input channel of 4, and the tail end of the image characteristic coder with the input channel of 4 is attached with two convolutions which are connected in parallel and used for generating a reference key value coding pair (k) R ,v R ) The reference coding unit firstly splices the reference frame and the reference target mask on the channel dimension and then sends the spliced reference frame and the reference target mask together into an image feature coder with an input channel of 4, and the mathematical expression of the reference coding unit is as follows:
R={Concate(I i ,M i )} N
wherein, I i ,M i Respectively representing the reference frame RGB image and the reference target mask of the ith frame in the reference set R; n is the reference set size; concate represents the splicing operation along the channel dimension;
s12, designing three high-level fused mask feature aggregation units I2, I3 and I4, aggregating a target mask or a target mask feature and an image feature after a feature extraction stage for foreground discovery, wherein the I2 consists of a reference coding unit and a query coding unit, the query coding unit is an image feature encoder, and the tail end of the image feature encoder is attached with two convolutions which are connected in parallel and used for generating a query image key value encoding pair (k) by the aid of two convolutions Q ,v Q ) The reference coding unit comprises an image feature encoder shared with the query encoder feature and a mask feature aggregation module for generating a reference key value code pair (k) R ,v R ) The reference coding unit directly samples the original target mask and fuses the original target mask with the reference frame features output by the image feature coder by using a feature aggregation module; an independent mask feature encoder is used in the reference encoding unit of I3 to perform feature extraction on a target mask, and then a feature aggregation module is used to fuse reference frame features output by a shared image feature encoder; the I4 further shares a mask feature encoder and an image feature encoder on the basis of the I3;
s13, designFour multi-scale fused mask feature aggregation units I5, I6, I7 and I8, and four feature extraction units respectively comprise a reference coding unit and a query coding unit and output a query image key value coding pair (k) Q ,v Q ) And a reference key value code pair (k) R ,v R ) And target mask feature F M The I5 adopts the structure of SwiftNet, the image feature encoders in the reference encoding unit and the query encoding unit share the features, and the reference encoding unit fuses down-sampled target mask information into the reference frame features extracted by the image feature encoder after the 1 st and 4 th stages of the backbone network; the reference coding unit of I6 adopts a separate mask feature coder to extract a reference target mask feature F M Instead of simple down-sampling, the AFC module is used for fusing the reference frame features extracted by the image feature encoder in the first four stages of the backbone network, the image feature encoder is a main branch, and the image feature encoders in the query encoding unit and the reference encoding unit are not shared; the I7 has basically the same structure as the I6, takes a mask encoder as a main branch, and shares parameters of image feature encoders in a query coding unit and a reference coding unit; the I8 differs from the I6 in that only parameters of image feature encoders in a query coding unit and a reference coding unit are shared;
s14, comparing the effects of various feature extraction units after the steps S3 and S4 by using a default feature matching unit and a default decoding unit to obtain an optimal multi-scale mask feature aggregation unit I8.
3. The method for video object segmentation based on mask feature aggregation and object enhancement according to claim 2, wherein the feature aggregation module in step S12 is composed of two parallel convolution branches, one of which is composed of a series of 1 x 7 convolution and 7 x 1 convolution; the other branch consists of a concatenation of a 7 x 1 convolution and a 1 x 7 convolution.
4. The method for video object segmentation based on mask feature aggregation and object enhancement according to claim 1, wherein the step S2 specifically includes the following steps:
s21, Using reference mask feature F generated by mask encoder M Generating a target attention map w R (ii) a Then the target attention is tried w R And reference value coding feature v R Multiplying to obtain target enhanced reference value coding characteristics
Figure FDA0003658392410000021
S22, according to the similarity between the query frame and the previous frame, drawing the corresponding target attention of the previous frame
Figure FDA0003658392410000022
Converting to the query frame to obtain a target attention diagram w corresponding to the query frame Q (ii) a Aim attention is tried to w Q And query value encoding features v Q Multiplying to obtain target enhanced query value coding features
Figure FDA0003658392410000023
S23, using the target enhanced query image key value code pair
Figure FDA0003658392410000024
Retrieving reference key value code pairs
Figure FDA0003658392410000025
And encoding the feature with the query value
Figure FDA0003658392410000031
And obtaining the final matching characteristics after splicing.
5. The method for video object segmentation based on mask feature aggregation and object enhancement according to claim 4, wherein the information retrieving process in step S23 specifically comprises: firstly, similarity is calculated between the query key features and the reference key features, weighted summation is carried out on the reference frame value features as weights after the reference frame dimensions are normalized, and then splicing is carried out on the reference frame value features and the query value coding features, namely:
Figure FDA0003658392410000032
wherein p and q represent pixels in the query key coding feature and the reference key coding feature, respectively, [ ] represents concatenation, σ represents a Softmax function, and y is the output of the feature matching unit.
6. The method for video object segmentation based on mask feature aggregation and object enhancement according to claim 1, wherein the step S3 specifically includes the following steps:
s31, executing a training video clip generating unit by using a server to generate a training video clip with the length of T, wherein T is more than or equal to 2;
s32, executing characteristic coding unit by the server to perform key value coding pair (k) of the query image Q ,v Q ) Reference key value code pair (k) R ,v R ) And reference target mask feature F M Extracting;
s33, executing the target enhanced feature matching unit in the step S2 by using a server, and encoding a pair (k) according to a query image key value Q ,v Q ) And reference target mask feature F M To retrieve a reference key value code pair (k) R ,v R ) Obtaining the final matching characteristics;
s34, a server is used for executing a decoding unit and outputting a final segmentation result of the query frame;
s35, performing network training by using a server, and performing training in an end-to-end mode; the mathematical expression of the segmentation loss function L is:
L(Y,M)=L ce (Y,M)+α·L IoU (Y,M)
wherein the content of the first and second substances,
Figure FDA0003658392410000033
represents cross entropy loss;
Figure FDA0003658392410000034
representing mask cross-over ratio loss; y represents a target mask true value; m represents a target mask prediction result; Ω represents the set of all pixels in the target mask; t represents the length of the training video clip; alpha is a hyper-parameter;
and S36, optimizing the objective function by using the server to obtain the local optimal network parameters.
7. The method for video object segmentation based on mask feature aggregation and object enhancement according to claim 6, wherein the step S31 specifically includes the following steps:
s311, randomly extracting T images from any video of a plurality of video data sets at intervals;
and S312, respectively carrying out different affine transformations on the T images for T times to form a training video clip, wherein the affine transformations comprise translation, scaling, overturning, rotation and shearing.
8. The method for video object segmentation based on mask feature aggregation and object enhancement according to claim 6, wherein the step S32 specifically comprises: for a reference set, respectively performing feature extraction on an input reference frame image and a reference frame target mask prediction result by using a shared image feature encoder and a mask feature encoder; then, the characteristics of each stage of the mask characteristic encoder and the characteristics of the corresponding stage of the image characteristic encoder are added after passing through an extrusion excitation fusion module respectively; then injecting the added features into an image feature encoder; finally, the image feature encoder outputs a reference key value encoding pair (k) R ,v R ) The mask feature encoder outputs a mask feature F M (ii) a Reference key value code pair (k) R ,v R ) Is directly stored into the memory; for the query frame, directly using the coding features of the image feature coder to obtain a query image coding feature pair (k) Q ,v Q )。
9. The method for video object segmentation based on mask feature aggregation and object enhancement according to claim 6, wherein the step S34 specifically comprises: using a plurality of residual blocks as decoders, the matching features in the step S33 and the query image coding feature pair (k) in the step S32 introduced by the skip connection Q ,v Q ) As input, 2 times of upsampling is performed at each stage, and finally the final segmentation result is output.
10. The method for video object segmentation based on mask feature aggregation and object enhancement according to claim 1, wherein the step S4 specifically includes the following steps:
s41, initializing a segmentation target, giving a mask of the target to be segmented in a first frame of a new video sequence, and initializing a reference set by using the first frame and the target mask thereof; the segmentation starts from a second frame of the video sequence;
s42, obtaining the inquiry image coding feature pair (k) by the current frame image and the reference set through the feature extraction unit Q ,v Q ) Reference key value code pair (k) R ,v R ) The mask feature encoder outputs a mask feature F M
S43, executing a target enhanced matching unit to obtain matching characteristics;
s44, matching the matching feature in the step S43 and the query image coding feature pair in the step S42 (k) Q ,v Q ) Inputting the prediction result into a decoding unit to obtain a current frame target mask prediction result;
s45, putting the current frame and its target mask prediction result into the reference set every 5 frames.
CN202210569043.6A 2022-05-24 2022-05-24 Video target segmentation method based on mask feature aggregation and target enhancement Pending CN115035437A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210569043.6A CN115035437A (en) 2022-05-24 2022-05-24 Video target segmentation method based on mask feature aggregation and target enhancement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210569043.6A CN115035437A (en) 2022-05-24 2022-05-24 Video target segmentation method based on mask feature aggregation and target enhancement

Publications (1)

Publication Number Publication Date
CN115035437A true CN115035437A (en) 2022-09-09

Family

ID=83121690

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210569043.6A Pending CN115035437A (en) 2022-05-24 2022-05-24 Video target segmentation method based on mask feature aggregation and target enhancement

Country Status (1)

Country Link
CN (1) CN115035437A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116452600A (en) * 2023-06-15 2023-07-18 上海蜜度信息技术有限公司 Instance segmentation method, system, model training method, medium and electronic equipment
CN116630869A (en) * 2023-07-26 2023-08-22 北京航空航天大学 Video target segmentation method

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116452600A (en) * 2023-06-15 2023-07-18 上海蜜度信息技术有限公司 Instance segmentation method, system, model training method, medium and electronic equipment
CN116452600B (en) * 2023-06-15 2023-10-03 上海蜜度信息技术有限公司 Instance segmentation method, system, model training method, medium and electronic equipment
CN116630869A (en) * 2023-07-26 2023-08-22 北京航空航天大学 Video target segmentation method
CN116630869B (en) * 2023-07-26 2023-11-07 北京航空航天大学 Video target segmentation method

Similar Documents

Publication Publication Date Title
Yang et al. Learning to extract semantic structure from documents using multimodal fully convolutional neural networks
CN111079532B (en) Video content description method based on text self-encoder
CN115035437A (en) Video target segmentation method based on mask feature aggregation and target enhancement
CN112101410B (en) Image pixel semantic segmentation method and system based on multi-modal feature fusion
JP2022548712A (en) Image Haze Removal Method by Adversarial Generation Network Fusing Feature Pyramids
CN113077388B (en) Data-augmented deep semi-supervised over-limit learning image classification method and system
CN112651940A (en) Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network
CN113076957A (en) RGB-D image saliency target detection method based on cross-modal feature fusion
CN114723760B (en) Portrait segmentation model training method and device and portrait segmentation method and device
CN113870286A (en) Foreground segmentation method based on multi-level feature and mask fusion
Yu et al. MagConv: Mask-guided convolution for image inpainting
CN115661535B (en) Target background removal recovery method and device and electronic equipment
CN113554655B (en) Optical remote sensing image segmentation method and device based on multi-feature enhancement
US11928855B2 (en) Method, device, and computer program product for video processing
CN113674230B (en) Method and device for detecting key points of indoor backlight face
CN115187775A (en) Semantic segmentation method and device for remote sensing image
CN114943655A (en) Image restoration system for generating confrontation network structure based on cyclic depth convolution
Mengyang et al. Content-based video copy detection using binary object fingerprints
Osina et al. Text detection algorithm on real scenes images and videos on the base of discrete cosine transform and convolutional neural network
CN114494804B (en) Unsupervised field adaptive image classification method based on domain specific information acquisition
Cheng et al. TSRRNet: two-stage reflection removal network with reflective guidance
KELES et al. Learning Dense Contextual Features for Semantic Segmentation
Duan et al. Restoring Ancient Ideograph: A Multimodal Multitask Neural Network Approach
CN115115970A (en) Video saliency detection method and device based on adaptive matching and storage medium
Wang et al. Research on Key Techniques of Text Recognition under Strong Light Noise

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination