CN115035437A

CN115035437A - Video target segmentation method based on mask feature aggregation and target enhancement

Info

Publication number: CN115035437A
Application number: CN202210569043.6A
Authority: CN
Inventors: 刘勇; 梅剑标; 王蒙蒙
Original assignee: Zhejiang University ZJU
Current assignee: Zhejiang University ZJU
Priority date: 2022-05-24
Filing date: 2022-05-24
Publication date: 2022-09-09

Abstract

The invention relates to the field of computer vision, and discloses a video target segmentation method based on mask feature aggregation and target enhancement, which comprises the following steps: s1, designing and obtaining an optimized multi-scale mask feature aggregation unit; s2, obtaining a target enhanced feature matching unit by using a target enhanced attention mechanism; s3, training a network model by using a server, optimizing network parameters by reducing a network loss function until the network converges, and obtaining a video target segmentation method based on multi-scale mask feature aggregation and target enhancement; and S4, segmenting a given target for the new video sequence by utilizing the multi-scale mask feature aggregation and target enhancement-based video target segmentation method. The method can fully utilize the edge contour information in the target mask to enhance the learning of the appearance representation of the target, so that the segmentation result has better contour accuracy, and the target can be accurately segmented in a complex environment.

Description

Video target segmentation method based on mask feature aggregation and target enhancement

Technical Field

The invention relates to the technical field of computer vision, in particular to a video target segmentation method based on mask feature aggregation and target enhancement.

Background

In recent years, Video Object Segmentation (VOS) has received much attention for its wide application in video manipulation and editing, among others, with semi-supervised video object segmentation tasks being of particular interest, which greatly simplifies video manipulation and editing applications, by segmenting objects in a video sequence given an initial object mask, which allows a user to provide an object mask only in the first frame, and then specific objects in the remaining frames will be automatically segmented without the user having to laboriously process the entire video.

The reference frame and the corresponding reference target mask are two vital reference information in video target segmentation, and are mainly used for memorizing historical target information and matching current target characteristics. The reference frame refers to the original RGB image of the historical frame that has undergone the segmentation process, and it contains the complete information of the target and the background environment. The reference target mask refers to a target mask corresponding to a reference frame (the first frame is a real value, and the rest frames are predicted values), and the reference target mask contains the edge and contour features of the target and clearly expresses the area and the boundary of the target in a background environment.

Although reference target masks help the algorithm to accurately segment the target, it remains an open question of how to properly utilize the reference target mask and effectively fuse it with the reference frame to better remember and match the target. Most of the previous methods only carry out simple auxiliary processing on the reference target mask, do not further mine the characteristics in the target mask, do not further explore how to effectively fuse the characteristics of the target mask and the characteristics of the reference frame image, and ignore the influence of the target mask on the characteristic matcher. For example, MaskTrack and RGMP simply concatenate the reference frame and the target mask in the channel dimension as input to the network. FEELVOS then directly uses the target mask to distinguish between foreground and background pixels. Until recently, the use of reference object masks in video object segmentation has not attracted attention. For example, SwiftNet generates target mask features through a convolution and sub-pixel module and fuses them with reference frame image features to enable efficient reference feature encoding. However, in addition to using target masks, these methods typically have other specific designs and different experimental settings to improve their performance, such as network structure, training and reasoning configuration, hyper-parameters, other special modules, etc. It is difficult to find the most efficient way and whether there is a better way to use the reference target mask. In addition, in the past, the research on the use of the target mask is mainly focused on a feature coding part, the application of the target mask in the feature matching process is omitted, and a feature matching unit is also a very critical link in video target segmentation.

Disclosure of Invention

Aiming at the problems, the invention provides a video target segmentation method based on mask feature aggregation and target enhancement, which can accurately and quickly segment targets in a plurality of difficult actual scenes.

In order to achieve the above object, the present invention provides a video object segmentation method based on mask feature aggregation and object enhancement, comprising the following steps:

s1, designing and obtaining an optimized multi-scale mask feature aggregation unit;

s2, obtaining a target enhanced feature matching unit by using a target enhanced attention mechanism;

s3, training a network model by using a server, optimizing network parameters by reducing a network loss function until the network converges, and obtaining a video target segmentation method based on multi-scale mask feature aggregation and target enhancement;

and S4, segmenting a given target for the new video sequence by using the multi-scale mask feature aggregation and target enhancement-based video target segmentation method.

Preferably, the step S1 specifically includes the following steps:

s11, designing a low-level fused mask feature aggregation unitI1, extracting inquiry image key value code pair (k) respectively by using the same inquiry coding unit and reference coding unit of the backbone network ^Q ,v ^Q ) And a reference key value code pair (k) ^R ,v ^R ) The index Q refers to a query image, the index R refers to a reference set, the query encoding unit is an image feature encoder with an input channel of 3, and the tail end of the image feature encoder with the input channel of 3 is attached with two convolutions which are connected in parallel and used for generating a query image key value encoding pair (k) ^Q ,v ^Q ) The reference coding unit is an image characteristic coder with an input channel of 4, and the tail end of the image characteristic coder with the input channel of 4 is attached with two convolutions which are connected in parallel and used for generating a reference key value coding pair (k) ^R ,v ^R ) The reference coding unit firstly splices the reference frame and the reference target mask on the channel dimension and then sends the spliced reference frame and the reference target mask together into an image feature coder with an input channel of 4, and the mathematical expression of the reference coding unit is as follows:

R＝{Concate(I _i ,M _i )} _N

wherein, I _i ,M _i Respectively representing the reference frame RGB image and the reference target mask of the ith frame in the reference set R; n is the reference set size; concate represents the splicing operation along the channel dimension;

s12, designing three high-level fused mask feature aggregation units I2, I3 and I4, aggregating a target mask or a target mask feature and an image feature after a feature extraction stage for foreground discovery, wherein the I2 consists of a reference coding unit and a query coding unit, the query coding unit is an image feature encoder, and the tail end of the image feature encoder is attached with two convolutions which are connected in parallel and used for generating a query image key value encoding pair (k) by the aid of two convolutions ^Q ,v ^Q ) The reference coding unit comprises an image feature encoder shared with the query encoder feature and a mask feature aggregation module for generating a reference key value code pair (k) ^R ,v ^R ) The reference coding unit directly samples the original target mask and fuses the original target mask with the reference frame features output by the image feature coder by using a feature aggregation module; the reference coding unit of I3 uses an independent mask feature encoder to mask the targetExtracting the characteristics of the codes, and fusing the reference frame characteristics output by the shared image characteristic encoder by using a characteristic aggregation module; the I4 further shares a mask feature encoder and an image feature encoder on the basis of the I3;

s13, designing four multi-scale fused mask feature aggregation units I5, I6, I7 and I8, wherein the four feature extraction units respectively comprise a reference coding unit and a query coding unit, and outputting a query image key value coding pair (k) ^Q ,v ^Q ) And a reference key value code pair (k) ^R ,v ^R ) And target mask feature F ^M The I5 adopts the structure of SwiftNet, the image feature encoders in the reference encoding unit and the query encoding unit share the features, and the reference encoding unit fuses down-sampled target mask information into the reference frame features extracted by the image feature encoder after the 1 st and 4 th stages of the backbone network; the reference coding unit of I6 adopts a separate mask feature coder to extract a reference target mask feature F ^M Instead of simple down-sampling, the AFC module is used for fusing the reference frame features extracted by the image feature encoder in the first four stages of the backbone network, the image feature encoder is a main branch, and the image feature encoder in the query encoding unit and the reference encoding unit is not shared; the I7 has basically the same structure as the I6, takes a mask encoder as a main branch, and shares parameters of image feature encoders in a query coding unit and a reference coding unit; the I8 differs from the I6 in that only parameters of image feature encoders in a query coding unit and a reference coding unit are shared;

s14, comparing the effect of each type of feature extraction unit after the step S3 and the step S4 by using a default feature matching unit and a default decoding unit to obtain an optimal multi-scale mask feature aggregation unit I8.

Preferably, the feature aggregation module in step S12 is composed of two parallel convolution branches, wherein one branch is composed of a series of 1 × 7 convolution and 7 × 1 convolution; the other branch is formed by connecting a 7 multiplied by 1 convolution and a 1 multiplied by 7 convolution in series;

preferably, the step S2 specifically includes the following steps:

s21, Using reference mask feature F generated by mask encoder ^M Generating a target attention map w ^R (ii) a Then the target attention diagram w ^R And reference value coding features v ^R Multiplying to obtain target enhanced reference value coding characteristics

S22, according to the similarity between the query frame and the previous frame, drawing the corresponding target attention of the previous frame

Converting to the query frame to obtain a target attention diagram w corresponding to the query frame ^Q (ii) a Aim attention is tried to w ^Q And query value encoding features v ^Q Multiplying to obtain target enhanced query value coding features

S23, using the target enhanced query image key value code pair

Retrieving reference key value code pairs

And encoding the feature with the query value

And obtaining the final matching characteristics after splicing.

Preferably, the information retrieval process in step S23 specifically includes: firstly, similarity of the query key features and the reference key features is calculated, after the reference frame dimensions are normalized, the similarity is used as a weight to carry out weighted summation on the reference frame value features, and then the similarity is spliced with the query value coding features, namely:

wherein p and q represent pixels in the query key coding feature and the reference key coding feature, respectively, [ ] represents concatenation, σ represents a Softmax function, and y is the output of the feature matching unit.

Preferably, the step S3 specifically includes the following steps:

s31, executing a training video clip generating unit by using a server to generate a training video clip with the length of T, wherein T is more than or equal to 2;

s32, executing characteristic coding unit by the server to perform key value coding pair (k) of the query image ^Q ,v ^Q ) Reference key value code pair (k) ^R ,v ^R ) And reference target mask feature F ^M Extracting;

s33, executing the target enhanced feature matching unit in the step S2 by using a server, and encoding a pair (k) according to a query image key value ^Q ,v ^Q ) And reference target mask feature F ^M To retrieve a reference key-value code pair (k) ^R ,v ^R ) Obtaining the final matching characteristics according to the information in the step (1);

s34, a server is used for executing a decoding unit and outputting a final segmentation result of the query frame;

s35, performing network training by using a server, and performing training in an end-to-end mode; the mathematical expression of the segmentation loss function L is:

L(Y,M)＝L _ce (Y,M)+α·L _IoU (Y,M)

wherein the content of the first and second substances,

represents cross entropy loss;

representing mask cross-over ratio loss; y represents a target mask true value; m represents a target mask prediction result; Ω represents the set of all pixels in the target mask; t represents the length of the training video clip; alpha is a hyper-parameter;

and S36, optimizing the objective function by using the server to obtain the local optimal network parameters.

Preferably, the step S31 specifically includes the following steps:

s311, randomly extracting T images from any video of a plurality of video data sets at intervals;

and S312, respectively carrying out different affine transformations on the T images for T times to form a training video clip, wherein the affine transformations comprise translation, scaling, overturning, rotation and shearing.

Preferably, the step S32 specifically includes: for a reference set, respectively performing feature extraction on an input reference frame image and a reference frame target mask prediction result by using a shared image feature encoder and a mask feature encoder; then, the characteristics of each stage of the mask characteristic encoder and the characteristics of the corresponding stage of the image characteristic encoder are added after passing through an extrusion excitation fusion module respectively; then injecting the added features into an image feature encoder; finally, the image feature encoder outputs a reference key value encoding pair (k) ^R ,v ^R ) The mask feature encoder outputs a mask feature F ^M (ii) a Reference key value code pair (k) ^R ,v ^R ) Is directly stored into the memory; for the query frame, directly using the image feature encoder coding features to obtain a query image coding feature pair (k) ^Q ,v ^Q )。

Preferably, the step S34 is specifically: using a plurality of residual blocks as a decoder, the matching features in the step S33 and the query image coding feature pair (k) in the step S32 introduced by the skip connection ^Q ,v ^Q ) As input, 2 times of upsampling is performed at each stage, and finally, a final segmentation result is output.

Preferably, the step S4 specifically includes the following steps:

s41, initializing a segmentation target, giving a mask of the target to be segmented in a first frame of a new video sequence, and initializing a reference set by using the first frame and the target mask thereof; the segmentation starts from a second frame of the video sequence;

s42, obtaining the inquiry image coding feature pair (k) by the current frame image and the reference set through the feature extraction unit ^Q ,v ^Q ) Reference key value coding pair (k) ^R ,v ^R ) The mask feature encoder outputs a mask feature F ^M ；

S43, executing a target enhanced matching unit to obtain matching characteristics;

s44, matching the matching feature in the step S43 and the query image coding feature pair (k) in the step S42 ^Q ,v ^Q ) Inputting the prediction result into a decoding unit to obtain a current frame target mask prediction result;

s45, putting the current frame and its target mask prediction result into the reference set every 5 frames.

Compared with the prior art, the invention has the beneficial effects that:

the invention provides a video target segmentation method based on mask feature aggregation and target enhancement, which gives out the optimal feature encoder configuration by designing and comparing eight different feature encoder designs: the multi-scale mask feature aggregation encoder fully utilizes edge contour information in a target mask through multi-scale mask feature aggregation to enhance the learning of target appearance representation, so that a segmentation result has better contour accuracy; by means of the target enhanced attention, given targets in a first frame are paid more attention, interference of targets with similar appearance characteristics and similar colors in a background is weakened, robustness of a method for challenges of rapid movement, deformation, shielding and the like of the targets is enhanced, the targets can be accurately segmented by the system in a complex environment, the targets can be accurately and rapidly segmented in a plurality of difficult actual scenes, J & F values reach 91.1% in a DAVIS2016 verification set, J & F values reach 85.5% in a DAVIS2017 verification set, overgrade scores reach 81.9% in a YouTube-VOS 2018 verification set, and a very good effect is achieved.

Drawings

FIG. 1 is a diagram of eight feature extraction units according to the present invention;

FIG. 2 is a diagram of a target enhanced feature matching unit according to the present invention;

FIG. 3 is an algorithm framework diagram of a video object segmentation method based on multi-scale mask feature aggregation and object enhancement according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Aiming at the problems and the defects in the prior art, the invention provides a video target segmentation method based on mask feature aggregation and target enhancement, which is mainly realized by four stages of multi-scale target mask feature aggregation encoder design, target enhancement type feature matching unit design, model training and model inference.

The invention provides a video target segmentation method based on mask feature aggregation and target enhancement, which comprises the following steps

and S4, segmenting a given target for the new video sequence by utilizing the multi-scale mask feature aggregation and target enhancement-based video target segmentation method.

Each step is described in detail below.

And step S1, designing and obtaining an optimized multi-scale mask feature aggregation unit. Eight kinds of feature extraction units are designed in a summary manner, and as shown in fig. 1, eight kinds of feature extraction unit diagrams designed by the invention summarize a feature extraction unit configuration with the optimal effect, namely a multi-scale target mask feature aggregation unit.

The specific implementation process is as follows:

s11, designing a low-level fused mask feature aggregation unit I1, and respectively extracting query image key value coding pairs (k) by using a query coding unit and a reference coding unit which are the same in backbone network ^Q ,v ^Q ) And a reference key value code pair (k) ^R ,v ^R ) The superscript Q refers to the query image, and the superscript R refers to the reference set; the query encoding unit consists of an image feature encoder with an input channel of 3 (with two convolutions in parallel at the end for generating a query feature encoding pair (k) ^Q ,v ^Q ) And the reference encoding unit consists of an image feature encoder with an input channel 4 (with two convolutions in parallel at the ends) for generating a query feature encoding pair (k) ^R ,v ^R ) Composition (c); the reference coding unit firstly splices the reference frame and the reference target mask on the channel dimension and then sends the spliced reference frame and the reference target mask together into an image feature coder with an input channel of 4, and the mathematical expression of the reference coding unit is as follows:

R＝{Concate(I _i ,M _i )} _N

s12, designing three mask feature aggregation units I2, I3 and I4 which are fused at a high level, and aggregating a target mask or target mask feature and image features after a feature extraction stage for foreground discovery; i2 is composed of a reference coding unit and a query coding unit, which is composed of an image feature encoder (with two convolutions in parallel at the end for generating a query feature encoding pair (k) ^Q ,v ^Q ) A reference coding unit consisting of an image feature encoder shared with the query encoder features and a mask feature aggregation module, outputting a reference feature coded pair (k) ^R ,v ^R ) (ii) a The reference coding unit directly samples the original target mask and fuses the original target mask with the reference frame features output by the image feature coder by using a feature aggregation module; unlike I2, used in the reference coding unit of I3An independent mask feature encoder extracts features of a target mask, and then a feature aggregation module is used for fusing reference frame features output by a shared image feature encoder; i4 further shares the mask feature encoder and the image feature encoder on the basis of I3; specifically, the feature aggregation module is composed of two parallel convolution branches, wherein one branch is composed of a 1 × 7 convolution and a 7 × 1 convolution in series; the other branch is formed by connecting a 7 multiplied by 1 convolution and a 1 multiplied by 7 convolution in series;

s13, designing four multi-scale fused mask feature aggregation units I5, I6, I7 and I8, wherein the four feature extraction units respectively comprise a reference coding unit and a query coding unit, and outputting a reference feature coding pair (k) ^R ,v ^R ) And query feature code pair (k) ^Q ,v ^Q ) And target mask feature F ^M (ii) a I5 basically adopts the structure of SwiftNet, the image feature coder in the reference coding unit and the inquiry coding unit carries out feature sharing, and the reference coding unit fuses the down-sampled target mask information into the reference frame feature extracted by the image feature coder after the 1 st and 4 th stages of the backbone network; the reference encoding unit of I6 adopts a separate mask feature encoder to extract the reference target mask feature F ^M Instead of simple downsampling, the AFC module is then used to fuse the reference frame features extracted by the image feature encoder in the first four stages of the backbone network (the image feature encoder is a main branch), and the image feature encoders in the query encoding unit and the reference encoding unit are not shared; i7 and I6 are basically the same in structure, a mask encoder is used as a main branch, and parameters of image feature encoders in a query encoding unit and a reference encoding unit are shared; in contrast to I6, I8 only shares parameters of the image feature encoders in the query coding unit and the reference coding unit.

S14, comparing the effects of various feature extraction units after the steps S3 and S4 by using a default feature matching unit and a default decoding unit to obtain an optimal multi-scale mask feature aggregation unit I8.

The invention designs eight different feature extraction units to find an effective method for using a target mask in the feature extraction unit, and in order to test the effectiveness of the feature extraction units, the invention provides a uniform reference, and the invention keeps the same system structure (a feature matching unit and a decoder), the same hyper-parameters and training/reasoning configuration except the feature extraction unit, and empirically summarizes two key findings from comparison: (i) it is necessary to use a separate encoder to extract the target mask features independently, which is more beneficial than using the original mask or simple down-sampling; (ii) multi-scale aggregation of target mask features and reference frame image features can improve performance, indicating that both low-level and high-level mask features are useful. The invention also finally picks a multi-scale target mask feature aggregation unit (fig. one, example I8) as our final feature extraction unit.

And step S2, obtaining a target enhanced feature matching unit by using the target enhanced attention mechanism. Fig. 2 shows a diagram of a target enhanced feature matching unit designed by the present invention, which includes the following specific steps:

The method comprises the following specific steps:

w ^R ＝Conv(F ^M )

wherein Conv represents a 1 × 1 convolution,

representing the hadamard product;

The method specifically comprises the following steps:

wherein Conv represents a 1 × 1 convolution,

denotes the hadamard product, x denotes the matrix multiplication, σ denotes the Softmax function, ss subscript-1 denotes the last element of the reference set, i.e. the previous frame of the current frame;

s23, using the target enhanced query key value coding pair

Retrieving reference key value code pairs

And encoding the features with the query value

And (3) obtaining final matching characteristics after splicing, specifically:

wherein p and q represent pixels in the query key coding feature and the reference key coding feature respectively, [ ] represents concatenation, σ represents a Softmax function, and y is the output of the feature matching unit.

The invention explores the use method of the target mask on the feature matching unit. This was often ignored in previous methods, but the present invention found it helpful to eliminate background interference. A common feature matching unit in video object segmentation uses non-local attention. However, the attention between the query frame (current frame) and the reference set (reference frame and reference target mask) in such a feature matching unit involves a large number of unnecessary pairs of features (such as the relationship between backgrounds), and thus contains excessive background noise and interference. In response to this problem, the present invention improves the above problem simply and efficiently by explicitly introducing reference target mask information into the feature matching unit. Unlike the conventional feature matching unit, the present invention proposes a new target enhanced feature matching unit, which uses target enhanced attention to first generate a mask attention map using target mask features, and then uses the mask attention map to enhance the target area and suppress the background (see fig. 3).

And step S3, training the network model by using a server, and optimizing network parameters by reducing a network loss function until the network converges to obtain the video target segmentation method based on multi-scale mask feature aggregation and target enhancement. Fig. 3 is a diagram of an algorithm framework of a video object segmentation method based on multi-scale mask feature aggregation and object enhancement, which includes the following specific steps:

s31, executing a training video clip generating unit by using a server to generate a training video clip with the length of T, wherein T is more than or equal to 2; specifically, randomly extracting T images from any video of a plurality of video data sets at intervals, and performing affine transformation (translation, scaling, overturning, rotating and shearing) on the T images for T times to form a training video clip; or, arbitrarily extracting an image from the image data set, and performing different affine transformations for T times to form a training video fragment;

s32, executing the feature coding unit by the server to perform the key value coding pair of the query image and the reference key value coding pairAnd extracting the characteristic of the reference target mask, and inquiring the key value code pair of the image into (k) ^Q ,v ^Q ) The reference key value code pair is (k) ^R ,v ^R ) The reference target mask is characterized by F ^M The method comprises the following steps that A, a superscript Q indicates a query image, a superscript R indicates a reference set, and a superscript M indicates a reference target mask, specifically, for the reference set, feature extraction is respectively carried out on input reference frame images and reference frame target mask prediction results by using a shared image feature encoder and a mask feature encoder; then the characteristics of each stage of the mask characteristic encoder and the characteristics of the corresponding stage of the image characteristic encoder are added after passing through an extrusion excitation fusion module (AFC module); then injecting the added features into an image feature encoder; finally, the image characteristic encoder outputs a reference key value encoding pair (k) ^R ,v ^R ) The mask feature encoder outputs a mask feature F ^M (ii) a Key value code pair (k) ^R ,v ^R ) Is directly stored into the memory; for the query frame, directly using the coding features of the image feature coder to obtain a query image coding feature pair (k) ^R ,v ^R )；

S33, executing the target enhanced feature matching unit in the step S2 by a server; encoding pairs (k) according to query image key values ^Q ,v ^Q ) And reference target mask feature F ^M To retrieve a reference key value code pair (k) ^R ,v ^R ) Obtaining the final matching characteristics according to the information in the step (1);

s34, a server is used for executing a decoding unit and outputting a final segmentation result of the query frame; specifically, a plurality of residual blocks are used as decoders, the matching features in the step S33 and the query coding features introduced through the skip connection in the step S32 are used as inputs, 2 times of upsampling is performed at each stage, and finally, a final segmentation result is output.

S35, performing network training by using a server, and performing end-to-end training; the mathematical expression of the segmentation loss function L is:

L(Y,M)＝L _ce (Y,M)+α·L _IoU (Y,M)

wherein the content of the first and second substances,

represents the cross entropy loss;

s36, optimizing the objective function by using the server to obtain local optimal network parameters; specifically, the loss function L in step S35 is used as a target function, and an AdamW optimizer is used to iteratively update network parameters, so that the target loss function is reduced until the target loss function converges to a local optimum, and the training is ended to obtain the trained network weight of video target segmentation based on multi-scale mask feature aggregation and target enhanced attention.

And step S4, segmenting a given target for a new video sequence by using the multi-scale mask feature aggregation and target enhancement based video target segmentation method. The method comprises the following specific steps:

s42, obtaining the key value coding pair (k) of the query frame from the current frame (query frame) image and the reference set through the feature extraction unit ^Q ,v ^Q ) Reference key value code pair (k) ^R ,v ^R ) The mask feature encoder outputs a mask feature F ^M ；

s44, inputting the matching characteristic in the step S43 and the inquiry coding characteristic in the step S42 into a decoding unit to obtain a current frame target mask prediction result;

Although the invention herein has been described with reference to particular embodiments, it is to be understood that these embodiments are merely illustrative of the principles and applications of the present invention. It is therefore to be understood that numerous modifications may be made to the illustrative embodiments and that other arrangements may be devised without departing from the spirit and scope of the present invention as defined by the appended claims. It should be understood that various dependent claims and the features described herein may be combined in ways different from those described in the original claims. It is also to be understood that features described in connection with individual embodiments may be used in other described embodiments.

Claims

1. The video target segmentation method based on mask feature aggregation and target enhancement is characterized by comprising the following steps of:

2. The method for video object segmentation based on mask feature aggregation and object enhancement as claimed in claim 1, wherein the step S1 specifically includes the steps of:

s11, designing a low-level fused mask feature aggregation unit I1, and respectively extracting query image key value coding pairs (k) by using a query coding unit and a reference coding unit which are the same in backbone network ^Q ,v ^Q ) And a reference key value code pair (k) ^R ,v ^R ) The index Q indicates the query image, the index R indicates the reference set, and the query coding unit isAn input channel 3 image feature encoder, wherein the tail end of the input channel 3 image feature encoder is attached with two convolutions in parallel for generating a query image key value encoding pair (k) ^Q ,v ^Q ) The reference coding unit is an image characteristic coder with an input channel of 4, and the tail end of the image characteristic coder with the input channel of 4 is attached with two convolutions which are connected in parallel and used for generating a reference key value coding pair (k) ^R ,v ^R ) The reference coding unit firstly splices the reference frame and the reference target mask on the channel dimension and then sends the spliced reference frame and the reference target mask together into an image feature coder with an input channel of 4, and the mathematical expression of the reference coding unit is as follows:

R＝{Concate(I _i ,M _i )} _N

s12, designing three high-level fused mask feature aggregation units I2, I3 and I4, aggregating a target mask or a target mask feature and an image feature after a feature extraction stage for foreground discovery, wherein the I2 consists of a reference coding unit and a query coding unit, the query coding unit is an image feature encoder, and the tail end of the image feature encoder is attached with two convolutions which are connected in parallel and used for generating a query image key value encoding pair (k) by the aid of two convolutions ^Q ,v ^Q ) The reference coding unit comprises an image feature encoder shared with the query encoder feature and a mask feature aggregation module for generating a reference key value code pair (k) ^R ,v ^R ) The reference coding unit directly samples the original target mask and fuses the original target mask with the reference frame features output by the image feature coder by using a feature aggregation module; an independent mask feature encoder is used in the reference encoding unit of I3 to perform feature extraction on a target mask, and then a feature aggregation module is used to fuse reference frame features output by a shared image feature encoder; the I4 further shares a mask feature encoder and an image feature encoder on the basis of the I3;

s13, designFour multi-scale fused mask feature aggregation units I5, I6, I7 and I8, and four feature extraction units respectively comprise a reference coding unit and a query coding unit and output a query image key value coding pair (k) ^Q ,v ^Q ) And a reference key value code pair (k) ^R ,v ^R ) And target mask feature F ^M The I5 adopts the structure of SwiftNet, the image feature encoders in the reference encoding unit and the query encoding unit share the features, and the reference encoding unit fuses down-sampled target mask information into the reference frame features extracted by the image feature encoder after the 1 st and 4 th stages of the backbone network; the reference coding unit of I6 adopts a separate mask feature coder to extract a reference target mask feature F ^M Instead of simple down-sampling, the AFC module is used for fusing the reference frame features extracted by the image feature encoder in the first four stages of the backbone network, the image feature encoder is a main branch, and the image feature encoders in the query encoding unit and the reference encoding unit are not shared; the I7 has basically the same structure as the I6, takes a mask encoder as a main branch, and shares parameters of image feature encoders in a query coding unit and a reference coding unit; the I8 differs from the I6 in that only parameters of image feature encoders in a query coding unit and a reference coding unit are shared;

3. The method for video object segmentation based on mask feature aggregation and object enhancement according to claim 2, wherein the feature aggregation module in step S12 is composed of two parallel convolution branches, one of which is composed of a series of 1 x 7 convolution and 7 x 1 convolution; the other branch consists of a concatenation of a 7 x 1 convolution and a 1 x 7 convolution.

4. The method for video object segmentation based on mask feature aggregation and object enhancement according to claim 1, wherein the step S2 specifically includes the following steps:

s21, Using reference mask feature F generated by mask encoder ^M Generating a target attention map w ^R (ii) a Then the target attention is tried w ^R And reference value coding feature v ^R Multiplying to obtain target enhanced reference value coding characteristics

S23, using the target enhanced query image key value code pair

Retrieving reference key value code pairs

And encoding the feature with the query value

And obtaining the final matching characteristics after splicing.

5. The method for video object segmentation based on mask feature aggregation and object enhancement according to claim 4, wherein the information retrieving process in step S23 specifically comprises: firstly, similarity is calculated between the query key features and the reference key features, weighted summation is carried out on the reference frame value features as weights after the reference frame dimensions are normalized, and then splicing is carried out on the reference frame value features and the query value coding features, namely:

6. The method for video object segmentation based on mask feature aggregation and object enhancement according to claim 1, wherein the step S3 specifically includes the following steps:

s33, executing the target enhanced feature matching unit in the step S2 by using a server, and encoding a pair (k) according to a query image key value ^Q ,v ^Q ) And reference target mask feature F ^M To retrieve a reference key value code pair (k) ^R ,v ^R ) Obtaining the final matching characteristics;

L(Y,M)＝L _ce (Y,M)+α·L _IoU (Y,M)

wherein the content of the first and second substances,

represents cross entropy loss;

7. The method for video object segmentation based on mask feature aggregation and object enhancement according to claim 6, wherein the step S31 specifically includes the following steps:

8. The method for video object segmentation based on mask feature aggregation and object enhancement according to claim 6, wherein the step S32 specifically comprises: for a reference set, respectively performing feature extraction on an input reference frame image and a reference frame target mask prediction result by using a shared image feature encoder and a mask feature encoder; then, the characteristics of each stage of the mask characteristic encoder and the characteristics of the corresponding stage of the image characteristic encoder are added after passing through an extrusion excitation fusion module respectively; then injecting the added features into an image feature encoder; finally, the image feature encoder outputs a reference key value encoding pair (k) ^R ,v ^R ) The mask feature encoder outputs a mask feature F ^M (ii) a Reference key value code pair (k) ^R ,v ^R ) Is directly stored into the memory; for the query frame, directly using the coding features of the image feature coder to obtain a query image coding feature pair (k) ^Q ,v ^Q )。

9. The method for video object segmentation based on mask feature aggregation and object enhancement according to claim 6, wherein the step S34 specifically comprises: using a plurality of residual blocks as decoders, the matching features in the step S33 and the query image coding feature pair (k) in the step S32 introduced by the skip connection ^Q ,v ^Q ) As input, 2 times of upsampling is performed at each stage, and finally the final segmentation result is output.

10. The method for video object segmentation based on mask feature aggregation and object enhancement according to claim 1, wherein the step S4 specifically includes the following steps:

s42, obtaining the inquiry image coding feature pair (k) by the current frame image and the reference set through the feature extraction unit ^Q ,v ^Q ) Reference key value code pair (k) ^R ,v ^R ) The mask feature encoder outputs a mask feature F ^M ；

s44, matching the matching feature in the step S43 and the query image coding feature pair in the step S42 (k) ^Q ,v ^Q ) Inputting the prediction result into a decoding unit to obtain a current frame target mask prediction result;