CN110910391A

CN110910391A - Video object segmentation method with dual-module neural network structure

Info

Publication number: CN110910391A
Application number: CN201911125917.3A
Authority: CN
Inventors: 汪粼波; 陈彬彬; 方贤勇
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2019-11-15
Filing date: 2019-11-15
Publication date: 2020-03-24
Anticipated expiration: 2039-11-15
Also published as: CN110910391B

Abstract

The invention provides a method for segmenting a video object with a double-module neural network structure, which is used for solving the problem that the segmentation result of the video object is not ideal due to noise interference in the process of segmenting the video object. The method comprises the following steps: inputting the first frame image and the mask of the first frame into a transformation network to generate an image pair; performing a goal proposal box generation for each image pair to determine whether the image pair is a region of interest; inputting the feeling region adding tracker into a feeling segmentation network to train a learning model and outputting the model; outputting a characteristic diagram from the last layer convolution of the sensory segmentation network, and respectively inputting the characteristic diagram into a space attention module and a channel attention module; finally, fusing the feature maps output by the two concerned modules, and outputting a final segmentation mask result through a convolutional layer operation; the invention obtains better segmentation experiment results on the DAVIS video data set.

Description

Video object segmentation method with dual-module neural network structure

Technical Field

The invention belongs to the field of computer vision, and particularly relates to video object segmentation processing with large-scale change and inaccurate dynamic appearance change in videos, in particular to a method for segmenting a video object with a double-module neural network structure.

Background

With the rapid development of computer vision technology in recent years, the convolutional neural network in deep learning has gained great attention in various research fields, and the video object segmentation technology is an important content that researchers pay attention to in recent years. Video segmentation techniques are increasingly showing their importance. The video segmentation method is applied to scene understanding, video labeling, unmanned vehicles, object detection and the like, and is rapidly developed in a video segmentation technology. It can be said that the progress of the video segmentation technology drives the development of the computer vision technology as a whole. However, video object segmentation is not only a research hotspot, but also a research difficulty. The segmentation aims to find an accurate position relation for an object in a video, but the implementation process is limited by various factors, such as motion speed, object deformation, occlusion between instances and chaotic background, which can come from different camera devices and different scene images. This makes video object segmentation very challenging. Poor results are still shown in real world scene segmentation. These images undoubtedly pose a great challenge to video object segmentation techniques.

In recent years. The vast scholars have conducted a great deal of research on the video segmentation technology and have achieved good academic achievements. Unsupervised video object segmentation. The unsupervised method is mainly to segment moving objects from the background without any prior knowledge of the target, and the unsupervised video object segmentation method aims at automatically finding and separating salient objects from the background. These methods are based on probabilistic models, action and object propositions. Existing methods typically rely on visual cues (such as superpixel. saliency images or optical flow) to acquire initial object regions and require processing the entire video in batch mode to provide object segmentation. Furthermore, generating and processing thousands of candidate regions in each frame typically consumes a significant amount of time. These unsupervised methods cannot segment a particular object due to motion confusion between different instances and the dynamic background. Many methods for semi-supervised video object segmentation rely on fine-tuning using the first frame ground truth, using convolutional networks, training the foreground and background segmentation, and adjusting it to the first frame of the target video under test (e.g., semantic information for online adaptive mechanisms and example segmentation networks). They provide key visual cues for the target. Thus, these methods can handle multi-instance cases and generally perform better than unsupervised methods. However, many semi-supervised approaches rely heavily on a segmentation mask in the first frame. The methods generally use the first frame for data enhancement, the model self-adaptation seriously depends on a fine adjustment model, and the problems of complex background, occlusion or rapid movement and camera shaking oscillation in the video that high-efficiency segmentation cannot be realized are caused.

Disclosure of Invention

In view of the problems of the video segmentation method, the invention provides a video segmentation method based on spatial and channel information of a double-correlation injection module structure. Compared with the prior art, the method can flexibly utilize the space and channel information in the characteristic diagram, simplifies the calculated amount in the optimization process, and greatly improves the accuracy of the video target object segmentation.

The purpose of the invention is as follows: the invention aims to solve the problem of defects in the existing video object segmentation method, and provides a method for segmenting a video object with a dual-module neural network structure, which is used for solving some problems in video object segmentation.

The technical scheme is as follows: the invention discloses a method for segmenting a video object with a double-module neural network structure, which is used for ensuring training data with enough quantity of a near target domain and customizing the training data for a pixel-level video object segmentation scene.

First, the first frame and its annotation mask input transform network generate future possible image pairs, solving the problem of extra processing time required for the preparation of the data at a previous stage and for the training data enhancement. A training set of reasonably realistic images is generated. The expected trajectory and appearance of the target as it might be in future video frames is captured. Secondly, the image pair is input into a target proposing operation, a sense region which can be candidate is determined through the target proposing operation, and the sense region determination can screen some unneeded image pairs. Making it interesting to split the network computationally some unnecessary overhead. Then, the perceptual region addition tracker is input into a perceptual segmentation network to train a model, and segmentation results are input. The tracking segmentation is inaccurate due to the interference influence when the video frame tracks the target. Finally, a double-gate injection molding method was devised. The spatial interest module captures spatial dependencies between any two spatial locations and the channel interest module captures channel dependencies between any two channel maps. And the output of the two attention modules is fused, so that the discrimination capability of feature representation in video object segmentation is enhanced. The method plays a role in inhibiting the interference and noise influence in video segmentation. The discrimination capability of feature representation in video object segmentation is enhanced. The operation of the convolutional layer is performed again. And outputting a final segmentation mask result graph. The experimental results prove that the method for segmenting the video object obtains effective effect results as shown in the results of fig. 4 and 5. The method specifically comprises the following steps:

step 1, marking the first frame in the video as I₀The mask of the first frame is denoted as M₀. From the known first frame I₀And a first frame mask M₀The input transformation network, through which a plurality of different image pairs can be generated. An image pair is an image and a corresponding mask. The transformation network is some operation of rotation, translation, flipping, scaling, etc. The different images are mask training data for possible objects to appear for future video frames, and the dataset is derived from a DAVIS public video image segmentation dataset. The method used by the invention is that the video frame and the corresponding auxiliary mask code are used for data processing. A large number of image pairs are obtained that are used to improve video training data deficiencies. Therefore, enough data can be obtained for training, and the video result can be accurately predicted.

Step 2, for the pixel I according to the first frame in step 1₀And mask M of the first frame₀Different image pairs are generated by a transformation network and the interesting areas are obtained using the goal proposals. The goal-proposition, which is a typical full convolution network, takes an arbitrary size of an image as input and outputs a set of rectangular goal-proposition boxes. Generating a target proposal around the target in the first frame, randomly generating a target proposal around the target by transforming the image generated by the network in step 1 above, generating an image target proposal and the first frame target proposal ratio IoU score. Or generate an image mask to initial mask ratio IoU score. Selecting the ratio to be more than 0.35 min through IoUThe representative image pairs are referred to as regions of interest (RoI). An initial mask, being a first frame mask M₀. The IoU is called intersection ratio, and is the intersection ratio between the prediction area and the actual area. Then, a tracker is added to the perceived area, which is an efficient way to locate the target in the next frame. The tracker inputs the current frame mask and the next frame image, and can predict the position of the next frame target mask. A tracker is used to acquire the mask area for the next frame of image. Providing temporal consistency for subsequent frame-like regions.

And step 3, once the feeling region is positioned in the next frame, inputting the feeling region adding tracker into a feeling segmentation network (RoISeg) of the invention and training a prediction target. The Hoxist segmentation network RoISeg is based on a deep convolutional neural network CNN, and a network framework is innovated on the basis of a ResNet101 framework network and is called as a Hoxist segmentation network. RoISeg hereinafter denotes the perceptually split network. The CNN is a convolutional neural network in deep learning. The ResNet101 frame network is a network with a deep residual learning frame to solve the problem of accuracy reduction, and has a low training error and test error network. And adding a tracker on the sensitive region, inputting the tracker into RoISeg for training a model, and outputting a result of obtaining a rough target recognition position and a segmentation mask.

And 4, inputting the target result of the RoISeg prediction output by the sensing area adding tracker in the step 3, wherein the target result has a large error, and the noise is divided into parts in order to be reduced. Thus, the present invention constructs a "dual module of interest" approach. The signature output at the last convolutional layer of RoISeg is input to a double gate injection mold. The double attention modules are a space attention module and a channel attention module respectively.

And the space attention module introduces a space attention mechanism to capture the space dependency between any two space positions. The space attention mechanism is the operation of some functions in a space attention module. For the target location features in the frame, the features are updated by aggregating the features by weighted summation, wherein the weight is determined by the feature similarity between the corresponding two locations. In other words, any two locations with similar characteristics may promote mutual refinement regardless of their distance in the spatial dimension.

And the channel attention module captures the channel dependency relationship between any two channel mappings through a channel attention mechanism. And updates each channel map with a weighted sum of all channel maps. The channel attention mechanism is the operation of some functions in the channel attention module.

And finally, fusing the two concerned modules. The fusion operation is a parallel strategy, and combines the two feature vectors into a complex vector. And information between the front frame and the rear frame of the target object is enriched by fusing the two frames together, so that a better characteristic effect of video object segmentation is obtained. The discrimination capability of feature representation in video object segmentation is enhanced through the feature capture dependency relationship between the spatial dimension information and the channel dimension information in the double-correlation model. The method plays a role in inhibiting the interference and noise influence in video segmentation. After passing through the convolution layer once, the final division mask result is output.

The detailed specific steps are as follows:

step 1, inputting a video to a computer, wherein each frame of the video is a picture. The picture is in RGB format and marked as RGB picture I. The target label in this image is denoted as mask M. The mask is the binary foreground and background of the image.

Firstly, inputting a segment of video and a mask of a first frame, and inputting the first frame I₀And a first mask M₀Input into the transformation network G. Resulting in a large number of transformed image pairs D. The specific expression is as follows:

D_n＝G(I₀，M₀)

wherein G represents a transformation network, and is operations of rotation, translation, flipping, scaling and the like. D_n＝{d₁m₁，d₂m₂，...，d_nm_n}，D_nIndicating that there are n image pairs. d_im_iRepresenting the ith image pair, wherein d_iRepresenting images generated by the i-th transformation network, m_iDenotes the ithA mask generated by the transformation network. And generating an image pair through a transformation network, and screening whether the image pair is used as a sense region.

The specific steps of the step 2 are as follows:

the image pair generated by the transformation network is screened to determine whether the image pair is a sensitive region. The perceived area is obtained using the goal proposition. The goal-proposition, which is a typical full convolution network, takes an arbitrary size of an image as input and outputs a set of rectangular goal-proposition boxes. The target proposal operation is performed around the target in the first frame and noted gt_boxSaid gt_boxIs the bounding box of the real marker around the object of the first frame. A bounding box generated by the image for the target proposing operation around the target is marked as b_boxSaid b is_boxThe image pair is input into the goal offer and the goal offer box in the image pair is output, as shown at the mark 5 in fig. 2. Resulting in an image target proposal and first frame target proposal ratio IoU score. The specific expression is as follows:

S＝IoU(b_box，gt_box)

where IoU is a functional expression of the cross-over ratio. And S score is the intersection ratio score of the target proposal frame in the image pair and the target proposal frame in the first frame. The representative image pair is identified as the region of interest by IoU where the ratio S > 0.75. Then, a tracker is added to the perceived area, which is an efficient way to locate the target in the next frame. The tracker inputs the current frame mask and the next frame image, and can predict the position of the next frame target mask. A tracker is used to acquire the mask area for the next frame of image. Providing temporal consistency for subsequent frame-like regions. A video sequence R, R ═ I is known₀，I₁，I₂，I₃，...It...，I_nAnd the first frame I₀Mask M of₀。I_tIs the t-th frame in the video sequence. t ∈ {1,2,3,..., n }. Masking the remaining frames in a video sequence₁，M₂，M₃，...，M_n}, according to the tracker function expression:

M_t+1＝f(I_t+1，M_t)

where f is expressed as a tracker function, known as I_t+1Image represented as t +1 th frame, known as M_tMask representing the image of the t-th frame, and calculating M_t+1Denoted as mask for frame t + 1. The masks for the second frame image and the first frame image are known and found by the tracker. Since the target has a smooth moving trend in space, the video frames have a characteristic of little change from one frame to another, and have a certain correlation relatively. By M_tMask and I_t+1Frame, prediction I_t+1Mask M for frame_t+1. Prediction I_t+1Mask for frame and real mask M_gtThere is a large error. The M is_gtRepresenting a truly accurate mask. The perceptual region addition tracker is then input into the perceptual segmentation network.

The specific steps of the step 3 are as follows:

and inputting the information into the feeling segmentation network through the feeling region adding tracker based on the step 2. The perceptive segmentation network (RoISeg) in the present invention is trained to predict targets by inputting the perceptive region addition tracker. The Hoxist segmentation network RoISeg is based on a deep convolutional neural network CNN, and a network framework is innovated on the basis of a ResNet101 framework network and is called as a Hoxist segmentation network. The ResNet101 frame network is a network with a deep residual learning frame to solve the problem of accuracy reduction, and has a low training error and test error network. The RoISeg network is composed of convolution layers, pooling layers, activation functions, batch normalization, deconvolution and the like. With some initial parameter settings in RoISeg. The learning rate is 0.0001 and the weight attenuation term is 0.005. The RoISeg final output is constrained using a weighted cross-entropy penalty. The crossover loss expression is as follows:

wherein L (theta) represents the weighted cross entropy loss, and theta is in a value range of [0,1 ]]Representing the weight parameter associated with the current prediction in the network.X₊And X_-β is a weighted decay term that penalizes biased sampling during training, the activation output of the convolutional layer computes a probability function P representing the probability distribution, P ∈ [0,1 ] with the label of the target positive and negative samples, respectively]. The value range [0,1 ] of the commonly used nonlinear activation function Sigmoid is used as the activation function]. And the perceptual segmentation network training output layer uses the constraint of cross entropy loss, and then continuously trains in the network through back propagation, and when the loss in the training process becomes smaller gradually, the convergence becomes small enough and stable. And outputting to obtain a target segmentation result. The output result is a segmentation map of the foreground and background of the mask.

The output target result is predicted by the RoISeg network in step 3 to have a large error in order to reduce the noise-divided portion. Thus, the present invention constructs a "dual module of interest" approach. The signature graph output at the last convolutional layer of the RoISeg is input to two attention modules, respectively. The two attention modules are a space attention module and a channel attention module respectively.

A spatial attention module: and introducing a spatial attention mechanism to enrich the dependency relationship of the context characteristics for the target in the video frame. The operation of the spatial attention mechanism is introduced for detailed description. A space-interest module is shown at 11 in fig. 3. The convolution layer output characteristic diagram from RoISeg is marked as A, and A belongs to R^C×H×WR represents a set, the shape size of A is C × H × W, C represents the number of channels, H represents the height, and W represents the width. First, three new feature maps B, D and F are generated from feature map A sharing, respectively, where { B, D, F }. epsilon.R^C ^×H×W. Then re-change their shape and size to R^C×NWhere N ═ H × W, N denotes the product of height and width. And then B performs matrix transposition and D performs matrix multiplication, and a soffmax layer is applied to calculate a space dimension information attention feature map S e R^N×NThe specific expression is as follows:

wherein S_ijMeasuring the ith^thThe j (th) position of space^thThe influence of spatial position. exp denotes the distance between two locations, the smaller the distance the more similar the locations between them. In the foregoing, the spatial dependency between any two spatial positions is captured. In other words, the more similar characteristics of both locations. The representation contributes to a greater similarity of features between them. Wherein the shape of F is R^C×N. Then, matrix multiplication operation is performed between F and S matrix transpose, and the characteristic diagram size shape of the matrix multiplication result is R^C×NChanging the shape and size of the feature map to R^C×H×WFinally, multiply by a scale parameter α and perform element and operation operations with feature A to obtain output feature map result E₁The specific expression is as follows:

wherein α is initialized to 0 for weight coefficient, α is equal to 0,1]And gradually more weight is assigned. Sum operation result feature graph E₁Size of shape E₁∈R^C×H×W. For the target feature position in the video frame, the feature is updated by aggregating the positions through weighted summation, wherein the weight is determined by the feature similarity between the two corresponding positions. In other words, any two locations with similar characteristics may promote mutual refinement regardless of their distance in the spatial dimension. And selectively aggregating the context feature representations according to the spatial attention mapping, thereby improving the information interdependence relation between the same classes.

And the channel attention module captures the channel dependency relationship between any two channel mappings through some operations of a channel attention mechanism. Channel attention mechanism some operations. The convolutional layer output characteristic diagram from RoISeg is also marked as A, and A ∈ R^C×H×WR represents a set, the shape size of A is C × H × W, C represents the number of channels, H represents the height, and W represents the width. Feature map A sharing respectively generates two new feature maps MAnd N, where { M, N }. epsilon.R^C×H×W. Then re-change their shape and size to R^C×N. Executing matrix multiplication between M and N transpose, directly calculating channel characteristic diagram X e to R^C×C. Obtaining a channel attention information characteristic diagram X epsilon R by using soffmax layer^C×CThe specific expression is as follows:

wherein X_jiMeasuring the ith^thChannel pair jth^thThe influence between the channels, mentioned earlier, the channel attention module captures the channel dependency between any two channel maps. In addition, matrix multiplication is performed between X and A matrix transposes, and the result of the matrix multiplication operation is characterized by a graph, reshaped to R^C×H×WThen multiply by a scale weight parameter β and perform an element sum operation with a to obtain an output feature map E₂The specific expression is as follows:

wherein β is a weight coefficient, and the initialization is set to 0.3, β E [0,1 ]]. Sum operation result feature graph E₂，E₂Size of shape E₂∈R^C×H×W. Channel dependencies between feature map channel mappings are simulated. Thereby contributing to improved intelligibility of the model function. The enhanced channel target features of the channel attention module are more prominent, so that the video frame can identify the target in the network.

And fusing the two interest modules. The fusion operation is to combine these two feature vectors into a complex vector. The characteristic E output by the space attention module₁Feature map E output by channel attention module₂And obtaining a new characteristic diagram O through fusion operation: the specific expression is as follows:

O＝f(E₁，E₂)

where O is the result of the fused feature map output, O outputs the feature map sizeIs O epsilon to R^C×H×W. The function f is expressed as a fusion operation. E₁Size of the feature map is E₁∈R^C×H×W。E₂Size of the feature map is E₂∈R^C×H×W. The feature information between the front frame and the rear frame of the target object is more obvious after being fused together, so that a better feature effect of video target object segmentation is obtained.

The dependency relationship is captured through feature fusion between the space dimension information and the channel dimension information in the attention module, and the context feature information between the space and the channel is fully utilized. Specifically, the convolution layer output of the perceptual segmentation network is input to each of the two attention modules. Through respective attention mechanism operation, the space attention module obtains remarkable space information characteristics, and the channel attention module obtains remarkable channel information characteristics. The two attention modules are fused with feature operation, so that the discrimination capability of feature representation in video object segmentation is enhanced. The method plays a role in inhibiting the interference and noise influence in video segmentation. The operation of the convolutional layer is performed again. And outputting a final segmentation mask result graph. The method for segmenting the video object is proved to be significant by the experimental results that the results of fig. 4 and 5 are shown to achieve effective effect results.

Advantageous technical effects

The invention provides a method for segmenting a convolutional neural network video object of a double-correlation modeling block, which is used for solving the problems of insufficient data, high processing overhead, complex background, quick movement, jitter, oscillation and other interference in the process of segmenting the video object. Because of the problems of these interferences, the present invention designs a transformation network, an area of interest adding tracker, and a double-gate injection module to effectively solve these problems. By means of a network transformation method, the problem that a learning model is poor in generalization capability due to insufficient data in the network training process is solved. The interested area is determined by the target proposal, and a tracker is added in the interested area to predict the position information of the target which can appear in the next frame. The method can solve the problem of shaking and oscillation questions caused by rapid movement or camera movement and find out the possible positions of the target. The feeling region adding tracker inputs a feeling segmentation network (RoISeg) designed by the invention to train a model and outputs a segmentation result. The tracking segmentation is inaccurate due to the interference influence when the video frame tracks the target. A double-gate injection molding method is designed for this purpose. The spatial interest module captures spatial dependencies between any two spatial locations and the channel interest module captures channel dependencies between any two channel maps. The feature graphs output by the two concerned modules are fused, so that the discrimination capability of feature representation in video object segmentation is enhanced. The method plays a role in inhibiting the interference and noise influence in video segmentation. The discrimination capability of feature representation in video object segmentation is enhanced. The operation of the convolutional layer is performed again. And outputting a final segmentation mask result graph.

Drawings

FIG. 1 is a basic flow diagram of the process of the present invention

FIG. 2 is a diagram of a network architecture of the present invention

FIG. 3 is a diagram showing a network relationship of double-gate injection molded blocks

FIG. 4 and FIG. 5 are graphs of experimental results

In fig. 2, reference numeral 1 denotes a first frame image. And 2 is a mask corresponding to the first frame image. And 3, the operation of the transformation network is carried out. 4 is the pair of images generated by the transform network. 5, proposing generation of a target to determine a sensitive region, 6, the RoISeg network framework of the invention, 7, an RoISeg network output feature map, 8, 9, a channel attention module, 11, a space attention module, 12, output feature map row fusion and 13, and finally, carrying out experiment segmentation results.

Detailed description of the invention

Technical features of the present invention will now be described in detail with reference to the accompanying drawings.

Referring to fig. 1, a method for video object segmentation with a dual-module neural network structure, which collects processing data for a pixel-level video object segmentation scene to ensure that a near-target domain has a sufficient amount of training data, comprises:

the first frame of the video and the annotation mask thereof are obtained and used for generating the mask of the future video frame and generating a training set of reasonable and vivid images, and further the expected appearance change in the future video frame is obtained to obtain the approximate target domain.

Furthermore, a module-of-interest mechanism is introduced that captures the target feature dependencies in the spatial and channel modules-of-interest, respectively. The injection molding mechanism is to add two parallel modules to the architecture of the expanded full convolution neural network: one is a spatial location dimension module and the other is a channel information dimension module. Through processing the two parallel modules, the spatial position dimension module obtains an accurate position information dependency relationship, and the channel dimension module obtains a dependency relationship between channel mappings.

And finally, fusing the output characteristic graphs from the two dimension modules to obtain better characteristic representation of pixel-level prediction and output a segmentation result after passing through a convolution layer. The segmentation result is composed of 1 and 0 for foreground and background respectively.

Further, the method of the invention is as follows: the method comprises the following steps of:

step 1, in the video, the first frame is marked as I₀The mask of the first frame image is denoted as M₀. From the known first frame map I₀And a first frame mask M₀Inputting the transformation network, and generating the image pair through the transformation network. The aforementioned image pair is an image and a mask. A transformation network is a network that contains rotation, translation, flipping, and/or scaling operations. The image pair refers to the situation that the data deficiency can be solved by inputting a network training model into the video frame. In this step, the input video is from a dataset, which may be derived from a DAVIS public video image segmentation dataset. The method used by the invention is that the video frame and the corresponding auxiliary mask code are used for data processing. A large number of image pairs are obtained that are used to improve video training data deficiencies. Therefore, enough data can be obtained for training, and the video result can be accurately predicted.

Step 2, the first frame image I in the step 1 is processed₀Pixel of (2) and mask M of the first frame₀More than one set of image pairs is generated by the transformation network, the image pairs are not identical, and the interesting area is obtained by the goal proposition.

The target proposal is a set of image target rectangular proposal frames which are input with images of any size and output in a full convolution network. The goal offer gets the region of interest by scoring the candidate boxes. . The method comprises the following specific steps:

in the first frame image I₀A target proposal is generated around the target in (1).

IoU was obtained in the following manner. The IoU is called intersection ratio, and is the intersection ratio between the prediction area and the actual area.

One is as follows: randomly generating an object proposal around the object by using the image generated by the transforming network in the step 1 to obtain a generated image object proposal and a first frame image I₀Target proposed ratio IoU.

The second step is as follows: resulting in an image mask to initial mask ratio IoU score. The initial mask is a first frame mask M₀。

Representative image pairs with a score greater than 0.75 are selected by the IoU ratio and are referred to as regions of interest (RoI).

Then, a tracker is added to the region of interest, which is effective in locating the target in the next frame. The tracker inputs the current frame mask and the next frame image, and can predict the position of the next frame target mask. A tracker is used to acquire the mask area for the next frame of image, providing temporal consistency for the perceived area of the subsequent frame.

And step 3, once the feeling region is positioned in the next frame, inputting the feeling region adding tracker into a feeling segmentation network (RoISeg) of the invention and training a prediction target. The network framework, called RoISeg for short, is formed on the basis of a ResNet101 framework network on the basis of a deep convolutional neural network CNN of the Hodgkin's segmented network RoISeg. The CNN is a convolutional neural network in deep learning. The ResNet101 frame network is a network with a deep residual learning frame to solve the problem of accuracy reduction, and has a low training error and test error network. And adding a tracker on the sensitive region, inputting the tracker into RoISeg for training a model, and outputting a result of obtaining a rough target recognition position and a segmentation mask.

And 4, inputting the target result of the RoISeg prediction output by the sensing area adding tracker in the step 3, wherein the target result has a large error, and the noise is divided into parts in order to be reduced. Therefore, the invention constructs a method of double concern module: the signature output at the last convolutional layer of RoISeg is input to a double gate injection mold. The dual attention module includes a space attention module and a channel attention module, which are detailed in fig. 1 and 3.

The spatial attention module introduces a spatial attention mechanism to capture the spatial dependency between any two spatial positions. The space attention mechanism is a function operation in a space attention module. For the target location features in the frame, the features are updated by aggregating the features by weighted summation, wherein the weight is determined by the feature similarity between the corresponding two locations. In other words, any two locations with similar characteristics may promote mutual refinement regardless of their distance in the spatial dimension.

The channel attention module captures the channel dependency relationship between any two channel maps through a channel attention mechanism, and updates each channel map by using the weighted sum of all the channel maps. The channel attention mechanism is a function operation in a channel attention module.

Furthermore, step 1, a video is input into the computer, and each frame of the video is a picture. The picture is in RGB format and marked as RGB picture I. The target label in this image is denoted as mask M. The mask is the binary foreground and background of the image.

Firstly, inputting a segment of video and a mask of a first frame, and inputting the first frame I₀And a first mask M₀The transformed image pair D is obtained by inputting the transformed image pair into a transformation network G. The specific expression is as follows:

D_n＝G(I₀，M₀)

where G denotes a transformation network. Set of image pairs D_n＝{d₁m₁，d₂m₂，...，d_nm_n}，D_nIndicating that there are n image pairs. d_im_iRepresenting the ith image pair, wherein d_iRepresenting images generated by the i-th transformation network, m_iRepresenting the mask generated by the ith transformation network. And generating an image pair through a transformation network, and screening whether the image pair is used as a sense region.

Further, the specific steps of step 2 are:

the image pair generated by the transformation network is screened to determine whether the image pair is a sensitive region. The perceived area is obtained using the goal proposition. The target proposal, which is a typical full convolution network, inputs images of arbitrary size and outputs a set of image target rectangular proposal boxes. The target proposal operation is performed around the target in the first frame and noted gt_boxSaid gt_boxIs the bounding box of the real marker around the object of the first frame. A bounding box generated by the image for the target proposing operation around the target is marked as b_boxSaid b is_boxThe image pair is input into the goal offer and the goal offer box in the image pair is output, as shown at the mark 5 in fig. 2. Resulting in an image target proposal and first frame target proposal ratio IoU score. The specific expression is as follows:

S＝IoU(b_box，gt_box)

where IoU is a functional expression of the cross-over ratio. And S score is the intersection ratio score of the target proposal frame in the image pair and the target proposal frame in the first frame. The representative image pair is identified as the region of interest by IoU where the ratio S > 0.75. Then, a tracker is added to the perceived area, which is able to locate the target in the next frameAnd (4) an effective method. The tracker inputs the current frame mask and the next frame image, and can predict the position of the next frame target mask. A tracker is used to acquire the mask area for the next frame of image. Providing temporal consistency for subsequent frame-like regions. A video sequence R, R ═ I is known₀，I₁，I₂，I₃，...I_t...，I_nAnd the first frame I₀Mask M of₀。I_tIs the t-th frame in the video sequence. t ∈ {1,2,3,..., n }. Masking the remaining frames in a video sequence₁，M₂，M₃，...，M_n}, according to the tracker function expression:

M_t+1＝f(I_t+1，M_t)

Further, the specific steps of step 3 are:

and inputting the information into the feeling segmentation network through the feeling region adding tracker based on the step 2. The perceptive segmentation network (RoISeg) in the present invention is trained to predict targets by inputting the perceptive region addition tracker. The Hoxist segmentation network RoISeg is based on a deep convolutional neural network CNN, and a network framework is innovated on the basis of a ResNet101 framework network and is called as a Hoxist segmentation network. The ResNet101 frame network is a network with a deep residual learning frame to solve the problem of accuracy reduction, and has a low training error and test error network. The RoISeg network is composed of convolution layers, pooling layers, activation functions, batch normalization, deconvolution and the like. Wherein the initial parameters in the RoISeg are set as: the learning rate is 0.0001 and the weight attenuation term is 0.005. The RoISeg final output is constrained using a weighted cross-entropy penalty. The crossover loss expression is as follows:

wherein, the upper (theta) represents the weighted cross entropy loss, and the theta value range is [0,1 ]]Representing the weight parameter associated with the current prediction in the network. X₊And X_-β is a weighted decay term that penalizes biased sampling during training, the activation output of the convolutional layer computes a probability function P representing the probability distribution, P ∈ [0,1 ] with the label of the target positive and negative samples, respectively]. The value range [0,1 ] of the commonly used nonlinear activation function Sigmoid is used as the activation function]. And the perceptual segmentation network training output layer uses the constraint of cross entropy loss, and then continuously trains in the network through back propagation, and when the loss in the training process becomes smaller gradually, the convergence becomes small enough and stable. And outputting to obtain a target segmentation result. The output result is a segmentation map of the foreground and background of the mask.

Further, the specific steps of step 4 are:

the output target result is predicted by the RoISeg network in step 3 to have a large error in order to reduce the noise-divided portion. Thus, the present invention constructs a "dual module of interest" approach. The signature graph output at the last convolutional layer of the RoISeg is input to two attention modules, respectively. Two focus modules are divided: the system comprises a space attention module and a channel attention module, and specifically comprises the following steps:

a spatial attention module: introducing a spatial attention mechanism to enrich dependency of contextual features for targets in video framesIs described. The operation of the spatial attention mechanism is introduced for detailed description. The space focus module in fig. 3 shows s. The convolution layer output characteristic diagram from RoISeg is marked as A, and A belongs to R^C×H×WR represents a set, the shape size of A is C × H × W, C represents the number of channels, H represents the height, and W represents the width. First, three new feature maps B, D and F are generated from feature map A sharing, respectively, where { B, D, F }. epsilon.R^C ^×H×W. Then re-change their shape and size to R^C×NWhere N ═ H × W, N denotes the product of height and width. And then B performs matrix transposition and D performs matrix multiplication, and applies a softmax layer to calculate a spatial dimension information attention feature map S e R^N×NThe specific expression is as follows:

wherein α is initialized to 0 for weight coefficient, α is equal to 0,1]And gradually more weight is assigned. Sum operation result feature graph E₁Size of shape E₁∈R^C×H×W. For a target feature location in a video frame,the location is updated by aggregating features by weighted summation, where the weights are determined by feature similarity between the respective two locations. In other words, any two locations with similar characteristics may promote mutual refinement regardless of their distance in the spatial dimension. And selectively aggregating the context feature representations according to the spatial attention mapping, thereby improving the information interdependence relation between the same classes.

A channel attention module: channel dependencies between any two channel maps are captured by a channel attention mechanism operation. The convolutional layer output characteristic diagram from RoISeg is also marked as A, and A ∈ R^C×H×WR represents a set, the shape size of A is C × H × W, C represents the number of channels, H represents the height, and W represents the width. Feature map A sharing generates two new feature maps M and N, respectively, where { M, N }. epsilon.R^C×H×W. Then re-change their shape and size to R^C×N. Executing matrix multiplication between M and N transpose, directly calculating channel characteristic diagram X e to R^C×C. Obtaining a channel attention information characteristic diagram X epsilon R by using softmax layer^C×CThe specific expression is as follows:

wherein X_jiMeasuring the ith^thChannel pair jth^thThe influence between the channels, mentioned earlier, the channel attention module captures the channel dependency between any two channel maps. In addition, the X matrix and the A characteristic pattern are reshaped into R^C×NMatrix multiplication is performed between the matrixes, and the shape of the result obtained by the matrix multiplication is R^C×NReshaped to R^C×H×WThen multiply by a scale weight parameter β and perform an element sum operation with a to obtain an output feature map E₂The specific expression is as follows:

wherein β is a weight coefficient, and the initialization is set to 0.3, β E [0,1 ]]. Summing operation nodeFruit characteristic graph E₂，E₂Size of shape E₂∈R^C×H×W. Channel dependencies between feature map channel mappings are simulated. Thereby contributing to improved intelligibility of the model function. The enhanced channel target features of the channel attention module are more prominent, so that the video frame can identify the target in the network.

And fusing the two interest modules. The fusion operation is to combine the two feature vectors into a complex vector. The characteristic E output by the space attention module₁Feature map E output by channel attention module₂And obtaining a new characteristic diagram O through fusion operation: the specific expression is as follows:

O＝f(E₁，E₂)

wherein O is the output result of the fusion characteristic diagram, and the size of the O output characteristic diagram is O epsilon R^C×H×W. The function f is expressed as a fusion operation. E₁Size of the feature map is E₁∈R^C×H×W。E₂Size of the feature map is E₂∈R^C×H×W. The feature information between the front frame and the rear frame of the target object is more obvious after being fused together, so that a better feature effect of video target object segmentation is obtained.

The dependency relationship is captured through feature fusion between the space dimension information and the channel dimension information in the attention module, and the context feature information between the space and the channel is fully utilized. Specifically, the convolution layer output of the perceptual segmentation network is input to each of the two attention modules. Through respective attention mechanism operation, the space attention module obtains remarkable space information characteristics, and the channel attention module obtains remarkable channel information characteristics. The two attention modules are fused with feature operation, so that the discrimination capability of feature representation in video object segmentation is enhanced. The method plays a role in inhibiting the interference and noise influence in video segmentation. The operation of the convolutional layer is performed again. And outputting a final segmentation mask result graph.

The method for segmenting the video object is proved to be significant by the experimental results that the results of fig. 4 and 5 are shown to achieve effective effect results.

Examples

The experimental hardware environment of the invention is as follows: the system is realized under the operating systems of a PC (personal computer) of a 3.4GHz Intel (R) core (TM) i5-7500 CPU and a GTX 1080TiGPU (general packet processing Unit), a 16 memory and an Ubuntu18.04, and is realized based on an open source frame Pythroch depth frame. The training and testing used an image size of 854x 480. The test results (fig. 4-5) data set is derived from a data set of DAVIS public video image segmentation.

First a given first frame and a mask of the first frame (as shown in fig. 2 at 1 and 2) are given. 1-100 image pairs (shown as 4 in FIG. 2) are generated by the transformation network. The candidate interest area (shown in 5 in fig. 2) is selected by the goal proposal box. The perceived area is trained in the RoISeg network after adding the tracker (shown in 6 in FIG. 2). The spatial attention module and the channel attention module are input from the last convolutional layer output feature map (shown as 7 in fig. 2) in the RoISeg network respectively. And finally, fusing the feature maps output by the space attention module and the channel attention module (12 in FIG. 2), and finally outputting a segmentation result map. The method for segmenting the video object is proved to be significant by the experimental results that the results of fig. 4 and 5 are shown to achieve effective effect results.

Claims

1. A method for segmenting a video object with a dual-module neural network structure is characterized by comprising the following steps: the method ensures that a near target domain has enough training data by collecting processing data for pixel-level video object segmentation scenes, and comprises the following steps:

acquiring a first frame of a video and an annotation mask thereof, generating a mask of a future video frame, generating a training set of a reasonable and vivid image, and further acquiring an expected appearance change in the future video frame to obtain a near target domain;

in addition, a module-of-interest mechanism is introduced to capture the target feature dependencies in the spatial and channel modules of interest, respectively; the injection molding mechanism is to add two parallel modules to the architecture of the expanded full convolution neural network: one is a spatial position dimension module, and the other is a channel information dimension module; after the two parallel modules are processed, the spatial position dimension module obtains an accurate position information dependency relationship, and the channel dimension module obtains a dependency relationship between channel mappings;

finally, output feature graphs from the two dimension modules are fused to obtain better feature representation of pixel-level prediction, and a segmentation result is output after passing through a convolution layer; the segmentation result is composed of 1 and 0 for foreground and background respectively.

2. The method of dual-module neural network structure video object segmentation according to claim 1, wherein: the method comprises the following steps of:

step 1, in the video, the first frame is marked as I₀The mask of the first frame image is denoted as M₀(ii) a From the known first frame map I₀And a first frame mask M₀Inputting a transformation network, and generating an image pair through the transformation network; the image pair is an image and a corresponding mask; the transformation network is a network that includes rotation, translation, flipping, and scaling operations; step 2, the first frame image I in the step 1 is processed₀Pixel of (2) and mask M of the first frame₀Generating more than one group of image pairs through a transformation network, wherein the image pairs are different, and obtaining a sensitive area through a target proposal;

the target proposal is a set of image target rectangular proposal frames which are input with any size and output in a full convolution network; the target proposal obtains a feeling area by scoring the candidate box; the method comprises the following specific steps:

in the first frame image I₀Generating a target offer around the target in (1);

IoU was obtained in the following manner; IoU is called intersection ratio, and is the intersection ratio between the prediction region and the actual region;

one is as follows: randomly generating an object proposal around the object by using the image generated by the transforming network in the step 1 to obtain a generated image object proposal and a first frame image I₀Target proposed ratio IoU;

the second step is as follows: generating an image mask to initial mask ratio IoU score; the initial mask is a first frame mask M₀；

Selecting image pairs with scores greater than 0.75 by IoU ratio, called the region of interest;

then, adding a tracker to the sensitive region, and enabling the tracker to effectively position the target in the next frame; the tracker is used for inputting a current frame mask and a next frame image and predicting the position of a next frame target mask; using a tracker to acquire a mask region of a next frame of image to provide temporal consistency for a subsequent frame of the perceived region;

step 3, once the feeling region is positioned in the next frame, inputting the feeling region adding tracker into the feeling segmentation network of the invention to train and predict the target; the RoISeg of the sensory segmentation network is based on a deep Convolutional Neural Network (CNN) and forms a network frame, called RoISeg for short, in the invention on the basis of a ResNet101 frame network; the CNN is a convolutional neural network in deep learning; the ResNet101 frame network is a network with a deep residual error learning frame to solve the problem of accuracy reduction; adding a tracker on the sensitive region, inputting the tracker into RoISeg for training a model, and outputting a result of obtaining a rough target recognition position and a segmentation mask;

step 4, inputting the target result of RoISeg prediction output by the sensitive region adding tracker in the step 3, wherein the target result has larger error, and the noise segmentation part is reduced; therefore, the invention constructs a method of double concern module: inputting the characteristic diagram output from the last convolution layer of the RoISeg into a double-gate injection module; the double attention modules comprise a space attention module and a channel attention module;

the spatial attention module is used for introducing a spatial attention mechanism to capture spatial dependency between any two spatial positions; the space attention mechanism is a function operation in a space attention module; for the target position features in the frame, the position aggregation features are updated through weighted summation, wherein the weight is determined by the feature similarity between two corresponding positions; in other words, any two locations with similar characteristics may promote mutual improvement regardless of their distance in the spatial dimension;

the channel attention module captures the channel dependency relationship between any two channel mappings through a channel attention mechanism; and updating each channel map with a weighted sum of all channel maps; the channel attention mechanism is a function operation in a channel attention module;

finally, fusing the two concerned modules; fusion operation, which is a parallel strategy, combines the two eigenvectors into a complex vector; information between the front frame and the rear frame of the target object is enriched by fusing the two frames together, so that a better characteristic effect of video object segmentation is obtained; the discrimination capability of feature representation in video object segmentation is enhanced through the feature capture dependency relationship between the space dimension information and the channel dimension information in the double-correlation injection module; the method plays a role in inhibiting the interference and noise influence in video segmentation; after passing through the convolution layer once, the final division mask result is output.

3. The method for segmenting the video object with the dual-module neural network structure according to the claim 1 or 2, characterized in that:

step 1, inputting a video to a computer, wherein each frame of the video is a picture; the picture is in an RGB format and is marked as an RGB picture I; marking the target label in the image as a mask M; the mask is a binary foreground and a background of the image;

firstly, inputting a segment of video and a mask of a first frame, and inputting the first frame I₀And a first mask M₀Inputting the image into a transformation network G to obtain a transformation image pair D; the specific expression is as follows:

D_n＝G(I₀,M₀)

wherein G represents a transformation network; set of image pairs D_n＝{d₁m₁,d₂m₂,...,d_nm_n}，D_nIndicating that there are n image pairs; d_im_iRepresenting the ith image pair, wherein d_iRepresenting images generated by the i-th transformation network, m_iRepresenting a mask generated by the ith transform network; and generating an image pair through a transformation network, and screening whether the image pair is used as a sense region.

4. The method for segmenting the video object with the dual-module neural network structure according to the claim 1 or 2, characterized in that: the specific steps of the step 2 are as follows:

the image pair generated by the transformation network is screened to determine whether the image pair is used as a sensitive region; obtaining a region of interest using the goal proposition; the target proposal is a full convolution network, takes an arbitrary size of an image as input, and outputs a set of rectangular target proposal frames; the target proposal operation is performed around the target in the first frame and noted gt_boxSaid gt_boxA bounding box that is a real marker around the target of the first frame; a bounding box generated by the image for the target proposing operation around the target is marked as b_boxSaid b is_boxInputting a picture pair into a target proposal, and outputting a target proposal frame in the picture pair; generating an image target proposal and a first frame target proposal ratio IoU score; the specific expression is as follows:

S＝IoU(b_box,gt_box)

wherein IoU is a functional expression of the cross-over ratio; s score, which is the intersection ratio score of the target proposal frame in the image pair and the target proposal frame in the first frame; by IoU ratio S>0.75 part of representative image pair is used as a feeling region; then, adding a tracker to the sensitive region, wherein the tracker can be used for effectively positioning the target in the next frame; the tracker inputs a current frame mask and a next frame image and can predict the position of a next frame target mask; using a tracker to acquire a mask region of a next frame of image; providing time consistency for subsequent frame-sensitive areas; a video sequence R, R ═ I is known₀,I₁,I₂,I₃,...I_t...,I_nAnd the first frame I₀Mask M of₀；I_tIs the t-th frame in the video sequence; t ∈ {1,2,3,..., n }; masking the remaining frames in a video sequence₁,M₂,M₃，...,M_n}, according to the tracker function expression:

M_t+1＝f(I_t+1,M_t)

where f is expressed as a tracker function, known as I_t+1Image represented as t +1 th frame, known as M_tMask representing the image of the t-th frame, and calculating M_t+1Mask represented as frame t + 1; the masks of the second frame image and the first frame image are known, and the mask of the second frame image is obtained through a tracker; due to the fact that the target has the smooth moving trend in space, the video frames have the characteristic of little change and have certain relevance; by M_tMask and I_t+1Frame, prediction I_t+1Mask M for frame_t+1(ii) a Prediction I_t+1Mask for frame and real mask M_gtThere is a large error; the M is_gtRepresenting a truly accurate mask; the perceptual region addition tracker is then input into the perceptual segmentation network.

5. The method for segmenting the video object with the dual-module neural network structure according to the claim 1 or 2, characterized in that: the specific steps of the step 3 are as follows:

inputting the information into a feeling segmentation network through a feeling region adding tracker based on the step 2; inputting a feeling region adding tracker into a feeling segmentation network training prediction target in the invention; the Hoxist segmentation network RoISeg is based on a deep convolutional neural network CNN, and a network framework is innovated on the basis of a ResNet101 framework network and is called as a Hoxist segmentation network; the ResNet101 frame network is a network with a deep residual error learning frame to solve the problem of accuracy reduction, and has a lower training error and test error network; the RoISeg network is composed of convolution layers, pooling layers, activation functions, batch normalization, deconvolution and the like; wherein the initial parameters in the RoISeg are set as: the learning rate is 0.0001, and the weight attenuation term is 0.005; the final output of the RoISeg is constrained by using weighted cross entropy loss; the crossover loss expression is as follows:

whereinL (theta) represents the weighted cross entropy loss, theta is in the value range of [0,1 ]]Representing a weight parameter associated with a current prediction in the network; x₊And X_-Representing sets of pixels with target positive and negative sample labels, respectively, positive samples being true correct samples and negative samples being prediction error samples, in other words, sets of pixels of positive and negative samples of a video frame mask, β being a weighted attenuation term penalizing biased samples during training, computing a probability function P from the activation output of the convolutional layer representing a probability distribution, P being [0,1 ]](ii) a The value range [0,1 ] of the commonly used nonlinear activation function Sigmoid is used as the activation function](ii) a The perceptual segmentation network training output layer uses the constraint of cross entropy loss, and then continuously trains after reversely propagating into the network, and when the loss in the training process gradually becomes smaller, the convergence becomes small enough and stable; outputting to obtain a target segmentation result; the output result is a segmentation map of the foreground and background of the mask.

6. The method for segmenting the video object with the dual-module neural network structure according to the claim 1 or 2, characterized in that: the specific steps of the step 4 are as follows:

predicting that a target result is output with a large error by the RoISeg network in the step 3, and reducing a noise segmentation part; therefore, the invention constructs a method of double concern modules; inputting the characteristic diagrams output from the last convolution layer of the RoISeg into two concerned modules respectively; two focus modules are divided: the system comprises a space attention module and a channel attention module, and specifically comprises the following steps:

a spatial attention module: introducing a spatial attention mechanism to enrich the dependency relationship of the context characteristics for the target in the video frame; the operation of a space attention mechanism is introduced for detailed description; the convolution layer output characteristic diagram from RoISeg is marked as A, and A belongs to R^C×H×WR represents a set, the shape size of A is C multiplied by H multiplied by W, C represents the number of channels, H represents the height, and W represents the width; first, three new feature maps B, D and F are generated from feature map A sharing, respectively, where { B, D, F }. epsilon.R^C×H×W(ii) a Then re-change their shape and size to R^C×NWherein N is H × W, N being height and widthMultiplying; and then B performs matrix transposition and D performs matrix multiplication, and applies a softmax layer to calculate a spatial dimension information attention feature map S e R^N×NThe specific expression is as follows:

wherein S_ijMeasuring the ith^thThe j (th) position of space^thThe effect of spatial position; exp represents the distance between two locations, the smaller the distance the more similar the locations between them; in the foregoing, to capture the spatial dependency between any two spatial positions; in other words, the more similar characteristics of both locations; indicating that greater similarity in features between them is facilitated; wherein the shape of F is R^C×N(ii) a Then, matrix multiplication operation is performed between F and S matrix transpose, and the characteristic diagram size shape of the matrix multiplication result is R^C ^×NChanging the shape and size of the feature map to R^C×H×WFinally, multiplying by a scale parameter α, and using feature A to execute element and operation to obtain output feature graph result E₁The specific expression is as follows:

wherein α is initialized to 0 for weight coefficient, α is equal to 0,1]And gradually more weights are assigned; sum operation result feature graph E₁Size of shape E₁∈R^C×H×W(ii) a For the target feature position in the video frame, aggregating features through the position where weighted sum is located to update, wherein the weight is determined by the feature similarity between the corresponding two positions; in other words, any two locations with similar characteristics may promote mutual improvement regardless of their distance in the spatial dimension; and selectively aggregating the context feature representations according to the spatial attention mapping, thereby improving the information interdependence relation between the same classes.

7. The method of dual-module neural network structure video object segmentation according to claim 6, wherein: the specific steps of the step 4 are as follows:

a channel attention module: capturing the channel dependency relationship between any two channel mappings through the channel attention mechanism operation; the convolutional layer output characteristic diagram from RoISeg is also marked as A, and A ∈ R^C×H×WR represents a set, the shape size of A is C multiplied by H multiplied by W, C represents the number of channels, H represents the height, and W represents the width; feature map A sharing generates two new feature maps M and N, respectively, where { M, N }. epsilon.R^C×H×W(ii) a Then re-change their shape and size to R^C×N(ii) a Executing matrix multiplication between M and N transpose, directly calculating channel characteristic diagram X e to R^C×C(ii) a Obtaining a channel attention information characteristic diagram X epsilon R by using softmax layer^C×CThe specific expression is as follows:

wherein X_jiMeasuring the ith^thChannel pair jth^thInfluence among channels, the channel attention module captures the channel dependency relationship between any two channel mappings; in addition, the X matrix and the A characteristic pattern are reshaped into R^C×NPerforms matrix multiplication between the matrices to obtain a result with a shape of R^C×NReshaped to R^C×H×WMultiplying by a scale weight parameter β, and performing element sum operation with A to obtain an output characteristic diagram E₂The specific expression is as follows:

wherein β is a weight coefficient, and the initialization is set to 0.3, β E [0,1 ]](ii) a Sum operation result feature graph E₂，E₂Size of shape E₂∈R^C×H×W(ii) a Simulating the channel dependence relationship between the feature map channel mappings; thereby being beneficial to improving the identifiability of the model function; enhancement of modules of interest by channelThe channel target features are more prominent, so that the video frame can identify the target in the network;

fusing the two concerned modules; the fusion operation is to combine the two feature vectors into a complex vector; the characteristic E output by the space attention module₁Feature map E output by channel attention module₂And obtaining a new characteristic diagram O through fusion operation: the specific expression is as follows:

O＝f(E₁,E₂)

wherein O is the output result of the fusion characteristic diagram, and the size of the O output characteristic diagram is O epsilon R^C×H×W(ii) a The function f is expressed as a fusion operation; e₁Size of the feature map is E₁∈R^C×H×W；E₂Size of the feature map is E₂∈R^C×H×W(ii) a The feature information between the front frame and the rear frame of the target object is more obvious after being fused together, so that a better feature effect of video target object segmentation is obtained.