CN110910391B

CN110910391B - Video object segmentation method for dual-module neural network structure

Info

Publication number: CN110910391B
Application number: CN201911125917.3A
Authority: CN
Inventors: 汪粼波; 陈彬彬; 方贤勇
Original assignee: Anhui University
Current assignee: Anhui University
Priority date: 2019-11-15
Filing date: 2019-11-15
Publication date: 2023-08-18
Anticipated expiration: 2039-11-15
Also published as: CN110910391A

Abstract

The invention provides a method for dividing a video object with a two-module neural network structure, which is used for solving the problem of non-ideal video object dividing result caused by noise interference in the video object dividing process. The method comprises the following steps: generating an image pair in a transformation network for the first frame map and the mask input for the first frame; generating a target proposal frame for each image pair to determine whether the image pair is an happy region; the happy region adding tracker is input into the happy segmentation network to train a learning model and output; the feature images are output from the convolution of the last layer of the happy segmentation network and are respectively input to a space attention module and a channel attention module; finally, fusing the feature graphs output by the two concerned modules, and outputting a final segmentation mask result through convolution layer operation; the invention obtains better segmentation experimental results on the DAVIS video data set.

Description

Video object segmentation method for dual-module neural network structure

Technical Field

The invention relates to the field of computer vision, in particular to a video object segmentation method with inaccurate large-scale change and dynamic appearance change in video, in particular to a video object segmentation method with a two-module neural network structure.

Background

With the rapid development of computer vision technology in recent years, convolutional neural networks have been greatly emphasized in various research fields in deep learning, and video object segmentation technology is an important content of attention of researchers in recent years. Video segmentation technology is increasingly showing its important role. The method is applied to scene understanding, video tagging, unmanned automobiles, object detection and the like, and is rapidly developed in video segmentation technology. It can be said that the progress of video segmentation technology has led to the development of the computer vision technology as a whole. Video object segmentation is not only a research hotspot, but also a research difficulty. The object of segmentation is to find accurate positional relationships for objects in the video, however, the implementation process is subject to various limitations, such as movement speed, object deformation, occlusion between instances and chaotic background, which can come from different camera devices, different scene images. This makes video object segmentation very challenging. Still exhibiting poor results in real world scene segmentation. These images clearly present a significant challenge to video object segmentation techniques.

In recent years. Numerous scholars have conducted a great deal of research on video segmentation techniques and achieved better academic results. Unsupervised video object segmentation. The unsupervised method is mainly to segment moving objects from the background without any prior knowledge of the object, and the unsupervised video object segmentation method aims at automatically finding and separating the salient objects from the background. These methods are based on probabilistic models, actions, and object suggestions. Existing methods typically rely on visual cues (e.g., superpixels. Saliency images or optical flow) to acquire the initial object region and require processing the entire video in batch mode to provide object segmentation. Furthermore, generating and processing thousands of candidate regions in each frame is typically time consuming. These unsupervised methods cannot segment a particular object due to motion confusion between different instances and dynamic contexts. Many methods for semi-supervised video object segmentation rely on fine tuning using ground truth for the first frame, training the foreground and background segmentations using convolutional networks, and tuning them onto the first frame of the target video at test time (e.g., on-line adaptive mechanisms and semantic information for example segmentation networks). They provide key visual cues for the target. Thus, these methods can handle multiple instance situations and generally perform better than unsupervised methods. However, many semi-supervised approaches rely to a large extent on the segmentation mask in the first frame. According to the method, data enhancement is carried out by using a first frame, the model self-adaption seriously depends on a fine tuning model, the background is complex in a video, and the problems of shielding, rapid movement and camera shake oscillation cannot be effectively segmented.

Disclosure of Invention

Aiming at the problems of the video segmentation method, the invention provides a video segmentation method based on space and channel information of a dual-focus module structure. Compared with the prior art, the method can flexibly utilize the space and channel information in the feature map, simplify the calculated amount in the optimization process, and greatly improve the accuracy of video target object segmentation.

The invention aims to: the invention aims to solve the problem of the defects in the existing video object segmentation method, and provides a method for segmenting a video object with a two-module neural network structure, which aims to solve some problems in the video object segmentation.

The technical scheme is as follows: the invention discloses a method for segmenting a video object with a two-module neural network structure, which aims to ensure enough training data of a near-target domain and custom-built training data for segmenting a scene for a pixel-level video object.

First, the first frame and its annotation mask input transform network generate future possible image pairs, solving the problem of additional processing time required for data pre-preparation and training data enhancement. A training set of reasonably realistic images is generated. The possible expected trajectory and appearance of changes to the object in future video frames is captured. Second, the image pairs input target proposal operation, through which the regions of interest that can be candidate are determined, and the regions of interest are determined to filter out some unwanted image pairs. So that the happy segmentation network saves some unnecessary overhead on computation. Then, the happy region adding tracker is input into the perceptual segmentation network to train a model and output segmentation results. The tracking segmentation is inaccurate due to the interference influence of the video frame tracking to the target. Finally, a dual-closure injection molding method is devised. The spatial attention module captures spatial dependencies between any two spatial locations, and the channel attention module captures channel dependencies between any two channel mappings. And then the two attention modules are output to perform fusion operation, so that the distinguishing capability of the feature representation in the video object segmentation is enhanced. The method has the advantage of inhibiting the influence of interference and noise in video segmentation. The discrimination capability of the characteristic representation in the video object segmentation is enhanced. The operation of the convolution layer is performed once more. And outputting a final segmentation mask result diagram. The experimental results prove that the video object segmentation method can obtain effective effect results as shown in the results of fig. 4 and 5. The method specifically comprises the following steps:

Step 1, the first frame in the video is recorded as I ₀ Mask of the first frame is denoted as M ₀ . From the known first frame I ₀ And a first frame mask M ₀ Input deviceA transformation network through which a plurality of different image pairs can be generated. The image pair is an image and corresponds to a mask. The transformation network is some operations of rotation, translation, flipping, scaling, etc. The different image pairs are mask training data for possible occurrence of objects for future video frames, and the dataset is derived from a DAVIS public video image segmentation dataset. The method used by the invention is that the video frame and the corresponding auxiliary mask are processed. A large number of image pairs are obtained which are used to ameliorate video training data deficiencies. Therefore, enough data can be obtained for training, and the video result can be accurately predicted.

Step 2, for the pixel I according to the first frame in step 1 ₀ And mask M of the first frame ₀ Different image pairs are generated by the transformation network and the target proposal is used to obtain the region of interest. The target proposal is a typical full convolution network, taking as input an image of arbitrary size, outputting a set of rectangular target proposal boxes. Generating target proposals around the targets in the first frame, randomly generating target proposals around the targets by the image generated by the transformation network in the step 1, and generating scores of the image target proposals and the target proposal ratio IoU of the first frame. Or a fraction of the generated image mask to initial mask ratio IoU. A representative image pair with a score greater than 0.35 is selected by the IoU ratio and is referred to as the region of interest (region of interest simply RoI). An initial mask, which is a first frame mask M ₀ . The IoU is called the intersection ratio, which is the intersection ratio between the predicted area and the actual area. Then, a tracker is added to the happy region, and the tracker can be used for locating the effective target in the next frame. The tracker is used for inputting a current frame mask and a next frame image and can predict the position of a next frame target mask. A tracker is used to acquire the mask area of the next frame image. Providing temporal consistency for subsequent frame-like regions.

Step 3, once the region of interest is located in the next frame, the region of interest addition tracker is input to the region of interest segmentation network (RoISeg) trained predictive target in the present invention. The Happy segmentation network RoISeg is based on a deep convolutional neural network CNN, and based on a ResNet101 framework network, the network framework is innovated and is called as the Happy segmentation network. In the following RoISeg denotes an happy split network. The CNN is a convolutional neural network in deep learning. The ResNet101 framework network is a network with a deep residual error learning framework for solving the problem of accuracy reduction, and has lower training error and test error. A tracker is added on the happy region and is input into the Roiseg for training a model, and a result of obtaining the rough target recognition position and dividing the mask is output.

And 4, adding a large error of the tracker input to the RoISeg prediction output target result by the happy region in the step 3, so as to reduce the noise segmentation part. Thus, the present invention builds a "dual focus module approach. The feature map output at the last convolution layer of RoISeg is input to the double-pass injection block. The double-closure injection molding block is respectively a space focusing module and a channel focusing module.

And the spatial attention module introduces a spatial attention mechanism to capture the spatial dependence between any two spatial positions. The spatial attention mechanism is the operation of some functions in the spatial attention module. For target location features in the frame, the location aggregate features are updated by weighted summation, where the weights are determined by feature similarity between the respective two locations. In other words, any two locations with similar features can facilitate mutual improvement regardless of their distance in the spatial dimension.

And the channel focusing module captures the channel dependency relationship between any two channel mappings through a channel focusing mechanism. And updates each channel map using a weighted sum of all channel maps. The channel attention mechanism is the operation of some functions in the channel attention module.

And finally, fusing the two concerned modules. The fusion operation is a parallel strategy, combining the two feature vectors into a complex vector. The method fuses together the information between the front frame and the rear frame of the object, thereby obtaining better characteristic effect of video object segmentation. The feature capturing dependency relationship between the space dimension information and the channel dimension information in the double-relation injection molding block enhances the distinguishing capability of feature representation in video object segmentation. The method has the advantage of inhibiting the influence of interference and noise in video segmentation. After the primary convolution layer, the final segmentation mask result is output.

The detailed steps are as follows:

step 1, inputting a video into a computer, wherein each frame of the video is a picture. The picture is in RGB format, denoted RGB picture I. The target label in this image is noted as mask M. The mask is a binary foreground and background of the image.

First, a mask of a video and a first frame is input, the first frame I ₀ And a first mask M ₀ Is input into the transformation network G. A large number of transformed image pairs D are obtained. The specific expression is as follows:

D _n ＝G(I ₀ ，M ₀ )

where G represents the transformation network, which is some operation of rotation, translation, flipping, scaling, etc. D (D) _n ＝{d ₁ m ₁ ，d ₂ m ₂ ，...，d _n m _n }，D _n There are n image pairs represented. d, d _i m _i Represents the ith image pair, where d _i Representing an image generated by an ith transformation network, m _i Representing the mask generated by the ith transformation network. Image pairs are generated through a transformation network, and the image pairs are screened as to whether the image pairs are an happy region or not.

The specific steps of the step 2 are as follows:

image pairs generated by the transformation network are screened for as an area of interest. The region of interest is obtained using the target proposal. The target proposal is a typical full convolution network, taking as input an image of arbitrary size, outputting a set of rectangular target proposal boxes. Performing a target proposal operation around the target in the first frame and denoted as gt _box The gt is as follows _box Is a bounding box of the real marker around the object of the first frame. Boundary box generated by performing object proposal operation on surrounding of object through image and marked as b _box The b is _box Is to input an image pair into a target proposal and output an image pair target proposal box, as in FIG. 2Shown at the label number 5. The ratio IoU score is calculated for the generated image object proposal and the first frame object proposal. The specific expression is as follows:

S＝IoU(b _box ，gt _box )

wherein IoU is a functional expression of the cross-over ratio. And S score, which is the intersection ratio score of the target proposal frame in the image pair and the target proposal frame in the first frame. The portion with the ratio S > 0.75 by IoU has a representative image pair as the happy region. Then, a tracker is added to the happy region, and the tracker can be used for locating the effective target in the next frame. The tracker is used for inputting a current frame mask and a next frame image and can predict the position of a next frame target mask. A tracker is used to acquire the mask area of the next frame image. Providing temporal consistency for subsequent frame-like regions. A video sequence R, r= { I is known ₀ ，I ₁ ，I ₂ ，I ₃ ，...It...，I _n First frame I ₀ Mask M of (2) ₀ 。I _t Is the t-th frame in the video sequence. t e {1,2,3,..n }. Mask { M for remaining frames in video sequence ₁ ，M ₂ ，M ₃ ，...，M _n The expression according to the tracker function is as follows:

M _t+1 ＝f(I _t+1 ，M _t )

where f is denoted as a tracker function, known as I _t+1 An image denoted as the t+1st frame, known as M _t Mask representing t-th frame image, and M is found _t+1 Mask indicated as t+1st frame. The mask of the second frame image is determined by a tracker, as is known for video of the second frame image and the first frame image. Since the object has a smooth moving trend in space, the video frames have very little change and have certain relativity. By M _t Mask and I _t+1 Frame, prediction I _t+1 Mask M of frame _t+1 . Prediction I _t+1 Frame mask and true mask M _gt There is a large error. The M is _gt Representing a truly accurate mask. The happy region addition tracker is then input into the happy segmentation network.

The specific steps of the step 3 are as follows:

adding tracker input into the happy segmentation network through the happy region based on step 2. The addition tracker of the happy region is input to the happy segmentation network (RoISeg) in the present invention to predict the target through training. The Happy segmentation network RoISeg is based on a deep convolutional neural network CNN, and based on a ResNet101 framework network, the network framework is innovated and is called as the Happy segmentation network. The ResNet101 framework network is a network with a deep residual error learning framework for solving the problem of accuracy reduction, and has lower training error and test error. The Roiseg network is composed of a convolution layer, a pooling layer, an activation function, batch normalization, deconvolution and the like. Wherein some of the initial parameters in the RoISeg are set. The learning rate is 0.0001, and the weight attenuation term is 0.005. The RoISeg final output is constrained using weighted cross entropy loss. The cross-over loss expression is as follows:

Wherein L (θ) represents a weighted cross entropy loss, and θ has a value range of [0,1 ]]Representing weight parameters related to the current prediction in the network. X is X ₊ And X _- Representing a set of pixels with target positive and negative sample labels, respectively. The positive samples are true correct samples and the negative samples are prediction error samples. In other words, the positive and negative samples of the video frame mask are pixelated. Beta is the weight decay term, penalizing biased sampling during training. The activation output of the convolution layer calculates a probability function P representing the probability distribution, P.epsilon.0, 1]. The activation function uses a common nonlinear activation function Sigmoid value range [0,1 ]]. The perceptual segmentation network training output layer uses the constraint of cross entropy loss, and then propagates to the network for continuous training through reverse propagation, and when the training process loss is gradually reduced, the convergence becomes small enough and stable. And outputting to obtain a target segmentation result. The output result is a segmentation map of the mask foreground and background.

The RoISeg network predicts that there is a large error in the output target result in step 3 in order to reduce the portion of noise splitting. Thus, the present invention builds a "dual focus module approach. The feature map output at the last convolution layer of the RoISeg is input to the two interest modules separately. The two focusing modules are a space focusing module and a channel focusing module respectively.

Spatial attention module: and introducing a spatial attention mechanism to enrich the dependency relationship of the context characteristics for the target in the video frame. The operation of the spatial attention mechanism is introduced for the detailed description. A spatial attention module is shown at 11 in fig. 3. The output feature map from the convolution layer of Roiseg is denoted as A, A ε R ^C×H×W R represents a set, A has a shape and size of C×H×W, C represents the number of channels, H represents a height, and W represents a width. First three new feature maps B, D and F are generated by feature map A sharing, where { B, D, F } ∈R ^C ^×H×W . Then change their shape and size again to R ^C×N Where n=h×w, N is expressed as the product of the height and the width. Then B performs matrix transposition and D performs matrix multiplication, and a soffmax layer is applied to calculate a space dimension information attention feature map S epsilon R ^N×N The specific expression is as follows:

wherein S is _ij Measurement of the ith ^th Spatial position pair j ^th Influence of spatial position. exp denotes the distance between two positions, the smaller the distance the more similar the positions between them. In the foregoing, to capture the spatial dependence between any two spatial locations. In other words, the more similar features of both locations. The representation contributes to a greater similarity of features between them. The shape and the size of the F are R ^C×N . Then, a matrix multiplication operation is carried out between the F matrix transpose and the S matrix transpose, and the size and the shape of the characteristic diagram of the matrix multiplication result are R ^C×N The shape and size of the feature map are changed into R again ^C×H×W . Finally, multiplying a scale parameter alpha, and performing element and operation by using the feature A to obtain an output feature diagram result E ₁ The specific expression is as follows:

wherein alpha is weight coefficient initialization set to 0, alpha E [0,1 ]]And progressively more weights are assigned. Addition operation result feature map E ₁ Shape and size E ₁ ∈R ^C×H×W . For target feature locations in a video frame, the location aggregate features are updated by weighted summation, where the weights are determined by feature similarity between the respective two locations. In other words, any two locations with similar features can facilitate mutual improvement regardless of their distance in the spatial dimension. And selectively aggregating the context feature representations according to the spatial attention map, thereby promoting information interdependencies between the same classes.

The channel focusing module captures the channel dependence relationship between any two channel mappings through some operations of a channel focusing mechanism. The channel focuses on some operations of the mechanism. The output feature map from the convolution layer of Roiseg is also denoted as A, A ε R ^C×H×W R represents a set, A has a shape and size of C×H×W, C represents the number of channels, H represents a height, and W represents a width. The feature map A shares to generate two new feature maps M and N, respectively, where { M, N } ∈R ^C×H×W . Then change their shape and size again to R ^C×N . Performing matrix multiplication between M and N transposition, and directly calculating channel characteristic diagram X epsilon R ^C×C . Obtaining a channel attention information feature map X epsilon R by using a soffmax layer ^C×C The specific expression is as follows:

wherein X is _ji Measurement of the ith ^th Channel pair j ^th The channel-to-channel effects, the channel attention module mentioned above captures the channel dependence between any two channel maps. In addition, matrix multiplication is performed between the X and A matrix transposes, the matrix multiplication operating the junctionFruit profile, reshaped to R ^C×H×W . Then multiplying a scale weight parameter beta, and performing element sum operation with A to obtain an output characteristic diagram E ₂ The specific expression is as follows:

wherein, beta is a weight coefficient, and is initialized to 0.3, beta is E [0,1 ]]. Addition operation result feature map E ₂ ，E ₂ Shape and size E ₂ ∈R ^C×H×W . The channel dependence between the feature map channel maps is simulated. Thereby helping to improve the legibility of the model function. The enhanced channel object features through the channel attention module are more prominent so that video frames can identify objects in the network.

And fusing the two concerned modules. The fusion operation is to combine the two feature vectors into a complex vector. Characteristic E output by the space attention module ₁ Feature map E output by channel attention module ₂ Obtaining a new feature map O through fusion operation: the specific expression is as follows:

O＝f(E ₁ ，E ₂ )

wherein O is the result of fusing the feature map output, and the size of the O output feature map is O E R ^C×H×W . The function f is denoted as a fusion operation. E (E) ₁ The size of the feature map is E ₁ ∈R ^C×H×W 。E ₂ The size of the feature map is E ₂ ∈R ^C×H×W . The feature information between the front frame and the rear frame of the rich target object fused together is more obvious, so that a better feature effect of video target object segmentation is obtained.

And capturing the dependency relationship through feature fusion between the space dimension information and the channel dimension information in the attention module, and fully utilizing the context feature information between the space and the channel. Specifically, the convolutional layer outputs through the sense splitting network are input to the two interest modules, respectively. Through respective focusing mechanism operation, the spatial focusing module obtains obvious spatial information characteristics, and the channel focusing module obtains obvious channel information characteristics. The feature operation is fused with the two concerned modules, so that the distinguishing capability of feature representation in video object segmentation is enhanced. The method has the advantage of inhibiting the influence of interference and noise in video segmentation. The operation of the convolution layer is performed once more. And outputting a final segmentation mask result diagram. The experimental results prove that the video object segmentation method has effective effect results as shown in the results of fig. 4 and 5, thereby proving that the method is significant.

Advantageous technical effects

The invention provides a convolution neural network video object segmentation method of a double-relation injection molding block, which is used for solving the problems of insufficient data, high processing cost, complex background, quick movement, jitter, oscillation and other interference in the video object segmentation process. Because of these interference problems, the present invention effectively addresses these problems by designing a transformation network, adding trackers to the region of interest, and dual-pass injection molding. The problem of poor generalization capability of a learning model due to insufficient data in the network training process is solved by a network transformation method. The region of interest is determined by the target proposal and trackers are added to the region of interest to predict the location information of the target that is likely to occur for the next frame. Jitter and oscillation questions occurring in rapid movement or camera movement can be solved, and the possible positions of targets can be found out. The happy region addition tracker inputs the happy segmentation network (Roiseg) designed by the invention to train a model and outputs a segmentation result. The tracking segmentation is inaccurate due to the interference influence of the video frame tracking to the target. For this purpose, a double-closing injection molding method is designed. The spatial attention module captures spatial dependencies between any two spatial locations, and the channel attention module captures channel dependencies between any two channel mappings. And the feature graphs output by the two attention modules are fused, so that the distinguishing capability of feature representation in video object segmentation is enhanced. The method has the advantage of inhibiting the influence of interference and noise in video segmentation. The discrimination capability of the characteristic representation in the video object segmentation is enhanced. The operation of the convolution layer is performed once more. And outputting a final segmentation mask result diagram.

Drawings

FIG. 1 is a basic flow chart of the method of the present invention

FIG. 2 is a diagram of a network architecture according to the present invention

FIG. 3 is a diagram showing a relationship between two-pass injection molding networks

FIG. 4 and FIG. 5 are graphs of experimental results

In fig. 2, 1 is a first frame image. And 2, a mask diagram corresponding to the first frame image. 3 is done for the transformation network operation. 4 is the image pair generated by the transformation network. 5 is target proposal generation to determine an happy region, 6 is the Roiseg network framework of the invention, 7 is the Roiseg network output feature map, 8 is the feature map, 9 is the feature map, 10 is the channel attention module, 11 is the space attention module, 12 is the output feature map, and 13 is the final experiment segmentation result.

Detailed description of the preferred embodiments

The technical features of the present invention will now be described in detail with reference to the accompanying drawings.

Referring to fig. 1, a method for segmenting a video object in a two-module neural network structure, the method collects processing data for a pixel-level video object segmentation scene to ensure that a near-target domain has a sufficient amount of training data, the method comprises the following steps:

and acquiring a first frame of the video and an annotation mask thereof, wherein the first frame and the annotation mask are used for generating a mask of a future video frame, generating a training set of reasonable vivid images, and further acquiring expected appearance changes in the future video frame to obtain a near target domain.

In addition, a focus module mechanism is introduced to capture target feature dependencies in the spatial and channel focus modules, respectively. The injection molding mechanism is to add two parallel modules to the architecture of the expanded full convolutional neural network: one is a spatial location dimension module and the other is a channel information dimension module. The space position dimension module obtains accurate position information dependency relationship through processing the two parallel modules, and the channel dimension module obtains dependency relationship between channel mappings.

And finally, fusing the output feature graphs from the two dimension modules to obtain better pixel-level prediction feature representation and outputting a segmentation result through a convolution layer. The segmentation result is a segmentation result composed of foreground and background represented by 1 and 0, respectively.

Further, the method of the invention is as follows: by a computer and according to the following steps:

step 1, in the video, a first frame graph is recorded as I ₀ Mask of the first frame image is denoted as M ₀ . From the known first frame picture I ₀ And a first frame mask M ₀ And inputting the transformation network, and generating an image pair through the transformation network. The aforementioned image pair is an image and corresponds to a mask. The transformation network is a network that includes rotation, translation, flipping, and/or scaling operations. The image pair refers to the situation that the network training model can be input for the video frames to solve the shortage of data. In this step, the input video is from a dataset, which may be derived from a DAVIS public video image segmentation dataset. The method used by the invention is that the video frame and the corresponding auxiliary mask are processed. A large number of image pairs are obtained which are used to ameliorate video training data deficiencies. Therefore, enough data can be obtained for training, and the video result can be accurately predicted.

Step 2, the first frame image I in the step 1 is processed ₀ Mask M of the first frame and the pixels of the first frame ₀ More than one set of image pairs are generated by the transformation network, the image pairs are not identical, and the region of interest is obtained by the target proposal.

The target proposal is to input an image with any size in a full convolution network and output a set of image target rectangular proposal frames. The target proposal gets the happy region by scoring the candidate boxes. . The method comprises the following specific steps:

in the first frame picture I ₀ A target proposal is generated around the target in (a).

IoU was obtained as follows. The IoU is called the intersection ratio, which is the intersection ratio between the predicted area and the actual area.

The method comprises the following steps: randomly generating target proposals around the target by using the image generated by the transformation network in the step 1, obtaining the generated image target proposals and a first frame image I ₀ Target proposed ratio IoU.

Second step: the ratio IoU score of the image mask to the initial mask is generated. The initial mask is a first frame mask M ₀ 。

A representative image pair with a score greater than 0.75 is selected by the IoU ratio and is referred to as the region of interest (region of interest simply RoI).

Then, a tracker is added to the happy region, and the tracker is effective to locate the target in the next frame. The tracker is used for inputting a current frame mask and a next frame image and can predict the position of a next frame target mask. A tracker is used to acquire the mask area of the next frame image to provide temporal consistency for the subsequent frame-like areas.

Step 3, once the region of interest is located in the next frame, the region of interest addition tracker is input to the region of interest segmentation network (RoISeg) trained predictive target in the present invention. The Happy segmentation network RoISeg is based on a deep convolutional neural network CNN, and forms a network framework, called RoISeg for short, on the basis of a ResNet101 framework network. The CNN is a convolutional neural network in deep learning. The ResNet101 framework network is a network with a deep residual error learning framework for solving the problem of accuracy reduction, and has lower training error and test error. A tracker is added on the happy region and is input into the Roiseg for training a model, and a result of obtaining the rough target recognition position and dividing the mask is output.

And 4, adding a large error of the tracker input to the RoISeg prediction output target result by the happy region in the step 3, so as to reduce the noise segmentation part. Thus, the present invention constructs a "method of dual focus modules: the feature map output at the last convolution layer of RoISeg is input to the double-pass injection block. The dual gate injection molding block includes a spatial attention module and a channel attention module, see fig. 1 and 3 in detail.

The spatial attention module is used for introducing a spatial attention mechanism to capture the spatial dependence between any two spatial positions. The spatial attention mechanism is a function operation in the spatial attention module. For target location features in the frame, the location aggregate features are updated by weighted summation, where the weights are determined by feature similarity between the respective two locations. In other words, any two locations with similar features can facilitate mutual improvement regardless of their distance in the spatial dimension.

The channel focusing module captures the channel dependence relationship between any two channel maps through a channel focusing mechanism and uses the weighted sum of all the channel maps to update each channel map. The channel attention mechanism is a function operation in the channel attention module.

Further, in step 1, a video is input to the computer, and each frame of the video is a picture. The picture is in RGB format, denoted RGB picture I. The target label in this image is noted as mask M. The mask is a binary foreground and background of the image.

First, a mask of a video and a first frame is input, the first frame I ₀ And a first mask M ₀ The input to the transformation network G results in a transformed image pair D. The specific expression is as follows:

D _n ＝G(I ₀ ，M ₀ )

where G represents the transformation network. Image pair set D _n ＝{d ₁ m ₁ ，d ₂ m ₂ ，...，d _n m _n }，D _n There are n image pairs represented. d, d _i m _i Represents the ith image pair, where d _i Representing an image generated by an ith transformation network, m _i Representing the mask generated by the ith transformation network. An image pair is generated by a transformation network,the image pairs are screened for the presence of an area of interest.

Further, the specific steps of the step 2 are as follows:

image pairs generated by the transformation network are screened for as an area of interest. The region of interest is obtained using the target proposal. The target proposal is a typical full convolution network, inputs an image of any size, and outputs a set of image target rectangular proposal boxes. Performing a target proposal operation around the target in the first frame and denoted as gt _box The gt is as follows _box Is a bounding box of the real marker around the object of the first frame. Boundary box generated by performing object proposal operation on surrounding of object through image and marked as b _box The b is _box The image pair is input to the target proposal and the image pair is output to the target proposal box as indicated at the number 5 mark in fig. 2. The ratio IoU score is calculated for the generated image object proposal and the first frame object proposal. The specific expression is as follows:

S＝IoU(b _box ，gt _box )

wherein IoU is a functional expression of the cross-over ratio. And S score, which is the intersection ratio score of the target proposal frame in the image pair and the target proposal frame in the first frame. The portion with the ratio S > 0.75 by IoU has a representative image pair as the happy region. Then, a tracker is added to the happy region, and the tracker can be used for locating the effective target in the next frame. The tracker is used for inputting a current frame mask and a next frame image and can predict the position of a next frame target mask. A tracker is used to acquire the mask area of the next frame image. Providing temporal consistency for subsequent frame-like regions. A video sequence R, r= { I is known ₀ ，I ₁ ，I ₂ ，I ₃ ，...I _t ...，I _n First frame I ₀ Mask M of (2) ₀ 。I _t Is the t-th frame in the video sequence. t e {1,2,3,..n }. Mask { M for remaining frames in video sequence ₁ ，M ₂ ，M ₃ ，...，M _n The expression according to the tracker function is as follows:

M _t+1 ＝f(I _t+1 ，M _t )

Further, the specific steps of the step 3 are as follows:

adding tracker input into the happy segmentation network through the happy region based on step 2. The addition tracker of the happy region is input to the happy segmentation network (RoISeg) in the present invention to predict the target through training. The Happy segmentation network RoISeg is based on a deep convolutional neural network CNN, and based on a ResNet101 framework network, the network framework is innovated and is called as the Happy segmentation network. The ResNet101 framework network is a network with a deep residual error learning framework for solving the problem of accuracy reduction, and has lower training error and test error. The Roiseg network is composed of a convolution layer, a pooling layer, an activation function, batch normalization, deconvolution and the like. Wherein the initial parameters in RoISeg are set as: the learning rate is 0.0001, and the weight attenuation term is 0.005. The RoISeg final output is constrained using weighted cross entropy loss. The cross-over loss expression is as follows:

Wherein, the upper (theta) represents the weighted cross entropy loss, and the value range of theta is [0,1 ]]Representing weight parameters related to the current prediction in the network. X is X ₊ And X _- Representing a set of pixels with target positive and negative sample labels, respectively. The positive samples are true correct samples and the negative samples are prediction error samples. In other words, the positive and negative samples of the video frame mask are pixelated. Beta is the weight decay term, penalizing biased sampling during training. The activation output of the convolution layer calculates a probability function P representing the probability distribution, P.epsilon.0, 1]. The activation function uses a common nonlinear activation function Sigmoid value range [0,1 ]]. The perceptual segmentation network training output layer uses the constraint of cross entropy loss, and then propagates to the network for continuous training through reverse propagation, and when the training process loss is gradually reduced, the convergence becomes small enough and stable. And outputting to obtain a target segmentation result. The output result is a segmentation map of the mask foreground and background.

Further, the specific steps of the step 4 are as follows:

the RoISeg network predicts that there is a large error in the output target result in step 3 in order to reduce the portion of noise splitting. Thus, the present invention builds a "dual focus module approach. The feature map output at the last convolution layer of the RoISeg is input to the two interest modules separately. Two modules of interest are: the system is a space focusing module and a channel focusing module, and concretely comprises the following steps:

Spatial attention module: and introducing a spatial attention mechanism to enrich the dependency relationship of the context characteristics for the target in the video frame. The operation of the spatial attention mechanism is introduced for the detailed description. S is shown in fig. 3 as a spatial attention module. The output feature map from the convolution layer of Roiseg is denoted as A, A ε R ^C×H×W R represents a set, A has a shape and size of C×H×W, C represents the number of channels, H represents a height, and W represents a width. First three new feature maps B, D and F are generated by feature map A sharing, where { B, D, F } ∈R ^C ^×H×W . Then change their shape and size again to R ^C×N Where n=h×w, N is expressed as the product of the height and the width. Then B performs matrix transposition and D performs matrix multiplication, and applies a softmax layer to calculate a spatial dimension information attention feature map S epsilon R ^N×N The specific expression is as follows:

The channel focusing module: the channel dependency relationship between any two channel mappings is captured by the channel attention mechanism operation. The output feature map from the convolution layer of Roiseg is also denoted as A, A ε R ^C×H×W R represents a set, A has a shape and size of C×H×W, C represents the number of channels, and H represents Shown as height, W represents width. The feature map A shares to generate two new feature maps M and N, respectively, where { M, N } ∈R ^C×H×W . Then change their shape and size again to R ^C×N . Performing matrix multiplication between M and N transposition, and directly calculating channel characteristic diagram X epsilon R ^C×C . Using softmax layers to obtain channel information-of-interest feature graphs X ε R ^C×C The specific expression is as follows:

wherein X is _ji Measurement of the ith ^th Channel pair j ^th The channel-to-channel effects, the channel attention module mentioned above captures the channel dependence between any two channel maps. In addition, the X matrix and the A characteristic graph are reshaped into R ^C×N Matrix multiplication is performed between the matrixes, and the shape of a result obtained by the matrix multiplication is R ^C×N Reshaped into R ^C×H×W . Then multiplying a scale weight parameter beta, and performing element sum operation with A to obtain an output characteristic diagram E ₂ The specific expression is as follows:

And fusing the two concerned modules. The fusion operation is to combine the two eigenvectors into a complex vector. Characteristic E output by the space attention module ₁ Feature map E output by channel attention module ₂ Obtaining a new feature map O through fusion operation: specific surfaceThe expression is as follows:

O＝f(E ₁ ，E ₂ )

And capturing the dependency relationship through feature fusion between the space dimension information and the channel dimension information in the attention module, and fully utilizing the context feature information between the space and the channel. Specifically, the convolutional layer outputs through the sense splitting network are input to the two interest modules, respectively. Through respective focusing mechanism operation, the spatial focusing module obtains obvious spatial information characteristics, and the channel focusing module obtains obvious channel information characteristics. The feature operation is fused with the two concerned modules, so that the distinguishing capability of feature representation in video object segmentation is enhanced. The method has the advantage of inhibiting the influence of interference and noise in video segmentation. The operation of the convolution layer is performed once more. And outputting a final segmentation mask result diagram.

The experimental results prove that the video object segmentation method has effective effect results as shown in the results of fig. 4 and 5, thereby proving that the method is significant.

Examples

The experimental hardware environment of the invention is: 3.4GHz Intel (R) Core (TM) i5-7500 CPU and GTX 1080Ti GPU are implemented on a PC, 16 memory and under Ubuntu18.04 operating system based on an open source framework Pytorch depth framework. Training and testing uses an image size of 854x 480. Test results (fig. 4-5) the dataset was derived from the dataset of DAVIS public video image segmentation.

First a given first frame and a mask for the first frame (as shown at 1 and 2 in fig. 2). 1 to 100 image pairs (shown as 4 in fig. 2) are generated by the transformation network. Candidate regions of interest (shown at 5 in fig. 2) are selected by a target suggestion box. The happy region is trained in the RoISeg network after adding trackers (shown as 6 in fig. 2). The spatial attention module and the channel attention module are respectively input from the final convolution layer output feature map (shown as 7 in fig. 2) in the RoISeg network. And finally, fusing the feature graphs output by the space focusing module and the channel focusing module (shown as 12 in fig. 2), and finally outputting a segmentation result graph. The experimental results prove that the video object segmentation method has effective effect results as shown in the results of fig. 4 and 5, thereby proving that the method is significant.

Claims

1. A method for segmenting a video object with a two-module neural network structure is characterized by comprising the following steps: the method collects processing data for a pixel-level video object segmentation scene to ensure that a near-target domain has a sufficient amount of training data, comprising:

step 1: acquiring a first frame of a video and a mask thereof, wherein the first frame and the mask are used for generating a mask of a future video frame, generating a reasonable and vivid image training set, and further acquiring expected appearance change in the future video frame to obtain a near target domain; acquiring a first frame image and a mask of the first frame image, and inputting the first frame image and the mask of the first frame image into a transformation network to generate an image pair;

step 2: screening the image pair generated by the transformation network, judging whether the image pair is used as an interested area, and acquiring the interested area by using a target proposal; the target proposal is a full convolution network, takes an image with any size as input and outputs a set of rectangular target proposal frames; performing a target proposal operation around the target in the first frame of image, and denoted as gt _box The gt is as follows _box Is a bounding box of a real mark around the object of the first frame image; performing object proposal operation on the image around the object to generate a boundary box, and is marked as b _box The b is _box An object proposal frame for inputting the image pair into the object proposal and outputting the image pair; calculating IoU scores of the generated image target proposal and the first frame image target proposal; the specific expression is as follows:

S＝IoU(b _box ,gt _box )

Wherein IoU is a functional expression of the cross-over ratio; s score, which is the intersection ratio score of the target proposal frame in the image pair and the target proposal frame in the first frame image; by taking a IoU score S >0.75 representative image pair as the region of interest;

step 3: adding a tracker for the region of interest, and inputting the tracker into a region of interest segmentation network for training: the tracker inputs the current frame mask and the next frame image and can predict the position of the next frame target mask; acquiring a mask region of the next frame image using a tracker; providing temporal consistency for regions of interest of subsequent frames; a video sequence R, r= { I is known ₀ ,I ₁ ,I ₂ ,I ₃ ,...I _t ...,I _n First frame image I ₀ Mask M of (2) ₀ ；I _t Is the t frame image in the video sequence; t e {1,2,3,., n }; mask { M for remaining frames in video sequence ₁ ,M ₂ ,M ₃ ，...,M _n The tracker function expression is as follows:

M _t+1 ＝f(I _t+1 ,M _t )

where f is denoted as a tracker function, known as I _t+1 Expressed as the t+1st frame image, known as M _t Mask representing t-th frame image, M _t+1 A mask represented as a t+1st frame image; the mask of the second frame image and the mask of the first frame image are known, and the mask of the second frame image is obtained through a tracker; because the target has a smooth moving trend in space, the change between video frames is very small, and the video frames have certain relevance; by M _t Mask and I _t+1 Frame, prediction I _t+1 Mask M of frame _t+1 The method comprises the steps of carrying out a first treatment on the surface of the Prediction I _t+1 Frame mask and true mask M _gt There is a large error; the M is _gt A mask representing true and accurate; then inputting the region of interest adding tracker into the region of interest segmentation network;

step 4: a focus module mechanism is introduced, wherein the focus module mechanism is formed by adding two parallel modules on the basis of the architecture of the expanded full convolution neural network: one is a spatial attention module and the other is a channel attention module; inputting the feature map output by the last layer of convolution layer of the region-of-interest segmentation network into a space focusing module and a channel focusing module, wherein the space focusing module obtains accurate position information dependency relationship, and the channel focusing module obtains dependency relationship among channel mapping; fusing the feature graphs output by the two concerned modules, and outputting a segmentation result through a layer of convolution layer in order to obtain better pixel-level prediction feature representation; the segmentation result is a segmentation result composed of foreground and background represented by 1 and 0, respectively.

2. The method for segmenting the video object with the dual-module neural network structure according to claim 1, wherein the method comprises the following steps:

the specific steps of the step 1 are as follows: inputting a video into a computer, wherein each frame of the video is a picture; the picture is in an RGB format and is marked as an RGB picture I; the target label in the image is marked as a mask M; the mask is a binary foreground and a binary background of the image;

First, a first frame image I is input ₀ And a mask M for the first frame image ₀ Will I ₀ And M ₀ Inputting the image into a transformation network G to obtain a transformation image pair D; the specific expression is as follows:

D _n ＝G(I ₀ ,M ₀ )

wherein G represents a transformation network; image pair set D _n ＝{d ₁ m ₁ ,d ₂ m ₂ ,...,d _n m _n }，D _n Representing n image pairs; d, d _i m _i Represents the ith image pair, where d _i Representing an image generated by an ith transformation network, m _i Representing the mask generated by the ith transformation network.

3. The method for segmenting the video object with the dual-module neural network structure according to claim 2, wherein the method comprises the following steps: the specific steps of the step 3 are as follows:

inputting the region of interest adding tracker into a region of interest division network for training; the region of interest area division network RoISeg is based on a deep convolutional neural network CNN, and a network framework is innovated on the basis of a ResNet101 framework network; the ResNet101 frame network is a network for solving the problem of accuracy reduction through a deep residual error learning frame, and has lower training error and test error; the RoISeg network is composed of a convolution layer, a pooling layer, an activation function, batch standardization and deconvolution; wherein the RoISeg initial parameters are set as: the learning rate is 0.0001, and the weight attenuation term is 0.005; the final output of the RoISeg is constrained using weighted cross entropy loss; the cross entropy loss expression is as follows:

Wherein L (θ) represents a weighted cross entropy loss, and θ has a value range of [0,1 ]]Representing weight parameters associated with current predictions in the network; x is X ₊ And X _- Representing a set of pixels having target positive and negative sample labels, respectively; the positive samples are true correct samples, and the negative samples are prediction error samples; i.e., a set of pixels of positive and negative samples of the video frame mask; beta is a weight decay term, punishing biased sampling during training; the activation function of the convolution layer calculates the output P, which represents the probability distribution, P E [0,1 ]]The method comprises the steps of carrying out a first treatment on the surface of the The activation function uses a common nonlinear activation function Sigmoid, and takes the value range of [0,1]The method comprises the steps of carrying out a first treatment on the surface of the The output layer of the region of interest segmentation network is constrained by using cross entropy loss, and then the constraint is transmitted to the network for continuous training through back propagation, the loss is gradually reduced in the training process, and the convergence is sufficiently reduced and stable; outputting and obtaining a target segmentation result; the output result is a segmentation map of the mask foreground and background.

4. A method for partitioning a video object of a two-module neural network structure as recited in claim 3, wherein: the specific steps of the step 4 are as follows:

spatial attention module: introducing a spatial attention mechanism to capture spatial dependence between any two spatial locations; the feature map output from the convolution layer of Roiseg is denoted as A, A ε R ^C×H×W R represents a set, A has a shape and size of C×H×W, C represents the number of channels, and HExpressed as height, W represents width; first three new feature maps B, D and F are generated by feature map A sharing, where { B, D, F } ∈R ^C×H×W The method comprises the steps of carrying out a first treatment on the surface of the Then change their shape and size again to R ^C×N Where n=h×w, N is expressed as a product of height and width; then, the matrix transposition is performed on B and the matrix multiplication is performed on D, and a softmax layer is applied to calculate a space dimension information attention feature map S epsilon R ^N×N The specific expression is as follows:

wherein S is _ij Measurement of the ith ^th Spatial position pair j ^th Influence of spatial position; exp represents the distance between two locations, the smaller the distance the more similar the locations between them; f shape and size of R ^C×N The method comprises the steps of carrying out a first treatment on the surface of the Then, a matrix multiplication operation is carried out between the F matrix transpose and the S matrix transpose, and the size and the shape of the characteristic diagram of the matrix multiplication result are R ^C×N The shape and size of the feature map are changed into R again ^C×H×W The method comprises the steps of carrying out a first treatment on the surface of the Finally, multiplying a scale parameter alpha, and performing element and operation by using the feature A to obtain an output feature diagram result E ₁ The specific expression is as follows:

wherein alpha is weight coefficient initialization set to 0, alpha E [0,1 ]]And progressively assigning more weights; e (E) ₁ To sum the operation result feature graphs, E ₁ ∈R ^C×H×W The method comprises the steps of carrying out a first treatment on the surface of the For target feature positions in the video frames, updating by weighting and summing the position aggregate features, wherein the weights are determined by feature similarity between the corresponding two positions; any two locations with similar features can facilitate mutual improvement regardless of their distance in the spatial dimension; and selectively aggregating the context feature representations according to the spatial attention map, thereby promoting information interdependencies between the same classes.

5. The method for segmenting the video object with the dual-module neural network structure according to claim 4, wherein the method comprises the following steps: the specific steps of the step 4 are as follows:

the channel focusing module: capturing a channel dependency relationship between any two channel mappings through channel attention mechanism operation; the feature map output from the convolution layer of Roiseg is also denoted A, A ε R ^C×H×W R represents a set, the shape and the size of A are C multiplied by H multiplied by W, C represents the number of channels, H represents the height, and W represents the width; the feature map A shares to generate two new feature maps M and N, respectively, where { M, N } ∈R ^C×H×W The method comprises the steps of carrying out a first treatment on the surface of the Then change their shape and size again to R ^C×N The method comprises the steps of carrying out a first treatment on the surface of the Performing matrix multiplication between M and N transposition, and directly calculating channel characteristic diagram X epsilon R ^C×C The method comprises the steps of carrying out a first treatment on the surface of the Using softmax layers to obtain channel information-of-interest feature graphs X ε R ^C×C The specific expression is as follows:

wherein X is _ji Measurement of the ith ^th Channel pair j ^th Influence of the channel; in addition, the shape and size of the X matrix and the A feature map are remodeled into R ^C×N Matrix multiplication is carried out among the matrixes, and the shape and the size of a result obtained by the matrix multiplication are R ^C×N Reshaped into R ^C×H×W The method comprises the steps of carrying out a first treatment on the surface of the Then multiplying a scale weight parameter beta, and performing element sum operation with A to obtain an output characteristic diagram E ₂ The specific expression is as follows:

wherein, beta is a weight coefficient, and is initialized to 0.3, beta is E [0,1 ]]；E ₂ To sum the operation result feature graphs, E ₂ ∈R ^C ^×H×W The method comprises the steps of carrying out a first treatment on the surface of the The channel dependence relationship between the feature map channel mappings is simulated; from the slaveAnd helps to improve the legibility of the model function; the channel target features are more prominent through the enhancement of the channel attention module, so that the video frames can identify targets in the network;

performing fusion operation on the two concerned modules; the fusion operation is to combine the two feature vectors into a complex vector; the feature map E output by the space attention module ₁ Feature map E output by channel attention module ₂ Obtaining a new feature map O through fusion operation: the specific expression is as follows:

O＝f(E ₁ ,E ₂ )

wherein O is the result of fusing the feature map output, O ε R ^C×H×W The method comprises the steps of carrying out a first treatment on the surface of the The function f is denoted as a fusion operation; e (E) ₁ ∈R ^C×H×W ；E ₂ ∈R ^C ^×H×W 。