CN110910391A - Video object segmentation method with dual-module neural network structure - Google Patents

Video object segmentation method with dual-module neural network structure Download PDF

Info

Publication number
CN110910391A
CN110910391A CN201911125917.3A CN201911125917A CN110910391A CN 110910391 A CN110910391 A CN 110910391A CN 201911125917 A CN201911125917 A CN 201911125917A CN 110910391 A CN110910391 A CN 110910391A
Authority
CN
China
Prior art keywords
frame
target
network
mask
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911125917.3A
Other languages
Chinese (zh)
Other versions
CN110910391B (en
Inventor
汪粼波
陈彬彬
方贤勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui University
Original Assignee
Anhui University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui University filed Critical Anhui University
Priority to CN201911125917.3A priority Critical patent/CN110910391B/en
Publication of CN110910391A publication Critical patent/CN110910391A/en
Application granted granted Critical
Publication of CN110910391B publication Critical patent/CN110910391B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/10Segmentation; Edge detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/207Analysis of motion for motion estimation over a hierarchy of resolutions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20092Interactive image processing based on input by user
    • G06T2207/20104Interactive definition of region of interest [ROI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The invention provides a method for segmenting a video object with a double-module neural network structure, which is used for solving the problem that the segmentation result of the video object is not ideal due to noise interference in the process of segmenting the video object. The method comprises the following steps: inputting the first frame image and the mask of the first frame into a transformation network to generate an image pair; performing a goal proposal box generation for each image pair to determine whether the image pair is a region of interest; inputting the feeling region adding tracker into a feeling segmentation network to train a learning model and outputting the model; outputting a characteristic diagram from the last layer convolution of the sensory segmentation network, and respectively inputting the characteristic diagram into a space attention module and a channel attention module; finally, fusing the feature maps output by the two concerned modules, and outputting a final segmentation mask result through a convolutional layer operation; the invention obtains better segmentation experiment results on the DAVIS video data set.

Description

Video object segmentation method with dual-module neural network structure
Technical Field
The invention belongs to the field of computer vision, and particularly relates to video object segmentation processing with large-scale change and inaccurate dynamic appearance change in videos, in particular to a method for segmenting a video object with a double-module neural network structure.
Background
With the rapid development of computer vision technology in recent years, the convolutional neural network in deep learning has gained great attention in various research fields, and the video object segmentation technology is an important content that researchers pay attention to in recent years. Video segmentation techniques are increasingly showing their importance. The video segmentation method is applied to scene understanding, video labeling, unmanned vehicles, object detection and the like, and is rapidly developed in a video segmentation technology. It can be said that the progress of the video segmentation technology drives the development of the computer vision technology as a whole. However, video object segmentation is not only a research hotspot, but also a research difficulty. The segmentation aims to find an accurate position relation for an object in a video, but the implementation process is limited by various factors, such as motion speed, object deformation, occlusion between instances and chaotic background, which can come from different camera devices and different scene images. This makes video object segmentation very challenging. Poor results are still shown in real world scene segmentation. These images undoubtedly pose a great challenge to video object segmentation techniques.
In recent years. The vast scholars have conducted a great deal of research on the video segmentation technology and have achieved good academic achievements. Unsupervised video object segmentation. The unsupervised method is mainly to segment moving objects from the background without any prior knowledge of the target, and the unsupervised video object segmentation method aims at automatically finding and separating salient objects from the background. These methods are based on probabilistic models, action and object propositions. Existing methods typically rely on visual cues (such as superpixel. saliency images or optical flow) to acquire initial object regions and require processing the entire video in batch mode to provide object segmentation. Furthermore, generating and processing thousands of candidate regions in each frame typically consumes a significant amount of time. These unsupervised methods cannot segment a particular object due to motion confusion between different instances and the dynamic background. Many methods for semi-supervised video object segmentation rely on fine-tuning using the first frame ground truth, using convolutional networks, training the foreground and background segmentation, and adjusting it to the first frame of the target video under test (e.g., semantic information for online adaptive mechanisms and example segmentation networks). They provide key visual cues for the target. Thus, these methods can handle multi-instance cases and generally perform better than unsupervised methods. However, many semi-supervised approaches rely heavily on a segmentation mask in the first frame. The methods generally use the first frame for data enhancement, the model self-adaptation seriously depends on a fine adjustment model, and the problems of complex background, occlusion or rapid movement and camera shaking oscillation in the video that high-efficiency segmentation cannot be realized are caused.
Disclosure of Invention
In view of the problems of the video segmentation method, the invention provides a video segmentation method based on spatial and channel information of a double-correlation injection module structure. Compared with the prior art, the method can flexibly utilize the space and channel information in the characteristic diagram, simplifies the calculated amount in the optimization process, and greatly improves the accuracy of the video target object segmentation.
The purpose of the invention is as follows: the invention aims to solve the problem of defects in the existing video object segmentation method, and provides a method for segmenting a video object with a dual-module neural network structure, which is used for solving some problems in video object segmentation.
The technical scheme is as follows: the invention discloses a method for segmenting a video object with a double-module neural network structure, which is used for ensuring training data with enough quantity of a near target domain and customizing the training data for a pixel-level video object segmentation scene.
First, the first frame and its annotation mask input transform network generate future possible image pairs, solving the problem of extra processing time required for the preparation of the data at a previous stage and for the training data enhancement. A training set of reasonably realistic images is generated. The expected trajectory and appearance of the target as it might be in future video frames is captured. Secondly, the image pair is input into a target proposing operation, a sense region which can be candidate is determined through the target proposing operation, and the sense region determination can screen some unneeded image pairs. Making it interesting to split the network computationally some unnecessary overhead. Then, the perceptual region addition tracker is input into a perceptual segmentation network to train a model, and segmentation results are input. The tracking segmentation is inaccurate due to the interference influence when the video frame tracks the target. Finally, a double-gate injection molding method was devised. The spatial interest module captures spatial dependencies between any two spatial locations and the channel interest module captures channel dependencies between any two channel maps. And the output of the two attention modules is fused, so that the discrimination capability of feature representation in video object segmentation is enhanced. The method plays a role in inhibiting the interference and noise influence in video segmentation. The discrimination capability of feature representation in video object segmentation is enhanced. The operation of the convolutional layer is performed again. And outputting a final segmentation mask result graph. The experimental results prove that the method for segmenting the video object obtains effective effect results as shown in the results of fig. 4 and 5. The method specifically comprises the following steps:
step 1, marking the first frame in the video as I0The mask of the first frame is denoted as M0. From the known first frame I0And a first frame mask M0The input transformation network, through which a plurality of different image pairs can be generated. An image pair is an image and a corresponding mask. The transformation network is some operation of rotation, translation, flipping, scaling, etc. The different images are mask training data for possible objects to appear for future video frames, and the dataset is derived from a DAVIS public video image segmentation dataset. The method used by the invention is that the video frame and the corresponding auxiliary mask code are used for data processing. A large number of image pairs are obtained that are used to improve video training data deficiencies. Therefore, enough data can be obtained for training, and the video result can be accurately predicted.
Step 2, for the pixel I according to the first frame in step 10And mask M of the first frame0Different image pairs are generated by a transformation network and the interesting areas are obtained using the goal proposals. The goal-proposition, which is a typical full convolution network, takes an arbitrary size of an image as input and outputs a set of rectangular goal-proposition boxes. Generating a target proposal around the target in the first frame, randomly generating a target proposal around the target by transforming the image generated by the network in step 1 above, generating an image target proposal and the first frame target proposal ratio IoU score. Or generate an image mask to initial mask ratio IoU score. Selecting the ratio to be more than 0.35 min through IoUThe representative image pairs are referred to as regions of interest (RoI). An initial mask, being a first frame mask M0. The IoU is called intersection ratio, and is the intersection ratio between the prediction area and the actual area. Then, a tracker is added to the perceived area, which is an efficient way to locate the target in the next frame. The tracker inputs the current frame mask and the next frame image, and can predict the position of the next frame target mask. A tracker is used to acquire the mask area for the next frame of image. Providing temporal consistency for subsequent frame-like regions.
And step 3, once the feeling region is positioned in the next frame, inputting the feeling region adding tracker into a feeling segmentation network (RoISeg) of the invention and training a prediction target. The Hoxist segmentation network RoISeg is based on a deep convolutional neural network CNN, and a network framework is innovated on the basis of a ResNet101 framework network and is called as a Hoxist segmentation network. RoISeg hereinafter denotes the perceptually split network. The CNN is a convolutional neural network in deep learning. The ResNet101 frame network is a network with a deep residual learning frame to solve the problem of accuracy reduction, and has a low training error and test error network. And adding a tracker on the sensitive region, inputting the tracker into RoISeg for training a model, and outputting a result of obtaining a rough target recognition position and a segmentation mask.
And 4, inputting the target result of the RoISeg prediction output by the sensing area adding tracker in the step 3, wherein the target result has a large error, and the noise is divided into parts in order to be reduced. Thus, the present invention constructs a "dual module of interest" approach. The signature output at the last convolutional layer of RoISeg is input to a double gate injection mold. The double attention modules are a space attention module and a channel attention module respectively.
And the space attention module introduces a space attention mechanism to capture the space dependency between any two space positions. The space attention mechanism is the operation of some functions in a space attention module. For the target location features in the frame, the features are updated by aggregating the features by weighted summation, wherein the weight is determined by the feature similarity between the corresponding two locations. In other words, any two locations with similar characteristics may promote mutual refinement regardless of their distance in the spatial dimension.
And the channel attention module captures the channel dependency relationship between any two channel mappings through a channel attention mechanism. And updates each channel map with a weighted sum of all channel maps. The channel attention mechanism is the operation of some functions in the channel attention module.
And finally, fusing the two concerned modules. The fusion operation is a parallel strategy, and combines the two feature vectors into a complex vector. And information between the front frame and the rear frame of the target object is enriched by fusing the two frames together, so that a better characteristic effect of video object segmentation is obtained. The discrimination capability of feature representation in video object segmentation is enhanced through the feature capture dependency relationship between the spatial dimension information and the channel dimension information in the double-correlation model. The method plays a role in inhibiting the interference and noise influence in video segmentation. After passing through the convolution layer once, the final division mask result is output.
The detailed specific steps are as follows:
step 1, inputting a video to a computer, wherein each frame of the video is a picture. The picture is in RGB format and marked as RGB picture I. The target label in this image is denoted as mask M. The mask is the binary foreground and background of the image.
Firstly, inputting a segment of video and a mask of a first frame, and inputting the first frame I0And a first mask M0Input into the transformation network G. Resulting in a large number of transformed image pairs D. The specific expression is as follows:
Dn=G(I0,M0)
wherein G represents a transformation network, and is operations of rotation, translation, flipping, scaling and the like. Dn={d1m1,d2m2,...,dnmn},DnIndicating that there are n image pairs. dimiRepresenting the ith image pair, wherein diRepresenting images generated by the i-th transformation network, miDenotes the ithA mask generated by the transformation network. And generating an image pair through a transformation network, and screening whether the image pair is used as a sense region.
The specific steps of the step 2 are as follows:
the image pair generated by the transformation network is screened to determine whether the image pair is a sensitive region. The perceived area is obtained using the goal proposition. The goal-proposition, which is a typical full convolution network, takes an arbitrary size of an image as input and outputs a set of rectangular goal-proposition boxes. The target proposal operation is performed around the target in the first frame and noted gtboxSaid gtboxIs the bounding box of the real marker around the object of the first frame. A bounding box generated by the image for the target proposing operation around the target is marked as bboxSaid b isboxThe image pair is input into the goal offer and the goal offer box in the image pair is output, as shown at the mark 5 in fig. 2. Resulting in an image target proposal and first frame target proposal ratio IoU score. The specific expression is as follows:
S=IoU(bbox,gtbox)
where IoU is a functional expression of the cross-over ratio. And S score is the intersection ratio score of the target proposal frame in the image pair and the target proposal frame in the first frame. The representative image pair is identified as the region of interest by IoU where the ratio S > 0.75. Then, a tracker is added to the perceived area, which is an efficient way to locate the target in the next frame. The tracker inputs the current frame mask and the next frame image, and can predict the position of the next frame target mask. A tracker is used to acquire the mask area for the next frame of image. Providing temporal consistency for subsequent frame-like regions. A video sequence R, R ═ I is known0,I1,I2,I3,...It...,InAnd the first frame I0Mask M of0。ItIs the t-th frame in the video sequence. t ∈ {1,2,3,..., n }. Masking the remaining frames in a video sequence1,M2,M3,...,Mn}, according to the tracker function expression:
Mt+1=f(It+1,Mt)
where f is expressed as a tracker function, known as It+1Image represented as t +1 th frame, known as MtMask representing the image of the t-th frame, and calculating Mt+1Denoted as mask for frame t + 1. The masks for the second frame image and the first frame image are known and found by the tracker. Since the target has a smooth moving trend in space, the video frames have a characteristic of little change from one frame to another, and have a certain correlation relatively. By MtMask and It+1Frame, prediction It+1Mask M for framet+1. Prediction It+1Mask for frame and real mask MgtThere is a large error. The M isgtRepresenting a truly accurate mask. The perceptual region addition tracker is then input into the perceptual segmentation network.
The specific steps of the step 3 are as follows:
and inputting the information into the feeling segmentation network through the feeling region adding tracker based on the step 2. The perceptive segmentation network (RoISeg) in the present invention is trained to predict targets by inputting the perceptive region addition tracker. The Hoxist segmentation network RoISeg is based on a deep convolutional neural network CNN, and a network framework is innovated on the basis of a ResNet101 framework network and is called as a Hoxist segmentation network. The ResNet101 frame network is a network with a deep residual learning frame to solve the problem of accuracy reduction, and has a low training error and test error network. The RoISeg network is composed of convolution layers, pooling layers, activation functions, batch normalization, deconvolution and the like. With some initial parameter settings in RoISeg. The learning rate is 0.0001 and the weight attenuation term is 0.005. The RoISeg final output is constrained using a weighted cross-entropy penalty. The crossover loss expression is as follows:
Figure BDA0002275577200000051
wherein L (theta) represents the weighted cross entropy loss, and theta is in a value range of [0,1 ]]Representing the weight parameter associated with the current prediction in the network.X+And X-β is a weighted decay term that penalizes biased sampling during training, the activation output of the convolutional layer computes a probability function P representing the probability distribution, P ∈ [0,1 ] with the label of the target positive and negative samples, respectively]. The value range [0,1 ] of the commonly used nonlinear activation function Sigmoid is used as the activation function]. And the perceptual segmentation network training output layer uses the constraint of cross entropy loss, and then continuously trains in the network through back propagation, and when the loss in the training process becomes smaller gradually, the convergence becomes small enough and stable. And outputting to obtain a target segmentation result. The output result is a segmentation map of the foreground and background of the mask.
The output target result is predicted by the RoISeg network in step 3 to have a large error in order to reduce the noise-divided portion. Thus, the present invention constructs a "dual module of interest" approach. The signature graph output at the last convolutional layer of the RoISeg is input to two attention modules, respectively. The two attention modules are a space attention module and a channel attention module respectively.
A spatial attention module: and introducing a spatial attention mechanism to enrich the dependency relationship of the context characteristics for the target in the video frame. The operation of the spatial attention mechanism is introduced for detailed description. A space-interest module is shown at 11 in fig. 3. The convolution layer output characteristic diagram from RoISeg is marked as A, and A belongs to RC×H×WR represents a set, the shape size of A is C × H × W, C represents the number of channels, H represents the height, and W represents the width. First, three new feature maps B, D and F are generated from feature map A sharing, respectively, where { B, D, F }. epsilon.RC ×H×W. Then re-change their shape and size to RC×NWhere N ═ H × W, N denotes the product of height and width. And then B performs matrix transposition and D performs matrix multiplication, and a soffmax layer is applied to calculate a space dimension information attention feature map S e RN×NThe specific expression is as follows:
Figure BDA0002275577200000061
wherein SijMeasuring the iththThe j (th) position of spacethThe influence of spatial position. exp denotes the distance between two locations, the smaller the distance the more similar the locations between them. In the foregoing, the spatial dependency between any two spatial positions is captured. In other words, the more similar characteristics of both locations. The representation contributes to a greater similarity of features between them. Wherein the shape of F is RC×N. Then, matrix multiplication operation is performed between F and S matrix transpose, and the characteristic diagram size shape of the matrix multiplication result is RC×NChanging the shape and size of the feature map to RC×H×WFinally, multiply by a scale parameter α and perform element and operation operations with feature A to obtain output feature map result E1The specific expression is as follows:
Figure BDA0002275577200000062
wherein α is initialized to 0 for weight coefficient, α is equal to 0,1]And gradually more weight is assigned. Sum operation result feature graph E1Size of shape E1∈RC×H×W. For the target feature position in the video frame, the feature is updated by aggregating the positions through weighted summation, wherein the weight is determined by the feature similarity between the two corresponding positions. In other words, any two locations with similar characteristics may promote mutual refinement regardless of their distance in the spatial dimension. And selectively aggregating the context feature representations according to the spatial attention mapping, thereby improving the information interdependence relation between the same classes.
And the channel attention module captures the channel dependency relationship between any two channel mappings through some operations of a channel attention mechanism. Channel attention mechanism some operations. The convolutional layer output characteristic diagram from RoISeg is also marked as A, and A ∈ RC×H×WR represents a set, the shape size of A is C × H × W, C represents the number of channels, H represents the height, and W represents the width. Feature map A sharing respectively generates two new feature maps MAnd N, where { M, N }. epsilon.RC×H×W. Then re-change their shape and size to RC×N. Executing matrix multiplication between M and N transpose, directly calculating channel characteristic diagram X e to RC×C. Obtaining a channel attention information characteristic diagram X epsilon R by using soffmax layerC×CThe specific expression is as follows:
Figure BDA0002275577200000063
wherein XjiMeasuring the iththChannel pair jththThe influence between the channels, mentioned earlier, the channel attention module captures the channel dependency between any two channel maps. In addition, matrix multiplication is performed between X and A matrix transposes, and the result of the matrix multiplication operation is characterized by a graph, reshaped to RC×H×WThen multiply by a scale weight parameter β and perform an element sum operation with a to obtain an output feature map E2The specific expression is as follows:
Figure BDA0002275577200000071
wherein β is a weight coefficient, and the initialization is set to 0.3, β E [0,1 ]]. Sum operation result feature graph E2,E2Size of shape E2∈RC×H×W. Channel dependencies between feature map channel mappings are simulated. Thereby contributing to improved intelligibility of the model function. The enhanced channel target features of the channel attention module are more prominent, so that the video frame can identify the target in the network.
And fusing the two interest modules. The fusion operation is to combine these two feature vectors into a complex vector. The characteristic E output by the space attention module1Feature map E output by channel attention module2And obtaining a new characteristic diagram O through fusion operation: the specific expression is as follows:
O=f(E1,E2)
where O is the result of the fused feature map output, O outputs the feature map sizeIs O epsilon to RC×H×W. The function f is expressed as a fusion operation. E1Size of the feature map is E1∈RC×H×W。E2Size of the feature map is E2∈RC×H×W. The feature information between the front frame and the rear frame of the target object is more obvious after being fused together, so that a better feature effect of video target object segmentation is obtained.
The dependency relationship is captured through feature fusion between the space dimension information and the channel dimension information in the attention module, and the context feature information between the space and the channel is fully utilized. Specifically, the convolution layer output of the perceptual segmentation network is input to each of the two attention modules. Through respective attention mechanism operation, the space attention module obtains remarkable space information characteristics, and the channel attention module obtains remarkable channel information characteristics. The two attention modules are fused with feature operation, so that the discrimination capability of feature representation in video object segmentation is enhanced. The method plays a role in inhibiting the interference and noise influence in video segmentation. The operation of the convolutional layer is performed again. And outputting a final segmentation mask result graph. The method for segmenting the video object is proved to be significant by the experimental results that the results of fig. 4 and 5 are shown to achieve effective effect results.
Advantageous technical effects
The invention provides a method for segmenting a convolutional neural network video object of a double-correlation modeling block, which is used for solving the problems of insufficient data, high processing overhead, complex background, quick movement, jitter, oscillation and other interference in the process of segmenting the video object. Because of the problems of these interferences, the present invention designs a transformation network, an area of interest adding tracker, and a double-gate injection module to effectively solve these problems. By means of a network transformation method, the problem that a learning model is poor in generalization capability due to insufficient data in the network training process is solved. The interested area is determined by the target proposal, and a tracker is added in the interested area to predict the position information of the target which can appear in the next frame. The method can solve the problem of shaking and oscillation questions caused by rapid movement or camera movement and find out the possible positions of the target. The feeling region adding tracker inputs a feeling segmentation network (RoISeg) designed by the invention to train a model and outputs a segmentation result. The tracking segmentation is inaccurate due to the interference influence when the video frame tracks the target. A double-gate injection molding method is designed for this purpose. The spatial interest module captures spatial dependencies between any two spatial locations and the channel interest module captures channel dependencies between any two channel maps. The feature graphs output by the two concerned modules are fused, so that the discrimination capability of feature representation in video object segmentation is enhanced. The method plays a role in inhibiting the interference and noise influence in video segmentation. The discrimination capability of feature representation in video object segmentation is enhanced. The operation of the convolutional layer is performed again. And outputting a final segmentation mask result graph.
Drawings
FIG. 1 is a basic flow diagram of the process of the present invention
FIG. 2 is a diagram of a network architecture of the present invention
FIG. 3 is a diagram showing a network relationship of double-gate injection molded blocks
FIG. 4 and FIG. 5 are graphs of experimental results
In fig. 2, reference numeral 1 denotes a first frame image. And 2 is a mask corresponding to the first frame image. And 3, the operation of the transformation network is carried out. 4 is the pair of images generated by the transform network. 5, proposing generation of a target to determine a sensitive region, 6, the RoISeg network framework of the invention, 7, an RoISeg network output feature map, 8, 9, a channel attention module, 11, a space attention module, 12, output feature map row fusion and 13, and finally, carrying out experiment segmentation results.
Detailed description of the invention
Technical features of the present invention will now be described in detail with reference to the accompanying drawings.
Referring to fig. 1, a method for video object segmentation with a dual-module neural network structure, which collects processing data for a pixel-level video object segmentation scene to ensure that a near-target domain has a sufficient amount of training data, comprises:
the first frame of the video and the annotation mask thereof are obtained and used for generating the mask of the future video frame and generating a training set of reasonable and vivid images, and further the expected appearance change in the future video frame is obtained to obtain the approximate target domain.
Furthermore, a module-of-interest mechanism is introduced that captures the target feature dependencies in the spatial and channel modules-of-interest, respectively. The injection molding mechanism is to add two parallel modules to the architecture of the expanded full convolution neural network: one is a spatial location dimension module and the other is a channel information dimension module. Through processing the two parallel modules, the spatial position dimension module obtains an accurate position information dependency relationship, and the channel dimension module obtains a dependency relationship between channel mappings.
And finally, fusing the output characteristic graphs from the two dimension modules to obtain better characteristic representation of pixel-level prediction and output a segmentation result after passing through a convolution layer. The segmentation result is composed of 1 and 0 for foreground and background respectively.
Further, the method of the invention is as follows: the method comprises the following steps of:
step 1, in the video, the first frame is marked as I0The mask of the first frame image is denoted as M0. From the known first frame map I0And a first frame mask M0Inputting the transformation network, and generating the image pair through the transformation network. The aforementioned image pair is an image and a mask. A transformation network is a network that contains rotation, translation, flipping, and/or scaling operations. The image pair refers to the situation that the data deficiency can be solved by inputting a network training model into the video frame. In this step, the input video is from a dataset, which may be derived from a DAVIS public video image segmentation dataset. The method used by the invention is that the video frame and the corresponding auxiliary mask code are used for data processing. A large number of image pairs are obtained that are used to improve video training data deficiencies. Therefore, enough data can be obtained for training, and the video result can be accurately predicted.
Step 2, the first frame image I in the step 1 is processed0Pixel of (2) and mask M of the first frame0More than one set of image pairs is generated by the transformation network, the image pairs are not identical, and the interesting area is obtained by the goal proposition.
The target proposal is a set of image target rectangular proposal frames which are input with images of any size and output in a full convolution network. The goal offer gets the region of interest by scoring the candidate boxes. . The method comprises the following specific steps:
in the first frame image I0A target proposal is generated around the target in (1).
IoU was obtained in the following manner. The IoU is called intersection ratio, and is the intersection ratio between the prediction area and the actual area.
One is as follows: randomly generating an object proposal around the object by using the image generated by the transforming network in the step 1 to obtain a generated image object proposal and a first frame image I0Target proposed ratio IoU.
The second step is as follows: resulting in an image mask to initial mask ratio IoU score. The initial mask is a first frame mask M0
Representative image pairs with a score greater than 0.75 are selected by the IoU ratio and are referred to as regions of interest (RoI).
Then, a tracker is added to the region of interest, which is effective in locating the target in the next frame. The tracker inputs the current frame mask and the next frame image, and can predict the position of the next frame target mask. A tracker is used to acquire the mask area for the next frame of image, providing temporal consistency for the perceived area of the subsequent frame.
And step 3, once the feeling region is positioned in the next frame, inputting the feeling region adding tracker into a feeling segmentation network (RoISeg) of the invention and training a prediction target. The network framework, called RoISeg for short, is formed on the basis of a ResNet101 framework network on the basis of a deep convolutional neural network CNN of the Hodgkin's segmented network RoISeg. The CNN is a convolutional neural network in deep learning. The ResNet101 frame network is a network with a deep residual learning frame to solve the problem of accuracy reduction, and has a low training error and test error network. And adding a tracker on the sensitive region, inputting the tracker into RoISeg for training a model, and outputting a result of obtaining a rough target recognition position and a segmentation mask.
And 4, inputting the target result of the RoISeg prediction output by the sensing area adding tracker in the step 3, wherein the target result has a large error, and the noise is divided into parts in order to be reduced. Therefore, the invention constructs a method of double concern module: the signature output at the last convolutional layer of RoISeg is input to a double gate injection mold. The dual attention module includes a space attention module and a channel attention module, which are detailed in fig. 1 and 3.
The spatial attention module introduces a spatial attention mechanism to capture the spatial dependency between any two spatial positions. The space attention mechanism is a function operation in a space attention module. For the target location features in the frame, the features are updated by aggregating the features by weighted summation, wherein the weight is determined by the feature similarity between the corresponding two locations. In other words, any two locations with similar characteristics may promote mutual refinement regardless of their distance in the spatial dimension.
The channel attention module captures the channel dependency relationship between any two channel maps through a channel attention mechanism, and updates each channel map by using the weighted sum of all the channel maps. The channel attention mechanism is a function operation in a channel attention module.
And finally, fusing the two concerned modules. The fusion operation is a parallel strategy, and combines the two feature vectors into a complex vector. And information between the front frame and the rear frame of the target object is enriched by fusing the two frames together, so that a better characteristic effect of video object segmentation is obtained. The discrimination capability of feature representation in video object segmentation is enhanced through the feature capture dependency relationship between the spatial dimension information and the channel dimension information in the double-correlation model. The method plays a role in inhibiting the interference and noise influence in video segmentation. After passing through the convolution layer once, the final division mask result is output.
Furthermore, step 1, a video is input into the computer, and each frame of the video is a picture. The picture is in RGB format and marked as RGB picture I. The target label in this image is denoted as mask M. The mask is the binary foreground and background of the image.
Firstly, inputting a segment of video and a mask of a first frame, and inputting the first frame I0And a first mask M0The transformed image pair D is obtained by inputting the transformed image pair into a transformation network G. The specific expression is as follows:
Dn=G(I0,M0)
where G denotes a transformation network. Set of image pairs Dn={d1m1,d2m2,...,dnmn},DnIndicating that there are n image pairs. dimiRepresenting the ith image pair, wherein diRepresenting images generated by the i-th transformation network, miRepresenting the mask generated by the ith transformation network. And generating an image pair through a transformation network, and screening whether the image pair is used as a sense region.
Further, the specific steps of step 2 are:
the image pair generated by the transformation network is screened to determine whether the image pair is a sensitive region. The perceived area is obtained using the goal proposition. The target proposal, which is a typical full convolution network, inputs images of arbitrary size and outputs a set of image target rectangular proposal boxes. The target proposal operation is performed around the target in the first frame and noted gtboxSaid gtboxIs the bounding box of the real marker around the object of the first frame. A bounding box generated by the image for the target proposing operation around the target is marked as bboxSaid b isboxThe image pair is input into the goal offer and the goal offer box in the image pair is output, as shown at the mark 5 in fig. 2. Resulting in an image target proposal and first frame target proposal ratio IoU score. The specific expression is as follows:
S=IoU(bbox,gtbox)
where IoU is a functional expression of the cross-over ratio. And S score is the intersection ratio score of the target proposal frame in the image pair and the target proposal frame in the first frame. The representative image pair is identified as the region of interest by IoU where the ratio S > 0.75. Then, a tracker is added to the perceived area, which is able to locate the target in the next frameAnd (4) an effective method. The tracker inputs the current frame mask and the next frame image, and can predict the position of the next frame target mask. A tracker is used to acquire the mask area for the next frame of image. Providing temporal consistency for subsequent frame-like regions. A video sequence R, R ═ I is known0,I1,I2,I3,...It...,InAnd the first frame I0Mask M of0。ItIs the t-th frame in the video sequence. t ∈ {1,2,3,..., n }. Masking the remaining frames in a video sequence1,M2,M3,...,Mn}, according to the tracker function expression:
Mt+1=f(It+1,Mt)
where f is expressed as a tracker function, known as It+1Image represented as t +1 th frame, known as MtMask representing the image of the t-th frame, and calculating Mt+1Denoted as mask for frame t + 1. The masks for the second frame image and the first frame image are known and found by the tracker. Since the target has a smooth moving trend in space, the video frames have a characteristic of little change from one frame to another, and have a certain correlation relatively. By MtMask and It+1Frame, prediction It+1Mask M for framet+1. Prediction It+1Mask for frame and real mask MgtThere is a large error. The M isgtRepresenting a truly accurate mask. The perceptual region addition tracker is then input into the perceptual segmentation network.
Further, the specific steps of step 3 are:
and inputting the information into the feeling segmentation network through the feeling region adding tracker based on the step 2. The perceptive segmentation network (RoISeg) in the present invention is trained to predict targets by inputting the perceptive region addition tracker. The Hoxist segmentation network RoISeg is based on a deep convolutional neural network CNN, and a network framework is innovated on the basis of a ResNet101 framework network and is called as a Hoxist segmentation network. The ResNet101 frame network is a network with a deep residual learning frame to solve the problem of accuracy reduction, and has a low training error and test error network. The RoISeg network is composed of convolution layers, pooling layers, activation functions, batch normalization, deconvolution and the like. Wherein the initial parameters in the RoISeg are set as: the learning rate is 0.0001 and the weight attenuation term is 0.005. The RoISeg final output is constrained using a weighted cross-entropy penalty. The crossover loss expression is as follows:
Figure BDA0002275577200000111
wherein, the upper (theta) represents the weighted cross entropy loss, and the theta value range is [0,1 ]]Representing the weight parameter associated with the current prediction in the network. X+And X-β is a weighted decay term that penalizes biased sampling during training, the activation output of the convolutional layer computes a probability function P representing the probability distribution, P ∈ [0,1 ] with the label of the target positive and negative samples, respectively]. The value range [0,1 ] of the commonly used nonlinear activation function Sigmoid is used as the activation function]. And the perceptual segmentation network training output layer uses the constraint of cross entropy loss, and then continuously trains in the network through back propagation, and when the loss in the training process becomes smaller gradually, the convergence becomes small enough and stable. And outputting to obtain a target segmentation result. The output result is a segmentation map of the foreground and background of the mask.
Further, the specific steps of step 4 are:
the output target result is predicted by the RoISeg network in step 3 to have a large error in order to reduce the noise-divided portion. Thus, the present invention constructs a "dual module of interest" approach. The signature graph output at the last convolutional layer of the RoISeg is input to two attention modules, respectively. Two focus modules are divided: the system comprises a space attention module and a channel attention module, and specifically comprises the following steps:
a spatial attention module: introducing a spatial attention mechanism to enrich dependency of contextual features for targets in video framesIs described. The operation of the spatial attention mechanism is introduced for detailed description. The space focus module in fig. 3 shows s. The convolution layer output characteristic diagram from RoISeg is marked as A, and A belongs to RC×H×WR represents a set, the shape size of A is C × H × W, C represents the number of channels, H represents the height, and W represents the width. First, three new feature maps B, D and F are generated from feature map A sharing, respectively, where { B, D, F }. epsilon.RC ×H×W. Then re-change their shape and size to RC×NWhere N ═ H × W, N denotes the product of height and width. And then B performs matrix transposition and D performs matrix multiplication, and applies a softmax layer to calculate a spatial dimension information attention feature map S e RN×NThe specific expression is as follows:
Figure BDA0002275577200000121
wherein SijMeasuring the iththThe j (th) position of spacethThe influence of spatial position. exp denotes the distance between two locations, the smaller the distance the more similar the locations between them. In the foregoing, the spatial dependency between any two spatial positions is captured. In other words, the more similar characteristics of both locations. The representation contributes to a greater similarity of features between them. Wherein the shape of F is RC×N. Then, matrix multiplication operation is performed between F and S matrix transpose, and the characteristic diagram size shape of the matrix multiplication result is RC×NChanging the shape and size of the feature map to RC×H×WFinally, multiply by a scale parameter α and perform element and operation operations with feature A to obtain output feature map result E1The specific expression is as follows:
Figure BDA0002275577200000122
wherein α is initialized to 0 for weight coefficient, α is equal to 0,1]And gradually more weight is assigned. Sum operation result feature graph E1Size of shape E1∈RC×H×W. For a target feature location in a video frame,the location is updated by aggregating features by weighted summation, where the weights are determined by feature similarity between the respective two locations. In other words, any two locations with similar characteristics may promote mutual refinement regardless of their distance in the spatial dimension. And selectively aggregating the context feature representations according to the spatial attention mapping, thereby improving the information interdependence relation between the same classes.
A channel attention module: channel dependencies between any two channel maps are captured by a channel attention mechanism operation. The convolutional layer output characteristic diagram from RoISeg is also marked as A, and A ∈ RC×H×WR represents a set, the shape size of A is C × H × W, C represents the number of channels, H represents the height, and W represents the width. Feature map A sharing generates two new feature maps M and N, respectively, where { M, N }. epsilon.RC×H×W. Then re-change their shape and size to RC×N. Executing matrix multiplication between M and N transpose, directly calculating channel characteristic diagram X e to RC×C. Obtaining a channel attention information characteristic diagram X epsilon R by using softmax layerC×CThe specific expression is as follows:
Figure BDA0002275577200000131
wherein XjiMeasuring the iththChannel pair jththThe influence between the channels, mentioned earlier, the channel attention module captures the channel dependency between any two channel maps. In addition, the X matrix and the A characteristic pattern are reshaped into RC×NMatrix multiplication is performed between the matrixes, and the shape of the result obtained by the matrix multiplication is RC×NReshaped to RC×H×WThen multiply by a scale weight parameter β and perform an element sum operation with a to obtain an output feature map E2The specific expression is as follows:
Figure BDA0002275577200000132
wherein β is a weight coefficient, and the initialization is set to 0.3, β E [0,1 ]]. Summing operation nodeFruit characteristic graph E2,E2Size of shape E2∈RC×H×W. Channel dependencies between feature map channel mappings are simulated. Thereby contributing to improved intelligibility of the model function. The enhanced channel target features of the channel attention module are more prominent, so that the video frame can identify the target in the network.
And fusing the two interest modules. The fusion operation is to combine the two feature vectors into a complex vector. The characteristic E output by the space attention module1Feature map E output by channel attention module2And obtaining a new characteristic diagram O through fusion operation: the specific expression is as follows:
O=f(E1,E2)
wherein O is the output result of the fusion characteristic diagram, and the size of the O output characteristic diagram is O epsilon RC×H×W. The function f is expressed as a fusion operation. E1Size of the feature map is E1∈RC×H×W。E2Size of the feature map is E2∈RC×H×W. The feature information between the front frame and the rear frame of the target object is more obvious after being fused together, so that a better feature effect of video target object segmentation is obtained.
The dependency relationship is captured through feature fusion between the space dimension information and the channel dimension information in the attention module, and the context feature information between the space and the channel is fully utilized. Specifically, the convolution layer output of the perceptual segmentation network is input to each of the two attention modules. Through respective attention mechanism operation, the space attention module obtains remarkable space information characteristics, and the channel attention module obtains remarkable channel information characteristics. The two attention modules are fused with feature operation, so that the discrimination capability of feature representation in video object segmentation is enhanced. The method plays a role in inhibiting the interference and noise influence in video segmentation. The operation of the convolutional layer is performed again. And outputting a final segmentation mask result graph.
The method for segmenting the video object is proved to be significant by the experimental results that the results of fig. 4 and 5 are shown to achieve effective effect results.
Examples
The experimental hardware environment of the invention is as follows: the system is realized under the operating systems of a PC (personal computer) of a 3.4GHz Intel (R) core (TM) i5-7500 CPU and a GTX 1080TiGPU (general packet processing Unit), a 16 memory and an Ubuntu18.04, and is realized based on an open source frame Pythroch depth frame. The training and testing used an image size of 854x 480. The test results (fig. 4-5) data set is derived from a data set of DAVIS public video image segmentation.
First a given first frame and a mask of the first frame (as shown in fig. 2 at 1 and 2) are given. 1-100 image pairs (shown as 4 in FIG. 2) are generated by the transformation network. The candidate interest area (shown in 5 in fig. 2) is selected by the goal proposal box. The perceived area is trained in the RoISeg network after adding the tracker (shown in 6 in FIG. 2). The spatial attention module and the channel attention module are input from the last convolutional layer output feature map (shown as 7 in fig. 2) in the RoISeg network respectively. And finally, fusing the feature maps output by the space attention module and the channel attention module (12 in FIG. 2), and finally outputting a segmentation result map. The method for segmenting the video object is proved to be significant by the experimental results that the results of fig. 4 and 5 are shown to achieve effective effect results.

Claims (7)

1. A method for segmenting a video object with a dual-module neural network structure is characterized by comprising the following steps: the method ensures that a near target domain has enough training data by collecting processing data for pixel-level video object segmentation scenes, and comprises the following steps:
acquiring a first frame of a video and an annotation mask thereof, generating a mask of a future video frame, generating a training set of a reasonable and vivid image, and further acquiring an expected appearance change in the future video frame to obtain a near target domain;
in addition, a module-of-interest mechanism is introduced to capture the target feature dependencies in the spatial and channel modules of interest, respectively; the injection molding mechanism is to add two parallel modules to the architecture of the expanded full convolution neural network: one is a spatial position dimension module, and the other is a channel information dimension module; after the two parallel modules are processed, the spatial position dimension module obtains an accurate position information dependency relationship, and the channel dimension module obtains a dependency relationship between channel mappings;
finally, output feature graphs from the two dimension modules are fused to obtain better feature representation of pixel-level prediction, and a segmentation result is output after passing through a convolution layer; the segmentation result is composed of 1 and 0 for foreground and background respectively.
2. The method of dual-module neural network structure video object segmentation according to claim 1, wherein: the method comprises the following steps of:
step 1, in the video, the first frame is marked as I0The mask of the first frame image is denoted as M0(ii) a From the known first frame map I0And a first frame mask M0Inputting a transformation network, and generating an image pair through the transformation network; the image pair is an image and a corresponding mask; the transformation network is a network that includes rotation, translation, flipping, and scaling operations; step 2, the first frame image I in the step 1 is processed0Pixel of (2) and mask M of the first frame0Generating more than one group of image pairs through a transformation network, wherein the image pairs are different, and obtaining a sensitive area through a target proposal;
the target proposal is a set of image target rectangular proposal frames which are input with any size and output in a full convolution network; the target proposal obtains a feeling area by scoring the candidate box; the method comprises the following specific steps:
in the first frame image I0Generating a target offer around the target in (1);
IoU was obtained in the following manner; IoU is called intersection ratio, and is the intersection ratio between the prediction region and the actual region;
one is as follows: randomly generating an object proposal around the object by using the image generated by the transforming network in the step 1 to obtain a generated image object proposal and a first frame image I0Target proposed ratio IoU;
the second step is as follows: generating an image mask to initial mask ratio IoU score; the initial mask is a first frame mask M0
Selecting image pairs with scores greater than 0.75 by IoU ratio, called the region of interest;
then, adding a tracker to the sensitive region, and enabling the tracker to effectively position the target in the next frame; the tracker is used for inputting a current frame mask and a next frame image and predicting the position of a next frame target mask; using a tracker to acquire a mask region of a next frame of image to provide temporal consistency for a subsequent frame of the perceived region;
step 3, once the feeling region is positioned in the next frame, inputting the feeling region adding tracker into the feeling segmentation network of the invention to train and predict the target; the RoISeg of the sensory segmentation network is based on a deep Convolutional Neural Network (CNN) and forms a network frame, called RoISeg for short, in the invention on the basis of a ResNet101 frame network; the CNN is a convolutional neural network in deep learning; the ResNet101 frame network is a network with a deep residual error learning frame to solve the problem of accuracy reduction; adding a tracker on the sensitive region, inputting the tracker into RoISeg for training a model, and outputting a result of obtaining a rough target recognition position and a segmentation mask;
step 4, inputting the target result of RoISeg prediction output by the sensitive region adding tracker in the step 3, wherein the target result has larger error, and the noise segmentation part is reduced; therefore, the invention constructs a method of double concern module: inputting the characteristic diagram output from the last convolution layer of the RoISeg into a double-gate injection module; the double attention modules comprise a space attention module and a channel attention module;
the spatial attention module is used for introducing a spatial attention mechanism to capture spatial dependency between any two spatial positions; the space attention mechanism is a function operation in a space attention module; for the target position features in the frame, the position aggregation features are updated through weighted summation, wherein the weight is determined by the feature similarity between two corresponding positions; in other words, any two locations with similar characteristics may promote mutual improvement regardless of their distance in the spatial dimension;
the channel attention module captures the channel dependency relationship between any two channel mappings through a channel attention mechanism; and updating each channel map with a weighted sum of all channel maps; the channel attention mechanism is a function operation in a channel attention module;
finally, fusing the two concerned modules; fusion operation, which is a parallel strategy, combines the two eigenvectors into a complex vector; information between the front frame and the rear frame of the target object is enriched by fusing the two frames together, so that a better characteristic effect of video object segmentation is obtained; the discrimination capability of feature representation in video object segmentation is enhanced through the feature capture dependency relationship between the space dimension information and the channel dimension information in the double-correlation injection module; the method plays a role in inhibiting the interference and noise influence in video segmentation; after passing through the convolution layer once, the final division mask result is output.
3. The method for segmenting the video object with the dual-module neural network structure according to the claim 1 or 2, characterized in that:
step 1, inputting a video to a computer, wherein each frame of the video is a picture; the picture is in an RGB format and is marked as an RGB picture I; marking the target label in the image as a mask M; the mask is a binary foreground and a background of the image;
firstly, inputting a segment of video and a mask of a first frame, and inputting the first frame I0And a first mask M0Inputting the image into a transformation network G to obtain a transformation image pair D; the specific expression is as follows:
Dn=G(I0,M0)
wherein G represents a transformation network; set of image pairs Dn={d1m1,d2m2,...,dnmn},DnIndicating that there are n image pairs; dimiRepresenting the ith image pair, wherein diRepresenting images generated by the i-th transformation network, miRepresenting a mask generated by the ith transform network; and generating an image pair through a transformation network, and screening whether the image pair is used as a sense region.
4. The method for segmenting the video object with the dual-module neural network structure according to the claim 1 or 2, characterized in that: the specific steps of the step 2 are as follows:
the image pair generated by the transformation network is screened to determine whether the image pair is used as a sensitive region; obtaining a region of interest using the goal proposition; the target proposal is a full convolution network, takes an arbitrary size of an image as input, and outputs a set of rectangular target proposal frames; the target proposal operation is performed around the target in the first frame and noted gtboxSaid gtboxA bounding box that is a real marker around the target of the first frame; a bounding box generated by the image for the target proposing operation around the target is marked as bboxSaid b isboxInputting a picture pair into a target proposal, and outputting a target proposal frame in the picture pair; generating an image target proposal and a first frame target proposal ratio IoU score; the specific expression is as follows:
S=IoU(bbox,gtbox)
wherein IoU is a functional expression of the cross-over ratio; s score, which is the intersection ratio score of the target proposal frame in the image pair and the target proposal frame in the first frame; by IoU ratio S>0.75 part of representative image pair is used as a feeling region; then, adding a tracker to the sensitive region, wherein the tracker can be used for effectively positioning the target in the next frame; the tracker inputs a current frame mask and a next frame image and can predict the position of a next frame target mask; using a tracker to acquire a mask region of a next frame of image; providing time consistency for subsequent frame-sensitive areas; a video sequence R, R ═ I is known0,I1,I2,I3,...It...,InAnd the first frame I0Mask M of0;ItIs the t-th frame in the video sequence; t ∈ {1,2,3,..., n }; masking the remaining frames in a video sequence1,M2,M3,...,Mn}, according to the tracker function expression:
Mt+1=f(It+1,Mt)
where f is expressed as a tracker function, known as It+1Image represented as t +1 th frame, known as MtMask representing the image of the t-th frame, and calculating Mt+1Mask represented as frame t + 1; the masks of the second frame image and the first frame image are known, and the mask of the second frame image is obtained through a tracker; due to the fact that the target has the smooth moving trend in space, the video frames have the characteristic of little change and have certain relevance; by MtMask and It+1Frame, prediction It+1Mask M for framet+1(ii) a Prediction It+1Mask for frame and real mask MgtThere is a large error; the M isgtRepresenting a truly accurate mask; the perceptual region addition tracker is then input into the perceptual segmentation network.
5. The method for segmenting the video object with the dual-module neural network structure according to the claim 1 or 2, characterized in that: the specific steps of the step 3 are as follows:
inputting the information into a feeling segmentation network through a feeling region adding tracker based on the step 2; inputting a feeling region adding tracker into a feeling segmentation network training prediction target in the invention; the Hoxist segmentation network RoISeg is based on a deep convolutional neural network CNN, and a network framework is innovated on the basis of a ResNet101 framework network and is called as a Hoxist segmentation network; the ResNet101 frame network is a network with a deep residual error learning frame to solve the problem of accuracy reduction, and has a lower training error and test error network; the RoISeg network is composed of convolution layers, pooling layers, activation functions, batch normalization, deconvolution and the like; wherein the initial parameters in the RoISeg are set as: the learning rate is 0.0001, and the weight attenuation term is 0.005; the final output of the RoISeg is constrained by using weighted cross entropy loss; the crossover loss expression is as follows:
Figure FDA0002275577190000041
whereinL (theta) represents the weighted cross entropy loss, theta is in the value range of [0,1 ]]Representing a weight parameter associated with a current prediction in the network; x+And X-Representing sets of pixels with target positive and negative sample labels, respectively, positive samples being true correct samples and negative samples being prediction error samples, in other words, sets of pixels of positive and negative samples of a video frame mask, β being a weighted attenuation term penalizing biased samples during training, computing a probability function P from the activation output of the convolutional layer representing a probability distribution, P being [0,1 ]](ii) a The value range [0,1 ] of the commonly used nonlinear activation function Sigmoid is used as the activation function](ii) a The perceptual segmentation network training output layer uses the constraint of cross entropy loss, and then continuously trains after reversely propagating into the network, and when the loss in the training process gradually becomes smaller, the convergence becomes small enough and stable; outputting to obtain a target segmentation result; the output result is a segmentation map of the foreground and background of the mask.
6. The method for segmenting the video object with the dual-module neural network structure according to the claim 1 or 2, characterized in that: the specific steps of the step 4 are as follows:
predicting that a target result is output with a large error by the RoISeg network in the step 3, and reducing a noise segmentation part; therefore, the invention constructs a method of double concern modules; inputting the characteristic diagrams output from the last convolution layer of the RoISeg into two concerned modules respectively; two focus modules are divided: the system comprises a space attention module and a channel attention module, and specifically comprises the following steps:
a spatial attention module: introducing a spatial attention mechanism to enrich the dependency relationship of the context characteristics for the target in the video frame; the operation of a space attention mechanism is introduced for detailed description; the convolution layer output characteristic diagram from RoISeg is marked as A, and A belongs to RC×H×WR represents a set, the shape size of A is C multiplied by H multiplied by W, C represents the number of channels, H represents the height, and W represents the width; first, three new feature maps B, D and F are generated from feature map A sharing, respectively, where { B, D, F }. epsilon.RC×H×W(ii) a Then re-change their shape and size to RC×NWherein N is H × W, N being height and widthMultiplying; and then B performs matrix transposition and D performs matrix multiplication, and applies a softmax layer to calculate a spatial dimension information attention feature map S e RN×NThe specific expression is as follows:
Figure FDA0002275577190000051
wherein SijMeasuring the iththThe j (th) position of spacethThe effect of spatial position; exp represents the distance between two locations, the smaller the distance the more similar the locations between them; in the foregoing, to capture the spatial dependency between any two spatial positions; in other words, the more similar characteristics of both locations; indicating that greater similarity in features between them is facilitated; wherein the shape of F is RC×N(ii) a Then, matrix multiplication operation is performed between F and S matrix transpose, and the characteristic diagram size shape of the matrix multiplication result is RC ×NChanging the shape and size of the feature map to RC×H×WFinally, multiplying by a scale parameter α, and using feature A to execute element and operation to obtain output feature graph result E1The specific expression is as follows:
Figure FDA0002275577190000052
wherein α is initialized to 0 for weight coefficient, α is equal to 0,1]And gradually more weights are assigned; sum operation result feature graph E1Size of shape E1∈RC×H×W(ii) a For the target feature position in the video frame, aggregating features through the position where weighted sum is located to update, wherein the weight is determined by the feature similarity between the corresponding two positions; in other words, any two locations with similar characteristics may promote mutual improvement regardless of their distance in the spatial dimension; and selectively aggregating the context feature representations according to the spatial attention mapping, thereby improving the information interdependence relation between the same classes.
7. The method of dual-module neural network structure video object segmentation according to claim 6, wherein: the specific steps of the step 4 are as follows:
a channel attention module: capturing the channel dependency relationship between any two channel mappings through the channel attention mechanism operation; the convolutional layer output characteristic diagram from RoISeg is also marked as A, and A ∈ RC×H×WR represents a set, the shape size of A is C multiplied by H multiplied by W, C represents the number of channels, H represents the height, and W represents the width; feature map A sharing generates two new feature maps M and N, respectively, where { M, N }. epsilon.RC×H×W(ii) a Then re-change their shape and size to RC×N(ii) a Executing matrix multiplication between M and N transpose, directly calculating channel characteristic diagram X e to RC×C(ii) a Obtaining a channel attention information characteristic diagram X epsilon R by using softmax layerC×CThe specific expression is as follows:
Figure FDA0002275577190000053
wherein XjiMeasuring the iththChannel pair jththInfluence among channels, the channel attention module captures the channel dependency relationship between any two channel mappings; in addition, the X matrix and the A characteristic pattern are reshaped into RC×NPerforms matrix multiplication between the matrices to obtain a result with a shape of RC×NReshaped to RC×H×WMultiplying by a scale weight parameter β, and performing element sum operation with A to obtain an output characteristic diagram E2The specific expression is as follows:
Figure FDA0002275577190000061
wherein β is a weight coefficient, and the initialization is set to 0.3, β E [0,1 ]](ii) a Sum operation result feature graph E2,E2Size of shape E2∈RC×H×W(ii) a Simulating the channel dependence relationship between the feature map channel mappings; thereby being beneficial to improving the identifiability of the model function; enhancement of modules of interest by channelThe channel target features are more prominent, so that the video frame can identify the target in the network;
fusing the two concerned modules; the fusion operation is to combine the two feature vectors into a complex vector; the characteristic E output by the space attention module1Feature map E output by channel attention module2And obtaining a new characteristic diagram O through fusion operation: the specific expression is as follows:
O=f(E1,E2)
wherein O is the output result of the fusion characteristic diagram, and the size of the O output characteristic diagram is O epsilon RC×H×W(ii) a The function f is expressed as a fusion operation; e1Size of the feature map is E1∈RC×H×W;E2Size of the feature map is E2∈RC×H×W(ii) a The feature information between the front frame and the rear frame of the target object is more obvious after being fused together, so that a better feature effect of video target object segmentation is obtained.
CN201911125917.3A 2019-11-15 2019-11-15 Video object segmentation method for dual-module neural network structure Active CN110910391B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911125917.3A CN110910391B (en) 2019-11-15 2019-11-15 Video object segmentation method for dual-module neural network structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911125917.3A CN110910391B (en) 2019-11-15 2019-11-15 Video object segmentation method for dual-module neural network structure

Publications (2)

Publication Number Publication Date
CN110910391A true CN110910391A (en) 2020-03-24
CN110910391B CN110910391B (en) 2023-08-18

Family

ID=69816867

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911125917.3A Active CN110910391B (en) 2019-11-15 2019-11-15 Video object segmentation method for dual-module neural network structure

Country Status (1)

Country Link
CN (1) CN110910391B (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111462140A (en) * 2020-04-30 2020-07-28 同济大学 Real-time image instance segmentation method based on block splicing
CN111461772A (en) * 2020-03-27 2020-07-28 上海大学 Video advertisement integration system and method based on generation countermeasure network
CN111583288A (en) * 2020-04-21 2020-08-25 西安交通大学 Video multi-target association and segmentation method and system
CN111968150A (en) * 2020-08-19 2020-11-20 中国科学技术大学 Weak surveillance video target segmentation method based on full convolution neural network
CN112085717A (en) * 2020-09-04 2020-12-15 厦门大学 Video prediction method and system for laparoscopic surgery
CN112288755A (en) * 2020-11-26 2021-01-29 深源恒际科技有限公司 Video-based vehicle appearance component deep learning segmentation method and system
CN113033428A (en) * 2021-03-30 2021-06-25 电子科技大学 Pedestrian attribute identification method based on instance segmentation
CN113044561A (en) * 2021-05-31 2021-06-29 山东捷瑞数字科技股份有限公司 Intelligent automatic material conveying method
CN113298036A (en) * 2021-06-17 2021-08-24 浙江大学 Unsupervised video target segmentation method
CN113421280A (en) * 2021-05-31 2021-09-21 江苏大学 Method for segmenting reinforcement learning video object by integrating precision and speed
WO2022133627A1 (en) * 2020-12-21 2022-06-30 广州视源电子科技股份有限公司 Image segmentation method and apparatus, and device and storage medium
CN116030247A (en) * 2023-03-20 2023-04-28 之江实验室 Medical image sample generation method and device, storage medium and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103942794A (en) * 2014-04-16 2014-07-23 南京大学 Image collaborative cutout method based on confidence level
CN109272530A (en) * 2018-08-08 2019-01-25 北京航空航天大学 Method for tracking target and device towards space base monitoring scene
CN110084829A (en) * 2019-03-12 2019-08-02 上海阅面网络科技有限公司 Method for tracking target, device, electronic equipment and computer readable storage medium
US20190311202A1 (en) * 2018-04-10 2019-10-10 Adobe Inc. Video object segmentation by reference-guided mask propagation

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103942794A (en) * 2014-04-16 2014-07-23 南京大学 Image collaborative cutout method based on confidence level
US20190311202A1 (en) * 2018-04-10 2019-10-10 Adobe Inc. Video object segmentation by reference-guided mask propagation
CN109272530A (en) * 2018-08-08 2019-01-25 北京航空航天大学 Method for tracking target and device towards space base monitoring scene
CN110084829A (en) * 2019-03-12 2019-08-02 上海阅面网络科技有限公司 Method for tracking target, device, electronic equipment and computer readable storage medium

Non-Patent Citations (8)

* Cited by examiner, † Cited by third party
Title
ALEKSANDR FEDOROV 等: "Traffic flow estimation with data from a video surveillance camera", 《JOURNAL OF BIG DATA》 *
JIFENG DAI 等: "BoxSup: Exploiting Bounding Boxes to Supervise Convolutional Networks for Semantic Segmentation", 《PROCEEDINGS OF THE IEEE INTERNATIONAL CONFERENCE ON COMPUTER VISION (ICCV)》 *
JUN FU 等: "Dual Attention Network for Scene Segmentation", 《2019 IEEE/CVF CONFERENCE ON COMPUTER VISION AND PATTERN RECOGNITION (CVPR)》 *
LINBOWANG 等: "A unified two-parallel-branch deep neural network for joint gland contour and segmentation learning", 《FUTURE GENERATION COMPUTER SYSTEMS》 *
任长安 等: "融合目标空间分割的网格任务调度算法", 《控制工程》 *
崔智高 等: "动态背景下融合运动线索和颜色信息的视频目标分割算法", 《光电子.激光》 *
洪朝群 等: "感兴趣区域中的快速人脸区域分割与跟踪方法", 《厦门理工学院学报》 *
黄元捷: "结合跟踪技术的视频目标分割方法研究", 《中国优秀博硕士学位论文全文数据库(硕士) 信息科技辑(月刊)》 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111461772A (en) * 2020-03-27 2020-07-28 上海大学 Video advertisement integration system and method based on generation countermeasure network
CN111583288A (en) * 2020-04-21 2020-08-25 西安交通大学 Video multi-target association and segmentation method and system
CN111462140A (en) * 2020-04-30 2020-07-28 同济大学 Real-time image instance segmentation method based on block splicing
CN111462140B (en) * 2020-04-30 2023-07-07 同济大学 Real-time image instance segmentation method based on block stitching
CN111968150A (en) * 2020-08-19 2020-11-20 中国科学技术大学 Weak surveillance video target segmentation method based on full convolution neural network
CN112085717A (en) * 2020-09-04 2020-12-15 厦门大学 Video prediction method and system for laparoscopic surgery
CN112085717B (en) * 2020-09-04 2024-03-19 厦门大学 Video prediction method and system for laparoscopic surgery
CN112288755A (en) * 2020-11-26 2021-01-29 深源恒际科技有限公司 Video-based vehicle appearance component deep learning segmentation method and system
WO2022133627A1 (en) * 2020-12-21 2022-06-30 广州视源电子科技股份有限公司 Image segmentation method and apparatus, and device and storage medium
CN113033428A (en) * 2021-03-30 2021-06-25 电子科技大学 Pedestrian attribute identification method based on instance segmentation
CN113421280A (en) * 2021-05-31 2021-09-21 江苏大学 Method for segmenting reinforcement learning video object by integrating precision and speed
CN113044561A (en) * 2021-05-31 2021-06-29 山东捷瑞数字科技股份有限公司 Intelligent automatic material conveying method
CN113298036B (en) * 2021-06-17 2023-06-02 浙江大学 Method for dividing unsupervised video target
CN113298036A (en) * 2021-06-17 2021-08-24 浙江大学 Unsupervised video target segmentation method
CN116030247A (en) * 2023-03-20 2023-04-28 之江实验室 Medical image sample generation method and device, storage medium and electronic equipment

Also Published As

Publication number Publication date
CN110910391B (en) 2023-08-18

Similar Documents

Publication Publication Date Title
CN110910391A (en) Video object segmentation method with dual-module neural network structure
Tokmakov et al. Learning to segment moving objects
Zhang et al. C2FDA: Coarse-to-fine domain adaptation for traffic object detection
CN107945204B (en) Pixel-level image matting method based on generation countermeasure network
CN109886121B (en) Human face key point positioning method for shielding robustness
CN106547880B (en) Multi-dimensional geographic scene identification method fusing geographic area knowledge
Li et al. Deep neural network for structural prediction and lane detection in traffic scene
Huang et al. Scribble-supervised video object segmentation
CN111507993B (en) Image segmentation method, device and storage medium based on generation countermeasure network
CN113673338B (en) Automatic labeling method, system and medium for weak supervision of natural scene text image character pixels
CN111652836A (en) Multi-scale target detection method based on clustering algorithm and neural network
Zhao et al. Bitnet: A lightweight object detection network for real-time classroom behavior recognition with transformer and bi-directional pyramid network
Li et al. AEMS: an attention enhancement network of modules stacking for lowlight image enhancement
CN111179272A (en) Rapid semantic segmentation method for road scene
Du et al. Adaptive visual interaction based multi-target future state prediction for autonomous driving vehicles
Fu et al. Purifying real images with an attention-guided style transfer network for gaze estimation
CN113096133A (en) Method for constructing semantic segmentation network based on attention mechanism
CN114882493A (en) Three-dimensional hand posture estimation and recognition method based on image sequence
CN116596966A (en) Segmentation and tracking method based on attention and feature fusion
Castro et al. AttenGait: Gait recognition with attention and rich modalities
CN116258937A (en) Small sample segmentation method, device, terminal and medium based on attention mechanism
CN113255493B (en) Video target segmentation method integrating visual words and self-attention mechanism
Musat et al. Depth-sims: Semi-parametric image and depth synthesis
Xiong et al. TFA-CNN: an efficient method for dealing with crowding and noise problems in crowd counting
Liu et al. Spatiotemporal saliency based multi-stream networks for action recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant