CN111161306A - Video target segmentation method based on motion attention - Google Patents
Video target segmentation method based on motion attention Download PDFInfo
- Publication number
- CN111161306A CN111161306A CN201911402450.2A CN201911402450A CN111161306A CN 111161306 A CN111161306 A CN 111161306A CN 201911402450 A CN201911402450 A CN 201911402450A CN 111161306 A CN111161306 A CN 111161306A
- Authority
- CN
- China
- Prior art keywords
- current frame
- attention
- feature map
- frame
- motion
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 230000011218 segmentation Effects 0.000 title claims abstract description 77
- 238000000034 method Methods 0.000 title claims abstract description 40
- 238000010586 diagram Methods 0.000 claims abstract description 27
- 238000004364 calculation method Methods 0.000 claims abstract description 5
- 230000006870 function Effects 0.000 claims description 24
- 239000011159 matrix material Substances 0.000 claims description 11
- CIWBSHSKHKDKBQ-JLAZNSOCSA-N Ascorbic acid Chemical compound OC[C@H](O)[C@H]1OC(=O)C(O)=C1O CIWBSHSKHKDKBQ-JLAZNSOCSA-N 0.000 claims description 10
- 238000011176 pooling Methods 0.000 claims description 3
- 239000004576 sand Substances 0.000 claims description 3
- 238000002620 method output Methods 0.000 claims 1
- 238000012549 training Methods 0.000 description 9
- 230000000694 effects Effects 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 230000004913 activation Effects 0.000 description 2
- 230000000386 athletic effect Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 230000002708 enhancing effect Effects 0.000 description 2
- 230000007246 mechanism Effects 0.000 description 2
- 230000009471 action Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000014759 maintenance of location Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005192 partition Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 238000005728 strengthening Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T7/00—Image analysis
- G06T7/20—Analysis of motion
- G06T7/215—Motion-based segmentation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T5/00—Image enhancement or restoration
- G06T5/20—Image enhancement or restoration using local operators
- G06T5/30—Erosion or dilatation, e.g. thinning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/10—Image acquisition modality
- G06T2207/10016—Video; Image sequence
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20081—Training; Learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T2207/00—Indexing scheme for image analysis or image enhancement
- G06T2207/20—Special algorithmic details
- G06T2207/20084—Artificial neural networks [ANN]
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Multimedia (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Image Analysis (AREA)
Abstract
The invention provides a video target segmentation method based on motion attention, which adds a channel characteristic diagram output by a channel attention module and a position characteristic diagram output by a motion attention module to obtain a segmentation result of a current frame. Wherein, the input of the channel attention module is a current frame feature map FtAnd the appearance characteristic diagram F of the target object provided by the first frame0The channel attention module inputs the feature map F by calculationtAnd F0The correlation among the channels, the output channel characteristic diagram reflects the object with the appearance closest to the target object in the current frame; the input of the motion attention module is a current frame feature map FtAnd the position information H of the target object predicted by the memory module in the previous frame motion attention networkt‑1The motion attention module inputs the feature map F by calculationtAnd Ht‑1Correlation between positions, output position characteristic map reflectionThe approximate position of the target object in the current frame is shown. The invention combines two factors of appearance and position to realize more accurate segmentation of the video target.
Description
Technical Field
The invention belongs to the field of image processing and computer vision, and relates to a video target segmentation method, in particular to a video target segmentation method based on motion attention.
Background
Video object segmentation is a prerequisite for solving a plurality of video tasks, and plays a significant role in the fields of object identification, video compression and the like. Video object segmentation may be defined as tracking the object and segmenting the object according to the object mask. The video target segmentation can be divided into a semi-supervised mode and an unsupervised mode according to the existence of the initial mask, and the semi-supervised segmentation method is that the segmentation mask is manually initialized in the first frame of the video, and then the target object is tracked and segmented. Unsupervised methods automatically segment target objects in a given video according to some mechanism without any prior information.
In a video scene, background clutter, object deformation and rapid movement of an object all affect the segmentation result. The traditional video target segmentation technology adopts a rigid background motion model and combines scene prior to realize the segmentation of a target object. However, the conventional video object segmentation technique has certain limitations in practical application based on the assumption. The existing video object segmentation technology mostly adopts a convolutional neural network, but the existing video object segmentation technology also has various defects, such as the fact that moving objects in a video are segmented by relying on optical flow between frames, and therefore the segmentation effect is easily affected by optical flow estimation errors. In addition, these methods do not fully utilize timing information in the video and do not memorize the relevant features of the target object in the scene.
In order to solve the problems, the invention researches the segmentation problem of the moving target in a semi-supervised mode and provides a video target segmentation method based on movement attention with a memory module.
Disclosure of Invention
The invention aims to solve the problems that: in video target segmentation, a target object of a current frame is determined only by a segmentation result of a previous frame, the accurate position of the target object cannot be obtained, and even the target object drifts due to excessive dependence on the segmentation result of the previous frame; most of the existing video object segmentation methods based on motion information perform segmentation of the object based on optical flow information between the current frame and the previous frame, which not only has a large calculation amount, but also limits the segmentation result to a specific motion mode. Therefore, a new video object segmentation method based on motion information needs to be provided to improve the segmentation effect.
In order to solve the above problems, the present invention provides a video object segmentation method based on motion attention, which fuses motion time sequence information in a video sequence and realizes video object segmentation based on an attention mechanism, and comprises the following steps:
1) constructing a segmented backbone network, and dividing the current frame ItAnd a first frame I0Respectively input into the backbone network to obtain corresponding characteristic diagram Ft,F0;
2) Constructing a motion attention network, and mapping the feature map F of the current frametFeature map F of the first frame0And hidden state H of the previous frame memory modulet-1As an input to the Athletic attention network, the output F of the Athletic attention networkoutNamely, the segmentation result of the current frame is obtained;
3) a loss function is constructed, which consists of two parts. The first part is the pixel level loss function; the second part is the structural similarity loss function.
Further, constructing a segmentation backbone network in the step 1) to obtain a feature map FtAnd F0The method specifically comprises the following steps:
1.1) modify Resnet-50 network and merge hole convolutions. First, the inflation factor scaled of conv _1 in Resnet-50 is set to 2; secondly, deleting the pooling layer in Resnet-50; then setting the step size stride of two layers of conv _3 and conv _4 in Resnet-50 to 1; finally, the modified Resnet-50 is used as a backbone network, and at the moment, the feature diagram output by the backbone network is 1/8 of the size of the original image;
1.2) will present frame ItInputting the feature map F into the backbone network to obtain the feature map F about the current framet;
1.3) first frame I0Inputting the feature map F into the backbone network to obtain a feature map F about the first frame0。
Further, constructing a motion attention network in the step 2) to obtain a segmentation result of the current frame.
The motion attention network consists of a channel attention module, a motion attention module and a memory module. The channel attention module, the motion attention module and the memory module are specifically constructed as follows:
2.1) construction of the channel attention Module, Ft,F0As an input to the channel attention module. F0And providing appearance information such as color and posture of the target object. First, Ft,F0Obtaining a channel weight attention chart X of the target object by matrix multiplication and a softmax functionc,XcThe relevance between channels in the current frame and the first frame is described, the higher the relevance is, the higher the response value is, and the more similar the characteristics are; then, X is addedcAnd FtMultiplying, feature enhancing, and the result is FtAdding residual errors to obtain a channel characteristic diagram;
2.2) construction of the exercise attention Module, Ft,Ht-1As input to the current module, Ht-1Position information of a target object of a current frame predicted based on a previous frame segmentation result and timing information is provided. Firstly, a feature map F is settRespectively passing through two convolution layers with convolution kernel of 1 × 1 to obtain two characteristic maps marked as FaAnd Fb(ii) a Then, FaAnd Ht-1Obtaining a position weight attention diagram of the target object through matrix multiplication and a softmax function; finally, X is addedsAnd FbMultiplying, feature enhancing, and the result is FtAdding residual errors to obtain a position feature map;
2.3) adding the channel characteristic diagram and the position characteristic diagram to obtain a final segmentation result F of the current frameout。
2.4) constructing a memory module convLSTM, and dividing the current frame into FoutAnd memory cell C output by the previous frame memory modulet-1Hidden state H of previous frame memory modulet-1As input to the current block, the output of this block is memory cell CtAnd hidden state Ht;
The convLSTM is composed of an input gate, a forgetting gate and an output gate.
Further, the step 2.4) of constructing the memory module convLSTM specifically includes:
2.4.1) first, forget the memory cell C of the previous frame memory modulet-1Partial state information is input, and then the current frame is divided into FoutUseful information is stored in the previous frame of memory cells Ct-1Finally, the memory cell C of the current frame is updated and outputtedt;
2.4.2) first, the output gate divides the current frame into results F by Sigmoid functionoutAnd hidden state H of previous frame memory modulet-1Filtering, determining the information to be output, and then calling tanh activation function to modify the memory cell C of the current frametThe partial information to be output and the modified memory cell C of the current frame are finally comparedtMatrix multiplication is carried out to obtain and output the hidden state H of the current framet。
Advantageous effects
The invention provides a video target segmentation method based on motion attention, which comprises the steps of firstly obtaining feature maps of a first frame and a current frame, and then obtaining a feature map F of the current frametAppearance characteristic F of target object provided by first frame0And the position information H of the target object predicted by the memory module in the previous frame motion attention networkt-1And inputting the current frame motion attention network to obtain the segmentation result of the current frame. By applying the method and the device, the problem of motion mode diversification which cannot be solved by other segmentation methods can be solved. The method is suitable for video target segmentation, and has good robustness and accurate segmentation effect.
The invention has the characteristics that: firstly, the method does not pay attention to the segmentation result of the previous frame, but can more accurately segment the target object by means of the appearance information of the target object in the first frame and the time sequence information of the target object in the video sequence; and secondly, the use of the motion attention network greatly reduces useless characteristics and improves the robustness of the model.
Drawings
FIG. 1 is a flow chart of a video object segmentation method based on attention to movement according to the present invention.
FIG. 2 is a network architecture diagram of the video object segmentation method based on attention to movement according to the present invention;
FIG. 3 is a diagram of the Resnet-50 structure
FIG. 4 is a modified Resnet-50 structure diagram for use in the method for video object segmentation based on attention to movement of the present invention
Detailed Description
The invention provides a video target segmentation method based on motion attention, which comprises the steps of firstly obtaining feature maps of a first frame and a current frame, and then inputting the feature maps of the first frame, the feature map of the current frame and target object position information predicted by a memory module in a motion attention network of the previous frame into the motion attention network to obtain a segmentation result of the current frame. The method is suitable for video target segmentation, and has good robustness and accurate segmentation effect.
The invention is explained in more detail below with reference to specific examples and the accompanying drawings.
The invention comprises the following steps:
1) acquiring YouTube and Davis data sets which are respectively used as a training set and a test set of the model;
2) the training data is pre-processed. Cutting each training sample (video frame) and a first frame mask of a video sequence, adjusting an image to 224 multiplied by 224 resolution, and performing data enhancement by using modes such as rotation and the like;
3) constructing a segmentation backbone network, inputting a first frame segmentation mask of a video sequence and a current frame, and obtaining a segmentation feature map of the first frame segmentation mask and the current frame;
3.1) first, set the inflation factor scaled of conv _1 in Resnet-50 to 2; secondly, deleting the pooling layer in Resnet-50; then setting the step size stride of two layers of conv _3 and conv _4 in Resnet-50 to 1; finally, the modified Resnet-50 is used as a backbone network, and the feature map output by the backbone network is 1/8 of the original image size, as shown in FIG. 4;
3.2) the resolution of the first frame division mask is 224 x 224, and the first frame division mask is input into the backbone network to obtain a feature map F of the first frame division mask02048 × 28 × 28 in size;
3.3) the resolution of the current frame is 224 x 224, and the current frame is input into the backbone network to obtain the characteristic diagram F of the current frametThe size is 2048 × 28 × 28.
4) Constructing a motion attention network, and inputting a current frame feature map FtHidden state H of previous frame memory modulet-1And a feature map F of the first frame0Obtaining a segmentation result F of the current frameout. The motion attention network consists of a channel attention module, a motion attention module and a memory module.
4.1) constructing a channel attention module. Inputting a current frame feature map FtAnd a first frame feature map F0Obtaining a channel characteristic diagram EcThe method comprises the following steps:
4.1.1) Call the reshape function in python, adjusting FtSize, converted to feature map F'tN × 2048 in size; calling reshape function in python, adjust F0Size, converted to feature map F'02048 × n, where n represents the total number of pixels of the current frame;
4.1.2)F′0and F'tMatrix multiplication and calling of the softmax function, the mathematical expression of which is as follows:
matrix multiplication can realize utilization and fusion of global information, and the action of the matrix multiplication is similar to full-connection operation. The full-connection operation can consider the data relation of all positions, but destroys the spatial structure, so that the matrix multiplication is used for replacing the full-connection operation, and the spatial information is reserved as much as possible on the basis of using the global information;
4.1.3) deriving a channel weight attention map Xc2048 × 2048 channel weight attention map XcElement x of j (th) row and i (th) columnjiThe mathematical expression of (a) is as follows:
wherein, F'tiIs represented by F'tColumn i, a feature map F of the current frametIth channel, F'0jIs represented by F'0Line j, first frame feature map F0The jth channel, xjiFeature map F representing the first frame0For the ith channel to the current frame feature map FtC represents the current frame feature map FtThe number of channels.
4.1.4) channel weight attention map XcAnd feature picture F'tMultiplying and strengthening the feature map F of the current frametThe result and the current frame feature map FtAdding residual errors to obtain a channel characteristic diagram EcThe mathematical expression is as follows:
wherein β represents the channel attention weight, the initial value is set to zero, and the model assigns β a larger and more reasonable weight, F ', through learning'tiIs represented by F'tColumn i, a feature map F of the current frametThe ith channel, C, represents the current frame feature map FtThe number of channels.
4.2) constructing a motion attention module. Inputting a current frame feature map FtAnd hidden state H of the previous frame memory modulet-1Obtaining a position feature map EsThe method comprises the following steps:
4.2.1)Ftrespectively passing through two convolution layers with convolution kernel of 1 × 1 to obtain two characteristic maps marked as FaAnd FbThe sizes are 2048 multiplied by 28;
4.2.2) Call the reshape function in python, adjust FaSize, converted to feature map F'a2048 xn in size; the reshape function in python is called,adjusting FbSize, converted to feature map F'b2048 xn in size; invoking reshape and transpose functions in python to adjust Ht-1Size, converted to signature H't-1The size is n multiplied by 2048, wherein n represents the total number of pixel points of the current frame;
4.2.3)H′t-1and F'aMatrix multiplication is carried out, a softmax function is called, and a position weight attention diagram X is obtainedsThe size is 28 × 28, and the mathematical expression is as follows:
wherein N represents the number of pixels of the current frame, FajIs represented by F'aColumn j, denotes FaJ-th position of (d), hjIs H't-1Row i of (2), represents Ht-1Position i, sjiIs a position weight attention map XsValue of element, s, in j-th row and i-columnjiIndicates a hidden state Ht-1Position i to current frame feature map FtThe effect of the jth position.
4.2.4) location weight attention map and feature map FbMatrix multiplication is carried out to strengthen the characteristic graph F of the current frametIs characterized by the followingtAdding residual errors to obtain a fused position feature map EsThe mathematical formula is expressed as follows:
where α represents the location attention weight, the initial value is set to zero, the model assigns α a greater and more reasonable weight through learning, FbiRepresentation feature diagram FbColumn i, a feature map F of the current frametThe ith position, N, represents the total number of pixels in the current frame.
4.3) reduction of bitsSet characteristic diagram EsAnd channel feature map EcAdding to obtain final segmentation result F of the current frameout。
4.4) constructing a memory module, and inputting a current frame segmentation result FoutHidden state H of previous frame memory modulet-1And memory cells C of the previous frame memory modulet-1Memory module convLSTM of current frame is formed by forgetting gate ftAnd input gate itAnd an output gate otComposition is carried out;
4.4.1) the output tensor of the forgetting gate is between 0 and 1 for each value, 0 represents complete forgetting, and 1 represents complete retention, therefore, the forgetting gate can realize selective discarding of the memory cell C in the previous framet-1The mathematical formula is expressed as follows:
ft=σ(Wxf*Fout+Whf*Ht-1+bf)
wherein denotes a convolution operation, σ denotes a sigmoid function, FoutDenotes the segmentation result of the current frame, Ht-1Indicating the hidden state of the memory module of the previous frame. Wxf,WhfIs a weight parameter with a value between 0 and 1, bfFor the offset, the initial value is set to 0.1 and the model is learned as bfDistributing more reasonable values;
4.4.2) input gate will segment the result F from the current frameoutThe content to be updated is selected, and the mathematical formula is expressed as follows:
it=σ(Wxi*Fout+Whi*Ht-1+bi)
wherein denotes a convolution operation, σ denotes a sigmoid function, FoutDenotes the segmentation result of the current frame, Ht-1Indicating the hidden state of the memory module of the previous frame. Wxi,WhiIs a weight parameter with a value between 0 and 1, biFor the offset, the initial value is set to 0.1 and the model is learned as biDistributing more reasonable values;
4.4.3) discarding the previous frame of memory cells C using the forgetting gatet-1And saving the useful information to the previous frameMemory cell Ct-1In the method, the memory cell C of the current frame after the update is outputtThe mathematical formula is expressed as follows:
wherein, denotes a convolution operation,is the product of Hadamard, FoutDenotes the segmentation result of the current frame, Ht-1Indicating the hidden state of the memory module of the previous frame. Wxc,WhcIs a weight parameter, and has a value of between 0 and 1 bcFor the offset, the initial value is set to 0.1 and the model is learned as bcDistributing more reasonable values;
4.4.4) hidden state H output by the current frame memory moduletThe mathematical formula is expressed as follows:
ot=σ(Wxo*Fout+Who*Ht-1+bo)
wherein, tanh is an activation function,is the product of Hadamard, otRepresenting output gates in the current frame memory block, representing convolution operations, FoutDenotes the segmentation result of the current frame, Ht-1Indicating the hidden state of the memory module of the previous frame, Who,WxoIs a weight parameter with a value between 0 and 1, boFor the offset, the initial value is set to 0.1 and the model is learned as boMore reasonable values are assigned.
6) The loss function adopted by the segmentation model consists of two parts. The first part is a pixel-level loss function, the second part is a structural similarity loss function, and the specific design is as follows:
l=lcross+lssim
6.1)lcrossrepresenting the pixel-level cross entropy loss function, the mathematical formula is expressed as follows:
where T (r, c) denotes a pixel value at the r row and c column of the target mask, and S (r, c) denotes a pixel value at the r row and c column of the division result;
6.2)lssimexpressing a structural similarity loss function, comparing differences between the target mask and the segmentation result from three aspects of brightness, contrast and structure, and expressing a mathematical formula as follows:
wherein A isx,AyRespectively representing regions of the same size, x, cut from the partition map and the target mask predicted by the modeliIs represented by AxPixel value, y, of the i-th pixel in the regioniIs represented by AyThe pixel value of the ith pixel point in the region, N represents the total number of pixel points of the intercepted region, C1And C2Is a constant that prevents the denominator from being zero, C1Set as 6.5025, C2Set to 58.5225, μxIs represented by AxAverage brightness ofyTABLE AyAverage brightness of σxRepresents AxDegree of change of medium brightness, σyRepresents AyDegree of change of medium brightness, σxyRepresenting a structure-dependent covariance formula.
7) Training the model, selecting the YouTube in the step 1) as a training set, setting the sample number of batch-size of each batch of training as 4, setting the learning rate as 1e-4, modifying the learning rate to 1e-5 after 30 ten thousand times of previous iterative training of the YouTube, performing 10 ten thousand times of iterative training again in the YouTube, setting the weight attenuation rate as 0.0005, and training the model by using the loss function in the step 6) until the model converges.
The invention has wide application in the field of video object segmentation and computer vision, such as: target tracking, image recognition, etc. The present invention will now be described in detail with reference to the accompanying drawings.
(1) Constructing a segmented backbone network, and dividing the current frame ItAnd a first frame I0Respectively input into the backbone network to obtain corresponding characteristic diagram Ft,F0;
(2) Constructing a motion attention network, and obtaining a current frame feature map FtFirst frame feature map F0Obtaining a channel feature map as an input of a channel attention module in the exercise attention network, and converting FtAnd hidden state H of previous frame memory modulet-1Obtaining a position feature map as an input of a motion attention module in the motion attention network, and adding the position feature map and the channel feature map to obtain an output F of the motion attention networkoutThat is, the segmentation result of the current frame is obtained, and the segmentation result F of the current frame is obtainedoutMemory cell C output by previous frame memory modulet-1And hidden state H of the previous frame memory modulet-1Obtaining memory cells C as input to a memory module in a motor attention networktAnd hidden state Ht. Memory cell CtFor storing and updating the time sequence information of the target object based on the segmentation result of the current frame, HtPosition information of a target object of a next frame predicted based on a current segmentation result and timing information is provided. The memory module convLSTM not only retains the spatial information of the target object of the current frame, but also retains the time sequence information of the target object, so that the long-distance position dependency relationship of the target object can be obtained.
The method is realized by adopting PyTorch framework and Python language under GTX 1080Ti GPU and Ubuntu 14.0464 bit operating system.
The invention provides a video target segmentation method based on motion attention, which is suitable for segmenting moving objects in a video, and has the advantages of good robustness and accurate segmentation result. Experiments show that the method can effectively segment moving objects.
The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can understand that the modifications or substitutions are included in the scope of the present invention, and therefore, the scope of the present invention should be determined by the protection scope of the claims.
Claims (6)
1. A video target segmentation method based on motion attention fuses time sequence information in a video sequence to carry out video target segmentation, and is characterized in that: the method outputs a channel feature map E of a channel attention modulecAdding the position characteristic diagram output by the motion attention module to obtain a segmentation result of the current frame; wherein, the input of the channel attention module is a current frame feature map FtAnd the appearance characteristic diagram F of the target object provided by the first frame0The channel attention module inputs the feature map F by calculationtAnd F0The correlation among the channels, the output channel characteristic diagram is used for reflecting the object with the appearance closest to the target object in the current frame; the input of the motion attention module is a current frame feature map FtAnd the position information H of the target object predicted by the previous frame memory modulet-1The motion attention module inputs the feature map F by calculationtAnd Ht-1And (4) correlating positions, and outputting a position feature map reflecting the approximate position of the target object in the current frame.
2. A method for video object segmentation based on attention from motion as claimed in claim 1 wherein: the characteristic diagram Ft,F0The specific acquisition process is as follows:
1.1) constructing a segmentation backbone network, which specifically comprises the following steps: modify the Resnet-50 network and merge hole convolutions: first, the inflation factor scaled of conv _1 in Resnet-50 is set to 2; secondly, deleting the pooling layer in Resnet-50; then setting the step size stride of two layers of conv _3 and conv _4 in Resnet-50 to 1; finally, the modified Resnet-50 is used as a backbone network, and at the moment, the feature diagram output by the backbone network is 1/8 of the size of the original image;
1.2) will present frame ItInputting the segmented backbone network to obtain a feature map F related to the current framet;
1.3) first frame I0Inputting the segmented backbone network to obtain a feature map F about the first frame0。
3. A method for video object segmentation based on attention from motion as claimed in claim 1 wherein: the specific working process of the channel attention module is as follows, firstly, Ft,F0Obtaining a channel weight attention chart X of the target object by matrix multiplication and a softmax functionc(ii) a Then, X is addedcAnd FtMultiplication of the result with FtAdding residual errors to obtain a channel characteristic diagram Ec。
4. A method for video object segmentation based on attention from motion as claimed in claim 1 wherein: the specific working process of the motion attention module is as follows, Ft,Ht-1As input to the current module, Ht-1Providing the position information of the target object in the current frame predicted based on the division result of the previous frame and the time sequence information, firstly, the feature map FtRespectively passing through two convolution layers with convolution kernel of 1 × 1 to obtain two characteristic maps marked as FaAnd Fb(ii) a Then, FaAnd Ht-1Obtaining the position weight attention chart X of the target object by matrix multiplication and a softmax functionS(ii) a Finally, X is addedSAnd FbMultiplying, and feature enhancingThe result is compared with FtAdding residual errors to obtain a position characteristic diagram Es。
5. A method for video object segmentation based on attention from motion as claimed in claim 1 wherein: the memory module convLSTM comprises a forgetting gate, an input gate and an output gate, and the memory module divides a current frame into a division result FoutMemory cell C output by previous frame memory modulet-1Hidden state H of previous frame memory modulet-1As an input to the current block, the output of the current block is memory cell CtAnd hidden state HtThe specific working process is as follows,
first, the forgetting gate discards the previous frame of memory cells Ct-1Partial state information is input, and then the current frame is divided into FoutUseful information is stored in the previous frame of memory cells Ct-1Finally, the memory cell C of the current frame is updated and outputtedt。
6. The method according to claim 1, wherein the structure loss function in step 3) is composed of two parts: the first part is the pixel level loss function; the second part is the structural similarity loss function.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911402450.2A CN111161306B (en) | 2019-12-31 | 2019-12-31 | Video target segmentation method based on motion attention |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911402450.2A CN111161306B (en) | 2019-12-31 | 2019-12-31 | Video target segmentation method based on motion attention |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111161306A true CN111161306A (en) | 2020-05-15 |
CN111161306B CN111161306B (en) | 2023-06-02 |
Family
ID=70559471
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911402450.2A Active CN111161306B (en) | 2019-12-31 | 2019-12-31 | Video target segmentation method based on motion attention |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111161306B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111968123A (en) * | 2020-08-28 | 2020-11-20 | 北京交通大学 | Semi-supervised video target segmentation method |
CN112580473A (en) * | 2020-12-11 | 2021-03-30 | 北京工业大学 | Motion feature fused video super-resolution reconstruction method |
CN112669324A (en) * | 2020-12-31 | 2021-04-16 | 中国科学技术大学 | Rapid video target segmentation method based on time sequence feature aggregation and conditional convolution |
CN112784750A (en) * | 2021-01-22 | 2021-05-11 | 清华大学 | Fast video object segmentation method and device based on pixel and region feature matching |
CN113436199A (en) * | 2021-07-23 | 2021-09-24 | 人民网股份有限公司 | Semi-supervised video target segmentation method and device |
CN113570607A (en) * | 2021-06-30 | 2021-10-29 | 北京百度网讯科技有限公司 | Target segmentation method and device and electronic equipment |
WO2021250811A1 (en) * | 2020-06-10 | 2021-12-16 | 日本電気株式会社 | Data processing device, data processing method, and recording medium |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090245571A1 (en) * | 2008-03-31 | 2009-10-01 | National Taiwan University | Digital video target moving object segmentation method and system |
WO2018128741A1 (en) * | 2017-01-06 | 2018-07-12 | Board Of Regents, The University Of Texas System | Segmenting generic foreground objects in images and videos |
CN109784261A (en) * | 2019-01-09 | 2019-05-21 | 深圳市烨嘉为技术有限公司 | Pedestrian's segmentation and recognition methods based on machine vision |
CN109919044A (en) * | 2019-02-18 | 2019-06-21 | 清华大学 | The video semanteme dividing method and device of feature propagation are carried out based on prediction |
CN110059662A (en) * | 2019-04-26 | 2019-07-26 | 山东大学 | A kind of deep video Activity recognition method and system |
CN110321761A (en) * | 2018-03-29 | 2019-10-11 | 中国科学院深圳先进技术研究院 | A kind of Activity recognition method, terminal device and computer readable storage medium |
CN110532955A (en) * | 2019-08-30 | 2019-12-03 | 中国科学院宁波材料技术与工程研究所 | Example dividing method and device based on feature attention and son up-sampling |
-
2019
- 2019-12-31 CN CN201911402450.2A patent/CN111161306B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090245571A1 (en) * | 2008-03-31 | 2009-10-01 | National Taiwan University | Digital video target moving object segmentation method and system |
WO2018128741A1 (en) * | 2017-01-06 | 2018-07-12 | Board Of Regents, The University Of Texas System | Segmenting generic foreground objects in images and videos |
CN110321761A (en) * | 2018-03-29 | 2019-10-11 | 中国科学院深圳先进技术研究院 | A kind of Activity recognition method, terminal device and computer readable storage medium |
CN109784261A (en) * | 2019-01-09 | 2019-05-21 | 深圳市烨嘉为技术有限公司 | Pedestrian's segmentation and recognition methods based on machine vision |
CN109919044A (en) * | 2019-02-18 | 2019-06-21 | 清华大学 | The video semanteme dividing method and device of feature propagation are carried out based on prediction |
CN110059662A (en) * | 2019-04-26 | 2019-07-26 | 山东大学 | A kind of deep video Activity recognition method and system |
CN110532955A (en) * | 2019-08-30 | 2019-12-03 | 中国科学院宁波材料技术与工程研究所 | Example dividing method and device based on feature attention and son up-sampling |
Non-Patent Citations (1)
Title |
---|
张建兴 等: "结合目标色彩特征的基于注意力的图像分割", 计算机工程与应用 * |
Cited By (14)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2021250811A1 (en) * | 2020-06-10 | 2021-12-16 | 日本電気株式会社 | Data processing device, data processing method, and recording medium |
JP7524946B2 (en) | 2020-06-10 | 2024-07-30 | 日本電気株式会社 | Data processing device, data processing method and recording medium |
CN111968123B (en) * | 2020-08-28 | 2024-02-02 | 北京交通大学 | Semi-supervised video target segmentation method |
CN111968123A (en) * | 2020-08-28 | 2020-11-20 | 北京交通大学 | Semi-supervised video target segmentation method |
CN112580473A (en) * | 2020-12-11 | 2021-03-30 | 北京工业大学 | Motion feature fused video super-resolution reconstruction method |
CN112580473B (en) * | 2020-12-11 | 2024-05-28 | 北京工业大学 | Video super-resolution reconstruction method integrating motion characteristics |
CN112669324B (en) * | 2020-12-31 | 2022-09-09 | 中国科学技术大学 | Rapid video target segmentation method based on time sequence feature aggregation and conditional convolution |
CN112669324A (en) * | 2020-12-31 | 2021-04-16 | 中国科学技术大学 | Rapid video target segmentation method based on time sequence feature aggregation and conditional convolution |
CN112784750B (en) * | 2021-01-22 | 2022-08-09 | 清华大学 | Fast video object segmentation method and device based on pixel and region feature matching |
CN112784750A (en) * | 2021-01-22 | 2021-05-11 | 清华大学 | Fast video object segmentation method and device based on pixel and region feature matching |
CN113570607A (en) * | 2021-06-30 | 2021-10-29 | 北京百度网讯科技有限公司 | Target segmentation method and device and electronic equipment |
CN113570607B (en) * | 2021-06-30 | 2024-02-06 | 北京百度网讯科技有限公司 | Target segmentation method and device and electronic equipment |
CN113436199B (en) * | 2021-07-23 | 2022-02-22 | 人民网股份有限公司 | Semi-supervised video target segmentation method and device |
CN113436199A (en) * | 2021-07-23 | 2021-09-24 | 人民网股份有限公司 | Semi-supervised video target segmentation method and device |
Also Published As
Publication number | Publication date |
---|---|
CN111161306B (en) | 2023-06-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111161306B (en) | Video target segmentation method based on motion attention | |
CN111476219B (en) | Image target detection method in intelligent home environment | |
CN110335290B (en) | Twin candidate region generation network target tracking method based on attention mechanism | |
WO2020238560A1 (en) | Video target tracking method and apparatus, computer device and storage medium | |
CN108932500B (en) | A kind of dynamic gesture identification method and system based on deep neural network | |
CN109949255B (en) | Image reconstruction method and device | |
CN111507993A (en) | Image segmentation method and device based on generation countermeasure network and storage medium | |
CN110910391A (en) | Video object segmentation method with dual-module neural network structure | |
CN112699958A (en) | Target detection model compression and acceleration method based on pruning and knowledge distillation | |
CN111861925A (en) | Image rain removing method based on attention mechanism and gate control circulation unit | |
CN112365514A (en) | Semantic segmentation method based on improved PSPNet | |
CN111368935B (en) | SAR time-sensitive target sample amplification method based on generation countermeasure network | |
CN113870335A (en) | Monocular depth estimation method based on multi-scale feature fusion | |
CN112862792A (en) | Wheat powdery mildew spore segmentation method for small sample image data set | |
CN113298032A (en) | Unmanned aerial vehicle visual angle image vehicle target detection method based on deep learning | |
CN111046771A (en) | Training method of network model for recovering writing track | |
CN110852199A (en) | Foreground extraction method based on double-frame coding and decoding model | |
CN113095254A (en) | Method and system for positioning key points of human body part | |
CN116758104B (en) | Multi-instance portrait matting method based on improved GCNet | |
CN114882493A (en) | Three-dimensional hand posture estimation and recognition method based on image sequence | |
CN116030498A (en) | Virtual garment running and showing oriented three-dimensional human body posture estimation method | |
CN113421276A (en) | Image processing method, device and storage medium | |
CN113538527A (en) | Efficient lightweight optical flow estimation method | |
CN116246110A (en) | Image classification method based on improved capsule network | |
CN112183602A (en) | Multi-layer feature fusion fine-grained image classification method with parallel rolling blocks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |