CN111161306A - Video target segmentation method based on motion attention - Google Patents

Video target segmentation method based on motion attention Download PDF

Info

Publication number
CN111161306A
CN111161306A CN201911402450.2A CN201911402450A CN111161306A CN 111161306 A CN111161306 A CN 111161306A CN 201911402450 A CN201911402450 A CN 201911402450A CN 111161306 A CN111161306 A CN 111161306A
Authority
CN
China
Prior art keywords
current frame
attention
feature map
frame
motion
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911402450.2A
Other languages
Chinese (zh)
Other versions
CN111161306B (en
Inventor
付利华
杨寒雪
杜宇斌
姜涵煦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Technology
Original Assignee
Beijing University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Technology filed Critical Beijing University of Technology
Priority to CN201911402450.2A priority Critical patent/CN111161306B/en
Publication of CN111161306A publication Critical patent/CN111161306A/en
Application granted granted Critical
Publication of CN111161306B publication Critical patent/CN111161306B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/215Motion-based segmentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/20Image enhancement or restoration using local operators
    • G06T5/30Erosion or dilatation, e.g. thinning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10016Video; Image sequence
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a video target segmentation method based on motion attention, which adds a channel characteristic diagram output by a channel attention module and a position characteristic diagram output by a motion attention module to obtain a segmentation result of a current frame. Wherein, the input of the channel attention module is a current frame feature map FtAnd the appearance characteristic diagram F of the target object provided by the first frame0The channel attention module inputs the feature map F by calculationtAnd F0The correlation among the channels, the output channel characteristic diagram reflects the object with the appearance closest to the target object in the current frame; the input of the motion attention module is a current frame feature map FtAnd the position information H of the target object predicted by the memory module in the previous frame motion attention networkt‑1The motion attention module inputs the feature map F by calculationtAnd Ht‑1Correlation between positions, output position characteristic map reflectionThe approximate position of the target object in the current frame is shown. The invention combines two factors of appearance and position to realize more accurate segmentation of the video target.

Description

Video target segmentation method based on motion attention
Technical Field
The invention belongs to the field of image processing and computer vision, and relates to a video target segmentation method, in particular to a video target segmentation method based on motion attention.
Background
Video object segmentation is a prerequisite for solving a plurality of video tasks, and plays a significant role in the fields of object identification, video compression and the like. Video object segmentation may be defined as tracking the object and segmenting the object according to the object mask. The video target segmentation can be divided into a semi-supervised mode and an unsupervised mode according to the existence of the initial mask, and the semi-supervised segmentation method is that the segmentation mask is manually initialized in the first frame of the video, and then the target object is tracked and segmented. Unsupervised methods automatically segment target objects in a given video according to some mechanism without any prior information.
In a video scene, background clutter, object deformation and rapid movement of an object all affect the segmentation result. The traditional video target segmentation technology adopts a rigid background motion model and combines scene prior to realize the segmentation of a target object. However, the conventional video object segmentation technique has certain limitations in practical application based on the assumption. The existing video object segmentation technology mostly adopts a convolutional neural network, but the existing video object segmentation technology also has various defects, such as the fact that moving objects in a video are segmented by relying on optical flow between frames, and therefore the segmentation effect is easily affected by optical flow estimation errors. In addition, these methods do not fully utilize timing information in the video and do not memorize the relevant features of the target object in the scene.
In order to solve the problems, the invention researches the segmentation problem of the moving target in a semi-supervised mode and provides a video target segmentation method based on movement attention with a memory module.
Disclosure of Invention
The invention aims to solve the problems that: in video target segmentation, a target object of a current frame is determined only by a segmentation result of a previous frame, the accurate position of the target object cannot be obtained, and even the target object drifts due to excessive dependence on the segmentation result of the previous frame; most of the existing video object segmentation methods based on motion information perform segmentation of the object based on optical flow information between the current frame and the previous frame, which not only has a large calculation amount, but also limits the segmentation result to a specific motion mode. Therefore, a new video object segmentation method based on motion information needs to be provided to improve the segmentation effect.
In order to solve the above problems, the present invention provides a video object segmentation method based on motion attention, which fuses motion time sequence information in a video sequence and realizes video object segmentation based on an attention mechanism, and comprises the following steps:
1) constructing a segmented backbone network, and dividing the current frame ItAnd a first frame I0Respectively input into the backbone network to obtain corresponding characteristic diagram Ft,F0
2) Constructing a motion attention network, and mapping the feature map F of the current frametFeature map F of the first frame0And hidden state H of the previous frame memory modulet-1As an input to the Athletic attention network, the output F of the Athletic attention networkoutNamely, the segmentation result of the current frame is obtained;
3) a loss function is constructed, which consists of two parts. The first part is the pixel level loss function; the second part is the structural similarity loss function.
Further, constructing a segmentation backbone network in the step 1) to obtain a feature map FtAnd F0The method specifically comprises the following steps:
1.1) modify Resnet-50 network and merge hole convolutions. First, the inflation factor scaled of conv _1 in Resnet-50 is set to 2; secondly, deleting the pooling layer in Resnet-50; then setting the step size stride of two layers of conv _3 and conv _4 in Resnet-50 to 1; finally, the modified Resnet-50 is used as a backbone network, and at the moment, the feature diagram output by the backbone network is 1/8 of the size of the original image;
1.2) will present frame ItInputting the feature map F into the backbone network to obtain the feature map F about the current framet
1.3) first frame I0Inputting the feature map F into the backbone network to obtain a feature map F about the first frame0
Further, constructing a motion attention network in the step 2) to obtain a segmentation result of the current frame.
The motion attention network consists of a channel attention module, a motion attention module and a memory module. The channel attention module, the motion attention module and the memory module are specifically constructed as follows:
2.1) construction of the channel attention Module, Ft,F0As an input to the channel attention module. F0And providing appearance information such as color and posture of the target object. First, Ft,F0Obtaining a channel weight attention chart X of the target object by matrix multiplication and a softmax functionc,XcThe relevance between channels in the current frame and the first frame is described, the higher the relevance is, the higher the response value is, and the more similar the characteristics are; then, X is addedcAnd FtMultiplying, feature enhancing, and the result is FtAdding residual errors to obtain a channel characteristic diagram;
2.2) construction of the exercise attention Module, Ft,Ht-1As input to the current module, Ht-1Position information of a target object of a current frame predicted based on a previous frame segmentation result and timing information is provided. Firstly, a feature map F is settRespectively passing through two convolution layers with convolution kernel of 1 × 1 to obtain two characteristic maps marked as FaAnd Fb(ii) a Then, FaAnd Ht-1Obtaining a position weight attention diagram of the target object through matrix multiplication and a softmax function; finally, X is addedsAnd FbMultiplying, feature enhancing, and the result is FtAdding residual errors to obtain a position feature map;
2.3) adding the channel characteristic diagram and the position characteristic diagram to obtain a final segmentation result F of the current frameout
2.4) constructing a memory module convLSTM, and dividing the current frame into FoutAnd memory cell C output by the previous frame memory modulet-1Hidden state H of previous frame memory modulet-1As input to the current block, the output of this block is memory cell CtAnd hidden state Ht
The convLSTM is composed of an input gate, a forgetting gate and an output gate.
Further, the step 2.4) of constructing the memory module convLSTM specifically includes:
2.4.1) first, forget the memory cell C of the previous frame memory modulet-1Partial state information is input, and then the current frame is divided into FoutUseful information is stored in the previous frame of memory cells Ct-1Finally, the memory cell C of the current frame is updated and outputtedt
2.4.2) first, the output gate divides the current frame into results F by Sigmoid functionoutAnd hidden state H of previous frame memory modulet-1Filtering, determining the information to be output, and then calling tanh activation function to modify the memory cell C of the current frametThe partial information to be output and the modified memory cell C of the current frame are finally comparedtMatrix multiplication is carried out to obtain and output the hidden state H of the current framet
Advantageous effects
The invention provides a video target segmentation method based on motion attention, which comprises the steps of firstly obtaining feature maps of a first frame and a current frame, and then obtaining a feature map F of the current frametAppearance characteristic F of target object provided by first frame0And the position information H of the target object predicted by the memory module in the previous frame motion attention networkt-1And inputting the current frame motion attention network to obtain the segmentation result of the current frame. By applying the method and the device, the problem of motion mode diversification which cannot be solved by other segmentation methods can be solved. The method is suitable for video target segmentation, and has good robustness and accurate segmentation effect.
The invention has the characteristics that: firstly, the method does not pay attention to the segmentation result of the previous frame, but can more accurately segment the target object by means of the appearance information of the target object in the first frame and the time sequence information of the target object in the video sequence; and secondly, the use of the motion attention network greatly reduces useless characteristics and improves the robustness of the model.
Drawings
FIG. 1 is a flow chart of a video object segmentation method based on attention to movement according to the present invention.
FIG. 2 is a network architecture diagram of the video object segmentation method based on attention to movement according to the present invention;
FIG. 3 is a diagram of the Resnet-50 structure
FIG. 4 is a modified Resnet-50 structure diagram for use in the method for video object segmentation based on attention to movement of the present invention
Detailed Description
The invention provides a video target segmentation method based on motion attention, which comprises the steps of firstly obtaining feature maps of a first frame and a current frame, and then inputting the feature maps of the first frame, the feature map of the current frame and target object position information predicted by a memory module in a motion attention network of the previous frame into the motion attention network to obtain a segmentation result of the current frame. The method is suitable for video target segmentation, and has good robustness and accurate segmentation effect.
The invention is explained in more detail below with reference to specific examples and the accompanying drawings.
The invention comprises the following steps:
1) acquiring YouTube and Davis data sets which are respectively used as a training set and a test set of the model;
2) the training data is pre-processed. Cutting each training sample (video frame) and a first frame mask of a video sequence, adjusting an image to 224 multiplied by 224 resolution, and performing data enhancement by using modes such as rotation and the like;
3) constructing a segmentation backbone network, inputting a first frame segmentation mask of a video sequence and a current frame, and obtaining a segmentation feature map of the first frame segmentation mask and the current frame;
3.1) first, set the inflation factor scaled of conv _1 in Resnet-50 to 2; secondly, deleting the pooling layer in Resnet-50; then setting the step size stride of two layers of conv _3 and conv _4 in Resnet-50 to 1; finally, the modified Resnet-50 is used as a backbone network, and the feature map output by the backbone network is 1/8 of the original image size, as shown in FIG. 4;
3.2) the resolution of the first frame division mask is 224 x 224, and the first frame division mask is input into the backbone network to obtain a feature map F of the first frame division mask02048 × 28 × 28 in size;
3.3) the resolution of the current frame is 224 x 224, and the current frame is input into the backbone network to obtain the characteristic diagram F of the current frametThe size is 2048 × 28 × 28.
4) Constructing a motion attention network, and inputting a current frame feature map FtHidden state H of previous frame memory modulet-1And a feature map F of the first frame0Obtaining a segmentation result F of the current frameout. The motion attention network consists of a channel attention module, a motion attention module and a memory module.
4.1) constructing a channel attention module. Inputting a current frame feature map FtAnd a first frame feature map F0Obtaining a channel characteristic diagram EcThe method comprises the following steps:
4.1.1) Call the reshape function in python, adjusting FtSize, converted to feature map F'tN × 2048 in size; calling reshape function in python, adjust F0Size, converted to feature map F'02048 × n, where n represents the total number of pixels of the current frame;
4.1.2)F′0and F'tMatrix multiplication and calling of the softmax function, the mathematical expression of which is as follows:
Figure BDA0002347815920000051
matrix multiplication can realize utilization and fusion of global information, and the action of the matrix multiplication is similar to full-connection operation. The full-connection operation can consider the data relation of all positions, but destroys the spatial structure, so that the matrix multiplication is used for replacing the full-connection operation, and the spatial information is reserved as much as possible on the basis of using the global information;
4.1.3) deriving a channel weight attention map Xc2048 × 2048 channel weight attention map XcElement x of j (th) row and i (th) columnjiThe mathematical expression of (a) is as follows:
Figure BDA0002347815920000061
wherein, F'tiIs represented by F'tColumn i, a feature map F of the current frametIth channel, F'0jIs represented by F'0Line j, first frame feature map F0The jth channel, xjiFeature map F representing the first frame0For the ith channel to the current frame feature map FtC represents the current frame feature map FtThe number of channels.
4.1.4) channel weight attention map XcAnd feature picture F'tMultiplying and strengthening the feature map F of the current frametThe result and the current frame feature map FtAdding residual errors to obtain a channel characteristic diagram EcThe mathematical expression is as follows:
Figure BDA0002347815920000062
wherein β represents the channel attention weight, the initial value is set to zero, and the model assigns β a larger and more reasonable weight, F ', through learning'tiIs represented by F'tColumn i, a feature map F of the current frametThe ith channel, C, represents the current frame feature map FtThe number of channels.
4.2) constructing a motion attention module. Inputting a current frame feature map FtAnd hidden state H of the previous frame memory modulet-1Obtaining a position feature map EsThe method comprises the following steps:
4.2.1)Ftrespectively passing through two convolution layers with convolution kernel of 1 × 1 to obtain two characteristic maps marked as FaAnd FbThe sizes are 2048 multiplied by 28;
4.2.2) Call the reshape function in python, adjust FaSize, converted to feature map F'a2048 xn in size; the reshape function in python is called,adjusting FbSize, converted to feature map F'b2048 xn in size; invoking reshape and transpose functions in python to adjust Ht-1Size, converted to signature H't-1The size is n multiplied by 2048, wherein n represents the total number of pixel points of the current frame;
4.2.3)H′t-1and F'aMatrix multiplication is carried out, a softmax function is called, and a position weight attention diagram X is obtainedsThe size is 28 × 28, and the mathematical expression is as follows:
Figure BDA0002347815920000063
Figure BDA0002347815920000071
wherein N represents the number of pixels of the current frame, FajIs represented by F'aColumn j, denotes FaJ-th position of (d), hjIs H't-1Row i of (2), represents Ht-1Position i, sjiIs a position weight attention map XsValue of element, s, in j-th row and i-columnjiIndicates a hidden state Ht-1Position i to current frame feature map FtThe effect of the jth position.
4.2.4) location weight attention map and feature map FbMatrix multiplication is carried out to strengthen the characteristic graph F of the current frametIs characterized by the followingtAdding residual errors to obtain a fused position feature map EsThe mathematical formula is expressed as follows:
Figure BDA0002347815920000072
where α represents the location attention weight, the initial value is set to zero, the model assigns α a greater and more reasonable weight through learning, FbiRepresentation feature diagram FbColumn i, a feature map F of the current frametThe ith position, N, represents the total number of pixels in the current frame.
4.3) reduction of bitsSet characteristic diagram EsAnd channel feature map EcAdding to obtain final segmentation result F of the current frameout
4.4) constructing a memory module, and inputting a current frame segmentation result FoutHidden state H of previous frame memory modulet-1And memory cells C of the previous frame memory modulet-1Memory module convLSTM of current frame is formed by forgetting gate ftAnd input gate itAnd an output gate otComposition is carried out;
4.4.1) the output tensor of the forgetting gate is between 0 and 1 for each value, 0 represents complete forgetting, and 1 represents complete retention, therefore, the forgetting gate can realize selective discarding of the memory cell C in the previous framet-1The mathematical formula is expressed as follows:
ft=σ(Wxf*Fout+Whf*Ht-1+bf)
wherein denotes a convolution operation, σ denotes a sigmoid function, FoutDenotes the segmentation result of the current frame, Ht-1Indicating the hidden state of the memory module of the previous frame. Wxf,WhfIs a weight parameter with a value between 0 and 1, bfFor the offset, the initial value is set to 0.1 and the model is learned as bfDistributing more reasonable values;
4.4.2) input gate will segment the result F from the current frameoutThe content to be updated is selected, and the mathematical formula is expressed as follows:
it=σ(Wxi*Fout+Whi*Ht-1+bi)
wherein denotes a convolution operation, σ denotes a sigmoid function, FoutDenotes the segmentation result of the current frame, Ht-1Indicating the hidden state of the memory module of the previous frame. Wxi,WhiIs a weight parameter with a value between 0 and 1, biFor the offset, the initial value is set to 0.1 and the model is learned as biDistributing more reasonable values;
4.4.3) discarding the previous frame of memory cells C using the forgetting gatet-1And saving the useful information to the previous frameMemory cell Ct-1In the method, the memory cell C of the current frame after the update is outputtThe mathematical formula is expressed as follows:
Figure BDA0002347815920000081
wherein, denotes a convolution operation,
Figure BDA0002347815920000087
is the product of Hadamard, FoutDenotes the segmentation result of the current frame, Ht-1Indicating the hidden state of the memory module of the previous frame. Wxc,WhcIs a weight parameter, and has a value of between 0 and 1 bcFor the offset, the initial value is set to 0.1 and the model is learned as bcDistributing more reasonable values;
4.4.4) hidden state H output by the current frame memory moduletThe mathematical formula is expressed as follows:
Figure BDA0002347815920000082
ot=σ(Wxo*Fout+Who*Ht-1+bo)
wherein, tanh is an activation function,
Figure BDA0002347815920000088
is the product of Hadamard, otRepresenting output gates in the current frame memory block, representing convolution operations, FoutDenotes the segmentation result of the current frame, Ht-1Indicating the hidden state of the memory module of the previous frame, Who,WxoIs a weight parameter with a value between 0 and 1, boFor the offset, the initial value is set to 0.1 and the model is learned as boMore reasonable values are assigned.
6) The loss function adopted by the segmentation model consists of two parts. The first part is a pixel-level loss function, the second part is a structural similarity loss function, and the specific design is as follows:
l=lcross+lssim
6.1)lcrossrepresenting the pixel-level cross entropy loss function, the mathematical formula is expressed as follows:
Figure BDA0002347815920000083
where T (r, c) denotes a pixel value at the r row and c column of the target mask, and S (r, c) denotes a pixel value at the r row and c column of the division result;
6.2)lssimexpressing a structural similarity loss function, comparing differences between the target mask and the segmentation result from three aspects of brightness, contrast and structure, and expressing a mathematical formula as follows:
Figure BDA0002347815920000084
Figure BDA0002347815920000085
Figure BDA0002347815920000086
Figure BDA0002347815920000091
wherein A isx,AyRespectively representing regions of the same size, x, cut from the partition map and the target mask predicted by the modeliIs represented by AxPixel value, y, of the i-th pixel in the regioniIs represented by AyThe pixel value of the ith pixel point in the region, N represents the total number of pixel points of the intercepted region, C1And C2Is a constant that prevents the denominator from being zero, C1Set as 6.5025, C2Set to 58.5225, μxIs represented by AxAverage brightness ofyTABLE AyAverage brightness of σxRepresents AxDegree of change of medium brightness, σyRepresents AyDegree of change of medium brightness, σxyRepresenting a structure-dependent covariance formula.
7) Training the model, selecting the YouTube in the step 1) as a training set, setting the sample number of batch-size of each batch of training as 4, setting the learning rate as 1e-4, modifying the learning rate to 1e-5 after 30 ten thousand times of previous iterative training of the YouTube, performing 10 ten thousand times of iterative training again in the YouTube, setting the weight attenuation rate as 0.0005, and training the model by using the loss function in the step 6) until the model converges.
The invention has wide application in the field of video object segmentation and computer vision, such as: target tracking, image recognition, etc. The present invention will now be described in detail with reference to the accompanying drawings.
(1) Constructing a segmented backbone network, and dividing the current frame ItAnd a first frame I0Respectively input into the backbone network to obtain corresponding characteristic diagram Ft,F0
(2) Constructing a motion attention network, and obtaining a current frame feature map FtFirst frame feature map F0Obtaining a channel feature map as an input of a channel attention module in the exercise attention network, and converting FtAnd hidden state H of previous frame memory modulet-1Obtaining a position feature map as an input of a motion attention module in the motion attention network, and adding the position feature map and the channel feature map to obtain an output F of the motion attention networkoutThat is, the segmentation result of the current frame is obtained, and the segmentation result F of the current frame is obtainedoutMemory cell C output by previous frame memory modulet-1And hidden state H of the previous frame memory modulet-1Obtaining memory cells C as input to a memory module in a motor attention networktAnd hidden state Ht. Memory cell CtFor storing and updating the time sequence information of the target object based on the segmentation result of the current frame, HtPosition information of a target object of a next frame predicted based on a current segmentation result and timing information is provided. The memory module convLSTM not only retains the spatial information of the target object of the current frame, but also retains the time sequence information of the target object, so that the long-distance position dependency relationship of the target object can be obtained.
The method is realized by adopting PyTorch framework and Python language under GTX 1080Ti GPU and Ubuntu 14.0464 bit operating system.
The invention provides a video target segmentation method based on motion attention, which is suitable for segmenting moving objects in a video, and has the advantages of good robustness and accurate segmentation result. Experiments show that the method can effectively segment moving objects.
The above description is only an embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art can understand that the modifications or substitutions are included in the scope of the present invention, and therefore, the scope of the present invention should be determined by the protection scope of the claims.

Claims (6)

1. A video target segmentation method based on motion attention fuses time sequence information in a video sequence to carry out video target segmentation, and is characterized in that: the method outputs a channel feature map E of a channel attention modulecAdding the position characteristic diagram output by the motion attention module to obtain a segmentation result of the current frame; wherein, the input of the channel attention module is a current frame feature map FtAnd the appearance characteristic diagram F of the target object provided by the first frame0The channel attention module inputs the feature map F by calculationtAnd F0The correlation among the channels, the output channel characteristic diagram is used for reflecting the object with the appearance closest to the target object in the current frame; the input of the motion attention module is a current frame feature map FtAnd the position information H of the target object predicted by the previous frame memory modulet-1The motion attention module inputs the feature map F by calculationtAnd Ht-1And (4) correlating positions, and outputting a position feature map reflecting the approximate position of the target object in the current frame.
2. A method for video object segmentation based on attention from motion as claimed in claim 1 wherein: the characteristic diagram Ft,F0The specific acquisition process is as follows:
1.1) constructing a segmentation backbone network, which specifically comprises the following steps: modify the Resnet-50 network and merge hole convolutions: first, the inflation factor scaled of conv _1 in Resnet-50 is set to 2; secondly, deleting the pooling layer in Resnet-50; then setting the step size stride of two layers of conv _3 and conv _4 in Resnet-50 to 1; finally, the modified Resnet-50 is used as a backbone network, and at the moment, the feature diagram output by the backbone network is 1/8 of the size of the original image;
1.2) will present frame ItInputting the segmented backbone network to obtain a feature map F related to the current framet
1.3) first frame I0Inputting the segmented backbone network to obtain a feature map F about the first frame0
3. A method for video object segmentation based on attention from motion as claimed in claim 1 wherein: the specific working process of the channel attention module is as follows, firstly, Ft,F0Obtaining a channel weight attention chart X of the target object by matrix multiplication and a softmax functionc(ii) a Then, X is addedcAnd FtMultiplication of the result with FtAdding residual errors to obtain a channel characteristic diagram Ec
4. A method for video object segmentation based on attention from motion as claimed in claim 1 wherein: the specific working process of the motion attention module is as follows, Ft,Ht-1As input to the current module, Ht-1Providing the position information of the target object in the current frame predicted based on the division result of the previous frame and the time sequence information, firstly, the feature map FtRespectively passing through two convolution layers with convolution kernel of 1 × 1 to obtain two characteristic maps marked as FaAnd Fb(ii) a Then, FaAnd Ht-1Obtaining the position weight attention chart X of the target object by matrix multiplication and a softmax functionS(ii) a Finally, X is addedSAnd FbMultiplying, and feature enhancingThe result is compared with FtAdding residual errors to obtain a position characteristic diagram Es
5. A method for video object segmentation based on attention from motion as claimed in claim 1 wherein: the memory module convLSTM comprises a forgetting gate, an input gate and an output gate, and the memory module divides a current frame into a division result FoutMemory cell C output by previous frame memory modulet-1Hidden state H of previous frame memory modulet-1As an input to the current block, the output of the current block is memory cell CtAnd hidden state HtThe specific working process is as follows,
first, the forgetting gate discards the previous frame of memory cells Ct-1Partial state information is input, and then the current frame is divided into FoutUseful information is stored in the previous frame of memory cells Ct-1Finally, the memory cell C of the current frame is updated and outputtedt
6. The method according to claim 1, wherein the structure loss function in step 3) is composed of two parts: the first part is the pixel level loss function; the second part is the structural similarity loss function.
CN201911402450.2A 2019-12-31 2019-12-31 Video target segmentation method based on motion attention Active CN111161306B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911402450.2A CN111161306B (en) 2019-12-31 2019-12-31 Video target segmentation method based on motion attention

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911402450.2A CN111161306B (en) 2019-12-31 2019-12-31 Video target segmentation method based on motion attention

Publications (2)

Publication Number Publication Date
CN111161306A true CN111161306A (en) 2020-05-15
CN111161306B CN111161306B (en) 2023-06-02

Family

ID=70559471

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911402450.2A Active CN111161306B (en) 2019-12-31 2019-12-31 Video target segmentation method based on motion attention

Country Status (1)

Country Link
CN (1) CN111161306B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111968123A (en) * 2020-08-28 2020-11-20 北京交通大学 Semi-supervised video target segmentation method
CN112580473A (en) * 2020-12-11 2021-03-30 北京工业大学 Motion feature fused video super-resolution reconstruction method
CN112669324A (en) * 2020-12-31 2021-04-16 中国科学技术大学 Rapid video target segmentation method based on time sequence feature aggregation and conditional convolution
CN112784750A (en) * 2021-01-22 2021-05-11 清华大学 Fast video object segmentation method and device based on pixel and region feature matching
CN113436199A (en) * 2021-07-23 2021-09-24 人民网股份有限公司 Semi-supervised video target segmentation method and device
CN113570607A (en) * 2021-06-30 2021-10-29 北京百度网讯科技有限公司 Target segmentation method and device and electronic equipment
WO2021250811A1 (en) * 2020-06-10 2021-12-16 日本電気株式会社 Data processing device, data processing method, and recording medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090245571A1 (en) * 2008-03-31 2009-10-01 National Taiwan University Digital video target moving object segmentation method and system
WO2018128741A1 (en) * 2017-01-06 2018-07-12 Board Of Regents, The University Of Texas System Segmenting generic foreground objects in images and videos
CN109784261A (en) * 2019-01-09 2019-05-21 深圳市烨嘉为技术有限公司 Pedestrian's segmentation and recognition methods based on machine vision
CN109919044A (en) * 2019-02-18 2019-06-21 清华大学 The video semanteme dividing method and device of feature propagation are carried out based on prediction
CN110059662A (en) * 2019-04-26 2019-07-26 山东大学 A kind of deep video Activity recognition method and system
CN110321761A (en) * 2018-03-29 2019-10-11 中国科学院深圳先进技术研究院 A kind of Activity recognition method, terminal device and computer readable storage medium
CN110532955A (en) * 2019-08-30 2019-12-03 中国科学院宁波材料技术与工程研究所 Example dividing method and device based on feature attention and son up-sampling

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090245571A1 (en) * 2008-03-31 2009-10-01 National Taiwan University Digital video target moving object segmentation method and system
WO2018128741A1 (en) * 2017-01-06 2018-07-12 Board Of Regents, The University Of Texas System Segmenting generic foreground objects in images and videos
CN110321761A (en) * 2018-03-29 2019-10-11 中国科学院深圳先进技术研究院 A kind of Activity recognition method, terminal device and computer readable storage medium
CN109784261A (en) * 2019-01-09 2019-05-21 深圳市烨嘉为技术有限公司 Pedestrian's segmentation and recognition methods based on machine vision
CN109919044A (en) * 2019-02-18 2019-06-21 清华大学 The video semanteme dividing method and device of feature propagation are carried out based on prediction
CN110059662A (en) * 2019-04-26 2019-07-26 山东大学 A kind of deep video Activity recognition method and system
CN110532955A (en) * 2019-08-30 2019-12-03 中国科学院宁波材料技术与工程研究所 Example dividing method and device based on feature attention and son up-sampling

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张建兴 等: "结合目标色彩特征的基于注意力的图像分割", 计算机工程与应用 *

Cited By (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021250811A1 (en) * 2020-06-10 2021-12-16 日本電気株式会社 Data processing device, data processing method, and recording medium
JP7524946B2 (en) 2020-06-10 2024-07-30 日本電気株式会社 Data processing device, data processing method and recording medium
CN111968123B (en) * 2020-08-28 2024-02-02 北京交通大学 Semi-supervised video target segmentation method
CN111968123A (en) * 2020-08-28 2020-11-20 北京交通大学 Semi-supervised video target segmentation method
CN112580473A (en) * 2020-12-11 2021-03-30 北京工业大学 Motion feature fused video super-resolution reconstruction method
CN112580473B (en) * 2020-12-11 2024-05-28 北京工业大学 Video super-resolution reconstruction method integrating motion characteristics
CN112669324B (en) * 2020-12-31 2022-09-09 中国科学技术大学 Rapid video target segmentation method based on time sequence feature aggregation and conditional convolution
CN112669324A (en) * 2020-12-31 2021-04-16 中国科学技术大学 Rapid video target segmentation method based on time sequence feature aggregation and conditional convolution
CN112784750B (en) * 2021-01-22 2022-08-09 清华大学 Fast video object segmentation method and device based on pixel and region feature matching
CN112784750A (en) * 2021-01-22 2021-05-11 清华大学 Fast video object segmentation method and device based on pixel and region feature matching
CN113570607A (en) * 2021-06-30 2021-10-29 北京百度网讯科技有限公司 Target segmentation method and device and electronic equipment
CN113570607B (en) * 2021-06-30 2024-02-06 北京百度网讯科技有限公司 Target segmentation method and device and electronic equipment
CN113436199B (en) * 2021-07-23 2022-02-22 人民网股份有限公司 Semi-supervised video target segmentation method and device
CN113436199A (en) * 2021-07-23 2021-09-24 人民网股份有限公司 Semi-supervised video target segmentation method and device

Also Published As

Publication number Publication date
CN111161306B (en) 2023-06-02

Similar Documents

Publication Publication Date Title
CN111161306B (en) Video target segmentation method based on motion attention
CN111476219B (en) Image target detection method in intelligent home environment
CN110335290B (en) Twin candidate region generation network target tracking method based on attention mechanism
WO2020238560A1 (en) Video target tracking method and apparatus, computer device and storage medium
CN108932500B (en) A kind of dynamic gesture identification method and system based on deep neural network
CN109949255B (en) Image reconstruction method and device
CN111507993A (en) Image segmentation method and device based on generation countermeasure network and storage medium
CN110910391A (en) Video object segmentation method with dual-module neural network structure
CN112699958A (en) Target detection model compression and acceleration method based on pruning and knowledge distillation
CN111861925A (en) Image rain removing method based on attention mechanism and gate control circulation unit
CN112365514A (en) Semantic segmentation method based on improved PSPNet
CN111368935B (en) SAR time-sensitive target sample amplification method based on generation countermeasure network
CN113870335A (en) Monocular depth estimation method based on multi-scale feature fusion
CN112862792A (en) Wheat powdery mildew spore segmentation method for small sample image data set
CN113298032A (en) Unmanned aerial vehicle visual angle image vehicle target detection method based on deep learning
CN111046771A (en) Training method of network model for recovering writing track
CN110852199A (en) Foreground extraction method based on double-frame coding and decoding model
CN113095254A (en) Method and system for positioning key points of human body part
CN116758104B (en) Multi-instance portrait matting method based on improved GCNet
CN114882493A (en) Three-dimensional hand posture estimation and recognition method based on image sequence
CN116030498A (en) Virtual garment running and showing oriented three-dimensional human body posture estimation method
CN113421276A (en) Image processing method, device and storage medium
CN113538527A (en) Efficient lightweight optical flow estimation method
CN116246110A (en) Image classification method based on improved capsule network
CN112183602A (en) Multi-layer feature fusion fine-grained image classification method with parallel rolling blocks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant