CN113869154B - Video actor segmentation method according to language description - Google Patents

Video actor segmentation method according to language description Download PDF

Info

Publication number
CN113869154B
CN113869154B CN202111081527.8A CN202111081527A CN113869154B CN 113869154 B CN113869154 B CN 113869154B CN 202111081527 A CN202111081527 A CN 202111081527A CN 113869154 B CN113869154 B CN 113869154B
Authority
CN
China
Prior art keywords
feature
frame
features
sentence
level
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111081527.8A
Other languages
Chinese (zh)
Other versions
CN113869154A (en
Inventor
李国荣
陈伟东
张新峰
黄庆明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Chinese Academy of Sciences
Original Assignee
University of Chinese Academy of Sciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Chinese Academy of Sciences filed Critical University of Chinese Academy of Sciences
Priority to CN202111081527.8A priority Critical patent/CN113869154B/en
Publication of CN113869154A publication Critical patent/CN113869154A/en
Application granted granted Critical
Publication of CN113869154B publication Critical patent/CN113869154B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention discloses a video actor segmentation method according to language description, which utilizes a cascade cross-modal attention module, roughly focuses on information words of language query by utilizing clip-level visual features, finely tunes the attention of the words by utilizing frame-level visual features, finely tunes the weight of a language in a target frame, can distinguish and segment positive examples from abundant video information, and can learn to identify target actors from videos by mining difficult negative examples through contrast learning, and can distinguish the target actors from different videos, thereby remarkably improving the accuracy of intra-frame matching and segmentation.

Description

Video actor segmentation method according to language description
Technical Field
The invention relates to the technical field of video identification, in particular to a video actor segmentation method according to language description.
Background
In recent years, video understanding tasks have received much attention, especially in relation to natural language processing. In this area, great achievements have been achieved in language selective temporal action localization, video caption generation, and actions segmentation tasks for actors and actions in video according to sentence descriptions. In real-world scenarios, it is common for a video to contain multiple actors acting. Therefore, selectively fine-grained locating a particular actor and its actions spatially and temporally through linguistic queries becomes an important task for computers to better understand video.
A framework widely used in related tasks, such as video/image object grounding, is to generate a region proposal in a video/image by some detection method and then match the text features with the proposed visual features to select the best object proposal as the matched object. To improve the performance of matching two heterogeneous features, previous work first generated linguistic features using a two-way LSTM and a self-attention mechanism, then processed visual features using weighted text features, and finally text-to-visual feature matching. However, this self-attention mechanism focuses on the language actually being the average solution to the training data, rather than the individual solution focusing on a certain video. Thus, in the inference process, regardless of the input video, the language features of interest are determined, because the video is a high-level semantic space containing rich content, and it is difficult to grasp the most discriminative features of the video. Thus, video determines the key in linguistic queries, and discriminant language representations that capture informative words and learn visual perception are critical to the language-guided video actor-action segmentation task.
How to design a language encoder for visual perception and generate a discriminant language so as to segment an actor and actions thereof in a video, and further optimizing a segmentation method to improve accuracy of intra-frame matching and segmentation.
Disclosure of Invention
In order to overcome the problems, the invention provides a method for segmenting a video actor and actions thereof according to language description, wherein the accuracy of matching and segmentation is obviously improved by utilizing a cooperative optimization network of a cascade cross-modal attention mechanism. The language is focused from thick to thin by using the visual features of two visual angles, the language features with discriminative visual perception are generated, in addition, a method of comparative learning is provided, a difficult negative example mining strategy is designed, the network is favorable for identifying a positive example from a negative example, and the performance is further improved, so that the method is completed.
The invention aims to provide a video actor segmentation method according to language description, which is carried out by utilizing a cascade cross-modal attention module to generate sentence query characteristics with discriminative power and improve the accuracy of matching and segmentation.
The cascaded cross-modal attention module includes a clip-level feature attention element and a frame-level feature attention element.
The clipping-level feature attention unit uses sentence embedding s and the clipping-level feature v of the target frame j c As an input.
The clip-level feature attention unit utilizes clip-level features v c Roughly weighting the language features to respectively obtain:
Figure BDA0003264160750000021
F 1 =Att 1 ·ψ(v c )+φ(s)
wherein T is matrix transposition; sigma softmax Is the softmax activation function; att 1 For clipping features v c And an attention map of sentence embedding s; f 1 Coarse weighted sentence features; clipping feature v c Passing through the convolutional layer
Figure BDA0003264160750000031
And psi (. cndot.) to
Figure BDA0003264160750000032
And psi (v) c ) (ii) a Embedding a combined word into e t A sentence embedding s is formed, which is then put into a convolutional layer φ (-) to generate a sentence feature φ(s).
For video V, clip-level features V c Encoded by the following formula:
Figure BDA0003264160750000033
wherein the content of the first and second substances,
Figure BDA0003264160750000034
represents L 2 Norm, theta avg For mean-pooling operation, I3D (-) is a dual stream I3D encoder, preferably using the output of the Mixed-4f layer of the I3D network as the I3D encoder.
The frame-level feature attention unit processes the coarsely weighted sentence feature F 1 And frame level features v f Obtaining fine-tuned sentence features F 2
Figure BDA0003264160750000035
F 2 =Att 2 ·ψ′(v f )+F 1
Wherein, Att 2 For frame-level features v f And coarsely weighted sentence features F 1 The attention map of (1); f 2 Representing fine-tuned sentence features, each column of which represents a vector of one word;
Figure BDA0003264160750000036
is v is f Through a linear layer
Figure BDA0003264160750000037
The resulting features; psi' (v) f ) Is v is f Features obtained through a linear layer psi' (-).
The frame level features v f Extracting by using a ResNet-101 network, preferably, pre-training on a COCO data set before extracting frame-level features, and fine-tuning on A2D data set training segmentation;
the frame-level features are the warped feature from frame j to frame i and the original feature v i The linear weighted combination of (1) is specifically represented by the following formula:
Figure BDA0003264160750000038
wherein v is i For ResNet-101 coding characteristics of target frame i, beta is a weight coefficient, i isTarget frame, j is reference frame, 2K is frame number of reference frame (i.e. taking K frame in forward direction and K frame in backward direction of target frame, compensating target frame characteristics), v j→i Is a warped feature from the reference frame j to the target frame i;
v is j→i Comprises the following steps:
Figure BDA0003264160750000041
wherein v is j Is the ResNet-101 coding feature of reference frame j; OF j→i Is the optical flow between the reference frame j and the target frame i;
Figure BDA0003264160750000042
is a bilinear warp equation.
Weighted word feature h t Obtaining sentence query characteristics q through a full connection layer:
m t =FC(h t )
α t =σ softmax (m t )
Figure BDA0003264160750000043
wherein h is t Is F 2 The tth column of features, representing the vector of the tth word; FC (h) t ) Is a full connection layer; m is t Is h t Intermediate vectors passing through a full link layer; alpha is alpha t Is the weighting coefficient of the t-th word.
The second aspect of the present invention also provides a computer readable storage medium storing a training program for video actor segmentation according to language description, which program, when executed by a processor, causes the processor to carry out the steps of the video actor segmentation method according to language description.
The method for segmenting the video actor according to the language description in the invention can be realized by means of software plus a necessary general hardware platform, wherein the software is stored in a computer readable storage medium (comprising a ROM/RAM, a magnetic disk and an optical disk) and comprises a plurality of instructions for enabling a terminal device (which can be a mobile phone, a computer, a server, a network device and the like) to execute the method in the invention.
The third aspect of the present invention also provides a computer device comprising a memory and a processor, the memory storing a training program for video actor segmentation according to a language description, the program, when executed by the processor, causing the processor to perform the steps of the video actor segmentation method according to a language description.
The invention has the advantages that:
(1) the method provided by the invention roughly focuses on the information words of the language query by using the clip-level visual features, finely adjusts the attention of the words by using the frame-level visual features, finely adjusts the weight of the language in the target frame, and obviously improves the accuracy of intra-frame matching and segmentation.
(2) The method provided by the invention can distinguish and segment positive examples from abundant video information, and can learn to identify the target actor from the video by mining hard negative examples through comparison learning, and can also distinguish the target actor from different videos.
(3) By utilizing the visual feature encoder, the visual features of the clip-level features, the frame-level features and the positive examples and the negative examples are extracted to obtain the sentence query features q, and the positive examples and the negative examples are distinguished through comparison and learning, so that the discrimination capability of the cascade cross-modal attention module is enhanced.
Drawings
FIG. 1 is a network architecture diagram illustrating a video actor segmentation method according to language description according to the present invention;
fig. 2 is a diagram showing the effect of a method for segmenting a video actor according to language description applied to an A2D data set in embodiment 1 of the present invention.
Detailed Description
The present invention will be described in further detail below with reference to the accompanying drawings and embodiments. The features and advantages of the present invention will become more apparent from the description. In which, although various aspects of the embodiments are shown in the drawings, the drawings are not necessarily drawn to scale unless specifically indicated.
The invention provides a video actor segmentation method according to language description, which is carried out by utilizing a cascade cross-modal attention module to generate sentence query characteristics with discriminative power and improve the accuracy of matching and segmentation.
The cascaded cross-modality attention module includes a clip-level feature attention element and a frame-level feature attention element.
In the present invention, since the video clip feature is a global feature containing all information, information words of a language query, such as words reflecting motion and temporal changes, are first roughly focused with a clip-level visual feature attention unit. Then, the attention of the words is finely adjusted by using the frame-level feature attention unit, the weight of the sentences in the specific frame is finely adjusted, for example, the attribute reflecting the appearance is reflected, and the accuracy rate of matching and segmentation in the frame is remarkably improved. In the method, discriminant language representations based on a given video are learned using clip-level feature attention units and frame-level feature attention units.
In the cascade cross-modal attention module, a clipping-level feature attention unit adopts a sentence embedding s and a clipping-level feature v of a target frame i c As input, a coarsely weighted sentence feature F is obtained 1
In the present invention, the clipping level features v c The extraction is performed using a dual stream I3D encoder, preferably the dual stream I3D encoder is pre-trained using ImageNet and Kinetics before extracting clip level features.
The dual stream I3D encoder is specifically described in the documents "Carreira J, Zisserman A. Quo vadis, action repetition a new model and the kinetics dataset [ C ]// proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 6299-.
The ImageNet is specifically described in the documents "Russakovsky O, Deng J, Su H, et al. Imagenet large scale Visual recognition change [ J ]. International journal of computer vision, 2015, 115 (3): 211 and 252.
The Kinetics are specifically described in the literature "Kay W, Carreira J, Simony K, et al. the Kinetics human action video dataset [ J ]. arXiV preprint arXiv:1705.06950, 2017.
For video V, clip-level features V c Encoded by the following formula:
Figure BDA0003264160750000071
wherein the content of the first and second substances,
Figure BDA0003264160750000072
is L 2 Norm (i.e. square root of sum of squares of elements of vector), θ avg The average pooling operation is performed. I3D (-) is a dual stream 13D encoder, preferably using the output of the Mixed-4f layer of the I3D network as the I3D encoder.
The Mixed-4f layer of the I3D network is described in more detail in the references "Carreira J, Zisserman A. Quo vadis, action Recognition a new model and the kinetics data set [ C ]// proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 6299-.
In the invention, firstly, the visual characteristics v of the target frame i j Into two convolution layers (
Figure BDA0003264160750000081
And psi (-) to generate two new visual features, respectively
Figure BDA0003264160750000082
And psi (v) j ). At the same time, e is embedded by combining words t Form sentence embedding s, e t Is the embedded feature of the t-th word. It is then put into a convolutional layer phi (-) to generate the sentence feature phi(s). Wherein t is ∈ [1, n ]]T is an integer, n is the total number of words; e.g. of the type t Specifically, the documents "MIKOLOV T, SUTSKEVER I, CHEN K, et al, distributed representations of words and phrases and the third composition [ C ]]//Advances in neural information processing systems,2013: 3111-3119.
The above-mentioned
Figure BDA0003264160750000083
Multiplying by φ(s) to generate an attention map of word-by-word spatial positions:
Figure BDA0003264160750000084
where T is the matrix transpose, σ softmax For the softmax function, Att is an attention map that measures spatial position for each word.
The Att is multiplied by ψ (v) j ) Plus φ(s) to get visually weighted sentence features:
F=Att·ψ(v j )+φ(s)
where F is a visually weighted sentence feature.
In the present invention, the clipping-level feature attention unit utilizes clipping-level features v c Roughly weighting the language features to respectively obtain:
Figure BDA0003264160750000085
F 1 =Att 1 ·ψ(v c )+φ(s)
where T is the matrix transpose, σ softmax Is the softmax function, Att 1 For clipping features v c Attention map of sentence embedding s, F 1 Are coarsely weighted sentence features.
In this disclosure, the frame-level feature attention unit processes the coarsely weighted sentence feature F 1 And frame level features v f Obtaining fine-tuned sentence features F 2
The frame level features v f Extraction is performed using a ResNet-101 network, which is preferably pre-trained on the COCO dataset and fine-tuned on the A2D dataset training partition before frame-level features are extracted.
The ResNet-101 network is specifically described in the following documents "He K, Zhang X, Ren S, et al. deep residual learning for image registration [ C ]// Proceedings of the IEEE reference on computer vision and pattern registration.2016: 770-778; the COCO data set is specifically described in the literature "Lin T Y, Maire M, Belongie S, et al, Microsoft COCO: common objects in context [ C ]// European context on computer vision. Springer, Cham, 2014: 740, 755. "; the A2D data set is specifically described in the documents "Xu C, Hsieh S H, Xiong C, et al. Can humans flash action understandings with multiple classes of actions [ C ]// Proceedings of the IEEE Conference on Computer Vision and Pattern recognition.2015: 2264-.
In the invention, in order to enhance a feature map, calculating optical flows between 2K frames of reference frames j near a target frame i, namely taking K frames in the forward direction of the target frame i and K frames in the backward direction as reference frames, and warping ResNet-101 coding features of the nearby 2K frames onto the target frame based on the optical flows, wherein K is the frame number, and the method is specifically carried out according to the following formula:
Figure BDA0003264160750000091
wherein v is j Is the ResNet-101 coding feature of reference frame j; OF j→i Is the optical flow between the reference frame j and the target frame i;
Figure BDA0003264160750000092
is a bilinear distortion equation; v. of j→i Is a warped feature from the reference frame j to the target frame i. The above operation is specifically described in the document "ZHU X, WANG Y, DAI J, et al]// Proceedings of the IEEE International Conference on Computer Vision.2017: 408 and 417.
The frame-level features are the warped feature from frame j to frame i and the original feature v i The linear weighted combination of (1) is specifically represented by the following formula:
Figure BDA0003264160750000101
wherein v is i For ResNet-101 coding characteristics of a target frame i, beta is a weight coefficient, i is the target frame, j is a reference frame, and 2K is the frame number of the reference frame, namely K frames are taken in the front direction of the target frame and K frames are taken in the back direction of the target frame, and the characteristics of the target frame are compensated. In this way, the frame-level features v f Enhanced by nearby frame-level features.
In the invention, the enhanced frame-level features v are combined f Directly into the Regional Proposal Network (RPN) to generate the proposal. Since the target frame has been compensated by its neighboring frames, a moving object can be more easily detected. Preferably, the area proposal network (RPN) is embodied as the document "He K, Gkioxari G, Doll's a r P, et al Mask r-cnn [ C ]]// Proceedings of the IEEE international conference on computer vision.2017: 2961-2969 ", preferably the Regional Proposal Network (RPN) is pre-trained on the COCO dataset and fine-tuned on the A2D dataset training branch.
The frame-level feature attention unit processes the coarsely weighted sentence feature F 1 And frame level features v f Obtaining fine-tuned sentence features F 2
Figure BDA0003264160750000111
F 2 =Att 2 ·ψ′(v f )+F 1
Wherein, Att 2 For frame-level features v f And coarsely weighted sentence features F 1 The attention map of (1); f 2 Representing the finely adjusted sentence characteristics;
Figure BDA0003264160750000112
is v f Through a linear layer
Figure BDA0003264160750000113
The resulting features; psi' (v) f ) Is v is f Features obtained through a linear layer ψ' (·).
Weighted word feature h t Obtaining sentence query characteristics q through a full connection layer:
m t =FC(h t )
α t =σ softmax (m t )
Figure BDA0003264160750000114
wherein h is t Is F 2 The tth column of features, representing the vector of the tth word; FC (h) t ) Is a fully connected layer, which is specifically described in the literature "Krizhevsky A, Sutskeeper I, Hinton G E]// Advances in Neural Information Processing systems.2012. "; m is t Is h t Intermediate vectors passing through a full link layer; alpha is alpha t A weighting factor for the t-th word; e.g. of a cylinder t The T-th word is an embedded feature, such as the documents "MIKOLOV T, SUTSKEVER I, CHEN K, et al]// Advances in neural information processing systems, 2013: 3111-3119.
In a preferred embodiment of the present invention, during training, the method for segmenting the video actor according to the language description further includes using the sentence query feature q to perform comparative learning so as to distinguish positive examples from difficult negative examples. The positive example is a region proposal on the target frame with a true value IoU (cross ratio) greater than 0.5 and a maximum value of IoU. The hard negative example is a region proposal on the target frame with a true value of IoU less than 0.5. The IoU is described in particular in the literature "He K, Gkioxari G, Doll a r P, et al Mask r-cnn [ C ]// Proceedings of the IEEE international conference on computer vision.2017: 2961-2969
It is noted that the difficulty of matching the sentence query feature q from the video region proposal to the desired result is not enough to learn by only using the difficulty in the same target frame, and the difficulty mining strategy is designed in the invention and is divided into two parts. The first part, mining the hard negative examples from within the video containing the target frame: from other key frames in the video, a region proposal different from the current frame true value (i.e. a region proposal with IoU less than 0.5) is found as a hard negative example. The second part, mining hard and negative examples from different videos: when the difficult example is insufficient, videos containing the same actor-action tag are continuously found, and the regional proposals different from the actor-action tag are found from key frames of the videos as the difficult example. During testing, the region proposal with the highest similarity score with the obtained sentence query characteristic q is found from the region proposals on the target frame, and the result is the matching result.
In the invention, an actor and action thereof which can be matched with the sentence query feature q need to be separated from the video containing rich information, so that the sentence query feature q is utilized for comparative learning in the method, and difficult examples are mined to enhance the matching capability between q and the region proposal.
In the invention, the sentence embedding s is processed by a clipping level feature attention unit and a frame level feature attention unit to obtain a fine-tuned sentence feature F 2 And finally, embedding and weighting the words to obtain sentence query characteristics q. Said r + Is a positive example of a visual feature, r l - Is a negative example of visual characteristics, wherein L is 1, 2, L, where L is the number of negative examples, and L is an integer greater than or equal to 1, preferably an integer from 5 to 50, more preferably from 15 to 35, such as 25. The penalty function for the sentence query feature q is:
Figure BDA0003264160750000131
where T is the matrix transpose.
The Loss function Loss utilizes the (L +1) tuple to optimize the identification of the positive case from the L negative cases, and reduces the possibility of network convergence to local optimum.
Visual characteristic r of the sound example + Or visual characteristics of negative examples r l - Extraction using RPN networkI.e. the area proposal network, generates an area proposal, the backbone network of which is the ResNet-101 network. Preferably, the ResNet-101 network and the Regional Proposal Network (RPN) are pre-trained on the COCO dataset and fine-tuned on the A2D dataset training branch before feature extraction. The region proposal has a target feature o and a location feature i.
The target feature o is a feature passing through an RPN network ROI-Aligned module, and is preferably consistent with the dimension of the sentence query feature q, such as a 1024-dimensional vector. The ROI-Aligned module is specifically described as "Kaiming He, Georgia Gkioxari, Piotr Doll a, Ross Girshick. Mask r-cnn [ C ]// Proceedings of the IEEE international conference on computer vision.2017: 2961 and 2969.
The position characteristic l is:
Figure BDA0003264160750000132
wherein (x) tl ,y tl ) And (x) br ,y br ) The coordinates of the points of the upper left and lower right corners of the region proposal, respectively, and W, H, W, H are the width, height, width and height of the region proposal, respectively.
Visual characteristics r of the prime example + Or visual characteristics of negative examples r l - By connecting the location feature/to the target feature o and then passing the connected features to the full connection layer to obtain the visual feature r of the true example with the same dimensions as the text representation + Or visual characteristics of negative examples r l - Specifically, the following formula is shown:
r=σ tanh (W([o;l]))
wherein r is the visual characteristic r of the positive example + Or negative example of the visual characteristic r l - C-dimensional target level features, [;]denotes join operation, W is the parameter matrix to be learned, σ tanh Is tan h activation function.
The tanh activation function is specifically described in the document "LeCun Y, Bottou L, Bengio Y, et al.Gradient-based learning applied to document recognition [ J ]. Proceedings of the IEEE, 1998, 86 (11): 2278 and 2324.
In the invention, after sentence query characteristics q and positive/negative examples are compared and learned to obtain a trained network, a region proposal with the highest similarity with text characteristics is selected, and the visual characteristics r of the selected region proposal are compared + Input into the split branch to generate a result mask. The segmentation branch in the present invention is a mask r-cnn segmentation branch, which is specifically described in documents "He K, Gkioxari G, Doll a r P, et al]// Proceedings of the IEEE International conference on computer vision.2017: 2961-. Fig. 1 is a schematic diagram of a network structure of a method for segmenting a video actor and its actions according to language description in the present invention.
The second aspect of the present invention also provides a computer readable storage medium storing a training program for video actor segmentation according to language description, which when executed by a processor, causes the processor to perform the steps of the video actor segmentation method according to language description.
The method for segmenting the video actor according to the language description in the invention can be realized by means of software plus a necessary general hardware platform, wherein the software is stored in a computer readable storage medium (comprising a ROM/RAM, a magnetic disk and an optical disk) and comprises a plurality of instructions for enabling a terminal device (which can be a mobile phone, a computer, a server, a network device and the like) to execute the method in the invention.
The third aspect of the present invention also provides a computer device comprising a memory and a processor, the memory storing a training program for video actor segmentation according to a language description, the program, when executed by the processor, causing the processor to perform the steps of the video actor segmentation method according to a language description.
The video actor segmentation method according to the language description provided by the invention firstly utilizes the clip-level visual features to roughly focus on the information words of the language query, and then utilizes the frame-level visual features to finely adjust the attention of the words, thereby obviously improving the accuracy of intra-frame matching and segmentation. In addition, in the segmentation method, positive examples can be identified in the video with rich content, and through the difficult negative example mining strategy and the introduction of contrast learning, the segmentation method not only can learn to identify target actors from the video, but also can distinguish the target actors from different videos. Thus, the cross-model search has stronger discrimination capability. Furthermore, a special segmentation network is used to segment the regions resulting in the matching, which results in a mask with better integrity and better edges.
Examples
The present invention is further described below by way of specific examples, which are merely exemplary and do not limit the scope of the present invention in any way.
Example 1
(1) The I3D network was used to train on the data set ImageNet and data set Kinetics, using the resulting output of the Mixed-4f layer of the I3D network as the I3D encoder.
The data set ImageNet is specifically described in the literature "Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, lifting Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al.2015.Imagenet large scale visual retrieval change. International journal of computer vision 115,3(2015), 211-.
The data set Kinetics is specifically described in the literature "Will Kay, Joao Carreira, Karen Simnyan, Brian Zhang, Chole Hillier, Sudheendora Vijayanariman, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al.2017.the Kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)".
Given a video V (in particular as shown in the video frame in FIG. 1 or FIG. 2), a clipping feature V c Encoded by the formula:
Figure BDA0003264160750000171
wherein the content of the first and second substances,
Figure BDA0003264160750000172
represents L 2 Norm, theta avg Representing a mean pooling operation; v. of c The resulting clipping features are represented as a matrix of 32 x 832.
(2) The ResNet-101 network was pre-trained on the COCO dataset and then fine-tuned on the A2D dataset training partition. The ResNet-101 network is specifically described in the following documents "He K, Zhang X, Ren S, et al. deep residual learning for image registration [ C ]// Proceedings of the IEEE reference on computer vision and pattern registration.2016: 770-778; the COCO data set is specifically described in the literature "Lin T Y, Maire M, Belongie S, et al, Microsoft COCO: common objects in context [ C ]// European context on computer vision. Springer, Cham, 2014: 740, 755. "; the A2D data set is specifically described in the documents "Xu C, Hsieh S H, Xiong C, et al. Can humans flash action understandings with multiple classes of actions [ C ]// Proceedings of the IEEE Conference on Computer Vision and Pattern recognition.2015: 2264 2273.
Calculating optical flows among 14 reference frames j near the target frame i (the forward 7 frame and the backward 7 frame of the target frame i are the reference frames j), and warping ResNet-101 coding features of the 14 reference frames j near the target frame to the target frame based on the optical flows, wherein the method is specifically completed by the following formula:
Figure BDA0003264160750000173
wherein v is j Is a ResNet-101 coding feature, OF, for reference frame j j→i For the optical flow between reference frame j and target frame i,
Figure BDA0003264160750000174
representing the bilinear warp equation, v j→i Features representing warping from frame J to frame i, as described in the literature "ZHU X, WANG Y, DAI J, et aleo object detection[C]// Proceedings of the IEEE International Conference on Computer Vision.2017: 408 and 417.
Frame level features v f For warped and original features v from frame j to frame i i Is shown in the following formula, wherein v i For ResNet-101 encoding characteristics for target frame i:
Figure BDA0003264160750000181
where β is a weight coefficient of 0.1, and K is set to 7.The reference frame j is the front 7 frames and the back 7 frames of the target frame i, and the video sampling frame rate is 24 fps. By warping features from frame j to frame i and original features v i So that the frame-level features v are combined with linear weighting f Enhanced by nearby frame-level features. The enhanced features are then placed directly into a Regional Proposal Network (RPN) to generate a proposal, as embodied in the document "Girshick r]// Proceedings of the IEEE international conference on computer vision.2015: 1440-1448 ", the area proposal network (RPN) was pre-trained on the COCO dataset and fine-tuned on the A2D dataset training branch. Since the target frame i has been compensated by the frames in its vicinity, a moving object can be detected more easily.
Wherein v is f Is a matrix of dimensions 16 × 16 × 1024.
(3) Clipping-level feature v with clipping-level feature attention element c Roughly weighting the language features, and roughly weighting the clipping-level features v of the target frame i c Feeding into the convolutional layer
Figure BDA0003264160750000182
Knowing psi (-) generates two new visual features separately
Figure BDA0003264160750000191
And ψ (v) c )。
The convolutional layer
Figure BDA0003264160750000192
Is a 1 × 1 convolution of 1024 dimensions. The convolutional layer ψ (-) is a 1 × 1 convolution of 1024 dimensions.
Embedding e by combining words t A sentence embedding s is formed and then put into a 1024-dimensional fully-connected layer to generate a sentence feature phi(s). Wherein t is equal to [1, n ]]And t is an integer and n is the total number of words. e.g. of the type t The embedded feature for the T-th word is a 300-dimensional vector obtained from the word2vec model trained on the Google News dataset, as described in the document "Mikolov T, Sutskeeper I, Chen K, et al]// In Advances In neural information processing systems 2013.3111-3119 "; the Google News dataset is specifically described in the literature "a.s.das, m.datar, a.garg, and s.rajaram," Google News personalization: scalable online formatting filtering, "in Proceedings of the 16th international conference on World Wide web. acm, 2007, pp.271-280". Phi (-) is a fully connected layer that functions to transform 300-dimensional information to 1024-dimensional, visual features
Figure BDA0003264160750000193
And psi (v) c ) The dimensions remain consistent.
Clipping-level feature attention Unit Using clipping-level features v c Roughly weighting the language features to respectively obtain:
Figure BDA0003264160750000194
F 1 =Att 1 ·ψ(v c )+φ(s)
wherein σ softmax Is the softmax activation function, Att 1 For clipping features v c And an attention map of sentence embedding s with dimensions of 20 x 1024, where each element (a, b) represents the effect of visual features of the b-th position on the a-th word, F 1 For coarsely weighted sentencesFeatures, whose dimensions are 20 × 1024, represent the features of the word after global feature weighting. The softmax activation function is specifically described in the document "LeCun Y, Bottou L, Bengio Y, et al]Proceedings of the IEEE, 1998, 86 (11): 2278-2324.
Roughly weighting sentence characteristic F 1 And frame level features v f Inputting the input into a frame-level feature attention unit to obtain a fine-tuned sentence feature F 2
Figure BDA0003264160750000201
F 2 =Att 2 ·ψ′(v f )+F 1
Wherein, Att 2 For frame-level features v f And coarsely weighted sentence features F 1 Attention pattern of (1), F 2 Representing the fine-tuned sentence features.
Figure BDA0003264160750000202
1 × 1 convolution for 1024 dimensions; psi' (v) f ) Is a 1 × 1 convolution of 1024 dimensions.
(4) The fine-tuned sentence characteristics F 2 Weighted word h of t Embedding to obtain sentence query characteristics q:
m t =FC(h t )
α t =σ softmax (m t )
Figure BDA0003264160750000203
wherein h is t Is F 2 The tth column of features, representing the vector of the tth word; FC (h) t ) Is a full connection layer; m is t Is h t Intermediate vectors passing through a full link layer; alpha is alpha t Is the weighting coefficient of the t-th word.
(5) For the video V, embedding sentences into sentences through the steps (1)- (4) obtaining sentence query features q (as shown in step 4), wherein 1 positive example visual feature is r + And 25 negative visual characteristics are r l - Wherein, l is 1, 2.
The loss function is:
Figure BDA0003264160750000211
where T is the matrix transpose.
(6) Visual characteristic r of the sound example + Or visual characteristics of negative examples r l - Using ResNet-101 network (the ResNet-101 network is specifically shown in the document "He K, Zhang X, Ren S, et al]// In Proceedings of the IEEE conference on computer vision and pattern recognition.2016, 770-778). Before feature extraction, the network is connected to COCO data sets (such as the documents "Lin T, Maire M, Belongie S, et al. Microsoft COCO: Common objects in context [ C)]// In European conference on computer vision.2014.740-755. ") and pre-trained on an A2D dataset (said A2D dataset being specifically described In the documents" Xu C, Hsieh S, Xiong, C, et]I// In Proceedings of the IEEE Conference on Computer Vision and Pattern recognition.2015.2264-2273. ") on the training partition to generate a region proposal. The region proposed feature o is the target feature of the region output by the trimmed ResNet-01, and the feature l is the position feature.
The target feature o is a feature passing through an RPN network ROI-Aligned module, and is specifically a 1024-dimensional vector (the dimension of the target feature o is consistent with that of the sentence query feature q).
The position characteristic l is:
Figure BDA0003264160750000221
wherein (x) tl ,y tl ) And (x) br ,y br ) The coordinates of the points of the upper left and lower right corners of the region proposal, respectively, and W, H, W, H are the width, height, width and height of the region proposal, respectively.
Visual characteristic r of the sound example + Or visual characteristics of negative examples r l - By connecting the location feature/to the target feature o and then passing the connected features to the full connection layer to obtain the visual feature r of the true example with the same dimensions as the text representation + Or visual characteristics of negative examples r l - Specifically, the following formula is shown:
r=σ tanh (W([o;l]))
wherein r is the visual characteristic r of the positive example + Or visual characteristics of negative examples r l - A C-dimensional target level feature, [;]denotes the join operation, σ tanh Is tan h activation function.
The tanh activation function is specifically described in the document "LeCun Y, Bottou L, Bengio Y, et al.Gradient-based learning applied to document recognition [ J ]. Proceedings of the IEEE, 1998, 86 (11): 2278 and 2324.
Visual characteristic r of rule + Or visual characteristics of negative examples r l - The method is compared with the sentence query characteristic q for learning, the method which can be matched with the characteristic q is a positive example which needs to be segmented, a region proposal with the highest similarity with the text characteristic is obtained, and the visual characteristic r of the selected region proposal is used + The result mask is generated by inputting the result mask into the split branch of mask r-cnn to obtain the final result after splitting.
Fig. 2 is a diagram of the effect of the method in embodiment 1 on the A2D data set, where the text in the first row is an input sentence, the picture sequence in the second row is an input video, the third row is the segmentation result obtained in this embodiment, and the fourth row is a real segmentation map given by the data set. As can be seen from fig. 2, the target actor can be completely segmented from the video frame by using the video actor segmentation method based on language description segmentation provided in the present invention, and the segmentation result is very close to the real segmentation graph given by the data set.
The invention has been described in detail with reference to specific embodiments and/or illustrative examples and the accompanying drawings, which, however, should not be construed as limiting the invention. Those skilled in the art will appreciate that various equivalent substitutions, modifications or improvements may be made to the technical solution of the present invention and its embodiments without departing from the spirit and scope of the present invention, which fall within the scope of the present invention. The scope of the invention is defined by the appended claims.

Claims (10)

1. A method of video actor segmentation in accordance with a language description, the method being performed using a cascaded cross-modality attention module comprising a clip-level feature attention element and a frame-level feature attention element;
the clipping-level feature attention unit roughly weights the language features by using the clipping-level features vc to respectively obtain:
Figure FDA0003646167650000011
F 1 =Att 1 ·ψ(v c )+φ(s)
wherein T is matrix transposition; sigma softmax Is the softmax activation function; att 1 For clipping features v c And an attention map of sentence embedding s; f 1 Coarse weighted sentence features; clipping feature v c By the convolution layer
Figure FDA0003646167650000012
And psi (. cndot.) to
Figure FDA0003646167650000013
And psi (v) c ) (ii) a Embedding a combined word into e t Forming a sentence embedding s, which is then put into a convolutional layer phi (-) to generate a sentence characteristic phi(s); et is the embedding feature of the t-th word;
the frame-level feature attention unit processes the coarsely weighted sentence feature F 1 And frame level features v f Obtaining fine-tuned sentence features F 2
Figure FDA0003646167650000014
F 2 =Att 2 ·ψ′(v f )+F 1
Wherein, Att 2 For frame-level features v f And coarsely weighted sentence features F 1 The attention map of (1); f 2 Representing the finely adjusted sentence characteristics;
Figure FDA0003646167650000015
is v is f Through a linear layer
Figure FDA0003646167650000016
The resulting features; psi' (v) f ) Is v is f Features obtained through a linear layer psi' (-).
2. The method of claim 1, wherein the clipping-level feature attention unit employs a sentence embedding s and a clipping-level feature v of a target frame i c As an input.
3. The method of claim 1, wherein for video V, the clip-level feature V c Encoded by the following formula:
Figure FDA0003646167650000021
wherein the content of the first and second substances,
Figure FDA0003646167650000022
represents L 2 Norm, theta avg For mean pooling operation, I3D (-) was compiled for a dual stream I3DAnd a coder.
4. The method of claim 1,
the frame level features v f The extraction is performed using the ResNet-101 network.
5. The method of claim 4, wherein prior to extracting frame-level features, the ResNet-101 network is pre-trained on COCO datasets and fine-tuned on A2D dataset training partitions;
the frame-level features are the warped feature from frame j to frame i and the original feature v i The linear weighted combination of (1) is specifically represented by the following formula:
Figure FDA0003646167650000023
wherein v is i For ResNet-101 coding characteristics of a target frame i, beta is a weight coefficient, i is the target frame, j is a reference frame, 2K is the frame number of the reference frame, K frames are taken in the front direction of the target frame, K frames are taken in the back direction, characteristics of the target frame are compensated, v j→i Is a warped feature from the reference frame j to the target frame i;
v is j→i Comprises the following steps:
Figure FDA0003646167650000024
wherein v is j Is the ResNet-101 coding feature of reference frame j; OF j→i Is the optical flow between the reference frame j and the target frame i;
Figure FDA0003646167650000025
is a bilinear warp equation.
6. The method of claim 1, wherein the weighted word feature h t Obtaining sentence query characteristics q through a full connection layer:
m t =FC(h t )
α t =σ softmax (m t )
Figure FDA0003646167650000031
wherein h is t Is F 2 A tth column of features representing a vector of tth words; FC (h) t ) Is a full connection layer; m is t Is h t A vector is obtained after passing through a full connection layer; alpha (alpha) ("alpha") t Is the weighting coefficient of the t-th word.
7.The method according to one of claims 1 to 6, further comprising performing a comparative learning using the sentence query feature q to distinguish between a positive case and a difficult negative case, wherein the positive case is a region proposal with a true value IoU greater than 0.5 and a maximum value of IoU on the target frame, and the difficult negative case is a region proposal with a true value IoU less than 0.5 on the target frame.
8. The method according to one of claims 1 to 6,
in the method, the loss function of the sentence query feature q is as follows:
Figure FDA0003646167650000032
wherein T is a matrix transpose, and r is + Is a positive example of a visual feature, r l - Is a visual characteristic of a negative example, wherein L is 1, 2, and L is the number of negative examples;
visual characteristic r of the sound example + Or visual characteristics of negative examples r l - Comprises the following steps:
r=σ tanh (W([o;l])),
r is the visual characteristic r of the positive example + Or visual characteristics of negative examples r l - The r is a C-dimensional target level feature, [;]denotes join operation, W is the parameter matrix to be learned, σ tanh Is tan h activation function;
visual characteristic r of the sound example + Or visual characteristics of negative examples r l Extracting by using an RPN network to generate a region proposal, wherein the target feature of the region proposal is o, and the corresponding position feature of the region is l;
the target feature o is a feature generated through extraction of the RPN network,
the position characteristic l is:
Figure FDA0003646167650000041
wherein (x) tl ,y tl ) And (x) br ,y br ) Coordinates of points of the upper left corner and the lower right corner of the region proposal, respectively, and W, H, W, H are the width and height of the region proposal, and the width and height of the target frame, respectively.
9. A computer-readable storage medium, in which a training program for video-actor segmentation in accordance with a language description is stored, which program, when being executed by a processor, causes the processor to carry out the steps of the video-actor segmentation in accordance with a language description method as claimed in one of claims 1 to 8.
10. Computer device, characterized in that it comprises a memory and a processor, said memory storing a training program for video actor segmentation according to a language description, said program, when executed by the processor, causing the processor to carry out the steps of the video actor segmentation method according to a language description according to one of claims 1 to 8.
CN202111081527.8A 2021-09-15 2021-09-15 Video actor segmentation method according to language description Active CN113869154B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111081527.8A CN113869154B (en) 2021-09-15 2021-09-15 Video actor segmentation method according to language description

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111081527.8A CN113869154B (en) 2021-09-15 2021-09-15 Video actor segmentation method according to language description

Publications (2)

Publication Number Publication Date
CN113869154A CN113869154A (en) 2021-12-31
CN113869154B true CN113869154B (en) 2022-09-02

Family

ID=78996032

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111081527.8A Active CN113869154B (en) 2021-09-15 2021-09-15 Video actor segmentation method according to language description

Country Status (1)

Country Link
CN (1) CN113869154B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114494297B (en) * 2022-01-28 2022-12-06 杭州电子科技大学 Adaptive video target segmentation method for processing multiple priori knowledge

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101411195B (en) * 2003-09-07 2012-07-04 微软公司 Coding and decoding for interlaced video
CN108829677A (en) * 2018-06-05 2018-11-16 大连理工大学 A kind of image header automatic generation method based on multi-modal attention
US10839223B1 (en) * 2019-11-14 2020-11-17 Fudan University System and method for localization of activities in videos
CN112466326A (en) * 2020-12-14 2021-03-09 江苏师范大学 Speech emotion feature extraction method based on transform model encoder
CN112650886A (en) * 2020-12-28 2021-04-13 电子科技大学 Cross-modal video time retrieval method based on cross-modal dynamic convolution network

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10395118B2 (en) * 2015-10-29 2019-08-27 Baidu Usa Llc Systems and methods for video paragraph captioning using hierarchical recurrent neural networks

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101411195B (en) * 2003-09-07 2012-07-04 微软公司 Coding and decoding for interlaced video
CN108829677A (en) * 2018-06-05 2018-11-16 大连理工大学 A kind of image header automatic generation method based on multi-modal attention
US10839223B1 (en) * 2019-11-14 2020-11-17 Fudan University System and method for localization of activities in videos
CN112466326A (en) * 2020-12-14 2021-03-09 江苏师范大学 Speech emotion feature extraction method based on transform model encoder
CN112650886A (en) * 2020-12-28 2021-04-13 电子科技大学 Cross-modal video time retrieval method based on cross-modal dynamic convolution network

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
【VideoQA最新论文阅读】第一篇视频问答综述Video Question Answering: a Survey of Models and Datasets;Leokadia Rothschild;《https://blog.csdn.net/m0_46413065/article/details/113828037》;20210216;第1-36页 *
CVPR 2020 | 细粒度文本视频跨模态检索;AI科技评论;《https://zhuanlan.zhihu.com/p/115890991》;20200324;第1-6页 *
视频显著性检测研究进展;丛润民等;《软件学报》;20180208(第08期);全文 *

Also Published As

Publication number Publication date
CN113869154A (en) 2021-12-31

Similar Documents

Publication Publication Date Title
Han et al. A survey on visual transformer
Wu et al. Language as queries for referring video object segmentation
Zhao et al. Object detection with deep learning: A review
Hu et al. Dense relation distillation with context-aware aggregation for few-shot object detection
Ricci et al. Monocular depth estimation using multi-scale continuous CRFs as sequential deep networks
Abbas et al. A comprehensive review of recent advances on deep vision systems
Bhunia et al. Joint visual semantic reasoning: Multi-stage decoder for text recognition
CN111444889A (en) Fine-grained action detection method of convolutional neural network based on multi-stage condition influence
Sun et al. Combining feature-level and decision-level fusion in a hierarchical classifier for emotion recognition in the wild
Li et al. Learning face image super-resolution through facial semantic attribute transformation and self-attentive structure enhancement
CN112307995A (en) Semi-supervised pedestrian re-identification method based on feature decoupling learning
CN113569865A (en) Single sample image segmentation method based on class prototype learning
CN112801068B (en) Video multi-target tracking and segmenting system and method
Gao et al. Co-saliency detection with co-attention fully convolutional network
CN112819013A (en) Image description method based on intra-layer and inter-layer joint global representation
CN115019143A (en) Text detection method based on CNN and Transformer mixed model
Jiang et al. Context-integrated and feature-refined network for lightweight object parsing
CN113935435A (en) Multi-modal emotion recognition method based on space-time feature fusion
CN113869154B (en) Video actor segmentation method according to language description
US20230072445A1 (en) Self-supervised video representation learning by exploring spatiotemporal continuity
Chen et al. Y-Net: Dual-branch joint network for semantic segmentation
CN116704506A (en) Cross-environment-attention-based image segmentation method
Cai et al. Vehicle detection based on visual saliency and deep sparse convolution hierarchical model
Yan et al. MEAN: multi-element attention network for scene text recognition
US20230186600A1 (en) Method of clustering using encoder-decoder model based on attention mechanism and storage medium for image recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant