CN113869154B

CN113869154B - Video actor segmentation method according to language description

Info

Publication number: CN113869154B
Application number: CN202111081527.8A
Authority: CN
Inventors: 李国荣; 陈伟东; 张新峰; 黄庆明
Original assignee: University of Chinese Academy of Sciences
Current assignee: University of Chinese Academy of Sciences
Priority date: 2021-09-15
Filing date: 2021-09-15
Publication date: 2022-09-02
Anticipated expiration: 2041-09-15
Also published as: CN113869154A

Abstract

The invention discloses a video actor segmentation method according to language description, which utilizes a cascade cross-modal attention module, roughly focuses on information words of language query by utilizing clip-level visual features, finely tunes the attention of the words by utilizing frame-level visual features, finely tunes the weight of a language in a target frame, can distinguish and segment positive examples from abundant video information, and can learn to identify target actors from videos by mining difficult negative examples through contrast learning, and can distinguish the target actors from different videos, thereby remarkably improving the accuracy of intra-frame matching and segmentation.

Description

Video actor segmentation method according to language description

Technical Field

The invention relates to the technical field of video identification, in particular to a video actor segmentation method according to language description.

Background

In recent years, video understanding tasks have received much attention, especially in relation to natural language processing. In this area, great achievements have been achieved in language selective temporal action localization, video caption generation, and actions segmentation tasks for actors and actions in video according to sentence descriptions. In real-world scenarios, it is common for a video to contain multiple actors acting. Therefore, selectively fine-grained locating a particular actor and its actions spatially and temporally through linguistic queries becomes an important task for computers to better understand video.

A framework widely used in related tasks, such as video/image object grounding, is to generate a region proposal in a video/image by some detection method and then match the text features with the proposed visual features to select the best object proposal as the matched object. To improve the performance of matching two heterogeneous features, previous work first generated linguistic features using a two-way LSTM and a self-attention mechanism, then processed visual features using weighted text features, and finally text-to-visual feature matching. However, this self-attention mechanism focuses on the language actually being the average solution to the training data, rather than the individual solution focusing on a certain video. Thus, in the inference process, regardless of the input video, the language features of interest are determined, because the video is a high-level semantic space containing rich content, and it is difficult to grasp the most discriminative features of the video. Thus, video determines the key in linguistic queries, and discriminant language representations that capture informative words and learn visual perception are critical to the language-guided video actor-action segmentation task.

How to design a language encoder for visual perception and generate a discriminant language so as to segment an actor and actions thereof in a video, and further optimizing a segmentation method to improve accuracy of intra-frame matching and segmentation.

Disclosure of Invention

In order to overcome the problems, the invention provides a method for segmenting a video actor and actions thereof according to language description, wherein the accuracy of matching and segmentation is obviously improved by utilizing a cooperative optimization network of a cascade cross-modal attention mechanism. The language is focused from thick to thin by using the visual features of two visual angles, the language features with discriminative visual perception are generated, in addition, a method of comparative learning is provided, a difficult negative example mining strategy is designed, the network is favorable for identifying a positive example from a negative example, and the performance is further improved, so that the method is completed.

The invention aims to provide a video actor segmentation method according to language description, which is carried out by utilizing a cascade cross-modal attention module to generate sentence query characteristics with discriminative power and improve the accuracy of matching and segmentation.

The cascaded cross-modal attention module includes a clip-level feature attention element and a frame-level feature attention element.

The clipping-level feature attention unit uses sentence embedding s and the clipping-level feature v of the target frame j _c As an input.

The clip-level feature attention unit utilizes clip-level features v _c Roughly weighting the language features to respectively obtain:

F ₁ ＝Att ₁ ·ψ(v _c )+φ(s)

wherein T is matrix transposition; sigma _softmax Is the softmax activation function; att ₁ For clipping features v _c And an attention map of sentence embedding s; f ₁ Coarse weighted sentence features; clipping feature v _c Passing through the convolutional layer

And psi (. cndot.) to

And psi (v) _c ) (ii) a Embedding a combined word into e _t A sentence embedding s is formed, which is then put into a convolutional layer φ (-) to generate a sentence feature φ(s).

For video V, clip-level features V _c Encoded by the following formula:

wherein the content of the first and second substances,

represents L ₂ Norm, theta _avg For mean-pooling operation, I3D (-) is a dual stream I3D encoder, preferably using the output of the Mixed-4f layer of the I3D network as the I3D encoder.

The frame-level feature attention unit processes the coarsely weighted sentence feature F ₁ And frame level features v _f Obtaining fine-tuned sentence features F ₂ ：

F ₂ ＝Att ₂ ·ψ′(v _f )+F ₁

Wherein, Att ₂ For frame-level features v _f And coarsely weighted sentence features F ₁ The attention map of (1); f ₂ Representing fine-tuned sentence features, each column of which represents a vector of one word;

is v is _f Through a linear layer

The resulting features; psi' (v) _f ) Is v is _f Features obtained through a linear layer psi' (-).

The frame level features v _f Extracting by using a ResNet-101 network, preferably, pre-training on a COCO data set before extracting frame-level features, and fine-tuning on A2D data set training segmentation;

the frame-level features are the warped feature from frame j to frame i and the original feature v _i The linear weighted combination of (1) is specifically represented by the following formula:

wherein v is _i For ResNet-101 coding characteristics of target frame i, beta is a weight coefficient, i isTarget frame, j is reference frame, 2K is frame number of reference frame (i.e. taking K frame in forward direction and K frame in backward direction of target frame, compensating target frame characteristics), v _j→i Is a warped feature from the reference frame j to the target frame i;

v is _j→i Comprises the following steps:

wherein v is _j Is the ResNet-101 coding feature of reference frame j; OF _j→i Is the optical flow between the reference frame j and the target frame i;

is a bilinear warp equation.

Weighted word feature h _t Obtaining sentence query characteristics q through a full connection layer:

m _t ＝FC(h _t )

α _t ＝σ _softmax (m _t )

wherein h is _t Is F ₂ The tth column of features, representing the vector of the tth word; FC (h) _t ) Is a full connection layer; m is _t Is h _t Intermediate vectors passing through a full link layer; alpha is alpha _t Is the weighting coefficient of the t-th word.

The second aspect of the present invention also provides a computer readable storage medium storing a training program for video actor segmentation according to language description, which program, when executed by a processor, causes the processor to carry out the steps of the video actor segmentation method according to language description.

The method for segmenting the video actor according to the language description in the invention can be realized by means of software plus a necessary general hardware platform, wherein the software is stored in a computer readable storage medium (comprising a ROM/RAM, a magnetic disk and an optical disk) and comprises a plurality of instructions for enabling a terminal device (which can be a mobile phone, a computer, a server, a network device and the like) to execute the method in the invention.

The third aspect of the present invention also provides a computer device comprising a memory and a processor, the memory storing a training program for video actor segmentation according to a language description, the program, when executed by the processor, causing the processor to perform the steps of the video actor segmentation method according to a language description.

The invention has the advantages that:

(1) the method provided by the invention roughly focuses on the information words of the language query by using the clip-level visual features, finely adjusts the attention of the words by using the frame-level visual features, finely adjusts the weight of the language in the target frame, and obviously improves the accuracy of intra-frame matching and segmentation.

(2) The method provided by the invention can distinguish and segment positive examples from abundant video information, and can learn to identify the target actor from the video by mining hard negative examples through comparison learning, and can also distinguish the target actor from different videos.

(3) By utilizing the visual feature encoder, the visual features of the clip-level features, the frame-level features and the positive examples and the negative examples are extracted to obtain the sentence query features q, and the positive examples and the negative examples are distinguished through comparison and learning, so that the discrimination capability of the cascade cross-modal attention module is enhanced.

Drawings

FIG. 1 is a network architecture diagram illustrating a video actor segmentation method according to language description according to the present invention;

fig. 2 is a diagram showing the effect of a method for segmenting a video actor according to language description applied to an A2D data set in embodiment 1 of the present invention.

Detailed Description

The present invention will be described in further detail below with reference to the accompanying drawings and embodiments. The features and advantages of the present invention will become more apparent from the description. In which, although various aspects of the embodiments are shown in the drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The invention provides a video actor segmentation method according to language description, which is carried out by utilizing a cascade cross-modal attention module to generate sentence query characteristics with discriminative power and improve the accuracy of matching and segmentation.

The cascaded cross-modality attention module includes a clip-level feature attention element and a frame-level feature attention element.

In the present invention, since the video clip feature is a global feature containing all information, information words of a language query, such as words reflecting motion and temporal changes, are first roughly focused with a clip-level visual feature attention unit. Then, the attention of the words is finely adjusted by using the frame-level feature attention unit, the weight of the sentences in the specific frame is finely adjusted, for example, the attribute reflecting the appearance is reflected, and the accuracy rate of matching and segmentation in the frame is remarkably improved. In the method, discriminant language representations based on a given video are learned using clip-level feature attention units and frame-level feature attention units.

In the cascade cross-modal attention module, a clipping-level feature attention unit adopts a sentence embedding s and a clipping-level feature v of a target frame i _c As input, a coarsely weighted sentence feature F is obtained ₁ 。

In the present invention, the clipping level features v _c The extraction is performed using a dual stream I3D encoder, preferably the dual stream I3D encoder is pre-trained using ImageNet and Kinetics before extracting clip level features.

The dual stream I3D encoder is specifically described in the documents "Carreira J, Zisserman A. Quo vadis, action repetition a new model and the kinetics dataset [ C ]// proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 6299-.

The ImageNet is specifically described in the documents "Russakovsky O, Deng J, Su H, et al. Imagenet large scale Visual recognition change [ J ]. International journal of computer vision, 2015, 115 (3): 211 and 252.

The Kinetics are specifically described in the literature "Kay W, Carreira J, Simony K, et al. the Kinetics human action video dataset [ J ]. arXiV preprint arXiv:1705.06950, 2017.

For video V, clip-level features V _c Encoded by the following formula:

wherein the content of the first and second substances,

is L ₂ Norm (i.e. square root of sum of squares of elements of vector), θ _avg The average pooling operation is performed. I3D (-) is a dual stream 13D encoder, preferably using the output of the Mixed-4f layer of the I3D network as the I3D encoder.

The Mixed-4f layer of the I3D network is described in more detail in the references "Carreira J, Zisserman A. Quo vadis, action Recognition a new model and the kinetics data set [ C ]// proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017: 6299-.

In the invention, firstly, the visual characteristics v of the target frame i _j Into two convolution layers (

And psi (-) to generate two new visual features, respectively

And psi (v) _j ). At the same time, e is embedded by combining words _t Form sentence embedding s, e _t Is the embedded feature of the t-th word. It is then put into a convolutional layer phi (-) to generate the sentence feature phi(s). Wherein t is ∈ [1, n ]]T is an integer, n is the total number of words; e.g. of the type _t Specifically, the documents "MIKOLOV T, SUTSKEVER I, CHEN K, et al, distributed representations of words and phrases and the third composition [ C ]]//Advances in neural information processing systems，2013: 3111-3119.

The above-mentioned

Multiplying by φ(s) to generate an attention map of word-by-word spatial positions:

where T is the matrix transpose, σ _softmax For the softmax function, Att is an attention map that measures spatial position for each word.

The Att is multiplied by ψ (v) _j ) Plus φ(s) to get visually weighted sentence features:

F＝Att·ψ(v _j )+φ(s)

where F is a visually weighted sentence feature.

In the present invention, the clipping-level feature attention unit utilizes clipping-level features v _c Roughly weighting the language features to respectively obtain:

F ₁ ＝Att ₁ ·ψ(v _c )+φ(s)

where T is the matrix transpose, σ _softmax Is the softmax function, Att ₁ For clipping features v _c Attention map of sentence embedding s, F ₁ Are coarsely weighted sentence features.

In this disclosure, the frame-level feature attention unit processes the coarsely weighted sentence feature F ₁ And frame level features v _f Obtaining fine-tuned sentence features F ₂ 。

The frame level features v _f Extraction is performed using a ResNet-101 network, which is preferably pre-trained on the COCO dataset and fine-tuned on the A2D dataset training partition before frame-level features are extracted.

The ResNet-101 network is specifically described in the following documents "He K, Zhang X, Ren S, et al. deep residual learning for image registration [ C ]// Proceedings of the IEEE reference on computer vision and pattern registration.2016: 770-778; the COCO data set is specifically described in the literature "Lin T Y, Maire M, Belongie S, et al, Microsoft COCO: common objects in context [ C ]// European context on computer vision. Springer, Cham, 2014: 740, 755. "; the A2D data set is specifically described in the documents "Xu C, Hsieh S H, Xiong C, et al. Can humans flash action understandings with multiple classes of actions [ C ]// Proceedings of the IEEE Conference on Computer Vision and Pattern recognition.2015: 2264-.

In the invention, in order to enhance a feature map, calculating optical flows between 2K frames of reference frames j near a target frame i, namely taking K frames in the forward direction of the target frame i and K frames in the backward direction as reference frames, and warping ResNet-101 coding features of the nearby 2K frames onto the target frame based on the optical flows, wherein K is the frame number, and the method is specifically carried out according to the following formula:

is a bilinear distortion equation; v. of _j→i Is a warped feature from the reference frame j to the target frame i. The above operation is specifically described in the document "ZHU X, WANG Y, DAI J, et al]// Proceedings of the IEEE International Conference on Computer Vision.2017: 408 and 417.

wherein v is _i For ResNet-101 coding characteristics of a target frame i, beta is a weight coefficient, i is the target frame, j is a reference frame, and 2K is the frame number of the reference frame, namely K frames are taken in the front direction of the target frame and K frames are taken in the back direction of the target frame, and the characteristics of the target frame are compensated. In this way, the frame-level features v _f Enhanced by nearby frame-level features.

In the invention, the enhanced frame-level features v are combined _f Directly into the Regional Proposal Network (RPN) to generate the proposal. Since the target frame has been compensated by its neighboring frames, a moving object can be more easily detected. Preferably, the area proposal network (RPN) is embodied as the document "He K, Gkioxari G, Doll's a r P, et al Mask r-cnn [ C ]]// Proceedings of the IEEE international conference on computer vision.2017: 2961-2969 ", preferably the Regional Proposal Network (RPN) is pre-trained on the COCO dataset and fine-tuned on the A2D dataset training branch.

F ₂ ＝Att ₂ ·ψ′(v _f )+F ₁

Wherein, Att ₂ For frame-level features v _f And coarsely weighted sentence features F ₁ The attention map of (1); f ₂ Representing the finely adjusted sentence characteristics;

is v _f Through a linear layer

The resulting features; psi' (v) _f ) Is v is _f Features obtained through a linear layer ψ' (·).

m _t ＝FC(h _t )

α _t ＝σ _softmax (m _t )

wherein h is _t Is F ₂ The tth column of features, representing the vector of the tth word; FC (h) _t ) Is a fully connected layer, which is specifically described in the literature "Krizhevsky A, Sutskeeper I, Hinton G E]// Advances in Neural Information Processing systems.2012. "; m is _t Is h _t Intermediate vectors passing through a full link layer; alpha is alpha _t A weighting factor for the t-th word; e.g. of a cylinder _t The T-th word is an embedded feature, such as the documents "MIKOLOV T, SUTSKEVER I, CHEN K, et al]// Advances in neural information processing systems, 2013: 3111-3119.

In a preferred embodiment of the present invention, during training, the method for segmenting the video actor according to the language description further includes using the sentence query feature q to perform comparative learning so as to distinguish positive examples from difficult negative examples. The positive example is a region proposal on the target frame with a true value IoU (cross ratio) greater than 0.5 and a maximum value of IoU. The hard negative example is a region proposal on the target frame with a true value of IoU less than 0.5. The IoU is described in particular in the literature "He K, Gkioxari G, Doll a r P, et al Mask r-cnn [ C ]// Proceedings of the IEEE international conference on computer vision.2017: 2961-2969

It is noted that the difficulty of matching the sentence query feature q from the video region proposal to the desired result is not enough to learn by only using the difficulty in the same target frame, and the difficulty mining strategy is designed in the invention and is divided into two parts. The first part, mining the hard negative examples from within the video containing the target frame: from other key frames in the video, a region proposal different from the current frame true value (i.e. a region proposal with IoU less than 0.5) is found as a hard negative example. The second part, mining hard and negative examples from different videos: when the difficult example is insufficient, videos containing the same actor-action tag are continuously found, and the regional proposals different from the actor-action tag are found from key frames of the videos as the difficult example. During testing, the region proposal with the highest similarity score with the obtained sentence query characteristic q is found from the region proposals on the target frame, and the result is the matching result.

In the invention, an actor and action thereof which can be matched with the sentence query feature q need to be separated from the video containing rich information, so that the sentence query feature q is utilized for comparative learning in the method, and difficult examples are mined to enhance the matching capability between q and the region proposal.

In the invention, the sentence embedding s is processed by a clipping level feature attention unit and a frame level feature attention unit to obtain a fine-tuned sentence feature F ₂ And finally, embedding and weighting the words to obtain sentence query characteristics q. Said r ⁺ Is a positive example of a visual feature, r _l ^- Is a negative example of visual characteristics, wherein L is 1, 2, L, where L is the number of negative examples, and L is an integer greater than or equal to 1, preferably an integer from 5 to 50, more preferably from 15 to 35, such as 25. The penalty function for the sentence query feature q is:

where T is the matrix transpose.

The Loss function Loss utilizes the (L +1) tuple to optimize the identification of the positive case from the L negative cases, and reduces the possibility of network convergence to local optimum.

Visual characteristic r of the sound example ⁺ Or visual characteristics of negative examples r _l ^- Extraction using RPN networkI.e. the area proposal network, generates an area proposal, the backbone network of which is the ResNet-101 network. Preferably, the ResNet-101 network and the Regional Proposal Network (RPN) are pre-trained on the COCO dataset and fine-tuned on the A2D dataset training branch before feature extraction. The region proposal has a target feature o and a location feature i.

The target feature o is a feature passing through an RPN network ROI-Aligned module, and is preferably consistent with the dimension of the sentence query feature q, such as a 1024-dimensional vector. The ROI-Aligned module is specifically described as "Kaiming He, Georgia Gkioxari, Piotr Doll a, Ross Girshick. Mask r-cnn [ C ]// Proceedings of the IEEE international conference on computer vision.2017: 2961 and 2969.

The position characteristic l is:

wherein (x) _tl ，y _tl ) And (x) _br ，y _br ) The coordinates of the points of the upper left and lower right corners of the region proposal, respectively, and W, H, W, H are the width, height, width and height of the region proposal, respectively.

Visual characteristics r of the prime example ⁺ Or visual characteristics of negative examples r _l ^- By connecting the location feature/to the target feature o and then passing the connected features to the full connection layer to obtain the visual feature r of the true example with the same dimensions as the text representation ⁺ Or visual characteristics of negative examples r _l ^- Specifically, the following formula is shown:

r＝σ _tanh (W([o；l]))

wherein r is the visual characteristic r of the positive example ⁺ Or negative example of the visual characteristic r _l ^- C-dimensional target level features, [;]denotes join operation, W is the parameter matrix to be learned, σ _tanh Is tan h activation function.

The tanh activation function is specifically described in the document "LeCun Y, Bottou L, Bengio Y, et al.Gradient-based learning applied to document recognition [ J ]. Proceedings of the IEEE, 1998, 86 (11): 2278 and 2324.

In the invention, after sentence query characteristics q and positive/negative examples are compared and learned to obtain a trained network, a region proposal with the highest similarity with text characteristics is selected, and the visual characteristics r of the selected region proposal are compared ⁺ Input into the split branch to generate a result mask. The segmentation branch in the present invention is a mask r-cnn segmentation branch, which is specifically described in documents "He K, Gkioxari G, Doll a r P, et al]// Proceedings of the IEEE International conference on computer vision.2017: 2961-. Fig. 1 is a schematic diagram of a network structure of a method for segmenting a video actor and its actions according to language description in the present invention.

The second aspect of the present invention also provides a computer readable storage medium storing a training program for video actor segmentation according to language description, which when executed by a processor, causes the processor to perform the steps of the video actor segmentation method according to language description.

The video actor segmentation method according to the language description provided by the invention firstly utilizes the clip-level visual features to roughly focus on the information words of the language query, and then utilizes the frame-level visual features to finely adjust the attention of the words, thereby obviously improving the accuracy of intra-frame matching and segmentation. In addition, in the segmentation method, positive examples can be identified in the video with rich content, and through the difficult negative example mining strategy and the introduction of contrast learning, the segmentation method not only can learn to identify target actors from the video, but also can distinguish the target actors from different videos. Thus, the cross-model search has stronger discrimination capability. Furthermore, a special segmentation network is used to segment the regions resulting in the matching, which results in a mask with better integrity and better edges.

Examples

The present invention is further described below by way of specific examples, which are merely exemplary and do not limit the scope of the present invention in any way.

Example 1

(1) The I3D network was used to train on the data set ImageNet and data set Kinetics, using the resulting output of the Mixed-4f layer of the I3D network as the I3D encoder.

The data set ImageNet is specifically described in the literature "Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, lifting Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al.2015.Imagenet large scale visual retrieval change. International journal of computer vision 115,3(2015), 211-.

The data set Kinetics is specifically described in the literature "Will Kay, Joao Carreira, Karen Simnyan, Brian Zhang, Chole Hillier, Sudheendora Vijayanariman, Fabio Viola, Tim Green, Trevor Back, Paul Natsev, et al.2017.the Kinetics human action video dataset. arXiv preprint arXiv:1705.06950 (2017)".

Given a video V (in particular as shown in the video frame in FIG. 1 or FIG. 2), a clipping feature V _c Encoded by the formula:

wherein the content of the first and second substances,

represents L ₂ Norm, theta _avg Representing a mean pooling operation; v. of _c The resulting clipping features are represented as a matrix of 32 x 832.

(2) The ResNet-101 network was pre-trained on the COCO dataset and then fine-tuned on the A2D dataset training partition. The ResNet-101 network is specifically described in the following documents "He K, Zhang X, Ren S, et al. deep residual learning for image registration [ C ]// Proceedings of the IEEE reference on computer vision and pattern registration.2016: 770-778; the COCO data set is specifically described in the literature "Lin T Y, Maire M, Belongie S, et al, Microsoft COCO: common objects in context [ C ]// European context on computer vision. Springer, Cham, 2014: 740, 755. "; the A2D data set is specifically described in the documents "Xu C, Hsieh S H, Xiong C, et al. Can humans flash action understandings with multiple classes of actions [ C ]// Proceedings of the IEEE Conference on Computer Vision and Pattern recognition.2015: 2264 2273.

Calculating optical flows among 14 reference frames j near the target frame i (the forward 7 frame and the backward 7 frame of the target frame i are the reference frames j), and warping ResNet-101 coding features of the 14 reference frames j near the target frame to the target frame based on the optical flows, wherein the method is specifically completed by the following formula:

wherein v is _j Is a ResNet-101 coding feature, OF, for reference frame j _j→i For the optical flow between reference frame j and target frame i,

representing the bilinear warp equation, v _j→i Features representing warping from frame J to frame i, as described in the literature "ZHU X, WANG Y, DAI J, et aleo object detection[C]// Proceedings of the IEEE International Conference on Computer Vision.2017: 408 and 417.

Frame level features v _f For warped and original features v from frame j to frame i _i Is shown in the following formula, wherein v _i For ResNet-101 encoding characteristics for target frame i:

where β is a weight coefficient of 0.1, and K is set to 7.The reference frame j is the front 7 frames and the back 7 frames of the target frame i, and the video sampling frame rate is 24 fps. By warping features from frame j to frame i and original features v _i So that the frame-level features v are combined with linear weighting _f Enhanced by nearby frame-level features. The enhanced features are then placed directly into a Regional Proposal Network (RPN) to generate a proposal, as embodied in the document "Girshick r]// Proceedings of the IEEE international conference on computer vision.2015: 1440-1448 ", the area proposal network (RPN) was pre-trained on the COCO dataset and fine-tuned on the A2D dataset training branch. Since the target frame i has been compensated by the frames in its vicinity, a moving object can be detected more easily.

Wherein v is _f Is a matrix of dimensions 16 × 16 × 1024.

(3) Clipping-level feature v with clipping-level feature attention element _c Roughly weighting the language features, and roughly weighting the clipping-level features v of the target frame i _c Feeding into the convolutional layer

Knowing psi (-) generates two new visual features separately

And ψ (v) _c )。

The convolutional layer

Is a 1 × 1 convolution of 1024 dimensions. The convolutional layer ψ (-) is a 1 × 1 convolution of 1024 dimensions.

Embedding e by combining words _t A sentence embedding s is formed and then put into a 1024-dimensional fully-connected layer to generate a sentence feature phi(s). Wherein t is equal to [1, n ]]And t is an integer and n is the total number of words. e.g. of the type _t The embedded feature for the T-th word is a 300-dimensional vector obtained from the word2vec model trained on the Google News dataset, as described in the document "Mikolov T, Sutskeeper I, Chen K, et al]// In Advances In neural information processing systems 2013.3111-3119 "; the Google News dataset is specifically described in the literature "a.s.das, m.datar, a.garg, and s.rajaram," Google News personalization: scalable online formatting filtering, "in Proceedings of the 16th international conference on World Wide web. acm, 2007, pp.271-280". Phi (-) is a fully connected layer that functions to transform 300-dimensional information to 1024-dimensional, visual features

And psi (v) _c ) The dimensions remain consistent.

Clipping-level feature attention Unit Using clipping-level features v _c Roughly weighting the language features to respectively obtain:

F ₁ ＝Att ₁ ·ψ(v _c )+φ(s)

wherein σ _softmax Is the softmax activation function, Att ₁ For clipping features v _c And an attention map of sentence embedding s with dimensions of 20 x 1024, where each element (a, b) represents the effect of visual features of the b-th position on the a-th word, F ₁ For coarsely weighted sentencesFeatures, whose dimensions are 20 × 1024, represent the features of the word after global feature weighting. The softmax activation function is specifically described in the document "LeCun Y, Bottou L, Bengio Y, et al]Proceedings of the IEEE, 1998, 86 (11): 2278-2324.

Roughly weighting sentence characteristic F ₁ And frame level features v _f Inputting the input into a frame-level feature attention unit to obtain a fine-tuned sentence feature F ₂ ：

F ₂ ＝Att ₂ ·ψ′(v _f )+F ₁

Wherein, Att ₂ For frame-level features v _f And coarsely weighted sentence features F ₁ Attention pattern of (1), F ₂ Representing the fine-tuned sentence features.

1 × 1 convolution for 1024 dimensions; psi' (v) _f ) Is a 1 × 1 convolution of 1024 dimensions.

(4) The fine-tuned sentence characteristics F ₂ Weighted word h of _t Embedding to obtain sentence query characteristics q:

m _t ＝FC(h _t )

α _t ＝σ _softmax (m _t )

(5) For the video V, embedding sentences into sentences through the steps (1)- (4) obtaining sentence query features q (as shown in step 4), wherein 1 positive example visual feature is r ⁺ And 25 negative visual characteristics are r _l ^- Wherein, l is 1, 2.

The loss function is:

where T is the matrix transpose.

(6) Visual characteristic r of the sound example ⁺ Or visual characteristics of negative examples r _l ^- Using ResNet-101 network (the ResNet-101 network is specifically shown in the document "He K, Zhang X, Ren S, et al]// In Proceedings of the IEEE conference on computer vision and pattern recognition.2016, 770-778). Before feature extraction, the network is connected to COCO data sets (such as the documents "Lin T, Maire M, Belongie S, et al. Microsoft COCO: Common objects in context [ C)]// In European conference on computer vision.2014.740-755. ") and pre-trained on an A2D dataset (said A2D dataset being specifically described In the documents" Xu C, Hsieh S, Xiong, C, et]I// In Proceedings of the IEEE Conference on Computer Vision and Pattern recognition.2015.2264-2273. ") on the training partition to generate a region proposal. The region proposed feature o is the target feature of the region output by the trimmed ResNet-01, and the feature l is the position feature.

The target feature o is a feature passing through an RPN network ROI-Aligned module, and is specifically a 1024-dimensional vector (the dimension of the target feature o is consistent with that of the sentence query feature q).

The position characteristic l is:

Visual characteristic r of the sound example ⁺ Or visual characteristics of negative examples r _l ^- By connecting the location feature/to the target feature o and then passing the connected features to the full connection layer to obtain the visual feature r of the true example with the same dimensions as the text representation ⁺ Or visual characteristics of negative examples r _l ^- Specifically, the following formula is shown:

r＝σ _tanh (W([o；l]))

wherein r is the visual characteristic r of the positive example ⁺ Or visual characteristics of negative examples r _l ^- A C-dimensional target level feature, [;]denotes the join operation, σ _tanh Is tan h activation function.

Visual characteristic r of rule ⁺ Or visual characteristics of negative examples r _l ^- The method is compared with the sentence query characteristic q for learning, the method which can be matched with the characteristic q is a positive example which needs to be segmented, a region proposal with the highest similarity with the text characteristic is obtained, and the visual characteristic r of the selected region proposal is used ⁺ The result mask is generated by inputting the result mask into the split branch of mask r-cnn to obtain the final result after splitting.

Fig. 2 is a diagram of the effect of the method in embodiment 1 on the A2D data set, where the text in the first row is an input sentence, the picture sequence in the second row is an input video, the third row is the segmentation result obtained in this embodiment, and the fourth row is a real segmentation map given by the data set. As can be seen from fig. 2, the target actor can be completely segmented from the video frame by using the video actor segmentation method based on language description segmentation provided in the present invention, and the segmentation result is very close to the real segmentation graph given by the data set.

The invention has been described in detail with reference to specific embodiments and/or illustrative examples and the accompanying drawings, which, however, should not be construed as limiting the invention. Those skilled in the art will appreciate that various equivalent substitutions, modifications or improvements may be made to the technical solution of the present invention and its embodiments without departing from the spirit and scope of the present invention, which fall within the scope of the present invention. The scope of the invention is defined by the appended claims.

Claims

1. A method of video actor segmentation in accordance with a language description, the method being performed using a cascaded cross-modality attention module comprising a clip-level feature attention element and a frame-level feature attention element;

the clipping-level feature attention unit roughly weights the language features by using the clipping-level features vc to respectively obtain:

F ₁ ＝Att ₁ ·ψ(v _c )+φ(s)

wherein T is matrix transposition; sigma _softmax Is the softmax activation function; att ₁ For clipping features v _c And an attention map of sentence embedding s; f ₁ Coarse weighted sentence features; clipping feature v _c By the convolution layer

And psi (. cndot.) to

And psi (v) _c ) (ii) a Embedding a combined word into e _t Forming a sentence embedding s, which is then put into a convolutional layer phi (-) to generate a sentence characteristic phi(s); et is the embedding feature of the t-th word;

F ₂ ＝Att ₂ ·ψ′(v _f )+F ₁

is v is _f Through a linear layer

2. The method of claim 1, wherein the clipping-level feature attention unit employs a sentence embedding s and a clipping-level feature v of a target frame i _c As an input.

3. The method of claim 1, wherein for video V, the clip-level feature V _c Encoded by the following formula:

wherein the content of the first and second substances,

represents L ₂ Norm, theta _avg For mean pooling operation, I3D (-) was compiled for a dual stream I3DAnd a coder.

4. The method of claim 1,

the frame level features v _f The extraction is performed using the ResNet-101 network.

5. The method of claim 4, wherein prior to extracting frame-level features, the ResNet-101 network is pre-trained on COCO datasets and fine-tuned on A2D dataset training partitions;

wherein v is _i For ResNet-101 coding characteristics of a target frame i, beta is a weight coefficient, i is the target frame, j is a reference frame, 2K is the frame number of the reference frame, K frames are taken in the front direction of the target frame, K frames are taken in the back direction, characteristics of the target frame are compensated, v _j→i Is a warped feature from the reference frame j to the target frame i;

v is _j→i Comprises the following steps:

is a bilinear warp equation.

6. The method of claim 1, wherein the weighted word feature h _t Obtaining sentence query characteristics q through a full connection layer:

m _t ＝FC(h _t )

α _t ＝σ _softmax (m _t )

wherein h is _t Is F ₂ A tth column of features representing a vector of tth words; FC (h) _t ) Is a full connection layer; m is _t Is h _t A vector is obtained after passing through a full connection layer; alpha (alpha) ("alpha") _t Is the weighting coefficient of the t-th word.

7.The method according to one of claims 1 to 6, further comprising performing a comparative learning using the sentence query feature q to distinguish between a positive case and a difficult negative case, wherein the positive case is a region proposal with a true value IoU greater than 0.5 and a maximum value of IoU on the target frame, and the difficult negative case is a region proposal with a true value IoU less than 0.5 on the target frame.

8. The method according to one of claims 1 to 6,

in the method, the loss function of the sentence query feature q is as follows:

wherein T is a matrix transpose, and r is ⁺ Is a positive example of a visual feature, r _l ^- Is a visual characteristic of a negative example, wherein L is 1, 2, and L is the number of negative examples;

visual characteristic r of the sound example ⁺ Or visual characteristics of negative examples r _l ^- Comprises the following steps:

r＝σ _tanh (W([o；l]))，

r is the visual characteristic r of the positive example ⁺ Or visual characteristics of negative examples r _l ^- The r is a C-dimensional target level feature, [;]denotes join operation, W is the parameter matrix to be learned, σ _tanh Is tan h activation function;

visual characteristic r of the sound example ⁺ Or visual characteristics of negative examples r _l Extracting by using an RPN network to generate a region proposal, wherein the target feature of the region proposal is o, and the corresponding position feature of the region is l;

the target feature o is a feature generated through extraction of the RPN network,

the position characteristic l is:

wherein (x) _tl ，y _tl ) And (x) _br ，y _br ) Coordinates of points of the upper left corner and the lower right corner of the region proposal, respectively, and W, H, W, H are the width and height of the region proposal, and the width and height of the target frame, respectively.

9. A computer-readable storage medium, in which a training program for video-actor segmentation in accordance with a language description is stored, which program, when being executed by a processor, causes the processor to carry out the steps of the video-actor segmentation in accordance with a language description method as claimed in one of claims 1 to 8.

10. Computer device, characterized in that it comprises a memory and a processor, said memory storing a training program for video actor segmentation according to a language description, said program, when executed by the processor, causing the processor to carry out the steps of the video actor segmentation method according to a language description according to one of claims 1 to 8.