CN108647255A

CN108647255A - The video sequential sentence localization method and device returned based on attention

Info

Publication number: CN108647255A
Application number: CN201810367989.8A
Authority: CN
Inventors: 朱文武; 袁艺天
Original assignee: Tsinghua University
Current assignee: Tsinghua University
Priority date: 2018-04-23
Filing date: 2018-04-23
Publication date: 2018-10-12
Also published as: WO2019205562A1

Abstract

The invention discloses a kind of video sequential sentence localization methods and device returned based on attention, wherein method includes the following steps：According to Three dimensional convolution neural network and Glove term vector mechanism, and using two-way length, memory network encodes video clip and sentence in short-term on this basis, to characterize video clip content and content of the sentence；According to video clip content and content of the sentence by the symmetrical association between multi-modal attention Mechanism establishing video and sentence, to obtain the attention weight vector and attention weighted feature of video and sentence；According to the attention weight vector or attention weighted feature of video and sentence, export to obtain the positioning result of video sequential sentence by the retrogression mechanism based on attention weight or the retrogression mechanism based on attention weighted feature.This method can keep the contextual information in video and sentence, improve the efficiency of sentence position fixing process, to achieve the purpose that be promoted sentence locating speed, positioning accuracy and positioning robustness.

Description

The video sequential sentence localization method and device returned based on attention

Technical field

The present invention relates to technical field of computer vision, more particularly to a kind of video sequential sentence returned based on attention Localization method and device.

Background technology

In the prior art, video sequential sentence localization method is mainly：The Unified Characterization built between video and sentence is empty Between, it is scanned generates several positioning video sections to be selected in video, sentence and positioning video section to be selected are projected into unified table Sign space is compared and positions；It is scanned in video and generates several positioning video sections to be selected, by positioning video section to be selected Visual signature merged with the text feature of sentence and generate multi-modal feature.Sequential is carried out on the basis of multi-modal feature to return Return, generate positioning video section to be selected and predict the reporting between positioning video section, and positioning video section to be selected is moved To predicted position.

The method used in the prior art has the disadvantage that：Be scanned in video generate positioning video section to be selected this One way calculating cost is higher, can not adapt to the processing of long video, thus the above video sequential sentence localization method is expansible Property is not strong；Positioning video section to be selected is separated into independent process from global video, has obstructed particular video frequency content and video The interaction of contextual information, and video contextual information is most important to the positioning of sentence.Therefore, the above video sequential sentence is fixed The accuracy rate of position method is not high；All directly using general length, memory network extracts sentence characteristics to above method in short-term, has ignored For the key message of sequential positioning in sentence, therefore the excavation of their distich sub-informations is abundant not enough.

Invention content

The present invention is directed to solve at least some of the technical problems in related technologies.

For this purpose, an object of the present invention is to provide a kind of promotion sentence locating speed, positioning accuracy and positioning Shandongs The video sequential sentence localization method that the purpose of stick is returned based on attention.

It is another object of the present invention to propose a kind of video sequential sentence positioning device returned based on attention.

In order to achieve the above objectives, one aspect of the present invention embodiment proposes a kind of video sequential sentence returned based on attention Sub- localization method, includes the following steps：It is sharp according to Three dimensional convolution neural network and Glove term vector mechanism, and on this basis With two-way length, memory network encodes video clip and sentence in short-term, to characterize video clip content and content of the sentence；Root According to video clip content and content of the sentence by the symmetrical association between multi-modal attention Mechanism establishing video and sentence, to obtain Take the attention weight vector of video and sentence and attention weighted feature；According to the attention weight vector of video and sentence or Attention weighted feature is exported by the retrogression mechanism based on attention weight or the retrogression mechanism based on attention weighted feature Obtain the positioning result of video sequential sentence.

The video sequential sentence localization method of the embodiment of the present invention returned based on attention, by characterizing in video clip Hold and content of the sentence keeps the contextual information of the two, and combines the connection between multi-modal attention Mechanism establishing video and sentence System, further according to the attention weight vector of the video of acquisition and sentence and attention weighted feature, returns out video sequential The positioning result of sentence, to achieve the purpose that be promoted sentence locating speed, positioning accuracy and positioning robustness.

In addition, the video sequential sentence localization method according to the above embodiment of the present invention returned based on attention can be with With following additional technical characteristic：

Further, in one embodiment of the invention, described according to Three dimensional convolution neural network and Glove term vectors Mechanism, and using two-way length, memory network encodes video clip and sentence in short-term on this basis, to characterize piece of video Section content and content of the sentence, further comprise：Characterize the video clip content and amalgamation of global video sentence context Information, and memory network characterizes each of sentence according to the contextual information of sentence in short-term using Glove term vectors and two-way length Word.

Further, in one embodiment of the invention, the multi-modal attention mechanism includes：According to sentence characteristics Guidance generates the video attention weight vector and the attention weighting video feature, is associated with tightly with sentence semantics with obtaining Close critical video content；It is instructed to generate sentence attention weight vector and attention weighting sentence according to the video clip content Subcharacter, to obtain the crucial clue positioned about sequential in sentence.

Further, in one embodiment of the invention, it is described according to the attention weights of the video and sentence to Amount or attention weighted feature, pass through the retrogression mechanism based on attention weight or the retrogression mechanism based on attention weighted feature Output obtains the positioning result of video sequential sentence, further comprises：Recurrence based on the attention weight is with the video Attention weight vector returns the video content indicated by sentence in global video as input, using the full attended operation of multilayer Relative position；The attention weighting video feature and the note are then first merged in recurrence based on the attention weighted feature Power of anticipating weights sentence characteristics, obtains multi-modal attention weighted feature, then using multi-modal attention weighted feature as input, profit Relative position of the video content in global video indicated by sentence is returned with the full attended operation of multilayer.

Further, in one embodiment of the invention, the video sequential sentence returned based on attention is fixed Position method further include：Loss function and attention, which are returned, according to attention calibrates loss function by back-propagation algorithm iteratively Training pattern parameter, to obtain the model of the video sequential sentence localization method returned based on attention.

In order to achieve the above objectives, another aspect of the present invention embodiment proposes a kind of video sequential returned based on attention Sentence positioning device, including：Characterization module, for according to Three dimensional convolution neural network and Glove term vector mechanism, further leading to Crossing two-way length, memory network encodes video clip and content of the sentence in short-term, to characterize video clip and content of the sentence；It obtains Modulus block, for according to the video clip content and content of the sentence by multi-modal attention Mechanism establishing video and sentence it Between symmetrical association, to obtain the attention weight vector of video and sentence and attention weighted feature；Locating module is used for root According to the attention weight vector or attention weighted feature of the video and sentence, pass through the retrogression mechanism based on attention weight Or the retrogression mechanism based on attention weighted feature exports to obtain the positioning result of video sequential sentence.

The video sequential sentence positioning device of the embodiment of the present invention returned based on attention, by characterizing in video clip Hold and content of the sentence keeps the contextual information of the two, and combines the connection between multi-modal attention Mechanism establishing video and sentence System, further according to the attention weight vector of the video of acquisition and sentence and attention weighted feature, returns out video sequential The positioning result of sentence, to achieve the purpose that be promoted sentence locating speed, positioning accuracy and positioning robustness.

In addition, the video sequential sentence positioning device according to the above embodiment of the present invention returned based on attention can be with With following additional technical characteristic：

Further, in one embodiment of the invention, the characterization module is additionally operable to：It characterizes in the video clip Contextual information hold and amalgamation of global video sentence, and using Glove term vectors and two-way length in short-term memory network according to Each word of the contextual information characterization sentence of sentence.

Further, in one embodiment of the invention, the acquisition module is additionally operable to：It instructs to give birth to according to sentence characteristics At the video attention weight vector and the attention weighting video feature, closely pass is associated with sentence semantics to obtain Key video content；Sentence attention weight vector, which is generated, according to the video clip content weights sentence characteristics with attention, with Obtain the crucial clue about sequential positioning in sentence.

Further, in one embodiment of the invention, the locating module is additionally operable to：Based on the attention weight Recurrence using the video attention weight vector as input, utilize the full attended operation of multilayer to return the video indicated by sentence Relative position of the content in global video；The attention weighting is then first merged in recurrence based on the attention weighted feature Video features and the attention weight sentence characteristics, obtain multi-modal attention weighted feature, then add with multi-modal attention Feature is weighed as input, opposite position of the video content in global video indicated by sentence is returned using the full attended operation of multilayer It sets.

Further, in one embodiment of the invention, the video sequential sentence returned based on attention is fixed Position device further includes training module, is used for：Passed through according to attention recurrence loss function and attention calibration loss function reversed Propagation algorithm iteratively training pattern parameter, to obtain the mould of the video sequential sentence localization method returned based on attention Type.

The additional aspect of the present invention and advantage will be set forth in part in the description, and will partly become from the following description Obviously, or practice through the invention is recognized.

Description of the drawings

Above-mentioned and/or additional aspect and advantage of the invention will become from the following description of the accompanying drawings of embodiments Obviously and it is readily appreciated that, wherein：

Fig. 1 is the flow chart according to the video sequential sentence localization method of the embodiment of the present invention returned based on attention；

Fig. 2 is the model according to the video sequential sentence positioning device of one embodiment of the invention returned based on attention Structural schematic diagram；With

Fig. 3 is the structural representation according to the video sequential sentence positioning device of the embodiment of the present invention returned based on attention Figure.

Specific implementation mode

The embodiment of the present invention is described below in detail, the example of embodiment is shown in the accompanying drawings, wherein identical from beginning to end Or similar label indicates same or similar element or element with the same or similar functions.It is retouched below with reference to attached drawing The embodiment stated is exemplary, it is intended to for explaining the present invention, and is not considered as limiting the invention.

The video sequential sentence returned based on attention for describing to propose according to embodiments of the present invention with reference to the accompanying drawings is fixed Position method and device, describes the video sequential returned based on attention proposed according to embodiments of the present invention with reference to the accompanying drawings first Sentence localization method.

Fig. 1 is the flow chart of the video sequential sentence localization method returned based on attention according to the embodiment of the present invention, As shown in Figure 1, should be included the following steps based on the video sequential sentence localization method that attention returns：

In step S101, according to Three dimensional convolution neural network and Glove term vector mechanism, and utilize on this basis double Video clip and sentence are encoded to long memory network in short-term, to characterize video clip content and content of the sentence.

It is understood that while characterizing video clip content, the contextual information of amalgamation of global video, and use Memory network in this way may be used according to each word of the contextual information of sentence characterization sentence in short-term for Glove term vectors and two-way length So that the video clip content and content of the sentence that obtain more comprehensively and have robustness.

In step s 102, according to video clip content and content of the sentence by multi-modal attention Mechanism establishing video with Symmetrical association between sentence, to obtain the attention weight vector and attention weighted feature of video and sentence.

It is understood that multi-modal attention mechanism includes：It is instructed to generate video attention weights according to sentence characteristics Vector and attention weighting video feature are associated with close critical video content with sentence semantics to obtain；According to video clip Content generates sentence attention weight vector and weights sentence characteristics with attention, to obtain the key positioned about sequential in sentence Clue.

In step s 103, according to the attention weight vector or attention weighted feature of video and sentence, by being based on The retrogression mechanism of attention weight or retrogression mechanism based on attention weighted feature export to obtain the positioning of video sequential sentence As a result.

It is understood that position Recurrent networks, which include two kinds, returns strategy, include recurrence based on attention weight and Recurrence based on attention weighted feature.Wherein, recurrence based on attention weight is using video attention weight vector as defeated Enter, relative position of the video content in global video indicated by sentence is returned using the full attended operation of multilayer；Based on attention The recurrence of power weighted feature then first merges attention weighting video feature and attention weights sentence characteristics, obtains multi-modal attention Power weighted feature, then using multi-modal attention weighted feature as input, returned indicated by sentence using the full attended operation of multilayer Relative position of the video content in global video.

As shown in Fig. 2, in one embodiment of the invention, the video sequential sentence localization method returned based on attention Model be divided into three modules, including：The feature coding of integrating context information, multi-modal attention mechanism and based on pay attention to The position Recurrent networks of power, training step are：

Training set is expressed asWherein V_iIndicate i-th of video in training set, video Shi Changwei τ_i, S_iTo describe video V_iOne sentence of content, S_iIt the starting in video of described content and terminates the time and sits Mark is respectivelyWithK is training set number of samples；

Each video is equally divided into M video clip, each sentence is also characterized as the sequence of word；By the starting of sentence It is normalized by video length with time coordinate is terminated, prediction target namely sentence coordinate as position Recurrent networks Actual value：

This programme devises two loss functions to instruct the learning process of block mold：I.e. attention returns loss letter Number and attention calibrate loss function；By the way that video and sentence inputting are determined to the video sequential sentence returned based on attention In bit model, the predicted value of sentence coordinate is exportedAttention returns the predicted value that loss function is defined as sentence coordinate Smooth L1 distance R (t) (Smooth L1Distance) between the actual value of sentence coordinate： Attention calibration loss function then limits the actual position time positioned at sentence WindowInterior video clip, attention weight are as big as possible：If video V_i J-th of video clip in time windowIt is interior, then m_i,j=1, otherwise m_i,=0；

Joint attention returns the study of loss function and the attention calibration common guidance model of loss function, passes through classics Back-propagation algorithm iteratively training pattern parameter.

It is understood that the model of the video sequential sentence localization method returned based on attention with it is a kind of end to end Frame combined optimization is trained, and is reduced redundant computation cost and is improved the locating accuracy of sentence.It can be solved based on this programme Certainly video sequential sentence orientation problem preferably serves disparate networks Video Applications, is suitable for the video content based on sentence Quickly positioning, video frequency searching, the scenes such as video frequency abstract.

The video sequential sentence returned based on attention proposed according to embodiments of the present invention referring next to attached drawing description is fixed Position device.

Fig. 3 is the structural representation according to the video sequential sentence positioning device of the embodiment of the present invention returned based on attention Figure, as shown in figure 3, should include based on the video sequential sentence positioning device 10 that attention returns：Characterization module 100 is used for root According to Three dimensional convolution neural network and Glove term vector mechanism, further by two-way length in short-term memory network to video clip and Content of the sentence is encoded, to characterize video clip and content of the sentence；Acquisition module 200, for according to video clip content and Content of the sentence is by the symmetrical association between multi-modal attention Mechanism establishing video and sentence, to obtain the note of video and sentence Meaning power weight vector and attention weighted feature；Locating module 300, for according to the attention weight vector of video and sentence or Attention weighted feature is exported by the retrogression mechanism based on attention weight or the retrogression mechanism based on attention weighted feature Obtain the positioning result of video sequential sentence.

Further, in one embodiment of the invention, characterization module 100 is additionally operable to：Characterize video clip content With the contextual information of amalgamation of global video sentence, and using Glove term vectors and two-way length in short-term memory network according to sentence Contextual information characterization sentence each word.

Further, in one embodiment of the invention, acquisition module 200 is additionally operable to：It instructs to give birth to according to sentence characteristics At video attention weight vector and attention weighting video feature, it is associated in close key video sequence with sentence semantics with obtaining Hold；Sentence attention weight vector is generated according to video clip content and weights sentence characteristics with attention, is closed with obtaining in sentence In the crucial clue of sequential positioning.

Further, in one embodiment of the invention, locating module 300 is additionally operable to：Returning based on attention weight Return using video attention weight vector as input, the video content indicated by sentence is returned complete using the full attended operation of multilayer Relative position in office's video；Attention weighting video feature and attention are then first merged in recurrence based on attention weighted feature Sentence characteristics are weighted, obtain multi-modal attention weighted feature, then using multi-modal attention weighted feature as inputting, using more The full attended operation of layer returns relative position of the video content in global video indicated by sentence.

Further, in one embodiment of the invention, dress should be positioned based on the video sequential sentence that attention returns It further includes training module to set 10, is used for：Loss function is returned according to attention and attention calibrates loss function by reversely passing With broadcasting algorithm iteration training pattern parameter, to obtain the model of the video sequential sentence localization method returned based on attention.

It should be noted that aforementioned explaining to the video sequential sentence localization method embodiment that is returned based on attention The bright device for being also applied for the embodiment, details are not described herein again.

In the description of the present invention, it is to be understood that, term "center", " longitudinal direction ", " transverse direction ", " length ", " width ", " thickness ", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom" "inner", "outside", " up time The orientation or positional relationship of the instructions such as needle ", " counterclockwise ", " axial direction ", " radial direction ", " circumferential direction " be orientation based on ... shown in the drawings or Position relationship is merely for convenience of description of the present invention and simplification of the description, and does not indicate or imply the indicated device or element must There must be specific orientation, with specific azimuth configuration and operation, therefore be not considered as limiting the invention.

In addition, term " first ", " second " are used for description purposes only, it is not understood to indicate or imply relative importance Or implicitly indicate the quantity of indicated technical characteristic.Define " first " as a result, the feature of " second " can be expressed or Implicitly include at least one this feature.In the description of the present invention, the meaning of " plurality " is at least two, such as two, three It is a etc., unless otherwise specifically defined.

In the present invention unless specifically defined or limited otherwise, term " installation ", " connected ", " connection ", " fixation " etc. Term shall be understood in a broad sense, for example, it may be being fixedly connected, may be a detachable connection, or integral；Can be that machinery connects It connects, can also be electrical connection；It can be directly connected, can also can be indirectly connected through an intermediary in two elements The interaction relationship of the connection in portion or two elements, unless otherwise restricted clearly.For those of ordinary skill in the art For, the specific meanings of the above terms in the present invention can be understood according to specific conditions.

In the present invention unless specifically defined or limited otherwise, fisrt feature can be with "above" or "below" second feature It is that the first and second features are in direct contact or the first and second features pass through intermediary mediate contact.Moreover, fisrt feature exists Second feature " on ", " top " and " above " but fisrt feature be directly above or diagonally above the second feature, or be merely representative of Fisrt feature level height is higher than second feature.Fisrt feature second feature " under ", " lower section " and " below " can be One feature is directly under or diagonally below the second feature, or is merely representative of fisrt feature level height and is less than second feature.

In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example Point is included at least one embodiment or example of the invention.In the present specification, schematic expression of the above terms are not It must be directed to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be in office It can be combined in any suitable manner in one or more embodiments or example.In addition, without conflicting with each other, the skill of this field Art personnel can tie the feature of different embodiments or examples described in this specification and different embodiments or examples It closes and combines.

Although the embodiments of the present invention has been shown and described above, it is to be understood that above-described embodiment is example Property, it is not considered as limiting the invention, those skilled in the art within the scope of the invention can be to above-mentioned Embodiment is changed, changes, replacing and modification.

Claims

1. a kind of video sequential sentence localization method returned based on attention, which is characterized in that include the following steps：

According to Three dimensional convolution neural network and Glove term vector mechanism, and two-way length memory network in short-term is utilized on this basis Video clip and sentence are encoded, to characterize video clip content and content of the sentence；

Pass through pair between multi-modal attention Mechanism establishing video and sentence according to the video clip content and content of the sentence Claim association, to obtain the attention weight vector and attention weighted feature of video and sentence；

According to the attention weight vector or attention weighted feature of the video and sentence, pass through returning based on attention weight Mechanism or the retrogression mechanism based on attention weighted feature is returned to export to obtain the positioning result of video sequential sentence.

2. the video sequential sentence localization method according to claim 1 returned based on attention, which is characterized in that described According to Three dimensional convolution neural network and Glove term vector mechanism, and on this basis using two-way length in short-term memory network to regarding Frequency segment and sentence are encoded, and to characterize video clip content and content of the sentence, are further comprised：

Characterize the contextual information of the video clip content and amalgamation of global video sentence, and using Glove term vectors and double Each word of sentence is characterized according to the contextual information of sentence to long memory network in short-term.

3. the video sequential sentence localization method according to claim 1 returned based on attention, which is characterized in that described Multi-modal attention mechanism includes：

The video attention weight vector and the attention weighting video feature are generated according to sentence characteristics guidance, to obtain Close critical video content is associated with sentence semantics；

It is instructed to generate sentence attention weight vector and attention weighting sentence characteristics according to the video clip content, to obtain Crucial clue in sentence about sequential positioning.

4. the video sequential sentence localization method according to claim 1 returned based on attention, which is characterized in that described According to the attention weight vector or attention weighted feature of the video and sentence, pass through the regression machine based on attention weight System or the retrogression mechanism based on attention weighted feature export to obtain the positioning result of video sequential sentence, further comprise：

Recurrence based on the attention weight connects behaviour using multilayer entirely using the video attention weight vector as input Make relative position of the video content in global video indicated by recurrence sentence；

The attention weighting video feature is then first merged in recurrence based on the attention weighted feature and the attention adds Sentence characteristics are weighed, obtain multi-modal attention weighted feature, then using multi-modal attention weighted feature as input, utilize multilayer Full attended operation returns relative position of the video content in global video indicated by sentence.

5. according to the video sequential sentence localization method that claim 1-4 any one of them is returned based on attention, feature It is, further includes：

Loss function is returned according to attention and attention calibration loss function passes through back-propagation algorithm iteratively training pattern Parameter, to obtain the model of the video sequential sentence localization method returned based on attention.

6. a kind of video sequential sentence positioning device returned based on attention, which is characterized in that include the following steps：

Characterization module, for according to Three dimensional convolution neural network and Glove term vector mechanism, further being remembered in short-term by two-way length Recall network to encode video clip and content of the sentence, to characterize video clip and content of the sentence；

Acquisition module, for according to the video clip content and content of the sentence by multi-modal attention Mechanism establishing video with Symmetrical association between sentence, to obtain the attention weight vector and attention weighted feature of video and sentence；

Locating module, for the attention weight vector or attention weighted feature according to the video and sentence, by being based on The retrogression mechanism of attention weight or retrogression mechanism based on attention weighted feature export to obtain the positioning of video sequential sentence As a result.

7. the video sequential sentence positioning device according to claim 6 returned based on attention, which is characterized in that described Characterization module is additionally operable to：

Characterize the video clip content and amalgamation of global video sentence contextual information, and using Glove term vectors and Two-way length in short-term memory network according to the contextual information of sentence characterize sentence each word.

8. the video sequential sentence positioning device according to claim 6 returned based on attention, which is characterized in that described Acquisition module is additionally operable to：

Sentence attention weight vector is generated according to the video clip content and weights sentence characteristics with attention, to obtain sentence In about sequential positioning crucial clue.

9. the video sequential sentence positioning device according to claim 6 returned based on attention, which is characterized in that described Locating module is additionally operable to：

10. according to the video sequential sentence positioning device that claim 6-9 any one of them is returned based on attention, feature It is, the video sequential sentence positioning device returned based on attention further includes training module, is used for：