CN108647255A - The video sequential sentence localization method and device returned based on attention - Google Patents

The video sequential sentence localization method and device returned based on attention Download PDF

Info

Publication number
CN108647255A
CN108647255A CN201810367989.8A CN201810367989A CN108647255A CN 108647255 A CN108647255 A CN 108647255A CN 201810367989 A CN201810367989 A CN 201810367989A CN 108647255 A CN108647255 A CN 108647255A
Authority
CN
China
Prior art keywords
sentence
attention
video
content
sequential
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810367989.8A
Other languages
Chinese (zh)
Inventor
朱文武
袁艺天
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN201810367989.8A priority Critical patent/CN108647255A/en
Publication of CN108647255A publication Critical patent/CN108647255A/en
Priority to PCT/CN2018/113805 priority patent/WO2019205562A1/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Molecular Biology (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The invention discloses a kind of video sequential sentence localization methods and device returned based on attention, wherein method includes the following steps:According to Three dimensional convolution neural network and Glove term vector mechanism, and using two-way length, memory network encodes video clip and sentence in short-term on this basis, to characterize video clip content and content of the sentence;According to video clip content and content of the sentence by the symmetrical association between multi-modal attention Mechanism establishing video and sentence, to obtain the attention weight vector and attention weighted feature of video and sentence;According to the attention weight vector or attention weighted feature of video and sentence, export to obtain the positioning result of video sequential sentence by the retrogression mechanism based on attention weight or the retrogression mechanism based on attention weighted feature.This method can keep the contextual information in video and sentence, improve the efficiency of sentence position fixing process, to achieve the purpose that be promoted sentence locating speed, positioning accuracy and positioning robustness.

Description

The video sequential sentence localization method and device returned based on attention
Technical field
The present invention relates to technical field of computer vision, more particularly to a kind of video sequential sentence returned based on attention Localization method and device.
Background technology
In the prior art, video sequential sentence localization method is mainly:The Unified Characterization built between video and sentence is empty Between, it is scanned generates several positioning video sections to be selected in video, sentence and positioning video section to be selected are projected into unified table Sign space is compared and positions;It is scanned in video and generates several positioning video sections to be selected, by positioning video section to be selected Visual signature merged with the text feature of sentence and generate multi-modal feature.Sequential is carried out on the basis of multi-modal feature to return Return, generate positioning video section to be selected and predict the reporting between positioning video section, and positioning video section to be selected is moved To predicted position.
The method used in the prior art has the disadvantage that:Be scanned in video generate positioning video section to be selected this One way calculating cost is higher, can not adapt to the processing of long video, thus the above video sequential sentence localization method is expansible Property is not strong;Positioning video section to be selected is separated into independent process from global video, has obstructed particular video frequency content and video The interaction of contextual information, and video contextual information is most important to the positioning of sentence.Therefore, the above video sequential sentence is fixed The accuracy rate of position method is not high;All directly using general length, memory network extracts sentence characteristics to above method in short-term, has ignored For the key message of sequential positioning in sentence, therefore the excavation of their distich sub-informations is abundant not enough.
Invention content
The present invention is directed to solve at least some of the technical problems in related technologies.
For this purpose, an object of the present invention is to provide a kind of promotion sentence locating speed, positioning accuracy and positioning Shandongs The video sequential sentence localization method that the purpose of stick is returned based on attention.
It is another object of the present invention to propose a kind of video sequential sentence positioning device returned based on attention.
In order to achieve the above objectives, one aspect of the present invention embodiment proposes a kind of video sequential sentence returned based on attention Sub- localization method, includes the following steps:It is sharp according to Three dimensional convolution neural network and Glove term vector mechanism, and on this basis With two-way length, memory network encodes video clip and sentence in short-term, to characterize video clip content and content of the sentence;Root According to video clip content and content of the sentence by the symmetrical association between multi-modal attention Mechanism establishing video and sentence, to obtain Take the attention weight vector of video and sentence and attention weighted feature;According to the attention weight vector of video and sentence or Attention weighted feature is exported by the retrogression mechanism based on attention weight or the retrogression mechanism based on attention weighted feature Obtain the positioning result of video sequential sentence.
The video sequential sentence localization method of the embodiment of the present invention returned based on attention, by characterizing in video clip Hold and content of the sentence keeps the contextual information of the two, and combines the connection between multi-modal attention Mechanism establishing video and sentence System, further according to the attention weight vector of the video of acquisition and sentence and attention weighted feature, returns out video sequential The positioning result of sentence, to achieve the purpose that be promoted sentence locating speed, positioning accuracy and positioning robustness.
In addition, the video sequential sentence localization method according to the above embodiment of the present invention returned based on attention can be with With following additional technical characteristic:
Further, in one embodiment of the invention, described according to Three dimensional convolution neural network and Glove term vectors Mechanism, and using two-way length, memory network encodes video clip and sentence in short-term on this basis, to characterize piece of video Section content and content of the sentence, further comprise:Characterize the video clip content and amalgamation of global video sentence context Information, and memory network characterizes each of sentence according to the contextual information of sentence in short-term using Glove term vectors and two-way length Word.
Further, in one embodiment of the invention, the multi-modal attention mechanism includes:According to sentence characteristics Guidance generates the video attention weight vector and the attention weighting video feature, is associated with tightly with sentence semantics with obtaining Close critical video content;It is instructed to generate sentence attention weight vector and attention weighting sentence according to the video clip content Subcharacter, to obtain the crucial clue positioned about sequential in sentence.
Further, in one embodiment of the invention, it is described according to the attention weights of the video and sentence to Amount or attention weighted feature, pass through the retrogression mechanism based on attention weight or the retrogression mechanism based on attention weighted feature Output obtains the positioning result of video sequential sentence, further comprises:Recurrence based on the attention weight is with the video Attention weight vector returns the video content indicated by sentence in global video as input, using the full attended operation of multilayer Relative position;The attention weighting video feature and the note are then first merged in recurrence based on the attention weighted feature Power of anticipating weights sentence characteristics, obtains multi-modal attention weighted feature, then using multi-modal attention weighted feature as input, profit Relative position of the video content in global video indicated by sentence is returned with the full attended operation of multilayer.
Further, in one embodiment of the invention, the video sequential sentence returned based on attention is fixed Position method further include:Loss function and attention, which are returned, according to attention calibrates loss function by back-propagation algorithm iteratively Training pattern parameter, to obtain the model of the video sequential sentence localization method returned based on attention.
In order to achieve the above objectives, another aspect of the present invention embodiment proposes a kind of video sequential returned based on attention Sentence positioning device, including:Characterization module, for according to Three dimensional convolution neural network and Glove term vector mechanism, further leading to Crossing two-way length, memory network encodes video clip and content of the sentence in short-term, to characterize video clip and content of the sentence;It obtains Modulus block, for according to the video clip content and content of the sentence by multi-modal attention Mechanism establishing video and sentence it Between symmetrical association, to obtain the attention weight vector of video and sentence and attention weighted feature;Locating module is used for root According to the attention weight vector or attention weighted feature of the video and sentence, pass through the retrogression mechanism based on attention weight Or the retrogression mechanism based on attention weighted feature exports to obtain the positioning result of video sequential sentence.
The video sequential sentence positioning device of the embodiment of the present invention returned based on attention, by characterizing in video clip Hold and content of the sentence keeps the contextual information of the two, and combines the connection between multi-modal attention Mechanism establishing video and sentence System, further according to the attention weight vector of the video of acquisition and sentence and attention weighted feature, returns out video sequential The positioning result of sentence, to achieve the purpose that be promoted sentence locating speed, positioning accuracy and positioning robustness.
In addition, the video sequential sentence positioning device according to the above embodiment of the present invention returned based on attention can be with With following additional technical characteristic:
Further, in one embodiment of the invention, the characterization module is additionally operable to:It characterizes in the video clip Contextual information hold and amalgamation of global video sentence, and using Glove term vectors and two-way length in short-term memory network according to Each word of the contextual information characterization sentence of sentence.
Further, in one embodiment of the invention, the acquisition module is additionally operable to:It instructs to give birth to according to sentence characteristics At the video attention weight vector and the attention weighting video feature, closely pass is associated with sentence semantics to obtain Key video content;Sentence attention weight vector, which is generated, according to the video clip content weights sentence characteristics with attention, with Obtain the crucial clue about sequential positioning in sentence.
Further, in one embodiment of the invention, the locating module is additionally operable to:Based on the attention weight Recurrence using the video attention weight vector as input, utilize the full attended operation of multilayer to return the video indicated by sentence Relative position of the content in global video;The attention weighting is then first merged in recurrence based on the attention weighted feature Video features and the attention weight sentence characteristics, obtain multi-modal attention weighted feature, then add with multi-modal attention Feature is weighed as input, opposite position of the video content in global video indicated by sentence is returned using the full attended operation of multilayer It sets.
Further, in one embodiment of the invention, the video sequential sentence returned based on attention is fixed Position device further includes training module, is used for:Passed through according to attention recurrence loss function and attention calibration loss function reversed Propagation algorithm iteratively training pattern parameter, to obtain the mould of the video sequential sentence localization method returned based on attention Type.
The additional aspect of the present invention and advantage will be set forth in part in the description, and will partly become from the following description Obviously, or practice through the invention is recognized.
Description of the drawings
Above-mentioned and/or additional aspect and advantage of the invention will become from the following description of the accompanying drawings of embodiments Obviously and it is readily appreciated that, wherein:
Fig. 1 is the flow chart according to the video sequential sentence localization method of the embodiment of the present invention returned based on attention;
Fig. 2 is the model according to the video sequential sentence positioning device of one embodiment of the invention returned based on attention Structural schematic diagram;With
Fig. 3 is the structural representation according to the video sequential sentence positioning device of the embodiment of the present invention returned based on attention Figure.
Specific implementation mode
The embodiment of the present invention is described below in detail, the example of embodiment is shown in the accompanying drawings, wherein identical from beginning to end Or similar label indicates same or similar element or element with the same or similar functions.It is retouched below with reference to attached drawing The embodiment stated is exemplary, it is intended to for explaining the present invention, and is not considered as limiting the invention.
The video sequential sentence returned based on attention for describing to propose according to embodiments of the present invention with reference to the accompanying drawings is fixed Position method and device, describes the video sequential returned based on attention proposed according to embodiments of the present invention with reference to the accompanying drawings first Sentence localization method.
Fig. 1 is the flow chart of the video sequential sentence localization method returned based on attention according to the embodiment of the present invention, As shown in Figure 1, should be included the following steps based on the video sequential sentence localization method that attention returns:
In step S101, according to Three dimensional convolution neural network and Glove term vector mechanism, and utilize on this basis double Video clip and sentence are encoded to long memory network in short-term, to characterize video clip content and content of the sentence.
It is understood that while characterizing video clip content, the contextual information of amalgamation of global video, and use Memory network in this way may be used according to each word of the contextual information of sentence characterization sentence in short-term for Glove term vectors and two-way length So that the video clip content and content of the sentence that obtain more comprehensively and have robustness.
In step s 102, according to video clip content and content of the sentence by multi-modal attention Mechanism establishing video with Symmetrical association between sentence, to obtain the attention weight vector and attention weighted feature of video and sentence.
It is understood that multi-modal attention mechanism includes:It is instructed to generate video attention weights according to sentence characteristics Vector and attention weighting video feature are associated with close critical video content with sentence semantics to obtain;According to video clip Content generates sentence attention weight vector and weights sentence characteristics with attention, to obtain the key positioned about sequential in sentence Clue.
In step s 103, according to the attention weight vector or attention weighted feature of video and sentence, by being based on The retrogression mechanism of attention weight or retrogression mechanism based on attention weighted feature export to obtain the positioning of video sequential sentence As a result.
It is understood that position Recurrent networks, which include two kinds, returns strategy, include recurrence based on attention weight and Recurrence based on attention weighted feature.Wherein, recurrence based on attention weight is using video attention weight vector as defeated Enter, relative position of the video content in global video indicated by sentence is returned using the full attended operation of multilayer;Based on attention The recurrence of power weighted feature then first merges attention weighting video feature and attention weights sentence characteristics, obtains multi-modal attention Power weighted feature, then using multi-modal attention weighted feature as input, returned indicated by sentence using the full attended operation of multilayer Relative position of the video content in global video.
As shown in Fig. 2, in one embodiment of the invention, the video sequential sentence localization method returned based on attention Model be divided into three modules, including:The feature coding of integrating context information, multi-modal attention mechanism and based on pay attention to The position Recurrent networks of power, training step are:
Training set is expressed asWherein ViIndicate i-th of video in training set, video Shi Changwei τi, SiTo describe video ViOne sentence of content, SiIt the starting in video of described content and terminates the time and sits Mark is respectivelyWithK is training set number of samples;
Each video is equally divided into M video clip, each sentence is also characterized as the sequence of word;By the starting of sentence It is normalized by video length with time coordinate is terminated, prediction target namely sentence coordinate as position Recurrent networks Actual value:
This programme devises two loss functions to instruct the learning process of block mold:I.e. attention returns loss letter Number and attention calibrate loss function;By the way that video and sentence inputting are determined to the video sequential sentence returned based on attention In bit model, the predicted value of sentence coordinate is exportedAttention returns the predicted value that loss function is defined as sentence coordinate Smooth L1 distance R (t) (Smooth L1Distance) between the actual value of sentence coordinate: Attention calibration loss function then limits the actual position time positioned at sentence WindowInterior video clip, attention weight are as big as possible:If video Vi J-th of video clip in time windowIt is interior, then mi,j=1, otherwise mi,=0;
Joint attention returns the study of loss function and the attention calibration common guidance model of loss function, passes through classics Back-propagation algorithm iteratively training pattern parameter.
It is understood that the model of the video sequential sentence localization method returned based on attention with it is a kind of end to end Frame combined optimization is trained, and is reduced redundant computation cost and is improved the locating accuracy of sentence.It can be solved based on this programme Certainly video sequential sentence orientation problem preferably serves disparate networks Video Applications, is suitable for the video content based on sentence Quickly positioning, video frequency searching, the scenes such as video frequency abstract.
The video sequential sentence localization method of the embodiment of the present invention returned based on attention, by characterizing in video clip Hold and content of the sentence keeps the contextual information of the two, and combines the connection between multi-modal attention Mechanism establishing video and sentence System, further according to the attention weight vector of the video of acquisition and sentence and attention weighted feature, returns out video sequential The positioning result of sentence, to achieve the purpose that be promoted sentence locating speed, positioning accuracy and positioning robustness.
The video sequential sentence returned based on attention proposed according to embodiments of the present invention referring next to attached drawing description is fixed Position device.
Fig. 3 is the structural representation according to the video sequential sentence positioning device of the embodiment of the present invention returned based on attention Figure, as shown in figure 3, should include based on the video sequential sentence positioning device 10 that attention returns:Characterization module 100 is used for root According to Three dimensional convolution neural network and Glove term vector mechanism, further by two-way length in short-term memory network to video clip and Content of the sentence is encoded, to characterize video clip and content of the sentence;Acquisition module 200, for according to video clip content and Content of the sentence is by the symmetrical association between multi-modal attention Mechanism establishing video and sentence, to obtain the note of video and sentence Meaning power weight vector and attention weighted feature;Locating module 300, for according to the attention weight vector of video and sentence or Attention weighted feature is exported by the retrogression mechanism based on attention weight or the retrogression mechanism based on attention weighted feature Obtain the positioning result of video sequential sentence.
Further, in one embodiment of the invention, characterization module 100 is additionally operable to:Characterize video clip content With the contextual information of amalgamation of global video sentence, and using Glove term vectors and two-way length in short-term memory network according to sentence Contextual information characterization sentence each word.
Further, in one embodiment of the invention, acquisition module 200 is additionally operable to:It instructs to give birth to according to sentence characteristics At video attention weight vector and attention weighting video feature, it is associated in close key video sequence with sentence semantics with obtaining Hold;Sentence attention weight vector is generated according to video clip content and weights sentence characteristics with attention, is closed with obtaining in sentence In the crucial clue of sequential positioning.
Further, in one embodiment of the invention, locating module 300 is additionally operable to:Returning based on attention weight Return using video attention weight vector as input, the video content indicated by sentence is returned complete using the full attended operation of multilayer Relative position in office's video;Attention weighting video feature and attention are then first merged in recurrence based on attention weighted feature Sentence characteristics are weighted, obtain multi-modal attention weighted feature, then using multi-modal attention weighted feature as inputting, using more The full attended operation of layer returns relative position of the video content in global video indicated by sentence.
Further, in one embodiment of the invention, dress should be positioned based on the video sequential sentence that attention returns It further includes training module to set 10, is used for:Loss function is returned according to attention and attention calibrates loss function by reversely passing With broadcasting algorithm iteration training pattern parameter, to obtain the model of the video sequential sentence localization method returned based on attention.
The video sequential sentence positioning device of the embodiment of the present invention returned based on attention, by characterizing in video clip Hold and content of the sentence keeps the contextual information of the two, and combines the connection between multi-modal attention Mechanism establishing video and sentence System, further according to the attention weight vector of the video of acquisition and sentence and attention weighted feature, returns out video sequential The positioning result of sentence, to achieve the purpose that be promoted sentence locating speed, positioning accuracy and positioning robustness.
It should be noted that aforementioned explaining to the video sequential sentence localization method embodiment that is returned based on attention The bright device for being also applied for the embodiment, details are not described herein again.
In the description of the present invention, it is to be understood that, term "center", " longitudinal direction ", " transverse direction ", " length ", " width ", " thickness ", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom" "inner", "outside", " up time The orientation or positional relationship of the instructions such as needle ", " counterclockwise ", " axial direction ", " radial direction ", " circumferential direction " be orientation based on ... shown in the drawings or Position relationship is merely for convenience of description of the present invention and simplification of the description, and does not indicate or imply the indicated device or element must There must be specific orientation, with specific azimuth configuration and operation, therefore be not considered as limiting the invention.
In addition, term " first ", " second " are used for description purposes only, it is not understood to indicate or imply relative importance Or implicitly indicate the quantity of indicated technical characteristic.Define " first " as a result, the feature of " second " can be expressed or Implicitly include at least one this feature.In the description of the present invention, the meaning of " plurality " is at least two, such as two, three It is a etc., unless otherwise specifically defined.
In the present invention unless specifically defined or limited otherwise, term " installation ", " connected ", " connection ", " fixation " etc. Term shall be understood in a broad sense, for example, it may be being fixedly connected, may be a detachable connection, or integral;Can be that machinery connects It connects, can also be electrical connection;It can be directly connected, can also can be indirectly connected through an intermediary in two elements The interaction relationship of the connection in portion or two elements, unless otherwise restricted clearly.For those of ordinary skill in the art For, the specific meanings of the above terms in the present invention can be understood according to specific conditions.
In the present invention unless specifically defined or limited otherwise, fisrt feature can be with "above" or "below" second feature It is that the first and second features are in direct contact or the first and second features pass through intermediary mediate contact.Moreover, fisrt feature exists Second feature " on ", " top " and " above " but fisrt feature be directly above or diagonally above the second feature, or be merely representative of Fisrt feature level height is higher than second feature.Fisrt feature second feature " under ", " lower section " and " below " can be One feature is directly under or diagonally below the second feature, or is merely representative of fisrt feature level height and is less than second feature.
In the description of this specification, reference term " one embodiment ", " some embodiments ", " example ", " specifically show The description of example " or " some examples " etc. means specific features, structure, material or spy described in conjunction with this embodiment or example Point is included at least one embodiment or example of the invention.In the present specification, schematic expression of the above terms are not It must be directed to identical embodiment or example.Moreover, particular features, structures, materials, or characteristics described can be in office It can be combined in any suitable manner in one or more embodiments or example.In addition, without conflicting with each other, the skill of this field Art personnel can tie the feature of different embodiments or examples described in this specification and different embodiments or examples It closes and combines.
Although the embodiments of the present invention has been shown and described above, it is to be understood that above-described embodiment is example Property, it is not considered as limiting the invention, those skilled in the art within the scope of the invention can be to above-mentioned Embodiment is changed, changes, replacing and modification.

Claims (10)

1. a kind of video sequential sentence localization method returned based on attention, which is characterized in that include the following steps:
According to Three dimensional convolution neural network and Glove term vector mechanism, and two-way length memory network in short-term is utilized on this basis Video clip and sentence are encoded, to characterize video clip content and content of the sentence;
Pass through pair between multi-modal attention Mechanism establishing video and sentence according to the video clip content and content of the sentence Claim association, to obtain the attention weight vector and attention weighted feature of video and sentence;
According to the attention weight vector or attention weighted feature of the video and sentence, pass through returning based on attention weight Mechanism or the retrogression mechanism based on attention weighted feature is returned to export to obtain the positioning result of video sequential sentence.
2. the video sequential sentence localization method according to claim 1 returned based on attention, which is characterized in that described According to Three dimensional convolution neural network and Glove term vector mechanism, and on this basis using two-way length in short-term memory network to regarding Frequency segment and sentence are encoded, and to characterize video clip content and content of the sentence, are further comprised:
Characterize the contextual information of the video clip content and amalgamation of global video sentence, and using Glove term vectors and double Each word of sentence is characterized according to the contextual information of sentence to long memory network in short-term.
3. the video sequential sentence localization method according to claim 1 returned based on attention, which is characterized in that described Multi-modal attention mechanism includes:
The video attention weight vector and the attention weighting video feature are generated according to sentence characteristics guidance, to obtain Close critical video content is associated with sentence semantics;
It is instructed to generate sentence attention weight vector and attention weighting sentence characteristics according to the video clip content, to obtain Crucial clue in sentence about sequential positioning.
4. the video sequential sentence localization method according to claim 1 returned based on attention, which is characterized in that described According to the attention weight vector or attention weighted feature of the video and sentence, pass through the regression machine based on attention weight System or the retrogression mechanism based on attention weighted feature export to obtain the positioning result of video sequential sentence, further comprise:
Recurrence based on the attention weight connects behaviour using multilayer entirely using the video attention weight vector as input Make relative position of the video content in global video indicated by recurrence sentence;
The attention weighting video feature is then first merged in recurrence based on the attention weighted feature and the attention adds Sentence characteristics are weighed, obtain multi-modal attention weighted feature, then using multi-modal attention weighted feature as input, utilize multilayer Full attended operation returns relative position of the video content in global video indicated by sentence.
5. according to the video sequential sentence localization method that claim 1-4 any one of them is returned based on attention, feature It is, further includes:
Loss function is returned according to attention and attention calibration loss function passes through back-propagation algorithm iteratively training pattern Parameter, to obtain the model of the video sequential sentence localization method returned based on attention.
6. a kind of video sequential sentence positioning device returned based on attention, which is characterized in that include the following steps:
Characterization module, for according to Three dimensional convolution neural network and Glove term vector mechanism, further being remembered in short-term by two-way length Recall network to encode video clip and content of the sentence, to characterize video clip and content of the sentence;
Acquisition module, for according to the video clip content and content of the sentence by multi-modal attention Mechanism establishing video with Symmetrical association between sentence, to obtain the attention weight vector and attention weighted feature of video and sentence;
Locating module, for the attention weight vector or attention weighted feature according to the video and sentence, by being based on The retrogression mechanism of attention weight or retrogression mechanism based on attention weighted feature export to obtain the positioning of video sequential sentence As a result.
7. the video sequential sentence positioning device according to claim 6 returned based on attention, which is characterized in that described Characterization module is additionally operable to:
Characterize the video clip content and amalgamation of global video sentence contextual information, and using Glove term vectors and Two-way length in short-term memory network according to the contextual information of sentence characterize sentence each word.
8. the video sequential sentence positioning device according to claim 6 returned based on attention, which is characterized in that described Acquisition module is additionally operable to:
The video attention weight vector and the attention weighting video feature are generated according to sentence characteristics guidance, to obtain Close critical video content is associated with sentence semantics;
Sentence attention weight vector is generated according to the video clip content and weights sentence characteristics with attention, to obtain sentence In about sequential positioning crucial clue.
9. the video sequential sentence positioning device according to claim 6 returned based on attention, which is characterized in that described Locating module is additionally operable to:
Recurrence based on the attention weight connects behaviour using multilayer entirely using the video attention weight vector as input Make relative position of the video content in global video indicated by recurrence sentence;
The attention weighting video feature is then first merged in recurrence based on the attention weighted feature and the attention adds Sentence characteristics are weighed, obtain multi-modal attention weighted feature, then using multi-modal attention weighted feature as input, utilize multilayer Full attended operation returns relative position of the video content in global video indicated by sentence.
10. according to the video sequential sentence positioning device that claim 6-9 any one of them is returned based on attention, feature It is, the video sequential sentence positioning device returned based on attention further includes training module, is used for:
Loss function is returned according to attention and attention calibration loss function passes through back-propagation algorithm iteratively training pattern Parameter, to obtain the model of the video sequential sentence localization method returned based on attention.
CN201810367989.8A 2018-04-23 2018-04-23 The video sequential sentence localization method and device returned based on attention Pending CN108647255A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201810367989.8A CN108647255A (en) 2018-04-23 2018-04-23 The video sequential sentence localization method and device returned based on attention
PCT/CN2018/113805 WO2019205562A1 (en) 2018-04-23 2018-11-02 Attention regression-based method and device for positioning sentence in video timing sequence

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810367989.8A CN108647255A (en) 2018-04-23 2018-04-23 The video sequential sentence localization method and device returned based on attention

Publications (1)

Publication Number Publication Date
CN108647255A true CN108647255A (en) 2018-10-12

Family

ID=63747336

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810367989.8A Pending CN108647255A (en) 2018-04-23 2018-04-23 The video sequential sentence localization method and device returned based on attention

Country Status (2)

Country Link
CN (1) CN108647255A (en)
WO (1) WO2019205562A1 (en)

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109889923A (en) * 2019-02-28 2019-06-14 杭州一知智能科技有限公司 Utilize the method for combining the layering of video presentation to summarize video from attention network
CN109948691A (en) * 2019-03-14 2019-06-28 齐鲁工业大学 Iamge description generation method and device based on depth residual error network and attention
CN110188360A (en) * 2019-06-06 2019-08-30 北京百度网讯科技有限公司 Model training method and device
CN110225368A (en) * 2019-06-27 2019-09-10 腾讯科技(深圳)有限公司 A kind of video locating method, device and electronic equipment
WO2019205562A1 (en) * 2018-04-23 2019-10-31 清华大学 Attention regression-based method and device for positioning sentence in video timing sequence
CN110688446A (en) * 2019-08-23 2020-01-14 重庆兆光科技股份有限公司 Sentence meaning mathematical space representation method, system, medium and equipment
CN110717054A (en) * 2019-09-16 2020-01-21 清华大学 Method and system for generating video by crossing modal characters based on dual learning
CN110933518A (en) * 2019-12-11 2020-03-27 浙江大学 Method for generating query-oriented video abstract by using convolutional multi-layer attention network mechanism
WO2020113468A1 (en) * 2018-12-05 2020-06-11 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for grounding a target video clip in a video
CN111368870A (en) * 2019-10-31 2020-07-03 杭州电子科技大学 Video time sequence positioning method based on intra-modal collaborative multi-linear pooling
CN111836111A (en) * 2019-04-17 2020-10-27 微软技术许可有限责任公司 Technique for generating barrage
CN112015955A (en) * 2020-09-01 2020-12-01 清华大学 Multi-mode data association method and device
CN112560811A (en) * 2021-02-19 2021-03-26 中国科学院自动化研究所 End-to-end automatic detection research method for audio-video depression

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110866938B (en) * 2019-11-21 2021-04-27 北京理工大学 Full-automatic video moving object segmentation method
CN112200250A (en) * 2020-10-14 2021-01-08 重庆金山医疗器械有限公司 Digestive tract segmentation identification method, device and equipment of capsule endoscope image
CN113762322B (en) * 2021-04-22 2024-06-25 腾讯科技(北京)有限公司 Video classification method, device and equipment based on multi-modal representation and storage medium
CN116363817B (en) * 2023-02-02 2024-01-02 淮阴工学院 Chemical plant dangerous area invasion early warning method and system

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104199933B (en) * 2014-09-04 2017-07-07 华中科技大学 The football video event detection and semanteme marking method of a kind of multimodal information fusion
US11409791B2 (en) * 2016-06-10 2022-08-09 Disney Enterprises, Inc. Joint heterogeneous language-vision embeddings for video tagging and search
CN107038221B (en) * 2017-03-22 2020-11-17 杭州电子科技大学 Video content description method based on semantic information guidance
CN107066973B (en) * 2017-04-17 2020-07-21 杭州电子科技大学 Video content description method using space-time attention model
CN108647255A (en) * 2018-04-23 2018-10-12 清华大学 The video sequential sentence localization method and device returned based on attention

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YUAN, YITIAN 等: "To Find Where You Talk: Temporal Sentence Localization in Video with Attention Based Location Regression", 《HTTPS://ARXIV.ORG/ABC/1804.07014V1》 *

Cited By (24)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019205562A1 (en) * 2018-04-23 2019-10-31 清华大学 Attention regression-based method and device for positioning sentence in video timing sequence
WO2020113468A1 (en) * 2018-12-05 2020-06-11 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for grounding a target video clip in a video
CN111480166A (en) * 2018-12-05 2020-07-31 北京百度网讯科技有限公司 Method and device for positioning target video clip from video
US11410422B2 (en) 2018-12-05 2022-08-09 Beijing Baidu Netcom Science And Technology Co., Ltd. Method and apparatus for grounding a target video clip in a video
CN109889923A (en) * 2019-02-28 2019-06-14 杭州一知智能科技有限公司 Utilize the method for combining the layering of video presentation to summarize video from attention network
CN109889923B (en) * 2019-02-28 2021-03-26 杭州一知智能科技有限公司 Method for summarizing videos by utilizing layered self-attention network combined with video description
CN109948691A (en) * 2019-03-14 2019-06-28 齐鲁工业大学 Iamge description generation method and device based on depth residual error network and attention
US11877016B2 (en) 2019-04-17 2024-01-16 Microsoft Technology Licensing, Llc Live comments generating
CN111836111A (en) * 2019-04-17 2020-10-27 微软技术许可有限责任公司 Technique for generating barrage
CN110188360B (en) * 2019-06-06 2023-04-25 北京百度网讯科技有限公司 Model training method and device
CN110188360A (en) * 2019-06-06 2019-08-30 北京百度网讯科技有限公司 Model training method and device
CN110225368B (en) * 2019-06-27 2020-07-10 腾讯科技(深圳)有限公司 Video positioning method and device and electronic equipment
CN110225368A (en) * 2019-06-27 2019-09-10 腾讯科技(深圳)有限公司 A kind of video locating method, device and electronic equipment
CN110688446A (en) * 2019-08-23 2020-01-14 重庆兆光科技股份有限公司 Sentence meaning mathematical space representation method, system, medium and equipment
CN110717054B (en) * 2019-09-16 2022-07-15 清华大学 Method and system for generating video by crossing modal characters based on dual learning
CN110717054A (en) * 2019-09-16 2020-01-21 清华大学 Method and system for generating video by crossing modal characters based on dual learning
CN111368870A (en) * 2019-10-31 2020-07-03 杭州电子科技大学 Video time sequence positioning method based on intra-modal collaborative multi-linear pooling
CN111368870B (en) * 2019-10-31 2023-09-05 杭州电子科技大学 Video time sequence positioning method based on inter-modal cooperative multi-linear pooling
CN110933518B (en) * 2019-12-11 2020-10-02 浙江大学 Method for generating query-oriented video abstract by using convolutional multi-layer attention network mechanism
CN110933518A (en) * 2019-12-11 2020-03-27 浙江大学 Method for generating query-oriented video abstract by using convolutional multi-layer attention network mechanism
CN112015955A (en) * 2020-09-01 2020-12-01 清华大学 Multi-mode data association method and device
CN112015955B (en) * 2020-09-01 2021-07-30 清华大学 Multi-mode data association method and device
CN112560811A (en) * 2021-02-19 2021-03-26 中国科学院自动化研究所 End-to-end automatic detection research method for audio-video depression
US11963771B2 (en) 2021-02-19 2024-04-23 Institute Of Automation, Chinese Academy Of Sciences Automatic depression detection method based on audio-video

Also Published As

Publication number Publication date
WO2019205562A1 (en) 2019-10-31

Similar Documents

Publication Publication Date Title
CN108647255A (en) The video sequential sentence localization method and device returned based on attention
CN108052937B (en) Based on Weakly supervised character machining device training method, device, system and medium
CN111738111B (en) Road extraction method of high-resolution remote sensing image based on multi-branch cascade cavity space pyramid
CN109271646A (en) Text interpretation method, device, readable storage medium storing program for executing and computer equipment
CN108665506A (en) Image processing method, device, computer storage media and server
CN109635204A (en) Online recommender system based on collaborative filtering and length memory network
CN109671102A (en) A kind of composite type method for tracking target based on depth characteristic fusion convolutional neural networks
US20220092297A1 (en) Facial beauty prediction method and device based on multi-task migration
CN105740984A (en) Product concept performance evaluation method based on performance prediction
CN112199600A (en) Target object identification method and device
CN114925238B (en) Federal learning-based video clip retrieval method and system
CN110442741A (en) A kind of mutual search method of cross-module state picture and text for merging and reordering based on tensor
US20230368500A1 (en) Time-series image description method for dam defects based on local self-attention
CN113870312B (en) Single target tracking method based on twin network
CN113255701B (en) Small sample learning method and system based on absolute-relative learning framework
Huang et al. An incremental SAR target recognition framework via memory-augmented weight alignment and enhancement discrimination
CN117213470A (en) Multi-machine fragment map aggregation updating method and system
CN111651577A (en) Cross-media data association analysis model training method, data association analysis method and system
CN117453949A (en) Video positioning method and device
CN115017377B (en) Method, device and computing equipment for searching target model
Chen et al. Application of Data‐Driven Iterative Learning Algorithm in Transmission Line Defect Detection
Chaalal et al. Mobility prediction for aerial base stations for a coverage extension in 5G networks
Zha et al. [Retracted] Research on the Prediction of Port Economic Synergy Development Trend Based on Deep Neural Networks
CN113239219A (en) Image retrieval method, system, medium and equipment based on multi-modal query
CN115935001A (en) Frame-level fine-grained natural language video time positioning method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20181012

WD01 Invention patent application deemed withdrawn after publication