CN114064967A - Cross-modal time sequence behavior positioning method and device of multi-granularity cascade interactive network - Google Patents

Cross-modal time sequence behavior positioning method and device of multi-granularity cascade interactive network Download PDF

Info

Publication number
CN114064967A
CN114064967A CN202210052687.8A CN202210052687A CN114064967A CN 114064967 A CN114064967 A CN 114064967A CN 202210052687 A CN202210052687 A CN 202210052687A CN 114064967 A CN114064967 A CN 114064967A
Authority
CN
China
Prior art keywords
video
cross
representation
loss
query
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202210052687.8A
Other languages
Chinese (zh)
Other versions
CN114064967B (en
Inventor
王聪
鲍虎军
宋明黎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Lab
Original Assignee
Zhejiang Lab
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Lab filed Critical Zhejiang Lab
Priority to CN202210052687.8A priority Critical patent/CN114064967B/en
Publication of CN114064967A publication Critical patent/CN114064967A/en
Application granted granted Critical
Publication of CN114064967B publication Critical patent/CN114064967B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/73Querying
    • G06F16/735Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/7867Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/134Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
    • H04N19/146Data rate or code amount at the encoder output
    • H04N19/149Data rate or code amount at the encoder output by estimating the code amount by means of a model, e.g. mathematical model or statistical model
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/20Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using video object coding
    • H04N19/21Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using video object coding with binary alpha-plane coding for video objects, e.g. context-based arithmetic encoding [CAE]

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Library & Information Science (AREA)
  • Databases & Information Systems (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Signal Processing (AREA)
  • Mathematical Analysis (AREA)
  • Mathematical Optimization (AREA)
  • Pure & Applied Mathematics (AREA)
  • Algebra (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a cross-modal time sequence behavior positioning method and device of a multi-granularity cascade interactive network, which are used for solving the problem of time sequence behavior positioning based on given text query in an untrimmed video. The invention implements a novel multi-granularity cascade cross-modal interaction network, and cascade cross-modal interaction is carried out in a coarse-to-fine mode so as to improve the cross-modal alignment capability of the model. In addition, the invention introduces a local-global context-aware video encoder (local-global context-aware video encoder) for improving the context timing dependency modeling capability of the video encoder. The method is simple, flexible in means and superior in the aspect of improving the vision-language cross-modal alignment precision, and the timing sequence positioning accuracy of the trained model can be remarkably improved on paired video-query test data.

Description

Cross-modal time sequence behavior positioning method and device of multi-granularity cascade interactive network
Technical Field
The invention relates to the field of vision-language cross-modal learning, in particular to a cross-modal time sequence behavior positioning method and device.
Background
With the rapid development of multimedia and network technologies and the increasing popularization of large-scale video monitoring in places such as traffic, campus, market and the like, the presentation of massive video data increases rapidly in a geometric manner, and video understanding becomes an important and urgent problem to be solved. Wherein the time-series behavior localization is the basis and important component of video understanding. The time sequence behavior positioning research based on visual single mode limits the behavior to be positioned in a predefined behavior set, however, the behavior is complex and diverse in the real world, and the predefined behavior set is difficult to meet the requirements of the real world. As shown in fig. 1, the visual-language cross-modality time-series behavior localization task gives a text description of a certain behavior in a video as a query, and performs time-series localization on a corresponding behavior segment in the video. The visual-language cross-modal time sequence behavior positioning is a very natural man-machine interaction mode, and the technology has wide application prospects in the fields of network short video content retrieval and production, intelligent video monitoring, man-machine interaction and the like.
With the push of deep learning, the task of locating visual-language cross-modal temporal behavior attracts great attention from the industrial and academic communities. Because a significant semantic gap exists between the heterogeneous text mode and the visual mode, how to realize semantic alignment between the modes is a core problem in the task of positioning the cross-mode time sequence behavior from the text mode to the visual mode. The existing visual-language cross-modal time sequence behavior positioning method mainly comprises three types, including a candidate segment nomination-based method, a candidate segment nomination-free method and a sequence decision-based method. The cross-modal visual-language alignment is an indispensable important link in the existing three methods. However, the existing method does not fully utilize multi-granularity text query information in the visual-language cross-modal interaction link, and does not fully model the local context timing dependence characteristic of the video in the video representation coding link.
Disclosure of Invention
In order to solve the defects of the prior art and achieve the purpose of improving the visual-language cross-modal alignment precision in the visual-language cross-modal time sequence behavior positioning task, the invention adopts the following technical scheme:
a cross-modal time sequence behavior positioning method of a multi-granularity cascade interactive network comprises the following steps:
step S1: giving an unclipped video sample, performing initial extraction of video representation by using a visual pre-training model, and performing context-aware time sequence dependent coding on the initially extracted video representation in a local-global mode to obtain a final video representation, so that the context time sequence dependent modeling capability of the video representation is improved;
step S2: for text query corresponding to an untrimmed video, performing word embedding initialization on each word in a query text by adopting a pre-trained word embedding model, and then performing context coding by adopting a multi-layer bidirectional long-time memory network to obtain a word-level representation and a global-level representation of the text query;
step S3: for the extracted video representation and the text query representation, a multi-granularity cascade interaction network is adopted to carry out interaction between a video modality and a text query modality, so that an enhanced video representation guided by query is obtained, and the cross-modality alignment precision is improved;
step S4: for the video representation obtained after multi-granularity cascade interaction, predicting the time sequence position of a text query corresponding target video fragment by adopting an attention-based time sequence position regression module;
step S5: for the cross-modal time sequence behavior positioning model based on the multi-granularity cascade interaction network formed in the steps S1-S4, training of the model is carried out by utilizing a training sample set, and a total loss function adopted in the training comprises attention alignment loss and boundary loss, wherein the boundary loss comprises attention alignment loss and boundary lossInvolving smoothing
Figure 645443DEST_PATH_IMAGE001
The loss and the time sequence generalized intersection are better adapted to the evaluation criterion of the time sequence positioning task than the loss, and the training sample set is composed of a plurality of { video, query, target video segment time sequence position mark } triple samples.
Further, in step S1, based on the visual pre-training model, the video frame features are extracted in an off-line manner and the T frames are uniformly sampled, and then a set of video representations is obtained through a linear transformation layer
Figure 93742DEST_PATH_IMAGE002
Figure 404638DEST_PATH_IMAGE003
For characterization of the ith frame of a video, and thus for characterization of the video
Figure 331005DEST_PATH_IMAGE004
And performing context-aware time-sequence dependent coding in a local-global mode.
Further, in the local-global context-aware coding in step S1, the video is first characterized
Figure 925804DEST_PATH_IMAGE004
Performing local context-aware coding to obtain video representation
Figure 279425DEST_PATH_IMAGE005
(ii) a Then characterize the video
Figure 77616DEST_PATH_IMAGE005
Performing global context-aware coding to obtain video representations
Figure 807675DEST_PATH_IMAGE006
Further, the local context-aware coding and the global context-aware coding in step S1 are respectively implemented as follows:
step S1.1, local context-aware coding adopts a group of continuous local transformer (local transformer) blocks equipped with one-dimensional offset windows to represent video
Figure 492865DEST_PATH_IMAGE004
As an initial characterization, inputting a continuous local transformer block of a first one-dimensional offset window, inputting the obtained result into a continuous local transformer block of a second one-dimensional offset window, and so on, and taking the output of the continuous local transformer block of the last one-dimensional offset window as a video characterization of local context-aware coding output
Figure 282967DEST_PATH_IMAGE005
(ii) a The operation inside the continuous local transformer block of one-dimensional offset window is as follows:
characterizing acquired video
Figure 568455DEST_PATH_IMAGE007
After layer standardization is carried out, the obtained result and the video representation are represented through a one-dimensional window multi-head self-attention module
Figure 836625DEST_PATH_IMAGE007
Adding to obtain video representation
Figure 874857DEST_PATH_IMAGE008
(ii) a Characterizing video
Figure 835860DEST_PATH_IMAGE009
After layer standardization, the obtained result and the video representation are carried out through a multilayer perceptron
Figure 874223DEST_PATH_IMAGE009
Adding to obtain video representation
Figure 680505DEST_PATH_IMAGE010
(ii) a Characterizing video
Figure 605867DEST_PATH_IMAGE011
After layer standardization is carried out, the obtained result and the video representation are represented by a one-dimensional offset window multi-head self-attention module
Figure 472191DEST_PATH_IMAGE011
Adding to obtain video representation
Figure 466692DEST_PATH_IMAGE012
(ii) a Characterizing video
Figure 76665DEST_PATH_IMAGE012
After layer standardization, the obtained result and the video representation are carried out through a multilayer perceptron
Figure 355069DEST_PATH_IMAGE012
Adding, outputting video representations
Figure 657874DEST_PATH_IMAGE013
The output of successive partial transformer blocks as one-dimensional shifted windows,
Figure 139671DEST_PATH_IMAGE014
is shown as
Figure 553335DEST_PATH_IMAGE014
The blocks are provided with successive partial transformer blocks of one-dimensional offset windows.
Specifically, the first
Figure 187710DEST_PATH_IMAGE015
Successive partial transformer blocks with blocks equipped with one-dimensional offset windows are represented as:
Figure 661416DEST_PATH_IMAGE016
Figure 630509DEST_PATH_IMAGE017
Figure 582285DEST_PATH_IMAGE018
Figure 835280DEST_PATH_IMAGE019
wherein,
Figure 479888DEST_PATH_IMAGE020
Figure 201857DEST_PATH_IMAGE021
in order to standardize the layers, the method comprises the following steps of,
Figure 957323DEST_PATH_IMAGE022
is a one-dimensional window multi-head self-attention module,
Figure 300711DEST_PATH_IMAGE023
in order to be a multi-layer sensor,
Figure 116220DEST_PATH_IMAGE024
a one-dimensional offset window multi-headed self-attention module.
Step S1.2, global context-aware coding includes a set of conventional transformer blocks, characterizing the video
Figure 59905DEST_PATH_IMAGE005
Making an initial representation and inputting a first conventional transformer block, inputting an obtained result into a second conventional transformer block, and so on, and taking the output of a last conventional transformer block as a global context-aware coding output video representation
Figure 619063DEST_PATH_IMAGE006
(ii) a The conventional transformer block operates internally as follows:
captured video characterization
Figure 535066DEST_PATH_IMAGE025
After passing through a conventional multi-head self-attention module, the obtained result is represented by a video
Figure 505165DEST_PATH_IMAGE025
After addition, layer standardization is carried out to obtain video representation
Figure 936146DEST_PATH_IMAGE026
(ii) a Video characterization
Figure 33415DEST_PATH_IMAGE026
After passing through the multilayer perceptron, the obtained result is represented by the video
Figure 866242DEST_PATH_IMAGE026
After addition, layer normalization is performed to obtain a video representation
Figure 774286DEST_PATH_IMAGE027
As an output of the conventional transformer block,
Figure 692564DEST_PATH_IMAGE028
is shown as
Figure 593524DEST_PATH_IMAGE028
Block conventional transformer blocks.
Specifically, the first
Figure 15278DEST_PATH_IMAGE028
Each transformer block is represented as:
Figure 592759DEST_PATH_IMAGE029
Figure 732753DEST_PATH_IMAGE030
wherein,
Figure 437404DEST_PATH_IMAGE031
Figure 182506DEST_PATH_IMAGE032
is a constantThe self-attention module of the gauge head,
Figure 681620DEST_PATH_IMAGE021
in order to standardize the layers, the method comprises the following steps of,
Figure 59643DEST_PATH_IMAGE023
is a multilayer perceptron.
Further, in step S2, the learnable word embedded vector corresponding to each word in the text is queried, and the word embedded model is initialized by using the pre-trained word embedded model to obtain the embedded vector sequence of the text query
Figure 567985DEST_PATH_IMAGE033
Figure 964331DEST_PATH_IMAGE034
For characterization of ith word of video, embedded vector sequence of text query is processed by multilayer bidirectional long-and-short memory network (BLSTM)
Figure 634347DEST_PATH_IMAGE035
Context coding is carried out to obtain the word-level text query representation of the query
Figure 998201DEST_PATH_IMAGE036
By passing
Figure 779075DEST_PATH_IMAGE037
Forward hidden state vector sum of
Figure 295507DEST_PATH_IMAGE038
Splicing the backward hidden state vectors to obtain a global level text query representation
Figure 136424DEST_PATH_IMAGE039
Finally, the text query representation is obtained
Figure 941569DEST_PATH_IMAGE040
The specific implementation mode is as follows:
Figure 276867DEST_PATH_IMAGE041
Figure 647805DEST_PATH_IMAGE042
wherein
Figure 659624DEST_PATH_IMAGE043
Is composed of
Figure 483223DEST_PATH_IMAGE037
Forward hidden state vector sum of
Figure 120747DEST_PATH_IMAGE038
And (4) splicing the backward hidden state vectors.
Further, in the multi-granularity cascading interactive network in step S3, the video is first characterized
Figure 346192DEST_PATH_IMAGE044
And text query characterization
Figure 528911DEST_PATH_IMAGE040
Obtaining a video-guided query representation by video-guided query decoding
Figure 370965DEST_PATH_IMAGE045
Figure 48065DEST_PATH_IMAGE046
A query token representing a global level video guide,
Figure 128017DEST_PATH_IMAGE047
representing a query characterization of a word-level video guide, and then characterizing the video-guided query
Figure 747217DEST_PATH_IMAGE048
And video modality characterization
Figure 545409DEST_PATH_IMAGE006
And finally obtaining the enhanced video representation through cascade cross-modal fusion. Video-guided query decoding to narrow down video representations
Figure 524735DEST_PATH_IMAGE006
And text query characterization
Figure 459193DEST_PATH_IMAGE040
The semantic gap between modalities.
Further, the step S3 includes the following steps:
step S3.1, adopting a group of cross-mode decoding blocks for video-guided query decoding to represent text query
Figure 249294DEST_PATH_IMAGE049
Inputting a first cross-mode decoding block as an initial representation, inputting an obtained result into a second cross-mode decoding block, and so on, and outputting a last cross-mode decoding block as a query representation of video guidance
Figure 534782DEST_PATH_IMAGE050
(ii) a The internal operation of the cross-mode decoding block in step S3.1 is as follows:
characterizing the obtained text query
Figure 553685DEST_PATH_IMAGE051
Obtaining the text query representation through a multi-head self-attention module
Figure 608229DEST_PATH_IMAGE052
(ii) a Characterizing a text query
Figure 303652DEST_PATH_IMAGE052
Characterizing videos as queries
Figure 810857DEST_PATH_IMAGE006
Text lookup through multi-headed cross-attention module as keys and valuesQuery characterization
Figure 131986DEST_PATH_IMAGE053
(ii) a Text query characterization
Figure 572194DEST_PATH_IMAGE053
Text query characterization via conventional forward network
Figure 438519DEST_PATH_IMAGE054
As the output of the cross-mode decoding block;
Figure 698599DEST_PATH_IMAGE055
is shown as
Figure 59304DEST_PATH_IMAGE055
And decoding the block in a cross-mode.
Specifically, the first
Figure 291703DEST_PATH_IMAGE055
The cross-mode decoding block is represented as:
Figure 328929DEST_PATH_IMAGE056
Figure 76305DEST_PATH_IMAGE057
Figure 489969DEST_PATH_IMAGE058
wherein,
Figure 622879DEST_PATH_IMAGE059
Figure 96585DEST_PATH_IMAGE060
and
Figure 331258DEST_PATH_IMAGE061
are respectively multi-head self-attention modulesAnd a multi-head cross-attention module,
Figure 283033DEST_PATH_IMAGE062
is a conventional forward network (feed forward network).
Step S3.2, cascading cross-modal fusion, firstly representing the query guided by the global level video
Figure 37494DEST_PATH_IMAGE046
And video modality characterization
Figure 416522DEST_PATH_IMAGE006
Performing cross-modal fusion on the coarse-grained level by element multiplication to obtain a coarse-grained fused video representation
Figure 872912DEST_PATH_IMAGE063
Then characterize word-level video-guided queries
Figure 628378DEST_PATH_IMAGE047
Video characterization after merging with coarse level
Figure 735880DEST_PATH_IMAGE063
Performing cross-modal fusion at a fine granularity level through another set of cross-modal decoding blocks, and representing the video after coarse granularity level fusion
Figure 551389DEST_PATH_IMAGE063
Inputting a first cross-mode decoding block as an initial representation, inputting an obtained result into a second cross-mode decoding block, and so on, and taking the output of a last cross-mode decoding block as an enhanced video representation
Figure 495075DEST_PATH_IMAGE064
(ii) a The internal operation of the cross-mode decoding block in step S3.2 is as follows:
characterizing the acquired video
Figure 788653DEST_PATH_IMAGE065
Obtaining video representations through a multi-headed self-attention module
Figure 252126DEST_PATH_IMAGE066
(ii) a Characterizing video
Figure 238537DEST_PATH_IMAGE066
Word-level video-guided query characterization as a query
Figure 669518DEST_PATH_IMAGE047
Video representations are obtained as keys and values by a multi-headed cross attention module
Figure 766787DEST_PATH_IMAGE067
(ii) a Video characterization
Figure 583302DEST_PATH_IMAGE067
Video characterization via conventional forward network
Figure 740614DEST_PATH_IMAGE068
As the output of the cross-mode decoding block;
Figure 658891DEST_PATH_IMAGE069
is shown as
Figure 294272DEST_PATH_IMAGE069
And decoding the block in a cross-mode. Cross-modal fusion at coarse level for suppressing background video frames and emphasizing foreground video frames, which can be expressed as
Figure 732338DEST_PATH_IMAGE070
Figure 60551DEST_PATH_IMAGE071
Representing element-by-element multiplication.
First, the
Figure 200545DEST_PATH_IMAGE072
The cross-mode decoding block is represented as:
Figure 905196DEST_PATH_IMAGE073
Figure 650298DEST_PATH_IMAGE074
Figure 218856DEST_PATH_IMAGE075
wherein,
Figure 846146DEST_PATH_IMAGE076
Figure 26592DEST_PATH_IMAGE060
and
Figure 688517DEST_PATH_IMAGE061
a multi-head self-attention module and a multi-head cross attention module respectively,
Figure 109265DEST_PATH_IMAGE062
is a conventional forward network (feed forward network).
Further, the attention-based time sequence position regression module in step S4 characterizes the video sequence subjected to the multi-granularity cascade interaction
Figure 223852DEST_PATH_IMAGE064
Obtaining the time sequence attention score of the video through a multilayer perceptron and a SoftMax active layer
Figure 4726DEST_PATH_IMAGE077
(ii) a Then the enhanced video is characterized
Figure 521158DEST_PATH_IMAGE064
And time series attention points
Figure 611343DEST_PATH_IMAGE077
Obtaining the target sheet through the attention pooling layerCharacterization of segments
Figure 213225DEST_PATH_IMAGE078
(ii) a Finally, the characterization of the target fragment
Figure 797790DEST_PATH_IMAGE078
Normalizing the time sequence center coordinates of the target segment by the multilayer perceptron
Figure 168729DEST_PATH_IMAGE079
And segment duration
Figure 931280DEST_PATH_IMAGE080
Direct regression was performed.
The particular attention-based time series position regression is represented as:
Figure 20458DEST_PATH_IMAGE081
Figure 408714DEST_PATH_IMAGE082
Figure 634159DEST_PATH_IMAGE083
wherein,
Figure 800567DEST_PATH_IMAGE064
in order to enhance video characterization, i.e., video sequence characterization output after multi-granularity cascade interaction, the attention pooling layer is used for converging video sequence characterization.
Further, the training of the model in step S5 includes the following steps:
step S5.1, calculating attention alignment loss
Figure 377042DEST_PATH_IMAGE084
To be connected toiLogarithm and indication value of time sequence attention fraction corresponding to frame
Figure 303410DEST_PATH_IMAGE085
Is accumulated according to the number of sampling frames, and the result obtained by the accumulation is compared with that obtained by the calculation
Figure 383361DEST_PATH_IMAGE085
Calculating loss according to the accumulated result of sampling frame number
Figure 18873DEST_PATH_IMAGE084
Figure 817065DEST_PATH_IMAGE086
Indicating that the ith frame of the video is positioned in the time sequence marking segment, otherwise, indicating that the ith frame of the video is positioned in the time sequence marking segment
Figure 281544DEST_PATH_IMAGE087
(ii) a Loss of attention alignment
Figure 793166DEST_PATH_IMAGE084
For encouraging the video frames within the annotated time-series segment to have a higher attention score, the specific calculation process can be expressed as:
Figure 583267DEST_PATH_IMAGE088
wherein,Twhich represents the number of frames of the sample,
Figure 603176DEST_PATH_IMAGE089
representing the time series attention score of the ith frame,
Figure 261559DEST_PATH_IMAGE086
indicating that the ith frame of the video is positioned in the time sequence marking segment, otherwise, indicating that the ith frame of the video is positioned in the time sequence marking segment
Figure 316103DEST_PATH_IMAGE087
Step S5.2, calculating boundary loss
Figure 11526DEST_PATH_IMAGE090
Passing through the knotSynthetic smoothing
Figure 987573DEST_PATH_IMAGE001
Loss of power
Figure 59434DEST_PATH_IMAGE091
Sum-time generalized cross-over ratio loss
Figure 719217DEST_PATH_IMAGE092
Performing boundary loss measurement; normalized time-series center coordinates for predicted segments
Figure 851121DEST_PATH_IMAGE079
Normalized time series center coordinates with time series annotation segments
Figure 845621DEST_PATH_IMAGE093
To find the first smoothness
Figure 986753DEST_PATH_IMAGE001
Loss, segment duration for predicted segment
Figure 265156DEST_PATH_IMAGE080
Segment duration with time sequence annotation segment
Figure 302382DEST_PATH_IMAGE094
To find the second smoothness
Figure 49758DEST_PATH_IMAGE001
Loss of first and second
Figure 463422DEST_PATH_IMAGE001
Sum of losses as loss
Figure 97797DEST_PATH_IMAGE091
(ii) a Calculating regression fragments
Figure 509187DEST_PATH_IMAGE095
And corresponding annotated fragments
Figure 478280DEST_PATH_IMAGE096
The negative value of the generalized cross-over ratio is added with 1 to be used as the loss of the time sequence generalized cross-over ratio
Figure 695635DEST_PATH_IMAGE092
(ii) a Will lose
Figure 699363DEST_PATH_IMAGE091
Generalized cross-correlation loss with timing
Figure 593238DEST_PATH_IMAGE092
As a boundary loss
Figure 49627DEST_PATH_IMAGE090
(ii) a Boundary loss
Figure 805094DEST_PATH_IMAGE090
The specific calculation process of (a) can be expressed as follows:
Figure 663328DEST_PATH_IMAGE097
Figure 229570DEST_PATH_IMAGE098
Figure 110938DEST_PATH_IMAGE099
wherein,
Figure 670096DEST_PATH_IMAGE100
representation smoothing
Figure 382837DEST_PATH_IMAGE001
The function of the loss is a function of,
Figure 369247DEST_PATH_IMAGE101
the cross-over ratio of the two fragments is shown,
Figure 49496DEST_PATH_IMAGE102
representing coverage model regression fragments
Figure 412344DEST_PATH_IMAGE095
And corresponding annotated fragments
Figure 979592DEST_PATH_IMAGE096
The minimum time frame of (c).
Step S5.3, attention is aligned to the loss
Figure 136904DEST_PATH_IMAGE084
And boundary loss
Figure 540335DEST_PATH_IMAGE090
As the total loss of model training.
Specific total loss function
Figure 441294DEST_PATH_IMAGE103
Comprises the following steps:
Figure 331890DEST_PATH_IMAGE104
wherein,
Figure 660103DEST_PATH_IMAGE105
and
Figure 800098DEST_PATH_IMAGE106
the weight is over-parameterized and the model parameters are updated using the optimizer during the training phase.
The cross-modal time sequence behavior positioning device of the multi-granularity cascade interactive network comprises one or more processors and is used for realizing the cross-modal time sequence behavior positioning method of the multi-granularity cascade interactive network.
The invention has the advantages and beneficial effects that:
the invention discloses a cross-modal time sequence behavior positioning method and device of a multi-granularity cascade interaction network, which fully utilize multi-granularity text query information in a coarse-to-fine mode in a vision-language cross-modal interaction link, fully model the local-global context time sequence dependence characteristic of a video in a video representation coding link, and solve the problem of time sequence behavior positioning based on text query in an untrimmed video. For given untrimmed videos and text queries, the method can improve the vision-language cross-modal alignment precision, and further improve the positioning accuracy of a cross-modal time sequence behavior positioning task.
Drawings
FIG. 1 is an exemplary diagram of a visual-language cross-modal temporal behavior localization task.
FIG. 2 is a block diagram of the cross-modal temporal behavior localization of the multi-granularity cascading interactive network of the present invention.
FIG. 3 is a flowchart of a cross-modal timing behavior localization method of a multi-granularity cascading interactive network according to the present invention.
Fig. 4 is a structural diagram of a cross-mode timing behavior positioning device of a multi-granularity cascade interaction network in the invention.
Detailed Description
The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present invention, are given by way of illustration and explanation only, not limitation.
The invention discloses a cross-modal time sequence behavior positioning method and device of a multi-granularity cascade interactive network, which are used for solving the problem of time sequence behavior positioning based on given text query in an untrimmed video based on vision-language cross-modal time sequence behavior positioning of the multi-granularity cascade interactive network. The method provides a simple and effective multi-granularity cascade cross-modal interaction network for improving the cross-modal alignment capability of the model. In addition, the invention introduces a local-global context-aware video encoder for improving the context timing dependence modeling capability of the video encoder. Therefore, the timing positioning accuracy of the trained model can be obviously improved on paired video-query test data.
A cross-modal time sequence behavior positioning method of a multi-granularity cascade interactive network is based on a Pythrch frame for experiment, video frame features are extracted in an off-line mode by using a pre-trained C3D network, a video is uniformly sampled into 256 frames, and the number of heads of all self-attention sub-modules and cross-attention sub-modules in the method is set to be 8. The model was trained using Adam optimizer in the training phase with a learning rate fixed at 0.0004 and each batch consisted of 100 pairs of video-queries. In addition, the performance evaluation criterion in the experiment implementation can adopt an "R @ n, IoU = m" evaluation criterion, the evaluation criterion represents the percentage of correctly positioned queries in the evaluation data set, wherein the maximum value of the intersection ratio (IoU) of the n prediction segments with the highest confidence degrees and the real labels is considered to be correctly positioned if the maximum value is greater than m.
In particular embodiments, an untrimmed video is given
Figure 488437DEST_PATH_IMAGE107
Is uniformly sampled into a sequence of video frames
Figure 30276DEST_PATH_IMAGE108
And give a video
Figure 794970DEST_PATH_IMAGE107
Text description of a behavior segment
Figure 422261DEST_PATH_IMAGE109
The visual-language cross-modal time sequence behavior positioning task is to predict videos
Figure 415756DEST_PATH_IMAGE107
Description of Chinese text
Figure 77681DEST_PATH_IMAGE109
Start time of the corresponding video segment
Figure 747697DEST_PATH_IMAGE110
And end time
Figure 596704DEST_PATH_IMAGE111
. The training data set for the task may be defined as
Figure 846420DEST_PATH_IMAGE112
Wherein
Figure 612119DEST_PATH_IMAGE113
And
Figure 453036DEST_PATH_IMAGE114
the real mark of the starting time and the ending time of the target video segment is obtained.
As shown in fig. 2 and fig. 3, the cross-modal time series behavior positioning method of the multi-granularity cascade interaction network includes the following steps:
step S1: giving an unclipped video sample, performing initial extraction of video representation by using a visual pre-training model, and performing context-aware time sequence dependent coding on the initially extracted video representation in a local-global mode to obtain a final video representation, so that the context time sequence dependent modeling capability of the video representation is improved;
in step S1, based on the visual pre-training model, the video frame features are extracted in an off-line manner and the T frames are sampled uniformly, and then a set of video representations is obtained through a linear transformation layer
Figure 789340DEST_PATH_IMAGE002
Figure 373905DEST_PATH_IMAGE003
For characterization of the ith frame of a video, and thus for characterization of the video
Figure 495576DEST_PATH_IMAGE004
And performing context-aware time-sequence dependent coding in a local-global mode.
In the local-global context-aware coding in S1, the video is first characterized
Figure 772973DEST_PATH_IMAGE004
Performing local context-aware coding to obtain video representation
Figure 862152DEST_PATH_IMAGE005
(ii) a Then characterize the video
Figure 984829DEST_PATH_IMAGE005
Performing global context-aware coding to obtain video representations
Figure 725121DEST_PATH_IMAGE006
The local context-aware coding and the global context-aware coding in step S1 are implemented as follows:
step S1.1, local context-aware coding adopts a group of continuous local transformer (local transformer) blocks equipped with one-dimensional offset windows to represent video
Figure 907840DEST_PATH_IMAGE004
As an initial characterization, inputting a continuous local transformer block of a first one-dimensional offset window, inputting the obtained result into a continuous local transformer block of a second one-dimensional offset window, and so on, and taking the output of the continuous local transformer block of the last one-dimensional offset window as a video characterization of local context-aware coding output
Figure 218736DEST_PATH_IMAGE005
(ii) a The operation inside the continuous local transformer block of one-dimensional offset window is as follows:
characterizing acquired video
Figure 348366DEST_PATH_IMAGE007
After layer standardization is carried out, the obtained result and the video representation are represented through a one-dimensional window multi-head self-attention module
Figure 693897DEST_PATH_IMAGE007
Adding to obtain video representation
Figure 798250DEST_PATH_IMAGE008
(ii) a Characterizing video
Figure 596442DEST_PATH_IMAGE009
After layer standardization, the obtained result and the video representation are carried out through a multilayer perceptron
Figure 592080DEST_PATH_IMAGE009
Adding to obtain video representation
Figure 526538DEST_PATH_IMAGE010
(ii) a Characterizing video
Figure 565907DEST_PATH_IMAGE011
After layer standardization is carried out, the obtained result and the video representation are represented by a one-dimensional offset window multi-head self-attention module
Figure 851394DEST_PATH_IMAGE011
Adding to obtain video representation
Figure 385144DEST_PATH_IMAGE012
(ii) a Characterizing video
Figure 439688DEST_PATH_IMAGE012
After layer standardization, the obtained result and the video representation are carried out through a multilayer perceptron
Figure 885844DEST_PATH_IMAGE012
Adding, outputting video representations
Figure 658628DEST_PATH_IMAGE013
The output of successive partial transformer blocks as one-dimensional shifted windows,
Figure 730489DEST_PATH_IMAGE014
is shown as
Figure 842801DEST_PATH_IMAGE014
The blocks are provided with successive partial transformer blocks of one-dimensional offset windows.
Specifically, the first
Figure 974705DEST_PATH_IMAGE015
Successive partial transformer blocks with blocks equipped with one-dimensional offset windows are represented as:
Figure 15211DEST_PATH_IMAGE016
Figure 94026DEST_PATH_IMAGE017
Figure 857583DEST_PATH_IMAGE018
Figure 160388DEST_PATH_IMAGE019
wherein,
Figure 392917DEST_PATH_IMAGE020
Figure 72160DEST_PATH_IMAGE021
in order to standardize the layers, the method comprises the following steps of,
Figure 955803DEST_PATH_IMAGE022
is a one-dimensional window multi-head self-attention module,
Figure 429509DEST_PATH_IMAGE023
in order to be a multi-layer sensor,
Figure 647870DEST_PATH_IMAGE024
a one-dimensional offset window multi-headed self-attention module.
Step S1.2, global context-aware coding includes a set of conventional transformer blocks, characterizing the video
Figure 865225DEST_PATH_IMAGE005
Making an initial representation and inputting a first conventional transformer block, inputting an obtained result into a second conventional transformer block, and so on, and taking the output of a last conventional transformer block as a wholeLocal context aware coded output video characterization
Figure 603373DEST_PATH_IMAGE006
(ii) a The conventional transformer block operates internally as follows:
captured video characterization
Figure 247981DEST_PATH_IMAGE025
After passing through a conventional multi-head self-attention module, the obtained result is represented by a video
Figure 907633DEST_PATH_IMAGE025
After addition, layer standardization is carried out to obtain video representation
Figure 413832DEST_PATH_IMAGE026
(ii) a Video characterization
Figure 272066DEST_PATH_IMAGE026
After passing through the multilayer perceptron, the obtained result is represented by the video
Figure 618734DEST_PATH_IMAGE026
After addition, layer normalization is performed to obtain a video representation
Figure 765682DEST_PATH_IMAGE027
As an output of the conventional transformer block,
Figure 370844DEST_PATH_IMAGE028
is shown as
Figure 286848DEST_PATH_IMAGE028
Block conventional transformer blocks.
Specifically, the first
Figure 273258DEST_PATH_IMAGE028
Each transformer block is represented as:
Figure 438660DEST_PATH_IMAGE029
Figure 552241DEST_PATH_IMAGE030
wherein,
Figure 385068DEST_PATH_IMAGE031
Figure 542379DEST_PATH_IMAGE032
in the case of a conventional multi-headed self-attention module,
Figure 195078DEST_PATH_IMAGE021
in order to standardize the layers, the method comprises the following steps of,
Figure 345305DEST_PATH_IMAGE023
is a multilayer perceptron.
Step S2: for text query corresponding to an untrimmed video, performing word embedding initialization on each word in a query text by adopting a pre-trained word embedding model, and then performing context coding by adopting a multi-layer bidirectional long-time memory network to obtain a word-level representation and a global-level representation of the text query;
in step S2, the learnable word embedded vector corresponding to each word in the text is queried, and the word embedded model is initialized using the pre-trained word embedded model to obtain an embedded vector sequence of the text query
Figure 32638DEST_PATH_IMAGE033
Figure 564114DEST_PATH_IMAGE034
For characterization of ith word of video, embedded vector sequence of text query is processed by multilayer bidirectional long-and-short memory network (BLSTM)
Figure 235267DEST_PATH_IMAGE035
Context coding is carried out to obtain the word-level text query representation of the query
Figure 690650DEST_PATH_IMAGE036
By passing
Figure 232490DEST_PATH_IMAGE037
Forward hidden state vector sum of
Figure 731604DEST_PATH_IMAGE038
Splicing the backward hidden state vectors to obtain a global level text query representation
Figure 562157DEST_PATH_IMAGE039
Finally, the text query representation is obtained
Figure 804919DEST_PATH_IMAGE040
The specific implementation mode is as follows:
Figure 473971DEST_PATH_IMAGE115
Figure 143987DEST_PATH_IMAGE042
wherein
Figure 258573DEST_PATH_IMAGE043
Is composed of
Figure 305027DEST_PATH_IMAGE037
Forward hidden state vector sum of
Figure 572191DEST_PATH_IMAGE038
And (4) splicing the backward hidden state vectors.
Step S3: for the extracted video representation and the text query representation, a multi-granularity cascade interaction network is adopted to carry out interaction between a video modality and a text query modality, so that an enhanced video representation guided by query is obtained, and the cross-modality alignment precision is improved;
in the multi-granularity cascade interactive network in the step S3, firstly, the video is characterized
Figure 413108DEST_PATH_IMAGE044
And text query characterization
Figure 14991DEST_PATH_IMAGE040
Obtaining a video-guided query representation by video-guided query decoding
Figure 599556DEST_PATH_IMAGE045
Figure 219762DEST_PATH_IMAGE046
A query token representing a global level video guide,
Figure 434842DEST_PATH_IMAGE047
representing a query characterization of a word-level video guide, and then characterizing the video-guided query
Figure 524021DEST_PATH_IMAGE048
And video modality characterization
Figure 646698DEST_PATH_IMAGE006
And finally obtaining the enhanced video representation through cascade cross-modal fusion. Video-guided query decoding to narrow down video representations
Figure 872143DEST_PATH_IMAGE006
And text query characterization
Figure 71174DEST_PATH_IMAGE040
The semantic gap between modalities.
The step S3 specifically includes the following steps:
step S3.1, adopting a group of cross-mode decoding blocks for video-guided query decoding to represent text query
Figure 647649DEST_PATH_IMAGE049
Inputting a first block of cross-mode decoding block as an initial representation, inputting the obtained result into a second block of cross-mode decoding block, and so on, and inputting a last block of cross-mode decoding blockOutput of Cross-modality decoding Block as video-guided query characterization
Figure 574017DEST_PATH_IMAGE050
(ii) a The internal operation of the cross-mode decoding block in step S3.1 is as follows:
characterizing the obtained text query
Figure 919547DEST_PATH_IMAGE051
Obtaining the text query representation through a multi-head self-attention module
Figure 522436DEST_PATH_IMAGE052
(ii) a Characterizing a text query
Figure 320628DEST_PATH_IMAGE052
Characterizing videos as queries
Figure 50686DEST_PATH_IMAGE006
Text query characterization by multi-headed cross-attention module as keys and values
Figure 453986DEST_PATH_IMAGE053
(ii) a Text query characterization
Figure 978508DEST_PATH_IMAGE053
Text query characterization via conventional forward network
Figure 14728DEST_PATH_IMAGE054
As the output of the cross-mode decoding block;
Figure 282899DEST_PATH_IMAGE055
is shown as
Figure 603022DEST_PATH_IMAGE055
And decoding the block in a cross-mode.
Specifically, the first
Figure 564024DEST_PATH_IMAGE055
One span moldThe state decoding block is represented as:
Figure 586076DEST_PATH_IMAGE056
Figure 392358DEST_PATH_IMAGE057
Figure 770250DEST_PATH_IMAGE058
wherein,
Figure 433312DEST_PATH_IMAGE059
Figure 896654DEST_PATH_IMAGE060
and
Figure 257360DEST_PATH_IMAGE061
a multi-head self-attention module and a multi-head cross attention module respectively,
Figure 286496DEST_PATH_IMAGE062
is a conventional forward network (feed forward network).
Step S3.2, cascading cross-modal fusion, firstly representing the query guided by the global level video
Figure 589301DEST_PATH_IMAGE046
And video modality characterization
Figure 336677DEST_PATH_IMAGE006
Performing cross-modal fusion on the coarse-grained level by element multiplication to obtain a coarse-grained fused video representation
Figure 999609DEST_PATH_IMAGE063
Then characterize word-level video-guided queries
Figure 883251DEST_PATH_IMAGE047
Video characterization after merging with coarse level
Figure 356958DEST_PATH_IMAGE063
Performing cross-modal fusion at a fine granularity level through another set of cross-modal decoding blocks, and representing the video after coarse granularity level fusion
Figure 326051DEST_PATH_IMAGE063
Inputting a first cross-mode decoding block as an initial representation, inputting an obtained result into a second cross-mode decoding block, and so on, and taking the output of a last cross-mode decoding block as an enhanced video representation
Figure 481088DEST_PATH_IMAGE064
(ii) a The internal operation of the cross-mode decoding block in step S3.2 is as follows:
characterizing the acquired video
Figure 235549DEST_PATH_IMAGE065
Obtaining video representations through a multi-headed self-attention module
Figure 880157DEST_PATH_IMAGE066
(ii) a Characterizing video
Figure 336546DEST_PATH_IMAGE066
Word-level video-guided query characterization as a query
Figure 92012DEST_PATH_IMAGE047
Video representations are obtained as keys and values by a multi-headed cross attention module
Figure 199515DEST_PATH_IMAGE067
(ii) a Video characterization
Figure 15024DEST_PATH_IMAGE067
Video characterization via conventional forward network
Figure 958709DEST_PATH_IMAGE068
As the output of the cross-mode decoding block;
Figure 517866DEST_PATH_IMAGE069
is shown as
Figure 981340DEST_PATH_IMAGE069
And decoding the block in a cross-mode. Cross-modal fusion at coarse level for suppressing background video frames and emphasizing foreground video frames, which can be expressed as
Figure 905434DEST_PATH_IMAGE070
Figure 336415DEST_PATH_IMAGE071
Representing element-by-element multiplication.
First, the
Figure 433684DEST_PATH_IMAGE072
The cross-mode decoding block is represented as:
Figure 266511DEST_PATH_IMAGE073
Figure 407511DEST_PATH_IMAGE074
Figure 325788DEST_PATH_IMAGE116
wherein,
Figure 492327DEST_PATH_IMAGE076
Figure 914082DEST_PATH_IMAGE060
and
Figure 993027DEST_PATH_IMAGE061
a multi-head self-attention module and a multi-head cross attention module respectively,
Figure 133021DEST_PATH_IMAGE062
is a conventional forward network (feed forward network).
Step S4: for the video representation obtained after multi-granularity cascade interaction, predicting the time sequence position of a text query corresponding target video fragment by adopting an attention-based time sequence position regression module;
the attention-based time sequence position regression module in the step S4 characterizes the video sequence subjected to the multi-granularity cascade interaction
Figure 837672DEST_PATH_IMAGE064
Obtaining the time sequence attention score of the video through a multilayer perceptron and a SoftMax active layer
Figure 379512DEST_PATH_IMAGE077
(ii) a Then the enhanced video is characterized
Figure 81889DEST_PATH_IMAGE064
And time series attention points
Figure 958447DEST_PATH_IMAGE077
Obtaining a representation of the target segment by means of an attention pooling layer
Figure 201209DEST_PATH_IMAGE078
(ii) a Finally, the characterization of the target fragment
Figure 597556DEST_PATH_IMAGE078
Normalizing the time sequence center coordinates of the target segment by the multilayer perceptron
Figure 267571DEST_PATH_IMAGE079
And segment duration
Figure 398470DEST_PATH_IMAGE080
Direct regression was performed.
The particular attention-based time series position regression is represented as:
Figure 179344DEST_PATH_IMAGE081
Figure 695776DEST_PATH_IMAGE117
Figure 536693DEST_PATH_IMAGE083
wherein,
Figure 341838DEST_PATH_IMAGE064
in order to enhance video characterization, i.e., video sequence characterization through multi-granularity cascade interaction, the attention pooling layer is used for converging video sequence characterization,
step S5: and for the cross-modal time sequence behavior positioning model based on the multi-granularity cascade interaction network formed in the steps S1-S4, training the model by utilizing a training sample set, wherein a total loss function adopted in the training comprises attention alignment loss and boundary loss, and the boundary loss comprises smooth attention alignment loss and boundary loss
Figure 175670DEST_PATH_IMAGE001
The loss and the time sequence generalized intersection are better adapted to the evaluation criterion of the time sequence positioning task than the loss, and the training sample set is composed of a plurality of { video, query, target video segment time sequence position mark } triple samples.
The training of the model in the step S5 includes the following steps:
step S5.1, calculating attention alignment loss
Figure 546609DEST_PATH_IMAGE084
To be connected toiLogarithm and indication value of time sequence attention fraction corresponding to frame
Figure 558427DEST_PATH_IMAGE085
Is accumulated according to the number of sampling frames, and the result obtained by the accumulation is compared with that obtained by the calculation
Figure 382027DEST_PATH_IMAGE085
Calculating loss according to the accumulated result of sampling frame number
Figure 786594DEST_PATH_IMAGE084
Figure 12039DEST_PATH_IMAGE086
Indicating that the ith frame of the video is positioned in the time sequence marking segment, otherwise, indicating that the ith frame of the video is positioned in the time sequence marking segment
Figure 194759DEST_PATH_IMAGE087
(ii) a Loss of attention alignment
Figure 771234DEST_PATH_IMAGE084
For encouraging the video frames within the annotated time-series segment to have a higher attention score, the specific calculation process can be expressed as:
Figure 946869DEST_PATH_IMAGE118
wherein,Twhich represents the number of frames of the sample,
Figure 26821DEST_PATH_IMAGE089
representing the time series attention score of the ith frame,
Figure 849283DEST_PATH_IMAGE086
indicating that the ith frame of the video is positioned in the time sequence marking segment, otherwise, indicating that the ith frame of the video is positioned in the time sequence marking segment
Figure 647475DEST_PATH_IMAGE087
Step S5.2, calculating boundary loss
Figure 377533DEST_PATH_IMAGE090
By combining smoothing
Figure 62724DEST_PATH_IMAGE001
Loss of power
Figure 852825DEST_PATH_IMAGE091
Sum-time generalized cross-over ratio loss
Figure 403892DEST_PATH_IMAGE092
Performing boundary loss measurement; normalized time-series center coordinates for predicted segments
Figure 672063DEST_PATH_IMAGE079
Normalized time series center coordinates with time series annotation segments
Figure 975874DEST_PATH_IMAGE093
To find the first smoothness
Figure 671297DEST_PATH_IMAGE001
Loss, segment duration for predicted segment
Figure 444081DEST_PATH_IMAGE080
Segment duration with time sequence annotation segment
Figure 515943DEST_PATH_IMAGE094
To find the second smoothness
Figure 441304DEST_PATH_IMAGE001
Loss of first and second
Figure 307629DEST_PATH_IMAGE001
Sum of losses as loss
Figure 567709DEST_PATH_IMAGE091
(ii) a Calculating regression fragments
Figure 380944DEST_PATH_IMAGE095
And corresponding annotated fragments
Figure 410080DEST_PATH_IMAGE096
The negative value of the generalized cross-over ratio is added with 1 to be used as the loss of the time sequence generalized cross-over ratio
Figure 696574DEST_PATH_IMAGE092
(ii) a Will lose
Figure 443950DEST_PATH_IMAGE091
Generalized cross-correlation loss with timing
Figure 123193DEST_PATH_IMAGE092
As a boundary loss
Figure 6836DEST_PATH_IMAGE090
(ii) a Boundary loss
Figure 231275DEST_PATH_IMAGE090
The specific calculation process of (a) can be expressed as follows:
Figure 200368DEST_PATH_IMAGE119
Figure 152143DEST_PATH_IMAGE120
Figure 359134DEST_PATH_IMAGE099
wherein,
Figure 738162DEST_PATH_IMAGE100
representation smoothing
Figure 443819DEST_PATH_IMAGE001
The function of the loss is a function of,
Figure 199286DEST_PATH_IMAGE101
the cross-over ratio of the two fragments is shown,
Figure 57520DEST_PATH_IMAGE102
representing coverage model regression fragments
Figure 138609DEST_PATH_IMAGE095
And corresponding annotated fragments
Figure 833026DEST_PATH_IMAGE096
The minimum time frame of (c).
Step S5.3, attention is aligned to the loss
Figure 126604DEST_PATH_IMAGE084
And boundary loss
Figure 839345DEST_PATH_IMAGE090
As the total loss of model training.
Specific total loss function
Figure 29018DEST_PATH_IMAGE103
Comprises the following steps:
Figure 460000DEST_PATH_IMAGE104
wherein,
Figure 806536DEST_PATH_IMAGE105
and
Figure 373784DEST_PATH_IMAGE106
the weight is over-parameterized and the model parameters are updated using the optimizer during the training phase.
The method of the present invention is compared with other representative methods for the accuracy of the TACoS test set, as shown in table 1, using the evaluation criterion of "R @ n, IoU = m", where n =1 and m = {0.1, 0.3, 0.5 }.
TABLE 1
Figure 796675DEST_PATH_IMAGE121
Corresponding to the embodiment of the cross-modal time sequence behavior positioning method, the invention also provides an embodiment of a cross-modal time sequence behavior positioning device of the multi-granularity cascade interactive network.
Referring to fig. 4, the cross-modal timing behavior positioning apparatus of the multi-granularity cascading interactive network provided in the embodiment of the present invention includes one or more processors, and is configured to implement the cross-modal timing behavior positioning method of the multi-granularity cascading interactive network in the embodiment.
The cross-modal time sequence behavior positioning device of the multi-granularity cascade interactive network can be applied to any equipment with data processing capability, and the any equipment with data processing capability can be equipment or devices such as computers. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. The software implementation is taken as an example, and as a logical device, the device is formed by reading corresponding computer program instructions in the nonvolatile memory into the memory for running through the processor of any device with data processing capability. From a hardware aspect, as shown in fig. 4, the present invention is a hardware structure diagram of any device with data processing capability where a cross-modal time series behavior positioning apparatus of a multi-granularity cascading interactive network is located, except for the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 4, in an embodiment, any device with data processing capability where the apparatus is located may also include other hardware according to an actual function of the any device with data processing capability, which is not described again.
The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.
For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.
The embodiment of the present invention further provides a computer-readable storage medium, on which a program is stored, where the program, when executed by a processor, implements the cross-modal time series behavior positioning method of the multi-granularity cascading interactive network in the above embodiments.
The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any data processing capability device described in any of the foregoing embodiments. The computer readable storage medium may also be any external storage device of a device with data processing capabilities, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), etc. provided on the device. Further, the computer readable storage medium may include both an internal storage unit and an external storage device of any data processing capable device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing-capable device, and may also be used for temporarily storing data that has been output or is to be output.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims (10)

1. A cross-mode time sequence behavior positioning method of a multi-granularity cascade interactive network is characterized by comprising the following steps:
step S1: giving an unclipped video sample, performing initial extraction of video representation by using a visual pre-training model, and performing context-aware time sequence dependent coding on the initially extracted video representation in a local-global mode to obtain a final video representation;
step S2: for text query corresponding to an untrimmed video, performing word embedding initialization on each word in a query text by adopting a pre-trained word embedding model, and then performing context coding by adopting a multi-layer bidirectional long-time memory network to obtain a word-level representation and a global-level representation of the text query;
step S3: for the extracted video representation and text query representation, performing interaction between a video modality and a text query modality by adopting a multi-granularity cascade interaction network to obtain an enhanced video representation guided by query;
step S4: for the enhanced video representation obtained after multi-granularity cascade interaction, predicting the time sequence position of a corresponding target video fragment of a text query by adopting an attention-based time sequence position regression module;
step S5: and for the cross-modal time sequence behavior positioning model based on the multi-granularity cascade interaction network formed in the steps S1-S4, training the model by utilizing a training sample set, wherein a total loss function adopted in the training comprises attention alignment loss and boundary loss, and the boundary loss comprises smooth attention alignment loss and boundary loss
Figure 842729DEST_PATH_IMAGE001
Loss and timing are broadly cross-correlation-specific losses.
2. The method according to claim 1, wherein in step S1, video frame features are extracted in an off-line manner based on a visual pre-training model and the T frames are sampled uniformly, and then a set of video representations is obtained through a linear transformation layer
Figure 478110DEST_PATH_IMAGE002
Figure 916176DEST_PATH_IMAGE003
For characterization of the ith frame of a video, and thus for characterization of the video
Figure 244389DEST_PATH_IMAGE004
And performing context-aware time-sequence dependent coding in a local-global mode.
3. The method as claimed in claim 2, wherein the local-global context-aware coding in step S1 is implemented by first characterizing the video
Figure 384383DEST_PATH_IMAGE004
Performing local context-aware coding to obtain video representation
Figure 89034DEST_PATH_IMAGE005
(ii) a Then characterize the video
Figure 614562DEST_PATH_IMAGE005
Performing global context-aware coding to obtain video representations
Figure 113677DEST_PATH_IMAGE006
4. The cross-modal time series behavior localization method of multi-granularity cascading interactive network as claimed in claim 3, wherein the local context-aware coding and the global context-aware coding in step S1 are implemented as follows:
step S1.1, local context-aware coding adopts a group of continuous local transformer blocks provided with one-dimensional offset windows to represent a video
Figure 6546DEST_PATH_IMAGE004
As an initial characterization, inputting a continuous local transformer block of a first one-dimensional offset window, inputting the obtained result into a continuous local transformer block of a second one-dimensional offset window, and so on, and inputting a continuous local transformer block of a last one-dimensional offset windowOutput of local transformer blocks as video representation of local context-aware coding output
Figure 718150DEST_PATH_IMAGE005
(ii) a The operation inside the continuous local transformer block of one-dimensional offset window is as follows:
characterizing acquired video
Figure 130808DEST_PATH_IMAGE007
After layer standardization is carried out, the obtained result and the video representation are represented through a one-dimensional window multi-head self-attention module
Figure 66403DEST_PATH_IMAGE007
Adding to obtain video representation
Figure 180990DEST_PATH_IMAGE008
(ii) a Characterizing video
Figure 961864DEST_PATH_IMAGE009
After layer standardization, the obtained result and the video representation are carried out through a multilayer perceptron
Figure 461984DEST_PATH_IMAGE009
Adding to obtain video representation
Figure 302901DEST_PATH_IMAGE010
(ii) a Characterizing video
Figure 904784DEST_PATH_IMAGE011
After layer standardization is carried out, the obtained result and the video representation are represented by a one-dimensional offset window multi-head self-attention module
Figure 489349DEST_PATH_IMAGE011
Adding to obtain video representation
Figure 611020DEST_PATH_IMAGE012
(ii) a Characterizing video
Figure 888418DEST_PATH_IMAGE012
After layer standardization, the obtained result and the video representation are carried out through a multilayer perceptron
Figure 712017DEST_PATH_IMAGE012
Adding, outputting video representations
Figure 100273DEST_PATH_IMAGE013
The output of successive partial transformer blocks as one-dimensional shifted windows,
Figure 574986DEST_PATH_IMAGE014
is shown as
Figure 757705DEST_PATH_IMAGE014
The block is provided with a continuous local transformer block with a one-dimensional offset window;
step S1.2, global context-aware coding includes a set of conventional transformer blocks, characterizing the video
Figure 334180DEST_PATH_IMAGE005
Making an initial representation and inputting a first conventional transformer block, inputting an obtained result into a second conventional transformer block, and so on, and taking the output of a last conventional transformer block as a global context-aware coding output video representation
Figure 260548DEST_PATH_IMAGE006
(ii) a The conventional transformer block operates internally as follows:
captured video characterization
Figure 91232DEST_PATH_IMAGE015
After passing through a conventional multi-head self-attention module, the obtained result is represented by a video
Figure 710432DEST_PATH_IMAGE015
After addition, layer standardization is carried out to obtain video representation
Figure 508624DEST_PATH_IMAGE016
(ii) a Video characterization
Figure 973103DEST_PATH_IMAGE016
After passing through the multilayer perceptron, the obtained result is represented by the video
Figure 422408DEST_PATH_IMAGE016
After addition, layer normalization is performed to obtain a video representation
Figure 212509DEST_PATH_IMAGE017
As an output of the conventional transformer block,
Figure 232418DEST_PATH_IMAGE018
is shown as
Figure 766167DEST_PATH_IMAGE018
Block conventional transformer blocks.
5. The method according to claim 1, wherein in step S2, the learnable word embedded vector corresponding to each word in the text is queried, and the pre-trained word embedding model is used to initialize the learnable word embedded vector to obtain the embedded vector sequence of the text query
Figure 571443DEST_PATH_IMAGE019
Figure 470129DEST_PATH_IMAGE020
For the representation of the ith word of the video, the embedded vector sequence of the text query is realized through a multi-layer bidirectional long-time memory network
Figure 39651DEST_PATH_IMAGE021
Context coding is carried out to obtain the word-level text query representation of the query
Figure 314775DEST_PATH_IMAGE022
By passing
Figure 223825DEST_PATH_IMAGE023
Forward hidden state vector sum of
Figure 604996DEST_PATH_IMAGE024
Splicing the backward hidden state vectors to obtain a global level text query representation
Figure 599497DEST_PATH_IMAGE025
Finally, the text query representation is obtained
Figure 475049DEST_PATH_IMAGE026
6. The method as claimed in claim 1, wherein the step S3 is executed by first applying a video representation and a text query representation to the multi-granularity cascading interactive network
Figure 504185DEST_PATH_IMAGE026
Obtaining a video-guided query representation by video-guided query decoding
Figure 292144DEST_PATH_IMAGE027
Figure 836258DEST_PATH_IMAGE028
A query token representing a global level video guide,
Figure 453184DEST_PATH_IMAGE029
representing word-level video-guided lookupsQuery characterization, followed by video-guided query characterization
Figure 382831DEST_PATH_IMAGE030
And performing cascade cross-modal fusion with the video modal characterization to obtain a final enhanced video characterization.
7. The cross-modal timing behavior localization method of multi-granularity cascading interactive network according to claim 6, wherein the video-guided query decoding and cascading cross-modal fusion in the step S3 are respectively implemented as follows:
step S3.1, adopting a group of cross-mode decoding blocks for video-guided query decoding to represent text query
Figure 590959DEST_PATH_IMAGE031
Inputting a first cross-mode decoding block as an initial representation, inputting an obtained result into a second cross-mode decoding block, and so on, and outputting a last cross-mode decoding block as a query representation of video guidance
Figure 560052DEST_PATH_IMAGE032
(ii) a The internal operation of the cross-mode decoding block in step S3.1 is as follows:
characterizing the obtained text query
Figure 777406DEST_PATH_IMAGE033
Obtaining the text query representation through a multi-head self-attention module
Figure 266288DEST_PATH_IMAGE034
(ii) a Characterizing a text query
Figure 114158DEST_PATH_IMAGE034
Characterizing videos as queries
Figure 367285DEST_PATH_IMAGE006
As a key and a value, the key and the value,obtaining text query representation through multi-head cross attention module
Figure 122751DEST_PATH_IMAGE035
(ii) a Text query characterization
Figure 230253DEST_PATH_IMAGE035
Text query characterization via conventional forward network
Figure 45763DEST_PATH_IMAGE036
As the output of the cross-mode decoding block;
Figure 723869DEST_PATH_IMAGE037
is shown as
Figure 17447DEST_PATH_IMAGE037
A block-spanning mode decoding block;
step S3.2, cascading cross-modal fusion, firstly representing the query guided by the global level video
Figure 480920DEST_PATH_IMAGE028
And video modality characterization
Figure 467331DEST_PATH_IMAGE006
Performing cross-modal fusion on the coarse-grained level by element multiplication to obtain a coarse-grained fused video representation
Figure 898312DEST_PATH_IMAGE038
Then characterize word-level video-guided queries
Figure 995581DEST_PATH_IMAGE029
Video characterization after merging with coarse level
Figure 812096DEST_PATH_IMAGE038
At fine level of granularity, by another set of cross-modal decoding blocksLine-crossing modal fusion, representing the video after coarse-grained fusion
Figure 969408DEST_PATH_IMAGE038
Inputting a first cross-mode decoding block as an initial representation, inputting an obtained result into a second cross-mode decoding block, and so on, and taking the output of a last cross-mode decoding block as an enhanced video representation
Figure 622106DEST_PATH_IMAGE039
(ii) a The internal operation of the cross-mode decoding block in step S3.2 is as follows:
characterizing the acquired video
Figure 523066DEST_PATH_IMAGE040
Obtaining video representations through a multi-headed self-attention module
Figure 961132DEST_PATH_IMAGE041
(ii) a Characterizing video
Figure 289345DEST_PATH_IMAGE041
Word-level video-guided query characterization as a query
Figure 429339DEST_PATH_IMAGE029
Video representations are obtained as keys and values by a multi-headed cross attention module
Figure 868411DEST_PATH_IMAGE042
(ii) a Video characterization
Figure 659518DEST_PATH_IMAGE042
Video characterization via conventional forward network
Figure 424212DEST_PATH_IMAGE043
As the output of the cross-mode decoding block;
Figure 785923DEST_PATH_IMAGE044
is shown as
Figure 28686DEST_PATH_IMAGE044
And decoding the block in a cross-mode.
8. The method according to claim 1, wherein the attention-based temporal position regression module in step S4 characterizes the enhanced video outputted by the multi-granular cascade interaction
Figure 441344DEST_PATH_IMAGE039
Obtaining the time sequence attention score of the video through a multilayer perceptron and a SoftMax active layer
Figure 376939DEST_PATH_IMAGE045
(ii) a Then the enhanced video is characterized
Figure 225946DEST_PATH_IMAGE039
And time series attention points
Figure 272399DEST_PATH_IMAGE045
Obtaining a representation of the target segment by means of an attention pooling layer
Figure 772520DEST_PATH_IMAGE046
(ii) a Finally, the characterization of the target fragment
Figure 879016DEST_PATH_IMAGE046
Normalizing the time sequence center coordinates of the target segment by the multilayer perceptron
Figure 215319DEST_PATH_IMAGE047
And segment duration
Figure 534305DEST_PATH_IMAGE048
Direct regression was performed.
9. The method for positioning cross-modal timing behavior of multi-granularity cascading interactive network according to claim 1, wherein the training of the model in the step S5 includes the following steps:
step S5.1, calculating attention alignment loss
Figure 655976DEST_PATH_IMAGE049
To be connected toiLogarithm and indication value of time sequence attention fraction corresponding to frame
Figure 933374DEST_PATH_IMAGE050
Is accumulated according to the number of sampling frames, and the result obtained by the accumulation is compared with that obtained by the calculation
Figure 22553DEST_PATH_IMAGE050
Calculating loss according to the accumulated result of sampling frame number
Figure 145229DEST_PATH_IMAGE049
Figure 885521DEST_PATH_IMAGE051
Indicating that the ith frame of the video is positioned in the time sequence marking segment, otherwise, indicating that the ith frame of the video is positioned in the time sequence marking segment
Figure 68241DEST_PATH_IMAGE052
(ii) a Loss of attention alignment
Figure 379136DEST_PATH_IMAGE049
The specific calculation process of (a) can be expressed as:
Figure 305504DEST_PATH_IMAGE053
step S5.2, calculating boundary loss
Figure 401767DEST_PATH_IMAGE054
By combining smoothing
Figure 755388DEST_PATH_IMAGE001
Loss of power
Figure 553580DEST_PATH_IMAGE055
Sum-time generalized cross-over ratio loss
Figure 283638DEST_PATH_IMAGE056
Performing boundary loss measurement; normalized time-series center coordinates for predicted segments
Figure 490802DEST_PATH_IMAGE047
Normalized time series center coordinates with time series annotation segments
Figure 15324DEST_PATH_IMAGE057
To find the first smoothness
Figure 300812DEST_PATH_IMAGE001
Loss, segment duration for predicted segment
Figure 834561DEST_PATH_IMAGE048
Segment duration with time sequence annotation segment
Figure 639837DEST_PATH_IMAGE058
To find the second smoothness
Figure 335261DEST_PATH_IMAGE001
Loss of first and second
Figure 108045DEST_PATH_IMAGE001
Sum of losses as loss
Figure 179906DEST_PATH_IMAGE055
(ii) a Calculating a regressionFragments
Figure 338224DEST_PATH_IMAGE059
And corresponding annotated fragments
Figure 470128DEST_PATH_IMAGE060
The negative value of the generalized cross-over ratio is added with 1 to be used as the loss of the time sequence generalized cross-over ratio
Figure 730208DEST_PATH_IMAGE056
(ii) a Will lose
Figure 340181DEST_PATH_IMAGE055
Generalized cross-correlation loss with timing
Figure 854470DEST_PATH_IMAGE056
As a boundary loss
Figure 157275DEST_PATH_IMAGE054
(ii) a Boundary loss
Figure 639072DEST_PATH_IMAGE054
The specific calculation process of (a) can be expressed as follows:
Figure 318315DEST_PATH_IMAGE061
Figure 185646DEST_PATH_IMAGE062
Figure 924932DEST_PATH_IMAGE063
wherein,
Figure 628445DEST_PATH_IMAGE064
representation smoothing
Figure 111379DEST_PATH_IMAGE001
The function of the loss is a function of,
Figure 600261DEST_PATH_IMAGE065
the cross-over ratio of the two fragments is shown,
Figure 244869DEST_PATH_IMAGE066
representing coverage model regression fragments
Figure 701258DEST_PATH_IMAGE059
And corresponding annotated fragments
Figure 456724DEST_PATH_IMAGE060
The minimum time frame of (c);
step S5.3, attention is aligned to the loss
Figure 564226DEST_PATH_IMAGE049
And boundary loss
Figure 114156DEST_PATH_IMAGE054
As the total loss of model training, in conjunction with the optimizer updating the model parameters.
10. A cross-modal time series behavior positioning apparatus of a multi-granularity cascading interactive network, comprising one or more processors, configured to implement the cross-modal time series behavior positioning method of the multi-granularity cascading interactive network according to any one of claims 1 to 9.
CN202210052687.8A 2022-01-18 2022-01-18 Cross-modal time sequence behavior positioning method and device of multi-granularity cascade interactive network Active CN114064967B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210052687.8A CN114064967B (en) 2022-01-18 2022-01-18 Cross-modal time sequence behavior positioning method and device of multi-granularity cascade interactive network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210052687.8A CN114064967B (en) 2022-01-18 2022-01-18 Cross-modal time sequence behavior positioning method and device of multi-granularity cascade interactive network

Publications (2)

Publication Number Publication Date
CN114064967A true CN114064967A (en) 2022-02-18
CN114064967B CN114064967B (en) 2022-05-06

Family

ID=80231249

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210052687.8A Active CN114064967B (en) 2022-01-18 2022-01-18 Cross-modal time sequence behavior positioning method and device of multi-granularity cascade interactive network

Country Status (1)

Country Link
CN (1) CN114064967B (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114357124A (en) * 2022-03-18 2022-04-15 成都考拉悠然科技有限公司 Video paragraph positioning method based on language reconstruction and graph mechanism
CN114581821A (en) * 2022-02-23 2022-06-03 腾讯科技(深圳)有限公司 Video detection method, system, storage medium and server
CN114792424A (en) * 2022-05-30 2022-07-26 北京百度网讯科技有限公司 Document image processing method and device and electronic equipment
CN114925232A (en) * 2022-05-31 2022-08-19 杭州电子科技大学 Cross-modal time domain video positioning method under text segment question-answering framework
CN115131655A (en) * 2022-09-01 2022-09-30 浙江啄云智能科技有限公司 Training method and device of target detection model and target detection method
CN115187783A (en) * 2022-09-09 2022-10-14 之江实验室 Multi-task hybrid supervision medical image segmentation method and system based on federal learning
CN115223086A (en) * 2022-09-20 2022-10-21 之江实验室 Cross-modal action positioning method and system based on interactive attention guidance and correction
CN115238130A (en) * 2022-09-21 2022-10-25 之江实验室 Time sequence language positioning method and device based on modal customization cooperative attention interaction
CN116246213A (en) * 2023-05-08 2023-06-09 腾讯科技(深圳)有限公司 Data processing method, device, equipment and medium
CN116385070A (en) * 2023-01-18 2023-07-04 中国科学技术大学 Multi-target prediction method, system, equipment and storage medium for short video advertisement of E-commerce
CN117076712A (en) * 2023-10-16 2023-11-17 中国科学技术大学 Video retrieval method, system, device and storage medium
CN116824461B (en) * 2023-08-30 2023-12-08 山东建筑大学 Question understanding guiding video question answering method and system
CN117609553A (en) * 2024-01-23 2024-02-27 江南大学 Video retrieval method and system based on local feature enhancement and modal interaction
CN117724153A (en) * 2023-12-25 2024-03-19 北京孚梅森石油科技有限公司 Lithology recognition method based on multi-window cascading interaction
CN117876929A (en) * 2024-01-12 2024-04-12 天津大学 Sequential target positioning method for progressive multi-scale context learning

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107346328A (en) * 2017-05-25 2017-11-14 北京大学 A kind of cross-module state association learning method based on more granularity hierarchical networks
CN109858032A (en) * 2019-02-14 2019-06-07 程淑玉 Merge more granularity sentences interaction natural language inference model of Attention mechanism
CN111310676A (en) * 2020-02-21 2020-06-19 重庆邮电大学 Video motion recognition method based on CNN-LSTM and attention
CN111309971A (en) * 2020-01-19 2020-06-19 浙江工商大学 Multi-level coding-based text-to-video cross-modal retrieval method
CN111782871A (en) * 2020-06-18 2020-10-16 湖南大学 Cross-modal video time positioning method based on space-time reinforcement learning
CN111930999A (en) * 2020-07-21 2020-11-13 山东省人工智能研究院 Method for implementing text query and positioning video clip by frame-by-frame cross-modal similarity correlation
CN112115849A (en) * 2020-09-16 2020-12-22 中国石油大学(华东) Video scene identification method based on multi-granularity video information and attention mechanism
CN113111837A (en) * 2021-04-25 2021-07-13 山东省人工智能研究院 Intelligent monitoring video early warning method based on multimedia semantic analysis
CN113204675A (en) * 2021-07-07 2021-08-03 成都考拉悠然科技有限公司 Cross-modal video time retrieval method based on cross-modal object inference network
EP3933686A2 (en) * 2020-11-27 2022-01-05 Beijing Baidu Netcom Science Technology Co., Ltd. Video processing method, apparatus, electronic device, storage medium, and program product
CN113934887A (en) * 2021-12-20 2022-01-14 成都考拉悠然科技有限公司 No-proposal time sequence language positioning method based on semantic decoupling

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107346328A (en) * 2017-05-25 2017-11-14 北京大学 A kind of cross-module state association learning method based on more granularity hierarchical networks
CN109858032A (en) * 2019-02-14 2019-06-07 程淑玉 Merge more granularity sentences interaction natural language inference model of Attention mechanism
CN111309971A (en) * 2020-01-19 2020-06-19 浙江工商大学 Multi-level coding-based text-to-video cross-modal retrieval method
CN111310676A (en) * 2020-02-21 2020-06-19 重庆邮电大学 Video motion recognition method based on CNN-LSTM and attention
CN111782871A (en) * 2020-06-18 2020-10-16 湖南大学 Cross-modal video time positioning method based on space-time reinforcement learning
CN111930999A (en) * 2020-07-21 2020-11-13 山东省人工智能研究院 Method for implementing text query and positioning video clip by frame-by-frame cross-modal similarity correlation
CN112115849A (en) * 2020-09-16 2020-12-22 中国石油大学(华东) Video scene identification method based on multi-granularity video information and attention mechanism
EP3933686A2 (en) * 2020-11-27 2022-01-05 Beijing Baidu Netcom Science Technology Co., Ltd. Video processing method, apparatus, electronic device, storage medium, and program product
CN113111837A (en) * 2021-04-25 2021-07-13 山东省人工智能研究院 Intelligent monitoring video early warning method based on multimedia semantic analysis
CN113204675A (en) * 2021-07-07 2021-08-03 成都考拉悠然科技有限公司 Cross-modal video time retrieval method based on cross-modal object inference network
CN113934887A (en) * 2021-12-20 2022-01-14 成都考拉悠然科技有限公司 No-proposal time sequence language positioning method based on semantic decoupling

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
JONGHWAN MUN: "Local-Global Video-Text Interactions for Temporal Grounding", 《ARXIV》 *
SHIZHE CHEN: "Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning", 《ARXIV》 *
ZHENZHI WANG: "Negative Sample Matters:A Renaissance of Metric Learning for Temporal Groounding", 《ARXIV》 *
戴思达: "深度多模态融合技术及时间序列分析算法研究", 《中国优秀硕士学位论文全文数据库》 *
赵才荣,齐鼎等: "智能视频监控关键技术: 行人再识别研究综述", 《中国科学:信息科学》 *

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114581821A (en) * 2022-02-23 2022-06-03 腾讯科技(深圳)有限公司 Video detection method, system, storage medium and server
CN114581821B (en) * 2022-02-23 2024-11-08 腾讯科技(深圳)有限公司 Video detection method, system, storage medium and server
CN114357124A (en) * 2022-03-18 2022-04-15 成都考拉悠然科技有限公司 Video paragraph positioning method based on language reconstruction and graph mechanism
CN114792424A (en) * 2022-05-30 2022-07-26 北京百度网讯科技有限公司 Document image processing method and device and electronic equipment
CN114925232A (en) * 2022-05-31 2022-08-19 杭州电子科技大学 Cross-modal time domain video positioning method under text segment question-answering framework
CN115131655A (en) * 2022-09-01 2022-09-30 浙江啄云智能科技有限公司 Training method and device of target detection model and target detection method
CN115187783A (en) * 2022-09-09 2022-10-14 之江实验室 Multi-task hybrid supervision medical image segmentation method and system based on federal learning
CN115223086A (en) * 2022-09-20 2022-10-21 之江实验室 Cross-modal action positioning method and system based on interactive attention guidance and correction
CN115223086B (en) * 2022-09-20 2022-12-06 之江实验室 Cross-modal action positioning method and system based on interactive attention guidance and correction
CN115238130B (en) * 2022-09-21 2022-12-06 之江实验室 Time sequence language positioning method and device based on modal customization collaborative attention interaction
CN115238130A (en) * 2022-09-21 2022-10-25 之江实验室 Time sequence language positioning method and device based on modal customization cooperative attention interaction
CN116385070A (en) * 2023-01-18 2023-07-04 中国科学技术大学 Multi-target prediction method, system, equipment and storage medium for short video advertisement of E-commerce
CN116385070B (en) * 2023-01-18 2023-10-03 中国科学技术大学 Multi-target prediction method, system, equipment and storage medium for short video advertisement of E-commerce
CN116246213A (en) * 2023-05-08 2023-06-09 腾讯科技(深圳)有限公司 Data processing method, device, equipment and medium
CN116824461B (en) * 2023-08-30 2023-12-08 山东建筑大学 Question understanding guiding video question answering method and system
CN117076712B (en) * 2023-10-16 2024-02-23 中国科学技术大学 Video retrieval method, system, device and storage medium
CN117076712A (en) * 2023-10-16 2023-11-17 中国科学技术大学 Video retrieval method, system, device and storage medium
CN117724153A (en) * 2023-12-25 2024-03-19 北京孚梅森石油科技有限公司 Lithology recognition method based on multi-window cascading interaction
CN117724153B (en) * 2023-12-25 2024-05-14 北京孚梅森石油科技有限公司 Lithology recognition method based on multi-window cascading interaction
CN117876929A (en) * 2024-01-12 2024-04-12 天津大学 Sequential target positioning method for progressive multi-scale context learning
CN117876929B (en) * 2024-01-12 2024-06-21 天津大学 Sequential target positioning method for progressive multi-scale context learning
CN117609553A (en) * 2024-01-23 2024-02-27 江南大学 Video retrieval method and system based on local feature enhancement and modal interaction
CN117609553B (en) * 2024-01-23 2024-03-22 江南大学 Video retrieval method and system based on local feature enhancement and modal interaction

Also Published As

Publication number Publication date
CN114064967B (en) 2022-05-06

Similar Documents

Publication Publication Date Title
CN114064967B (en) Cross-modal time sequence behavior positioning method and device of multi-granularity cascade interactive network
WO2021121198A1 (en) Semantic similarity-based entity relation extraction method and apparatus, device and medium
CN107832476B (en) Method, device, equipment and storage medium for understanding search sequence
CN113987169A (en) Text abstract generation method, device and equipment based on semantic block and storage medium
JP2023022845A (en) Method of processing video, method of querying video, method of training model, device, electronic apparatus, storage medium and computer program
CN112541125B (en) Sequence annotation model training method and device and electronic equipment
CN115983271B (en) Named entity recognition method and named entity recognition model training method
CN113128431B (en) Video clip retrieval method, device, medium and electronic equipment
CN111353311A (en) Named entity identification method and device, computer equipment and storage medium
CN118132803B (en) Zero sample video moment retrieval method, system, equipment and medium
CN113420212A (en) Deep feature learning-based recommendation method, device, equipment and storage medium
CN114420107A (en) Speech recognition method based on non-autoregressive model and related equipment
CN112446209A (en) Method, equipment and device for setting intention label and storage medium
JP2023017759A (en) Training method and training apparatus for image recognition model based on semantic enhancement
CN112199954A (en) Disease entity matching method and device based on voice semantics and computer equipment
US20230326178A1 (en) Concept disambiguation using multimodal embeddings
CN114241411B (en) Counting model processing method and device based on target detection and computer equipment
CN113723077B (en) Sentence vector generation method and device based on bidirectional characterization model and computer equipment
CN114882874A (en) End-to-end model training method and device, computer equipment and storage medium
CN112528040B (en) Detection method for guiding drive corpus based on knowledge graph and related equipment thereof
CN117874234A (en) Text classification method and device based on semantics, computer equipment and storage medium
CN113297525A (en) Webpage classification method and device, electronic equipment and storage medium
CN116022154A (en) Driving behavior prediction method, device, computer equipment and storage medium
CN115062136A (en) Event disambiguation method based on graph neural network and related equipment thereof
CN114091451A (en) Text classification method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant