CN114064967A - Cross-modal time sequence behavior positioning method and device of multi-granularity cascade interactive network - Google Patents
Cross-modal time sequence behavior positioning method and device of multi-granularity cascade interactive network Download PDFInfo
- Publication number
- CN114064967A CN114064967A CN202210052687.8A CN202210052687A CN114064967A CN 114064967 A CN114064967 A CN 114064967A CN 202210052687 A CN202210052687 A CN 202210052687A CN 114064967 A CN114064967 A CN 114064967A
- Authority
- CN
- China
- Prior art keywords
- video
- cross
- representation
- loss
- query
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 51
- 230000002452 interceptive effect Effects 0.000 title claims abstract description 26
- 230000003993 interaction Effects 0.000 claims abstract description 24
- 238000012512 characterization method Methods 0.000 claims description 65
- 230000006399 behavior Effects 0.000 claims description 49
- 238000012549 training Methods 0.000 claims description 27
- 239000012634 fragment Substances 0.000 claims description 20
- 239000013598 vector Substances 0.000 claims description 20
- 230000004927 fusion Effects 0.000 claims description 15
- 230000006870 function Effects 0.000 claims description 11
- 238000004364 calculation method Methods 0.000 claims description 9
- 230000000007 visual effect Effects 0.000 claims description 9
- 230000001419 dependent effect Effects 0.000 claims description 8
- 230000004807 localization Effects 0.000 claims description 8
- 230000008569 process Effects 0.000 claims description 8
- 238000009499 grossing Methods 0.000 claims description 7
- 230000002457 bidirectional effect Effects 0.000 claims description 6
- 238000005070 sampling Methods 0.000 claims description 6
- 238000011176 pooling Methods 0.000 claims description 5
- 230000002123 temporal effect Effects 0.000 claims description 4
- 238000009825 accumulation Methods 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 3
- 238000005259 measurement Methods 0.000 claims description 3
- 238000010606 normalization Methods 0.000 claims description 3
- 230000009466 transformation Effects 0.000 claims description 3
- 238000012360 testing method Methods 0.000 abstract description 3
- 238000012545 processing Methods 0.000 description 9
- 238000011156 evaluation Methods 0.000 description 7
- 238000010586 diagram Methods 0.000 description 4
- 238000004590 computer program Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000002474 experimental method Methods 0.000 description 2
- 238000012544 monitoring process Methods 0.000 description 2
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/73—Querying
- G06F16/735—Filtering based on additional data, e.g. user or group profiles
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
- G06F16/7844—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/7867—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/134—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the element, parameter or criterion affecting or controlling the adaptive coding
- H04N19/146—Data rate or code amount at the encoder output
- H04N19/149—Data rate or code amount at the encoder output by estimating the code amount by means of a model, e.g. mathematical model or statistical model
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/20—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using video object coding
- H04N19/21—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using video object coding with binary alpha-plane coding for video objects, e.g. context-based arithmetic encoding [CAE]
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Multimedia (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Library & Information Science (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Artificial Intelligence (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Biophysics (AREA)
- Molecular Biology (AREA)
- Evolutionary Computation (AREA)
- Signal Processing (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Algebra (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a cross-modal time sequence behavior positioning method and device of a multi-granularity cascade interactive network, which are used for solving the problem of time sequence behavior positioning based on given text query in an untrimmed video. The invention implements a novel multi-granularity cascade cross-modal interaction network, and cascade cross-modal interaction is carried out in a coarse-to-fine mode so as to improve the cross-modal alignment capability of the model. In addition, the invention introduces a local-global context-aware video encoder (local-global context-aware video encoder) for improving the context timing dependency modeling capability of the video encoder. The method is simple, flexible in means and superior in the aspect of improving the vision-language cross-modal alignment precision, and the timing sequence positioning accuracy of the trained model can be remarkably improved on paired video-query test data.
Description
Technical Field
The invention relates to the field of vision-language cross-modal learning, in particular to a cross-modal time sequence behavior positioning method and device.
Background
With the rapid development of multimedia and network technologies and the increasing popularization of large-scale video monitoring in places such as traffic, campus, market and the like, the presentation of massive video data increases rapidly in a geometric manner, and video understanding becomes an important and urgent problem to be solved. Wherein the time-series behavior localization is the basis and important component of video understanding. The time sequence behavior positioning research based on visual single mode limits the behavior to be positioned in a predefined behavior set, however, the behavior is complex and diverse in the real world, and the predefined behavior set is difficult to meet the requirements of the real world. As shown in fig. 1, the visual-language cross-modality time-series behavior localization task gives a text description of a certain behavior in a video as a query, and performs time-series localization on a corresponding behavior segment in the video. The visual-language cross-modal time sequence behavior positioning is a very natural man-machine interaction mode, and the technology has wide application prospects in the fields of network short video content retrieval and production, intelligent video monitoring, man-machine interaction and the like.
With the push of deep learning, the task of locating visual-language cross-modal temporal behavior attracts great attention from the industrial and academic communities. Because a significant semantic gap exists between the heterogeneous text mode and the visual mode, how to realize semantic alignment between the modes is a core problem in the task of positioning the cross-mode time sequence behavior from the text mode to the visual mode. The existing visual-language cross-modal time sequence behavior positioning method mainly comprises three types, including a candidate segment nomination-based method, a candidate segment nomination-free method and a sequence decision-based method. The cross-modal visual-language alignment is an indispensable important link in the existing three methods. However, the existing method does not fully utilize multi-granularity text query information in the visual-language cross-modal interaction link, and does not fully model the local context timing dependence characteristic of the video in the video representation coding link.
Disclosure of Invention
In order to solve the defects of the prior art and achieve the purpose of improving the visual-language cross-modal alignment precision in the visual-language cross-modal time sequence behavior positioning task, the invention adopts the following technical scheme:
a cross-modal time sequence behavior positioning method of a multi-granularity cascade interactive network comprises the following steps:
step S1: giving an unclipped video sample, performing initial extraction of video representation by using a visual pre-training model, and performing context-aware time sequence dependent coding on the initially extracted video representation in a local-global mode to obtain a final video representation, so that the context time sequence dependent modeling capability of the video representation is improved;
step S2: for text query corresponding to an untrimmed video, performing word embedding initialization on each word in a query text by adopting a pre-trained word embedding model, and then performing context coding by adopting a multi-layer bidirectional long-time memory network to obtain a word-level representation and a global-level representation of the text query;
step S3: for the extracted video representation and the text query representation, a multi-granularity cascade interaction network is adopted to carry out interaction between a video modality and a text query modality, so that an enhanced video representation guided by query is obtained, and the cross-modality alignment precision is improved;
step S4: for the video representation obtained after multi-granularity cascade interaction, predicting the time sequence position of a text query corresponding target video fragment by adopting an attention-based time sequence position regression module;
step S5: for the cross-modal time sequence behavior positioning model based on the multi-granularity cascade interaction network formed in the steps S1-S4, training of the model is carried out by utilizing a training sample set, and a total loss function adopted in the training comprises attention alignment loss and boundary loss, wherein the boundary loss comprises attention alignment loss and boundary lossInvolving smoothingThe loss and the time sequence generalized intersection are better adapted to the evaluation criterion of the time sequence positioning task than the loss, and the training sample set is composed of a plurality of { video, query, target video segment time sequence position mark } triple samples.
Further, in step S1, based on the visual pre-training model, the video frame features are extracted in an off-line manner and the T frames are uniformly sampled, and then a set of video representations is obtained through a linear transformation layer,For characterization of the ith frame of a video, and thus for characterization of the videoAnd performing context-aware time-sequence dependent coding in a local-global mode.
Further, in the local-global context-aware coding in step S1, the video is first characterizedPerforming local context-aware coding to obtain video representation(ii) a Then characterize the videoPerforming global context-aware coding to obtain video representations。
Further, the local context-aware coding and the global context-aware coding in step S1 are respectively implemented as follows:
step S1.1, local context-aware coding adopts a group of continuous local transformer (local transformer) blocks equipped with one-dimensional offset windows to represent videoAs an initial characterization, inputting a continuous local transformer block of a first one-dimensional offset window, inputting the obtained result into a continuous local transformer block of a second one-dimensional offset window, and so on, and taking the output of the continuous local transformer block of the last one-dimensional offset window as a video characterization of local context-aware coding output(ii) a The operation inside the continuous local transformer block of one-dimensional offset window is as follows:
characterizing acquired videoAfter layer standardization is carried out, the obtained result and the video representation are represented through a one-dimensional window multi-head self-attention moduleAdding to obtain video representation(ii) a Characterizing videoAfter layer standardization, the obtained result and the video representation are carried out through a multilayer perceptronAdding to obtain video representation(ii) a Characterizing videoAfter layer standardization is carried out, the obtained result and the video representation are represented by a one-dimensional offset window multi-head self-attention moduleAdding to obtain video representation(ii) a Characterizing videoAfter layer standardization, the obtained result and the video representation are carried out through a multilayer perceptronAdding, outputting video representationsThe output of successive partial transformer blocks as one-dimensional shifted windows,is shown asThe blocks are provided with successive partial transformer blocks of one-dimensional offset windows.
Specifically, the firstSuccessive partial transformer blocks with blocks equipped with one-dimensional offset windows are represented as:
wherein,,in order to standardize the layers, the method comprises the following steps of,is a one-dimensional window multi-head self-attention module,in order to be a multi-layer sensor,a one-dimensional offset window multi-headed self-attention module.
Step S1.2, global context-aware coding includes a set of conventional transformer blocks, characterizing the videoMaking an initial representation and inputting a first conventional transformer block, inputting an obtained result into a second conventional transformer block, and so on, and taking the output of a last conventional transformer block as a global context-aware coding output video representation(ii) a The conventional transformer block operates internally as follows:
captured video characterizationAfter passing through a conventional multi-head self-attention module, the obtained result is represented by a videoAfter addition, layer standardization is carried out to obtain video representation(ii) a Video characterizationAfter passing through the multilayer perceptron, the obtained result is represented by the videoAfter addition, layer normalization is performed to obtain a video representationAs an output of the conventional transformer block,is shown asBlock conventional transformer blocks.
wherein,,is a constantThe self-attention module of the gauge head,in order to standardize the layers, the method comprises the following steps of,is a multilayer perceptron.
Further, in step S2, the learnable word embedded vector corresponding to each word in the text is queried, and the word embedded model is initialized by using the pre-trained word embedded model to obtain the embedded vector sequence of the text query,For characterization of ith word of video, embedded vector sequence of text query is processed by multilayer bidirectional long-and-short memory network (BLSTM)Context coding is carried out to obtain the word-level text query representation of the queryBy passingForward hidden state vector sum ofSplicing the backward hidden state vectors to obtain a global level text query representationFinally, the text query representation is obtained。
The specific implementation mode is as follows:
whereinIs composed ofForward hidden state vector sum ofAnd (4) splicing the backward hidden state vectors.
Further, in the multi-granularity cascading interactive network in step S3, the video is first characterizedAnd text query characterizationObtaining a video-guided query representation by video-guided query decoding,A query token representing a global level video guide,representing a query characterization of a word-level video guide, and then characterizing the video-guided queryAnd video modality characterizationAnd finally obtaining the enhanced video representation through cascade cross-modal fusion. Video-guided query decoding to narrow down video representationsAnd text query characterizationThe semantic gap between modalities.
Further, the step S3 includes the following steps:
step S3.1, adopting a group of cross-mode decoding blocks for video-guided query decoding to represent text queryInputting a first cross-mode decoding block as an initial representation, inputting an obtained result into a second cross-mode decoding block, and so on, and outputting a last cross-mode decoding block as a query representation of video guidance(ii) a The internal operation of the cross-mode decoding block in step S3.1 is as follows:
characterizing the obtained text queryObtaining the text query representation through a multi-head self-attention module(ii) a Characterizing a text queryCharacterizing videos as queriesText lookup through multi-headed cross-attention module as keys and valuesQuery characterization(ii) a Text query characterizationText query characterization via conventional forward networkAs the output of the cross-mode decoding block;is shown asAnd decoding the block in a cross-mode.
wherein,,andare respectively multi-head self-attention modulesAnd a multi-head cross-attention module,is a conventional forward network (feed forward network).
Step S3.2, cascading cross-modal fusion, firstly representing the query guided by the global level videoAnd video modality characterizationPerforming cross-modal fusion on the coarse-grained level by element multiplication to obtain a coarse-grained fused video representationThen characterize word-level video-guided queriesVideo characterization after merging with coarse levelPerforming cross-modal fusion at a fine granularity level through another set of cross-modal decoding blocks, and representing the video after coarse granularity level fusionInputting a first cross-mode decoding block as an initial representation, inputting an obtained result into a second cross-mode decoding block, and so on, and taking the output of a last cross-mode decoding block as an enhanced video representation(ii) a The internal operation of the cross-mode decoding block in step S3.2 is as follows:
characterizing the acquired videoObtaining video representations through a multi-headed self-attention module(ii) a Characterizing videoWord-level video-guided query characterization as a queryVideo representations are obtained as keys and values by a multi-headed cross attention module(ii) a Video characterizationVideo characterization via conventional forward networkAs the output of the cross-mode decoding block;is shown asAnd decoding the block in a cross-mode. Cross-modal fusion at coarse level for suppressing background video frames and emphasizing foreground video frames, which can be expressed as,Representing element-by-element multiplication.
wherein,,anda multi-head self-attention module and a multi-head cross attention module respectively,is a conventional forward network (feed forward network).
Further, the attention-based time sequence position regression module in step S4 characterizes the video sequence subjected to the multi-granularity cascade interactionObtaining the time sequence attention score of the video through a multilayer perceptron and a SoftMax active layer(ii) a Then the enhanced video is characterizedAnd time series attention pointsObtaining the target sheet through the attention pooling layerCharacterization of segments(ii) a Finally, the characterization of the target fragmentNormalizing the time sequence center coordinates of the target segment by the multilayer perceptronAnd segment durationDirect regression was performed.
The particular attention-based time series position regression is represented as:
wherein,in order to enhance video characterization, i.e., video sequence characterization output after multi-granularity cascade interaction, the attention pooling layer is used for converging video sequence characterization.
Further, the training of the model in step S5 includes the following steps:
step S5.1, calculating attention alignment lossTo be connected toiLogarithm and indication value of time sequence attention fraction corresponding to frameIs accumulated according to the number of sampling frames, and the result obtained by the accumulation is compared with that obtained by the calculationCalculating loss according to the accumulated result of sampling frame number,Indicating that the ith frame of the video is positioned in the time sequence marking segment, otherwise, indicating that the ith frame of the video is positioned in the time sequence marking segment(ii) a Loss of attention alignmentFor encouraging the video frames within the annotated time-series segment to have a higher attention score, the specific calculation process can be expressed as:
wherein,Twhich represents the number of frames of the sample,representing the time series attention score of the ith frame,indicating that the ith frame of the video is positioned in the time sequence marking segment, otherwise, indicating that the ith frame of the video is positioned in the time sequence marking segment。
Step S5.2, calculating boundary lossPassing through the knotSynthetic smoothingLoss of powerSum-time generalized cross-over ratio lossPerforming boundary loss measurement; normalized time-series center coordinates for predicted segmentsNormalized time series center coordinates with time series annotation segmentsTo find the first smoothnessLoss, segment duration for predicted segmentSegment duration with time sequence annotation segmentTo find the second smoothnessLoss of first and secondSum of losses as loss(ii) a Calculating regression fragmentsAnd corresponding annotated fragmentsThe negative value of the generalized cross-over ratio is added with 1 to be used as the loss of the time sequence generalized cross-over ratio(ii) a Will loseGeneralized cross-correlation loss with timingAs a boundary loss(ii) a Boundary lossThe specific calculation process of (a) can be expressed as follows:
wherein,representation smoothingThe function of the loss is a function of,the cross-over ratio of the two fragments is shown,representing coverage model regression fragmentsAnd corresponding annotated fragmentsThe minimum time frame of (c).
wherein,andthe weight is over-parameterized and the model parameters are updated using the optimizer during the training phase.
The cross-modal time sequence behavior positioning device of the multi-granularity cascade interactive network comprises one or more processors and is used for realizing the cross-modal time sequence behavior positioning method of the multi-granularity cascade interactive network.
The invention has the advantages and beneficial effects that:
the invention discloses a cross-modal time sequence behavior positioning method and device of a multi-granularity cascade interaction network, which fully utilize multi-granularity text query information in a coarse-to-fine mode in a vision-language cross-modal interaction link, fully model the local-global context time sequence dependence characteristic of a video in a video representation coding link, and solve the problem of time sequence behavior positioning based on text query in an untrimmed video. For given untrimmed videos and text queries, the method can improve the vision-language cross-modal alignment precision, and further improve the positioning accuracy of a cross-modal time sequence behavior positioning task.
Drawings
FIG. 1 is an exemplary diagram of a visual-language cross-modal temporal behavior localization task.
FIG. 2 is a block diagram of the cross-modal temporal behavior localization of the multi-granularity cascading interactive network of the present invention.
FIG. 3 is a flowchart of a cross-modal timing behavior localization method of a multi-granularity cascading interactive network according to the present invention.
Fig. 4 is a structural diagram of a cross-mode timing behavior positioning device of a multi-granularity cascade interaction network in the invention.
Detailed Description
The following detailed description of embodiments of the invention refers to the accompanying drawings. It should be understood that the detailed description and specific examples, while indicating the present invention, are given by way of illustration and explanation only, not limitation.
The invention discloses a cross-modal time sequence behavior positioning method and device of a multi-granularity cascade interactive network, which are used for solving the problem of time sequence behavior positioning based on given text query in an untrimmed video based on vision-language cross-modal time sequence behavior positioning of the multi-granularity cascade interactive network. The method provides a simple and effective multi-granularity cascade cross-modal interaction network for improving the cross-modal alignment capability of the model. In addition, the invention introduces a local-global context-aware video encoder for improving the context timing dependence modeling capability of the video encoder. Therefore, the timing positioning accuracy of the trained model can be obviously improved on paired video-query test data.
A cross-modal time sequence behavior positioning method of a multi-granularity cascade interactive network is based on a Pythrch frame for experiment, video frame features are extracted in an off-line mode by using a pre-trained C3D network, a video is uniformly sampled into 256 frames, and the number of heads of all self-attention sub-modules and cross-attention sub-modules in the method is set to be 8. The model was trained using Adam optimizer in the training phase with a learning rate fixed at 0.0004 and each batch consisted of 100 pairs of video-queries. In addition, the performance evaluation criterion in the experiment implementation can adopt an "R @ n, IoU = m" evaluation criterion, the evaluation criterion represents the percentage of correctly positioned queries in the evaluation data set, wherein the maximum value of the intersection ratio (IoU) of the n prediction segments with the highest confidence degrees and the real labels is considered to be correctly positioned if the maximum value is greater than m.
In particular embodiments, an untrimmed video is givenIs uniformly sampled into a sequence of video framesAnd give a videoText description of a behavior segmentThe visual-language cross-modal time sequence behavior positioning task is to predict videosDescription of Chinese textStart time of the corresponding video segmentAnd end time. The training data set for the task may be defined asWhereinAndthe real mark of the starting time and the ending time of the target video segment is obtained.
As shown in fig. 2 and fig. 3, the cross-modal time series behavior positioning method of the multi-granularity cascade interaction network includes the following steps:
step S1: giving an unclipped video sample, performing initial extraction of video representation by using a visual pre-training model, and performing context-aware time sequence dependent coding on the initially extracted video representation in a local-global mode to obtain a final video representation, so that the context time sequence dependent modeling capability of the video representation is improved;
in step S1, based on the visual pre-training model, the video frame features are extracted in an off-line manner and the T frames are sampled uniformly, and then a set of video representations is obtained through a linear transformation layer,For characterization of the ith frame of a video, and thus for characterization of the videoAnd performing context-aware time-sequence dependent coding in a local-global mode.
In the local-global context-aware coding in S1, the video is first characterizedPerforming local context-aware coding to obtain video representation(ii) a Then characterize the videoPerforming global context-aware coding to obtain video representations。
The local context-aware coding and the global context-aware coding in step S1 are implemented as follows:
step S1.1, local context-aware coding adopts a group of continuous local transformer (local transformer) blocks equipped with one-dimensional offset windows to represent videoAs an initial characterization, inputting a continuous local transformer block of a first one-dimensional offset window, inputting the obtained result into a continuous local transformer block of a second one-dimensional offset window, and so on, and taking the output of the continuous local transformer block of the last one-dimensional offset window as a video characterization of local context-aware coding output(ii) a The operation inside the continuous local transformer block of one-dimensional offset window is as follows:
characterizing acquired videoAfter layer standardization is carried out, the obtained result and the video representation are represented through a one-dimensional window multi-head self-attention moduleAdding to obtain video representation(ii) a Characterizing videoAfter layer standardization, the obtained result and the video representation are carried out through a multilayer perceptronAdding to obtain video representation(ii) a Characterizing videoAfter layer standardization is carried out, the obtained result and the video representation are represented by a one-dimensional offset window multi-head self-attention moduleAdding to obtain video representation(ii) a Characterizing videoAfter layer standardization, the obtained result and the video representation are carried out through a multilayer perceptronAdding, outputting video representationsThe output of successive partial transformer blocks as one-dimensional shifted windows,is shown asThe blocks are provided with successive partial transformer blocks of one-dimensional offset windows.
Specifically, the firstSuccessive partial transformer blocks with blocks equipped with one-dimensional offset windows are represented as:
wherein,,in order to standardize the layers, the method comprises the following steps of,is a one-dimensional window multi-head self-attention module,in order to be a multi-layer sensor,a one-dimensional offset window multi-headed self-attention module.
Step S1.2, global context-aware coding includes a set of conventional transformer blocks, characterizing the videoMaking an initial representation and inputting a first conventional transformer block, inputting an obtained result into a second conventional transformer block, and so on, and taking the output of a last conventional transformer block as a wholeLocal context aware coded output video characterization(ii) a The conventional transformer block operates internally as follows:
captured video characterizationAfter passing through a conventional multi-head self-attention module, the obtained result is represented by a videoAfter addition, layer standardization is carried out to obtain video representation(ii) a Video characterizationAfter passing through the multilayer perceptron, the obtained result is represented by the videoAfter addition, layer normalization is performed to obtain a video representationAs an output of the conventional transformer block,is shown asBlock conventional transformer blocks.
wherein,,in the case of a conventional multi-headed self-attention module,in order to standardize the layers, the method comprises the following steps of,is a multilayer perceptron.
Step S2: for text query corresponding to an untrimmed video, performing word embedding initialization on each word in a query text by adopting a pre-trained word embedding model, and then performing context coding by adopting a multi-layer bidirectional long-time memory network to obtain a word-level representation and a global-level representation of the text query;
in step S2, the learnable word embedded vector corresponding to each word in the text is queried, and the word embedded model is initialized using the pre-trained word embedded model to obtain an embedded vector sequence of the text query,For characterization of ith word of video, embedded vector sequence of text query is processed by multilayer bidirectional long-and-short memory network (BLSTM)Context coding is carried out to obtain the word-level text query representation of the queryBy passingForward hidden state vector sum ofSplicing the backward hidden state vectors to obtain a global level text query representationFinally, the text query representation is obtained。
The specific implementation mode is as follows:
whereinIs composed ofForward hidden state vector sum ofAnd (4) splicing the backward hidden state vectors.
Step S3: for the extracted video representation and the text query representation, a multi-granularity cascade interaction network is adopted to carry out interaction between a video modality and a text query modality, so that an enhanced video representation guided by query is obtained, and the cross-modality alignment precision is improved;
in the multi-granularity cascade interactive network in the step S3, firstly, the video is characterizedAnd text query characterizationObtaining a video-guided query representation by video-guided query decoding,A query token representing a global level video guide,representing a query characterization of a word-level video guide, and then characterizing the video-guided queryAnd video modality characterizationAnd finally obtaining the enhanced video representation through cascade cross-modal fusion. Video-guided query decoding to narrow down video representationsAnd text query characterizationThe semantic gap between modalities.
The step S3 specifically includes the following steps:
step S3.1, adopting a group of cross-mode decoding blocks for video-guided query decoding to represent text queryInputting a first block of cross-mode decoding block as an initial representation, inputting the obtained result into a second block of cross-mode decoding block, and so on, and inputting a last block of cross-mode decoding blockOutput of Cross-modality decoding Block as video-guided query characterization(ii) a The internal operation of the cross-mode decoding block in step S3.1 is as follows:
characterizing the obtained text queryObtaining the text query representation through a multi-head self-attention module(ii) a Characterizing a text queryCharacterizing videos as queriesText query characterization by multi-headed cross-attention module as keys and values(ii) a Text query characterizationText query characterization via conventional forward networkAs the output of the cross-mode decoding block;is shown asAnd decoding the block in a cross-mode.
wherein,,anda multi-head self-attention module and a multi-head cross attention module respectively,is a conventional forward network (feed forward network).
Step S3.2, cascading cross-modal fusion, firstly representing the query guided by the global level videoAnd video modality characterizationPerforming cross-modal fusion on the coarse-grained level by element multiplication to obtain a coarse-grained fused video representationThen characterize word-level video-guided queriesVideo characterization after merging with coarse levelPerforming cross-modal fusion at a fine granularity level through another set of cross-modal decoding blocks, and representing the video after coarse granularity level fusionInputting a first cross-mode decoding block as an initial representation, inputting an obtained result into a second cross-mode decoding block, and so on, and taking the output of a last cross-mode decoding block as an enhanced video representation(ii) a The internal operation of the cross-mode decoding block in step S3.2 is as follows:
characterizing the acquired videoObtaining video representations through a multi-headed self-attention module(ii) a Characterizing videoWord-level video-guided query characterization as a queryVideo representations are obtained as keys and values by a multi-headed cross attention module(ii) a Video characterizationVideo characterization via conventional forward networkAs the output of the cross-mode decoding block;is shown asAnd decoding the block in a cross-mode. Cross-modal fusion at coarse level for suppressing background video frames and emphasizing foreground video frames, which can be expressed as,Representing element-by-element multiplication.
wherein,,anda multi-head self-attention module and a multi-head cross attention module respectively,is a conventional forward network (feed forward network).
Step S4: for the video representation obtained after multi-granularity cascade interaction, predicting the time sequence position of a text query corresponding target video fragment by adopting an attention-based time sequence position regression module;
the attention-based time sequence position regression module in the step S4 characterizes the video sequence subjected to the multi-granularity cascade interactionObtaining the time sequence attention score of the video through a multilayer perceptron and a SoftMax active layer(ii) a Then the enhanced video is characterizedAnd time series attention pointsObtaining a representation of the target segment by means of an attention pooling layer(ii) a Finally, the characterization of the target fragmentNormalizing the time sequence center coordinates of the target segment by the multilayer perceptronAnd segment durationDirect regression was performed.
The particular attention-based time series position regression is represented as:
wherein,in order to enhance video characterization, i.e., video sequence characterization through multi-granularity cascade interaction, the attention pooling layer is used for converging video sequence characterization,
step S5: and for the cross-modal time sequence behavior positioning model based on the multi-granularity cascade interaction network formed in the steps S1-S4, training the model by utilizing a training sample set, wherein a total loss function adopted in the training comprises attention alignment loss and boundary loss, and the boundary loss comprises smooth attention alignment loss and boundary lossThe loss and the time sequence generalized intersection are better adapted to the evaluation criterion of the time sequence positioning task than the loss, and the training sample set is composed of a plurality of { video, query, target video segment time sequence position mark } triple samples.
The training of the model in the step S5 includes the following steps:
step S5.1, calculating attention alignment lossTo be connected toiLogarithm and indication value of time sequence attention fraction corresponding to frameIs accumulated according to the number of sampling frames, and the result obtained by the accumulation is compared with that obtained by the calculationCalculating loss according to the accumulated result of sampling frame number,Indicating that the ith frame of the video is positioned in the time sequence marking segment, otherwise, indicating that the ith frame of the video is positioned in the time sequence marking segment(ii) a Loss of attention alignmentFor encouraging the video frames within the annotated time-series segment to have a higher attention score, the specific calculation process can be expressed as:
wherein,Twhich represents the number of frames of the sample,representing the time series attention score of the ith frame,indicating that the ith frame of the video is positioned in the time sequence marking segment, otherwise, indicating that the ith frame of the video is positioned in the time sequence marking segment。
Step S5.2, calculating boundary lossBy combining smoothingLoss of powerSum-time generalized cross-over ratio lossPerforming boundary loss measurement; normalized time-series center coordinates for predicted segmentsNormalized time series center coordinates with time series annotation segmentsTo find the first smoothnessLoss, segment duration for predicted segmentSegment duration with time sequence annotation segmentTo find the second smoothnessLoss of first and secondSum of losses as loss(ii) a Calculating regression fragmentsAnd corresponding annotated fragmentsThe negative value of the generalized cross-over ratio is added with 1 to be used as the loss of the time sequence generalized cross-over ratio(ii) a Will loseGeneralized cross-correlation loss with timingAs a boundary loss(ii) a Boundary lossThe specific calculation process of (a) can be expressed as follows:
wherein,representation smoothingThe function of the loss is a function of,the cross-over ratio of the two fragments is shown,representing coverage model regression fragmentsAnd corresponding annotated fragmentsThe minimum time frame of (c).
wherein,andthe weight is over-parameterized and the model parameters are updated using the optimizer during the training phase.
The method of the present invention is compared with other representative methods for the accuracy of the TACoS test set, as shown in table 1, using the evaluation criterion of "R @ n, IoU = m", where n =1 and m = {0.1, 0.3, 0.5 }.
TABLE 1
Corresponding to the embodiment of the cross-modal time sequence behavior positioning method, the invention also provides an embodiment of a cross-modal time sequence behavior positioning device of the multi-granularity cascade interactive network.
Referring to fig. 4, the cross-modal timing behavior positioning apparatus of the multi-granularity cascading interactive network provided in the embodiment of the present invention includes one or more processors, and is configured to implement the cross-modal timing behavior positioning method of the multi-granularity cascading interactive network in the embodiment.
The cross-modal time sequence behavior positioning device of the multi-granularity cascade interactive network can be applied to any equipment with data processing capability, and the any equipment with data processing capability can be equipment or devices such as computers. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. The software implementation is taken as an example, and as a logical device, the device is formed by reading corresponding computer program instructions in the nonvolatile memory into the memory for running through the processor of any device with data processing capability. From a hardware aspect, as shown in fig. 4, the present invention is a hardware structure diagram of any device with data processing capability where a cross-modal time series behavior positioning apparatus of a multi-granularity cascading interactive network is located, except for the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 4, in an embodiment, any device with data processing capability where the apparatus is located may also include other hardware according to an actual function of the any device with data processing capability, which is not described again.
The implementation process of the functions and actions of each unit in the above device is specifically described in the implementation process of the corresponding step in the above method, and is not described herein again.
For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules can be selected according to actual needs to achieve the purpose of the scheme of the invention. One of ordinary skill in the art can understand and implement it without inventive effort.
The embodiment of the present invention further provides a computer-readable storage medium, on which a program is stored, where the program, when executed by a processor, implements the cross-modal time series behavior positioning method of the multi-granularity cascading interactive network in the above embodiments.
The computer readable storage medium may be an internal storage unit, such as a hard disk or a memory, of any data processing capability device described in any of the foregoing embodiments. The computer readable storage medium may also be any external storage device of a device with data processing capabilities, such as a plug-in hard disk, a Smart Media Card (SMC), an SD Card, a Flash memory Card (Flash Card), etc. provided on the device. Further, the computer readable storage medium may include both an internal storage unit and an external storage device of any data processing capable device. The computer-readable storage medium is used for storing the computer program and other programs and data required by the arbitrary data processing-capable device, and may also be used for temporarily storing data that has been output or is to be output.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.
Claims (10)
1. A cross-mode time sequence behavior positioning method of a multi-granularity cascade interactive network is characterized by comprising the following steps:
step S1: giving an unclipped video sample, performing initial extraction of video representation by using a visual pre-training model, and performing context-aware time sequence dependent coding on the initially extracted video representation in a local-global mode to obtain a final video representation;
step S2: for text query corresponding to an untrimmed video, performing word embedding initialization on each word in a query text by adopting a pre-trained word embedding model, and then performing context coding by adopting a multi-layer bidirectional long-time memory network to obtain a word-level representation and a global-level representation of the text query;
step S3: for the extracted video representation and text query representation, performing interaction between a video modality and a text query modality by adopting a multi-granularity cascade interaction network to obtain an enhanced video representation guided by query;
step S4: for the enhanced video representation obtained after multi-granularity cascade interaction, predicting the time sequence position of a corresponding target video fragment of a text query by adopting an attention-based time sequence position regression module;
step S5: and for the cross-modal time sequence behavior positioning model based on the multi-granularity cascade interaction network formed in the steps S1-S4, training the model by utilizing a training sample set, wherein a total loss function adopted in the training comprises attention alignment loss and boundary loss, and the boundary loss comprises smooth attention alignment loss and boundary lossLoss and timing are broadly cross-correlation-specific losses.
2. The method according to claim 1, wherein in step S1, video frame features are extracted in an off-line manner based on a visual pre-training model and the T frames are sampled uniformly, and then a set of video representations is obtained through a linear transformation layer,For characterization of the ith frame of a video, and thus for characterization of the videoAnd performing context-aware time-sequence dependent coding in a local-global mode.
3. The method as claimed in claim 2, wherein the local-global context-aware coding in step S1 is implemented by first characterizing the videoPerforming local context-aware coding to obtain video representation(ii) a Then characterize the videoPerforming global context-aware coding to obtain video representations。
4. The cross-modal time series behavior localization method of multi-granularity cascading interactive network as claimed in claim 3, wherein the local context-aware coding and the global context-aware coding in step S1 are implemented as follows:
step S1.1, local context-aware coding adopts a group of continuous local transformer blocks provided with one-dimensional offset windows to represent a videoAs an initial characterization, inputting a continuous local transformer block of a first one-dimensional offset window, inputting the obtained result into a continuous local transformer block of a second one-dimensional offset window, and so on, and inputting a continuous local transformer block of a last one-dimensional offset windowOutput of local transformer blocks as video representation of local context-aware coding output(ii) a The operation inside the continuous local transformer block of one-dimensional offset window is as follows:
characterizing acquired videoAfter layer standardization is carried out, the obtained result and the video representation are represented through a one-dimensional window multi-head self-attention moduleAdding to obtain video representation(ii) a Characterizing videoAfter layer standardization, the obtained result and the video representation are carried out through a multilayer perceptronAdding to obtain video representation(ii) a Characterizing videoAfter layer standardization is carried out, the obtained result and the video representation are represented by a one-dimensional offset window multi-head self-attention moduleAdding to obtain video representation(ii) a Characterizing videoAfter layer standardization, the obtained result and the video representation are carried out through a multilayer perceptronAdding, outputting video representationsThe output of successive partial transformer blocks as one-dimensional shifted windows,is shown asThe block is provided with a continuous local transformer block with a one-dimensional offset window;
step S1.2, global context-aware coding includes a set of conventional transformer blocks, characterizing the videoMaking an initial representation and inputting a first conventional transformer block, inputting an obtained result into a second conventional transformer block, and so on, and taking the output of a last conventional transformer block as a global context-aware coding output video representation(ii) a The conventional transformer block operates internally as follows:
captured video characterizationAfter passing through a conventional multi-head self-attention module, the obtained result is represented by a videoAfter addition, layer standardization is carried out to obtain video representation(ii) a Video characterizationAfter passing through the multilayer perceptron, the obtained result is represented by the videoAfter addition, layer normalization is performed to obtain a video representationAs an output of the conventional transformer block,is shown asBlock conventional transformer blocks.
5. The method according to claim 1, wherein in step S2, the learnable word embedded vector corresponding to each word in the text is queried, and the pre-trained word embedding model is used to initialize the learnable word embedded vector to obtain the embedded vector sequence of the text query,For the representation of the ith word of the video, the embedded vector sequence of the text query is realized through a multi-layer bidirectional long-time memory networkContext coding is carried out to obtain the word-level text query representation of the queryBy passingForward hidden state vector sum ofSplicing the backward hidden state vectors to obtain a global level text query representationFinally, the text query representation is obtained。
6. The method as claimed in claim 1, wherein the step S3 is executed by first applying a video representation and a text query representation to the multi-granularity cascading interactive networkObtaining a video-guided query representation by video-guided query decoding,A query token representing a global level video guide,representing word-level video-guided lookupsQuery characterization, followed by video-guided query characterizationAnd performing cascade cross-modal fusion with the video modal characterization to obtain a final enhanced video characterization.
7. The cross-modal timing behavior localization method of multi-granularity cascading interactive network according to claim 6, wherein the video-guided query decoding and cascading cross-modal fusion in the step S3 are respectively implemented as follows:
step S3.1, adopting a group of cross-mode decoding blocks for video-guided query decoding to represent text queryInputting a first cross-mode decoding block as an initial representation, inputting an obtained result into a second cross-mode decoding block, and so on, and outputting a last cross-mode decoding block as a query representation of video guidance(ii) a The internal operation of the cross-mode decoding block in step S3.1 is as follows:
characterizing the obtained text queryObtaining the text query representation through a multi-head self-attention module(ii) a Characterizing a text queryCharacterizing videos as queriesAs a key and a value, the key and the value,obtaining text query representation through multi-head cross attention module(ii) a Text query characterizationText query characterization via conventional forward networkAs the output of the cross-mode decoding block;is shown asA block-spanning mode decoding block;
step S3.2, cascading cross-modal fusion, firstly representing the query guided by the global level videoAnd video modality characterizationPerforming cross-modal fusion on the coarse-grained level by element multiplication to obtain a coarse-grained fused video representationThen characterize word-level video-guided queriesVideo characterization after merging with coarse levelAt fine level of granularity, by another set of cross-modal decoding blocksLine-crossing modal fusion, representing the video after coarse-grained fusionInputting a first cross-mode decoding block as an initial representation, inputting an obtained result into a second cross-mode decoding block, and so on, and taking the output of a last cross-mode decoding block as an enhanced video representation(ii) a The internal operation of the cross-mode decoding block in step S3.2 is as follows:
characterizing the acquired videoObtaining video representations through a multi-headed self-attention module(ii) a Characterizing videoWord-level video-guided query characterization as a queryVideo representations are obtained as keys and values by a multi-headed cross attention module(ii) a Video characterizationVideo characterization via conventional forward networkAs the output of the cross-mode decoding block;is shown asAnd decoding the block in a cross-mode.
8. The method according to claim 1, wherein the attention-based temporal position regression module in step S4 characterizes the enhanced video outputted by the multi-granular cascade interactionObtaining the time sequence attention score of the video through a multilayer perceptron and a SoftMax active layer(ii) a Then the enhanced video is characterizedAnd time series attention pointsObtaining a representation of the target segment by means of an attention pooling layer(ii) a Finally, the characterization of the target fragmentNormalizing the time sequence center coordinates of the target segment by the multilayer perceptronAnd segment durationDirect regression was performed.
9. The method for positioning cross-modal timing behavior of multi-granularity cascading interactive network according to claim 1, wherein the training of the model in the step S5 includes the following steps:
step S5.1, calculating attention alignment lossTo be connected toiLogarithm and indication value of time sequence attention fraction corresponding to frameIs accumulated according to the number of sampling frames, and the result obtained by the accumulation is compared with that obtained by the calculationCalculating loss according to the accumulated result of sampling frame number,Indicating that the ith frame of the video is positioned in the time sequence marking segment, otherwise, indicating that the ith frame of the video is positioned in the time sequence marking segment(ii) a Loss of attention alignmentThe specific calculation process of (a) can be expressed as:
step S5.2, calculating boundary lossBy combining smoothingLoss of powerSum-time generalized cross-over ratio lossPerforming boundary loss measurement; normalized time-series center coordinates for predicted segmentsNormalized time series center coordinates with time series annotation segmentsTo find the first smoothnessLoss, segment duration for predicted segmentSegment duration with time sequence annotation segmentTo find the second smoothnessLoss of first and secondSum of losses as loss(ii) a Calculating a regressionFragmentsAnd corresponding annotated fragmentsThe negative value of the generalized cross-over ratio is added with 1 to be used as the loss of the time sequence generalized cross-over ratio(ii) a Will loseGeneralized cross-correlation loss with timingAs a boundary loss(ii) a Boundary lossThe specific calculation process of (a) can be expressed as follows:
wherein,representation smoothingThe function of the loss is a function of,the cross-over ratio of the two fragments is shown,representing coverage model regression fragmentsAnd corresponding annotated fragmentsThe minimum time frame of (c);
10. A cross-modal time series behavior positioning apparatus of a multi-granularity cascading interactive network, comprising one or more processors, configured to implement the cross-modal time series behavior positioning method of the multi-granularity cascading interactive network according to any one of claims 1 to 9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210052687.8A CN114064967B (en) | 2022-01-18 | 2022-01-18 | Cross-modal time sequence behavior positioning method and device of multi-granularity cascade interactive network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210052687.8A CN114064967B (en) | 2022-01-18 | 2022-01-18 | Cross-modal time sequence behavior positioning method and device of multi-granularity cascade interactive network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114064967A true CN114064967A (en) | 2022-02-18 |
CN114064967B CN114064967B (en) | 2022-05-06 |
Family
ID=80231249
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210052687.8A Active CN114064967B (en) | 2022-01-18 | 2022-01-18 | Cross-modal time sequence behavior positioning method and device of multi-granularity cascade interactive network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114064967B (en) |
Cited By (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114357124A (en) * | 2022-03-18 | 2022-04-15 | 成都考拉悠然科技有限公司 | Video paragraph positioning method based on language reconstruction and graph mechanism |
CN114581821A (en) * | 2022-02-23 | 2022-06-03 | 腾讯科技(深圳)有限公司 | Video detection method, system, storage medium and server |
CN114792424A (en) * | 2022-05-30 | 2022-07-26 | 北京百度网讯科技有限公司 | Document image processing method and device and electronic equipment |
CN114925232A (en) * | 2022-05-31 | 2022-08-19 | 杭州电子科技大学 | Cross-modal time domain video positioning method under text segment question-answering framework |
CN115131655A (en) * | 2022-09-01 | 2022-09-30 | 浙江啄云智能科技有限公司 | Training method and device of target detection model and target detection method |
CN115187783A (en) * | 2022-09-09 | 2022-10-14 | 之江实验室 | Multi-task hybrid supervision medical image segmentation method and system based on federal learning |
CN115223086A (en) * | 2022-09-20 | 2022-10-21 | 之江实验室 | Cross-modal action positioning method and system based on interactive attention guidance and correction |
CN115238130A (en) * | 2022-09-21 | 2022-10-25 | 之江实验室 | Time sequence language positioning method and device based on modal customization cooperative attention interaction |
CN116246213A (en) * | 2023-05-08 | 2023-06-09 | 腾讯科技(深圳)有限公司 | Data processing method, device, equipment and medium |
CN116385070A (en) * | 2023-01-18 | 2023-07-04 | 中国科学技术大学 | Multi-target prediction method, system, equipment and storage medium for short video advertisement of E-commerce |
CN117076712A (en) * | 2023-10-16 | 2023-11-17 | 中国科学技术大学 | Video retrieval method, system, device and storage medium |
CN116824461B (en) * | 2023-08-30 | 2023-12-08 | 山东建筑大学 | Question understanding guiding video question answering method and system |
CN117609553A (en) * | 2024-01-23 | 2024-02-27 | 江南大学 | Video retrieval method and system based on local feature enhancement and modal interaction |
CN117724153A (en) * | 2023-12-25 | 2024-03-19 | 北京孚梅森石油科技有限公司 | Lithology recognition method based on multi-window cascading interaction |
CN117876929A (en) * | 2024-01-12 | 2024-04-12 | 天津大学 | Sequential target positioning method for progressive multi-scale context learning |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107346328A (en) * | 2017-05-25 | 2017-11-14 | 北京大学 | A kind of cross-module state association learning method based on more granularity hierarchical networks |
CN109858032A (en) * | 2019-02-14 | 2019-06-07 | 程淑玉 | Merge more granularity sentences interaction natural language inference model of Attention mechanism |
CN111310676A (en) * | 2020-02-21 | 2020-06-19 | 重庆邮电大学 | Video motion recognition method based on CNN-LSTM and attention |
CN111309971A (en) * | 2020-01-19 | 2020-06-19 | 浙江工商大学 | Multi-level coding-based text-to-video cross-modal retrieval method |
CN111782871A (en) * | 2020-06-18 | 2020-10-16 | 湖南大学 | Cross-modal video time positioning method based on space-time reinforcement learning |
CN111930999A (en) * | 2020-07-21 | 2020-11-13 | 山东省人工智能研究院 | Method for implementing text query and positioning video clip by frame-by-frame cross-modal similarity correlation |
CN112115849A (en) * | 2020-09-16 | 2020-12-22 | 中国石油大学(华东) | Video scene identification method based on multi-granularity video information and attention mechanism |
CN113111837A (en) * | 2021-04-25 | 2021-07-13 | 山东省人工智能研究院 | Intelligent monitoring video early warning method based on multimedia semantic analysis |
CN113204675A (en) * | 2021-07-07 | 2021-08-03 | 成都考拉悠然科技有限公司 | Cross-modal video time retrieval method based on cross-modal object inference network |
EP3933686A2 (en) * | 2020-11-27 | 2022-01-05 | Beijing Baidu Netcom Science Technology Co., Ltd. | Video processing method, apparatus, electronic device, storage medium, and program product |
CN113934887A (en) * | 2021-12-20 | 2022-01-14 | 成都考拉悠然科技有限公司 | No-proposal time sequence language positioning method based on semantic decoupling |
-
2022
- 2022-01-18 CN CN202210052687.8A patent/CN114064967B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107346328A (en) * | 2017-05-25 | 2017-11-14 | 北京大学 | A kind of cross-module state association learning method based on more granularity hierarchical networks |
CN109858032A (en) * | 2019-02-14 | 2019-06-07 | 程淑玉 | Merge more granularity sentences interaction natural language inference model of Attention mechanism |
CN111309971A (en) * | 2020-01-19 | 2020-06-19 | 浙江工商大学 | Multi-level coding-based text-to-video cross-modal retrieval method |
CN111310676A (en) * | 2020-02-21 | 2020-06-19 | 重庆邮电大学 | Video motion recognition method based on CNN-LSTM and attention |
CN111782871A (en) * | 2020-06-18 | 2020-10-16 | 湖南大学 | Cross-modal video time positioning method based on space-time reinforcement learning |
CN111930999A (en) * | 2020-07-21 | 2020-11-13 | 山东省人工智能研究院 | Method for implementing text query and positioning video clip by frame-by-frame cross-modal similarity correlation |
CN112115849A (en) * | 2020-09-16 | 2020-12-22 | 中国石油大学(华东) | Video scene identification method based on multi-granularity video information and attention mechanism |
EP3933686A2 (en) * | 2020-11-27 | 2022-01-05 | Beijing Baidu Netcom Science Technology Co., Ltd. | Video processing method, apparatus, electronic device, storage medium, and program product |
CN113111837A (en) * | 2021-04-25 | 2021-07-13 | 山东省人工智能研究院 | Intelligent monitoring video early warning method based on multimedia semantic analysis |
CN113204675A (en) * | 2021-07-07 | 2021-08-03 | 成都考拉悠然科技有限公司 | Cross-modal video time retrieval method based on cross-modal object inference network |
CN113934887A (en) * | 2021-12-20 | 2022-01-14 | 成都考拉悠然科技有限公司 | No-proposal time sequence language positioning method based on semantic decoupling |
Non-Patent Citations (5)
Title |
---|
JONGHWAN MUN: "Local-Global Video-Text Interactions for Temporal Grounding", 《ARXIV》 * |
SHIZHE CHEN: "Fine-grained Video-Text Retrieval with Hierarchical Graph Reasoning", 《ARXIV》 * |
ZHENZHI WANG: "Negative Sample Matters:A Renaissance of Metric Learning for Temporal Groounding", 《ARXIV》 * |
戴思达: "深度多模态融合技术及时间序列分析算法研究", 《中国优秀硕士学位论文全文数据库》 * |
赵才荣,齐鼎等: "智能视频监控关键技术: 行人再识别研究综述", 《中国科学:信息科学》 * |
Cited By (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114581821A (en) * | 2022-02-23 | 2022-06-03 | 腾讯科技(深圳)有限公司 | Video detection method, system, storage medium and server |
CN114581821B (en) * | 2022-02-23 | 2024-11-08 | 腾讯科技(深圳)有限公司 | Video detection method, system, storage medium and server |
CN114357124A (en) * | 2022-03-18 | 2022-04-15 | 成都考拉悠然科技有限公司 | Video paragraph positioning method based on language reconstruction and graph mechanism |
CN114792424A (en) * | 2022-05-30 | 2022-07-26 | 北京百度网讯科技有限公司 | Document image processing method and device and electronic equipment |
CN114925232A (en) * | 2022-05-31 | 2022-08-19 | 杭州电子科技大学 | Cross-modal time domain video positioning method under text segment question-answering framework |
CN115131655A (en) * | 2022-09-01 | 2022-09-30 | 浙江啄云智能科技有限公司 | Training method and device of target detection model and target detection method |
CN115187783A (en) * | 2022-09-09 | 2022-10-14 | 之江实验室 | Multi-task hybrid supervision medical image segmentation method and system based on federal learning |
CN115223086A (en) * | 2022-09-20 | 2022-10-21 | 之江实验室 | Cross-modal action positioning method and system based on interactive attention guidance and correction |
CN115223086B (en) * | 2022-09-20 | 2022-12-06 | 之江实验室 | Cross-modal action positioning method and system based on interactive attention guidance and correction |
CN115238130B (en) * | 2022-09-21 | 2022-12-06 | 之江实验室 | Time sequence language positioning method and device based on modal customization collaborative attention interaction |
CN115238130A (en) * | 2022-09-21 | 2022-10-25 | 之江实验室 | Time sequence language positioning method and device based on modal customization cooperative attention interaction |
CN116385070A (en) * | 2023-01-18 | 2023-07-04 | 中国科学技术大学 | Multi-target prediction method, system, equipment and storage medium for short video advertisement of E-commerce |
CN116385070B (en) * | 2023-01-18 | 2023-10-03 | 中国科学技术大学 | Multi-target prediction method, system, equipment and storage medium for short video advertisement of E-commerce |
CN116246213A (en) * | 2023-05-08 | 2023-06-09 | 腾讯科技(深圳)有限公司 | Data processing method, device, equipment and medium |
CN116824461B (en) * | 2023-08-30 | 2023-12-08 | 山东建筑大学 | Question understanding guiding video question answering method and system |
CN117076712B (en) * | 2023-10-16 | 2024-02-23 | 中国科学技术大学 | Video retrieval method, system, device and storage medium |
CN117076712A (en) * | 2023-10-16 | 2023-11-17 | 中国科学技术大学 | Video retrieval method, system, device and storage medium |
CN117724153A (en) * | 2023-12-25 | 2024-03-19 | 北京孚梅森石油科技有限公司 | Lithology recognition method based on multi-window cascading interaction |
CN117724153B (en) * | 2023-12-25 | 2024-05-14 | 北京孚梅森石油科技有限公司 | Lithology recognition method based on multi-window cascading interaction |
CN117876929A (en) * | 2024-01-12 | 2024-04-12 | 天津大学 | Sequential target positioning method for progressive multi-scale context learning |
CN117876929B (en) * | 2024-01-12 | 2024-06-21 | 天津大学 | Sequential target positioning method for progressive multi-scale context learning |
CN117609553A (en) * | 2024-01-23 | 2024-02-27 | 江南大学 | Video retrieval method and system based on local feature enhancement and modal interaction |
CN117609553B (en) * | 2024-01-23 | 2024-03-22 | 江南大学 | Video retrieval method and system based on local feature enhancement and modal interaction |
Also Published As
Publication number | Publication date |
---|---|
CN114064967B (en) | 2022-05-06 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114064967B (en) | Cross-modal time sequence behavior positioning method and device of multi-granularity cascade interactive network | |
WO2021121198A1 (en) | Semantic similarity-based entity relation extraction method and apparatus, device and medium | |
CN107832476B (en) | Method, device, equipment and storage medium for understanding search sequence | |
CN113987169A (en) | Text abstract generation method, device and equipment based on semantic block and storage medium | |
JP2023022845A (en) | Method of processing video, method of querying video, method of training model, device, electronic apparatus, storage medium and computer program | |
CN112541125B (en) | Sequence annotation model training method and device and electronic equipment | |
CN115983271B (en) | Named entity recognition method and named entity recognition model training method | |
CN113128431B (en) | Video clip retrieval method, device, medium and electronic equipment | |
CN111353311A (en) | Named entity identification method and device, computer equipment and storage medium | |
CN118132803B (en) | Zero sample video moment retrieval method, system, equipment and medium | |
CN113420212A (en) | Deep feature learning-based recommendation method, device, equipment and storage medium | |
CN114420107A (en) | Speech recognition method based on non-autoregressive model and related equipment | |
CN112446209A (en) | Method, equipment and device for setting intention label and storage medium | |
JP2023017759A (en) | Training method and training apparatus for image recognition model based on semantic enhancement | |
CN112199954A (en) | Disease entity matching method and device based on voice semantics and computer equipment | |
US20230326178A1 (en) | Concept disambiguation using multimodal embeddings | |
CN114241411B (en) | Counting model processing method and device based on target detection and computer equipment | |
CN113723077B (en) | Sentence vector generation method and device based on bidirectional characterization model and computer equipment | |
CN114882874A (en) | End-to-end model training method and device, computer equipment and storage medium | |
CN112528040B (en) | Detection method for guiding drive corpus based on knowledge graph and related equipment thereof | |
CN117874234A (en) | Text classification method and device based on semantics, computer equipment and storage medium | |
CN113297525A (en) | Webpage classification method and device, electronic equipment and storage medium | |
CN116022154A (en) | Driving behavior prediction method, device, computer equipment and storage medium | |
CN115062136A (en) | Event disambiguation method based on graph neural network and related equipment thereof | |
CN114091451A (en) | Text classification method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |