CN116127132A - Time sequence language positioning method based on cross-modal text related attention - Google Patents

Time sequence language positioning method based on cross-modal text related attention Download PDF

Info

Publication number
CN116127132A
CN116127132A CN202310199160.2A CN202310199160A CN116127132A CN 116127132 A CN116127132 A CN 116127132A CN 202310199160 A CN202310199160 A CN 202310199160A CN 116127132 A CN116127132 A CN 116127132A
Authority
CN
China
Prior art keywords
layer
attention
text
video
self
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310199160.2A
Other languages
Chinese (zh)
Inventor
何立火
邓夏迪
黄子涵
唐杰浩
王笛
高新波
路文
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202310199160.2A priority Critical patent/CN116127132A/en
Publication of CN116127132A publication Critical patent/CN116127132A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • G06F16/7844Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content using original textual content or text extracted from visual content or transcript of audio data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • G06V10/806Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Multimedia (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Computing Systems (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Library & Information Science (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a time sequence language positioning method based on cross-modal text related attention, which mainly solves the problem that the prior art lacks semantic relevance in cross-modal fusion of text and video. The scheme is as follows: acquiring a training data set and a test data set, and extracting video and text characteristics of the training data set; constructing a time sequence language positioning model based on cross-modal text related attention, utilizing text features and video features to fuse and acquire fusion features, and utilizing text semantic information and interaction of the fusion feature attention to realize time sequence positioning of the video; training the time sequence language positioning model by using video and text characteristics of the training data set; inputting the test data set into the trained time sequence language positioning model to obtain a time sequence language positioning result of the cross-modal text related attention. The invention can search out rich relevant characteristic information in various complicated cross-mode videos, improves the searching precision, and can be used for searching the fragments corresponding to the texts in the videos.

Description

Time sequence language positioning method based on cross-modal text related attention
Technical Field
The invention belongs to the technical field of multi-mode video processing, and particularly relates to a cross-mode time sequence language positioning method which can be used for searching fragments corresponding to texts in videos.
Background
With the rapid development of internet technology in recent years, video data has also exhibited an exponential growth as an important component of multimedia data. Existing video understanding technologies have reached a level that enables a preliminary understanding of video content, but as demand increases, time-series language localization is becoming an important and urgent issue in the video understanding field. The time sequence language positioning task finds a segment which corresponds to the text meaning most from a long video according to a given text, and returns the starting time and the ending time of the segment. The time sequence language positioning task has wide application prospect and has attracted wide attention in the industry and academia.
However, timing language localization tasks also have certain challenges. First, the great modal difference between the search text and the video greatly increases the difficulty of aligning the search text with the video segment. Secondly, the overlapping video clips often have similar video features, which brings great interference to distinguishing the similarity of different video clips and the search text. Thirdly, different people have different understandings of the moment when the action occurs, which leads to inaccuracy of the data annotation.
The patent document with the application publication number of CN 115238130A discloses a time sequence language positioning method and device based on modal customization collaborative attention interaction, which comprises the steps of firstly obtaining paired unchupped video-text query data, constructing a data set of a time sequence language positioning task, then extracting video characterization from video, combining word-level query characterization extracted from text and sentence query characterization to obtain multi-granularity query characterization, then inputting the video extraction characterization and the multi-granularity query characterization together into a modal customization collaborative attention interaction module, obtaining semantically aligned video characterization after cross-modal fusion of video-text, and finally obtaining corresponding time sequence language positioning results according to the semantically aligned video characterization after cross-modal fusion by using multi-branch tasks comprising dense time sequence boundary regression, semantic matching score prediction and cross-union regression. According to the method, the performance of a time sequence language positioning task is improved by utilizing cooperative attention, but text query characterization with semantic limitation is used in cooperative attention interaction, so that the method cannot well correspond to video content, enough cross-modal information cannot be acquired by video-text fused video characterization, and the video retrieval precision is low.
Disclosure of Invention
The invention aims to overcome the defects in the prior art and provides a time sequence language positioning method based on cross-modal text related attention so as to acquire enough cross-modal information and improve the video retrieval precision.
In order to achieve the above purpose, the technical scheme of the invention comprises the following steps:
(1) Raw, unclamped video and text query data are acquired and processed as per 3:1 into a training data set and a test data set;
(2) Extracting video characteristics V corresponding to videos in the training data set through a video encoder, and extracting text characteristics S corresponding to texts in the training data set through a word segmentation and word embedding encoder;
(3) Constructing a time sequence language positioning model based on the cross-modal text related attention;
(3a) Respectively constructing corresponding full connection layers L for video features V and text features S v 、L S Self-attention feature extractor N v 、N s Passing the video feature V through L in turn v N v Obtaining video self-attention feature V t Sequentially passing the text features S through L S N s Obtaining text self-attention features S t
(3b) According to video self-attention feature V t And text self-attention feature S t Calculating a cross-modal text related attention matrix E;
(3c) Building full connection layer L with dimension of 300 multiplied by 384 E Cross-modality fusion self-attention feature extractor N E The cross-modal text related attention matrix E sequentially passes through the full-connection layer L E Cross-modal fusion self-attention feature extractor N E Obtaining cross-modal related attention feature codes E t
(3d) Encoding cross-modality dependent attention features E t Inputting the time positioning characteristic E into the existing bidirectional gating circulation unit network tg
(3e) Establishing initial time positioning full-connection layer L with size of 768 multiplied by 1 respectively Q And end time positioning full connection layer L J
(3f) Positioning the moment of time feature E tg Respectively through L Q 、L J Obtaining the corresponding initial time positioning characteristic E Q And end time locating feature E J Then, respectively corresponding starting time and ending time are selected through normalization to form a time sequence language positioning model based on the cross-modal text related attention;
(4) Setting a Kullback-Leibler divergence as a loss function of the time sequence language positioning model constructed in the step 3, inputting a training data set into the constructed time sequence language positioning model, and updating model parameters by using an optimizer until the loss function converges to obtain a trained time sequence language positioning model based on cross-modal text related attention;
(5) Inputting the test data set into a trained time sequence language positioning model based on cross-modal text related attention for testing, and outputting a target segment time sequence boundary regression value with the highest confidence value as a time sequence language positioning result of the test data set.
Compared with the prior art, the invention has the following advantages:
according to the invention, the text features and the video features are fused to obtain the fusion features, the interaction between the text semantic information and the attention of the fusion features is used to realize the time sequence positioning of the video, and the start frame and the end frame are generated, so that rich relevant feature information can be obtained in various complex cross-mode video retrieval, and the retrieval precision is improved.
Drawings
FIG. 1 is a flow chart of an implementation of the present invention.
FIG. 2 is a block diagram of a temporal language localization model of cross-modal text-dependent attention of the present invention.
Fig. 3 is a block diagram of a self-attention feature extractor in the present invention.
Detailed Description
The invention will now be described in further detail with reference to the drawings and to specific embodiments.
Referring to fig. 1, this example includes the following implementation steps:
and step 1, acquiring original unclamped video and text query data, and dividing a training data set and a testing data set.
In this embodiment, the video and text input used is from the Charades-STA dataset. The data set comprises 6672 videos shot in daily life and 16128 video-text annotation pairs;
video-text annotation pairs were annotated according to about 3: the scale of 1 is divided into a training data set and a test data set, wherein the training set contains 12408 pairs and the test set contains 3720 pairs.
And 2, respectively extracting video features V corresponding to videos and text features S corresponding to texts in the training data set.
(2.1) inputting the video in the training data set into the existing video encoder, extracting video features according to the number of 4 feature blocks per second, and obtaining the length v corresponding to the video l Video feature V of the same dimension 1024;
(2.2) firstly word segmentation is carried out on text query sentences in the training data set, then the obtained words are input into the existing word embedding encoder, and the length s corresponding to the text is obtained l The same text feature S with dimension 300.
And 3, constructing a time sequence language positioning model based on the cross-modal text related attention.
Referring to fig. 2, the present step is specifically implemented as follows:
(3.1) respectively constructing corresponding full connection layers L for the video features V and the text features S v 、L S Self-attention feature extractor N v 、N s
Self-attention feature extractor N of the video features v And a self-attention feature extractor N for text features s The structural parameters are the same, namely each self-attention feature extractor comprises a position coding layer, a discarding layer, a multi-head self-attention encoder P and a normalization layer which are connected in sequence;
the multi-head self-attention encoder P is composed of 6 self-attention encoding layers T which are sequentially connected, each self-attention encoding layer T comprises a first normalization layer, a self-attention module F, a first random depth layer, a second normalization layer and a forward propagation layer Y which are sequentially connected, the input of the first normalization layer is connected with the output residual error of the first random depth layer, and the output of the second normalization layer is connected with the output residual error of the forward propagation layer, as shown in fig. 3.
The self-attention module F comprises a first linear layer, an attention discarding layer, a second linear layer and a first discarding layer which are connected in sequence;
the forward propagation layer Y includes a third linear layer, a GELU activation function, a second drop layer, a fourth linear layer, a third drop layer, and a second random depth layer connected in sequence.
In this embodiment, the first random depth layerThe discard rate of (2) is 0.3; full connection layer L of video feature V v Full connection layer L of size 1024×128, text feature S S The size is 300 multiplied by 128; the first linear layer size is 128×384, the attention discard layer discard rate is 0.3, the second linear layer size is 128×128, and the first discard layer discard rate is 0.3; the third linear layer size is 128×128, the discard rate of the second discard layer is 0.3, the fourth linear layer size is 128×128, the discard rate of the third discard layer is 0.3, and the discard rate of the second random depth layer is 0.2;
(3.2) passing video feature V sequentially through L v N v Obtaining video self-attention feature V t Sequentially passing the text features S through L S N s Obtaining text self-attention features S t
(3.2.1) passing the video feature V through the full connection layer L of the video feature V v Feature dimension compression is performed to make the dimension number of the feature dimension compressed to 128, and then the video self-attention feature extractor N is used for processing the video v The size of 128 Xv was obtained l Video self-attention feature encoding V t
(3.2.2) passing the text feature S through the full connection layer L of the text feature S S Compressing the feature dimension to 128, and passing through text self-attention feature extractor N s The size of 128×s is obtained 1 Video self-attention feature encoding S t
(3.3) according to video self-attention feature V t And text self-attention feature S t Calculating a cross-modal text related attention matrix E:
(3.3.1) according to video self-attention feature V t And text self-attention feature S t And calculating to obtain a transition matrix M:
M=S t ×V t T
(3.3.2) calculating a cross-modal text related attention matrix E according to the transition matrix M and the text features S:
E=(S T ×M) T
wherein E has a size v l ×300,v 1 For the length of the video,t is matrix transposition;
(3.4) building the full connection layer L of the cross-modal text-dependent attention matrix E E Cross-modality fusion self-attention feature extractor N E
Full connection layer L of the cross-modal text-dependent attention matrix E E The size is 300 multiplied by 384;
the cross-modal fusion self-attention feature extractor N E The multi-head self-attention encoder comprises a position coding layer, a discarding layer, a multi-head self-attention encoder P' and a normalization layer which are sequentially connected; the structure of the multi-head self-attention encoder P ' is composed of 6 self-attention encoding layers T ' which are sequentially connected, each self-attention encoding layer T ' comprises a first normalization layer, a self-attention module F ', a first random depth layer, a second normalization layer and a forward propagation layer Y ' which are sequentially connected, wherein the input of the first normalization layer is connected with the output residual error of the first random depth layer, and the output of the second normalization layer is connected with the output residual error of the forward propagation layer, as shown in figure 3;
the self-attention module F' comprises a first linear layer, an attention discarding layer, a second linear layer and a first discarding layer which are sequentially connected;
the forward propagation layer Y' comprises a third linear layer, a GELU activation function, a second discarding layer, a fourth linear layer, a third discarding layer and a second random depth layer which are sequentially connected;
in this embodiment, the discarding rate of the first random depth layer is 0.1; the first linear layer size is 384×1152, the discard rate of the attention discard layer is 0.2, the second linear layer size is 384×384, and the discard rate of the first discard layer is 0.2; the third linear layer size is 384×384, the discard rate of the second discard layer is 0.2, the fourth linear layer size is 1152×384, the discard rate of the third discard layer is 0.2, and the discard rate of the second random depth layer is 0.1;
(3.5) passing the cross-modal text-related attention matrix E sequentially through the full connection layer L E Cross-modal fusion self-attention feature extractor N E Obtaining cross-modal related attention feature codes E t
Cross-modal text-dependent attention matrix EFull connection layer L through cross-modal text-dependent attention matrix E E Performing feature dimension compression with the dimension number changed to 384;
followed by cross-modal fusion of the self-attention feature extractor N E The size of 384 Xv was obtained l Cross-modality dependent attention feature encoding E t
(3.6) encoding the cross-modality dependent attention profile E t Inputting the time positioning characteristic E into the existing bidirectional gating circulation unit network tg
(3.6.1) encoding the cross-modality dependent attention profile E t Input into the existing two-way gating circulation unit network, the current gating circulation unit G t According to the last gated loop unit G t-1 And input x of the current node t Acquiring reset gating r and updating gating z;
r=σ(W r ·[h t-1 ,x t ])
z=σ(W z ·[h t-1 ,x t ])
wherein W is r To reset the gate weight, W z To update the gate weights, σ is the activation function.
(3.6.2) hiding the state h according to the reset gating r and the previous moment t-1 Obtaining reset state h after reset t-1 ':
h t-1 '=h t-1 ⊙r
(3.6.3) reset State h t-1 ' and current input x t Splicing, scaling data to [ -1,1 ] by tanh activation function]In the range, an intermediate quantity h':
h′=tanh(W·[h t-1' ,x t ])
(3.6.4) obtaining the output h of the current gating circulating unit according to the updated gating z and the intermediate quantity h t
h t =(1-z)⊙h t-1 +z⊙h′
(3.6.5) output h t Obtaining a time positioning characteristic E by linear combination tg In the present embodiment, the time alignment feature E tg Is 768 Xv in size l
(3.7) establishing the initial time positioning full connection layer L Q And end time positioning full connection layer L J In this embodiment, the full link layer L is positioned at the start time Q And end time positioning full connection layer L J The sizes of (2) are 768 multiplied by 1;
(3.8) time-of-day positioning feature E tg Positioning the full connection layer L by the starting time Q And end time positioning full connection layer L J Obtaining the corresponding initial time positioning characteristic E Q And end time locating feature E J
Time positioning feature E tg Positioning the full link layer L by the starting time Q Compressing the characteristic dimension to obtain a size v 1 Start time locating feature E of (2) Q
Time positioning feature E tg Positioning the full connection layer L by the end time J Compressing the characteristic dimension to obtain a size v 1 End time locating feature E J
(3.9) Start time positioning feature E Q And end time locating feature E J And respectively corresponding starting time and ending time are selected through normalization, so that a time sequence language positioning model based on the cross-modal text related attention is formed.
In this embodiment, the normalization function uses a Softmax function, as follows:
Figure BDA0004108434350000071
where i denotes a certain classification in n classes, g i A value representing the classification. The multi-class output values can be converted to a range of [0,1 by a Softmax function]And the sum is a probability distribution of 1.
And step 4, training the constructed time sequence language positioning model based on the cross-modal text related attention.
(4.1) setting a Kullback-Leibler divergence as a loss function of the time-series language positioning model, which is expressed as follows:
Figure BDA0004108434350000072
Figure BDA0004108434350000073
Figure BDA0004108434350000074
where L is the final calculated positioning loss,
Figure BDA0004108434350000075
for the actual starting probability distribution, +.>
Figure BDA0004108434350000076
For the actual ending probability distribution, P start P for a predicted onset probability distribution end D for the predicted ending probability distribution KL Calculating a function for the Kullback-Leibler divergence;
and (4.2) inputting the training data set into the constructed time sequence language positioning model, and updating model parameters by using an optimizer until the loss function converges to obtain the trained time sequence language positioning model based on the cross-modal text related attention.
And step 5, obtaining a time sequence language positioning result of the cross-modal text related attention.
Inputting the test data set into a trained time sequence language positioning model based on the cross-modal text related attention, and outputting a target segment time sequence boundary regression value with the highest confidence value, namely the time sequence language positioning result of the test data set.
The technical effects of the invention can be further illustrated by the following simulation experiments:
1. simulation conditions
The hardware platform of the simulation experiment of the invention is: the processor is an Intel (R) Core (TM) i7-7800X CPU, the main frequency is 3.50GH, the memory is 48GB, and the display card is NVIDIA GeForce TITAN XP X2.
The software platform of the simulation experiment of the invention is: ubuntu 16.04,Pytorch 1.12.0,Python 3.9.
The video and text inputs used in the simulation experiments of the present invention were from the Charades-STA dataset containing 6672 videos taken in daily life, most of the video content being in-house activity, the average duration of the videos being 29.76 seconds, each video having approximately 2.4 annotated target videos, the average duration of the target videos being 8.2 seconds. Also included in the dataset are 16128 video-text annotation pairs, which are divided into training and testing portions, with the training portion having 12408 pairs and the testing portion having 3720 pairs.
2. Simulation experiment contents
The invention and four existing time sequence language positioning methods DRN, LGI, BPNet and ACRM are used for time sequence language positioning on a Charades-STA data set, and the statistics and the comparison are respectively carried out on the percentage of the video quantity which is larger than 0.5 and 0.7 by using a Recall@1 evaluation index, so that the respective time sequence language positioning capability is evaluated, and the result is shown in Table 1:
table 1 comparative table of evaluation results of the present invention and the prior art
Figure BDA0004108434350000081
The DRN method is a dense regression network proposed by R.Zeng et al in "Dense Regression Network for Video Grounding, in2020IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)";
the LGI method is a Local-global video-text interaction method proposed by MUN J et al in Local-global video-text interactions for temporal grounding, in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition;
the BPnet method suggests a network for the boundaries proposed by shaoming Xiao et al in "Boundary Proposal Network for Two-Stage Natural Language Video Localization, in Proceedings of the AAAI Conference on Artificial Intelligence";
the ACRM method is a Frame-by-Frame Cross-modal matching method proposed by H.Tang et al in Frame-Wise Cross-Modal Matching for Video Moment Retrieval, in IEEE Transactions on Multimedia;
as can be seen from table 1, when the result with the highest confidence is selected under the recall@1 evaluation index, the video quantity of the invention with the cross ratio greater than 0.5 and 0.7 respectively accounts for higher percentages than those of the prior art DRN, LGI, BPNet and ACRM, which shows that the invention has better time sequence language positioning capability on the Charades-STA data set.

Claims (12)

1. A time sequence language positioning method based on cross-modal text related attention is characterized by comprising the following steps:
(1) Raw, unclamped video and text query data are acquired and processed as per 3:1 into a training data set and a test data set;
(2) Extracting video characteristics V corresponding to videos in the training data set through a video encoder, and extracting text characteristics S corresponding to texts in the training data set through a word segmentation and word embedding encoder;
(3) Constructing a time sequence language positioning model based on the cross-modal text related attention;
(3a) Respectively constructing corresponding full connection layers L for video features V and text features S v 、L S Self-attention feature extractor N v 、N s Passing the video feature V through L in turn v N v Obtaining video self-attention feature V t Sequentially passing the text features S through L S N s Obtaining text self-attention features S t
(3b) According to video self-attention feature V t And text self-attention feature S t Calculating a cross-modal text related attention matrix E;
(3c) Building full connection layer L with dimension of 300 multiplied by 384 E Cross-modality fusion self-attention feature extractor N E The cross-modal text related attention matrix E is sequentially connected with the full linksJunction layer L E Cross-modal fusion self-attention feature extractor N E Obtaining cross-modal related attention feature codes E t
(3d) Encoding cross-modality dependent attention features E t Inputting the time positioning characteristic E into the existing bidirectional gating circulation unit network tg
(3e) Establishing initial time positioning full-connection layer L with size of 768 multiplied by 1 respectively Q And end time positioning full connection layer L J
(3f) Positioning the moment of time feature E tg Respectively through L Q 、L J Obtaining the corresponding initial time positioning characteristic E Q And end time locating feature E J Then, respectively corresponding starting time and ending time are selected through normalization to form a time sequence language positioning model based on the cross-modal text related attention;
(4) Setting a Kullback-Leibler divergence as a loss function of the time sequence language positioning model constructed in the step 3, inputting a training data set into the constructed time sequence language positioning model, and updating model parameters by using an optimizer until the loss function converges to obtain a trained time sequence language positioning model based on cross-modal text related attention;
(5) Inputting the test data set into a trained time sequence language positioning model based on cross-modal text related attention for testing, and outputting a target segment time sequence boundary regression value with the highest confidence value as a time sequence language positioning result of the test data set.
2. The method according to claim 1, wherein the step (2) of extracting the video feature V corresponding to the video in the training dataset by the video encoder is to input the video without clipping into the video encoder, and extract the video feature according to the number of 4 feature blocks per second to obtain the length V corresponding to the video l The video features V are the same, 1024 in dimension.
3. The method according to claim 1, characterized in that theExtracting text features S corresponding to texts in a training set through a word segmentation and word embedding encoder, firstly segmenting a text query sentence, inputting the obtained word into the word embedding encoder to obtain a text length S corresponding to the text l The same text feature S with dimension 300.
4. The method according to claim 1, wherein the full link layer L of the video feature in step (3 a) v Full connection layer L with text features S The parameters are as follows:
full connection layer L of video feature V v Size 1024×128;
full connection layer L of the text feature S S The size is 300×128.
5. The method of claim 1, wherein the self-attention feature extractor N of the video features in step (3 a) v And a self-attention feature extractor N for text features s The structural parameters are the same, namely each self-attention feature extractor comprises a position coding layer, a discarding layer, a multi-head self-attention encoder P and a normalization layer which are connected in sequence;
the multi-head self-attention encoder P consists of 6 self-attention encoding layers T which are sequentially connected, wherein each self-attention encoding layer T comprises a first normalization layer, a self-attention module F, a first random depth layer, a second normalization layer and a forward propagation layer Y which are sequentially connected, the input of the first normalization layer is connected with the output residual error of the first random depth layer, and the output of the second normalization layer is connected with the output residual error of the forward propagation layer;
the discard rate of the first random depth layer is 0.3.
6. The method according to claim 5, wherein:
the self-attention module F comprises a first linear layer, an attention discarding layer, a second linear layer and a first discarding layer which are sequentially connected; the first linear layer size is 128 x 384, the attention discard layer discard rate is 0.3, the second linear layer size is 128 x 128, and the first discard layer discard rate is 0.3;
the forward propagation layer Y comprises a third linear layer, a GELU activation function, a second discarding layer, a fourth linear layer, a third discarding layer and a second random depth layer which are connected in sequence; the third linear layer has a size of 128×128, the second discard layer has a discard rate of 0.3, the fourth linear layer has a size of 128×128, the third discard layer has a discard rate of 0.3, and the second random depth layer discard rate is 0.2.
7. The method according to claim 1, wherein the video self-attention feature V obtained in step (3 a) t And text self-attention feature S t The parameters of (2) are as follows:
the video self-attention feature V t Is 128 Xv in size l
The text self-attention feature S t Is 128 x s in size l
8. The method of claim 1, wherein the step (3 b) of calculating a cross-modal text-related attention matrix E is implemented as follows:
3b1) According to video self-attention feature V t And text self-attention feature S t Obtaining a transition matrix M:
M=S t ×V t T
3b2) According to the transition matrix M and the text feature S, a cross-modal text related attention matrix E is obtained:
E=(S T ×M) T
wherein E has a size v l ×300,v 1 For video length, T is the matrix transpose.
9. The method of claim 1, wherein the cross-modality fusion self-attention feature extractor N in step (3 c) E The multi-head self-attention encoder comprises a position coding layer, a discarding layer, a multi-head self-attention encoder P' and a normalization layer which are sequentially connected; the multi-head self-attention encoderP ' is composed of 6 self-attention coding layers T ' which are sequentially connected, each self-attention coding layer T ' comprises a first normalization layer, a self-attention module F ', a first random depth layer, a second normalization layer and a forward propagation layer Y ' which are sequentially connected, wherein the input of the first normalization layer is connected with the output residual error of the first random depth layer, and the output of the second normalization layer is connected with the output residual error of the forward propagation layer; the discard rate of the first random depth layer is 0.1.
10. The method according to claim 9, wherein:
the self-attention module F' comprises a first linear layer, an attention discarding layer, a second linear layer and a first discarding layer which are sequentially connected; the first linear layer size is 384×1152, the discard rate of the attention discard layer is 0.2, the second linear layer size is 384×384, and the discard rate of the first discard layer is 0.2;
the forward propagation layer Y' comprises a third linear layer, a GELU activation function, a second discarding layer, a fourth linear layer, a third discarding layer and a second random depth layer which are sequentially connected; the third linear layer size is 384×384, the discard rate of the second discard layer is 0.2, the fourth linear layer size is 1152×384, the discard rate of the third discard layer is 0.2, and the discard rate of the second random depth layer is 0.1.
11. The method of claim 1, wherein step (3 d) encodes E across modality-dependent attention feature codes t Inputting the time positioning characteristic E into the existing bidirectional gating circulation unit network tg The implementation steps of (a) are as follows:
3d1) Encoding cross-modality dependent attention features E t Input into the existing two-way gating circulation unit network, the current gating circulation unit G t According to the last gated loop unit G t-1 And input x of the current node t Acquiring reset gating r and updating gating z;
r=σ(W r ·[h t-1 ,x t ])
z=σ(W z ·[h t-1 ,x t ])
wherein W is r To reset the gate weight, W z To update the gate weights, σ is the activation function.
3d2) According to the reset gate r and the hidden state h at the last moment t-1 Obtaining reset state h after reset t-1'
h t-1' =h t-1 ⊙r;
3d3) Will reset state h t-1' With current input x t Splicing, and scaling data to [ -1,1 by tanh activation function]In the range, an intermediate quantity h':
h′=tanh(W·[h t-1' ,x t ])
3d4) Obtaining the output h of the current gating circulating unit according to the updated gating z and the intermediate quantity h t
h t =(1-z)⊙h t-1 +z⊙h′
3d5) Will output h t Obtaining a time positioning characteristic E by linear combination tg
12. The method of claim 1, wherein the Kullback-Leibler divergence set in step (4) is expressed as follows:
Figure FDA0004108434340000051
Figure FDA0004108434340000052
Figure FDA0004108434340000053
where L is the final calculated positioning loss,
Figure FDA0004108434340000054
for the actual starting probability distribution, +.>
Figure FDA0004108434340000055
For the actual ending probability distribution, P start P for a predicted onset probability distribution end D for the predicted ending probability distribution KL A function is calculated for Kullback-Leibler divergence. />
CN202310199160.2A 2023-03-03 2023-03-03 Time sequence language positioning method based on cross-modal text related attention Pending CN116127132A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310199160.2A CN116127132A (en) 2023-03-03 2023-03-03 Time sequence language positioning method based on cross-modal text related attention

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310199160.2A CN116127132A (en) 2023-03-03 2023-03-03 Time sequence language positioning method based on cross-modal text related attention

Publications (1)

Publication Number Publication Date
CN116127132A true CN116127132A (en) 2023-05-16

Family

ID=86302898

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310199160.2A Pending CN116127132A (en) 2023-03-03 2023-03-03 Time sequence language positioning method based on cross-modal text related attention

Country Status (1)

Country Link
CN (1) CN116127132A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117171712A (en) * 2023-11-03 2023-12-05 中关村科学城城市大脑股份有限公司 Auxiliary information generation method, auxiliary information generation device, electronic equipment and computer readable medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117171712A (en) * 2023-11-03 2023-12-05 中关村科学城城市大脑股份有限公司 Auxiliary information generation method, auxiliary information generation device, electronic equipment and computer readable medium
CN117171712B (en) * 2023-11-03 2024-02-02 中关村科学城城市大脑股份有限公司 Auxiliary information generation method, auxiliary information generation device, electronic equipment and computer readable medium

Similar Documents

Publication Publication Date Title
CN110119765B (en) Keyword extraction method based on Seq2Seq framework
CN111581961B (en) Automatic description method for image content constructed by Chinese visual vocabulary
CN110348016A (en) Text snippet generation method based on sentence association attention mechanism
CN111666764B (en) Automatic abstracting method and device based on XLNet
CN110807324A (en) Video entity identification method based on IDCNN-crf and knowledge graph
CN113434636B (en) Semantic-based approximate text searching method, semantic-based approximate text searching device, computer equipment and medium
CN116702091B (en) Multi-mode ironic intention recognition method, device and equipment based on multi-view CLIP
EP4390725A1 (en) Video retrieval method and apparatus, device, and storage medium
Zhong et al. Deep semantic and attentive network for unsupervised video summarization
CN114003682A (en) Text classification method, device, equipment and storage medium
CN116127132A (en) Time sequence language positioning method based on cross-modal text related attention
CN113806554A (en) Knowledge graph construction method for massive conference texts
CN113626584A (en) Automatic text abstract generation method, system, computer equipment and storage medium
CN116186328A (en) Video text cross-modal retrieval method based on pre-clustering guidance
CN111325033A (en) Entity identification method, entity identification device, electronic equipment and computer readable storage medium
EP3703061A1 (en) Image retrieval
CN116680420B (en) Low-resource cross-language text retrieval method and device based on knowledge representation enhancement
CN116186562B (en) Encoder-based long text matching method
CN117057349A (en) News text keyword extraction method, device, computer equipment and storage medium
Jiang et al. A hierarchical bidirectional LSTM sequence model for extractive text summarization in electric power systems
CN112926340B (en) Semantic matching model for knowledge point positioning
CN113157914B (en) Document abstract extraction method and system based on multilayer recurrent neural network
CN115017404A (en) Target news topic abstracting method based on compressed space sentence selection
CN114266249A (en) Mass text clustering method based on birch clustering
Yang et al. Multimodal short video rumor detection system based on contrastive learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination