CN111897913A - Semantic tree enhancement based cross-modal retrieval method for searching video from complex text - Google Patents
Semantic tree enhancement based cross-modal retrieval method for searching video from complex text Download PDFInfo
- Publication number
- CN111897913A CN111897913A CN202010686024.2A CN202010686024A CN111897913A CN 111897913 A CN111897913 A CN 111897913A CN 202010686024 A CN202010686024 A CN 202010686024A CN 111897913 A CN111897913 A CN 111897913A
- Authority
- CN
- China
- Prior art keywords
- node
- video
- text query
- complex text
- representing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Evolutionary Computation (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Library & Information Science (AREA)
- Databases & Information Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Multimedia (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a semantic tree enhancement based cross-modal retrieval method for complex text query to video. For a complex text query statement, each word of the complex text query statement is converted into leaf node representation, the relation between child nodes is mined, the two child nodes with the highest dependency are combined, a semantic tree structure of the query statement is built in a recursive mode, and query representation based on semantic tree enhancement is obtained. For the encoding of the candidate video, video preliminary characteristics are obtained through CNN, and time dependency and semantic dependency between videos are captured by utilizing GRU and a self-attention mechanism module to obtain robust video characteristic representation. Mapping the complex text query expression and the video feature expression to a public space, and automatically learning the matching relationship between the complex text query expression and the video feature expression, thereby realizing the cross-modal retrieval from the complex text query to the video. The method of the invention can not only explain the information components in the complex text query sentence, better understand the user intention, but also improve the retrieval performance to a great extent.
Description
Technical Field
The invention relates to the field of cross-modal retrieval from text query to video, in particular to a cross-modal retrieval method from complex text query to video based on semantic tree enhancement.
Background
As users on the internet have exponentially increased their generated videos, uploading videos and searching videos of interest in daily life have become indispensable activities in people's daily life. A cross-modal retrieval method from text query to video is one of the techniques to obtain videos of interest. Early cross-modal retrieval methods from text query to video were based on text keywords and were extensively studied and developed. But this type of method only allows the user to enter several keywords as queries. With the further increase of the demand of people on the video searching capability of the internet, the search intention of the user is difficult to be fully expressed by the keyword-based query, so that the search experience is influenced. In response to this problem, video retrieval supporting complex text queries is ongoing. Therefore, how to understand the more complex semantics passed on for complex text queries and understand user intent has become one of the difficult challenges across the domain of modal retrieval.
Existing cross-modal retrieval methods for text queries to videos generally fall into two categories, the first category being concept-based methods that utilize a large number of visual concepts to describe the video content while converting text queries into a set of basic visual concepts. Text queries are represented with visual concepts. And finally, realizing cross-modal retrieval through concept matching between different modalities (text and video). However, such methods have the following disadvantages: one, it is not generally very effective for complex text queries because it is often difficult for the semantic content of complex text queries to be adequately described by several visual concepts, resulting in information loss, and the semantic content of complex text queries is not just an aggregation of extracted concepts. Secondly, how to effectively train the concept classifier, and selecting related concepts is also a very challenging problem. The second method is to learn a joint embedding space of text query and video to support video retrieval, and this method represents the video as a time aggregation feature by converting the text query into a word vector representation and maps the two to a common space. Such that similar text queries and videos are close in the common space, and away otherwise. While such directions can better handle longer text queries than concept-based approaches, such approaches have the following disadvantages: firstly, the text query of the user is represented by a word vector, which cannot effectively understand the intention of the user, so that the video retrieval effect on complex text query is not good. Second, such methods lack the interpretability of the sub-search process.
Disclosure of Invention
Aiming at the defects of the prior art, the invention adopts a modeling method facing complex text query to video retrieval, provides a cross-modal retrieval method based on semantic tree enhancement, firstly encodes the complex text query by using a tree structure, and simultaneously performs quantization encoding and expression learning on the complex text query and the video; the similarity of the coded features in the public space is calculated by mapping the coded features to the public space, and cross-modal retrieval from complex text query to videos is achieved.
The purpose of the invention is realized by the following technical scheme: a semantic tree enhancement based cross-modal retrieval method for complex text query to video comprises the following steps:
(1) extracting features of the complex text query sentence to obtain leaf node features of the complex text query sentence;
(2) encoding a tree structure of semantic tree enhancement on leaf node characteristics of the complex text query sentence obtained in the step (1);
(3) expressing the codes of the semantic tree structures of the complex text query sentences obtained in the step (2), and mining the importance of each node component forming the tree structures by using an attention mechanism to obtain the expression of the complex text query sentences capable of perceiving the intentions of the user;
(4) performing feature extraction on the video frame to obtain initial visual feature representation of the video;
(5) extracting the time dependence of continuous frames along the sequence direction from the initial visual feature representation obtained in the step (4), and extracting the semantic correlation between the frames;
(6) applying an attention mechanism to the video representation obtained in the step (5), and distinguishing the importance degree of the information to enable the useful information to occupy a larger proportion in the final video visual feature representation;
(7) respectively mapping the complex text query sentence representation and the video visual feature representation processed by the attention mechanism in the steps (3) and (6) into a common space, learning the correlation degree between the two modes by using a common space learning algorithm, and training a model in an end-to-end mode;
(8) and (4) realizing cross-modal retrieval from complex text query to video based on the semantic tree by using the model obtained by training in the step (7).
Further, the method for extracting leaf node features of the complex text query statement in step (1) comprises the following substeps:
(1-1) coding each word in the complex text query sentence by using one-hot coding to obtain one-hot coding vector sequence; multiplying the one-hot coded vector by a word embedding matrix to obtain a word vector sequence representation of the complex text query statement;
(1-2) modeling the word vector sequence representation using the LSTM in the RNN, converting the word vector representation into leaf node features.
Further, the encoding of the tree structure with semantic tree enhancement on the leaf node feature of the complex text query statement in the step (2) includes the following sub-steps:
(2-1) generating a father node by using an LSTM method of a tree structure, and recursively forming a semantic tree structure from bottom to top; the semantic tree consists of two types of nodes: the system comprises child nodes and parent nodes, wherein the child nodes represent words in a complex text query sentence, and the parent nodes represent combinations of word components; expressing the leaf node characteristics obtained in the step (1) as first-layer nodes of a semantic tree, and combining two adjacent child nodes in all the child nodes by using an LSTM method of a tree structure to obtain candidate father nodes;
(2-2) selecting the best father node from the candidate father nodes of each layer as a next-layer node according to the memory-enhanced node scoring module, and directly copying unselected child nodes to the next layer to be used as the representation of the next-layer node; the above process is repeated recursively until only one node remains.
Further, in the step (2-1), two adjacent child nodes (h) are giveni,ci) And (h)i+1,ci+1) As input, hiRepresenting the hidden state of the ith node, ciRepresenting the memory state of the ith node, the parent node may be computed as:
hp=o⊙tanh(cp)
cp=fl⊙ci+fr⊙ci+1+τ⊙g
wherein h ispRepresenting a hidden state of a parent node with its dimension set to dt*1;cpRepresenting the memory state of the parent node with dimension set to d t1, starting from the basic word; an element-by-element multiplication between features; tau, fl,frO, g can be expressed as:
wherein WpRepresenting a trainable transformation matrix with dimensions set to 5dt*2dt;bpRepresenting trainable bias vectors with dimensions set to 5d t1, starting from the basic word; sigma represents a sigmoid nonlinear activation function, and tanh represents a tanh nonlinear transformation function;
suppose a t-th level semantic tree is composed of NtThe node of the t-th layer can be expressed as:
whereinRepresents the ith node of the t-th layer,represents the i +1 th node of the t-th layer,representing the ith node of the t +1 th layer, wherein TreeLSTM represents the LSTM method of the tree structure;
in the step (2-2), a module f is scored according to the nodes with enhanced memoryscore(.;Θscore) Determining the likelihood that the best parent node, the ith candidate parent node, is selectedExpressed as:
wherein Θ isscoreTrainable parameters representing a node scoring module;for the context semantic vector, judging the importance degree of the hidden state of each node through the query memory M, and aggregating the importance degrees according to the hidden state of each node in M to obtain the context semantic vectorThe memory M is represented as:
whereinIndicating a hidden state of an nth node of layer 1; degree of importance of node hidden state in memory MExpressed as:
whereinRepresenting the importance degree of the ith node of the t-th layer of the semantic tree to the hidden state of the jth node in the memory M; wmRepresenting a trainable transformation matrix with dimensions set to dt*dt;bmRepresenting a trainable bias vector with dimension set to dt*1;Representing the Relu nonlinear activation function; softmax represents a non-linear function; according to the degree of importanceAggregating the information in the memory M to obtain a context semantic vector Expressed as:
WhereinIs a Relu nonlinear activation function; w is asRepresenting trainable transformation vectors with dimensions set to 2dt*1;bsRepresenting trainable bias vectors with dimensions set to 2dt*1;WsRepresenting a trainable transformation matrix with dimensions set to 2dt*2dt。
Further, in the step (3), on the basis of introducing the complex text query statement representation based on semantic tree enhancement, a neural network is introduced to mine the importance of each node component by using an attention mechanism, and the node eiIs a score of importance ofiExpressed as:
whereinIs the ReLU nonlinear activation function; wtaRepresenting a trainable transformation matrix with dimensions set to dta*dt;btaRepresenting a trainable bias vector with dimension set to dta*1;utaRepresenting a trainable transformation vector having a dimension set to d ta1, starting from the basic word; the importance scores are used as the weights of the nodes, all node components are aggregated, and the representation of the complex text query sentence capable of perceiving the user intention is obtained
Where N-1 represents the number of nodes of the semantic tree structure, βiRepresents a node eiThe importance score of.
Further, in the step (4), feature extraction is performed on the input video frame by using a pre-trained deep Convolutional Neural Network (CNN), and a deep visual feature of each frame is extracted as an initial visual feature.
Further, in the step (5), extracting the time dependency of the consecutive frames along the sequence direction includes: encoding the initial visual features of the video obtained in the step (4) by using a GRU (general packet unit), wherein the GRU takes the feature vector of the current frame and the hidden state of the previous frame as input at each time step and outputs the hidden state of the current frame; through GRU operation, the time dependency between continuous frames is effectively captured;
the extraction of semantic relevance between frames in the whole video comprises the following steps: by a self-attention mechanism, firstly, performing scaled dot product attention, namely projecting the representation of the video sequence frames into a plurality of attention spaces, performing dot product on the query frame projected by each frame and the rest key frames, obtaining a weight value on the current value frame through Softmax operation, and multiplying the obtained weight value by the va1ue frame; the final video representation will be obtained from the output of the multiple attention spaces through a stitching operation and normalization.
Further, in the step (6), an attention neural network model with three parameters is designed to distinguish the importance degree of the video frames, and the importance degree of the tth frame is expressed as:
wherein u isvaIs a trainable transformation vector with dimension set to dva*1,bvaIs a trainable offset vector with dimension set to dva,WvaIs a trainable transformation matrix with dimensions set to dva*dv,Is a representation of a corresponding video frame with dimension set to dv*1;
Multiplying the importance degree of each frame as a weight by the representation of the corresponding video frame, and finally accumulating the frames to form the final video representation:
wherein etatIndicating the importance degree of the t-th frame;is set to d v1, is an aggregate representation of all components of the video frame.
Further, in the step (7), the step of learning the correlation between the two modalities and training the model by using a common space learning algorithm is as follows:
(7-1) mapping the complex text query statement and the video visual feature representation obtained in the step (3) and the step (6) through an attention mechanism to a uniform public space through two linear projection models for expression; in order to obtain the same dimensionality, applying a nonlinear activation function to the obtained features, and then applying a Batch Normalization (BN) layer for processing;
(7-2) training the model in an end-to-end manner through the defined ternary ordering loss, so that the model automatically learns the correlation between the two modalities.
Further, in the step (8), given a complex text query sentence, finding out a video related to the complex text query sentence from a candidate video set, and using the video as a retrieval result, the steps are as follows:
(8-1) mapping the input complex text query sentence and all candidate videos to a public space through the model trained in the step (7);
(8-2) calculating the similarity of the complex text query sentence and all candidate videos in a public space, then sorting all the candidate videos in a descending order according to the similarity, and returning the videos with the top order as a retrieval result, thereby realizing the cross-modal retrieval from the complex text query sentence to the videos.
The invention has the beneficial effects that: the invention provides a novel cross-modal retrieval framework from complex text query to video, which can automatically form a flexible tree structure to model complex text query sentences, and designs a memory-enhanced node scoring module to mine the language environment of the tree structure of the complex text query sentences. An attention mechanism is introduced into the complex text query sentence and the video visual feature representation, and the node component combination in the complex text query and the importance degree of each frame of the video are deeply mined. The invention can explain the information components in the complex text query sentence, better understand the user intention and improve the retrieval performance to a great extent.
Drawings
FIG. 1 is a schematic diagram of an implementation of a semantic tree enhancement-based cross-modal search method for complex text query to video;
FIG. 2 is an example of a complex text query to video search of the present invention.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
In order to solve the problem of cross-modal retrieval from complex text query to video, the invention provides a semantic tree enhancement-based cross-modal retrieval method from complex text query to video, which comprises the following specific steps:
(1) and extracting the features of the complex text query sentence by using a feature extraction method to obtain the leaf node features of the complex text query sentence.
(1-1) given a complex text query sentence Q of length N, the complex text query sentence Q can be represented as:
Q={w1,w2,…,wN}
wherein w1Representing the first word in the complex text query statement, each word in the complex text query statement is first encoded with one-hot encoding (one-hot) to obtain a sequence of one-hot encoded vectors, { w'1,w′2,...,w′NW therein'tA one-hot coded vector representing the t-th word. Obtaining a word vector sequence representation { Q ] of a complex text query statement Q by multiplying the one-hot coded vector by a word embedding matrix1,q2,…,qN}。
(1-2) using LSTM (long-short-duration memory network) in RNN (recurrent neural network) as basic sequence modeling module. To maintain structural consistency, the N word vector representations in the word vector sequence in step (1-1) are converted into N leaf nodes using LSTM. For the ith time step, the word vector sequence represents { q }1,q2,…,qNThe ith word vector in (j) represents qiIt is converted into leaf nodes by the LSTM unit, and the ith leaf node is represented as:
(hi,ci)=LSTM(qi,hi-1,ci-1)
wherein h isi-1Representing the hidden state of the i-1 st node, ci-1Represents the memory state of the (i-1) th node, (h)i,ci) Representing the ith leaf node feature into which the ith word vector is converted.
(2) Encoding a tree structure of semantic tree enhancement on leaf node characteristics of the complex text query sentence obtained in the step (1); to is coming toAnd (2) better understanding the complex text query statement, and carrying out tree-structured LSTM (TreLSTM) modeling on leaf node characteristics of the complex text query statement obtained in the step (1), wherein a TreLSTM method is used for generating a parent node. Given two adjacent child nodes (h)i,ci) And (h)i+1,ci+1) As input, hiRepresenting the hidden state of the ith node, ciIndicating the memory state of the ith node. The parent node may be computed as:
hp=o⊙tanh(cp)
cp=fl⊙ci+fr⊙ci+1+τ⊙g
wherein h ispRepresenting a hidden state of a parent node with its dimension set to dt*1;cpRepresenting the memory state of the parent node with dimension set to d t1, starting from the basic word; an element-by-element multiplication between features; tau, fl,frThe parameters o, g, etc. are represented by hiAnd hi+1Obtained after sigmoid and tanh functions; tau, fl,frO, g can be expressed as:
wherein WpRepresenting a trainable transformation matrix with dimensions set to 5dt*2dt;bpRepresenting trainable bias vectors with dimensions set to 5d t1, starting from the basic word; sigma represents a sigmoid nonlinear activation function, and tanh represents a tanh nonlinear transformation function; generating a father node by using TreeLSTM, and recursively forming a semantic tree structure in a bottom-up manner; the semantic tree consists of two types of nodes: the method comprises child nodes and parent nodes, wherein the child nodes represent words in a complex text query sentence, and the parent nodes represent combinations of word components, and can describe more complex semantic information compared with the child nodes.
(2-1) representing the leaf node characteristic sequence obtained in the step (1) as a first-layer node of a semantic tree; suppose a t-th level semantic tree is composed of NtEach node is composed of t-th layer nodesExpressed as:
if we choose to connect the t-level nodesAndmerging, and then calculating a parent node by using the TreeLSTM, where the parent node may be represented as:
whereinRepresents the ith node of the t-th layer,represents the i +1 th node of the t-th layer,representing the ith node of the t +1 th layer. And combining two adjacent child nodes in all child nodes by using the LSTM (TreeLSTM) method of the tree structure to obtain a parent node candidate node.
(2-2) the key step of building the semantic tree structure is how to accurately select the best parent node from the parent node candidate nodes at each layer, which requires designing a node scoring module to select the best parent node. For the node scoring module, it is difficult to efficiently determine the best parent node when a given query is a complex text query due to the ambiguity of language and the limited ability of the hidden state of the node to remember historical inputs. Therefore, a node scoring module f with memory enhancement is specially designed for complex text query sentencesscore(.;Θscore) For determining the best parent node, i-th parent node candidate nodeLikelihood of a point being selectedCan be expressed as:
wherein Θ isscoreTrainable parameters representing a node scoring module;for contextual semantic vectors, to obtainThe importance degree of the hidden state of each node is judged by inquiring a memory M, and a context semantic vector is obtained after aggregation according to the importance degree of the hidden state of each node in MThe memory M can be represented as:
whereinRepresenting the hidden state of the 1 st node of layer 1,indicating the hidden state of the 2 nd node of layer 1,indicating a hidden state of an nth node of layer 1; degree of importance of node hidden state in memory MCan be expressed as:
whereinRepresenting the importance degree of the ith node of the t-th layer of the semantic tree to the hidden state of the jth node in the memory M; wmRepresenting a trainable transformation matrix with dimensions set to dt*dt;bmRepresenting a trainable bias vector with dimension set to dt*1;Representing the Relu nonlinear activation function; softmax represents a non-linear function; according to the degree of importanceAggregating the information in the memory M to obtain a context semantic vector Can be expressed as:
whereinIs a vector normalized after applying the attention mechanism; obtaining a context semantic vectorThen, the score of the parent node candidate node is obtained through the following formula
WhereinIs a Relu nonlinear activation function; w is asRepresenting trainable transformation vectors with dimensions set to 2dt*1;bsRepresenting trainable bias vectors with dimensions set to 2dt*1;WsRepresenting a trainable transformation matrix with dimensions set to 2dt*2dt。
The memory enhanced node scoring module fuses the contextual semantic information, injecting semantic context into each choice to better select parent nodes. In such a recursive process, two adjacent child nodes of all the child nodes are combined to obtain a candidate parent node. Selecting from these candidate parent nodesThe node with the largest score is used as the next layer node. Only the representation of the selected node is updated and the unselected child nodes are copied directly to the next level as the representation of the next level node. The above process is repeated recursively until only one node remains. Through this process, we can compose a semantic tree structure, and the encoding of the semantic tree structure can be expressed as:
{e1,e2,…,eN-1}=LSTree({q1,q2,…,qN})
wherein LSTree represents the overall construction process of the semantic tree, eiE R represents the representation of the ith node. The coded representation of the semantic tree structure automatically extracts semantic components which may meet the search intention of a user, and can better understand complex text query sentences without any grammar comments.
(3) And (3) carrying out coding representation on the semantic tree structure of the complex text query sentence obtained in the step (2), and mining the importance of each node component forming the tree structure by using an attention mechanism to obtain the representation of the complex text query capable of sensing the intention of the user.
Complex text query sentences typically consist of references and their reference descriptions in multiple videos, where some concepts or reference descriptions in the complex text query sentence may not be clearly represented in the video or have only a short time span. Therefore, an attention network is introduced to mine the importance of each node component on the basis of introducing the complex text query sentence expression based on semantic tree enhancement, more important node components can be sensed by scoring the importance of the node components of the complex text query sentence based on semantic tree enhancement, and the score is taken as weight to aggregate the nodes of the complex text query sentence based on semantic tree enhancement, so that the complex text query sentence expression capable of effectively sensing the user intention can be obtained. The concrete implementation is as follows:
using the attention mechanism, a neural network is introduced to study the importance of each node component, node eiIs a score of importance ofiCan be expressed as:
whereinIs the ReLU nonlinear activation function; wtaRepresenting a trainable transformation matrix with dimensions set to dta*dt;btaRepresenting a trainable bias vector with dimension set to dta*1;utaRepresenting a trainable transformation vector having a dimension set to d ta1, starting from the basic word; the importance scores are used as the weights of the nodes, all node components are aggregated, and the representation of the complex text query sentence capable of perceiving the user intention is obtained
Where N-1 represents the number of nodes of the semantic tree structure, βiRepresents a node eiThe importance score of.
(4) Extracting video features by using a feature extraction method to obtain initial visual feature representation of the video;
specifically, a pre-trained deep Convolutional Neural Network (CNN) may be used to perform feature extraction on an input video frame, including: for a given video, we pre-specify that video frames are extracted uniformly from the video every 0.5 seconds, assuming that there are m extracted video frames, as represented by a series of feature vectors { v }1,v2,…,vmDescription of the drawings. The deep visual features of each frame are extracted using a deep Convolutional Neural Network (CNN) model, such as the ResNet model, trained on the ImageNet dataset. The video frame may be represented as:
wherein v istAnd representing the extracted t-th frame feature vector, and obtaining the initial visual features of the video frame through the feature extraction of the steps, but the features are only simple initial visual features extracted through a CNN model, the content information contained in the initial visual features is relatively rough, and the features are further encoded to obtain a more refined feature representation.
(5) Further mining the initial visual feature representations obtained in step (4) for their temporal and semantic dependencies, first extracting the temporal dependencies of successive frames along the sequence direction; and secondly, extracting semantic correlation between frames.
(5-1) extracting a temporal dependency of successive frames along the sequence direction. Since a video is composed of a series of image sequences and has a front-back order, that is, the video has a time sequence, it is also important to acquire time sequence information of the video. In order to extract the temporal dependencies of consecutive frames along the sequence direction, we use a GRU (Gated Reset Unit Gated cyclic Unit) to encode the initial visual features of the video obtained in step (4), modeling the temporal dependencies between consecutive frames. At each time step, the GRU takes the feature vector of the current frame and the hidden state of the previous frame as input, and outputs the hidden state of the current frame. The concealment state for the tth frame is represented as:
h′t=GRU(vt,h′t-1)
wherein v istRepresenting the t frame feature vector h extracted by the CNN network't-1Indicating the concealment state of the t-1 th frame. By the operation of the above formula, we can effectively capture the dependency relationship between consecutive frames. A GRU processed video sequence V' can be represented as:
wherein h'tIndicating the concealment state of the t-th frame, and m indicating the number of video frames uniformly extracted from the video.
And (5-2) extracting semantic correlation between frames in the whole video.
In order to enhance the representation of the video sequence characteristics, based on the video representation in the step (5-1), semantic correlation between frames in the whole video is utilized, the representation of the video sequence frames is projected into a plurality of attention spaces through a self-attention mechanism, dot product is carried out on the query frame projected by each frame and the rest key frames, a weight value on the current value frame is obtained through Softmax operation, and the obtained weight value is multiplied by the value frame; the output from the multiple attention spaces will eventually be aggregated to get the final video representation. The specific implementation process is as follows:
we exploit the semantic correlation between video frames by first performing scaled dot product attention by a self-attention mechanism, projecting a representation of a video sequence frame into multiple attention spaces. And performing dot product on the query frame projected by each frame and the rest key frames, and obtaining the weight on the current value frame through Softmax operation. The weight on value frame is expressed as:
whereinIs a trainable transformation matrix with dimensions set to di*dv,Is a trainable transformation matrix with dimensions set to di*dv,Is a trainable transformation matrix with dimensions set to di*dvProjecting initial input V' into query, key and value matrix spaces in the ith attention space through the three parameters, obtaining a weight value on a current value frame through Softmax operation on a dot product of a query frame projected by each frame and the rest key frames, and setting the dimensionality of the query, key and value matrix spaces in the ith attention space as di*1. And multiplying the obtained weight value on the value frame by the value frame. And finally, the output of the plurality of attention spaces is subjected to splicing operation and normalization to obtain a final video representation. The final video representation is:
where Concat (. cndot.) represents the splicing operation.Is the output of the 1 st attention space,is the output of the 2 nd attention space,is the output of the z-th attention space, WpIs a trainable transformation matrix with dimension dv*dvThe connected features are projected into the original space. Norm (·) represents the layer normalization operation.Is a video sequence representation enhanced by a self-attention mechanism model.Expressed as:
wherein m represents the number of video frames uniformly extracted from the video at first;representing the video representation of the t frame after enhancement by the self-attention mechanism model described above. The video representation enhanced by the self-attention mechanism model can effectively capture the time sequence dependency between continuous frames and also can effectively capture the semantic correlation between the frames.
(6) Applying an attention mechanism to the video representation obtained in step (5) to distinguish the importance of the information so that the useful information is a greater proportion of the final video representation.
Specifically, an attention neural network model with three parameters is designed, so that the importance degree of a video frame can be distinguished, and the importance degree of a t-th frame can be expressed as:
wherein u isvaIs a trainable transformation vector with dimension set to dva*1,bvaIs a trainable offset vector with dimension set to dva,WvaIs a trainable transformation matrix with dimensions set to dva*dv. Multiplying the importance degree of each frame as weight with the representation of the corresponding video frame, and finally accumulating the m frames as finalIs displayed. The final video representation is:
wherein etatIndicating the importance degree of the t-th frame;is set to d v1, is an aggregate representation of all components of the video frame.
(7) Respectively mapping the complex text query sentence representation and the video visual feature representation processed by the attention mechanism in the step (3) and the step (6) into a common space, learning the correlation degree between the two modes by using a common space learning algorithm, and finally training the model in an end-to-end mode. Specifically, the method of learning the correlation between two modalities and training the model using a common space learning algorithm is as follows:
and (7-1) mapping the complex text query sentence obtained in the step (3) and the step (6) through the attention mechanism and the video visual feature representation to a uniform public space through two linear projection models for expression. To get the same dimensions, we apply a non-linear activation function to the resulting features, followed by a Batch Normalization (BN) layer de-processing. The specific implementation process is as follows:
the final complex text query statement representation is obtained through the step (3) and the step (6)And video visual feature representation Is d in the dimension oft*1,Is d in the dimension ofv*1. We pass through two linesThe sexual projection model projects the complex text query sentence representation and the video visual feature representation into a joint embedding space.
The model representation of the projected complex text query statement is:
whereinIs a trainable transformation matrix with dimensions set to d**dt,Is a trainable offset vector with dimension set to d *1, BN (-) represents a batch normalization layer, which contributes to the performance improvement of the model.
The model of the projected video visual features is represented as:
whereinIs a trainable transformation matrix with dimensions set to d**dv,Is a trainable offset vector with dimension set to d *1, BN (-) represents a batch normalization layer, which contributes to the performance improvement of the model.
The cosine similarity represented by a projected complex text query statement and a video visual characteristic is used as a cross-modal matching score, and the cross-modal matching score is represented as follows:
wherein Q meterShowing complex text query statements, V representing initially input video features,representing complex text query sentence feature representations that are eventually projected into a common space,representing a representation of the visual features of the video that are ultimately projected into a common space. We denote the cosine similarity of the query statement Q and the video V by s (Q, V).
(7-2) training the model in an end-to-end manner through the defined ternary ordering loss, so that the model automatically learns the correlation between the two modalities. The method comprises the following specific steps:
to train the model, we use a ternary ranking loss (triplet ranking loss) to optimize the network, which penalizes the model by the hardest negative sample sampling strategy (hardest negative sample). In training the model, we have sampled a batch of complex text query sentences and video pairs seven, which can be represented as:
where B denotes the number of complex text query sentences and video pairs we sample, we want to implement for any one positive sample pair (Q) by a margin constant (margin)i,Vi) Complex text query statement QiAnd video V matched with itiS (Q) ofi,Vi) Than any one negative sample pair (Q)i,Vj) Complex text query statement QiAnd video V not matched therewithjS (Q) ofi,Vj) Is large. The loss function for a batch is expressed as:
the margin constant (margin) is between (0, 1). | NhI tableThe number of negative sample videos that are very different in a batch of videos. We find that the most difficult negative sample penalization model may lead to unstable training, while averaging all negative sample penalization models leads to slow training, so we use a balancing strategy, averaging | NhThe negative sample loss function at the top of | can ensure the stability and effectiveness of training.
(8) And (4) realizing cross-modal retrieval from complex text query to video based on the semantic tree by using the model obtained by training in the step (7).
Specifically, through the training of the model in step (7), the model has learned the mutual connection between the video and the complex text query sentence. Given a complex text query sentence, the model finds out the relevant videos of the complex text query sentence from a candidate video set and uses the relevant videos as the retrieval result, and the steps are as follows:
(8-1) mapping the input complex text query sentence and all candidate videos to a common space through the model trained in the step (7), wherein the complex text query sentence Q is expressed asVideo V is expressed as
(8-2) calculating cosine similarity of the complex text query sentence and all candidate videos in a public space, then sorting all the candidate videos in a descending order according to the cosine similarity, and returning the videos with the top order as a retrieval result, thereby realizing cross-modal retrieval from the complex text query sentence to the videos.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and should not be taken as limiting the invention, so that any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention should be included in the scope of the present invention.
Claims (10)
1. A semantic tree enhancement based cross-modal retrieval method for complex text query to video is characterized by comprising the following steps:
(1) and extracting the features of the complex text query sentence to obtain the leaf node features of the complex text query sentence.
(2) And (3) carrying out semantic tree enhanced tree structure coding on the leaf node characteristics of the complex text query sentence obtained in the step (1).
(3) Expressing the codes of the semantic tree structures of the complex text query sentences obtained in the step (2), and mining the importance of each node component forming the tree structures by using an attention mechanism to obtain the expression of the complex text query sentences capable of perceiving the intentions of the user;
(4) and performing feature extraction on the video frame to obtain an initial visual feature representation of the video.
(5) And (4) extracting the time dependence of continuous frames along the sequence direction on the initial visual feature representation obtained in the step (4), and extracting the semantic correlation between the frames.
(6) And (4) applying an attention mechanism to the video representation obtained in the step (5) to distinguish the importance degree of the information, so that the useful information accounts for a larger proportion in the final video visual feature representation.
(7) Respectively mapping the complex text query sentence representation and the video visual feature representation processed by the attention mechanism in the step (3) and the step (6) into a common space, learning the correlation degree between the two modes by using a common space learning algorithm, and training the model in an end-to-end mode.
(8) And (4) realizing cross-modal retrieval from complex text query to video based on the semantic tree by using the model obtained by training in the step (7).
2. The semantic tree enhancement-based cross-modal search method for complex text query to video according to claim 1, wherein the method for extracting leaf node features of complex text query sentences in step (1) comprises the following sub-steps:
(1-1) coding each word in the complex text query sentence by using one-hot coding to obtain one-hot coding vector sequence; multiplying the one-hot coded vector by a word embedding matrix to obtain a word vector sequence representation of the complex text query statement;
(1-2) modeling the word vector sequence representation using the LSTM in the RNN, converting the word vector sequence representation into leaf node features.
3. The semantic tree enhancement based cross-modal search method for complex text query to video according to claim 1, wherein the step (2) of encoding the tree structure of semantic tree enhancement on the leaf node feature of the complex text query sentence comprises the following sub-steps:
(2-1) generating a father node by using an LSTM method of a tree structure, and recursively forming a semantic tree structure in a bottom-up manner; the semantic tree consists of two types of nodes: the system comprises child nodes and parent nodes, wherein the child nodes represent words in a complex text query sentence, and the parent nodes represent combinations of word components; expressing the leaf node characteristics obtained in the step (1) as a first-layer child node of a semantic tree, and combining two adjacent child nodes in all the child nodes by using an LSTM method of a tree structure to obtain a candidate father node;
(2-2) selecting the best father node from the candidate father nodes of each layer as a next-layer node according to the memory-enhanced node scoring module, and directly copying unselected child nodes to the next layer to be used as the representation of the next-layer node; the above process is repeated recursively until only one node remains.
4. The method for cross-modal search for complex text query to video based on semantic tree enhancement as claimed in claim 3, wherein in the step (2-1), two adjacent child nodes (h) are giveni,ci) And (h)i+1,ci+1) As input, hiRepresenting the hidden state of the ith node, ciRepresenting the memory state of the ith node, the parent node may be computed as:
hp=o⊙tanh(cp)
cp=fl⊙ci+fr⊙ci+1+τ⊙g
wherein h ispRepresenting a hidden state of a parent node with its dimension set to dt*1;cpRepresenting the memory state of the parent node with dimension set to dt1, starting from the basic word; → represents the element-by-element multiplication between features; tau, fl,frO, g can be expressed as:
wherein WpRepresenting a trainable transformation matrix with dimensions set to 5dt*2dt;bpRepresenting trainable bias vectors with dimensions set to 5dt1, starting from the basic word; sigma represents a sigmoid nonlinear activation function, and tanh represents a tanh nonlinear transformation function;
suppose a t-th level semantic tree is composed of NtThe node of the t-th layer can be expressed as:
whereinRepresents the ith node of the t-th layer,denotes the t-th layerThe (i + 1) th node of (1),representing the ith node of the t +1 th layer, wherein treeLSTM represents the LSTM method of the tree structure;
in the step (2-2), a module f is scored according to the nodes with enhanced memoryscore(.;Θscore) Determining the likelihood that the best parent node, the ith candidate parent node, is selectedExpressed as:
wherein Θ isscoreTrainable parameters representing a node scoring module;for the context semantic vector, judging the importance degree of the hidden state of each node through the query memory M, and aggregating the importance degrees according to the hidden state of each node in M to obtain the context semantic vectorThe memory M is represented as:
whereinIndicating a hidden state of an nth node of layer 1; degree of importance of node hidden state in memory MExpressed as:
whereinRepresenting the importance degree of the ith node of the t-th layer of the semantic tree to the hidden state of the jth node in the memory M; wmRepresenting a trainable transformation matrix with dimensions set to dt*dt;bmRepresenting a trainable bias vector with dimension set to dt*1;Representing the Relu nonlinear activation function; softmax represents a non-linear function; according to the degree of importanceAggregating the information in the memory M to obtain a context semantic vector Expressed as:
WhereinIs a Relu nonlinear activation function; w is asRepresenting trainable transformation vectors with dimensions set to 2dt*1;bsRepresenting trainable bias vectors with dimensions set to 2dt*1;WsRepresenting a trainable transformation matrix with dimensions set to 2dt*2dt。
5. The method according to claim 1, wherein in the step (3), based on the introduction of the complex text query sentence representation based on semantic tree enhancement, a neural network is introduced to mine the importance of each node component by using an attention mechanism, and the node e is a node eiIs a score of importance ofiExpressed as:
whereinIs the ReLU nonlinear activation function; wtaRepresenting a trainable transformation matrix with dimensions set to dta*dt;btaRepresenting a trainable bias vector with dimension set to dta*1;utaRepresenting a trainable transformation vector having a dimension set to dta1, starting from the basic word; and aggregating all node components by taking the importance scores as the weights of the nodes to obtain the complex text query sentence capable of perceiving the user intentionIs shown in
Where N-1 represents the number of nodes of the semantic tree structure, βiRepresents a node eiThe importance score of.
6. The method for cross-modal search of complex text query to video based on semantic tree enhancement as claimed in claim 1, wherein in the step (4), a pre-trained deep Convolutional Neural Network (CNN) is used to perform feature extraction on an input video frame, and a deep visual feature of each frame is extracted as an initial visual feature.
7. The method for cross-modal search of complex text query to video based on semantic tree enhancement as claimed in claim 1, wherein the step (5) of extracting the temporal dependency of the consecutive frames along the sequence direction comprises: encoding the initial visual features of the video obtained in the step (4) by using a GRU (general packet unit), wherein the GRU takes the feature vector of the current frame and the hidden state of the previous frame as input at each time step and outputs the hidden state of the current frame; through GRU operation, the time dependency between continuous frames is effectively captured;
the extraction of semantic relevance between frames in the whole video comprises the following steps: by the self-attention mechanism, firstly, scaled dot product attention is carried out, namely, representations of video sequence frames are projected into a plurality of attention spaces, dot products are carried out on a query frame projected by each frame and the rest key frames, a weight value on a current value frame is obtained through Softmax operation, the obtained weight value is multiplied by the value frame, and finally, final video representations are obtained from outputs of the attention spaces through splicing operation and normalization.
8. The method for cross-modal search for complex text query to video based on semantic tree enhancement as claimed in claim 1, wherein in the step (6), an attention neural network model with three parameters is designed to distinguish the importance degree of video frames, and the importance degree of the t-th frame is expressed as:
wherein u isvaIs a trainable transformation vector with dimension set to dva*1,bvaIs a trainable offset vector with dimension set to dva*1,WvaIs a trainable transformation matrix with dimensions set to dva*dv,Is a representation of a corresponding video frame with dimension set to dv*1;
Multiplying the importance degree of each frame as a weight by the representation of the corresponding video frame, and finally accumulating the m frames to form a final video representation:
9. The method for searching the cross-modal from the complex text query to the video based on the semantic tree enhancement as claimed in claim 1, wherein in the step (7), the steps of learning the correlation between two modalities and training the model by using the common space learning algorithm are as follows:
(7-1) mapping the complex text query statement and the video visual feature representation obtained in the step (3) and the step (6) through an attention mechanism to a uniform public space through two linear projection models for expression; in order to obtain the same dimensionality, applying a nonlinear activation function to the obtained features, and then applying a Batch Normalization (BN) layer for processing;
(7-2) training the model in an end-to-end manner through the defined ternary ordering loss, so that the model automatically learns the correlation between the two modalities.
10. The method for cross-modal search of complex text query to video based on semantic tree enhancement as claimed in claim 1, wherein in the step (8), given a complex text query sentence, a video related to the complex text query sentence is found from a candidate video set and is used as a search result, and the steps are as follows:
(8-1) mapping the input complex text query sentence and all candidate videos to a public space through the model trained in the step (7);
(8-2) calculating the similarity of the complex text query sentence and all candidate videos in a public space, then sorting all the candidate videos in a descending order according to the similarity, and returning the videos with the top order as a retrieval result, thereby realizing the cross-modal retrieval from the complex text query sentence to the videos.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010686024.2A CN111897913B (en) | 2020-07-16 | 2020-07-16 | Semantic tree enhancement based cross-modal retrieval method for searching video from complex text |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010686024.2A CN111897913B (en) | 2020-07-16 | 2020-07-16 | Semantic tree enhancement based cross-modal retrieval method for searching video from complex text |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111897913A true CN111897913A (en) | 2020-11-06 |
CN111897913B CN111897913B (en) | 2022-06-03 |
Family
ID=73189400
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010686024.2A Active CN111897913B (en) | 2020-07-16 | 2020-07-16 | Semantic tree enhancement based cross-modal retrieval method for searching video from complex text |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111897913B (en) |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112241468A (en) * | 2020-07-23 | 2021-01-19 | 哈尔滨工业大学(深圳) | Cross-modal video retrieval method and system based on multi-head self-attention mechanism and storage medium |
CN112380385A (en) * | 2020-11-18 | 2021-02-19 | 湖南大学 | Video time positioning method and device based on multi-modal relational graph |
CN112883229A (en) * | 2021-03-09 | 2021-06-01 | 中国科学院信息工程研究所 | Video-text cross-modal retrieval method and device based on multi-feature-map attention network model |
CN113111837A (en) * | 2021-04-25 | 2021-07-13 | 山东省人工智能研究院 | Intelligent monitoring video early warning method based on multimedia semantic analysis |
CN113111836A (en) * | 2021-04-25 | 2021-07-13 | 山东省人工智能研究院 | Video analysis method based on cross-modal Hash learning |
CN113204666A (en) * | 2021-05-26 | 2021-08-03 | 杭州联汇科技股份有限公司 | Method for searching matched pictures based on characters |
CN113590881A (en) * | 2021-08-09 | 2021-11-02 | 北京达佳互联信息技术有限公司 | Video clip retrieval method, and training method and device of video clip retrieval model |
CN113762007A (en) * | 2020-11-12 | 2021-12-07 | 四川大学 | Abnormal behavior detection method based on appearance and action characteristic double prediction |
CN113934887A (en) * | 2021-12-20 | 2022-01-14 | 成都考拉悠然科技有限公司 | No-proposal time sequence language positioning method based on semantic decoupling |
CN114048350A (en) * | 2021-11-08 | 2022-02-15 | 湖南大学 | Text-video retrieval method based on fine-grained cross-modal alignment model |
CN114429119A (en) * | 2022-01-18 | 2022-05-03 | 重庆大学 | Video and subtitle fragment retrieval method based on multi-cross attention |
CN114579803A (en) * | 2022-03-09 | 2022-06-03 | 北方工业大学 | Video retrieval model based on dynamic convolution and shortcut |
CN114612748A (en) * | 2022-03-24 | 2022-06-10 | 北京工业大学 | Cross-modal video clip retrieval method based on feature decoupling |
CN114896450A (en) * | 2022-04-15 | 2022-08-12 | 中山大学 | Video time retrieval method and system based on deep learning |
CN115099855A (en) * | 2022-06-23 | 2022-09-23 | 广州华多网络科技有限公司 | Method for preparing advertising pattern creation model and device, equipment, medium and product thereof |
CN115223086A (en) * | 2022-09-20 | 2022-10-21 | 之江实验室 | Cross-modal action positioning method and system based on interactive attention guidance and correction |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110175266A (en) * | 2019-05-28 | 2019-08-27 | 复旦大学 | A method of it is retrieved for multistage video cross-module state |
CN110490946A (en) * | 2019-07-15 | 2019-11-22 | 同济大学 | Text generation image method based on cross-module state similarity and generation confrontation network |
US20200104318A1 (en) * | 2017-03-07 | 2020-04-02 | Selerio Limited | Multi-modal image search |
CN111191075A (en) * | 2019-12-31 | 2020-05-22 | 华南师范大学 | Cross-modal retrieval method, system and storage medium based on dual coding and association |
CN111309971A (en) * | 2020-01-19 | 2020-06-19 | 浙江工商大学 | Multi-level coding-based text-to-video cross-modal retrieval method |
-
2020
- 2020-07-16 CN CN202010686024.2A patent/CN111897913B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20200104318A1 (en) * | 2017-03-07 | 2020-04-02 | Selerio Limited | Multi-modal image search |
CN110175266A (en) * | 2019-05-28 | 2019-08-27 | 复旦大学 | A method of it is retrieved for multistage video cross-module state |
CN110490946A (en) * | 2019-07-15 | 2019-11-22 | 同济大学 | Text generation image method based on cross-module state similarity and generation confrontation network |
CN111191075A (en) * | 2019-12-31 | 2020-05-22 | 华南师范大学 | Cross-modal retrieval method, system and storage medium based on dual coding and association |
CN111309971A (en) * | 2020-01-19 | 2020-06-19 | 浙江工商大学 | Multi-level coding-based text-to-video cross-modal retrieval method |
Cited By (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112241468A (en) * | 2020-07-23 | 2021-01-19 | 哈尔滨工业大学(深圳) | Cross-modal video retrieval method and system based on multi-head self-attention mechanism and storage medium |
CN113762007B (en) * | 2020-11-12 | 2023-08-01 | 四川大学 | Abnormal behavior detection method based on appearance and action feature double prediction |
CN113762007A (en) * | 2020-11-12 | 2021-12-07 | 四川大学 | Abnormal behavior detection method based on appearance and action characteristic double prediction |
CN112380385A (en) * | 2020-11-18 | 2021-02-19 | 湖南大学 | Video time positioning method and device based on multi-modal relational graph |
CN112380385B (en) * | 2020-11-18 | 2023-12-29 | 湖南大学 | Video time positioning method and device based on multi-mode relation diagram |
CN112883229A (en) * | 2021-03-09 | 2021-06-01 | 中国科学院信息工程研究所 | Video-text cross-modal retrieval method and device based on multi-feature-map attention network model |
CN113111837A (en) * | 2021-04-25 | 2021-07-13 | 山东省人工智能研究院 | Intelligent monitoring video early warning method based on multimedia semantic analysis |
CN113111836A (en) * | 2021-04-25 | 2021-07-13 | 山东省人工智能研究院 | Video analysis method based on cross-modal Hash learning |
CN113204666B (en) * | 2021-05-26 | 2022-04-05 | 杭州联汇科技股份有限公司 | Method for searching matched pictures based on characters |
CN113204666A (en) * | 2021-05-26 | 2021-08-03 | 杭州联汇科技股份有限公司 | Method for searching matched pictures based on characters |
CN113590881A (en) * | 2021-08-09 | 2021-11-02 | 北京达佳互联信息技术有限公司 | Video clip retrieval method, and training method and device of video clip retrieval model |
CN113590881B (en) * | 2021-08-09 | 2024-03-19 | 北京达佳互联信息技术有限公司 | Video clip retrieval method, training method and device for video clip retrieval model |
CN114048350A (en) * | 2021-11-08 | 2022-02-15 | 湖南大学 | Text-video retrieval method based on fine-grained cross-modal alignment model |
CN113934887B (en) * | 2021-12-20 | 2022-03-15 | 成都考拉悠然科技有限公司 | No-proposal time sequence language positioning method based on semantic decoupling |
CN113934887A (en) * | 2021-12-20 | 2022-01-14 | 成都考拉悠然科技有限公司 | No-proposal time sequence language positioning method based on semantic decoupling |
CN114429119A (en) * | 2022-01-18 | 2022-05-03 | 重庆大学 | Video and subtitle fragment retrieval method based on multi-cross attention |
CN114429119B (en) * | 2022-01-18 | 2024-05-28 | 重庆大学 | Video and subtitle fragment retrieval method based on multiple cross attentions |
CN114579803A (en) * | 2022-03-09 | 2022-06-03 | 北方工业大学 | Video retrieval model based on dynamic convolution and shortcut |
CN114579803B (en) * | 2022-03-09 | 2024-04-12 | 北方工业大学 | Video retrieval method, device and storage medium based on dynamic convolution and shortcuts |
CN114612748A (en) * | 2022-03-24 | 2022-06-10 | 北京工业大学 | Cross-modal video clip retrieval method based on feature decoupling |
CN114612748B (en) * | 2022-03-24 | 2024-06-07 | 北京工业大学 | Cross-modal video segment retrieval method based on feature decoupling |
CN114896450A (en) * | 2022-04-15 | 2022-08-12 | 中山大学 | Video time retrieval method and system based on deep learning |
CN114896450B (en) * | 2022-04-15 | 2024-05-10 | 中山大学 | Video moment retrieval method and system based on deep learning |
CN115099855A (en) * | 2022-06-23 | 2022-09-23 | 广州华多网络科技有限公司 | Method for preparing advertising pattern creation model and device, equipment, medium and product thereof |
CN115223086A (en) * | 2022-09-20 | 2022-10-21 | 之江实验室 | Cross-modal action positioning method and system based on interactive attention guidance and correction |
CN115223086B (en) * | 2022-09-20 | 2022-12-06 | 之江实验室 | Cross-modal action positioning method and system based on interactive attention guidance and correction |
Also Published As
Publication number | Publication date |
---|---|
CN111897913B (en) | 2022-06-03 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111897913B (en) | Semantic tree enhancement based cross-modal retrieval method for searching video from complex text | |
CN111309971B (en) | Multi-level coding-based text-to-video cross-modal retrieval method | |
CN110298037B (en) | Convolutional neural network matching text recognition method based on enhanced attention mechanism | |
CN110083705B (en) | Multi-hop attention depth model, method, storage medium and terminal for target emotion classification | |
CN111914067B (en) | Chinese text matching method and system | |
CN112100351A (en) | Method and equipment for constructing intelligent question-answering system through question generation data set | |
CN111581510A (en) | Shared content processing method and device, computer equipment and storage medium | |
CN111414461B (en) | Intelligent question-answering method and system fusing knowledge base and user modeling | |
CN111159485B (en) | Tail entity linking method, device, server and storage medium | |
CN110704601A (en) | Method for solving video question-answering task requiring common knowledge by using problem-knowledge guided progressive space-time attention network | |
CN111274398A (en) | Method and system for analyzing comment emotion of aspect-level user product | |
CN113297370B (en) | End-to-end multi-modal question-answering method and system based on multi-interaction attention | |
CN114693397B (en) | Attention neural network-based multi-view multi-mode commodity recommendation method | |
CN112328900A (en) | Deep learning recommendation method integrating scoring matrix and comment text | |
CN111400494B (en) | Emotion analysis method based on GCN-Attention | |
CN113220891B (en) | Method for generating confrontation network image description based on unsupervised concept-to-sentence | |
CN116204674B (en) | Image description method based on visual concept word association structural modeling | |
CN112131345B (en) | Text quality recognition method, device, equipment and storage medium | |
CN112651940A (en) | Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network | |
CN112115253A (en) | Depth text ordering method based on multi-view attention mechanism | |
CN113806554A (en) | Knowledge graph construction method for massive conference texts | |
CN114417851A (en) | Emotion analysis method based on keyword weighted information | |
CN114298055B (en) | Retrieval method and device based on multilevel semantic matching, computer equipment and storage medium | |
CN114356990A (en) | Base named entity recognition system and method based on transfer learning | |
CN114239730A (en) | Cross-modal retrieval method based on neighbor sorting relation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |