CN111897913A - Semantic tree enhancement based cross-modal retrieval method for searching video from complex text - Google Patents

Semantic tree enhancement based cross-modal retrieval method for searching video from complex text Download PDF

Info

Publication number
CN111897913A
CN111897913A CN202010686024.2A CN202010686024A CN111897913A CN 111897913 A CN111897913 A CN 111897913A CN 202010686024 A CN202010686024 A CN 202010686024A CN 111897913 A CN111897913 A CN 111897913A
Authority
CN
China
Prior art keywords
node
video
text query
complex text
representing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010686024.2A
Other languages
Chinese (zh)
Other versions
CN111897913B (en
Inventor
董建锋
彭敬伟
杨勋
郑琪
王勋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Gongshang University
Original Assignee
Zhejiang Gongshang University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Gongshang University filed Critical Zhejiang Gongshang University
Priority to CN202010686024.2A priority Critical patent/CN111897913B/en
Publication of CN111897913A publication Critical patent/CN111897913A/en
Application granted granted Critical
Publication of CN111897913B publication Critical patent/CN111897913B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/70Information retrieval; Database structures therefor; File system structures therefor of video data
    • G06F16/78Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/783Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Library & Information Science (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a semantic tree enhancement based cross-modal retrieval method for complex text query to video. For a complex text query statement, each word of the complex text query statement is converted into leaf node representation, the relation between child nodes is mined, the two child nodes with the highest dependency are combined, a semantic tree structure of the query statement is built in a recursive mode, and query representation based on semantic tree enhancement is obtained. For the encoding of the candidate video, video preliminary characteristics are obtained through CNN, and time dependency and semantic dependency between videos are captured by utilizing GRU and a self-attention mechanism module to obtain robust video characteristic representation. Mapping the complex text query expression and the video feature expression to a public space, and automatically learning the matching relationship between the complex text query expression and the video feature expression, thereby realizing the cross-modal retrieval from the complex text query to the video. The method of the invention can not only explain the information components in the complex text query sentence, better understand the user intention, but also improve the retrieval performance to a great extent.

Description

Semantic tree enhancement based cross-modal retrieval method for searching video from complex text
Technical Field
The invention relates to the field of cross-modal retrieval from text query to video, in particular to a cross-modal retrieval method from complex text query to video based on semantic tree enhancement.
Background
As users on the internet have exponentially increased their generated videos, uploading videos and searching videos of interest in daily life have become indispensable activities in people's daily life. A cross-modal retrieval method from text query to video is one of the techniques to obtain videos of interest. Early cross-modal retrieval methods from text query to video were based on text keywords and were extensively studied and developed. But this type of method only allows the user to enter several keywords as queries. With the further increase of the demand of people on the video searching capability of the internet, the search intention of the user is difficult to be fully expressed by the keyword-based query, so that the search experience is influenced. In response to this problem, video retrieval supporting complex text queries is ongoing. Therefore, how to understand the more complex semantics passed on for complex text queries and understand user intent has become one of the difficult challenges across the domain of modal retrieval.
Existing cross-modal retrieval methods for text queries to videos generally fall into two categories, the first category being concept-based methods that utilize a large number of visual concepts to describe the video content while converting text queries into a set of basic visual concepts. Text queries are represented with visual concepts. And finally, realizing cross-modal retrieval through concept matching between different modalities (text and video). However, such methods have the following disadvantages: one, it is not generally very effective for complex text queries because it is often difficult for the semantic content of complex text queries to be adequately described by several visual concepts, resulting in information loss, and the semantic content of complex text queries is not just an aggregation of extracted concepts. Secondly, how to effectively train the concept classifier, and selecting related concepts is also a very challenging problem. The second method is to learn a joint embedding space of text query and video to support video retrieval, and this method represents the video as a time aggregation feature by converting the text query into a word vector representation and maps the two to a common space. Such that similar text queries and videos are close in the common space, and away otherwise. While such directions can better handle longer text queries than concept-based approaches, such approaches have the following disadvantages: firstly, the text query of the user is represented by a word vector, which cannot effectively understand the intention of the user, so that the video retrieval effect on complex text query is not good. Second, such methods lack the interpretability of the sub-search process.
Disclosure of Invention
Aiming at the defects of the prior art, the invention adopts a modeling method facing complex text query to video retrieval, provides a cross-modal retrieval method based on semantic tree enhancement, firstly encodes the complex text query by using a tree structure, and simultaneously performs quantization encoding and expression learning on the complex text query and the video; the similarity of the coded features in the public space is calculated by mapping the coded features to the public space, and cross-modal retrieval from complex text query to videos is achieved.
The purpose of the invention is realized by the following technical scheme: a semantic tree enhancement based cross-modal retrieval method for complex text query to video comprises the following steps:
(1) extracting features of the complex text query sentence to obtain leaf node features of the complex text query sentence;
(2) encoding a tree structure of semantic tree enhancement on leaf node characteristics of the complex text query sentence obtained in the step (1);
(3) expressing the codes of the semantic tree structures of the complex text query sentences obtained in the step (2), and mining the importance of each node component forming the tree structures by using an attention mechanism to obtain the expression of the complex text query sentences capable of perceiving the intentions of the user;
(4) performing feature extraction on the video frame to obtain initial visual feature representation of the video;
(5) extracting the time dependence of continuous frames along the sequence direction from the initial visual feature representation obtained in the step (4), and extracting the semantic correlation between the frames;
(6) applying an attention mechanism to the video representation obtained in the step (5), and distinguishing the importance degree of the information to enable the useful information to occupy a larger proportion in the final video visual feature representation;
(7) respectively mapping the complex text query sentence representation and the video visual feature representation processed by the attention mechanism in the steps (3) and (6) into a common space, learning the correlation degree between the two modes by using a common space learning algorithm, and training a model in an end-to-end mode;
(8) and (4) realizing cross-modal retrieval from complex text query to video based on the semantic tree by using the model obtained by training in the step (7).
Further, the method for extracting leaf node features of the complex text query statement in step (1) comprises the following substeps:
(1-1) coding each word in the complex text query sentence by using one-hot coding to obtain one-hot coding vector sequence; multiplying the one-hot coded vector by a word embedding matrix to obtain a word vector sequence representation of the complex text query statement;
(1-2) modeling the word vector sequence representation using the LSTM in the RNN, converting the word vector representation into leaf node features.
Further, the encoding of the tree structure with semantic tree enhancement on the leaf node feature of the complex text query statement in the step (2) includes the following sub-steps:
(2-1) generating a father node by using an LSTM method of a tree structure, and recursively forming a semantic tree structure from bottom to top; the semantic tree consists of two types of nodes: the system comprises child nodes and parent nodes, wherein the child nodes represent words in a complex text query sentence, and the parent nodes represent combinations of word components; expressing the leaf node characteristics obtained in the step (1) as first-layer nodes of a semantic tree, and combining two adjacent child nodes in all the child nodes by using an LSTM method of a tree structure to obtain candidate father nodes;
(2-2) selecting the best father node from the candidate father nodes of each layer as a next-layer node according to the memory-enhanced node scoring module, and directly copying unselected child nodes to the next layer to be used as the representation of the next-layer node; the above process is repeated recursively until only one node remains.
Further, in the step (2-1), two adjacent child nodes (h) are giveni,ci) And (h)i+1,ci+1) As input, hiRepresenting the hidden state of the ith node, ciRepresenting the memory state of the ith node, the parent node may be computed as:
hp=o⊙tanh(cp)
cp=fl⊙ci+fr⊙ci+1+τ⊙g
wherein h ispRepresenting a hidden state of a parent node with its dimension set to dt*1;cpRepresenting the memory state of the parent node with dimension set to d t1, starting from the basic word; an element-by-element multiplication between features; tau, fl,frO, g can be expressed as:
Figure BDA0002587592230000031
wherein WpRepresenting a trainable transformation matrix with dimensions set to 5dt*2dt;bpRepresenting trainable bias vectors with dimensions set to 5d t1, starting from the basic word; sigma represents a sigmoid nonlinear activation function, and tanh represents a tanh nonlinear transformation function;
suppose a t-th level semantic tree is composed of NtThe node of the t-th layer can be expressed as:
Figure BDA0002587592230000032
if selected, theConnecting the t layer nodes
Figure BDA0002587592230000033
And
Figure BDA0002587592230000034
merging, the parent node can be represented as:
Figure BDA0002587592230000035
wherein
Figure BDA0002587592230000036
Represents the ith node of the t-th layer,
Figure BDA0002587592230000037
represents the i +1 th node of the t-th layer,
Figure BDA0002587592230000038
representing the ith node of the t +1 th layer, wherein TreeLSTM represents the LSTM method of the tree structure;
in the step (2-2), a module f is scored according to the nodes with enhanced memoryscore(.;Θscore) Determining the likelihood that the best parent node, the ith candidate parent node, is selected
Figure BDA0002587592230000039
Expressed as:
Figure BDA00025875922300000310
wherein Θ isscoreTrainable parameters representing a node scoring module;
Figure BDA00025875922300000311
for the context semantic vector, judging the importance degree of the hidden state of each node through the query memory M, and aggregating the importance degrees according to the hidden state of each node in M to obtain the context semantic vector
Figure BDA00025875922300000312
The memory M is represented as:
Figure BDA00025875922300000313
wherein
Figure BDA00025875922300000314
Indicating a hidden state of an nth node of layer 1; degree of importance of node hidden state in memory M
Figure BDA00025875922300000315
Expressed as:
Figure BDA00025875922300000316
wherein
Figure BDA00025875922300000317
Representing the importance degree of the ith node of the t-th layer of the semantic tree to the hidden state of the jth node in the memory M; wmRepresenting a trainable transformation matrix with dimensions set to dt*dt;bmRepresenting a trainable bias vector with dimension set to dt*1;
Figure BDA0002587592230000041
Representing the Relu nonlinear activation function; softmax represents a non-linear function; according to the degree of importance
Figure BDA0002587592230000042
Aggregating the information in the memory M to obtain a context semantic vector
Figure BDA0002587592230000043
Figure BDA0002587592230000044
Expressed as:
Figure BDA0002587592230000045
wherein
Figure BDA0002587592230000046
Is a vector normalized after applying the attention mechanism;
obtaining the score of the candidate node of the father node by the following formula
Figure BDA0002587592230000047
Figure BDA0002587592230000048
Wherein
Figure BDA0002587592230000049
Is a Relu nonlinear activation function; w is asRepresenting trainable transformation vectors with dimensions set to 2dt*1;bsRepresenting trainable bias vectors with dimensions set to 2dt*1;WsRepresenting a trainable transformation matrix with dimensions set to 2dt*2dt
Selecting from candidate parent nodes
Figure BDA00025875922300000410
The score is the largest as the best parent node.
Further, in the step (3), on the basis of introducing the complex text query statement representation based on semantic tree enhancement, a neural network is introduced to mine the importance of each node component by using an attention mechanism, and the node eiIs a score of importance ofiExpressed as:
Figure BDA00025875922300000411
wherein
Figure BDA00025875922300000412
Is the ReLU nonlinear activation function; wtaRepresenting a trainable transformation matrix with dimensions set to dta*dt;btaRepresenting a trainable bias vector with dimension set to dta*1;utaRepresenting a trainable transformation vector having a dimension set to d ta1, starting from the basic word; the importance scores are used as the weights of the nodes, all node components are aggregated, and the representation of the complex text query sentence capable of perceiving the user intention is obtained
Figure BDA00025875922300000413
Figure BDA00025875922300000414
Where N-1 represents the number of nodes of the semantic tree structure, βiRepresents a node eiThe importance score of.
Further, in the step (4), feature extraction is performed on the input video frame by using a pre-trained deep Convolutional Neural Network (CNN), and a deep visual feature of each frame is extracted as an initial visual feature.
Further, in the step (5), extracting the time dependency of the consecutive frames along the sequence direction includes: encoding the initial visual features of the video obtained in the step (4) by using a GRU (general packet unit), wherein the GRU takes the feature vector of the current frame and the hidden state of the previous frame as input at each time step and outputs the hidden state of the current frame; through GRU operation, the time dependency between continuous frames is effectively captured;
the extraction of semantic relevance between frames in the whole video comprises the following steps: by a self-attention mechanism, firstly, performing scaled dot product attention, namely projecting the representation of the video sequence frames into a plurality of attention spaces, performing dot product on the query frame projected by each frame and the rest key frames, obtaining a weight value on the current value frame through Softmax operation, and multiplying the obtained weight value by the va1ue frame; the final video representation will be obtained from the output of the multiple attention spaces through a stitching operation and normalization.
Further, in the step (6), an attention neural network model with three parameters is designed to distinguish the importance degree of the video frames, and the importance degree of the tth frame is expressed as:
Figure BDA0002587592230000051
wherein u isvaIs a trainable transformation vector with dimension set to dva*1,bvaIs a trainable offset vector with dimension set to dva,WvaIs a trainable transformation matrix with dimensions set to dva*dv
Figure BDA0002587592230000052
Is a representation of a corresponding video frame with dimension set to dv*1;
Multiplying the importance degree of each frame as a weight by the representation of the corresponding video frame, and finally accumulating the frames to form the final video representation:
Figure BDA0002587592230000053
wherein etatIndicating the importance degree of the t-th frame;
Figure BDA0002587592230000054
is set to d v1, is an aggregate representation of all components of the video frame.
Further, in the step (7), the step of learning the correlation between the two modalities and training the model by using a common space learning algorithm is as follows:
(7-1) mapping the complex text query statement and the video visual feature representation obtained in the step (3) and the step (6) through an attention mechanism to a uniform public space through two linear projection models for expression; in order to obtain the same dimensionality, applying a nonlinear activation function to the obtained features, and then applying a Batch Normalization (BN) layer for processing;
(7-2) training the model in an end-to-end manner through the defined ternary ordering loss, so that the model automatically learns the correlation between the two modalities.
Further, in the step (8), given a complex text query sentence, finding out a video related to the complex text query sentence from a candidate video set, and using the video as a retrieval result, the steps are as follows:
(8-1) mapping the input complex text query sentence and all candidate videos to a public space through the model trained in the step (7);
(8-2) calculating the similarity of the complex text query sentence and all candidate videos in a public space, then sorting all the candidate videos in a descending order according to the similarity, and returning the videos with the top order as a retrieval result, thereby realizing the cross-modal retrieval from the complex text query sentence to the videos.
The invention has the beneficial effects that: the invention provides a novel cross-modal retrieval framework from complex text query to video, which can automatically form a flexible tree structure to model complex text query sentences, and designs a memory-enhanced node scoring module to mine the language environment of the tree structure of the complex text query sentences. An attention mechanism is introduced into the complex text query sentence and the video visual feature representation, and the node component combination in the complex text query and the importance degree of each frame of the video are deeply mined. The invention can explain the information components in the complex text query sentence, better understand the user intention and improve the retrieval performance to a great extent.
Drawings
FIG. 1 is a schematic diagram of an implementation of a semantic tree enhancement-based cross-modal search method for complex text query to video;
FIG. 2 is an example of a complex text query to video search of the present invention.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.
In order to solve the problem of cross-modal retrieval from complex text query to video, the invention provides a semantic tree enhancement-based cross-modal retrieval method from complex text query to video, which comprises the following specific steps:
(1) and extracting the features of the complex text query sentence by using a feature extraction method to obtain the leaf node features of the complex text query sentence.
(1-1) given a complex text query sentence Q of length N, the complex text query sentence Q can be represented as:
Q={w1,w2,…,wN}
wherein w1Representing the first word in the complex text query statement, each word in the complex text query statement is first encoded with one-hot encoding (one-hot) to obtain a sequence of one-hot encoded vectors, { w'1,w′2,...,w′NW therein'tA one-hot coded vector representing the t-th word. Obtaining a word vector sequence representation { Q ] of a complex text query statement Q by multiplying the one-hot coded vector by a word embedding matrix1,q2,…,qN}。
(1-2) using LSTM (long-short-duration memory network) in RNN (recurrent neural network) as basic sequence modeling module. To maintain structural consistency, the N word vector representations in the word vector sequence in step (1-1) are converted into N leaf nodes using LSTM. For the ith time step, the word vector sequence represents { q }1,q2,…,qNThe ith word vector in (j) represents qiIt is converted into leaf nodes by the LSTM unit, and the ith leaf node is represented as:
(hi,ci)=LSTM(qi,hi-1,ci-1)
wherein h isi-1Representing the hidden state of the i-1 st node, ci-1Represents the memory state of the (i-1) th node, (h)i,ci) Representing the ith leaf node feature into which the ith word vector is converted.
(2) Encoding a tree structure of semantic tree enhancement on leaf node characteristics of the complex text query sentence obtained in the step (1); to is coming toAnd (2) better understanding the complex text query statement, and carrying out tree-structured LSTM (TreLSTM) modeling on leaf node characteristics of the complex text query statement obtained in the step (1), wherein a TreLSTM method is used for generating a parent node. Given two adjacent child nodes (h)i,ci) And (h)i+1,ci+1) As input, hiRepresenting the hidden state of the ith node, ciIndicating the memory state of the ith node. The parent node may be computed as:
hp=o⊙tanh(cp)
cp=fl⊙ci+fr⊙ci+1+τ⊙g
wherein h ispRepresenting a hidden state of a parent node with its dimension set to dt*1;cpRepresenting the memory state of the parent node with dimension set to d t1, starting from the basic word; an element-by-element multiplication between features; tau, fl,frThe parameters o, g, etc. are represented by hiAnd hi+1Obtained after sigmoid and tanh functions; tau, fl,frO, g can be expressed as:
Figure BDA0002587592230000071
wherein WpRepresenting a trainable transformation matrix with dimensions set to 5dt*2dt;bpRepresenting trainable bias vectors with dimensions set to 5d t1, starting from the basic word; sigma represents a sigmoid nonlinear activation function, and tanh represents a tanh nonlinear transformation function; generating a father node by using TreeLSTM, and recursively forming a semantic tree structure in a bottom-up manner; the semantic tree consists of two types of nodes: the method comprises child nodes and parent nodes, wherein the child nodes represent words in a complex text query sentence, and the parent nodes represent combinations of word components, and can describe more complex semantic information compared with the child nodes.
(2-1) representing the leaf node characteristic sequence obtained in the step (1) as a first-layer node of a semantic tree; suppose a t-th level semantic tree is composed of NtEach node is composed of t-th layer nodesExpressed as:
Figure BDA0002587592230000072
if we choose to connect the t-level nodes
Figure BDA0002587592230000073
And
Figure BDA0002587592230000074
merging, and then calculating a parent node by using the TreeLSTM, where the parent node may be represented as:
Figure BDA0002587592230000075
wherein
Figure BDA0002587592230000076
Represents the ith node of the t-th layer,
Figure BDA0002587592230000077
represents the i +1 th node of the t-th layer,
Figure BDA0002587592230000078
representing the ith node of the t +1 th layer. And combining two adjacent child nodes in all child nodes by using the LSTM (TreeLSTM) method of the tree structure to obtain a parent node candidate node.
(2-2) the key step of building the semantic tree structure is how to accurately select the best parent node from the parent node candidate nodes at each layer, which requires designing a node scoring module to select the best parent node. For the node scoring module, it is difficult to efficiently determine the best parent node when a given query is a complex text query due to the ambiguity of language and the limited ability of the hidden state of the node to remember historical inputs. Therefore, a node scoring module f with memory enhancement is specially designed for complex text query sentencesscore(.;Θscore) For determining the best parent node, i-th parent node candidate nodeLikelihood of a point being selected
Figure BDA0002587592230000079
Can be expressed as:
Figure BDA00025875922300000710
wherein Θ isscoreTrainable parameters representing a node scoring module;
Figure BDA00025875922300000711
for contextual semantic vectors, to obtain
Figure BDA00025875922300000712
The importance degree of the hidden state of each node is judged by inquiring a memory M, and a context semantic vector is obtained after aggregation according to the importance degree of the hidden state of each node in M
Figure BDA0002587592230000081
The memory M can be represented as:
Figure BDA0002587592230000082
wherein
Figure BDA00025875922300000819
Representing the hidden state of the 1 st node of layer 1,
Figure BDA0002587592230000083
indicating the hidden state of the 2 nd node of layer 1,
Figure BDA0002587592230000084
indicating a hidden state of an nth node of layer 1; degree of importance of node hidden state in memory M
Figure BDA0002587592230000085
Can be expressed as:
Figure BDA0002587592230000086
wherein
Figure BDA0002587592230000087
Representing the importance degree of the ith node of the t-th layer of the semantic tree to the hidden state of the jth node in the memory M; wmRepresenting a trainable transformation matrix with dimensions set to dt*dt;bmRepresenting a trainable bias vector with dimension set to dt*1;
Figure BDA0002587592230000088
Representing the Relu nonlinear activation function; softmax represents a non-linear function; according to the degree of importance
Figure BDA0002587592230000089
Aggregating the information in the memory M to obtain a context semantic vector
Figure BDA00025875922300000810
Figure BDA00025875922300000811
Can be expressed as:
Figure BDA00025875922300000812
wherein
Figure BDA00025875922300000813
Is a vector normalized after applying the attention mechanism; obtaining a context semantic vector
Figure BDA00025875922300000814
Then, the score of the parent node candidate node is obtained through the following formula
Figure BDA00025875922300000815
Figure BDA00025875922300000816
Wherein
Figure BDA00025875922300000817
Is a Relu nonlinear activation function; w is asRepresenting trainable transformation vectors with dimensions set to 2dt*1;bsRepresenting trainable bias vectors with dimensions set to 2dt*1;WsRepresenting a trainable transformation matrix with dimensions set to 2dt*2dt
The memory enhanced node scoring module fuses the contextual semantic information, injecting semantic context into each choice to better select parent nodes. In such a recursive process, two adjacent child nodes of all the child nodes are combined to obtain a candidate parent node. Selecting from these candidate parent nodes
Figure BDA00025875922300000818
The node with the largest score is used as the next layer node. Only the representation of the selected node is updated and the unselected child nodes are copied directly to the next level as the representation of the next level node. The above process is repeated recursively until only one node remains. Through this process, we can compose a semantic tree structure, and the encoding of the semantic tree structure can be expressed as:
{e1,e2,…,eN-1}=LSTree({q1,q2,…,qN})
wherein LSTree represents the overall construction process of the semantic tree, eiE R represents the representation of the ith node. The coded representation of the semantic tree structure automatically extracts semantic components which may meet the search intention of a user, and can better understand complex text query sentences without any grammar comments.
(3) And (3) carrying out coding representation on the semantic tree structure of the complex text query sentence obtained in the step (2), and mining the importance of each node component forming the tree structure by using an attention mechanism to obtain the representation of the complex text query capable of sensing the intention of the user.
Complex text query sentences typically consist of references and their reference descriptions in multiple videos, where some concepts or reference descriptions in the complex text query sentence may not be clearly represented in the video or have only a short time span. Therefore, an attention network is introduced to mine the importance of each node component on the basis of introducing the complex text query sentence expression based on semantic tree enhancement, more important node components can be sensed by scoring the importance of the node components of the complex text query sentence based on semantic tree enhancement, and the score is taken as weight to aggregate the nodes of the complex text query sentence based on semantic tree enhancement, so that the complex text query sentence expression capable of effectively sensing the user intention can be obtained. The concrete implementation is as follows:
using the attention mechanism, a neural network is introduced to study the importance of each node component, node eiIs a score of importance ofiCan be expressed as:
Figure BDA0002587592230000091
wherein
Figure BDA0002587592230000092
Is the ReLU nonlinear activation function; wtaRepresenting a trainable transformation matrix with dimensions set to dta*dt;btaRepresenting a trainable bias vector with dimension set to dta*1;utaRepresenting a trainable transformation vector having a dimension set to d ta1, starting from the basic word; the importance scores are used as the weights of the nodes, all node components are aggregated, and the representation of the complex text query sentence capable of perceiving the user intention is obtained
Figure BDA0002587592230000093
Figure BDA0002587592230000094
Where N-1 represents the number of nodes of the semantic tree structure, βiRepresents a node eiThe importance score of.
(4) Extracting video features by using a feature extraction method to obtain initial visual feature representation of the video;
specifically, a pre-trained deep Convolutional Neural Network (CNN) may be used to perform feature extraction on an input video frame, including: for a given video, we pre-specify that video frames are extracted uniformly from the video every 0.5 seconds, assuming that there are m extracted video frames, as represented by a series of feature vectors { v }1,v2,…,vmDescription of the drawings. The deep visual features of each frame are extracted using a deep Convolutional Neural Network (CNN) model, such as the ResNet model, trained on the ImageNet dataset. The video frame may be represented as:
Figure BDA0002587592230000095
wherein v istAnd representing the extracted t-th frame feature vector, and obtaining the initial visual features of the video frame through the feature extraction of the steps, but the features are only simple initial visual features extracted through a CNN model, the content information contained in the initial visual features is relatively rough, and the features are further encoded to obtain a more refined feature representation.
(5) Further mining the initial visual feature representations obtained in step (4) for their temporal and semantic dependencies, first extracting the temporal dependencies of successive frames along the sequence direction; and secondly, extracting semantic correlation between frames.
(5-1) extracting a temporal dependency of successive frames along the sequence direction. Since a video is composed of a series of image sequences and has a front-back order, that is, the video has a time sequence, it is also important to acquire time sequence information of the video. In order to extract the temporal dependencies of consecutive frames along the sequence direction, we use a GRU (Gated Reset Unit Gated cyclic Unit) to encode the initial visual features of the video obtained in step (4), modeling the temporal dependencies between consecutive frames. At each time step, the GRU takes the feature vector of the current frame and the hidden state of the previous frame as input, and outputs the hidden state of the current frame. The concealment state for the tth frame is represented as:
h′t=GRU(vt,h′t-1)
wherein v istRepresenting the t frame feature vector h extracted by the CNN network't-1Indicating the concealment state of the t-1 th frame. By the operation of the above formula, we can effectively capture the dependency relationship between consecutive frames. A GRU processed video sequence V' can be represented as:
Figure BDA0002587592230000101
wherein h'tIndicating the concealment state of the t-th frame, and m indicating the number of video frames uniformly extracted from the video.
And (5-2) extracting semantic correlation between frames in the whole video.
In order to enhance the representation of the video sequence characteristics, based on the video representation in the step (5-1), semantic correlation between frames in the whole video is utilized, the representation of the video sequence frames is projected into a plurality of attention spaces through a self-attention mechanism, dot product is carried out on the query frame projected by each frame and the rest key frames, a weight value on the current value frame is obtained through Softmax operation, and the obtained weight value is multiplied by the value frame; the output from the multiple attention spaces will eventually be aggregated to get the final video representation. The specific implementation process is as follows:
we exploit the semantic correlation between video frames by first performing scaled dot product attention by a self-attention mechanism, projecting a representation of a video sequence frame into multiple attention spaces. And performing dot product on the query frame projected by each frame and the rest key frames, and obtaining the weight on the current value frame through Softmax operation. The weight on value frame is expressed as:
Figure BDA0002587592230000102
wherein
Figure BDA0002587592230000103
Is a trainable transformation matrix with dimensions set to di*dv
Figure BDA0002587592230000104
Is a trainable transformation matrix with dimensions set to di*dv
Figure BDA0002587592230000105
Is a trainable transformation matrix with dimensions set to di*dvProjecting initial input V' into query, key and value matrix spaces in the ith attention space through the three parameters, obtaining a weight value on a current value frame through Softmax operation on a dot product of a query frame projected by each frame and the rest key frames, and setting the dimensionality of the query, key and value matrix spaces in the ith attention space as di*1. And multiplying the obtained weight value on the value frame by the value frame. And finally, the output of the plurality of attention spaces is subjected to splicing operation and normalization to obtain a final video representation. The final video representation is:
Figure BDA0002587592230000106
where Concat (. cndot.) represents the splicing operation.
Figure BDA0002587592230000107
Is the output of the 1 st attention space,
Figure BDA0002587592230000108
is the output of the 2 nd attention space,
Figure BDA0002587592230000109
is the output of the z-th attention space, WpIs a trainable transformation matrix with dimension dv*dvThe connected features are projected into the original space. Norm (·) represents the layer normalization operation.
Figure BDA00025875922300001010
Is a video sequence representation enhanced by a self-attention mechanism model.
Figure BDA00025875922300001011
Expressed as:
Figure BDA00025875922300001012
wherein m represents the number of video frames uniformly extracted from the video at first;
Figure BDA00025875922300001013
representing the video representation of the t frame after enhancement by the self-attention mechanism model described above. The video representation enhanced by the self-attention mechanism model can effectively capture the time sequence dependency between continuous frames and also can effectively capture the semantic correlation between the frames.
(6) Applying an attention mechanism to the video representation obtained in step (5) to distinguish the importance of the information so that the useful information is a greater proportion of the final video representation.
Specifically, an attention neural network model with three parameters is designed, so that the importance degree of a video frame can be distinguished, and the importance degree of a t-th frame can be expressed as:
Figure BDA0002587592230000111
wherein u isvaIs a trainable transformation vector with dimension set to dva*1,bvaIs a trainable offset vector with dimension set to dva,WvaIs a trainable transformation matrix with dimensions set to dva*dv. Multiplying the importance degree of each frame as weight with the representation of the corresponding video frame, and finally accumulating the m frames as finalIs displayed. The final video representation is:
Figure BDA0002587592230000112
wherein etatIndicating the importance degree of the t-th frame;
Figure BDA0002587592230000113
is set to d v1, is an aggregate representation of all components of the video frame.
(7) Respectively mapping the complex text query sentence representation and the video visual feature representation processed by the attention mechanism in the step (3) and the step (6) into a common space, learning the correlation degree between the two modes by using a common space learning algorithm, and finally training the model in an end-to-end mode. Specifically, the method of learning the correlation between two modalities and training the model using a common space learning algorithm is as follows:
and (7-1) mapping the complex text query sentence obtained in the step (3) and the step (6) through the attention mechanism and the video visual feature representation to a uniform public space through two linear projection models for expression. To get the same dimensions, we apply a non-linear activation function to the resulting features, followed by a Batch Normalization (BN) layer de-processing. The specific implementation process is as follows:
the final complex text query statement representation is obtained through the step (3) and the step (6)
Figure BDA0002587592230000114
And video visual feature representation
Figure BDA0002587592230000115
Figure BDA0002587592230000116
Is d in the dimension oft*1,
Figure BDA0002587592230000117
Is d in the dimension ofv*1. We pass through two linesThe sexual projection model projects the complex text query sentence representation and the video visual feature representation into a joint embedding space.
The model representation of the projected complex text query statement is:
Figure BDA0002587592230000118
wherein
Figure BDA0002587592230000119
Is a trainable transformation matrix with dimensions set to d**dt
Figure BDA00025875922300001110
Is a trainable offset vector with dimension set to d *1, BN (-) represents a batch normalization layer, which contributes to the performance improvement of the model.
The model of the projected video visual features is represented as:
Figure BDA00025875922300001111
wherein
Figure BDA00025875922300001112
Is a trainable transformation matrix with dimensions set to d**dv
Figure BDA00025875922300001113
Is a trainable offset vector with dimension set to d *1, BN (-) represents a batch normalization layer, which contributes to the performance improvement of the model.
The cosine similarity represented by a projected complex text query statement and a video visual characteristic is used as a cross-modal matching score, and the cross-modal matching score is represented as follows:
Figure BDA0002587592230000121
wherein Q meterShowing complex text query statements, V representing initially input video features,
Figure BDA0002587592230000122
representing complex text query sentence feature representations that are eventually projected into a common space,
Figure BDA0002587592230000123
representing a representation of the visual features of the video that are ultimately projected into a common space. We denote the cosine similarity of the query statement Q and the video V by s (Q, V).
(7-2) training the model in an end-to-end manner through the defined ternary ordering loss, so that the model automatically learns the correlation between the two modalities. The method comprises the following specific steps:
to train the model, we use a ternary ranking loss (triplet ranking loss) to optimize the network, which penalizes the model by the hardest negative sample sampling strategy (hardest negative sample). In training the model, we have sampled a batch of complex text query sentences and video pairs seven, which can be represented as:
Figure BDA0002587592230000124
where B denotes the number of complex text query sentences and video pairs we sample, we want to implement for any one positive sample pair (Q) by a margin constant (margin)i,Vi) Complex text query statement QiAnd video V matched with itiS (Q) ofi,Vi) Than any one negative sample pair (Q)i,Vj) Complex text query statement QiAnd video V not matched therewithjS (Q) ofi,Vj) Is large. The loss function for a batch is expressed as:
Figure BDA0002587592230000125
the margin constant (margin) is between (0, 1). | NhI tableThe number of negative sample videos that are very different in a batch of videos. We find that the most difficult negative sample penalization model may lead to unstable training, while averaging all negative sample penalization models leads to slow training, so we use a balancing strategy, averaging | NhThe negative sample loss function at the top of | can ensure the stability and effectiveness of training.
(8) And (4) realizing cross-modal retrieval from complex text query to video based on the semantic tree by using the model obtained by training in the step (7).
Specifically, through the training of the model in step (7), the model has learned the mutual connection between the video and the complex text query sentence. Given a complex text query sentence, the model finds out the relevant videos of the complex text query sentence from a candidate video set and uses the relevant videos as the retrieval result, and the steps are as follows:
(8-1) mapping the input complex text query sentence and all candidate videos to a common space through the model trained in the step (7), wherein the complex text query sentence Q is expressed as
Figure BDA0002587592230000126
Video V is expressed as
Figure BDA0002587592230000127
(8-2) calculating cosine similarity of the complex text query sentence and all candidate videos in a public space, then sorting all the candidate videos in a descending order according to the cosine similarity, and returning the videos with the top order as a retrieval result, thereby realizing cross-modal retrieval from the complex text query sentence to the videos.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and should not be taken as limiting the invention, so that any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. A semantic tree enhancement based cross-modal retrieval method for complex text query to video is characterized by comprising the following steps:
(1) and extracting the features of the complex text query sentence to obtain the leaf node features of the complex text query sentence.
(2) And (3) carrying out semantic tree enhanced tree structure coding on the leaf node characteristics of the complex text query sentence obtained in the step (1).
(3) Expressing the codes of the semantic tree structures of the complex text query sentences obtained in the step (2), and mining the importance of each node component forming the tree structures by using an attention mechanism to obtain the expression of the complex text query sentences capable of perceiving the intentions of the user;
(4) and performing feature extraction on the video frame to obtain an initial visual feature representation of the video.
(5) And (4) extracting the time dependence of continuous frames along the sequence direction on the initial visual feature representation obtained in the step (4), and extracting the semantic correlation between the frames.
(6) And (4) applying an attention mechanism to the video representation obtained in the step (5) to distinguish the importance degree of the information, so that the useful information accounts for a larger proportion in the final video visual feature representation.
(7) Respectively mapping the complex text query sentence representation and the video visual feature representation processed by the attention mechanism in the step (3) and the step (6) into a common space, learning the correlation degree between the two modes by using a common space learning algorithm, and training the model in an end-to-end mode.
(8) And (4) realizing cross-modal retrieval from complex text query to video based on the semantic tree by using the model obtained by training in the step (7).
2. The semantic tree enhancement-based cross-modal search method for complex text query to video according to claim 1, wherein the method for extracting leaf node features of complex text query sentences in step (1) comprises the following sub-steps:
(1-1) coding each word in the complex text query sentence by using one-hot coding to obtain one-hot coding vector sequence; multiplying the one-hot coded vector by a word embedding matrix to obtain a word vector sequence representation of the complex text query statement;
(1-2) modeling the word vector sequence representation using the LSTM in the RNN, converting the word vector sequence representation into leaf node features.
3. The semantic tree enhancement based cross-modal search method for complex text query to video according to claim 1, wherein the step (2) of encoding the tree structure of semantic tree enhancement on the leaf node feature of the complex text query sentence comprises the following sub-steps:
(2-1) generating a father node by using an LSTM method of a tree structure, and recursively forming a semantic tree structure in a bottom-up manner; the semantic tree consists of two types of nodes: the system comprises child nodes and parent nodes, wherein the child nodes represent words in a complex text query sentence, and the parent nodes represent combinations of word components; expressing the leaf node characteristics obtained in the step (1) as a first-layer child node of a semantic tree, and combining two adjacent child nodes in all the child nodes by using an LSTM method of a tree structure to obtain a candidate father node;
(2-2) selecting the best father node from the candidate father nodes of each layer as a next-layer node according to the memory-enhanced node scoring module, and directly copying unselected child nodes to the next layer to be used as the representation of the next-layer node; the above process is repeated recursively until only one node remains.
4. The method for cross-modal search for complex text query to video based on semantic tree enhancement as claimed in claim 3, wherein in the step (2-1), two adjacent child nodes (h) are giveni,ci) And (h)i+1,ci+1) As input, hiRepresenting the hidden state of the ith node, ciRepresenting the memory state of the ith node, the parent node may be computed as:
hp=o⊙tanh(cp)
cp=fl⊙ci+fr⊙ci+1+τ⊙g
wherein h ispRepresenting a hidden state of a parent node with its dimension set to dt*1;cpRepresenting the memory state of the parent node with dimension set to dt1, starting from the basic word; → represents the element-by-element multiplication between features; tau, fl,frO, g can be expressed as:
Figure FDA0002587592220000021
wherein WpRepresenting a trainable transformation matrix with dimensions set to 5dt*2dt;bpRepresenting trainable bias vectors with dimensions set to 5dt1, starting from the basic word; sigma represents a sigmoid nonlinear activation function, and tanh represents a tanh nonlinear transformation function;
suppose a t-th level semantic tree is composed of NtThe node of the t-th layer can be expressed as:
Figure FDA0002587592220000022
if the selection is to connect the t-layer nodes
Figure FDA0002587592220000023
And
Figure FDA0002587592220000024
merging, the parent node can be represented as:
Figure FDA0002587592220000025
wherein
Figure FDA0002587592220000026
Represents the ith node of the t-th layer,
Figure FDA0002587592220000027
denotes the t-th layerThe (i + 1) th node of (1),
Figure FDA0002587592220000028
representing the ith node of the t +1 th layer, wherein treeLSTM represents the LSTM method of the tree structure;
in the step (2-2), a module f is scored according to the nodes with enhanced memoryscore(.;Θscore) Determining the likelihood that the best parent node, the ith candidate parent node, is selected
Figure FDA0002587592220000029
Expressed as:
Figure FDA00025875922200000210
wherein Θ isscoreTrainable parameters representing a node scoring module;
Figure FDA00025875922200000211
for the context semantic vector, judging the importance degree of the hidden state of each node through the query memory M, and aggregating the importance degrees according to the hidden state of each node in M to obtain the context semantic vector
Figure FDA00025875922200000212
The memory M is represented as:
Figure FDA00025875922200000213
wherein
Figure FDA00025875922200000214
Indicating a hidden state of an nth node of layer 1; degree of importance of node hidden state in memory M
Figure FDA00025875922200000215
Expressed as:
Figure FDA0002587592220000031
wherein
Figure FDA0002587592220000032
Representing the importance degree of the ith node of the t-th layer of the semantic tree to the hidden state of the jth node in the memory M; wmRepresenting a trainable transformation matrix with dimensions set to dt*dt;bmRepresenting a trainable bias vector with dimension set to dt*1;
Figure FDA0002587592220000033
Representing the Relu nonlinear activation function; softmax represents a non-linear function; according to the degree of importance
Figure FDA0002587592220000034
Aggregating the information in the memory M to obtain a context semantic vector
Figure FDA0002587592220000035
Figure FDA0002587592220000036
Expressed as:
Figure FDA0002587592220000037
wherein
Figure FDA0002587592220000038
Is a vector normalized after applying the attention mechanism;
the score of the candidate father node is obtained by the following formula
Figure FDA0002587592220000039
Figure FDA00025875922200000310
Wherein
Figure FDA00025875922200000311
Is a Relu nonlinear activation function; w is asRepresenting trainable transformation vectors with dimensions set to 2dt*1;bsRepresenting trainable bias vectors with dimensions set to 2dt*1;WsRepresenting a trainable transformation matrix with dimensions set to 2dt*2dt
Selecting from candidate parent nodes
Figure FDA00025875922200000316
The score is the largest as the best parent node.
5. The method according to claim 1, wherein in the step (3), based on the introduction of the complex text query sentence representation based on semantic tree enhancement, a neural network is introduced to mine the importance of each node component by using an attention mechanism, and the node e is a node eiIs a score of importance ofiExpressed as:
Figure FDA00025875922200000312
wherein
Figure FDA00025875922200000313
Is the ReLU nonlinear activation function; wtaRepresenting a trainable transformation matrix with dimensions set to dta*dt;btaRepresenting a trainable bias vector with dimension set to dta*1;utaRepresenting a trainable transformation vector having a dimension set to dta1, starting from the basic word; and aggregating all node components by taking the importance scores as the weights of the nodes to obtain the complex text query sentence capable of perceiving the user intentionIs shown in
Figure FDA00025875922200000314
Figure FDA00025875922200000315
Where N-1 represents the number of nodes of the semantic tree structure, βiRepresents a node eiThe importance score of.
6. The method for cross-modal search of complex text query to video based on semantic tree enhancement as claimed in claim 1, wherein in the step (4), a pre-trained deep Convolutional Neural Network (CNN) is used to perform feature extraction on an input video frame, and a deep visual feature of each frame is extracted as an initial visual feature.
7. The method for cross-modal search of complex text query to video based on semantic tree enhancement as claimed in claim 1, wherein the step (5) of extracting the temporal dependency of the consecutive frames along the sequence direction comprises: encoding the initial visual features of the video obtained in the step (4) by using a GRU (general packet unit), wherein the GRU takes the feature vector of the current frame and the hidden state of the previous frame as input at each time step and outputs the hidden state of the current frame; through GRU operation, the time dependency between continuous frames is effectively captured;
the extraction of semantic relevance between frames in the whole video comprises the following steps: by the self-attention mechanism, firstly, scaled dot product attention is carried out, namely, representations of video sequence frames are projected into a plurality of attention spaces, dot products are carried out on a query frame projected by each frame and the rest key frames, a weight value on a current value frame is obtained through Softmax operation, the obtained weight value is multiplied by the value frame, and finally, final video representations are obtained from outputs of the attention spaces through splicing operation and normalization.
8. The method for cross-modal search for complex text query to video based on semantic tree enhancement as claimed in claim 1, wherein in the step (6), an attention neural network model with three parameters is designed to distinguish the importance degree of video frames, and the importance degree of the t-th frame is expressed as:
Figure FDA0002587592220000041
wherein u isvaIs a trainable transformation vector with dimension set to dva*1,bvaIs a trainable offset vector with dimension set to dva*1,WvaIs a trainable transformation matrix with dimensions set to dva*dv
Figure FDA0002587592220000042
Is a representation of a corresponding video frame with dimension set to dv*1;
Multiplying the importance degree of each frame as a weight by the representation of the corresponding video frame, and finally accumulating the m frames to form a final video representation:
Figure FDA0002587592220000043
wherein etatIndicating the importance degree of the t-th frame;
Figure FDA0002587592220000044
is set to dv1, is an aggregate representation of all components of the video frame.
9. The method for searching the cross-modal from the complex text query to the video based on the semantic tree enhancement as claimed in claim 1, wherein in the step (7), the steps of learning the correlation between two modalities and training the model by using the common space learning algorithm are as follows:
(7-1) mapping the complex text query statement and the video visual feature representation obtained in the step (3) and the step (6) through an attention mechanism to a uniform public space through two linear projection models for expression; in order to obtain the same dimensionality, applying a nonlinear activation function to the obtained features, and then applying a Batch Normalization (BN) layer for processing;
(7-2) training the model in an end-to-end manner through the defined ternary ordering loss, so that the model automatically learns the correlation between the two modalities.
10. The method for cross-modal search of complex text query to video based on semantic tree enhancement as claimed in claim 1, wherein in the step (8), given a complex text query sentence, a video related to the complex text query sentence is found from a candidate video set and is used as a search result, and the steps are as follows:
(8-1) mapping the input complex text query sentence and all candidate videos to a public space through the model trained in the step (7);
(8-2) calculating the similarity of the complex text query sentence and all candidate videos in a public space, then sorting all the candidate videos in a descending order according to the similarity, and returning the videos with the top order as a retrieval result, thereby realizing the cross-modal retrieval from the complex text query sentence to the videos.
CN202010686024.2A 2020-07-16 2020-07-16 Semantic tree enhancement based cross-modal retrieval method for searching video from complex text Active CN111897913B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010686024.2A CN111897913B (en) 2020-07-16 2020-07-16 Semantic tree enhancement based cross-modal retrieval method for searching video from complex text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010686024.2A CN111897913B (en) 2020-07-16 2020-07-16 Semantic tree enhancement based cross-modal retrieval method for searching video from complex text

Publications (2)

Publication Number Publication Date
CN111897913A true CN111897913A (en) 2020-11-06
CN111897913B CN111897913B (en) 2022-06-03

Family

ID=73189400

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010686024.2A Active CN111897913B (en) 2020-07-16 2020-07-16 Semantic tree enhancement based cross-modal retrieval method for searching video from complex text

Country Status (1)

Country Link
CN (1) CN111897913B (en)

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112241468A (en) * 2020-07-23 2021-01-19 哈尔滨工业大学(深圳) Cross-modal video retrieval method and system based on multi-head self-attention mechanism and storage medium
CN112380385A (en) * 2020-11-18 2021-02-19 湖南大学 Video time positioning method and device based on multi-modal relational graph
CN112883229A (en) * 2021-03-09 2021-06-01 中国科学院信息工程研究所 Video-text cross-modal retrieval method and device based on multi-feature-map attention network model
CN113111837A (en) * 2021-04-25 2021-07-13 山东省人工智能研究院 Intelligent monitoring video early warning method based on multimedia semantic analysis
CN113111836A (en) * 2021-04-25 2021-07-13 山东省人工智能研究院 Video analysis method based on cross-modal Hash learning
CN113204666A (en) * 2021-05-26 2021-08-03 杭州联汇科技股份有限公司 Method for searching matched pictures based on characters
CN113590881A (en) * 2021-08-09 2021-11-02 北京达佳互联信息技术有限公司 Video clip retrieval method, and training method and device of video clip retrieval model
CN113762007A (en) * 2020-11-12 2021-12-07 四川大学 Abnormal behavior detection method based on appearance and action characteristic double prediction
CN113934887A (en) * 2021-12-20 2022-01-14 成都考拉悠然科技有限公司 No-proposal time sequence language positioning method based on semantic decoupling
CN114048350A (en) * 2021-11-08 2022-02-15 湖南大学 Text-video retrieval method based on fine-grained cross-modal alignment model
CN114429119A (en) * 2022-01-18 2022-05-03 重庆大学 Video and subtitle fragment retrieval method based on multi-cross attention
CN114579803A (en) * 2022-03-09 2022-06-03 北方工业大学 Video retrieval model based on dynamic convolution and shortcut
CN114612748A (en) * 2022-03-24 2022-06-10 北京工业大学 Cross-modal video clip retrieval method based on feature decoupling
CN114896450A (en) * 2022-04-15 2022-08-12 中山大学 Video time retrieval method and system based on deep learning
CN115099855A (en) * 2022-06-23 2022-09-23 广州华多网络科技有限公司 Method for preparing advertising pattern creation model and device, equipment, medium and product thereof
CN115223086A (en) * 2022-09-20 2022-10-21 之江实验室 Cross-modal action positioning method and system based on interactive attention guidance and correction

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110175266A (en) * 2019-05-28 2019-08-27 复旦大学 A method of it is retrieved for multistage video cross-module state
CN110490946A (en) * 2019-07-15 2019-11-22 同济大学 Text generation image method based on cross-module state similarity and generation confrontation network
US20200104318A1 (en) * 2017-03-07 2020-04-02 Selerio Limited Multi-modal image search
CN111191075A (en) * 2019-12-31 2020-05-22 华南师范大学 Cross-modal retrieval method, system and storage medium based on dual coding and association
CN111309971A (en) * 2020-01-19 2020-06-19 浙江工商大学 Multi-level coding-based text-to-video cross-modal retrieval method

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200104318A1 (en) * 2017-03-07 2020-04-02 Selerio Limited Multi-modal image search
CN110175266A (en) * 2019-05-28 2019-08-27 复旦大学 A method of it is retrieved for multistage video cross-module state
CN110490946A (en) * 2019-07-15 2019-11-22 同济大学 Text generation image method based on cross-module state similarity and generation confrontation network
CN111191075A (en) * 2019-12-31 2020-05-22 华南师范大学 Cross-modal retrieval method, system and storage medium based on dual coding and association
CN111309971A (en) * 2020-01-19 2020-06-19 浙江工商大学 Multi-level coding-based text-to-video cross-modal retrieval method

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112241468A (en) * 2020-07-23 2021-01-19 哈尔滨工业大学(深圳) Cross-modal video retrieval method and system based on multi-head self-attention mechanism and storage medium
CN113762007B (en) * 2020-11-12 2023-08-01 四川大学 Abnormal behavior detection method based on appearance and action feature double prediction
CN113762007A (en) * 2020-11-12 2021-12-07 四川大学 Abnormal behavior detection method based on appearance and action characteristic double prediction
CN112380385A (en) * 2020-11-18 2021-02-19 湖南大学 Video time positioning method and device based on multi-modal relational graph
CN112380385B (en) * 2020-11-18 2023-12-29 湖南大学 Video time positioning method and device based on multi-mode relation diagram
CN112883229A (en) * 2021-03-09 2021-06-01 中国科学院信息工程研究所 Video-text cross-modal retrieval method and device based on multi-feature-map attention network model
CN113111837A (en) * 2021-04-25 2021-07-13 山东省人工智能研究院 Intelligent monitoring video early warning method based on multimedia semantic analysis
CN113111836A (en) * 2021-04-25 2021-07-13 山东省人工智能研究院 Video analysis method based on cross-modal Hash learning
CN113204666B (en) * 2021-05-26 2022-04-05 杭州联汇科技股份有限公司 Method for searching matched pictures based on characters
CN113204666A (en) * 2021-05-26 2021-08-03 杭州联汇科技股份有限公司 Method for searching matched pictures based on characters
CN113590881A (en) * 2021-08-09 2021-11-02 北京达佳互联信息技术有限公司 Video clip retrieval method, and training method and device of video clip retrieval model
CN113590881B (en) * 2021-08-09 2024-03-19 北京达佳互联信息技术有限公司 Video clip retrieval method, training method and device for video clip retrieval model
CN114048350A (en) * 2021-11-08 2022-02-15 湖南大学 Text-video retrieval method based on fine-grained cross-modal alignment model
CN113934887B (en) * 2021-12-20 2022-03-15 成都考拉悠然科技有限公司 No-proposal time sequence language positioning method based on semantic decoupling
CN113934887A (en) * 2021-12-20 2022-01-14 成都考拉悠然科技有限公司 No-proposal time sequence language positioning method based on semantic decoupling
CN114429119A (en) * 2022-01-18 2022-05-03 重庆大学 Video and subtitle fragment retrieval method based on multi-cross attention
CN114429119B (en) * 2022-01-18 2024-05-28 重庆大学 Video and subtitle fragment retrieval method based on multiple cross attentions
CN114579803A (en) * 2022-03-09 2022-06-03 北方工业大学 Video retrieval model based on dynamic convolution and shortcut
CN114579803B (en) * 2022-03-09 2024-04-12 北方工业大学 Video retrieval method, device and storage medium based on dynamic convolution and shortcuts
CN114612748A (en) * 2022-03-24 2022-06-10 北京工业大学 Cross-modal video clip retrieval method based on feature decoupling
CN114612748B (en) * 2022-03-24 2024-06-07 北京工业大学 Cross-modal video segment retrieval method based on feature decoupling
CN114896450A (en) * 2022-04-15 2022-08-12 中山大学 Video time retrieval method and system based on deep learning
CN114896450B (en) * 2022-04-15 2024-05-10 中山大学 Video moment retrieval method and system based on deep learning
CN115099855A (en) * 2022-06-23 2022-09-23 广州华多网络科技有限公司 Method for preparing advertising pattern creation model and device, equipment, medium and product thereof
CN115223086A (en) * 2022-09-20 2022-10-21 之江实验室 Cross-modal action positioning method and system based on interactive attention guidance and correction
CN115223086B (en) * 2022-09-20 2022-12-06 之江实验室 Cross-modal action positioning method and system based on interactive attention guidance and correction

Also Published As

Publication number Publication date
CN111897913B (en) 2022-06-03

Similar Documents

Publication Publication Date Title
CN111897913B (en) Semantic tree enhancement based cross-modal retrieval method for searching video from complex text
CN111309971B (en) Multi-level coding-based text-to-video cross-modal retrieval method
CN110298037B (en) Convolutional neural network matching text recognition method based on enhanced attention mechanism
CN110083705B (en) Multi-hop attention depth model, method, storage medium and terminal for target emotion classification
CN111914067B (en) Chinese text matching method and system
CN112100351A (en) Method and equipment for constructing intelligent question-answering system through question generation data set
CN111581510A (en) Shared content processing method and device, computer equipment and storage medium
CN111414461B (en) Intelligent question-answering method and system fusing knowledge base and user modeling
CN111159485B (en) Tail entity linking method, device, server and storage medium
CN110704601A (en) Method for solving video question-answering task requiring common knowledge by using problem-knowledge guided progressive space-time attention network
CN111274398A (en) Method and system for analyzing comment emotion of aspect-level user product
CN113297370B (en) End-to-end multi-modal question-answering method and system based on multi-interaction attention
CN114693397B (en) Attention neural network-based multi-view multi-mode commodity recommendation method
CN112328900A (en) Deep learning recommendation method integrating scoring matrix and comment text
CN111400494B (en) Emotion analysis method based on GCN-Attention
CN113220891B (en) Method for generating confrontation network image description based on unsupervised concept-to-sentence
CN116204674B (en) Image description method based on visual concept word association structural modeling
CN112131345B (en) Text quality recognition method, device, equipment and storage medium
CN112651940A (en) Collaborative visual saliency detection method based on dual-encoder generation type countermeasure network
CN112115253A (en) Depth text ordering method based on multi-view attention mechanism
CN113806554A (en) Knowledge graph construction method for massive conference texts
CN114417851A (en) Emotion analysis method based on keyword weighted information
CN114298055B (en) Retrieval method and device based on multilevel semantic matching, computer equipment and storage medium
CN114356990A (en) Base named entity recognition system and method based on transfer learning
CN114239730A (en) Cross-modal retrieval method based on neighbor sorting relation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant