CN111897913A

CN111897913A - Semantic tree enhancement based cross-modal retrieval method for searching video from complex text

Info

Publication number: CN111897913A
Application number: CN202010686024.2A
Authority: CN
Inventors: 董建锋; 彭敬伟; 杨勋; 郑琪; 王勋
Original assignee: Zhejiang Gongshang University
Current assignee: Zhejiang Gongshang University
Priority date: 2020-07-16
Filing date: 2020-07-16
Publication date: 2020-11-06
Anticipated expiration: 2040-07-16
Also published as: CN111897913B

Abstract

The invention discloses a semantic tree enhancement based cross-modal retrieval method for complex text query to video. For a complex text query statement, each word of the complex text query statement is converted into leaf node representation, the relation between child nodes is mined, the two child nodes with the highest dependency are combined, a semantic tree structure of the query statement is built in a recursive mode, and query representation based on semantic tree enhancement is obtained. For the encoding of the candidate video, video preliminary characteristics are obtained through CNN, and time dependency and semantic dependency between videos are captured by utilizing GRU and a self-attention mechanism module to obtain robust video characteristic representation. Mapping the complex text query expression and the video feature expression to a public space, and automatically learning the matching relationship between the complex text query expression and the video feature expression, thereby realizing the cross-modal retrieval from the complex text query to the video. The method of the invention can not only explain the information components in the complex text query sentence, better understand the user intention, but also improve the retrieval performance to a great extent.

Description

Semantic tree enhancement based cross-modal retrieval method for searching video from complex text

Technical Field

The invention relates to the field of cross-modal retrieval from text query to video, in particular to a cross-modal retrieval method from complex text query to video based on semantic tree enhancement.

Background

As users on the internet have exponentially increased their generated videos, uploading videos and searching videos of interest in daily life have become indispensable activities in people's daily life. A cross-modal retrieval method from text query to video is one of the techniques to obtain videos of interest. Early cross-modal retrieval methods from text query to video were based on text keywords and were extensively studied and developed. But this type of method only allows the user to enter several keywords as queries. With the further increase of the demand of people on the video searching capability of the internet, the search intention of the user is difficult to be fully expressed by the keyword-based query, so that the search experience is influenced. In response to this problem, video retrieval supporting complex text queries is ongoing. Therefore, how to understand the more complex semantics passed on for complex text queries and understand user intent has become one of the difficult challenges across the domain of modal retrieval.

Existing cross-modal retrieval methods for text queries to videos generally fall into two categories, the first category being concept-based methods that utilize a large number of visual concepts to describe the video content while converting text queries into a set of basic visual concepts. Text queries are represented with visual concepts. And finally, realizing cross-modal retrieval through concept matching between different modalities (text and video). However, such methods have the following disadvantages: one, it is not generally very effective for complex text queries because it is often difficult for the semantic content of complex text queries to be adequately described by several visual concepts, resulting in information loss, and the semantic content of complex text queries is not just an aggregation of extracted concepts. Secondly, how to effectively train the concept classifier, and selecting related concepts is also a very challenging problem. The second method is to learn a joint embedding space of text query and video to support video retrieval, and this method represents the video as a time aggregation feature by converting the text query into a word vector representation and maps the two to a common space. Such that similar text queries and videos are close in the common space, and away otherwise. While such directions can better handle longer text queries than concept-based approaches, such approaches have the following disadvantages: firstly, the text query of the user is represented by a word vector, which cannot effectively understand the intention of the user, so that the video retrieval effect on complex text query is not good. Second, such methods lack the interpretability of the sub-search process.

Disclosure of Invention

Aiming at the defects of the prior art, the invention adopts a modeling method facing complex text query to video retrieval, provides a cross-modal retrieval method based on semantic tree enhancement, firstly encodes the complex text query by using a tree structure, and simultaneously performs quantization encoding and expression learning on the complex text query and the video; the similarity of the coded features in the public space is calculated by mapping the coded features to the public space, and cross-modal retrieval from complex text query to videos is achieved.

The purpose of the invention is realized by the following technical scheme: a semantic tree enhancement based cross-modal retrieval method for complex text query to video comprises the following steps:

(1) extracting features of the complex text query sentence to obtain leaf node features of the complex text query sentence;

(2) encoding a tree structure of semantic tree enhancement on leaf node characteristics of the complex text query sentence obtained in the step (1);

(3) expressing the codes of the semantic tree structures of the complex text query sentences obtained in the step (2), and mining the importance of each node component forming the tree structures by using an attention mechanism to obtain the expression of the complex text query sentences capable of perceiving the intentions of the user;

(4) performing feature extraction on the video frame to obtain initial visual feature representation of the video;

(5) extracting the time dependence of continuous frames along the sequence direction from the initial visual feature representation obtained in the step (4), and extracting the semantic correlation between the frames;

(6) applying an attention mechanism to the video representation obtained in the step (5), and distinguishing the importance degree of the information to enable the useful information to occupy a larger proportion in the final video visual feature representation;

(7) respectively mapping the complex text query sentence representation and the video visual feature representation processed by the attention mechanism in the steps (3) and (6) into a common space, learning the correlation degree between the two modes by using a common space learning algorithm, and training a model in an end-to-end mode;

(8) and (4) realizing cross-modal retrieval from complex text query to video based on the semantic tree by using the model obtained by training in the step (7).

Further, the method for extracting leaf node features of the complex text query statement in step (1) comprises the following substeps:

(1-1) coding each word in the complex text query sentence by using one-hot coding to obtain one-hot coding vector sequence; multiplying the one-hot coded vector by a word embedding matrix to obtain a word vector sequence representation of the complex text query statement;

(1-2) modeling the word vector sequence representation using the LSTM in the RNN, converting the word vector representation into leaf node features.

Further, the encoding of the tree structure with semantic tree enhancement on the leaf node feature of the complex text query statement in the step (2) includes the following sub-steps:

(2-1) generating a father node by using an LSTM method of a tree structure, and recursively forming a semantic tree structure from bottom to top; the semantic tree consists of two types of nodes: the system comprises child nodes and parent nodes, wherein the child nodes represent words in a complex text query sentence, and the parent nodes represent combinations of word components; expressing the leaf node characteristics obtained in the step (1) as first-layer nodes of a semantic tree, and combining two adjacent child nodes in all the child nodes by using an LSTM method of a tree structure to obtain candidate father nodes;

(2-2) selecting the best father node from the candidate father nodes of each layer as a next-layer node according to the memory-enhanced node scoring module, and directly copying unselected child nodes to the next layer to be used as the representation of the next-layer node; the above process is repeated recursively until only one node remains.

Further, in the step (2-1), two adjacent child nodes (h) are given_i，c_i) And (h)_i+1，c_i+1) As input, h_iRepresenting the hidden state of the ith node, c_iRepresenting the memory state of the ith node, the parent node may be computed as:

h_p＝o⊙tanh(c_p)

c_p＝f_l⊙c_i+f_r⊙c_i+1+τ⊙g

wherein h is_pRepresenting a hidden state of a parent node with its dimension set to d_t*1；c_pRepresenting the memory state of the parent node with dimension set to d _t1, starting from the basic word; an element-by-element multiplication between features; tau, f_l，f_rO, g can be expressed as:

wherein W^pRepresenting a trainable transformation matrix with dimensions set to 5d_t*2d_t；b^pRepresenting trainable bias vectors with dimensions set to 5d _t1, starting from the basic word; sigma represents a sigmoid nonlinear activation function, and tanh represents a tanh nonlinear transformation function;

suppose a t-th level semantic tree is composed of N_tThe node of the t-th layer can be expressed as:

if selected, theConnecting the t layer nodes

And

merging, the parent node can be represented as:

wherein

Represents the ith node of the t-th layer,

represents the i +1 th node of the t-th layer,

representing the ith node of the t +1 th layer, wherein TreeLSTM represents the LSTM method of the tree structure;

in the step (2-2), a module f is scored according to the nodes with enhanced memory_score(.；Θ_score) Determining the likelihood that the best parent node, the ith candidate parent node, is selected

Expressed as:

wherein Θ is_scoreTrainable parameters representing a node scoring module;

for the context semantic vector, judging the importance degree of the hidden state of each node through the query memory M, and aggregating the importance degrees according to the hidden state of each node in M to obtain the context semantic vector

The memory M is represented as:

wherein

Indicating a hidden state of an nth node of layer 1; degree of importance of node hidden state in memory M

Expressed as:

wherein

Representing the importance degree of the ith node of the t-th layer of the semantic tree to the hidden state of the jth node in the memory M; w_mRepresenting a trainable transformation matrix with dimensions set to d_t*d_t；b_mRepresenting a trainable bias vector with dimension set to d_t*1；

Representing the Relu nonlinear activation function; softmax represents a non-linear function; according to the degree of importance

Aggregating the information in the memory M to obtain a context semantic vector

Expressed as:

wherein

Is a vector normalized after applying the attention mechanism;

obtaining the score of the candidate node of the father node by the following formula

Wherein

Is a Relu nonlinear activation function; w is a_sRepresenting trainable transformation vectors with dimensions set to 2d_t*1；b_sRepresenting trainable bias vectors with dimensions set to 2d_t*1；W_sRepresenting a trainable transformation matrix with dimensions set to 2d_t*2d_t。

Selecting from candidate parent nodes

The score is the largest as the best parent node.

Further, in the step (3), on the basis of introducing the complex text query statement representation based on semantic tree enhancement, a neural network is introduced to mine the importance of each node component by using an attention mechanism, and the node e_iIs a score of importance of_iExpressed as:

wherein

Is the ReLU nonlinear activation function; w_taRepresenting a trainable transformation matrix with dimensions set to d_ta*d_t；b_taRepresenting a trainable bias vector with dimension set to d_ta*1；u_taRepresenting a trainable transformation vector having a dimension set to d _ta1, starting from the basic word; the importance scores are used as the weights of the nodes, all node components are aggregated, and the representation of the complex text query sentence capable of perceiving the user intention is obtained

Where N-1 represents the number of nodes of the semantic tree structure, β_iRepresents a node e_iThe importance score of.

Further, in the step (4), feature extraction is performed on the input video frame by using a pre-trained deep Convolutional Neural Network (CNN), and a deep visual feature of each frame is extracted as an initial visual feature.

Further, in the step (5), extracting the time dependency of the consecutive frames along the sequence direction includes: encoding the initial visual features of the video obtained in the step (4) by using a GRU (general packet unit), wherein the GRU takes the feature vector of the current frame and the hidden state of the previous frame as input at each time step and outputs the hidden state of the current frame; through GRU operation, the time dependency between continuous frames is effectively captured;

the extraction of semantic relevance between frames in the whole video comprises the following steps: by a self-attention mechanism, firstly, performing scaled dot product attention, namely projecting the representation of the video sequence frames into a plurality of attention spaces, performing dot product on the query frame projected by each frame and the rest key frames, obtaining a weight value on the current value frame through Softmax operation, and multiplying the obtained weight value by the va1ue frame; the final video representation will be obtained from the output of the multiple attention spaces through a stitching operation and normalization.

Further, in the step (6), an attention neural network model with three parameters is designed to distinguish the importance degree of the video frames, and the importance degree of the tth frame is expressed as:

wherein u is_vaIs a trainable transformation vector with dimension set to d_va*1，b_vaIs a trainable offset vector with dimension set to d_va，W_vaIs a trainable transformation matrix with dimensions set to d_va*d_v，

Is a representation of a corresponding video frame with dimension set to d_v*1；

Multiplying the importance degree of each frame as a weight by the representation of the corresponding video frame, and finally accumulating the frames to form the final video representation:

wherein eta_tIndicating the importance degree of the t-th frame;

is set to d _v1, is an aggregate representation of all components of the video frame.

Further, in the step (7), the step of learning the correlation between the two modalities and training the model by using a common space learning algorithm is as follows:

(7-1) mapping the complex text query statement and the video visual feature representation obtained in the step (3) and the step (6) through an attention mechanism to a uniform public space through two linear projection models for expression; in order to obtain the same dimensionality, applying a nonlinear activation function to the obtained features, and then applying a Batch Normalization (BN) layer for processing;

(7-2) training the model in an end-to-end manner through the defined ternary ordering loss, so that the model automatically learns the correlation between the two modalities.

Further, in the step (8), given a complex text query sentence, finding out a video related to the complex text query sentence from a candidate video set, and using the video as a retrieval result, the steps are as follows:

(8-1) mapping the input complex text query sentence and all candidate videos to a public space through the model trained in the step (7);

(8-2) calculating the similarity of the complex text query sentence and all candidate videos in a public space, then sorting all the candidate videos in a descending order according to the similarity, and returning the videos with the top order as a retrieval result, thereby realizing the cross-modal retrieval from the complex text query sentence to the videos.

The invention has the beneficial effects that: the invention provides a novel cross-modal retrieval framework from complex text query to video, which can automatically form a flexible tree structure to model complex text query sentences, and designs a memory-enhanced node scoring module to mine the language environment of the tree structure of the complex text query sentences. An attention mechanism is introduced into the complex text query sentence and the video visual feature representation, and the node component combination in the complex text query and the importance degree of each frame of the video are deeply mined. The invention can explain the information components in the complex text query sentence, better understand the user intention and improve the retrieval performance to a great extent.

Drawings

FIG. 1 is a schematic diagram of an implementation of a semantic tree enhancement-based cross-modal search method for complex text query to video;

FIG. 2 is an example of a complex text query to video search of the present invention.

Detailed Description

The present invention will be described in detail below with reference to the accompanying drawings and specific embodiments.

In order to solve the problem of cross-modal retrieval from complex text query to video, the invention provides a semantic tree enhancement-based cross-modal retrieval method from complex text query to video, which comprises the following specific steps:

(1) and extracting the features of the complex text query sentence by using a feature extraction method to obtain the leaf node features of the complex text query sentence.

(1-1) given a complex text query sentence Q of length N, the complex text query sentence Q can be represented as:

Q＝{w₁，w₂，…，w_N}

wherein w₁Representing the first word in the complex text query statement, each word in the complex text query statement is first encoded with one-hot encoding (one-hot) to obtain a sequence of one-hot encoded vectors, { w'₁，w′₂，...，w′_NW therein'_tA one-hot coded vector representing the t-th word. Obtaining a word vector sequence representation { Q ] of a complex text query statement Q by multiplying the one-hot coded vector by a word embedding matrix₁，q₂，…，q_N}。

(1-2) using LSTM (long-short-duration memory network) in RNN (recurrent neural network) as basic sequence modeling module. To maintain structural consistency, the N word vector representations in the word vector sequence in step (1-1) are converted into N leaf nodes using LSTM. For the ith time step, the word vector sequence represents { q }₁，q₂，…，q_NThe ith word vector in (j) represents q_iIt is converted into leaf nodes by the LSTM unit, and the ith leaf node is represented as:

(h_i，c_i)＝LSTM(q_i，h_i-1，c_i-1)

wherein h is_i-1Representing the hidden state of the i-1 st node, c_i-1Represents the memory state of the (i-1) th node, (h)_i，c_i) Representing the ith leaf node feature into which the ith word vector is converted.

(2) Encoding a tree structure of semantic tree enhancement on leaf node characteristics of the complex text query sentence obtained in the step (1); to is coming toAnd (2) better understanding the complex text query statement, and carrying out tree-structured LSTM (TreLSTM) modeling on leaf node characteristics of the complex text query statement obtained in the step (1), wherein a TreLSTM method is used for generating a parent node. Given two adjacent child nodes (h)_i，c_i) And (h)_i+1，c_i+1) As input, h_iRepresenting the hidden state of the ith node, c_iIndicating the memory state of the ith node. The parent node may be computed as:

h_p＝o⊙tanh(c_p)

c_p＝f_l⊙c_i+f_r⊙c_i+1+τ⊙g

wherein h is_pRepresenting a hidden state of a parent node with its dimension set to d_t*1；c_pRepresenting the memory state of the parent node with dimension set to d _t1, starting from the basic word; an element-by-element multiplication between features; tau, f_l，f_rThe parameters o, g, etc. are represented by h_iAnd h_i+1Obtained after sigmoid and tanh functions; tau, f_l，f_rO, g can be expressed as:

wherein W^pRepresenting a trainable transformation matrix with dimensions set to 5d_t*2d_t；b^pRepresenting trainable bias vectors with dimensions set to 5d _t1, starting from the basic word; sigma represents a sigmoid nonlinear activation function, and tanh represents a tanh nonlinear transformation function; generating a father node by using TreeLSTM, and recursively forming a semantic tree structure in a bottom-up manner; the semantic tree consists of two types of nodes: the method comprises child nodes and parent nodes, wherein the child nodes represent words in a complex text query sentence, and the parent nodes represent combinations of word components, and can describe more complex semantic information compared with the child nodes.

(2-1) representing the leaf node characteristic sequence obtained in the step (1) as a first-layer node of a semantic tree; suppose a t-th level semantic tree is composed of N_tEach node is composed of t-th layer nodesExpressed as:

if we choose to connect the t-level nodes

And

merging, and then calculating a parent node by using the TreeLSTM, where the parent node may be represented as:

wherein

Represents the ith node of the t-th layer,

represents the i +1 th node of the t-th layer,

representing the ith node of the t +1 th layer. And combining two adjacent child nodes in all child nodes by using the LSTM (TreeLSTM) method of the tree structure to obtain a parent node candidate node.

(2-2) the key step of building the semantic tree structure is how to accurately select the best parent node from the parent node candidate nodes at each layer, which requires designing a node scoring module to select the best parent node. For the node scoring module, it is difficult to efficiently determine the best parent node when a given query is a complex text query due to the ambiguity of language and the limited ability of the hidden state of the node to remember historical inputs. Therefore, a node scoring module f with memory enhancement is specially designed for complex text query sentences_score(.；Θ_score) For determining the best parent node, i-th parent node candidate nodeLikelihood of a point being selected

Can be expressed as:

wherein Θ is_scoreTrainable parameters representing a node scoring module;

for contextual semantic vectors, to obtain

The importance degree of the hidden state of each node is judged by inquiring a memory M, and a context semantic vector is obtained after aggregation according to the importance degree of the hidden state of each node in M

The memory M can be represented as:

wherein

Representing the hidden state of the 1 st node of layer 1,

indicating the hidden state of the 2 nd node of layer 1,

Can be expressed as:

wherein

Aggregating the information in the memory M to obtain a context semantic vector

Can be expressed as:

wherein

Is a vector normalized after applying the attention mechanism; obtaining a context semantic vector

Then, the score of the parent node candidate node is obtained through the following formula

Wherein

The memory enhanced node scoring module fuses the contextual semantic information, injecting semantic context into each choice to better select parent nodes. In such a recursive process, two adjacent child nodes of all the child nodes are combined to obtain a candidate parent node. Selecting from these candidate parent nodes

The node with the largest score is used as the next layer node. Only the representation of the selected node is updated and the unselected child nodes are copied directly to the next level as the representation of the next level node. The above process is repeated recursively until only one node remains. Through this process, we can compose a semantic tree structure, and the encoding of the semantic tree structure can be expressed as:

{e₁，e₂，…，e_N-1}＝LSTree({q₁，q₂，…，q_N})

wherein LSTree represents the overall construction process of the semantic tree, e_iE R represents the representation of the ith node. The coded representation of the semantic tree structure automatically extracts semantic components which may meet the search intention of a user, and can better understand complex text query sentences without any grammar comments.

(3) And (3) carrying out coding representation on the semantic tree structure of the complex text query sentence obtained in the step (2), and mining the importance of each node component forming the tree structure by using an attention mechanism to obtain the representation of the complex text query capable of sensing the intention of the user.

Complex text query sentences typically consist of references and their reference descriptions in multiple videos, where some concepts or reference descriptions in the complex text query sentence may not be clearly represented in the video or have only a short time span. Therefore, an attention network is introduced to mine the importance of each node component on the basis of introducing the complex text query sentence expression based on semantic tree enhancement, more important node components can be sensed by scoring the importance of the node components of the complex text query sentence based on semantic tree enhancement, and the score is taken as weight to aggregate the nodes of the complex text query sentence based on semantic tree enhancement, so that the complex text query sentence expression capable of effectively sensing the user intention can be obtained. The concrete implementation is as follows:

using the attention mechanism, a neural network is introduced to study the importance of each node component, node e_iIs a score of importance of_iCan be expressed as:

wherein

(4) Extracting video features by using a feature extraction method to obtain initial visual feature representation of the video;

specifically, a pre-trained deep Convolutional Neural Network (CNN) may be used to perform feature extraction on an input video frame, including: for a given video, we pre-specify that video frames are extracted uniformly from the video every 0.5 seconds, assuming that there are m extracted video frames, as represented by a series of feature vectors { v }₁，v₂，…，v_mDescription of the drawings. The deep visual features of each frame are extracted using a deep Convolutional Neural Network (CNN) model, such as the ResNet model, trained on the ImageNet dataset. The video frame may be represented as:

wherein v is_tAnd representing the extracted t-th frame feature vector, and obtaining the initial visual features of the video frame through the feature extraction of the steps, but the features are only simple initial visual features extracted through a CNN model, the content information contained in the initial visual features is relatively rough, and the features are further encoded to obtain a more refined feature representation.

(5) Further mining the initial visual feature representations obtained in step (4) for their temporal and semantic dependencies, first extracting the temporal dependencies of successive frames along the sequence direction; and secondly, extracting semantic correlation between frames.

(5-1) extracting a temporal dependency of successive frames along the sequence direction. Since a video is composed of a series of image sequences and has a front-back order, that is, the video has a time sequence, it is also important to acquire time sequence information of the video. In order to extract the temporal dependencies of consecutive frames along the sequence direction, we use a GRU (Gated Reset Unit Gated cyclic Unit) to encode the initial visual features of the video obtained in step (4), modeling the temporal dependencies between consecutive frames. At each time step, the GRU takes the feature vector of the current frame and the hidden state of the previous frame as input, and outputs the hidden state of the current frame. The concealment state for the tth frame is represented as:

h′_t＝GRU(v_t，h′_t-1)

wherein v is_tRepresenting the t frame feature vector h extracted by the CNN network'_t-1Indicating the concealment state of the t-1 th frame. By the operation of the above formula, we can effectively capture the dependency relationship between consecutive frames. A GRU processed video sequence V' can be represented as:

wherein h'_tIndicating the concealment state of the t-th frame, and m indicating the number of video frames uniformly extracted from the video.

And (5-2) extracting semantic correlation between frames in the whole video.

In order to enhance the representation of the video sequence characteristics, based on the video representation in the step (5-1), semantic correlation between frames in the whole video is utilized, the representation of the video sequence frames is projected into a plurality of attention spaces through a self-attention mechanism, dot product is carried out on the query frame projected by each frame and the rest key frames, a weight value on the current value frame is obtained through Softmax operation, and the obtained weight value is multiplied by the value frame; the output from the multiple attention spaces will eventually be aggregated to get the final video representation. The specific implementation process is as follows:

we exploit the semantic correlation between video frames by first performing scaled dot product attention by a self-attention mechanism, projecting a representation of a video sequence frame into multiple attention spaces. And performing dot product on the query frame projected by each frame and the rest key frames, and obtaining the weight on the current value frame through Softmax operation. The weight on value frame is expressed as:

wherein

Is a trainable transformation matrix with dimensions set to d_i*d_v，

Is a trainable transformation matrix with dimensions set to d_i*d_v，

Is a trainable transformation matrix with dimensions set to d_i*d_vProjecting initial input V' into query, key and value matrix spaces in the ith attention space through the three parameters, obtaining a weight value on a current value frame through Softmax operation on a dot product of a query frame projected by each frame and the rest key frames, and setting the dimensionality of the query, key and value matrix spaces in the ith attention space as d_i*1. And multiplying the obtained weight value on the value frame by the value frame. And finally, the output of the plurality of attention spaces is subjected to splicing operation and normalization to obtain a final video representation. The final video representation is:

where Concat (. cndot.) represents the splicing operation.

Is the output of the 1 st attention space,

is the output of the 2 nd attention space,

is the output of the z-th attention space, W^pIs a trainable transformation matrix with dimension d_v*d_vThe connected features are projected into the original space. Norm (·) represents the layer normalization operation.

Is a video sequence representation enhanced by a self-attention mechanism model.

Expressed as:

wherein m represents the number of video frames uniformly extracted from the video at first;

representing the video representation of the t frame after enhancement by the self-attention mechanism model described above. The video representation enhanced by the self-attention mechanism model can effectively capture the time sequence dependency between continuous frames and also can effectively capture the semantic correlation between the frames.

(6) Applying an attention mechanism to the video representation obtained in step (5) to distinguish the importance of the information so that the useful information is a greater proportion of the final video representation.

Specifically, an attention neural network model with three parameters is designed, so that the importance degree of a video frame can be distinguished, and the importance degree of a t-th frame can be expressed as:

wherein u is_vaIs a trainable transformation vector with dimension set to d_va*1，b_vaIs a trainable offset vector with dimension set to d_va，W_vaIs a trainable transformation matrix with dimensions set to d_va*d_v. Multiplying the importance degree of each frame as weight with the representation of the corresponding video frame, and finally accumulating the m frames as finalIs displayed. The final video representation is:

wherein eta_tIndicating the importance degree of the t-th frame;

(7) Respectively mapping the complex text query sentence representation and the video visual feature representation processed by the attention mechanism in the step (3) and the step (6) into a common space, learning the correlation degree between the two modes by using a common space learning algorithm, and finally training the model in an end-to-end mode. Specifically, the method of learning the correlation between two modalities and training the model using a common space learning algorithm is as follows:

and (7-1) mapping the complex text query sentence obtained in the step (3) and the step (6) through the attention mechanism and the video visual feature representation to a uniform public space through two linear projection models for expression. To get the same dimensions, we apply a non-linear activation function to the resulting features, followed by a Batch Normalization (BN) layer de-processing. The specific implementation process is as follows:

the final complex text query statement representation is obtained through the step (3) and the step (6)

And video visual feature representation

Is d in the dimension of_t*1，

Is d in the dimension of_v*1. We pass through two linesThe sexual projection model projects the complex text query sentence representation and the video visual feature representation into a joint embedding space.

The model representation of the projected complex text query statement is:

wherein

Is a trainable transformation matrix with dimensions set to d^**d_t，

Is a trainable offset vector with dimension set to d ^*1, BN (-) represents a batch normalization layer, which contributes to the performance improvement of the model.

The model of the projected video visual features is represented as:

wherein

Is a trainable transformation matrix with dimensions set to d^**d_v，

The cosine similarity represented by a projected complex text query statement and a video visual characteristic is used as a cross-modal matching score, and the cross-modal matching score is represented as follows:

wherein Q meterShowing complex text query statements, V representing initially input video features,

representing complex text query sentence feature representations that are eventually projected into a common space,

representing a representation of the visual features of the video that are ultimately projected into a common space. We denote the cosine similarity of the query statement Q and the video V by s (Q, V).

(7-2) training the model in an end-to-end manner through the defined ternary ordering loss, so that the model automatically learns the correlation between the two modalities. The method comprises the following specific steps:

to train the model, we use a ternary ranking loss (triplet ranking loss) to optimize the network, which penalizes the model by the hardest negative sample sampling strategy (hardest negative sample). In training the model, we have sampled a batch of complex text query sentences and video pairs seven, which can be represented as:

where B denotes the number of complex text query sentences and video pairs we sample, we want to implement for any one positive sample pair (Q) by a margin constant (margin)_i，V_i) Complex text query statement Q_iAnd video V matched with it_iS (Q) of_i，V_i) Than any one negative sample pair (Q)_i，V_j) Complex text query statement Q_iAnd video V not matched therewith_jS (Q) of_i，V_j) Is large. The loss function for a batch is expressed as:

the margin constant (margin) is between (0, 1). | N^hI tableThe number of negative sample videos that are very different in a batch of videos. We find that the most difficult negative sample penalization model may lead to unstable training, while averaging all negative sample penalization models leads to slow training, so we use a balancing strategy, averaging | N^hThe negative sample loss function at the top of | can ensure the stability and effectiveness of training.

Specifically, through the training of the model in step (7), the model has learned the mutual connection between the video and the complex text query sentence. Given a complex text query sentence, the model finds out the relevant videos of the complex text query sentence from a candidate video set and uses the relevant videos as the retrieval result, and the steps are as follows:

(8-1) mapping the input complex text query sentence and all candidate videos to a common space through the model trained in the step (7), wherein the complex text query sentence Q is expressed as

Video V is expressed as

(8-2) calculating cosine similarity of the complex text query sentence and all candidate videos in a public space, then sorting all the candidate videos in a descending order according to the cosine similarity, and returning the videos with the top order as a retrieval result, thereby realizing cross-modal retrieval from the complex text query sentence to the videos.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and should not be taken as limiting the invention, so that any modifications, equivalents, improvements, etc. made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A semantic tree enhancement based cross-modal retrieval method for complex text query to video is characterized by comprising the following steps:

(1) and extracting the features of the complex text query sentence to obtain the leaf node features of the complex text query sentence.

(2) And (3) carrying out semantic tree enhanced tree structure coding on the leaf node characteristics of the complex text query sentence obtained in the step (1).

(4) and performing feature extraction on the video frame to obtain an initial visual feature representation of the video.

(5) And (4) extracting the time dependence of continuous frames along the sequence direction on the initial visual feature representation obtained in the step (4), and extracting the semantic correlation between the frames.

(6) And (4) applying an attention mechanism to the video representation obtained in the step (5) to distinguish the importance degree of the information, so that the useful information accounts for a larger proportion in the final video visual feature representation.

(7) Respectively mapping the complex text query sentence representation and the video visual feature representation processed by the attention mechanism in the step (3) and the step (6) into a common space, learning the correlation degree between the two modes by using a common space learning algorithm, and training the model in an end-to-end mode.

2. The semantic tree enhancement-based cross-modal search method for complex text query to video according to claim 1, wherein the method for extracting leaf node features of complex text query sentences in step (1) comprises the following sub-steps:

(1-2) modeling the word vector sequence representation using the LSTM in the RNN, converting the word vector sequence representation into leaf node features.

3. The semantic tree enhancement based cross-modal search method for complex text query to video according to claim 1, wherein the step (2) of encoding the tree structure of semantic tree enhancement on the leaf node feature of the complex text query sentence comprises the following sub-steps:

(2-1) generating a father node by using an LSTM method of a tree structure, and recursively forming a semantic tree structure in a bottom-up manner; the semantic tree consists of two types of nodes: the system comprises child nodes and parent nodes, wherein the child nodes represent words in a complex text query sentence, and the parent nodes represent combinations of word components; expressing the leaf node characteristics obtained in the step (1) as a first-layer child node of a semantic tree, and combining two adjacent child nodes in all the child nodes by using an LSTM method of a tree structure to obtain a candidate father node;

4. The method for cross-modal search for complex text query to video based on semantic tree enhancement as claimed in claim 3, wherein in the step (2-1), two adjacent child nodes (h) are given_i,c_i) And (h)_i+1,c_i+1) As input, h_iRepresenting the hidden state of the ith node, c_iRepresenting the memory state of the ith node, the parent node may be computed as:

h_p＝o⊙tanh(c_p)

c_p＝f_l⊙c_i+f_r⊙c_i+1+τ⊙g

wherein h is_pRepresenting a hidden state of a parent node with its dimension set to d_t*1；c_pRepresenting the memory state of the parent node with dimension set to d_t1, starting from the basic word; → represents the element-by-element multiplication between features; tau, f_l,f_rO, g can be expressed as:

wherein W^pRepresenting a trainable transformation matrix with dimensions set to 5d_t*2d_t；b^pRepresenting trainable bias vectors with dimensions set to 5d_t1, starting from the basic word; sigma represents a sigmoid nonlinear activation function, and tanh represents a tanh nonlinear transformation function;

if the selection is to connect the t-layer nodes

And

merging, the parent node can be represented as:

wherein

Represents the ith node of the t-th layer,

denotes the t-th layerThe (i + 1) th node of (1),

Expressed as:

wherein Θ is_scoreTrainable parameters representing a node scoring module;

The memory M is represented as:

wherein

Expressed as:

wherein

Aggregating the information in the memory M to obtain a context semantic vector

Expressed as:

wherein

Is a vector normalized after applying the attention mechanism;

the score of the candidate father node is obtained by the following formula

Wherein

Selecting from candidate parent nodes

The score is the largest as the best parent node.

5. The method according to claim 1, wherein in the step (3), based on the introduction of the complex text query sentence representation based on semantic tree enhancement, a neural network is introduced to mine the importance of each node component by using an attention mechanism, and the node e is a node e_iIs a score of importance of_iExpressed as:

wherein

Is the ReLU nonlinear activation function; w_taRepresenting a trainable transformation matrix with dimensions set to d_ta*d_t；b_taRepresenting a trainable bias vector with dimension set to d_ta*1；u_taRepresenting a trainable transformation vector having a dimension set to d_ta1, starting from the basic word; and aggregating all node components by taking the importance scores as the weights of the nodes to obtain the complex text query sentence capable of perceiving the user intentionIs shown in

6. The method for cross-modal search of complex text query to video based on semantic tree enhancement as claimed in claim 1, wherein in the step (4), a pre-trained deep Convolutional Neural Network (CNN) is used to perform feature extraction on an input video frame, and a deep visual feature of each frame is extracted as an initial visual feature.

7. The method for cross-modal search of complex text query to video based on semantic tree enhancement as claimed in claim 1, wherein the step (5) of extracting the temporal dependency of the consecutive frames along the sequence direction comprises: encoding the initial visual features of the video obtained in the step (4) by using a GRU (general packet unit), wherein the GRU takes the feature vector of the current frame and the hidden state of the previous frame as input at each time step and outputs the hidden state of the current frame; through GRU operation, the time dependency between continuous frames is effectively captured;

the extraction of semantic relevance between frames in the whole video comprises the following steps: by the self-attention mechanism, firstly, scaled dot product attention is carried out, namely, representations of video sequence frames are projected into a plurality of attention spaces, dot products are carried out on a query frame projected by each frame and the rest key frames, a weight value on a current value frame is obtained through Softmax operation, the obtained weight value is multiplied by the value frame, and finally, final video representations are obtained from outputs of the attention spaces through splicing operation and normalization.

8. The method for cross-modal search for complex text query to video based on semantic tree enhancement as claimed in claim 1, wherein in the step (6), an attention neural network model with three parameters is designed to distinguish the importance degree of video frames, and the importance degree of the t-th frame is expressed as:

wherein u is_vaIs a trainable transformation vector with dimension set to d_va*1，b_vaIs a trainable offset vector with dimension set to d_va*1，W_vaIs a trainable transformation matrix with dimensions set to d_va*d_v，

Multiplying the importance degree of each frame as a weight by the representation of the corresponding video frame, and finally accumulating the m frames to form a final video representation:

wherein eta_tIndicating the importance degree of the t-th frame;

is set to d_v1, is an aggregate representation of all components of the video frame.

9. The method for searching the cross-modal from the complex text query to the video based on the semantic tree enhancement as claimed in claim 1, wherein in the step (7), the steps of learning the correlation between two modalities and training the model by using the common space learning algorithm are as follows:

10. The method for cross-modal search of complex text query to video based on semantic tree enhancement as claimed in claim 1, wherein in the step (8), given a complex text query sentence, a video related to the complex text query sentence is found from a candidate video set and is used as a search result, and the steps are as follows: