CN111988673A

CN111988673A - Video description statement generation method and related equipment

Info

Publication number: CN111988673A
Application number: CN202010764613.8A
Authority: CN
Inventors: 袁艺天; 马林; 朱文武
Original assignee: Tsinghua University; Tencent Technology Shenzhen Co Ltd
Current assignee: Tsinghua University; Tencent Technology Shenzhen Co Ltd
Priority date: 2020-07-31
Filing date: 2020-07-31
Publication date: 2020-11-24
Anticipated expiration: 2040-07-31
Also published as: CN111988673B

Abstract

The embodiment of the application provides a method for generating a video description statement and related equipment, wherein the method comprises the following steps: obtaining a syntactic characteristic vector of a target example sentence; determining the syntax of a video description sentence to be generated according to the syntax feature vector to obtain syntax information; determining the semantics of the video description sentence to be generated corresponding to the syntax according to the syntax information and the video semantic feature vector of the target video to obtain semantic information; and generating a video description statement of the target video according to the semantic information. Therefore, the video description sentences with different syntax structures can be generated by selecting different target example sentences, and the problem of single syntax of the video description sentences is solved.

Description

Video description statement generation method and related equipment

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a method for generating a video description statement and related equipment.

Background

Video description (Video capturing) refers to generating a sentence for a given Video that can be used to describe the content in the Video, and the generated sentence is referred to as a Video description sentence. By the aid of the video description sentences generated for the videos, a user can quickly know the content of the videos only through the video description sentences without watching the videos. In the related art, the generated video description sentence has a problem of single syntax.

Disclosure of Invention

The embodiment of the application provides a method for generating a video description statement and related equipment, and further solves the problem that the syntax of the video description statement is single at least to a certain extent.

Other features and advantages of the present application will be apparent from the following detailed description, or may be learned by practice of the application.

According to an aspect of an embodiment of the present application, there is provided a method for generating a video description sentence, the method including: obtaining a syntactic characteristic vector of a target example sentence; determining the syntax of a video description sentence to be generated according to the syntax feature vector to obtain syntax information; determining the semantics of the video description sentence to be generated corresponding to the syntax according to the syntax information and the video semantic feature vector of the target video to obtain semantic information; and generating a video description statement of the target video according to the semantic information.

According to an aspect of the embodiments of the present application, there is provided an apparatus for generating a video description sentence, the apparatus including: the obtaining module is used for obtaining the syntactic characteristic vector of the target example sentence; the syntax determining module is used for determining the syntax of the video description sentence to be generated according to the syntax feature vector to obtain syntax information; the semantic determining module is used for determining the semantic of the video description sentence to be generated corresponding to the syntax according to the syntax information and the video semantic feature vector of the target video to obtain semantic information; and the video description statement determining module is used for generating the video description statement of the target video according to the semantic information.

In some embodiments of the present application, the syntax determination module is configured to: generating, by a first neural network included in a description generation model, a first hidden vector from the syntactic feature vector, the first hidden vector indicating the syntactic information, the description generation model further including a second neural network cascaded with the first neural network, the first and second neural networks being gate-based recurrent neural networks.

In this embodiment, the semantic determination module is configured to: generating, by the second neural network, a second hidden vector from the first hidden vector and the video semantic feature vector, the second hidden vector being used to indicate the semantic information.

In some embodiments of the application, the video description statement determination module is configured to: determining a word vector at the t moment according to a second hidden vector generated by the second neural network at the t moment; and generating the video description sentence according to the word vector output at each moment.

In this embodiment, the syntax determining module includes a first hidden vector generating unit, configured to output, by the first neural network, a first hidden vector at a time t according to the syntax feature vector, the word vector at the time t-1, and the first hidden vector at the time t-1 generated by the first neural network.

In this embodiment, the semantic determining module includes a second hidden vector generating unit, configured to output, by the second neural network, a second hidden vector at a time t according to the video semantic feature vector, the first hidden vector at the time t, and a second hidden vector at the time t-1 generated by the second neural network.

In some embodiments of the present application, the first hidden vector generation unit includes: and the first soft attention weighting unit is used for carrying out soft attention weighting on the syntactic characteristic vector according to the first implicit vector at the time t-1 to obtain a target syntactic characteristic vector corresponding to the time t. And the first splicing unit is used for splicing the target syntactic characteristic vector corresponding to the time t with the word vector at the time t-1 to obtain a first splicing vector corresponding to the time t. And the first output unit is used for correspondingly outputting a first implicit vector at the t moment by taking the first splicing vector corresponding to the t moment as an input through the first neural network.

In some embodiments of the present application, the first neural network includes a first input gate, a first forgetting gate, and a first output gate, and the first output unit includes: and the first forgetting gate vector calculation unit is used for calculating a first forgetting gate vector at the time t by the first forgetting gate according to the first splicing vector corresponding to the time t. And the first input gate vector calculation unit is used for calculating the first input gate vector at the time t according to the first splicing vector corresponding to the time t by the first input gate. And the first cell unit vector calculating unit is used for calculating to obtain a first cell unit vector at the time t according to the first forgetting gate vector at the time t, the first input gate vector at the time t, the first cell unit vector at the time t and the first cell unit vector at the time t-1 corresponding to the first neural network, and the first cell unit vector at the time t is obtained by performing hyperbolic tangent calculation according to the first splicing vector corresponding to the time t. And the first implicit vector calculation unit is used for calculating a first implicit vector at the time t according to the first cell unit vector at the time t and a first output gate vector at the time t, wherein the first output gate vector at the time t is calculated by the first output gate according to the first splicing vector corresponding to the time t.

In some embodiments of the present application, the syntax determination module further comprises: the first normalization unit is used for respectively normalizing a first input gate vector, a first forgetting gate vector, a first output gate vector and a first unit vector in the first neural network; the first transformation unit is used for respectively transforming the normalized first input gate vector, the normalized first forgetting gate vector, the normalized first output gate vector and the normalized first unit vector according to a first offset vector and a first scaling vector to obtain a target first input gate vector, a target first forgetting gate vector, a target first output gate vector and a target first unit vector, wherein the first offset vector is output by the first multilayer perceptron according to the target syntactic characteristic vector corresponding to the moment t, the first scaling vector is output by the second multilayer perceptron according to the target syntactic characteristic vector corresponding to the moment t, and the first multilayer perceptron and the second multilayer perceptron are independent.

In this embodiment, the first cell unit vector calculation unit is further configured to: and calculating to obtain a first cell unit vector at the t moment according to the target first forgetting gate vector, the target first input gate vector, the target first cell unit vector and the first cell unit vector at the t-1 moment.

In this embodiment, the first hidden vector calculation unit is further configured to: and calculating to obtain a first implicit vector at the time t according to the first cell unit vector at the time t and the target output gate vector.

In some embodiments of the present application, the second hidden vector generation unit includes: and the second soft attention weighting unit is used for carrying out soft attention weighting on the video semantic feature vector according to the second hidden vector at the time t-1 to obtain a target video semantic vector corresponding to the time t. And the second splicing unit is used for splicing the target video semantic vector corresponding to the time t with the first hidden vector corresponding to the time t to obtain a second splicing vector corresponding to the time t. And the second output unit is used for correspondingly outputting a second implicit vector at the t moment by the second neural network by taking the second splicing vector corresponding to the t moment as an input.

In some embodiments of the present application, the second neural network includes a second input gate, a second forgetting gate, and a second output gate, and the second output unit includes: the second forgetting gate vector calculation unit is used for calculating a second forgetting gate vector at the time t according to the second splicing vector corresponding to the time t by the second forgetting gate; and the second input gate vector calculation unit is used for calculating a second input gate vector at the time t according to the second splicing vector corresponding to the time t by the second input gate. A second cell unit vector calculating unit, configured to calculate a second cell unit vector at a time t according to the second forgetting gate vector at the time t, the second input gate vector at the time t, the second cell unit vector at the time t, and a second cell unit vector at a time t-1 corresponding to the second neural network, where the second cell unit vector at the time t is obtained by performing hyperbolic tangent calculation according to the second splicing vector corresponding to the time t; and the second implicit vector calculation unit is used for calculating a second implicit vector at the time t according to the second cell unit vector at the time t and a second output gate vector at the time t, wherein the second output gate vector at the time t is calculated by the second output gate according to the second splicing vector corresponding to the time t.

In some embodiments of the present application, the semantic determination module further comprises: and the second normalization unit is used for respectively normalizing the second input gate vector, the second forgetting gate vector, the second output gate vector and the second unit vector in the second neural network. And the second transformation unit is used for respectively transforming the normalized second input gate vector, the normalized second forgetting gate vector, the normalized second output gate vector and the normalized second unit vector according to a second offset vector and a second scaling vector to obtain a target second input gate vector, a target second forgetting gate vector, a target second output gate vector and a target second unit vector, wherein the second offset vector is output by the third multilayer perceptron according to the target video semantic vector corresponding to the time t, the second scaling vector is output by the fourth multilayer perceptron according to the target video semantic vector corresponding to the time t, and the third multilayer perceptron is independent from the fourth multilayer perceptron.

In this embodiment, the second cell unit vector calculation unit is further configured to: calculating to obtain a second cell unit vector at the t moment according to the target second forgetting gate vector, the target second input gate vector, the target second cell unit vector and the second cell unit vector at the t-1 moment;

in this embodiment, the second hidden vector calculation unit is further configured to: and calculating to obtain a second implicit vector at the time t according to the second cell unit vector at the time t and the target second output gate vector.

In some embodiments of the present application, the generating means of the video description sentence further comprises: the training data acquisition module is used for acquiring training data, and the training data comprises a plurality of sample videos and sample video description sentences corresponding to the sample videos. The semantic feature extraction module is used for extracting semantic features of the sample video to obtain a sample video semantic feature vector of the sample video; and the syntactic feature extraction module is used for carrying out syntactic feature extraction on the sample video description statement corresponding to the sample video to obtain a sample syntactic feature vector of the sample video description statement. A first syntactic loss determining module, configured to output, by the first neural network, a first implicit vector sequence according to the sample syntactic feature vector, and calculate a first syntactic loss by the first implicit vector sequence; and the first semantic loss determining module is used for outputting a second hidden vector sequence by the second neural network according to the first hidden vector sequence and the sample video semantic feature vector of the sample video, and calculating the first semantic loss through the second hidden vector sequence. The first target loss calculation module is used for calculating to obtain a first target loss according to the first syntax loss and the first semantic loss; a first adjustment module to adjust parameters of the description generative model based on the first target loss.

In some embodiments of the present application, the first syntax loss determination module comprises: a syntax tree prediction unit, configured to predict, by a sixth neural network, a syntax tree for the sample description statement according to the first implicit vector sequence, where the sixth neural network is a gate-controlled cyclic neural network; a first syntax loss calculation unit configured to calculate the first syntax loss according to the predicted syntax tree and an actual syntax tree of the sample description sentence.

In some embodiments of the present application, the first semantic loss determination module comprises: and the first description statement output unit is used for outputting a first description statement for the sample video according to the second hidden vector sequence through a fifth multilayer perceptron. And the first semantic loss calculating unit is used for calculating the first semantic loss according to the first description statement and the sample video description statement corresponding to the sample video.

In some embodiments of the present application, the generating means of the video description sentence further comprises: the system comprises a first sample syntax feature vector acquisition module, a second sample syntax feature vector acquisition module and a third sample syntax feature vector acquisition module, wherein the first sample syntax feature vector acquisition module is used for acquiring sample syntax feature vectors of sample statements, and the sample statements comprise sample example sentences and sample video description sentences corresponding to sample videos; the second syntax loss calculation module is used for outputting a first hidden vector sequence by the first neural network according to the sample syntax feature vector of the sample statement and calculating a second syntax loss through the first hidden vector sequence corresponding to the sample statement; the second semantic loss calculation module is used for outputting a second hidden vector sequence by the second neural network according to a sample semantic feature vector of the sample statement and the first hidden vector sequence corresponding to the sample statement, wherein the sample semantic feature vector is obtained by performing semantic feature extraction on the sample statement, and the second semantic loss is calculated through the second hidden vector sequence corresponding to the sample statement; the second target loss calculation module is used for calculating to obtain a second target loss according to the second syntax loss and the second semantic loss; a second adjustment module to adjust parameters of the description generative model based on the second target loss.

In some embodiments of the present application, the training data further includes a plurality of sample example sentences, and the apparatus for generating video description sentences further includes: the second sample syntax feature vector acquisition module is used for acquiring the sample syntax feature vector of the sample example sentence; a first hidden vector sequence output module, configured to output, by the first neural network, a first hidden vector sequence according to the sample syntax feature vector of the sample example sentence; a second hidden vector sequence output module, configured to output, by the second neural network, a second hidden vector sequence according to the first hidden vector sequence corresponding to the sample example sentence and the sample video semantic feature vector of the sample video; a second description sentence determination module for determining a second description sentence according to a second hidden vector sequence corresponding to the sample video; a third target loss calculation module, configured to calculate a third target loss according to the syntax tree corresponding to the sample example sentence and the syntax tree corresponding to the second description sentence; a third adjustment module to adjust parameters of the description generative model based on the third target loss.

In some embodiments of the present application, the obtaining module comprises: the character feature vector acquisition unit is used for acquiring a character feature vector of characters included by each word in the target example sentence, and the character feature vector is obtained by coding characters; the third implicit vector output unit is used for outputting a third implicit vector corresponding to each character by the third neural network according to the character feature vector of each character; the average calculation unit is used for carrying out average calculation on each word in the target example sentence according to the third hidden vector corresponding to each character in the word to obtain a feature vector of the word; a fourth hidden vector output unit, configured to output, by the fourth neural network, a fourth hidden vector according to a feature vector of each word in the target example sentence, where the fourth hidden vector is used as the syntactic feature vector, and the third neural network and the fourth neural network are gate-controlled-based recurrent neural networks.

In some embodiments of the present application, the generating means of the video description sentence further comprises: a video frame sequence acquisition module, configured to acquire a video frame sequence obtained by framing the target video; the semantic extraction module is used for performing semantic extraction on each video frame in the video frame sequence through the convolutional neural network to obtain a semantic vector of each video frame; a fifth hidden vector output module, configured to output a fifth hidden vector according to a semantic vector of each video frame in the sequence of video frames through the fifth neural network, where the fifth hidden vector is used as the video semantic feature vector, and the fifth neural network is a gate-controlled cyclic neural network.

According to an aspect of an embodiment of the present application, there is provided an electronic device including: a processor; a memory having computer readable instructions stored thereon which, when executed by the processor, implement the method described above.

According to an aspect of embodiments of the present application, there is provided a computer-readable storage medium having stored thereon computer-readable instructions which, when executed by a processor, implement the method as described above.

In the technical solutions provided in some embodiments of the present application, syntax information for guiding a syntax structure of a video description sentence to be generated is obtained based on a syntax feature vector of a target example sentence, then semantic information corresponding to the syntax structure indicated by the syntax feature vector of the video description sentence to be generated is determined according to the syntax information and a video semantic feature vector of the target video, and finally a video description sentence is generated for the target video according to the semantic information.

Because the syntax of the generated video description statement is controlled by the syntax feature vector of the target example sentence, for the same target video, if the target example sentences with different syntax structures are selected to constrain the syntax structures of the video description statement, the video description statements with different syntax structures can be generated, so that the video description statements with different syntax structures can be generated for the same target video by changing the target example sentences, thereby realizing the generation of diversified video description statements for the target video, and effectively solving the problem of single syntax of the video description statement in the prior art.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present application and together with the description, serve to explain the principles of the application. It is obvious that the drawings in the following description are only some embodiments of the application, and that for a person skilled in the art, other drawings can be derived from them without inventive effort. In the drawings:

FIG. 1 shows a schematic diagram of an exemplary system architecture to which aspects of embodiments of the present application may be applied;

FIG. 2 is a flow diagram of a method of generating a video description statement according to an embodiment of the present application;

FIG. 3 shows a schematic diagram of an episodic memory neural network;

fig. 4 is a diagram illustrating video description sentences generated under different target example sentences for the same target video according to a specific embodiment;

FIG. 5 is a flow diagram illustrating the output of a first hidden vector according to one embodiment;

FIG. 6 is a flow diagram illustrating outputting a second hidden vector according to one embodiment;

FIG. 7 is a flow diagram illustrating training a description generative model according to one embodiment;

FIG. 8 is a flow diagram illustrating training a description generative model according to another embodiment;

FIG. 9 is a flow diagram illustrating training a description generative model according to another embodiment;

FIG. 10 is a schematic diagram illustrating the generation of a video description statement, according to one embodiment;

FIG. 11 is a block diagram illustrating an apparatus for generating a video description statement in accordance with one embodiment;

FIG. 12 illustrates a schematic structural diagram of a computer system suitable for use in implementing the electronic device of an embodiment of the present application.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art.

Furthermore, the described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments. In the following description, numerous specific details are provided to give a thorough understanding of embodiments of the application. One skilled in the relevant art will recognize, however, that the subject matter of the present application can be practiced without one or more of the specific details, or with other methods, components, devices, steps, and so forth. In other instances, well-known methods, devices, implementations, or operations have not been shown or described in detail to avoid obscuring aspects of the application.

The block diagrams shown in the figures are functional entities only and do not necessarily correspond to physically separate entities. I.e. these functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor means and/or microcontroller means.

The flow charts shown in the drawings are merely illustrative and do not necessarily include all of the contents and operations/steps, nor do they necessarily have to be performed in the order described. For example, some operations/steps may be decomposed, and some operations/steps may be combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human Intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly includes several directions, such as computer vision technology, speech processing technology, Natural Language Processing (NLP) technology, machine learning/deep learning, and the like.

Fig. 1 shows a schematic diagram of an exemplary system architecture to which the technical solution of the embodiments of the present application can be applied.

As shown in fig. 1, the system architecture may include a terminal device (e.g., one or more of a smartphone 101, a tablet computer 102, and a portable computer 103 shown in fig. 1, but may also be a desktop computer, etc.), a network 104, and a server 105. The network 104 serves as a medium for providing communication links between terminal devices and the server 105. Network 104 may include various connection types, such as wired communication links, wireless communication links, and so forth.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. For example, server 105 may be a server cluster comprised of multiple servers, or the like.

In an embodiment of the application, the server may obtain a target example sentence and a target video uploaded on the terminal device, then perform syntactic feature extraction on the target example sentence to obtain a syntactic feature vector of the target example sentence, and perform video semantic extraction on the target video to obtain a video semantic feature vector of the target video.

In an embodiment of the application, a server may further store a set of example sentences, and the server may receive an example sentence selection instruction sent by a terminal device, determine a target example sentence according to the example sentence selection instruction, and further perform syntactic feature extraction on the target example sentence.

In other embodiments, of course, the server may also store a plurality of videos for the user to select, that is, the server may receive a video selection instruction sent by the terminal device, and use the video selected by the video selection instruction as the target video of the video description statement to be generated.

In an embodiment of the application, after obtaining the syntactic feature vector of the target example sentence and the video semantic feature vector of the target video, the server generates a video description sentence for the target video based on the syntactic feature vector and the video semantic feature vector, so that the generated video description sentence is the same as or similar to the syntax of the target example sentence on one hand, and the semantics of the video description sentence are ensured to be related to the video content in the target video, that is, related to the semantics of the target video.

In one embodiment of the application, after the server generates the video description sentence for the target video, the generated video description sentence is fed back to the terminal device, so that the terminal device presents the video description sentence to the user.

It should be noted that the method for generating a video description sentence provided in the embodiment of the present application is generally executed by the server 105, and accordingly, the means for generating a video description sentence is generally disposed in the server 105. However, in other embodiments of the present application, the terminal device may also have a similar function as the server, so as to execute the method for generating the video description sentence provided in the embodiments of the present application.

The implementation details of the technical solution of the embodiment of the present application are set forth in detail below:

fig. 2 shows a flowchart of a method for generating a video description sentence according to an embodiment of the present application, which may be performed by a device having a calculation processing function, such as the server 105 shown in fig. 1. Referring to fig. 2, the method for generating the video description sentence at least includes steps 210 to 240, which are described in detail as follows:

and step 210, obtaining the syntactic characteristic vector of the target example sentence.

The target example sentence is an example sentence used for restricting the syntactic structure of the video description sentence to be generated.

In some embodiments of the present application, the example sentence set may be pre-constructed, so that a user selects an example sentence from the example sentence set as a target example sentence, thereby constraining the syntax structure of the video description sentence to be generated. Of course, in other embodiments, the target example sentence may also be an example sentence uploaded by the user through the terminal device.

The syntactic characteristic vector of the target example sentence is used for describing the syntactic structure of the target example sentence, and reflects the dependency relationship between words in the target example sentence and syntactic structure information (such as a predicate object, a fixed form complement and the like).

The syntactic characteristic vector of the target example sentence can be obtained by carrying out syntactic analysis on the target example sentence. The syntactic analysis may adopt syntactic structure analysis (also called phrase structure analysis and constituent sentence analysis), dependency relationship analysis (also called dependency syntactic analysis and dependency analysis), and deep grammar syntactic analysis.

In some embodiments of the present application, the target example sentence may be parsed by means of a parsing tool, and a syntactic feature vector may be generated according to the parsing result. Syntactic analysis tools such as StanfordCoreNLP, HanLP, SpaCy, FudanNLP. The syntax analysis result may be a constituent sentence method tree generated for the target example sentence, and the syntax feature vector of the target example sentence is obtained by serializing the constituent sentence method tree.

In some embodiments of the present application, syntactic feature extraction may be performed by two cascaded layers of gate-based recurrent neural networks, so as to obtain a syntactic feature vector of a target example sentence.

Specifically, character feature vectors of characters included by each word in the target example sentence are obtained firstly, and the character feature vectors are obtained by coding the characters; then, a third neural network outputs a third implicit vector corresponding to each character according to the character feature vector of each character; then, aiming at each word in the target example sentence, carrying out average calculation according to a third hidden vector corresponding to each character in the word to obtain a feature vector of the word; and finally, obtaining a fourth hidden vector sequence by a fourth neural network according to the feature vector of each word in the target example sentence, wherein the fourth hidden vector sequence is used as a syntactic feature vector, and the third neural network and the fourth neural network are gate-controlled cyclic neural networks.

The Gated cyclic neural Network may be a Long Short Term Memory Network (LSTM) or a Gated cyclic Network (GRU).

A GRU is a variant of LSTM, wherein the improvement in GRU over LSTM comprises: will forgetThe gate and the input gate are combined into one gate, namely a refresh gate, and the other gate is called a reset gate; it does not have the partitioning of cell unit vectors as internal states and hidden vectors as external states in the LSTM, but directly through the state of the current network (h)_t) And the state (h) of the network at the previous moment_t-1) A linear dependency is added.

Next, a process of generating the syntactic feature vector for the target example sentence will be described by taking the third neural network and the fourth neural network as an example, which are both long and short term memory networks. Before making a detailed description, it is necessary to explain the structure of the long-term memory neural network and the processing involved therein.

Fig. 3 shows a schematic diagram of the long-short term memory neural network, and as shown in fig. 3, at any time t (assumed to be time t), there are three inputs of the long-short term memory neural network: input vector x at time t_tThe hidden vector h output at the last moment_t-1And the cell unit vector c of the previous time_t-1And the cell unit vector is used for reflecting the state of the cell unit at the corresponding time, and the hidden vector is used as the output of the corresponding time LSTM.

As shown in FIG. 3, the LSTM comprises a forgetting gate, an input gate and an output gate, wherein the forgetting gate determines the cell unit vector c at the previous moment_t-1How much cell unit vector c remains until the current time_t(ii) a The input gate determines the input vector x at the current time_tHow many cell unit vectors c are stored up to the current time_t(ii) a Output gates for controlling cell unit vector c_tHow many outputs to the current output h of the LSTM_t。

In LSTM, the hidden vector and the cell unit vector at the corresponding time are determined based on the calculation of the forgetting gate, the input gate, and the output gate, and for the convenience of description, a vector directly obtained by the calculation of the input gate in LSTM is referred to as an input gate vector, a vector obtained by the calculation of the output gate is referred to as an output gate vector, and a vector obtained by the calculation of the forgetting gate is referred to as a forgetting gate vector.

For any time t, the forgetting gate vector corresponding to the time tf_tInput gate vector i_tUnit vector g_tCell unit vector c_tOutput gate vector o_tHidden vector h_tThe calculation process of (a) is as follows.

Wherein, forgetting the gate vector f_tComprises the following steps:

f_t＝σ(W_f·[h_t-1，x_t]+b_f)， (1)

wherein σ is a sigmoid function whose value range is (0, 1); w_fIs the weight matrix of the forgetting gate, [ h ]_t-1，x_t]Representing the concatenation of two vectors, b_fIs a biased term of a forgetting gate, W_fAnd b_fCan be determined by training.

Input gate vector i_tComprises the following steps:

i_t＝σ(W_i·[h_t-1，x_t]+b_i)， (2)

W_iis a weight matrix of the input gate, b_iIs an offset term of the input gate, W_iAnd b_iDetermined by training.

In LSTM there is also involved the calculation of a unit vector, the unit vector describing the current input, the unit vector g_tComprises the following steps:

g_t＝tanh(W_c·[h_t-1，x_t]+b_c)， (3)

wherein tanh represents a hyperbolic tangent function, W_cAs a weight matrix, b_cIs an offset term, W_cAnd b_cCan be determined by training.

The unit vector g_tFor calculating cell unit vector c_tCell unit vector c_tComprises the following steps:

wherein, the symbol

Meaning multiplication by element.

Output gate vector o_tComprises the following steps:

o_t＝σ(W_o·[h_t-1，x_t]+b_o)， (5)

the hidden vector ht is:

proceed back to the process of generating syntactic feature vectors for the target exemplar sentence. For example, for the word "applet", the character sequence is: firstly, inputting the character feature vector of each character in the character sequence into a third neural network LSTMc in sequence to obtain a third implicit vector corresponding to each character. Specifically, assume that the character feature vector of the ith character in the nth word in the target example sentence is

It is input into the third neural network LSTMc, and its calculation process in the third neural network LSTMc network can be described as:

wherein the content of the first and second substances,

a hidden vector output by the third neural network for the l-1 character in the nth word,

cell unit vectors obtained for the l-1 character in the nth word for the third neural network,

cell unit vectors derived for the 1 st character in the nth word for the third neural network.

The hidden vector output by the third neural network is called a third hidden vector for the convenience of distinguishing, wherein the hidden vector is output by the 1 st character in the nth word of the third neural network.

After the third implicit vectors corresponding to the characters are obtained, average calculation is carried out according to the third implicit vectors corresponding to the characters in the nth word, and the obtained average vectors are used as the characteristic vectors w of the nth word_n：

Finally, the feature vectors corresponding to all words in the target example sentence are input to a fourth neural network LSTM in sequence^wWherein the feature vector w of the nth word_nThe processing in the fourth neural network may be described as:

wherein the content of the first and second substances,

a hidden vector output by the fourth neural network for the feature vector of the nth word;

a cell unit vector output by the fourth neural network for the feature vector of the nth word;

a hidden vector output by the fourth neural network aiming at the feature vector of the (n-1) th word;

a cell unit vector output by the fourth neural network for the feature vector of the (n-1) th word; for the convenience of distinguishing, the fourth neural network is used for inputtingThe hidden vector is called the fourth hidden vector.

Therefore, the fourth implicit vectors output by each time are combined to obtain a fourth implicit vector sequence

The fourth implicit vector sequence H^sAnd the syntactic characteristic vector serving as the target example sentence is used for controlling the syntax of the video description sentence to be generated.

Through the process, the characteristic vector corresponding to the word is obtained on the basis of character level coding for each word in the target example sentence, and the characteristic vector corresponding to the word can be ensured to fully reflect the characteristics of the word by starting coding from the character level.

Continuing with fig. 2, at step 220, the syntax of the video description sentence to be generated is determined according to the syntax feature vector, and syntax information is obtained.

For the generation of the video description sentence, the prediction is performed according to words and the words are output, and the syntactic components of the words output at each time point in the video description sentence are different. For example, if the target example sentence is a subject-predicate structure, corresponding words are sequentially output according to the order of the subject, the predicate, and the object, so as to form a video description sentence.

The syntactic characteristic vector of the target example sentence describes the dependency relationship and the syntactic structure information between words in the target example sentence, so that the syntactic structure of the target example sentence can be correspondingly determined according to the syntactic characteristic vector of the target example sentence, the syntactic structure of the target example sentence is taken as the syntactic structure of the video description sentence to be generated, and the sentence components to be output at each moment are guided through the syntactic information on the basis of the syntactic structure.

That is, the syntax information is used to indicate sentence components of the video descriptive sentence to be output at respective times, such as subject, predicate, object, predicate, subject, and the like. The syntax structure of the video descriptive sentence to be generated is the same as that of the target example sentence, so that sentence components to be output at each moment are correspondingly determined according to the syntax structure indicated by the syntax feature vector, and the syntax structure of the output video descriptive sentence is ensured to be consistent with that of the target example sentence.

And step 230, determining semantics of the video description sentence to be generated corresponding to the syntax according to the syntax information and the video semantic feature vector of the target video, and obtaining semantic information.

The target video does not refer to a certain video, but refers to a video needing to generate a video description sentence.

The video semantic feature vector of the target video is used for describing the content of the video, and the content of the video can be understood as the semantic of the video. The content in the video may include objects (e.g., people, animals, objects, scenes, etc.) in the video, behaviors of the objects, etc.

For objects in the video, an object recognition determination may be made based on video frames in the video; for the behavior of an object in a video, motion recognition of the object may be performed based on several consecutive video frames to determine the behavior of the object.

In some embodiments of the present application, to obtain the video semantic feature vector of the target video, the target video may be firstly framed, and object recognition, motion recognition, and the like are performed based on image features in each video frame, so as to generate the video semantic feature vector of the target video.

In some embodiments of the present application, the video semantic feature vector of the target video may be extracted through the following process. Specifically, a video frame sequence obtained by framing a target video is obtained; then, semantic extraction is carried out on each video frame in the video frame sequence through a convolutional neural network to obtain a semantic vector of each video frame; and outputting a fifth hidden vector sequence by a fifth neural network according to the semantic vector of each video frame in the video frame sequence, wherein the fifth hidden vector sequence is used as a video semantic feature vector, and the fifth neural network is a gate-controlled cyclic neural network.

The hidden vectors output by the fifth neural network are called as fifth hidden vectors, and the fifth hidden vector sequence is obtained by combining the fifth hidden vectors correspondingly output by the fifth neural network for each video frame.

Suppose the semantic vectors of each video frame are combined into a video semantic sequence V ═ V₁，...，v_m，...，v_M]Wherein v is_mIs the semantic vector of the mth frame video frame.

Then, the video semantic sequence is input into a fifth neural network (for example, a long-term and short-term memory network (LSTMv)) to be encoded, and a feature sequence containing video context is obtained

Wherein, the semantic vector v of the mth frame video frame_mIn the fifth neural network LSTM^vThe process in (1) can be described as:

wherein the content of the first and second substances,

is a fifth neural network LSTM^vA hidden vector output aiming at the semantic vector of the m-1 frame video frame;

is a fifth neural network LSTM^vA cell unit vector output for the semantic vector of the m-1 frame video frame;

is a fifth neural network LSTM^vA hidden vector output aiming at the semantic vector of the mth frame video frame;

is a fifth neural network LSTM^vCell unit vectors output for semantic vectors of m frames of video frames; for the sake of distinction, the fifth neural network LSTM^vThe output hidden vector is referred to as a fifth hidden vector.

By a fifth neural network LSTM^vAnd combining the fifth hidden vectors output by each video frame in the target video according to the time sequence order of the video frames to obtain a fifth hidden vector sequence.

As described above, the syntax information is used to indicate sentence components to be output at respective time instants. Therefore, in order to ensure that the output video description sentence accurately expresses the content in the video, the semantics is given to each sentence component by combining the video semantic feature vector and the syntax information of the target video, and the semantic information corresponding to each sentence component is obtained.

It can be understood that if the syntax structure of the target example sentence changes, the sentence components that need to be sequentially output at each time also correspondingly change, and therefore, the semantic information output for the target video is controlled by the syntax structure defined by the target example sentence.

And 240, generating a video description sentence of the target video according to the semantic information.

In some embodiments of the present application, after obtaining semantic information corresponding to each syntax component, words are predicted according to the semantic information, so that the predicted words at each time are sequentially combined to obtain a video description sentence of a target video.

In some embodiments of the present application, word lists are pre-deployed for word prediction. Based on the obtained semantic information, the probability of each word in the word list corresponding to the semantic information is predicted, and then the word corresponding to the semantic information is determined according to the predicted probability, for example, the word with the highest probability is used as the word corresponding to the semantic information.

According to the scheme, the syntax information used for guiding the syntax structure of the video description sentence to be generated is obtained based on the syntax feature vector of the target example sentence, then the semantic information of the video description sentence to be generated, which corresponds to the syntax structure indicated by the syntax feature vector, is determined according to the syntax information and the video semantic feature vector of the target video, and finally the video description sentence is generated for the target video according to the semantic information.

It can be understood that, for the same target video, if the target example sentences with different syntax structures are selected to constrain the syntax structures of the video description sentences, video description sentences with different syntax structures can be generated, so that the video description sentences with different syntax structures can be generated for the same target video by changing the target example sentences, thereby realizing the generation of diversified video description sentences for the target video.

Referring to fig. 4, for the same target video, if the target example sentence is "advanced view of a group of videos on a grass field", the video description sentence generated for the target video according to the method of the present application based on the target example sentence is "viewing video of a recording with entries in a glass bowl"; if the target example sentence is "multimedia partial watching TV and remoting control in hand in host bed", the video description sentence generated for the target video according to the method of the present application based on the target example sentence is "Woman book sleeping and mixing in bed in kitchen table"; if the target example sentence is "Water videos where a core dropped in a glass", a video description sentence generated for the target video according to the method of the present application based on the target example sentence is "Egg watches disks where a knife cut at board".

In some embodiments of the present application, the processes of step 220 and step 230 are implemented by a gate-based recurrent neural network, respectively.

In this embodiment, step 220 includes: generating a first hidden vector by a first neural network included in the description generation model according to the syntactic characteristic vector, wherein the first hidden vector is used for indicating syntactic information, and the description generation model further comprises a second neural network, and the first neural network and the second neural network are gate control-based cyclic neural networks.

In some embodiments of the present application, a first implicit vector at time t is output by the first neural network based on the syntactic feature vector, the word vector at time t-1, and a first implicit vector at time t-1 generated by the first neural network.

Specifically, the first hidden vector can be generated by the process of

steps

510 and 530 as shown in FIG. 5. The concrete description is as follows:

and 510, carrying out soft attention weighting on the syntactic characteristic vector according to the first hidden vector at the time t-1 to obtain a target syntactic characteristic vector corresponding to the time t.

And step 520, splicing the target syntactic characteristic vector corresponding to the time t with the word vector at the time t-1 to obtain a first spliced vector corresponding to the time t.

Step 530, the first neural network takes the first splicing vector corresponding to the time t as an input, and correspondingly outputs the first implicit vector at the time t.

Soft attention weighting, also known as a soft attention mechanism, performs a re-weighted aggregation calculation of the remaining information by selectively ignoring portions of the information.

Continuing with the above example, the memory neural network is used as a long-term memory neural network to describe the steps 510-530.

Suppose that the first hidden vector output by the first neural network at the time t-1 is

The first hidden vector at the time t-1 is passed

Syntactic feature vector of target example sentence as

Performing soft attention weighting can be described as:

wherein the content of the first and second substances,

is the target syntactic feature vector corresponding to time t.

Target syntactic feature vector at t moment

Word vector e with time t-1_t-1Splicing is carried out, and a first splicing vector corresponding to the t moment is obtained

Then, a first stitching vector corresponding to time t is added

As a first neural network LSTM corresponding to time t^synBy the first neural network LSTM^synOutputting a first hidden vector corresponding to the time t

The process can be described as:

wherein the content of the first and second substances,

is a first neural network LSTM^synCell unit vectors at time t;

is a first neural network LSTM^synA first hidden vector at the middle t moment;

is a first neural network LSTM^synCell unit at intermediate t-1 timeAnd (5) vector quantity.

When the first neural network is a long-term and short-term memory neural network LSTM^synThe specific structure thereof can be seen in fig. 3. The first neural network comprises a first input gate, a first forgetting gate and a first output gate, and the step 530 comprises: calculating by a first forgetting gate according to a first splicing vector corresponding to the time t to obtain a first forgetting gate vector at the time t; calculating a first input gate vector at the time t by the first input gate according to the first splicing vector corresponding to the time t; then, calculating according to a first forgetting gate vector at the time t, a first input gate vector at the time t, a first unit vector at the time t and a first unit vector at the time t-1 corresponding to the first neural network to obtain a first cell unit vector at the time t, wherein the first unit vector at the time t is obtained by performing hyperbolic tangent calculation according to a first splicing vector corresponding to the time t; and finally, calculating to obtain a first implicit vector at the time t according to the first cell unit vector at the time t and the first output gate vector at the time t, wherein the first output gate vector at the time t is calculated by the first output gate according to the first splicing vector corresponding to the time t.

In this embodiment, the first forgetting gate vector refers to a forgetting gate vector in the first neural network, and similarly, the first input gate vector, the first output gate vector, and the first cell unit vector refer to an input gate vector, an output gate vector, and a cell unit vector in the first neural network, respectively.

The calculation of the first input gate vector, the first output gate vector, the first cell unit vector, the first unit vector and the first hidden vector at time t is not repeated herein with reference to equations (1) - (6) above.

In some embodiments of the present application, a second neural network of the gate-based recurrent neural network implementing step 230 is as follows: and generating a second hidden vector by the second neural network according to the first hidden vector and the video semantic feature vector, wherein the second hidden vector is used for indicating semantic information.

In some embodiments of the present application, the second hidden vector at the time t is output by the second neural network according to the semantic feature vector of the video, the first hidden vector at the time t and the second hidden vector at the time t-1 generated by the second neural network.

In the case that the second neural network is a long-term memory neural network, the generation process of the second implicit vector at time t may include the following

steps

610 and 630 shown in fig. 6. The concrete description is as follows:

and step 610, carrying out soft attention weighting on the video semantic feature vector according to the second hidden vector at the time t-1 to obtain a target video semantic vector corresponding to the time t.

And step 620, splicing the target video semantic vector corresponding to the time t with the first hidden vector corresponding to the time t to obtain a second spliced vector corresponding to the time t.

And 630, correspondingly outputting a second implicit vector at the t moment by the second neural network with the second splicing vector corresponding to the t moment as an input.

Continuing with the above example, the video semantic vector of the target video is

Second latent vector through second neural network at time t-1

For video semantic vector

Performing soft attention weighting can be described as:

wherein the content of the first and second substances,

is the target video semantic vector corresponding to the time t.

Then, the target video semantic vector at the time t is used

A first hidden vector output by the first neural network at the time of t-1

Splicing to obtain a second splicing vector corresponding to the t moment

The second splicing vector corresponding to the time t

Second neural network LSTM as time t^semBy a second neural network LSTM^semCorresponding to the second hidden vector at the output time t, the process can be described as:

wherein the content of the first and second substances,

is a second neural network LSTM^semCell unit vectors at time t;

is a second neural network LSTM^semA first hidden vector at the middle t moment;

is a second neural network LSTM^semCell unit vector at time t-1.

In the case that the second neural network is a long-term memory neural network, the second neural network includes a second input gate, a second forgetting gate, and a second output gate, in this embodiment, step 630 includes: calculating by a second forgetting gate according to a second splicing vector corresponding to the time t to obtain a second forgetting gate vector at the time t; calculating a second input gate vector at the time t by the second input gate according to a second splicing vector corresponding to the time t; calculating to obtain a second cell unit vector at the t moment according to a second forgetting gate vector at the t moment, a second input gate vector at the t moment, a second cell unit vector at the t moment and a second cell unit vector at the t-1 moment corresponding to the second neural network, wherein the second cell unit vector at the t moment is obtained by performing hyperbolic tangent calculation according to a second splicing vector corresponding to the t moment; and calculating to obtain a second implicit vector at the time t according to the second cell unit vector at the time t and a second output gate vector at the time t, wherein the second output gate vector at the time t is calculated by a second output gate according to a second splicing vector corresponding to the time t.

In this embodiment, the second forgetting gate vector refers to a forgetting gate vector in the second neural network, and similarly, the second input gate vector, the second output gate vector, and the second cell unit vector refer to an input gate vector, an output gate vector, and a cell unit vector in the second neural network, respectively.

The calculation of the second input gate vector, the second output gate vector, the second cell unit vector, the second unit vector and the second hidden vector at time t is described in equations (1) - (6) above, and is not described herein again.

After a second implicit vector output by the second neural network at the time t is obtained, determining a word vector at the time t according to the second implicit vector generated by the second neural network at the time t; and generating a video description sentence according to the word vector output at each moment.

And the word vector is used as the vector code of the word to be output, and the word vector at the time t is predicted according to a second implicit vector generated by the second neural network at the time t, so that words corresponding to the predicted word vector at each time are combined to obtain the video description sentence of the target video.

In a model with a multilayer neural network, the output of the previous layer is the input of the next layer, and the input of the next layer may have a large difference in the value range because the calculation processes such as linear transformation, activation function and the like are required in the neural network. If the distribution of the input of a certain layer of neural network changes, the parameters need to be learnt again, and the phenomenon is called internal covariate deviation.

Therefore, in order to avoid the phenomenon of internal covariate offset in the description generation model, a Conditional Layer Normalization (CLN) operation is performed on the first output gate vector, the first forgetting gate vector, the first output gate vector and the first unit vector in the first neural network, and the second output gate vector, the second forgetting gate vector, the second output gate vector and the second unit vector in the second neural network, respectively.

Wherein the condition layer normalization operation is defined as:

wherein, x is a variable to be subjected to condition layer normalization processing, mu (x) is the mean value of the variable x, and sigma (x) is the standard deviation of the variable x; f. of_γ(y) is a vector (assumed to be a first vector) that is output from a multilayer perceptron with the condition vector y as input, f_β(y) is a vector (assumed to be the second vector) that is output by another multi-layered perceptron with the condition vector y as input. The multi-layered perceptron outputting the first vector is independent of the multi-layered perceptron outputting the second vector.

As can be seen from the above, in order to implement the normalization operation of the condition layer, the normalization operation needs to be performed on the variable first, and then the first vector is used as a scaling vector for scaling and transforming the normalized variable; and the second vector is used as an offset vector and is used for carrying out offset transformation on the normalized variable.

Performing conditional-layer normalization operation on a first output gate vector, a first forgetting gate vector, a first output gate vector and a first unit vector in a first neural network, and normalizing a first input gate vector, a first forgetting gate vector, a first output gate vector and a first unit vector in the first neural network respectively; and then respectively transforming the normalized first input gate vector, the normalized first forgetting gate vector, the normalized first output gate vector and the normalized first unit vector according to a first offset vector and a first scaling vector to obtain a target first input gate vector, a target first forgetting gate vector, a target first output gate vector and a target first unit vector, wherein the first offset vector is output by the first multilayer perceptron according to a target syntactic characteristic vector corresponding to the moment t, the first scaling vector is output by the second multilayer perceptron according to the target syntactic characteristic vector corresponding to the moment t, and the first multilayer perceptron and the second multilayer perceptron are independent.

In this embodiment, calculating a first cell unit vector at time t according to a first forgetting gate vector at time t, a first input gate vector at time t, a first cell unit vector at time t, and a first cell unit vector at time t-1 corresponding to a first neural network, includes: and calculating to obtain a first cell unit vector at the t moment according to the target first forgetting gate vector, the target first input gate vector, the target first cell unit vector and the first cell unit vector at the t-1 moment.

In this embodiment, calculating a first implicit vector at time t according to the first cell unit vector at time t and the first output gate vector at time t includes: and calculating to obtain a first implicit vector at the time t according to the first cell unit vector at the time t and the target output gate vector.

And aiming at the condition layer normalization operation of a first output gate vector, a first forgetting gate vector, a first output gate vector and a first unit vector in the first neural network, wherein the condition vector y is the calculated target syntactic characteristic vector corresponding to the t moment.

The above performing conditional-layer normalization on the first output gate vector, the first forgetting gate vector, the first output gate vector, and the first unit vector in the first neural network may be represented as:

wherein the content of the first and second substances,

and W_i ^synAre all weight matrices, b^synAs a bias term, the

W_i ^syn、b^synDetermined by training. Left side f of equal sign of the above formula (16)_t ^syn，

The first target forgetting gate vector, the first target input gate vector, the first target output gate vector and the first target unit vector are obtained after the condition layer normalization operation.

And after a target first forgetting gate vector, a target first input gate vector, a target first output gate vector and a target first unit vector corresponding to the time t are obtained, the target first forgetting gate vector, the target first input gate vector, the target first output gate vector and the target first unit vector participate in calculation of a first implicit vector and a first cell unit vector in the long-time memory neural network. Specifically, the first cell unit vector at time t is expressed by the following equation (17)

And a first implicit vector at time t is calculated according to the following equation (18)

And (4) calculating.

A process for conditional-layer normalization operations for a second input gate vector, a second forgetting gate vector, a second output gate vector, and a second unit vector in a second neural network, comprising: firstly, respectively normalizing a second input gate vector, a second forgetting gate vector, a second output gate vector and a second unit vector in a second neural network; and then, respectively transforming the normalized second input gate vector, the normalized second forgetting gate vector, the normalized second output gate vector and the normalized second unit vector according to a second offset vector and a second scaling vector to obtain a target second input gate vector, a target second forgetting gate vector, a target second output gate vector and a target second unit vector, wherein the second offset vector is output by a third multilayer perceptron according to a target video semantic vector corresponding to the moment t, the second scaling vector is output by a fourth multilayer perceptron according to the target video semantic vector corresponding to the moment t, and the third multilayer perceptron and the fourth multilayer perceptron are independent.

In this embodiment, calculating a second cell unit vector at time t according to a second forgetting gate vector at time t, a second input gate vector at time t, a second cell unit vector at time t, and a second cell unit vector at time t-1 corresponding to the second neural network includes: and calculating to obtain a second cell unit vector at the t moment according to the target second forgetting gate vector, the target second input gate vector, the target second cell unit vector and the second cell unit vector at the t-1 moment.

In this embodiment, calculating a second implicit vector at time t according to the second cell unit vector at time t and the second output gate vector at time t includes: and calculating to obtain a second implicit vector at the time t according to the second cell unit vector at the time t and the target second output gate vector.

For the calculation process related to the conditional layer normalization operation of the second input gate vector, the second forgetting gate vector, the second output gate vector, and the second unit vector in the second neural network is similar to the calculation process in the first neural network, the above calculation process for the conditional layer normalization operation in the first neural network may be specifically referred to, and details are not repeated here.

The gate-controlled cyclic neural network is characterized in that the structure of the network is adjusted on the basis of a simple cyclic neural network, and a gate control mechanism is added for controlling the information transmission in the neural network. The gating mechanism can be used for controlling how much information in the memory unit needs to be reserved, how much information needs to be discarded, how much new state information needs to be stored in the memory unit, and the like, so that the gating-based recurrent neural network can learn the dependence relationship with relatively long span without the relationship of gradient extinction and gradient explosion.

Moreover, the gate-controlled cyclic neural network keeps the characteristics of a simple cyclic neural network, namely, the processing capacity of data streams with dependency relationships, such as target example sentences, target videos, syntactic characteristic vectors of the target example sentences, video semantic characteristic vectors of the target videos and the like.

It should be noted that, although the first neural network and the second neural network are used as the long-and-short memory neural network to perform distance description on the scheme of the present application, the scheme of the present application is not limited to be implemented by using the long-and-short memory neural network, and the scheme of the present application may also be implemented by using a gated cyclic network, and the specific process refers to the process of implementing the scheme by using the long-and-short memory neural network.

In some embodiments of the present application, in order to ensure the accuracy of the video description sentence that describes the output predicted by the generative model, the generative model also needs to be trained, and the specific training process may include the process of step 710 and step 760 shown in fig. 7. The specific description is as follows:

step 710, obtaining training data, where the training data includes a plurality of sample videos and sample video description sentences corresponding to the sample videos.

The training data can use the existing video description to generate the data sets MSRVTT and ActivityNet. Of course, in other embodiments, the training data may also be constructed as needed.

720, extracting semantic features of the sample video to obtain a sample video semantic feature vector of the sample video; and carrying out syntactic feature extraction on the sample video description statement corresponding to the sample video to obtain a sample syntactic feature vector of the sample video description statement.

The process of extracting semantic features from the sample video to obtain the sample video semantic feature vector may be implemented by using the convolutional neural network and the fifth neural network, and the specific process is described above and is not described herein again.

The syntactic feature extraction performed on the sample description sentence to obtain the sample syntactic feature vector can be implemented by using the third neural network and the fourth neural network, and the specific process is described above and is not repeated here.

Step 730, outputting a first hidden vector sequence by the first neural network according to the sample syntactic feature vector, and calculating a first syntactic loss through the first hidden vector sequence.

In some embodiments of the present application, to calculate the first syntactic loss by the first implicit vector sequence, a syntax tree is first predicted for the sample description statement by a sixth neural network according to the first implicit vector sequence, where the sixth neural network is a gate-based recurrent neural network; a first syntax loss is then calculated based on the predicted syntax tree and the actual syntax tree of the sample description statement.

In some embodiments of the present application, the description generative model may be syntactically supervised trained by a negative log-likelihood loss function. Defining a first syntactic loss function

Comprises the following steps:

wherein, P (C)^syn|H^syn(ii) a V, C) is a syntax tree H obtained by predicting according to a sample video V and a first hidden vector sequence obtained by a sample video description statement C corresponding to the sample video V^synActual syntax tree C for sample description statement C^synThe similarity of (2) satisfies a probability of a first preset condition.

The first preset condition may be set according to a first syntax tree similarity threshold, for example, if the predicted similarity between the syntax tree and the actual syntax tree of the sample description sentence is greater than or equal to the first syntax tree similarity threshold, the first preset condition is deemed to be satisfied.

Thus, the first syntax loss for the sample video and the sample video description sentence corresponding to the sample video are calculated according to the first syntax loss function described above.

And 740, outputting a second hidden vector sequence by the second neural network according to the first hidden vector sequence and the sample video semantic feature vector of the sample video, and calculating the first semantic loss through the second hidden vector sequence.

In some embodiments of the present application, to calculate the first semantic loss by using the second hidden vector sequence, a first description sentence is output for the sample video by the fifth multi-layered perceptron according to the second hidden vector sequence; and then, calculating to obtain a first semantic loss according to the first description statement and the sample video description statement corresponding to the sample video.

In some embodiments of the present application, the description generation model is semantically supervised trained by computing a negative log-likelihood loss function, wherein a first semantic loss function is defined

Comprises the following steps:

wherein, P (C | H)^sem(ii) a V, C) is the probability that the semantic similarity between the first descriptive statement and the sample descriptive statement meets the second preset condition, which is obtained based on the sample video V and the video descriptive statement C corresponding to the sample video V.

The second preset condition may be set according to a first semantic similarity threshold, for example, if the predicted semantic similarity between the first descriptive statement and the sample descriptive statement is greater than or equal to the first semantic similarity threshold, the second preset condition is considered to be satisfied.

It can be understood that, in order to calculate the semantic similarity between the first description statement and the sample description statement, the semantic vector of the first description statement and the semantic vector of the sample description statement need to be respectively constructed, so that the similarity calculation is performed according to the semantic vector of the first description statement and the semantic vector of the sample description statement, and the semantic similarity is correspondingly obtained.

Thus, the first semantic loss for the sample video and the sample video description sentence corresponding to the sample video are calculated according to the first semantic loss function.

And step 750, calculating to obtain a first target loss according to the first syntax loss and the first semantic loss.

In some embodiments of the present application, the first syntactic loss function and the first semantic loss may be weighted, and the weighted result is taken as a first objective loss function.

In a specific embodiment, the first syntactic loss function and the first semantic loss function are added, and the added sum is taken as a first target loss function, i.e., a first target loss function L_v，cComprises the following steps:

the first target loss may be obtained by substituting the first syntax loss calculated in step 740 and the first semantic loss calculated in step 750.

At step 760, parameters describing the generative model are adjusted based on the first target loss.

Thereby, the parameters describing the generative model are adjusted according to the calculated first target loss until the first target loss function converges.

In the training process, the performance of the description generative model may be limited due to limited training data, so to avoid this situation, on the basis of the training mode corresponding to fig. 7, a training mode corresponding to at least one of fig. 8 and 9 described below is introduced, and the description generative model is further trained to assist the training mode shown in fig. 7.

In some embodiments of the present application, as shown in fig. 8, the method further comprises:

step 810, obtaining a sample syntax feature vector of a sample statement, where the sample statement includes a sample example sentence and a sample video description statement corresponding to a sample video.

The syntactic feature vector of the sample sentence can be implemented by using the third neural network and the fourth neural network in the above, and the specific process is described above and is not described herein again.

And 820, outputting a first hidden vector sequence by the first neural network according to the sample syntax feature vector of the sample statement, and calculating a second syntax loss through the first hidden vector sequence corresponding to the sample statement.

In some embodiments of the present application, to calculate the second syntax loss through the first implicit vector sequence corresponding to the sample statement, a syntax tree is first obtained by predicting the sample statement through a sixth neural network according to the first implicit vector sequence corresponding to the sample statement, where the sixth neural network is a gate-controlled cyclic neural network; and then calculating a second syntax loss according to the predicted syntax tree and the actual syntax tree of the sample sentence.

In some embodiments of the present application, the description generative model is syntactically supervised trained by a negative log-likelihood loss function. Defining a second syntactic loss function

Comprises the following steps:

wherein, P (S)^syn|H^syn(ii) a S, S) represents a syntax tree H predicted from a first hidden vector sequence obtained from a sample statement S^synActual syntax tree S with sample sentence S^synSatisfies a third predetermined conditionThe probability of (c).

The third preset condition may be set according to a second syntax tree similarity threshold, for example, if the similarity between the syntax tree predicted for the sample statement and the actual syntax tree of the sample statement is greater than or equal to the second syntax tree similarity threshold, the third preset condition is deemed to be satisfied.

Thus, second syntax losses for the sample sentences are calculated according to the second syntax loss functions described above, respectively.

And 830, outputting a second hidden vector sequence by the second neural network according to the sample semantic feature vector of the sample statement and the first hidden vector sequence corresponding to the sample statement, wherein the sample semantic feature vector is obtained by performing semantic feature extraction on the sample statement, and calculating a second semantic loss through the second hidden vector sequence corresponding to the sample statement.

In some embodiments of the present application, a sentence semantic coding module may be constructed in advance to perform semantic feature extraction on a sample sentence. Specifically, for a sample statement, each word in the sample statement is encoded by using a glove word vector, and then the encoded word vector sequence is input into a long-term memory network, wherein the output of the long-term memory network is a sample semantic feature vector of the sample statement.

In some embodiments of the present application, to calculate the second semantic loss through the second hidden vector sequence corresponding to the sample statement, a multilayer perceptron outputs a third description statement for the sample statement according to the second hidden vector sequence corresponding to the sample statement; and then calculating to obtain a second semantic loss according to the third description statement and the sample statement.

In some embodiments of the present application, the description generation model is semantically supervised trained by computing a negative log-likelihood loss function, wherein a second semantic loss function is defined

Comprises the following steps:

wherein, P (S | H)^sem(ii) a S, S) is a third descriptive statement H predicted based on the sample statement S^semAnd the probability that the semantic similarity with the sample sentence S meets a fourth preset condition.

The fourth preset condition may be set according to a second semantic similarity threshold, for example, if the predicted semantic similarity between the third description sentence and the sample sentence is greater than or equal to the second semantic similarity threshold, the fourth preset condition is considered to be satisfied.

And similarly, performing semantic similarity calculation according to the semantic vector of the third description statement and the semantic vector of the sample statement.

Thus, according to the second semantic loss function described above

A second semantic loss may be computed for the sample statement.

And 840, calculating to obtain a second target loss according to the second syntax loss and the second semantic loss.

In some embodiments of the present application, the second syntactic loss function and the second semantic loss function are added, and the result of the addition is taken as a second target loss function, i.e., a second target loss function L_s，sComprises the following steps:

thus, on the basis of the second syntax loss and the second syntax loss being calculated, a second target loss for the sample sentence is calculated in accordance with the above expression (24).

At step 850, parameters describing the generative model are adjusted based on the second target loss.

In some embodiments of the present application, the training data further includes a number of sample example sentences, as shown in fig. 9, further including:

step 910, obtaining a sample syntactic characteristic vector of the sample example sentence.

Step 920, outputting, by the first neural network, a first hidden vector sequence according to the sample syntactic feature vector of the sample example sentence.

Step 930, outputting, by the second neural network, a second hidden vector sequence according to the first hidden vector sequence corresponding to the sample example sentence and the sample video semantic feature vector of the sample video.

At step 940, a second description statement is determined according to a second hidden vector sequence corresponding to the sample video.

In some embodiments of the present application, a second description statement may be output for the sample video by a multi-layered perceptron according to a second hidden vector sequence of the sample video.

And 950, calculating to obtain a third target loss according to the syntax tree corresponding to the sample example sentence and the syntax tree corresponding to the second description sentence.

In some embodiments of the present application, the syntax tree of the sample example sentence and the syntax tree of the second descriptive sentence may be obtained by a syntax analysis tool, such as the tools listed above; the syntax trees of the sample example sentences and the second description sentences can also be determined by a gate-based recurrent neural network according to the method described above.

In some embodiments of the present application, a third objective loss function is defined

Comprises the following steps:

wherein, P (E)^syn|H^syn(ii) a V, E) represents a syntax tree H of a second descriptive sentence obtained on the basis of a sample video V and a sample example sentence E^synSyntax tree E with sample example sentence E^synThe similarity between the first and second objects meets the probability of a preset condition of the first object.

The fifth preset condition may be set according to a third syntax tree similarity threshold, for example, if the similarity between the syntax tree of the second descriptive sentence and the syntax tree of the sample example sentence is greater than or equal to the third syntax tree similarity threshold, the fifth preset condition is deemed to be satisfied.

Thus, the third target loss for the sample video and the sample example sentence is correspondingly calculated according to the third target loss function.

The parameters describing the generative model are adjusted based on the third target loss, step 960.

In some embodiments of the present application, the description generative model may be trained in conjunction with the three training approaches of fig. 7-9, in which case the total loss function L describing the generative model may be defined as the sum of the first objective loss function, the second objective loss function, and the third objective loss function, i.e.:

of course, in other embodiments, the description generation model may also be trained only with the training manners of fig. 7 and fig. 8, or only with the training manners of fig. 7 and fig. 9, according to actual needs.

Fig. 10 is a schematic diagram of generating a video description sentence according to an embodiment, and as shown in fig. 10, a video semantic feature vector of an input video is extracted by a video semantic coding module and a syntax feature vector of a sample sentence is extracted by a sentence syntax coding module with a hierarchical structure, and then a video description sentence is output for the input video by a description generation model according to the video semantic feature vector of the input video and the syntax feature vector of the sample sentence.

Specifically, as shown in fig. 10, after a video is input, the video semantic coding module includes a Convolutional Neural Network (CNN) and a long-short term memory Neural network (LSTM), and after the video is input, feature extraction is performed on each video frame through the Convolutional Neural network to obtain a semantic vector of each video frame, then the semantic vector of each video frame is input into the long-short term memory Neural network according to a time sequence of the video frame, hidden vectors output by the long-short term memory Neural network for each video are obtained, and further, video semantic feature vectors of the input video are obtained by combining the hidden vectors corresponding to each video frame.

The sentence syntax coding module with hierarchical structure comprises two layers of long-time and short-time memory neural networks (LSTM)^cAnd LSTM^wWherein the long-time and short-time memory neural network LSTM^cThe long-short time memory neural network LSTM is used for coding the character feature vector of each character in each word in the example sentence^cA hidden vector for character output; then, aiming at each word in the example sentence, carrying out average calculation according to the hidden vector of each character in the word to obtain the characteristic vector of the word; then, the neural network LSTM is memorized by the length of time^wAnd outputting a hidden vector sequence according to the feature vector of each word in the example sentence, wherein the output hidden vector sequence is used as the syntactic feature vector of the example sentence.

The description generation model comprises two cascaded layers of long-and-short-term memory neural networks (LSTM)^syn(first neural network) and LSTM^sem(second neural network), wherein the long-term memory neural network LSTM^synSyntax for controlling the video description statements to be generated, long-and-short memory neural network LSTM^semFor giving semantic meaning to the video description statement. Specifically, the syntactic feature vector of the example sentence is input to the LSTM^synThe LSTM is added^synThe output implicit vector is used as LSTM^semInput of (2), LSTM^synThe output hidden vector is used for carrying out syntactic guidance on a video description sentence to be generated; then, LSTM^semAccording to LSTM^synThe output hidden vector and the video semantic feature vector of the input video output the hidden vector; finally, according to LSTM^semThe output hidden vector determines a word vector, and further generates a video description sentence according to a word corresponding to the determined word vector, so that the generated video description sentence can describe the content of the video and is syntactically similar to the example sentence.

As shown in fig. 10, a sample sentence is "analog view of group of pictures on a grass field", and a video description sentence generated for an input video based on the sample sentence is "browsing video of a recording with entries in a glass bowl".

In some embodiments of the present application, after the video description sentence is generated, it can also be output through a voice technology.

Embodiments of the apparatus of the present application are described below, which may be used to perform the methods of the above-described embodiments of the present application. For details which are not disclosed in the embodiments of the apparatus of the present application, reference is made to the embodiments of the method described above in the present application.

The present application provides a video descriptive sentence generating apparatus 1100, the video descriptive sentence generating apparatus 1100 may be configured in a server shown in fig. 1, as shown in fig. 11, the video descriptive sentence generating apparatus 1100 includes:

an obtaining module 1110, configured to obtain a syntactic feature vector of the target example sentence.

And a syntax determining module 1120, configured to determine a syntax of the video description sentence to be generated according to the syntax feature vector, so as to obtain syntax information.

The semantic determining module 1130 is configured to determine, according to the syntax information and the video semantic feature vector of the target video, a semantic corresponding to syntax of the video description sentence to be generated, so as to obtain semantic information.

And a video description sentence determining module 1140, configured to generate a video description sentence of the target video according to the semantic information.

In some embodiments of the present application, the syntax determination module is configured to: generating a first hidden vector by a first neural network contained in the description generation model according to the syntactic characteristic vector, wherein the first hidden vector is used for indicating syntactic information, and the description generation model further comprises a second neural network cascaded with the first neural network, and the first neural network and the second neural network are gate control-based cyclic neural networks.

In this embodiment, the semantic determination module is configured to: and generating a second hidden vector by the second neural network according to the first hidden vector and the video semantic feature vector, wherein the second hidden vector is used for indicating semantic information.

In some embodiments of the application, the video description statement determination module is configured to: determining a word vector at the t moment according to a second hidden vector generated by the second neural network at the t moment; and generating a video description sentence according to the word vector output at each moment.

In this embodiment, the syntax determining module includes a first hidden vector generating unit, which is configured to output, by the first neural network, the first hidden vector at the time t-1 according to the syntax feature vector, the word vector at the time t-1, and the first hidden vector at the time t-1 generated by the first neural network.

In this embodiment, the semantic determining module includes a second hidden vector generating unit, configured to output, by the second neural network, a second hidden vector at time t according to the video semantic feature vector, the first hidden vector at time t, and a second hidden vector at time t-1 generated by the second neural network.

In some embodiments of the present application, the first hidden vector generation unit includes: and the first soft attention weighting unit is used for carrying out soft attention weighting on the syntactic characteristic vector according to the first implicit vector at the time t-1 to obtain a target syntactic characteristic vector corresponding to the time t. And the first splicing unit is used for splicing the target syntactic characteristic vector corresponding to the time t with the word vector at the time t-1 to obtain a first splicing vector corresponding to the time t. And the first output unit is used for correspondingly outputting a first hidden vector at the t moment by taking the first splicing vector corresponding to the t moment as an input through the first neural network.

In some embodiments of the present application, the first neural network includes a first input gate, a first forgetting gate, and a first output gate, and the first output unit includes: and the first forgetting gate vector calculating unit is used for calculating a first forgetting gate vector at the time t by the first forgetting gate according to the first splicing vector corresponding to the time t. And the first input gate vector calculation unit is used for calculating the first input gate vector at the time t according to the first splicing vector corresponding to the time t by the first input gate. And the first cell unit vector calculating unit is used for calculating to obtain a first cell unit vector at the time t according to the first forgetting gate vector at the time t, the first input gate vector at the time t, the first cell unit vector at the time t and the first cell unit vector at the time t-1 corresponding to the first neural network, and the first cell unit vector at the time t is obtained by performing hyperbolic tangent calculation according to the first splicing vector corresponding to the time t. And the first implicit vector calculation unit is used for calculating a first implicit vector at the time t according to the first cell unit vector at the time t and the first output gate vector at the time t, and the first output gate vector at the time t is calculated by the first output gate according to the first splicing vector corresponding to the time t.

In some embodiments of the present application, the syntax determination module further comprises: the first normalization unit is used for respectively normalizing a first input gate vector, a first forgetting gate vector, a first output gate vector and a first unit vector in the first neural network; the first transformation unit is used for respectively transforming the normalized first input gate vector, the normalized first forgetting gate vector, the normalized first output gate vector and the normalized first unit vector according to a first offset vector and a first scaling vector to obtain a target first input gate vector, a target first forgetting gate vector, a target first output gate vector and a target first unit vector, wherein the first offset vector is output by the first multilayer perceptron according to a target syntactic characteristic vector corresponding to the moment t, the first scaling vector is output by the second multilayer perceptron according to the target syntactic characteristic vector corresponding to the moment t, and the first multilayer perceptron and the second multilayer perceptron are independent.

In this embodiment, the first cell unit vector calculation unit is further configured to: calculating to obtain a first cell unit vector at the t moment according to the target first forgetting gate vector, the target first input gate vector, the target first cell unit vector and the first cell unit vector at the t-1 moment;

In some embodiments of the present application, the second hidden vector generation unit includes: and the second soft attention weighting unit is used for carrying out soft attention weighting on the video semantic feature vector according to the second hidden vector at the time t-1 to obtain a target video semantic vector corresponding to the time t. And the second splicing unit is used for splicing the target video semantic vector corresponding to the time t with the first hidden vector corresponding to the time t to obtain a second splicing vector corresponding to the time t. And the second output unit is used for correspondingly outputting a second implicit vector at the t moment by taking the second splicing vector corresponding to the t moment as an input through the second neural network.

In some embodiments of the present application, the second neural network comprises a second input gate, a second forgetting gate, and a second output gate, the second output unit comprising: and the second forgetting gate vector calculating unit is used for calculating a second forgetting gate vector at the time t by the second forgetting gate according to a second splicing vector corresponding to the time t. And the second input gate vector calculation unit is used for calculating a second input gate vector at the time t according to a second splicing vector corresponding to the time t by the second input gate. And the second cell unit vector calculating unit is used for calculating a second cell unit vector at the time t according to a second forgetting gate vector at the time t, a second input gate vector at the time t, the second cell unit vector at the time t and a second cell unit vector at the time t-1 corresponding to the second neural network, and the second cell unit vector at the time t is obtained by performing hyperbolic tangent calculation according to a second splicing vector corresponding to the time t. And the second implicit vector calculation unit is used for calculating a second implicit vector at the time t according to the second cell unit vector at the time t and a second output gate vector at the time t, and the second output gate vector at the time t is calculated by a second output gate according to a second splicing vector corresponding to the time t.

In some embodiments of the present application, the semantic determination module further comprises: and the second normalization unit is used for respectively normalizing the second input gate vector, the second forgetting gate vector, the second output gate vector and the second unit vector in the second neural network. And the second transformation unit is used for respectively transforming the normalized second input gate vector, the normalized second forgetting gate vector, the normalized second output gate vector and the normalized second unit vector according to a second offset vector and a second scaling vector to obtain a target second input gate vector, a target second forgetting gate vector, a target second output gate vector and a target second unit vector, wherein the second offset vector is output by the third multilayer perceptron according to the target video semantic vector corresponding to the moment t, the second scaling vector is output by the fourth multilayer perceptron according to the target video semantic vector corresponding to the moment t, and the third multilayer perceptron and the fourth multilayer perceptron are independent.

In some embodiments of the present application, the generating means of the video description sentence further comprises: and the training data acquisition module is used for acquiring training data, and the training data comprises a plurality of sample videos and sample video description sentences corresponding to the sample videos. The semantic feature extraction module is used for extracting semantic features of the sample video to obtain a sample video semantic feature vector of the sample video; and the syntactic feature extraction module is used for carrying out syntactic feature extraction on the sample video description statement corresponding to the sample video to obtain a sample syntactic feature vector of the sample video description statement. And the first syntax loss determining module is used for outputting a first hidden vector sequence by the first neural network according to the sample syntax feature vector and calculating the first syntax loss through the first hidden vector sequence. And the first semantic loss determining module is used for outputting a second hidden vector sequence by the second neural network according to the first hidden vector sequence and the sample video semantic feature vector of the sample video and calculating the first semantic loss through the second hidden vector sequence. And the first target loss calculation module is used for calculating to obtain a first target loss according to the first syntax loss and the first semantic loss. A first adjustment module to adjust parameters describing the generative model based on the first target loss.

In some embodiments of the present application, the first syntax loss determination module comprises: and the syntax tree prediction unit is used for predicting the syntax tree for the sample description statement through a sixth neural network according to the first hidden vector sequence, and the sixth neural network is a gate-controlled cyclic neural network. And a first syntax loss calculation unit for calculating a first syntax loss according to the predicted syntax tree and the actual syntax tree of the sample description sentence.

In some embodiments of the present application, the first semantic loss determination module comprises: and the first description statement output unit is used for outputting the first description statement for the sample video according to the second hidden vector sequence through a fifth multilayer perceptron. And the first semantic loss calculating unit is used for calculating to obtain a first semantic loss according to the first description statement and the sample video description statement corresponding to the sample video.

In some embodiments of the present application, the generating means of the video description sentence further comprises: the first sample syntax feature vector obtaining module is used for obtaining sample syntax feature vectors of sample statements, wherein the sample statements comprise sample example sentences and sample video description sentences corresponding to sample videos. And the second syntax loss calculation module is used for outputting a first hidden vector sequence by the first neural network according to the sample syntax feature vector of the sample statement and calculating second syntax loss through the first hidden vector sequence corresponding to the sample statement. And the second semantic loss calculation module is used for outputting a second implicit vector sequence by the second neural network according to the sample semantic feature vector of the sample statement and the first implicit vector sequence corresponding to the sample statement, wherein the sample semantic feature vector is obtained by performing semantic feature extraction on the sample statement, and the second semantic loss is calculated through the second implicit vector sequence corresponding to the sample statement. And the second target loss calculation module is used for calculating to obtain a second target loss according to the second syntactic loss and the second semantic loss. A second adjustment module to adjust parameters describing the generative model based on a second target loss.

In some embodiments of the present application, the training data further includes a plurality of sample example sentences, and the apparatus for generating video description sentences further includes: and the second sample syntax feature vector acquisition module is used for acquiring the sample syntax feature vector of the sample example sentence. And the first hidden vector sequence output module is used for outputting a first hidden vector sequence by the first neural network according to the sample syntax characteristic vector of the sample example sentence. And the second hidden vector sequence output module is used for outputting a second hidden vector sequence by the second neural network according to the first hidden vector sequence corresponding to the sample example sentence and the sample video semantic feature vector of the sample video. And the second description statement determining module is used for determining a second description statement according to a second hidden vector sequence corresponding to the sample video. And the third target loss calculation module is used for calculating to obtain a third target loss according to the syntax tree corresponding to the sample example sentence and the syntax tree corresponding to the second description sentence. A third adjustment module to adjust parameters describing the generative model based on a third target loss.

In some embodiments of the present application, the syntactic feature vector corresponding to the target example sentence is obtained through a syntactic model, the syntactic model includes a third neural network and a fourth neural network which are cascaded, and the third neural network and the fourth neural network are gate-controlled based recurrent neural networks. In this embodiment, the obtaining module includes: and the character feature vector acquisition unit is used for acquiring the character feature vectors of the characters included by the words in the target example sentence, and the character feature vectors are obtained by coding the characters. And the third implicit vector output unit is used for outputting a third implicit vector corresponding to each character by the third neural network according to the character feature vector of each character. And the average calculating unit is used for carrying out average calculation on each word in the target example sentence according to the third hidden vector corresponding to each character in the word to obtain the feature vector of the word. And the fourth hidden vector output unit is used for outputting a fourth hidden vector by a fourth neural network according to the feature vector of each word in the target example sentence, and the fourth hidden vector is used as a syntactic feature vector.

In some embodiments of the present application, a video semantic feature vector corresponding to a target video is obtained through a video semantic model, where the video semantic model includes a cascaded convolutional neural network and a fifth neural network, the fifth neural network is a gate-controlled cyclic neural network, and the apparatus for generating a video description statement further includes: and the video frame sequence acquisition module is used for acquiring a video frame sequence obtained by framing the target video. And the semantic extraction module is used for performing semantic extraction on each video frame in the video frame sequence through the convolutional neural network to obtain a semantic vector of each video frame. And the fifth hidden vector output module is used for outputting a fifth hidden vector according to the semantic vector of each video frame in the video frame sequence through a fifth neural network, and the fifth hidden vector is used as a video semantic feature vector.

It should be noted that the computer system 1200 of the electronic device shown in fig. 12 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 12, the computer system 1200 includes a Central Processing Unit (CPU)1201, which can perform various appropriate actions and processes, such as executing the methods in the above-described embodiments, according to a program stored in a Read-Only Memory (ROM) 1202 or a program loaded from a storage section 1208 into a Random Access Memory (RAM) 1203. In the RAM 1203, various programs and data necessary for system operation are also stored. The CPU 1201, ROM 1202, and RAM 1203 are connected to each other by a bus 1204. An Input/Output (I/O) interface 1205 is also connected to bus 1204.

The following components are connected to the I/O interface 1205: an input section 1206 including a keyboard, a mouse, and the like; an output section 1207 including a Display device such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and a speaker; a storage section 1208 including a hard disk and the like; and a communication section 1209 including a Network interface card such as a LAN (Local Area Network) card, a modem, or the like. The communication section 1209 performs communication processing via a network such as the internet. A driver 1210 is also connected to the I/O interface 1205 as needed. A removable medium 1211, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like, is mounted on the drive 1210 as necessary, so that a computer program read out therefrom is mounted into the storage section 1208 as necessary.

In particular, according to embodiments of the application, the processes described above with reference to the flow diagrams may be implemented as computer software programs. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated by the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication section 1209, and/or installed from the removable medium 1211. The computer program executes various functions defined in the system of the present application when executed by a Central Processing Unit (CPU) 1201.

It should be noted that the computer readable medium shown in the embodiments of the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an Erasable Programmable Read-Only Memory (EPROM), a flash Memory, an optical fiber, a portable Compact Disc Read-Only Memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. Each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams or flowchart illustration, and combinations of blocks in the block diagrams or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units described in the embodiments of the present application may be implemented by software, or may be implemented by hardware, and the described units may also be disposed in a processor. Wherein the names of the elements do not in some way constitute a limitation on the elements themselves.

In one aspect of an embodiment of the present application, there is provided an electronic device including: a processor; and the memory is used for storing computer readable instructions, and the computer readable instructions are executed by the processor to realize the generation method of the video description statement in any embodiment.

As another aspect, the present application also provides a computer-readable storage medium, which may be included in the electronic device described in the above embodiments; or may exist separately without being assembled into the electronic device. The computer readable storage medium has stored thereon computer readable instructions which, when executed by a processor, implement a method as in any of the above embodiments.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the application. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which can be a personal computer, a server, a touch terminal, or a network device, etc.) to execute the method according to the embodiments of the present application.

Other embodiments of the present application will be apparent to those skilled in the art from consideration of the specification and practice of the embodiments disclosed herein. This application is intended to cover any variations, uses, or adaptations of the invention following, in general, the principles of the application and including such departures from the present disclosure as come within known or customary practice within the art to which the invention pertains.

It will be understood that the present application is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the application is limited only by the appended claims.

Claims

1. A method for generating a video description sentence, the method comprising:

obtaining a syntactic characteristic vector of a target example sentence;

determining the syntax of a video description sentence to be generated according to the syntax feature vector to obtain syntax information;

determining the semantics of the video description sentence to be generated corresponding to the syntax according to the syntax information and the video semantic feature vector of the target video to obtain semantic information;

and generating a video description statement of the target video according to the semantic information.

2. The method of claim 1, wherein determining the syntax of the video description sentence to be generated according to the syntax feature vector, and obtaining syntax information comprises:

generating a first hidden vector according to the syntactic characteristic vector by a first neural network contained in a description generation model, wherein the first hidden vector is used for indicating the syntactic information, the description generation model further comprises a second neural network, and the first neural network and the second neural network are gate-based cyclic neural networks;

determining the semantics of the to-be-generated video description sentence corresponding to the syntax according to the syntax information and the video semantic feature vector of the target video to obtain semantic information, wherein the semantic information comprises:

generating, by the second neural network, a second hidden vector from the first hidden vector and the video semantic feature vector, the second hidden vector being used to indicate the semantic information.

3. The method of claim 2, wherein the generating a video description sentence of the target video according to the semantic information comprises:

determining a word vector at the t moment according to a second hidden vector generated by the second neural network at the t moment;

generating the video description sentence according to the word vector output at each moment;

generating a first hidden vector by a first neural network contained in a description generation model according to the syntactic feature vector, wherein the first hidden vector comprises:

outputting, by the first neural network, a first hidden vector at a time t according to the syntactic feature vector, the word vector at the time t-1 and the first hidden vector at the time t-1 generated by the first neural network;

generating, by the second neural network, a second hidden vector from the first hidden vector and the video semantic feature vector, comprising:

and outputting the second hidden vector at the t moment by the second neural network according to the video semantic feature vector, the first hidden vector at the t moment and the second hidden vector at the t-1 moment generated by the second neural network.

4. The method of claim 3, wherein outputting, by the first neural network, the first hidden vector at time t from the syntactic feature vector, the word vector at time t-1, and the first hidden vector at time t-1 generated by the first neural network comprises:

carrying out soft attention weighting on the syntactic characteristic vector according to the first implicit vector at the time t-1 to obtain a target syntactic characteristic vector corresponding to the time t;

splicing the target syntactic characteristic vector corresponding to the time t with the word vector at the time t-1 to obtain a first spliced vector corresponding to the time t;

and correspondingly outputting a first implicit vector at the time t by using the first splicing vector corresponding to the time t as an input of the first neural network.

5. The method of claim 4, wherein the first neural network comprises a first input gate, a first forgetting gate, and a first output gate, wherein the outputting, by the first neural network, the first stitched vector corresponding to time t as an input and the first hidden vector corresponding to time t as an output comprises:

calculating by the first forgetting gate according to the first splicing vector corresponding to the time t to obtain a first forgetting gate vector at the time t; calculating to obtain a first input gate vector at the t moment by the first input gate according to the first splicing vector corresponding to the t moment;

calculating to obtain a first cell unit vector at the time t according to the first forgetting gate vector at the time t, the first input gate vector at the time t, the first cell unit vector at the time t and the first cell unit vector at the time t-1 corresponding to the first neural network, wherein the first cell unit vector at the time t is obtained by performing hyperbolic tangent calculation according to the first splicing vector corresponding to the time t;

and calculating to obtain a first implicit vector at the time t according to the first cell unit vector at the time t and a first output gate vector at the time t, wherein the first output gate vector at the time t is calculated by the first output gate according to the first splicing vector corresponding to the time t.

6. The method of claim 5, wherein before the calculating the first cell unit vector at time t according to the first forgetting gate vector at time t, the first input gate vector at time t, the first unit vector at time t, and the first cell unit vector at time t-1 corresponding to the first neural network, the method further comprises:

respectively normalizing a first input gate vector, a first forgetting gate vector, a first output gate vector and a first unit vector in the first neural network;

respectively transforming the normalized first input gate vector, the normalized first forgetting gate vector, the normalized first output gate vector and the normalized first unit vector according to a first offset vector and a first scaling vector to obtain a target first input gate vector, a target first forgetting gate vector, a target first output gate vector and a target first unit vector, wherein the first offset vector is output by a first multilayer perceptron according to the target syntactic feature vector corresponding to the moment t, the first scaling vector is output by a second multilayer perceptron according to the target syntactic feature vector corresponding to the moment t, and the first multilayer perceptron and the second multilayer perceptron are independent;

the calculating according to the first forgetting gate vector at the time t, the first input gate vector at the time t, the first unit vector at the time t and the first cell unit vector at the time t-1 corresponding to the first neural network to obtain the first cell unit vector at the time t includes:

calculating to obtain a first cell unit vector at the t moment according to the target first forgetting gate vector, the target first input gate vector, the target first cell unit vector and the first cell unit vector at the t-1 moment;

the calculating to obtain the first implicit vector at the time t according to the first cell unit vector at the time t and the first output gate vector at the time t includes:

and calculating to obtain a first implicit vector at the time t according to the first cell unit vector at the time t and the target output gate vector.

7. The method of claim 3, wherein outputting, by the second neural network, the second hidden vector at time t from the video semantic feature vector, the first hidden vector at time t, and the second hidden vector at time t-1 generated by the second neural network comprises:

carrying out soft attention weighting on the video semantic feature vector according to the second hidden vector at the time t-1 to obtain a target video semantic vector corresponding to the time t;

splicing the target video semantic vector corresponding to the time t with the first hidden vector corresponding to the time t to obtain a second spliced vector corresponding to the time t;

and correspondingly outputting a second implicit vector at the t moment by the second neural network by taking the second splicing vector corresponding to the t moment as an input.

8. The method of claim 7, wherein the second neural network comprises a second input gate, a second forgetting gate, and a second output gate, wherein the outputting, by the second neural network, the second stitched vector corresponding to time t as an input and the second implicit vector corresponding to time t as an output comprises:

calculating by the second forgetting gate according to the second splicing vector corresponding to the time t to obtain a second forgetting gate vector at the time t; and calculating a second input gate vector at the time t by the second input gate according to the second splicing vector corresponding to the time t;

calculating a second cell unit vector at the time t according to the second forgetting gate vector at the time t, the second input gate vector at the time t, the second cell unit vector at the time t and the second cell unit vector at the time t-1 corresponding to the second neural network, wherein the second cell unit vector at the time t is obtained by performing hyperbolic tangent calculation according to the second splicing vector corresponding to the time t;

and calculating to obtain a second implicit vector at the time t according to the second cell unit vector at the time t and a second output gate vector at the time t, wherein the second output gate vector at the time t is calculated by the second output gate according to the second splicing vector corresponding to the time t.

9. The method of claim 8, wherein before the calculating the second cell unit vector at time t according to the second forgetting gate vector at time t, the second input gate vector at time t, the second cell unit vector at time t, and the second cell unit vector at time t-1 corresponding to the second neural network, the method further comprises:

respectively normalizing a second input gate vector, a second forgetting gate vector, a second output gate vector and a second unit vector in the second neural network;

respectively transforming the normalized second input gate vector, the normalized second forgetting gate vector, the normalized second output gate vector and the normalized second unit vector according to a second offset vector and a second scaling vector to obtain a target second input gate vector, a target second forgetting gate vector, a target second output gate vector and a target second unit vector, wherein the second offset vector is output by a third multilayer perceptron according to the target video semantic vector corresponding to the moment t, the second scaling vector is output by a fourth multilayer perceptron according to the target video semantic vector corresponding to the moment t, and the third multilayer perceptron and the fourth multilayer perceptron are independent;

the calculating according to the second forgetting gate vector at the time t, the second input gate vector at the time t, the second unit vector at the time t and the second unit vector at the time t-1 corresponding to the second neural network to obtain the second cell unit vector at the time t includes:

calculating to obtain a second cell unit vector at the t moment according to the target second forgetting gate vector, the target second input gate vector, the target second cell unit vector and the second cell unit vector at the t-1 moment;

the calculating to obtain a second implicit vector at the time t according to the second cell unit vector at the time t and a second output gate vector at the time t includes:

and calculating to obtain a second implicit vector at the time t according to the second cell unit vector at the time t and the target second output gate vector.

10. The method of claim 2, further comprising:

acquiring training data, wherein the training data comprises a plurality of sample videos and sample video description sentences corresponding to the sample videos;

semantic feature extraction is carried out on the sample video to obtain a sample video semantic feature vector of the sample video; carrying out syntactic feature extraction on a sample video description statement corresponding to the sample video to obtain a sample syntactic feature vector of the sample video description statement;

outputting, by the first neural network, a first sequence of hidden vectors from the sample syntactic feature vectors, a first syntactic loss being calculated from the first sequence of hidden vectors;

outputting a second hidden vector sequence by the second neural network according to the first hidden vector sequence and the sample video semantic feature vector of the sample video, and calculating a first semantic loss through the second hidden vector sequence;

calculating to obtain a first target loss according to the first syntax loss and the first semantic loss;

adjusting parameters of the description generative model based on the first target loss.

11. The method of claim 10, wherein said computing a first syntactic loss by said first sequence of hidden vectors comprises:

predicting a syntax tree for the sample description statement according to the first hidden vector sequence through a sixth neural network, wherein the sixth neural network is a gate-controlled cyclic neural network;

and calculating the first syntax loss according to the predicted syntax tree and the actual syntax tree of the sample description statement.

12. The method of claim 10, wherein said calculating a first semantic loss through said second sequence of hidden vectors comprises:

outputting a first description statement for the sample video according to the second hidden vector sequence by a fifth multilayer perceptron;

and calculating to obtain the first semantic loss according to the first description statement and the sample video description statement corresponding to the sample video.

13. The method of claim 10, further comprising:

obtaining a sample syntax feature vector of a sample statement, wherein the sample statement comprises a sample example sentence and a sample video description statement corresponding to a sample video;

outputting a first hidden vector sequence by the first neural network according to the sample syntax feature vector of the sample statement, and calculating a second syntax loss through the first hidden vector sequence corresponding to the sample statement;

outputting a second hidden vector sequence by the second neural network according to a sample semantic feature vector of the sample statement and the first hidden vector sequence corresponding to the sample statement, wherein the sample semantic feature vector is obtained by performing semantic feature extraction on the sample statement, and calculating a second semantic loss through the second hidden vector sequence corresponding to the sample statement;

calculating to obtain a second target loss according to the second syntax loss and the second semantic loss;

adjusting parameters of the description generative model based on the second target loss.

14. The method of claim 10 or 13, wherein the training data further comprises a number of sample example sentences, the method further comprising:

obtaining a sample syntax feature vector of the sample example sentence;

outputting, by the first neural network, a first hidden vector sequence according to the sample syntactic feature vector of the sample example sentence;

outputting, by the second neural network, a second hidden vector sequence according to the first hidden vector sequence corresponding to the sample example sentence and a sample video semantic feature vector of the sample video;

determining a second description statement from a second sequence of hidden vectors corresponding to the sample video;

calculating to obtain a third target loss according to the syntax tree corresponding to the sample example sentence and the syntax tree corresponding to the second description sentence;

adjusting parameters of the description generative model based on the third target loss.

15. The method of claim 1, wherein the obtaining of the syntactic feature vector of the target example sentence comprises:

acquiring character feature vectors of characters included by each word in the target example sentence, wherein the character feature vectors are obtained by coding characters;

outputting a third implicit vector corresponding to each character by a third neural network according to the character feature vector of each character;

aiming at each word in the target example sentence, carrying out average calculation according to a third hidden vector corresponding to each character in the word to obtain a feature vector of the word;

outputting a fourth hidden vector sequence by a fourth neural network according to the feature vector of each word in the target example sentence, wherein the fourth hidden vector sequence is used as the syntactic feature vector, and the third neural network and the fourth neural network are gate-controlled-based cyclic neural networks.

16. The method according to claim 1, wherein before determining the semantic meaning of the video description sentence to be generated corresponding to the syntax according to the syntax information and the video semantic feature vector of the target video, and obtaining the semantic information, the method further comprises:

acquiring a video frame sequence obtained by framing the target video;

semantic extraction is carried out on each video frame in the video frame sequence through a convolutional neural network to obtain a semantic vector of each video frame;

outputting a fifth hidden vector sequence according to the semantic vector of each video frame in the video frame sequence through a fifth neural network, wherein the fifth hidden vector sequence is used as the video semantic feature vector, and the fifth neural network is a gate-controlled cyclic neural network.

17. An apparatus for generating a video description sentence, the apparatus comprising:

the obtaining module is used for obtaining the syntactic characteristic vector of the target example sentence;

the syntax determining module is used for determining the syntax of the video description sentence to be generated according to the syntax feature vector to obtain syntax information;

the semantic determining module is used for determining the semantic of the video description sentence to be generated corresponding to the syntax according to the syntax information and the video semantic feature vector of the target video to obtain semantic information;

and the video description statement determining module is used for generating the video description statement of the target video according to the semantic information.