CN111079601A

CN111079601A - Video content description method, system and device based on multi-mode attention mechanism

Info

Publication number: CN111079601A
Application number: CN201911243331.7A
Authority: CN
Inventors: 胡卫明; 孙亮; 李兵
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2019-12-06
Filing date: 2019-12-06
Publication date: 2020-04-28

Abstract

The invention belongs to the field of computer vision and natural language processing, and particularly relates to a video content description method, a system and a device based on a multi-mode attention mechanism, aiming at solving the problem that the accuracy of a generated description statement is low due to the fact that the video content description method only considers video characteristics and ignores high-level semantic attribute information. The method comprises the following steps: acquiring a video frame sequence of a video to be described; extracting multi-modal feature vectors of the video frame sequence, constructing the multi-modal feature vector sequence, and obtaining feature representations corresponding to the modal feature vector sequences through a recurrent neural network; obtaining semantic attribute vectors corresponding to the feature representations through a semantic attribute detection network; and expressing the concatenated vector and semantic attribute vector based on the features corresponding to each modal feature vector sequence, and obtaining a description statement of the video to be described through an LSTM network based on an attention mechanism. The invention integrates visual features and high-level semantic attributes, and improves the accuracy of generating the video description sentences.

Description

Video content description method, system and device based on multi-mode attention mechanism

Technical Field

The invention belongs to the field of computer vision and natural language processing, and particularly relates to a video content description method, system and device based on a multi-mode attention mechanism.

Background

Artificial intelligence can be roughly divided into two research directions: perceptual intelligence and cognitive intelligence. Perceptual intelligence research progresses rapidly, such as picture classification and natural language translation, but cognitive intelligence development speed is limited, such as talking through pictures, visual description and the like. The study is combined with natural language and computer vision, which is beneficial to building a bridge for communication between human and machines and promoting the study of cognitive intelligence.

Video content describes a different label-based coarse-grained visual understanding task than video classification, object detection, etc., but requires a smooth and accurate sentence to describe the video content. This requires not only identifying objects in the video, but also understanding the interrelationships between objects in the video. Meanwhile, the video content has various description styles, such as abstract description of scenes, description of the relationship among objects, description of behavior and motion of objects in the video, and the like, which brings great challenges to the description research of the video content. The traditional video content description algorithm mainly adopts a language template-based method or a retrieval-based method. The method based on the language template can only generate sentences with single form and lack of flexibility due to the limitation of the fixed language template. The retrieval-based method is too dependent on the size of a retrieval video library, and when a video similar to the video to be described is absent in the database, the generated description sentence has a large deviation with the video content. Meanwhile, both methods need to perform a complicated preprocessing process on the video in the early stage, and the optimization on the language sequence part at the back end is insufficient, so that the quality of the generated sentences is poor.

With the advancement of deep learning techniques, codec-based sequence learning models have made a breakthrough in the video content description problem. The method is also based on a codec model, does not need to adopt a complex processing process for the video in the early stage, directly realizes end-to-end training through a network, and can directly learn the mapping relation between the video and the language from a large amount of training data, thereby generating the video description with more accurate content, various forms and flexible grammar.

The key of the video content description problem lies in the extraction of video characteristics, and because different modal information in the video can assist each other, the encoding of the video multi-modal information is beneficial to mining more semantic information. Meanwhile, because the general video content description algorithm only considers the video characteristics and ignores the high-level semantic attribute information of the video, in order to improve the quality of the generated description sentence, the invention also discusses how to extract the high-level semantic attributes and apply the semantic attributes to the video content description task. The invention also analyzes and researches the problem of insufficient optimization of the decoder end language generation part, most of the current video content description algorithms adopt maximum likelihood to model the language sequence, and the cross entropy loss is used for training and optimizing, which brings two obvious defects: one is the exposure bias problem, when the model is trained, the input of the decoder at each moment comes from the real words in the training set, and when the model is tested, the input at each moment comes from the words predicted at the last moment. If one of the words is predicted with insufficient accuracy, the error may be passed down, resulting in a lower and lower quality of the subsequently generated word. Secondly, the problem that the training index and the evaluation criterion are not uniform is solved, the posterior probability is maximized by adopting a cross entropy loss function in the training stage, objective evaluation criteria such as BLEU, METEOR and CIDER are adopted in the evaluation stage, and the inconsistency causes that the model cannot fully optimize the evaluation index generated by the video content description.

Disclosure of Invention

In order to solve the above-mentioned problem in the prior art, that is, to solve the problem that the accuracy of the generated description statement is low due to the fact that the existing video content description method only considers the video features and ignores the high-level semantic attribute information of the video, a first aspect of the present invention provides a video content description method based on a multi-modal attention mechanism, which includes:

step S100, acquiring a video frame sequence of a video to be described as an input sequence;

step S200, extracting multi-modal feature vectors of the input sequence, constructing a multi-modal feature vector sequence, and obtaining feature representations corresponding to the modal feature vector sequences through a recurrent neural network; the multi-modal feature vector sequence comprises a video frame feature vector sequence, an optical flow frame feature vector sequence and a video segment feature vector sequence;

step S300, based on the feature representation corresponding to each modal feature vector sequence, respectively obtaining semantic attribute vectors corresponding to each feature representation through a semantic attribute detection network;

step S400, cascading the feature representations corresponding to the modal feature vector sequences to obtain initial coding vectors; obtaining a description statement of the video to be described through an LSTM network based on an attention mechanism based on the initial coding vector and the semantic attribute vector corresponding to each feature representation;

wherein,

the semantic attribute detection network is constructed based on a multilayer perceptron and is trained based on training samples, and the training samples comprise feature representation samples and corresponding semantic attribute vector labels.

In some preferred embodiments, in step S200, "extracting the multi-modal feature vectors of the input sequence, and constructing a multi-modal feature vector sequence", the method includes:

extracting the features of each frame of RGB image in the input sequence based on a depth residual error network to obtain a video frame feature vector sequence;

based on the input sequence, obtaining an optical flow sequence through a Lucas-Kanade algorithm; extracting the features of the optical flow sequence through a depth residual error network to obtain an optical flow frame feature vector sequence;

and (3) dividing the input sequence into T sections in average, and extracting the characteristic vectors of the sequences of each section respectively through a three-dimensional convolution deep neural network to obtain a video segment characteristic vector sequence.

In some preferred embodiments, the training method of the semantic attribute detection network is as follows:

acquiring a training data set, wherein the training data set comprises videos and corresponding description sentences;

extracting words describing sentences in the training data set, sequencing the words according to the occurrence frequency, and selecting the first K words as high-level semantic attribute vectors; acquiring a real semantic attribute vector label of the video according to whether the description statement contains the high-level semantic attribute vector;

acquiring feature representation corresponding to a multi-modal feature vector sequence of the video in the training data set;

and training the semantic attribute detection network based on the feature representation and the real semantic attribute vector labels.

In some preferred embodiments, the loss function loss of the semantic attribute detection network in the training process₁Comprises the following steps:

wherein N is the number of description sentences in the training data set, K is the dimension of the predicted semantic attribute vector label output by the semantic attribute detection network, and s_ikDetecting predicted semantic attribute vector tags, y, of network outputs for semantic attributes_ikIs the real semantic attribute vector label, i, k are subscripts, α are weights, W^encoderAnd (3) collecting all weight matrix and bias matrix parameters of the recurrent neural network and the semantic attribute detection network.

In some preferred embodiments, in step S400, "obtaining the sentence description of the video to be described through the LSTM network based on the attention mechanism based on the initial encoding vector and the semantic attribute vector corresponding to each feature representation" includes:

weighting the semantic attribute vectors corresponding to the feature representations through an attention mechanism to obtain multi-modal semantic attribute vectors;

and generating statement description of the video to be described through an LSTM network based on the initial coding vector and the multi-mode semantic attribute vector.

In some preferred embodiments, the attention-based LSTM network performs the calculation of the weight matrix in a training process by using a factorization method.

The invention provides a video content description system based on a multi-mode attention mechanism, which comprises an acquisition module, an extracted feature representation module, a semantic attribute detection module and a video description generation module, wherein the extracted feature representation module is used for extracting a feature of a video;

the acquisition module is configured to acquire a video frame sequence of a video to be described as an input sequence;

the extracted feature representation module is configured to extract multi-modal feature vectors of the input sequence, construct a multi-modal feature vector sequence, and obtain feature representations corresponding to the feature vector sequences of each modality through a recurrent neural network; the multi-modal feature vector sequence comprises a video frame feature vector sequence, an optical flow frame feature vector sequence and a video segment feature vector sequence;

the semantic attribute detection module is configured to obtain semantic attribute vectors corresponding to the feature representations respectively through a semantic attribute detection network based on the feature representations corresponding to the modal feature vector sequences;

the video generation description module is configured to cascade feature representations corresponding to the modal feature vector sequences to obtain initial coding vectors; obtaining a description statement of the video to be described through an LSTM network based on an attention mechanism based on the initial coding vector and the semantic attribute vector corresponding to each feature representation;

wherein,

In a third aspect of the present invention, a storage device is provided, in which a plurality of programs are stored, the programs being loaded and executed by a processor to implement the above-mentioned video content description method based on the multi-modal attention mechanism.

In a fourth aspect of the present invention, a processing apparatus is provided, which includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is adapted to be loaded and executed by a processor to implement the above-described method for video content description based on a multi-modal attention mechanism.

The invention has the beneficial effects that:

the invention integrates visual features and high-level semantic attributes, and improves the accuracy of generating the video description sentences. The invention starts from multi-modal information, obtains a video characteristic vector sequence by adopting a method of combining a video frame, a light stream frame and a video fragment, and simultaneously detects and generates a semantic attribute vector label of the video. In order to obtain more effective visual characteristics and semantic attributes, the auxiliary classification loss and LSTM network loss in the generation stage of the semantic attribute vector labels are optimized simultaneously, and the context relationship in sentences can be ensured. In the decoding stage, an attention mechanism algorithm combined with semantic attributes is provided, the semantic attribute vectors are fused into a traditional recurrent neural network weight matrix, and in each moment of generating a sentence word, the attention mechanism is adopted to pay attention to the specific semantic attributes, so that the accuracy of video content description is improved.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings.

FIG. 1 is a flow chart of a method for describing video content based on a multi-modal attention mechanism according to an embodiment of the present invention;

FIG. 2 is a block diagram of a multi-modal attention mechanism based video content description system according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a training process of a video content description method based on a multi-modal attention mechanism according to an embodiment of the present invention;

fig. 4 is a schematic network structure diagram of a semantic attribute detection network according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.

It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

The video content description method based on the multi-modal attention mechanism, as shown in fig. 1, includes the following steps:

wherein,

In order to more clearly explain the video content description method based on the multi-modal attention mechanism, the following describes the steps in an embodiment of the method in detail with reference to the accompanying drawings.

The programming language in which the method of the present invention operates specifically is not limited, and the method of the present invention can be implemented by writing in any language. The method adopts a 4-card Titan Xp GPU server with 12 Gbyte memory, and uses Python language to compile a working program of the video content description method based on the multi-mode attention mechanism, thereby realizing the method of the invention. The method comprises the following concrete steps:

step S100, acquiring a video frame sequence of a video to be described as an input sequence.

In this embodiment, the video to be described may be a video shot in real time, for example, in an intelligent monitoring and behavior analysis scene, a video shot by a camera in real time needs to be described, and at this time, the video to be described may be a video shot by the camera in real time; or the video to be described may be a video acquired from a network, for example, in a video content preview scene, the video acquired from the network needs to be described through a natural language, so as to implement the preview of the video content by the user, and at this time, the video to be described may be a video acquired from the network and needing to be previewed; or the video to be described may be a locally stored video, for example, in a video classified storage scene, the video needs to be described, and is classified and stored according to the description information, and at this time, the video to be described may be a locally stored video that needs to be classified and stored. And extracting a video frame sequence as input based on the acquired video to be described.

Step S200, extracting multi-modal feature vectors of the input sequence, constructing a multi-modal feature vector sequence, and obtaining feature representations corresponding to the modal feature vector sequences through a recurrent neural network; the multi-modal feature vector sequence comprises a video frame feature vector sequence, an optical flow frame feature vector sequence and a video segment feature vector sequence.

In the present example, the multi-modal video features of the video to be described are extracted, namely video frames, streaming frames and video clips. The method comprises the following specific steps:

step S201, using a pre-trained depth residual error network to extract the features of each frame of a video frame sequence of a video to be described, using the output of the ith layer of the network as the feature representation of the frame, and obtaining a video frame feature vector sequence

Inputting the video frame feature vector sequence into the recurrent neural network LSTM in sequence, and hiding the hidden state h at the last moment of the network_tThe feature representation, denoted v, of the video frame feature vector as a video_f。

Step S202, generating an optical flow sequence of a video by a video frame sequence of a video to be described through a Lucas-Kanade algorithm, performing feature extraction on each frame through a pre-trained depth residual error network, and taking the output of the ith layer of the network as feature representation of the frame to obtain an optical flow frame feature vector sequence

Inputting the light stream frame feature vector sequence into a recurrent neural network (LSTM) in sequence, and hiding the hidden state h of the network at the last moment_tThe characteristic representation of an optical flow frame as a video is denoted v_o

Step S203, the video frame sequence of the video to be described is averagely divided into T sections, each section uses a three-dimensional convolution depth neural network to carry out feature extraction, the output of the ith layer of the network is used as the feature representation of the T section of the video, and the feature vector sequence of the video segment is obtained

Inputting the video segment feature vector sequence into the recurrent neural network LSTM in sequence, and hiding the hidden state h at the last moment of the network_tThe characteristic representation of a video segment as a video, denoted v_c。

In the process of extracting the multi-modal Feature representation of the Video in the above steps, as shown in fig. 3, Video (Video) is input and divided into a Video Frame (Frame), an Optical flow Frame (Optical flow) and a Video clip (Video clip), wherein the Video Frame outputs Frame Feature (Feature representation of Video Frame Feature vector), i.e. static Feature, the Video clip outputs C3D Feature (3D-CNN Feature of Video clip), and the Optical flow Frame outputs Motion Feature (dynamic Feature). Other steps in fig. 3 are described below.

And step S300, respectively obtaining semantic attribute vectors corresponding to the feature representations through a semantic attribute detection network based on the feature representations corresponding to the modal feature vector sequences.

In this embodiment, training of the semantic attribute detection network is introduced first, and then semantic attribute vectors corresponding to the feature representations obtained by the semantic attribute detection network are introduced.

The semantic attribute detection network is constructed based on a multilayer perceptron, the structure is shown in fig. 4, the semantic attribute detection network comprises an input Layer (input Layer), a Hidden Layer (Hidden Layer) and an Output Layer (Output Layer), a video (input video) and a corresponding description sentence (A Small child playing the guitar) are input, and a multimodal feature vector sequence (v) is obtained through a recurrent neural network (LSTM)_i1,v_i2,...,v_in) Outputting a semantic attribute vector s by a semantic attribute detection network_i1,s_i2,...,s_iK. The specific training process of the semantic attribute detection network is as follows:

step A301, a training data set is obtained, wherein the training data set comprises videos and corresponding description sentences.

Step A302, extracting words describing sentences in the training data set, sequencing the words according to the occurrence frequency, and selecting the first K words as high-level semantic attribute vectors; and acquiring a real semantic attribute vector label of the video according to whether the description statement contains the high-level semantic attribute vector.

Extracting words describing sentences in the training data set, sorting the words according to the occurrence frequency of the words, removing the null words, and then selecting K words with highest occurrence probability as high-level semantic attribute values

Suppose that the training data set has N sentences, y_i＝[y_i1,y_i1,...y_il,...y_iK]Is the true semantic attribute vector tag of the ith video. Wherein if the descriptive sentence corresponding to the video i contains the attribute word l, y _il1 is ═ 1; otherwise y_il＝0。

Step a303 is to obtain feature representations corresponding to the multi-modal feature vector sequences of the video in the training dataset by the method in step S200.

Step A304, training the semantic attribute detection network based on the feature representation and the real semantic attribute vector labels.

Let v be_i∈{v_f，v_o，v_cThe feature representation of the ith video learning is represented, and the training sample at the moment is { v }_i,y_i}. In the invention, a semantic attribute detection network constructed by adopting a multilayer perceptron is adopted to learn a function f (R): R)^m→R^KExpressed as mapping an m-dimensional space to a K-dimensional space, where R^mReal number space in m dimensions, R^KSimilarly, m is the dimension represented by the input features, K is the dimension of the output semantic attribute vector, the dimension is equal to the number of the extracted semantic attribute values (the dimension of the high-level semantic attribute vector), and the multilayer perceptron outputs a vector s_i＝[s_i1,...,s_iK]Vector labels for the predicted semantic attributes of the ith video. Loss of class function (loss) of semantic attribute detection network₁As shown in equation (1):

wherein, W^encoderRepresenting recurrent neural networks, semantic attributesDetecting the set of all the parameters of the weight matrix and the bias matrix of the network, α is the weight, s_i＝α(f(v_i) Learning to obtain a K-dimensional vector, α (·) representing sigmoid function, and f (·) being implemented by a multilayer perceptron.

After training of the semantic attribute detection network is completed, in an actual application process, semantic attribute vectors corresponding to the feature representations are obtained through the semantic attribute detection network respectively based on the feature representations corresponding to the modal feature vector sequences. Such as the Multimodal Semantic Detector module in fig. 3.

Step S400, cascading the feature representations corresponding to the modal feature vector sequences to obtain initial coding vectors; and obtaining the description statement of the video to be described through an LSTM network based on an attention mechanism based on the initial coding vector and the semantic attribute vector corresponding to each feature representation.

In this embodiment, { v_f，v_o，v_cCascade as initial coding vector v, like the collocation module in fig. 3. Attention Fusion is based on the Attention module.

The following description first introduces the training process of the attention-based LSTM network, and then introduces the method for obtaining the descriptive statement of the video to be described through the attention-based LSTM network.

The LSTM network based on attention mechanism is trained, the input descriptive sentence is "A Small child playing the guitar", and the specific training process is as follows:

when the output is a sentence, the LSTM network is used as a decoder, where the long-term dependence of the sentence can be captured. Suppose that the word input at the present time is w_tThe hidden state at a moment on the LSTM network is h_t-1The last memory state of the cell is c_t-1Then, the update rule of the LSTM network at time t is as shown in equations (2) (3) (4) (5) (6) (7) (8):

i_t＝σ(W_iw_t+U_hih_t-1+z) (2)

f_t＝σ(W_fw_t+U_hfh_t-1+z) (3)

o_t＝σ(W_ow_t+U_hoh_t-1+z) (4)

h_t＝o_t⊙tanh(c_t) (7)

z＝1(t＝1)·Cv (8)

a subscript of the above formula { i, f, o, c }, wherein W_*、U_h*And C are both weight matrices, i_t,f_t,o_t,c_t,

The states of the input gate, the forgetting gate, the output gate, the memory unit, and the compression input at time t are respectively represented, tanh (·) represents a hyperbolic tangent function, 1(t ═ 1) represents an indication function, an initial encoding vector v is input at the initial time of LSTM, and z represents that a video vector is input at the initial time when t ═ 1. For simplicity, the bias terms in the above equations are omitted.

To better exploit the side information from the semantic attributes of multiple modalities, we propose an attention mechanism to compute the weight matrix W in conjunction with the semantic attributes_*And U_h*Each weight matrix of the traditional LSTM is expanded to a set of K attribute-related weight matrices for mining the meaning of a single word to generate the final descriptive statement. By W_*(S_t)/U_h*(S_t) Replacing the initial weight matrix W_*/U_h*In which S is_t∈R^KThe semantic attribute vector is a multi-modal semantic attribute vector and dynamically changes at any moment. In particular, two weight matrices are defined

Wherein n is_hIs the number of hidden units, n_xIs the dimension of the word embedding vector, W_*(S_t)/U_h*(S_t) The expression (c) is shown in the formulas (9) and (10):

wherein, W_τ[k]、U_τ[k]Are respectively expressed as weight matrix W_τAnd U_τWith a probability value S_t[k]Is associated, S_t[k]Is a semantic attribute vector S of multiple modalities_tThe kth element of (1). Due to W_τIs a three-dimensional vector, so a 2D slice refers to W_τIs a two-dimensional vector.

S_tThe calculation process of (2) is shown in the formulas (11), (12) and (13):

e_ti＝w^Ttanh(W_ah_t-1+U_as_i) (13)

where l-3 denotes the three learned semantic attribute vectors s_f，s_o，s_c}, attention weight a_tiThe importance degree of the ith semantic attribute in the video at the current moment is reflected. It can be seen that for different time steps t, the semantic property S_tIs different, such that the model selectively focuses on different semantic attribute parts in the video each time a word is generated, j denotes a subscript, e_tiRepresenting an unregulated attention weight, w^TA transformation matrix is represented.

In the training process of the attention-based LSTM network, which is equivalent to the joint training of K LSTMs, the parameters of the network are in direct proportion to the K value, and when the K value is large, the network can hardly complete the training, and the following factorization method is adopted, as shown in the formula (14) (15):

W_*(S_t)＝W_a·diag(W_bS_t)·W_c(14)

U_h*(S_t)＝U_a·diag(U_bS_t)·U_c(15)

wherein,

and

n_frepresenting the relevant hyper-parameters of the factorization.

The method is used for solving the problem that the parameter quantity of the original network is in direct proportion to the K value because the parameter quantity of the factorization network is greatly reduced, and the parameter quantity of the network is analyzed. In the formulae (9) and (10), the total parameter amount is K.n_h·(n_x+n_h) The reference number is considered to be proportional to K. In the formulas (14) (15), W_*(S_t) The formula parameter is n_f·(n_h+K+n_x)，U_h*(S_t) The formula parameter is n_f·(2n_h+ K) of the two parameter quantities is n_f·(3n_h+2K+n_x). When n is designated_f＝n_hFor larger values of K, n_f·(3n_h+2K+n_x) Is much less than K.n_h·(n_x+n_h)。

Substituting the factorized equations (14) and (15) into the LSTM network update rules yields equations (16), (17), and (18):

where ⊙ denotes the element-by-element multiplication operator, for S_tOf each element value, parameter matrix W_aAnd U_aIs shared, which effectively captures common language patterns in video, while the diagonal matrix diag (W)_bS_t) And diag (U)_bS_t) Specific semantic attribute parts in different videos are considered,

representing the input merged into the semantic attribute vector,

representing the hidden state merged into the semantic attribute vector. In the same way, can prove that f_t,o_t,c_tThe expression of (c) is similar to the above formula.

The formulas can be obtained, after the network is fully trained, not only can the common language mode part in the video be effectively captured, but also the specific semantic attribute part in the video can be focused on, and meanwhile, due to the fact that factorization is adopted, the parameter quantity of the network is greatly reduced, and the problem that the parameter quantity of an original network is in direct proportion to the K value is solved.

In the process of generating description sentences of the video, greedy search is adopted, and words output at each moment are shown as formula (19):

w_t＝softmax(Wh_t) (19)

where W is the transformation matrix.

Thus, the loss function loss of the generated sentences of the design network₂As shown in equation (20):

loss₂＝-logP(Y|v,s_f,s_c,s_o)＝-∑logP(w_t|w_0～t-1) (20)

wherein Y ═ { w ═ w₁,w₂,.......w_NDenotes a sentence of N words, w_0～t-1The words generated before time t.

Classification loss to generate high-level semantic attributes₁And generating a loss describing the sentence₂And the context relationship in the sentence can be ensured by adding and optimizing at the same time. And training the network by adopting a back propagation algorithm based on the obtained loss value. Like the Classification Loss module and the capturing Loss module in fig. 3, the Total Loss module is obtained by adding.

And after the LSTM network based on the attention mechanism is trained, obtaining a description statement of the video to be described through the LSTM network based on the attention mechanism based on the initial coding vector and the semantic attribute vector corresponding to each feature representation.

A second embodiment of the present invention is a video content description system based on a multi-modal attention mechanism, as shown in fig. 2, including: the system comprises an acquisition module 100, an extracted feature representation module 200, a semantic attribute detection module 300 and a video description generation module 400;

the acquiring module 100 is configured to acquire a video frame sequence of a video to be described as an input sequence;

the extracted feature representation module 200 is configured to extract the multi-modal feature vectors of the input sequence, construct a multi-modal feature vector sequence, and obtain feature representations corresponding to the feature vector sequences of each modality through a recurrent neural network; the multi-modal feature vector sequence comprises a video frame feature vector sequence, an optical flow frame feature vector sequence and a video segment feature vector sequence;

the semantic attribute detection module 300 is configured to obtain semantic attribute vectors corresponding to the feature representations respectively through a semantic attribute detection network based on the feature representations corresponding to the feature vector sequences of the respective modalities;

the generated video description module 400 is configured to cascade feature representations corresponding to the feature vector sequences of each modality to obtain an initial coding vector; obtaining a description statement of the video to be described through an LSTM network based on an attention mechanism based on the initial coding vector and the semantic attribute vector corresponding to each feature representation;

wherein,

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and related description of the system described above may refer to the corresponding process in the foregoing method embodiment, and details are not described herein again.

It should be noted that, the video content description system based on the multi-modal attention mechanism provided in the foregoing embodiment is only illustrated by the division of the above functional modules, and in practical applications, the functions may be allocated to different functional modules according to needs, that is, the modules or steps in the embodiment of the present invention are further decomposed or combined, for example, the modules in the foregoing embodiment may be combined into one module, or may be further split into multiple sub-modules, so as to complete all or part of the functions described above. The names of the modules and steps involved in the embodiments of the present invention are only for distinguishing the modules or steps, and are not to be construed as unduly limiting the present invention.

A storage device according to a third embodiment of the present invention stores therein a plurality of programs, which are adapted to be loaded by a processor and to implement the above-described video content description method based on the multi-modal attention mechanism.

A processing apparatus according to a fourth embodiment of the present invention includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is adapted to be loaded and executed by a processor to implement the above-described method for video content description based on a multi-modal attention mechanism.

It can be clearly understood by those skilled in the art that, for convenience and brevity, the specific working processes and related descriptions of the storage device and the processing device described above may refer to the corresponding processes in the foregoing method examples, and are not described herein again.

Those of skill in the art would appreciate that the various illustrative modules, method steps, and modules described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that programs corresponding to the software modules, method steps may be located in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. To clearly illustrate this interchangeability of electronic hardware and software, various illustrative components and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as electronic hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.

The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims

1. A method for describing video content based on a multi-modal attention mechanism, the method comprising:

wherein,

2. The method for describing video content based on multi-modal attention mechanism according to claim 1, wherein in step S200, "extracting multi-modal feature vectors of the input sequence to construct a multi-modal feature vector sequence" comprises:

3. The method for describing video content based on multi-modal attention mechanism according to claim 1, wherein the semantic attribute detection network is trained by:

4. The method according to claim 3, wherein the loss function loss of the semantic attribute detection network is obtained during training₁Comprises the following steps:

wherein N is the number of description sentences in the training data set, K is the dimension of the predicted semantic attribute vector label output by the semantic attribute detection network, and s_ikDetecting predicted semantics of network output for semantic attributesAttribute vector label, y_ikIs the real semantic attribute vector label, i, k are subscripts, α are weights, W^encoderAnd (3) collecting all weight matrix and bias matrix parameters of the recurrent neural network and the semantic attribute detection network.

5. The method according to claim 1, wherein in step S400, "obtaining the sentence description of the video to be described through an LSTM network based on the attention mechanism based on the initial coding vector and each feature representing a corresponding semantic attribute vector" includes:

6. The method for video content description based on multi-modal attention mechanism according to claim 1, wherein the LSTM network based on attention mechanism adopts a factorization method for weight matrix calculation during training.

7. A video content description system based on a multi-mode attention mechanism is characterized by comprising an acquisition module, an extracted feature representation module, a semantic attribute detection module and a generated video description module;

wherein,

8. A storage device having a plurality of programs stored therein, wherein the program applications are loaded and executed by a processor to implement the method for multi-modal attention mechanism based video content description according to any one of claims 1-6.

9. A processing device comprising a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; characterized in that the program is adapted to be loaded and executed by a processor to implement the method for multi-modal attention mechanism based video content description according to any of claims 1-6.