CN113095435A

CN113095435A - Video description generation method, device, equipment and computer readable storage medium

Info

Publication number: CN113095435A
Application number: CN202110470037.0A
Authority: CN
Inventors: 罗剑; 王健宗; 程宁
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2021-04-28
Filing date: 2021-04-28
Publication date: 2021-07-09
Anticipated expiration: 2041-04-28
Also published as: CN113095435B

Abstract

The application belongs to the technical field of intelligent decision making, and provides a video description generation method, a device, equipment and a computer readable storage medium, wherein the method comprises the following steps: acquiring a video to be described, and extracting visual features, auditory features and word features of the video to be described; respectively coding the visual characteristics and the auditory characteristics through a multi-mode attention mechanism main body model of the video description generation system to obtain visual coding characteristics and auditory coding characteristics; processing the visual coding features and the auditory coding features through an auxiliary model of a video description generation system to generate target auxiliary features; decoding the visual coding features, the auditory coding features, the target auxiliary features and the word features through a multi-mode attention mechanism main model to obtain the posterior probability of each keyword, and selecting a decoding word from each keyword according to the posterior probability of each keyword; and generating the video description of the video to be described according to the decoding words. The method and the device can improve the accuracy of video description.

Description

Video description generation method, device, equipment and computer readable storage medium

Technical Field

The present application relates to the field of intelligent decision making technologies, and in particular, to a video description generation method, apparatus, device, and computer-readable storage medium.

Background

Video description is a technique for automatically generating a content description for a video. With the continuous development of the mobile internet, the short video gradually becomes the most popular spreading form at present, and the video description is automatically generated for the short video, so that the method has important application value in the aspects of providing reference for users, optimizing the recommendation algorithm and search engine of the short video, improving the auditing work efficiency of the short video and the like. Unlike image descriptions alone or audio descriptions alone, videos contain complex spatiotemporal relationships between objects, e.g., "footsteps from a wooden ladder, two people walking slowly closer", and thus how to automatically generate a video description is a challenge in the field of computer vision.

In the related art, a classical attention-based encoder-decoder algorithm is generally adopted to generate a video description for a video, however, the algorithm only utilizes visual features in the video, and the quality of the finally generated video description is not high due to single features, so that the video content cannot be accurately described.

Disclosure of Invention

The present application mainly aims to provide a video description generation method, device, apparatus and computer readable storage medium, and aims to solve the technical problem that the accuracy of video description generated by the existing automatic video description generation method is not high.

In a first aspect, the present application provides a video description generation method, including:

acquiring a video to be described, and extracting visual features, auditory features and word features of the video to be described;

respectively coding the visual features and the auditory features through a multi-mode attention mechanism main body model of a video description generation system to obtain visual coding features and auditory coding features;

processing the visual coding features and the auditory coding features through an auxiliary model of the video description generation system to generate target auxiliary features;

decoding the visual coding features, the auditory coding features, the target auxiliary features and the word features through the multi-mode attention mechanism main body model to obtain the posterior probability of each keyword, and selecting decoding words from each keyword according to the posterior probability of each keyword;

and generating the video description of the video to be described according to the decoding words.

In a second aspect, the present application further provides a video description generation apparatus, including:

the extraction module is used for acquiring a video to be described and extracting visual features, auditory features and word features of the video to be described;

the coding module is used for coding the visual features and the auditory features respectively through a multi-mode attention mechanism main body model of the video description generation system to obtain visual coding features and auditory coding features;

a target assistant feature generation module, configured to process the visual coding features and the auditory coding features through an assistant model of the video description generation system to generate target assistant features;

the decoding module is used for decoding the visual coding features, the auditory coding features, the target auxiliary features and the word features through the multi-mode attention mechanism main body model to obtain the posterior probability of each keyword, and selecting a decoding word from each keyword according to the posterior probability of each keyword;

and the video description generation module is used for generating the video description of the video to be described according to the decoding words.

In a third aspect, the present application further provides a computer device comprising a processor, a memory, and a computer program stored on the memory and executable by the processor, wherein the computer program, when executed by the processor, implements the steps of the video description generation method as described above.

In a fourth aspect, the present application further provides a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the video description generation method as described above.

The application discloses a video description generation method, a video description generation device, computer equipment and a computer readable storage medium, wherein the video description generation method comprises the steps of firstly obtaining a video to be described, and extracting visual features, auditory features and word features of the video to be described; then, coding the visual characteristics and the auditory characteristics through a multi-mode attention mechanism main body model of the video description generation system to obtain visual coding characteristics and auditory coding characteristics; processing the visual coding features and the auditory coding features through an auxiliary model of a video description generation system to generate target auxiliary features; and further decoding the visual coding features, the auditory coding features, the target auxiliary features and the word features through a multi-mode attention mechanism-based main body model of the video description system to obtain the posterior probability of each keyword, selecting decoding words from each keyword according to the posterior probability of each keyword, and finally generating the video description of the video to be described according to the decoding words. The video description generation system realizes the fusion of visual features and auditory features through the multi-mode attention mechanism main body model, realizes the addition of auxiliary features through the auxiliary model, provides rich features for the generation of video description, and lays a foundation for accurately selecting words which accord with video scenes and events, thereby improving the video description accuracy.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic flowchart of a video description generation method according to an embodiment of the present application;

fig. 2 is a schematic architecture diagram of a video description system according to an embodiment of the present application;

fig. 3 is a schematic diagram of an architecture of a scene classification assistance model according to an embodiment of the present application;

FIG. 4 is a schematic diagram of a keyword evaluation support model according to an embodiment of the present application;

fig. 5 is a schematic block diagram of a video description generation apparatus provided in an embodiment of the present application;

fig. 6 is a block diagram schematically illustrating a structure of a computer device according to an embodiment of the present application.

The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The flow diagrams depicted in the figures are merely illustrative and do not necessarily include all of the elements and operations/steps, nor do they necessarily have to be performed in the order depicted. For example, some operations/steps may be decomposed, combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.

It is to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

The embodiment of the application provides a video description generation method, a video description generation device, video description generation equipment and a computer-readable storage medium. The video description generation method is mainly applied to video description generation equipment, and can be equipment with a data processing function, such as a mobile terminal, a Personal Computer (PC), a portable computer and a server. The video description generation device carries a video description generation system thereon. The video description generation system may be implemented as part of a multimedia application.

Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.

Referring to fig. 1, fig. 1 is a schematic flowchart of a video description generation method according to an embodiment of the present application.

As shown in fig. 1, the video description generation method includes steps S101 to S105.

Step S101, obtaining a video to be described, and extracting visual features, auditory features and word features of the video to be described.

As shown in fig. 2, fig. 2 is a schematic structural diagram of a video description generation system, which is a video description generation model and mainly includes three parts, namely a main model and two auxiliary models, the main model is a coder-decoder model based on a multi-modal attention mechanism (defined as a main model of the multi-modal attention mechanism, as shown in a dashed box part in fig. 2), and the two auxiliary models are respectively a scene classification auxiliary model and a keyword evaluation auxiliary model. The multi-modal attention mechanism main body model introduces a multi-modal attention mechanism in a traditional self-attention-based encoder-decoder algorithm, and can perform fusion extraction on visual features and auditory features in a video to be described; and a scene classification auxiliary model supported by visual features and a keyword evaluation auxiliary model supported by auditory features are introduced into the video description generation system, so that visual and auditory fusion features and word features can be assisted, words conforming to the current scene and event can be accurately selected, and the description accuracy of the video to be described is improved.

As shown in the dashed box of FIG. 2, the dashed box of FIG. 2 is an architectural diagram of a multi-modal attention mechanism body model, which includes a visual feature encoder (denoted as VE)_θv) An auditory feature encoder (denoted VE)_θa) And a text decoder (denoted D). The visual characteristic encoder and the auditory characteristic encoder are used for fusing and extracting the visual characteristics and the auditory characteristics, and the text decoder decodes the decoded words based on the visual characteristics, the auditory characteristics and the word characteristics.

First, a video to be described is acquired, and visual features (denoted as phi) and auditory features (denoted as phi) of the video to be described are extracted

) And word features (denoted as w)_n-1)。

In one embodiment, the visual features are obtained by performing feature extraction on the visual information in the video to be described based on an Inflated 3D convolutional network (I3D ConvNet, I3D) pre-trained on the behavior data set Kinetics-600. It will be appreciated that the visual characteristic is in the form of a T_v×d_vCharacteristic sequence of

Wherein T is_vIndicating the length of the input sequence, d_vThe dimensions of the features are represented such that,

representing the visual characteristics of the point in time t.

In one embodiment, based on a VGGish model obtained by pre-training in a Google data set Audio, the auditory information in a video to be described is subjected to feature extraction to obtain auditory features. When the VGGish model extracts the characteristics of the auditory information in the video to be describedFirstly, resampling the audio of the video to be described to be a monaural audio, exemplarily, resampling the audio of the video to be described to be a 16kHz monaural audio, then performing short-time fourier transform on the monaural audio by using a 25ms Hanning window (Hanning) and a 10ms frame shift to obtain a spectrogram, mapping the spectrogram into a 64-order mel filter bank, taking a logarithm to obtain a stable mel-frequency spectrum, and outputting a time-duration framing of 0.96s for the features, wherein each frame comprises 64 mel frequency bands and has a time duration of 10ms (namely, 96 frames in total). Similar to visual feature extraction, the VGGish model outputs T_a×d_aCharacteristic sequence of

T_aWherein the length of the input sequence, i.e. audio duration/0.96, d is indicated_aRepresenting a characteristic dimension, which may be 128 dimensions.

In one embodiment, the extraction of the word features of the video to be described is to obtain each word w at the previous moment_n-1＝(w₁，...，w_n-1) Based on the public data set Common Crawl, using fastText to pre-train to obtain a lookup table with one dimension, so as to obtain each word w in the previous time_n-1＝(w₁，...，w_n-1) Can all use one d_wThe vector of dimensions is represented.

And S102, coding the visual characteristics and the auditory characteristics through a multi-mode attention mechanism main body model of the video description generation system to obtain visual coding characteristics and auditory coding characteristics.

For the extracted visual and auditory features

Coding is carried out through a multi-modal attention mechanism main body model, and visual coding characteristics (expressed as v ═ VE) are obtained_θv(phi)) and auditory coding features (denoted as

In an embodiment, the multi-modal attention mechanism body model includes a visual feature encoder and an auditory feature encoder, and the multi-modal attention mechanism body model of the video description generation system encodes the visual feature and the auditory feature respectively to obtain a visual coding feature and an auditory coding feature, specifically: performing multi-head attention calculation on the visual features through the visual feature encoder to obtain visual multi-head attention features, and performing multi-head attention calculation on the auditory features through the auditory feature encoder to obtain auditory multi-head attention features; performing multi-modal attention calculation on the visual multi-head attention feature and the auditory multi-head attention feature through the visual feature encoder to obtain a visual feature fused with auditory attention, and performing multi-modal attention calculation on the auditory multi-head attention feature and the visual multi-head attention feature through the auditory feature encoder to obtain an auditory feature fused with visual attention; through the visual feature encoder is right in proper order visual feature that has fused the sense of hearing attention carries out first sublayer regularization, feedforward calculation and second sublayer regularization, obtains the visual coding feature of visual feature encoder output, and through the sense of hearing feature encoder is right in proper order the sense of hearing feature that has fused the sense of hearing attention carries out first sublayer regularization, feedforward calculation and second sublayer regularization, obtains the sense of hearing coding feature of sense of hearing feature encoder output.

With continued reference to fig. 2, a left dashed-line frame portion of fig. 2 is a schematic structural diagram of a visual feature encoder and an auditory feature encoder of a Multi-modal Attention mechanism body model, where encoding layers of the visual feature encoder and the auditory feature encoder each include five sublayers, a first Layer is a Multi-head Attention (Multi-modal Attention), a second Layer is a Multi-modal Attention (Multi-modal Attention), where the Multi-modal Attention is a variation of the Multi-head Attention, a third Layer is a first Layer regularization (Layer regularization) Layer, a fourth Layer is a feedforward neural network, and a fifth Layer is a second Layer regularization Layer.

For the extracted visual and auditory features

Respectively through visual feature encoder VE_θvCoding is carried out to obtain visual coding characteristics (expressed as v ═ VE)_θvPhi), by means of an auditory feature encoder VE_θaEncoding is performed to obtain auditory encoding characteristics (expressed as

)。

Specifically, the visual feature phi and the auditory feature are respectively combined

Inputting the data into a visual characteristic encoder and an auditory characteristic encoder, and respectively adding a visual characteristic phi and an auditory characteristic phi into corresponding encoding layers of the visual characteristic encoder and the auditory characteristic encoder

Inputting the data into a corresponding multi-head attention layer to perform multi-head attention calculation to obtain the output of the corresponding multi-head attention layer, namely the visual multi-head attention feature and the auditory multi-head attention feature (defined as Q); then, the output of the corresponding multi-head attention layer is input to the corresponding multi-modal attention layer for multi-modal attention calculation, and the output (defined as K, V) corresponding to the multi-modal attention, namely the visual feature V fused with the auditory attention is obtained^mm＝MultiHeadAttention(V^self，A^self，A^self) And an auditory feature A with fused visual attention^mm＝MultiHeadAttention(A^self，V^self，V^self) (ii) a Then, respectively inputting the output corresponding to the multi-modal attention to a corresponding first layer of regularization layer for layer regularization to obtain the output corresponding to the first layer of regularization layer; then, respectively inputting the output of the corresponding first layer of regularization layer to the corresponding feedforward neural network layer for feedforward calculation to obtain the output of the corresponding feedforward neural network layer; finally, the output of the corresponding feedforward neural network layer is input to the corresponding second layer regularization layer for layer regularization, so that the complete encoder is stacked for N times to finally output the viewPerceptual coding features

And auditory coding features

Where VE and AE represent a visual feature encoder and an auditory feature encoder, respectively, and θ_v、θ_aRespectively, representing their parameter spaces. It should be noted that, in order to alleviate the problems such as the disappearance of the gradient, a stub connection is added between the input layer of the corresponding coding layer and the first layer regularization layer, and between the first layer regularization layer and the second layer regularization layer.

Among the coding layers of the visual feature encoder and the auditory feature encoder, the multi-point Attention is the most important conversion mapping, and the multi-point Attention is obtained by Scaled Dot-product Attention (Scaled Dot-Production Attention), and the formula is as follows:

wherein the content of the first and second substances,

is a scaling factor, Q, K, V is a sequence of queries (queries), keys (keys), values (values), respectively.

Multi-headed attention lets the query (Q, query), key (K, key), and value (V, value) first go through a linear transformation

Then inputting the data into the zooming dot product attention, repeating the step H times (H is the number of heads in multi-head attention), using different linear transformation parameter matrixes each time, splicing the results obtained by the zooming dot product attention of the H times, and performing linear transformation W once again^outThe obtained value is used as the output result of the multi-head attention, and the specific formula is as follows:

MuilHeadAttention(Q,K,V)＝[head₁(Q,K,V),...,head_H(Q,K,V)]W^out

step S103, processing the visual coding features and the auditory coding features through an auxiliary model of the video description generation system to generate target auxiliary features.

The visual coding features and the auditory coding features may then be processed through an auxiliary model of the video description generation system to generate target auxiliary features.

In an embodiment, the auxiliary models include a scene classification auxiliary model and a keyword evaluation auxiliary model; the processing the visual coding features and the auditory coding features through the auxiliary model of the video description generation system to generate target auxiliary features specifically includes: inputting the visual coding features into the scene classification auxiliary model for processing to obtain first auxiliary features output by the scene classification auxiliary model, and inputting the auditory coding features into the keyword evaluation auxiliary model for processing to obtain second auxiliary features output by the keyword evaluation auxiliary model; and generating a target auxiliary feature according to the first auxiliary feature and the second auxiliary feature.

With continued reference to FIG. 2, the auxiliary models of the video description generation system include a scene classification auxiliary model and a keyword evaluation auxiliary model. For the visual coding features v output by the visual feature encoder, the video description generation system inputs the visual coding features v into the scene classification auxiliary model for processing to obtain first auxiliary features output by the scene classification auxiliary model

For the auditory coding characteristics a output by the auditory characteristic encoder, the video description generation system inputs the auditory coding characteristics a into the keyword evaluation auxiliary model for processing to obtain second auxiliary characteristics output by the keyword evaluation auxiliary model

Thereby generating the target assist feature m from the first assist feature and the second assist feature.

In an embodiment, the inputting the visual coding feature into the scene classification auxiliary model for processing to obtain a first auxiliary feature output by the scene classification auxiliary model specifically includes: inputting the visual coding features into the scene classification auxiliary model, and performing linear transformation on the visual coding features; carrying out nonlinear mapping on the vision coding features after linear transformation through a linear rectification function to obtain vision coding feature mapping; performing linear transformation on the visual coding feature mapping; and performing softmax logistic regression calculation on the vision coding feature mapping after the linear transformation to obtain a first auxiliary feature output by the scene classification auxiliary model.

As shown in fig. 3, fig. 3 is an architecture diagram of a scene classification assistant model, where the scene classification assistant model includes four sub-layers, a first Linear transformation layer (Linear), a second Linear rectification function (ReLU, activation function), a third Linear transformation layer, and a fourth Softmax function logistic regression layer.

After the visual coding features v output by the visual feature encoder are input into a scene classification auxiliary model, the scene classification auxiliary model firstly accesses the visual coding features v into a first linear transformation layer for linear transformation to obtain the output of the linear transformation layer; then, the output of the linear transformation layer is input to a linear rectification function for nonlinear mapping to obtain a calculation result of the linear rectification function, namely visual coding characteristic mapping; inputting the calculation result of the linear rectification function into a second linear conversion layer for linear conversion to obtain the output of the second linear conversion layer; then, the output of the second linear transformation layer is input to a Softmax function logistic regression layer for Softmax logistic regression calculation, and the probability scores m of Ka preset scenes output by the Softmax function are obtained_vAs the final output of the scene classification assistance model, wherein,

in an embodiment, the inputting the auditory coding features into the keyword assessment assistant model for processing to obtain second assistant features output by the keyword assessment assistant model specifically includes: inputting the auditory coding features into the keyword evaluation auxiliary model, and performing linear transformation on the auditory coding features; carrying out nonlinear mapping on the hearing coding features after linear transformation through a linear rectification function to obtain hearing coding feature mapping; performing a linear transformation on the auditory coding feature map;

calculating the auditory coding feature mapping after linear transformation through a Sigmoid function to obtain the posterior probability of each keyword in a dictionary; performing maximum pooling on the posterior probability of each keyword to obtain a score of each keyword; ranking the scores of the keywords, and selecting a preset number of keywords according to the order of the scores from large to small so as to search the indexes of the selected keywords in a dictionary; and combining the searched indexes to obtain a second auxiliary characteristic output by the keyword evaluation auxiliary module.

As shown in fig. 4, fig. 4 is a schematic diagram of a keyword evaluation auxiliary model, where the keyword evaluation auxiliary model includes six sublayers, a first layer is a first linear transformation layer, a second layer is an activation function (a linear rectification function), a third layer is a second linear transformation layer, a fourth layer is a Sigmoid function, a fifth layer is a maximum pooling layer, and a sixth layer is a sorting & selecting layer.

After the auditory coding characteristics a output by the auditory characteristic encoder are input into the keyword evaluation auxiliary model, the keyword evaluation auxiliary model firstly accesses the auditory coding characteristics a into a first linear transformation layer for linear transformation to obtain the output of the linear transformation layer; then, the output of the linear transformation layer is input to an activation function for calculation to obtain a calculation result of the activation function, namely, the auditory coding features after linear transformation are subjected to nonlinear mapping through a linear rectification function to obtain auditory coding feature mapping; then the auditory code characteristic mapping is input to a second linear transformation layer for linear transformation to obtain a secondThe output of each linear conversion layer; then, the output of the second linear transformation layer is input into a Sigmoid function to obtain the posterior probability Z of each keyword in a dictionary output by the Sigmoid function, wherein

Further inputting the posterior probability Z of each keyword in the dictionary output by the Sigmoid function into the maximum pooling layer for keyword evaluation to obtain a keyword score P (Z) output by the maximum pooling layer_C| a), wherein P (Z)_C|a)＝max P(Z_C,t| a); finally, the keyword score P (Z) output by the maximum pooling layer_C| a) input to the sorting&Selecting a layer to sort, and taking indexes of the first K keywords in a dictionary according to the order of scores from large to small to form an output m of a keyword evaluation auxiliary model_aAnd K represents a preset number and can be flexibly set according to the actual situation.

In an embodiment, the generating a target assist feature from the first assist feature and the second assist feature comprises: carrying out keyword embedding processing and linear transformation on the second auxiliary features in sequence to obtain second auxiliary features with reduced feature dimensions; and splicing the second auxiliary features with the reduced feature dimensions with the first auxiliary features to obtain target auxiliary features.

With continued reference to FIG. 2, a first assist feature m output to assist the model in classifying the scene_vAnd a second assistant feature m output by the keyword evaluation assistant model_aSplicing is carried out, and the second auxiliary characteristic m is also spliced_aPerforming keyword embedding processing and linear transformation in sequence to reduce feature dimension, and combining the second assistant feature with reduced feature dimension and the first assistant feature m_vAnd splicing to obtain the target auxiliary feature m.

And step S104, decoding the visual coding features, the auditory coding features, the target auxiliary features and the word features through the multi-mode attention mechanism main body model to obtain the posterior probability of each keyword, and selecting a decoding word from each keyword according to the posterior probability of each keyword.

In the decoding stage, when the nth word is decoded, the visual coding characteristics, the auditory coding characteristics, the target auxiliary characteristics and the word characteristics are decoded through the multi-mode attention machine main body model to obtain the posterior probability of the nth word finally output by the multi-mode attention machine main body model

D is a text decoder, θ_dRepresenting its parameter space.

In an embodiment, the multi-modal attention mechanism principal model includes a text decoder, and the multi-modal attention mechanism principal model decodes the visual coding features, the auditory coding features, the target assistant features, and the word features to obtain posterior probabilities of the keywords, specifically: sequentially performing multi-head attention calculation and layer regularization on the word features through the text decoder to obtain word layer regularization features; performing multi-mode attention calculation on the word layer regularization features and the visual coding features to obtain word features fused with visual attention, and performing multi-mode attention calculation on the word layer regularization features and the auditory coding features to obtain word features fused with auditory attention; bridging the word features fused with the visual attention and the word features fused with the auditory attention to obtain bridging word features; performing layer regularization on the bridge word features, and performing multi-head attention calculation on the bridge word features after the layer regularization and the target auxiliary features to obtain word features fused with the target auxiliary features; sequentially carrying out first-layer regularization, feedforward calculation and second-layer regularization on the word features fused with the target auxiliary features to obtain the output of the text decoder; and sequentially carrying out linear transformation and Softmax logistic regression calculation on the output of the text decoder to obtain the posterior probability of each keyword.

As shown in the dashed box portion on the right of fig. 2, the dashed box portion on the right of fig. 2 is a structural diagram of a text decoder, which includes nine sub-layers, the first layer is a first multi-headed attention layer,the second layer is a first regularization layer, and the third layer is two different multi-modal attention layers, multiHeadsAttention (W)^selfV, v) and MultiHeadAttention (W)^selfA, a), the fourth layer is a bridge layer, the fifth layer is a second layer of regularization layer, and the sixth layer is a second multi-head attention layer (W)^normM, m), the seventh layer is a third layer regularization layer, the eighth layer is a feedforward neural network layer, and the ninth layer is a fourth layer regularization layer.

For the word characteristics, the video description generation system inputs the word characteristics into a text decoder, firstly inputs the word characteristics into a first multi-head attention layer at a decoding layer of the text decoder to carry out multi-head attention calculation, and obtains the output of the first multi-head attention layer; inputting the output of the first multi-head attention layer to the first layer of regularization layer for layer regularization to obtain the output W of the first layer of regularization layer^selfI.e. word layer regularization features; the output W of the first layer of regularization layer is then^selfRespectively inputting the data into two different multi-modal attention layers, respectively fusing with the visual coding features and the auditory coding features, performing multi-modal attention calculation to obtain the output of the two different multi-modal attention layers, namely the output W of the regularization layer of the first layer^selfPerforming multi-mode attention calculation with the visual coding features to obtain word features fused with visual attention, and normalizing the output W of the first layer^selfPerforming multi-modal attention calculation with the auditory coding features to obtain word features fused with auditory attention; then the outputs of two different multi-modal attention layers are input into the bridging layer to be bridged, and the output-bridging word characteristic (shape 2 d) of the bridging layer is obtained_wConversion of x (n-1) to d_wX (n-1)); inputting the output of the bridging layer to a second layer of regularization layer for layer regularization to obtain the output W of the second layer of regularization layer^norm(ii) a Further regularizing the second layer by the output W^normInput to the second Multi-head attention Multi HeadAttention (W)^normM, m), fusing the word features and the target assist features to obtain a second multi-headed attention output-the word feature fused with the target assist featuresCharacterisation, i.e. output W of the second layer of regularization layer^normPerforming head attention calculation with the target auxiliary features to obtain word features fused with the target auxiliary features; inputting the output of the second multi-head attention to a third layer of regularization layer for layer regularization to obtain the output of the third layer of regularization layer; then, the output of the third layer of regularization layer is input to a corresponding feedforward neural network layer for feedforward calculation to obtain the output of the feedforward neural network layer; and then inputting the output of the feedforward neural network layer to a fourth layer of regularization layer for layer regularization to obtain the output of the fourth layer of regularization layer, namely the output of the text decoder, and taking the output as the output of the main model of the multi-mode attention mechanism. It should be noted that, a stub connection is respectively added between the input layer of the decoding layer and the first layer regularization layer, between the first layer regularization layer and the second layer regularization layer, between the second layer regularization layer and the third layer regularization layer, and between the third layer regularization layer and the fourth layer regularization layer.

Continuing to refer to fig. 2, the output of the multi-modal attention mechanism main body model is subjected to linear transformation, the result obtained after the linear transformation is calculated through Softmax logistic regression, and the output of the video description generation system, namely the posterior probability P (W) of the nth keyword is finally obtained_n|v,a,m,W_n-1). It can be understood that the higher the posterior probability is, the higher the matching degree of the corresponding keyword and the video content to be described is, and the keyword with the highest posterior probability is determined as a decoding word.

And step S105, generating the video description of the video to be described according to the decoding words.

Because the video description is a natural language formed by the sequence of decoding words, the steps from S101 to S104 are repeated during each decoding, the decoding words are sequentially generated to form the sequence of decoding words, and the video description of the video to be described is generated by the sequence of decoding words.

In summary, based on visual features

Auditory features

And word (w)_n-1＝(w₁，...，w_n-1) ) feature, generates current word w_nTo generate a complete word sequence (w)₁，...，w_n) And the method is used for describing the content of the video to be described.

The video description generation method provided by the embodiment comprises the steps of firstly obtaining a video to be described, and extracting visual features, auditory features and word features of the video to be described; then, coding the visual characteristics and the auditory characteristics through a multi-mode attention mechanism main body model of the video description generation system to obtain visual coding characteristics and auditory coding characteristics; processing the visual coding features and the auditory coding features through an auxiliary model of a video description generation system to generate target auxiliary features; and further decoding the visual coding features, the auditory coding features, the target auxiliary features and the word features through a multi-mode attention mechanism-based main body model of the video description system to obtain the posterior probability of each keyword, selecting decoding words from each keyword according to the posterior probability of each keyword, and finally generating the video description of the video to be described according to the decoding words. The video description generation system realizes the fusion of visual features and auditory features through the multi-mode attention mechanism main body model, realizes the addition of auxiliary features through the auxiliary model, provides rich features for the generation of video description, and lays a foundation for accurately selecting words which accord with video scenes and events, thereby improving the video description accuracy.

Referring to fig. 5, fig. 5 is a schematic block diagram of a video description generating apparatus according to an embodiment of the present application.

As shown in fig. 5, the video description generating apparatus 400 includes: an extraction module 401, an encoding module 402, a target assistant feature generation module 403, a decoding module 404, and a video description generation module 405.

The extraction module 401 is configured to acquire a video to be described, and extract visual features, auditory features, and word features of the video to be described;

the encoding module 402 is configured to encode the visual features and the auditory features through a multi-modal attention mechanism main body model of the video description generation system, so as to obtain visual encoding features and auditory encoding features;

a target assistant feature generation module 403, configured to process the visual coding features and the auditory coding features through an assistant model of the video description generation system to generate target assistant features;

a decoding module 404, configured to decode the visual coding features, the auditory coding features, the target auxiliary features, and the word features through the multi-modal attention mechanism main body model to obtain posterior probabilities of the keywords, and select a decoded word from the keywords according to the posterior probabilities of the keywords;

a video description generating module 405, configured to generate a video description of the video to be described according to the decoded word.

It should be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the apparatus and each module and unit described above may refer to the corresponding processes in the foregoing video description generation method embodiment, and are not described herein again.

The apparatus provided by the above embodiments may be implemented in the form of a computer program, which can be run on a computer device as shown in fig. 6.

Referring to fig. 6, fig. 6 is a schematic block diagram illustrating a structure of a computer device according to an embodiment of the present disclosure. The computer device may be a Personal Computer (PC), a server, or the like having a data processing function.

As shown in fig. 6, the computer device includes a processor, a memory, and a network interface connected by a system bus, wherein the memory may include a nonvolatile storage medium and an internal memory.

The non-volatile storage medium may store an operating system and a computer program. The computer program includes program instructions that, when executed, cause a processor to perform any of the video description generation methods.

The processor is used for providing calculation and control capability and supporting the operation of the whole computer equipment.

The internal memory provides an environment for running a computer program in the non-volatile storage medium, which when executed by the processor causes the processor to perform any of the video description generation methods.

The network interface is used for network communication, such as sending assigned tasks and the like. Those skilled in the art will appreciate that the architecture shown in fig. 6 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.

It should be understood that the Processor may be a Central Processing Unit (CPU), and the Processor may be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, etc. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Wherein, in one embodiment, the processor is configured to execute a computer program stored in the memory to implement the steps of:

acquiring a video to be described, and extracting visual features, auditory features and word features of the video to be described; respectively coding the visual features and the auditory features through a multi-mode attention mechanism main body model of a video description generation system to obtain visual coding features and auditory coding features; processing the visual coding features and the auditory coding features through an auxiliary model of the video description generation system to generate target auxiliary features; decoding the visual coding features, the auditory coding features, the target auxiliary features and the word features through the multi-mode attention mechanism main body model to obtain the posterior probability of each keyword, and selecting decoding words from each keyword according to the posterior probability of each keyword; and generating the video description of the video to be described according to the decoding words.

In some embodiments, the multi-modal attention mechanism principal model includes a visual feature encoder and an auditory feature encoder, and the processor implements the encoding of the visual features and the auditory features by the multi-modal attention mechanism principal model of the video description generation system, respectively, to obtain visual coding features and auditory coding features, including:

performing multi-head attention calculation on the visual features through the visual feature encoder to obtain visual multi-head attention features, and performing multi-head attention calculation on the auditory features through the auditory feature encoder to obtain auditory multi-head attention features;

performing multi-modal attention calculation on the visual multi-head attention feature and the auditory multi-head attention feature through the visual feature encoder to obtain a visual feature fused with auditory attention, and performing multi-modal attention calculation on the auditory multi-head attention feature and the visual multi-head attention feature through the auditory feature encoder to obtain an auditory feature fused with visual attention;

through the visual feature encoder is right in proper order visual feature that has fused the sense of hearing attention carries out first sublayer regularization, feedforward calculation and second sublayer regularization, obtains the visual coding feature of visual feature encoder output, and through the sense of hearing feature encoder is right in proper order the sense of hearing feature that has fused the sense of hearing attention carries out first sublayer regularization, feedforward calculation and second sublayer regularization, obtains the sense of hearing coding feature of sense of hearing feature encoder output.

In some embodiments, the auxiliary models include a scene classification auxiliary model and a keyword evaluation auxiliary model, and the processor implements the auxiliary models by the video description generation system to process the visual coding features and the auditory coding features to generate target auxiliary features, including:

inputting the visual coding features into the scene classification auxiliary model for processing to obtain first auxiliary features output by the scene classification auxiliary model, and inputting the auditory coding features into the keyword evaluation auxiliary model for processing to obtain second auxiliary features output by the keyword evaluation auxiliary model;

and generating a target auxiliary feature according to the first auxiliary feature and the second auxiliary feature.

In some embodiments, the inputting the visual coding features into the scene classification assistant model for processing by the processor to obtain the first assistant features output by the scene classification assistant model includes:

inputting the visual coding features into the scene classification auxiliary model, and performing linear transformation on the visual coding features;

carrying out nonlinear mapping on the vision coding features after linear transformation through a linear rectification function to obtain vision coding feature mapping;

performing linear transformation on the visual coding feature mapping;

and performing softmax logistic regression calculation on the vision coding feature mapping after the linear transformation to obtain a first auxiliary feature output by the scene classification auxiliary model.

In some embodiments, the processor implements the inputting of the auditory coding features into the keyword assessment assistant model for processing to obtain second assistant features output by the keyword assessment assistant model, and the method includes:

inputting the auditory coding features into the keyword evaluation auxiliary model, and performing linear transformation on the auditory coding features;

carrying out nonlinear mapping on the hearing coding features after linear transformation through a linear rectification function to obtain hearing coding feature mapping;

performing a linear transformation on the auditory coding feature map;

calculating the auditory coding feature mapping after linear transformation through a Sigmoid function to obtain the posterior probability of each keyword in a dictionary;

performing maximum pooling on the posterior probability of each keyword to obtain a score of each keyword;

ranking the scores of the keywords, and selecting a preset number of keywords according to the order of the scores from large to small so as to search the indexes of the selected keywords in a dictionary;

and combining the searched indexes to obtain a second auxiliary characteristic output by the keyword evaluation auxiliary module.

In some embodiments, the processor implements the generating a target assist feature from the first assist feature and the second assist feature, including:

carrying out keyword embedding processing and linear transformation on the second auxiliary features in sequence to obtain second auxiliary features with reduced feature dimensions;

and splicing the second auxiliary features with the reduced feature dimensions with the first auxiliary features to obtain target auxiliary features.

In some embodiments, the multi-modal attention mechanism principal model comprises a text decoder, and the processor implements the decoding of the visual coding features, the auditory coding features, the target assist features, and the word features by the multi-modal attention mechanism principal model to obtain a posterior probability of each keyword, including:

sequentially performing multi-head attention calculation and layer regularization on the word features through the text decoder to obtain word layer regularization features;

performing multi-mode attention calculation on the word layer regularization features and the visual coding features to obtain word features fused with visual attention, and performing multi-mode attention calculation on the word layer regularization features and the auditory coding features to obtain word features fused with auditory attention;

bridging the word features fused with the visual attention and the word features fused with the auditory attention to obtain bridging word features;

performing layer regularization on the bridge word features, and performing multi-head attention calculation on the bridge word features after the layer regularization and the target auxiliary features to obtain word features fused with the target auxiliary features;

sequentially carrying out first-layer regularization, feedforward calculation and second-layer regularization on the word features fused with the target auxiliary features to obtain the output of the text decoder;

and sequentially carrying out linear transformation and Softmax logistic regression calculation on the output of the text decoder to obtain the posterior probability of each keyword.

Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, where the computer program includes program instructions, and a method implemented when the program instructions are executed may refer to various embodiments of a video description generation method of the present application.

The computer-readable storage medium may be an internal storage unit of the computer device described in the foregoing embodiment, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the computer device.

Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.

The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.

The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments. While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and those skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of video description generation, the method comprising the steps of:

2. The video description generation method according to claim 1, wherein the multi-modal attention mechanism principal model includes a visual feature encoder and an auditory feature encoder;

the method for obtaining the visual coding features and the auditory coding features by respectively coding the visual features and the auditory features through a multi-mode attention mechanism main body model of a video description generation system comprises the following steps:

3. The video description generation method according to claim 1, wherein the auxiliary models include a scene classification auxiliary model and a keyword evaluation auxiliary model;

the processing, by an assistant model of the video description generation system, the visually encoded features and the aurally encoded features to generate target assistant features, comprising:

4. The method according to claim 3, wherein the inputting the visual coding features into the scene classification assistant model for processing to obtain the first assistant features output by the scene classification assistant model comprises:

performing linear transformation on the visual coding feature mapping;

5. The method of claim 3, wherein the inputting the auditory coding features into the keyword assessment assistant model for processing to obtain the second assistant features output by the keyword assessment assistant model comprises:

performing a linear transformation on the auditory coding feature map;

6. The video description generation method according to claim 3, wherein the generating a target assist feature from the first assist feature and the second assist feature comprises:

7. The video description generation method of claim 1, wherein the multi-modal attention mechanism principal model comprises a text decoder;

the decoding the visual coding features, the auditory coding features, the target auxiliary features and the word features through the multi-modal attention mechanism main body model to obtain the posterior probability of each keyword, and the method comprises the following steps:

8. A video description generation apparatus, characterized in that the video description generation apparatus comprises:

9. A computer device, characterized in that the computer device comprises a processor, a memory, and a computer program stored on the memory and executable by the processor, wherein the computer program, when executed by the processor, implements the steps of the video description generation method according to any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, wherein the computer program, when executed by a processor, implements the steps of the video description generation method according to any one of claims 1 to 7.