CN113095435A - Video description generation method, device, equipment and computer readable storage medium - Google Patents
Video description generation method, device, equipment and computer readable storage medium Download PDFInfo
- Publication number
- CN113095435A CN113095435A CN202110470037.0A CN202110470037A CN113095435A CN 113095435 A CN113095435 A CN 113095435A CN 202110470037 A CN202110470037 A CN 202110470037A CN 113095435 A CN113095435 A CN 113095435A
- Authority
- CN
- China
- Prior art keywords
- features
- auditory
- visual
- coding
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 53
- 238000003860 storage Methods 0.000 title claims abstract description 22
- 230000000007 visual effect Effects 0.000 claims abstract description 169
- 230000007246 mechanism Effects 0.000 claims abstract description 44
- 238000012545 processing Methods 0.000 claims abstract description 30
- 238000004364 calculation method Methods 0.000 claims description 54
- 230000009466 transformation Effects 0.000 claims description 52
- 230000006870 function Effects 0.000 claims description 29
- 238000011156 evaluation Methods 0.000 claims description 28
- 238000013507 mapping Methods 0.000 claims description 28
- 238000004590 computer program Methods 0.000 claims description 15
- 238000007477 logistic regression Methods 0.000 claims description 10
- 238000000605 extraction Methods 0.000 claims description 9
- 230000008569 process Effects 0.000 claims description 9
- 238000011176 pooling Methods 0.000 claims description 7
- 238000010586 diagram Methods 0.000 description 15
- 238000013528 artificial neural network Methods 0.000 description 8
- 238000006243 chemical reaction Methods 0.000 description 6
- 238000004422 calculation algorithm Methods 0.000 description 5
- 230000004913 activation Effects 0.000 description 4
- 230000004927 fusion Effects 0.000 description 4
- 238000012952 Resampling Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 238000003491 array Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000005540 biological transmission Effects 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000008034 disappearance Effects 0.000 description 1
- 230000037433 frameshift Effects 0.000 description 1
- 238000009432 framing Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000001228 spectrum Methods 0.000 description 1
- 230000007480 spreading Effects 0.000 description 1
- 238000003892 spreading Methods 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/25—Fusion techniques
- G06F18/253—Fusion techniques of extracted features
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/783—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Library & Information Science (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Computation (AREA)
- Evolutionary Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Multimedia (AREA)
- Databases & Information Systems (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
Abstract
The application belongs to the technical field of intelligent decision making, and provides a video description generation method, a device, equipment and a computer readable storage medium, wherein the method comprises the following steps: acquiring a video to be described, and extracting visual features, auditory features and word features of the video to be described; respectively coding the visual characteristics and the auditory characteristics through a multi-mode attention mechanism main body model of the video description generation system to obtain visual coding characteristics and auditory coding characteristics; processing the visual coding features and the auditory coding features through an auxiliary model of a video description generation system to generate target auxiliary features; decoding the visual coding features, the auditory coding features, the target auxiliary features and the word features through a multi-mode attention mechanism main model to obtain the posterior probability of each keyword, and selecting a decoding word from each keyword according to the posterior probability of each keyword; and generating the video description of the video to be described according to the decoding words. The method and the device can improve the accuracy of video description.
Description
Technical Field
The present application relates to the field of intelligent decision making technologies, and in particular, to a video description generation method, apparatus, device, and computer-readable storage medium.
Background
Video description is a technique for automatically generating a content description for a video. With the continuous development of the mobile internet, the short video gradually becomes the most popular spreading form at present, and the video description is automatically generated for the short video, so that the method has important application value in the aspects of providing reference for users, optimizing the recommendation algorithm and search engine of the short video, improving the auditing work efficiency of the short video and the like. Unlike image descriptions alone or audio descriptions alone, videos contain complex spatiotemporal relationships between objects, e.g., "footsteps from a wooden ladder, two people walking slowly closer", and thus how to automatically generate a video description is a challenge in the field of computer vision.
In the related art, a classical attention-based encoder-decoder algorithm is generally adopted to generate a video description for a video, however, the algorithm only utilizes visual features in the video, and the quality of the finally generated video description is not high due to single features, so that the video content cannot be accurately described.
Disclosure of Invention
The present application mainly aims to provide a video description generation method, device, apparatus and computer readable storage medium, and aims to solve the technical problem that the accuracy of video description generated by the existing automatic video description generation method is not high.
In a first aspect, the present application provides a video description generation method, including:
acquiring a video to be described, and extracting visual features, auditory features and word features of the video to be described;
respectively coding the visual features and the auditory features through a multi-mode attention mechanism main body model of a video description generation system to obtain visual coding features and auditory coding features;
processing the visual coding features and the auditory coding features through an auxiliary model of the video description generation system to generate target auxiliary features;
decoding the visual coding features, the auditory coding features, the target auxiliary features and the word features through the multi-mode attention mechanism main body model to obtain the posterior probability of each keyword, and selecting decoding words from each keyword according to the posterior probability of each keyword;
and generating the video description of the video to be described according to the decoding words.
In a second aspect, the present application further provides a video description generation apparatus, including:
the extraction module is used for acquiring a video to be described and extracting visual features, auditory features and word features of the video to be described;
the coding module is used for coding the visual features and the auditory features respectively through a multi-mode attention mechanism main body model of the video description generation system to obtain visual coding features and auditory coding features;
a target assistant feature generation module, configured to process the visual coding features and the auditory coding features through an assistant model of the video description generation system to generate target assistant features;
the decoding module is used for decoding the visual coding features, the auditory coding features, the target auxiliary features and the word features through the multi-mode attention mechanism main body model to obtain the posterior probability of each keyword, and selecting a decoding word from each keyword according to the posterior probability of each keyword;
and the video description generation module is used for generating the video description of the video to be described according to the decoding words.
In a third aspect, the present application further provides a computer device comprising a processor, a memory, and a computer program stored on the memory and executable by the processor, wherein the computer program, when executed by the processor, implements the steps of the video description generation method as described above.
In a fourth aspect, the present application further provides a computer-readable storage medium having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the video description generation method as described above.
The application discloses a video description generation method, a video description generation device, computer equipment and a computer readable storage medium, wherein the video description generation method comprises the steps of firstly obtaining a video to be described, and extracting visual features, auditory features and word features of the video to be described; then, coding the visual characteristics and the auditory characteristics through a multi-mode attention mechanism main body model of the video description generation system to obtain visual coding characteristics and auditory coding characteristics; processing the visual coding features and the auditory coding features through an auxiliary model of a video description generation system to generate target auxiliary features; and further decoding the visual coding features, the auditory coding features, the target auxiliary features and the word features through a multi-mode attention mechanism-based main body model of the video description system to obtain the posterior probability of each keyword, selecting decoding words from each keyword according to the posterior probability of each keyword, and finally generating the video description of the video to be described according to the decoding words. The video description generation system realizes the fusion of visual features and auditory features through the multi-mode attention mechanism main body model, realizes the addition of auxiliary features through the auxiliary model, provides rich features for the generation of video description, and lays a foundation for accurately selecting words which accord with video scenes and events, thereby improving the video description accuracy.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic flowchart of a video description generation method according to an embodiment of the present application;
fig. 2 is a schematic architecture diagram of a video description system according to an embodiment of the present application;
fig. 3 is a schematic diagram of an architecture of a scene classification assistance model according to an embodiment of the present application;
FIG. 4 is a schematic diagram of a keyword evaluation support model according to an embodiment of the present application;
fig. 5 is a schematic block diagram of a video description generation apparatus provided in an embodiment of the present application;
fig. 6 is a block diagram schematically illustrating a structure of a computer device according to an embodiment of the present application.
The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The flow diagrams depicted in the figures are merely illustrative and do not necessarily include all of the elements and operations/steps, nor do they necessarily have to be performed in the order depicted. For example, some operations/steps may be decomposed, combined or partially combined, so that the actual execution sequence may be changed according to the actual situation.
It is to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.
The embodiment of the application provides a video description generation method, a video description generation device, video description generation equipment and a computer-readable storage medium. The video description generation method is mainly applied to video description generation equipment, and can be equipment with a data processing function, such as a mobile terminal, a Personal Computer (PC), a portable computer and a server. The video description generation device carries a video description generation system thereon. The video description generation system may be implemented as part of a multimedia application.
Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.
Referring to fig. 1, fig. 1 is a schematic flowchart of a video description generation method according to an embodiment of the present application.
As shown in fig. 1, the video description generation method includes steps S101 to S105.
Step S101, obtaining a video to be described, and extracting visual features, auditory features and word features of the video to be described.
As shown in fig. 2, fig. 2 is a schematic structural diagram of a video description generation system, which is a video description generation model and mainly includes three parts, namely a main model and two auxiliary models, the main model is a coder-decoder model based on a multi-modal attention mechanism (defined as a main model of the multi-modal attention mechanism, as shown in a dashed box part in fig. 2), and the two auxiliary models are respectively a scene classification auxiliary model and a keyword evaluation auxiliary model. The multi-modal attention mechanism main body model introduces a multi-modal attention mechanism in a traditional self-attention-based encoder-decoder algorithm, and can perform fusion extraction on visual features and auditory features in a video to be described; and a scene classification auxiliary model supported by visual features and a keyword evaluation auxiliary model supported by auditory features are introduced into the video description generation system, so that visual and auditory fusion features and word features can be assisted, words conforming to the current scene and event can be accurately selected, and the description accuracy of the video to be described is improved.
As shown in the dashed box of FIG. 2, the dashed box of FIG. 2 is an architectural diagram of a multi-modal attention mechanism body model, which includes a visual feature encoder (denoted as VE)θv) An auditory feature encoder (denoted VE)θa) And a text decoder (denoted D). The visual characteristic encoder and the auditory characteristic encoder are used for fusing and extracting the visual characteristics and the auditory characteristics, and the text decoder decodes the decoded words based on the visual characteristics, the auditory characteristics and the word characteristics.
First, a video to be described is acquired, and visual features (denoted as phi) and auditory features (denoted as phi) of the video to be described are extracted) And word features (denoted as w)n-1)。
In one embodiment, the visual features are obtained by performing feature extraction on the visual information in the video to be described based on an Inflated 3D convolutional network (I3D ConvNet, I3D) pre-trained on the behavior data set Kinetics-600. It will be appreciated that the visual characteristic is in the form of a Tv×dvCharacteristic sequence ofWherein T isvIndicating the length of the input sequence, dvThe dimensions of the features are represented such that,representing the visual characteristics of the point in time t.
In one embodiment, based on a VGGish model obtained by pre-training in a Google data set Audio, the auditory information in a video to be described is subjected to feature extraction to obtain auditory features. When the VGGish model extracts the characteristics of the auditory information in the video to be describedFirstly, resampling the audio of the video to be described to be a monaural audio, exemplarily, resampling the audio of the video to be described to be a 16kHz monaural audio, then performing short-time fourier transform on the monaural audio by using a 25ms Hanning window (Hanning) and a 10ms frame shift to obtain a spectrogram, mapping the spectrogram into a 64-order mel filter bank, taking a logarithm to obtain a stable mel-frequency spectrum, and outputting a time-duration framing of 0.96s for the features, wherein each frame comprises 64 mel frequency bands and has a time duration of 10ms (namely, 96 frames in total). Similar to visual feature extraction, the VGGish model outputs Ta×daCharacteristic sequence ofTaWherein the length of the input sequence, i.e. audio duration/0.96, d is indicatedaRepresenting a characteristic dimension, which may be 128 dimensions.
In one embodiment, the extraction of the word features of the video to be described is to obtain each word w at the previous momentn-1=(w1,...,wn-1) Based on the public data set Common Crawl, using fastText to pre-train to obtain a lookup table with one dimension, so as to obtain each word w in the previous timen-1=(w1,...,wn-1) Can all use one dwThe vector of dimensions is represented.
And S102, coding the visual characteristics and the auditory characteristics through a multi-mode attention mechanism main body model of the video description generation system to obtain visual coding characteristics and auditory coding characteristics.
For the extracted visual and auditory featuresCoding is carried out through a multi-modal attention mechanism main body model, and visual coding characteristics (expressed as v ═ VE) are obtainedθv(phi)) and auditory coding features (denoted as
In an embodiment, the multi-modal attention mechanism body model includes a visual feature encoder and an auditory feature encoder, and the multi-modal attention mechanism body model of the video description generation system encodes the visual feature and the auditory feature respectively to obtain a visual coding feature and an auditory coding feature, specifically: performing multi-head attention calculation on the visual features through the visual feature encoder to obtain visual multi-head attention features, and performing multi-head attention calculation on the auditory features through the auditory feature encoder to obtain auditory multi-head attention features; performing multi-modal attention calculation on the visual multi-head attention feature and the auditory multi-head attention feature through the visual feature encoder to obtain a visual feature fused with auditory attention, and performing multi-modal attention calculation on the auditory multi-head attention feature and the visual multi-head attention feature through the auditory feature encoder to obtain an auditory feature fused with visual attention; through the visual feature encoder is right in proper order visual feature that has fused the sense of hearing attention carries out first sublayer regularization, feedforward calculation and second sublayer regularization, obtains the visual coding feature of visual feature encoder output, and through the sense of hearing feature encoder is right in proper order the sense of hearing feature that has fused the sense of hearing attention carries out first sublayer regularization, feedforward calculation and second sublayer regularization, obtains the sense of hearing coding feature of sense of hearing feature encoder output.
With continued reference to fig. 2, a left dashed-line frame portion of fig. 2 is a schematic structural diagram of a visual feature encoder and an auditory feature encoder of a Multi-modal Attention mechanism body model, where encoding layers of the visual feature encoder and the auditory feature encoder each include five sublayers, a first Layer is a Multi-head Attention (Multi-modal Attention), a second Layer is a Multi-modal Attention (Multi-modal Attention), where the Multi-modal Attention is a variation of the Multi-head Attention, a third Layer is a first Layer regularization (Layer regularization) Layer, a fourth Layer is a feedforward neural network, and a fifth Layer is a second Layer regularization Layer.
For the extracted visual and auditory featuresRespectively through visual feature encoder VEθvCoding is carried out to obtain visual coding characteristics (expressed as v ═ VE)θvPhi), by means of an auditory feature encoder VEθaEncoding is performed to obtain auditory encoding characteristics (expressed as)。
Specifically, the visual feature phi and the auditory feature are respectively combinedInputting the data into a visual characteristic encoder and an auditory characteristic encoder, and respectively adding a visual characteristic phi and an auditory characteristic phi into corresponding encoding layers of the visual characteristic encoder and the auditory characteristic encoderInputting the data into a corresponding multi-head attention layer to perform multi-head attention calculation to obtain the output of the corresponding multi-head attention layer, namely the visual multi-head attention feature and the auditory multi-head attention feature (defined as Q); then, the output of the corresponding multi-head attention layer is input to the corresponding multi-modal attention layer for multi-modal attention calculation, and the output (defined as K, V) corresponding to the multi-modal attention, namely the visual feature V fused with the auditory attention is obtainedmm=MultiHeadAttention(Vself,Aself,Aself) And an auditory feature A with fused visual attentionmm=MultiHeadAttention(Aself,Vself,Vself) (ii) a Then, respectively inputting the output corresponding to the multi-modal attention to a corresponding first layer of regularization layer for layer regularization to obtain the output corresponding to the first layer of regularization layer; then, respectively inputting the output of the corresponding first layer of regularization layer to the corresponding feedforward neural network layer for feedforward calculation to obtain the output of the corresponding feedforward neural network layer; finally, the output of the corresponding feedforward neural network layer is input to the corresponding second layer regularization layer for layer regularization, so that the complete encoder is stacked for N times to finally output the viewPerceptual coding featuresAnd auditory coding featuresWhere VE and AE represent a visual feature encoder and an auditory feature encoder, respectively, and θv、θaRespectively, representing their parameter spaces. It should be noted that, in order to alleviate the problems such as the disappearance of the gradient, a stub connection is added between the input layer of the corresponding coding layer and the first layer regularization layer, and between the first layer regularization layer and the second layer regularization layer.
Among the coding layers of the visual feature encoder and the auditory feature encoder, the multi-point Attention is the most important conversion mapping, and the multi-point Attention is obtained by Scaled Dot-product Attention (Scaled Dot-Production Attention), and the formula is as follows:
wherein the content of the first and second substances,is a scaling factor, Q, K, V is a sequence of queries (queries), keys (keys), values (values), respectively.
Multi-headed attention lets the query (Q, query), key (K, key), and value (V, value) first go through a linear transformationThen inputting the data into the zooming dot product attention, repeating the step H times (H is the number of heads in multi-head attention), using different linear transformation parameter matrixes each time, splicing the results obtained by the zooming dot product attention of the H times, and performing linear transformation W once againoutThe obtained value is used as the output result of the multi-head attention, and the specific formula is as follows:
MuilHeadAttention(Q,K,V)=[head1(Q,K,V),...,headH(Q,K,V)]Wout
step S103, processing the visual coding features and the auditory coding features through an auxiliary model of the video description generation system to generate target auxiliary features.
The visual coding features and the auditory coding features may then be processed through an auxiliary model of the video description generation system to generate target auxiliary features.
In an embodiment, the auxiliary models include a scene classification auxiliary model and a keyword evaluation auxiliary model; the processing the visual coding features and the auditory coding features through the auxiliary model of the video description generation system to generate target auxiliary features specifically includes: inputting the visual coding features into the scene classification auxiliary model for processing to obtain first auxiliary features output by the scene classification auxiliary model, and inputting the auditory coding features into the keyword evaluation auxiliary model for processing to obtain second auxiliary features output by the keyword evaluation auxiliary model; and generating a target auxiliary feature according to the first auxiliary feature and the second auxiliary feature.
With continued reference to FIG. 2, the auxiliary models of the video description generation system include a scene classification auxiliary model and a keyword evaluation auxiliary model. For the visual coding features v output by the visual feature encoder, the video description generation system inputs the visual coding features v into the scene classification auxiliary model for processing to obtain first auxiliary features output by the scene classification auxiliary modelFor the auditory coding characteristics a output by the auditory characteristic encoder, the video description generation system inputs the auditory coding characteristics a into the keyword evaluation auxiliary model for processing to obtain second auxiliary characteristics output by the keyword evaluation auxiliary modelThereby generating the target assist feature m from the first assist feature and the second assist feature.
In an embodiment, the inputting the visual coding feature into the scene classification auxiliary model for processing to obtain a first auxiliary feature output by the scene classification auxiliary model specifically includes: inputting the visual coding features into the scene classification auxiliary model, and performing linear transformation on the visual coding features; carrying out nonlinear mapping on the vision coding features after linear transformation through a linear rectification function to obtain vision coding feature mapping; performing linear transformation on the visual coding feature mapping; and performing softmax logistic regression calculation on the vision coding feature mapping after the linear transformation to obtain a first auxiliary feature output by the scene classification auxiliary model.
As shown in fig. 3, fig. 3 is an architecture diagram of a scene classification assistant model, where the scene classification assistant model includes four sub-layers, a first Linear transformation layer (Linear), a second Linear rectification function (ReLU, activation function), a third Linear transformation layer, and a fourth Softmax function logistic regression layer.
After the visual coding features v output by the visual feature encoder are input into a scene classification auxiliary model, the scene classification auxiliary model firstly accesses the visual coding features v into a first linear transformation layer for linear transformation to obtain the output of the linear transformation layer; then, the output of the linear transformation layer is input to a linear rectification function for nonlinear mapping to obtain a calculation result of the linear rectification function, namely visual coding characteristic mapping; inputting the calculation result of the linear rectification function into a second linear conversion layer for linear conversion to obtain the output of the second linear conversion layer; then, the output of the second linear transformation layer is input to a Softmax function logistic regression layer for Softmax logistic regression calculation, and the probability scores m of Ka preset scenes output by the Softmax function are obtainedvAs the final output of the scene classification assistance model, wherein,
in an embodiment, the inputting the auditory coding features into the keyword assessment assistant model for processing to obtain second assistant features output by the keyword assessment assistant model specifically includes: inputting the auditory coding features into the keyword evaluation auxiliary model, and performing linear transformation on the auditory coding features; carrying out nonlinear mapping on the hearing coding features after linear transformation through a linear rectification function to obtain hearing coding feature mapping; performing a linear transformation on the auditory coding feature map;
calculating the auditory coding feature mapping after linear transformation through a Sigmoid function to obtain the posterior probability of each keyword in a dictionary; performing maximum pooling on the posterior probability of each keyword to obtain a score of each keyword; ranking the scores of the keywords, and selecting a preset number of keywords according to the order of the scores from large to small so as to search the indexes of the selected keywords in a dictionary; and combining the searched indexes to obtain a second auxiliary characteristic output by the keyword evaluation auxiliary module.
As shown in fig. 4, fig. 4 is a schematic diagram of a keyword evaluation auxiliary model, where the keyword evaluation auxiliary model includes six sublayers, a first layer is a first linear transformation layer, a second layer is an activation function (a linear rectification function), a third layer is a second linear transformation layer, a fourth layer is a Sigmoid function, a fifth layer is a maximum pooling layer, and a sixth layer is a sorting & selecting layer.
After the auditory coding characteristics a output by the auditory characteristic encoder are input into the keyword evaluation auxiliary model, the keyword evaluation auxiliary model firstly accesses the auditory coding characteristics a into a first linear transformation layer for linear transformation to obtain the output of the linear transformation layer; then, the output of the linear transformation layer is input to an activation function for calculation to obtain a calculation result of the activation function, namely, the auditory coding features after linear transformation are subjected to nonlinear mapping through a linear rectification function to obtain auditory coding feature mapping; then the auditory code characteristic mapping is input to a second linear transformation layer for linear transformation to obtain a secondThe output of each linear conversion layer; then, the output of the second linear transformation layer is input into a Sigmoid function to obtain the posterior probability Z of each keyword in a dictionary output by the Sigmoid function, whereinFurther inputting the posterior probability Z of each keyword in the dictionary output by the Sigmoid function into the maximum pooling layer for keyword evaluation to obtain a keyword score P (Z) output by the maximum pooling layerC| a), wherein P (Z)C|a)=max P(ZC,t| a); finally, the keyword score P (Z) output by the maximum pooling layerC| a) input to the sorting&Selecting a layer to sort, and taking indexes of the first K keywords in a dictionary according to the order of scores from large to small to form an output m of a keyword evaluation auxiliary modelaAnd K represents a preset number and can be flexibly set according to the actual situation.
In an embodiment, the generating a target assist feature from the first assist feature and the second assist feature comprises: carrying out keyword embedding processing and linear transformation on the second auxiliary features in sequence to obtain second auxiliary features with reduced feature dimensions; and splicing the second auxiliary features with the reduced feature dimensions with the first auxiliary features to obtain target auxiliary features.
With continued reference to FIG. 2, a first assist feature m output to assist the model in classifying the scenevAnd a second assistant feature m output by the keyword evaluation assistant modelaSplicing is carried out, and the second auxiliary characteristic m is also splicedaPerforming keyword embedding processing and linear transformation in sequence to reduce feature dimension, and combining the second assistant feature with reduced feature dimension and the first assistant feature mvAnd splicing to obtain the target auxiliary feature m.
And step S104, decoding the visual coding features, the auditory coding features, the target auxiliary features and the word features through the multi-mode attention mechanism main body model to obtain the posterior probability of each keyword, and selecting a decoding word from each keyword according to the posterior probability of each keyword.
In the decoding stage, when the nth word is decoded, the visual coding characteristics, the auditory coding characteristics, the target auxiliary characteristics and the word characteristics are decoded through the multi-mode attention machine main body model to obtain the posterior probability of the nth word finally output by the multi-mode attention machine main body modelD is a text decoder, θdRepresenting its parameter space.
In an embodiment, the multi-modal attention mechanism principal model includes a text decoder, and the multi-modal attention mechanism principal model decodes the visual coding features, the auditory coding features, the target assistant features, and the word features to obtain posterior probabilities of the keywords, specifically: sequentially performing multi-head attention calculation and layer regularization on the word features through the text decoder to obtain word layer regularization features; performing multi-mode attention calculation on the word layer regularization features and the visual coding features to obtain word features fused with visual attention, and performing multi-mode attention calculation on the word layer regularization features and the auditory coding features to obtain word features fused with auditory attention; bridging the word features fused with the visual attention and the word features fused with the auditory attention to obtain bridging word features; performing layer regularization on the bridge word features, and performing multi-head attention calculation on the bridge word features after the layer regularization and the target auxiliary features to obtain word features fused with the target auxiliary features; sequentially carrying out first-layer regularization, feedforward calculation and second-layer regularization on the word features fused with the target auxiliary features to obtain the output of the text decoder; and sequentially carrying out linear transformation and Softmax logistic regression calculation on the output of the text decoder to obtain the posterior probability of each keyword.
As shown in the dashed box portion on the right of fig. 2, the dashed box portion on the right of fig. 2 is a structural diagram of a text decoder, which includes nine sub-layers, the first layer is a first multi-headed attention layer,the second layer is a first regularization layer, and the third layer is two different multi-modal attention layers, multiHeadsAttention (W)selfV, v) and MultiHeadAttention (W)selfA, a), the fourth layer is a bridge layer, the fifth layer is a second layer of regularization layer, and the sixth layer is a second multi-head attention layer (W)normM, m), the seventh layer is a third layer regularization layer, the eighth layer is a feedforward neural network layer, and the ninth layer is a fourth layer regularization layer.
For the word characteristics, the video description generation system inputs the word characteristics into a text decoder, firstly inputs the word characteristics into a first multi-head attention layer at a decoding layer of the text decoder to carry out multi-head attention calculation, and obtains the output of the first multi-head attention layer; inputting the output of the first multi-head attention layer to the first layer of regularization layer for layer regularization to obtain the output W of the first layer of regularization layerselfI.e. word layer regularization features; the output W of the first layer of regularization layer is thenselfRespectively inputting the data into two different multi-modal attention layers, respectively fusing with the visual coding features and the auditory coding features, performing multi-modal attention calculation to obtain the output of the two different multi-modal attention layers, namely the output W of the regularization layer of the first layerselfPerforming multi-mode attention calculation with the visual coding features to obtain word features fused with visual attention, and normalizing the output W of the first layerselfPerforming multi-modal attention calculation with the auditory coding features to obtain word features fused with auditory attention; then the outputs of two different multi-modal attention layers are input into the bridging layer to be bridged, and the output-bridging word characteristic (shape 2 d) of the bridging layer is obtainedwConversion of x (n-1) to dwX (n-1)); inputting the output of the bridging layer to a second layer of regularization layer for layer regularization to obtain the output W of the second layer of regularization layernorm(ii) a Further regularizing the second layer by the output WnormInput to the second Multi-head attention Multi HeadAttention (W)normM, m), fusing the word features and the target assist features to obtain a second multi-headed attention output-the word feature fused with the target assist featuresCharacterisation, i.e. output W of the second layer of regularization layernormPerforming head attention calculation with the target auxiliary features to obtain word features fused with the target auxiliary features; inputting the output of the second multi-head attention to a third layer of regularization layer for layer regularization to obtain the output of the third layer of regularization layer; then, the output of the third layer of regularization layer is input to a corresponding feedforward neural network layer for feedforward calculation to obtain the output of the feedforward neural network layer; and then inputting the output of the feedforward neural network layer to a fourth layer of regularization layer for layer regularization to obtain the output of the fourth layer of regularization layer, namely the output of the text decoder, and taking the output as the output of the main model of the multi-mode attention mechanism. It should be noted that, a stub connection is respectively added between the input layer of the decoding layer and the first layer regularization layer, between the first layer regularization layer and the second layer regularization layer, between the second layer regularization layer and the third layer regularization layer, and between the third layer regularization layer and the fourth layer regularization layer.
Continuing to refer to fig. 2, the output of the multi-modal attention mechanism main body model is subjected to linear transformation, the result obtained after the linear transformation is calculated through Softmax logistic regression, and the output of the video description generation system, namely the posterior probability P (W) of the nth keyword is finally obtainedn|v,a,m,Wn-1). It can be understood that the higher the posterior probability is, the higher the matching degree of the corresponding keyword and the video content to be described is, and the keyword with the highest posterior probability is determined as a decoding word.
And step S105, generating the video description of the video to be described according to the decoding words.
Because the video description is a natural language formed by the sequence of decoding words, the steps from S101 to S104 are repeated during each decoding, the decoding words are sequentially generated to form the sequence of decoding words, and the video description of the video to be described is generated by the sequence of decoding words.
In summary, based on visual featuresAuditory featuresAnd word (w)n-1=(w1,...,wn-1) ) feature, generates current word wnTo generate a complete word sequence (w)1,...,wn) And the method is used for describing the content of the video to be described.
The video description generation method provided by the embodiment comprises the steps of firstly obtaining a video to be described, and extracting visual features, auditory features and word features of the video to be described; then, coding the visual characteristics and the auditory characteristics through a multi-mode attention mechanism main body model of the video description generation system to obtain visual coding characteristics and auditory coding characteristics; processing the visual coding features and the auditory coding features through an auxiliary model of a video description generation system to generate target auxiliary features; and further decoding the visual coding features, the auditory coding features, the target auxiliary features and the word features through a multi-mode attention mechanism-based main body model of the video description system to obtain the posterior probability of each keyword, selecting decoding words from each keyword according to the posterior probability of each keyword, and finally generating the video description of the video to be described according to the decoding words. The video description generation system realizes the fusion of visual features and auditory features through the multi-mode attention mechanism main body model, realizes the addition of auxiliary features through the auxiliary model, provides rich features for the generation of video description, and lays a foundation for accurately selecting words which accord with video scenes and events, thereby improving the video description accuracy.
Referring to fig. 5, fig. 5 is a schematic block diagram of a video description generating apparatus according to an embodiment of the present application.
As shown in fig. 5, the video description generating apparatus 400 includes: an extraction module 401, an encoding module 402, a target assistant feature generation module 403, a decoding module 404, and a video description generation module 405.
The extraction module 401 is configured to acquire a video to be described, and extract visual features, auditory features, and word features of the video to be described;
the encoding module 402 is configured to encode the visual features and the auditory features through a multi-modal attention mechanism main body model of the video description generation system, so as to obtain visual encoding features and auditory encoding features;
a target assistant feature generation module 403, configured to process the visual coding features and the auditory coding features through an assistant model of the video description generation system to generate target assistant features;
a decoding module 404, configured to decode the visual coding features, the auditory coding features, the target auxiliary features, and the word features through the multi-modal attention mechanism main body model to obtain posterior probabilities of the keywords, and select a decoded word from the keywords according to the posterior probabilities of the keywords;
a video description generating module 405, configured to generate a video description of the video to be described according to the decoded word.
It should be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the apparatus and each module and unit described above may refer to the corresponding processes in the foregoing video description generation method embodiment, and are not described herein again.
The apparatus provided by the above embodiments may be implemented in the form of a computer program, which can be run on a computer device as shown in fig. 6.
Referring to fig. 6, fig. 6 is a schematic block diagram illustrating a structure of a computer device according to an embodiment of the present disclosure. The computer device may be a Personal Computer (PC), a server, or the like having a data processing function.
As shown in fig. 6, the computer device includes a processor, a memory, and a network interface connected by a system bus, wherein the memory may include a nonvolatile storage medium and an internal memory.
The non-volatile storage medium may store an operating system and a computer program. The computer program includes program instructions that, when executed, cause a processor to perform any of the video description generation methods.
The processor is used for providing calculation and control capability and supporting the operation of the whole computer equipment.
The internal memory provides an environment for running a computer program in the non-volatile storage medium, which when executed by the processor causes the processor to perform any of the video description generation methods.
The network interface is used for network communication, such as sending assigned tasks and the like. Those skilled in the art will appreciate that the architecture shown in fig. 6 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
It should be understood that the Processor may be a Central Processing Unit (CPU), and the Processor may be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, etc. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Wherein, in one embodiment, the processor is configured to execute a computer program stored in the memory to implement the steps of:
acquiring a video to be described, and extracting visual features, auditory features and word features of the video to be described; respectively coding the visual features and the auditory features through a multi-mode attention mechanism main body model of a video description generation system to obtain visual coding features and auditory coding features; processing the visual coding features and the auditory coding features through an auxiliary model of the video description generation system to generate target auxiliary features; decoding the visual coding features, the auditory coding features, the target auxiliary features and the word features through the multi-mode attention mechanism main body model to obtain the posterior probability of each keyword, and selecting decoding words from each keyword according to the posterior probability of each keyword; and generating the video description of the video to be described according to the decoding words.
In some embodiments, the multi-modal attention mechanism principal model includes a visual feature encoder and an auditory feature encoder, and the processor implements the encoding of the visual features and the auditory features by the multi-modal attention mechanism principal model of the video description generation system, respectively, to obtain visual coding features and auditory coding features, including:
performing multi-head attention calculation on the visual features through the visual feature encoder to obtain visual multi-head attention features, and performing multi-head attention calculation on the auditory features through the auditory feature encoder to obtain auditory multi-head attention features;
performing multi-modal attention calculation on the visual multi-head attention feature and the auditory multi-head attention feature through the visual feature encoder to obtain a visual feature fused with auditory attention, and performing multi-modal attention calculation on the auditory multi-head attention feature and the visual multi-head attention feature through the auditory feature encoder to obtain an auditory feature fused with visual attention;
through the visual feature encoder is right in proper order visual feature that has fused the sense of hearing attention carries out first sublayer regularization, feedforward calculation and second sublayer regularization, obtains the visual coding feature of visual feature encoder output, and through the sense of hearing feature encoder is right in proper order the sense of hearing feature that has fused the sense of hearing attention carries out first sublayer regularization, feedforward calculation and second sublayer regularization, obtains the sense of hearing coding feature of sense of hearing feature encoder output.
In some embodiments, the auxiliary models include a scene classification auxiliary model and a keyword evaluation auxiliary model, and the processor implements the auxiliary models by the video description generation system to process the visual coding features and the auditory coding features to generate target auxiliary features, including:
inputting the visual coding features into the scene classification auxiliary model for processing to obtain first auxiliary features output by the scene classification auxiliary model, and inputting the auditory coding features into the keyword evaluation auxiliary model for processing to obtain second auxiliary features output by the keyword evaluation auxiliary model;
and generating a target auxiliary feature according to the first auxiliary feature and the second auxiliary feature.
In some embodiments, the inputting the visual coding features into the scene classification assistant model for processing by the processor to obtain the first assistant features output by the scene classification assistant model includes:
inputting the visual coding features into the scene classification auxiliary model, and performing linear transformation on the visual coding features;
carrying out nonlinear mapping on the vision coding features after linear transformation through a linear rectification function to obtain vision coding feature mapping;
performing linear transformation on the visual coding feature mapping;
and performing softmax logistic regression calculation on the vision coding feature mapping after the linear transformation to obtain a first auxiliary feature output by the scene classification auxiliary model.
In some embodiments, the processor implements the inputting of the auditory coding features into the keyword assessment assistant model for processing to obtain second assistant features output by the keyword assessment assistant model, and the method includes:
inputting the auditory coding features into the keyword evaluation auxiliary model, and performing linear transformation on the auditory coding features;
carrying out nonlinear mapping on the hearing coding features after linear transformation through a linear rectification function to obtain hearing coding feature mapping;
performing a linear transformation on the auditory coding feature map;
calculating the auditory coding feature mapping after linear transformation through a Sigmoid function to obtain the posterior probability of each keyword in a dictionary;
performing maximum pooling on the posterior probability of each keyword to obtain a score of each keyword;
ranking the scores of the keywords, and selecting a preset number of keywords according to the order of the scores from large to small so as to search the indexes of the selected keywords in a dictionary;
and combining the searched indexes to obtain a second auxiliary characteristic output by the keyword evaluation auxiliary module.
In some embodiments, the processor implements the generating a target assist feature from the first assist feature and the second assist feature, including:
carrying out keyword embedding processing and linear transformation on the second auxiliary features in sequence to obtain second auxiliary features with reduced feature dimensions;
and splicing the second auxiliary features with the reduced feature dimensions with the first auxiliary features to obtain target auxiliary features.
In some embodiments, the multi-modal attention mechanism principal model comprises a text decoder, and the processor implements the decoding of the visual coding features, the auditory coding features, the target assist features, and the word features by the multi-modal attention mechanism principal model to obtain a posterior probability of each keyword, including:
sequentially performing multi-head attention calculation and layer regularization on the word features through the text decoder to obtain word layer regularization features;
performing multi-mode attention calculation on the word layer regularization features and the visual coding features to obtain word features fused with visual attention, and performing multi-mode attention calculation on the word layer regularization features and the auditory coding features to obtain word features fused with auditory attention;
bridging the word features fused with the visual attention and the word features fused with the auditory attention to obtain bridging word features;
performing layer regularization on the bridge word features, and performing multi-head attention calculation on the bridge word features after the layer regularization and the target auxiliary features to obtain word features fused with the target auxiliary features;
sequentially carrying out first-layer regularization, feedforward calculation and second-layer regularization on the word features fused with the target auxiliary features to obtain the output of the text decoder;
and sequentially carrying out linear transformation and Softmax logistic regression calculation on the output of the text decoder to obtain the posterior probability of each keyword.
Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, where the computer program includes program instructions, and a method implemented when the program instructions are executed may refer to various embodiments of a video description generation method of the present application.
The computer-readable storage medium may be an internal storage unit of the computer device described in the foregoing embodiment, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the computer device.
Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.
The block chain is a novel application mode of computer technologies such as distributed data storage, point-to-point transmission, a consensus mechanism, an encryption algorithm and the like. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments. While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and those skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
Claims (10)
1. A method of video description generation, the method comprising the steps of:
acquiring a video to be described, and extracting visual features, auditory features and word features of the video to be described;
respectively coding the visual features and the auditory features through a multi-mode attention mechanism main body model of a video description generation system to obtain visual coding features and auditory coding features;
processing the visual coding features and the auditory coding features through an auxiliary model of the video description generation system to generate target auxiliary features;
decoding the visual coding features, the auditory coding features, the target auxiliary features and the word features through the multi-mode attention mechanism main body model to obtain the posterior probability of each keyword, and selecting decoding words from each keyword according to the posterior probability of each keyword;
and generating the video description of the video to be described according to the decoding words.
2. The video description generation method according to claim 1, wherein the multi-modal attention mechanism principal model includes a visual feature encoder and an auditory feature encoder;
the method for obtaining the visual coding features and the auditory coding features by respectively coding the visual features and the auditory features through a multi-mode attention mechanism main body model of a video description generation system comprises the following steps:
performing multi-head attention calculation on the visual features through the visual feature encoder to obtain visual multi-head attention features, and performing multi-head attention calculation on the auditory features through the auditory feature encoder to obtain auditory multi-head attention features;
performing multi-modal attention calculation on the visual multi-head attention feature and the auditory multi-head attention feature through the visual feature encoder to obtain a visual feature fused with auditory attention, and performing multi-modal attention calculation on the auditory multi-head attention feature and the visual multi-head attention feature through the auditory feature encoder to obtain an auditory feature fused with visual attention;
through the visual feature encoder is right in proper order visual feature that has fused the sense of hearing attention carries out first sublayer regularization, feedforward calculation and second sublayer regularization, obtains the visual coding feature of visual feature encoder output, and through the sense of hearing feature encoder is right in proper order the sense of hearing feature that has fused the sense of hearing attention carries out first sublayer regularization, feedforward calculation and second sublayer regularization, obtains the sense of hearing coding feature of sense of hearing feature encoder output.
3. The video description generation method according to claim 1, wherein the auxiliary models include a scene classification auxiliary model and a keyword evaluation auxiliary model;
the processing, by an assistant model of the video description generation system, the visually encoded features and the aurally encoded features to generate target assistant features, comprising:
inputting the visual coding features into the scene classification auxiliary model for processing to obtain first auxiliary features output by the scene classification auxiliary model, and inputting the auditory coding features into the keyword evaluation auxiliary model for processing to obtain second auxiliary features output by the keyword evaluation auxiliary model;
and generating a target auxiliary feature according to the first auxiliary feature and the second auxiliary feature.
4. The method according to claim 3, wherein the inputting the visual coding features into the scene classification assistant model for processing to obtain the first assistant features output by the scene classification assistant model comprises:
inputting the visual coding features into the scene classification auxiliary model, and performing linear transformation on the visual coding features;
carrying out nonlinear mapping on the vision coding features after linear transformation through a linear rectification function to obtain vision coding feature mapping;
performing linear transformation on the visual coding feature mapping;
and performing softmax logistic regression calculation on the vision coding feature mapping after the linear transformation to obtain a first auxiliary feature output by the scene classification auxiliary model.
5. The method of claim 3, wherein the inputting the auditory coding features into the keyword assessment assistant model for processing to obtain the second assistant features output by the keyword assessment assistant model comprises:
inputting the auditory coding features into the keyword evaluation auxiliary model, and performing linear transformation on the auditory coding features;
carrying out nonlinear mapping on the hearing coding features after linear transformation through a linear rectification function to obtain hearing coding feature mapping;
performing a linear transformation on the auditory coding feature map;
calculating the auditory coding feature mapping after linear transformation through a Sigmoid function to obtain the posterior probability of each keyword in a dictionary;
performing maximum pooling on the posterior probability of each keyword to obtain a score of each keyword;
ranking the scores of the keywords, and selecting a preset number of keywords according to the order of the scores from large to small so as to search the indexes of the selected keywords in a dictionary;
and combining the searched indexes to obtain a second auxiliary characteristic output by the keyword evaluation auxiliary module.
6. The video description generation method according to claim 3, wherein the generating a target assist feature from the first assist feature and the second assist feature comprises:
carrying out keyword embedding processing and linear transformation on the second auxiliary features in sequence to obtain second auxiliary features with reduced feature dimensions;
and splicing the second auxiliary features with the reduced feature dimensions with the first auxiliary features to obtain target auxiliary features.
7. The video description generation method of claim 1, wherein the multi-modal attention mechanism principal model comprises a text decoder;
the decoding the visual coding features, the auditory coding features, the target auxiliary features and the word features through the multi-modal attention mechanism main body model to obtain the posterior probability of each keyword, and the method comprises the following steps:
sequentially performing multi-head attention calculation and layer regularization on the word features through the text decoder to obtain word layer regularization features;
performing multi-mode attention calculation on the word layer regularization features and the visual coding features to obtain word features fused with visual attention, and performing multi-mode attention calculation on the word layer regularization features and the auditory coding features to obtain word features fused with auditory attention;
bridging the word features fused with the visual attention and the word features fused with the auditory attention to obtain bridging word features;
performing layer regularization on the bridge word features, and performing multi-head attention calculation on the bridge word features after the layer regularization and the target auxiliary features to obtain word features fused with the target auxiliary features;
sequentially carrying out first-layer regularization, feedforward calculation and second-layer regularization on the word features fused with the target auxiliary features to obtain the output of the text decoder;
and sequentially carrying out linear transformation and Softmax logistic regression calculation on the output of the text decoder to obtain the posterior probability of each keyword.
8. A video description generation apparatus, characterized in that the video description generation apparatus comprises:
the extraction module is used for acquiring a video to be described and extracting visual features, auditory features and word features of the video to be described;
the coding module is used for coding the visual features and the auditory features respectively through a multi-mode attention mechanism main body model of the video description generation system to obtain visual coding features and auditory coding features;
a target assistant feature generation module, configured to process the visual coding features and the auditory coding features through an assistant model of the video description generation system to generate target assistant features;
the decoding module is used for decoding the visual coding features, the auditory coding features, the target auxiliary features and the word features through the multi-mode attention mechanism main body model to obtain the posterior probability of each keyword, and selecting a decoding word from each keyword according to the posterior probability of each keyword;
and the video description generation module is used for generating the video description of the video to be described according to the decoding words.
9. A computer device, characterized in that the computer device comprises a processor, a memory, and a computer program stored on the memory and executable by the processor, wherein the computer program, when executed by the processor, implements the steps of the video description generation method according to any one of claims 1 to 7.
10. A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, wherein the computer program, when executed by a processor, implements the steps of the video description generation method according to any one of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110470037.0A CN113095435B (en) | 2021-04-28 | 2021-04-28 | Video description generation method, device, equipment and computer readable storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110470037.0A CN113095435B (en) | 2021-04-28 | 2021-04-28 | Video description generation method, device, equipment and computer readable storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113095435A true CN113095435A (en) | 2021-07-09 |
CN113095435B CN113095435B (en) | 2024-06-04 |
Family
ID=76681011
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110470037.0A Active CN113095435B (en) | 2021-04-28 | 2021-04-28 | Video description generation method, device, equipment and computer readable storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113095435B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023201990A1 (en) * | 2022-04-19 | 2023-10-26 | 苏州浪潮智能科技有限公司 | Visual positioning method and apparatus, device, and medium |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10699129B1 (en) * | 2019-11-15 | 2020-06-30 | Fudan University | System and method for video captioning |
CN111541910A (en) * | 2020-04-21 | 2020-08-14 | 华中科技大学 | Video barrage comment automatic generation method and system based on deep learning |
WO2020190112A1 (en) * | 2019-03-21 | 2020-09-24 | Samsung Electronics Co., Ltd. | Method, apparatus, device and medium for generating captioning information of multimedia data |
-
2021
- 2021-04-28 CN CN202110470037.0A patent/CN113095435B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2020190112A1 (en) * | 2019-03-21 | 2020-09-24 | Samsung Electronics Co., Ltd. | Method, apparatus, device and medium for generating captioning information of multimedia data |
US10699129B1 (en) * | 2019-11-15 | 2020-06-30 | Fudan University | System and method for video captioning |
CN111541910A (en) * | 2020-04-21 | 2020-08-14 | 华中科技大学 | Video barrage comment automatic generation method and system based on deep learning |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2023201990A1 (en) * | 2022-04-19 | 2023-10-26 | 苏州浪潮智能科技有限公司 | Visual positioning method and apparatus, device, and medium |
Also Published As
Publication number | Publication date |
---|---|
CN113095435B (en) | 2024-06-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11948066B2 (en) | Processing sequences using convolutional neural networks | |
WO2021232746A1 (en) | Speech recognition method, apparatus and device, and storage medium | |
WO2021037113A1 (en) | Image description method and apparatus, computing device, and storage medium | |
CN110489567B (en) | Node information acquisition method and device based on cross-network feature mapping | |
CN112712813B (en) | Voice processing method, device, equipment and storage medium | |
CN110781306B (en) | English text aspect layer emotion classification method and system | |
CN112417855A (en) | Text intention recognition method and device and related equipment | |
CN109344242B (en) | Dialogue question-answering method, device, equipment and storage medium | |
CN115083435B (en) | Audio data processing method and device, computer equipment and storage medium | |
CN114091450B (en) | Judicial domain relation extraction method and system based on graph convolution network | |
CN116543768A (en) | Model training method, voice recognition method and device, equipment and storage medium | |
CN114360502A (en) | Processing method of voice recognition model, voice recognition method and device | |
CN113450765A (en) | Speech synthesis method, apparatus, device and storage medium | |
CN115831105A (en) | Speech recognition method and device based on improved Transformer model | |
CN115203372A (en) | Text intention classification method and device, computer equipment and storage medium | |
CN113095435B (en) | Video description generation method, device, equipment and computer readable storage medium | |
CN111898363B (en) | Compression method, device, computer equipment and storage medium for long and difficult text sentence | |
CN112489651B (en) | Voice recognition method, electronic device and storage device | |
CN117648469A (en) | Cross double-tower structure answer selection method based on contrast learning | |
CN111563161B (en) | Statement identification method, statement identification device and intelligent equipment | |
CN116775873A (en) | Multi-mode dialogue emotion recognition method | |
CN115017900B (en) | Conversation emotion recognition method based on multi-mode multi-prejudice | |
CN113515617B (en) | Method, device and equipment for generating model through dialogue | |
CN114822509A (en) | Speech recognition method, speech recognition device, computer equipment and storage medium | |
CN113889130A (en) | Voice conversion method, device, equipment and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |