CN114494980B - Diversified video comment generation method, system, equipment and storage medium - Google Patents
Diversified video comment generation method, system, equipment and storage medium Download PDFInfo
- Publication number
- CN114494980B CN114494980B CN202210352708.8A CN202210352708A CN114494980B CN 114494980 B CN114494980 B CN 114494980B CN 202210352708 A CN202210352708 A CN 202210352708A CN 114494980 B CN114494980 B CN 114494980B
- Authority
- CN
- China
- Prior art keywords
- comment
- emotion
- vocabulary
- text
- video frame
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 42
- 239000013598 vector Substances 0.000 claims abstract description 145
- 230000008451 emotion Effects 0.000 claims abstract description 106
- 238000009826 distribution Methods 0.000 claims description 58
- 230000000007 visual effect Effects 0.000 claims description 57
- 238000012545 processing Methods 0.000 claims description 37
- 230000002996 emotional effect Effects 0.000 claims description 16
- 238000013527 convolutional neural network Methods 0.000 claims description 10
- 230000006870 function Effects 0.000 claims description 8
- 230000003993 interaction Effects 0.000 claims description 8
- 230000008569 process Effects 0.000 claims description 8
- 238000004458 analytical method Methods 0.000 claims description 7
- 239000011159 matrix material Substances 0.000 claims description 6
- 238000005070 sampling Methods 0.000 claims description 6
- 230000002194 synthesizing effect Effects 0.000 claims description 6
- 238000004590 computer program Methods 0.000 claims description 5
- 238000012549 training Methods 0.000 claims description 5
- 239000000203 mixture Substances 0.000 claims description 3
- 230000001143 conditioned effect Effects 0.000 claims description 2
- DUQAODNTUBJRGF-ONEGZZNKSA-N dinitrogen difluoride Chemical compound F\N=N\F DUQAODNTUBJRGF-ONEGZZNKSA-N 0.000 claims description 2
- 230000006854 communication Effects 0.000 abstract description 3
- 238000004891 communication Methods 0.000 abstract description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 230000007246 mechanism Effects 0.000 description 3
- 238000011156 evaluation Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 230000007175 bidirectional communication Effects 0.000 description 1
- 238000004422 calculation algorithm Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- -1 carrier Substances 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 239000000306 component Substances 0.000 description 1
- 150000001875 compounds Chemical class 0.000 description 1
- 238000009795 derivation Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000004069 differentiation Effects 0.000 description 1
- 238000009472 formulation Methods 0.000 description 1
- 239000004615 ingredient Substances 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 239000000758 substrate Substances 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/253—Grammatical analysis; Style critique
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- General Engineering & Computer Science (AREA)
- Biomedical Technology (AREA)
- Evolutionary Computation (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
Abstract
The invention discloses a method, a system, equipment and a storage medium for generating diversified video comments, aiming at the problem that comments generated by a current video comment generation model are single in one side, starting from the aspect of emotion diversity, emotion category weights are introduced for marking, the thought of a variational self-encoder is used for reference, and an emotion hidden vector is modeled and controlled to guide the generation of the diversified video comments with controllable emotion, so that high-quality real-time video comment generation can be realized, and the communication experience of a user can be enhanced.
Description
Technical Field
The invention relates to the technical field of video comment generation, in particular to a method, a system, equipment and a storage medium for generating diversified video comments.
Background
With the development of the times, the video barrage system successively logs in the popular video platforms such as Bilibili, Aiqiyi and Youkou. Due to the wide application of the barrage system, a bidirectional communication mode between the user and the video is created, and the real-time participation sense of the user in the video watching process is enhanced. The real-time barrage of the video can provide richer viewpoint angles, arouse the attention and discussion of the user, and enhance the communication experience of the user. Therefore, high-quality real-time video comment (bullet screen) generation is realized, and the method has great application value.
The current real-time video comment generation method mostly adopts a traditional end-to-end model, and combines a video clip and an adjacent barrage to generate a real-time comment. However, following the generation logic of the comments, the comments are influenced by the viewpoint, emotional tendency and thinking mode of the commentator for the same video clip, and the comments have the characteristic of diversification. The current real-time video comment generation method is mainly used for optimizing the quality of comments, but neglects the diversity characteristics of the comments and only generates a single video comment. For the input of the same video clip and adjacent comments, the reference comment used as the group route (annotation information) often contains multiple types, and the generation of a single comment by a model is not only unfavorable for performance evaluation and model optimization, but also not in accordance with the logical characteristics of the comment.
Disclosure of Invention
The invention aims to provide a method, a system, equipment and a storage medium for generating diversified video comments, which can control emotional tendency to generate diversified video comments.
The purpose of the invention is realized by the following technical scheme:
a diversified video comment generation method includes:
constructing a video frame image set by using the video frame image at the current moment and a plurality of nearest video frame images thereof, extracting comments in the video frame image at the current moment as reference comments, and extracting the comments in all the nearest video frame images to form a comment text;
visual features are extracted from the video frame image set, text features are extracted from the comment text in combination with the visual features, emotion category weights corresponding to reference comments are combined, emotion hidden vectors are generated and coded into emotion hidden vector coding features;
interacting the input vocabulary with the generated vocabulary of the previous time step, the visual characteristics, the text characteristics and the emotional hidden vector coding characteristics in sequence to obtain the vocabulary probability distribution of the current time step, determining the generated vocabulary of the current time step according to the vocabulary probability distribution of the current time step, and synthesizing the generated vocabularies of all time steps to form the video comment of the current time; the input vocabulary is the vocabulary in the reference comment or the vocabulary in the generated vocabulary of the previous time step.
A diverse video comment generation system comprising:
the information acquisition unit is used for constructing a video frame image set by using the video frame image at the current moment and a plurality of nearest video frame images thereof, extracting comments in the video frame image at the current moment as reference comments, and extracting all the comments in the nearest video frame images to form a comment text;
a visual encoder for extracting visual features from the set of video frame images;
a text encoder for extracting text features from the comment text in conjunction with the visual features;
the hidden vector encoder is used for generating an emotion hidden vector by combining the emotion category weights corresponding to the reference comments and encoding the emotion hidden vector into emotion hidden vector encoding characteristics;
the comment decoder is used for sequentially interacting the input vocabulary with the generated vocabulary of the previous time step, the visual characteristics, the text characteristics and the emotional hidden vector coding characteristics to obtain the probability distribution of the vocabulary of the current time step; the input vocabulary is the vocabulary in the reference comment or the vocabulary in the generated vocabulary of the previous time step;
and the video comment generating unit is used for determining the generated vocabulary of the current time step according to the vocabulary probability distribution of the current time step and synthesizing the generated vocabularies of all the time steps to form the video comment at the current moment.
A processing device, comprising: one or more processors; a memory for storing one or more programs;
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the aforementioned methods.
A readable storage medium, storing a computer program, characterized in that the computer program realizes the aforementioned method when executed by a processor.
According to the technical scheme provided by the invention, aiming at the problem that the comment generated by the current video comment generation model is single in one side, from the aspect of emotion diversity, emotion category weight is introduced as emotion marking, the idea of a variational self-encoder is used for reference, and an emotion hidden vector is modeled and controlled to guide the generation of the emotion-controllable diversified video comment, so that the generation of the high-quality real-time video comment can be realized, and the communication experience of a user can be enhanced.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.
Fig. 1 is a flowchart of a method for generating diversified video comments according to an embodiment of the present invention;
fig. 2 is a schematic overall structure diagram of a diversified video comment generation model according to an embodiment of the present invention;
fig. 3 is a schematic diagram of a diversified video comment generating system according to an embodiment of the present invention;
fig. 4 is a schematic diagram of a processing apparatus according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention are clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.
The terms that may be used herein are first described as follows:
the terms "comprising," "including," "containing," "having," or other similar terms of meaning should be construed as non-exclusive inclusions. For example: including a feature (e.g., material, component, ingredient, carrier, formulation, material, dimension, part, component, mechanism, device, step, process, method, reaction condition, processing condition, parameter, algorithm, signal, data, product, or article, etc.) that is not specifically recited, should be interpreted to include not only the specifically recited feature but also other features not specifically recited and known in the art.
The following describes a method, a system, a device and a storage medium for generating diversified video comments, which are provided by the present invention, in detail. Details which are not described in detail in the embodiments of the invention belong to the prior art which is known to a person skilled in the art. Those not specifically mentioned in the examples of the present invention were carried out according to the conventional conditions in the art or conditions suggested by the manufacturer.
Example one
As shown in fig. 1, a method for generating diversified video comments mainly includes the following steps:
And 2, extracting visual features from the video frame image set, extracting text features from the comment text by combining the visual features, generating an emotion hidden vector by combining emotion category weights corresponding to the reference comments, and encoding the emotion hidden vector into emotion hidden vector encoding features.
In the above scheme of the embodiment of the invention, extracting visual features from the video frame image set is realized by a visual encoder; extracting text features from the comment text in combination with the visual features through a text encoder; acquiring corresponding emotion category weights by combining the reference comments, generating emotion hidden vectors, and coding the emotion hidden vectors into emotion hidden vector coding characteristics through a hidden vector coder; and interacting the input vocabulary with the generated vocabulary of the previous time step, the visual characteristics, the text characteristics and the emotional hidden vector coding characteristics in sequence, and realizing through a comment decoder according to the acquired vocabulary probability distribution of the current time step. The visual encoder, the text encoder, the hidden vector encoder and the comment decoder form a diversified video comment generation model, fig. 2 shows the overall structure of the diversified video comment generation model, and the working principle of each part of the diversified video comment generation model and a loss function during training are introduced in detail by combining fig. 2.
A visual encoder.
As shown in fig. 2, the visual Encoder (Video Encoder) mainly includes a Convolutional Neural Network (CNN) and a first transform model, and the first transform model mainly includes a Multi-head Attention sub-layer and a fully-connected Feed-Forward network (Position-wise Feed-Forward Networks).
In the embodiment of the invention, the visual features are extracted from the video frame image set through a visual encoder. Recording a video frame image set as F = &F 1 , F 2 ,…, F J -means for, among other things,F j denotes the firstjEach video frame image corresponds to a time instant,j=1,2,…,J,Jrepresenting the number of video frame images; the video frame image set F is a video frame image set of a specified video containing a video frame at a specified moment and the nearest moment,F 1i.e. the video frame image at the current moment, and the follow-upF 2 ,…, F J I.e. the time closest to the video frame image of the current timeJ-1 video frame image.
Firstly, extracting the characteristics of each video frame image (image) through a convolutional neural network, and expressing the characteristics as follows:
V j = CNN(F j )
in the above-mentioned formula, the compound has the following structure,V j represents the extracted secondjA video frame imageF j The characteristics of (1). Illustratively, the convolutional neural network may use a ResNet network.
Recording feature set corresponding to video frame image setV = {V 1 , V 2 ,…, V J H, performing feature set on the video frame image set F through a first Transformer modelVThe encoding process is performed as follows:
W j =FNNF(MultiHead-AttenF(V j ,V,V) )
in the above formula, Multihead-antenFAnd FNNFRespectively representing a multi-head attention module and a fully-connected feedforward network in the first Transformer model;W j indicating the result of the encoding processjVisual characteristics of the individual video frame images.
Finally, the visual characteristics of the video frame image set F are recorded asW F = {W 1 , W 2 ,…, W J }。
Those skilled in the art will understand that (A)V j ,V,V) The input information of the multi-head attention module sequentially corresponds to query (query matrix), key (key matrix) and value (value matrix) in multi-head attention, namely query value acquisitionV j Key and value fetchV. The multi-head attention module related to the subsequent related formula has similar meanings, and the expression forms in the formula are general forms in the field, so that the description is omitted.
And II, a text coder.
In the embodiment of the invention, the Text Encoder (Text Encoder) comprises a first linear coding layer and a second transform model, wherein the second transform model comprises two multi-head attention modules and a fully connected feedforward network; in fig. 2, the sexually encoding layers included in the text encoder and other parts are denoted as embed, and similarly, all Multi-head Attention modules are denoted as Multi-head Attention, and all fully-connected Feed-Forward networks are denoted as Feed Forward.
In the embodiment of the invention, text features are extracted from the comment text (context) by combining the visual features through the text encoder.
Firstly, the comment text is linearly encoded through a first linear encoding layer, and a corresponding word embedding vector set e = check is obtainede 1 , e 2 ,…, e M -means for, among other things,e m representing the first in comment textmThe words of the individual words are embedded into the vector,m=1,2,…,M ,Mto comment on the text vocabulary total. In the embodiment of the present invention, the comment text may be a bullet screen text, which is generally a bullet screen adjacent to a specified video at a specified time, where the adjacent range may be set by a person skilled in the art according to actual needs or experience, and the larger the range, the more the number of words. Referring to the introduction to the visual encoder, if the time is specified as the current time, thenF 1I.e. the video frame at the current moment, the comment text contains the video frame imageF 2 ,…, F J The remarks in (1).
Then, pass the first MultiHead attention module MultiHead-Atten in the second transform modele1Processing the word embedding vector set e, and then processing the word embedding vector set e by a second multi-head attention module Multihead-Attene2With fully-connected feedforward network FNNeInteracting the processing result of the first multi-head attention module with the visual feature to obtain a text feature, wherein the processing process is represented as:
e m ’= MultiHead-Attene1(e m ,e,e)
E m =FNNe(MultiHead-Attene2(e m ’,W F ,W F )
wherein,e m ' denotes the first multi-headed attention module pairmWord-embedded vector for individual wordse m As a result of the processing of (1),W F the visual characteristics are represented by a visual representation,E m the representation corresponds tomThe textual features of the individual words.
Finally, the text characteristics of the comment text are recorded asW e = {E 1 , E 2 ,…, E M }。
And thirdly, a hidden vector encoder.
In the embodiment of the invention, a Latent Vector Encoder (Latent Vector Encoder) introduces the encoding principle of a variational self-Encoder on the basis of a transform model, and the emotion Latent Vector z is sampled to guide the generation of diversified comments by training a mixed Gaussian distribution. Hidden vector encoder based on reference comments (comments), text featuresW e Generating an emotion hidden vector z with the emotion category weight, and then coding the emotion hidden vector z into emotion hidden vector coding features; the main principle is as follows:
the probability distribution p (z | c,W e ) Modelable as using emotion class weightsc k A weighted mixture gaussian distribution model, represented as:
wherein,c k is shown askThe weight of each of the emotion category weights,Krepresenting the number of emotion class weights, c representing a set of emotion class weights, c =c k } K ,Is shown askA plurality of Gaussian distribution models are generated according to the method,andrespectively representing the mean and variance of a model of the model-defined gaussian distribution,Ia standard identity matrix is represented, which is,W e representing a text feature; z represents an emotion hidden vector.
As shown in fig. 2, the hidden vector encoder includes: two linear coding layers, a third transform model, a multi-layer perceptron (MLP) and a sampling layer (sample).
The two linear coding layers (Embedding) are respectively called a second linear coding layer and a third linear coding layer and are positioned at the head end and the tail end of the hidden vector coder. Linearly encoding the reference comment through a second linear encoding layer to obtain a corresponding word embedding vector set d =d 1 , d 2 , …, d L -means for, among other things,d l express the first in the reference commentlThe words of the individual words are embedded into the vector,l=1,2,…,L ,Lthe total number of words to be reviewed for reference.
Referring also to FIG. 2, the third transform model includes two multi-headed attention modules and a fully connected feed-forward network, with the first multi-headed attention module Multihead-Attenz1Processing the word embedding vector set d, and then processing the word embedding vector set d by a second multi-head attention module Multihead-Attenz2With fully-connected feedforward network FNNzInteracting the processing result of the first multi-head attention module with the text characteristics to obtain a middle hidden vector sethThe processing procedure is represented as:
d l ’= MultiHead-Attenz1(d l ,d,d)
h l =FNNz(MultiHead-Attenz2(d l ’,W e ,W e ) )
wherein,d l ' first to multiple attention ModulelLayer to layer reference commentlWord-embedded vector of individual wordsd l The processing result of (1); when the temperature is higher than the set temperaturel=2,…,LFirst of a multi-headed attention modulelThe input of the layer also contains the first in the reference commentl1 intermediate latent vectors for vocabulary correspondenceh l-1;h l Second to represent a second multi-headed attention modulelLayer output is fed forward through a full-connection feedforward network FNNzProcessing the obtained intermediate hidden vectorh l Corresponding to the first in the reference commentlA word; finally, an intermediate implicit vector set is obtained through processingh= {h 1 , h 2 , …, h L } will beh L Called the last layer hidden vector.
As will be appreciated by those skilled in the art, whenl=2,…,LTime, calculated l ' the first multi-head attention Module Multihead-Atten in the procedurez1Includes two types of inputs, one isd l And d, another class ish l-1That is, the first Multi-head attention Module Multi-head-attentionz1Will combine a second multi-head attention module MultiHead-Attenz2With fully-connected feedforward network FNNzThe word embedding vector set d is processed, but this mechanism is not shown by the equation, since it is already covered in the multi-head attention module.
The last layer of hidden vectors is processed by a multi-layer perceptronh L Mean and variance encoded as a gaussian distribution model, expressed as:
wherein,MLPa multi-layer perceptron is represented.
Emotion category weights obtained in conjunction with reference commentsc k Mean value of Gaussian distribution model obtained by encoding with multilayer perceptronAnd varianceThe input vector is taken into the equation p (z | c,W e ) Obtaining probability distribution of an emotion hidden vector z; obtaining an emotion hidden vector z through sampling of a sampling layer, and coding the emotion hidden vector z into emotion hidden vector coding characteristics through a third linear coding layerW z。
The above corresponds to the inference phase, when the model passes controlAndas close as possible toAnd withThe method can effectively model the direct mapping relation between the selected hidden vector space and the generated comment, and different emotion classification weights are selected in the testing stagec k And diversified comment generation is realized.
In the embodiment of the invention, an existing database SnowNLP can be adopted to evaluate the Emotion Analysis (Emotion Analysis) of the reference comment s, and the SnowNLP(s) outputs a [0,1 ] to]And the evaluation value of the interval is used for explaining the score of the diversity of the reference comments s, and the larger the score is, the more positive the emotion of the sentence is represented. The emotional tendency can be divided into three directions of positive, objective and negative: (K= 3), whereby the weight of the corresponding emotion classification can be obtainedc k :
c 1 = 1for T2<SnowNLP (s)≤T1;
c 2 = 1for T3<SnowNLP (s)≤T2;
c 3 = 1else。
The emotion category weight is c =c 1, c 2, c 3And the above expression shows that in the case of K =3, the weights of different emotion types are equal to 1, and in the above expression, T1, T2 and T3 are all set thresholds, and T1 is satisfied>T2>T3, for example, may be set at T1=1, T2=0.7, T3= 0.3.
In the embodiment of the invention, the emotion category weightc k (symbol)kThe emotion types are indicated, and the corresponding emotion types can be determined according to the current reference commentskTheir corresponding emotion category weightsc k =1, and the remaining emotion category weights are all 0, so the expression p (z | c,W e ) Then, only the emotion classification is neededkMean of corresponding Gaussian distribution modelsAnd variance。
And fourthly, comment decoders.
As shown in fig. 2, the Comment Decoder (Comment Decoder) mainly includes a fourth linear coding layer, a fourth transform model, a linear layer (not shown), and a softmax layer.
The fourth linear coding layer is used for coding the input vocabulary to obtain word embedding vectors corresponding to the input vocabulary, and the word embedding vectors are marked as y'; the inference phase directly samples the reference comment (comment) vocabulary as the input vocabulary, and the test phase uses the vocabulary generated at the previous time step as the input vocabulary, as shown in fig. 2, which gives an example of the vocabulary with the reference comment vocabulary as the input vocabulary.
Referring also to FIG. 2, the fourth transform model includes four multi-headed attention modules and a fully connected feed-forward network; by first multi-head attention Module Multihead-Atteno1The words are embedded into the vector y' and interacted with the generated words of the previous time step through a second multi-head attention module Multihead-attentiono2Interacting the interaction result of the first multi-head attention module with the visual features, and interacting the interaction result of the first multi-head attention module with the visual features through a third multi-head attention module MultiHead-Atteno3Interacting the interaction result of the second multi-head attention module with the text feature, and interacting the text feature through a fourth multi-head attention module Multihead-attentiono4Interacting the interaction result of the third multi-head attention module with the emotion hidden vector coding feature, and performing full-connection feedforward network FNNoOutputting the final decoding characteristics, and expressing the processing procedure as follows:
y -1= MultiHead-Atteno1(y’,y,y)
y -2= MultiHead-Atteno2(y -1 ,W F ,W F )
y -3= MultiHead-Atteno3(y -2 ,W e ,W e )
s t =FNNo(MultiHead-Atteno4 (y -3 ,W z ,W z) )
wherein y is,W F 、W e 、W zSequentially representing the generated vocabulary, the visual features, the text features and the emotion hidden vector coding features of the previous time step;s t representing the final decoded features.
Final decoding features t Linear layer and softmax layerAnd obtaining the vocabulary probability distribution of the current time step, wherein the vocabulary probability distribution is expressed as:
p(y t |y 0 ,…,y t −1 , W z , W I , W e ) = Softmax(Ws t )
wherein,y 0 ,…,y t−1represents the vocabulary generated from the initial time step 0 to the previous time step t-1, i.e. the generated vocabulary y of the previous time step,y t represents the vocabulary generated at the current time step,Wrepresenting the parameters of the linear layer.
In the embodiment of the invention, the multi-head attention mechanism related to the multi-head attention module in each part of the diversified video comment generation model can refer to the conventional technology, the related expression shows the related processing process, and the principle of processing various characteristics and intermediate hidden vectors can refer to the conventional technology, which is not described in detail herein; in addition, in the introduction of the comment decoder, the generated vocabulary y of the previous time step needs to be used, and in the actual calculation flow, the generated vocabulary y of the previous time step needs to be converted into the corresponding embedded vector setyHowever, considering that the principle is a conventional technology, and the formula mainly shows the required data, the skilled person can understand the meaning expressed by the formula by using the current showing mode.
And fifthly, a loss function.
Conventional codec models generate comments by maximizingThe invention defines a generation model controlled by an emotion hidden vector z:
wherein the comment is generated={y 0 , y 1 ,…And the video comment is a video comment at the current moment, and the video comment contains generated words of all time steps at the current moment.
Because all the emotion hidden vectors z cannot be traversed for integration, the video comment at the current moment can be obtained by using the mathematical derivation in the variational self-encoder for referenceOne lower bound of variation (ELBO) of the log-likelihood function of (a):
wherein,p() Representing video commentary generated at current timeThe probability distribution of (a) is determined,p( |z) generating a video comment at the current moment when the expression condition is an emotional hidden vector zAm ofThe distribution of the rate is distributed according to the specific weight,p(z) is the distribution of the emotion hidden vector z,E z[.]the representation finds the mathematical expectation about the emotion latent vector z,D KL indicating that the KL distance (relative entropy) is calculated,q(z| ) Corresponding to the probability distribution obtained by the hidden vector encoder, for approximating the posterior probability distribution of a comment decoderp(z| ). Therefore, the optimization goal of the present invention is to maximize the lower bound of the variation of the log-likelihood function.
The invention is also influenced by the word embedding vector set e corresponding to the video frame image set F and the comment text, and the objective functionLThe method is characterized by comprising the following steps:
wherein,p(i z, F, e) represents that the video comment at the current moment is generated under the condition of an emotional hidden vector z, a video frame image set F and a word embedding vector set e corresponding to the comment textA probability distribution of (a);E q(z| F, e)[.]express to get aboutqMathematical expectation of (z | F, e),q(z | F, e) represents the probability distribution of an emotion hidden vector z conditioned on a video frame image set F and a word embedding vector e corresponding to the comment text; q (z-F, e) video comments at the current timeWhen the video frame image F set and the word embedding vector set e corresponding to the comment text are used as conditions, probability distribution of the emotion hidden vector z is carried out;p(z | F, e) represents a probability distribution of the emotion hidden vector z when the set F of video frame images and the set e of word embedding vectors corresponding to the comment text are taken as conditions. The first term is similar to the log-likelihood function of the traditional model, and encourages to generate comments with higher quality, namely reconstruction loss. The second term encourages that the z distribution of the emotion hidden vectors obtained by training can be as close to the prior distribution p (z | F, e) as possible, the prior probability is set as the standard normal distribution N (0,1), and the phenomenon that the model tends to be assimilated in the training process and loses diversity is avoided.
In the embodiment of the present invention, the first and second substrates, pandqboth are probability distributions, and two expression forms are mainly used for distinguishing different probability distributions; as has been described in the foregoing, it is,q(z| ) Probability distribution of sentiment hidden vector z obtained by corresponding hidden vector encoder for approximating posterior probability distribution of comment decoderp(z| ) But due to posterior probability distributionp(z| ) Cannot be calculated and therefore, different expression forms are used for differentiation.
According to the scheme of the embodiment of the invention, aiming at the problem that the comment generated by the current video comment generation model is single in one side, an emotion analysis module is introduced for marking from the aspect of emotion diversity, and an emotion hidden vector z is modeled and controlled on the basis of a Transformer model by using the thought of a variational self-encoder so as to guide the generation of the emotion-controllable diversified video comment. It is worth noting that the introduced Emotion Analysis module (namely, Emotion Analysis in fig. 2) is independent of the whole model, so that it can be replaced by other Emotion analyzers, and finer-grained controllable Emotion comment generation is realized; or a theme analysis module is replaced to realize comment generation of controllable subject, and the like.
Example two
The invention further provides a diversified video comment generating system, which is implemented mainly based on the method provided by the first embodiment, as shown in fig. 3, the system mainly includes:
the information acquisition unit is used for constructing a video frame image set by using the video frame image at the current moment and a plurality of nearest video frame images thereof, extracting comments in the video frame image at the current moment as reference comments, and extracting all the comments in the nearest video frame images to form a comment text;
a visual encoder for extracting visual features from the set of video frame images;
a text encoder for extracting text features from the comment text in conjunction with the visual features;
the hidden vector encoder is used for generating an emotion hidden vector by combining the emotion category weights corresponding to the reference comments and encoding the emotion hidden vector into emotion hidden vector encoding characteristics;
the comment decoder is used for sequentially interacting the input vocabulary with the generated vocabulary of the previous time step, the visual characteristics, the text characteristics and the emotional hidden vector coding characteristics to obtain the vocabulary probability distribution of the current time step; the input vocabulary is the vocabulary in the reference comment or the vocabulary in the generated vocabulary of the previous time step;
and the video comment generating unit is used for determining the generated vocabulary of the current time step according to the vocabulary probability distribution of the current time step and synthesizing the generated vocabularies of all the time steps to form the video comment at the current moment.
It is obvious to those skilled in the art that, for convenience and simplicity of description, the above division of each functional module is only used for illustration, and in practical applications, the above function distribution may be completed by different functional modules according to needs, that is, the internal structure of the system is divided into different functional modules to complete all or part of the above described functions.
It should be noted that, the main principles of the various parts of the system have been described in detail in the first embodiment, and therefore are not described in detail.
EXAMPLE III
The present invention also provides a processing apparatus, as shown in fig. 4, which mainly includes: one or more processors; a memory for storing one or more programs; wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the methods provided by the foregoing embodiments.
Further, the processing device further comprises at least one input device and at least one output device; in the processing device, a processor, a memory, an input device and an output device are connected through a bus.
In the embodiment of the present invention, the specific types of the memory, the input device, and the output device are not limited; for example:
the input device can be a touch screen, an image acquisition device, a physical key or a mouse and the like;
the output device may be a display terminal;
the Memory may be a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as a disk Memory.
Example four
The present invention also provides a readable storage medium storing a computer program which, when executed by a processor, implements the method provided by the foregoing embodiments.
The readable storage medium in the embodiment of the present invention may be provided in the foregoing processing device as a computer readable storage medium, for example, as a memory in the processing device. The readable storage medium may be various media that can store program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a magnetic disk, or an optical disk.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.
Claims (5)
1. A method for generating diversified video comments, comprising:
constructing a video frame image set by using the video frame image at the current moment and a plurality of nearest video frame images thereof, extracting comments in the video frame image at the current moment as reference comments, and extracting the comments in all the nearest video frame images to form a comment text;
visual features are extracted from the video frame image set, text features are extracted from the comment text in combination with the visual features, emotion category weights corresponding to reference comments are combined, emotion hidden vectors are generated and coded into emotion hidden vector coding features;
interacting the input vocabulary with the generated vocabulary of the previous time step, the visual characteristics, the text characteristics and the emotional hidden vector coding characteristics in sequence to obtain the vocabulary probability distribution of the current time step, determining the generated vocabulary of the current time step according to the vocabulary probability distribution of the current time step, and synthesizing the generated vocabularies of all time steps to form the video comment of the current time; the input vocabulary is the vocabulary in the reference comment or the vocabulary in the generated vocabulary of the previous time step;
wherein, extracting visual features from the video frame image set is realized by a visual encoder; extracting text features from the comment text in combination with the visual features by a text encoder; generating an emotion hidden vector by combining the emotion category weights corresponding to the reference comments, and coding the emotion hidden vector into emotion hidden vector coding characteristics through a hidden vector coder; interacting the input vocabulary with the generated vocabulary of the previous time step, the visual characteristics, the text characteristics and the emotional hidden vector coding characteristics in sequence, and obtaining the vocabulary probability distribution of the current time step through a comment decoder;
extracting visual features from the set of video frame images comprises: extracting visual features from the set of video frame images using a visual encoder comprising a convolutional neural network and a first transform model;
recording a video frame image set as F =F 1 , F 2 ,…, F J -means for, among other things,F j denotes the firstjThe number of video frame images is one,j=1,2,…,J,Jrepresenting the number of video frame images; each video frame image corresponds to a moment, and the video frame image at the current moment isF 1,F 2 ,…, F J As a video frame image corresponding to the current timeF 1Nearest neighborJ-1 video frame image;
extracting the characteristics of each video frame image through a convolutional neural network, and expressing the characteristics as follows:
V j = CNN(F j )
in the above equation, CNN represents a convolutional neural network,V j represents the extracted secondjA video frame imageF j The features of (1);
recording feature set corresponding to video frame image set FV = {V 1 , V 2 ,…, V J And F, performing feature set corresponding to the video frame image set F through a first transform modelVThe encoding process is performed as follows:
W j =FNNF(MultiHead-AttenF(V j ,V,V ) )
the upper typeMiddle, Multihead-AttenFAnd FNNFRespectively representing a multi-head attention module and a fully connected feedforward network in the first Transformer model;W j indicating the result of the encoding processjVisual characteristics of each video frame image;
recording the visual characteristics of the video frame image set FW F = {W 1 , W 2 ,…, W J };
The extracting text features from the comment text in combination with the visual features comprises: extracting text features from the comment text in conjunction with the visual features using a text encoder comprising a first linear encoding layer and a second transform model;
linear coding is carried out on the comment text through a first linear coding layer, and a corresponding word embedding vector set e = &isobtainede 1 , e 2 ,…, e M And (c) the step of (c) in which,e m representing the first in comment textmThe words of the individual words are embedded into the vector,m=1,2,…,M ,Mtotal number of words for comment text;
the second transform model comprises two multi-head attention modules and a fully-connected feedforward network, and the multi head-attention module passes through the first multi-head attention modulee1Processing the word embedding vector set e, and then processing the word embedding vector set e by a second multi-head attention module Multihead-Attene2With fully-connected feedforward network FNNeInteracting the processing result of the first multi-head attention module with the visual feature to obtain a text feature, wherein the processing process is represented as:
e m ’= MultiHead-Attene1(e m ,e,e)
E m =FNNe(MultiHead-Attene2(e m ’, W F ,W F ) )
wherein,e m ' denotes the first multi-headed attention module pairmWord-embedded vector for individual wordse m As a result of the processing of (1),W F the visual characteristics are represented by a visual representation,E m the representation corresponds tomText characteristics of individual words;
recording text features of the comment textW e = {E 1 , E 2 ,…, E M };
The generating of the emotion hidden vector by combining the emotion category weight corresponding to the reference comment and the encoding of the emotion hidden vector into the emotion hidden vector encoding feature comprises the following steps: determining emotion category weight by performing emotion analysis on the reference comment, generating an emotion hidden vector by combining the reference comment, the text feature and the emotion category weight through a hidden vector encoder, and encoding the emotion hidden vector into emotion hidden vector encoding features;
wherein the probability distribution p (z | c,W e ) Modeling as Using Emotion class weightsc k A weighted mixture gaussian distribution model, represented as:
wherein,c k is shown askThe weight of each of the emotion category weights,Krepresenting the number of emotion class weights, c representing a set of emotion class weights, c = &c k } K ,Is shown askA plurality of Gaussian distribution models are generated according to the method,andmean values of Gaussian distribution models respectively representing modeling definitionsAnd the variance of the measured values is calculated,Ia standard unit matrix is represented by a standard unit matrix,W e representing a text feature; z represents an emotion hidden vector;
the hidden vector encoder includes: two linear coding layers, a third transform model, a multi-layer sensor and a sampling layer;
the two linear coding layers are respectively called a second linear coding layer and a third linear coding layer, the reference comment is linearly coded through the second linear coding layer, and a corresponding word embedding vector set d = the last word is obtained d 1 , d 2 , …, d L And (c) the step of (c) in which,d l express the first in the reference commentlThe words of the individual words are embedded into the vector,l=1,2,…,L ,Ltotal number of words for the reference review;
the third transform model comprises two multi-head attention modules and a fully-connected feedforward network, and the multi head-attention module is used for generating a multi head-attention modelz1Processing the word embedding vector set d, and then processing the word embedding vector set d by a second multi-head attention module Multihead-Attenz2With fully-connected feedforward network FNNzInteracting the processing result of the first multi-head attention module with the text characteristics to obtain a middle hidden vector sethThe processing procedure is represented as:
d l ’= MultiHead-Attenz1(d l ,d,d)
h l =FNNz(MultiHead-Attenz2(d l ’,W e ,W e ) )
wherein,d l ' denotes the first of a multi-headed attention modulelLayer pair reference commentlWord-embedded vector of individual wordsd l The processing result of (1); when the temperature is higher than the set temperaturel=2,…,LFirst of the first multi-headed attention modulelThe input of the layer also contains the first in the reference commentl-1 intermediate latent vector for vocabulary correspondenceh l-1;h l Second to represent a second multi-headed attention modulelLayer output is fed forward through a full-connection feedforward network FNNzProcessing the obtained intermediate latent vectorh l Corresponding to the first in the reference commentlA word; finally, an intermediate implicit vector set is obtained through processingh= {h 1 , h 2 , …, h L } will beh L Called the last layer hidden vector;
the last layer of hidden vectors is processed by a multi-layer perceptronh L Mean and variance encoded as a gaussian distribution model, expressed as:
wherein,MLPrepresents a multi-layer perceptron;
sentiment category weighting of reference commentsc k The mean and variance of the gaussian distribution model obtained from the multi-layer perceptron encoding, into the equation p (z | c,W e ) Obtaining probability distribution of an emotion hidden vector z; obtaining an emotion hidden vector z through sampling of a sampling layer, and coding the emotion hidden vector z into emotion hidden vector coding characteristics through a third linear coding layerW z;
Interacting the input vocabulary with the generated vocabulary of the previous time step, the visual characteristics, the text characteristics and the emotional hidden vector coding characteristics in sequence to obtain the vocabulary probability distribution of the current time step, and determining the generated vocabulary of the current time step according to the vocabulary probability distribution of the current time step to realize through a comment decoder;
the comment decoder comprises a fourth linear coding layer, a fourth transform model, a linear layer and a softmax layer;
the fourth linear coding layer is used for coding the input vocabulary to obtain a word embedding vector corresponding to the input vocabulary, and the word embedding vector is marked as y';
the fourth transform model comprises four multi-head attention modules and a full linkConnecting a feedforward network; by first multi-head attention Module Multihead-Atteno1The words are embedded into the vector y' and interacted with the generated words of the previous time step through a second multi-head attention module Multihead-attentiono2Interacting the interaction result of the first multi-head attention module with the visual features, and interacting the interaction result of the first multi-head attention module with the visual features through a third multi-head-attention moduleo3Interacting the interaction result of the second multi-head attention module with the text feature, and interacting the text feature through a fourth multi-head attention module Multihead-attentiono4Interacting the interaction result of the third multi-head attention module with the emotion hidden vector coding characteristics, and performing full-connection feedforward network FNNoOutputting the final decoding characteristics, and expressing the processing procedure as follows:
y -1= MultiHead-Atteno1(y’,y,y)
y -2= MultiHead-Atteno2(y -1 ,W F ,W F )
y -3= MultiHead-Atteno3(y -2 ,W e ,W e )
s t =FNNo(MultiHead-Atteno4 (y -3 ,W z ,W z) )
wherein y is,W F 、W e 、W zSequentially representing the generated vocabulary, the visual features, the text features and the emotional hidden vector coding features of the previous time step;s t represents the final decoded features;
final decoding characteristicss t And (3) obtaining the vocabulary probability distribution at the current moment sequentially through the linear layer and the softmax layer, wherein the vocabulary probability distribution is expressed as:
p(y t |y 0 ,…,y t −1 , W z , W F , W e ) = Softmax(W s t )
wherein,y 0 ,…,y t−1represents the vocabulary generated from the initial time step 0 to the previous time step t-1, i.e. the generated vocabulary y of the previous time step,y t represents the vocabulary generated at the current time step t,Wrepresenting the parameters of the linear layer.
2. The method according to claim 1, wherein the visual encoder, the text encoder, the hidden vector encoder and the comment decoder form an objective function of a diversified video comment generation model, a training phase and a diversified video comment generation modelLExpressed as:
wherein z represents an emotion hidden vector obtained by a hidden vector encoder, F represents a video frame image set, e represents a word embedding vector set corresponding to the comment text,a video comment representing a current time;p(i z, F, e) represents that the video comment at the current moment is generated under the condition of an emotional hidden vector z, a video frame image set F and a word embedding vector set e corresponding to the comment textA probability distribution of (a);E q(z| F, e)[.]express to get aboutqMathematical expectation of (z | F, e),q(z | F, e) represents the probability distribution of an emotion hidden vector z conditioned on a video frame image set F and a word embedding vector e corresponding to the comment text; q (z-F, e) video comments at the current timeWhen the video frame image set F and the word embedding vector set e corresponding to the comment text are used as conditions, probability distribution of the sentiment hidden vector z is carried out;p(z | F, e) represents the probability distribution of the emotion hidden vector z under the condition of the video frame image set F and the word embedding vector set e corresponding to the comment text,D KL indicating that the KL distance is calculated.
3. A diversified video comment generation system implemented based on the method of any one of claims 1-2, the system comprising:
the information acquisition unit is used for constructing a video frame image set by utilizing the video frame image at the current moment and a plurality of nearest video frame images thereof, extracting comments in the video frame image at the current moment as reference comments, and extracting all the comments in the nearest video frame images to form a comment text;
a visual encoder for extracting visual features from the set of video frame images;
a text encoder for extracting text features from the comment text in conjunction with the visual features;
the hidden vector encoder is used for generating an emotion hidden vector by combining the emotion category weights corresponding to the reference comments and encoding the emotion hidden vector into emotion hidden vector encoding characteristics;
the comment decoder is used for sequentially interacting the input vocabulary with the generated vocabulary of the previous time step, the visual characteristics, the text characteristics and the emotional hidden vector coding characteristics to obtain the vocabulary probability distribution of the current time step; the input vocabulary is the vocabulary in the reference comment or the vocabulary in the generated vocabulary of the previous time step;
and the video comment generating unit is used for determining the generated vocabulary of the current time step according to the vocabulary probability distribution of the current time step and synthesizing the generated vocabularies of all the time steps to form the video comment at the current moment.
4. A processing device, comprising: one or more processors; a memory for storing one or more programs;
wherein the one or more programs, when executed by the one or more processors, cause the one or more processors to implement the method of any of claims 1-2.
5. A readable storage medium, storing a computer program, characterized in that the computer program, when executed by a processor, implements the method according to any of claims 1-2.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210352708.8A CN114494980B (en) | 2022-04-06 | 2022-04-06 | Diversified video comment generation method, system, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202210352708.8A CN114494980B (en) | 2022-04-06 | 2022-04-06 | Diversified video comment generation method, system, equipment and storage medium |
Publications (2)
Publication Number | Publication Date |
---|---|
CN114494980A CN114494980A (en) | 2022-05-13 |
CN114494980B true CN114494980B (en) | 2022-07-15 |
Family
ID=81488043
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202210352708.8A Active CN114494980B (en) | 2022-04-06 | 2022-04-06 | Diversified video comment generation method, system, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN114494980B (en) |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108133038A (en) * | 2018-01-10 | 2018-06-08 | 重庆邮电大学 | A kind of entity level emotional semantic classification system and method based on dynamic memory network |
CN109800390A (en) * | 2018-12-21 | 2019-05-24 | 北京石油化工学院 | A kind of calculation method and device of individualized emotion abstract |
CN111079532A (en) * | 2019-11-13 | 2020-04-28 | 杭州电子科技大学 | Video content description method based on text self-encoder |
CN111696535A (en) * | 2020-05-22 | 2020-09-22 | 百度在线网络技术(北京)有限公司 | Information verification method, device, equipment and computer storage medium based on voice interaction |
CN111858914A (en) * | 2020-07-27 | 2020-10-30 | 湖南大学 | Text abstract generation method and system based on sentence-level evaluation |
CN112329474A (en) * | 2020-11-02 | 2021-02-05 | 山东师范大学 | Attention-fused aspect-level user comment text emotion analysis method and system |
CN113704393A (en) * | 2021-04-13 | 2021-11-26 | 腾讯科技(深圳)有限公司 | Keyword extraction method, device, equipment and medium |
CN113807222A (en) * | 2021-09-07 | 2021-12-17 | 中山大学 | Video question-answering method and system for end-to-end training based on sparse sampling |
CN113918764A (en) * | 2020-12-31 | 2022-01-11 | 浙江大学 | Film recommendation system based on cross modal fusion |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10303768B2 (en) * | 2015-05-04 | 2019-05-28 | Sri International | Exploiting multi-modal affect and semantics to assess the persuasiveness of a video |
US10049106B2 (en) * | 2017-01-18 | 2018-08-14 | Xerox Corporation | Natural language generation through character-based recurrent neural networks with finite-state prior knowledge |
-
2022
- 2022-04-06 CN CN202210352708.8A patent/CN114494980B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108133038A (en) * | 2018-01-10 | 2018-06-08 | 重庆邮电大学 | A kind of entity level emotional semantic classification system and method based on dynamic memory network |
CN109800390A (en) * | 2018-12-21 | 2019-05-24 | 北京石油化工学院 | A kind of calculation method and device of individualized emotion abstract |
CN111079532A (en) * | 2019-11-13 | 2020-04-28 | 杭州电子科技大学 | Video content description method based on text self-encoder |
CN111696535A (en) * | 2020-05-22 | 2020-09-22 | 百度在线网络技术(北京)有限公司 | Information verification method, device, equipment and computer storage medium based on voice interaction |
WO2021232725A1 (en) * | 2020-05-22 | 2021-11-25 | 百度在线网络技术(北京)有限公司 | Voice interaction-based information verification method and apparatus, and device and computer storage medium |
CN111858914A (en) * | 2020-07-27 | 2020-10-30 | 湖南大学 | Text abstract generation method and system based on sentence-level evaluation |
CN112329474A (en) * | 2020-11-02 | 2021-02-05 | 山东师范大学 | Attention-fused aspect-level user comment text emotion analysis method and system |
CN113918764A (en) * | 2020-12-31 | 2022-01-11 | 浙江大学 | Film recommendation system based on cross modal fusion |
CN113704393A (en) * | 2021-04-13 | 2021-11-26 | 腾讯科技(深圳)有限公司 | Keyword extraction method, device, equipment and medium |
CN113807222A (en) * | 2021-09-07 | 2021-12-17 | 中山大学 | Video question-answering method and system for end-to-end training based on sparse sampling |
Non-Patent Citations (2)
Title |
---|
Automatic facial expression analysis:a survey;B.Fasel等;《Pattern Recognition》;20031231;第259-275页 * |
基于用户兴趣度的学习资源推荐方法研究;曹玉龙;《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》;20200915;第I138-155页 * |
Also Published As
Publication number | Publication date |
---|---|
CN114494980A (en) | 2022-05-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107608943B (en) | Image subtitle generating method and system fusing visual attention and semantic attention | |
CN109472031B (en) | Aspect level emotion classification model and method based on double memory attention | |
Afouras et al. | Deep lip reading: a comparison of models and an online application | |
CN111652066A (en) | Medical behavior identification method based on multi-self-attention mechanism deep learning | |
CN114612891B (en) | Image description generation method and medium based on contrast learning and self-adaptive attention | |
CN114567815B (en) | Pre-training-based adaptive learning system construction method and device for lessons | |
Yang et al. | Real-time steganalysis for streaming media based on multi-channel convolutional sliding windows | |
CN112307760A (en) | Deep learning-based financial report emotion analysis method and device and terminal | |
CN112561718A (en) | Case microblog evaluation object emotion tendency analysis method based on BilSTM weight sharing | |
CN116579347A (en) | Comment text emotion analysis method, system, equipment and medium based on dynamic semantic feature fusion | |
CN116341562A (en) | Similar problem generation method based on Unilm language model | |
CN115935975A (en) | Controllable-emotion news comment generation method | |
CN117591648A (en) | Power grid customer service co-emotion dialogue reply generation method based on emotion fine perception | |
CN113554040B (en) | Image description method and device based on condition generation countermeasure network | |
CN114494980B (en) | Diversified video comment generation method, system, equipment and storage medium | |
CN114936723B (en) | Social network user attribute prediction method and system based on data enhancement | |
CN115422388B (en) | Visual dialogue method and system | |
CN113688204B (en) | Multi-person session emotion prediction method utilizing similar scenes and mixed attention | |
CN115171878A (en) | Depression detection method based on BiGRU and BiLSTM | |
CN115758218A (en) | Three-modal emotion analysis method based on long-time and short-time feature and decision fusion | |
CN115309894A (en) | Text emotion classification method and device based on confrontation training and TF-IDF | |
CN115422945A (en) | Rumor detection method and system integrating emotion mining | |
CN115270917A (en) | Two-stage processing multi-mode garment image generation method | |
Wu et al. | A DCGAN image generation algorithm based on AE feature extraction | |
Zhao et al. | MSA-HCL: Multimodal sentiment analysis model with hybrid contrastive learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |