CN111079601A - Video content description method, system and device based on multi-mode attention mechanism - Google Patents
Video content description method, system and device based on multi-mode attention mechanism Download PDFInfo
- Publication number
- CN111079601A CN111079601A CN201911243331.7A CN201911243331A CN111079601A CN 111079601 A CN111079601 A CN 111079601A CN 201911243331 A CN201911243331 A CN 201911243331A CN 111079601 A CN111079601 A CN 111079601A
- Authority
- CN
- China
- Prior art keywords
- video
- semantic attribute
- feature
- sequence
- modal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 74
- 230000007246 mechanism Effects 0.000 title claims abstract description 44
- 239000013598 vector Substances 0.000 claims abstract description 186
- 238000001514 detection method Methods 0.000 claims abstract description 43
- 238000013528 artificial neural network Methods 0.000 claims abstract description 19
- 230000000306 recurrent effect Effects 0.000 claims abstract description 16
- 238000012549 training Methods 0.000 claims description 53
- 230000003287 optical effect Effects 0.000 claims description 19
- 239000011159 matrix material Substances 0.000 claims description 17
- 238000012545 processing Methods 0.000 claims description 5
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000012163 sequencing technique Methods 0.000 claims description 3
- 230000000007 visual effect Effects 0.000 abstract description 5
- 238000003058 natural language processing Methods 0.000 abstract description 2
- 230000008569 process Effects 0.000 description 18
- 230000006870 function Effects 0.000 description 10
- 238000011156 evaluation Methods 0.000 description 4
- 238000011160 research Methods 0.000 description 4
- 230000009286 beneficial effect Effects 0.000 description 3
- 230000001149 cognitive effect Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 230000006399 behavior Effects 0.000 description 2
- 238000013461 design Methods 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 238000005065 mining Methods 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 238000007476 Maximum Likelihood Methods 0.000 description 1
- RTAQQCXQSZGOHL-UHFFFAOYSA-N Titanium Chemical compound [Ti] RTAQQCXQSZGOHL-UHFFFAOYSA-N 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 235000019987 cider Nutrition 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000001737 promoting effect Effects 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Computing Systems (AREA)
- Software Systems (AREA)
- Molecular Biology (AREA)
- Computational Linguistics (AREA)
- Biophysics (AREA)
- Biomedical Technology (AREA)
- Mathematical Physics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Multimedia (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention belongs to the field of computer vision and natural language processing, and particularly relates to a video content description method, a system and a device based on a multi-mode attention mechanism, aiming at solving the problem that the accuracy of a generated description statement is low due to the fact that the video content description method only considers video characteristics and ignores high-level semantic attribute information. The method comprises the following steps: acquiring a video frame sequence of a video to be described; extracting multi-modal feature vectors of the video frame sequence, constructing the multi-modal feature vector sequence, and obtaining feature representations corresponding to the modal feature vector sequences through a recurrent neural network; obtaining semantic attribute vectors corresponding to the feature representations through a semantic attribute detection network; and expressing the concatenated vector and semantic attribute vector based on the features corresponding to each modal feature vector sequence, and obtaining a description statement of the video to be described through an LSTM network based on an attention mechanism. The invention integrates visual features and high-level semantic attributes, and improves the accuracy of generating the video description sentences.
Description
Technical Field
The invention belongs to the field of computer vision and natural language processing, and particularly relates to a video content description method, system and device based on a multi-mode attention mechanism.
Background
Artificial intelligence can be roughly divided into two research directions: perceptual intelligence and cognitive intelligence. Perceptual intelligence research progresses rapidly, such as picture classification and natural language translation, but cognitive intelligence development speed is limited, such as talking through pictures, visual description and the like. The study is combined with natural language and computer vision, which is beneficial to building a bridge for communication between human and machines and promoting the study of cognitive intelligence.
Video content describes a different label-based coarse-grained visual understanding task than video classification, object detection, etc., but requires a smooth and accurate sentence to describe the video content. This requires not only identifying objects in the video, but also understanding the interrelationships between objects in the video. Meanwhile, the video content has various description styles, such as abstract description of scenes, description of the relationship among objects, description of behavior and motion of objects in the video, and the like, which brings great challenges to the description research of the video content. The traditional video content description algorithm mainly adopts a language template-based method or a retrieval-based method. The method based on the language template can only generate sentences with single form and lack of flexibility due to the limitation of the fixed language template. The retrieval-based method is too dependent on the size of a retrieval video library, and when a video similar to the video to be described is absent in the database, the generated description sentence has a large deviation with the video content. Meanwhile, both methods need to perform a complicated preprocessing process on the video in the early stage, and the optimization on the language sequence part at the back end is insufficient, so that the quality of the generated sentences is poor.
With the advancement of deep learning techniques, codec-based sequence learning models have made a breakthrough in the video content description problem. The method is also based on a codec model, does not need to adopt a complex processing process for the video in the early stage, directly realizes end-to-end training through a network, and can directly learn the mapping relation between the video and the language from a large amount of training data, thereby generating the video description with more accurate content, various forms and flexible grammar.
The key of the video content description problem lies in the extraction of video characteristics, and because different modal information in the video can assist each other, the encoding of the video multi-modal information is beneficial to mining more semantic information. Meanwhile, because the general video content description algorithm only considers the video characteristics and ignores the high-level semantic attribute information of the video, in order to improve the quality of the generated description sentence, the invention also discusses how to extract the high-level semantic attributes and apply the semantic attributes to the video content description task. The invention also analyzes and researches the problem of insufficient optimization of the decoder end language generation part, most of the current video content description algorithms adopt maximum likelihood to model the language sequence, and the cross entropy loss is used for training and optimizing, which brings two obvious defects: one is the exposure bias problem, when the model is trained, the input of the decoder at each moment comes from the real words in the training set, and when the model is tested, the input at each moment comes from the words predicted at the last moment. If one of the words is predicted with insufficient accuracy, the error may be passed down, resulting in a lower and lower quality of the subsequently generated word. Secondly, the problem that the training index and the evaluation criterion are not uniform is solved, the posterior probability is maximized by adopting a cross entropy loss function in the training stage, objective evaluation criteria such as BLEU, METEOR and CIDER are adopted in the evaluation stage, and the inconsistency causes that the model cannot fully optimize the evaluation index generated by the video content description.
Disclosure of Invention
In order to solve the above-mentioned problem in the prior art, that is, to solve the problem that the accuracy of the generated description statement is low due to the fact that the existing video content description method only considers the video features and ignores the high-level semantic attribute information of the video, a first aspect of the present invention provides a video content description method based on a multi-modal attention mechanism, which includes:
step S100, acquiring a video frame sequence of a video to be described as an input sequence;
step S200, extracting multi-modal feature vectors of the input sequence, constructing a multi-modal feature vector sequence, and obtaining feature representations corresponding to the modal feature vector sequences through a recurrent neural network; the multi-modal feature vector sequence comprises a video frame feature vector sequence, an optical flow frame feature vector sequence and a video segment feature vector sequence;
step S300, based on the feature representation corresponding to each modal feature vector sequence, respectively obtaining semantic attribute vectors corresponding to each feature representation through a semantic attribute detection network;
step S400, cascading the feature representations corresponding to the modal feature vector sequences to obtain initial coding vectors; obtaining a description statement of the video to be described through an LSTM network based on an attention mechanism based on the initial coding vector and the semantic attribute vector corresponding to each feature representation;
wherein,
the semantic attribute detection network is constructed based on a multilayer perceptron and is trained based on training samples, and the training samples comprise feature representation samples and corresponding semantic attribute vector labels.
In some preferred embodiments, in step S200, "extracting the multi-modal feature vectors of the input sequence, and constructing a multi-modal feature vector sequence", the method includes:
extracting the features of each frame of RGB image in the input sequence based on a depth residual error network to obtain a video frame feature vector sequence;
based on the input sequence, obtaining an optical flow sequence through a Lucas-Kanade algorithm; extracting the features of the optical flow sequence through a depth residual error network to obtain an optical flow frame feature vector sequence;
and (3) dividing the input sequence into T sections in average, and extracting the characteristic vectors of the sequences of each section respectively through a three-dimensional convolution deep neural network to obtain a video segment characteristic vector sequence.
In some preferred embodiments, the training method of the semantic attribute detection network is as follows:
acquiring a training data set, wherein the training data set comprises videos and corresponding description sentences;
extracting words describing sentences in the training data set, sequencing the words according to the occurrence frequency, and selecting the first K words as high-level semantic attribute vectors; acquiring a real semantic attribute vector label of the video according to whether the description statement contains the high-level semantic attribute vector;
acquiring feature representation corresponding to a multi-modal feature vector sequence of the video in the training data set;
and training the semantic attribute detection network based on the feature representation and the real semantic attribute vector labels.
In some preferred embodiments, the loss function loss of the semantic attribute detection network in the training process1Comprises the following steps:
wherein N is the number of description sentences in the training data set, K is the dimension of the predicted semantic attribute vector label output by the semantic attribute detection network, and sikDetecting predicted semantic attribute vector tags, y, of network outputs for semantic attributesikIs the real semantic attribute vector label, i, k are subscripts, α are weights, WencoderAnd (3) collecting all weight matrix and bias matrix parameters of the recurrent neural network and the semantic attribute detection network.
In some preferred embodiments, in step S400, "obtaining the sentence description of the video to be described through the LSTM network based on the attention mechanism based on the initial encoding vector and the semantic attribute vector corresponding to each feature representation" includes:
weighting the semantic attribute vectors corresponding to the feature representations through an attention mechanism to obtain multi-modal semantic attribute vectors;
and generating statement description of the video to be described through an LSTM network based on the initial coding vector and the multi-mode semantic attribute vector.
In some preferred embodiments, the attention-based LSTM network performs the calculation of the weight matrix in a training process by using a factorization method.
The invention provides a video content description system based on a multi-mode attention mechanism, which comprises an acquisition module, an extracted feature representation module, a semantic attribute detection module and a video description generation module, wherein the extracted feature representation module is used for extracting a feature of a video;
the acquisition module is configured to acquire a video frame sequence of a video to be described as an input sequence;
the extracted feature representation module is configured to extract multi-modal feature vectors of the input sequence, construct a multi-modal feature vector sequence, and obtain feature representations corresponding to the feature vector sequences of each modality through a recurrent neural network; the multi-modal feature vector sequence comprises a video frame feature vector sequence, an optical flow frame feature vector sequence and a video segment feature vector sequence;
the semantic attribute detection module is configured to obtain semantic attribute vectors corresponding to the feature representations respectively through a semantic attribute detection network based on the feature representations corresponding to the modal feature vector sequences;
the video generation description module is configured to cascade feature representations corresponding to the modal feature vector sequences to obtain initial coding vectors; obtaining a description statement of the video to be described through an LSTM network based on an attention mechanism based on the initial coding vector and the semantic attribute vector corresponding to each feature representation;
wherein,
the semantic attribute detection network is constructed based on a multilayer perceptron and is trained based on training samples, and the training samples comprise feature representation samples and corresponding semantic attribute vector labels.
In a third aspect of the present invention, a storage device is provided, in which a plurality of programs are stored, the programs being loaded and executed by a processor to implement the above-mentioned video content description method based on the multi-modal attention mechanism.
In a fourth aspect of the present invention, a processing apparatus is provided, which includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is adapted to be loaded and executed by a processor to implement the above-described method for video content description based on a multi-modal attention mechanism.
The invention has the beneficial effects that:
the invention integrates visual features and high-level semantic attributes, and improves the accuracy of generating the video description sentences. The invention starts from multi-modal information, obtains a video characteristic vector sequence by adopting a method of combining a video frame, a light stream frame and a video fragment, and simultaneously detects and generates a semantic attribute vector label of the video. In order to obtain more effective visual characteristics and semantic attributes, the auxiliary classification loss and LSTM network loss in the generation stage of the semantic attribute vector labels are optimized simultaneously, and the context relationship in sentences can be ensured. In the decoding stage, an attention mechanism algorithm combined with semantic attributes is provided, the semantic attribute vectors are fused into a traditional recurrent neural network weight matrix, and in each moment of generating a sentence word, the attention mechanism is adopted to pay attention to the specific semantic attributes, so that the accuracy of video content description is improved.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings.
FIG. 1 is a flow chart of a method for describing video content based on a multi-modal attention mechanism according to an embodiment of the present invention;
FIG. 2 is a block diagram of a multi-modal attention mechanism based video content description system according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a training process of a video content description method based on a multi-modal attention mechanism according to an embodiment of the present invention;
fig. 4 is a schematic network structure diagram of a semantic attribute detection network according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
The video content description method based on the multi-modal attention mechanism, as shown in fig. 1, includes the following steps:
step S100, acquiring a video frame sequence of a video to be described as an input sequence;
step S200, extracting multi-modal feature vectors of the input sequence, constructing a multi-modal feature vector sequence, and obtaining feature representations corresponding to the modal feature vector sequences through a recurrent neural network; the multi-modal feature vector sequence comprises a video frame feature vector sequence, an optical flow frame feature vector sequence and a video segment feature vector sequence;
step S300, based on the feature representation corresponding to each modal feature vector sequence, respectively obtaining semantic attribute vectors corresponding to each feature representation through a semantic attribute detection network;
step S400, cascading the feature representations corresponding to the modal feature vector sequences to obtain initial coding vectors; obtaining a description statement of the video to be described through an LSTM network based on an attention mechanism based on the initial coding vector and the semantic attribute vector corresponding to each feature representation;
wherein,
the semantic attribute detection network is constructed based on a multilayer perceptron and is trained based on training samples, and the training samples comprise feature representation samples and corresponding semantic attribute vector labels.
In order to more clearly explain the video content description method based on the multi-modal attention mechanism, the following describes the steps in an embodiment of the method in detail with reference to the accompanying drawings.
The programming language in which the method of the present invention operates specifically is not limited, and the method of the present invention can be implemented by writing in any language. The method adopts a 4-card Titan Xp GPU server with 12 Gbyte memory, and uses Python language to compile a working program of the video content description method based on the multi-mode attention mechanism, thereby realizing the method of the invention. The method comprises the following concrete steps:
step S100, acquiring a video frame sequence of a video to be described as an input sequence.
In this embodiment, the video to be described may be a video shot in real time, for example, in an intelligent monitoring and behavior analysis scene, a video shot by a camera in real time needs to be described, and at this time, the video to be described may be a video shot by the camera in real time; or the video to be described may be a video acquired from a network, for example, in a video content preview scene, the video acquired from the network needs to be described through a natural language, so as to implement the preview of the video content by the user, and at this time, the video to be described may be a video acquired from the network and needing to be previewed; or the video to be described may be a locally stored video, for example, in a video classified storage scene, the video needs to be described, and is classified and stored according to the description information, and at this time, the video to be described may be a locally stored video that needs to be classified and stored. And extracting a video frame sequence as input based on the acquired video to be described.
Step S200, extracting multi-modal feature vectors of the input sequence, constructing a multi-modal feature vector sequence, and obtaining feature representations corresponding to the modal feature vector sequences through a recurrent neural network; the multi-modal feature vector sequence comprises a video frame feature vector sequence, an optical flow frame feature vector sequence and a video segment feature vector sequence.
In the present example, the multi-modal video features of the video to be described are extracted, namely video frames, streaming frames and video clips. The method comprises the following specific steps:
step S201, using a pre-trained depth residual error network to extract the features of each frame of a video frame sequence of a video to be described, using the output of the ith layer of the network as the feature representation of the frame, and obtaining a video frame feature vector sequence
Inputting the video frame feature vector sequence into the recurrent neural network LSTM in sequence, and hiding the hidden state h at the last moment of the networktThe feature representation, denoted v, of the video frame feature vector as a videof。
Step S202, generating an optical flow sequence of a video by a video frame sequence of a video to be described through a Lucas-Kanade algorithm, performing feature extraction on each frame through a pre-trained depth residual error network, and taking the output of the ith layer of the network as feature representation of the frame to obtain an optical flow frame feature vector sequence
Inputting the light stream frame feature vector sequence into a recurrent neural network (LSTM) in sequence, and hiding the hidden state h of the network at the last momenttThe characteristic representation of an optical flow frame as a video is denoted vo
Step S203, the video frame sequence of the video to be described is averagely divided into T sections, each section uses a three-dimensional convolution depth neural network to carry out feature extraction, the output of the ith layer of the network is used as the feature representation of the T section of the video, and the feature vector sequence of the video segment is obtained
Inputting the video segment feature vector sequence into the recurrent neural network LSTM in sequence, and hiding the hidden state h at the last moment of the networktThe characteristic representation of a video segment as a video, denoted vc。
In the process of extracting the multi-modal Feature representation of the Video in the above steps, as shown in fig. 3, Video (Video) is input and divided into a Video Frame (Frame), an Optical flow Frame (Optical flow) and a Video clip (Video clip), wherein the Video Frame outputs Frame Feature (Feature representation of Video Frame Feature vector), i.e. static Feature, the Video clip outputs C3D Feature (3D-CNN Feature of Video clip), and the Optical flow Frame outputs Motion Feature (dynamic Feature). Other steps in fig. 3 are described below.
And step S300, respectively obtaining semantic attribute vectors corresponding to the feature representations through a semantic attribute detection network based on the feature representations corresponding to the modal feature vector sequences.
In this embodiment, training of the semantic attribute detection network is introduced first, and then semantic attribute vectors corresponding to the feature representations obtained by the semantic attribute detection network are introduced.
The semantic attribute detection network is constructed based on a multilayer perceptron, the structure is shown in fig. 4, the semantic attribute detection network comprises an input Layer (input Layer), a Hidden Layer (Hidden Layer) and an Output Layer (Output Layer), a video (input video) and a corresponding description sentence (A Small child playing the guitar) are input, and a multimodal feature vector sequence (v) is obtained through a recurrent neural network (LSTM)i1,vi2,...,vin) Outputting a semantic attribute vector s by a semantic attribute detection networki1,si2,...,siK. The specific training process of the semantic attribute detection network is as follows:
step A301, a training data set is obtained, wherein the training data set comprises videos and corresponding description sentences.
Step A302, extracting words describing sentences in the training data set, sequencing the words according to the occurrence frequency, and selecting the first K words as high-level semantic attribute vectors; and acquiring a real semantic attribute vector label of the video according to whether the description statement contains the high-level semantic attribute vector.
Extracting words describing sentences in the training data set, sorting the words according to the occurrence frequency of the words, removing the null words, and then selecting K words with highest occurrence probability as high-level semantic attribute values
Suppose that the training data set has N sentences, yi=[yi1,yi1,...yil,...yiK]Is the true semantic attribute vector tag of the ith video. Wherein if the descriptive sentence corresponding to the video i contains the attribute word l, y il1 is ═ 1; otherwise yil=0。
Step a303 is to obtain feature representations corresponding to the multi-modal feature vector sequences of the video in the training dataset by the method in step S200.
Step A304, training the semantic attribute detection network based on the feature representation and the real semantic attribute vector labels.
Let v bei∈{vf,vo,vcThe feature representation of the ith video learning is represented, and the training sample at the moment is { v }i,yi}. In the invention, a semantic attribute detection network constructed by adopting a multilayer perceptron is adopted to learn a function f (R): R)m→RKExpressed as mapping an m-dimensional space to a K-dimensional space, where RmReal number space in m dimensions, RKSimilarly, m is the dimension represented by the input features, K is the dimension of the output semantic attribute vector, the dimension is equal to the number of the extracted semantic attribute values (the dimension of the high-level semantic attribute vector), and the multilayer perceptron outputs a vector si=[si1,...,siK]Vector labels for the predicted semantic attributes of the ith video. Loss of class function (loss) of semantic attribute detection network1As shown in equation (1):
wherein, WencoderRepresenting recurrent neural networks, semantic attributesDetecting the set of all the parameters of the weight matrix and the bias matrix of the network, α is the weight, si=α(f(vi) Learning to obtain a K-dimensional vector, α (·) representing sigmoid function, and f (·) being implemented by a multilayer perceptron.
After training of the semantic attribute detection network is completed, in an actual application process, semantic attribute vectors corresponding to the feature representations are obtained through the semantic attribute detection network respectively based on the feature representations corresponding to the modal feature vector sequences. Such as the Multimodal Semantic Detector module in fig. 3.
Step S400, cascading the feature representations corresponding to the modal feature vector sequences to obtain initial coding vectors; and obtaining the description statement of the video to be described through an LSTM network based on an attention mechanism based on the initial coding vector and the semantic attribute vector corresponding to each feature representation.
In this embodiment, { vf,vo,vcCascade as initial coding vector v, like the collocation module in fig. 3. Attention Fusion is based on the Attention module.
The following description first introduces the training process of the attention-based LSTM network, and then introduces the method for obtaining the descriptive statement of the video to be described through the attention-based LSTM network.
The LSTM network based on attention mechanism is trained, the input descriptive sentence is "A Small child playing the guitar", and the specific training process is as follows:
when the output is a sentence, the LSTM network is used as a decoder, where the long-term dependence of the sentence can be captured. Suppose that the word input at the present time is wtThe hidden state at a moment on the LSTM network is ht-1The last memory state of the cell is ct-1Then, the update rule of the LSTM network at time t is as shown in equations (2) (3) (4) (5) (6) (7) (8):
it=σ(Wiwt+Uhiht-1+z) (2)
ft=σ(Wfwt+Uhfht-1+z) (3)
ot=σ(Wowt+Uhoht-1+z) (4)
ht=ot⊙tanh(ct) (7)
z=1(t=1)·Cv (8)
a subscript of the above formula { i, f, o, c }, wherein W*、Uh*And C are both weight matrices, it,ft,ot,ct,The states of the input gate, the forgetting gate, the output gate, the memory unit, and the compression input at time t are respectively represented, tanh (·) represents a hyperbolic tangent function, 1(t ═ 1) represents an indication function, an initial encoding vector v is input at the initial time of LSTM, and z represents that a video vector is input at the initial time when t ═ 1. For simplicity, the bias terms in the above equations are omitted.
To better exploit the side information from the semantic attributes of multiple modalities, we propose an attention mechanism to compute the weight matrix W in conjunction with the semantic attributes*And Uh*Each weight matrix of the traditional LSTM is expanded to a set of K attribute-related weight matrices for mining the meaning of a single word to generate the final descriptive statement. By W*(St)/Uh*(St) Replacing the initial weight matrix W*/Uh*In which S ist∈RKThe semantic attribute vector is a multi-modal semantic attribute vector and dynamically changes at any moment. In particular, two weight matrices are definedWherein n ishIs the number of hidden units, nxIs the dimension of the word embedding vector, W*(St)/Uh*(St) The expression (c) is shown in the formulas (9) and (10):
wherein, Wτ[k]、Uτ[k]Are respectively expressed as weight matrix WτAnd UτWith a probability value St[k]Is associated, St[k]Is a semantic attribute vector S of multiple modalitiestThe kth element of (1). Due to WτIs a three-dimensional vector, so a 2D slice refers to WτIs a two-dimensional vector.
StThe calculation process of (2) is shown in the formulas (11), (12) and (13):
eti=wTtanh(Waht-1+Uasi) (13)
where l-3 denotes the three learned semantic attribute vectors sf,so,sc}, attention weight atiThe importance degree of the ith semantic attribute in the video at the current moment is reflected. It can be seen that for different time steps t, the semantic property StIs different, such that the model selectively focuses on different semantic attribute parts in the video each time a word is generated, j denotes a subscript, etiRepresenting an unregulated attention weight, wTA transformation matrix is represented.
In the training process of the attention-based LSTM network, which is equivalent to the joint training of K LSTMs, the parameters of the network are in direct proportion to the K value, and when the K value is large, the network can hardly complete the training, and the following factorization method is adopted, as shown in the formula (14) (15):
W*(St)=Wa·diag(WbSt)·Wc(14)
Uh*(St)=Ua·diag(UbSt)·Uc(15)
The method is used for solving the problem that the parameter quantity of the original network is in direct proportion to the K value because the parameter quantity of the factorization network is greatly reduced, and the parameter quantity of the network is analyzed. In the formulae (9) and (10), the total parameter amount is K.nh·(nx+nh) The reference number is considered to be proportional to K. In the formulas (14) (15), W*(St) The formula parameter is nf·(nh+K+nx),Uh*(St) The formula parameter is nf·(2nh+ K) of the two parameter quantities is nf·(3nh+2K+nx). When n is designatedf=nhFor larger values of K, nf·(3nh+2K+nx) Is much less than K.nh·(nx+nh)。
Substituting the factorized equations (14) and (15) into the LSTM network update rules yields equations (16), (17), and (18):
where ⊙ denotes the element-by-element multiplication operator, for StOf each element value, parameter matrix WaAnd UaIs shared, which effectively captures common language patterns in video, while the diagonal matrix diag (W)bSt) And diag (U)bSt) Specific semantic attribute parts in different videos are considered,representing the input merged into the semantic attribute vector,representing the hidden state merged into the semantic attribute vector. In the same way, can prove that ft,ot,ctThe expression of (c) is similar to the above formula.
The formulas can be obtained, after the network is fully trained, not only can the common language mode part in the video be effectively captured, but also the specific semantic attribute part in the video can be focused on, and meanwhile, due to the fact that factorization is adopted, the parameter quantity of the network is greatly reduced, and the problem that the parameter quantity of an original network is in direct proportion to the K value is solved.
In the process of generating description sentences of the video, greedy search is adopted, and words output at each moment are shown as formula (19):
wt=softmax(Wht) (19)
where W is the transformation matrix.
Thus, the loss function loss of the generated sentences of the design network2As shown in equation (20):
loss2=-logP(Y|v,sf,sc,so)=-∑logP(wt|w0~t-1) (20)
wherein Y ═ { w ═ w1,w2,.......wNDenotes a sentence of N words, w0~t-1The words generated before time t.
Classification loss to generate high-level semantic attributes1And generating a loss describing the sentence2And the context relationship in the sentence can be ensured by adding and optimizing at the same time. And training the network by adopting a back propagation algorithm based on the obtained loss value. Like the Classification Loss module and the capturing Loss module in fig. 3, the Total Loss module is obtained by adding.
And after the LSTM network based on the attention mechanism is trained, obtaining a description statement of the video to be described through the LSTM network based on the attention mechanism based on the initial coding vector and the semantic attribute vector corresponding to each feature representation.
A second embodiment of the present invention is a video content description system based on a multi-modal attention mechanism, as shown in fig. 2, including: the system comprises an acquisition module 100, an extracted feature representation module 200, a semantic attribute detection module 300 and a video description generation module 400;
the acquiring module 100 is configured to acquire a video frame sequence of a video to be described as an input sequence;
the extracted feature representation module 200 is configured to extract the multi-modal feature vectors of the input sequence, construct a multi-modal feature vector sequence, and obtain feature representations corresponding to the feature vector sequences of each modality through a recurrent neural network; the multi-modal feature vector sequence comprises a video frame feature vector sequence, an optical flow frame feature vector sequence and a video segment feature vector sequence;
the semantic attribute detection module 300 is configured to obtain semantic attribute vectors corresponding to the feature representations respectively through a semantic attribute detection network based on the feature representations corresponding to the feature vector sequences of the respective modalities;
the generated video description module 400 is configured to cascade feature representations corresponding to the feature vector sequences of each modality to obtain an initial coding vector; obtaining a description statement of the video to be described through an LSTM network based on an attention mechanism based on the initial coding vector and the semantic attribute vector corresponding to each feature representation;
wherein,
the semantic attribute detection network is constructed based on a multilayer perceptron and is trained based on training samples, and the training samples comprise feature representation samples and corresponding semantic attribute vector labels.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and related description of the system described above may refer to the corresponding process in the foregoing method embodiment, and details are not described herein again.
It should be noted that, the video content description system based on the multi-modal attention mechanism provided in the foregoing embodiment is only illustrated by the division of the above functional modules, and in practical applications, the functions may be allocated to different functional modules according to needs, that is, the modules or steps in the embodiment of the present invention are further decomposed or combined, for example, the modules in the foregoing embodiment may be combined into one module, or may be further split into multiple sub-modules, so as to complete all or part of the functions described above. The names of the modules and steps involved in the embodiments of the present invention are only for distinguishing the modules or steps, and are not to be construed as unduly limiting the present invention.
A storage device according to a third embodiment of the present invention stores therein a plurality of programs, which are adapted to be loaded by a processor and to implement the above-described video content description method based on the multi-modal attention mechanism.
A processing apparatus according to a fourth embodiment of the present invention includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is adapted to be loaded and executed by a processor to implement the above-described method for video content description based on a multi-modal attention mechanism.
It can be clearly understood by those skilled in the art that, for convenience and brevity, the specific working processes and related descriptions of the storage device and the processing device described above may refer to the corresponding processes in the foregoing method examples, and are not described herein again.
Those of skill in the art would appreciate that the various illustrative modules, method steps, and modules described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that programs corresponding to the software modules, method steps may be located in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. To clearly illustrate this interchangeability of electronic hardware and software, various illustrative components and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as electronic hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.
The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.
Claims (9)
1. A method for describing video content based on a multi-modal attention mechanism, the method comprising:
step S100, acquiring a video frame sequence of a video to be described as an input sequence;
step S200, extracting multi-modal feature vectors of the input sequence, constructing a multi-modal feature vector sequence, and obtaining feature representations corresponding to the modal feature vector sequences through a recurrent neural network; the multi-modal feature vector sequence comprises a video frame feature vector sequence, an optical flow frame feature vector sequence and a video segment feature vector sequence;
step S300, based on the feature representation corresponding to each modal feature vector sequence, respectively obtaining semantic attribute vectors corresponding to each feature representation through a semantic attribute detection network;
step S400, cascading the feature representations corresponding to the modal feature vector sequences to obtain initial coding vectors; obtaining a description statement of the video to be described through an LSTM network based on an attention mechanism based on the initial coding vector and the semantic attribute vector corresponding to each feature representation;
wherein,
the semantic attribute detection network is constructed based on a multilayer perceptron and is trained based on training samples, and the training samples comprise feature representation samples and corresponding semantic attribute vector labels.
2. The method for describing video content based on multi-modal attention mechanism according to claim 1, wherein in step S200, "extracting multi-modal feature vectors of the input sequence to construct a multi-modal feature vector sequence" comprises:
extracting the features of each frame of RGB image in the input sequence based on a depth residual error network to obtain a video frame feature vector sequence;
based on the input sequence, obtaining an optical flow sequence through a Lucas-Kanade algorithm; extracting the features of the optical flow sequence through a depth residual error network to obtain an optical flow frame feature vector sequence;
and (3) dividing the input sequence into T sections in average, and extracting the characteristic vectors of the sequences of each section respectively through a three-dimensional convolution deep neural network to obtain a video segment characteristic vector sequence.
3. The method for describing video content based on multi-modal attention mechanism according to claim 1, wherein the semantic attribute detection network is trained by:
acquiring a training data set, wherein the training data set comprises videos and corresponding description sentences;
extracting words describing sentences in the training data set, sequencing the words according to the occurrence frequency, and selecting the first K words as high-level semantic attribute vectors; acquiring a real semantic attribute vector label of the video according to whether the description statement contains the high-level semantic attribute vector;
acquiring feature representation corresponding to a multi-modal feature vector sequence of the video in the training data set;
and training the semantic attribute detection network based on the feature representation and the real semantic attribute vector labels.
4. The method according to claim 3, wherein the loss function loss of the semantic attribute detection network is obtained during training1Comprises the following steps:
wherein N is the number of description sentences in the training data set, K is the dimension of the predicted semantic attribute vector label output by the semantic attribute detection network, and sikDetecting predicted semantics of network output for semantic attributesAttribute vector label, yikIs the real semantic attribute vector label, i, k are subscripts, α are weights, WencoderAnd (3) collecting all weight matrix and bias matrix parameters of the recurrent neural network and the semantic attribute detection network.
5. The method according to claim 1, wherein in step S400, "obtaining the sentence description of the video to be described through an LSTM network based on the attention mechanism based on the initial coding vector and each feature representing a corresponding semantic attribute vector" includes:
weighting the semantic attribute vectors corresponding to the feature representations through an attention mechanism to obtain multi-modal semantic attribute vectors;
and generating statement description of the video to be described through an LSTM network based on the initial coding vector and the multi-mode semantic attribute vector.
6. The method for video content description based on multi-modal attention mechanism according to claim 1, wherein the LSTM network based on attention mechanism adopts a factorization method for weight matrix calculation during training.
7. A video content description system based on a multi-mode attention mechanism is characterized by comprising an acquisition module, an extracted feature representation module, a semantic attribute detection module and a generated video description module;
the acquisition module is configured to acquire a video frame sequence of a video to be described as an input sequence;
the extracted feature representation module is configured to extract multi-modal feature vectors of the input sequence, construct a multi-modal feature vector sequence, and obtain feature representations corresponding to the feature vector sequences of each modality through a recurrent neural network; the multi-modal feature vector sequence comprises a video frame feature vector sequence, an optical flow frame feature vector sequence and a video segment feature vector sequence;
the semantic attribute detection module is configured to obtain semantic attribute vectors corresponding to the feature representations respectively through a semantic attribute detection network based on the feature representations corresponding to the modal feature vector sequences;
the video generation description module is configured to cascade feature representations corresponding to the modal feature vector sequences to obtain initial coding vectors; obtaining a description statement of the video to be described through an LSTM network based on an attention mechanism based on the initial coding vector and the semantic attribute vector corresponding to each feature representation;
wherein,
the semantic attribute detection network is constructed based on a multilayer perceptron and is trained based on training samples, and the training samples comprise feature representation samples and corresponding semantic attribute vector labels.
8. A storage device having a plurality of programs stored therein, wherein the program applications are loaded and executed by a processor to implement the method for multi-modal attention mechanism based video content description according to any one of claims 1-6.
9. A processing device comprising a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; characterized in that the program is adapted to be loaded and executed by a processor to implement the method for multi-modal attention mechanism based video content description according to any of claims 1-6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911243331.7A CN111079601A (en) | 2019-12-06 | 2019-12-06 | Video content description method, system and device based on multi-mode attention mechanism |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911243331.7A CN111079601A (en) | 2019-12-06 | 2019-12-06 | Video content description method, system and device based on multi-mode attention mechanism |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111079601A true CN111079601A (en) | 2020-04-28 |
Family
ID=70313089
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911243331.7A Pending CN111079601A (en) | 2019-12-06 | 2019-12-06 | Video content description method, system and device based on multi-mode attention mechanism |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111079601A (en) |
Cited By (23)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111723649A (en) * | 2020-05-08 | 2020-09-29 | 天津大学 | Short video event detection method based on semantic decomposition |
CN111783709A (en) * | 2020-07-09 | 2020-10-16 | 中国科学技术大学 | Information prediction method and device for education video |
CN112801017A (en) * | 2021-02-09 | 2021-05-14 | 成都视海芯图微电子有限公司 | Visual scene description method and system |
CN113191263A (en) * | 2021-04-29 | 2021-07-30 | 桂林电子科技大学 | Video description method and device |
CN113269253A (en) * | 2021-05-26 | 2021-08-17 | 大连民族大学 | Method and system for detecting fusion semantics of visual features in video description |
CN113269093A (en) * | 2021-05-26 | 2021-08-17 | 大连民族大学 | Method and system for detecting visual characteristic segmentation semantics in video description |
CN113312923A (en) * | 2021-06-18 | 2021-08-27 | 广东工业大学 | Method for generating text explanation of ball game |
CN113553445A (en) * | 2021-07-28 | 2021-10-26 | 北京理工大学 | Method for generating video description |
CN113641854A (en) * | 2021-07-28 | 2021-11-12 | 上海影谱科技有限公司 | Method and system for converting characters into video |
CN113673535A (en) * | 2021-05-24 | 2021-11-19 | 重庆师范大学 | Image description generation method of multi-modal feature fusion network |
CN113705402A (en) * | 2021-08-18 | 2021-11-26 | 中国科学院自动化研究所 | Video behavior prediction method, system, electronic device and storage medium |
CN113792183A (en) * | 2021-09-17 | 2021-12-14 | 咪咕数字传媒有限公司 | Text generation method and device and computing equipment |
CN114268846A (en) * | 2020-09-16 | 2022-04-01 | 镇江多游网络科技有限公司 | Video description generation model based on attention mechanism |
CN114339450A (en) * | 2022-03-11 | 2022-04-12 | 中国科学技术大学 | Video comment generation method, system, device and storage medium |
CN114627413A (en) * | 2022-03-11 | 2022-06-14 | 电子科技大学 | Video intensive event content understanding method |
CN115311595A (en) * | 2022-06-30 | 2022-11-08 | 中国科学院自动化研究所 | Video feature extraction method and device and electronic equipment |
CN115359383A (en) * | 2022-07-07 | 2022-11-18 | 北京百度网讯科技有限公司 | Cross-modal feature extraction, retrieval and model training method, device and medium |
WO2023050295A1 (en) * | 2021-09-30 | 2023-04-06 | 中远海运科技股份有限公司 | Multimodal heterogeneous feature fusion-based compact video event description method |
CN116743609A (en) * | 2023-08-14 | 2023-09-12 | 清华大学 | QoE evaluation method and device for video streaming media based on semantic communication |
CN117789099A (en) * | 2024-02-26 | 2024-03-29 | 北京搜狐新媒体信息技术有限公司 | Video feature extraction method and device, storage medium and electronic equipment |
CN118132803A (en) * | 2024-05-10 | 2024-06-04 | 成都考拉悠然科技有限公司 | Zero sample video moment retrieval method, system, equipment and medium |
CN118135452A (en) * | 2024-02-02 | 2024-06-04 | 广州像素数据技术股份有限公司 | Physical and chemical experiment video description method and related equipment based on large-scale video-language model |
CN118658104A (en) * | 2024-08-16 | 2024-09-17 | 厦门立马耀网络科技有限公司 | Video segmentation method and system based on cross attention and sequence attention |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107038221A (en) * | 2017-03-22 | 2017-08-11 | 杭州电子科技大学 | A kind of video content description method guided based on semantic information |
CN109344288A (en) * | 2018-09-19 | 2019-02-15 | 电子科技大学 | A kind of combination video presentation method based on multi-modal feature combination multilayer attention mechanism |
CN110110145A (en) * | 2018-01-29 | 2019-08-09 | 腾讯科技(深圳)有限公司 | Document creation method and device are described |
CN110333774A (en) * | 2019-03-20 | 2019-10-15 | 中国科学院自动化研究所 | A kind of remote user's attention appraisal procedure and system based on multi-modal interaction |
-
2019
- 2019-12-06 CN CN201911243331.7A patent/CN111079601A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107038221A (en) * | 2017-03-22 | 2017-08-11 | 杭州电子科技大学 | A kind of video content description method guided based on semantic information |
CN110110145A (en) * | 2018-01-29 | 2019-08-09 | 腾讯科技(深圳)有限公司 | Document creation method and device are described |
CN109344288A (en) * | 2018-09-19 | 2019-02-15 | 电子科技大学 | A kind of combination video presentation method based on multi-modal feature combination multilayer attention mechanism |
CN110333774A (en) * | 2019-03-20 | 2019-10-15 | 中国科学院自动化研究所 | A kind of remote user's attention appraisal procedure and system based on multi-modal interaction |
Non-Patent Citations (2)
Title |
---|
LIANG SUN ET AL.: "Multimodal Semantic Attention Network for Video Captioning", 《HTTPS://ARXIV.ORG/ABS/1905.02963V1》 * |
戴国强 等: "《科技大数据》", 31 August 2018, 科学技术文献出版社 * |
Cited By (37)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111723649A (en) * | 2020-05-08 | 2020-09-29 | 天津大学 | Short video event detection method based on semantic decomposition |
CN111783709B (en) * | 2020-07-09 | 2022-09-06 | 中国科学技术大学 | Information prediction method and device for education video |
CN111783709A (en) * | 2020-07-09 | 2020-10-16 | 中国科学技术大学 | Information prediction method and device for education video |
CN114268846A (en) * | 2020-09-16 | 2022-04-01 | 镇江多游网络科技有限公司 | Video description generation model based on attention mechanism |
CN112801017A (en) * | 2021-02-09 | 2021-05-14 | 成都视海芯图微电子有限公司 | Visual scene description method and system |
CN112801017B (en) * | 2021-02-09 | 2023-08-04 | 成都视海芯图微电子有限公司 | Visual scene description method and system |
CN113191263A (en) * | 2021-04-29 | 2021-07-30 | 桂林电子科技大学 | Video description method and device |
CN113673535A (en) * | 2021-05-24 | 2021-11-19 | 重庆师范大学 | Image description generation method of multi-modal feature fusion network |
CN113673535B (en) * | 2021-05-24 | 2023-01-10 | 重庆师范大学 | Image description generation method of multi-modal feature fusion network |
CN113269093B (en) * | 2021-05-26 | 2023-08-22 | 大连民族大学 | Visual feature segmentation semantic detection method and system in video description |
CN113269093A (en) * | 2021-05-26 | 2021-08-17 | 大连民族大学 | Method and system for detecting visual characteristic segmentation semantics in video description |
CN113269253B (en) * | 2021-05-26 | 2023-08-22 | 大连民族大学 | Visual feature fusion semantic detection method and system in video description |
CN113269253A (en) * | 2021-05-26 | 2021-08-17 | 大连民族大学 | Method and system for detecting fusion semantics of visual features in video description |
CN113312923A (en) * | 2021-06-18 | 2021-08-27 | 广东工业大学 | Method for generating text explanation of ball game |
CN113641854A (en) * | 2021-07-28 | 2021-11-12 | 上海影谱科技有限公司 | Method and system for converting characters into video |
CN113641854B (en) * | 2021-07-28 | 2023-09-26 | 上海影谱科技有限公司 | Method and system for converting text into video |
CN113553445B (en) * | 2021-07-28 | 2022-03-29 | 北京理工大学 | Method for generating video description |
CN113553445A (en) * | 2021-07-28 | 2021-10-26 | 北京理工大学 | Method for generating video description |
CN113705402A (en) * | 2021-08-18 | 2021-11-26 | 中国科学院自动化研究所 | Video behavior prediction method, system, electronic device and storage medium |
CN113792183B (en) * | 2021-09-17 | 2023-09-08 | 咪咕数字传媒有限公司 | Text generation method and device and computing equipment |
CN113792183A (en) * | 2021-09-17 | 2021-12-14 | 咪咕数字传媒有限公司 | Text generation method and device and computing equipment |
WO2023050295A1 (en) * | 2021-09-30 | 2023-04-06 | 中远海运科技股份有限公司 | Multimodal heterogeneous feature fusion-based compact video event description method |
CN114627413B (en) * | 2022-03-11 | 2022-09-13 | 电子科技大学 | Video intensive event content understanding method |
CN114339450B (en) * | 2022-03-11 | 2022-07-15 | 中国科学技术大学 | Video comment generation method, system, device and storage medium |
CN114627413A (en) * | 2022-03-11 | 2022-06-14 | 电子科技大学 | Video intensive event content understanding method |
CN114339450A (en) * | 2022-03-11 | 2022-04-12 | 中国科学技术大学 | Video comment generation method, system, device and storage medium |
CN115311595A (en) * | 2022-06-30 | 2022-11-08 | 中国科学院自动化研究所 | Video feature extraction method and device and electronic equipment |
CN115311595B (en) * | 2022-06-30 | 2023-11-03 | 中国科学院自动化研究所 | Video feature extraction method and device and electronic equipment |
CN115359383A (en) * | 2022-07-07 | 2022-11-18 | 北京百度网讯科技有限公司 | Cross-modal feature extraction, retrieval and model training method, device and medium |
CN116743609B (en) * | 2023-08-14 | 2023-10-17 | 清华大学 | QoE evaluation method and device for video streaming media based on semantic communication |
CN116743609A (en) * | 2023-08-14 | 2023-09-12 | 清华大学 | QoE evaluation method and device for video streaming media based on semantic communication |
CN118135452A (en) * | 2024-02-02 | 2024-06-04 | 广州像素数据技术股份有限公司 | Physical and chemical experiment video description method and related equipment based on large-scale video-language model |
CN117789099A (en) * | 2024-02-26 | 2024-03-29 | 北京搜狐新媒体信息技术有限公司 | Video feature extraction method and device, storage medium and electronic equipment |
CN117789099B (en) * | 2024-02-26 | 2024-05-28 | 北京搜狐新媒体信息技术有限公司 | Video feature extraction method and device, storage medium and electronic equipment |
CN118132803A (en) * | 2024-05-10 | 2024-06-04 | 成都考拉悠然科技有限公司 | Zero sample video moment retrieval method, system, equipment and medium |
CN118132803B (en) * | 2024-05-10 | 2024-08-13 | 成都考拉悠然科技有限公司 | Zero sample video moment retrieval method, system, equipment and medium |
CN118658104A (en) * | 2024-08-16 | 2024-09-17 | 厦门立马耀网络科技有限公司 | Video segmentation method and system based on cross attention and sequence attention |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111079601A (en) | Video content description method, system and device based on multi-mode attention mechanism | |
US11657230B2 (en) | Referring image segmentation | |
CN110532996B (en) | Video classification method, information processing method and server | |
CN108804530B (en) | Subtitling areas of an image | |
CN111488807B (en) | Video description generation system based on graph rolling network | |
CN111860235B (en) | Method and system for generating high-low-level feature fused attention remote sensing image description | |
CN111741330B (en) | Video content evaluation method and device, storage medium and computer equipment | |
CN109947912A (en) | A kind of model method based on paragraph internal reasoning and combined problem answer matches | |
CN111079532A (en) | Video content description method based on text self-encoder | |
CN108563624A (en) | A kind of spatial term method based on deep learning | |
CN109919221B (en) | Image description method based on bidirectional double-attention machine | |
CN108536784B (en) | Comment information sentiment analysis method and device, computer storage medium and server | |
CN109543112A (en) | A kind of sequence of recommendation method and device based on cyclic convolution neural network | |
CN111079658A (en) | Video-based multi-target continuous behavior analysis method, system and device | |
CN114339450B (en) | Video comment generation method, system, device and storage medium | |
CN114443899A (en) | Video classification method, device, equipment and medium | |
CN115311598A (en) | Video description generation system based on relation perception | |
CN115130591A (en) | Cross supervision-based multi-mode data classification method and device | |
CN112115744A (en) | Point cloud data processing method and device, computer storage medium and electronic equipment | |
CN112115131A (en) | Data denoising method, device and equipment and computer readable storage medium | |
CN112668608A (en) | Image identification method and device, electronic equipment and storage medium | |
CN112529149A (en) | Data processing method and related device | |
CN117541668A (en) | Virtual character generation method, device, equipment and storage medium | |
Taylor | Composable, distributed-state models for high-dimensional time series | |
CN117407557B (en) | Zero sample instance segmentation method, system, readable storage medium and computer |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200428 |