CN111079601A - Video content description method, system and device based on multi-mode attention mechanism - Google Patents

Video content description method, system and device based on multi-mode attention mechanism Download PDF

Info

Publication number
CN111079601A
CN111079601A CN201911243331.7A CN201911243331A CN111079601A CN 111079601 A CN111079601 A CN 111079601A CN 201911243331 A CN201911243331 A CN 201911243331A CN 111079601 A CN111079601 A CN 111079601A
Authority
CN
China
Prior art keywords
video
semantic attribute
feature
sequence
modal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911243331.7A
Other languages
Chinese (zh)
Inventor
胡卫明
孙亮
李兵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Automation of Chinese Academy of Science
Original Assignee
Institute of Automation of Chinese Academy of Science
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Automation of Chinese Academy of Science filed Critical Institute of Automation of Chinese Academy of Science
Priority to CN201911243331.7A priority Critical patent/CN111079601A/en
Publication of CN111079601A publication Critical patent/CN111079601A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Multimedia (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the field of computer vision and natural language processing, and particularly relates to a video content description method, a system and a device based on a multi-mode attention mechanism, aiming at solving the problem that the accuracy of a generated description statement is low due to the fact that the video content description method only considers video characteristics and ignores high-level semantic attribute information. The method comprises the following steps: acquiring a video frame sequence of a video to be described; extracting multi-modal feature vectors of the video frame sequence, constructing the multi-modal feature vector sequence, and obtaining feature representations corresponding to the modal feature vector sequences through a recurrent neural network; obtaining semantic attribute vectors corresponding to the feature representations through a semantic attribute detection network; and expressing the concatenated vector and semantic attribute vector based on the features corresponding to each modal feature vector sequence, and obtaining a description statement of the video to be described through an LSTM network based on an attention mechanism. The invention integrates visual features and high-level semantic attributes, and improves the accuracy of generating the video description sentences.

Description

Video content description method, system and device based on multi-mode attention mechanism
Technical Field
The invention belongs to the field of computer vision and natural language processing, and particularly relates to a video content description method, system and device based on a multi-mode attention mechanism.
Background
Artificial intelligence can be roughly divided into two research directions: perceptual intelligence and cognitive intelligence. Perceptual intelligence research progresses rapidly, such as picture classification and natural language translation, but cognitive intelligence development speed is limited, such as talking through pictures, visual description and the like. The study is combined with natural language and computer vision, which is beneficial to building a bridge for communication between human and machines and promoting the study of cognitive intelligence.
Video content describes a different label-based coarse-grained visual understanding task than video classification, object detection, etc., but requires a smooth and accurate sentence to describe the video content. This requires not only identifying objects in the video, but also understanding the interrelationships between objects in the video. Meanwhile, the video content has various description styles, such as abstract description of scenes, description of the relationship among objects, description of behavior and motion of objects in the video, and the like, which brings great challenges to the description research of the video content. The traditional video content description algorithm mainly adopts a language template-based method or a retrieval-based method. The method based on the language template can only generate sentences with single form and lack of flexibility due to the limitation of the fixed language template. The retrieval-based method is too dependent on the size of a retrieval video library, and when a video similar to the video to be described is absent in the database, the generated description sentence has a large deviation with the video content. Meanwhile, both methods need to perform a complicated preprocessing process on the video in the early stage, and the optimization on the language sequence part at the back end is insufficient, so that the quality of the generated sentences is poor.
With the advancement of deep learning techniques, codec-based sequence learning models have made a breakthrough in the video content description problem. The method is also based on a codec model, does not need to adopt a complex processing process for the video in the early stage, directly realizes end-to-end training through a network, and can directly learn the mapping relation between the video and the language from a large amount of training data, thereby generating the video description with more accurate content, various forms and flexible grammar.
The key of the video content description problem lies in the extraction of video characteristics, and because different modal information in the video can assist each other, the encoding of the video multi-modal information is beneficial to mining more semantic information. Meanwhile, because the general video content description algorithm only considers the video characteristics and ignores the high-level semantic attribute information of the video, in order to improve the quality of the generated description sentence, the invention also discusses how to extract the high-level semantic attributes and apply the semantic attributes to the video content description task. The invention also analyzes and researches the problem of insufficient optimization of the decoder end language generation part, most of the current video content description algorithms adopt maximum likelihood to model the language sequence, and the cross entropy loss is used for training and optimizing, which brings two obvious defects: one is the exposure bias problem, when the model is trained, the input of the decoder at each moment comes from the real words in the training set, and when the model is tested, the input at each moment comes from the words predicted at the last moment. If one of the words is predicted with insufficient accuracy, the error may be passed down, resulting in a lower and lower quality of the subsequently generated word. Secondly, the problem that the training index and the evaluation criterion are not uniform is solved, the posterior probability is maximized by adopting a cross entropy loss function in the training stage, objective evaluation criteria such as BLEU, METEOR and CIDER are adopted in the evaluation stage, and the inconsistency causes that the model cannot fully optimize the evaluation index generated by the video content description.
Disclosure of Invention
In order to solve the above-mentioned problem in the prior art, that is, to solve the problem that the accuracy of the generated description statement is low due to the fact that the existing video content description method only considers the video features and ignores the high-level semantic attribute information of the video, a first aspect of the present invention provides a video content description method based on a multi-modal attention mechanism, which includes:
step S100, acquiring a video frame sequence of a video to be described as an input sequence;
step S200, extracting multi-modal feature vectors of the input sequence, constructing a multi-modal feature vector sequence, and obtaining feature representations corresponding to the modal feature vector sequences through a recurrent neural network; the multi-modal feature vector sequence comprises a video frame feature vector sequence, an optical flow frame feature vector sequence and a video segment feature vector sequence;
step S300, based on the feature representation corresponding to each modal feature vector sequence, respectively obtaining semantic attribute vectors corresponding to each feature representation through a semantic attribute detection network;
step S400, cascading the feature representations corresponding to the modal feature vector sequences to obtain initial coding vectors; obtaining a description statement of the video to be described through an LSTM network based on an attention mechanism based on the initial coding vector and the semantic attribute vector corresponding to each feature representation;
wherein,
the semantic attribute detection network is constructed based on a multilayer perceptron and is trained based on training samples, and the training samples comprise feature representation samples and corresponding semantic attribute vector labels.
In some preferred embodiments, in step S200, "extracting the multi-modal feature vectors of the input sequence, and constructing a multi-modal feature vector sequence", the method includes:
extracting the features of each frame of RGB image in the input sequence based on a depth residual error network to obtain a video frame feature vector sequence;
based on the input sequence, obtaining an optical flow sequence through a Lucas-Kanade algorithm; extracting the features of the optical flow sequence through a depth residual error network to obtain an optical flow frame feature vector sequence;
and (3) dividing the input sequence into T sections in average, and extracting the characteristic vectors of the sequences of each section respectively through a three-dimensional convolution deep neural network to obtain a video segment characteristic vector sequence.
In some preferred embodiments, the training method of the semantic attribute detection network is as follows:
acquiring a training data set, wherein the training data set comprises videos and corresponding description sentences;
extracting words describing sentences in the training data set, sequencing the words according to the occurrence frequency, and selecting the first K words as high-level semantic attribute vectors; acquiring a real semantic attribute vector label of the video according to whether the description statement contains the high-level semantic attribute vector;
acquiring feature representation corresponding to a multi-modal feature vector sequence of the video in the training data set;
and training the semantic attribute detection network based on the feature representation and the real semantic attribute vector labels.
In some preferred embodiments, the loss function loss of the semantic attribute detection network in the training process1Comprises the following steps:
Figure BDA0002306852780000041
wherein N is the number of description sentences in the training data set, K is the dimension of the predicted semantic attribute vector label output by the semantic attribute detection network, and sikDetecting predicted semantic attribute vector tags, y, of network outputs for semantic attributesikIs the real semantic attribute vector label, i, k are subscripts, α are weights, WencoderAnd (3) collecting all weight matrix and bias matrix parameters of the recurrent neural network and the semantic attribute detection network.
In some preferred embodiments, in step S400, "obtaining the sentence description of the video to be described through the LSTM network based on the attention mechanism based on the initial encoding vector and the semantic attribute vector corresponding to each feature representation" includes:
weighting the semantic attribute vectors corresponding to the feature representations through an attention mechanism to obtain multi-modal semantic attribute vectors;
and generating statement description of the video to be described through an LSTM network based on the initial coding vector and the multi-mode semantic attribute vector.
In some preferred embodiments, the attention-based LSTM network performs the calculation of the weight matrix in a training process by using a factorization method.
The invention provides a video content description system based on a multi-mode attention mechanism, which comprises an acquisition module, an extracted feature representation module, a semantic attribute detection module and a video description generation module, wherein the extracted feature representation module is used for extracting a feature of a video;
the acquisition module is configured to acquire a video frame sequence of a video to be described as an input sequence;
the extracted feature representation module is configured to extract multi-modal feature vectors of the input sequence, construct a multi-modal feature vector sequence, and obtain feature representations corresponding to the feature vector sequences of each modality through a recurrent neural network; the multi-modal feature vector sequence comprises a video frame feature vector sequence, an optical flow frame feature vector sequence and a video segment feature vector sequence;
the semantic attribute detection module is configured to obtain semantic attribute vectors corresponding to the feature representations respectively through a semantic attribute detection network based on the feature representations corresponding to the modal feature vector sequences;
the video generation description module is configured to cascade feature representations corresponding to the modal feature vector sequences to obtain initial coding vectors; obtaining a description statement of the video to be described through an LSTM network based on an attention mechanism based on the initial coding vector and the semantic attribute vector corresponding to each feature representation;
wherein,
the semantic attribute detection network is constructed based on a multilayer perceptron and is trained based on training samples, and the training samples comprise feature representation samples and corresponding semantic attribute vector labels.
In a third aspect of the present invention, a storage device is provided, in which a plurality of programs are stored, the programs being loaded and executed by a processor to implement the above-mentioned video content description method based on the multi-modal attention mechanism.
In a fourth aspect of the present invention, a processing apparatus is provided, which includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is adapted to be loaded and executed by a processor to implement the above-described method for video content description based on a multi-modal attention mechanism.
The invention has the beneficial effects that:
the invention integrates visual features and high-level semantic attributes, and improves the accuracy of generating the video description sentences. The invention starts from multi-modal information, obtains a video characteristic vector sequence by adopting a method of combining a video frame, a light stream frame and a video fragment, and simultaneously detects and generates a semantic attribute vector label of the video. In order to obtain more effective visual characteristics and semantic attributes, the auxiliary classification loss and LSTM network loss in the generation stage of the semantic attribute vector labels are optimized simultaneously, and the context relationship in sentences can be ensured. In the decoding stage, an attention mechanism algorithm combined with semantic attributes is provided, the semantic attribute vectors are fused into a traditional recurrent neural network weight matrix, and in each moment of generating a sentence word, the attention mechanism is adopted to pay attention to the specific semantic attributes, so that the accuracy of video content description is improved.
Drawings
Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings.
FIG. 1 is a flow chart of a method for describing video content based on a multi-modal attention mechanism according to an embodiment of the present invention;
FIG. 2 is a block diagram of a multi-modal attention mechanism based video content description system according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of a training process of a video content description method based on a multi-modal attention mechanism according to an embodiment of the present invention;
fig. 4 is a schematic network structure diagram of a semantic attribute detection network according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings, and it is apparent that the described embodiments are some, but not all embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The present application will be described in further detail with reference to the following drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the relevant invention and not restrictive of the invention. It should be noted that, for convenience of description, only the portions related to the related invention are shown in the drawings.
It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
The video content description method based on the multi-modal attention mechanism, as shown in fig. 1, includes the following steps:
step S100, acquiring a video frame sequence of a video to be described as an input sequence;
step S200, extracting multi-modal feature vectors of the input sequence, constructing a multi-modal feature vector sequence, and obtaining feature representations corresponding to the modal feature vector sequences through a recurrent neural network; the multi-modal feature vector sequence comprises a video frame feature vector sequence, an optical flow frame feature vector sequence and a video segment feature vector sequence;
step S300, based on the feature representation corresponding to each modal feature vector sequence, respectively obtaining semantic attribute vectors corresponding to each feature representation through a semantic attribute detection network;
step S400, cascading the feature representations corresponding to the modal feature vector sequences to obtain initial coding vectors; obtaining a description statement of the video to be described through an LSTM network based on an attention mechanism based on the initial coding vector and the semantic attribute vector corresponding to each feature representation;
wherein,
the semantic attribute detection network is constructed based on a multilayer perceptron and is trained based on training samples, and the training samples comprise feature representation samples and corresponding semantic attribute vector labels.
In order to more clearly explain the video content description method based on the multi-modal attention mechanism, the following describes the steps in an embodiment of the method in detail with reference to the accompanying drawings.
The programming language in which the method of the present invention operates specifically is not limited, and the method of the present invention can be implemented by writing in any language. The method adopts a 4-card Titan Xp GPU server with 12 Gbyte memory, and uses Python language to compile a working program of the video content description method based on the multi-mode attention mechanism, thereby realizing the method of the invention. The method comprises the following concrete steps:
step S100, acquiring a video frame sequence of a video to be described as an input sequence.
In this embodiment, the video to be described may be a video shot in real time, for example, in an intelligent monitoring and behavior analysis scene, a video shot by a camera in real time needs to be described, and at this time, the video to be described may be a video shot by the camera in real time; or the video to be described may be a video acquired from a network, for example, in a video content preview scene, the video acquired from the network needs to be described through a natural language, so as to implement the preview of the video content by the user, and at this time, the video to be described may be a video acquired from the network and needing to be previewed; or the video to be described may be a locally stored video, for example, in a video classified storage scene, the video needs to be described, and is classified and stored according to the description information, and at this time, the video to be described may be a locally stored video that needs to be classified and stored. And extracting a video frame sequence as input based on the acquired video to be described.
Step S200, extracting multi-modal feature vectors of the input sequence, constructing a multi-modal feature vector sequence, and obtaining feature representations corresponding to the modal feature vector sequences through a recurrent neural network; the multi-modal feature vector sequence comprises a video frame feature vector sequence, an optical flow frame feature vector sequence and a video segment feature vector sequence.
In the present example, the multi-modal video features of the video to be described are extracted, namely video frames, streaming frames and video clips. The method comprises the following specific steps:
step S201, using a pre-trained depth residual error network to extract the features of each frame of a video frame sequence of a video to be described, using the output of the ith layer of the network as the feature representation of the frame, and obtaining a video frame feature vector sequence
Figure BDA0002306852780000091
Inputting the video frame feature vector sequence into the recurrent neural network LSTM in sequence, and hiding the hidden state h at the last moment of the networktThe feature representation, denoted v, of the video frame feature vector as a videof
Step S202, generating an optical flow sequence of a video by a video frame sequence of a video to be described through a Lucas-Kanade algorithm, performing feature extraction on each frame through a pre-trained depth residual error network, and taking the output of the ith layer of the network as feature representation of the frame to obtain an optical flow frame feature vector sequence
Figure BDA0002306852780000092
Inputting the light stream frame feature vector sequence into a recurrent neural network (LSTM) in sequence, and hiding the hidden state h of the network at the last momenttThe characteristic representation of an optical flow frame as a video is denoted vo
Step S203, the video frame sequence of the video to be described is averagely divided into T sections, each section uses a three-dimensional convolution depth neural network to carry out feature extraction, the output of the ith layer of the network is used as the feature representation of the T section of the video, and the feature vector sequence of the video segment is obtained
Figure BDA0002306852780000093
Inputting the video segment feature vector sequence into the recurrent neural network LSTM in sequence, and hiding the hidden state h at the last moment of the networktThe characteristic representation of a video segment as a video, denoted vc
In the process of extracting the multi-modal Feature representation of the Video in the above steps, as shown in fig. 3, Video (Video) is input and divided into a Video Frame (Frame), an Optical flow Frame (Optical flow) and a Video clip (Video clip), wherein the Video Frame outputs Frame Feature (Feature representation of Video Frame Feature vector), i.e. static Feature, the Video clip outputs C3D Feature (3D-CNN Feature of Video clip), and the Optical flow Frame outputs Motion Feature (dynamic Feature). Other steps in fig. 3 are described below.
And step S300, respectively obtaining semantic attribute vectors corresponding to the feature representations through a semantic attribute detection network based on the feature representations corresponding to the modal feature vector sequences.
In this embodiment, training of the semantic attribute detection network is introduced first, and then semantic attribute vectors corresponding to the feature representations obtained by the semantic attribute detection network are introduced.
The semantic attribute detection network is constructed based on a multilayer perceptron, the structure is shown in fig. 4, the semantic attribute detection network comprises an input Layer (input Layer), a Hidden Layer (Hidden Layer) and an Output Layer (Output Layer), a video (input video) and a corresponding description sentence (A Small child playing the guitar) are input, and a multimodal feature vector sequence (v) is obtained through a recurrent neural network (LSTM)i1,vi2,...,vin) Outputting a semantic attribute vector s by a semantic attribute detection networki1,si2,...,siK. The specific training process of the semantic attribute detection network is as follows:
step A301, a training data set is obtained, wherein the training data set comprises videos and corresponding description sentences.
Step A302, extracting words describing sentences in the training data set, sequencing the words according to the occurrence frequency, and selecting the first K words as high-level semantic attribute vectors; and acquiring a real semantic attribute vector label of the video according to whether the description statement contains the high-level semantic attribute vector.
Extracting words describing sentences in the training data set, sorting the words according to the occurrence frequency of the words, removing the null words, and then selecting K words with highest occurrence probability as high-level semantic attribute values
Suppose that the training data set has N sentences, yi=[yi1,yi1,...yil,...yiK]Is the true semantic attribute vector tag of the ith video. Wherein if the descriptive sentence corresponding to the video i contains the attribute word l, y il1 is ═ 1; otherwise yil=0。
Step a303 is to obtain feature representations corresponding to the multi-modal feature vector sequences of the video in the training dataset by the method in step S200.
Step A304, training the semantic attribute detection network based on the feature representation and the real semantic attribute vector labels.
Let v bei∈{vf,vo,vcThe feature representation of the ith video learning is represented, and the training sample at the moment is { v }i,yi}. In the invention, a semantic attribute detection network constructed by adopting a multilayer perceptron is adopted to learn a function f (R): R)m→RKExpressed as mapping an m-dimensional space to a K-dimensional space, where RmReal number space in m dimensions, RKSimilarly, m is the dimension represented by the input features, K is the dimension of the output semantic attribute vector, the dimension is equal to the number of the extracted semantic attribute values (the dimension of the high-level semantic attribute vector), and the multilayer perceptron outputs a vector si=[si1,...,siK]Vector labels for the predicted semantic attributes of the ith video. Loss of class function (loss) of semantic attribute detection network1As shown in equation (1):
Figure BDA0002306852780000111
wherein, WencoderRepresenting recurrent neural networks, semantic attributesDetecting the set of all the parameters of the weight matrix and the bias matrix of the network, α is the weight, si=α(f(vi) Learning to obtain a K-dimensional vector, α (·) representing sigmoid function, and f (·) being implemented by a multilayer perceptron.
After training of the semantic attribute detection network is completed, in an actual application process, semantic attribute vectors corresponding to the feature representations are obtained through the semantic attribute detection network respectively based on the feature representations corresponding to the modal feature vector sequences. Such as the Multimodal Semantic Detector module in fig. 3.
Step S400, cascading the feature representations corresponding to the modal feature vector sequences to obtain initial coding vectors; and obtaining the description statement of the video to be described through an LSTM network based on an attention mechanism based on the initial coding vector and the semantic attribute vector corresponding to each feature representation.
In this embodiment, { vf,vo,vcCascade as initial coding vector v, like the collocation module in fig. 3. Attention Fusion is based on the Attention module.
The following description first introduces the training process of the attention-based LSTM network, and then introduces the method for obtaining the descriptive statement of the video to be described through the attention-based LSTM network.
The LSTM network based on attention mechanism is trained, the input descriptive sentence is "A Small child playing the guitar", and the specific training process is as follows:
when the output is a sentence, the LSTM network is used as a decoder, where the long-term dependence of the sentence can be captured. Suppose that the word input at the present time is wtThe hidden state at a moment on the LSTM network is ht-1The last memory state of the cell is ct-1Then, the update rule of the LSTM network at time t is as shown in equations (2) (3) (4) (5) (6) (7) (8):
it=σ(Wiwt+Uhiht-1+z) (2)
ft=σ(Wfwt+Uhfht-1+z) (3)
ot=σ(Wowt+Uhoht-1+z) (4)
Figure BDA0002306852780000121
Figure BDA0002306852780000122
ht=ot⊙tanh(ct) (7)
z=1(t=1)·Cv (8)
a subscript of the above formula { i, f, o, c }, wherein W*、Uh*And C are both weight matrices, it,ft,ot,ct,
Figure BDA0002306852780000123
The states of the input gate, the forgetting gate, the output gate, the memory unit, and the compression input at time t are respectively represented, tanh (·) represents a hyperbolic tangent function, 1(t ═ 1) represents an indication function, an initial encoding vector v is input at the initial time of LSTM, and z represents that a video vector is input at the initial time when t ═ 1. For simplicity, the bias terms in the above equations are omitted.
To better exploit the side information from the semantic attributes of multiple modalities, we propose an attention mechanism to compute the weight matrix W in conjunction with the semantic attributes*And Uh*Each weight matrix of the traditional LSTM is expanded to a set of K attribute-related weight matrices for mining the meaning of a single word to generate the final descriptive statement. By W*(St)/Uh*(St) Replacing the initial weight matrix W*/Uh*In which S ist∈RKThe semantic attribute vector is a multi-modal semantic attribute vector and dynamically changes at any moment. In particular, two weight matrices are defined
Figure BDA0002306852780000131
Wherein n ishIs the number of hidden units, nxIs the dimension of the word embedding vector, W*(St)/Uh*(St) The expression (c) is shown in the formulas (9) and (10):
Figure BDA0002306852780000132
Figure BDA0002306852780000133
wherein, Wτ[k]、Uτ[k]Are respectively expressed as weight matrix WτAnd UτWith a probability value St[k]Is associated, St[k]Is a semantic attribute vector S of multiple modalitiestThe kth element of (1). Due to WτIs a three-dimensional vector, so a 2D slice refers to WτIs a two-dimensional vector.
StThe calculation process of (2) is shown in the formulas (11), (12) and (13):
Figure BDA0002306852780000134
Figure BDA0002306852780000135
eti=wTtanh(Waht-1+Uasi) (13)
where l-3 denotes the three learned semantic attribute vectors sf,so,sc}, attention weight atiThe importance degree of the ith semantic attribute in the video at the current moment is reflected. It can be seen that for different time steps t, the semantic property StIs different, such that the model selectively focuses on different semantic attribute parts in the video each time a word is generated, j denotes a subscript, etiRepresenting an unregulated attention weight, wTA transformation matrix is represented.
In the training process of the attention-based LSTM network, which is equivalent to the joint training of K LSTMs, the parameters of the network are in direct proportion to the K value, and when the K value is large, the network can hardly complete the training, and the following factorization method is adopted, as shown in the formula (14) (15):
W*(St)=Wa·diag(WbSt)·Wc(14)
Uh*(St)=Ua·diag(UbSt)·Uc(15)
wherein,
Figure BDA0002306852780000141
and
Figure BDA0002306852780000142
Figure BDA0002306852780000143
nfrepresenting the relevant hyper-parameters of the factorization.
The method is used for solving the problem that the parameter quantity of the original network is in direct proportion to the K value because the parameter quantity of the factorization network is greatly reduced, and the parameter quantity of the network is analyzed. In the formulae (9) and (10), the total parameter amount is K.nh·(nx+nh) The reference number is considered to be proportional to K. In the formulas (14) (15), W*(St) The formula parameter is nf·(nh+K+nx),Uh*(St) The formula parameter is nf·(2nh+ K) of the two parameter quantities is nf·(3nh+2K+nx). When n is designatedf=nhFor larger values of K, nf·(3nh+2K+nx) Is much less than K.nh·(nx+nh)。
Substituting the factorized equations (14) and (15) into the LSTM network update rules yields equations (16), (17), and (18):
Figure BDA0002306852780000144
Figure BDA0002306852780000145
Figure BDA0002306852780000146
where ⊙ denotes the element-by-element multiplication operator, for StOf each element value, parameter matrix WaAnd UaIs shared, which effectively captures common language patterns in video, while the diagonal matrix diag (W)bSt) And diag (U)bSt) Specific semantic attribute parts in different videos are considered,
Figure BDA0002306852780000151
representing the input merged into the semantic attribute vector,
Figure BDA0002306852780000152
representing the hidden state merged into the semantic attribute vector. In the same way, can prove that ft,ot,ctThe expression of (c) is similar to the above formula.
The formulas can be obtained, after the network is fully trained, not only can the common language mode part in the video be effectively captured, but also the specific semantic attribute part in the video can be focused on, and meanwhile, due to the fact that factorization is adopted, the parameter quantity of the network is greatly reduced, and the problem that the parameter quantity of an original network is in direct proportion to the K value is solved.
In the process of generating description sentences of the video, greedy search is adopted, and words output at each moment are shown as formula (19):
wt=softmax(Wht) (19)
where W is the transformation matrix.
Thus, the loss function loss of the generated sentences of the design network2As shown in equation (20):
loss2=-logP(Y|v,sf,sc,so)=-∑logP(wt|w0~t-1) (20)
wherein Y ═ { w ═ w1,w2,.......wNDenotes a sentence of N words, w0~t-1The words generated before time t.
Classification loss to generate high-level semantic attributes1And generating a loss describing the sentence2And the context relationship in the sentence can be ensured by adding and optimizing at the same time. And training the network by adopting a back propagation algorithm based on the obtained loss value. Like the Classification Loss module and the capturing Loss module in fig. 3, the Total Loss module is obtained by adding.
And after the LSTM network based on the attention mechanism is trained, obtaining a description statement of the video to be described through the LSTM network based on the attention mechanism based on the initial coding vector and the semantic attribute vector corresponding to each feature representation.
A second embodiment of the present invention is a video content description system based on a multi-modal attention mechanism, as shown in fig. 2, including: the system comprises an acquisition module 100, an extracted feature representation module 200, a semantic attribute detection module 300 and a video description generation module 400;
the acquiring module 100 is configured to acquire a video frame sequence of a video to be described as an input sequence;
the extracted feature representation module 200 is configured to extract the multi-modal feature vectors of the input sequence, construct a multi-modal feature vector sequence, and obtain feature representations corresponding to the feature vector sequences of each modality through a recurrent neural network; the multi-modal feature vector sequence comprises a video frame feature vector sequence, an optical flow frame feature vector sequence and a video segment feature vector sequence;
the semantic attribute detection module 300 is configured to obtain semantic attribute vectors corresponding to the feature representations respectively through a semantic attribute detection network based on the feature representations corresponding to the feature vector sequences of the respective modalities;
the generated video description module 400 is configured to cascade feature representations corresponding to the feature vector sequences of each modality to obtain an initial coding vector; obtaining a description statement of the video to be described through an LSTM network based on an attention mechanism based on the initial coding vector and the semantic attribute vector corresponding to each feature representation;
wherein,
the semantic attribute detection network is constructed based on a multilayer perceptron and is trained based on training samples, and the training samples comprise feature representation samples and corresponding semantic attribute vector labels.
It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process and related description of the system described above may refer to the corresponding process in the foregoing method embodiment, and details are not described herein again.
It should be noted that, the video content description system based on the multi-modal attention mechanism provided in the foregoing embodiment is only illustrated by the division of the above functional modules, and in practical applications, the functions may be allocated to different functional modules according to needs, that is, the modules or steps in the embodiment of the present invention are further decomposed or combined, for example, the modules in the foregoing embodiment may be combined into one module, or may be further split into multiple sub-modules, so as to complete all or part of the functions described above. The names of the modules and steps involved in the embodiments of the present invention are only for distinguishing the modules or steps, and are not to be construed as unduly limiting the present invention.
A storage device according to a third embodiment of the present invention stores therein a plurality of programs, which are adapted to be loaded by a processor and to implement the above-described video content description method based on the multi-modal attention mechanism.
A processing apparatus according to a fourth embodiment of the present invention includes a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; the program is adapted to be loaded and executed by a processor to implement the above-described method for video content description based on a multi-modal attention mechanism.
It can be clearly understood by those skilled in the art that, for convenience and brevity, the specific working processes and related descriptions of the storage device and the processing device described above may refer to the corresponding processes in the foregoing method examples, and are not described herein again.
Those of skill in the art would appreciate that the various illustrative modules, method steps, and modules described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that programs corresponding to the software modules, method steps may be located in Random Access Memory (RAM), memory, Read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. To clearly illustrate this interchangeability of electronic hardware and software, various illustrative components and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as electronic hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.
The terms "first," "second," and the like are used for distinguishing between similar elements and not necessarily for describing or implying a particular order or sequence.
The terms "comprises," "comprising," or any other similar term are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.
So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims (9)

1. A method for describing video content based on a multi-modal attention mechanism, the method comprising:
step S100, acquiring a video frame sequence of a video to be described as an input sequence;
step S200, extracting multi-modal feature vectors of the input sequence, constructing a multi-modal feature vector sequence, and obtaining feature representations corresponding to the modal feature vector sequences through a recurrent neural network; the multi-modal feature vector sequence comprises a video frame feature vector sequence, an optical flow frame feature vector sequence and a video segment feature vector sequence;
step S300, based on the feature representation corresponding to each modal feature vector sequence, respectively obtaining semantic attribute vectors corresponding to each feature representation through a semantic attribute detection network;
step S400, cascading the feature representations corresponding to the modal feature vector sequences to obtain initial coding vectors; obtaining a description statement of the video to be described through an LSTM network based on an attention mechanism based on the initial coding vector and the semantic attribute vector corresponding to each feature representation;
wherein,
the semantic attribute detection network is constructed based on a multilayer perceptron and is trained based on training samples, and the training samples comprise feature representation samples and corresponding semantic attribute vector labels.
2. The method for describing video content based on multi-modal attention mechanism according to claim 1, wherein in step S200, "extracting multi-modal feature vectors of the input sequence to construct a multi-modal feature vector sequence" comprises:
extracting the features of each frame of RGB image in the input sequence based on a depth residual error network to obtain a video frame feature vector sequence;
based on the input sequence, obtaining an optical flow sequence through a Lucas-Kanade algorithm; extracting the features of the optical flow sequence through a depth residual error network to obtain an optical flow frame feature vector sequence;
and (3) dividing the input sequence into T sections in average, and extracting the characteristic vectors of the sequences of each section respectively through a three-dimensional convolution deep neural network to obtain a video segment characteristic vector sequence.
3. The method for describing video content based on multi-modal attention mechanism according to claim 1, wherein the semantic attribute detection network is trained by:
acquiring a training data set, wherein the training data set comprises videos and corresponding description sentences;
extracting words describing sentences in the training data set, sequencing the words according to the occurrence frequency, and selecting the first K words as high-level semantic attribute vectors; acquiring a real semantic attribute vector label of the video according to whether the description statement contains the high-level semantic attribute vector;
acquiring feature representation corresponding to a multi-modal feature vector sequence of the video in the training data set;
and training the semantic attribute detection network based on the feature representation and the real semantic attribute vector labels.
4. The method according to claim 3, wherein the loss function loss of the semantic attribute detection network is obtained during training1Comprises the following steps:
Figure FDA0002306852770000021
wherein N is the number of description sentences in the training data set, K is the dimension of the predicted semantic attribute vector label output by the semantic attribute detection network, and sikDetecting predicted semantics of network output for semantic attributesAttribute vector label, yikIs the real semantic attribute vector label, i, k are subscripts, α are weights, WencoderAnd (3) collecting all weight matrix and bias matrix parameters of the recurrent neural network and the semantic attribute detection network.
5. The method according to claim 1, wherein in step S400, "obtaining the sentence description of the video to be described through an LSTM network based on the attention mechanism based on the initial coding vector and each feature representing a corresponding semantic attribute vector" includes:
weighting the semantic attribute vectors corresponding to the feature representations through an attention mechanism to obtain multi-modal semantic attribute vectors;
and generating statement description of the video to be described through an LSTM network based on the initial coding vector and the multi-mode semantic attribute vector.
6. The method for video content description based on multi-modal attention mechanism according to claim 1, wherein the LSTM network based on attention mechanism adopts a factorization method for weight matrix calculation during training.
7. A video content description system based on a multi-mode attention mechanism is characterized by comprising an acquisition module, an extracted feature representation module, a semantic attribute detection module and a generated video description module;
the acquisition module is configured to acquire a video frame sequence of a video to be described as an input sequence;
the extracted feature representation module is configured to extract multi-modal feature vectors of the input sequence, construct a multi-modal feature vector sequence, and obtain feature representations corresponding to the feature vector sequences of each modality through a recurrent neural network; the multi-modal feature vector sequence comprises a video frame feature vector sequence, an optical flow frame feature vector sequence and a video segment feature vector sequence;
the semantic attribute detection module is configured to obtain semantic attribute vectors corresponding to the feature representations respectively through a semantic attribute detection network based on the feature representations corresponding to the modal feature vector sequences;
the video generation description module is configured to cascade feature representations corresponding to the modal feature vector sequences to obtain initial coding vectors; obtaining a description statement of the video to be described through an LSTM network based on an attention mechanism based on the initial coding vector and the semantic attribute vector corresponding to each feature representation;
wherein,
the semantic attribute detection network is constructed based on a multilayer perceptron and is trained based on training samples, and the training samples comprise feature representation samples and corresponding semantic attribute vector labels.
8. A storage device having a plurality of programs stored therein, wherein the program applications are loaded and executed by a processor to implement the method for multi-modal attention mechanism based video content description according to any one of claims 1-6.
9. A processing device comprising a processor, a storage device; a processor adapted to execute various programs; a storage device adapted to store a plurality of programs; characterized in that the program is adapted to be loaded and executed by a processor to implement the method for multi-modal attention mechanism based video content description according to any of claims 1-6.
CN201911243331.7A 2019-12-06 2019-12-06 Video content description method, system and device based on multi-mode attention mechanism Pending CN111079601A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911243331.7A CN111079601A (en) 2019-12-06 2019-12-06 Video content description method, system and device based on multi-mode attention mechanism

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911243331.7A CN111079601A (en) 2019-12-06 2019-12-06 Video content description method, system and device based on multi-mode attention mechanism

Publications (1)

Publication Number Publication Date
CN111079601A true CN111079601A (en) 2020-04-28

Family

ID=70313089

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911243331.7A Pending CN111079601A (en) 2019-12-06 2019-12-06 Video content description method, system and device based on multi-mode attention mechanism

Country Status (1)

Country Link
CN (1) CN111079601A (en)

Cited By (22)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111723649A (en) * 2020-05-08 2020-09-29 天津大学 Short video event detection method based on semantic decomposition
CN111783709A (en) * 2020-07-09 2020-10-16 中国科学技术大学 Information prediction method and device for education video
CN112801017A (en) * 2021-02-09 2021-05-14 成都视海芯图微电子有限公司 Visual scene description method and system
CN113191263A (en) * 2021-04-29 2021-07-30 桂林电子科技大学 Video description method and device
CN113269093A (en) * 2021-05-26 2021-08-17 大连民族大学 Method and system for detecting visual characteristic segmentation semantics in video description
CN113269253A (en) * 2021-05-26 2021-08-17 大连民族大学 Method and system for detecting fusion semantics of visual features in video description
CN113312923A (en) * 2021-06-18 2021-08-27 广东工业大学 Method for generating text explanation of ball game
CN113553445A (en) * 2021-07-28 2021-10-26 北京理工大学 Method for generating video description
CN113641854A (en) * 2021-07-28 2021-11-12 上海影谱科技有限公司 Method and system for converting characters into video
CN113673535A (en) * 2021-05-24 2021-11-19 重庆师范大学 Image description generation method of multi-modal feature fusion network
CN113705402A (en) * 2021-08-18 2021-11-26 中国科学院自动化研究所 Video behavior prediction method, system, electronic device and storage medium
CN113792183A (en) * 2021-09-17 2021-12-14 咪咕数字传媒有限公司 Text generation method and device and computing equipment
CN114268846A (en) * 2020-09-16 2022-04-01 镇江多游网络科技有限公司 Video description generation model based on attention mechanism
CN114339450A (en) * 2022-03-11 2022-04-12 中国科学技术大学 Video comment generation method, system, device and storage medium
CN114627413A (en) * 2022-03-11 2022-06-14 电子科技大学 Video intensive event content understanding method
CN115311595A (en) * 2022-06-30 2022-11-08 中国科学院自动化研究所 Video feature extraction method and device and electronic equipment
CN115359383A (en) * 2022-07-07 2022-11-18 北京百度网讯科技有限公司 Cross-modal feature extraction, retrieval and model training method, device and medium
WO2023050295A1 (en) * 2021-09-30 2023-04-06 中远海运科技股份有限公司 Multimodal heterogeneous feature fusion-based compact video event description method
CN116743609A (en) * 2023-08-14 2023-09-12 清华大学 QoE evaluation method and device for video streaming media based on semantic communication
CN117789099A (en) * 2024-02-26 2024-03-29 北京搜狐新媒体信息技术有限公司 Video feature extraction method and device, storage medium and electronic equipment
CN118135452A (en) * 2024-02-02 2024-06-04 广州像素数据技术股份有限公司 Physical and chemical experiment video description method and related equipment based on large-scale video-language model
CN118132803A (en) * 2024-05-10 2024-06-04 成都考拉悠然科技有限公司 Zero sample video moment retrieval method, system, equipment and medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107038221A (en) * 2017-03-22 2017-08-11 杭州电子科技大学 A kind of video content description method guided based on semantic information
CN109344288A (en) * 2018-09-19 2019-02-15 电子科技大学 A kind of combination video presentation method based on multi-modal feature combination multilayer attention mechanism
CN110110145A (en) * 2018-01-29 2019-08-09 腾讯科技(深圳)有限公司 Document creation method and device are described
CN110333774A (en) * 2019-03-20 2019-10-15 中国科学院自动化研究所 A kind of remote user's attention appraisal procedure and system based on multi-modal interaction

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107038221A (en) * 2017-03-22 2017-08-11 杭州电子科技大学 A kind of video content description method guided based on semantic information
CN110110145A (en) * 2018-01-29 2019-08-09 腾讯科技(深圳)有限公司 Document creation method and device are described
CN109344288A (en) * 2018-09-19 2019-02-15 电子科技大学 A kind of combination video presentation method based on multi-modal feature combination multilayer attention mechanism
CN110333774A (en) * 2019-03-20 2019-10-15 中国科学院自动化研究所 A kind of remote user's attention appraisal procedure and system based on multi-modal interaction

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LIANG SUN ET AL.: "Multimodal Semantic Attention Network for Video Captioning", 《HTTPS://ARXIV.ORG/ABS/1905.02963V1》 *
戴国强 等: "《科技大数据》", 31 August 2018, 科学技术文献出版社 *

Cited By (36)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111723649A (en) * 2020-05-08 2020-09-29 天津大学 Short video event detection method based on semantic decomposition
CN111783709B (en) * 2020-07-09 2022-09-06 中国科学技术大学 Information prediction method and device for education video
CN111783709A (en) * 2020-07-09 2020-10-16 中国科学技术大学 Information prediction method and device for education video
CN114268846A (en) * 2020-09-16 2022-04-01 镇江多游网络科技有限公司 Video description generation model based on attention mechanism
CN112801017A (en) * 2021-02-09 2021-05-14 成都视海芯图微电子有限公司 Visual scene description method and system
CN112801017B (en) * 2021-02-09 2023-08-04 成都视海芯图微电子有限公司 Visual scene description method and system
CN113191263A (en) * 2021-04-29 2021-07-30 桂林电子科技大学 Video description method and device
CN113673535A (en) * 2021-05-24 2021-11-19 重庆师范大学 Image description generation method of multi-modal feature fusion network
CN113673535B (en) * 2021-05-24 2023-01-10 重庆师范大学 Image description generation method of multi-modal feature fusion network
CN113269253B (en) * 2021-05-26 2023-08-22 大连民族大学 Visual feature fusion semantic detection method and system in video description
CN113269253A (en) * 2021-05-26 2021-08-17 大连民族大学 Method and system for detecting fusion semantics of visual features in video description
CN113269093B (en) * 2021-05-26 2023-08-22 大连民族大学 Visual feature segmentation semantic detection method and system in video description
CN113269093A (en) * 2021-05-26 2021-08-17 大连民族大学 Method and system for detecting visual characteristic segmentation semantics in video description
CN113312923A (en) * 2021-06-18 2021-08-27 广东工业大学 Method for generating text explanation of ball game
CN113641854A (en) * 2021-07-28 2021-11-12 上海影谱科技有限公司 Method and system for converting characters into video
CN113641854B (en) * 2021-07-28 2023-09-26 上海影谱科技有限公司 Method and system for converting text into video
CN113553445B (en) * 2021-07-28 2022-03-29 北京理工大学 Method for generating video description
CN113553445A (en) * 2021-07-28 2021-10-26 北京理工大学 Method for generating video description
CN113705402A (en) * 2021-08-18 2021-11-26 中国科学院自动化研究所 Video behavior prediction method, system, electronic device and storage medium
CN113792183B (en) * 2021-09-17 2023-09-08 咪咕数字传媒有限公司 Text generation method and device and computing equipment
CN113792183A (en) * 2021-09-17 2021-12-14 咪咕数字传媒有限公司 Text generation method and device and computing equipment
WO2023050295A1 (en) * 2021-09-30 2023-04-06 中远海运科技股份有限公司 Multimodal heterogeneous feature fusion-based compact video event description method
CN114627413B (en) * 2022-03-11 2022-09-13 电子科技大学 Video intensive event content understanding method
CN114339450B (en) * 2022-03-11 2022-07-15 中国科学技术大学 Video comment generation method, system, device and storage medium
CN114627413A (en) * 2022-03-11 2022-06-14 电子科技大学 Video intensive event content understanding method
CN114339450A (en) * 2022-03-11 2022-04-12 中国科学技术大学 Video comment generation method, system, device and storage medium
CN115311595A (en) * 2022-06-30 2022-11-08 中国科学院自动化研究所 Video feature extraction method and device and electronic equipment
CN115311595B (en) * 2022-06-30 2023-11-03 中国科学院自动化研究所 Video feature extraction method and device and electronic equipment
CN115359383A (en) * 2022-07-07 2022-11-18 北京百度网讯科技有限公司 Cross-modal feature extraction, retrieval and model training method, device and medium
CN116743609B (en) * 2023-08-14 2023-10-17 清华大学 QoE evaluation method and device for video streaming media based on semantic communication
CN116743609A (en) * 2023-08-14 2023-09-12 清华大学 QoE evaluation method and device for video streaming media based on semantic communication
CN118135452A (en) * 2024-02-02 2024-06-04 广州像素数据技术股份有限公司 Physical and chemical experiment video description method and related equipment based on large-scale video-language model
CN117789099A (en) * 2024-02-26 2024-03-29 北京搜狐新媒体信息技术有限公司 Video feature extraction method and device, storage medium and electronic equipment
CN117789099B (en) * 2024-02-26 2024-05-28 北京搜狐新媒体信息技术有限公司 Video feature extraction method and device, storage medium and electronic equipment
CN118132803A (en) * 2024-05-10 2024-06-04 成都考拉悠然科技有限公司 Zero sample video moment retrieval method, system, equipment and medium
CN118132803B (en) * 2024-05-10 2024-08-13 成都考拉悠然科技有限公司 Zero sample video moment retrieval method, system, equipment and medium

Similar Documents

Publication Publication Date Title
CN111079601A (en) Video content description method, system and device based on multi-mode attention mechanism
US11657230B2 (en) Referring image segmentation
CN110532996B (en) Video classification method, information processing method and server
CN108804530B (en) Subtitling areas of an image
CN111488807B (en) Video description generation system based on graph rolling network
CN111741330B (en) Video content evaluation method and device, storage medium and computer equipment
CN111507378A (en) Method and apparatus for training image processing model
CN109947912A (en) A kind of model method based on paragraph internal reasoning and combined problem answer matches
CN111079532A (en) Video content description method based on text self-encoder
CN111294646A (en) Video processing method, device, equipment and storage medium
CN111860235A (en) Method and system for generating high-low-level feature fused attention remote sensing image description
CN109919221B (en) Image description method based on bidirectional double-attention machine
CN108563624A (en) A kind of spatial term method based on deep learning
CN108536784B (en) Comment information sentiment analysis method and device, computer storage medium and server
CN109543112A (en) A kind of sequence of recommendation method and device based on cyclic convolution neural network
CN114339450B (en) Video comment generation method, system, device and storage medium
CN116524593A (en) Dynamic gesture recognition method, system, equipment and medium
CN115311598A (en) Video description generation system based on relation perception
CN115130591A (en) Cross supervision-based multi-mode data classification method and device
CN112668608A (en) Image identification method and device, electronic equipment and storage medium
CN112529149A (en) Data processing method and related device
CN117541668A (en) Virtual character generation method, device, equipment and storage medium
Taylor Composable, distributed-state models for high-dimensional time series
CN111161238A (en) Image quality evaluation method and device, electronic device, and storage medium
CN112115744A (en) Point cloud data processing method and device, computer storage medium and electronic equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20200428