CN113099228B - Video encoding and decoding method and system - Google Patents
Video encoding and decoding method and system Download PDFInfo
- Publication number
- CN113099228B CN113099228B CN202110483437.5A CN202110483437A CN113099228B CN 113099228 B CN113099228 B CN 113099228B CN 202110483437 A CN202110483437 A CN 202110483437A CN 113099228 B CN113099228 B CN 113099228B
- Authority
- CN
- China
- Prior art keywords
- features
- video
- fusion
- feature
- sequence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 25
- 230000004927 fusion Effects 0.000 claims abstract description 53
- 230000006399 behavior Effects 0.000 claims abstract description 31
- 230000006870 function Effects 0.000 claims abstract description 11
- 230000006403 short-term memory Effects 0.000 claims abstract description 6
- 238000000605 extraction Methods 0.000 claims description 18
- 238000013527 convolutional neural network Methods 0.000 claims description 14
- 238000009826 distribution Methods 0.000 claims description 9
- 239000011159 matrix material Substances 0.000 claims description 8
- 230000003542 behavioural effect Effects 0.000 claims description 4
- 230000007787 long-term memory Effects 0.000 claims description 3
- 238000012545 processing Methods 0.000 claims description 3
- 230000015654 memory Effects 0.000 claims description 2
- 230000007246 mechanism Effects 0.000 abstract description 7
- 230000003068 static effect Effects 0.000 abstract description 6
- 230000001737 promoting effect Effects 0.000 abstract 1
- 238000011161 development Methods 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 230000001364 causal effect Effects 0.000 description 4
- 238000001914 filtration Methods 0.000 description 3
- 230000002123 temporal effect Effects 0.000 description 3
- 230000000007 visual effect Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 241000245591 Baptisia tinctoria Species 0.000 description 1
- 241000282412 Homo Species 0.000 description 1
- 244000007853 Sarothamnus scoparius Species 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000000844 transformation Methods 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
- 229940088594 vitamin Drugs 0.000 description 1
- 229930003231 vitamin Natural products 0.000 description 1
- 235000013343 vitamin Nutrition 0.000 description 1
- 239000011782 vitamin Substances 0.000 description 1
- 150000003722 vitamin derivatives Chemical class 0.000 description 1
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/169—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
- H04N19/17—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
- H04N19/172—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a picture, frame or field
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/42—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by implementation details or hardware specially adapted for video compression or decompression, e.g. dedicated software implementation
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/44—Decoders specially adapted therefor, e.g. video decoders which are asymmetric with respect to the encoder
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Artificial Intelligence (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Computational Linguistics (AREA)
- Signal Processing (AREA)
- Evolutionary Computation (AREA)
- Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Multimedia (AREA)
- Compression Or Coding Systems Of Tv Signals (AREA)
Abstract
The invention discloses a video coding and decoding method and a video coding and decoding system. Firstly, overlapping the 2D features and the processed 3D features according to time sequence to realize the deep fusion of static and dynamic information. And then, introducing an attention mechanism to encode the fusion features at each time t, obtaining normalized weights through a softmax function, distributing different weights for the fusion features, and obtaining new fusion features so as to learn the human-based features, thereby promoting the final language description related to human behaviors. And finally, inputting the new fusion characteristics into a Long Short Term Memory (LSTM) network, and decoding along with the time to obtain the video description sentence. The video description obtained by the method is more logical and smooth, and has consistent and clear semantics.
Description
Technical Field
The invention relates to the field of machine learning, in particular to a video coding and decoding method and a video coding and decoding system.
Background
Currently, although deep learning algorithms in artificial intelligence are capable of performing video description functions, video information can be easily converted into language content. For example, before a user views massive video information, the user can quickly know the event development condition and the influence thereof by forming a precise text abstract for the video information, so that a great deal of time cost can be saved. In addition, the two-hour movie extracts the highlight and converts the highlight into a text outline summarizing the movie, so that a more perfect recommendation experience is brought to the user. However, such indiscriminate performing of the described functions on video information does not fully embody the imagination, curiosity and wisdom of human understanding things, which have been the heart of humans. Although text information can be extracted from a large amount of video information, the high value knowledge available to people is very little. Therefore, an excellent machine intelligent understanding algorithm should fully describe the events taking place in human thinking mode, and understand the development law of things in human as the first view angle, so that the machine can understand the video to a more intelligent degree.
In general, events that occur in video are closely connected and causally related, and are the source of performing understanding tasks. The transition of these events from the end to another new event is mostly motivated by human behavior. It can be said that human behavior dominates the development venation of events and causes and results among events, so that it is necessary to explore the development rules of events and enhance the understanding of causal relationships of events following human behavior. The traditional video understanding method is difficult to fully consider the time sequence relevance of human behaviors in each frame of the video and the causal relationship of occurrence of events, and the extracted global time sequence features contain a large number of redundant frame features, so that huge calculation power is consumed, the model is too slow to converge in a training stage, the development rule of things cannot be well understood from the human point of view taking the behaviors as clues, and the machine can more intelligently understand the video.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a video coding and decoding method and a system, which improve the logic and accuracy of video understanding tasks.
In order to solve the technical problems, the invention adopts the following technical scheme: a video encoding and decoding method comprising the steps of:
s1, respectively extracting 3D features and 2D features of a video frame sequence;
s2, processing the 3D features to obtain key features; overlapping the key features and the 2D features according to time sequence to construct fusion features;
s3, encoding the fusion feature at a time t, obtaining a normalized weight through a softmax function, and multiplying the fusion feature by the normalized weight to obtain a new fusion feature;
s4, inputting the new fusion characteristics into a long-short-period memory network to obtain a description sentence about the video frame sequence.
In order to construct a strong feature representation in video, the invention considers not only static image information but also dynamic information with time as a cue. The present invention therefore proposes a hybrid 2D/3D convolutional network. The 2D convolutional network and the 3D convolutional network are used to extract 2D features and 3D features of video frames, respectively, which represent static and dynamic video information, respectively. The 2D features cover single-frame feature information such as environment, objects and human behaviors, the 3D features not only make up for the shortage of context information when decoding the single-frame features, but also form event feature representations with long time intervals, which not only comprise the time relation of events on visual features, but also enhance the logic of finally output descriptive sentences. And overlapping the 2D features and the processed 3D features according to time sequence to realize the deep fusion of static and dynamic information. At time t, the fusion features are encoded, normalized weights are obtained through a softmax function, and different weights can be assigned to the fusion features to learn the human-based features, thereby facilitating the final language description related to human behavior. LSTM can learn long-term dependencies and is well suited to address issues that are highly time-series dependent. Thus, the present invention uses LSTM networks to decode the characteristics of human behavioral information and then describes it in text.
In step S1, 3D features of the video frame sequence are extracted using a 3D convolutional neural network. The 3D convolution is more suitable for learning the space-time characteristics than the 2D convolution, and the 3D convolution neural network can capture the time relation between video frames.
In step S1, 2D features of the video frame sequence are extracted using a 2D convolutional neural network. The 2D convolutional neural network can extract feature information such as environment, objects, and human behavior, which helps fully mine behavior features in the video.
The 2D convolutional neural network adopts an acceptance v3 network which is pre-trained by the ImageNet as a backbone network. The asymmetric convolution structure is introduced into the acceptance v3 network, so that the effect of processing more and richer spatial features, increasing feature diversity and the like is better than that of the symmetric convolution structure, and the calculation amount can be reduced.
The video frame sequence acquisition method comprises the following steps: a fixed number of frames are sampled from the whole video, constituting the sequence of video frames. This may cover the long timing structure used to understand the video, i.e. the sampled frames will cover the entire video, regardless of the length of the video.
In step S2, a behavior filter composed of a plurality of sets of N Kexil distributions is usedAnd 3D feature->Performing matrix multiplication to obtain key feature +.>. Based on the 3D global features of the cauchy distribution filter, the model can be made to implicitly understand which timestamps in the entire video frame sequence are more important for human behavior based descriptions.
The invention also provides a video coding and decoding system, which comprises:
a first extraction unit for extracting 3D features of a video frame sequence;
a second extraction unit for extracting 2D features of the video frame sequence;
a third extraction unit, configured to extract key information of the 3D feature;
the first fusion unit is used for superposing the key features and the 2D features in time sequence to construct fusion features;
a second fusion unit for at the momenttEncoding the fusion feature, obtaining a normalized weight through a softmax function, and multiplying the fusion feature by the normalized weight to obtain a new fusion feature;
and the long-term and short-term memory network is used for outputting a description sentence related to the video frame sequence after the new fusion characteristic is input.
The first extraction unit is a 3D convolutional neural network.
The second extraction unit is a 2D convolutional neural network.
The third extraction unit specifically performs the following operations: using a behavioural filter consisting of a plurality of sets of N Cauchy distributionsAnd 3D feature->Performing matrix multiplication to obtain key feature +.>。
Compared with the prior art, the invention has the following beneficial effects: the video description obtained by the method is more logical and smooth, and has consistent and clear semantics. Compared with a human standard reference sentence, the model provided by the invention tightly grabs human behavior information to generate description, so that not only is the event information in the video captured, but also the front cause and the result of the event are related by human actions. These advantages benefit not only from guidance based on human behavior, but also from the combination of dynamic information of hybrid 2D/3D convolution, which allows descriptive statements to have video information timing consistency and causal relevance.
Drawings
FIG. 1 is an overview of a hybrid 2D/3D convolutional network under human behavior guidance;
FIG. 2 is an evaluation of the method of the present invention on a Charades dataset;
fig. 3 is an example of a machine-to-video description. (example one: refer to a person wearing a checked shirt standing in a bathroom, taking off the shirt, and then picking up a yellow broom; the invention is described in that a person walks into the room, taking off the shirt, placing it on a shelf, then the person picks up a broom; example two: refer to a person sitting in a aisle, picking up a book, and then standing up again; the invention is described in that a person sits on the floor, then picks up a book, and places it on the floor.
Detailed Description
The present invention proposes a hybrid 2D/3D convolutional network (Mixed 2D/3D Convolutional Networks, MCN). By constructing a dual-branch network structure, wherein the first branch utilizes a 2D convolutional network to generate each frame feature, the second branch utilizes a 3D convolutional network to refine global feature information in all frames of the video.
And (3) constructing depth fusion of video static and dynamic information: a fixed number of frames are first sampled in the entire video to cover the long-range temporal structure used to understand the video. The sampled frames span the entire video regardless of the length of the video. Therefore we use a constant number of frame sequencesFrame-by-frame input of 2D convolution network branches to generate single-frame visual feature +.>Where the number of samples is represented. The 2D convolution network adopts an acceptance v3 network which is pre-trained by the ImageNet as a backbone network, and all single-frame visual characteristics are extracted.
Since 3D convolution is more suitable for spatio-temporal feature learning than 2D convolution, a 3D convolution network is introduced to capture the timing relationship between frames. Frame sequence obtained by samplingInputting 3D convolution network branches to generate global feature representation of the video clip>. The output global feature not only compensates the lack of context information in the single frame feature decoding process, but also can form event feature representation with long time interval. The associated frames are closely related, and the final description statement is promoted to have logic of going up and down. Then, for 3D features->The key features are extracted by performing temporal filtering and soft attention mechanisms. Finally, for key features and 2D features->And superposing according to time sequence to construct fusion characteristics. In the decoding process, we first use the attention mechanism for the fusion feature, extracting new fusion features at each time. The new fusion features are then input into a Long Short Term Memory (LSTM) network, decoded over time, and finally output a descriptive sentence about behavior in the video.
Since the global features obtained by 3D convolution contain environmental information, objects, human behavior, etc., if equally considered, the features would not be unique to human behavior transformations. Considering that 2D convolution networks have fully extracted characteristic information of the environment, objects and human behavior, it is desirable to construct a set of behavior filters. These filters have human behavior following different moments to refine the 3D convolution features so that the last filtered feature not only fully mines the behavior features but also explores the causal relationships of the behavior at different moments in time. Therefore, the invention provides a set of timing filters based on the cauchy distribution to capture key information in global features and to mine the correlation between frames in the video. And forming an implicit state vector containing key behavior information by filtering global features obtained by the 3D convolution branches.
The inventive method first selectively encodes temporal features of video information according to constraints of human behavior. The language description is then decoded by combining the static image features and the dynamic time series features, and the specific flow is as follows:
in the encoding phase:
the first step: firstly, respectively extracting 3D features and 2D features by using two branches of a 3D convolution network and a 2D convolution network;
and a second step of: cauchy filter filtering is performed on the 3D features and key features are extracted using a soft attention mechanism. In particular, a behavioural filter consisting of a plurality of sets of N-Kouchy distributionsFor performing the feature of the frame sequence->Matrix multiplication of (2) to obtain->Vitamin key feature->I.e. +.>Wherein, the method comprises the steps of, wherein,Mthe number of filters is indicated and the number of filters is indicated,cthe category of behavior is indicated and,Trepresenting the duration of the video +.>Representation oftTime of daymA filter (L)>Representing soft attention coefficients, realized by softmax function, calculated as +.>Wherein->Representing information about behavior categoriescIs the first of (2)iThe weights of the filters are automatically learned by the convolutional neural network during the training process. The soft-attention mechanism is based on the global features of the cauchy distribution filter, thus making the model implicitly understand which timestamps in the entire video frame sequence are more important for human behavior-based descriptions;
and a third step of: superposing the key features and the 2D features according to time sequences to construct fusion features;
in the decoding stage:
fourth step: first, new fusion features are extracted at each moment by using the attention mechanism to the fusion features. I.e. the mechanism of attention is introduced in timetFor fusion featuresREncoding and obtaining normalized weights through softmax function, and then fusing the featuresRMultiplied by the attention weight to obtain a new fusion feature, the formula of which is as follows:
wherein,representing attention vector, ++>Representing the hidden state of the LSTM output,Rrepresenting fusion features->And->The weight matrix is represented by a matrix of weights,bis biased (is->Representing new fusion features.
Fifth step: the new fusion features are input into a Long Short Term Memory (LSTM) network and decoded over time to obtain a descriptive sentence about behavior in the video. In particular, LSTM decodes video features at each instant to obtain hidden statesAnd storage state->. And obtaining a word at each moment by decoding the characteristic information, and finally obtaining a complete description sentence. For each LSTM cell, it inputs +.>Is a new fusion feature whose output is a word sequence。
Another embodiment of the present invention provides a video encoding and decoding system, corresponding to the method of the above embodiment, including:
a first extraction unit for extracting 3D features of a video frame sequence;
in this embodiment, the first extraction unit is a 3D convolutional neural network;
a second extraction unit for extracting 2D features of the video frame sequence;
in this embodiment, the second extraction unit is a 2D convolutional neural network;
a third extraction unit, configured to extract key information of the 3D feature;
the third extraction unit specifically performs the following operations: by using a plurality of sets of N Kexil distribution structuresFinished behavior filterAnd 3D feature->Performing matrix multiplication to obtain key feature +.>(the specific implementation process is the same as that of the above embodiment, and will not be repeated here).
The first fusion unit is used for superposing the key features and the 2D features in time sequence to construct fusion features;
the second fusion unit is used for encoding the fusion characteristics at the moment t, obtaining normalized weights through softmax functions, and multiplying the fusion characteristics by the normalized weights to obtain new fusion characteristics;
and the long-term and short-term memory network is used for outputting a description sentence related to the video frame sequence after the new fusion characteristic is input.
The implementation process of each unit in the coding and decoding system of the present invention is the same as that of the foregoing embodiment.
The coding and decoding system of the invention can be configured in computer equipment, and the computer equipment can be a microprocessor, an upper computer and the like.
Claims (8)
1. A video encoding and decoding method, comprising the steps of:
s1, respectively extracting 3D features and 2D features of a video frame sequence;
s2, processing the 3D features to obtain key features; overlapping the key features and the 2D features according to time sequence to construct fusion features;
s3, encoding the fusion feature at a time t, obtaining a normalized weight through a softmax function, and multiplying the fusion feature by the normalized weight to obtain a new fusion feature;
s4, inputting the new fusion characteristics into a long-period memory network to obtain a description sentence about the video frame sequence;
in step S2, a behavior filter K composed of a plurality of sets of N Kexil distributions is used C And 3D feature v t Performing matrix multiplication operation to obtain key features S ct 。
2. The video coding method according to claim 1, characterized in that in step S1, 3D features of the sequence of video frames are extracted using a 3D convolutional neural network.
3. The video coding method according to claim 1, characterized in that in step S1, 2D features of the sequence of video frames are extracted using a 2D convolutional neural network.
4. A video codec method according to claim 3, wherein the 2D convolutional neural network employs an ImageNet pre-trained concept v3 network as a backbone network.
5. The video encoding and decoding method according to claim 1, wherein the video frame sequence acquisition method is: a fixed number of frames are sampled from the whole video, constituting the sequence of video frames.
6. A video codec system, comprising:
a first extraction unit for extracting 3D features of a video frame sequence;
a second extraction unit for extracting 2D features of the video frame sequence;
a third extraction unit, configured to extract key information of the 3D feature; the third extraction unit specifically performs the following operations: using a behavioural filter K consisting of a plurality of sets of N Keuchy distributions C And 3D feature v t Performing matrix multiplication operation to obtain key features S ct ;
The first fusion unit is used for superposing the key features and the 2D features in time sequence to construct fusion features;
the second fusion unit is used for encoding the fusion characteristics at the moment t, obtaining normalized weights through softmax functions, and multiplying the fusion characteristics by the normalized weights to obtain new fusion characteristics;
and the long-term and short-term memory network is used for outputting a description sentence related to the video frame sequence after the new fusion characteristic is input.
7. The system of claim 6, wherein the first extraction unit is a 3D convolutional neural network.
8. The system of claim 6, wherein the second extraction unit is a 2D convolutional neural network.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110483437.5A CN113099228B (en) | 2021-04-30 | 2021-04-30 | Video encoding and decoding method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110483437.5A CN113099228B (en) | 2021-04-30 | 2021-04-30 | Video encoding and decoding method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113099228A CN113099228A (en) | 2021-07-09 |
CN113099228B true CN113099228B (en) | 2024-04-05 |
Family
ID=76681265
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110483437.5A Active CN113099228B (en) | 2021-04-30 | 2021-04-30 | Video encoding and decoding method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113099228B (en) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108388900A (en) * | 2018-02-05 | 2018-08-10 | 华南理工大学 | The video presentation method being combined based on multiple features fusion and space-time attention mechanism |
CN109344288A (en) * | 2018-09-19 | 2019-02-15 | 电子科技大学 | A kind of combination video presentation method based on multi-modal feature combination multilayer attention mechanism |
CN109874029A (en) * | 2019-04-22 | 2019-06-11 | 腾讯科技(深圳)有限公司 | Video presentation generation method, device, equipment and storage medium |
WO2020037965A1 (en) * | 2018-08-21 | 2020-02-27 | 北京大学深圳研究生院 | Method for multi-motion flow deep convolutional network model for video prediction |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2010033642A2 (en) * | 2008-09-16 | 2010-03-25 | Realnetworks, Inc. | Systems and methods for video/multimedia rendering, composition, and user-interactivity |
US10049279B2 (en) * | 2016-03-11 | 2018-08-14 | Qualcomm Incorporated | Recurrent networks with motion-based attention for video understanding |
-
2021
- 2021-04-30 CN CN202110483437.5A patent/CN113099228B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108388900A (en) * | 2018-02-05 | 2018-08-10 | 华南理工大学 | The video presentation method being combined based on multiple features fusion and space-time attention mechanism |
WO2020037965A1 (en) * | 2018-08-21 | 2020-02-27 | 北京大学深圳研究生院 | Method for multi-motion flow deep convolutional network model for video prediction |
CN109344288A (en) * | 2018-09-19 | 2019-02-15 | 电子科技大学 | A kind of combination video presentation method based on multi-modal feature combination multilayer attention mechanism |
CN109874029A (en) * | 2019-04-22 | 2019-06-11 | 腾讯科技(深圳)有限公司 | Video presentation generation method, device, equipment and storage medium |
Non-Patent Citations (1)
Title |
---|
多特征融合的行为识别模型;谭等泰,李世超等;《中国图象图形学报》;20201216;2541-2552 * |
Also Published As
Publication number | Publication date |
---|---|
CN113099228A (en) | 2021-07-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN112668671B (en) | Method and device for acquiring pre-training model | |
CN109947912B (en) | Model method based on intra-paragraph reasoning and joint question answer matching | |
CN108388900A (en) | The video presentation method being combined based on multiple features fusion and space-time attention mechanism | |
CN108416065A (en) | Image based on level neural network-sentence description generates system and method | |
CN109871736A (en) | The generation method and device of natural language description information | |
Yu et al. | Spae: Semantic pyramid autoencoder for multimodal generation with frozen llms | |
CN110457661A (en) | Spatial term method, apparatus, equipment and storage medium | |
CN116977457A (en) | Data processing method, device and computer readable storage medium | |
CN110889505B (en) | Cross-media comprehensive reasoning method and system for image-text sequence matching | |
CN111191461B (en) | Remote supervision relation extraction method based on course learning | |
CN112364148A (en) | Deep learning method-based generative chat robot | |
CN115908991A (en) | Image description model method, system, device and medium based on feature fusion | |
CN113657272B (en) | Micro video classification method and system based on missing data completion | |
Yan et al. | Intra-agent speech permits zero-shot task acquisition | |
CN109710787A (en) | Image Description Methods based on deep learning | |
CN113947074A (en) | Deep collaborative interaction emotion reason joint extraction method | |
CN113240714A (en) | Human motion intention prediction method based on context-aware network | |
CN110826397B (en) | Video description method based on high-order low-rank multi-modal attention mechanism | |
CN116644759B (en) | Method and system for extracting aspect category and semantic polarity in sentence | |
CN113420179A (en) | Semantic reconstruction video description method based on time sequence Gaussian mixture hole convolution | |
CN113099228B (en) | Video encoding and decoding method and system | |
CN116958324A (en) | Training method, device, equipment and storage medium of image generation model | |
CN111898576B (en) | Behavior identification method based on human skeleton space-time relationship | |
Xu et al. | Isolated Word Sign Language Recognition Based on Improved SKResNet‐TCN Network | |
CN113741759A (en) | Comment information display method and device, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |