CN113099228B - Video encoding and decoding method and system - Google Patents

Video encoding and decoding method and system Download PDF

Info

Publication number
CN113099228B
CN113099228B CN202110483437.5A CN202110483437A CN113099228B CN 113099228 B CN113099228 B CN 113099228B CN 202110483437 A CN202110483437 A CN 202110483437A CN 113099228 B CN113099228 B CN 113099228B
Authority
CN
China
Prior art keywords
features
video
fusion
feature
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110483437.5A
Other languages
Chinese (zh)
Other versions
CN113099228A (en
Inventor
郭克华
申长春
奎晓燕
刘斌
王凌风
刘超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hand In Hand Information Technology Co ltd
Central South University
Original Assignee
Hand In Hand Information Technology Co ltd
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hand In Hand Information Technology Co ltd, Central South University filed Critical Hand In Hand Information Technology Co ltd
Priority to CN202110483437.5A priority Critical patent/CN113099228B/en
Publication of CN113099228A publication Critical patent/CN113099228A/en
Application granted granted Critical
Publication of CN113099228B publication Critical patent/CN113099228B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/17Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
    • H04N19/172Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a picture, frame or field
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/42Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by implementation details or hardware specially adapted for video compression or decompression, e.g. dedicated software implementation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/44Decoders specially adapted therefor, e.g. video decoders which are asymmetric with respect to the encoder

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Multimedia (AREA)
  • Compression Or Coding Systems Of Tv Signals (AREA)

Abstract

The invention discloses a video coding and decoding method and a video coding and decoding system. Firstly, overlapping the 2D features and the processed 3D features according to time sequence to realize the deep fusion of static and dynamic information. And then, introducing an attention mechanism to encode the fusion features at each time t, obtaining normalized weights through a softmax function, distributing different weights for the fusion features, and obtaining new fusion features so as to learn the human-based features, thereby promoting the final language description related to human behaviors. And finally, inputting the new fusion characteristics into a Long Short Term Memory (LSTM) network, and decoding along with the time to obtain the video description sentence. The video description obtained by the method is more logical and smooth, and has consistent and clear semantics.

Description

Video encoding and decoding method and system
Technical Field
The invention relates to the field of machine learning, in particular to a video coding and decoding method and a video coding and decoding system.
Background
Currently, although deep learning algorithms in artificial intelligence are capable of performing video description functions, video information can be easily converted into language content. For example, before a user views massive video information, the user can quickly know the event development condition and the influence thereof by forming a precise text abstract for the video information, so that a great deal of time cost can be saved. In addition, the two-hour movie extracts the highlight and converts the highlight into a text outline summarizing the movie, so that a more perfect recommendation experience is brought to the user. However, such indiscriminate performing of the described functions on video information does not fully embody the imagination, curiosity and wisdom of human understanding things, which have been the heart of humans. Although text information can be extracted from a large amount of video information, the high value knowledge available to people is very little. Therefore, an excellent machine intelligent understanding algorithm should fully describe the events taking place in human thinking mode, and understand the development law of things in human as the first view angle, so that the machine can understand the video to a more intelligent degree.
In general, events that occur in video are closely connected and causally related, and are the source of performing understanding tasks. The transition of these events from the end to another new event is mostly motivated by human behavior. It can be said that human behavior dominates the development venation of events and causes and results among events, so that it is necessary to explore the development rules of events and enhance the understanding of causal relationships of events following human behavior. The traditional video understanding method is difficult to fully consider the time sequence relevance of human behaviors in each frame of the video and the causal relationship of occurrence of events, and the extracted global time sequence features contain a large number of redundant frame features, so that huge calculation power is consumed, the model is too slow to converge in a training stage, the development rule of things cannot be well understood from the human point of view taking the behaviors as clues, and the machine can more intelligently understand the video.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a video coding and decoding method and a system, which improve the logic and accuracy of video understanding tasks.
In order to solve the technical problems, the invention adopts the following technical scheme: a video encoding and decoding method comprising the steps of:
s1, respectively extracting 3D features and 2D features of a video frame sequence;
s2, processing the 3D features to obtain key features; overlapping the key features and the 2D features according to time sequence to construct fusion features;
s3, encoding the fusion feature at a time t, obtaining a normalized weight through a softmax function, and multiplying the fusion feature by the normalized weight to obtain a new fusion feature;
s4, inputting the new fusion characteristics into a long-short-period memory network to obtain a description sentence about the video frame sequence.
In order to construct a strong feature representation in video, the invention considers not only static image information but also dynamic information with time as a cue. The present invention therefore proposes a hybrid 2D/3D convolutional network. The 2D convolutional network and the 3D convolutional network are used to extract 2D features and 3D features of video frames, respectively, which represent static and dynamic video information, respectively. The 2D features cover single-frame feature information such as environment, objects and human behaviors, the 3D features not only make up for the shortage of context information when decoding the single-frame features, but also form event feature representations with long time intervals, which not only comprise the time relation of events on visual features, but also enhance the logic of finally output descriptive sentences. And overlapping the 2D features and the processed 3D features according to time sequence to realize the deep fusion of static and dynamic information. At time t, the fusion features are encoded, normalized weights are obtained through a softmax function, and different weights can be assigned to the fusion features to learn the human-based features, thereby facilitating the final language description related to human behavior. LSTM can learn long-term dependencies and is well suited to address issues that are highly time-series dependent. Thus, the present invention uses LSTM networks to decode the characteristics of human behavioral information and then describes it in text.
In step S1, 3D features of the video frame sequence are extracted using a 3D convolutional neural network. The 3D convolution is more suitable for learning the space-time characteristics than the 2D convolution, and the 3D convolution neural network can capture the time relation between video frames.
In step S1, 2D features of the video frame sequence are extracted using a 2D convolutional neural network. The 2D convolutional neural network can extract feature information such as environment, objects, and human behavior, which helps fully mine behavior features in the video.
The 2D convolutional neural network adopts an acceptance v3 network which is pre-trained by the ImageNet as a backbone network. The asymmetric convolution structure is introduced into the acceptance v3 network, so that the effect of processing more and richer spatial features, increasing feature diversity and the like is better than that of the symmetric convolution structure, and the calculation amount can be reduced.
The video frame sequence acquisition method comprises the following steps: a fixed number of frames are sampled from the whole video, constituting the sequence of video frames. This may cover the long timing structure used to understand the video, i.e. the sampled frames will cover the entire video, regardless of the length of the video.
In step S2, a behavior filter composed of a plurality of sets of N Kexil distributions is usedAnd 3D feature->Performing matrix multiplication to obtain key feature +.>. Based on the 3D global features of the cauchy distribution filter, the model can be made to implicitly understand which timestamps in the entire video frame sequence are more important for human behavior based descriptions.
The invention also provides a video coding and decoding system, which comprises:
a first extraction unit for extracting 3D features of a video frame sequence;
a second extraction unit for extracting 2D features of the video frame sequence;
a third extraction unit, configured to extract key information of the 3D feature;
the first fusion unit is used for superposing the key features and the 2D features in time sequence to construct fusion features;
a second fusion unit for at the momenttEncoding the fusion feature, obtaining a normalized weight through a softmax function, and multiplying the fusion feature by the normalized weight to obtain a new fusion feature;
and the long-term and short-term memory network is used for outputting a description sentence related to the video frame sequence after the new fusion characteristic is input.
The first extraction unit is a 3D convolutional neural network.
The second extraction unit is a 2D convolutional neural network.
The third extraction unit specifically performs the following operations: using a behavioural filter consisting of a plurality of sets of N Cauchy distributionsAnd 3D feature->Performing matrix multiplication to obtain key feature +.>
Compared with the prior art, the invention has the following beneficial effects: the video description obtained by the method is more logical and smooth, and has consistent and clear semantics. Compared with a human standard reference sentence, the model provided by the invention tightly grabs human behavior information to generate description, so that not only is the event information in the video captured, but also the front cause and the result of the event are related by human actions. These advantages benefit not only from guidance based on human behavior, but also from the combination of dynamic information of hybrid 2D/3D convolution, which allows descriptive statements to have video information timing consistency and causal relevance.
Drawings
FIG. 1 is an overview of a hybrid 2D/3D convolutional network under human behavior guidance;
FIG. 2 is an evaluation of the method of the present invention on a Charades dataset;
fig. 3 is an example of a machine-to-video description. (example one: refer to a person wearing a checked shirt standing in a bathroom, taking off the shirt, and then picking up a yellow broom; the invention is described in that a person walks into the room, taking off the shirt, placing it on a shelf, then the person picks up a broom; example two: refer to a person sitting in a aisle, picking up a book, and then standing up again; the invention is described in that a person sits on the floor, then picks up a book, and places it on the floor.
Detailed Description
The present invention proposes a hybrid 2D/3D convolutional network (Mixed 2D/3D Convolutional Networks, MCN). By constructing a dual-branch network structure, wherein the first branch utilizes a 2D convolutional network to generate each frame feature, the second branch utilizes a 3D convolutional network to refine global feature information in all frames of the video.
And (3) constructing depth fusion of video static and dynamic information: a fixed number of frames are first sampled in the entire video to cover the long-range temporal structure used to understand the video. The sampled frames span the entire video regardless of the length of the video. Therefore we use a constant number of frame sequencesFrame-by-frame input of 2D convolution network branches to generate single-frame visual feature +.>Where the number of samples is represented. The 2D convolution network adopts an acceptance v3 network which is pre-trained by the ImageNet as a backbone network, and all single-frame visual characteristics are extracted.
Since 3D convolution is more suitable for spatio-temporal feature learning than 2D convolution, a 3D convolution network is introduced to capture the timing relationship between frames. Frame sequence obtained by samplingInputting 3D convolution network branches to generate global feature representation of the video clip>. The output global feature not only compensates the lack of context information in the single frame feature decoding process, but also can form event feature representation with long time interval. The associated frames are closely related, and the final description statement is promoted to have logic of going up and down. Then, for 3D features->The key features are extracted by performing temporal filtering and soft attention mechanisms. Finally, for key features and 2D features->And superposing according to time sequence to construct fusion characteristics. In the decoding process, we first use the attention mechanism for the fusion feature, extracting new fusion features at each time. The new fusion features are then input into a Long Short Term Memory (LSTM) network, decoded over time, and finally output a descriptive sentence about behavior in the video.
Since the global features obtained by 3D convolution contain environmental information, objects, human behavior, etc., if equally considered, the features would not be unique to human behavior transformations. Considering that 2D convolution networks have fully extracted characteristic information of the environment, objects and human behavior, it is desirable to construct a set of behavior filters. These filters have human behavior following different moments to refine the 3D convolution features so that the last filtered feature not only fully mines the behavior features but also explores the causal relationships of the behavior at different moments in time. Therefore, the invention provides a set of timing filters based on the cauchy distribution to capture key information in global features and to mine the correlation between frames in the video. And forming an implicit state vector containing key behavior information by filtering global features obtained by the 3D convolution branches.
The inventive method first selectively encodes temporal features of video information according to constraints of human behavior. The language description is then decoded by combining the static image features and the dynamic time series features, and the specific flow is as follows:
in the encoding phase:
the first step: firstly, respectively extracting 3D features and 2D features by using two branches of a 3D convolution network and a 2D convolution network;
and a second step of: cauchy filter filtering is performed on the 3D features and key features are extracted using a soft attention mechanism. In particular, a behavioural filter consisting of a plurality of sets of N-Kouchy distributionsFor performing the feature of the frame sequence->Matrix multiplication of (2) to obtain->Vitamin key feature->I.e. +.>Wherein, the method comprises the steps of, wherein,Mthe number of filters is indicated and the number of filters is indicated,cthe category of behavior is indicated and,Trepresenting the duration of the video +.>Representation oftTime of daymA filter (L)>Representing soft attention coefficients, realized by softmax function, calculated as +.>Wherein->Representing information about behavior categoriescIs the first of (2)iThe weights of the filters are automatically learned by the convolutional neural network during the training process. The soft-attention mechanism is based on the global features of the cauchy distribution filter, thus making the model implicitly understand which timestamps in the entire video frame sequence are more important for human behavior-based descriptions;
and a third step of: superposing the key features and the 2D features according to time sequences to construct fusion features;
in the decoding stage:
fourth step: first, new fusion features are extracted at each moment by using the attention mechanism to the fusion features. I.e. the mechanism of attention is introduced in timetFor fusion featuresREncoding and obtaining normalized weights through softmax function, and then fusing the featuresRMultiplied by the attention weight to obtain a new fusion feature, the formula of which is as follows:
wherein,representing attention vector, ++>Representing the hidden state of the LSTM output,Rrepresenting fusion features->And->The weight matrix is represented by a matrix of weights,bis biased (is->Representing new fusion features.
Fifth step: the new fusion features are input into a Long Short Term Memory (LSTM) network and decoded over time to obtain a descriptive sentence about behavior in the video. In particular, LSTM decodes video features at each instant to obtain hidden statesAnd storage state->. And obtaining a word at each moment by decoding the characteristic information, and finally obtaining a complete description sentence. For each LSTM cell, it inputs +.>Is a new fusion feature whose output is a word sequence
Another embodiment of the present invention provides a video encoding and decoding system, corresponding to the method of the above embodiment, including:
a first extraction unit for extracting 3D features of a video frame sequence;
in this embodiment, the first extraction unit is a 3D convolutional neural network;
a second extraction unit for extracting 2D features of the video frame sequence;
in this embodiment, the second extraction unit is a 2D convolutional neural network;
a third extraction unit, configured to extract key information of the 3D feature;
the third extraction unit specifically performs the following operations: by using a plurality of sets of N Kexil distribution structuresFinished behavior filterAnd 3D feature->Performing matrix multiplication to obtain key feature +.>(the specific implementation process is the same as that of the above embodiment, and will not be repeated here).
The first fusion unit is used for superposing the key features and the 2D features in time sequence to construct fusion features;
the second fusion unit is used for encoding the fusion characteristics at the moment t, obtaining normalized weights through softmax functions, and multiplying the fusion characteristics by the normalized weights to obtain new fusion characteristics;
and the long-term and short-term memory network is used for outputting a description sentence related to the video frame sequence after the new fusion characteristic is input.
The implementation process of each unit in the coding and decoding system of the present invention is the same as that of the foregoing embodiment.
The coding and decoding system of the invention can be configured in computer equipment, and the computer equipment can be a microprocessor, an upper computer and the like.

Claims (8)

1. A video encoding and decoding method, comprising the steps of:
s1, respectively extracting 3D features and 2D features of a video frame sequence;
s2, processing the 3D features to obtain key features; overlapping the key features and the 2D features according to time sequence to construct fusion features;
s3, encoding the fusion feature at a time t, obtaining a normalized weight through a softmax function, and multiplying the fusion feature by the normalized weight to obtain a new fusion feature;
s4, inputting the new fusion characteristics into a long-period memory network to obtain a description sentence about the video frame sequence;
in step S2, a behavior filter K composed of a plurality of sets of N Kexil distributions is used C And 3D feature v t Performing matrix multiplication operation to obtain key features S ct
2. The video coding method according to claim 1, characterized in that in step S1, 3D features of the sequence of video frames are extracted using a 3D convolutional neural network.
3. The video coding method according to claim 1, characterized in that in step S1, 2D features of the sequence of video frames are extracted using a 2D convolutional neural network.
4. A video codec method according to claim 3, wherein the 2D convolutional neural network employs an ImageNet pre-trained concept v3 network as a backbone network.
5. The video encoding and decoding method according to claim 1, wherein the video frame sequence acquisition method is: a fixed number of frames are sampled from the whole video, constituting the sequence of video frames.
6. A video codec system, comprising:
a first extraction unit for extracting 3D features of a video frame sequence;
a second extraction unit for extracting 2D features of the video frame sequence;
a third extraction unit, configured to extract key information of the 3D feature; the third extraction unit specifically performs the following operations: using a behavioural filter K consisting of a plurality of sets of N Keuchy distributions C And 3D feature v t Performing matrix multiplication operation to obtain key features S ct
The first fusion unit is used for superposing the key features and the 2D features in time sequence to construct fusion features;
the second fusion unit is used for encoding the fusion characteristics at the moment t, obtaining normalized weights through softmax functions, and multiplying the fusion characteristics by the normalized weights to obtain new fusion characteristics;
and the long-term and short-term memory network is used for outputting a description sentence related to the video frame sequence after the new fusion characteristic is input.
7. The system of claim 6, wherein the first extraction unit is a 3D convolutional neural network.
8. The system of claim 6, wherein the second extraction unit is a 2D convolutional neural network.
CN202110483437.5A 2021-04-30 2021-04-30 Video encoding and decoding method and system Active CN113099228B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110483437.5A CN113099228B (en) 2021-04-30 2021-04-30 Video encoding and decoding method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110483437.5A CN113099228B (en) 2021-04-30 2021-04-30 Video encoding and decoding method and system

Publications (2)

Publication Number Publication Date
CN113099228A CN113099228A (en) 2021-07-09
CN113099228B true CN113099228B (en) 2024-04-05

Family

ID=76681265

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110483437.5A Active CN113099228B (en) 2021-04-30 2021-04-30 Video encoding and decoding method and system

Country Status (1)

Country Link
CN (1) CN113099228B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108388900A (en) * 2018-02-05 2018-08-10 华南理工大学 The video presentation method being combined based on multiple features fusion and space-time attention mechanism
CN109344288A (en) * 2018-09-19 2019-02-15 电子科技大学 A kind of combination video presentation method based on multi-modal feature combination multilayer attention mechanism
CN109874029A (en) * 2019-04-22 2019-06-11 腾讯科技(深圳)有限公司 Video presentation generation method, device, equipment and storage medium
WO2020037965A1 (en) * 2018-08-21 2020-02-27 北京大学深圳研究生院 Method for multi-motion flow deep convolutional network model for video prediction

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2010033642A2 (en) * 2008-09-16 2010-03-25 Realnetworks, Inc. Systems and methods for video/multimedia rendering, composition, and user-interactivity
US10049279B2 (en) * 2016-03-11 2018-08-14 Qualcomm Incorporated Recurrent networks with motion-based attention for video understanding

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108388900A (en) * 2018-02-05 2018-08-10 华南理工大学 The video presentation method being combined based on multiple features fusion and space-time attention mechanism
WO2020037965A1 (en) * 2018-08-21 2020-02-27 北京大学深圳研究生院 Method for multi-motion flow deep convolutional network model for video prediction
CN109344288A (en) * 2018-09-19 2019-02-15 电子科技大学 A kind of combination video presentation method based on multi-modal feature combination multilayer attention mechanism
CN109874029A (en) * 2019-04-22 2019-06-11 腾讯科技(深圳)有限公司 Video presentation generation method, device, equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
多特征融合的行为识别模型;谭等泰,李世超等;《中国图象图形学报》;20201216;2541-2552 *

Also Published As

Publication number Publication date
CN113099228A (en) 2021-07-09

Similar Documents

Publication Publication Date Title
CN112668671B (en) Method and device for acquiring pre-training model
CN109947912B (en) Model method based on intra-paragraph reasoning and joint question answer matching
CN108388900A (en) The video presentation method being combined based on multiple features fusion and space-time attention mechanism
CN108416065A (en) Image based on level neural network-sentence description generates system and method
CN109871736A (en) The generation method and device of natural language description information
Yu et al. Spae: Semantic pyramid autoencoder for multimodal generation with frozen llms
CN110457661A (en) Spatial term method, apparatus, equipment and storage medium
CN116977457A (en) Data processing method, device and computer readable storage medium
CN110889505B (en) Cross-media comprehensive reasoning method and system for image-text sequence matching
CN111191461B (en) Remote supervision relation extraction method based on course learning
CN112364148A (en) Deep learning method-based generative chat robot
CN115908991A (en) Image description model method, system, device and medium based on feature fusion
CN113657272B (en) Micro video classification method and system based on missing data completion
Yan et al. Intra-agent speech permits zero-shot task acquisition
CN109710787A (en) Image Description Methods based on deep learning
CN113947074A (en) Deep collaborative interaction emotion reason joint extraction method
CN113240714A (en) Human motion intention prediction method based on context-aware network
CN110826397B (en) Video description method based on high-order low-rank multi-modal attention mechanism
CN116644759B (en) Method and system for extracting aspect category and semantic polarity in sentence
CN113420179A (en) Semantic reconstruction video description method based on time sequence Gaussian mixture hole convolution
CN113099228B (en) Video encoding and decoding method and system
CN116958324A (en) Training method, device, equipment and storage medium of image generation model
CN111898576B (en) Behavior identification method based on human skeleton space-time relationship
Xu et al. Isolated Word Sign Language Recognition Based on Improved SKResNet‐TCN Network
CN113741759A (en) Comment information display method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant