CN113099228A - Video coding and decoding method and system - Google Patents

Video coding and decoding method and system Download PDF

Info

Publication number
CN113099228A
CN113099228A CN202110483437.5A CN202110483437A CN113099228A CN 113099228 A CN113099228 A CN 113099228A CN 202110483437 A CN202110483437 A CN 202110483437A CN 113099228 A CN113099228 A CN 113099228A
Authority
CN
China
Prior art keywords
features
video
fusion
sequence
decoding method
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110483437.5A
Other languages
Chinese (zh)
Other versions
CN113099228B (en
Inventor
郭克华
申长春
奎晓燕
刘斌
王凌风
刘超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hand In Hand Information Technology Co ltd
Central South University
Original Assignee
Hand In Hand Information Technology Co ltd
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hand In Hand Information Technology Co ltd, Central South University filed Critical Hand In Hand Information Technology Co ltd
Priority to CN202110483437.5A priority Critical patent/CN113099228B/en
Publication of CN113099228A publication Critical patent/CN113099228A/en
Application granted granted Critical
Publication of CN113099228B publication Critical patent/CN113099228B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/10Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
    • H04N19/169Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
    • H04N19/17Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
    • H04N19/172Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a picture, frame or field
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/42Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by implementation details or hardware specially adapted for video compression or decompression, e.g. dedicated software implementation
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N19/00Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
    • H04N19/44Decoders specially adapted therefor, e.g. video decoders which are asymmetric with respect to the encoder

Abstract

The invention discloses a video coding and decoding method and a video coding and decoding system. Firstly, the 2D features and the processed 3D features are overlapped according to a time sequence, and the depth fusion of static and dynamic information is realized. Then, an attention mechanism is introduced to encode the fusion features at each moment t, normalization weight is obtained through a softmax function, different weights are distributed to the fusion features, new fusion features are obtained, human-oriented features are learned, and therefore final language description related to human behaviors is promoted. And finally, inputting the new fusion characteristics into a long-short term memory (LSTM) network, and decoding the new fusion characteristics along with the time to obtain the video description sentence. The video description obtained by the invention is more logical, smooth, coherent and clear.

Description

Video coding and decoding method and system
Technical Field
The invention relates to the field of machine learning, in particular to a video coding and decoding method and a video coding and decoding system.
Background
At present, although the deep learning algorithm in artificial intelligence can execute the video description function, the video information can be easily converted into language content. For example, before a user watches massive video information, the user can quickly know the event development condition and the influence thereof by forming an accurate text summary on the video information, so that a lot of time cost can be saved. In addition, the highlight segments extracted from the two-hour movie are converted into the text outline summarizing of the movie, so that more perfect recommendation experience can be brought to the user. However, such an undifferentiated description function of video information does not sufficiently embody imagination, curiosity and intelligence for human to understand things, and these nature have been the core of human. Although text information can be extracted from a large amount of video information, the high-value knowledge for people is very little. Therefore, an excellent machine intelligent understanding algorithm should fully describe the occurring events in human thinking mode, and understand the development rule of things with human as the first perspective, so as to make the machine understand the video to a more intelligent degree.
Generally, events occurring in video are closely related and causal, and these events are the source of performing understanding tasks. The transition of these events from the end to another new event is mostly prompted by human behavior. It can be said that human behaviors dominate the development context of events and the causes and results between events, so it is necessary to follow human behaviors to explore the development rules of events and to strengthen the understanding of causal relationships. The traditional video understanding method is difficult to fully consider the relevance of human behaviors in each frame of a video in time sequence and the causal relationship of event occurrence, and the extracted global time sequence features contain a large number of redundant frame features, so that huge computing power is consumed, a model is too slow to converge in a training stage, the development rule of things cannot be well understood from the human perspective with behaviors as clues, and a machine can more intelligently understand the video.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a video encoding and decoding method and system aiming at the defects of the prior art, so that the logic and the accuracy of a video understanding task are improved.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows: a video coding and decoding method comprises the following steps:
s1, respectively extracting 3D features and 2D features of the video frame sequence;
s2, processing the 3D features to obtain key features; overlapping the key features and the 2D features according to a time sequence to construct fusion features;
s3, encoding the fusion features at the moment t, obtaining normalization weight through a softmax function, and multiplying the fusion features by the normalization weight to obtain new fusion features;
and S4, inputting the new fusion characteristics into the long-term and short-term memory network to obtain the description sentences about the video frame sequence.
In order to construct a strong feature representation in video, the invention considers not only static image information but also dynamic information with time as clues. Thus, the present invention proposes a hybrid 2D/3D convolutional network. The 2D convolutional network and the 3D convolutional network are used to extract 2D features and 3D features of the video frames, respectively, which represent static and dynamic video information, respectively. The 2D features cover single-frame feature information such as environment, objects and human behaviors, the 3D features make up for the deficiency of context information when decoding the single-frame features, event feature representation at long time intervals is formed, the event feature representation not only contains the time relation of events on visual features, but also enhances the logic of finally output descriptive sentences. And superposing the 2D features and the processed 3D features according to a time sequence to realize the depth fusion of static and dynamic information. The fusion features are encoded at time t, normalized weights are obtained by a softmax function, and different weights can be assigned to the fusion features to learn human-oriented features, thereby facilitating final language description related to human behavior. LSTM can learn long term dependencies and is well suited to handle issues that are highly time-dependent. Thus, the present invention uses the LSTM network to decode the characteristics of human behavioral information, which is then described in text.
In step S1, a 3D convolutional neural network is used to extract 3D features of the sequence of video frames. The 3D convolution is more suitable for the learning of space-time characteristics than the 2D convolution, and the 3D convolution neural network can capture the time relation between video frames.
In step S1, 2D features of the sequence of video frames are extracted using a 2D convolutional neural network. The 2D convolutional neural network can extract characteristic information such as environment, objects and human behaviors, and is helpful for fully mining behavior characteristics in the video.
The 2D convolutional neural network adopts an amplification v3 network pre-trained by ImageNet as a backbone network. The Incepton v3 network introduces an asymmetric convolution structure, the effect in the aspects of processing more and richer space features, increasing feature diversity and the like is better than that of a symmetric convolution structure, and meanwhile, the calculation amount can be reduced.
The video frame sequence acquisition method comprises the following steps: a fixed number of frames are sampled from the entire video to form the sequence of video frames. This can cover the long-order structure used to understand video, i.e. the sample frames will cover the entire video regardless of the length of the video.
In step S2, a behavior filter composed of a plurality of sets of N Cauchy distributions is used
Figure 790997DEST_PATH_IMAGE001
And 3D features
Figure 426246DEST_PATH_IMAGE002
Performing matrix multiplication to obtain key features
Figure 674825DEST_PATH_IMAGE003
. Based on the 3D global features of the cauchy distribution filter, the model can be made to implicitly understand which timestamps are more important for human behavior-based descriptions throughout the sequence of video frames.
The invention also provides a video coding and decoding system, which comprises:
a first extraction unit for extracting 3D features of a sequence of video frames;
a second extraction unit for extracting 2D features of the sequence of video frames;
a third extraction unit, configured to extract key information of the 3D feature;
the first fusion unit is used for overlapping the key features and the 2D features according to a time sequence to construct fusion features;
a second fusing unit for fusing the time of daytEncoding the fusion features, obtaining normalization weight through a softmax function, and multiplying the fusion features by the normalization weight to obtain new fusion features;
and the long-short term memory network is used for outputting description sentences related to the video frame sequence after inputting the new fusion characteristics.
The first extraction unit is a 3D convolutional neural network.
The second extraction unit is a 2D convolutional neural network.
The third extraction unit specifically performs the following operations: using a behavioral filter consisting of groups of N Cauchy distributions
Figure 538876DEST_PATH_IMAGE001
And 3D features
Figure 822090DEST_PATH_IMAGE002
Performing matrix multiplication to obtain key features
Figure 864126DEST_PATH_IMAGE003
Compared with the prior art, the invention has the beneficial effects that: the video description obtained by the invention is more logical, smooth, coherent and clear. Compared with a standard human sentence, the model provided by the invention tightly grasps the description generated by human behavior information, not only captures event information in the video, but also associates the antecedent and consequent consequences of the event with human actions. These advantages not only benefit from human behavior-based guidelines, but also allow descriptive statements to have video information temporal consistency and causal relevance due to the combination of dynamic information of the hybrid 2D/3D convolution.
Drawings
FIG. 1 is an overview of a hybrid 2D/3D convolutional network under human behavior guidance;
FIG. 2 shows the evaluation results of the method of the present invention on a Chardes dataset;
fig. 3 is an example of a machine-to-machine description of video. (example one: reference sentence: a person wearing a latticed shirt stands in a bathroom, takes off the shirt, and then picks up a yellow broom. the present invention describes that a person walks into a room, takes off the shirt, places it on a shelf, and then the person picks up a broom.
Detailed Description
The invention provides a Mixed 2D/3D Convolutional network (MCN). By constructing a two-branch network structure, where the first branch uses a 2D convolutional network to generate frame features and the second branch uses a 3D convolutional network to refine global feature information in all frames of the video.
Constructing depth fusion of video static and dynamic information: a fixed number of frames are first sampled throughout the video to cover the long-range temporal structure used to understand the video. The sampled frames span the entire video regardless of the length of the video. Therefore, we use a constant number of frame sequences
Figure 600001DEST_PATH_IMAGE004
Method for generating single-frame visual features by inputting 2D (two-dimensional) convolutional network branches frame by frame
Figure 64480DEST_PATH_IMAGE005
Where represents the number of sampling frames. The 2D convolutional network adopts an ImageNet pre-trained inclusion v3 network as a backbone network to extract all single-frame visual features.
Since 3D convolution is more suitable for spatio-temporal feature learning than 2D convolution, 3D convolution networks were introduced to capture the timing relationship between frames. Sequence of frames obtained by sampling
Figure 936621DEST_PATH_IMAGE004
Inputting the 3D convolutional network branch to generate a global feature representation of the video segment
Figure 398827DEST_PATH_IMAGE006
. The output global features not only make up the context information lacking in the decoding of the single frame features, but also form event feature representation at long time intervals. The related frames are closely related, and the final descriptive statement is promoted to have logical property of bearing the beginning and the end. Then, for the 3D features
Figure 418735DEST_PATH_IMAGE007
And performing time filtering and a soft attention mechanism to extract key features. Finally, key features and 2D features are paired
Figure 873856DEST_PATH_IMAGE008
And overlapping according to the time sequence to construct fusion characteristics. In the decoding process, we first extract new fusion features at each time using an attention mechanism on the fusion features. The new fused features are then input into a long-short term memory (LSTM) network, decoded over time, and finally output a description of the behavior in the video.
Since the global features obtained by 3D convolution include environmental information, object and human behavior, etc., if equal consideration is given, the features will not reflect the uniqueness of human behavior transformation. Considering that the 2D convolutional network has extracted the characteristic information of environment, object and human behavior sufficiently, it is desirable to construct a set of behavior filters. The filters can refine the 3D convolution characteristics according to human behaviors at different moments, so that the finally filtered characteristics can not only fully mine the behavior characteristics, but also explore causal relationships of the behaviors at different moments in time sequence. Therefore, the invention provides a group of sequential filters based on Cauchy distribution to capture key information in global features and mine the correlation between frames in the video. And forming an implicit state vector containing the key behavior information by filtering the global features obtained by the 3D convolution branches.
The method first selectively encodes temporal features of the video information according to constraints imposed by human behavior. And then decoding the language description by combining the static image characteristic and the dynamic time sequence characteristic, wherein the specific flow is as follows:
in the encoding stage:
the first step is as follows: firstly, extracting 3D features and 2D features respectively by using two branches of a 3D convolutional network and a 2D convolutional network;
the second step is that: filtering a Cauchy filter aiming at the 3D features and extracting key features by utilizing a soft attention mechanism. In particular, a behavioral filter consisting of multiple sets of N Cauchy distributions
Figure 600504DEST_PATH_IMAGE009
For performing features related to frame sequence
Figure 295927DEST_PATH_IMAGE010
By matrix multiplication of (a) to obtain
Figure 6394DEST_PATH_IMAGE011
Dimension key feature
Figure 484780DEST_PATH_IMAGE012
I.e. by
Figure 347825DEST_PATH_IMAGE013
Wherein, in the step (A),Mindicating the number of filters,cThe category of the behavior is represented by,Tthe duration of the video is represented as,
Figure 948571DEST_PATH_IMAGE014
to representtAt the first momentmA filter for filtering the received signal,
Figure 880755DEST_PATH_IMAGE015
expressing the soft attention coefficient, and is realized by a softmax function with the calculation formula of
Figure 225148DEST_PATH_IMAGE016
Wherein, in the step (A),
Figure 191967DEST_PATH_IMAGE017
representing categories about behaviorcTo (1) aiThe weights of the filters are automatically learned by the convolutional neural network in the training process. The soft attention mechanism is based on the global features of the cauchy distribution filter, so that the model implicitly understands which timestamps are more important for human behavior-based descriptions throughout the sequence of video frames;
the third step: overlapping the key features and the 2D features according to a time sequence to construct fusion features;
in the decoding stage:
the fourth step: first, a new fusion feature is extracted at each time instant, using the attention mechanism for the fusion feature. I.e. the attention mechanism is introduced in timetFor the fusion characteristicsREncoding is carried out, normalized weights are obtained through a softmax function, and then the features are fusedRMultiplying by the attention weight to obtain a new fusion feature, which is formulated as follows:
Figure 166877DEST_PATH_IMAGE018
wherein the content of the first and second substances,
Figure 835624DEST_PATH_IMAGE019
a vector of attention is represented, and,
Figure 983709DEST_PATH_IMAGE020
representing a hidden state of the LSTM output,Rthe fused features are represented as a result of the fusion,
Figure 539455DEST_PATH_IMAGE021
and
Figure 950845DEST_PATH_IMAGE022
a matrix of weights is represented by a matrix of weights,bis a bias that is a function of the bias,
Figure 654359DEST_PATH_IMAGE023
representing the new fusion feature.
The fifth step: the new fused features are input into a long-short term memory (LSTM) network and decoded over time to obtain a description sentence about the behavior in the video. In particular, the LSTM decodes the video features at each instant to obtain the hidden state
Figure 278238DEST_PATH_IMAGE024
And storage state
Figure 970382DEST_PATH_IMAGE025
. By decoding the characteristic information, a word is obtained at each moment, and finally, a complete descriptive sentence is obtained. For each LSTM cell, its input
Figure 349410DEST_PATH_IMAGE023
Is a new fusion feature whose output is a word sequence
Figure 743483DEST_PATH_IMAGE026
Corresponding to the method of the above embodiment, another embodiment of the present invention further provides a video encoding and decoding system, which includes:
a first extraction unit for extracting 3D features of a sequence of video frames;
in this embodiment, the first extraction unit is a 3D convolutional neural network;
a second extraction unit for extracting 2D features of the sequence of video frames;
in this embodiment, the second extraction unit is a 2D convolutional neural network;
a third extraction unit, configured to extract key information of the 3D feature;
the third extraction unit specifically performs the following operations: using a behavioral filter consisting of groups of N Cauchy distributions
Figure 171053DEST_PATH_IMAGE027
And 3D features
Figure 701391DEST_PATH_IMAGE028
Performing matrix multiplication to obtain key features
Figure 251321DEST_PATH_IMAGE012
(the specific implementation process is the same as the above embodiment, and is not described here again).
The first fusion unit is used for overlapping the key features and the 2D features according to a time sequence to construct fusion features;
the second fusion unit is used for coding the fusion features at the moment t, obtaining normalization weight through a softmax function, and multiplying the fusion features by the normalization weight to obtain new fusion features;
and the long-short term memory network is used for outputting description sentences related to the video frame sequence after inputting the new fusion characteristics.
The implementation process of each unit in the coding and decoding system of the present invention is the same as the implementation process of the foregoing embodiment.
The coding and decoding system can be configured in computer equipment, and the computer equipment can be a microprocessor, an upper computer and the like.

Claims (10)

1. A video encoding and decoding method, comprising the steps of:
s1, respectively extracting 3D features and 2D features of the video frame sequence;
s2, processing the 3D features to obtain key features; overlapping the key features and the 2D features according to a time sequence to construct fusion features;
s3, at the momenttEncoding the fusion features, obtaining normalization weight through a softmax function, and multiplying the fusion features by the normalization weight to obtain new fusion features;
and S4, inputting the new fusion characteristics into the long-term and short-term memory network to obtain the description sentences about the video frame sequence.
2. The video coding/decoding method according to claim 1, wherein in step S1, a 3D convolutional neural network is used to extract 3D features of the sequence of video frames.
3. The video coding/decoding method according to claim 1, wherein in step S1, 2D features of the video frame sequence are extracted by using a 2D convolutional neural network.
4. The video coding and decoding method according to claim 3, wherein the 2D convolutional neural network adopts an inclusion v3 network pre-trained by ImageNet as a backbone network.
5. The video coding and decoding method of claim 1, wherein the video frame sequence obtaining method comprises: a fixed number of frames are sampled from the entire video to form the sequence of video frames.
6. The video encoding/decoding method of claim 1, wherein in step S2, a behavior filter consisting of a plurality of sets of N Cauchy distributions is used
Figure 521742DEST_PATH_IMAGE001
And 3D features
Figure 73946DEST_PATH_IMAGE002
Performing matrix multiplication to obtain key features
Figure 604284DEST_PATH_IMAGE003
7. A video coding/decoding system, comprising:
a first extraction unit for extracting 3D features of a sequence of video frames;
a second extraction unit for extracting 2D features of the sequence of video frames;
a third extraction unit, configured to extract key information of the 3D feature;
the first fusion unit is used for overlapping the key features and the 2D features according to a time sequence to construct fusion features;
a second fusing unit for fusing the time of daytEncoding the fusion features, obtaining normalization weight through a softmax function, and multiplying the fusion features by the normalization weight to obtain new fusion features;
and the long-short term memory network is used for outputting description sentences related to the video frame sequence after inputting the new fusion characteristics.
8. The system of claim 7, wherein the first extraction unit is a 3D convolutional neural network.
9. The system of claim 7, wherein the second extraction unit is a 2D convolutional neural network.
10. The system according to claim 7, wherein the third extraction unit performs the following operations: using a behavioral filter consisting of groups of N Cauchy distributions
Figure 154215DEST_PATH_IMAGE001
And 3D features
Figure 894637DEST_PATH_IMAGE002
Performing matrix multiplication to obtain key features
Figure 188215DEST_PATH_IMAGE003
CN202110483437.5A 2021-04-30 2021-04-30 Video encoding and decoding method and system Active CN113099228B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110483437.5A CN113099228B (en) 2021-04-30 2021-04-30 Video encoding and decoding method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110483437.5A CN113099228B (en) 2021-04-30 2021-04-30 Video encoding and decoding method and system

Publications (2)

Publication Number Publication Date
CN113099228A true CN113099228A (en) 2021-07-09
CN113099228B CN113099228B (en) 2024-04-05

Family

ID=76681265

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110483437.5A Active CN113099228B (en) 2021-04-30 2021-04-30 Video encoding and decoding method and system

Country Status (1)

Country Link
CN (1) CN113099228B (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100158099A1 (en) * 2008-09-16 2010-06-24 Realnetworks, Inc. Systems and methods for video/multimedia rendering, composition, and user interactivity
US20170262705A1 (en) * 2016-03-11 2017-09-14 Qualcomm Incorporated Recurrent networks with motion-based attention for video understanding
CN108388900A (en) * 2018-02-05 2018-08-10 华南理工大学 The video presentation method being combined based on multiple features fusion and space-time attention mechanism
CN109344288A (en) * 2018-09-19 2019-02-15 电子科技大学 A kind of combination video presentation method based on multi-modal feature combination multilayer attention mechanism
CN109874029A (en) * 2019-04-22 2019-06-11 腾讯科技(深圳)有限公司 Video presentation generation method, device, equipment and storage medium
WO2020037965A1 (en) * 2018-08-21 2020-02-27 北京大学深圳研究生院 Method for multi-motion flow deep convolutional network model for video prediction

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100158099A1 (en) * 2008-09-16 2010-06-24 Realnetworks, Inc. Systems and methods for video/multimedia rendering, composition, and user interactivity
US20170262705A1 (en) * 2016-03-11 2017-09-14 Qualcomm Incorporated Recurrent networks with motion-based attention for video understanding
CN108388900A (en) * 2018-02-05 2018-08-10 华南理工大学 The video presentation method being combined based on multiple features fusion and space-time attention mechanism
WO2020037965A1 (en) * 2018-08-21 2020-02-27 北京大学深圳研究生院 Method for multi-motion flow deep convolutional network model for video prediction
CN109344288A (en) * 2018-09-19 2019-02-15 电子科技大学 A kind of combination video presentation method based on multi-modal feature combination multilayer attention mechanism
CN109874029A (en) * 2019-04-22 2019-06-11 腾讯科技(深圳)有限公司 Video presentation generation method, device, equipment and storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
谭等泰,李世超等: "多特征融合的行为识别模型", 《中国图象图形学报》, 16 December 2020 (2020-12-16), pages 2541 - 2552 *

Also Published As

Publication number Publication date
CN113099228B (en) 2024-04-05

Similar Documents

Publication Publication Date Title
CN109063164A (en) A kind of intelligent answer method based on deep learning
CN108549658A (en) A kind of deep learning video answering method and system based on the upper attention mechanism of syntactic analysis tree
CN108960063A (en) It is a kind of towards event relation coding video in multiple affair natural language description algorithm
CN115687687B (en) Video segment searching method and system for open domain query
CN112800203B (en) Question-answer matching method and system fusing text representation and knowledge representation
CN113504906B (en) Code generation method and device, electronic equipment and readable storage medium
CN110427629A (en) Semi-supervised text simplified model training method and system
CN116204674B (en) Image description method based on visual concept word association structural modeling
CN112733043B (en) Comment recommendation method and device
CN112364148B (en) Deep learning method-based generative chat robot
CN113807222A (en) Video question-answering method and system for end-to-end training based on sparse sampling
CN112699310A (en) Cold start cross-domain hybrid recommendation method and system based on deep neural network
CN111949886A (en) Sample data generation method and related device for information recommendation
CN114648032B (en) Training method and device of semantic understanding model and computer equipment
Wang et al. Self-information loss compensation learning for machine-generated text detection
CN115525744A (en) Dialog recommendation system based on prompt learning method
CN116484024A (en) Multi-level knowledge base construction method based on knowledge graph
Zhao et al. Shared-private memory networks for multimodal sentiment analysis
CN114240697A (en) Method and device for generating broker recommendation model, electronic equipment and storage medium
Tang et al. Predictive modelling of student behaviour using granular large-scale action data
CN109710787A (en) Image Description Methods based on deep learning
CN112668481A (en) Semantic extraction method for remote sensing image
CN114579869B (en) Model training method and related product
CN113099228A (en) Video coding and decoding method and system
CN110852066A (en) Multi-language entity relation extraction method and system based on confrontation training mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant