CN113099228A - Video coding and decoding method and system - Google Patents
Video coding and decoding method and system Download PDFInfo
- Publication number
- CN113099228A CN113099228A CN202110483437.5A CN202110483437A CN113099228A CN 113099228 A CN113099228 A CN 113099228A CN 202110483437 A CN202110483437 A CN 202110483437A CN 113099228 A CN113099228 A CN 113099228A
- Authority
- CN
- China
- Prior art keywords
- features
- video
- fusion
- sequence
- decoding method
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 24
- 230000004927 fusion Effects 0.000 claims abstract description 50
- 230000006399 behavior Effects 0.000 claims abstract description 29
- 230000006870 function Effects 0.000 claims abstract description 12
- 238000010606 normalization Methods 0.000 claims abstract description 11
- 230000015654 memory Effects 0.000 claims abstract description 6
- 238000000605 extraction Methods 0.000 claims description 18
- 238000013527 convolutional neural network Methods 0.000 claims description 14
- 238000009826 distribution Methods 0.000 claims description 9
- 239000011159 matrix material Substances 0.000 claims description 8
- 230000003542 behavioural effect Effects 0.000 claims description 5
- 238000012545 processing Methods 0.000 claims description 3
- 230000007787 long-term memory Effects 0.000 claims description 2
- 230000006403 short-term memory Effects 0.000 claims description 2
- 230000007246 mechanism Effects 0.000 abstract description 7
- 230000003068 static effect Effects 0.000 abstract description 6
- 230000001427 coherent effect Effects 0.000 abstract description 2
- 230000001364 causal effect Effects 0.000 description 5
- 238000011161 development Methods 0.000 description 5
- 230000008569 process Effects 0.000 description 5
- 238000001914 filtration Methods 0.000 description 4
- 230000002123 temporal effect Effects 0.000 description 3
- 230000000007 visual effect Effects 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 241000245591 Baptisia tinctoria Species 0.000 description 1
- 235000021538 Chard Nutrition 0.000 description 1
- 244000007853 Sarothamnus scoparius Species 0.000 description 1
- 230000003321 amplification Effects 0.000 description 1
- 238000013473 artificial intelligence Methods 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000003199 nucleic acid amplification method Methods 0.000 description 1
- 239000000126 substance Substances 0.000 description 1
- 230000036962 time dependent Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Images
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/10—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding
- H04N19/169—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding
- H04N19/17—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object
- H04N19/172—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals using adaptive coding characterised by the coding unit, i.e. the structural portion or semantic portion of the video signal being the object or the subject of the adaptive coding the unit being an image region, e.g. an object the region being a picture, frame or field
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/42—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals characterised by implementation details or hardware specially adapted for video compression or decompression, e.g. dedicated software implementation
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04N—PICTORIAL COMMUNICATION, e.g. TELEVISION
- H04N19/00—Methods or arrangements for coding, decoding, compressing or decompressing digital video signals
- H04N19/44—Decoders specially adapted therefor, e.g. video decoders which are asymmetric with respect to the encoder
Abstract
The invention discloses a video coding and decoding method and a video coding and decoding system. Firstly, the 2D features and the processed 3D features are overlapped according to a time sequence, and the depth fusion of static and dynamic information is realized. Then, an attention mechanism is introduced to encode the fusion features at each moment t, normalization weight is obtained through a softmax function, different weights are distributed to the fusion features, new fusion features are obtained, human-oriented features are learned, and therefore final language description related to human behaviors is promoted. And finally, inputting the new fusion characteristics into a long-short term memory (LSTM) network, and decoding the new fusion characteristics along with the time to obtain the video description sentence. The video description obtained by the invention is more logical, smooth, coherent and clear.
Description
Technical Field
The invention relates to the field of machine learning, in particular to a video coding and decoding method and a video coding and decoding system.
Background
At present, although the deep learning algorithm in artificial intelligence can execute the video description function, the video information can be easily converted into language content. For example, before a user watches massive video information, the user can quickly know the event development condition and the influence thereof by forming an accurate text summary on the video information, so that a lot of time cost can be saved. In addition, the highlight segments extracted from the two-hour movie are converted into the text outline summarizing of the movie, so that more perfect recommendation experience can be brought to the user. However, such an undifferentiated description function of video information does not sufficiently embody imagination, curiosity and intelligence for human to understand things, and these nature have been the core of human. Although text information can be extracted from a large amount of video information, the high-value knowledge for people is very little. Therefore, an excellent machine intelligent understanding algorithm should fully describe the occurring events in human thinking mode, and understand the development rule of things with human as the first perspective, so as to make the machine understand the video to a more intelligent degree.
Generally, events occurring in video are closely related and causal, and these events are the source of performing understanding tasks. The transition of these events from the end to another new event is mostly prompted by human behavior. It can be said that human behaviors dominate the development context of events and the causes and results between events, so it is necessary to follow human behaviors to explore the development rules of events and to strengthen the understanding of causal relationships. The traditional video understanding method is difficult to fully consider the relevance of human behaviors in each frame of a video in time sequence and the causal relationship of event occurrence, and the extracted global time sequence features contain a large number of redundant frame features, so that huge computing power is consumed, a model is too slow to converge in a training stage, the development rule of things cannot be well understood from the human perspective with behaviors as clues, and a machine can more intelligently understand the video.
Disclosure of Invention
The technical problem to be solved by the invention is to provide a video encoding and decoding method and system aiming at the defects of the prior art, so that the logic and the accuracy of a video understanding task are improved.
In order to solve the technical problems, the technical scheme adopted by the invention is as follows: a video coding and decoding method comprises the following steps:
s1, respectively extracting 3D features and 2D features of the video frame sequence;
s2, processing the 3D features to obtain key features; overlapping the key features and the 2D features according to a time sequence to construct fusion features;
s3, encoding the fusion features at the moment t, obtaining normalization weight through a softmax function, and multiplying the fusion features by the normalization weight to obtain new fusion features;
and S4, inputting the new fusion characteristics into the long-term and short-term memory network to obtain the description sentences about the video frame sequence.
In order to construct a strong feature representation in video, the invention considers not only static image information but also dynamic information with time as clues. Thus, the present invention proposes a hybrid 2D/3D convolutional network. The 2D convolutional network and the 3D convolutional network are used to extract 2D features and 3D features of the video frames, respectively, which represent static and dynamic video information, respectively. The 2D features cover single-frame feature information such as environment, objects and human behaviors, the 3D features make up for the deficiency of context information when decoding the single-frame features, event feature representation at long time intervals is formed, the event feature representation not only contains the time relation of events on visual features, but also enhances the logic of finally output descriptive sentences. And superposing the 2D features and the processed 3D features according to a time sequence to realize the depth fusion of static and dynamic information. The fusion features are encoded at time t, normalized weights are obtained by a softmax function, and different weights can be assigned to the fusion features to learn human-oriented features, thereby facilitating final language description related to human behavior. LSTM can learn long term dependencies and is well suited to handle issues that are highly time-dependent. Thus, the present invention uses the LSTM network to decode the characteristics of human behavioral information, which is then described in text.
In step S1, a 3D convolutional neural network is used to extract 3D features of the sequence of video frames. The 3D convolution is more suitable for the learning of space-time characteristics than the 2D convolution, and the 3D convolution neural network can capture the time relation between video frames.
In step S1, 2D features of the sequence of video frames are extracted using a 2D convolutional neural network. The 2D convolutional neural network can extract characteristic information such as environment, objects and human behaviors, and is helpful for fully mining behavior characteristics in the video.
The 2D convolutional neural network adopts an amplification v3 network pre-trained by ImageNet as a backbone network. The Incepton v3 network introduces an asymmetric convolution structure, the effect in the aspects of processing more and richer space features, increasing feature diversity and the like is better than that of a symmetric convolution structure, and meanwhile, the calculation amount can be reduced.
The video frame sequence acquisition method comprises the following steps: a fixed number of frames are sampled from the entire video to form the sequence of video frames. This can cover the long-order structure used to understand video, i.e. the sample frames will cover the entire video regardless of the length of the video.
In step S2, a behavior filter composed of a plurality of sets of N Cauchy distributions is usedAnd 3D featuresPerforming matrix multiplication to obtain key features. Based on the 3D global features of the cauchy distribution filter, the model can be made to implicitly understand which timestamps are more important for human behavior-based descriptions throughout the sequence of video frames.
The invention also provides a video coding and decoding system, which comprises:
a first extraction unit for extracting 3D features of a sequence of video frames;
a second extraction unit for extracting 2D features of the sequence of video frames;
a third extraction unit, configured to extract key information of the 3D feature;
the first fusion unit is used for overlapping the key features and the 2D features according to a time sequence to construct fusion features;
a second fusing unit for fusing the time of daytEncoding the fusion features, obtaining normalization weight through a softmax function, and multiplying the fusion features by the normalization weight to obtain new fusion features;
and the long-short term memory network is used for outputting description sentences related to the video frame sequence after inputting the new fusion characteristics.
The first extraction unit is a 3D convolutional neural network.
The second extraction unit is a 2D convolutional neural network.
The third extraction unit specifically performs the following operations: using a behavioral filter consisting of groups of N Cauchy distributionsAnd 3D featuresPerforming matrix multiplication to obtain key features。
Compared with the prior art, the invention has the beneficial effects that: the video description obtained by the invention is more logical, smooth, coherent and clear. Compared with a standard human sentence, the model provided by the invention tightly grasps the description generated by human behavior information, not only captures event information in the video, but also associates the antecedent and consequent consequences of the event with human actions. These advantages not only benefit from human behavior-based guidelines, but also allow descriptive statements to have video information temporal consistency and causal relevance due to the combination of dynamic information of the hybrid 2D/3D convolution.
Drawings
FIG. 1 is an overview of a hybrid 2D/3D convolutional network under human behavior guidance;
FIG. 2 shows the evaluation results of the method of the present invention on a Chardes dataset;
fig. 3 is an example of a machine-to-machine description of video. (example one: reference sentence: a person wearing a latticed shirt stands in a bathroom, takes off the shirt, and then picks up a yellow broom. the present invention describes that a person walks into a room, takes off the shirt, places it on a shelf, and then the person picks up a broom.
Detailed Description
The invention provides a Mixed 2D/3D Convolutional network (MCN). By constructing a two-branch network structure, where the first branch uses a 2D convolutional network to generate frame features and the second branch uses a 3D convolutional network to refine global feature information in all frames of the video.
Constructing depth fusion of video static and dynamic information: a fixed number of frames are first sampled throughout the video to cover the long-range temporal structure used to understand the video. The sampled frames span the entire video regardless of the length of the video. Therefore, we use a constant number of frame sequencesMethod for generating single-frame visual features by inputting 2D (two-dimensional) convolutional network branches frame by frameWhere represents the number of sampling frames. The 2D convolutional network adopts an ImageNet pre-trained inclusion v3 network as a backbone network to extract all single-frame visual features.
Since 3D convolution is more suitable for spatio-temporal feature learning than 2D convolution, 3D convolution networks were introduced to capture the timing relationship between frames. Sequence of frames obtained by samplingInputting the 3D convolutional network branch to generate a global feature representation of the video segment. The output global features not only make up the context information lacking in the decoding of the single frame features, but also form event feature representation at long time intervals. The related frames are closely related, and the final descriptive statement is promoted to have logical property of bearing the beginning and the end. Then, for the 3D featuresAnd performing time filtering and a soft attention mechanism to extract key features. Finally, key features and 2D features are pairedAnd overlapping according to the time sequence to construct fusion characteristics. In the decoding process, we first extract new fusion features at each time using an attention mechanism on the fusion features. The new fused features are then input into a long-short term memory (LSTM) network, decoded over time, and finally output a description of the behavior in the video.
Since the global features obtained by 3D convolution include environmental information, object and human behavior, etc., if equal consideration is given, the features will not reflect the uniqueness of human behavior transformation. Considering that the 2D convolutional network has extracted the characteristic information of environment, object and human behavior sufficiently, it is desirable to construct a set of behavior filters. The filters can refine the 3D convolution characteristics according to human behaviors at different moments, so that the finally filtered characteristics can not only fully mine the behavior characteristics, but also explore causal relationships of the behaviors at different moments in time sequence. Therefore, the invention provides a group of sequential filters based on Cauchy distribution to capture key information in global features and mine the correlation between frames in the video. And forming an implicit state vector containing the key behavior information by filtering the global features obtained by the 3D convolution branches.
The method first selectively encodes temporal features of the video information according to constraints imposed by human behavior. And then decoding the language description by combining the static image characteristic and the dynamic time sequence characteristic, wherein the specific flow is as follows:
in the encoding stage:
the first step is as follows: firstly, extracting 3D features and 2D features respectively by using two branches of a 3D convolutional network and a 2D convolutional network;
the second step is that: filtering a Cauchy filter aiming at the 3D features and extracting key features by utilizing a soft attention mechanism. In particular, a behavioral filter consisting of multiple sets of N Cauchy distributionsFor performing features related to frame sequenceBy matrix multiplication of (a) to obtainDimension key featureI.e. byWherein, in the step (A),Mindicating the number of filters,cThe category of the behavior is represented by,Tthe duration of the video is represented as,to representtAt the first momentmA filter for filtering the received signal,expressing the soft attention coefficient, and is realized by a softmax function with the calculation formula ofWherein, in the step (A),representing categories about behaviorcTo (1) aiThe weights of the filters are automatically learned by the convolutional neural network in the training process. The soft attention mechanism is based on the global features of the cauchy distribution filter, so that the model implicitly understands which timestamps are more important for human behavior-based descriptions throughout the sequence of video frames;
the third step: overlapping the key features and the 2D features according to a time sequence to construct fusion features;
in the decoding stage:
the fourth step: first, a new fusion feature is extracted at each time instant, using the attention mechanism for the fusion feature. I.e. the attention mechanism is introduced in timetFor the fusion characteristicsREncoding is carried out, normalized weights are obtained through a softmax function, and then the features are fusedRMultiplying by the attention weight to obtain a new fusion feature, which is formulated as follows:
wherein the content of the first and second substances,a vector of attention is represented, and,representing a hidden state of the LSTM output,Rthe fused features are represented as a result of the fusion,anda matrix of weights is represented by a matrix of weights,bis a bias that is a function of the bias,representing the new fusion feature.
The fifth step: the new fused features are input into a long-short term memory (LSTM) network and decoded over time to obtain a description sentence about the behavior in the video. In particular, the LSTM decodes the video features at each instant to obtain the hidden stateAnd storage state. By decoding the characteristic information, a word is obtained at each moment, and finally, a complete descriptive sentence is obtained. For each LSTM cell, its inputIs a new fusion feature whose output is a word sequence。
Corresponding to the method of the above embodiment, another embodiment of the present invention further provides a video encoding and decoding system, which includes:
a first extraction unit for extracting 3D features of a sequence of video frames;
in this embodiment, the first extraction unit is a 3D convolutional neural network;
a second extraction unit for extracting 2D features of the sequence of video frames;
in this embodiment, the second extraction unit is a 2D convolutional neural network;
a third extraction unit, configured to extract key information of the 3D feature;
the third extraction unit specifically performs the following operations: using a behavioral filter consisting of groups of N Cauchy distributionsAnd 3D featuresPerforming matrix multiplication to obtain key features(the specific implementation process is the same as the above embodiment, and is not described here again).
The first fusion unit is used for overlapping the key features and the 2D features according to a time sequence to construct fusion features;
the second fusion unit is used for coding the fusion features at the moment t, obtaining normalization weight through a softmax function, and multiplying the fusion features by the normalization weight to obtain new fusion features;
and the long-short term memory network is used for outputting description sentences related to the video frame sequence after inputting the new fusion characteristics.
The implementation process of each unit in the coding and decoding system of the present invention is the same as the implementation process of the foregoing embodiment.
The coding and decoding system can be configured in computer equipment, and the computer equipment can be a microprocessor, an upper computer and the like.
Claims (10)
1. A video encoding and decoding method, comprising the steps of:
s1, respectively extracting 3D features and 2D features of the video frame sequence;
s2, processing the 3D features to obtain key features; overlapping the key features and the 2D features according to a time sequence to construct fusion features;
s3, at the momenttEncoding the fusion features, obtaining normalization weight through a softmax function, and multiplying the fusion features by the normalization weight to obtain new fusion features;
and S4, inputting the new fusion characteristics into the long-term and short-term memory network to obtain the description sentences about the video frame sequence.
2. The video coding/decoding method according to claim 1, wherein in step S1, a 3D convolutional neural network is used to extract 3D features of the sequence of video frames.
3. The video coding/decoding method according to claim 1, wherein in step S1, 2D features of the video frame sequence are extracted by using a 2D convolutional neural network.
4. The video coding and decoding method according to claim 3, wherein the 2D convolutional neural network adopts an inclusion v3 network pre-trained by ImageNet as a backbone network.
5. The video coding and decoding method of claim 1, wherein the video frame sequence obtaining method comprises: a fixed number of frames are sampled from the entire video to form the sequence of video frames.
7. A video coding/decoding system, comprising:
a first extraction unit for extracting 3D features of a sequence of video frames;
a second extraction unit for extracting 2D features of the sequence of video frames;
a third extraction unit, configured to extract key information of the 3D feature;
the first fusion unit is used for overlapping the key features and the 2D features according to a time sequence to construct fusion features;
a second fusing unit for fusing the time of daytEncoding the fusion features, obtaining normalization weight through a softmax function, and multiplying the fusion features by the normalization weight to obtain new fusion features;
and the long-short term memory network is used for outputting description sentences related to the video frame sequence after inputting the new fusion characteristics.
8. The system of claim 7, wherein the first extraction unit is a 3D convolutional neural network.
9. The system of claim 7, wherein the second extraction unit is a 2D convolutional neural network.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110483437.5A CN113099228B (en) | 2021-04-30 | 2021-04-30 | Video encoding and decoding method and system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110483437.5A CN113099228B (en) | 2021-04-30 | 2021-04-30 | Video encoding and decoding method and system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113099228A true CN113099228A (en) | 2021-07-09 |
CN113099228B CN113099228B (en) | 2024-04-05 |
Family
ID=76681265
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110483437.5A Active CN113099228B (en) | 2021-04-30 | 2021-04-30 | Video encoding and decoding method and system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113099228B (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100158099A1 (en) * | 2008-09-16 | 2010-06-24 | Realnetworks, Inc. | Systems and methods for video/multimedia rendering, composition, and user interactivity |
US20170262705A1 (en) * | 2016-03-11 | 2017-09-14 | Qualcomm Incorporated | Recurrent networks with motion-based attention for video understanding |
CN108388900A (en) * | 2018-02-05 | 2018-08-10 | 华南理工大学 | The video presentation method being combined based on multiple features fusion and space-time attention mechanism |
CN109344288A (en) * | 2018-09-19 | 2019-02-15 | 电子科技大学 | A kind of combination video presentation method based on multi-modal feature combination multilayer attention mechanism |
CN109874029A (en) * | 2019-04-22 | 2019-06-11 | 腾讯科技(深圳)有限公司 | Video presentation generation method, device, equipment and storage medium |
WO2020037965A1 (en) * | 2018-08-21 | 2020-02-27 | 北京大学深圳研究生院 | Method for multi-motion flow deep convolutional network model for video prediction |
-
2021
- 2021-04-30 CN CN202110483437.5A patent/CN113099228B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100158099A1 (en) * | 2008-09-16 | 2010-06-24 | Realnetworks, Inc. | Systems and methods for video/multimedia rendering, composition, and user interactivity |
US20170262705A1 (en) * | 2016-03-11 | 2017-09-14 | Qualcomm Incorporated | Recurrent networks with motion-based attention for video understanding |
CN108388900A (en) * | 2018-02-05 | 2018-08-10 | 华南理工大学 | The video presentation method being combined based on multiple features fusion and space-time attention mechanism |
WO2020037965A1 (en) * | 2018-08-21 | 2020-02-27 | 北京大学深圳研究生院 | Method for multi-motion flow deep convolutional network model for video prediction |
CN109344288A (en) * | 2018-09-19 | 2019-02-15 | 电子科技大学 | A kind of combination video presentation method based on multi-modal feature combination multilayer attention mechanism |
CN109874029A (en) * | 2019-04-22 | 2019-06-11 | 腾讯科技(深圳)有限公司 | Video presentation generation method, device, equipment and storage medium |
Non-Patent Citations (1)
Title |
---|
谭等泰,李世超等: "多特征融合的行为识别模型", 《中国图象图形学报》, 16 December 2020 (2020-12-16), pages 2541 - 2552 * |
Also Published As
Publication number | Publication date |
---|---|
CN113099228B (en) | 2024-04-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109063164A (en) | A kind of intelligent answer method based on deep learning | |
CN108549658A (en) | A kind of deep learning video answering method and system based on the upper attention mechanism of syntactic analysis tree | |
CN108960063A (en) | It is a kind of towards event relation coding video in multiple affair natural language description algorithm | |
CN115687687B (en) | Video segment searching method and system for open domain query | |
CN112800203B (en) | Question-answer matching method and system fusing text representation and knowledge representation | |
CN113504906B (en) | Code generation method and device, electronic equipment and readable storage medium | |
CN110427629A (en) | Semi-supervised text simplified model training method and system | |
CN116204674B (en) | Image description method based on visual concept word association structural modeling | |
CN112733043B (en) | Comment recommendation method and device | |
CN112364148B (en) | Deep learning method-based generative chat robot | |
CN113807222A (en) | Video question-answering method and system for end-to-end training based on sparse sampling | |
CN112699310A (en) | Cold start cross-domain hybrid recommendation method and system based on deep neural network | |
CN111949886A (en) | Sample data generation method and related device for information recommendation | |
CN114648032B (en) | Training method and device of semantic understanding model and computer equipment | |
Wang et al. | Self-information loss compensation learning for machine-generated text detection | |
CN115525744A (en) | Dialog recommendation system based on prompt learning method | |
CN116484024A (en) | Multi-level knowledge base construction method based on knowledge graph | |
Zhao et al. | Shared-private memory networks for multimodal sentiment analysis | |
CN114240697A (en) | Method and device for generating broker recommendation model, electronic equipment and storage medium | |
Tang et al. | Predictive modelling of student behaviour using granular large-scale action data | |
CN109710787A (en) | Image Description Methods based on deep learning | |
CN112668481A (en) | Semantic extraction method for remote sensing image | |
CN114579869B (en) | Model training method and related product | |
CN113099228A (en) | Video coding and decoding method and system | |
CN110852066A (en) | Multi-language entity relation extraction method and system based on confrontation training mechanism |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |