CN113099228B

CN113099228B - Video encoding and decoding method and system

Info

Publication number: CN113099228B
Application number: CN202110483437.5A
Authority: CN
Inventors: 郭克华; 申长春; 奎晓燕; 刘斌; 王凌风; 刘超
Original assignee: Hand In Hand Information Technology Co ltd; Central South University
Current assignee: Hand In Hand Information Technology Co ltd; Central South University
Priority date: 2021-04-30
Filing date: 2021-04-30
Publication date: 2024-04-05
Anticipated expiration: 2041-04-30
Also published as: CN113099228A

Abstract

The invention discloses a video coding and decoding method and a video coding and decoding system. Firstly, overlapping the 2D features and the processed 3D features according to time sequence to realize the deep fusion of static and dynamic information. And then, introducing an attention mechanism to encode the fusion features at each time t, obtaining normalized weights through a softmax function, distributing different weights for the fusion features, and obtaining new fusion features so as to learn the human-based features, thereby promoting the final language description related to human behaviors. And finally, inputting the new fusion characteristics into a Long Short Term Memory (LSTM) network, and decoding along with the time to obtain the video description sentence. The video description obtained by the method is more logical and smooth, and has consistent and clear semantics.

Description

Video encoding and decoding method and system

Technical Field

The invention relates to the field of machine learning, in particular to a video coding and decoding method and a video coding and decoding system.

Background

Currently, although deep learning algorithms in artificial intelligence are capable of performing video description functions, video information can be easily converted into language content. For example, before a user views massive video information, the user can quickly know the event development condition and the influence thereof by forming a precise text abstract for the video information, so that a great deal of time cost can be saved. In addition, the two-hour movie extracts the highlight and converts the highlight into a text outline summarizing the movie, so that a more perfect recommendation experience is brought to the user. However, such indiscriminate performing of the described functions on video information does not fully embody the imagination, curiosity and wisdom of human understanding things, which have been the heart of humans. Although text information can be extracted from a large amount of video information, the high value knowledge available to people is very little. Therefore, an excellent machine intelligent understanding algorithm should fully describe the events taking place in human thinking mode, and understand the development law of things in human as the first view angle, so that the machine can understand the video to a more intelligent degree.

In general, events that occur in video are closely connected and causally related, and are the source of performing understanding tasks. The transition of these events from the end to another new event is mostly motivated by human behavior. It can be said that human behavior dominates the development venation of events and causes and results among events, so that it is necessary to explore the development rules of events and enhance the understanding of causal relationships of events following human behavior. The traditional video understanding method is difficult to fully consider the time sequence relevance of human behaviors in each frame of the video and the causal relationship of occurrence of events, and the extracted global time sequence features contain a large number of redundant frame features, so that huge calculation power is consumed, the model is too slow to converge in a training stage, the development rule of things cannot be well understood from the human point of view taking the behaviors as clues, and the machine can more intelligently understand the video.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a video coding and decoding method and a system, which improve the logic and accuracy of video understanding tasks.

In order to solve the technical problems, the invention adopts the following technical scheme: a video encoding and decoding method comprising the steps of:

s1, respectively extracting 3D features and 2D features of a video frame sequence;

s2, processing the 3D features to obtain key features; overlapping the key features and the 2D features according to time sequence to construct fusion features;

s3, encoding the fusion feature at a time t, obtaining a normalized weight through a softmax function, and multiplying the fusion feature by the normalized weight to obtain a new fusion feature;

s4, inputting the new fusion characteristics into a long-short-period memory network to obtain a description sentence about the video frame sequence.

In order to construct a strong feature representation in video, the invention considers not only static image information but also dynamic information with time as a cue. The present invention therefore proposes a hybrid 2D/3D convolutional network. The 2D convolutional network and the 3D convolutional network are used to extract 2D features and 3D features of video frames, respectively, which represent static and dynamic video information, respectively. The 2D features cover single-frame feature information such as environment, objects and human behaviors, the 3D features not only make up for the shortage of context information when decoding the single-frame features, but also form event feature representations with long time intervals, which not only comprise the time relation of events on visual features, but also enhance the logic of finally output descriptive sentences. And overlapping the 2D features and the processed 3D features according to time sequence to realize the deep fusion of static and dynamic information. At time t, the fusion features are encoded, normalized weights are obtained through a softmax function, and different weights can be assigned to the fusion features to learn the human-based features, thereby facilitating the final language description related to human behavior. LSTM can learn long-term dependencies and is well suited to address issues that are highly time-series dependent. Thus, the present invention uses LSTM networks to decode the characteristics of human behavioral information and then describes it in text.

In step S1, 3D features of the video frame sequence are extracted using a 3D convolutional neural network. The 3D convolution is more suitable for learning the space-time characteristics than the 2D convolution, and the 3D convolution neural network can capture the time relation between video frames.

In step S1, 2D features of the video frame sequence are extracted using a 2D convolutional neural network. The 2D convolutional neural network can extract feature information such as environment, objects, and human behavior, which helps fully mine behavior features in the video.

The 2D convolutional neural network adopts an acceptance v3 network which is pre-trained by the ImageNet as a backbone network. The asymmetric convolution structure is introduced into the acceptance v3 network, so that the effect of processing more and richer spatial features, increasing feature diversity and the like is better than that of the symmetric convolution structure, and the calculation amount can be reduced.

The video frame sequence acquisition method comprises the following steps: a fixed number of frames are sampled from the whole video, constituting the sequence of video frames. This may cover the long timing structure used to understand the video, i.e. the sampled frames will cover the entire video, regardless of the length of the video.

In step S2, a behavior filter composed of a plurality of sets of N Kexil distributions is usedAnd 3D feature->Performing matrix multiplication to obtain key feature +.>. Based on the 3D global features of the cauchy distribution filter, the model can be made to implicitly understand which timestamps in the entire video frame sequence are more important for human behavior based descriptions.

The invention also provides a video coding and decoding system, which comprises:

a first extraction unit for extracting 3D features of a video frame sequence;

a second extraction unit for extracting 2D features of the video frame sequence;

a third extraction unit, configured to extract key information of the 3D feature;

the first fusion unit is used for superposing the key features and the 2D features in time sequence to construct fusion features;

a second fusion unit for at the momenttEncoding the fusion feature, obtaining a normalized weight through a softmax function, and multiplying the fusion feature by the normalized weight to obtain a new fusion feature;

and the long-term and short-term memory network is used for outputting a description sentence related to the video frame sequence after the new fusion characteristic is input.

The first extraction unit is a 3D convolutional neural network.

The second extraction unit is a 2D convolutional neural network.

The third extraction unit specifically performs the following operations: using a behavioural filter consisting of a plurality of sets of N Cauchy distributionsAnd 3D feature->Performing matrix multiplication to obtain key feature +.>。

Compared with the prior art, the invention has the following beneficial effects: the video description obtained by the method is more logical and smooth, and has consistent and clear semantics. Compared with a human standard reference sentence, the model provided by the invention tightly grabs human behavior information to generate description, so that not only is the event information in the video captured, but also the front cause and the result of the event are related by human actions. These advantages benefit not only from guidance based on human behavior, but also from the combination of dynamic information of hybrid 2D/3D convolution, which allows descriptive statements to have video information timing consistency and causal relevance.

Drawings

FIG. 1 is an overview of a hybrid 2D/3D convolutional network under human behavior guidance;

FIG. 2 is an evaluation of the method of the present invention on a Charades dataset;

fig. 3 is an example of a machine-to-video description. (example one: refer to a person wearing a checked shirt standing in a bathroom, taking off the shirt, and then picking up a yellow broom; the invention is described in that a person walks into the room, taking off the shirt, placing it on a shelf, then the person picks up a broom; example two: refer to a person sitting in a aisle, picking up a book, and then standing up again; the invention is described in that a person sits on the floor, then picks up a book, and places it on the floor.

Detailed Description

The present invention proposes a hybrid 2D/3D convolutional network (Mixed 2D/3D Convolutional Networks, MCN). By constructing a dual-branch network structure, wherein the first branch utilizes a 2D convolutional network to generate each frame feature, the second branch utilizes a 3D convolutional network to refine global feature information in all frames of the video.

And (3) constructing depth fusion of video static and dynamic information: a fixed number of frames are first sampled in the entire video to cover the long-range temporal structure used to understand the video. The sampled frames span the entire video regardless of the length of the video. Therefore we use a constant number of frame sequencesFrame-by-frame input of 2D convolution network branches to generate single-frame visual feature +.>Where the number of samples is represented. The 2D convolution network adopts an acceptance v3 network which is pre-trained by the ImageNet as a backbone network, and all single-frame visual characteristics are extracted.

Since 3D convolution is more suitable for spatio-temporal feature learning than 2D convolution, a 3D convolution network is introduced to capture the timing relationship between frames. Frame sequence obtained by samplingInputting 3D convolution network branches to generate global feature representation of the video clip>. The output global feature not only compensates the lack of context information in the single frame feature decoding process, but also can form event feature representation with long time interval. The associated frames are closely related, and the final description statement is promoted to have logic of going up and down. Then, for 3D features->The key features are extracted by performing temporal filtering and soft attention mechanisms. Finally, for key features and 2D features->And superposing according to time sequence to construct fusion characteristics. In the decoding process, we first use the attention mechanism for the fusion feature, extracting new fusion features at each time. The new fusion features are then input into a Long Short Term Memory (LSTM) network, decoded over time, and finally output a descriptive sentence about behavior in the video.

Since the global features obtained by 3D convolution contain environmental information, objects, human behavior, etc., if equally considered, the features would not be unique to human behavior transformations. Considering that 2D convolution networks have fully extracted characteristic information of the environment, objects and human behavior, it is desirable to construct a set of behavior filters. These filters have human behavior following different moments to refine the 3D convolution features so that the last filtered feature not only fully mines the behavior features but also explores the causal relationships of the behavior at different moments in time. Therefore, the invention provides a set of timing filters based on the cauchy distribution to capture key information in global features and to mine the correlation between frames in the video. And forming an implicit state vector containing key behavior information by filtering global features obtained by the 3D convolution branches.

The inventive method first selectively encodes temporal features of video information according to constraints of human behavior. The language description is then decoded by combining the static image features and the dynamic time series features, and the specific flow is as follows:

in the encoding phase:

the first step: firstly, respectively extracting 3D features and 2D features by using two branches of a 3D convolution network and a 2D convolution network;

and a second step of: cauchy filter filtering is performed on the 3D features and key features are extracted using a soft attention mechanism. In particular, a behavioural filter consisting of a plurality of sets of N-Kouchy distributionsFor performing the feature of the frame sequence->Matrix multiplication of (2) to obtain->Vitamin key feature->I.e. +.>Wherein, the method comprises the steps of, wherein,Mthe number of filters is indicated and the number of filters is indicated,cthe category of behavior is indicated and,Trepresenting the duration of the video +.>Representation oftTime of daymA filter (L)>Representing soft attention coefficients, realized by softmax function, calculated as +.>Wherein->Representing information about behavior categoriescIs the first of (2)iThe weights of the filters are automatically learned by the convolutional neural network during the training process. The soft-attention mechanism is based on the global features of the cauchy distribution filter, thus making the model implicitly understand which timestamps in the entire video frame sequence are more important for human behavior-based descriptions;

and a third step of: superposing the key features and the 2D features according to time sequences to construct fusion features;

in the decoding stage:

fourth step: first, new fusion features are extracted at each moment by using the attention mechanism to the fusion features. I.e. the mechanism of attention is introduced in timetFor fusion featuresREncoding and obtaining normalized weights through softmax function, and then fusing the featuresRMultiplied by the attention weight to obtain a new fusion feature, the formula of which is as follows:

wherein,representing attention vector, ++>Representing the hidden state of the LSTM output,Rrepresenting fusion features->And->The weight matrix is represented by a matrix of weights,bis biased (is->Representing new fusion features.

Fifth step: the new fusion features are input into a Long Short Term Memory (LSTM) network and decoded over time to obtain a descriptive sentence about behavior in the video. In particular, LSTM decodes video features at each instant to obtain hidden statesAnd storage state->. And obtaining a word at each moment by decoding the characteristic information, and finally obtaining a complete description sentence. For each LSTM cell, it inputs +.>Is a new fusion feature whose output is a word sequence。

Another embodiment of the present invention provides a video encoding and decoding system, corresponding to the method of the above embodiment, including:

a first extraction unit for extracting 3D features of a video frame sequence;

in this embodiment, the first extraction unit is a 3D convolutional neural network;

in this embodiment, the second extraction unit is a 2D convolutional neural network;

the third extraction unit specifically performs the following operations: by using a plurality of sets of N Kexil distribution structuresFinished behavior filterAnd 3D feature->Performing matrix multiplication to obtain key feature +.>(the specific implementation process is the same as that of the above embodiment, and will not be repeated here).

the second fusion unit is used for encoding the fusion characteristics at the moment t, obtaining normalized weights through softmax functions, and multiplying the fusion characteristics by the normalized weights to obtain new fusion characteristics;

The implementation process of each unit in the coding and decoding system of the present invention is the same as that of the foregoing embodiment.

The coding and decoding system of the invention can be configured in computer equipment, and the computer equipment can be a microprocessor, an upper computer and the like.

Claims

1. A video encoding and decoding method, comprising the steps of:

s4, inputting the new fusion characteristics into a long-period memory network to obtain a description sentence about the video frame sequence;

in step S2, a behavior filter K composed of a plurality of sets of N Kexil distributions is used _C And 3D feature v _t Performing matrix multiplication operation to obtain key features S _ct 。

2. The video coding method according to claim 1, characterized in that in step S1, 3D features of the sequence of video frames are extracted using a 3D convolutional neural network.

3. The video coding method according to claim 1, characterized in that in step S1, 2D features of the sequence of video frames are extracted using a 2D convolutional neural network.

4. A video codec method according to claim 3, wherein the 2D convolutional neural network employs an ImageNet pre-trained concept v3 network as a backbone network.

5. The video encoding and decoding method according to claim 1, wherein the video frame sequence acquisition method is: a fixed number of frames are sampled from the whole video, constituting the sequence of video frames.

6. A video codec system, comprising:

a first extraction unit for extracting 3D features of a video frame sequence;

a third extraction unit, configured to extract key information of the 3D feature; the third extraction unit specifically performs the following operations: using a behavioural filter K consisting of a plurality of sets of N Keuchy distributions _C And 3D feature v _t Performing matrix multiplication operation to obtain key features S _ct ；

7. The system of claim 6, wherein the first extraction unit is a 3D convolutional neural network.

8. The system of claim 6, wherein the second extraction unit is a 2D convolutional neural network.