CN113099228A

CN113099228A - Video coding and decoding method and system

Info

Publication number: CN113099228A
Application number: CN202110483437.5A
Authority: CN
Inventors: 郭克华; 申长春; 奎晓燕; 刘斌; 王凌风; 刘超
Original assignee: Hand In Hand Information Technology Co ltd; Central South University
Current assignee: Hand In Hand Information Technology Co ltd; Central South University
Priority date: 2021-04-30
Filing date: 2021-04-30
Publication date: 2021-07-09
Anticipated expiration: 2041-04-30
Also published as: CN113099228B

Abstract

The invention discloses a video coding and decoding method and a video coding and decoding system. Firstly, the 2D features and the processed 3D features are overlapped according to a time sequence, and the depth fusion of static and dynamic information is realized. Then, an attention mechanism is introduced to encode the fusion features at each moment t, normalization weight is obtained through a softmax function, different weights are distributed to the fusion features, new fusion features are obtained, human-oriented features are learned, and therefore final language description related to human behaviors is promoted. And finally, inputting the new fusion characteristics into a long-short term memory (LSTM) network, and decoding the new fusion characteristics along with the time to obtain the video description sentence. The video description obtained by the invention is more logical, smooth, coherent and clear.

Description

Video coding and decoding method and system

Technical Field

The invention relates to the field of machine learning, in particular to a video coding and decoding method and a video coding and decoding system.

Background

At present, although the deep learning algorithm in artificial intelligence can execute the video description function, the video information can be easily converted into language content. For example, before a user watches massive video information, the user can quickly know the event development condition and the influence thereof by forming an accurate text summary on the video information, so that a lot of time cost can be saved. In addition, the highlight segments extracted from the two-hour movie are converted into the text outline summarizing of the movie, so that more perfect recommendation experience can be brought to the user. However, such an undifferentiated description function of video information does not sufficiently embody imagination, curiosity and intelligence for human to understand things, and these nature have been the core of human. Although text information can be extracted from a large amount of video information, the high-value knowledge for people is very little. Therefore, an excellent machine intelligent understanding algorithm should fully describe the occurring events in human thinking mode, and understand the development rule of things with human as the first perspective, so as to make the machine understand the video to a more intelligent degree.

Generally, events occurring in video are closely related and causal, and these events are the source of performing understanding tasks. The transition of these events from the end to another new event is mostly prompted by human behavior. It can be said that human behaviors dominate the development context of events and the causes and results between events, so it is necessary to follow human behaviors to explore the development rules of events and to strengthen the understanding of causal relationships. The traditional video understanding method is difficult to fully consider the relevance of human behaviors in each frame of a video in time sequence and the causal relationship of event occurrence, and the extracted global time sequence features contain a large number of redundant frame features, so that huge computing power is consumed, a model is too slow to converge in a training stage, the development rule of things cannot be well understood from the human perspective with behaviors as clues, and a machine can more intelligently understand the video.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a video encoding and decoding method and system aiming at the defects of the prior art, so that the logic and the accuracy of a video understanding task are improved.

In order to solve the technical problems, the technical scheme adopted by the invention is as follows: a video coding and decoding method comprises the following steps:

s1, respectively extracting 3D features and 2D features of the video frame sequence;

s2, processing the 3D features to obtain key features; overlapping the key features and the 2D features according to a time sequence to construct fusion features;

s3, encoding the fusion features at the moment t, obtaining normalization weight through a softmax function, and multiplying the fusion features by the normalization weight to obtain new fusion features;

and S4, inputting the new fusion characteristics into the long-term and short-term memory network to obtain the description sentences about the video frame sequence.

In order to construct a strong feature representation in video, the invention considers not only static image information but also dynamic information with time as clues. Thus, the present invention proposes a hybrid 2D/3D convolutional network. The 2D convolutional network and the 3D convolutional network are used to extract 2D features and 3D features of the video frames, respectively, which represent static and dynamic video information, respectively. The 2D features cover single-frame feature information such as environment, objects and human behaviors, the 3D features make up for the deficiency of context information when decoding the single-frame features, event feature representation at long time intervals is formed, the event feature representation not only contains the time relation of events on visual features, but also enhances the logic of finally output descriptive sentences. And superposing the 2D features and the processed 3D features according to a time sequence to realize the depth fusion of static and dynamic information. The fusion features are encoded at time t, normalized weights are obtained by a softmax function, and different weights can be assigned to the fusion features to learn human-oriented features, thereby facilitating final language description related to human behavior. LSTM can learn long term dependencies and is well suited to handle issues that are highly time-dependent. Thus, the present invention uses the LSTM network to decode the characteristics of human behavioral information, which is then described in text.

In step S1, a 3D convolutional neural network is used to extract 3D features of the sequence of video frames. The 3D convolution is more suitable for the learning of space-time characteristics than the 2D convolution, and the 3D convolution neural network can capture the time relation between video frames.

In step S1, 2D features of the sequence of video frames are extracted using a 2D convolutional neural network. The 2D convolutional neural network can extract characteristic information such as environment, objects and human behaviors, and is helpful for fully mining behavior characteristics in the video.

The 2D convolutional neural network adopts an amplification v3 network pre-trained by ImageNet as a backbone network. The Incepton v3 network introduces an asymmetric convolution structure, the effect in the aspects of processing more and richer space features, increasing feature diversity and the like is better than that of a symmetric convolution structure, and meanwhile, the calculation amount can be reduced.

The video frame sequence acquisition method comprises the following steps: a fixed number of frames are sampled from the entire video to form the sequence of video frames. This can cover the long-order structure used to understand video, i.e. the sample frames will cover the entire video regardless of the length of the video.

In step S2, a behavior filter composed of a plurality of sets of N Cauchy distributions is used

And 3D features

Performing matrix multiplication to obtain key features

. Based on the 3D global features of the cauchy distribution filter, the model can be made to implicitly understand which timestamps are more important for human behavior-based descriptions throughout the sequence of video frames.

The invention also provides a video coding and decoding system, which comprises:

a first extraction unit for extracting 3D features of a sequence of video frames;

a second extraction unit for extracting 2D features of the sequence of video frames;

a third extraction unit, configured to extract key information of the 3D feature;

the first fusion unit is used for overlapping the key features and the 2D features according to a time sequence to construct fusion features;

a second fusing unit for fusing the time of daytEncoding the fusion features, obtaining normalization weight through a softmax function, and multiplying the fusion features by the normalization weight to obtain new fusion features;

and the long-short term memory network is used for outputting description sentences related to the video frame sequence after inputting the new fusion characteristics.

The first extraction unit is a 3D convolutional neural network.

The second extraction unit is a 2D convolutional neural network.

The third extraction unit specifically performs the following operations: using a behavioral filter consisting of groups of N Cauchy distributions

And 3D features

Performing matrix multiplication to obtain key features

。

Compared with the prior art, the invention has the beneficial effects that: the video description obtained by the invention is more logical, smooth, coherent and clear. Compared with a standard human sentence, the model provided by the invention tightly grasps the description generated by human behavior information, not only captures event information in the video, but also associates the antecedent and consequent consequences of the event with human actions. These advantages not only benefit from human behavior-based guidelines, but also allow descriptive statements to have video information temporal consistency and causal relevance due to the combination of dynamic information of the hybrid 2D/3D convolution.

Drawings

FIG. 1 is an overview of a hybrid 2D/3D convolutional network under human behavior guidance;

FIG. 2 shows the evaluation results of the method of the present invention on a Chardes dataset;

fig. 3 is an example of a machine-to-machine description of video. (example one: reference sentence: a person wearing a latticed shirt stands in a bathroom, takes off the shirt, and then picks up a yellow broom. the present invention describes that a person walks into a room, takes off the shirt, places it on a shelf, and then the person picks up a broom.

Detailed Description

The invention provides a Mixed 2D/3D Convolutional network (MCN). By constructing a two-branch network structure, where the first branch uses a 2D convolutional network to generate frame features and the second branch uses a 3D convolutional network to refine global feature information in all frames of the video.

Constructing depth fusion of video static and dynamic information: a fixed number of frames are first sampled throughout the video to cover the long-range temporal structure used to understand the video. The sampled frames span the entire video regardless of the length of the video. Therefore, we use a constant number of frame sequences

Method for generating single-frame visual features by inputting 2D (two-dimensional) convolutional network branches frame by frame

Where represents the number of sampling frames. The 2D convolutional network adopts an ImageNet pre-trained inclusion v3 network as a backbone network to extract all single-frame visual features.

Since 3D convolution is more suitable for spatio-temporal feature learning than 2D convolution, 3D convolution networks were introduced to capture the timing relationship between frames. Sequence of frames obtained by sampling

Inputting the 3D convolutional network branch to generate a global feature representation of the video segment

. The output global features not only make up the context information lacking in the decoding of the single frame features, but also form event feature representation at long time intervals. The related frames are closely related, and the final descriptive statement is promoted to have logical property of bearing the beginning and the end. Then, for the 3D features

And performing time filtering and a soft attention mechanism to extract key features. Finally, key features and 2D features are paired

And overlapping according to the time sequence to construct fusion characteristics. In the decoding process, we first extract new fusion features at each time using an attention mechanism on the fusion features. The new fused features are then input into a long-short term memory (LSTM) network, decoded over time, and finally output a description of the behavior in the video.

Since the global features obtained by 3D convolution include environmental information, object and human behavior, etc., if equal consideration is given, the features will not reflect the uniqueness of human behavior transformation. Considering that the 2D convolutional network has extracted the characteristic information of environment, object and human behavior sufficiently, it is desirable to construct a set of behavior filters. The filters can refine the 3D convolution characteristics according to human behaviors at different moments, so that the finally filtered characteristics can not only fully mine the behavior characteristics, but also explore causal relationships of the behaviors at different moments in time sequence. Therefore, the invention provides a group of sequential filters based on Cauchy distribution to capture key information in global features and mine the correlation between frames in the video. And forming an implicit state vector containing the key behavior information by filtering the global features obtained by the 3D convolution branches.

The method first selectively encodes temporal features of the video information according to constraints imposed by human behavior. And then decoding the language description by combining the static image characteristic and the dynamic time sequence characteristic, wherein the specific flow is as follows:

in the encoding stage:

the first step is as follows: firstly, extracting 3D features and 2D features respectively by using two branches of a 3D convolutional network and a 2D convolutional network;

the second step is that: filtering a Cauchy filter aiming at the 3D features and extracting key features by utilizing a soft attention mechanism. In particular, a behavioral filter consisting of multiple sets of N Cauchy distributions

For performing features related to frame sequence

By matrix multiplication of (a) to obtain

Dimension key feature

I.e. by

Wherein, in the step (A),Mindicating the number of filters，cThe category of the behavior is represented by,Tthe duration of the video is represented as,

to representtAt the first momentmA filter for filtering the received signal,

expressing the soft attention coefficient, and is realized by a softmax function with the calculation formula of

Wherein, in the step (A),

representing categories about behaviorcTo (1) aiThe weights of the filters are automatically learned by the convolutional neural network in the training process. The soft attention mechanism is based on the global features of the cauchy distribution filter, so that the model implicitly understands which timestamps are more important for human behavior-based descriptions throughout the sequence of video frames;

the third step: overlapping the key features and the 2D features according to a time sequence to construct fusion features;

in the decoding stage:

the fourth step: first, a new fusion feature is extracted at each time instant, using the attention mechanism for the fusion feature. I.e. the attention mechanism is introduced in timetFor the fusion characteristicsREncoding is carried out, normalized weights are obtained through a softmax function, and then the features are fusedRMultiplying by the attention weight to obtain a new fusion feature, which is formulated as follows:

wherein the content of the first and second substances,

a vector of attention is represented, and,

representing a hidden state of the LSTM output,Rthe fused features are represented as a result of the fusion,

and

a matrix of weights is represented by a matrix of weights,bis a bias that is a function of the bias,

representing the new fusion feature.

The fifth step: the new fused features are input into a long-short term memory (LSTM) network and decoded over time to obtain a description sentence about the behavior in the video. In particular, the LSTM decodes the video features at each instant to obtain the hidden state

And storage state

. By decoding the characteristic information, a word is obtained at each moment, and finally, a complete descriptive sentence is obtained. For each LSTM cell, its input

Is a new fusion feature whose output is a word sequence

。

Corresponding to the method of the above embodiment, another embodiment of the present invention further provides a video encoding and decoding system, which includes:

in this embodiment, the first extraction unit is a 3D convolutional neural network;

in this embodiment, the second extraction unit is a 2D convolutional neural network;

And 3D features

Performing matrix multiplication to obtain key features

(the specific implementation process is the same as the above embodiment, and is not described here again).

the second fusion unit is used for coding the fusion features at the moment t, obtaining normalization weight through a softmax function, and multiplying the fusion features by the normalization weight to obtain new fusion features;

The implementation process of each unit in the coding and decoding system of the present invention is the same as the implementation process of the foregoing embodiment.

The coding and decoding system can be configured in computer equipment, and the computer equipment can be a microprocessor, an upper computer and the like.

Claims

1. A video encoding and decoding method, comprising the steps of:

s3, at the momenttEncoding the fusion features, obtaining normalization weight through a softmax function, and multiplying the fusion features by the normalization weight to obtain new fusion features;

2. The video coding/decoding method according to claim 1, wherein in step S1, a 3D convolutional neural network is used to extract 3D features of the sequence of video frames.

3. The video coding/decoding method according to claim 1, wherein in step S1, 2D features of the video frame sequence are extracted by using a 2D convolutional neural network.

4. The video coding and decoding method according to claim 3, wherein the 2D convolutional neural network adopts an inclusion v3 network pre-trained by ImageNet as a backbone network.

5. The video coding and decoding method of claim 1, wherein the video frame sequence obtaining method comprises: a fixed number of frames are sampled from the entire video to form the sequence of video frames.

6. The video encoding/decoding method of claim 1, wherein in step S2, a behavior filter consisting of a plurality of sets of N Cauchy distributions is used

And 3D features

Performing matrix multiplication to obtain key features

。

7. A video coding/decoding system, comprising:

8. The system of claim 7, wherein the first extraction unit is a 3D convolutional neural network.

9. The system of claim 7, wherein the second extraction unit is a 2D convolutional neural network.

10. The system according to claim 7, wherein the third extraction unit performs the following operations: using a behavioral filter consisting of groups of N Cauchy distributions

And 3D features

Performing matrix multiplication to obtain key features

。