CN112364690A

CN112364690A - Video spatio-temporal information characterization method based on multi-scale intensive time sequence pooling

Info

Publication number: CN112364690A
Application number: CN202011074043.6A
Authority: CN
Inventors: 侯高泽
Original assignee: Individual
Current assignee: Individual
Priority date: 2020-10-09
Filing date: 2020-10-09
Publication date: 2021-02-12

Abstract

A video spatio-temporal information characterization method based on multi-scale intensive time sequence pooling comprises the following specific steps: firstly, the method comprises the following steps: inputting a video sequence; II, secondly: obtaining video basic characteristics through a deep convolutional neural network, and obtaining characteristic representation of each channel characteristic through a global average pooling layer; thirdly, the method comprises the following steps: the obtained basic features are subjected to multi-scale dense pooling; fourthly, the method comprises the following steps: and combining all the scales and the basic characteristics along the channel, and performing average pooling on a time sequence to obtain the final video space-time characteristic representation. For videos with different sequence lengths, the method can increase or reduce the scale without influencing the size and parameter quantity of the model, and obtains better performance benefits than other methods such as multi-scale hole convolution and the like; for a video sequence with longer time sequence length, a short-to-long space-time clue can be captured through light-weight multi-scale pooling, the obtained effect is better, and the performance increase of the duration is more remarkable; the intensive pooling approach may capture periodic features between arbitrary locations of the video.

Description

Video spatio-temporal information characterization method based on multi-scale intensive time sequence pooling

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a video spatiotemporal information characterization method based on multi-scale intensive time sequence pooling.

Background

The video spatio-temporal information characterization technology in the existing video analysis has the following defects:

1. the existing video representation has defects in time length and multiple scales. Since the motion variation and appearance continuity of video often spans multiple consecutive video frames, it represents clues to the video for different durations. In the prior art, the multi-scale feature extraction on the time sequence is mostly carried out through the design of a convolution network or a circulation network on the time sequence, and then the feature representation of the video segment level is obtained through the time sequence pooling of certain continuous frames. The characterization methods have the defects that multi-scale feature extraction methods based on the CNN and the RNN, such as hole convolution, can only extract short local clues, and for a long-sequence video segment, the capability of extracting long clues is lacking, so that the obtained effect gain is not obvious enough for the long-sequence video segment. If the scale is increased, the complexity code paid by the model is large and no appreciable performance gain can be achieved. Through the time sequence pooling of the multiple scales, the problem of the time span of the local clues is solved fundamentally, the variable pooling core size can obtain the local clues with different long spans, for a long sequence video segment, the appropriate large-scale pooling can be added, the representation of the long-span clues can be added under the condition of not adding complexity, and the method is a better performance choice.

2. Existing video representations are too single in temporal pooling operations. The existing video segment level space-time characterization technology is obtained by pooling video segment characteristics on a time sequence level, and is generally fixed in length, which means that not only multiscale considerations but also cue period position change considerations are lacked on the time sequence pooling level. A local thread may not only span a variable length of time, but also have a variable start-stop position. To solve this problem, the present invention captures as much varying local cues as possible by a dense pooling at each scale.

3. The size of the existing video characterization algorithm model needs to be optimized. The extraction of long-scale clues is realized by capturing clues of various scales through the CNN and the RNN, and the required model structure is more, so that the model is larger, the running time is longer, and the method can not be well applied to the application environment of longer sequence input and mobile terminals.

Disclosure of Invention

The invention provides a video spatio-temporal information characterization method based on multi-scale intensive time sequence pooling, which aims at the defects of the prior art, the multi-scale intensive pooling has the biggest characteristic of light weight, the problem of long and short clues is solved by using a multi-scale method, and the model can be out of common modules of two video analysis models, namely a basic feature extraction module and a final full-connection classification module, and does not need other modules which occupy the size and increase the complexity, so that the model achieves the extremely light weight, and can meet the industrial requirement of the extremely light weight while providing a high performance result. The specific technical scheme is as follows:

a video spatio-temporal information characterization method based on multi-scale intensive time sequence pooling comprises the following specific steps:

the method comprises the following steps: inputting a video sequence;

step two: obtaining video basic characteristics through a deep convolutional neural network, and obtaining characteristic representation of each channel characteristic through a global average pooling layer;

step three: the obtained basic features are subjected to multi-scale dense pooling;

step four: and combining all the scales and the basic characteristics along the channel, and performing average pooling on a time sequence to obtain the final video space-time characteristic representation.

In order to better realize the invention, as optimization: said step one is, the specific formula of the input video sequence is,

wherein, T is the frame number of the video sequence, the input is an RGB image sequence, so the channel dimension is 3, and H and W are the sizes of the input model after data preprocessing;

the second step is specifically that video basic characteristics are obtained through a deep convolutional neural network, and characteristic characterization is obtained through a global average pooling layer for each channel characteristic

That is, each frame obtains the characteristics of the C channel as the representation;

the third step is specifically that the obtained basic features are subjected to multi-scale dense pooling, each scale is realized by adopting a one-dimensional average pooling layer on a time sequence, and if K scales exist, namely K layers of dense pooling layers, the output feature graph of each pooling layer can be controlled to be consistent in scale by controlling the step length and filling of the pooling layers

The fourth step is to output pooling outputs X ″ ═ X ″, of K scales₁，X″₂，...，X″_KCombining the basic characteristics along a channel, and performing average pooling on a time sequence to obtain the final video space-time characteristic representation

The invention has the beneficial effects that: 1. for videos with different sequence lengths, the method can increase or reduce the scale without influencing the size and parameter quantity of the model, and obtains better performance benefits than other methods such as multi-scale hole convolution and the like; 2. for a video sequence with longer time sequence length, a short-to-long space-time clue can be captured through light-weight multi-scale pooling, the obtained effect is better, and the performance increase of the duration is more remarkable; 3. the intensive pooling mode can capture periodic characteristics between any positions of the video; 4. the module is highly transferable and reusable, and the multi-scale intensive pooling method is a technology which can be nested in all video related researches and can be applied to other video analysis fields; 5. the extremely lightweight design makes the benefits and performance more appreciable in mobile-end, speed-type video applications.

Drawings

Fig. 1 is a frame diagram of the present invention.

Detailed Description

The following detailed description of the preferred embodiments of the present invention, taken in conjunction with the accompanying drawings, will make the advantages and features of the invention easier to understand by those skilled in the art, and thus will clearly and clearly define the scope of the invention.

Interpretation of terms:

video analysis: the method is applied to the research field of modeling and analyzing video content by using characteristics such as space time information, time sequence context information and the like, and video action recognition, video pedestrian re-recognition and the like are common.

Video motion recognition: the method judges the motion type through analyzing the video motion sequence.

And (3) pedestrian re-identification: the method refers to accurately inquiring and matching the same pedestrian under a plurality of cameras. The pedestrian re-identification method mainly comprises image-based pedestrian re-identification and video-based pedestrian re-identification, and has the characteristics of continuous frames and complex motion aiming at the video-based pedestrian re-identification.

CNN: the convolutional neural network, a deep laminar neural network commonly used for processing grid data, is widely used in the field of computer vision, including image video classification, image video feature extraction, image video synthesis, model parameter optimization, and the like.

RNN: a Recurrent Neural Network (Recurrent Neural Network) is a type of Neural Network used to process sequence data, reflecting the changing state of an object over a time frame.

Sequential pooling: the feature weighting of input features of a plurality of continuous frames in a time dimension forms a feature representation of a segment of video.

ResNet: a network for obtaining a better basic feature extraction effect by utilizing a residual error neural network can solve the problem that the depth of the network is deepened and the gradient is dispersed.

Residual neural network: a CNN that greatly expands the depth of a network by cross-layer connection is suitable for high-resolution image processing.

Mars data set is a common data set for video pedestrian re-identification.

opencv: a computer vision standard function library that encompasses almost all classical modules of the computer vision field, and which can be used by most computer vision tasks to process.

numpy: a python language expansion program library provides a large number of dimensionality arrays, matrix operation and mathematical function realization of array operation, and is the most common matrix operation library.

Pythrch: an open-source machine learning framework provides seamless connection of dynamic graph patterns and training of graph patterns and product production, provides rich scripts and libraries, and realizes distributed training and cloud support.

As shown in fig. 1: a video spatio-temporal information characterization method based on multi-scale dense time sequence pooling comprises two parts, namely a basic feature extraction network and a multi-scale dense pooling module. The basic feature extraction network, i.e., ConvNet in fig. 1, may employ many existing mature feature extraction networks, such as ResNet series network, GoogleNet series network, VGG series network, etc., pre-trained by ImageNet. The basic features of each frame of the video can be obtained through a basic feature extraction network, the features are fused on a time sequence to form a representation of a video layer, and the method is a main purpose of multi-scale intensive time sequence pooling.

The multi-scale dense pooling module has the main function of replacing a clue extraction method of a time sequence context by a lightest design mode, the feature fusion of different continuous frame numbers can represent video features in different lengths, and clues of various scales can consider more local features of different time sequence views. The prior art means is mostly realized by adopting multi-scale hole convolution on a time sequence, and has the defects that the hole convolution is a jump type selection mode, information of continuous frames is easy to lose, and in addition, a larger calculation amount is increased when the scale is increased. The method replaces the hole convolution or the ordinary convolution with the dense pooling mode, and can achieve the extraction of multi-scale video clues with the same effect under the condition of extremely small calculation amount and complexity.

The key of the multi-scale dense pooling is the design of various pooling cores and a dense pooling strategy, and the two points can be realized through a one-dimensional pooling layer on a time sequence dimension. The pooling kernel corresponds to the extracted local cue, a small pooling kernel corresponds to a local cue with a shorter cross-time length, and a large pooling kernel means capturing a local cue with a longer cross-time length. Through the fusion of the local clues with various lengths, the representation of various context information can be realized, and the video representation with a more distinguishing effect is formed. More importantly, the multi-scale of the invention can be expanded at low cost, and a large-scale one-dimensional pooling layer with more pooling cores and larger can be set, which has better effect on actions with long duration, more frames spanning and longer video sequence input.

The overall solution is shown in fig. 1, inputting a video sequence (

Wherein, T is the frame number of the video sequence, the input is an RGB image sequence, so the channel dimension is 3, and H and W are the sizes of the input model after data preprocessing); obtaining video basic characteristics through a deep convolutional neural network, and obtaining characteristic characterization through a global average pooling layer for each channel characteristic

That is, each frame obtains the characteristics of the C channel as the representation; the obtained basic features are subjected to multi-scale dense pooling, each scale is realized by adopting a one-dimensional average pooling layer on a time sequence, and if K scales exist, namely K layers of dense pooling layers, the output feature graph of each pooling layer can be controlled to be consistent in scale by controlling the step length (stride) and the padding (padding) of the pooling layers

Pooling outputs X ″ ═ X ″, for K scales₁，X″₂，...，X″_KCombining the basic characteristics along a channel, and performing average pooling on a time sequence to obtain the final video space-time characteristic representation

A specific example model network architecture table and parameter design is shown in table 1. Table 1 adopts a multi-scale dense pooling design of 4 scales, with pooling kernel sizes of 3, 5, 7, and 9, respectively, representing feature pooling and extraction at different long spans. In complex industrial environments and video research scenarios, the number and size of scales can be increased with little impact on model size and parameter calculations. It is noted that the present invention employs the AvgPool1d network layer as the implementation network layer for dense pooling, since pooling only needs to be done in the timing dimension and it can meet the design requirements for dense timing pooling. T in the table represents the length of each input video segment, and longer length will also impose higher requirements on model performance, meaning having a local time cue that spans longer. The label amount of the last full link layer in the table varies from one industrial field to another and from one research field to another, and 625 is the category label amount of the video pedestrian re-recognition field data set Mars.

Table 1 network architecture and parameter design table

The code of the embodiment is realized based on Python, numpy and opencv libraries are used for image processing, model construction, training and testing are carried out based on a Pythrch tool, and the model can be suitable for various tools, environments and platforms.

Claims

1. A video spatio-temporal information characterization method based on multi-scale intensive time sequence pooling is characterized by comprising the following specific steps:

the method comprises the following steps: inputting a video sequence;

2. The method for characterizing video spatio-temporal information based on multi-scale dense temporal pooling of claim 1, wherein: said step one is, the specific formula of the input video sequence is,