CN112364690A - Video spatio-temporal information characterization method based on multi-scale intensive time sequence pooling - Google Patents

Video spatio-temporal information characterization method based on multi-scale intensive time sequence pooling Download PDF

Info

Publication number
CN112364690A
CN112364690A CN202011074043.6A CN202011074043A CN112364690A CN 112364690 A CN112364690 A CN 112364690A CN 202011074043 A CN202011074043 A CN 202011074043A CN 112364690 A CN112364690 A CN 112364690A
Authority
CN
China
Prior art keywords
pooling
video
scale
sequence
time sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202011074043.6A
Other languages
Chinese (zh)
Inventor
侯高泽
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN202011074043.6A priority Critical patent/CN112364690A/en
Publication of CN112364690A publication Critical patent/CN112364690A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/41Higher-level, semantic clustering, classification or understanding of video scenes, e.g. detection, labelling or Markovian modelling of sport events or news items
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • G06F18/253Fusion techniques of extracted features
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Molecular Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Multimedia (AREA)
  • Image Analysis (AREA)

Abstract

A video spatio-temporal information characterization method based on multi-scale intensive time sequence pooling comprises the following specific steps: firstly, the method comprises the following steps: inputting a video sequence; II, secondly: obtaining video basic characteristics through a deep convolutional neural network, and obtaining characteristic representation of each channel characteristic through a global average pooling layer; thirdly, the method comprises the following steps: the obtained basic features are subjected to multi-scale dense pooling; fourthly, the method comprises the following steps: and combining all the scales and the basic characteristics along the channel, and performing average pooling on a time sequence to obtain the final video space-time characteristic representation. For videos with different sequence lengths, the method can increase or reduce the scale without influencing the size and parameter quantity of the model, and obtains better performance benefits than other methods such as multi-scale hole convolution and the like; for a video sequence with longer time sequence length, a short-to-long space-time clue can be captured through light-weight multi-scale pooling, the obtained effect is better, and the performance increase of the duration is more remarkable; the intensive pooling approach may capture periodic features between arbitrary locations of the video.

Description

Video spatio-temporal information characterization method based on multi-scale intensive time sequence pooling
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a video spatiotemporal information characterization method based on multi-scale intensive time sequence pooling.
Background
The video spatio-temporal information characterization technology in the existing video analysis has the following defects:
1. the existing video representation has defects in time length and multiple scales. Since the motion variation and appearance continuity of video often spans multiple consecutive video frames, it represents clues to the video for different durations. In the prior art, the multi-scale feature extraction on the time sequence is mostly carried out through the design of a convolution network or a circulation network on the time sequence, and then the feature representation of the video segment level is obtained through the time sequence pooling of certain continuous frames. The characterization methods have the defects that multi-scale feature extraction methods based on the CNN and the RNN, such as hole convolution, can only extract short local clues, and for a long-sequence video segment, the capability of extracting long clues is lacking, so that the obtained effect gain is not obvious enough for the long-sequence video segment. If the scale is increased, the complexity code paid by the model is large and no appreciable performance gain can be achieved. Through the time sequence pooling of the multiple scales, the problem of the time span of the local clues is solved fundamentally, the variable pooling core size can obtain the local clues with different long spans, for a long sequence video segment, the appropriate large-scale pooling can be added, the representation of the long-span clues can be added under the condition of not adding complexity, and the method is a better performance choice.
2. Existing video representations are too single in temporal pooling operations. The existing video segment level space-time characterization technology is obtained by pooling video segment characteristics on a time sequence level, and is generally fixed in length, which means that not only multiscale considerations but also cue period position change considerations are lacked on the time sequence pooling level. A local thread may not only span a variable length of time, but also have a variable start-stop position. To solve this problem, the present invention captures as much varying local cues as possible by a dense pooling at each scale.
3. The size of the existing video characterization algorithm model needs to be optimized. The extraction of long-scale clues is realized by capturing clues of various scales through the CNN and the RNN, and the required model structure is more, so that the model is larger, the running time is longer, and the method can not be well applied to the application environment of longer sequence input and mobile terminals.
Disclosure of Invention
The invention provides a video spatio-temporal information characterization method based on multi-scale intensive time sequence pooling, which aims at the defects of the prior art, the multi-scale intensive pooling has the biggest characteristic of light weight, the problem of long and short clues is solved by using a multi-scale method, and the model can be out of common modules of two video analysis models, namely a basic feature extraction module and a final full-connection classification module, and does not need other modules which occupy the size and increase the complexity, so that the model achieves the extremely light weight, and can meet the industrial requirement of the extremely light weight while providing a high performance result. The specific technical scheme is as follows:
a video spatio-temporal information characterization method based on multi-scale intensive time sequence pooling comprises the following specific steps:
the method comprises the following steps: inputting a video sequence;
step two: obtaining video basic characteristics through a deep convolutional neural network, and obtaining characteristic representation of each channel characteristic through a global average pooling layer;
step three: the obtained basic features are subjected to multi-scale dense pooling;
step four: and combining all the scales and the basic characteristics along the channel, and performing average pooling on a time sequence to obtain the final video space-time characteristic representation.
In order to better realize the invention, as optimization: said step one is, the specific formula of the input video sequence is,
Figure BDA0002716081670000021
wherein, T is the frame number of the video sequence, the input is an RGB image sequence, so the channel dimension is 3, and H and W are the sizes of the input model after data preprocessing;
the second step is specifically that video basic characteristics are obtained through a deep convolutional neural network, and characteristic characterization is obtained through a global average pooling layer for each channel characteristic
Figure BDA0002716081670000031
That is, each frame obtains the characteristics of the C channel as the representation;
the third step is specifically that the obtained basic features are subjected to multi-scale dense pooling, each scale is realized by adopting a one-dimensional average pooling layer on a time sequence, and if K scales exist, namely K layers of dense pooling layers, the output feature graph of each pooling layer can be controlled to be consistent in scale by controlling the step length and filling of the pooling layers
Figure BDA0002716081670000032
The fourth step is to output pooling outputs X ″ ═ X ″, of K scales1,X″2,...,X″KCombining the basic characteristics along a channel, and performing average pooling on a time sequence to obtain the final video space-time characteristic representation
Figure BDA0002716081670000033
The invention has the beneficial effects that: 1. for videos with different sequence lengths, the method can increase or reduce the scale without influencing the size and parameter quantity of the model, and obtains better performance benefits than other methods such as multi-scale hole convolution and the like; 2. for a video sequence with longer time sequence length, a short-to-long space-time clue can be captured through light-weight multi-scale pooling, the obtained effect is better, and the performance increase of the duration is more remarkable; 3. the intensive pooling mode can capture periodic characteristics between any positions of the video; 4. the module is highly transferable and reusable, and the multi-scale intensive pooling method is a technology which can be nested in all video related researches and can be applied to other video analysis fields; 5. the extremely lightweight design makes the benefits and performance more appreciable in mobile-end, speed-type video applications.
Drawings
Fig. 1 is a frame diagram of the present invention.
Detailed Description
The following detailed description of the preferred embodiments of the present invention, taken in conjunction with the accompanying drawings, will make the advantages and features of the invention easier to understand by those skilled in the art, and thus will clearly and clearly define the scope of the invention.
Interpretation of terms:
video analysis: the method is applied to the research field of modeling and analyzing video content by using characteristics such as space time information, time sequence context information and the like, and video action recognition, video pedestrian re-recognition and the like are common.
Video motion recognition: the method judges the motion type through analyzing the video motion sequence.
And (3) pedestrian re-identification: the method refers to accurately inquiring and matching the same pedestrian under a plurality of cameras. The pedestrian re-identification method mainly comprises image-based pedestrian re-identification and video-based pedestrian re-identification, and has the characteristics of continuous frames and complex motion aiming at the video-based pedestrian re-identification.
CNN: the convolutional neural network, a deep laminar neural network commonly used for processing grid data, is widely used in the field of computer vision, including image video classification, image video feature extraction, image video synthesis, model parameter optimization, and the like.
RNN: a Recurrent Neural Network (Recurrent Neural Network) is a type of Neural Network used to process sequence data, reflecting the changing state of an object over a time frame.
Sequential pooling: the feature weighting of input features of a plurality of continuous frames in a time dimension forms a feature representation of a segment of video.
ResNet: a network for obtaining a better basic feature extraction effect by utilizing a residual error neural network can solve the problem that the depth of the network is deepened and the gradient is dispersed.
Residual neural network: a CNN that greatly expands the depth of a network by cross-layer connection is suitable for high-resolution image processing.
Mars data set is a common data set for video pedestrian re-identification.
opencv: a computer vision standard function library that encompasses almost all classical modules of the computer vision field, and which can be used by most computer vision tasks to process.
numpy: a python language expansion program library provides a large number of dimensionality arrays, matrix operation and mathematical function realization of array operation, and is the most common matrix operation library.
Pythrch: an open-source machine learning framework provides seamless connection of dynamic graph patterns and training of graph patterns and product production, provides rich scripts and libraries, and realizes distributed training and cloud support.
As shown in fig. 1: a video spatio-temporal information characterization method based on multi-scale dense time sequence pooling comprises two parts, namely a basic feature extraction network and a multi-scale dense pooling module. The basic feature extraction network, i.e., ConvNet in fig. 1, may employ many existing mature feature extraction networks, such as ResNet series network, GoogleNet series network, VGG series network, etc., pre-trained by ImageNet. The basic features of each frame of the video can be obtained through a basic feature extraction network, the features are fused on a time sequence to form a representation of a video layer, and the method is a main purpose of multi-scale intensive time sequence pooling.
The multi-scale dense pooling module has the main function of replacing a clue extraction method of a time sequence context by a lightest design mode, the feature fusion of different continuous frame numbers can represent video features in different lengths, and clues of various scales can consider more local features of different time sequence views. The prior art means is mostly realized by adopting multi-scale hole convolution on a time sequence, and has the defects that the hole convolution is a jump type selection mode, information of continuous frames is easy to lose, and in addition, a larger calculation amount is increased when the scale is increased. The method replaces the hole convolution or the ordinary convolution with the dense pooling mode, and can achieve the extraction of multi-scale video clues with the same effect under the condition of extremely small calculation amount and complexity.
The key of the multi-scale dense pooling is the design of various pooling cores and a dense pooling strategy, and the two points can be realized through a one-dimensional pooling layer on a time sequence dimension. The pooling kernel corresponds to the extracted local cue, a small pooling kernel corresponds to a local cue with a shorter cross-time length, and a large pooling kernel means capturing a local cue with a longer cross-time length. Through the fusion of the local clues with various lengths, the representation of various context information can be realized, and the video representation with a more distinguishing effect is formed. More importantly, the multi-scale of the invention can be expanded at low cost, and a large-scale one-dimensional pooling layer with more pooling cores and larger can be set, which has better effect on actions with long duration, more frames spanning and longer video sequence input.
The overall solution is shown in fig. 1, inputting a video sequence (
Figure BDA0002716081670000051
Wherein, T is the frame number of the video sequence, the input is an RGB image sequence, so the channel dimension is 3, and H and W are the sizes of the input model after data preprocessing); obtaining video basic characteristics through a deep convolutional neural network, and obtaining characteristic characterization through a global average pooling layer for each channel characteristic
Figure BDA0002716081670000052
That is, each frame obtains the characteristics of the C channel as the representation; the obtained basic features are subjected to multi-scale dense pooling, each scale is realized by adopting a one-dimensional average pooling layer on a time sequence, and if K scales exist, namely K layers of dense pooling layers, the output feature graph of each pooling layer can be controlled to be consistent in scale by controlling the step length (stride) and the padding (padding) of the pooling layers
Figure BDA0002716081670000053
Pooling outputs X ″ ═ X ″, for K scales1,X″2,...,X″KCombining the basic characteristics along a channel, and performing average pooling on a time sequence to obtain the final video space-time characteristic representation
Figure BDA0002716081670000054
A specific example model network architecture table and parameter design is shown in table 1. Table 1 adopts a multi-scale dense pooling design of 4 scales, with pooling kernel sizes of 3, 5, 7, and 9, respectively, representing feature pooling and extraction at different long spans. In complex industrial environments and video research scenarios, the number and size of scales can be increased with little impact on model size and parameter calculations. It is noted that the present invention employs the AvgPool1d network layer as the implementation network layer for dense pooling, since pooling only needs to be done in the timing dimension and it can meet the design requirements for dense timing pooling. T in the table represents the length of each input video segment, and longer length will also impose higher requirements on model performance, meaning having a local time cue that spans longer. The label amount of the last full link layer in the table varies from one industrial field to another and from one research field to another, and 625 is the category label amount of the video pedestrian re-recognition field data set Mars.
Table 1 network architecture and parameter design table
Figure BDA0002716081670000061
Figure BDA0002716081670000071
The code of the embodiment is realized based on Python, numpy and opencv libraries are used for image processing, model construction, training and testing are carried out based on a Pythrch tool, and the model can be suitable for various tools, environments and platforms.

Claims (2)

1. A video spatio-temporal information characterization method based on multi-scale intensive time sequence pooling is characterized by comprising the following specific steps:
the method comprises the following steps: inputting a video sequence;
step two: obtaining video basic characteristics through a deep convolutional neural network, and obtaining characteristic representation of each channel characteristic through a global average pooling layer;
step three: the obtained basic features are subjected to multi-scale dense pooling;
step four: and combining all the scales and the basic characteristics along the channel, and performing average pooling on a time sequence to obtain the final video space-time characteristic representation.
2. The method for characterizing video spatio-temporal information based on multi-scale dense temporal pooling of claim 1, wherein: said step one is, the specific formula of the input video sequence is,
Figure FDA0002716081660000011
wherein, T is the frame number of the video sequence, the input is an RGB image sequence, so the channel dimension is 3, and H and W are the sizes of the input model after data preprocessing;
the second step is specifically that video basic characteristics are obtained through a deep convolutional neural network, and characteristic characterization is obtained through a global average pooling layer for each channel characteristic
Figure FDA0002716081660000012
That is, each frame obtains the characteristics of the C channel as the representation;
the third step is specifically that the obtained basic features are subjected to multi-scale dense pooling, each scale is realized by adopting a one-dimensional average pooling layer on a time sequence, and if K scales exist, namely K layers of dense pooling layers, the output feature graph of each pooling layer can be controlled to be consistent in scale by controlling the step length and filling of the pooling layers
Figure FDA0002716081660000013
The fourth step is to output pooling outputs X ″ ═ X ″, of K scales1,X″2,...,X″KCombining the basic characteristics along a channel, and performing average pooling on a time sequence to obtain the final video space-time characteristic representation
Figure FDA0002716081660000014
CN202011074043.6A 2020-10-09 2020-10-09 Video spatio-temporal information characterization method based on multi-scale intensive time sequence pooling Withdrawn CN112364690A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011074043.6A CN112364690A (en) 2020-10-09 2020-10-09 Video spatio-temporal information characterization method based on multi-scale intensive time sequence pooling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011074043.6A CN112364690A (en) 2020-10-09 2020-10-09 Video spatio-temporal information characterization method based on multi-scale intensive time sequence pooling

Publications (1)

Publication Number Publication Date
CN112364690A true CN112364690A (en) 2021-02-12

Family

ID=74508313

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011074043.6A Withdrawn CN112364690A (en) 2020-10-09 2020-10-09 Video spatio-temporal information characterization method based on multi-scale intensive time sequence pooling

Country Status (1)

Country Link
CN (1) CN112364690A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115376052A (en) * 2022-10-26 2022-11-22 山东百盟信息技术有限公司 Long video classification method based on key frame sampling and multi-scale dense network

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115376052A (en) * 2022-10-26 2022-11-22 山东百盟信息技术有限公司 Long video classification method based on key frame sampling and multi-scale dense network

Similar Documents

Publication Publication Date Title
Jiang et al. Self-supervised relative depth learning for urban scene understanding
CN107229757B (en) Video retrieval method based on deep learning and Hash coding
CN111667399B (en) Training method of style migration model, video style migration method and device
WO2021022521A1 (en) Method for processing data, and method and device for training neural network model
CN110059598B (en) Long-term fast-slow network fusion behavior identification method based on attitude joint points
CN108399380A (en) A kind of video actions detection method based on Three dimensional convolution and Faster RCNN
CN108537824B (en) Feature map enhanced network structure optimization method based on alternating deconvolution and convolution
CN110472604B (en) Pedestrian and crowd behavior identification method based on video
CN112016406B (en) Video key frame extraction method based on full convolution network
CN111046821A (en) Video behavior identification method and system and electronic equipment
CN114596520A (en) First visual angle video action identification method and device
Wei et al. Novel video prediction for large-scale scene using optical flow
CN114781629A (en) Hardware accelerator of convolutional neural network based on parallel multiplexing and parallel multiplexing method
CN112906520A (en) Gesture coding-based action recognition method and device
Ouchra et al. Object detection approaches in images: A weighted scoring model based comparative study
Lv et al. An inverted residual based lightweight network for object detection in sweeping robots
Farrajota et al. Human action recognition in videos with articulated pose information by deep networks
CN112364690A (en) Video spatio-temporal information characterization method based on multi-scale intensive time sequence pooling
CN114170657A (en) Facial emotion recognition method integrating attention mechanism and high-order feature representation
CN113469238A (en) Self-supervision learning method for solving puzzle task based on CRNN
Chandrapala et al. Invariant feature extraction from event based stimuli
CN112069979A (en) Real-time action recognition man-machine interaction system
Zhai et al. 3D dual-stream convolutional neural networks with simple recurrent unit network: A new framework for action recognition
Hao et al. Architecture self-attention mechanism: Nonlinear optimization for neural architecture search
Zhang Behaviour Detection and Recognition of College Basketball Players Based on Multimodal Sequence Matching and Deep Neural Networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication

Application publication date: 20210212

WW01 Invention patent application withdrawn after publication