WO2022134655A1

WO2022134655A1 - End-to-end video action detection and positioning system

Info

Publication number: WO2022134655A1
Application number: PCT/CN2021/116771
Authority: WO
Inventors: 席道亮; 许野平; 刘辰飞; 陈英鹏; 张朝瑞; 高朋
Original assignee: 神思电子技术股份有限公司
Priority date: 2020-12-25
Filing date: 2021-09-06
Publication date: 2022-06-30
Also published as: CN113158723A; CN113158723B

Abstract

An end-to-end video action detection and positioning system, which relates to the field of human action recognition. A positioning process of the end-to-end video action detection and positioning system comprises: decoding a video; recombining data; setting a data sampling frequency, reading a video clip of a fixed length, and recombining data into an inputtable data mode and inputting same into the next module; performing a calculation operation on input data; extracting spatial key information; processing feature information that is extracted by a spatio-temporal information parsing unit module, such that a feature extracted via a network can better focus on more useful spatial information in an image, filtering out background information, and enhancing a feature of a position at which an action occurs in the image; integrating and mining channel information; performing channel-level information integration on a data feature that is obtained by the spatio-temporal information parsing unit module, mining motion information, focusing on the mining of motion information between frames, and focusing on the type of a behavior action that occurs; outputting a prediction result; and outputting, by using 1×1 convolution, a feature map having a corresponding channel quantity.

Description

An end-to-end video action detection and localization system

technical field

The present invention relates to the technical field of human action recognition, in particular to an end-to-end video action detection and positioning system.

Background technique

The statements herein merely provide background related to the present invention and do not necessarily constitute prior art.

Behavior recognition analyzes a given video clip for multiple consecutive frames, which can realize the recognition of the content in the video, usually to recognize human actions, such as fighting, falling to the ground, etc. In practical application scenarios, it can identify the occurrences in the scene. Dangerous behavior has a wide range of application scenarios and is a hot research topic in computer vision. At present, the behavior recognition algorithm based on deep learning can not only identify the type of action, but also locate the spatial position of the action. higher accuracy.

In the paper "Learning Spatiotemporal Features with 3D Convolutional Networks", Du Tran et al. proposed a simple and effective method to use deep 3-dimensional convolutional networks (3D ConvNets) on large-scale supervised video datasets. 2D ConvNets are more suitable for the learning of spatiotemporal features, and can better express the continuous information between frames. On the UCF101 dataset, the accuracy is comparable to the best method at that time with fewer dimensions. It adopts a simple 3D convolutional architecture and calculates It has high efficiency, fast forward propagation speed, and is easier to train and use. The disadvantage of this method is that the recognition target is a simple scene of a single person. In complex scenes, the recognition accuracy is low and the false alarm rate is high. There is basically no generalization ability and cannot be used. Promote the application in the actual complex environment, and it is impossible to locate the position where the action occurs in the picture.

The paper "Two-Stream Convolutional Networks for Action Recognition in Videos" proposes a dual-stream network detection method for action classification. The method uses parallel networks spatial stream ConvNet and temporal stream ConvNet. The former is a classification network, and the input is a static image. The image information is obtained, and the latter inputs the dense optical flow of multiple consecutive frames to obtain the motion information. The two networks finally fuse the classification scores through softmax. This method has high calculation accuracy and can be applied to complex multi-person scenes. The disadvantage of the method is that the optical flow information of the video clip to be detected needs to be obtained in advance, which cannot achieve real-time detection, and also cannot locate the position where the action occurs.

The Chinese patent with patent number 201810292563 discloses a video action classification model training method, device and video action classification method. On the basis of the features of the training video frames with less difficulty, learning the difference features between the training image frames with greater training difficulty and other training image frames with less training difficulty can classify the training videos more accurately. The method still exists and cannot locate the spatial location and starting time of the action in the screen.

The Chinese patent with the patent number of 201810707711 discloses a video-based behavior recognition method, behavior recognition device and terminal equipment. The innovation lies in the use of convolutional neural network and long short-term memory network LSTM for time series modeling, increasing the number of frames between frames. It can effectively solve the problems of complex background information and insufficient time series modeling ability in existing behavior recognition methods. However, this method cannot achieve end-to-end training, and can detect a single RGB image frame separately. The recognition accuracy is low.

The Chinese patent with patent number 201210345589.X discloses a behavior recognition method based on action subspace and weighted behavior recognition model. The advantage is that the input is the video sequence to be detected, the time information of the action is extracted, and the background subtraction method is used. Removing the influence of background noise on the foreground can not only accurately identify human behaviors that change with time and people inside and outside the area, but also has strong robustness to noise and other influencing factors. Unable to make accurate judgments.

SUMMARY OF THE INVENTION

In view of the deficiencies in the prior art, the purpose of the present invention is to provide an end-to-end video motion detection and positioning system capable of locating the spatial position of the action after inputting the video sequence to be detected.

The present invention specifically adopts following technical scheme:

An end-to-end video action detection and positioning system includes a video decoding module and a data reorganization module, and the positioning process includes the following steps:

(1) Video decoding; the video decoding module inputs the network video stream to the video decoding unit through the network line, decodes the video stream into a frame of RGB image through the SOC system-on-chip, and then inputs it to the data reorganization module for data preprocessing operate;

(2) data reorganization; set data sampling frequency, read video clips of fixed length, recombine the data into input data mode and input to the next module;

(3) Calculating the input data;

(4) Extraction of key spatial information; the feature information extracted by the spatiotemporal information parsing unit module is processed, so that the features extracted by the network can pay more attention to the more useful spatial information in the image, filter out the background information, and analyze the location features of the action in the image. enhance;

(5) Channel information integration mining; the data features obtained by the spatiotemporal information analysis unit module are integrated into channel-level information to mine motion information, pay attention to the mining of motion information between frames, and pay attention to the types of behaviors that occur;

(6) Prediction result output; use 1x1 convolution to output the feature map of the corresponding number of channels.

Preferably, the specific process of data reorganization is:

The prediction starts to take a video clip of a fixed length n, and then the unit data Ydst is input to the spatiotemporal information parsing unit module, and n is equal to 8 or 16. Before inputting to the spatiotemporal information parsing unit module, the size of each RGB image of the unit data Ydst needs to be adjusted to fixed size;

Assuming that a single picture of the source video segment is represented by Xsrc, and the fixed-size picture input to the spatiotemporal information parsing unit module is represented by Xdst, the calculation method for each pixel in Xdst after size scaling is as follows:

(1) For each pixel in X _dst , set the floating-point coordinates obtained by inverse transformation to (i+u, j+v), where i, j are the integer parts of floating-point coordinates, u, v is the fractional part of the floating-point coordinate, which is a floating-point number in the range [0,1);

(2) This pixel value f(i+u, j+v) can be obtained from the coordinates in the original image as (i, j), (i+1, j), (i, j+1), (i+1, j The surrounding four pixel values corresponding to +1) are determined, that is, f(i+u, j+v)=(1-u)(1-v)f(i, j)+(1-u)vf(i , j+1)+u(1-v)f(i+1,j)+uuf(i+1,j+1) where, f(i,j) represents the pixel at the source image (i,j) value.

Preferably, the calculation operation on the input data includes the following processes:

(1) Input the video unit data Ydst into the spatiotemporal information analysis unit module, and input a series of RGB image frames R ^C×D×H×W into the module, C=3 represents the number of channels of each RGB image frame , D represents the number of pictures of each group of unit data Ydst, the maximum is 16, H and W represent the width and height of each picture of this group of unit data Ydst; the spatiotemporal information parsing unit module outputs the feature map

C ₁ , H ₁ , and W ₁ represent the number of channels, width and height of the output feature map respectively. In order to adapt to the output dimension of the spatial key information extraction module, D′=1 is enforced, and then the output of the spatiotemporal information parsing unit module is converted through dimension transformation. The four-dimensional data is transformed into three-dimensional data, and the output feature map is expressed as

(2) Add a spatial key information extraction module to make the network pay more attention to the characteristics of the object where the behavior occurs. The input of this module is

The output feature map is

Preferably, the extraction of spatial key information includes the following processes:

(1) Set the output feature map size of the spatiotemporal information analysis unit module as

Input the feature map into the spatial key information extraction module to obtain R _f1 , R _f2 ;

Among them, f ₁ () represents the mean operation on the feature matrix, and f2 () represents the feature extraction operation on the matrix;

(2) Add R _f1 and R _f2 according to the first dimension to obtain the combined spatial feature information

R _f =R _f1 +R _f2

(3) Perform spatial feature fusion on R _f , and input R _f into the fusion feature normalization unit, which can enhance the spatial features and normalize the enhanced features. The calculation efficiency is more efficient:

x=f _fuse (Rf)

X _out = f _normalize (X)

X represents the fused feature map, the fusion function ffuse() integrates the information of the feature Rf, and the enhanced feature is normalized between 0 and 1 through the normalization function f _normalize ().

Preferably, the channel information integration mining includes the following steps:

(1) The data features obtained by the spatial key information extraction module are expressed as

The spatiotemporal information parsing unit module features are expressed as

In order to reduce the information loss of the channel information integration mining module, X _out and

After input, merge feature information by channel and output feature map Y;

(2) The channel compression unit is used to vectorize the feature map Y into Z, the function f _vector ( ) represents the vectorization function, the feature map Z represents the vectorized symbol representation of the feature map, and C ₃ represents the addition and sum of the channel scalars, Its value is C ₃ =C ₁ +C ₂ , and N represents the quantized numerical representation of each feature map, and its value is N=H ₁ *W ₁ ;

By transposing the eigenvectors Z and the eigenmatrix Z ^T of Z, T represents the transpose of the matrix, the eigenmatrix I is generated, and each element in the matrix is the value of the inner product of Z and Z ^T , where the matrix I is The generating dimension is C ₃ *C ₃ , and the formula for generating the matrix I is:

The parameters i and j are the index representations of the rows and columns of the matrix Z, and the maximum value of n is calculated from zero to N, and the following operations are performed on the matrix to generate the feature map

The formula for the calculation formula of matrix E is:

Feature map

Each value in is 0 to 1, and its meaning indicates the degree of influence of the jth channel on the ith channel;

(3) In order to further illustrate the influence of the feature map E on the original feature map Z, it is necessary to calculate Z ^* . First, the matrix E is used to perform the matrix transposition operation. The calculation formula is:

Z′=E ^T *Z

Dimensionally transform Z ^* to 3-dimensional output:

The function f _reshape () mainly expands the dimension, and the output of the final feature map is

The calculation formula is O=Z ^* +x _out .

Preferably, the pre-result output includes the following steps:

For each feature point in the picture, 3 prediction boxes are generated, and the entire network model is designed to be a four-layer output. Therefore, before network training, it is necessary to use the clustering algorithm to cluster all bboxes on the data set to generate 12 preset boxes. , the regression of coordinates mainly generates the final output size of each layer of the model according to the number of predicted categories [(3×(NumClass+5))×H×W], where NumClass is the number of predicted categories, and in the training, in order to To adapt to the categories in the current dataset, we use the following loss function for category prediction, and the loss _coord calculation formula is:

loss _c =-∑a′*lna

Where y represents the true value in the label, a represents the category output value predicted by the model, and the coordinate loss function loss value loss _coord calculation formula:

loss _coord =-y'*log(y)-(1-y')*log(1-y)

where y′ represents the real coordinate value in the label, and y represents the output value of the model predicted coordinate.

The present invention has the following beneficial effects:

The spatial key information extraction module and the channel information integration mining module are used to improve the accuracy of behavior recognition, and can recognize multiple behaviors at the same time in complex scenarios.

Combining the idea of bounding box regression in the object detection network with video classification increases the generalization ability of the model and improves the robustness of recognition in different scenarios.

Description of drawings

The accompanying drawings forming a part of the present invention are used to provide further understanding of the present invention, and the exemplary embodiments of the present invention and their descriptions are used to explain the present invention, and do not constitute an improper limitation of the present invention.

Figure 1. Structure diagram of an end-to-end video action detection and localization system.

Detailed ways

It should be noted that the following detailed description is exemplary and intended to provide further explanation of the invention. Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

It should be noted that the terminology used herein is for the purpose of describing specific embodiments only, and is not intended to limit the exemplary embodiments according to the present invention. As used herein, unless the invention clearly dictates otherwise, the singular is intended to include the plural as well, and it should also be understood that when the terms "comprising" and/or "including" are used in this specification, Indicate the presence of features, steps, operations, devices, components and/or combinations thereof;

The specific embodiments of the present invention will be further described below in conjunction with the accompanying drawings and specific embodiments:

With reference to Figure 1, an end-to-end video action detection and positioning system includes a video decoding module and a data reorganization module, and the positioning process includes the following steps:

(1) Video decoding; the video decoding module inputs the network video stream to the video decoding unit through the network line, decodes the video stream into a frame of RGB image through the SOC system-on-chip, and then inputs it to the data reorganization module for data preprocessing operate.

(2) Data reorganization; set the data sampling frequency, read video clips of fixed length, and recombine the data into an input data mode for input to the next module.

(3) Calculate the input data.

(4) Extraction of key spatial information; the feature information extracted by the spatiotemporal information parsing unit module is processed, so that the features extracted by the network can pay more attention to the more useful spatial information in the image, filter out the background information, and analyze the location features of the action in the image. enhanced.

(5) Channel information integration mining: The data features obtained by the spatiotemporal information analysis unit module are integrated into channel-level information to mine motion information, focus on the mining of motion information between frames, and pay attention to the types of behaviors.

The specific process of data reorganization is as follows:

The prediction starts to take a video clip of a fixed length n, and after processing, the unit data Ydst is input to the spatiotemporal information parsing unit module, and n is equal to 8 or 16. Before inputting to the spatiotemporal information parsing unit module, the size of each RGB image of the unit data Ydst needs to be adjusted to fixed size;

Assuming that a single picture of the source video clip is represented by Xsrc, and the fixed-size picture input to the spatiotemporal information parsing unit module is represented by Xdst, the calculation method for each pixel in Xdst after size scaling is as follows:

(1) For each pixel in X _dst , set the floating-point coordinates obtained by inverse transformation to (i+u, j+v), where i and j are integer parts of floating-point coordinates, u, v is the fractional part of the floating-point coordinate, which is a floating-point number in the range [0,1);

(2) This pixel value f(i+u, j+v) can be obtained from the coordinates in the original image as (i, j), (i+1, j), (i, j+1), (i+1, j The surrounding four pixel values corresponding to +1) are determined, that is,

f(i+u,j+v)=(1-u)(1-v)f(i,j)+(1-u)vf(i,j+1)+u(1-v)f( i+1,j)+uvf(i+1,j+1) where f(i,j) represents the pixel value at the source image (i,j).

The output feature map is

The extraction of spatial key information includes the following processes:

R _f =R _f1 +R _f2

x=f _fuse (R _f )

X _out = f _normalize (X)

X represents the fused feature map, the fusion function ffuse() integrates the information of the feature Rf, and the enhanced feature is normalized between 0 and 1 through the normalization function f _normaliz ().

Channel information integration mining includes the following steps:

The spatiotemporal information parsing unit module features are expressed as

After input, merge feature information by channel and output feature map Y;

By transposing the eigenvectors Z and Z to the eigenmatrix Z ^T , where T represents the transpose of the matrix, the eigen matrix I is generated, and each element in the matrix is the value of the inner product of Z and Z ^T , where the matrix I The generating dimension is C ₃ *C ₃ , and the formula for generating the matrix I is: