CN112766177B

CN112766177B - Behavior identification method based on feature mapping and multi-layer time interaction attention

Info

Publication number: CN112766177B
Application number: CN202110086627.3A
Authority: CN
Inventors: 同鸣; 金磊; 董秋宇; 边放
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2021-01-22
Filing date: 2021-01-22
Publication date: 2022-12-02
Anticipated expiration: 2041-01-22
Also published as: CN112766177A

Abstract

The invention discloses a behavior recognition method based on feature mapping and multi-layer time interaction attention, which solves the problem that the prior art is insufficient in time dynamic information modeling and ignores the interdependency relation between different frames, so that the recognition capability of behaviors is insufficient. The method comprises the following implementation steps: generating a training set; (2) acquiring a depth feature map; (3) constructing a feature mapping matrix; (4) generating a time interaction attention matrix; (5) generating a time interaction attention weighted feature matrix; (6) Generating a multi-layer time interactive attention weighted feature matrix; (7) acquiring a feature vector of the video; and (8) performing behavior recognition on the video. Because the invention constructs the feature mapping matrix and provides multilayer time interaction attention, the invention can improve the accuracy of behavior identification in the video.

Description

Behavior identification method based on feature mapping and multi-layer time interaction attention

Technical Field

The invention belongs to the technical field of video processing, and further relates to a behavior identification method based on feature mapping and multilayer time interaction attention in the technical field of computer vision. The method can be used for human behavior recognition in videos.

Background

The human behavior recognition task based on the video plays an important role in the field of computer vision, has a wide application prospect, and is applied to the fields of unmanned driving, man-machine interaction, video monitoring and the like at present. The aim of human behavior recognition is to judge the category of human behavior in a video, and the essence is a classification problem. In recent years, with the development of deep learning, behavior recognition methods based on deep learning have been widely studied.

The south China university of marble discloses a human behavior recognition method in the patent document 'human behavior recognition method based on time attention mechanism and LSTM' (patent application No.: CN201910271178.2, application publication No. CN 110135249A) applied by the university of south China. The method mainly comprises the following implementation steps: 1. acquiring video data of an RGB monocular vision sensor; 2. extracting 2D skeleton joint point data; 3. extracting joint point combined structure characteristics; 4. constructing an LSTM long-term and short-term memory network; 5. adding a time attention mechanism in the LSTM network; 6. and (5) carrying out human behavior recognition by utilizing a softmax classifier. The time attention mechanism proposed by the method separately explores the importance degree of each frame in the video and gives large weight to the characteristics of the important frames, but the method still has the defects that the interdependence relation between different frames in the video is ignored, so that partial global information is lost, and the error of behavior identification is caused.

A method of behavior recognition is disclosed in the published article "Temporal segment networks for action recognition in videos" by Limin Wang et al (IEEE transactions on pattern analysis and machine interaction, 2018, 2740-2755). The method mainly comprises the following implementation steps: 1. uniformly dividing the video into 7 video segments; 2. randomly sampling a frame of RGB image in each video segment to obtain 7 frames of RGB images; 3. inputting each frame of the obtained RGB image into a convolutional neural network to obtain a classification score of each frame of the RGB image; 4. and combining the segment consensus function and the prediction function with the classification score of the 7 frames of RGB images to obtain the behavior recognition result of the video. The method has the defects that for a longer video, only 7 frames of RGB images are sampled, so that information in the video is lost, more complete time dynamic information cannot be modeled, and further, the behavior recognition accuracy rate is lower.

Disclosure of Invention

The invention aims to provide a behavior recognition method based on feature mapping and multi-layer time interaction attention aiming at the defects of the prior art, and the method is used for solving the problems that in the prior art, time dynamic information is not sufficiently modeled, and the mutual dependency relationship among different frames is ignored, so that the behavior recognition capability is poor.

In order to achieve the purpose, the idea of the invention is to construct a feature mapping matrix and embed the feature mapping matrix into the time and space information in the video; the time interaction attention is obtained by exploring the mutual influence among different frames in the video; and (3) mining complex time dynamic information in the video by using multi-layer time interaction attention.

In order to achieve the purpose, the method comprises the following specific steps:

(1) Generating a training set:

(1a) Selecting RGB videos containing N behavior categories in a video data set to form a sample set, wherein each category contains at least 100 videos, each video has a determined behavior category, and N is greater than 50;

(1b) Preprocessing each video in the sample set to obtain RGB images corresponding to the video, and forming the RGB images of all preprocessed videos into a training set;

(2) Generating a depth feature map:

sequentially inputting each frame of RGB image in each video in a training set into an inclusion-v 2 network, and sequentially outputting a depth feature map X with the size of 7X 1024 in each frame of image in each video _k Wherein k represents a sequence number of a sample image in the video, and k =1, 2.., 60;

(3) Constructing a feature mapping matrix:

(3a) Encoding each depth feature map as a low-dimensional vector f with 1024 dimensions using a spatial vectorization function _k ，k＝1,2,...,60；

(3b) Arranging the low-dimensional vectors corresponding to 60 frame sampling images of each video in a row according to the time sequence of the frames to obtain a two-dimensional feature mapping matrix

Wherein T represents a transpose operation;

(4) Generating a temporal interaction attention matrix:

(4a) Using the formula B = M ^T M, generating a correlation matrix B of M, wherein the value of the ith row and the jth column in the matrix represents the correlation degree between two low-dimensional vectors corresponding to the ith and jth sampling images in the video;

(4b) Normalizing the correlation matrix B to obtain a time interaction attention matrix A with the size of 60 multiplied by 60;

(5) Generating a time interaction attention weighted feature matrix:

using formulas

Generating a temporal interaction attention weighted feature matrix

Wherein γ represents a proportionality parameter initialized to 0 for balancing both MA and M;

(6) Generating a multi-layer time interaction attention weighted feature matrix:

(6a) Using formulas

Generating

Correlation matrix of

To pair

Normalization processing is carried out to obtain a multilayer time interactive attention moment array with the size of 60 multiplied by 60

(6b) Using formulas

Generating a multi-tier temporal interaction attention weighted feature matrix

Wherein the content of the first and second substances,

indicating an initialization to 0 for balancing

And

the proportion parameters of the two terms;

(7) Acquiring a feature vector of a video:

inputting the multi-layer time interactive attention weighted feature matrix of each video into a full-connection layer, and outputting the feature vector of the video;

(8) Performing behavior recognition on the video:

(8a) Inputting the feature vector of each video into a softmax classifier, and iteratively updating parameters gamma and gamma by using a back propagation gradient descent method

Parameters of a full connection layer and parameters of a softmax classifier are obtained until a cross entropy loss function is converged, and each trained parameter is obtained;

(8b) Sampling 60 frames of RGB images of each video to be identified at equal intervals, scaling the size of each frame of image to 256 multiplied by 340, then performing center cutting to obtain 60 frames of RGB images with the size of 224 multiplied by 224, inputting each frame of RGB images into an inclusion-v 2 network, and outputting a depth feature map of the video to be identified;

(8c) And (4) processing the depth feature map of each video to be recognized by adopting the same processing method as the steps (3) to (7) to obtain feature vectors of the video, inputting each feature vector into a trained softmax classifier, and outputting a behavior recognition result of each video.

Compared with the prior art, the invention has the following advantages:

firstly, because the invention constructs the characteristic mapping matrix which comprises the time information of 60 sampling images in the video and the space information of each sampling image, the invention overcomes the problems that the information in the video is lost and more complete time dynamic information can not be modeled because only 7 frames of RGB images are sampled in the prior art, so that the invention can more fully reserve the time sequence information and obtain the characteristic with more expressive ability.

Secondly, the time interaction attention matrix is provided by the invention, and is obtained by calculating the correlation degree between the low-dimensional features of different sampling images in the feature mapping matrix, so that the problem that partial global information is lost due to the fact that the method in the prior art ignores the interdependence relation between different frames in the video is solved, the technology provided by the invention can fully explore the global information, and the accuracy of behavior identification is improved.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

The specific steps of the present invention will be further described with reference to fig. 1.

Step 1, generating a training set.

Selecting RGB videos containing N behavior categories in a video data set to form a sample set, wherein each category contains at least 100 videos, each video has a determined behavior category, and N is greater than 50. Preprocessing each video in the sample set to obtain an RGB image corresponding to the video, and forming the RGB images of all preprocessed videos into a training set. The preprocessing is to sample 60 frames of RGB images at equal intervals for each video in the sample set, scale the size of each frame of RGB image to 256 × 340, and then crop the RGB images to obtain 60 frames of RGB images with the size of 224 × 224 for the video.

And 2, acquiring a depth characteristic map.

Will train each of the setSequentially inputting each frame of RGB image in each video into an increment-v 2 network, and sequentially outputting a depth feature map X with the size of 7X 1024 in each frame of image in each video _k Where k denotes the sequence number of the sampled image in the video, and k =1, 2.

And 3, constructing a characteristic mapping matrix.

Due to the high dimensionality of the feature map, the joint analysis of the information of the densely sampled images in the video is challenging, and the mapping of the feature map into a low-dimensional vector can reduce the amount of calculation and is beneficial to the joint analysis of the densely sampled images. Taking the kth sampling image in the r video as an example, how to encode the depth feature map of the sampling image of the video into a low-dimensional vector with 1024 dimensions is described:

wherein f is _r,k Represents the corresponding low-dimensional vector of the kth sampling image in the r video, V (-) represents the space vectorization function, X _r,k Representing a depth feature map, X, corresponding to the kth sample image in the r video _r,k,ij Represents X _r,k Represents a summation operation, and H and W represent X, respectively _r,k The total number of rows and the total number of columns.

Arranging the low-dimensional vectors corresponding to 60 frame sampling images of each video in a row according to the time sequence of the frames to obtain a two-dimensional feature mapping matrix

Wherein, f _k A low-dimensional vector representing the k-th sampled image, k =1, 2.

The number of columns of the matrix M is equal to the total number of sampled images corresponding to each video, and the number of rows is equal to the dimension of the low-dimensional vector.

The feature mapping matrix contains the time information of the video and the spatial information of each sampling image, so that the method can perform joint analysis on the densely sampled images in the video.

And 4, generating a time interaction attention matrix.

Generating M correlation matrix B = M ^T And M, expressing the correlation degree between two low-dimensional vectors corresponding to the ith and jth sampling images in the video by the value of the ith row and the jth column in the B, and normalizing the B to obtain a time interaction attention matrix A with the size of 60 multiplied by 60.

The following description will use the ith frame sample image and the jth frame sample image as an example to explain how to calculate the ith row and jth column elements A of the time interaction attention matrix A by the correlation degree between the two frame images _ij The specific calculation formula is as follows:

wherein A is _ij The correlation degree of the ith frame sample image and the jth frame sample image is measured. M is a group of _i And M _j And the physical meanings of the column vectors respectively consisting of the ith column elements and the jth column elements in the feature mapping matrix M are the transposes of the low-dimensional vectors of the ith sampling image and the jth sampling image in the video. If the lower dimensional vectors of the two frames are more similar, A _ij The larger the correlation between the two frames.

All elements in the time interaction attention matrix A are calculated in the same way, and the ith row in A represents the correlation degree of the ith frame sampling image of the video and all sampling images in the video. Therefore, the time interaction attention moment matrix models the correlation between video frames, and is helpful for more fully exploring the global information in the video.

And 5, generating a time interaction attention weighted feature matrix.

Using a formula

Generating a temporal interaction attention weighted feature matrix

Wherein, γIndicating a scaling parameter initialized to 0 to balance both MA and M terms.

And 6, generating a multilayer time interaction attention weighted feature matrix.

Using formulas

Generating

Correlation matrix of

To pair

Reuse formula

Generating a multi-tier temporal interaction attention weighted feature matrix

Wherein, the first and the second end of the pipe are connected with each other,

indicating that an initialization is 0 for balancing

And

the ratio parameter of the two terms.

The multi-layer time interaction attention to time interaction attention weighted feature matrix reapplies time interaction attention and explores richer time dynamics.

And 7, acquiring a feature vector of the video.

And inputting the multi-layer time interactive attention weighted feature matrix of each video into a full-connection layer with 1024 output neurons to obtain the feature vector of the video.

And 8, performing behavior recognition on the video.

Inputting the feature vector of each video into a softmax classifier, and respectively updating gamma, gamma and gamma by using a back propagation gradient descent method,

Parameters of a full connection layer and parameters of a softmax classifier until a cross entropy loss function is converged.

Sampling 60 frames of RGB images at equal intervals for each video to be identified, scaling the size of each frame of image to 256 × 340, then performing center cropping to obtain 60 frames of RGB images with the size of 224 × 224, inputting each frame of RGB images into an inclusion-v 2 network, and outputting a depth feature map of the video to be identified.

And (4) processing the depth characteristic map of each video to be recognized by adopting the same processing method as the steps 3 to 7 to obtain the characteristic vectors of the video to be recognized, inputting each characteristic vector into a trained softmax classifier, and outputting the behavior recognition result of each video.

Claims

1. A behavior identification method based on feature mapping and multi-layer time interactive attention is characterized in that a feature mapping matrix containing time information of a video and space information of each sampling image is constructed; the time interaction attention is provided, a time interaction attention matrix is obtained by calculating the correlation degree between the low-dimensional vectors of different sampling images in the feature mapping matrix, and the method specifically comprises the following steps:

(1) Generating a training set:

(2) Generating a depth feature map:

(3) Constructing a feature mapping matrix:

(3b) Arranging the low-dimensional vectors corresponding to 60 frames of sampling images of each video in a line according to the time sequence of the frames to obtain a two-dimensional feature mapping matrix

Wherein T represents a transpose operation;

(4) Generating a temporal interaction attention matrix:

(5) Generating a time interaction attention weighted feature matrix:

using formulas

Generating a temporal interaction attention weighted feature matrix

Wherein γ represents a scaling parameter initialized to 0 to balance both MA and M;

(6) Generating a multi-layer time interactive attention weighted feature matrix:

(6a) Using a formula

Generating

Correlation matrix of (2)

For is to

(6b) Using a formula

Generating a multi-tier temporal interaction attention weighted feature matrix

Wherein the content of the first and second substances,

indicating an initialization to 0 for balancing

And

the proportion parameters of the two terms;

(7) Acquiring a feature vector of a video:

inputting the multilayer time interactive attention weighted feature matrix of each video into a full connection layer, and outputting the feature vector of the video;

(8) Performing behavior recognition on the video:

(8a) Inputting the feature vector of each video into a softmax classifier, and iteratively updating the parameters gamma and gamma by using a back propagation gradient descent method

2. The method according to claim 1, wherein the preprocessing of each video in the sample set in step (1 b) comprises sampling 60 frames of RGB images at equal intervals for each video in the sample set, scaling the RGB images to 256 × 340, and cropping to obtain 60 frames of RGB images with a size of 224 × 224 for the video.

3. The method for recognizing behavior based on feature mapping and multi-layer temporal interaction attention according to claim 1, wherein the space vectorization function in step (3 a) is as follows:

wherein f is _r,k Represents the corresponding low-dimensional vector of the kth sampling frame in the r video, V (-) represents the space vectorization function, X _r,k Representing a depth feature map, X, corresponding to the kth sample frame in the r video _r,k,ij Represents X _r,k Represents a summation operation, and H and W represent X, respectively _r,k The total number of rows and the total number of columns.

4. The method for behavior recognition based on feature mapping and multi-layer temporal interaction attention of claim 1, wherein the number of output neurons of the fully-connected layer in step (7) is set to 1024.