CN110826462A

CN110826462A - Human body behavior identification method of non-local double-current convolutional neural network model

Info

Publication number: CN110826462A
Application number: CN201911053686.XA
Authority: CN
Inventors: 周云; 陈淑荣
Original assignee: Shanghai Maritime University
Current assignee: Shanghai Maritime University
Priority date: 2019-10-31
Filing date: 2019-10-31
Publication date: 2020-02-21

Abstract

The invention relates to a human behavior recognition method of a non-local double-current convolutional neural network model, which improves two branch networks on the basis of the double-current convolutional neural network model, adds a non-local feature extraction module in a space current CNN and a time current CNN for extracting more comprehensive and clearer feature maps, deepens the depth of the network to a certain extent, effectively relieves network overfitting, can also extract the non-local features of a sample, performs denoising processing on an input feature map, and solves the problems of low recognition accuracy rate caused by complex background environment, various human behaviors, large action similarity and the like in a behavior video. The invention also trains the loss layer by adopting an A-softmax loss function, adds m-times limit to the classification angle on the basis of the softmax function, and limits the weight W and the bias b of the full connection layer, so that the distance between classes of the sample is larger, the distance in the class is smaller, better identification precision is obtained, and finally, a deep learning model with stronger identification capability is obtained.

Description

Human body behavior identification method of non-local double-current convolutional neural network model

Technical Field

The invention relates to a computer visual image and video processing technology, in particular to a human behavior identification method of a non-local double-current convolution neural network model.

Background

The research of human behavior recognition is to give a computer the ability to resemble the vision of human beings, so that the computer can acquire information through a visual system like human beings. And analyzing and processing the human actions in the video, and classifying and understanding the human behaviors by automatically tracking the global information and the local information of the human behaviors. Due to the influence of reasons such as complex background environment, various human behaviors, large motion similarity and the like in the behavior video, the phenomenon that two similar behavior motions are classified into one class can occur, and the accuracy rate of human behavior identification is low. Human behavior recognition is therefore still a challenging task in computer vision. At present, research on video behavior recognition in academic circles is mainly divided into two directions, a traditional behavior recognition method based on machine learning and a behavior recognition method based on deep learning are adopted, manual features are extracted in the traditional method, great errors exist, the method with Convolutional Neural Network (CNN) as a representative of deep learning is adopted for improving the accuracy of behavior recognition, and the method becomes a popular research direction in recent years. The deep learning method shows strong feature extraction capability by automatically learning features through a network, and can learn the features with high self-adaption and discrimination aiming at tasks. Deep learning has achieved good results in the aspect of character action recognition.

Disclosure of Invention

In order to solve the problem of low recognition accuracy rate caused by complex background environment, various human behaviors, large action similarity and the like in a behavior video, the invention provides a human behavior recognition method of a non-local double-current convolutional neural network model, which designs the non-local double-current convolutional neural network model and combines the traditional CNN with a non-local feature extraction module; and (3) adopting an A-softmax Loss function to enable the inter-class distance to be larger and the intra-class distance to be smaller in the final classification of the two branch models of the double flows.

The invention provides a human behavior recognition method of a non-local double-current convolutional neural network model, which trains a data sample by using a network structure combining a CNN model and a non-local neural network module: extracting features and dimensionality reduction of a pooling layer through a convolutional layer during each processing except the last processing, performing non-local feature extraction and feature denoising through the non-local neural network module, and then performing next convolutional layer and pooling layer processing; after the last convolution layer and the last pooling layer are processed, feature vectors are obtained by summarizing the features of the data samples through the full-connection layer, and then classification and normalization are carried out through the loss layer.

Preferably, an a-softmax loss function is adopted in the loss layer, m times of limitation is performed on the classification angle, and limitations of | | | | W | | ═ 1 and | | | | b | | | | | 0 are performed on the weight W and the offset b of the fully-connected layer.

Preferably, a double-current convolutional neural network model is adopted to extract spatial appearance information and temporal motion information of the video sample through a spatial stream CNN model and a temporal stream CNN model;

an input video sample set is preprocessed to obtain RGB frames and optical flow images, the RGB frames and the optical flow images are divided into a training set and a testing set, and the training set and the testing set are respectively sent to a spatial flow CNN model and a temporal flow CNN model for training and testing;

and performing weighted fusion on the loss layer outputs of the space flow CNN model and the time flow CNN model to obtain a behavior recognition result of the double-flow convolutional neural network model.

The invention has the beneficial effects that:

① the invention designs non-local double-current convolution nerve network model training, compared with the original double-current convolution nerve network, NL-CNN can extract more comprehensive and clear characteristic graph, thus better training effect can be obtained, adding non-local module (NL block) in general CNN can deepen the depth of network to a certain extent, effectively relieve network overfitting, but more importantly, the non-local module can extract the non-local characteristic of sample, and can also perform denoising treatment to the input characteristic graph, which can better solve the problem of low recognition accuracy rate caused by complex background environment, various human behaviors, large action similarity and other reasons in the behavior video.

② when classifying in the softmax layer, the invention uses A-softmax loss function to not only add m times limit to the classification angle on the basis of the original softmax function, but also make two limits of 1 and 0 for the weight W and bias b of the full connection layer on the A-softmax layer, thus achieving the effect of enlarging the distance between different classes and reducing the distance between the same classes, making the distance between the classes of the sample larger and the distance within the classes smaller, and obtaining better classification and identification effect.

Drawings

FIG. 1 is a diagram of an overall non-local dual-stream convolutional neural network architecture.

Fig. 2 is an NL block configuration diagram.

Fig. 3a and fig. 3b are NL-CNN structure diagrams of CNN and NLblock combination, where fig. 3a shows that a common CNN is connected to NL block, and fig. 3b shows that a residual network is connected to NL block.

Fig. 4 is a schematic diagram of softmax classification.

Fig. 5a and 5b are a-softmax geometric schematic diagrams, corresponding to 2D and 3D hyper-spherical manifold (hyperspheremnifold), respectively.

Fig. 6a, 6b show one classification process and the result for a-softmax for two classifications.

Detailed Description

The invention provides a human behavior recognition method of a non-local double-current convolutional neural network model, wherein the whole network structure adopts the double-current convolutional neural network model, and figure 1 schematically illustrates that a network framework mainly comprises a space flow and time flow CNN model and is used for extracting space appearance information and time motion information of a video sample. The structure and the arrangement of each convolution layer and the full connection layer are the same and weight parameters are shared.

In fig. 1, an input video sample set is preprocessed to obtain RGB frames and optical flow images (single RGB frame and continuous optical flow frame), which are divided into a training set and a testing set, and sent to a spatial stream CNN and a temporal stream CNN for training and testing, respectively. And selecting a proper size for inputting the RGB frame and the optical flow image, and selecting a proper network model as a network structure of the double-flow CNN model to further extract the characteristics for training.

Generally, a CNN network mainly includes a convolutional layer (Conv), a pooling layer (Pool), a full link layer (FC), a lossy layer, etc., and features are extracted from the convolutional layer in a model, the pooling layer performs a dimension reduction process to obtain a specific feature region, and then local information extracted from the convolutional layer is integrated in the full link layer to obtain global information, but this may bring a lot of parameters to the network. In order to solve the problem and enable the deep network to better fuse non-local information, a non-local neural network module (NL block) is added before a convolutional layer of the CNN to extract non-local features of a sample, a feature denoising effect is achieved, and finally classification and normalization are carried out on a Softmax layer. And performing weighted fusion on the softmax outputs of the spatial stream CNN and the time stream CNN to obtain a final behavior recognition result of the double-stream CNN model.

The invention also adopts an A-softmax loss function to train in the final loss layer of the network, thereby obtaining better identification precision.

The non-local feature extraction module (NL block) processes the relation between local features and full-image feature points by a non-local mean method according to the application of non-local means in image denoising, and establishes the relation between two pixels with a certain distance on an image.

Fig. 2 is an NL block architecture diagram. In the figure, h and d represent the size, length and width of the feature map. The NL block abstraction is described as:

where s denotes an input signal (characteristic diagram) and y denotes an output signal, which has the same magnitude as s. C(s) is a numerical value for normalization;

for calculating pixel pointss_iAnd all pixels s_jCorrelation between, g(s)_j)＝W_gs_jCalculating the characteristic value of the input signal at the j position, and finally outputting NL block with z_i＝W_zy_i+s_i. W herein_θ，

W_g，W_zIs a learnable weight matrix, actually realized by convolution of 1 × 1.

According to the method, a traditional CNN model is effectively connected with an NL block, a data sample is sent into the NL block after a feature diagram extracted from a convolutional layer of the CNN is subjected to pooling layer dimensionality reduction, an original feature diagram and a non-local feature diagram are output through the calculation operation, and then the data sample is sent into the next convolutional layer for operation, wherein a combined diagram of the CNN and the NL block is shown in figures 3a and 3 b.

Principle of loss function:

fig. 4 shows a principle diagram of softmax classification. The neural network comprises an Input layer (Input), then the Input layer is processed by two feature layers (Features I and II), finally the probability under different conditions can be obtained by a softmax analyzer, the probability is divided into three categories, and finally the probability values of y being 0, y being 1 and y being 2 are obtained.

A-softmax Loss principle:

the A-softmax Loss function expression is as follows:

wherein psi (theta)_yi,i)＝(-1)^kcos(mθ_yi,i)-2k，

k∈[0，m-1]。

θ_j,_iDenotes x_iOffset W from all other classes_jThe included angle between them;

θ_yi,idenotes x_iAnd category y_iOffset W of_yiThe included angle between j belongs to [1, K ]]And K is the total number of categories.

The A-softmax loss function not only adds m-times limit to the classification angle on the basis of the original softmax function, but also makes two limits of 1 and 0 for the weight W and the offset b of the fully-connected layer on the upper layer of the A-softmax.

Wherein m > is 2, and m belongs to N; the larger the value of m is, the stronger the distinctiveness of the learned features is, but the learning difficulty is higher, so that the most appropriate numerical value can be selected through limited experiments; preferably, m is 4, which is the best effect.

From this point on, the classification process of a-Softmax depends only on the angle between W and x. See the schematic geometry of a-softmax as shown in fig. 5a, 5 b. x represents a sample, is x in the formula_iIs a general term for (1).

Taking the two-dimension classification as an example, the a-softmax loss has two separated decision planes in both 2-dimensional plane and 3-dimensional space, and the size of the separation of the decision planes is positively correlated with the size of m, and the decision plane is a plane in euclidean space, which can separate different classes. It is assumed that one sample x of class 1 and the weights W of the two classes₁,W₂，θ₁And theta₂Represents x samples and weight W₁,W₂The included angle of (a). The criterion for the classification of A-softmax is cos (m θ)₁)>cos(θ₂)cos(mθ₁)>cos(θ₂) Equivalent to m θ₁<θ₂. In a unit sphere, θ₁And theta₂Equivalent to the length ω of their corresponding arcs₁And ω₂. Theta in the comparative graph₁And theta₂X belongs to the smaller class.

Fig. 6a and 6b show a classification process and a result of a-softmax for two classes, where fig. 6a shows that a-softmax uses weight normalization to make two classes in two unit circles, and fig. 6b shows that the two classes are divided by an angle interval, so that the present invention is more accurate in final classification of behavior recognition, and the obtained accuracy of behavior recognition is higher. Margin in fig. 6b represents the interval between class 1 and class 2, and the larger m, the larger the interval between classes in the training process, and the more difficult learning.

The CNN model can be trained by networks such as VGGnet-16, Resnet-50 and GoogleNet, and can be tested on a plurality of public databases (such as a KTH data set, a UCF101 data set and a HMDB51 data set). The present invention uses the deep learning tensor library pytorreh to implement the proposed method. In order to verify the effectiveness of the method, a VGGnet-16 network is exemplarily selected to perform experiments on a UCF101 data set, and the hardware configuration used for the experiments is as follows, namely a GPU, NvidiaGeForceGTX 1080Ti (video memory: 12GB), a memory, 128GB and a CPU, an Intel Core eight-Core i7 processor (main frequency 3.60 GHz). The experiment further proves that the method of the invention improves the recognition accuracy of human behavior recognition by 1.1% under the same condition.

In summary, the present invention improves Two branch networks based on the original double-stream convolutional neural network (Two-stream CNN) model, designs a non-local CNN model, and adds a non-local block (NLblock for short) before the convolutional layers of the spatial stream CNN and the temporal stream CNN, so as to solve the defect that the CNN convolutional layers can only extract local features, and global features can only depend on the last fully-connected layer for fusion, which results in an excessive parameter amount, and also play a role in feature denoising in the network, so that the deep network can better fuse non-local information.

The neural network collects the characteristics of the samples at the last layer of the fully-connected layer after the characteristics of the samples are extracted at the convolutional layer so as to obtain characteristic vectors, and then the characteristic vectors are sent to the softmax layer to realize classification. The conventional softmax loss function is easy to optimize because it does not have the effect of maximizing the distance between different classes and minimizing the distance between the same classes. The invention adopts an A-softmax loss function to add m times of limit to the classification angle on the basis of the original softmax function, and also makes two limits of 1 and 0 for the weight W and the offset b of a full connecting layer on the A-softmax, so that the inter-class distance of the sample is larger and the intra-class distance is smaller. And finally, a deep learning model with stronger discrimination capability can be obtained.

While the present invention has been described in detail with reference to the preferred embodiments, it should be understood that the above description should not be taken as limiting the invention. Various modifications and alterations to this invention will become apparent to those skilled in the art upon reading the foregoing description. Accordingly, the scope of the invention should be determined from the following claims.

Claims

1. A human behavior recognition method of a non-local double-current convolution neural network model is characterized in that,

training data samples using a network structure incorporating a CNN model and a non-local neural network module: extracting features and dimensionality reduction of a pooling layer through a convolutional layer during each processing except the last processing, performing non-local feature extraction and feature denoising through the non-local neural network module, and then performing next convolutional layer and pooling layer processing; after the last convolution layer and the last pooling layer are processed, feature vectors are obtained by summarizing the features of the data samples through the full-connection layer, and then classification and normalization are carried out through the loss layer.

2. The human behavior recognition method according to claim 1,

and an A-softmax loss function is adopted in the loss layer, m-times limitation is carried out on the classification angle, and the limitations of 1 and 0 are carried out on the weight W and the offset b of the all-connection layer.

3. The human behavior recognition method according to claim 2,

the m multiple of the restricted classification angle is m > 2, and m belongs to N.

4. The human behavior recognition method according to claim 1 or 2,

extracting spatial appearance information and temporal motion information of a video sample by adopting a double-current convolutional neural network model through a spatial current CNN model and a temporal current CNN model;

5. The human behavior recognition method according to claim 1,

the local feature extraction module abstracts and describes as follows according to the definition of a non-local mean value:

wherein s represents a characteristic diagram as an input signal, y represents an output signal, and the magnitude thereof is the same as s;

c(s) is a numerical value for normalization;for calculating a pixel point s_iAnd a pixel point s_jThe correlation between them;

g(s_j)＝W_gs_jused for calculating the characteristic value of the input signal at the j position;

the output of the local feature extraction module is z_i＝W_zy_i+s_i；

W_θ，

W_g，W_zIs a learnable weight matrix, implemented by convolution of 1 × 1.

6. The human behavior recognition method according to claim 2,

the expression of the A-softmax loss function is as follows:

wherein psi (theta)_yi,i)＝(-1)^kcos(mθ_yi,i)-2k，

k∈[0，m-1]

θ_j,iDenotes x_iOffset W from all other classes_jThe included angle between them;