CN111523361B

CN111523361B - Human behavior recognition method

Info

Publication number: CN111523361B
Application number: CN201911366634.8A
Authority: CN
Inventors: 张信明; 郑辉
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2019-12-26
Filing date: 2019-12-26
Publication date: 2022-09-06
Anticipated expiration: 2039-12-26
Also published as: CN111523361A

Abstract

The application provides a human behavior identification method. The method comprises the steps of firstly extracting information of two modes of a first image and a second image, wherein the information can represent static information, the information of the first image and the second image is implicitly aligned by a convolutional neural network with an attention mechanism, the characteristics of implicit alignment are further mapped into a public subspace to be explicitly aligned, then, a sparse contraction depth automatic encoder is used for carrying out deep fusion on the characteristics of different aligned modes, and finally, the characteristics with high robustness and strong discriminability obtained after fusion are used for training a deep belief network to realize a high-precision human behavior recognition function. The invention can fully excavate and fuse the information of two different modes of time and space, learn the high-level semantic characteristics representing the essential information of the video, and finally realize the accurate identification of human behaviors.

Description

Human behavior recognition method

Technical Field

The application relates to the technical field of image analysis, in particular to a human behavior identification method based on cross-modal learning.

Background

In recent years, with the popularization of consumer electronic devices and products and the improvement of network performance, the amount of videos generated by various electronic devices is on a rapid growth trend.

Under the era background of big data drive and Smart City (Smart City) construction, better understanding of videos, especially visual understanding tasks with human as a center have profound influence on the fields of social security, intelligent medical treatment, unmanned driving and the like. Therefore, the human behavior recognition has important application value.

At present, mainstream human behavior recognition is mainly divided into model driving and data driving methods, but the accuracy of the two methods for human behavior recognition is lower in practical application.

Disclosure of Invention

In order to solve the technical problem, the application provides a human behavior recognition method so as to achieve the purpose of improving the accuracy of human behavior recognition.

In order to achieve the technical purpose, the embodiment of the application provides the following technical scheme:

a human behavior recognition method comprises the following steps:

extracting a first image and a second image from video data to be identified; the first image comprises static information of the video data to be identified, and the second image comprises dynamic information of the video data to be identified;

carrying out implicit alignment and feature learning on information of two different modes by utilizing a convolutional neural network with an attention mechanism;

mapping the implicitly aligned features to a common subspace for explicit alignment processing;

carrying out deep fusion on the aligned different modal information by using a sparse shrinkage depth automatic encoder;

and training a classifier deep belief network by using the fused features so as to realize accurate recognition of human behaviors.

Optionally, the mapping the implicitly aligned features to a common subspace to perform explicit alignment processing includes:

inputting the first mode information and the second mode information into a convolutional neural network with an attention mechanism for implicit alignment;

and performing explicit alignment on the implicitly aligned first modality information and second modality information, and mapping different modality information into a common subspace by performing subspace learning on the different modality information.

Optionally, the explicitly aligning the implicitly aligned first modality information and the second modality information includes:

substituting the first modality information and the second modality information into a first formula;

the first formula includes:

wherein X represents first modality information, Y represents second modality information, and X is d _x ×T ₁ Matrix of dimensions, Y being a number d _y ×T ₂ Matrix of dimensions, W _x A mapping matrix representing the first modality information, and W _x Is a d _x X d dimensional matrix, W _y A mapping matrix representing the second modality information, and W _y Is a d _y Matrix of dimension x d, V _x And V _y Represents a binary selection matrix, an

Δ represents a diagonal matrix; 1 represents a vector with all values 1; i denotes a unit vector.

Optionally, the method for acquiring the sparse shrinkage depth autoencoder network includes:

replacing the L2 norm of a weight matrix in a loss function of the traditional automatic encoder network with the F norm of a Jacobian matrix;

introducing a sparse term in the loss function;

and determining the number of hidden nodes and the sparse parameters by applying a particle swarm optimization algorithm so as to improve the traditional automatic encoder network into the sparse shrinkage depth automatic encoder network.

Optionally, the loss function of the sparse shrinkage depth autoencoder network is:

wherein, J _SCAE Representing a loss function of the sparse-shrinkage depth autoencoder network, D representing a training data set, L (x, y) representing a cross-entropy loss function, λ representing a control attenuation coefficient, s ₂ Representing the number of neurons of the hidden layer, beta representing a sparse term coefficient, j representing the jth hidden layer node, and j being more than or equal to 1 and less than or equal to s ₂ J (x) represents a Jacobian matrix, ρ represents a sparse term parameter,

represents the average activity of the hidden neuron j, and | | represents the relative entropy calculation.

It can be seen from the foregoing technical solutions that, in the human behavior recognition method provided in the embodiments of the present application, according to two different modality information, namely, a first image containing static information and a second image containing dynamic information, extracted from video data to be recognized, the human behavior recognition method performs implicit alignment and feature learning on the first image and the second image by using a convolutional neural network with an attention mechanism, maps different modalities of the implicit alignment into a common subspace for explicit alignment, and then fuses different modality features after alignment by using a sparse shrinkage depth auto-encoder, thereby implementing a feature learning process simultaneously in a fusion process; and finally, training a deep belief network by using the fused fusion characteristic information with high robustness and strong discriminability to realize the high-precision human behavior recognition function. Through the description, the human behavior identification method provided by the embodiment of the application can fully mine and fuse information between static and dynamic different modes, fully learn high-level semantic features representing essential video information, finally realize accurate identification of human behaviors and achieve the purpose of improving accuracy of human behavior identification.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, it is obvious that the drawings in the following description are only embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the provided drawings without creative efforts.

Fig. 1 is a schematic flowchart of a human behavior recognition method according to an embodiment of the present application;

FIG. 2 is a diagram illustrating several frames of images of video data to be identified according to an embodiment of the present application;

fig. 3 is a schematic diagram of a first image extracted from video data to be identified according to an embodiment of the present application;

fig. 4 is a schematic diagram of a second image extracted from video data to be identified according to an embodiment of the present application;

fig. 5 is a flowchart illustrating an obtaining method of a sparse-shrinkage depth autoencoder network according to an embodiment of the present application.

Detailed Description

As described in the background art, the recognition accuracy of the human behavior recognition method that is mainstream in the prior art is low. The following is a detailed analysis of the model-driven and data-driven methods.

(1) The model-driven method first obtains artificially extracted features such as HOG (Histogram of Oriented Gradient), HOF (Histogram of Oriented Optical Flow), MBH (Motion Boundary Histogram), etc. from a video sequence, and then inputs them into a common classifier such as bayes, support vector machine, decision tree, etc. for classification and recognition. On one hand, manually extracting features is a time-consuming and labor-consuming project, and on the other hand, the extracted features are based on prior knowledge and often cannot sufficiently reflect the most essential information of data.

(2) The data-driven method is popular in the big data age deep learning wave and simulates the characteristic learning of the original data by a deep neural network instead of the traditional priori knowledge and physical model of the human brain. The high-level abstract features learned through the deep neural network can reflect essential information of data, have strong discriminability and robustness, and gradually replace the traditional feature engineering method in recent years.

However, on one hand, some methods ignore complementary information of different modalities contained in the video, and on the other hand, some methods ignore difference of the modalities on a spatio-temporal scale by default, which has a certain influence on understanding of high-level semantic features of the video, although some methods utilize information of different modalities on a dynamic state and a static state, the modalities are aligned on a spatio-temporal basis.

In view of this, an embodiment of the present application provides a method for recognizing human body behaviors, including:

performing deep fusion on the aligned different modal information by using a sparse shrinkage depth automatic encoder;

In the embodiment, the human behavior identification method comprises the steps of firstly, according to two different modal information including a first image containing static information and a second image containing dynamic information extracted from video data to be identified, carrying out implicit alignment and feature learning on the two different modal information of the first image and the second image by using a convolutional neural network with an attention mechanism, mapping the feature information after the implicit alignment to a public subspace for explicit alignment, then fusing the aligned different modal features by using a sparse shrinkage depth automatic encoder network, and realizing the simultaneous implementation with a feature learning process in a fusion process; and finally, training a deep belief network by using the fused fusion characteristic information with high robustness and strong discriminability to realize the high-precision human behavior recognition function. Through the description, the human behavior identification method provided by the embodiment of the application can fully mine and fuse information between static and dynamic different modes, fully learn high-level semantic features representing essential video information, finally realize accurate identification of human behaviors and achieve the purpose of improving accuracy of human behavior identification.

The technical solutions in the embodiments of the present application will be described clearly and completely with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only some embodiments of the present application, and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

The embodiment of the application provides a human behavior identification method, as shown in fig. 1, including:

s101: extracting a first image and a second image from video data to be identified; the first image comprises static information of the video data to be identified, and the second image comprises dynamic information of the video data to be identified; extracting a first image representing static information and a second image representing dynamic information from video data;

alternatively, the first image may be an image capable of representing static information, such as an RGB image, and the second image may be an image capable of representing dynamic information, such as an optical flow image.

The Optical Flow (Optical Flow) method is an important method for motion image analysis, and refers to the velocity of mode motion in time-varying images, because when an object is moving, the luminance mode of its corresponding point on the image is also moving. The Apparent Motion (Apparent Motion) of this image brightness pattern is the optical flow. The optical flow expresses the change of the image, and since it contains information on the movement of the object, it can be used by the viewer to determine the movement of the object.

Referring to fig. 2 to 4, fig. 2 shows several frames of original images in the video data to be recognized, fig. 3 shows an RGB image extracted from the video data to be recognized as a first image, and fig. 4 shows an optical flow image calculated from the video data to be recognized as the second image.

S102: carrying out implicit alignment and feature learning on information of two different modes by utilizing a convolutional neural network with an attention mechanism;

attention (Attention) mechanism (CAM) is a mechanism embedded within a convolutional neural network that mimics human visual behavior. An attention mechanism is added to the ResNet50 convolutional neural network. In general, the first image and the second image are two-dimensional image signals, and after the size cropping transformation is completed, the two-dimensional image signals can be directly input into a convolutional neural network model with attention mechanism for feature learning.

The ResNet-50 network model is a powerful convolutional neural network with 50 layers that trains one million images from the ImageNet database to classify the images into 1000 object classes. The network which is fine-tuned and pre-trained is applied to a video human behavior recognition task, so that the advantage of characteristic learning of the network can be exerted, and essential information of each mode can be effectively mined.

In addition, an attention mechanism is added into the convolutional neural network, so that more discriminative and significant areas in the two-dimensional image can be captured further, and interference of some unimportant areas is ignored.

S103: mapping the implicitly aligned features to a common subspace to perform explicit alignment processing;

s104: performing deep fusion on the aligned different modal information by using a sparse shrinkage depth automatic encoder;

the fusion characteristic information obtained by the deep fusion can represent the essential spatiotemporal semantic information of the video data to be processed, and has the characteristic of high robustness.

S105: and training a deep belief network of the classifier by using the fused features so as to realize accurate recognition of human behaviors.

The Deep Belief Network (DBN) adopts a layer-by-layer training mode to solve the optimization problem of a Deep neural network, and gives a better initial weight to the whole network through the layer-by-layer training, so that the network can reach an optimal solution as long as the network is subjected to fine adjustment.

In the embodiment, the human behavior identification method comprises the steps of firstly, according to two different modal information of a first image containing static information and a second image containing dynamic information extracted from video data to be identified, carrying out implicit alignment and feature learning on the first image and the second image by using a convolutional neural network with an attention mechanism, mapping different implicitly aligned modes into a common subspace for explicit alignment, then fusing different aligned modal features by using a sparse shrinkage depth automatic encoder, and realizing the simultaneous implementation of the different modal features and the feature learning process in the fusion process; and finally, training a deep belief network by using the fused fusion characteristic information with high robustness and strong discriminability to realize the high-precision human behavior recognition function. Through the description, the human behavior identification method provided by the embodiment of the application can fully mine and fuse information between static and dynamic different modes, fully learn high-level semantic features representing essential video information, finally realize accurate identification of human behaviors and achieve the purpose of improving accuracy of human behavior identification.

On the basis of the foregoing embodiment, in an embodiment of the present application, the mapping the implicitly aligned features into a common subspace to perform explicit alignment processing includes:

and performing explicit alignment on the implicitly aligned first modality information and second modality information, and mapping different modality information into a common subspace by performing subspace learning on the different modality information. (i.e., mapping the first modality information and the second modality information into a common low-dimensional subspace), aligning the first modality information and the second modality information in the common subspace.

Specifically, the explicitly aligning the implicitly aligned first modality information and the second modality information includes:

the first formula includes:

wherein X represents first modality information, Y represents second modality information, and X is d _x ×T ₁ Matrix of dimensions, Y being a number d _y ×T ₂ Matrix of dimensions, W _x A mapping matrix representing the first modality information, and W _x Is a d _x X d dimension matrix, W _y A mapping matrix representing the second modality information, and W _y Is a d _y Matrix of dimension x d, d _x X d and d _y Xd denotes the dimension of the two-dimensional matrix, V _x And V _y Represents a binary selection matrix, and

Δ represents a diagonal matrix; 1 represents a vector of appropriate dimensions with all values 1; i denotes a unit vector.

As mentioned above, the attention mechanism is added into the convolutional neural network, so that more discriminative and significant areas in the two-dimensional image can be captured, and the interference of some unimportant areas is ignored. For a given two-dimensional image, by introducing a mechanism of attention in the convolutional neural network, the class c is input into the softmax layer in the convolutional neural network

k denotes the kth unit, f _k (x, y) refers to the activation function, c denotes class c, the output is:

the output result of the implicit alignment network needs to be further input into the explicit alignment network, that is, the alignment on the time-space scale is represented by mapping the information of different modalities into a common subspace through an explicit alignment method.

On the basis of the above embodiments, in an embodiment of the present application, with reference to fig. 5, the method for acquiring a sparse-shrinkage depth automatic encoder network includes:

s201: replacing the L2 norm of a weight matrix in a loss function of the traditional automatic encoder network with the F norm of a Jacobian matrix;

s202: introducing a sparse term into the loss function to achieve the aim of carrying out sparse constraint on the hidden layer of the traditional automatic encoder network;

s203: and determining the number of hidden nodes and sparse parameters by applying a particle swarm optimization algorithm so as to improve the traditional automatic encoder network into the sparse shrinkage depth automatic encoder network.

Specifically, the loss function of the sparse shrinkage depth automatic encoder network is as follows:

An automatic encoder (Autoencoder) network is an unsupervised neural network model, and the construction and specific structure of a conventional automatic encoder network in the prior art are well known to those skilled in the art, and are not described herein in detail.

To sum up, the embodiment of the present application provides a human behavior recognition method, where the human behavior recognition method first performs implicit alignment and feature learning on a first image and a second image according to two different modalities of information, namely, a first image containing static information and a second image containing dynamic information, extracted from video data to be recognized, by using a convolutional neural network with an attention mechanism, and maps the different modalities of the implicit alignment into a common subspace for explicit alignment, and then fuses the aligned different modality features by using a sparse shrinkage depth automatic encoder, thereby implementing a feature learning process simultaneously in a fusion process; and finally, training a deep belief network by using the fused high-robustness and strong-discriminability fused feature information to realize a high-precision human behavior recognition function. Through the description, the human behavior identification method provided by the embodiment of the application can fully mine and fuse information between static and dynamic different modes, fully learn high-level semantic features representing essential video information, finally realize accurate identification of human behaviors and achieve the purpose of improving accuracy of human behavior identification.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the application. Thus, the present application is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A human behavior recognition method is characterized by comprising the following steps:

carrying out implicit alignment and feature learning on information of two different modes by using a convolutional neural network with an attention mechanism;

training a classifier deep belief network by using the fused features to realize accurate recognition of human behaviors;

the mapping the implicitly aligned features to a common subspace for explicit alignment processing includes:

performing explicit alignment on the implicitly aligned first modality information and second modality information, and mapping different modality information into a common subspace by performing subspace learning on the different modality information;

the performing explicit alignment on the implicitly aligned first modality information and second modality information includes:

the first formula includes:

wherein X represents first modality information, Y represents second modality information, and X is d _x ×T ₁ Matrix of dimensions, Y is a d _y ×T ₂ Matrix of dimensions, W _x A mapping matrix representing the first modality information, and W _x Is a d _x X d dimensional matrix, W _y A mapping matrix representing the second modality information, and W _y Is a d _y Matrix of dimension x d, V _x And V _y Represents a binary selection matrix, and

Δ represents a diagonal matrix; 1 represents a vector in which all values are 1; i denotes a unit vector.

2. The human behavior recognition method according to claim 1, wherein the method for acquiring the sparse-contraction depth automatic encoder network comprises:

introducing a sparse term in the loss function;

and determining the number of hidden nodes and sparse parameters by applying a particle swarm optimization algorithm so as to improve the traditional automatic encoder network into the sparse shrinkage depth automatic encoder network.

3. The human behavior recognition method according to claim 2, wherein the loss function of the sparse-contraction depth auto-encoder network is:

mean Activity representing hidden neuron jJerk, | | | represents the relative entropy calculation.