CN113221693A

CN113221693A - Action recognition method

Info

Publication number: CN113221693A
Application number: CN202110472752.8A
Authority: CN
Inventors: 杨剑宇; 黄瑶
Original assignee: Suzhou University
Current assignee: Suzhou University
Priority date: 2021-04-29
Filing date: 2021-04-29
Publication date: 2021-08-06
Anticipated expiration: 2041-04-29
Also published as: CN113221693B

Abstract

The invention provides a motion recognition method, which comprises the steps of calculating a dynamic image of a motion video sample; inputting the dynamic image of the action video sample into a feature extractor to obtain a feature vector in the dynamic image; constructing a characteristic center group; inputting all the feature vectors into feature centers, and accumulating all the outputs on each feature center to obtain histogram expression; inputting the histogram expression into a multilayer perceptron to form a characteristic quantization network; training a characteristic quantization network until convergence, and finding out a co-occurrence characteristic center group of each action category; constructing an image feature co-occurrence layer; constructing a co-occurrence image feature-based action recognition network, training until convergence, and finding out a co-occurrence image feature neuron group of each action category; constructing a semantic feature co-occurrence layer; and constructing a motion recognition network based on the hierarchy co-occurrence characteristics, training the motion recognition network to be convergent, calculating a dynamic image of a test motion video sample, and inputting the trained motion recognition network based on the hierarchy co-occurrence characteristics to realize motion recognition.

Description

Action recognition method

Technical Field

The invention relates to a motion recognition method, and belongs to the technical field of motion recognition.

Background

Motion recognition is an important subject in the field of computer vision, and is widely applied in the fields of video monitoring, behavior analysis, human-computer interaction and the like. With the development of social networks and the popularization of RGB devices, motion recognition based on RGB videos gets more and more attention of scholars. Compared with the action recognition based on the skeleton, the action recognition based on the RGB video can acquire data more easily and has higher reliability.

Most of the existing methods extract deep features contained in a video by designing deep convolutional neural networks with different structures, and perform action recognition. These methods typically ignore the interpretability of the extracted features and put the stacked static frame images into a complex convolutional neural network, which places high demands on the performance of the computer device.

Therefore, a motion recognition method is proposed to solve the problem of the motion recognition algorithm.

Disclosure of Invention

The invention is provided for solving the problems in the prior art, the technical proposal is as follows,

a motion recognition method comprising the steps of:

step one, calculating a dynamic image of a motion video sample;

inputting the dynamic image of the motion video sample into a feature extractor to obtain a feature vector in the dynamic image;

step three, constructing a feature center group; inputting all the feature vectors of the dynamic images of the action video samples into the feature centers, accumulating all the outputs on each feature center to obtain histogram expression;

inputting the histogram expression into a multilayer perceptron to form a characteristic quantization network;

inputting the dynamic images of all the training video samples into a characteristic quantization network, training the characteristic quantization network until convergence, and finding out a co-occurrence characteristic center group of each action type;

constructing an image characteristic co-occurrence layer;

step seven, inputting the output of the image feature co-occurrence layer into a multilayer perceptron, and constructing a motion recognition network based on the co-occurrence image features;

step eight, inputting dynamic images of all training action video samples into an action recognition network based on co-occurrence image characteristics, training the action recognition network based on the co-occurrence image characteristics to be convergent, and finding out co-occurrence image characteristic neuron groups of each action category;

constructing a semantic feature co-occurrence layer;

step ten, inputting the output of the semantic feature co-occurrence layer into a multilayer perceptron, and constructing a motion recognition network based on the hierarchical co-occurrence feature;

and step eleven, training the action recognition network based on the level co-occurrence characteristics until convergence, calculating a dynamic image of the test action video sample, and inputting the trained action recognition network based on the level co-occurrence characteristics to realize action recognition.

Preferably, in the first step, the method for calculating the dynamic image of the motion video sample includes:

each motion video sample consists of all frames in the video, for any motion video sample a:

A＝{I_t|t∈[1,T]},

wherein T represents a time index, and T is the total frame number of the motion video sample A;

for the matrix representation of the t-th frame image of the motion video sample a, R, C, 3 correspond to the number of rows, columns and channels of the matrix representation of the t-th frame image respectively,

the representation matrix is a real number matrix; i is_tEach element in (a) represents a pixel value of the t-th frame image;

for any motion video sample A, first, I_tVectorization, i.e. I_tAll the row vectors of the three channels are connected into a new row vector i_t；

For row vector i_tEach element in (a) is used for calculating the arithmetic square root to obtain a new vector w_tNamely:

wherein the content of the first and second substances,

representing a row vector i_tEach element in (a) takes the arithmetic square root; note w_tA frame vector of a t frame image of the motion video sample A;

calculating a feature vector v of a t frame image of a motion video sample A_tThe calculation method is as follows:

wherein the content of the first and second substances,

representing the summation of frame vectors for the 1 st frame image to the t th frame image of the motion video sample A;

calculating t frame image I of motion video sample A_tScore B of_tThe calculation formula is as follows:

B_t＝u^T·v_t，

where u is a vector with dimension f, and f is R × C × 3. u. of^TRepresents transposing the vector u; u. of^T·v_tRepresenting the vector obtained by transposing the calculation pair vector u and the feature vector v_tDot product of (2);

calculating the value of u, so that the more the frame images arranged at the back in the action video sample are, the higher the score is, i.e. the larger t is, the score is B_tThe higher; the calculation of u can use RankSVM calculation, and the calculation method is as follows:

wherein the content of the first and second substances,

denotes u that minimizes the value of E (u), λ is a constant, | u | | non-calculation²Representing the sum of the squares of each element in the calculation vector u; b is_i、B_jRespectively represent the score of the ith frame image of the motion video sample A, the score of the jth frame image of the motion video sample A, and max {0,1-B_i+B_jMeans to choose 0 and 1-B_i+B_jThe larger of the values;

after calculating the vector u using RankSVM, the vector u is arranged to be I_tImage form of the same size is obtained

Let u' be the motion image of motion video sample a.

Further, in the second step, the feature extractor is composed of a series of convolution layers and pooling layers; inputting the dynamic image of each motion video sample into a feature extractor, wherein the feature extractor outputs a feature map

Wherein K₁、K₂D respectively represents the height, width and channel number of the output characteristic diagram; the characteristic diagram F has K in all₁×K₂Individual pixel point, feature vector x of each pixel point_yD, i.e. the number of channels in the feature map F, y 1,2₁×K₂(ii) a The feature vector in the moving image may be represented by a set of feature vectors X ═ X_y|y＝1,2,...,K₁×K₂Represents it.

Further, in the third step, the feature center group comprises N_KEach feature center corresponds to a scale factor, and each feature center and an initial value of the scale factor are obtained by calculation through the following method:

calculating the characteristic vectors in the dynamic images of all the training motion video samples, and clustering all the characteristic vectors, wherein the number of the clustered clusters is the same as that of the characteristic centers, namely the number of the clustered clusters is N_KEach cluster is provided with a clustering center, and the value of the clustering center obtained by clustering is used as the initial value of the characteristic center; for the kth cluster, the set of all feature vectors in the cluster is recorded as E_kIn which N is contained_kIndividual feature vector:

calculating Euclidean distance d between feature vectors_q,τ：

Wherein, [ e ]_q]_dRepresenting a feature vector e_qQ ∈ [1, N ] of the d-th dimension data of_k-1]，τ∈[q+1,N_k](ii) a Scaling factor σ of the kth feature center_kThe initial values of (a) are:

feature vector x for dynamic images_yCalculating it and the k-th feature center c_kAs its distance at the kth feature center c_kThe distance calculation formula is as follows:

W_k(x_y)＝exp(-||x_y-c_k||₂/σ_k)，

for feature vector x_yNormalizing the output obtained by inputting the k-th feature center:

inputting all the feature vectors of the dynamic images of each motion video sample into each feature center of the feature center group, and accumulating all the outputs on each feature center; cumulative output h of kth feature center_kThe calculation method of (c) is as follows:

and connecting the accumulated values of all the feature centers together to obtain a histogram expression H of the dynamic image:

the characteristic center group and the accumulation layer accumulating the output of the characteristic center group form a characteristic soft quantizer; the input of the characteristic soft quantizer is a characteristic vector of a dynamic image of each motion video sample, and the output is a histogram expression of the dynamic image.

Further, in the fourth step, the feature quantization network includes a feature extractor, a feature soft quantizer and a multi-layer perceptron;

the multilayer perceptron comprises an input layer, a hidden layer and an output layer; the Input layer is connected with the output of the histogram connection layer, and the output Input of the Input layer₁Same as the output H of the histogram connection layer, i.e. Input₁H, input layer total r₁＝N_KA plurality of neurons; hidden layer has z₁The neuron is fully connected with all output units of the input layer, the output layer of the multilayer perceptron is provided with o neurons, and each neuron represents an action category; the weight between the input layer and the hidden layer is expressed as

The weight between the hidden layer and the output layer is expressed as

Output Q of hidden neurons₁The calculation method of (c) is as follows:

wherein phi is_eluIt is elu that the activation function is active,

is the bias vector of the hidden layer;

output layer output O of multilayer perceptron₁Comprises the following steps:

wherein phi is_soft _maxThe function is activated for the softmax and,

is the offset vector of the output layer;

loss function L of a feature quantization network₁Comprises the following steps:

wherein the content of the first and second substances,

is the output vector of the multi-layer perceptron for the g-th sample,

is the desired output vector for the g-th sample, where l_gIs defined as:

where G is the total number of samples, l_gIs the tag value of the g-th sample.

Further, in the fifth step, the method for finding the co-occurrence feature center group of each action category includes:

inputting the dynamic image of the training action video sample of each category into a feature extractor and a feature soft quantizer in a trained feature quantization network to obtain histogram expression;

for each action category, the action video sample at the a training time has a histogram expression

Each dimension expressed by the histogram represents a corresponding feature center, and the value of each dimension represents the response value of the training motion video sample to the feature center; for each action class, calculating covariance between feature centers in the action class;

for any action category, the k-th action category in the action category₁A feature center and the kth₂Covariance Cov (k) of individual feature centers₁,k₂) The calculation formula is as follows:

wherein the content of the first and second substances,

represents the a training motion video sample pair in the motion category₁The response value of the center of each feature,

all training motion video motion sample pairs representing the motion class₁An average of the individual feature center response values; n is a radical of_θThe total number of training action video samples in the action category; k is a radical of₁∈[1,N_K-1]，k₂∈[k₁+1,N_K]；

For each action class, calculating the covariance between all feature centers in the above manner; then, the calculated covariance values are sorted from the first action category, and the K with the larger value is selected¹Each covariance corresponding to a set of feature centers; the larger the covariance, the more common the features represented by the set of feature centers are in the action classThe higher the likelihood of co-occurrence; each group of feature centers represents the co-occurring image features in the dynamic images of the category of motion video samples; if the characteristic center group found in the action category ranked behind is repeated with the characteristic center group found in the action category ranked in the front, the repeated characteristic center group does not account in the characteristic center group found in the action category, and the characteristic center group corresponding to the covariance with larger value is continuously selected; k can be found per action category¹A group feature center; there are a total of o action classes, so that a total of K can be found¹X o group feature centers.

Further, in the sixth step, the method for constructing the image feature co-occurrence layer includes:

according to the found K¹A new layer is constructed behind the characteristic soft quantizer at the x o group characteristic center, and the layer is called an image characteristic co-occurrence layer; the layer has a total of K¹The method comprises the following steps that x o neurons correspond to a group of found feature centers, the value of each neuron is the product of response values of the group of feature centers in a histogram of a motion video sample, and the neuron is called as a co-occurrence image feature neuron; the output of the image feature co-occurrence layer is

Wherein, the neuron output s of the b-th co-occurrence image characteristic neuron_bThe calculation method is as follows:

wherein the content of the first and second substances,

the response values of the motion video sample to two feature centers corresponding to the b-th co-occurrence image feature neuron are respectively.

Further, in the seventh step, the action recognition network based on the co-occurrence image features includes a feature extractor, a feature soft quantizer, an image feature co-occurrence layer and a multi-layer perceptron;

the multilayer sensor comprises an Input layer, a hidden layer and an output layer, wherein the Input layer is connected with the output of the image characteristic co-occurrence layer, and the output Input of the Input layer₂Same as the output S of the image feature co-occurrence layer, i.e. Input₂R for input layer₂＝K¹X o neurons, hidden layer sharing z₂The neuron is fully connected with all output units of the input layer, the output layer of the multilayer perceptron is provided with o neurons, and each neuron represents an action category; the weight between the input layer and the hidden layer is expressed as

The weight between the hidden layer and the output layer is expressed as

Output Q of hidden neurons₂The calculation method of (c) is as follows:

wherein phi is_eluIt is elu that the activation function is active,

is the bias vector of the hidden layer;

output layer output O of multilayer perceptron₂Comprises the following steps:

wherein phi is_soft _maxThe function is activated for the softmax and,

is the offset vector of the output layer;

loss function L of action recognition network based on co-occurrence image characteristics₂Comprises the following steps:

wherein the content of the first and second substances,

is the output vector of the multi-layered perceptron for the g-th sample.

Further, in the step eight, the method for finding the co-occurrence image feature neuron group of each action class includes:

inputting the dynamic image of the training action video sample of each action category into a trained action recognition network based on the co-occurrence image characteristics to obtain the output S of an image characteristic co-occurrence layer;

for each motion class, the output of the a-th training motion video sample at the image feature co-occurrence layer is

Each output dimension corresponds to one co-occurrence image feature neuron in the image feature co-occurrence layer; calculating the covariance between the co-occurrence image feature neurons in the action category;

for any action category, the d-th action in the action category₁Characteristic neuron of co-occurrence image and d₂Covariance Cov (d) of feature neurons of individual co-occurrence images₁,d₂) The calculation formula is as follows:

wherein the content of the first and second substances,

the motion class is shown in the d-th training motion video sample₁The output of the characteristic neurons of the individual co-occurrence images,

to representAll training motion video samples for this motion category are at d₁An average of the outputs of the plurality of co-occurrence image feature neurons; d₁∈[1,K¹×o-1]，d₂∈[d₁+1,K¹×o]；

For each action class, calculating the covariance among all co-occurrence image characteristic neurons according to the mode; then, the calculated covariance values are sorted from the first action category, and the K with the larger value is selected²Individual covariance, each covariance corresponding to a set of co-occurrence image feature neurons; the larger the covariance is, the higher the probability that the characteristic neurons of the group of co-occurrence images co-occur in the class is indicated; each group of co-occurrence image feature neurons represents the co-occurrence semantic features in the dynamic images of the category of motion video samples; if the co-occurrence image characteristic nerve cell group found in the action category ranked behind is repeated with the co-occurrence image characteristic nerve cell group found in the action category ranked in the front, in the action category, the repeated co-occurrence image characteristic nerve cell group does not account in the co-occurrence image characteristic nerve cell group found in the action category, and the co-occurrence image characteristic nerve cell group corresponding to the covariance with larger value is continuously selected; k can be found for each action category²Group co-occurrence image feature neurons; in a total of o action classes, K can be found²Xo group co-occurrence of image characteristic neurons.

Further, in the ninth step, the method for constructing the semantic feature co-occurrence layer includes:

according to the found K²Constructing a new layer behind the image feature co-occurrence layer by using the x o group of co-occurrence image feature neurons, and calling the layer as a semantic feature co-occurrence layer; the layer has a total of K²The neuron is called as a co-occurrence semantic feature neuron, each neuron corresponds to a group of found co-occurrence image feature neurons, and the value of the neuron is the product of the output of the action video sample in the group of co-occurrence image feature neurons; the output of the semantic feature co-occurrence layer is

WhereinOutput m of the chi-th co-occurrence semantic feature neuron_χThe calculation method is as follows:

wherein the content of the first and second substances,

and respectively outputting values of two co-occurrence image characteristic neurons corresponding to the chi-th co-occurrence semantic characteristic neuron of the motion video sample.

Further, in the step ten, the action recognition network based on the hierarchy co-occurrence features comprises a feature extractor, a feature soft quantizer, an image feature co-occurrence layer, a semantic feature co-occurrence layer and a multi-layer perceptron;

the multilayer perceptron comprises an Input layer, a hidden layer and an output layer, wherein the Input layer is connected with the output of the semantic feature co-occurrence layer, and the output Input of the Input layer₃Same as the output M of the semantic feature co-occurrence layer, i.e. Input₃R for M, input layers₃＝K²X o neurons; hidden layer has z₃The neuron is fully connected with all output units of the input layer, the output layer of the multilayer perceptron is provided with o neurons, and each neuron represents an action category; the weight between the input layer and the hidden layer is expressed as

The weight between the hidden layer and the output layer is expressed as

Output Q of hidden neurons₃The calculation method of (c) is as follows:

wherein phi is_eluIt is elu that the activation function is active,

is the bias vector of the hidden layer;

output layer output O of multilayer perceptron₃Comprises the following steps:

wherein phi is_soft _maxThe function is activated for the softmax and,

is the offset vector of the output layer;

loss function L of action recognition network based on hierarchy co-occurrence characteristics₃Comprises the following steps:

wherein the content of the first and second substances,

is the output vector of the multi-layer perceptron for the g-th sample;

the input of the action recognition network based on the hierarchy co-occurrence characteristics is a dynamic image of an action video sample, and the output is a probability value of the current action video sample belonging to each action category.

Further, in the eleventh step, a specific method for implementing the action recognition is as follows:

and calculating a dynamic image of the test action video sample and inputting the dynamic image into a trained action recognition network based on the hierarchy co-occurrence characteristics to obtain probability values which are predicted by the current test action video sample and belong to all action categories, wherein the action category with the maximum probability value is the action category which is finally predicted and belongs to the current test action video sample, so that action recognition is realized.

The action recognition network based on the hierarchy co-occurrence characteristics can learn the co-occurrence characteristics in the action video, and effectively increases the degree of distinction of the action sample representation; when the network is trained, only one dynamic image is used as the compact representation input network of the motion video, and the requirement on the performance of computer equipment is not high.

Drawings

FIG. 1 is a flow chart of the operation of a method of motion recognition in accordance with the present invention.

FIG. 2 is a schematic view of a dynamic image according to an embodiment of the present invention.

Fig. 3 is a schematic diagram of a feature extractor of the present invention.

Fig. 4 is a schematic diagram of the packet convolution module 1 of fig. 3.

Fig. 5 is a schematic diagram of the packet convolution module 2 or the packet convolution module 3 in fig. 3.

Fig. 6 is a schematic diagram of a feature quantification network of the present invention.

FIG. 7 is a schematic diagram of a motion recognition network based on co-occurrence image features according to the present invention.

FIG. 8 is a schematic diagram of the action recognition network based on the hierarchy co-occurrence feature of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, a motion recognition method includes the following steps:

1. the number of the motion video samples in the motion video sample set is 2000, and the motion video samples in the motion video sample set have 10 motion categories, and each motion category has 200 motion video samples. Three quarters of the action video samples are randomly selected from each action category and are divided into a training set, the remaining one quarter is divided into a testing set, and 1500 training action video samples and 500 testing action video samples are obtained. Each motion video sample consists of all frames in the motion sample video. Take any motion video sample a as an example:

A＝{I_t|t∈[1,40]}，

where t represents a time index, the motion video sample has a total of 40 frames.

The t-th frame image of the motion video sample a is represented by a matrix, and the frame image has the number of rows 240, the number of columns 320, and the number of channels 3.

The expression matrix is a real matrix, I_tEach element in (a) represents a pixel value of the t-th frame image.

And calculating the dynamic image of the motion video sample in the following way:

for any motion video sample A, first, I_tVectorization, i.e. I_tAll the row vectors of the three channels are connected into a new row vector i_t。

wherein the content of the first and second substances,

representing a row vector i_tEach element in (a) takes the arithmetic square root. Note w_tIs the frame vector of the t frame image of motion video sample a.

wherein the content of the first and second substances,

showing relative movementSumming frame vectors from a 1 st frame image to a t th frame image of the video sample A;

B_t＝u^T·v_t，

where u is a vector with dimension 230400. u. of^TRepresenting transposing the vector u. u. of^T·v_tRepresenting the vector obtained by transposing the calculation pair vector u and the feature vector v_tThe dot product of (a).

Calculating the value of u, so that the more the frame images arranged at the back in the action video sample are, the higher the score is, i.e. the larger t is, the score is B_tThe higher. The calculation of u can use RankSVM calculation, and the calculation method is as follows:

wherein the content of the first and second substances,

denotes u that minimizes the value of E (u), λ is a constant, | u | | non-calculation²Representing the sum of the squares of each element in the calculation vector u. B is_i、B_jRespectively represent the score of the ith frame image of the motion video sample A, the score of the jth frame image of the motion video sample A, and max {0,1-B_i+B_jMeans to choose 0 and 1-B_i+B_jThe larger value of (a).

Let u' be the motion image of motion video sample a. As shown in fig. 2, an example of the obtained moving image is shown.

2. And inputting the dynamic image of the motion video sample into a feature extractor, and extracting a feature vector in the dynamic image. The feature extractor consists of a series of convolutional layers and pooling layers. The feature extractor is shown in fig. 3, and is composed of the first two modules of ResNext-50, namely convolution module 1 and convolution module 2.

Convolution module 1 contains a convolution layer with 64 convolution kernels, each convolution kernel being 7 x 7 in size. The convolution module 2 comprises a maximum pooling layer and three grouped convolution modules. The size of the pooling core of the maximum pooling layer was 3 × 3. The packet convolution module 1 is shown in figure 4. The first layer is a convolutional layer, the second layer is a block convolutional layer, the third layer is a convolutional layer, and the fourth layer is a residual addition layer. The first layer of convolutional layers has 128 convolutional kernels, each of which has a size of 1 × 1. The second layer packet convolutional layer has 128 convolutional kernels, each of which has a size of 3 × 3. The packet convolutional layer has input size of W¹×H¹The x 128 feature map is divided into 32 groups of size W by channel¹×H¹And multiplying by 4, dividing 128 convolution kernels into 32 groups, wherein each group has 4 convolution kernels, performing convolution on the feature map of each group and the convolution kernels of each group respectively, and finally connecting the convolution results of each group according to channels to obtain the output of the grouped convolution layers. The third layer of convolutional layers has 256 convolutional kernels, each of which has a size of 1 × 1. The fourth layer of residual addition layer transfers the input of the first layer of convolution layer into the residual convolution layer, the residual convolution layer has 256 convolution kernels, the size of each convolution kernel is 1 x 1, the output of the residual convolution layer is added with the output of the third layer of convolution layer to be the output of the fourth layer of residual addition layer and the output of the first grouping convolution module. The block convolution modules 2, 3 are similar to the block convolution module 1, as shown in fig. 5, with the only difference that the fourth layer residual add layer of the second and third block convolution modules directly adds the input of the first layer convolution layer to the output of the third layer convolution layer, with no residual convolution layer.

The feature graph output by the feature extractor is

The height, width, and width of the characteristic diagram,The number of channels is 30, 40, 256 respectively. The feature graph F has 1200 pixel points in total, and the feature vector x of each pixel point_yIs 256, i.e. the number of channels in the feature map F, y being 1, 2. The feature vectors in the final dynamic image may be represented by the set of feature vectors X ═ X_yI y 1, 2.., 1200.

3. A feature center group is constructed, which contains a total of 128 feature centers. And each feature center corresponds to one scaling coefficient. Each feature center and an initial value of a scale scaling coefficient thereof are calculated by the following method:

extracting feature vectors in dynamic images of all training motion video samples, clustering all the feature vectors, wherein the number of clustered clusters is the same as that of feature centers, namely the number of clustered clusters is 128, each cluster has a clustering center, and the value of the clustering center of a first cluster is used as an initial value of the first feature center. Let the set of all feature vectors in the first cluster be E₁Which contains 300 vectors:

E₁＝{e₁,e₂,…,e₃₀₀}

calculating Euclidean distance d between vectors_q,τ：

Wherein, [ e ]_q]_dRepresents a vector e_qQ e [1,299 ]]，τ∈[q+1,300]. Scaling factor sigma of the first feature center₁The initial values of (a) are:

the initial values of 128 feature centers and the initial values of the corresponding scaling factors can be obtained in the above manner.

4. Feature vector x for dynamic images_yCalculating it and the k-th feature center c_kA distance ofIs it at the k characteristic center c_kThe distance calculation formula is as follows:

W_k(x_y)＝exp(-||x_y-c_k||₂/σ_k)，

5. all feature vectors of the motion image of the motion video sample are input to respective feature centers of the set of feature centers, and all outputs at each feature center are accumulated. Cumulative output h of kth feature center_kThe calculation method of (c) is as follows:

H＝(h₁,h₂,...,h₁₂₈)。

the feature center group and the accumulation layer that accumulates the outputs of the feature center group constitute a feature soft quantizer. The input of the characteristic soft quantizer is a characteristic vector of a dynamic image of each motion video sample, and the output is a histogram expression of the dynamic image.

6. The histogram representation H of the moving image of the motion video sample is input to the multi-layered perceptron to form a feature quantization network, as shown in fig. 6. The feature quantization network comprises a feature extractor, a feature soft quantizer and a multi-layer perceptron.

The multilayer perceptron includes an input layer, a hidden layer, and an output layer. The Input layer is connected with the output of the histogram connection layer, and the output Input of the Input layer₁Same as the output H of the histogram connection layer, i.e. Input₁H, the input layer has 128 neurons in total. The hidden layer has 64 neurons, is fully connected with all output units of the input layer, and has multiple layersThe output layer of the sensor is provided with 10 neurons, and each neuron represents an action class; the weight between the input layer and the hidden layer is expressed as

The weight between the hidden layer and the output layer is expressed as

Output Q of hidden neurons₁The calculation method of (c) is as follows:

wherein phi is_eluIt is elu that the activation function is active,

is the bias vector of the hidden layer;

output layer output O of multilayer perceptron₁Comprises the following steps:

wherein phi is_soft _maxThe function is activated for the softmax and,

is the bias vector of the output layer.

wherein the content of the first and second substances,

is the output vector of the multi-layer perceptron for the g-th sample,

is the desired output vector for the g-th sample, where l_gIs defined as:

wherein l_gIs the tag value of the g-th sample.

7. And inputting the dynamic images of all the training samples into the characteristic quantization network, and training the characteristic quantization network until convergence. And inputting the dynamic image of the training action video sample of each category into a feature extractor and a feature soft quantizer in the trained feature quantization network to obtain histogram expression.

For each action category, the a-th training action video sample has a histogram expression

Each dimension of the histogram expression represents a corresponding feature center, and the value of each dimension represents the response value of the training motion video sample to the feature center. For each motion class, the covariance between feature centers in the motion class is calculated.

Taking the first action category as an example, the action category has a total of 150 training samples. The covariance Cov (1,2) of the 1 st feature center and the 2 nd feature center in the action category is calculated as:

wherein the content of the first and second substances,

represents the response value of the a-th training motion video sample to the 1 st feature center in the first motion category,

represents the average value of all the training motion video samples in the first motion category to the 1 st feature center response value. The calculation formula is as follows:

the covariance between all feature centers for the first motion class can be calculated as described above, and a total of 8128 covariance values of 128 × 127/2 can be obtained. And 8 covariances with larger values are selected, and each covariance corresponds to a group of characteristic centers. The larger the covariance, the higher the probability that the features represented by the set of feature centers co-occur in the class. Each set of feature centers represents co-occurring image features in the motion image of the category of motion video sample. For the first action category, the larger 8 covariances are: cov (1,10), Cov (8,35), Cov (12,23), Cov (16,79), Cov (20,95), Cov (45,64), Cov (85,112) and Cov (97,121). The 8 covariances respectively correspond to a feature center group of (c)₁,c₁₀)、(c₈,c₃₅)、(c₁₂,c₂₃)、(c₁₆,c₇₉)、(c₂₀,c₉₅)、(c₄₅,c₆₄)、(c₈₅,c₁₁₂) And (c)₉₇,c₁₂₁)。

For the 2 nd to 10 th action categories, the covariance between the feature centers is calculated in the above manner, and the feature center group corresponding to the 8 covariances with larger values is found. If the feature center group found in the action category ranked behind is repeated with the feature center group found in the action category ranked in the front, the repeated feature center group does not account for the feature center group found in the action category, and the feature center group corresponding to the covariance with a larger value is continuously selected. Taking the 2 nd action category as an example, the larger 8 covariances are: cov (2,7), Cov (10,22), Cov (18,28), Cov (22,83), Cov (39,97), Cov (45,64), Cov (79,108) and Cov (98,125). The 8 covariances respectively correspond to a feature center group of (c)₂,c₇)、(c₁₀,c₂₂)、(c₁₈,c₂₈)、(c₂₂,c₈₃)、(c₃₉,c₉₇)、(c₄₅,c₆₄)、(c₇₉,c₁₀₈) And (c)₉₈,c₁₂₅). Wherein the feature center group (c)₄₅,c₆₄) Repeating with the feature center group found for the 1 st action category. Therefore, the feature center group (c) corresponding to the 9 th covariance Cov (67,99) with a larger value is found₆₇,c₉₉). Finally, for the 2 nd action category, the found 8 sets of feature centers are: (c)₂,c₇)、(c₁₀,c₂₂)、(c₁₈,c₂₈)、(c₂₂,c₈₃)、(c₃₉,c₉₇)、(c₆₇,c₉₉)、(c₇₉,c₁₀₈) And (c)₉₈,c₁₂₅). Finally, 10 action classes can find 80 sets of feature centers.

8. Based on the found 80 sets of feature centers, a new layer is constructed after the feature soft quantizer, which is called the image feature co-occurrence layer. This layer has a total of 80 neurons. Each neuron corresponds to a group of found feature centers, the value of the neuron is the product of response values of the group of feature centers in a histogram of the motion video sample, and the neuron is called a co-occurrence image feature neuron. The output of the image characteristic co-occurrence layer is recorded as S ═ S (S)₁,s₂,...,s₈₀)。

Wherein, the output s of the b-th co-occurrence image characteristic neuron_bThe calculation method is as follows:

wherein the content of the first and second substances,

9. The outputs of the co-occurrence layers of image features are input to the multi-layer perceptron to form a motion recognition network based on the co-occurrence image features, as shown in fig. 7. The action recognition network based on the co-occurrence image features comprises a feature extractor, a feature soft quantizer, an image feature co-occurrence layer and a multi-layer perceptron.

The multilayer perceptron includes an input layer, a hidden layer, and an output layer. The Input layer is connected with the output of the image characteristic co-occurrence layer, and the output Input of the Input layer₂Same as the output S of the image feature co-occurrence layer, i.e. Input₂The input layer has a total of 80 neurons. The hidden layer is provided with 64 neurons in total and is fully connected with all output units of the input layer, the output layer of the multilayer perceptron is provided with 10 neurons, and each neuron represents an action class; the weight between the input layer and the hidden layer is expressed as

The weight between the hidden layer and the output layer is expressed as

Output Q of hidden neurons₂The calculation method of (c) is as follows:

wherein phi is_eluIt is elu that the activation function is active,

is the bias vector of the hidden layer;

output layer output O of multilayer perceptron₂Comprises the following steps:

wherein phi is_soft _maxThe function is activated for the softmax and,

is the bias vector of the output layer.

Actions based on co-occurrence image featuresIdentifying a loss function L of a network₂Comprises the following steps:

wherein the content of the first and second substances,

is the output vector of the multi-layered perceptron for the g-th sample.

10. And inputting the dynamic images of all the training action video samples into the action recognition network based on the co-occurrence image characteristics, and training the action recognition network based on the co-occurrence image characteristics to be converged.

And inputting the dynamic image of the training action video sample of each action type into a trained action recognition network based on the co-occurrence image characteristics to obtain the output S of the image characteristic co-occurrence layer.

Each dimension of the output corresponds to one co-occurring image feature neuron in the image feature co-occurrence layer. Covariance between co-occurring image feature neurons in this motion class is calculated.

Taking the first action category as an example, the action category has a total of 150 training samples. The calculation formula of the covariance Cov (1,2) of the 1 st co-occurrence image feature neuron and the 2 nd co-occurrence image feature neuron in the action category is as follows:

wherein the content of the first and second substances,

representing the output of the a-th training motion video sample in the 1 st co-occurrence image feature neuron in the first motion class,

the average value of all training motion video samples representing the motion class output at the 1 st co-occurrence image feature neuron. The calculation formula is as follows:

the covariance between all co-occurrence image feature neurons for the first motion class can be calculated as described above. A total of 80 × 79/2 ═ 3160 covariances can be obtained. And 4 covariances with larger values are selected, and each covariance corresponds to a group of co-occurrence image characteristic neurons. The larger the covariance, the higher the probability that the group of co-occurrence image feature neurons co-occur in that class. Each group of co-occurrence image feature neurons represents co-occurrence semantic features in the dynamic images of the category of motion video samples. For the first action class, the 4 covariances with larger values are: cov (2,50), Cov (5,32), Cov (17,28), Cov (45, 78). The 4 covariances respectively correspond to a feature center group of (c)₂,c₅₀)、(c₅,c₃₂)、(c₁₇,c₂₈)、(c₄₅,c₇₈)。

For the 2 nd to 10 th action classes, the covariance between the co-occurrence image characteristic neurons is calculated in the above manner, and the co-occurrence image characteristic neuron group corresponding to the 4 covariances with larger values is found. If the co-occurrence image characteristic nerve group found in the action category ranked behind is repeated with the co-occurrence image characteristic nerve group found in the action category ranked in the front, the repeated co-occurrence image characteristic nerve group in the action category does not account for the co-occurrence image characteristic nerve group found in the action category, and the co-occurrence image characteristic nerve group corresponding to the covariance with larger value is continuously selected. Finally, 10 action classes can find 40 groups of co-occurrence image feature neurons.

11. According to the found 40 groups of co-occurrence image characteristic neurons, a new layer is constructed after the image characteristic co-occurrence layer,this layer is referred to as the semantic feature co-occurrence layer. This layer has a total of 40 neurons. Each neuron corresponds to a group of found co-occurrence image characteristic neurons, and the value of each neuron is the product of the outputs of the motion video samples in the group of co-occurrence image characteristic neurons, and the neurons are called co-occurrence semantic characteristic neurons. The output of the semantic feature co-occurrence layer is M ═ M (M)₁,m₂,...,m₄₀)。

Wherein, the output m of the chi-th co-occurrence semantic feature neuron_χThe calculation method is as follows:

wherein the content of the first and second substances,

12. And inputting the output of the semantic feature co-occurrence layer into a multi-layer perceptron to form an action recognition network based on the hierarchical co-occurrence feature, as shown in fig. 8. The action recognition network based on the hierarchy co-occurrence features comprises a feature extractor, a feature soft quantizer, an image feature co-occurrence layer, a semantic feature co-occurrence layer and a multilayer perceptron.

The multilayer perceptron includes an input layer, a hidden layer, and an output layer. The Input layer is connected with the output of the semantic feature co-occurrence layer, and the output Input of the Input layer₃Same as the output M of the semantic feature co-occurrence layer, i.e. Input₃The input layer has a total of 40 neurons. The hidden layer is provided with 64 neurons in total and is fully connected with all output units of the input layer, the output layer of the multilayer perceptron is provided with 10 neurons, and each neuron represents an action class; the weight between the input layer and the hidden layer is expressed as

The weight between the hidden layer and the output layer is expressed as

Output Q of hidden neurons₃The calculation method of (c) is as follows:

wherein phi is_eluIt is elu that the activation function is active,

is the bias vector of the hidden layer;

output layer output O of multilayer perceptron₃Comprises the following steps:

wherein phi is_soft _maxThe function is activated for the softmax and,

is the bias vector of the output layer.

wherein the content of the first and second substances,

is the output vector of the multi-layered perceptron for the g-th sample.

13. Training the action recognition network based on the hierarchy co-occurrence characteristics to be convergent, calculating dynamic images of the test action video samples, inputting the trained action recognition network based on the hierarchy co-occurrence characteristics, obtaining probability values which are predicted by the current test action video samples and belong to all action categories, and enabling the action category with the maximum probability value to be the action category which is finally predicted by the current test action video samples and belongs to, so that action recognition is achieved.

Although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that various changes in the embodiments and/or modifications of the invention can be made, and equivalents and modifications of some features of the invention can be made without departing from the spirit and scope of the invention.

Claims

1. A motion recognition method is characterized in that: the method comprises the following steps:

step one, calculating a dynamic image of a motion video sample;

constructing an image characteristic co-occurrence layer;

constructing a semantic feature co-occurrence layer;

2. A motion recognition method according to claim 1, characterized in that: in the first step, the method for calculating the dynamic image of the motion video sample comprises the following steps:

A＝{I_t|t∈[1,T]},

wherein the content of the first and second substances,

wherein the content of the first and second substances,

B_t＝u^T·v_t，

wherein the content of the first and second substances,

Let u' be the motion image of motion video sample a.

3. A motion recognition method according to claim 2, characterized in that: in the second step, the feature extractor is composed of a series of convolution layers and a pooling layer; inputting the dynamic image of each motion video sample into a feature extractor, wherein the feature extractor outputs a feature map

4. A motion recognition method according to claim 3, characterized in that: in the third step, the feature center group contains N in total_KA feature center, each feature center corresponding to a dimensionAnd the scaling coefficient, each characteristic center and the initial value of the scaling coefficient of the scale are obtained by the following method:

E_k＝{e₁,e₂,…,e_Nk}，

calculating Euclidean distance d between feature vectors_q,τ：

W_k(x_y)＝exp(-||x_y-c_k||₂/σ_k)，

5. An action recognition method according to claim 4, characterized in that: in the fourth step, the feature quantization network comprises a feature extractor, a feature soft quantizer and a multilayer perceptron;

Hidden layer and output layerAre expressed as

Output Q of hidden neurons₁The calculation method of (c) is as follows:

wherein phi is_eluIt is elu that the activation function is active,

is the bias vector of the hidden layer;

output layer output O of multilayer perceptron₁Comprises the following steps:

wherein phi is_softmaxThe function is activated for the softmax and,

is the offset vector of the output layer;

wherein the content of the first and second substances,

is the output vector of the multi-layer perceptron for the g-th sample,

is the desired output vector for the g-th sample, where l_gIs defined as:

where G is the total number of samples, l_gIs the tag value of the g-th sample.

6. A motion recognition method according to claim 5, wherein: in the fifth step, the method for finding out the co-occurrence feature center group of each action category comprises the following steps:

wherein the content of the first and second substances,

For each action class, calculating the covariance between all feature centers in the above manner; then, the calculated covariance values are sorted from the first action category, and the K with the larger value is selected¹Each covariance corresponding to a set of feature centers; the larger the covariance is, the higher the probability that the features represented by the set of feature centers co-occur in the action class is indicated; each group of feature centers represents the co-occurring image features in the dynamic images of the category of motion video samples; if the characteristic center group found in the action category ranked behind is repeated with the characteristic center group found in the action category ranked in the front, the repeated characteristic center group does not account in the characteristic center group found in the action category, and the characteristic center group corresponding to the covariance with larger value is continuously selected; k can be found per action category¹A group feature center; there are a total of o action classes, so that a total of K can be found¹X o group feature centers.

7. A motion recognition method according to claim 6, characterized in that: in the sixth step, the method for constructing the image feature co-occurrence layer comprises the following steps:

wherein the content of the first and second substances,

8. A motion recognition method according to claim 7, wherein: in the seventh step, the action recognition network based on the co-occurrence image features comprises a feature extractor, a feature soft quantizer, an image feature co-occurrence layer and a multilayer perceptron;

The weight between the hidden layer and the output layer is expressed as

Output Q of hidden neurons₂The calculation method of (c) is as follows:

wherein phi is_eluIt is elu that the activation function is active,

is the bias vector of the hidden layer;

output layer output O of multilayer perceptron₂Comprises the following steps:

wherein phi is_softmaxThe function is activated for the softmax and,

is the offset vector of the output layer;

wherein the content of the first and second substances,

is the output vector of the multi-layered perceptron for the g-th sample.

9. A motion recognition method according to claim 8, wherein: in the step eight, the method for finding the co-occurrence image feature neuron group of each action category comprises the following steps:

for each action category, the a-th training action video sample is in an image feature co-occurrence layerOutput is as

wherein the content of the first and second substances,

all training motion video samples representing the motion category are at d₁An average of the outputs of the plurality of co-occurrence image feature neurons; d₁∈[1,K¹×o-1]，d₂∈[d₁+1,K¹×o]；

For each action class, calculating the covariance among all co-occurrence image characteristic neurons according to the mode; then, the calculated covariance values are sorted from the first action category, and the K with the larger value is selected²Individual covariance, each covariance corresponding to a set of co-occurrence image feature neurons; the larger the covariance is, the higher the probability that the characteristic neurons of the group of co-occurrence images co-occur in the class is indicated; each group of co-occurrence image feature neurons represents the co-occurrence semantic features in the dynamic images of the category of motion video samples; co-occurrence image feature nerves if found in the later action categoriesThe tuple is repeated with the co-occurrence image characteristic nerve tuples found in the action category arranged in the front, and in the action category, the repeated co-occurrence image characteristic nerve tuples are not taken into the co-occurrence image characteristic nerve tuples found in the action category, and the co-occurrence image characteristic nerve tuples corresponding to the covariance with larger residual values are continuously selected; k can be found for each action category²Group co-occurrence image feature neurons; in a total of o action classes, K can be found²Xo group co-occurrence of image characteristic neurons.

10. A motion recognition method according to claim 9, wherein: in the ninth step, the method for constructing the semantic feature co-occurrence layer comprises the following steps:

m_χ＝m_χ1×m_χ2，

wherein m is_χ1、m_χ2And respectively outputting values of two co-occurrence image characteristic neurons corresponding to the chi-th co-occurrence semantic characteristic neuron of the motion video sample.

11. A motion recognition method according to claim 10, wherein: in the step ten, the action recognition network based on the hierarchy co-occurrence features comprises a feature extractor, a feature soft quantizer, an image feature co-occurrence layer, a semantic feature co-occurrence layer and a multilayer perceptron;

The weight between the hidden layer and the output layer is expressed as

Output Q of hidden neurons₃The calculation method of (c) is as follows:

wherein phi is_eluIt is elu that the activation function is active,

is the bias vector of the hidden layer;

output layer output O of multilayer perceptron₃Comprises the following steps:

wherein phi is_softmaxThe function is activated for the softmax and,

is the offset vector of the output layer;

action recognition network loss based on hierarchy co-occurrence characteristicsLoss function L₃Comprises the following steps:

wherein the content of the first and second substances,

is the output vector of the multi-layer perceptron for the g-th sample;

12. A motion recognition method according to claim 11, wherein: in the eleventh step, a specific method for realizing the action recognition is as follows: