CN113221693A - Action recognition method - Google Patents

Action recognition method Download PDF

Info

Publication number
CN113221693A
CN113221693A CN202110472752.8A CN202110472752A CN113221693A CN 113221693 A CN113221693 A CN 113221693A CN 202110472752 A CN202110472752 A CN 202110472752A CN 113221693 A CN113221693 A CN 113221693A
Authority
CN
China
Prior art keywords
feature
occurrence
layer
action
image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110472752.8A
Other languages
Chinese (zh)
Other versions
CN113221693B (en
Inventor
杨剑宇
黄瑶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou University
Original Assignee
Suzhou University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou University filed Critical Suzhou University
Priority to CN202110472752.8A priority Critical patent/CN113221693B/en
Publication of CN113221693A publication Critical patent/CN113221693A/en
Application granted granted Critical
Publication of CN113221693B publication Critical patent/CN113221693B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/40Scenes; Scene-specific elements in video content
    • G06V20/46Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames

Abstract

The invention provides a motion recognition method, which comprises the steps of calculating a dynamic image of a motion video sample; inputting the dynamic image of the action video sample into a feature extractor to obtain a feature vector in the dynamic image; constructing a characteristic center group; inputting all the feature vectors into feature centers, and accumulating all the outputs on each feature center to obtain histogram expression; inputting the histogram expression into a multilayer perceptron to form a characteristic quantization network; training a characteristic quantization network until convergence, and finding out a co-occurrence characteristic center group of each action category; constructing an image feature co-occurrence layer; constructing a co-occurrence image feature-based action recognition network, training until convergence, and finding out a co-occurrence image feature neuron group of each action category; constructing a semantic feature co-occurrence layer; and constructing a motion recognition network based on the hierarchy co-occurrence characteristics, training the motion recognition network to be convergent, calculating a dynamic image of a test motion video sample, and inputting the trained motion recognition network based on the hierarchy co-occurrence characteristics to realize motion recognition.

Description

Action recognition method
Technical Field
The invention relates to a motion recognition method, and belongs to the technical field of motion recognition.
Background
Motion recognition is an important subject in the field of computer vision, and is widely applied in the fields of video monitoring, behavior analysis, human-computer interaction and the like. With the development of social networks and the popularization of RGB devices, motion recognition based on RGB videos gets more and more attention of scholars. Compared with the action recognition based on the skeleton, the action recognition based on the RGB video can acquire data more easily and has higher reliability.
Most of the existing methods extract deep features contained in a video by designing deep convolutional neural networks with different structures, and perform action recognition. These methods typically ignore the interpretability of the extracted features and put the stacked static frame images into a complex convolutional neural network, which places high demands on the performance of the computer device.
Therefore, a motion recognition method is proposed to solve the problem of the motion recognition algorithm.
Disclosure of Invention
The invention is provided for solving the problems in the prior art, the technical proposal is as follows,
a motion recognition method comprising the steps of:
step one, calculating a dynamic image of a motion video sample;
inputting the dynamic image of the motion video sample into a feature extractor to obtain a feature vector in the dynamic image;
step three, constructing a feature center group; inputting all the feature vectors of the dynamic images of the action video samples into the feature centers, accumulating all the outputs on each feature center to obtain histogram expression;
inputting the histogram expression into a multilayer perceptron to form a characteristic quantization network;
inputting the dynamic images of all the training video samples into a characteristic quantization network, training the characteristic quantization network until convergence, and finding out a co-occurrence characteristic center group of each action type;
constructing an image characteristic co-occurrence layer;
step seven, inputting the output of the image feature co-occurrence layer into a multilayer perceptron, and constructing a motion recognition network based on the co-occurrence image features;
step eight, inputting dynamic images of all training action video samples into an action recognition network based on co-occurrence image characteristics, training the action recognition network based on the co-occurrence image characteristics to be convergent, and finding out co-occurrence image characteristic neuron groups of each action category;
constructing a semantic feature co-occurrence layer;
step ten, inputting the output of the semantic feature co-occurrence layer into a multilayer perceptron, and constructing a motion recognition network based on the hierarchical co-occurrence feature;
and step eleven, training the action recognition network based on the level co-occurrence characteristics until convergence, calculating a dynamic image of the test action video sample, and inputting the trained action recognition network based on the level co-occurrence characteristics to realize action recognition.
Preferably, in the first step, the method for calculating the dynamic image of the motion video sample includes:
each motion video sample consists of all frames in the video, for any motion video sample a:
A={It|t∈[1,T]},
wherein T represents a time index, and T is the total frame number of the motion video sample A;
Figure BDA0003045989180000021
for the matrix representation of the t-th frame image of the motion video sample a, R, C, 3 correspond to the number of rows, columns and channels of the matrix representation of the t-th frame image respectively,
Figure BDA0003045989180000022
the representation matrix is a real number matrix; i istEach element in (a) represents a pixel value of the t-th frame image;
for any motion video sample A, first, ItVectorization, i.e. ItAll the row vectors of the three channels are connected into a new row vector it
For row vector itEach element in (a) is used for calculating the arithmetic square root to obtain a new vector wtNamely:
Figure BDA0003045989180000023
wherein the content of the first and second substances,
Figure BDA0003045989180000024
representing a row vector itEach element in (a) takes the arithmetic square root; note wtA frame vector of a t frame image of the motion video sample A;
calculating a feature vector v of a t frame image of a motion video sample AtThe calculation method is as follows:
Figure BDA0003045989180000025
wherein the content of the first and second substances,
Figure BDA0003045989180000026
representing the summation of frame vectors for the 1 st frame image to the t th frame image of the motion video sample A;
calculating t frame image I of motion video sample AtScore B oftThe calculation formula is as follows:
Bt=uT·vt
where u is a vector with dimension f, and f is R × C × 3. u. ofTRepresents transposing the vector u; u. ofT·vtRepresenting the vector obtained by transposing the calculation pair vector u and the feature vector vtDot product of (2);
calculating the value of u, so that the more the frame images arranged at the back in the action video sample are, the higher the score is, i.e. the larger t is, the score is BtThe higher; the calculation of u can use RankSVM calculation, and the calculation method is as follows:
Figure BDA0003045989180000031
Figure BDA0003045989180000032
wherein the content of the first and second substances,
Figure BDA0003045989180000033
denotes u that minimizes the value of E (u), λ is a constant, | u | | non-calculation2Representing the sum of the squares of each element in the calculation vector u; b isi、BjRespectively represent the score of the ith frame image of the motion video sample A, the score of the jth frame image of the motion video sample A, and max {0,1-Bi+BjMeans to choose 0 and 1-Bi+BjThe larger of the values;
after calculating the vector u using RankSVM, the vector u is arranged to be ItImage form of the same size is obtained
Figure BDA0003045989180000034
Let u' be the motion image of motion video sample a.
Further, in the second step, the feature extractor is composed of a series of convolution layers and pooling layers; inputting the dynamic image of each motion video sample into a feature extractor, wherein the feature extractor outputs a feature map
Figure BDA0003045989180000035
Wherein K1、K2D respectively represents the height, width and channel number of the output characteristic diagram; the characteristic diagram F has K in all1×K2Individual pixel point, feature vector x of each pixel pointyD, i.e. the number of channels in the feature map F, y 1,21×K2(ii) a The feature vector in the moving image may be represented by a set of feature vectors X ═ Xy|y=1,2,...,K1×K2Represents it.
Further, in the third step, the feature center group comprises NKEach feature center corresponds to a scale factor, and each feature center and an initial value of the scale factor are obtained by calculation through the following method:
calculating the characteristic vectors in the dynamic images of all the training motion video samples, and clustering all the characteristic vectors, wherein the number of the clustered clusters is the same as that of the characteristic centers, namely the number of the clustered clusters is NKEach cluster is provided with a clustering center, and the value of the clustering center obtained by clustering is used as the initial value of the characteristic center; for the kth cluster, the set of all feature vectors in the cluster is recorded as EkIn which N is containedkIndividual feature vector:
Figure BDA0003045989180000036
calculating Euclidean distance d between feature vectorsq,τ
Figure BDA0003045989180000037
Wherein, [ e ]q]dRepresenting a feature vector eqQ ∈ [1, N ] of the d-th dimension data ofk-1],τ∈[q+1,Nk](ii) a Scaling factor σ of the kth feature centerkThe initial values of (a) are:
Figure BDA0003045989180000041
feature vector x for dynamic imagesyCalculating it and the k-th feature center ckAs its distance at the kth feature center ckThe distance calculation formula is as follows:
Wk(xy)=exp(-||xy-ck||2k),
for feature vector xyNormalizing the output obtained by inputting the k-th feature center:
Figure BDA0003045989180000042
inputting all the feature vectors of the dynamic images of each motion video sample into each feature center of the feature center group, and accumulating all the outputs on each feature center; cumulative output h of kth feature centerkThe calculation method of (c) is as follows:
Figure BDA0003045989180000043
and connecting the accumulated values of all the feature centers together to obtain a histogram expression H of the dynamic image:
Figure BDA0003045989180000044
the characteristic center group and the accumulation layer accumulating the output of the characteristic center group form a characteristic soft quantizer; the input of the characteristic soft quantizer is a characteristic vector of a dynamic image of each motion video sample, and the output is a histogram expression of the dynamic image.
Further, in the fourth step, the feature quantization network includes a feature extractor, a feature soft quantizer and a multi-layer perceptron;
the multilayer perceptron comprises an input layer, a hidden layer and an output layer; the Input layer is connected with the output of the histogram connection layer, and the output Input of the Input layer1Same as the output H of the histogram connection layer, i.e. Input1H, input layer total r1=NKA plurality of neurons; hidden layer has z1The neuron is fully connected with all output units of the input layer, the output layer of the multilayer perceptron is provided with o neurons, and each neuron represents an action category; the weight between the input layer and the hidden layer is expressed as
Figure BDA0003045989180000045
The weight between the hidden layer and the output layer is expressed as
Figure BDA0003045989180000046
Output Q of hidden neurons1The calculation method of (c) is as follows:
Figure BDA0003045989180000051
wherein phi iseluIt is elu that the activation function is active,
Figure BDA0003045989180000052
is the bias vector of the hidden layer;
output layer output O of multilayer perceptron1Comprises the following steps:
Figure BDA0003045989180000053
wherein phi issoft maxThe function is activated for the softmax and,
Figure BDA0003045989180000054
is the offset vector of the output layer;
loss function L of a feature quantization network1Comprises the following steps:
Figure BDA0003045989180000055
wherein the content of the first and second substances,
Figure BDA0003045989180000056
is the output vector of the multi-layer perceptron for the g-th sample,
Figure BDA0003045989180000057
is the desired output vector for the g-th sample, where lgIs defined as:
Figure BDA0003045989180000058
where G is the total number of samples, lgIs the tag value of the g-th sample.
Further, in the fifth step, the method for finding the co-occurrence feature center group of each action category includes:
inputting the dynamic image of the training action video sample of each category into a feature extractor and a feature soft quantizer in a trained feature quantization network to obtain histogram expression;
for each action category, the action video sample at the a training time has a histogram expression
Figure BDA0003045989180000059
Each dimension expressed by the histogram represents a corresponding feature center, and the value of each dimension represents the response value of the training motion video sample to the feature center; for each action class, calculating covariance between feature centers in the action class;
for any action category, the k-th action category in the action category1A feature center and the kth2Covariance Cov (k) of individual feature centers1,k2) The calculation formula is as follows:
Figure BDA00030459891800000510
wherein the content of the first and second substances,
Figure BDA0003045989180000061
represents the a training motion video sample pair in the motion category1The response value of the center of each feature,
Figure BDA0003045989180000062
all training motion video motion sample pairs representing the motion class1An average of the individual feature center response values; n is a radical ofθThe total number of training action video samples in the action category; k is a radical of1∈[1,NK-1],k2∈[k1+1,NK];
For each action class, calculating the covariance between all feature centers in the above manner; then, the calculated covariance values are sorted from the first action category, and the K with the larger value is selected1Each covariance corresponding to a set of feature centers; the larger the covariance, the more common the features represented by the set of feature centers are in the action classThe higher the likelihood of co-occurrence; each group of feature centers represents the co-occurring image features in the dynamic images of the category of motion video samples; if the characteristic center group found in the action category ranked behind is repeated with the characteristic center group found in the action category ranked in the front, the repeated characteristic center group does not account in the characteristic center group found in the action category, and the characteristic center group corresponding to the covariance with larger value is continuously selected; k can be found per action category1A group feature center; there are a total of o action classes, so that a total of K can be found1X o group feature centers.
Further, in the sixth step, the method for constructing the image feature co-occurrence layer includes:
according to the found K1A new layer is constructed behind the characteristic soft quantizer at the x o group characteristic center, and the layer is called an image characteristic co-occurrence layer; the layer has a total of K1The method comprises the following steps that x o neurons correspond to a group of found feature centers, the value of each neuron is the product of response values of the group of feature centers in a histogram of a motion video sample, and the neuron is called as a co-occurrence image feature neuron; the output of the image feature co-occurrence layer is
Figure BDA0003045989180000063
Wherein, the neuron output s of the b-th co-occurrence image characteristic neuronbThe calculation method is as follows:
Figure BDA0003045989180000064
wherein the content of the first and second substances,
Figure BDA0003045989180000065
the response values of the motion video sample to two feature centers corresponding to the b-th co-occurrence image feature neuron are respectively.
Further, in the seventh step, the action recognition network based on the co-occurrence image features includes a feature extractor, a feature soft quantizer, an image feature co-occurrence layer and a multi-layer perceptron;
the multilayer sensor comprises an Input layer, a hidden layer and an output layer, wherein the Input layer is connected with the output of the image characteristic co-occurrence layer, and the output Input of the Input layer2Same as the output S of the image feature co-occurrence layer, i.e. Input2R for input layer2=K1X o neurons, hidden layer sharing z2The neuron is fully connected with all output units of the input layer, the output layer of the multilayer perceptron is provided with o neurons, and each neuron represents an action category; the weight between the input layer and the hidden layer is expressed as
Figure BDA0003045989180000071
The weight between the hidden layer and the output layer is expressed as
Figure BDA0003045989180000072
Output Q of hidden neurons2The calculation method of (c) is as follows:
Figure BDA0003045989180000073
wherein phi iseluIt is elu that the activation function is active,
Figure BDA0003045989180000074
is the bias vector of the hidden layer;
output layer output O of multilayer perceptron2Comprises the following steps:
Figure BDA0003045989180000075
wherein phi issoft maxThe function is activated for the softmax and,
Figure BDA0003045989180000076
is the offset vector of the output layer;
loss function L of action recognition network based on co-occurrence image characteristics2Comprises the following steps:
Figure BDA0003045989180000077
wherein the content of the first and second substances,
Figure BDA0003045989180000078
is the output vector of the multi-layered perceptron for the g-th sample.
Further, in the step eight, the method for finding the co-occurrence image feature neuron group of each action class includes:
inputting the dynamic image of the training action video sample of each action category into a trained action recognition network based on the co-occurrence image characteristics to obtain the output S of an image characteristic co-occurrence layer;
for each motion class, the output of the a-th training motion video sample at the image feature co-occurrence layer is
Figure BDA0003045989180000079
Each output dimension corresponds to one co-occurrence image feature neuron in the image feature co-occurrence layer; calculating the covariance between the co-occurrence image feature neurons in the action category;
for any action category, the d-th action in the action category1Characteristic neuron of co-occurrence image and d2Covariance Cov (d) of feature neurons of individual co-occurrence images1,d2) The calculation formula is as follows:
Figure BDA00030459891800000710
wherein the content of the first and second substances,
Figure BDA00030459891800000711
the motion class is shown in the d-th training motion video sample1The output of the characteristic neurons of the individual co-occurrence images,
Figure BDA00030459891800000712
to representAll training motion video samples for this motion category are at d1An average of the outputs of the plurality of co-occurrence image feature neurons; d1∈[1,K1×o-1],d2∈[d1+1,K1×o];
For each action class, calculating the covariance among all co-occurrence image characteristic neurons according to the mode; then, the calculated covariance values are sorted from the first action category, and the K with the larger value is selected2Individual covariance, each covariance corresponding to a set of co-occurrence image feature neurons; the larger the covariance is, the higher the probability that the characteristic neurons of the group of co-occurrence images co-occur in the class is indicated; each group of co-occurrence image feature neurons represents the co-occurrence semantic features in the dynamic images of the category of motion video samples; if the co-occurrence image characteristic nerve cell group found in the action category ranked behind is repeated with the co-occurrence image characteristic nerve cell group found in the action category ranked in the front, in the action category, the repeated co-occurrence image characteristic nerve cell group does not account in the co-occurrence image characteristic nerve cell group found in the action category, and the co-occurrence image characteristic nerve cell group corresponding to the covariance with larger value is continuously selected; k can be found for each action category2Group co-occurrence image feature neurons; in a total of o action classes, K can be found2Xo group co-occurrence of image characteristic neurons.
Further, in the ninth step, the method for constructing the semantic feature co-occurrence layer includes:
according to the found K2Constructing a new layer behind the image feature co-occurrence layer by using the x o group of co-occurrence image feature neurons, and calling the layer as a semantic feature co-occurrence layer; the layer has a total of K2The neuron is called as a co-occurrence semantic feature neuron, each neuron corresponds to a group of found co-occurrence image feature neurons, and the value of the neuron is the product of the output of the action video sample in the group of co-occurrence image feature neurons; the output of the semantic feature co-occurrence layer is
Figure BDA0003045989180000081
WhereinOutput m of the chi-th co-occurrence semantic feature neuronχThe calculation method is as follows:
Figure BDA0003045989180000082
wherein the content of the first and second substances,
Figure BDA0003045989180000083
and respectively outputting values of two co-occurrence image characteristic neurons corresponding to the chi-th co-occurrence semantic characteristic neuron of the motion video sample.
Further, in the step ten, the action recognition network based on the hierarchy co-occurrence features comprises a feature extractor, a feature soft quantizer, an image feature co-occurrence layer, a semantic feature co-occurrence layer and a multi-layer perceptron;
the multilayer perceptron comprises an Input layer, a hidden layer and an output layer, wherein the Input layer is connected with the output of the semantic feature co-occurrence layer, and the output Input of the Input layer3Same as the output M of the semantic feature co-occurrence layer, i.e. Input3R for M, input layers3=K2X o neurons; hidden layer has z3The neuron is fully connected with all output units of the input layer, the output layer of the multilayer perceptron is provided with o neurons, and each neuron represents an action category; the weight between the input layer and the hidden layer is expressed as
Figure BDA0003045989180000091
The weight between the hidden layer and the output layer is expressed as
Figure BDA0003045989180000092
Output Q of hidden neurons3The calculation method of (c) is as follows:
Figure BDA0003045989180000093
wherein phi iseluIt is elu that the activation function is active,
Figure BDA0003045989180000094
is the bias vector of the hidden layer;
output layer output O of multilayer perceptron3Comprises the following steps:
Figure BDA0003045989180000095
wherein phi issoft maxThe function is activated for the softmax and,
Figure BDA0003045989180000096
is the offset vector of the output layer;
loss function L of action recognition network based on hierarchy co-occurrence characteristics3Comprises the following steps:
Figure BDA0003045989180000097
wherein the content of the first and second substances,
Figure BDA0003045989180000098
is the output vector of the multi-layer perceptron for the g-th sample;
the input of the action recognition network based on the hierarchy co-occurrence characteristics is a dynamic image of an action video sample, and the output is a probability value of the current action video sample belonging to each action category.
Further, in the eleventh step, a specific method for implementing the action recognition is as follows:
and calculating a dynamic image of the test action video sample and inputting the dynamic image into a trained action recognition network based on the hierarchy co-occurrence characteristics to obtain probability values which are predicted by the current test action video sample and belong to all action categories, wherein the action category with the maximum probability value is the action category which is finally predicted and belongs to the current test action video sample, so that action recognition is realized.
The action recognition network based on the hierarchy co-occurrence characteristics can learn the co-occurrence characteristics in the action video, and effectively increases the degree of distinction of the action sample representation; when the network is trained, only one dynamic image is used as the compact representation input network of the motion video, and the requirement on the performance of computer equipment is not high.
Drawings
FIG. 1 is a flow chart of the operation of a method of motion recognition in accordance with the present invention.
FIG. 2 is a schematic view of a dynamic image according to an embodiment of the present invention.
Fig. 3 is a schematic diagram of a feature extractor of the present invention.
Fig. 4 is a schematic diagram of the packet convolution module 1 of fig. 3.
Fig. 5 is a schematic diagram of the packet convolution module 2 or the packet convolution module 3 in fig. 3.
Fig. 6 is a schematic diagram of a feature quantification network of the present invention.
FIG. 7 is a schematic diagram of a motion recognition network based on co-occurrence image features according to the present invention.
FIG. 8 is a schematic diagram of the action recognition network based on the hierarchy co-occurrence feature of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1, a motion recognition method includes the following steps:
1. the number of the motion video samples in the motion video sample set is 2000, and the motion video samples in the motion video sample set have 10 motion categories, and each motion category has 200 motion video samples. Three quarters of the action video samples are randomly selected from each action category and are divided into a training set, the remaining one quarter is divided into a testing set, and 1500 training action video samples and 500 testing action video samples are obtained. Each motion video sample consists of all frames in the motion sample video. Take any motion video sample a as an example:
A={It|t∈[1,40]},
where t represents a time index, the motion video sample has a total of 40 frames.
Figure BDA0003045989180000101
The t-th frame image of the motion video sample a is represented by a matrix, and the frame image has the number of rows 240, the number of columns 320, and the number of channels 3.
Figure BDA0003045989180000102
The expression matrix is a real matrix, ItEach element in (a) represents a pixel value of the t-th frame image.
And calculating the dynamic image of the motion video sample in the following way:
for any motion video sample A, first, ItVectorization, i.e. ItAll the row vectors of the three channels are connected into a new row vector it
For row vector itEach element in (a) is used for calculating the arithmetic square root to obtain a new vector wtNamely:
Figure BDA0003045989180000103
wherein the content of the first and second substances,
Figure BDA0003045989180000104
representing a row vector itEach element in (a) takes the arithmetic square root. Note wtIs the frame vector of the t frame image of motion video sample a.
Calculating a feature vector v of a t frame image of a motion video sample AtThe calculation method is as follows:
Figure BDA0003045989180000105
wherein the content of the first and second substances,
Figure BDA0003045989180000111
showing relative movementSumming frame vectors from a 1 st frame image to a t th frame image of the video sample A;
calculating t frame image I of motion video sample AtScore B oftThe calculation formula is as follows:
Bt=uT·vt
where u is a vector with dimension 230400. u. ofTRepresenting transposing the vector u. u. ofT·vtRepresenting the vector obtained by transposing the calculation pair vector u and the feature vector vtThe dot product of (a).
Calculating the value of u, so that the more the frame images arranged at the back in the action video sample are, the higher the score is, i.e. the larger t is, the score is BtThe higher. The calculation of u can use RankSVM calculation, and the calculation method is as follows:
Figure BDA0003045989180000112
Figure BDA0003045989180000113
wherein the content of the first and second substances,
Figure BDA0003045989180000114
denotes u that minimizes the value of E (u), λ is a constant, | u | | non-calculation2Representing the sum of the squares of each element in the calculation vector u. B isi、BjRespectively represent the score of the ith frame image of the motion video sample A, the score of the jth frame image of the motion video sample A, and max {0,1-Bi+BjMeans to choose 0 and 1-Bi+BjThe larger value of (a).
After calculating the vector u using RankSVM, the vector u is arranged to be ItImage form of the same size is obtained
Figure BDA0003045989180000115
Let u' be the motion image of motion video sample a. As shown in fig. 2, an example of the obtained moving image is shown.
2. And inputting the dynamic image of the motion video sample into a feature extractor, and extracting a feature vector in the dynamic image. The feature extractor consists of a series of convolutional layers and pooling layers. The feature extractor is shown in fig. 3, and is composed of the first two modules of ResNext-50, namely convolution module 1 and convolution module 2.
Convolution module 1 contains a convolution layer with 64 convolution kernels, each convolution kernel being 7 x 7 in size. The convolution module 2 comprises a maximum pooling layer and three grouped convolution modules. The size of the pooling core of the maximum pooling layer was 3 × 3. The packet convolution module 1 is shown in figure 4. The first layer is a convolutional layer, the second layer is a block convolutional layer, the third layer is a convolutional layer, and the fourth layer is a residual addition layer. The first layer of convolutional layers has 128 convolutional kernels, each of which has a size of 1 × 1. The second layer packet convolutional layer has 128 convolutional kernels, each of which has a size of 3 × 3. The packet convolutional layer has input size of W1×H1The x 128 feature map is divided into 32 groups of size W by channel1×H1And multiplying by 4, dividing 128 convolution kernels into 32 groups, wherein each group has 4 convolution kernels, performing convolution on the feature map of each group and the convolution kernels of each group respectively, and finally connecting the convolution results of each group according to channels to obtain the output of the grouped convolution layers. The third layer of convolutional layers has 256 convolutional kernels, each of which has a size of 1 × 1. The fourth layer of residual addition layer transfers the input of the first layer of convolution layer into the residual convolution layer, the residual convolution layer has 256 convolution kernels, the size of each convolution kernel is 1 x 1, the output of the residual convolution layer is added with the output of the third layer of convolution layer to be the output of the fourth layer of residual addition layer and the output of the first grouping convolution module. The block convolution modules 2, 3 are similar to the block convolution module 1, as shown in fig. 5, with the only difference that the fourth layer residual add layer of the second and third block convolution modules directly adds the input of the first layer convolution layer to the output of the third layer convolution layer, with no residual convolution layer.
The feature graph output by the feature extractor is
Figure BDA0003045989180000121
The height, width, and width of the characteristic diagram,The number of channels is 30, 40, 256 respectively. The feature graph F has 1200 pixel points in total, and the feature vector x of each pixel pointyIs 256, i.e. the number of channels in the feature map F, y being 1, 2. The feature vectors in the final dynamic image may be represented by the set of feature vectors X ═ XyI y 1, 2.., 1200.
3. A feature center group is constructed, which contains a total of 128 feature centers. And each feature center corresponds to one scaling coefficient. Each feature center and an initial value of a scale scaling coefficient thereof are calculated by the following method:
extracting feature vectors in dynamic images of all training motion video samples, clustering all the feature vectors, wherein the number of clustered clusters is the same as that of feature centers, namely the number of clustered clusters is 128, each cluster has a clustering center, and the value of the clustering center of a first cluster is used as an initial value of the first feature center. Let the set of all feature vectors in the first cluster be E1Which contains 300 vectors:
E1={e1,e2,…,e300}
calculating Euclidean distance d between vectorsq,τ
Figure BDA0003045989180000122
Wherein, [ e ]q]dRepresents a vector eqQ e [1,299 ]],τ∈[q+1,300]. Scaling factor sigma of the first feature center1The initial values of (a) are:
Figure BDA0003045989180000123
the initial values of 128 feature centers and the initial values of the corresponding scaling factors can be obtained in the above manner.
4. Feature vector x for dynamic imagesyCalculating it and the k-th feature center ckA distance ofIs it at the k characteristic center ckThe distance calculation formula is as follows:
Wk(xy)=exp(-||xy-ck||2k),
for feature vector xyNormalizing the output obtained by inputting the k-th feature center:
Figure BDA0003045989180000131
5. all feature vectors of the motion image of the motion video sample are input to respective feature centers of the set of feature centers, and all outputs at each feature center are accumulated. Cumulative output h of kth feature centerkThe calculation method of (c) is as follows:
Figure BDA0003045989180000132
and connecting the accumulated values of all the feature centers together to obtain a histogram expression H of the dynamic image:
H=(h1,h2,...,h128)。
the feature center group and the accumulation layer that accumulates the outputs of the feature center group constitute a feature soft quantizer. The input of the characteristic soft quantizer is a characteristic vector of a dynamic image of each motion video sample, and the output is a histogram expression of the dynamic image.
6. The histogram representation H of the moving image of the motion video sample is input to the multi-layered perceptron to form a feature quantization network, as shown in fig. 6. The feature quantization network comprises a feature extractor, a feature soft quantizer and a multi-layer perceptron.
The multilayer perceptron includes an input layer, a hidden layer, and an output layer. The Input layer is connected with the output of the histogram connection layer, and the output Input of the Input layer1Same as the output H of the histogram connection layer, i.e. Input1H, the input layer has 128 neurons in total. The hidden layer has 64 neurons, is fully connected with all output units of the input layer, and has multiple layersThe output layer of the sensor is provided with 10 neurons, and each neuron represents an action class; the weight between the input layer and the hidden layer is expressed as
Figure BDA0003045989180000133
The weight between the hidden layer and the output layer is expressed as
Figure BDA0003045989180000134
Output Q of hidden neurons1The calculation method of (c) is as follows:
Figure BDA0003045989180000135
wherein phi iseluIt is elu that the activation function is active,
Figure BDA0003045989180000136
is the bias vector of the hidden layer;
output layer output O of multilayer perceptron1Comprises the following steps:
Figure BDA0003045989180000141
wherein phi issoft maxThe function is activated for the softmax and,
Figure BDA0003045989180000142
is the bias vector of the output layer.
Loss function L of a feature quantization network1Comprises the following steps:
Figure BDA0003045989180000143
wherein the content of the first and second substances,
Figure BDA0003045989180000144
is the output vector of the multi-layer perceptron for the g-th sample,
Figure BDA0003045989180000145
is the desired output vector for the g-th sample, where lgIs defined as:
Figure BDA0003045989180000146
wherein lgIs the tag value of the g-th sample.
7. And inputting the dynamic images of all the training samples into the characteristic quantization network, and training the characteristic quantization network until convergence. And inputting the dynamic image of the training action video sample of each category into a feature extractor and a feature soft quantizer in the trained feature quantization network to obtain histogram expression.
For each action category, the a-th training action video sample has a histogram expression
Figure BDA0003045989180000147
Each dimension of the histogram expression represents a corresponding feature center, and the value of each dimension represents the response value of the training motion video sample to the feature center. For each motion class, the covariance between feature centers in the motion class is calculated.
Taking the first action category as an example, the action category has a total of 150 training samples. The covariance Cov (1,2) of the 1 st feature center and the 2 nd feature center in the action category is calculated as:
Figure BDA0003045989180000148
wherein the content of the first and second substances,
Figure BDA0003045989180000149
represents the response value of the a-th training motion video sample to the 1 st feature center in the first motion category,
Figure BDA00030459891800001410
represents the average value of all the training motion video samples in the first motion category to the 1 st feature center response value. The calculation formula is as follows:
Figure BDA0003045989180000151
the covariance between all feature centers for the first motion class can be calculated as described above, and a total of 8128 covariance values of 128 × 127/2 can be obtained. And 8 covariances with larger values are selected, and each covariance corresponds to a group of characteristic centers. The larger the covariance, the higher the probability that the features represented by the set of feature centers co-occur in the class. Each set of feature centers represents co-occurring image features in the motion image of the category of motion video sample. For the first action category, the larger 8 covariances are: cov (1,10), Cov (8,35), Cov (12,23), Cov (16,79), Cov (20,95), Cov (45,64), Cov (85,112) and Cov (97,121). The 8 covariances respectively correspond to a feature center group of (c)1,c10)、(c8,c35)、(c12,c23)、(c16,c79)、(c20,c95)、(c45,c64)、(c85,c112) And (c)97,c121)。
For the 2 nd to 10 th action categories, the covariance between the feature centers is calculated in the above manner, and the feature center group corresponding to the 8 covariances with larger values is found. If the feature center group found in the action category ranked behind is repeated with the feature center group found in the action category ranked in the front, the repeated feature center group does not account for the feature center group found in the action category, and the feature center group corresponding to the covariance with a larger value is continuously selected. Taking the 2 nd action category as an example, the larger 8 covariances are: cov (2,7), Cov (10,22), Cov (18,28), Cov (22,83), Cov (39,97), Cov (45,64), Cov (79,108) and Cov (98,125). The 8 covariances respectively correspond to a feature center group of (c)2,c7)、(c10,c22)、(c18,c28)、(c22,c83)、(c39,c97)、(c45,c64)、(c79,c108) And (c)98,c125). Wherein the feature center group (c)45,c64) Repeating with the feature center group found for the 1 st action category. Therefore, the feature center group (c) corresponding to the 9 th covariance Cov (67,99) with a larger value is found67,c99). Finally, for the 2 nd action category, the found 8 sets of feature centers are: (c)2,c7)、(c10,c22)、(c18,c28)、(c22,c83)、(c39,c97)、(c67,c99)、(c79,c108) And (c)98,c125). Finally, 10 action classes can find 80 sets of feature centers.
8. Based on the found 80 sets of feature centers, a new layer is constructed after the feature soft quantizer, which is called the image feature co-occurrence layer. This layer has a total of 80 neurons. Each neuron corresponds to a group of found feature centers, the value of the neuron is the product of response values of the group of feature centers in a histogram of the motion video sample, and the neuron is called a co-occurrence image feature neuron. The output of the image characteristic co-occurrence layer is recorded as S ═ S (S)1,s2,...,s80)。
Wherein, the output s of the b-th co-occurrence image characteristic neuronbThe calculation method is as follows:
Figure BDA0003045989180000161
wherein the content of the first and second substances,
Figure BDA0003045989180000162
the response values of the motion video sample to two feature centers corresponding to the b-th co-occurrence image feature neuron are respectively.
9. The outputs of the co-occurrence layers of image features are input to the multi-layer perceptron to form a motion recognition network based on the co-occurrence image features, as shown in fig. 7. The action recognition network based on the co-occurrence image features comprises a feature extractor, a feature soft quantizer, an image feature co-occurrence layer and a multi-layer perceptron.
The multilayer perceptron includes an input layer, a hidden layer, and an output layer. The Input layer is connected with the output of the image characteristic co-occurrence layer, and the output Input of the Input layer2Same as the output S of the image feature co-occurrence layer, i.e. Input2The input layer has a total of 80 neurons. The hidden layer is provided with 64 neurons in total and is fully connected with all output units of the input layer, the output layer of the multilayer perceptron is provided with 10 neurons, and each neuron represents an action class; the weight between the input layer and the hidden layer is expressed as
Figure BDA0003045989180000163
The weight between the hidden layer and the output layer is expressed as
Figure BDA0003045989180000164
Output Q of hidden neurons2The calculation method of (c) is as follows:
Figure BDA0003045989180000165
wherein phi iseluIt is elu that the activation function is active,
Figure BDA0003045989180000166
is the bias vector of the hidden layer;
output layer output O of multilayer perceptron2Comprises the following steps:
Figure BDA0003045989180000167
wherein phi issoft maxThe function is activated for the softmax and,
Figure BDA0003045989180000168
is the bias vector of the output layer.
Actions based on co-occurrence image featuresIdentifying a loss function L of a network2Comprises the following steps:
Figure BDA0003045989180000169
wherein the content of the first and second substances,
Figure BDA00030459891800001610
is the output vector of the multi-layered perceptron for the g-th sample.
10. And inputting the dynamic images of all the training action video samples into the action recognition network based on the co-occurrence image characteristics, and training the action recognition network based on the co-occurrence image characteristics to be converged.
And inputting the dynamic image of the training action video sample of each action type into a trained action recognition network based on the co-occurrence image characteristics to obtain the output S of the image characteristic co-occurrence layer.
For each motion class, the output of the a-th training motion video sample at the image feature co-occurrence layer is
Figure BDA0003045989180000171
Each dimension of the output corresponds to one co-occurring image feature neuron in the image feature co-occurrence layer. Covariance between co-occurring image feature neurons in this motion class is calculated.
Taking the first action category as an example, the action category has a total of 150 training samples. The calculation formula of the covariance Cov (1,2) of the 1 st co-occurrence image feature neuron and the 2 nd co-occurrence image feature neuron in the action category is as follows:
Figure BDA0003045989180000172
wherein the content of the first and second substances,
Figure BDA0003045989180000173
representing the output of the a-th training motion video sample in the 1 st co-occurrence image feature neuron in the first motion class,
Figure BDA0003045989180000174
the average value of all training motion video samples representing the motion class output at the 1 st co-occurrence image feature neuron. The calculation formula is as follows:
Figure BDA0003045989180000175
the covariance between all co-occurrence image feature neurons for the first motion class can be calculated as described above. A total of 80 × 79/2 ═ 3160 covariances can be obtained. And 4 covariances with larger values are selected, and each covariance corresponds to a group of co-occurrence image characteristic neurons. The larger the covariance, the higher the probability that the group of co-occurrence image feature neurons co-occur in that class. Each group of co-occurrence image feature neurons represents co-occurrence semantic features in the dynamic images of the category of motion video samples. For the first action class, the 4 covariances with larger values are: cov (2,50), Cov (5,32), Cov (17,28), Cov (45, 78). The 4 covariances respectively correspond to a feature center group of (c)2,c50)、(c5,c32)、(c17,c28)、(c45,c78)。
For the 2 nd to 10 th action classes, the covariance between the co-occurrence image characteristic neurons is calculated in the above manner, and the co-occurrence image characteristic neuron group corresponding to the 4 covariances with larger values is found. If the co-occurrence image characteristic nerve group found in the action category ranked behind is repeated with the co-occurrence image characteristic nerve group found in the action category ranked in the front, the repeated co-occurrence image characteristic nerve group in the action category does not account for the co-occurrence image characteristic nerve group found in the action category, and the co-occurrence image characteristic nerve group corresponding to the covariance with larger value is continuously selected. Finally, 10 action classes can find 40 groups of co-occurrence image feature neurons.
11. According to the found 40 groups of co-occurrence image characteristic neurons, a new layer is constructed after the image characteristic co-occurrence layer,this layer is referred to as the semantic feature co-occurrence layer. This layer has a total of 40 neurons. Each neuron corresponds to a group of found co-occurrence image characteristic neurons, and the value of each neuron is the product of the outputs of the motion video samples in the group of co-occurrence image characteristic neurons, and the neurons are called co-occurrence semantic characteristic neurons. The output of the semantic feature co-occurrence layer is M ═ M (M)1,m2,...,m40)。
Wherein, the output m of the chi-th co-occurrence semantic feature neuronχThe calculation method is as follows:
Figure BDA0003045989180000181
wherein the content of the first and second substances,
Figure BDA0003045989180000182
and respectively outputting values of two co-occurrence image characteristic neurons corresponding to the chi-th co-occurrence semantic characteristic neuron of the motion video sample.
12. And inputting the output of the semantic feature co-occurrence layer into a multi-layer perceptron to form an action recognition network based on the hierarchical co-occurrence feature, as shown in fig. 8. The action recognition network based on the hierarchy co-occurrence features comprises a feature extractor, a feature soft quantizer, an image feature co-occurrence layer, a semantic feature co-occurrence layer and a multilayer perceptron.
The multilayer perceptron includes an input layer, a hidden layer, and an output layer. The Input layer is connected with the output of the semantic feature co-occurrence layer, and the output Input of the Input layer3Same as the output M of the semantic feature co-occurrence layer, i.e. Input3The input layer has a total of 40 neurons. The hidden layer is provided with 64 neurons in total and is fully connected with all output units of the input layer, the output layer of the multilayer perceptron is provided with 10 neurons, and each neuron represents an action class; the weight between the input layer and the hidden layer is expressed as
Figure BDA0003045989180000183
The weight between the hidden layer and the output layer is expressed as
Figure BDA0003045989180000184
Output Q of hidden neurons3The calculation method of (c) is as follows:
Figure BDA0003045989180000185
wherein phi iseluIt is elu that the activation function is active,
Figure BDA0003045989180000186
is the bias vector of the hidden layer;
output layer output O of multilayer perceptron3Comprises the following steps:
Figure BDA0003045989180000187
wherein phi issoft maxThe function is activated for the softmax and,
Figure BDA0003045989180000188
is the bias vector of the output layer.
Loss function L of action recognition network based on hierarchy co-occurrence characteristics3Comprises the following steps:
Figure BDA0003045989180000189
wherein the content of the first and second substances,
Figure BDA0003045989180000191
is the output vector of the multi-layered perceptron for the g-th sample.
The input of the action recognition network based on the hierarchy co-occurrence characteristics is a dynamic image of an action video sample, and the output is a probability value of the current action video sample belonging to each action category.
13. Training the action recognition network based on the hierarchy co-occurrence characteristics to be convergent, calculating dynamic images of the test action video samples, inputting the trained action recognition network based on the hierarchy co-occurrence characteristics, obtaining probability values which are predicted by the current test action video samples and belong to all action categories, and enabling the action category with the maximum probability value to be the action category which is finally predicted by the current test action video samples and belongs to, so that action recognition is achieved.
Although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that various changes in the embodiments and/or modifications of the invention can be made, and equivalents and modifications of some features of the invention can be made without departing from the spirit and scope of the invention.

Claims (12)

1. A motion recognition method is characterized in that: the method comprises the following steps:
step one, calculating a dynamic image of a motion video sample;
inputting the dynamic image of the motion video sample into a feature extractor to obtain a feature vector in the dynamic image;
step three, constructing a feature center group; inputting all the feature vectors of the dynamic images of the action video samples into the feature centers, accumulating all the outputs on each feature center to obtain histogram expression;
inputting the histogram expression into a multilayer perceptron to form a characteristic quantization network;
inputting the dynamic images of all the training video samples into a characteristic quantization network, training the characteristic quantization network until convergence, and finding out a co-occurrence characteristic center group of each action type;
constructing an image characteristic co-occurrence layer;
step seven, inputting the output of the image feature co-occurrence layer into a multilayer perceptron, and constructing a motion recognition network based on the co-occurrence image features;
step eight, inputting dynamic images of all training action video samples into an action recognition network based on co-occurrence image characteristics, training the action recognition network based on the co-occurrence image characteristics to be convergent, and finding out co-occurrence image characteristic neuron groups of each action category;
constructing a semantic feature co-occurrence layer;
step ten, inputting the output of the semantic feature co-occurrence layer into a multilayer perceptron, and constructing a motion recognition network based on the hierarchical co-occurrence feature;
and step eleven, training the action recognition network based on the level co-occurrence characteristics until convergence, calculating a dynamic image of the test action video sample, and inputting the trained action recognition network based on the level co-occurrence characteristics to realize action recognition.
2. A motion recognition method according to claim 1, characterized in that: in the first step, the method for calculating the dynamic image of the motion video sample comprises the following steps:
each motion video sample consists of all frames in the video, for any motion video sample a:
A={It|t∈[1,T]},
wherein T represents a time index, and T is the total frame number of the motion video sample A;
Figure FDA0003045989170000011
for the matrix representation of the t-th frame image of the motion video sample a, R, C, 3 correspond to the number of rows, columns and channels of the matrix representation of the t-th frame image respectively,
Figure FDA0003045989170000012
the representation matrix is a real number matrix; i istEach element in (a) represents a pixel value of the t-th frame image;
for any motion video sample A, first, ItVectorization, i.e. ItAll the row vectors of the three channels are connected into a new row vector it
For row vector itEach element in (a) is used for calculating the arithmetic square root to obtain a new vector wtNamely:
Figure FDA0003045989170000021
wherein the content of the first and second substances,
Figure FDA0003045989170000022
representing a row vector itEach element in (a) takes the arithmetic square root; note wtA frame vector of a t frame image of the motion video sample A;
calculating a feature vector v of a t frame image of a motion video sample AtThe calculation method is as follows:
Figure FDA0003045989170000023
wherein the content of the first and second substances,
Figure FDA0003045989170000024
representing the summation of frame vectors for the 1 st frame image to the t th frame image of the motion video sample A;
calculating t frame image I of motion video sample AtScore B oftThe calculation formula is as follows:
Bt=uT·vt
where u is a vector with dimension f, and f is R × C × 3. u. ofTRepresents transposing the vector u; u. ofT·vtRepresenting the vector obtained by transposing the calculation pair vector u and the feature vector vtDot product of (2);
calculating the value of u, so that the more the frame images arranged at the back in the action video sample are, the higher the score is, i.e. the larger t is, the score is BtThe higher; the calculation of u can use RankSVM calculation, and the calculation method is as follows:
Figure FDA0003045989170000025
Figure FDA0003045989170000026
wherein the content of the first and second substances,
Figure FDA0003045989170000027
denotes u that minimizes the value of E (u), λ is a constant, | u | | non-calculation2Representing the sum of the squares of each element in the calculation vector u; b isi、BjRespectively represent the score of the ith frame image of the motion video sample A, the score of the jth frame image of the motion video sample A, and max {0,1-Bi+BjMeans to choose 0 and 1-Bi+BjThe larger of the values;
after calculating the vector u using RankSVM, the vector u is arranged to be ItImage form of the same size is obtained
Figure FDA0003045989170000028
Let u' be the motion image of motion video sample a.
3. A motion recognition method according to claim 2, characterized in that: in the second step, the feature extractor is composed of a series of convolution layers and a pooling layer; inputting the dynamic image of each motion video sample into a feature extractor, wherein the feature extractor outputs a feature map
Figure FDA0003045989170000029
Wherein K1、K2D respectively represents the height, width and channel number of the output characteristic diagram; the characteristic diagram F has K in all1×K2Individual pixel point, feature vector x of each pixel pointyD, i.e. the number of channels in the feature map F, y 1,21×K2(ii) a The feature vector in the moving image may be represented by a set of feature vectors X ═ Xy|y=1,2,...,K1×K2Represents it.
4. A motion recognition method according to claim 3, characterized in that: in the third step, the feature center group contains N in totalKA feature center, each feature center corresponding to a dimensionAnd the scaling coefficient, each characteristic center and the initial value of the scaling coefficient of the scale are obtained by the following method:
calculating the characteristic vectors in the dynamic images of all the training motion video samples, and clustering all the characteristic vectors, wherein the number of the clustered clusters is the same as that of the characteristic centers, namely the number of the clustered clusters is NKEach cluster is provided with a clustering center, and the value of the clustering center obtained by clustering is used as the initial value of the characteristic center; for the kth cluster, the set of all feature vectors in the cluster is recorded as EkIn which N is containedkIndividual feature vector:
Ek={e1,e2,…,eNk},
calculating Euclidean distance d between feature vectorsq,τ
Figure FDA0003045989170000031
Wherein, [ e ]q]dRepresenting a feature vector eqQ ∈ [1, N ] of the d-th dimension data ofk-1],τ∈[q+1,Nk](ii) a Scaling factor σ of the kth feature centerkThe initial values of (a) are:
Figure FDA0003045989170000032
feature vector x for dynamic imagesyCalculating it and the k-th feature center ckAs its distance at the kth feature center ckThe distance calculation formula is as follows:
Wk(xy)=exp(-||xy-ck||2k),
for feature vector xyNormalizing the output obtained by inputting the k-th feature center:
Figure FDA0003045989170000033
inputting all the feature vectors of the dynamic images of each motion video sample into each feature center of the feature center group, and accumulating all the outputs on each feature center; cumulative output h of kth feature centerkThe calculation method of (c) is as follows:
Figure FDA0003045989170000041
and connecting the accumulated values of all the feature centers together to obtain a histogram expression H of the dynamic image:
Figure FDA0003045989170000042
the characteristic center group and the accumulation layer accumulating the output of the characteristic center group form a characteristic soft quantizer; the input of the characteristic soft quantizer is a characteristic vector of a dynamic image of each motion video sample, and the output is a histogram expression of the dynamic image.
5. An action recognition method according to claim 4, characterized in that: in the fourth step, the feature quantization network comprises a feature extractor, a feature soft quantizer and a multilayer perceptron;
the multilayer perceptron comprises an input layer, a hidden layer and an output layer; the Input layer is connected with the output of the histogram connection layer, and the output Input of the Input layer1Same as the output H of the histogram connection layer, i.e. Input1H, input layer total r1=NKA plurality of neurons; hidden layer has z1The neuron is fully connected with all output units of the input layer, the output layer of the multilayer perceptron is provided with o neurons, and each neuron represents an action category; the weight between the input layer and the hidden layer is expressed as
Figure FDA0003045989170000043
Hidden layer and output layerAre expressed as
Figure FDA0003045989170000044
Output Q of hidden neurons1The calculation method of (c) is as follows:
Figure FDA0003045989170000045
wherein phi iseluIt is elu that the activation function is active,
Figure FDA0003045989170000046
is the bias vector of the hidden layer;
output layer output O of multilayer perceptron1Comprises the following steps:
Figure FDA0003045989170000047
wherein phi issoftmaxThe function is activated for the softmax and,
Figure FDA0003045989170000048
is the offset vector of the output layer;
loss function L of a feature quantization network1Comprises the following steps:
Figure FDA0003045989170000049
wherein the content of the first and second substances,
Figure FDA00030459891700000410
is the output vector of the multi-layer perceptron for the g-th sample,
Figure FDA00030459891700000411
is the desired output vector for the g-th sample, where lgIs defined as:
Figure FDA0003045989170000051
where G is the total number of samples, lgIs the tag value of the g-th sample.
6. A motion recognition method according to claim 5, wherein: in the fifth step, the method for finding out the co-occurrence feature center group of each action category comprises the following steps:
inputting the dynamic image of the training action video sample of each category into a feature extractor and a feature soft quantizer in a trained feature quantization network to obtain histogram expression;
for each action category, the action video sample at the a training time has a histogram expression
Figure FDA0003045989170000052
Each dimension expressed by the histogram represents a corresponding feature center, and the value of each dimension represents the response value of the training motion video sample to the feature center; for each action class, calculating covariance between feature centers in the action class;
for any action category, the k-th action category in the action category1A feature center and the kth2Covariance Cov (k) of individual feature centers1,k2) The calculation formula is as follows:
Figure FDA0003045989170000053
wherein the content of the first and second substances,
Figure FDA0003045989170000054
represents the a training motion video sample pair in the motion category1The response value of the center of each feature,
Figure FDA0003045989170000055
all training motion video motion sample pairs representing the motion class1An average of the individual feature center response values; n is a radical ofθThe total number of training action video samples in the action category; k is a radical of1∈[1,NK-1],k2∈[k1+1,NK];
For each action class, calculating the covariance between all feature centers in the above manner; then, the calculated covariance values are sorted from the first action category, and the K with the larger value is selected1Each covariance corresponding to a set of feature centers; the larger the covariance is, the higher the probability that the features represented by the set of feature centers co-occur in the action class is indicated; each group of feature centers represents the co-occurring image features in the dynamic images of the category of motion video samples; if the characteristic center group found in the action category ranked behind is repeated with the characteristic center group found in the action category ranked in the front, the repeated characteristic center group does not account in the characteristic center group found in the action category, and the characteristic center group corresponding to the covariance with larger value is continuously selected; k can be found per action category1A group feature center; there are a total of o action classes, so that a total of K can be found1X o group feature centers.
7. A motion recognition method according to claim 6, characterized in that: in the sixth step, the method for constructing the image feature co-occurrence layer comprises the following steps:
according to the found K1A new layer is constructed behind the characteristic soft quantizer at the x o group characteristic center, and the layer is called an image characteristic co-occurrence layer; the layer has a total of K1The method comprises the following steps that x o neurons correspond to a group of found feature centers, the value of each neuron is the product of response values of the group of feature centers in a histogram of a motion video sample, and the neuron is called as a co-occurrence image feature neuron; the output of the image feature co-occurrence layer is
Figure FDA0003045989170000068
Wherein, the neuron output s of the b-th co-occurrence image characteristic neuronbThe calculation method is as follows:
Figure FDA0003045989170000061
wherein the content of the first and second substances,
Figure FDA0003045989170000062
the response values of the motion video sample to two feature centers corresponding to the b-th co-occurrence image feature neuron are respectively.
8. A motion recognition method according to claim 7, wherein: in the seventh step, the action recognition network based on the co-occurrence image features comprises a feature extractor, a feature soft quantizer, an image feature co-occurrence layer and a multilayer perceptron;
the multilayer sensor comprises an Input layer, a hidden layer and an output layer, wherein the Input layer is connected with the output of the image characteristic co-occurrence layer, and the output Input of the Input layer2Same as the output S of the image feature co-occurrence layer, i.e. Input2R for input layer2=K1X o neurons, hidden layer sharing z2The neuron is fully connected with all output units of the input layer, the output layer of the multilayer perceptron is provided with o neurons, and each neuron represents an action category; the weight between the input layer and the hidden layer is expressed as
Figure FDA0003045989170000063
The weight between the hidden layer and the output layer is expressed as
Figure FDA0003045989170000064
Output Q of hidden neurons2The calculation method of (c) is as follows:
Figure FDA0003045989170000065
wherein phi iseluIt is elu that the activation function is active,
Figure FDA0003045989170000066
is the bias vector of the hidden layer;
output layer output O of multilayer perceptron2Comprises the following steps:
Figure FDA0003045989170000067
wherein phi issoftmaxThe function is activated for the softmax and,
Figure FDA0003045989170000071
is the offset vector of the output layer;
loss function L of action recognition network based on co-occurrence image characteristics2Comprises the following steps:
Figure FDA0003045989170000072
wherein the content of the first and second substances,
Figure FDA0003045989170000073
is the output vector of the multi-layered perceptron for the g-th sample.
9. A motion recognition method according to claim 8, wherein: in the step eight, the method for finding the co-occurrence image feature neuron group of each action category comprises the following steps:
inputting the dynamic image of the training action video sample of each action category into a trained action recognition network based on the co-occurrence image characteristics to obtain the output S of an image characteristic co-occurrence layer;
for each action category, the a-th training action video sample is in an image feature co-occurrence layerOutput is as
Figure FDA0003045989170000074
Each output dimension corresponds to one co-occurrence image feature neuron in the image feature co-occurrence layer; calculating the covariance between the co-occurrence image feature neurons in the action category;
for any action category, the d-th action in the action category1Characteristic neuron of co-occurrence image and d2Covariance Cov (d) of feature neurons of individual co-occurrence images1,d2) The calculation formula is as follows:
Figure FDA0003045989170000075
wherein the content of the first and second substances,
Figure FDA0003045989170000076
the motion class is shown in the d-th training motion video sample1The output of the characteristic neurons of the individual co-occurrence images,
Figure FDA0003045989170000077
all training motion video samples representing the motion category are at d1An average of the outputs of the plurality of co-occurrence image feature neurons; d1∈[1,K1×o-1],d2∈[d1+1,K1×o];
For each action class, calculating the covariance among all co-occurrence image characteristic neurons according to the mode; then, the calculated covariance values are sorted from the first action category, and the K with the larger value is selected2Individual covariance, each covariance corresponding to a set of co-occurrence image feature neurons; the larger the covariance is, the higher the probability that the characteristic neurons of the group of co-occurrence images co-occur in the class is indicated; each group of co-occurrence image feature neurons represents the co-occurrence semantic features in the dynamic images of the category of motion video samples; co-occurrence image feature nerves if found in the later action categoriesThe tuple is repeated with the co-occurrence image characteristic nerve tuples found in the action category arranged in the front, and in the action category, the repeated co-occurrence image characteristic nerve tuples are not taken into the co-occurrence image characteristic nerve tuples found in the action category, and the co-occurrence image characteristic nerve tuples corresponding to the covariance with larger residual values are continuously selected; k can be found for each action category2Group co-occurrence image feature neurons; in a total of o action classes, K can be found2Xo group co-occurrence of image characteristic neurons.
10. A motion recognition method according to claim 9, wherein: in the ninth step, the method for constructing the semantic feature co-occurrence layer comprises the following steps:
according to the found K2Constructing a new layer behind the image feature co-occurrence layer by using the x o group of co-occurrence image feature neurons, and calling the layer as a semantic feature co-occurrence layer; the layer has a total of K2The neuron is called as a co-occurrence semantic feature neuron, each neuron corresponds to a group of found co-occurrence image feature neurons, and the value of the neuron is the product of the output of the action video sample in the group of co-occurrence image feature neurons; the output of the semantic feature co-occurrence layer is
Figure FDA0003045989170000081
Wherein, the output m of the chi-th co-occurrence semantic feature neuronχThe calculation method is as follows:
mχ=mχ1×mχ2
wherein m isχ1、mχ2And respectively outputting values of two co-occurrence image characteristic neurons corresponding to the chi-th co-occurrence semantic characteristic neuron of the motion video sample.
11. A motion recognition method according to claim 10, wherein: in the step ten, the action recognition network based on the hierarchy co-occurrence features comprises a feature extractor, a feature soft quantizer, an image feature co-occurrence layer, a semantic feature co-occurrence layer and a multilayer perceptron;
the multilayer perceptron comprises an Input layer, a hidden layer and an output layer, wherein the Input layer is connected with the output of the semantic feature co-occurrence layer, and the output Input of the Input layer3Same as the output M of the semantic feature co-occurrence layer, i.e. Input3R for M, input layers3=K2X o neurons; hidden layer has z3The neuron is fully connected with all output units of the input layer, the output layer of the multilayer perceptron is provided with o neurons, and each neuron represents an action category; the weight between the input layer and the hidden layer is expressed as
Figure FDA0003045989170000082
The weight between the hidden layer and the output layer is expressed as
Figure FDA0003045989170000083
Output Q of hidden neurons3The calculation method of (c) is as follows:
Figure FDA0003045989170000084
wherein phi iseluIt is elu that the activation function is active,
Figure FDA0003045989170000085
is the bias vector of the hidden layer;
output layer output O of multilayer perceptron3Comprises the following steps:
Figure FDA0003045989170000091
wherein phi issoftmaxThe function is activated for the softmax and,
Figure FDA0003045989170000092
is the offset vector of the output layer;
action recognition network loss based on hierarchy co-occurrence characteristicsLoss function L3Comprises the following steps:
Figure FDA0003045989170000093
wherein the content of the first and second substances,
Figure FDA0003045989170000094
is the output vector of the multi-layer perceptron for the g-th sample;
the input of the action recognition network based on the hierarchy co-occurrence characteristics is a dynamic image of an action video sample, and the output is a probability value of the current action video sample belonging to each action category.
12. A motion recognition method according to claim 11, wherein: in the eleventh step, a specific method for realizing the action recognition is as follows:
and calculating a dynamic image of the test action video sample and inputting the dynamic image into a trained action recognition network based on the hierarchy co-occurrence characteristics to obtain probability values which are predicted by the current test action video sample and belong to all action categories, wherein the action category with the maximum probability value is the action category which is finally predicted and belongs to the current test action video sample, so that action recognition is realized.
CN202110472752.8A 2021-04-29 2021-04-29 Action recognition method Active CN113221693B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110472752.8A CN113221693B (en) 2021-04-29 2021-04-29 Action recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110472752.8A CN113221693B (en) 2021-04-29 2021-04-29 Action recognition method

Publications (2)

Publication Number Publication Date
CN113221693A true CN113221693A (en) 2021-08-06
CN113221693B CN113221693B (en) 2023-07-28

Family

ID=77090049

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110472752.8A Active CN113221693B (en) 2021-04-29 2021-04-29 Action recognition method

Country Status (1)

Country Link
CN (1) CN113221693B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130108177A1 (en) * 2011-11-01 2013-05-02 Google Inc. Image matching using motion manifolds
CN103605989A (en) * 2013-11-20 2014-02-26 康江科技(北京)有限责任公司 Multi-view behavior identification method based on largest-interval meaning clustering
CN106778854A (en) * 2016-12-07 2017-05-31 西安电子科技大学 Activity recognition method based on track and convolutional neural networks feature extraction
US20170161606A1 (en) * 2015-12-06 2017-06-08 Beijing University Of Technology Clustering method based on iterations of neural networks
CN107341452A (en) * 2017-06-20 2017-11-10 东北电力大学 Human bodys' response method based on quaternary number space-time convolutional neural networks
US20180373985A1 (en) * 2017-06-23 2018-12-27 Nvidia Corporation Transforming convolutional neural networks for visual sequence learning
CN109446923A (en) * 2018-10-10 2019-03-08 北京理工大学 Depth based on training characteristics fusion supervises convolutional neural networks Activity recognition method
CN110119707A (en) * 2019-05-10 2019-08-13 苏州大学 A kind of human motion recognition method

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130108177A1 (en) * 2011-11-01 2013-05-02 Google Inc. Image matching using motion manifolds
CN103605989A (en) * 2013-11-20 2014-02-26 康江科技(北京)有限责任公司 Multi-view behavior identification method based on largest-interval meaning clustering
US20170161606A1 (en) * 2015-12-06 2017-06-08 Beijing University Of Technology Clustering method based on iterations of neural networks
CN106778854A (en) * 2016-12-07 2017-05-31 西安电子科技大学 Activity recognition method based on track and convolutional neural networks feature extraction
CN107341452A (en) * 2017-06-20 2017-11-10 东北电力大学 Human bodys' response method based on quaternary number space-time convolutional neural networks
US20180373985A1 (en) * 2017-06-23 2018-12-27 Nvidia Corporation Transforming convolutional neural networks for visual sequence learning
CN109446923A (en) * 2018-10-10 2019-03-08 北京理工大学 Depth based on training characteristics fusion supervises convolutional neural networks Activity recognition method
CN110119707A (en) * 2019-05-10 2019-08-13 苏州大学 A kind of human motion recognition method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
XIAOFENG ZHAO等: "Discriminative pose analysis for human action recognition", 2020 IEEE 6TH WORM FORUM ON INTERNET OF THINGS, pages 1 - 6 *

Also Published As

Publication number Publication date
CN113221693B (en) 2023-07-28

Similar Documents

Publication Publication Date Title
Zheng et al. PAC-Bayesian framework based drop-path method for 2D discriminative convolutional network pruning
CN112308158B (en) Multi-source field self-adaptive model and method based on partial feature alignment
CN113221694B (en) Action recognition method
CN109063719B (en) Image classification method combining structure similarity and class information
Lai et al. Form design of product image using grey relational analysis and neural network models
CN108875787A (en) A kind of image-recognizing method and device, computer equipment and storage medium
CN109165692B (en) User character prediction device and method based on weak supervised learning
Cai et al. A novel hyperspectral image classification model using bole convolution with three-direction attention mechanism: small sample and unbalanced learning
CN107766850A (en) Based on the face identification method for combining face character information
CN110119707B (en) Human body action recognition method
CN111414461A (en) Intelligent question-answering method and system fusing knowledge base and user modeling
CN109344888A (en) A kind of image-recognizing method based on convolutional neural networks, device and equipment
CN108596264A (en) A kind of community discovery method based on deep learning
CN113159171B (en) Plant leaf image fine classification method based on counterstudy
CN112256878B (en) Rice knowledge text classification method based on deep convolution
CN114625908A (en) Text expression package emotion analysis method and system based on multi-channel attention mechanism
CN112926645B (en) Electricity stealing detection method based on edge calculation
Srigurulekha et al. Food image recognition using CNN
CN110070070B (en) Action recognition method
CN113221693B (en) Action recognition method
Sharma et al. Sm2n2: A stacked architecture for multimodal data and its application to myocardial infarction detection
Kim et al. Tweaking deep neural networks
Dong et al. A biologically inspired system for classification of natural images
Lv et al. Deep convolutional network based on interleaved fusion group
Guzzi et al. Distillation of a CNN for a high accuracy mobile face recognition system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant