CN113221693A - Action recognition method - Google Patents
Action recognition method Download PDFInfo
- Publication number
- CN113221693A CN113221693A CN202110472752.8A CN202110472752A CN113221693A CN 113221693 A CN113221693 A CN 113221693A CN 202110472752 A CN202110472752 A CN 202110472752A CN 113221693 A CN113221693 A CN 113221693A
- Authority
- CN
- China
- Prior art keywords
- feature
- occurrence
- layer
- action
- image
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V40/00—Recognition of biometric, human-related or animal-related patterns in image or video data
- G06V40/20—Movements or behaviour, e.g. gesture recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V20/00—Scenes; Scene-specific elements
- G06V20/40—Scenes; Scene-specific elements in video content
- G06V20/46—Extracting features or characteristics from the video content, e.g. video fingerprints, representative shots or key frames
Abstract
The invention provides a motion recognition method, which comprises the steps of calculating a dynamic image of a motion video sample; inputting the dynamic image of the action video sample into a feature extractor to obtain a feature vector in the dynamic image; constructing a characteristic center group; inputting all the feature vectors into feature centers, and accumulating all the outputs on each feature center to obtain histogram expression; inputting the histogram expression into a multilayer perceptron to form a characteristic quantization network; training a characteristic quantization network until convergence, and finding out a co-occurrence characteristic center group of each action category; constructing an image feature co-occurrence layer; constructing a co-occurrence image feature-based action recognition network, training until convergence, and finding out a co-occurrence image feature neuron group of each action category; constructing a semantic feature co-occurrence layer; and constructing a motion recognition network based on the hierarchy co-occurrence characteristics, training the motion recognition network to be convergent, calculating a dynamic image of a test motion video sample, and inputting the trained motion recognition network based on the hierarchy co-occurrence characteristics to realize motion recognition.
Description
Technical Field
The invention relates to a motion recognition method, and belongs to the technical field of motion recognition.
Background
Motion recognition is an important subject in the field of computer vision, and is widely applied in the fields of video monitoring, behavior analysis, human-computer interaction and the like. With the development of social networks and the popularization of RGB devices, motion recognition based on RGB videos gets more and more attention of scholars. Compared with the action recognition based on the skeleton, the action recognition based on the RGB video can acquire data more easily and has higher reliability.
Most of the existing methods extract deep features contained in a video by designing deep convolutional neural networks with different structures, and perform action recognition. These methods typically ignore the interpretability of the extracted features and put the stacked static frame images into a complex convolutional neural network, which places high demands on the performance of the computer device.
Therefore, a motion recognition method is proposed to solve the problem of the motion recognition algorithm.
Disclosure of Invention
The invention is provided for solving the problems in the prior art, the technical proposal is as follows,
a motion recognition method comprising the steps of:
step one, calculating a dynamic image of a motion video sample;
inputting the dynamic image of the motion video sample into a feature extractor to obtain a feature vector in the dynamic image;
step three, constructing a feature center group; inputting all the feature vectors of the dynamic images of the action video samples into the feature centers, accumulating all the outputs on each feature center to obtain histogram expression;
inputting the histogram expression into a multilayer perceptron to form a characteristic quantization network;
inputting the dynamic images of all the training video samples into a characteristic quantization network, training the characteristic quantization network until convergence, and finding out a co-occurrence characteristic center group of each action type;
constructing an image characteristic co-occurrence layer;
step seven, inputting the output of the image feature co-occurrence layer into a multilayer perceptron, and constructing a motion recognition network based on the co-occurrence image features;
step eight, inputting dynamic images of all training action video samples into an action recognition network based on co-occurrence image characteristics, training the action recognition network based on the co-occurrence image characteristics to be convergent, and finding out co-occurrence image characteristic neuron groups of each action category;
constructing a semantic feature co-occurrence layer;
step ten, inputting the output of the semantic feature co-occurrence layer into a multilayer perceptron, and constructing a motion recognition network based on the hierarchical co-occurrence feature;
and step eleven, training the action recognition network based on the level co-occurrence characteristics until convergence, calculating a dynamic image of the test action video sample, and inputting the trained action recognition network based on the level co-occurrence characteristics to realize action recognition.
Preferably, in the first step, the method for calculating the dynamic image of the motion video sample includes:
each motion video sample consists of all frames in the video, for any motion video sample a:
A={It|t∈[1,T]},
wherein T represents a time index, and T is the total frame number of the motion video sample A;for the matrix representation of the t-th frame image of the motion video sample a, R, C, 3 correspond to the number of rows, columns and channels of the matrix representation of the t-th frame image respectively,the representation matrix is a real number matrix; i istEach element in (a) represents a pixel value of the t-th frame image;
for any motion video sample A, first, ItVectorization, i.e. ItAll the row vectors of the three channels are connected into a new row vector it;
For row vector itEach element in (a) is used for calculating the arithmetic square root to obtain a new vector wtNamely:
wherein the content of the first and second substances,representing a row vector itEach element in (a) takes the arithmetic square root; note wtA frame vector of a t frame image of the motion video sample A;
calculating a feature vector v of a t frame image of a motion video sample AtThe calculation method is as follows:
wherein the content of the first and second substances,representing the summation of frame vectors for the 1 st frame image to the t th frame image of the motion video sample A;
calculating t frame image I of motion video sample AtScore B oftThe calculation formula is as follows:
Bt=uT·vt,
where u is a vector with dimension f, and f is R × C × 3. u. ofTRepresents transposing the vector u; u. ofT·vtRepresenting the vector obtained by transposing the calculation pair vector u and the feature vector vtDot product of (2);
calculating the value of u, so that the more the frame images arranged at the back in the action video sample are, the higher the score is, i.e. the larger t is, the score is BtThe higher; the calculation of u can use RankSVM calculation, and the calculation method is as follows:
wherein the content of the first and second substances,denotes u that minimizes the value of E (u), λ is a constant, | u | | non-calculation2Representing the sum of the squares of each element in the calculation vector u; b isi、BjRespectively represent the score of the ith frame image of the motion video sample A, the score of the jth frame image of the motion video sample A, and max {0,1-Bi+BjMeans to choose 0 and 1-Bi+BjThe larger of the values;
after calculating the vector u using RankSVM, the vector u is arranged to be ItImage form of the same size is obtainedLet u' be the motion image of motion video sample a.
Further, in the second step, the feature extractor is composed of a series of convolution layers and pooling layers; inputting the dynamic image of each motion video sample into a feature extractor, wherein the feature extractor outputs a feature mapWherein K1、K2D respectively represents the height, width and channel number of the output characteristic diagram; the characteristic diagram F has K in all1×K2Individual pixel point, feature vector x of each pixel pointyD, i.e. the number of channels in the feature map F, y 1,21×K2(ii) a The feature vector in the moving image may be represented by a set of feature vectors X ═ Xy|y=1,2,...,K1×K2Represents it.
Further, in the third step, the feature center group comprises NKEach feature center corresponds to a scale factor, and each feature center and an initial value of the scale factor are obtained by calculation through the following method:
calculating the characteristic vectors in the dynamic images of all the training motion video samples, and clustering all the characteristic vectors, wherein the number of the clustered clusters is the same as that of the characteristic centers, namely the number of the clustered clusters is NKEach cluster is provided with a clustering center, and the value of the clustering center obtained by clustering is used as the initial value of the characteristic center; for the kth cluster, the set of all feature vectors in the cluster is recorded as EkIn which N is containedkIndividual feature vector:
calculating Euclidean distance d between feature vectorsq,τ:
Wherein, [ e ]q]dRepresenting a feature vector eqQ ∈ [1, N ] of the d-th dimension data ofk-1],τ∈[q+1,Nk](ii) a Scaling factor σ of the kth feature centerkThe initial values of (a) are:
feature vector x for dynamic imagesyCalculating it and the k-th feature center ckAs its distance at the kth feature center ckThe distance calculation formula is as follows:
Wk(xy)=exp(-||xy-ck||2/σk),
for feature vector xyNormalizing the output obtained by inputting the k-th feature center:
inputting all the feature vectors of the dynamic images of each motion video sample into each feature center of the feature center group, and accumulating all the outputs on each feature center; cumulative output h of kth feature centerkThe calculation method of (c) is as follows:
and connecting the accumulated values of all the feature centers together to obtain a histogram expression H of the dynamic image:
the characteristic center group and the accumulation layer accumulating the output of the characteristic center group form a characteristic soft quantizer; the input of the characteristic soft quantizer is a characteristic vector of a dynamic image of each motion video sample, and the output is a histogram expression of the dynamic image.
Further, in the fourth step, the feature quantization network includes a feature extractor, a feature soft quantizer and a multi-layer perceptron;
the multilayer perceptron comprises an input layer, a hidden layer and an output layer; the Input layer is connected with the output of the histogram connection layer, and the output Input of the Input layer1Same as the output H of the histogram connection layer, i.e. Input1H, input layer total r1=NKA plurality of neurons; hidden layer has z1The neuron is fully connected with all output units of the input layer, the output layer of the multilayer perceptron is provided with o neurons, and each neuron represents an action category; the weight between the input layer and the hidden layer is expressed asThe weight between the hidden layer and the output layer is expressed as
Output Q of hidden neurons1The calculation method of (c) is as follows:
wherein phi iseluIt is elu that the activation function is active,is the bias vector of the hidden layer;
output layer output O of multilayer perceptron1Comprises the following steps:
wherein phi issoft maxThe function is activated for the softmax and,is the offset vector of the output layer;
loss function L of a feature quantization network1Comprises the following steps:
wherein the content of the first and second substances,is the output vector of the multi-layer perceptron for the g-th sample,is the desired output vector for the g-th sample, where lgIs defined as:
where G is the total number of samples, lgIs the tag value of the g-th sample.
Further, in the fifth step, the method for finding the co-occurrence feature center group of each action category includes:
inputting the dynamic image of the training action video sample of each category into a feature extractor and a feature soft quantizer in a trained feature quantization network to obtain histogram expression;
for each action category, the action video sample at the a training time has a histogram expressionEach dimension expressed by the histogram represents a corresponding feature center, and the value of each dimension represents the response value of the training motion video sample to the feature center; for each action class, calculating covariance between feature centers in the action class;
for any action category, the k-th action category in the action category1A feature center and the kth2Covariance Cov (k) of individual feature centers1,k2) The calculation formula is as follows:
wherein the content of the first and second substances,represents the a training motion video sample pair in the motion category1The response value of the center of each feature,all training motion video motion sample pairs representing the motion class1An average of the individual feature center response values; n is a radical ofθThe total number of training action video samples in the action category; k is a radical of1∈[1,NK-1],k2∈[k1+1,NK];
For each action class, calculating the covariance between all feature centers in the above manner; then, the calculated covariance values are sorted from the first action category, and the K with the larger value is selected1Each covariance corresponding to a set of feature centers; the larger the covariance, the more common the features represented by the set of feature centers are in the action classThe higher the likelihood of co-occurrence; each group of feature centers represents the co-occurring image features in the dynamic images of the category of motion video samples; if the characteristic center group found in the action category ranked behind is repeated with the characteristic center group found in the action category ranked in the front, the repeated characteristic center group does not account in the characteristic center group found in the action category, and the characteristic center group corresponding to the covariance with larger value is continuously selected; k can be found per action category1A group feature center; there are a total of o action classes, so that a total of K can be found1X o group feature centers.
Further, in the sixth step, the method for constructing the image feature co-occurrence layer includes:
according to the found K1A new layer is constructed behind the characteristic soft quantizer at the x o group characteristic center, and the layer is called an image characteristic co-occurrence layer; the layer has a total of K1The method comprises the following steps that x o neurons correspond to a group of found feature centers, the value of each neuron is the product of response values of the group of feature centers in a histogram of a motion video sample, and the neuron is called as a co-occurrence image feature neuron; the output of the image feature co-occurrence layer is
Wherein, the neuron output s of the b-th co-occurrence image characteristic neuronbThe calculation method is as follows:
wherein the content of the first and second substances,the response values of the motion video sample to two feature centers corresponding to the b-th co-occurrence image feature neuron are respectively.
Further, in the seventh step, the action recognition network based on the co-occurrence image features includes a feature extractor, a feature soft quantizer, an image feature co-occurrence layer and a multi-layer perceptron;
the multilayer sensor comprises an Input layer, a hidden layer and an output layer, wherein the Input layer is connected with the output of the image characteristic co-occurrence layer, and the output Input of the Input layer2Same as the output S of the image feature co-occurrence layer, i.e. Input2R for input layer2=K1X o neurons, hidden layer sharing z2The neuron is fully connected with all output units of the input layer, the output layer of the multilayer perceptron is provided with o neurons, and each neuron represents an action category; the weight between the input layer and the hidden layer is expressed asThe weight between the hidden layer and the output layer is expressed as
Output Q of hidden neurons2The calculation method of (c) is as follows:
wherein phi iseluIt is elu that the activation function is active,is the bias vector of the hidden layer;
output layer output O of multilayer perceptron2Comprises the following steps:
wherein phi issoft maxThe function is activated for the softmax and,is the offset vector of the output layer;
loss function L of action recognition network based on co-occurrence image characteristics2Comprises the following steps:
wherein the content of the first and second substances,is the output vector of the multi-layered perceptron for the g-th sample.
Further, in the step eight, the method for finding the co-occurrence image feature neuron group of each action class includes:
inputting the dynamic image of the training action video sample of each action category into a trained action recognition network based on the co-occurrence image characteristics to obtain the output S of an image characteristic co-occurrence layer;
for each motion class, the output of the a-th training motion video sample at the image feature co-occurrence layer isEach output dimension corresponds to one co-occurrence image feature neuron in the image feature co-occurrence layer; calculating the covariance between the co-occurrence image feature neurons in the action category;
for any action category, the d-th action in the action category1Characteristic neuron of co-occurrence image and d2Covariance Cov (d) of feature neurons of individual co-occurrence images1,d2) The calculation formula is as follows:
wherein the content of the first and second substances,the motion class is shown in the d-th training motion video sample1The output of the characteristic neurons of the individual co-occurrence images,to representAll training motion video samples for this motion category are at d1An average of the outputs of the plurality of co-occurrence image feature neurons; d1∈[1,K1×o-1],d2∈[d1+1,K1×o];
For each action class, calculating the covariance among all co-occurrence image characteristic neurons according to the mode; then, the calculated covariance values are sorted from the first action category, and the K with the larger value is selected2Individual covariance, each covariance corresponding to a set of co-occurrence image feature neurons; the larger the covariance is, the higher the probability that the characteristic neurons of the group of co-occurrence images co-occur in the class is indicated; each group of co-occurrence image feature neurons represents the co-occurrence semantic features in the dynamic images of the category of motion video samples; if the co-occurrence image characteristic nerve cell group found in the action category ranked behind is repeated with the co-occurrence image characteristic nerve cell group found in the action category ranked in the front, in the action category, the repeated co-occurrence image characteristic nerve cell group does not account in the co-occurrence image characteristic nerve cell group found in the action category, and the co-occurrence image characteristic nerve cell group corresponding to the covariance with larger value is continuously selected; k can be found for each action category2Group co-occurrence image feature neurons; in a total of o action classes, K can be found2Xo group co-occurrence of image characteristic neurons.
Further, in the ninth step, the method for constructing the semantic feature co-occurrence layer includes:
according to the found K2Constructing a new layer behind the image feature co-occurrence layer by using the x o group of co-occurrence image feature neurons, and calling the layer as a semantic feature co-occurrence layer; the layer has a total of K2The neuron is called as a co-occurrence semantic feature neuron, each neuron corresponds to a group of found co-occurrence image feature neurons, and the value of the neuron is the product of the output of the action video sample in the group of co-occurrence image feature neurons; the output of the semantic feature co-occurrence layer is
WhereinOutput m of the chi-th co-occurrence semantic feature neuronχThe calculation method is as follows:
wherein the content of the first and second substances,and respectively outputting values of two co-occurrence image characteristic neurons corresponding to the chi-th co-occurrence semantic characteristic neuron of the motion video sample.
Further, in the step ten, the action recognition network based on the hierarchy co-occurrence features comprises a feature extractor, a feature soft quantizer, an image feature co-occurrence layer, a semantic feature co-occurrence layer and a multi-layer perceptron;
the multilayer perceptron comprises an Input layer, a hidden layer and an output layer, wherein the Input layer is connected with the output of the semantic feature co-occurrence layer, and the output Input of the Input layer3Same as the output M of the semantic feature co-occurrence layer, i.e. Input3R for M, input layers3=K2X o neurons; hidden layer has z3The neuron is fully connected with all output units of the input layer, the output layer of the multilayer perceptron is provided with o neurons, and each neuron represents an action category; the weight between the input layer and the hidden layer is expressed asThe weight between the hidden layer and the output layer is expressed as
Output Q of hidden neurons3The calculation method of (c) is as follows:
wherein phi iseluIt is elu that the activation function is active,is the bias vector of the hidden layer;
output layer output O of multilayer perceptron3Comprises the following steps:
wherein phi issoft maxThe function is activated for the softmax and,is the offset vector of the output layer;
loss function L of action recognition network based on hierarchy co-occurrence characteristics3Comprises the following steps:
wherein the content of the first and second substances,is the output vector of the multi-layer perceptron for the g-th sample;
the input of the action recognition network based on the hierarchy co-occurrence characteristics is a dynamic image of an action video sample, and the output is a probability value of the current action video sample belonging to each action category.
Further, in the eleventh step, a specific method for implementing the action recognition is as follows:
and calculating a dynamic image of the test action video sample and inputting the dynamic image into a trained action recognition network based on the hierarchy co-occurrence characteristics to obtain probability values which are predicted by the current test action video sample and belong to all action categories, wherein the action category with the maximum probability value is the action category which is finally predicted and belongs to the current test action video sample, so that action recognition is realized.
The action recognition network based on the hierarchy co-occurrence characteristics can learn the co-occurrence characteristics in the action video, and effectively increases the degree of distinction of the action sample representation; when the network is trained, only one dynamic image is used as the compact representation input network of the motion video, and the requirement on the performance of computer equipment is not high.
Drawings
FIG. 1 is a flow chart of the operation of a method of motion recognition in accordance with the present invention.
FIG. 2 is a schematic view of a dynamic image according to an embodiment of the present invention.
Fig. 3 is a schematic diagram of a feature extractor of the present invention.
Fig. 4 is a schematic diagram of the packet convolution module 1 of fig. 3.
Fig. 5 is a schematic diagram of the packet convolution module 2 or the packet convolution module 3 in fig. 3.
Fig. 6 is a schematic diagram of a feature quantification network of the present invention.
FIG. 7 is a schematic diagram of a motion recognition network based on co-occurrence image features according to the present invention.
FIG. 8 is a schematic diagram of the action recognition network based on the hierarchy co-occurrence feature of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1, a motion recognition method includes the following steps:
1. the number of the motion video samples in the motion video sample set is 2000, and the motion video samples in the motion video sample set have 10 motion categories, and each motion category has 200 motion video samples. Three quarters of the action video samples are randomly selected from each action category and are divided into a training set, the remaining one quarter is divided into a testing set, and 1500 training action video samples and 500 testing action video samples are obtained. Each motion video sample consists of all frames in the motion sample video. Take any motion video sample a as an example:
A={It|t∈[1,40]},
where t represents a time index, the motion video sample has a total of 40 frames.The t-th frame image of the motion video sample a is represented by a matrix, and the frame image has the number of rows 240, the number of columns 320, and the number of channels 3.The expression matrix is a real matrix, ItEach element in (a) represents a pixel value of the t-th frame image.
And calculating the dynamic image of the motion video sample in the following way:
for any motion video sample A, first, ItVectorization, i.e. ItAll the row vectors of the three channels are connected into a new row vector it。
For row vector itEach element in (a) is used for calculating the arithmetic square root to obtain a new vector wtNamely:
wherein the content of the first and second substances,representing a row vector itEach element in (a) takes the arithmetic square root. Note wtIs the frame vector of the t frame image of motion video sample a.
Calculating a feature vector v of a t frame image of a motion video sample AtThe calculation method is as follows:
wherein the content of the first and second substances,showing relative movementSumming frame vectors from a 1 st frame image to a t th frame image of the video sample A;
calculating t frame image I of motion video sample AtScore B oftThe calculation formula is as follows:
Bt=uT·vt,
where u is a vector with dimension 230400. u. ofTRepresenting transposing the vector u. u. ofT·vtRepresenting the vector obtained by transposing the calculation pair vector u and the feature vector vtThe dot product of (a).
Calculating the value of u, so that the more the frame images arranged at the back in the action video sample are, the higher the score is, i.e. the larger t is, the score is BtThe higher. The calculation of u can use RankSVM calculation, and the calculation method is as follows:
wherein the content of the first and second substances,denotes u that minimizes the value of E (u), λ is a constant, | u | | non-calculation2Representing the sum of the squares of each element in the calculation vector u. B isi、BjRespectively represent the score of the ith frame image of the motion video sample A, the score of the jth frame image of the motion video sample A, and max {0,1-Bi+BjMeans to choose 0 and 1-Bi+BjThe larger value of (a).
After calculating the vector u using RankSVM, the vector u is arranged to be ItImage form of the same size is obtainedLet u' be the motion image of motion video sample a. As shown in fig. 2, an example of the obtained moving image is shown.
2. And inputting the dynamic image of the motion video sample into a feature extractor, and extracting a feature vector in the dynamic image. The feature extractor consists of a series of convolutional layers and pooling layers. The feature extractor is shown in fig. 3, and is composed of the first two modules of ResNext-50, namely convolution module 1 and convolution module 2.
Convolution module 1 contains a convolution layer with 64 convolution kernels, each convolution kernel being 7 x 7 in size. The convolution module 2 comprises a maximum pooling layer and three grouped convolution modules. The size of the pooling core of the maximum pooling layer was 3 × 3. The packet convolution module 1 is shown in figure 4. The first layer is a convolutional layer, the second layer is a block convolutional layer, the third layer is a convolutional layer, and the fourth layer is a residual addition layer. The first layer of convolutional layers has 128 convolutional kernels, each of which has a size of 1 × 1. The second layer packet convolutional layer has 128 convolutional kernels, each of which has a size of 3 × 3. The packet convolutional layer has input size of W1×H1The x 128 feature map is divided into 32 groups of size W by channel1×H1And multiplying by 4, dividing 128 convolution kernels into 32 groups, wherein each group has 4 convolution kernels, performing convolution on the feature map of each group and the convolution kernels of each group respectively, and finally connecting the convolution results of each group according to channels to obtain the output of the grouped convolution layers. The third layer of convolutional layers has 256 convolutional kernels, each of which has a size of 1 × 1. The fourth layer of residual addition layer transfers the input of the first layer of convolution layer into the residual convolution layer, the residual convolution layer has 256 convolution kernels, the size of each convolution kernel is 1 x 1, the output of the residual convolution layer is added with the output of the third layer of convolution layer to be the output of the fourth layer of residual addition layer and the output of the first grouping convolution module. The block convolution modules 2, 3 are similar to the block convolution module 1, as shown in fig. 5, with the only difference that the fourth layer residual add layer of the second and third block convolution modules directly adds the input of the first layer convolution layer to the output of the third layer convolution layer, with no residual convolution layer.
The feature graph output by the feature extractor isThe height, width, and width of the characteristic diagram,The number of channels is 30, 40, 256 respectively. The feature graph F has 1200 pixel points in total, and the feature vector x of each pixel pointyIs 256, i.e. the number of channels in the feature map F, y being 1, 2. The feature vectors in the final dynamic image may be represented by the set of feature vectors X ═ XyI y 1, 2.., 1200.
3. A feature center group is constructed, which contains a total of 128 feature centers. And each feature center corresponds to one scaling coefficient. Each feature center and an initial value of a scale scaling coefficient thereof are calculated by the following method:
extracting feature vectors in dynamic images of all training motion video samples, clustering all the feature vectors, wherein the number of clustered clusters is the same as that of feature centers, namely the number of clustered clusters is 128, each cluster has a clustering center, and the value of the clustering center of a first cluster is used as an initial value of the first feature center. Let the set of all feature vectors in the first cluster be E1Which contains 300 vectors:
E1={e1,e2,…,e300}
calculating Euclidean distance d between vectorsq,τ:
Wherein, [ e ]q]dRepresents a vector eqQ e [1,299 ]],τ∈[q+1,300]. Scaling factor sigma of the first feature center1The initial values of (a) are:
the initial values of 128 feature centers and the initial values of the corresponding scaling factors can be obtained in the above manner.
4. Feature vector x for dynamic imagesyCalculating it and the k-th feature center ckA distance ofIs it at the k characteristic center ckThe distance calculation formula is as follows:
Wk(xy)=exp(-||xy-ck||2/σk),
for feature vector xyNormalizing the output obtained by inputting the k-th feature center:
5. all feature vectors of the motion image of the motion video sample are input to respective feature centers of the set of feature centers, and all outputs at each feature center are accumulated. Cumulative output h of kth feature centerkThe calculation method of (c) is as follows:
and connecting the accumulated values of all the feature centers together to obtain a histogram expression H of the dynamic image:
H=(h1,h2,...,h128)。
the feature center group and the accumulation layer that accumulates the outputs of the feature center group constitute a feature soft quantizer. The input of the characteristic soft quantizer is a characteristic vector of a dynamic image of each motion video sample, and the output is a histogram expression of the dynamic image.
6. The histogram representation H of the moving image of the motion video sample is input to the multi-layered perceptron to form a feature quantization network, as shown in fig. 6. The feature quantization network comprises a feature extractor, a feature soft quantizer and a multi-layer perceptron.
The multilayer perceptron includes an input layer, a hidden layer, and an output layer. The Input layer is connected with the output of the histogram connection layer, and the output Input of the Input layer1Same as the output H of the histogram connection layer, i.e. Input1H, the input layer has 128 neurons in total. The hidden layer has 64 neurons, is fully connected with all output units of the input layer, and has multiple layersThe output layer of the sensor is provided with 10 neurons, and each neuron represents an action class; the weight between the input layer and the hidden layer is expressed asThe weight between the hidden layer and the output layer is expressed as
Output Q of hidden neurons1The calculation method of (c) is as follows:
wherein phi iseluIt is elu that the activation function is active,is the bias vector of the hidden layer;
output layer output O of multilayer perceptron1Comprises the following steps:
wherein phi issoft maxThe function is activated for the softmax and,is the bias vector of the output layer.
Loss function L of a feature quantization network1Comprises the following steps:
wherein the content of the first and second substances,is the output vector of the multi-layer perceptron for the g-th sample,is the desired output vector for the g-th sample, where lgIs defined as:
wherein lgIs the tag value of the g-th sample.
7. And inputting the dynamic images of all the training samples into the characteristic quantization network, and training the characteristic quantization network until convergence. And inputting the dynamic image of the training action video sample of each category into a feature extractor and a feature soft quantizer in the trained feature quantization network to obtain histogram expression.
For each action category, the a-th training action video sample has a histogram expressionEach dimension of the histogram expression represents a corresponding feature center, and the value of each dimension represents the response value of the training motion video sample to the feature center. For each motion class, the covariance between feature centers in the motion class is calculated.
Taking the first action category as an example, the action category has a total of 150 training samples. The covariance Cov (1,2) of the 1 st feature center and the 2 nd feature center in the action category is calculated as:
wherein the content of the first and second substances,represents the response value of the a-th training motion video sample to the 1 st feature center in the first motion category,represents the average value of all the training motion video samples in the first motion category to the 1 st feature center response value. The calculation formula is as follows:
the covariance between all feature centers for the first motion class can be calculated as described above, and a total of 8128 covariance values of 128 × 127/2 can be obtained. And 8 covariances with larger values are selected, and each covariance corresponds to a group of characteristic centers. The larger the covariance, the higher the probability that the features represented by the set of feature centers co-occur in the class. Each set of feature centers represents co-occurring image features in the motion image of the category of motion video sample. For the first action category, the larger 8 covariances are: cov (1,10), Cov (8,35), Cov (12,23), Cov (16,79), Cov (20,95), Cov (45,64), Cov (85,112) and Cov (97,121). The 8 covariances respectively correspond to a feature center group of (c)1,c10)、(c8,c35)、(c12,c23)、(c16,c79)、(c20,c95)、(c45,c64)、(c85,c112) And (c)97,c121)。
For the 2 nd to 10 th action categories, the covariance between the feature centers is calculated in the above manner, and the feature center group corresponding to the 8 covariances with larger values is found. If the feature center group found in the action category ranked behind is repeated with the feature center group found in the action category ranked in the front, the repeated feature center group does not account for the feature center group found in the action category, and the feature center group corresponding to the covariance with a larger value is continuously selected. Taking the 2 nd action category as an example, the larger 8 covariances are: cov (2,7), Cov (10,22), Cov (18,28), Cov (22,83), Cov (39,97), Cov (45,64), Cov (79,108) and Cov (98,125). The 8 covariances respectively correspond to a feature center group of (c)2,c7)、(c10,c22)、(c18,c28)、(c22,c83)、(c39,c97)、(c45,c64)、(c79,c108) And (c)98,c125). Wherein the feature center group (c)45,c64) Repeating with the feature center group found for the 1 st action category. Therefore, the feature center group (c) corresponding to the 9 th covariance Cov (67,99) with a larger value is found67,c99). Finally, for the 2 nd action category, the found 8 sets of feature centers are: (c)2,c7)、(c10,c22)、(c18,c28)、(c22,c83)、(c39,c97)、(c67,c99)、(c79,c108) And (c)98,c125). Finally, 10 action classes can find 80 sets of feature centers.
8. Based on the found 80 sets of feature centers, a new layer is constructed after the feature soft quantizer, which is called the image feature co-occurrence layer. This layer has a total of 80 neurons. Each neuron corresponds to a group of found feature centers, the value of the neuron is the product of response values of the group of feature centers in a histogram of the motion video sample, and the neuron is called a co-occurrence image feature neuron. The output of the image characteristic co-occurrence layer is recorded as S ═ S (S)1,s2,...,s80)。
Wherein, the output s of the b-th co-occurrence image characteristic neuronbThe calculation method is as follows:
wherein the content of the first and second substances,the response values of the motion video sample to two feature centers corresponding to the b-th co-occurrence image feature neuron are respectively.
9. The outputs of the co-occurrence layers of image features are input to the multi-layer perceptron to form a motion recognition network based on the co-occurrence image features, as shown in fig. 7. The action recognition network based on the co-occurrence image features comprises a feature extractor, a feature soft quantizer, an image feature co-occurrence layer and a multi-layer perceptron.
The multilayer perceptron includes an input layer, a hidden layer, and an output layer. The Input layer is connected with the output of the image characteristic co-occurrence layer, and the output Input of the Input layer2Same as the output S of the image feature co-occurrence layer, i.e. Input2The input layer has a total of 80 neurons. The hidden layer is provided with 64 neurons in total and is fully connected with all output units of the input layer, the output layer of the multilayer perceptron is provided with 10 neurons, and each neuron represents an action class; the weight between the input layer and the hidden layer is expressed asThe weight between the hidden layer and the output layer is expressed as
Output Q of hidden neurons2The calculation method of (c) is as follows:
wherein phi iseluIt is elu that the activation function is active,is the bias vector of the hidden layer;
output layer output O of multilayer perceptron2Comprises the following steps:
wherein phi issoft maxThe function is activated for the softmax and,is the bias vector of the output layer.
Actions based on co-occurrence image featuresIdentifying a loss function L of a network2Comprises the following steps:
wherein the content of the first and second substances,is the output vector of the multi-layered perceptron for the g-th sample.
10. And inputting the dynamic images of all the training action video samples into the action recognition network based on the co-occurrence image characteristics, and training the action recognition network based on the co-occurrence image characteristics to be converged.
And inputting the dynamic image of the training action video sample of each action type into a trained action recognition network based on the co-occurrence image characteristics to obtain the output S of the image characteristic co-occurrence layer.
For each motion class, the output of the a-th training motion video sample at the image feature co-occurrence layer isEach dimension of the output corresponds to one co-occurring image feature neuron in the image feature co-occurrence layer. Covariance between co-occurring image feature neurons in this motion class is calculated.
Taking the first action category as an example, the action category has a total of 150 training samples. The calculation formula of the covariance Cov (1,2) of the 1 st co-occurrence image feature neuron and the 2 nd co-occurrence image feature neuron in the action category is as follows:
wherein the content of the first and second substances,representing the output of the a-th training motion video sample in the 1 st co-occurrence image feature neuron in the first motion class,the average value of all training motion video samples representing the motion class output at the 1 st co-occurrence image feature neuron. The calculation formula is as follows:
the covariance between all co-occurrence image feature neurons for the first motion class can be calculated as described above. A total of 80 × 79/2 ═ 3160 covariances can be obtained. And 4 covariances with larger values are selected, and each covariance corresponds to a group of co-occurrence image characteristic neurons. The larger the covariance, the higher the probability that the group of co-occurrence image feature neurons co-occur in that class. Each group of co-occurrence image feature neurons represents co-occurrence semantic features in the dynamic images of the category of motion video samples. For the first action class, the 4 covariances with larger values are: cov (2,50), Cov (5,32), Cov (17,28), Cov (45, 78). The 4 covariances respectively correspond to a feature center group of (c)2,c50)、(c5,c32)、(c17,c28)、(c45,c78)。
For the 2 nd to 10 th action classes, the covariance between the co-occurrence image characteristic neurons is calculated in the above manner, and the co-occurrence image characteristic neuron group corresponding to the 4 covariances with larger values is found. If the co-occurrence image characteristic nerve group found in the action category ranked behind is repeated with the co-occurrence image characteristic nerve group found in the action category ranked in the front, the repeated co-occurrence image characteristic nerve group in the action category does not account for the co-occurrence image characteristic nerve group found in the action category, and the co-occurrence image characteristic nerve group corresponding to the covariance with larger value is continuously selected. Finally, 10 action classes can find 40 groups of co-occurrence image feature neurons.
11. According to the found 40 groups of co-occurrence image characteristic neurons, a new layer is constructed after the image characteristic co-occurrence layer,this layer is referred to as the semantic feature co-occurrence layer. This layer has a total of 40 neurons. Each neuron corresponds to a group of found co-occurrence image characteristic neurons, and the value of each neuron is the product of the outputs of the motion video samples in the group of co-occurrence image characteristic neurons, and the neurons are called co-occurrence semantic characteristic neurons. The output of the semantic feature co-occurrence layer is M ═ M (M)1,m2,...,m40)。
Wherein, the output m of the chi-th co-occurrence semantic feature neuronχThe calculation method is as follows:
wherein the content of the first and second substances,and respectively outputting values of two co-occurrence image characteristic neurons corresponding to the chi-th co-occurrence semantic characteristic neuron of the motion video sample.
12. And inputting the output of the semantic feature co-occurrence layer into a multi-layer perceptron to form an action recognition network based on the hierarchical co-occurrence feature, as shown in fig. 8. The action recognition network based on the hierarchy co-occurrence features comprises a feature extractor, a feature soft quantizer, an image feature co-occurrence layer, a semantic feature co-occurrence layer and a multilayer perceptron.
The multilayer perceptron includes an input layer, a hidden layer, and an output layer. The Input layer is connected with the output of the semantic feature co-occurrence layer, and the output Input of the Input layer3Same as the output M of the semantic feature co-occurrence layer, i.e. Input3The input layer has a total of 40 neurons. The hidden layer is provided with 64 neurons in total and is fully connected with all output units of the input layer, the output layer of the multilayer perceptron is provided with 10 neurons, and each neuron represents an action class; the weight between the input layer and the hidden layer is expressed asThe weight between the hidden layer and the output layer is expressed as
Output Q of hidden neurons3The calculation method of (c) is as follows:
wherein phi iseluIt is elu that the activation function is active,is the bias vector of the hidden layer;
output layer output O of multilayer perceptron3Comprises the following steps:
wherein phi issoft maxThe function is activated for the softmax and,is the bias vector of the output layer.
Loss function L of action recognition network based on hierarchy co-occurrence characteristics3Comprises the following steps:
wherein the content of the first and second substances,is the output vector of the multi-layered perceptron for the g-th sample.
The input of the action recognition network based on the hierarchy co-occurrence characteristics is a dynamic image of an action video sample, and the output is a probability value of the current action video sample belonging to each action category.
13. Training the action recognition network based on the hierarchy co-occurrence characteristics to be convergent, calculating dynamic images of the test action video samples, inputting the trained action recognition network based on the hierarchy co-occurrence characteristics, obtaining probability values which are predicted by the current test action video samples and belong to all action categories, and enabling the action category with the maximum probability value to be the action category which is finally predicted by the current test action video samples and belongs to, so that action recognition is achieved.
Although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that various changes in the embodiments and/or modifications of the invention can be made, and equivalents and modifications of some features of the invention can be made without departing from the spirit and scope of the invention.
Claims (12)
1. A motion recognition method is characterized in that: the method comprises the following steps:
step one, calculating a dynamic image of a motion video sample;
inputting the dynamic image of the motion video sample into a feature extractor to obtain a feature vector in the dynamic image;
step three, constructing a feature center group; inputting all the feature vectors of the dynamic images of the action video samples into the feature centers, accumulating all the outputs on each feature center to obtain histogram expression;
inputting the histogram expression into a multilayer perceptron to form a characteristic quantization network;
inputting the dynamic images of all the training video samples into a characteristic quantization network, training the characteristic quantization network until convergence, and finding out a co-occurrence characteristic center group of each action type;
constructing an image characteristic co-occurrence layer;
step seven, inputting the output of the image feature co-occurrence layer into a multilayer perceptron, and constructing a motion recognition network based on the co-occurrence image features;
step eight, inputting dynamic images of all training action video samples into an action recognition network based on co-occurrence image characteristics, training the action recognition network based on the co-occurrence image characteristics to be convergent, and finding out co-occurrence image characteristic neuron groups of each action category;
constructing a semantic feature co-occurrence layer;
step ten, inputting the output of the semantic feature co-occurrence layer into a multilayer perceptron, and constructing a motion recognition network based on the hierarchical co-occurrence feature;
and step eleven, training the action recognition network based on the level co-occurrence characteristics until convergence, calculating a dynamic image of the test action video sample, and inputting the trained action recognition network based on the level co-occurrence characteristics to realize action recognition.
2. A motion recognition method according to claim 1, characterized in that: in the first step, the method for calculating the dynamic image of the motion video sample comprises the following steps:
each motion video sample consists of all frames in the video, for any motion video sample a:
A={It|t∈[1,T]},
wherein T represents a time index, and T is the total frame number of the motion video sample A;for the matrix representation of the t-th frame image of the motion video sample a, R, C, 3 correspond to the number of rows, columns and channels of the matrix representation of the t-th frame image respectively,the representation matrix is a real number matrix; i istEach element in (a) represents a pixel value of the t-th frame image;
for any motion video sample A, first, ItVectorization, i.e. ItAll the row vectors of the three channels are connected into a new row vector it;
For row vector itEach element in (a) is used for calculating the arithmetic square root to obtain a new vector wtNamely:
wherein the content of the first and second substances,representing a row vector itEach element in (a) takes the arithmetic square root; note wtA frame vector of a t frame image of the motion video sample A;
calculating a feature vector v of a t frame image of a motion video sample AtThe calculation method is as follows:
wherein the content of the first and second substances,representing the summation of frame vectors for the 1 st frame image to the t th frame image of the motion video sample A;
calculating t frame image I of motion video sample AtScore B oftThe calculation formula is as follows:
Bt=uT·vt,
where u is a vector with dimension f, and f is R × C × 3. u. ofTRepresents transposing the vector u; u. ofT·vtRepresenting the vector obtained by transposing the calculation pair vector u and the feature vector vtDot product of (2);
calculating the value of u, so that the more the frame images arranged at the back in the action video sample are, the higher the score is, i.e. the larger t is, the score is BtThe higher; the calculation of u can use RankSVM calculation, and the calculation method is as follows:
wherein the content of the first and second substances,denotes u that minimizes the value of E (u), λ is a constant, | u | | non-calculation2Representing the sum of the squares of each element in the calculation vector u; b isi、BjRespectively represent the score of the ith frame image of the motion video sample A, the score of the jth frame image of the motion video sample A, and max {0,1-Bi+BjMeans to choose 0 and 1-Bi+BjThe larger of the values;
3. A motion recognition method according to claim 2, characterized in that: in the second step, the feature extractor is composed of a series of convolution layers and a pooling layer; inputting the dynamic image of each motion video sample into a feature extractor, wherein the feature extractor outputs a feature mapWherein K1、K2D respectively represents the height, width and channel number of the output characteristic diagram; the characteristic diagram F has K in all1×K2Individual pixel point, feature vector x of each pixel pointyD, i.e. the number of channels in the feature map F, y 1,21×K2(ii) a The feature vector in the moving image may be represented by a set of feature vectors X ═ Xy|y=1,2,...,K1×K2Represents it.
4. A motion recognition method according to claim 3, characterized in that: in the third step, the feature center group contains N in totalKA feature center, each feature center corresponding to a dimensionAnd the scaling coefficient, each characteristic center and the initial value of the scaling coefficient of the scale are obtained by the following method:
calculating the characteristic vectors in the dynamic images of all the training motion video samples, and clustering all the characteristic vectors, wherein the number of the clustered clusters is the same as that of the characteristic centers, namely the number of the clustered clusters is NKEach cluster is provided with a clustering center, and the value of the clustering center obtained by clustering is used as the initial value of the characteristic center; for the kth cluster, the set of all feature vectors in the cluster is recorded as EkIn which N is containedkIndividual feature vector:
Ek={e1,e2,…,eNk},
calculating Euclidean distance d between feature vectorsq,τ:
Wherein, [ e ]q]dRepresenting a feature vector eqQ ∈ [1, N ] of the d-th dimension data ofk-1],τ∈[q+1,Nk](ii) a Scaling factor σ of the kth feature centerkThe initial values of (a) are:
feature vector x for dynamic imagesyCalculating it and the k-th feature center ckAs its distance at the kth feature center ckThe distance calculation formula is as follows:
Wk(xy)=exp(-||xy-ck||2/σk),
for feature vector xyNormalizing the output obtained by inputting the k-th feature center:
inputting all the feature vectors of the dynamic images of each motion video sample into each feature center of the feature center group, and accumulating all the outputs on each feature center; cumulative output h of kth feature centerkThe calculation method of (c) is as follows:
and connecting the accumulated values of all the feature centers together to obtain a histogram expression H of the dynamic image:
the characteristic center group and the accumulation layer accumulating the output of the characteristic center group form a characteristic soft quantizer; the input of the characteristic soft quantizer is a characteristic vector of a dynamic image of each motion video sample, and the output is a histogram expression of the dynamic image.
5. An action recognition method according to claim 4, characterized in that: in the fourth step, the feature quantization network comprises a feature extractor, a feature soft quantizer and a multilayer perceptron;
the multilayer perceptron comprises an input layer, a hidden layer and an output layer; the Input layer is connected with the output of the histogram connection layer, and the output Input of the Input layer1Same as the output H of the histogram connection layer, i.e. Input1H, input layer total r1=NKA plurality of neurons; hidden layer has z1The neuron is fully connected with all output units of the input layer, the output layer of the multilayer perceptron is provided with o neurons, and each neuron represents an action category; the weight between the input layer and the hidden layer is expressed asHidden layer and output layerAre expressed as
Output Q of hidden neurons1The calculation method of (c) is as follows:
wherein phi iseluIt is elu that the activation function is active,is the bias vector of the hidden layer;
output layer output O of multilayer perceptron1Comprises the following steps:
wherein phi issoftmaxThe function is activated for the softmax and,is the offset vector of the output layer;
loss function L of a feature quantization network1Comprises the following steps:
wherein the content of the first and second substances,is the output vector of the multi-layer perceptron for the g-th sample,is the desired output vector for the g-th sample, where lgIs defined as:
where G is the total number of samples, lgIs the tag value of the g-th sample.
6. A motion recognition method according to claim 5, wherein: in the fifth step, the method for finding out the co-occurrence feature center group of each action category comprises the following steps:
inputting the dynamic image of the training action video sample of each category into a feature extractor and a feature soft quantizer in a trained feature quantization network to obtain histogram expression;
for each action category, the action video sample at the a training time has a histogram expressionEach dimension expressed by the histogram represents a corresponding feature center, and the value of each dimension represents the response value of the training motion video sample to the feature center; for each action class, calculating covariance between feature centers in the action class;
for any action category, the k-th action category in the action category1A feature center and the kth2Covariance Cov (k) of individual feature centers1,k2) The calculation formula is as follows:
wherein the content of the first and second substances,represents the a training motion video sample pair in the motion category1The response value of the center of each feature,all training motion video motion sample pairs representing the motion class1An average of the individual feature center response values; n is a radical ofθThe total number of training action video samples in the action category; k is a radical of1∈[1,NK-1],k2∈[k1+1,NK];
For each action class, calculating the covariance between all feature centers in the above manner; then, the calculated covariance values are sorted from the first action category, and the K with the larger value is selected1Each covariance corresponding to a set of feature centers; the larger the covariance is, the higher the probability that the features represented by the set of feature centers co-occur in the action class is indicated; each group of feature centers represents the co-occurring image features in the dynamic images of the category of motion video samples; if the characteristic center group found in the action category ranked behind is repeated with the characteristic center group found in the action category ranked in the front, the repeated characteristic center group does not account in the characteristic center group found in the action category, and the characteristic center group corresponding to the covariance with larger value is continuously selected; k can be found per action category1A group feature center; there are a total of o action classes, so that a total of K can be found1X o group feature centers.
7. A motion recognition method according to claim 6, characterized in that: in the sixth step, the method for constructing the image feature co-occurrence layer comprises the following steps:
according to the found K1A new layer is constructed behind the characteristic soft quantizer at the x o group characteristic center, and the layer is called an image characteristic co-occurrence layer; the layer has a total of K1The method comprises the following steps that x o neurons correspond to a group of found feature centers, the value of each neuron is the product of response values of the group of feature centers in a histogram of a motion video sample, and the neuron is called as a co-occurrence image feature neuron; the output of the image feature co-occurrence layer is
Wherein, the neuron output s of the b-th co-occurrence image characteristic neuronbThe calculation method is as follows:
8. A motion recognition method according to claim 7, wherein: in the seventh step, the action recognition network based on the co-occurrence image features comprises a feature extractor, a feature soft quantizer, an image feature co-occurrence layer and a multilayer perceptron;
the multilayer sensor comprises an Input layer, a hidden layer and an output layer, wherein the Input layer is connected with the output of the image characteristic co-occurrence layer, and the output Input of the Input layer2Same as the output S of the image feature co-occurrence layer, i.e. Input2R for input layer2=K1X o neurons, hidden layer sharing z2The neuron is fully connected with all output units of the input layer, the output layer of the multilayer perceptron is provided with o neurons, and each neuron represents an action category; the weight between the input layer and the hidden layer is expressed asThe weight between the hidden layer and the output layer is expressed as
Output Q of hidden neurons2The calculation method of (c) is as follows:
wherein phi iseluIt is elu that the activation function is active,is the bias vector of the hidden layer;
output layer output O of multilayer perceptron2Comprises the following steps:
wherein phi issoftmaxThe function is activated for the softmax and,is the offset vector of the output layer;
loss function L of action recognition network based on co-occurrence image characteristics2Comprises the following steps:
9. A motion recognition method according to claim 8, wherein: in the step eight, the method for finding the co-occurrence image feature neuron group of each action category comprises the following steps:
inputting the dynamic image of the training action video sample of each action category into a trained action recognition network based on the co-occurrence image characteristics to obtain the output S of an image characteristic co-occurrence layer;
for each action category, the a-th training action video sample is in an image feature co-occurrence layerOutput is asEach output dimension corresponds to one co-occurrence image feature neuron in the image feature co-occurrence layer; calculating the covariance between the co-occurrence image feature neurons in the action category;
for any action category, the d-th action in the action category1Characteristic neuron of co-occurrence image and d2Covariance Cov (d) of feature neurons of individual co-occurrence images1,d2) The calculation formula is as follows:
wherein the content of the first and second substances,the motion class is shown in the d-th training motion video sample1The output of the characteristic neurons of the individual co-occurrence images,all training motion video samples representing the motion category are at d1An average of the outputs of the plurality of co-occurrence image feature neurons; d1∈[1,K1×o-1],d2∈[d1+1,K1×o];
For each action class, calculating the covariance among all co-occurrence image characteristic neurons according to the mode; then, the calculated covariance values are sorted from the first action category, and the K with the larger value is selected2Individual covariance, each covariance corresponding to a set of co-occurrence image feature neurons; the larger the covariance is, the higher the probability that the characteristic neurons of the group of co-occurrence images co-occur in the class is indicated; each group of co-occurrence image feature neurons represents the co-occurrence semantic features in the dynamic images of the category of motion video samples; co-occurrence image feature nerves if found in the later action categoriesThe tuple is repeated with the co-occurrence image characteristic nerve tuples found in the action category arranged in the front, and in the action category, the repeated co-occurrence image characteristic nerve tuples are not taken into the co-occurrence image characteristic nerve tuples found in the action category, and the co-occurrence image characteristic nerve tuples corresponding to the covariance with larger residual values are continuously selected; k can be found for each action category2Group co-occurrence image feature neurons; in a total of o action classes, K can be found2Xo group co-occurrence of image characteristic neurons.
10. A motion recognition method according to claim 9, wherein: in the ninth step, the method for constructing the semantic feature co-occurrence layer comprises the following steps:
according to the found K2Constructing a new layer behind the image feature co-occurrence layer by using the x o group of co-occurrence image feature neurons, and calling the layer as a semantic feature co-occurrence layer; the layer has a total of K2The neuron is called as a co-occurrence semantic feature neuron, each neuron corresponds to a group of found co-occurrence image feature neurons, and the value of the neuron is the product of the output of the action video sample in the group of co-occurrence image feature neurons; the output of the semantic feature co-occurrence layer is
Wherein, the output m of the chi-th co-occurrence semantic feature neuronχThe calculation method is as follows:
mχ=mχ1×mχ2,
wherein m isχ1、mχ2And respectively outputting values of two co-occurrence image characteristic neurons corresponding to the chi-th co-occurrence semantic characteristic neuron of the motion video sample.
11. A motion recognition method according to claim 10, wherein: in the step ten, the action recognition network based on the hierarchy co-occurrence features comprises a feature extractor, a feature soft quantizer, an image feature co-occurrence layer, a semantic feature co-occurrence layer and a multilayer perceptron;
the multilayer perceptron comprises an Input layer, a hidden layer and an output layer, wherein the Input layer is connected with the output of the semantic feature co-occurrence layer, and the output Input of the Input layer3Same as the output M of the semantic feature co-occurrence layer, i.e. Input3R for M, input layers3=K2X o neurons; hidden layer has z3The neuron is fully connected with all output units of the input layer, the output layer of the multilayer perceptron is provided with o neurons, and each neuron represents an action category; the weight between the input layer and the hidden layer is expressed asThe weight between the hidden layer and the output layer is expressed as
Output Q of hidden neurons3The calculation method of (c) is as follows:
wherein phi iseluIt is elu that the activation function is active,is the bias vector of the hidden layer;
output layer output O of multilayer perceptron3Comprises the following steps:
wherein phi issoftmaxThe function is activated for the softmax and,is the offset vector of the output layer;
action recognition network loss based on hierarchy co-occurrence characteristicsLoss function L3Comprises the following steps:
wherein the content of the first and second substances,is the output vector of the multi-layer perceptron for the g-th sample;
the input of the action recognition network based on the hierarchy co-occurrence characteristics is a dynamic image of an action video sample, and the output is a probability value of the current action video sample belonging to each action category.
12. A motion recognition method according to claim 11, wherein: in the eleventh step, a specific method for realizing the action recognition is as follows:
and calculating a dynamic image of the test action video sample and inputting the dynamic image into a trained action recognition network based on the hierarchy co-occurrence characteristics to obtain probability values which are predicted by the current test action video sample and belong to all action categories, wherein the action category with the maximum probability value is the action category which is finally predicted and belongs to the current test action video sample, so that action recognition is realized.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110472752.8A CN113221693B (en) | 2021-04-29 | 2021-04-29 | Action recognition method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110472752.8A CN113221693B (en) | 2021-04-29 | 2021-04-29 | Action recognition method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113221693A true CN113221693A (en) | 2021-08-06 |
CN113221693B CN113221693B (en) | 2023-07-28 |
Family
ID=77090049
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110472752.8A Active CN113221693B (en) | 2021-04-29 | 2021-04-29 | Action recognition method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113221693B (en) |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130108177A1 (en) * | 2011-11-01 | 2013-05-02 | Google Inc. | Image matching using motion manifolds |
CN103605989A (en) * | 2013-11-20 | 2014-02-26 | 康江科技(北京)有限责任公司 | Multi-view behavior identification method based on largest-interval meaning clustering |
CN106778854A (en) * | 2016-12-07 | 2017-05-31 | 西安电子科技大学 | Activity recognition method based on track and convolutional neural networks feature extraction |
US20170161606A1 (en) * | 2015-12-06 | 2017-06-08 | Beijing University Of Technology | Clustering method based on iterations of neural networks |
CN107341452A (en) * | 2017-06-20 | 2017-11-10 | 东北电力大学 | Human bodys' response method based on quaternary number space-time convolutional neural networks |
US20180373985A1 (en) * | 2017-06-23 | 2018-12-27 | Nvidia Corporation | Transforming convolutional neural networks for visual sequence learning |
CN109446923A (en) * | 2018-10-10 | 2019-03-08 | 北京理工大学 | Depth based on training characteristics fusion supervises convolutional neural networks Activity recognition method |
CN110119707A (en) * | 2019-05-10 | 2019-08-13 | 苏州大学 | A kind of human motion recognition method |
-
2021
- 2021-04-29 CN CN202110472752.8A patent/CN113221693B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20130108177A1 (en) * | 2011-11-01 | 2013-05-02 | Google Inc. | Image matching using motion manifolds |
CN103605989A (en) * | 2013-11-20 | 2014-02-26 | 康江科技(北京)有限责任公司 | Multi-view behavior identification method based on largest-interval meaning clustering |
US20170161606A1 (en) * | 2015-12-06 | 2017-06-08 | Beijing University Of Technology | Clustering method based on iterations of neural networks |
CN106778854A (en) * | 2016-12-07 | 2017-05-31 | 西安电子科技大学 | Activity recognition method based on track and convolutional neural networks feature extraction |
CN107341452A (en) * | 2017-06-20 | 2017-11-10 | 东北电力大学 | Human bodys' response method based on quaternary number space-time convolutional neural networks |
US20180373985A1 (en) * | 2017-06-23 | 2018-12-27 | Nvidia Corporation | Transforming convolutional neural networks for visual sequence learning |
CN109446923A (en) * | 2018-10-10 | 2019-03-08 | 北京理工大学 | Depth based on training characteristics fusion supervises convolutional neural networks Activity recognition method |
CN110119707A (en) * | 2019-05-10 | 2019-08-13 | 苏州大学 | A kind of human motion recognition method |
Non-Patent Citations (1)
Title |
---|
XIAOFENG ZHAO等: "Discriminative pose analysis for human action recognition", 2020 IEEE 6TH WORM FORUM ON INTERNET OF THINGS, pages 1 - 6 * |
Also Published As
Publication number | Publication date |
---|---|
CN113221693B (en) | 2023-07-28 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zheng et al. | PAC-Bayesian framework based drop-path method for 2D discriminative convolutional network pruning | |
CN112308158B (en) | Multi-source field self-adaptive model and method based on partial feature alignment | |
CN113221694B (en) | Action recognition method | |
CN109063719B (en) | Image classification method combining structure similarity and class information | |
Lai et al. | Form design of product image using grey relational analysis and neural network models | |
CN108875787A (en) | A kind of image-recognizing method and device, computer equipment and storage medium | |
CN109165692B (en) | User character prediction device and method based on weak supervised learning | |
Cai et al. | A novel hyperspectral image classification model using bole convolution with three-direction attention mechanism: small sample and unbalanced learning | |
CN107766850A (en) | Based on the face identification method for combining face character information | |
CN110119707B (en) | Human body action recognition method | |
CN111414461A (en) | Intelligent question-answering method and system fusing knowledge base and user modeling | |
CN109344888A (en) | A kind of image-recognizing method based on convolutional neural networks, device and equipment | |
CN108596264A (en) | A kind of community discovery method based on deep learning | |
CN113159171B (en) | Plant leaf image fine classification method based on counterstudy | |
CN112256878B (en) | Rice knowledge text classification method based on deep convolution | |
CN114625908A (en) | Text expression package emotion analysis method and system based on multi-channel attention mechanism | |
CN112926645B (en) | Electricity stealing detection method based on edge calculation | |
Srigurulekha et al. | Food image recognition using CNN | |
CN110070070B (en) | Action recognition method | |
CN113221693B (en) | Action recognition method | |
Sharma et al. | Sm2n2: A stacked architecture for multimodal data and its application to myocardial infarction detection | |
Kim et al. | Tweaking deep neural networks | |
Dong et al. | A biologically inspired system for classification of natural images | |
Lv et al. | Deep convolutional network based on interleaved fusion group | |
Guzzi et al. | Distillation of a CNN for a high accuracy mobile face recognition system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |