CN109190524A

CN109190524A - A kind of human motion recognition method based on generation confrontation network

Info

Publication number: CN109190524A
Application number: CN201810941368.6A
Authority: CN
Inventors: 李洪均; 李超波; 胡伟
Original assignee: Nantong University
Current assignee: Nantong University
Priority date: 2018-08-17
Filing date: 2018-08-17
Publication date: 2019-01-11
Anticipated expiration: 2038-08-17
Also published as: CN109190524B

Abstract

The invention proposes a kind of based on the human motion recognition method for generating confrontation network, and this method designs multiple step format first and generates identification network model, constructs classifier on the basis of fighting network, realizes generation and the classification feature of image；Secondly structural similarity is introduced in arbiter, improve the quality for generating image by increasing constraint condition；The generation and identification of image are finally completed in the human action image library for meeting daily life.The present invention solves the problems, such as that discrimination is low in sample deficiency situation by combining the natural generation and identification of image；In terms of the expansion of image and identification, have the characteristics that exptended sample is natural, discrimination is high, strong robustness.

Description

A kind of human motion recognition method based on generation confrontation network

Technical field

The invention belongs to technical field of computer vision more particularly to a kind of human action knowledges based on generation confrontation network Other method.

Background technique

Currently, human action identification technology is applied to the every field such as health care, intelligent home, interaction entertainment.It is existing Research method mainly include template matching method and machine learning algorithm two major classes.Liu et al. people [document 1] (Liu L, Ma S, Fu Q.Human action recognition based on locality constrained linear coding and two-dimensional spatial-temporal templates[C].Chinese Automation Congress.USA:IEEE, 2018.1879-1883.) the local restriction uniform enconding method based on two-dimension time-space template is proposed, Human action's information is described by calculating two-dimension time-space template, and as global characteristics, is linearly compiled using local restriction Code method encodes patch descriptor, identifies human action.Sharma et al. [document 2] (Sharma G, Jurie F, Schmid C.Expanded parts model for semantic description of humans in still images[J].IEEE Transactions on Pattern Analysis and Machine Intelligence, 2016,39 (1): 87-101.) it describes in the method for certain spatial areas in the template interpretation of images of partes corporis humani position, it is automatic fixed Position different parts simultaneously learn its corresponding template from a large amount of candidate regions, realize classification.Although template matching method algorithm is simple, It is that identification accuracy is not high.

Typical support vector machines, naive Bayesian, Markov model, backpropagation etc. are wide in machine learning algorithm It is general to be applied in action recognition research.(Zhang Guoliang, Jia Songmin, Zhang Xiangyin wait using adaptive strain to Zhang Guoliang et al. [document 3] Activity recognition [J] optical precision engineering of different particle group optimizing SVM, 2017,25 (6): 1669-1678.) use the weight of part It wants feature to build frame, time and space point of interest is described by histogram of gradients, coding vector；By Particle Swarm Optimization Method is put into support vector machine classifier, dynamically-adjusting parameter, finds the best classification of motion.Zhang Siliang [document 4] (Zhang Siliang The Guangzhou human action Study of recognition [D] of body feeling interaction system: South China Science & Engineering University, 2015.30-49.) it is filled with human-computer interaction Set the information for obtaining each artis of human body, line number of going forward side by side Data preprocess；Then according to the coordinate points in each joint formed space to Amount, and realize normalized, Gaussian Mixture Hidden Markov Model is established to the situation of change of each artis；It finally will acquisition Sample be put into the neural network put up training, realize the classification between different movements.

Depth characteristic study is an important branch in machine Learning Theory, its rise has further pushed movement The research of identification.(Wang Zhongmin, Cao Hongjiang, model beautiful jade one kind are based on convolutional neural networks deep learning to many researchers's [document 5] Human bodys' response method [J] computer science, 2016,43 (s2): 56-58.) using convolutional neural networks obtain video Middle human action characteristic dyadic quantization, classifies to feature using long memory unit in short-term, achievees the effect that action recognition；Sample When this is less, discrimination is low.Sun et al. [document 6] (Sun Y, Wu X, Yu W, et al.Action Recognition With Motion Map 3D Network [J] .Neurocomputing, 2018,297:33-39.) to propose Three dimensional convolution dynamic It maps, generates network and current video frame is integrated into the feature of early period by iteration, differentiate that network class has learnt to arrive Feature.(Tang Xianlun, Du Yiming, Liu Yuwei, Li Jiaxin, Ma Yiwei are based on condition depth convolution for Tang Xian human relations et al. [document 7] The image-recognizing method for generating confrontation network automates journal, 2018,44 (5): 855-864.) combine the generation pair of depth convolution Anti- network and condition generate the advantages of confrontation network, and set up the condition depth convolution generates confrontation network model, utilizes convolutional Neural The powerful ability in feature extraction of network is subject to condition auxiliary and generates sample applied in image recognition.Zheng et al. [document 8] (Zheng Z,Zheng L,Yang Y.Unlabeled samples generated by GAN improve the person re-identification baseline in vitro[A].In:Proceedings of the IEEE ICCV 2017 [C] .Venice:IEEE, 2018.3774-3782.) generation that marker samples are realized with GAN frame, the information such as including face, Field is identified again applied to image, improves discrimination.

Summary of the invention

It is an object of the invention to overcome the deficiency of the above-mentioned prior art, based on generating and fight network, addition classification Device establishes multiple step format and generates identification model；It is improved with structural similarity and generates picture quality, network mould is optimized using bound term Type proposes a kind of human motion recognition method based on generation confrontation network, is specifically realized by the following technical scheme:

The human motion recognition method based on generation confrontation network, includes the following steps:

Step 1) constructs generator G, using noise and the label for indicating action classification as the input of generator G, will generate The condition distribution p of device G_gThe condition distribution p (x | y) that (x | y) is gradually fitted original image generates new samples G (z | y), is formed and is generated Image；

Step 2) construct arbiter D: calculating input image from original image set probability value, according to the probability value Differentiate and generates whether image comes from original image set；

Step 3) constructs classifier C, automatically extracts feature capabilities by convolutional neural networks, to generating image and original The mixing collection of image is classified, and determines the classification of image；

The definition of step 4) objective function: in an iterative process, generator G, arbiter D and classifier C are calculated separately Target function value, it is excellent according to carrying out for minimizing objective function, arbiter D and classifier C to maximize objective function with generator G Change operation and forms objective function V (D, G, C)；

Step 5) structural similarity calculates: introducing structural similarity in arbiter D, passes through the brightness of picture, contrast And three factor contrast images of picture structure, promote generation image to be quickly fitted original image；

Step 6) objective function optimization: increase structure in the former objective function of generator G, arbiter D and classifier C Similarity function judges the similarity for generating the original image of image and this training batch, improves and generate picture quality；

Step 7) is constantly updated generator G, arbiter D in an iterative process and is divided according to the target function value of calculating Class device C seeks optimal generation image, extracts feature and realizes classification.

The further design based on the human motion recognition method for generating confrontation network is, in the step 1): Input noise in conjunction with corresponding label, and is converted dimension by generator；Generator generates corresponding classification according to the label Movement；The image of generation is subjected to label according to original classification.

It is described to be based on the further design for generating the human motion recognition method for fighting network, the generator G packet The 3 layers of layer that deconvolutes are included, the core size of the every layer of layer that deconvolutes is 5 × 5, and step-length is 2 × 2；The dimension of the label is 64 × 1 × 1 × 10, generator G deconvolute operation after by Leakly ReLU activation and normalized output first layer feature；Each layer defeated Out as next layer of input, it is finally completed the generation of image.

The further design based on the human motion recognition method for generating confrontation network is, sentences in the step 2) Other device calculates using original image as foundation and generates image and belong to the probability k of original image set, when probability value k [0,0.5) range When, arbiter determines that generating image is not belonging to original image set；In [0.5,1] range, arbiter determines to generate figure probability value k As belonging to original image set.

The further design based on the human motion recognition method for generating confrontation network is that the arbiter includes 3 layers of convolution, convolution kernel size are 5 × 5, and step-length is 2 × 2；Arbiter uses ReLU to being normalized after input picture convolution Activation primitive, the output result of ReLU activation primitive judge whether image generated meets classification in conjunction with label.

It is described to be based on the further design for generating the human motion recognition method for fighting network, root in the step 4) Objective function V (D, G, C) is formed according to formula (1):

Wherein, the prior distribution of input noise z is p in generator_z(z), learn the distribution p of original image_data(x), finally Generation new samples G (z | y)；D (x | y) y classification image x is represented from the probability of original image；Classifier, which combines, generates image Distribution p_g(x) and the distribution p of original image_data(x) classify, C_y(x) presentation class device prediction x is the general of correct classification y Rate.

It is described to be based on the further design for generating the human motion recognition method for fighting network, mesh in the step 4) The optimization of scalar functions V (D, G, C) is defined as penalty values, and optimization aim is to minimize the loss of generator, arbiter and classifier Value, if the probability matrix that arbiter judges that input picture belongs to original sample is L, the penalty values of generator are defined as L and complete 1 The cross entropy of matrix；The penalty values of arbiter are defined as generating the sum of loss and loss of original sample of sample, generate sample Loss be L and null matrix cross entropy, the loss of original sample is defined as the cross entropy of L and all 1's matrix；The loss of classifier Value be defined as image the sub-category cross entropy with concrete class, meanwhile, prediction classification accuracy in the training process, according to essence Exactness value optimizes itself.

The further design based on the human motion recognition method for generating confrontation network is, ties in the step 5) Three parameters of structure similarity include brightness L, contrast C and structure S, respectively such as formula (2), formula (3), formula (4):

Wherein u_x、u_G(z)It is the mean value of image x, G (z), σ_xG(z)It is the covariance of image x and G (z),For side Difference；C₁、C₂And C₃For constant；S (x, G (z)) indicates the structural similarity function of image x and G (z), and 0≤S (x, G (z))≤1； C (x, G (z)) indicates that the contrast similarity function of image x and G (z), L (x, G (z)) indicate that image x is similar with the brightness of G (z) Spend function；

Joint L, C, S tri- amounts, obtain generating the structural similarity function of image x and G (z) according to formula (5):

SSIM (x, G (z))=L (x, G (z)) C (x, G (z)) S (x, G (z)) (5)

It is described to be based on the further design for generating the human motion recognition method for fighting network, root in the step 6) Objective function is optimized according to formula (6):

Wherein, SSIM (x, G (z)) meets 0≤SSIM (x, G (z))≤1, D (x | y) and represents y classification image x from original The probability of beginning image；C_y(x) probability that presentation class device prediction x is correct classification y；G (z | y) indicate the new sample that generator generates This.

The utility model has the advantages that

Of the invention is good based on the human motion recognition method recognition efficiency height for fighting network, robustness is generated.This method Middle generator and arbiter are confronted with each other, and a large amount of natural image not only can be generated, but also solve depth network to a large amount of The needs of problems of sample.This method introduces structural similarity, improves generation picture quality, accelerates image generating rate；It will Image generates and identification combines, and solves the problems, such as that acquisition on a large scale, mark image are time-consuming and laborious, generates image and original image Combined training network shortens the training time, improves discrimination and robustness.

Detailed description of the invention

The structure chart of Fig. 1 inventive network model.

Fig. 2 partial original image of the present invention and generation image.

Fig. 3 present invention generates the SSIM value between image and all original images.

Fig. 4 generational loss of the present invention and the iteration tendency chart for differentiating loss.

Specific embodiment

Technical solution of the present invention is further illustrated with attached drawing in conjunction with specific embodiments.

It is provided by the invention based on the human motion recognition method for generating confrontation network such as Fig. 1, be to generate confrontation network Basis, addition classifier establish multiple step format and generate identification model, and building image generates and two modules of classification；It is similar with structure Degree, which improves, generates picture quality, realizes classification using feature is automatically extracted；Its implement the following steps are included:

Step 1): Generator Design.Uniformly distributed noise z and corresponding label y is inputted, by input noise and its label knot It closes, and converts dimension；In the generating process of image, including the 3 layers of layer that deconvolutes, every layer of core size deconvoluted are 5 × 5, step-length It is 2 × 2；Due to batch processing size be 64 and database include 10 movements to be sorted, so the dimension of label be 64 × 1 × 1 × 10, by Leakly ReLU activation and normalized output first layer feature after operation of deconvoluting.Each layer of output conduct Next layer of input is finally completed the generation of image.Original image and the generation image of output only have action classification label, no Label containing image sources, as shown in Fig. 2, the left side is original image, the right is to generate image.

Step 2): arbiter design.Arbiter includes 3 layers of convolution, and convolution kernel size is 5 × 5, and step-length is 2 × 2.To defeated It is normalized after entering image convolution, using ReLU activation primitive, each output result is convenient for and generation all in conjunction with label Device game, judges whether image generated meets classification.Arbiter calculates generation image and belongs to original using original image as foundation The probability k of beginning image set, when probability value k [0,0.5) range when, arbiter determine generate image be not belonging to original image set； For probability value k in [0.5,1] range, arbiter determines that generating image belongs to original image set.

Step 3): classifier design.Network specifically includes that two layers of convolution, the size of batch processing are 8, the ginseng of two layers of convolution Number setting is identical, is all made of filling convolution mode, convolution kernel size is 3 × 3 and is 16, and step-length is set as 1, activates using ReLU Function；Two layers of pond uses local acknowledgement's normalized function after the layer of pond, it is similar to Dropout function, improves structure Generalization ability；After two layers of full connection, extraction obtains 1024 feature vectors；After full articulamentum, Softmax classifier root According to extracted different characteristic, classifies and complete building for model.

Step 4): objective function definition.In an iterative process, generator, arbiter and classification are calculated separately according to formula (7) The target function value of device；The optimization of objective function is defined as penalty values, the target of optimizer be minimize generator, arbiter and The penalty values of classifier.Since the hyper parameter of adaptive optimization device does not need generally to adjust, and adjust automatically learning rate, it is suitable for There is the optimization of very big noise in gradient, therefore generator and arbiter all use adaptive optimization device.The study of optimizer is set Rate is 0.0002, and exponential decay rate β is 0.5.

If the probability matrix that arbiter judges that input picture belongs to original sample is L, the penalty values of generator are defined as L With the cross entropy of all 1's matrix；The loss of arbiter is defined as generating the sum of loss and loss of original sample of sample, generates sample This loss is the cross entropy of L and null matrix, and the loss of original sample is defined as the cross entropy of L and all 1's matrix；The damage of classifier Mistake value be defined as image the sub-category cross entropy with concrete class, meanwhile, prediction classification accuracy in the training process, according to Precision value optimizes itself.

Step 5): structural similarity calculates: introducing structural similarity in arbiter, passes through brightness, contrast and structure Three factor contrast images promote generation image to be quickly fitted original image.In the case that Fig. 3 illustrates different maneuver libraries, Generate the structural similarity value between image and all original images.

Step 6): optimizing objective function, such as formula (8)；The optimization of classifier is constant in network, the loss of generator It is the cross entropy that structural similarity value and all 1's matrix are added on the basis of original optimization；It is similar that structure is added in the loss of arbiter The cross entropy for spending mean value and null matrix optimizes loss function using Adam optimizer.Differentiate loss, generational loss with iteration time The case where number increases and changes is as shown in Figure 4.

Step 7): according to the target function value of calculating, generator, arbiter and classification are constantly updated in an iterative process Device seeks optimal generation image, extracts feature and realizes classification.

Experimental verification is carried out to the effect of the method for the present invention below, respectively with generation image set, original image set, generation The classifier of the mixing the set pair analysis model of image and original image is trained, each image set all includes the movement of 10 classes, image Size is 120 × 68.Unitary variant is set in training, i.e., only image sources are different, and the number of iterations is 6000 times.Training set When being respectively 2560 and 640 with test set, recognition result is as shown in table 1:

Table 1 uses the recognition result of a small amount of test sample

Although table 1 shows that original image is different with the ratio of image is generated in training data, the generally knowledge of model Not rate is higher；Meanwhile the convergence rate of model is improved using combination chart image set.In order to improve the robustness of network model, instructing Practice sample and test sample sum it is constant under the premise of, reduce number of training, increase test sample number, i.e., when training set with When the amount of images of test set is 1:15, experimental result is as shown in table 2.Since training data is few, model convergence rate is accelerated, together When explanation in the case where original training data is less, generate sample can be taken as part original sample training network model.

Table 2 uses the recognition effect of a small amount of training sample

By experimental data it is found that being only trained with original image set and only with generation image the set pair analysis model, the receipts of model Similar rate is held back, but in test, it is relatively low using only the model discrimination for generating image set training；When use original image When being trained with the mixing collection for generating image, the convergence rate of model is obviously very fast, and discrimination is assembled for training with original image Experienced discrimination is suitable.Image generation can effectively expand small sample and improve recognition efficiency.The method of the present invention not only can be with A large amount of natural image is generated, and improves discrimination and recognition efficiency.Experimental facilities is configured that Windows 10/64 Operating system, 2.2Ghz dominant frequency four core Xeon i5 CPU, back-up environment Anaconda 3, TensorFlow 0.14.1, CUDA 9.0, Cudnn 7.0,32GB RAM, NVIDIA Geforce GXT 1080Ti video card, Windows Server 2012 and with Upper network environment.

The method of the present invention is based on generating confrontation network, is confronted with each other using generator and arbiter, not only can be generated big Natural image is measured, and solves depth network to the needs of problems of great amount of samples；Structural similarity is introduced, generation figure is improved Image quality amount accelerates image generating rate；Image is generated and identified and is combined, it is time-consuming to solve extensive acquisition, mark image Laborious problem generates image and original image combined training network, shortens the training time, improve algorithm recognition efficiency and Discrimination.

The foregoing is only a preferred embodiment of the present invention, but scope of protection of the present invention is not limited thereto, In the technical scope disclosed by the present invention, any changes or substitutions that can be easily thought of by anyone skilled in the art, It should be covered by the protection scope of the present invention.Therefore, protection scope of the present invention should be with scope of protection of the claims Subject to.

Claims

1. a kind of based on the human motion recognition method for generating confrontation network, it is characterised in that include the following steps:

Step 1) constructs generator G, using noise and indicates the label of action classification as the input of generator G, the item of generator G Part distribution p_g(xy) the condition distribution p (xy) for being gradually fitted original image generates new samples G (zy), is formed and generates image；

Step 2) constructs arbiter D, and calculating input image differentiates from the probability value of original image set according to the probability value Generate whether image comes from original image set；

Step 3) constructs classifier C, automatically extracts feature capabilities by convolutional neural networks, to generation image and original image Mixing collection classify, determine the classification of image；

The definition of step 4) objective function, in an iterative process, calculates separately the target of generator G, arbiter D and classifier C Functional value, minimizing objective function, arbiter D and classifier C to maximize objective function with generator G is according to V (D, G, C) Operation is optimized, is formed objective function V (D, G, C)；

Step 5) structural similarity calculate: introduce structural similarity in arbiter D, by the brightness of image, contrast and Three factor contrast images of picture structure promote generation image to be quickly fitted original image；

Step 6) objective function optimization: it is similar to increase structure in the former objective function of generator G, arbiter D and classifier C Function is spent, the similarity for generating image and original image is judged, improves and generate picture quality；

Step 7) constantly updates generator G, arbiter D and classifier according to the target function value of calculating in an iterative process C seeks optimal generation image, extracts feature and realizes classification.

2. according to claim 1 based on the human motion recognition method for generating confrontation network, it is characterised in that the step It is rapid 1) in: input noise in conjunction with corresponding label, and is converted dimension by generator G；Generator G is according to label generation pair Answer the movement of classification；The image of generation is subjected to label according to original classification.

3. according to claim 2 based on the human motion recognition method for generating confrontation network, it is characterised in that the life The G that grows up to be a useful person includes the 3 layers of layer that deconvolutes, and the core size of the every layer of layer that deconvolutes is 5 × 5, and step-length is 2 × 2；The dimension of the label is 64 × 1 × 1 × 10, generator G deconvolute operation after by Leakly ReLU activation and normalized output first layer feature；It is each The output of layer is finally completed the generation of image as next layer of input.

4. according to claim 1 based on the human motion recognition method for generating confrontation network, it is characterised in that the step It is rapid 2) in arbiter using original image as foundation, calculate and generate image and belong to the probability k of original image set, when probability value k [0, 0.5) when range, arbiter determines that generating image is not belonging to original image set；Probability value k in [0.5,1] range, sentence by arbiter Surely it generates image and belongs to original image set.

5. according to claim 4 based on the human motion recognition method for generating confrontation network, it is characterised in that described to sentence Other device includes 3 layers of convolution, and convolution kernel size is 5 × 5, and step-length is 2 × 2；Arbiter to being normalized after input picture convolution, Using ReLU activation primitive, the output result of ReLU activation primitive judges to generate whether image meets classification in conjunction with label.

6. according to claim 1 based on the human motion recognition method for generating confrontation network, it is characterised in that the step It is rapid 4) in objective function V (D, G, C) is optimized according to formula (1):

Wherein, p_zIt (z) is the prior distribution of input noise z in generator G, p_dataIt (x) is the distribution of study original image, G (z | Y) new samples are ultimately generated for generator G；D (x | y) y classification image x is represented from the probability of original image；Classifier C knot Symphysis is at image distribution p_g(x) and the distribution p of original image_data(x) classify, C_y(x) presentation class device forecast image x is The probability of correct classification y.

7. according to claim 6 based on the human motion recognition method for generating confrontation network, it is characterised in that the step It is rapid 4) in the optimization of objective function V (D, G, C) be defined as penalty values, optimization aim is to minimize generator, arbiter and classification The penalty values of device, if the probability matrix that arbiter judges that input picture belongs to original sample is L, the penalty values of generator are defined For the cross entropy of L and all 1's matrix；The penalty values of arbiter are defined as generating the sum of loss and loss of original sample of sample, The loss for generating sample is the cross entropy of L and null matrix, and the loss of original sample is defined as the cross entropy of L and all 1's matrix；Classification The penalty values of device be defined as image the sub-category cross entropy with concrete class, meanwhile, prediction classification is accurate in the training process Degree optimizes itself according to precision value.

8. according to claim 1 based on the human motion recognition method for generating confrontation network, it is characterised in that the step It is rapid 5) in three parameters of structural similarity include brightness L, contrast C and picture structure S, respectively such as formula (2), formula (3), formula (4):

Wherein u_x、u_G(z)It is the mean value of image x, G (z), σ_xG(z)It is the covariance of image x and G (z),For variance；C₁、 C₂And C₃For constant；The structural similarity function of S (x, G (z)) expression image x and G (z)；C (x, G (z)) indicates image x and G (z) Contrast similarity function, L (x, G (z)) indicate image x and G (z) brightness similarity function；

In conjunction with L (x, G (z)), C (x, G (z)) and S (x, G (z)), obtain generating the structure phase of image x and G (z) according to formula (5) Like degree function:

SSIM (x, G (z))=L (x, G (z)) C (x, G (z)) S (x, G (z)) (5).

9. according to claim 8 based on the human motion recognition method for generating confrontation network, it is characterised in that the step It is rapid 6) in objective function is optimized according to formula (6):

Wherein, SSIM (x, G (z)) meets 0≤SSIM (x, G (z))≤1, D (x | y) and represents y classification image x from original graph The probability of picture；C_y(x) probability that presentation class device prediction x is correct classification y；G (z | y) indicate the new samples that generator generates.