CN110222645B

CN110222645B - Gesture misidentification feature discovery method

Info

Publication number: CN110222645B
Application number: CN201910496416.XA
Authority: CN
Inventors: 孙元功; 孙凯云; 冯志全
Original assignee: University of Jinan
Current assignee: University of Jinan
Priority date: 2019-06-10
Filing date: 2019-06-10
Publication date: 2022-09-27
Anticipated expiration: 2039-06-10
Also published as: CN110222645A

Abstract

The invention provides a method for discovering the character matrix character of gesture misidentification, supposing that m index finger gesture pictures correctly recognized by a convolutional neural network form a set A, n index finger gesture pictures incorrectly recognized by the convolutional neural network form a set B, extracting the character value of a 7 th layer full-connection layer by using a Python interface, storing the character value into a matrix V, inputting any two pictures i and j, wherein i belongs to A, j belongs to B, and calculating a character matrix Q of misidentification _i The steps are as follows: a. respectively extracting characteristic values of i and j at a 7 th layer of full connection layer, and storing the characteristic values into a matrix V; b. calculating Z of i, j input to Softmax function respectively _i ,Z _j A value of (d) above; c. data are arranged from small to large and Z is described by a curve _i ,Z _j Finding out the features with violent change and the corresponding original dimension according to the change trend of the model, and forming a set C by the dimension; d. from C ₁ Go through to C ₄₀₉₆ Ending circulation, and counting the frequency and frequency of each dimension; e. the frequency is more than 90% dimension and is stored in a matrix Q; f. and (6) ending. The invention can effectively extract the misidentification feature matrix.

Description

Gesture misidentification feature discovery method

Technical Field

The invention relates to the technical field of image recognition, in particular to a method for dynamic gesture recognition, and specifically relates to a method for discovering gesture misidentification characteristics.

Background

Dynamic gesture recognition has been widely studied for decades because it can provide a high level of human-computer interaction, but few researchers are concerned with the recognition of similar gestures. In 2006, a dynamic bayesian classifier was proposed to recognize similar gestures by combining motion-based and gesture-based features to achieve a competitive classification rate [1 ]. Elmezan [2] et al propose a real-time gesture recognition system that contains 36 gestures, and improves the accuracy of similar gestures mainly by creating a combination of features including location, direction and speed features. Ding [3] describes that there are many similarly shaped gestures, such as S and 5, Z and 2, in a database of numbers 0-9 and letters A-Z, which have a low recognition rate. The authors thus propose a new way of distinguishing similar gestures. Firstly, jumping motion track information in a three-dimensional space is captured, and motion characteristics are quantized into characteristics. The gestures are then modeled and classified using a Hidden Markov Model (HMM) approach. Experimental results show that the method has high recognition rate and universality. In summary, at present, similar gestures are researched a few times, and most of research methods for similar gestures reduce similarity to the maximum extent by combining multiple gesture features or combining multiple classification methods, so that the recognition rate is improved to a certain extent. But similar problems cannot be solved at the root.

At present, many researches are carried out on gesture recognition methods, but whether the methods are based on geometric features or machine learning, few researches are concerned about error mechanisms of gesture recognition errors. And thus lack an automatic recognition and correction method for the erroneous gesture. It is considered that the correction of the misrecognition is not only one of effective ways to improve the recognition rate but also one of breakthrough points to reveal the perception intelligence. Therefore, the invention provides a gesture misidentification feature discovery method based on the convolutional neural network gesture identification method, so that the mechanism problem of the error gesture can be further researched according to the method.

The AlexNet network model is adopted to train the operation model based on the convolutional neural network gesture recognition method, and the AlexNet network model and the gesture model training method are introduced as follows:

AlexNet is a network structure introduced by Alex Krizhevsky, university of Toronto, in the article "ImageNet Classification with Deep conditional Neural Networks". The AlexNet successfully applies the Tricks such as ReLU, Dropout and LRN in the CNN for the first time, and the AlexNet also uses the GPU for operation acceleration. The method adopts an AlexNet network model, and performs a limited number of iterative trainings by optimizing the parameters of the solution under the condition of ensuring that the loss value is decreased with the increase of the iterative times. And selecting an optimal test network model according to the two parameters of the accuracy and the loss value of each iteration. The convolution full-connection process is shown as C1-FC 7 in FIG. 1, the size of the input image in FIG. 1 is 227X 3 (width/height/channel number), the convolution kernel is shown as dark color in the figure, the size includes width, height and thickness, and the thickness is equal to the channel number of the convolved image. The number of convolution kernels is equivalent to the number of channels output after convolution operation. It is clear from the figure that the image size changes after each convolution, for example, after the original image passes through C1, the size becomes 55 × 55 × 48 × 2, and 96 is the number of convolution kernels, but the display is divided into two display cards, each display card has 48 display cards. Assuming that the image size is an N × N matrix, the size of the convolution kernel is a K × K matrix, and the convolution mode (edge pixel filling mode): p, the convolution step S, then the width and height mxm of the image after a layer of such convolution is calculated according to equation (1):

the convolutional layers are C1 … C5 shown in FIG. 1, and have 5 layers in total, the attributes of the convolutional layers are the same as the number of feature maps obtained by convolution of the same convolutional kernel in Table 1, and the same feature map is obtained by convolution of the same convolutional kernel, so that the number of parameter training is reduced by weight sharing, and the time is saved. C5 is followed by a full connection layer FC6, the convolution kernel size of full convolution is 13 × 13 × 256, 4096 full convolution operations are performed by the convolution kernels on the input image, respectively, and the final result is a column vector, which has a total of 4096 numbers. The FC8 has 7 neurons, each representing that the class in the label is 7 classes, the output value is in the [0,1] interval, the probability of identifying the class is represented, and the maximum probability is taken as the final identification result. The convolution kernel parameters employed in fig. 1 are shown in table 1.

TABLE 1

Convolutional layer	Number of convolution kernels	Width of	Height of	Thickness of
					C1	96	11	11	3
C2	256	5	5	48
					C3	384	3	3	256
C4	384	3	3	192
					C5	256	3	3	192

In the invention, the number of each type of training sample reaches 20k, and the number of verification sets is 2 k. The basic learning rate was 0.01. And when the training times reach 500 times, testing the correctness of the verification set by using the current parameters. And finally, selecting the optimal model according to the loss value and the accuracy. The above is the process of training the gesture model of the present invention.

In neural networks, softmax (flexible maximum) function is mainly used in multi-classification processes. It maps the outputs of multiple neurons to [0,1]]Within the interval. Suppose we extract the data of a full connection of an image on the last layer, and represent the data by a one-dimensional matrix V, V _l The value of the l-th element in V is represented, the value range of l is determined by the label number of the model, and W _l Representing the weight parameter in softmax. That softmax is expressed by the following equation:

Z _l ＝∑V _l *W _l (3)

the Softmax classification result is to select the class with the maximum probability, namely the maximum value S in the formula (2) _l The corresponding l. From this formula, S _l Value of and Z _l In a direct proportional relationship. Because the value on the denominator is invariant and exp () is an increasing function, Z _l The size of the value determines the result of the final classification, corresponding to Z _l The largest is the final classification result of the image. Wherein Z _l Can be calculated by the formula (3), and W is known from the above definition _l Is a weight parameter, V _l Is a set of input feature data and a more intuitive classification process is shown in fig. 2.

According to the method, firstly, the same gesture data set is divided into a correct type and an error type according to the result of a gesture recognition model. Based on the analysis, the value of Z depends on V, so that the characteristic value V of the 7 th fully-connected layer of each image before classification is extracted, and the distribution rule is represented by a curve fitting function. According to the invention, in the curve distribution observation of two recognition results of the same type of gestures, the common characteristic dimension T with large influence factors on the recognition results exists in V, and a matrix formed by dimensions with the frequency of the characteristic dimension T with large influence factors on the recognition results larger than 90% is defined as a false recognition characteristic matrix Q.

How to obtain the misrecognition characteristic matrix Q is the technical problem to be solved by the invention.

Disclosure of Invention

The invention provides a gesture misidentification feature matrix feature discovery method aiming at the technical problems, and the feature dimension with large influence factors on the gesture misidentification features can be extracted according to the method.

The invention is realized by the following technical scheme, and provides a method for finding the characteristic matrix characteristic of gesture misidentification, wherein a set A is formed by m index finger gesture pictures which are correctly identified by a convolutional neural network, a set B is formed by n index finger gesture pictures which are incorrectly identified by the convolutional neural network, the characteristic value of a 7 th layer full-connection layer is extracted by using a Python interface and is stored in a matrix V, any two pictures i and j are input, wherein i belongs to A, j belongs to B, and a misidentification characteristic matrix Q is calculated _i The steps are as follows:

a. respectively extracting characteristic values of i and j at a 7 th layer of full-connection layer, and storing the characteristic values into a matrix V;

b. calculating Z of i, j input to Softmax function respectively _i ,Z _j A value of (d) above;

c. data are arranged from small to large and Z is described by curve _i ,Z _j Finding out the features with violent change and the corresponding original dimension according to the change trend of the model, and forming a set C by the dimension;

d. from C ₁ Go through to C ₄₀₉₆ Ending circulation, and counting the frequency and frequency of each dimension;

e. the frequency dimension is more than 90 percent and is stored in a matrix Q;

f. and (6) ending.

Preferably, the value of m is 999, and the value of n is 999.

In conclusion, the method effectively extracts the false recognition characteristic matrix, and the characteristic dimension i with large influence factors of the recognition result can be further researched by the matrix.

Drawings

FIG. 1 is a schematic diagram of an Alexnet network structure according to the present invention;

FIG. 2 is a schematic diagram of the softmax classification process of the present invention;

FIG. 3 is a diagram illustrating a variation curve of the misrecognition feature corresponding to the set A in the present invention;

FIG. 4 is a diagram illustrating a variation curve of the misrecognized feature corresponding to the set B in the present invention.

Detailed Description

In order to clearly illustrate the technical features of the present invention, the present invention is further illustrated by the following detailed description with reference to the accompanying drawings.

A method for finding out features of a gesture misrecognition feature matrix is characterized by supposing that m index finger gesture pictures correctly recognized by a convolutional neural network form a set A, n index finger gesture pictures incorrectly recognized by the convolutional neural network form a set B, utilizing a Python interface to extract feature values of a 7 th full-connection layer, storing the feature values into a matrix V, inputting any two pictures i and j, wherein i belongs to A, j belongs to B, and calculating a misrecognition feature matrix Q _i The steps are as follows:

a. respectively extracting characteristic values of i and j at a 7 th layer of full connection layer, and storing the characteristic values into a matrix V;

b. calculating Z of i, j input to Softmax function respectively _i ,Z _j A value of (d);

e. the frequency is more than 90% dimension and is stored in a matrix Q;

f. and (6) ending.

In this embodiment, the value of m is 999, and the value of n is 999, that is, 999 pictures identified correctly by CNN are selected to form a set a, and 999 pictures identified by CNN as a thumb gesture by mistake are selected to form a set B. To further illustrate, first, Z is calculated for each picture in the A, B sets _i The value i takes 2 and represents a label such as the index finger. And Z of each set _i Storing by 999 x 4096 dimensional matrix, wherein the row represents the first picture, and the column represents the Z corresponding to the picture _i Value, in total4096 data. In order to observe the change situation of the characteristic value, 4096 data are arranged according to the size, and the change of the data is described by adopting a curve fitting method. As shown in FIGS. 3 and 4, these are characteristic variations derived from the two pictures in the set A, B, respectively, with the rows being the number of data and the columns being Z _i The areas with large slope of the curve of the value, the A set and the B set are marked with dark colors.

Since the value of the sum of these values determines the result of the index finger or the like, the larger the value, the greater the probability that it will eventually be recognized as correct. From the comparison between fig. 3 and fig. 4, it can be seen that the curvature of the curve varies dramatically in the anterior 100 and posterior 100 dimensions, and the values in the middle are relatively stable. Since the strongly varying values affect the final recognition result, the present invention calls this dimension as the feature with a large impact factor. Finally, the invention uses a matrix of 999 x 200 to represent the dimension of each picture corresponding to the original 200-dimensional position, and although the values of the 200-dimensional matrix corresponding to each picture are not completely the same, common dimensions exist, so that the invention counts the dimension with the frequency of more than 90%, as shown in the following table 2. The eigenvalues corresponding to these dimensions influence the correctness of the final classification, so the invention refers to the array formed by these dimensions as the misrecognition eigen matrix.

TABLE 2

Finally, it should be further noted that the above examples and descriptions are not limited to the above embodiments, and technical features of the present invention that are not described may be implemented by or using the prior art, and are not described herein again; the above embodiments and drawings are only for illustrating the technical solutions of the present invention and not for limiting the present invention, and the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that changes, modifications, additions or substitutions within the spirit and scope of the present invention may be made by those skilled in the art without departing from the spirit of the present invention, and shall also fall within the scope of the claims of the present invention. It should also be noted that other gesture recognition methods are cited in the present invention, and the documents cited for these gesture recognition methods are described as follows:

[1]Aviles-Arriaga H H,Sucar L E,Mendoza C E.Visual recognition of similar gestures[C].18th International Conference on Pattern Recognition(ICPR'06).IEEE,2006,1:1100-1103.

[2]Elmezain M,Al-Hamadi A,Michaelis B.Hand gesture recognition based on combined features extraction[J].Journal of World Academy of Science,Engineering and Technology,2009,60:395.

[3]Ding Z,Chen Y,Chen Y L,et al.Similar hand gesture recognition by automatically extracting distinctive features[J].International Journal of Control,Automation and Systems,2017,15(4):1770-1778.

Claims

1. a gesture misidentification feature discovery method is characterized in that a set A is formed by assuming that m index finger gesture pictures which are correctly identified by a convolutional neural network, a set B is formed by n index finger gesture pictures which are incorrectly identified by the convolutional neural network, feature values of a 7 th full-link layer are extracted by utilizing a Python interface and stored in a matrix V, and any two pictures i and j are input, wherein i belongs to A, and j belongs to B _i Comprises the following steps:

e. the frequency is more than 90% dimension and is stored in a matrix Q;

f. and (6) ending.

2. The method according to claim 1, wherein m is 999 and n is 999.