CN110334584B

CN110334584B - Gesture recognition method based on regional full convolution network

Info

Publication number: CN110334584B
Application number: CN201910419349.1A
Authority: CN
Inventors: 杨锦
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2019-05-20
Filing date: 2019-05-20
Publication date: 2023-01-20
Anticipated expiration: 2039-05-20
Also published as: CN110334584A

Abstract

The invention discloses a gesture recognition method based on a regional full convolution network, which comprises the steps of carrying out feature extraction on an input gesture image through the full convolution network to obtain a group of feature maps and generate a candidate frame, generating a position sensitive score map by a position sensitive sub-network, and scoring each gesture category through a pooling layer so as to realize the positioning and classification of target gestures; the invention is mainly characterized in that the whole area full convolution network is a shared full convolution structure, the whole structure is end-to-end learning, the high-precision identification rate is realized, meanwhile, the complex calculation is avoided, and in combination with the OHEM technology, the network model has higher rejection rate to the negative sample, is convenient for practical application, and has important significance for the field of human-computer interaction.

Description

Gesture recognition method based on regional full convolution network

Technical Field

The invention relates to the technical field of computer vision, machine learning and pattern recognition, in particular to a method for realizing end-to-end gesture recognition by utilizing a regional full convolution network.

Background

Currently, with the increasing popularity of VR (visual Reality) and AR (administrative Reality), human-computer interaction technology is receiving increasing attention. Gestures are paid attention to by extensive researchers as the most direct and convenient human-computer interaction mode, and gesture recognition gradually becomes an important research direction in the field of computer vision. The significance of how a computer accurately recognizes a gesture is an important part in a gesture human-computer interaction system, and because a human hand is a complicated morphism, the gesture has the characteristics of diversity, ambiguity, time difference and the like, and the gesture is usually in a complicated scene, such as various complicated scene factors of over-bright or over-dark light, existence of a plurality of gestures, different distances between the gesture and equipment and the like, the gesture recognition is still a huge challenge.

The typical gesture recognition method is mainly based on hidden markov model, template matching, artificial neural network and the like. The traditional gesture recognition methods have the defects that the features must be manually set, and then the features are extracted from the gesture for recognition, so that the processing process is complex and the efficiency is low.

Disclosure of Invention

The invention aims to provide a gesture recognition method based on a regional full convolution network, so as to improve recognition efficiency and reduce calculation complexity.

In order to realize the task, the invention adopts the following technical scheme:

a gesture recognition method based on a regional full convolution network comprises the following steps:

step 1, establishing a full convolution network

Using a residual error network ResNet-34 network architecture as a framework, changing the step length of a RerNet-34 network from 32 pixels to 16 pixels, deleting an average pooling layer and a full-connection layer of the ResNet-34 network architecture, and then constructing a full convolution network by using a convolution layer of the ResNet-34 network architecture so as to extract the characteristics of an input image; the method comprises the steps that an input image outputs a feature map after passing through a full convolution network, and each pixel point on the feature map generates a plurality of candidate frames for predicting the position of a coordinate frame;

step 2, establishing a regional candidate network

Establishing a regional candidate network, wherein the network comprises the last convolutional layer of the full convolutional network, two branches are arranged behind the convolutional layer, one branch is sequentially the convolutional layer, a first adjusting layer, a normalization layer and a second adjusting layer, the function of the branch is used for judging the candidate frame belongs to the fraction of the foreground and the background, and the other branch is the convolutional layer and used for predicting the offset of the candidate frame and the position of the real coordinate frame; the first adjusting layer and the second adjusting layer are used for changing the dimensionality of the image, and the normalizing layer is used for performing normalization operation;

step 3, training the area candidate network

The screening candidate box is used for training the regional candidate network, and the screening rule is as follows:

if the overlapping rate of the candidate frame and the real coordinate frame is more than or equal to 0.7, the candidate frame is considered as a foreground; if the overlapping rate of the candidate frame and the real coordinate frame is less than 0.3, the candidate frame is considered as the background; training by taking candidate frames corresponding to the foreground and the background as training data of the regional candidate network, wherein the candidate frame corresponding to the foreground is a positive sample, and the candidate frame corresponding to the background is a negative sample; the loss function for the regional candidate network training is:

L＝cls_loss+λ*reg_loss

wherein λ is an adjustable parameter; to train the area candidate network, a binary class label is assigned to the candidate box to be trained, let p _i Is the predicted probability that the ith candidate box belongs to the foreground,

is a true tag, thencls_lossIs defined as:

reg _ loss is used to regress the deviation between the candidate frame and the real coordinate frame, and is defined as:

where i ∈ (x, y, w, h), t _i Is the ith candidate frame and the real coordinate frame [ x, y, w, h ]]The predicted output of the offset of (a),

is [ x, y, w, h ] of the ith candidate frame and the real coordinate frame]X and y represent coordinates, and w and h represent width and height;

the regional candidate network utilizes the loss function L to carry out end-to-end training by a back propagation and random gradient descent method, and weight is initialized by zero mean Gaussian distribution with standard deviation of 0.01;

step 4, constructing a position sensitive sub-network

The location sensitive subnetwork includes the last convolutional layer in the full convolutional networkAfter the input image is processed by the full convolution network, the output characteristic diagram is subjected to convolution operation by the convolution layer to obtain a position sensitive score diagram; the convolutional layer generates a dimension k for each gesture class ² (c + 1) position sensitive score plot, k ² The position sensitivity score maps are relative positions described by k × k space grids, wherein c represents the number of classes of the identified objects;

step 5, pooling of location sensitive candidate frames

Outputting the deviation amount of the candidate frame and the real coordinate frame by the trained regional candidate network, wherein the deviation amount comprises the position information of the candidate frame region; according to the position information, corresponding the candidate frame to the position sensitive score map obtained in the step 4, wherein the candidate frame is divided into k × k sub-regions, and each sub-region corresponds to a region on the score map; the location sensitive subnetwork further comprises a pooling layer for implementing the following functions:

respectively extracting the position sensitive score map corresponding to each category from the candidate frame, respectively calculating the mean value of the extracted score maps, then forming a matrix according to the positions, and summing all values in the matrix to obtain a value; after all the categories are processed in the same way, all the obtained values jointly form an output vector, and the output vector is normalized, so that the category of the current candidate area is estimated;

and 6, training the network by using a database of the gesture pictures, and storing the trained network model for gesture classification.

The invention has the following technical characteristics:

the whole area full convolution network is a shared full convolution structure, the whole structure is end-to-end learning, high-precision recognition rate is achieved, meanwhile, complex calculation is avoided, and in combination with an OHEM technology, a network model has higher rejection rate on negative samples, and practical application is facilitated. The intelligent behavior analysis and post-processing method in the man-machine interaction system and the like has certain practical value for the intelligent construction in the fields of auxiliary automobile control systems, sign language recognition, personal wearing systems and the like.

Drawings

FIG. 1 is a block diagram of a network in the method of the present invention;

FIG. 2 is a block diagram of a regional candidate network;

FIG. 3 is a schematic diagram of candidate boxes obtained from a feature map;

FIG. 4 is a schematic diagram of seven gestures to be trained in an embodiment of the present invention;

FIG. 5 shows the correspondence between the 9 positions of the position sensitivity score map in gesture 1;

FIG. 6 shows the result of the gesture recognition test according to the present invention.

Detailed Description

The invention provides a gesture recognition method based on a regional full convolution network, which comprises the following steps of:

step 1, establishing a full convolution network

In the scheme, a residual error network ResNet-34 network architecture is used as a framework, the step length of a RerNet-34 network is changed from 32 pixels to 16 pixels, an average pooling layer and a full connection layer of the ResNet-34 network architecture are deleted, and then a full convolution network is constructed by utilizing a convolution layer of the ResNet-34 network architecture so as to extract the characteristics of an input image.

As shown in fig. 1, the full convolution network in this scheme includes two parts, the first part is a convolution layer processing input image with a convolution kernel size of 7 × 7, the second part is four groups of residual blocks with different depths, which are composed of 3 × 3 convolution kernels, and the residual blocks are important structures used by the residual network to extract features.

After an input image passes through a full convolution network, outputting a feature map through a last convolution layer, wherein each pixel point on the feature map generates 9 candidate frames for predicting the position of a coordinate frame; then a total of w x h x 9 candidate frames are generated for the three-dimensional convolution layer with dimension w x h x d (width x height x depth). The candidate box is a rectangular box, and has three shapes, and the length-width ratio is [1, 1.

Step 2, establishing a regional candidate network

Establishing a regional candidate network, wherein the network comprises the last convolution layer of the full convolution network, two branches are arranged behind the convolution layer, one branch is the convolution layer, a first adjusting layer, a normalization layer and a second adjusting layer in sequence, and the branch is used for judging the scores of the foreground and the background of a candidate frame generated by each pixel point on an output characteristic diagram of the convolution layer; the other branch is a convolution layer and is used for predicting the offset of the position of the candidate frame and the position of the real coordinate frame; the first adjusting layer and the second adjusting layer are used for carrying out Reshape operation, namely changing the dimension of the image.

In this embodiment, the convolution kernel size of the convolution layer of the first branch is 1 × 18, after the feature map output by the last convolution layer of the full convolution network passes through the convolution layer of the first branch, the dimension of the obtained feature map is (w, h,9 × 2), and then the feature map is normalized by the first adjustment layer Reshape and the normalization layer, and the prediction probability of the foreground and the background of the candidate frame input image is obtained after the second adjustment layer Reshape; the convolution kernel size of the convolution layer of the second branch is 1 × 36, and the feature map dimension obtained after passing through the convolution layer can be represented as (w, h,4 × 9), which represents the offset of w × h × 9 candidate frames from the real coordinate frame position.

Step 3, training the regional candidate network

The screening candidate box is used for training the area candidate network, and the screening rule is as follows:

as shown in fig. 3, if the overlapping rate of the candidate frame and the real coordinate frame is greater than or equal to 0.7, the candidate frame is considered as a foreground; if the overlapping rate of the candidate frame and the real coordinate frame is less than 0.3, the candidate frame is considered as the background; taking the candidate frames corresponding to the foreground and the background as training data of the regional candidate network for training, wherein the candidate frame corresponding to the foreground is a positive sample and corresponds to the category of the target gesture region; the candidate frame corresponding to the background is a negative sample; while the other candidate boxes do not participate in the training.

The loss function of the regional candidate network training is divided into two parts:cls_lossandreg_loss。

cls _ loss is used for classifying the candidate frame as foreground/background, in the scheme, in order to train the regional candidate network, a binary class label (foreground: 0 and background: 1) is allocated to the candidate frame needing to be trained, and p is set _i Is the predicted probability that the ith candidate box belongs to the foreground,

is a true tag (can only be 0 or 1), thencls_lossThe cross entropy loss function is defined as:

reg _ loss is used for regressing the deviation amount of the candidate frame and the real coordinate frame, the regression task can not use the middle cross entropy loss function, and the reg _ loss function is defined as:

where i ∈ (x, y, w, h), t _i Is the ith candidate frame and the real coordinate frame [ x, y, w, h [ ]]The predicted output of the offset of (a),

is [ x, y, w, h ] of the ith candidate frame and the real coordinate frame]X, y represent coordinates, and w, h represent width and height.

Due to different orders of magnitude, the two losses are balanced by using an adjustable parameter lambda, so that the two losses can be uniformly considered in the calculation process of the total loss function of the area candidate network during training. The loss function L for a regional candidate network is defined as:

L＝cls_loss+λ*reg_loss

the regional candidate network utilizes the loss function L to carry out end-to-end training by a back propagation and random gradient descent method, and weight is initialized by zero mean Gaussian distribution with standard deviation of 0.01; the part for training the required initialization parameters comprises the parameters of the full convolution network in the step 1 and the parameters of the convolution layer in the area candidate network.

Step 4, constructing a position sensitive sub-network

The position sensitive sub-network comprises a convolution layer conv _ L connected after the last convolution layer of the full convolution network, and the position sensitive score map is obtained after the convolution operation is carried out on the output characteristic map after the input image is processed by the full convolution network. The convolutional layer generates a dimension k for each gesture class ² (c + 1) location sensitive score map (c is gesture class +1 background class), k ² The position sensitivity scores are the relative positions described by k x k spatial meshes.

The high height and width of the convolution layer are the same as the last convolution layer of the full convolution network, but the number of channels is k (c + 1), wherein k represents the number of grids to be divided, c represents the number of categories of the recognition object, and a background category is added, as shown in fig. 4, the number of categories of the recognition task of the scheme is seven, so 8 categories are provided, each category has k scoring maps, taking gesture 1 as an example, each scoring map represents that the positions in the original input image contain a certain part of gesture 1, the scoring map has high response at a position containing a certain part of corresponding gesture 1, k is taken as 3, and the original input image is divided into 9 different positions and has 9 position sensitivity scoring maps.

Step 5, pooling of location sensitive candidate frames

Step 3, the trained area candidate network outputs the deviation value of the candidate frame and the real coordinate frame, wherein the deviation value comprises four values of position information [ x, y, w, h ] of the candidate frame area; according to the position information, corresponding the candidate frame area to the position sensitive score map obtained in the step 4, wherein the candidate frame area is divided into k × k sub-areas, and each sub-area corresponds to an area on the score map; the location sensitive subnetwork further comprises a pooling layer for implementing the following functions:

respectively extracting position sensitive score maps corresponding to each category from the candidate frames, respectively solving the mean values of the extracted score maps, then forming a matrix according to the positions, and summing all values in the matrix to obtain a value S; after all the categories are processed in the same way, all the obtained values S jointly form an output vector, and the output vector is normalized, so that the category of the current candidate area is estimated.

In this embodiment, each category has 9 position-sensitive score maps, and taking category gesture 1 as an example, 9 score maps of category 1 are extracted from the candidate box, as shown in fig. 5, the extracted score maps are respectively averaged, then a matrix of 3 × 3 size is formed according to the positions, and all values in the matrix of 3 × 3 are summed to obtain a value; and reusing the steps for the categories 2-8 to finally obtain a vector 1 x 8, performing softmax normalization processing on the vector, and calculating softmax responses belonging to each category to estimate the category of the current frame selection area, thereby outputting a prediction result.

The candidate box classification loss function in the location sensitive subnetwork is defined by a cross entropy loss function as follows:

wherein s is _ci Is the true output belonging to class i in a 1 x 8 dimensional output vector,

is the prediction output belonging to class i in the 1 x 8 dimensional output vector, and θ is the parameter set of the convolution layer in the whole network (full convolution network, regional candidate network, location sensitive subnetwork).

And 6, training the whole network by using a database of gesture pictures, and storing the trained network model for gesture classification.

In this embodiment, a CGD database is used to train a network, the database has thirty basic gesture actions, the size of a picture needs to be normalized to 224 × 224, and in addition, a regional candidate network and a position sensitive subnetwork share network parameters, so that the network needs to be trained once to meet requirements, a network model is built through a deep learning framework, and representative 7 gestures are selected from the CGD database, as shown in fig. 4, 8000 training sets and 500 test sets. The training period is set to 500, after the model is saved, an end-to-end test is performed, and after a picture containing a gesture is input, a result is output, as shown in fig. 5. As can be seen from the figure, the type and the coordinate frame position of the gesture are recognized, so that the gesture recognition method has good performance.

The method adopts an OHEM (Online hard execution) technology to calculate the loss function values of all gesture areas, then sorts all the gesture areas according to the loss values, and selects B gesture areas with the highest loss and the largest loss value to carry out back propagation.

Claims

1. A gesture recognition method based on a regional full convolution network is characterized by comprising the following steps:

step 1, establishing a full convolution network

Using a residual error network ResNet-34 network architecture as a framework, changing the step length of a RerNet-34 network from 32 pixels to 16 pixels, deleting an average pooling layer and a full-connection layer of the ResNet-34 network architecture, and then constructing a full convolution network by using a convolution layer of the ResNet-34 network architecture so as to extract the characteristics of an input image; the method comprises the steps that an input image passes through a full convolution network and then outputs a feature map, and each pixel point on the feature map generates a plurality of candidate frames for predicting the position of a coordinate frame;

step 2, establishing regional candidate network

step 3, training the regional candidate network

if the overlapping rate of the candidate frame and the real coordinate frame is more than or equal to 0.7, the candidate frame is regarded as a foreground; if the overlapping rate of the candidate frame and the real coordinate frame is less than 0.3, the candidate frame is considered as the background; training by taking candidate frames corresponding to the foreground and the background as training data of the regional candidate network, wherein the candidate frame corresponding to the foreground is a positive sample, and the candidate frame corresponding to the background is a negative sample; the loss function for the training of the regional candidate network is:

L＝cls_loss+λ*reg_loss

if it is a true tag, cls _ loss is defined as:

reg _ loss is used for the deviation amount between the regression candidate frame and the real coordinate frame, and is defined as:

is [ x, y, w, h ] of the ith candidate frame and the real coordinate frame]True value of the offset of (2)X, y represent coordinates, w, h represent width, height;

step 4, constructing a position sensitive sub-network

The position sensitive sub-network comprises a convolution layer connected after the last convolution layer of the full convolution network, and the position sensitive score map is obtained after the convolution operation is carried out on the output characteristic map after the input image is processed by the full convolution network; the convolutional layer generates a dimension k for each gesture class ² (c + 1) position sensitive score plot, k ² The position sensitivity score maps are relative positions described by k × k space grids, wherein c represents the number of classes of the identified objects;

step 5, pooling of location sensitive candidate frames

respectively extracting the position sensitive score map corresponding to each category from the candidate frame, respectively calculating the mean value of the extracted score maps, then forming a matrix according to the positions, and summing all values in the matrix to obtain a value; after all classes are processed in the same way, all obtained values jointly form an output vector, and the output vector is normalized, so that the class of the current candidate area is estimated;

and 6, training the network by utilizing a database of the gesture pictures, and storing the trained network model for gesture classification.