CN113420848A

CN113420848A - Neural network model training method and device and gesture recognition method and device

Info

Publication number: CN113420848A
Application number: CN202110974865.8A
Authority: CN
Inventors: 钱程浩; 黄雪峰; 熊海飞
Original assignee: Shenzhen Xinrun Fulian Digital Technology Co Ltd
Current assignee: Shenzhen Xinrun Fulian Digital Technology Co Ltd
Priority date: 2021-08-24
Filing date: 2021-08-24
Publication date: 2021-09-21

Abstract

The embodiment of the application provides a training method and a device of a neural network model and a gesture recognition method and a device, wherein the training method of the neural network model comprises the following steps: inputting the sample image and the corresponding thermal image into a neural network model for feature extraction to obtain feature data; wherein the sample image includes a gesture therein, and the feature data includes at least one of: a predicted gesture category, a predicted gesture scaling box, a predicted gesture key point, a predicted gesture thermodynamic diagram; obtaining a loss value between the characteristic data and corresponding original data; and updating the neural network model based on the loss value, and continuing training the updated neural network model until the loss value is smaller than a preset threshold value. Through the method and the device, the problem that the accuracy rate of gesture recognition is low in the prior art is solved.

Description

Neural network model training method and device and gesture recognition method and device

Technical Field

The present application relates to the technical field of neural network models, and in particular, to a training method and apparatus for a neural network model, and a gesture recognition method and apparatus.

Background

Gestures are a form of non-verbal communication that can be used in a number of areas such as communication between deaf-mutes, robotic control, Human-Computer Interaction (HCI), home automation, and medical applications. Gesture recognition has taken many different forms, mainly including:

1) and (3) template matching, namely matching the characteristic parameters of the gesture to be recognized with the pre-stored template characteristic parameters, and completing the recognition task by measuring the similarity between the two parameters. For example, the edge images of the gesture to be recognized and the template gesture are transformed to Euclidean distance space, the Hausdorff (Housdorff) distance of the gesture to be recognized and the template gesture is calculated or corrected, the similarity between the gesture to be recognized and the template gesture is represented by the distance value, and the template gesture corresponding to the minimum distance value is taken as the recognition result.

2) Statistical analysis, i.e. a classification method based on probability statistics theory that determines a classifier by counting sample feature vectors. Extracting fingertip and gravity center characteristics from each image, calculating a distance and an included angle, respectively counting the distance and the included angle for different gestures to obtain distributed digital characteristics, obtaining values for segmenting the distance and the included angle of different gestures according to Bayesian decision based on minimum error rate, and after obtaining a classifier, carrying out classification and identification on the acquired gesture images.

The following problems exist with the above-described approach to gesture recognition: 1) for the template matching mode, a large amount of manual design feature operation is needed, and under different environment backgrounds, the considered features are various, so that the engineering quantity is large, the system implementation is complex, and the gesture recognition rate is low; 2) for statistical analysis, although the feature set of different gesture class characteristics is allowed to be defined, a local optimal linear discriminator is estimated, and corresponding gesture classes are identified according to a large number of features extracted from a gesture image, the learning efficiency is not high, and along with the continuous increase of the sample size, the improvement of the algorithm identification rate is not obvious, so that the hand identification rate is low.

Disclosure of Invention

An embodiment of the application aims to provide a training method and device for a neural network model and a gesture recognition method and device, so as to solve the problem that in the prior art, the accuracy of gesture recognition is low. The specific technical scheme is as follows:

in a first aspect of the present application, there is provided a method for training a neural network model, including: inputting the sample image and the corresponding thermal image into a neural network model for feature extraction to obtain feature data; wherein the sample image includes a gesture therein, and the feature data includes at least one of: a predicted gesture category, a predicted gesture scaling box, a predicted gesture key point, a predicted gesture thermodynamic diagram; obtaining a loss value between the characteristic data and corresponding original data; and updating the neural network model based on the loss value, and continuing training the updated neural network model until the loss value is smaller than a preset threshold value.

In a second aspect of the present application, there is provided a method for performing gesture recognition based on the neural network model in the training method in the first aspect, including: acquiring image data to be identified; wherein the image data comprises a gesture; inputting the image data to be identified into the neural network model to obtain an output result; wherein the output result is used for representing the recognition result of the gesture.

In a third aspect of the present application, there is provided a training apparatus for a neural network model, including: the first processing module is used for inputting the sample image and the corresponding thermal image into the neural network model for feature extraction to obtain feature data; wherein the sample image includes a gesture therein, and the feature data includes at least one of: a predicted gesture category, a predicted gesture scaling box, a predicted gesture key point, a predicted gesture thermodynamic diagram; the first acquisition module is used for acquiring loss values between the characteristic data and the corresponding original data; and the training module is used for updating the neural network model based on the loss value and continuously training the updated neural network model until the loss value is smaller than a preset threshold value.

In a fourth aspect of the present application, there is provided an apparatus for performing gesture recognition based on the neural network model in the training apparatus in the third aspect, including: the second acquisition module is used for acquiring image data to be identified; wherein the image data comprises a gesture; the second processing module is used for inputting the image data to be identified into the neural network model to obtain an output result; wherein the output result is used for representing the recognition result of the gesture.

In a fifth aspect implemented by the present application, there is provided a computer-readable storage medium having stored therein instructions, which, when run on a computer, cause the computer to perform the method of any of the first aspects described above, or cause the computer to perform the method of any of the second aspects described above.

In a sixth aspect of an embodiment of the present application, there is provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the method of any one of the first aspects described above, or cause the computer to perform the method of any one of the second aspects described above.

In the embodiment of the present application, at least one item is included in the feature data: the method comprises the steps of predicting gesture types, predicting gesture calibration frames, predicting gesture key points and predicting gesture thermodynamic diagrams, updating the neural network model through loss values between feature data and corresponding original data, enabling the neural network model to pay more attention to a hand during training, and reducing the situation that similar objects such as human faces are mistakenly recognized as gestures.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.

FIG. 1 is a flow chart of a method for training a neural network model in an embodiment of the present application;

FIG. 2 is a schematic diagram of training a neural network model according to an embodiment of the present application;

FIG. 3 is a flow chart of a gesture recognition method in an embodiment of the present application;

FIG. 4 is a schematic structural diagram of an apparatus for training a neural network model according to an embodiment of the present disclosure;

FIG. 5 is a schematic structural diagram of a gesture recognition apparatus according to an embodiment of the present disclosure;

fig. 6 is a schematic structural diagram of an electronic device in an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

The embodiment of the application provides a training method of a neural network model, as shown in fig. 1, the method includes the following steps:

step 102, inputting the sample image and the corresponding thermal image into a neural network model for feature extraction to obtain feature data; wherein, the sample image comprises gestures, and the characteristic data comprises at least one of the following items: a predicted gesture category, a predicted gesture scaling box, a predicted gesture key point, a predicted gesture thermodynamic diagram;

it should be noted that the gesture calibration box refers to a position area of the gesture in the image; the gesture key points generally include 21 points on the hand, such as points on finger joints, points on finger tips, and the like; in other application scenarios of the embodiment of the present application, there may be more or less than 21 points, which may be determined according to actual situations.

In addition, the gesture category refers to gesture gestures, such as an "ok" gesture, a "ye" gesture, a "eight character" gesture, and the like, and in addition, the gesture calibration box and the gesture key points in the embodiment of the present application are used for determining the gesture category. Gesture enthalpies refer to graphical representations of gesture areas that are intended for visitors in the form of special highlights.

104, acquiring a loss value between the characteristic data and the corresponding original data;

and 106, updating the neural network model based on the loss value, and continuing training the updated neural network model until the loss value is smaller than a preset threshold value.

It should be noted that the preset threshold in the embodiment of the present application may be set correspondingly according to an actual situation.

Through the above steps 102 to 106 of the embodiment of the present application, since the feature data includes at least one item: the method comprises the steps of predicting gesture types, predicting gesture calibration frames, predicting gesture key points and predicting gesture thermodynamic diagrams, updating the neural network model through loss values between feature data and corresponding original data, enabling the neural network model to pay more attention to a hand during training, and reducing the situation that similar objects such as human faces are mistakenly recognized as gestures.

In an optional implementation manner of the embodiment of the present application, the manner of obtaining the loss value between the feature data and the corresponding original data, which is referred to in the above step 104, includes at least one of the following:

1) acquiring a first loss value between the predicted gesture thermodynamic diagram and a thermodynamic diagram corresponding to the sample image;

in one example, the first Loss value is denoted as Loss heat, i.e., Loss heat refers to the difference between the predicted thermodynamic diagram and the original thermodynamic diagram corresponding to the sample image. For example, the pixel size of the original thermodynamic diagram is 128x128, the pixel size of the predicted thermodynamic diagram is 128x128, and the 128x128 pixel points have respective values, and the Loss heat can be obtained by subtracting the 128x128 point values of the predicted thermal diagram from the 128x128 point values of the original thermal diagram and then squaring the subtracted values.

2) Acquiring a second loss value between the predicted coordinates of the gesture key points and the coordinates of the gesture key points in the sample image;

in one example, the second Loss value is recorded as a Loss point, which is the difference between the predicted gesture keypoint and the original keypoint in the sample image. For example, there are 21, i.e., 21 (x, y) such coordinate pairs for the original keypoint. The predicted key points are also 21, and the coordinate pairs of the corresponding 21 key points are subtracted and squared to obtain the Loss points of the key points.

3) Acquiring a third loss value between the predicted gesture calibration frame and the gesture calibration frame in the sample image;

in one example, the third penalty value is noted as a Loss box, which refers to the difference between the calibration box of the predicted gesture location and the calibration box of the original gesture in the sample image. It should be noted that the data of the calibration frame may be represented by (x, y, w, h), where x, y are coordinates of the center point of the gesture, and w, h are the length and width of the calibration frame.

4) And acquiring a fourth loss value between the predicted gesture category and the gesture category in the sample image.

In one example, the fourth penalty value is noted as a Loss class, which may be determined based on a cross entropy penalty function.

Based on the loss values in 1) to 4), the method for updating the neural network model based on the loss values in step 106 according to the embodiment of the present application further includes: updating the neural network model based on a sum of at least one of: a first loss value, a second loss value, a third loss value, and a fourth loss value.

In an example, if the penalty value includes a first penalty value, a second penalty value, a third penalty value, and a fourth penalty value, the penalty value is a sum of the first penalty value, the second penalty value, the third penalty value, and the fourth penalty value. That is, the loss values include which loss values, and the result is the sum of the included loss values as the loss values for updating the neural network model.

In an optional implementation manner of the embodiment of the present application, regarding the manner of obtaining the loss value between the feature data and the corresponding original data, which is referred to in the above step 104, further may include:

step 11, determining a difference value between the characteristic data and the corresponding original data;

and step 12, squaring the difference value to obtain a loss value.

Through the above steps 11 and 12, in the embodiment of the present application, the loss value is obtained based on the square of the difference between the feature data and the corresponding original data, and by this way, the obtained loss value can be more accurate, that is, the recognition rate of the neural network model updated by the loss value to the gesture is more accurate.

In this embodiment of the application, in the case that the feature data is a predicted gesture thermodynamic diagram, for inputting the sample image and the corresponding thermodynamic image into the neural network model for feature extraction, the manner of obtaining the feature data further may include:

step 21, inputting the thermodynamic diagrams corresponding to the sample images into a neural network model, and reducing the thermodynamic diagram size corresponding to the sample images through a convolution layer in the neural network model;

and step 22, up-sampling the thermodynamic diagrams corresponding to the sample images with the reduced sizes to obtain predicted gesture thermodynamic diagrams.

In the embodiment of the application, the neural network model can focus more attention on the hand during training through the training thermodynamic diagram, so that the situation that similar objects such as human faces are mistakenly recognized as gestures is reduced, and the accuracy of gesture recognition is improved.

The following exemplifies the present application with reference to specific embodiments of the present application; the specific embodiment provides a gesture recognition method, fig. 2 is a schematic diagram of neural network model training in an embodiment of the present application, and based on fig. 2, the method includes the steps of:

step 201, an original image (sample image) containing the gesture and the generated thermodynamic diagram are sent to a convolutional neural network to obtain a feature layer.

In step 202, the resolution size of the thermodynamic diagram passing through the convolutional layer is reduced, so that the thermodynamic diagram is restored to the original resolution size through an up-sampling mode to obtain a predicted gesture thermodynamic diagram.

And step 203, in the training process, subtracting the thermodynamic diagram generated by the original gesture from the predicted thermodynamic diagram, and then squaring to obtain the MSE (mean square error) loss which is used as the loss heat of the predicted thermodynamic diagram task. And predicting the key points of the gestures and the coordinates of the positions of the gestures to obtain a loss point and a loss frame. And predicting the gesture class by adopting a common multi-class cross entropy loss function to obtain a loss class. Finally, the whole neural network model is updated by the Loss sums of the four tasks:

total Loss = Loss Hot + Loss Point + Loss Box + Loss class

The Loss heat refers to a difference between the predicted thermodynamic diagram and the original thermodynamic diagram. The Loss point is the difference between the predicted key point and the original key point. The Loss box is the difference between the calibration box that indicates the predicted gesture location and the calibration box of the original gesture. The Loss class refers to the difference of the predicted gesture classification.

In another embodiment of the present application, there is further provided a method for performing gesture recognition based on the neural network model in the training method in fig. 1, as shown in fig. 3, the method includes the steps of:

step 302, acquiring image data to be identified; wherein the image data comprises gestures;

step 304, inputting image data to be identified into a neural network model to obtain an output result; and the output result is used for representing the recognition result of the gesture.

It can be seen that the loss value due to updating the neural network model is obtained as a difference value including at least one of: the difference of the gesture categories, the difference of the gesture calibration boxes, the difference of the gesture key points and the difference of the gesture thermodynamic diagrams. In addition, the gesture key points describe the outline of the hand and the range of the hand positioned by the gesture calibration frame, so that the gesture can be recognized more accurately, namely, the trained neural network model improves the accuracy of gesture recognition.

Corresponding to fig. 1, the present application also provides a training apparatus for a neural network model, as shown in fig. 4, the apparatus includes:

the first processing module 42 is configured to input the sample image and the corresponding thermal image into the neural network model for feature extraction, so as to obtain feature data; wherein, the sample image comprises gestures, and the characteristic data comprises at least one of the following items: a predicted gesture category, a predicted gesture scaling box, a predicted gesture key point, a predicted gesture thermodynamic diagram;

a first obtaining module 44, configured to obtain a loss value between the feature data and the corresponding original data;

and the training module 46 is configured to update the neural network model based on the loss value, and continue training the updated neural network model until the loss value is smaller than the preset threshold value.

By the device of the embodiment of the application, at least one item is included in the feature data: the method comprises the steps of predicting gesture types, predicting gesture calibration frames, predicting gesture key points and predicting gesture thermodynamic diagrams, updating the neural network model through loss values between feature data and corresponding original data, enabling the neural network model to pay more attention to a hand during training, and reducing the situation that similar objects such as human faces are mistakenly recognized as gestures.

Optionally, the first obtaining module 44 in this embodiment of the present application includes at least one of: the first acquisition unit is used for acquiring a first loss value between the predicted gesture thermodynamic diagram and the thermodynamic diagram corresponding to the sample image; the second obtaining unit is used for obtaining a second loss value between the predicted coordinates of the gesture key points and the coordinates of the gesture key points in the sample image; the third obtaining unit is used for obtaining a third loss value between the predicted gesture calibration frame and the gesture calibration frame in the sample image; and the fourth acquisition unit is used for acquiring a fourth loss value between the predicted gesture category and the gesture category in the sample image.

Optionally, the training module 46 in the embodiment of the present application further includes: an updating unit for updating the neural network model based on a sum of at least one of: a first loss value, a second loss value, a third loss value, and a fourth loss value.

Optionally, the first obtaining module in this embodiment of the present application includes: a determining unit for determining a difference between the feature data and the corresponding original data; and the first processing unit is used for squaring the difference value to obtain a loss value.

Optionally, in the case that the feature data is a predicted gesture thermodynamic diagram, the first processing module 42 in the embodiment of the present application further includes: the second processing unit is used for inputting the thermodynamic diagrams corresponding to the sample images into the neural network model and reducing the thermodynamic diagram size corresponding to the sample images through the convolution layer in the neural network model; and the third processing unit is used for up-sampling the thermodynamic diagrams corresponding to the sample images with the reduced sizes to obtain the predicted gesture thermodynamic diagrams.

Based on the foregoing fig. 4, an embodiment of the present application further provides an apparatus for performing gesture recognition based on the neural network model in the training method in fig. 4, as shown in fig. 5, the apparatus includes:

a second obtaining module 52, configured to obtain image data to be identified; wherein the image data comprises gestures;

the second processing module 54 is configured to input image data to be identified into the neural network model, so as to obtain an output result; and the output result is used for representing the recognition result of the gesture.

The embodiment of the present application further provides an electronic device, as shown in fig. 6, which includes a processor 601, a communication interface 602, a memory 603, and a communication bus 604, where the processor 601, the communication interface 602, and the memory 603 complete mutual communication through the communication bus 604,

a memory 603 for storing a computer program;

the processor 601 is configured to implement the following method steps in fig. 1 or fig. 3 when executing the program stored in the memory 603.

In addition, the functions performed in the process of implementing the method steps in fig. 1 or fig. 3 are also similar, and are not described in detail herein.

The communication bus mentioned in the above terminal may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 6, but this is not intended to represent only one bus or type of bus.

The communication interface is used for communication between the terminal and other equipment.

The Memory may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the Integrated Circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

In yet another embodiment provided by the present application, a computer-readable storage medium is further provided, which has instructions stored therein, and when the instructions are executed on a computer, the instructions cause the computer to perform any one of the above-mentioned neural network model training methods or any one of the above-mentioned gesture recognition methods.

In yet another embodiment provided by the present application, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform any of the above described methods of training a neural network model or any of the above described methods of gesture recognition.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only for the preferred embodiment of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application are included in the protection scope of the present application.

Claims

1. A training method of a neural network model is characterized by comprising the following steps:

inputting the sample image and the corresponding thermal image into a neural network model for feature extraction to obtain feature data; wherein the sample image includes a gesture therein, and the feature data includes at least one of: a predicted gesture category, a predicted gesture scaling box, a predicted gesture key point, a predicted gesture thermodynamic diagram;

obtaining a loss value between the characteristic data and corresponding original data;

and updating the neural network model based on the loss value, and continuing training the updated neural network model until the loss value is smaller than a preset threshold value.

2. The method of claim 1, wherein the obtaining a loss value between the feature data and corresponding raw data comprises at least one of:

obtaining a first loss value between the predicted gesture thermodynamic diagram and a thermodynamic diagram corresponding to the sample image;

acquiring a second loss value between the predicted coordinates of the gesture key points and the coordinates of the gesture key points in the sample image;

obtaining a third loss value between the predicted gesture calibration frame and a gesture calibration frame in the sample image;

obtaining a fourth loss value between the predicted gesture category and the gesture category in the sample image.

3. The method of claim 2, wherein said updating the neural network model based on the loss value comprises:

updating the neural network model based on a sum of at least one of: the first loss value, the second loss value, the third loss value, and the fourth loss value.

4. The method of claim 1, wherein the obtaining the loss value between the feature data and the corresponding raw data comprises:

determining a difference between the feature data and corresponding raw data;

and squaring the difference to obtain the loss value.

5. The method of claim 1, wherein in the case that the feature data is a predicted gesture thermodynamic diagram, inputting the sample image and the corresponding thermodynamic image into a neural network model for feature extraction, and obtaining the feature data comprises:

inputting thermodynamic diagrams corresponding to the sample images into the neural network model, and reducing the thermodynamic diagrams corresponding to the sample images through convolution layers in the neural network model;

and upsampling the thermodynamic diagram corresponding to the sample image with the reduced size to obtain the predicted gesture thermodynamic diagram.

6. A method for gesture recognition based on the neural network model in the training method of any one of claims 1 to 5, comprising:

acquiring image data to be identified; wherein the image data comprises a gesture;

inputting the image data to be identified into the neural network model to obtain an output result; wherein the output result is used for representing the recognition result of the gesture.

7. An apparatus for training a neural network model, comprising:

the first processing module is used for inputting the sample image and the corresponding thermal image into the neural network model for feature extraction to obtain feature data; wherein the sample image includes a gesture therein, and the feature data includes at least one of: a predicted gesture category, a predicted gesture scaling box, a predicted gesture key point, a predicted gesture thermodynamic diagram;

the first acquisition module is used for acquiring loss values between the characteristic data and the corresponding original data;

and the training module is used for updating the neural network model based on the loss value and continuously training the updated neural network model until the loss value is smaller than a preset threshold value.

8. An apparatus for performing gesture recognition based on the neural network model in the training apparatus of claim 7, comprising:

the second acquisition module is used for acquiring image data to be identified; wherein the image data comprises a gesture;

the second processing module is used for inputting the image data to be identified into the neural network model to obtain an output result; wherein the output result is used for representing the recognition result of the gesture.

9. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any one of claims 1 to 5 or the method steps of claim 6 when executing a program stored in the memory.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method of any one of claims 1 to 5 and carries out the method steps of claim 6.