CN110728209B

CN110728209B - Gesture recognition method and device, electronic equipment and storage medium

Info

Publication number: CN110728209B
Application number: CN201910906271.6A
Authority: CN
Inventors: 刘梦源; 陈宸; 肖万鹏; 鞠奇
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2019-09-24
Filing date: 2019-09-24
Publication date: 2023-08-08
Anticipated expiration: 2039-09-24
Also published as: CN110728209A

Abstract

The application relates to the technical field of computers, mainly to the technology of computer vision and machine learning in artificial intelligence, in particular to a gesture recognition method, a gesture recognition device, electronic equipment and a storage medium, and a human body image area in an image to be recognized is determined; estimating the human body posture in the human body image area to obtain a posture characteristic thermodynamic diagram corresponding to the human body image area; determining gesture scores of the human body image areas corresponding to preset gesture categories respectively according to the gesture feature thermodynamic diagrams; based on the gesture score, a human gesture recognition result in the human body image area is obtained, so that gesture recognition is performed by using a gesture feature thermodynamic diagram, and the gesture recognition accuracy can be improved.

Description

Gesture recognition method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a gesture recognition method, a gesture recognition device, an electronic device, and a storage medium.

Background

In practice, the human body posture can be an interactive way expressed by the user, and the information transmitted by the human body actions can be analyzed and known through the human body posture recognition, so that how to realize the human body posture recognition is necessary.

In the prior art, the human body posture recognition method mainly comprises the steps of obtaining joint points through a human body posture estimation algorithm, and then inputting the joint points into a classifier so as to judge the human body posture category. However, this approach relies on the accuracy of the estimated joint points, and if the joint points are inaccurate or missing, the results obtained by the subsequent classifiers must be inaccurate.

Disclosure of Invention

The embodiment of the application provides a gesture recognition method, a gesture recognition device, electronic equipment and a storage medium, so that the accuracy of gesture recognition is improved.

The specific technical scheme provided by the embodiment of the application is as follows:

one embodiment of the present application provides a gesture recognition method, including:

determining a human body image area in an image to be identified;

estimating the human body posture in the human body image area to obtain a posture characteristic thermodynamic diagram corresponding to the human body image area;

determining gesture scores of the human body image areas corresponding to preset gesture categories respectively according to the gesture feature thermodynamic diagrams;

and obtaining a human body gesture recognition result in the human body image area based on the gesture score.

Another embodiment of the present application provides a gesture recognition apparatus, including:

The detection module is used for determining a human body image area in the image to be identified;

the estimation module is used for estimating the human body posture in the human body image area to obtain a posture characteristic thermodynamic diagram corresponding to the human body image area;

the recognition module is used for determining gesture scores of the human body image areas corresponding to preset gesture categories respectively according to the gesture feature thermodynamic diagram; and obtaining a human body gesture recognition result in the human body image area based on the gesture score.

Another embodiment of the present application provides an electronic device including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the steps of any of the gesture recognition methods described above when the program is executed.

Another embodiment of the present application provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of any of the gesture recognition methods described above.

In the embodiment of the application, the human body image area is obtained by detection from the image to be identified, the human body gesture estimation is carried out on the human body image area, the gesture characteristic thermodynamic diagram corresponding to the human body image area is obtained, the gesture scores of the human body image area corresponding to each preset gesture type are determined according to the gesture characteristic thermodynamic diagram, so that the human body gesture recognition result in the human body image area is obtained according to the gesture scores, the gesture characteristic thermodynamic diagram is obtained, the gesture scores corresponding to each preset gesture type are determined by using the gesture characteristic thermodynamic diagram, so that the human body gesture recognition result is obtained, and because the gesture characteristic thermodynamic diagram contains more human body gesture information compared with the joint points, the accuracy of gesture recognition can be improved by using the gesture characteristic thermodynamic diagram, the human body image area is obtained by detecting firstly, then the gesture characteristic thermodynamic diagram is obtained by taking the image of the human body image area as input, the human body image area generally contains a single human body, the redundant background image is reduced, and therefore the gesture estimation performance and the efficiency can be improved.

Drawings

FIG. 1 is a schematic diagram of a thermodynamic diagram comparing effect between an articulation point and an attitude characteristic estimated by a human body attitude estimation algorithm in an embodiment of the present application;

fig. 2 is a schematic diagram of an application architecture of a gesture recognition method in an embodiment of the present application;

FIG. 3 is a flowchart of a gesture recognition method in an embodiment of the present application;

FIG. 4 is a schematic diagram of a human node and torso distribution in an embodiment of the present application;

FIG. 5 is a flowchart of the OpenPose principle in an embodiment of the present application;

FIG. 6 is a schematic flow chart of a gesture recognition method in an embodiment of the present application;

FIG. 7 is a diagram of a simple image sample example in an embodiment of the present application;

FIG. 8 is a diagram of an example of a difficult image sample in an embodiment of the present application;

FIG. 9 is a schematic structural diagram of a joint feature classification network according to an embodiment of the present application;

FIG. 10 is a schematic diagram of a convolutional layer feature classification network in an embodiment of the present application;

fig. 11 is a schematic structural diagram of a gesture recognition apparatus in an embodiment of the present application.

Detailed Description

The following description of the technical solutions in the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

In the application, the relevant data collection and processing should be strictly according to the requirements of relevant national laws and regulations when the example is applied, the informed consent or independent consent of the personal information body is obtained, and the subsequent data use and processing behaviors are developed within the authorized range of the laws and regulations and the personal information body.

For ease of understanding of embodiments of the present application, several concepts will be briefly described below:

closing node: the preset key points of the human body, such as a neck key point, an elbow key point, a wrist key point, a shoulder key point, a head key point and other joint positions, can be represented, and the joint points identified from the image in the embodiment of the present application can represent corresponding coordinate points of the human body joints in the image.

Torso: representing the connection of adjacent nodes of the human body.

Posture: representing a collective term for the articulation points and torso.

Gesture feature thermodynamic diagram: the middle layer characteristics when the positions of the joint points and the trunk are estimated based on the human body posture estimation algorithm are represented, the probability of the occurrence of the positions of the joint points and the trunk can be represented, and the posture characteristic thermodynamic diagram can be represented by circular gauss at the same position in a gray level diagram of the original image size of the human body joint points and the trunk, namely the probability that pixels in the input characteristic diagram belong to the human body joint points and the trunk is represented.

Human body detection algorithm: the method can be used for inputting an image and outputting human body candidate frames of all persons in the image, for example, detection algorithms such as a Yolo third version (You Only Look Once V, yolo v 3), a Fast Region-based convolutional network (Fast Region-based Convolutional Network, fast-RCNN), a single network-based multi-target detection (Single Shot MultiBox Detector, SSD) and the like, are not limited in this embodiment, and different human body detection algorithms can be selected according to requirements.

Human body posture estimation algorithm: the human body posture estimation algorithm can be divided into a bottom-up design mode and a top-down design mode, wherein the bottom-up human body posture estimation algorithm can directly estimate and obtain respective joint points of a plurality of people from a single image, the top-down human body posture estimation algorithm firstly detects and obtains the positions of the people from a human body detection algorithm, and then the joint points are estimated and obtained for the extracted single image successively. For example, bottom-up human body posture estimation algorithms have convolutional posture networks (Convolutional Pose Machines, CPM), openpoise, etc., and top-down human body posture estimation algorithms have alphapoise, etc. Because the human body posture estimated by openPose in the actual scene is strong in robustness, in the embodiment of the application, the human body posture estimation algorithm is mainly described by taking openPose as an example, but the method is not limited.

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Computer Vision (CV) is a science of studying how to "look" a machine, and more specifically, to replace human eyes with a camera and a Computer to perform machine Vision such as recognition, tracking and measurement on a target, and further perform graphic processing to make the Computer process into an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, optical character recognition (Optical Character Recognition, OCR), video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, three-dimensional (3 d) techniques, virtual reality, augmented reality, synchronous positioning, and map construction, and the like, as well as common biometric recognition techniques such as face recognition, fingerprint recognition, and the like. For example, in the embodiment of the application, image feature extraction, image classification and the like can be realized through an image semantic understanding technology in a computer vision technology, human body detection can be performed on an image, a human body image area is generated, human body posture information features in the image are extracted, a posture feature thermodynamic diagram is obtained, and further corresponding posture categories are learned according to the posture feature thermodynamic diagram, so that posture category classification according to the posture feature thermodynamic diagram is realized.

Machine Learning (ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like. For example, in the embodiment of the present application, a human body detection model for detecting a human body to obtain a human body image area, a human body posture estimation model for obtaining a posture feature thermodynamic diagram or an articulation point, a classifier for identifying a posture category from the posture feature thermodynamic diagram or the articulation point, and the like are all obtained through machine learning training.

With research and progress of artificial intelligence technology, research and application of artificial intelligence technology are developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, automatic driving, unmanned aerial vehicles, robots, smart medical treatment, smart customer service, etc., and with development of technology, artificial intelligence technology will be applied in more fields and become more and more important.

The scheme provided by the embodiment of the application mainly relates to the technologies of computer vision, machine learning and the like of artificial intelligence, and is specifically described through the following embodiments:

the human body gesture recognition has a very wide application prospect in the fields of behavior recognition, man-machine interaction, games, animation and the like, and is a direction with research value in the field of computer vision. In the prior art, a general framework of a human body posture recognition method based on an image is mainly divided into two parts, namely a human body posture estimation algorithm and a classifier based on joint points, wherein the coordinates of each joint point of a human body are estimated from the image through the human body posture estimation algorithm, then the coordinates of each joint point are used as characteristics to be input into the classifier to judge the human body posture category, the classifier needs to be designed in advance and can be used for mapping joint points into the human body posture category, for example, the classifier can adopt template matching, namely a similar neighbor algorithm for classification, for example, the joint points can be connected with human body trunk and stored as a picture mode, and then the classification is carried out by using a multi-layer convolution network. However, the effect of the classifier is limited by the accuracy of the estimated joint points caused by the strategy in the prior art, and if the joint point estimation is inaccurate or missing, the result obtained by the subsequent classifier is necessarily inaccurate.

In the embodiment of the present application, in the process of estimating the joint point through the human body posture estimation algorithm, it is found that in general, in the process of estimating the joint point through the human body posture estimation algorithm, the posture characteristic thermodynamic diagram is obtained first, then a high response position in the posture characteristic thermodynamic diagram is selected as the joint point and output, if the posture characteristic thermodynamic diagram is relatively decentralized, the joint point is deleted, that is, the joint point cannot be obtained, and it is known that even if the estimated joint point is missing, the corresponding posture characteristic thermodynamic diagram still provides the prediction information for the joint point, which is beneficial to posture identification, for example, as shown in fig. 1, the contrast effect diagram of the joint point estimated by the human body posture estimation algorithm in the embodiment of the present application and the posture characteristic thermodynamic diagram is shown in fig. 1, when the human leg joint in the second diagram (from left to right) in fig. 1 is missing, the information for predicting the leg joint point can still be provided in the third to the fifth posture characteristic thermodynamic diagram in fig. 1, so that in the embodiment of the present application, the posture characteristic thermodynamic diagram is used as the input of the classifier, thereby improving the classification accuracy.

In addition, according to different image acquisition modes, the objects processed by the human body gesture recognition algorithm in the prior art can be divided into three types of color images, depth images and color depth image fusion. For example, the monitoring camera, the mobile phone camera and the computer camera capture color images, the depth sensor captures depth images such as Kinect, and the like, wherein the depth images only sense the distance from the object surface to the camera and ignore complex texture information, the human body gesture is estimated through the depth images more easily than the human body gesture is estimated through the color images, the obtained joint points are more stable, but compared with the common camera, the depth sensor is more expensive and has a narrower application range, for example, the depth sensor is not suitable for outdoor environments with abundant illumination, so that in practice, the scene for estimating the human body gesture generally uses the color images as input, and the images in the embodiment of the application represent the color images.

Referring to fig. 2, an application architecture diagram of a gesture recognition method in an embodiment of the present application includes a terminal 100 and a server 200.

The terminal 100 may be any intelligent device such as a smart phone, a tablet computer, a portable personal computer, etc., the terminal 100 may capture an image, or an Application (APP) requiring a human body gesture recognition scene or a client for image recognition, such as a game client, a different gesture image retrieval client, etc., may be installed on the terminal 100, and the terminal 100 may send a recognition request for an image to be recognized, where the image to be recognized includes one or more human bodies, to the server 200, and may receive a gesture recognition result returned by the server 200.

The server 200 is capable of providing various network services to the terminal 100, and the server 200 may be regarded as a background server providing corresponding network services for different applications on the terminal 100.

The server 200 may be a server, a server cluster formed by a plurality of servers, or a cloud computing center.

In particular, the server 200 may include a processor 210 (Center Processing Unit, CPU), a memory 220, an input device 230, an output device 240, etc., the input device 230 may include a keyboard, a mouse, a touch screen, etc., and the output device 240 may include a display device such as a liquid crystal display (Liquid Crystal Display, LCD), a Cathode Ray Tube (CRT), etc.

Memory 220 may include Read Only Memory (ROM) and Random Access Memory (RAM) and provides processor 210 with program instructions and data stored in memory 220. In the embodiment of the present invention, the memory 220 may be used to store a program of any of the gesture recognition methods in the embodiment of the present invention.

The processor 210 is configured to execute the steps of any of the gesture recognition methods according to the embodiments of the present invention according to the obtained program instructions by calling the program instructions stored in the memory 220 by the processor 210.

It should be noted that, the gesture recognition method in the embodiment of the present application is mainly performed by the server 200 side, and the pre-training of the classifier, the human body detection model, and the human body gesture estimation model related in the gesture recognition method is also performed by the server 200 side, after training is completed, the human body image region can be detected from the image to be recognized by the human body detection model obtained through training, and the gesture feature thermodynamic diagram of the human body is obtained by the human body gesture estimation model, and the gesture feature thermodynamic diagram is input into the classifier, so as to output the gesture category of the human body in the image to be recognized.

The terminal 100 and the server 200 are connected to each other through the internet, and communicate with each other. Optionally, the internet described above uses standard communication techniques and/or protocols. The internet is typically the internet, but may be any network including, but not limited to, a local area network (Local Area Network, LAN), metropolitan area network (Metropolitan Area Network, MAN), wide area network (Wide Area Network, WAN), mobile, wired or wireless network, private network, or any combination of virtual private networks. In some embodiments, data exchanged over the network is represented using techniques and/or formats including HyperText Mark-up Language (HTML), extensible markup Language (Extensible Markup Language, XML), and the like. All or some of the links may also be encrypted using conventional encryption techniques such as secure socket layer (Secure Socket Layer, SSL), transport layer security (Transport Layer Security, TLS), virtual private network (Virtual Private Network, VPN), internet protocol security (Internet Protocol Security, IPsec), and the like. In other embodiments, custom and/or dedicated data communication techniques may also be used in place of or in addition to the data communication techniques described above.

It should be noted that, the application architecture diagram in the embodiment of the present application is to more clearly illustrate the technical solution in the embodiment of the present application, and does not limit the technical solution provided in the embodiment of the present application, and for other application architectures and service applications, the technical solution provided in the embodiment of the present application is also applicable to similar problems. In the following, in various embodiments of the present application, an application architecture shown in fig. 2 is schematically illustrated by using an application of the gesture recognition method.

Based on the foregoing embodiments, a gesture recognition method in an embodiment of the present application is described below, and referring to fig. 3, a flowchart of the gesture recognition method in an embodiment of the present application is shown, where the method includes:

step 300: and acquiring an image to be identified.

Wherein, the image to be identified can comprise one or more human bodies.

Step 310: and determining a human body image area in the image to be identified.

In the embodiment of the present application, the pose characteristic thermodynamic diagram may be obtained through a human body pose estimation model, for example, the human body pose estimation model may adopt openPose, but taking the whole image with the background as the input of openPose may cause the performance to be steep, so in the embodiment of the present application, the human body image area is detected from the image to be identified, and then the image corresponding to the human body image area is scratched out as the input of openPose, that is, the human body pose estimation model, so the accuracy and the performance may be improved.

When step 310 is executed, the method specifically includes:

s1, when an image to be identified does not meet abnormal image conditions, human body detection is carried out on the image to be identified based on a human body detection model obtained through pre-training, and a human body detection result is obtained.

The human body detection model may adopt a YoloV3 algorithm, which is not limited in the embodiment of the present application.

Wherein the abnormal image represents an image satisfying any one of the following conditions: the aspect ratio exceeding the ratio threshold, being wider or higher than the pixel threshold, may be other conditions, and the abnormal image satisfying these conditions usually contains no human body or is difficult to extract the human body, so that the abnormal image may be judged in advance.

Thus, through the human body detection model, a human body can be detected from the image to be identified, a human body detection result is obtained, and the human body detection result can be whether a human body detection frame exists or not.

S2, when the human body detection result does not include a human body detection frame, taking the image to be identified as a human body image area.

In the embodiment of the application, if the human body detection frame is not detected from the image to be identified, the whole image to be identified can be used as input for subsequent processing, the attitude characteristic thermodynamic diagram is obtained, and the attitude category is output through the classifier, so that missed detection can be prevented, and the detection effect is improved.

S3, when a human body detection frame exists in the human body detection result, filtering abnormal human body detection frames in the human body detection frame to obtain human body candidate frames; and determining a human body image area according to the human body candidate frame.

In this embodiment of the present application, if the image to be identified includes a plurality of human bodies, a plurality of human bodies may be detected from the image to be identified through a human body detection model, and a human body detection frame is present in a human body detection result, so as to generate a plurality of human body detection frames, that is, obtain a plurality of human body image areas, and further, respectively perform pose estimation on images corresponding to each human body image area, and determine a pose category corresponding to each human body image area.

In addition, for example, the YoloV3 algorithm is used for detecting a human body, the generated human body detection frame may include a single person or a crowd, and the determined human body image area may include a single person or a crowd, but in the embodiment of the present application, the gesture recognition is mainly performed for the single person, so that the human body detection frame including the crowd needs to be removed, specifically, a possible implementation manner is provided in the embodiment of the present application, and when the human body detection frame exists in the human body detection result, the area of each human body detection frame is calculated; determining abnormal human body detection frames in the human body detection frames according to the areas of the human body detection frames; filtering the abnormal human body detection frames to obtain human body candidate frames.

That is, in the embodiment of the present application, whether the human body detection frame includes a crowd of human body detection frames may be determined according to the area of the human body detection frame, for example, 3 human body detection frames are detected from the image to be identified, the 3 human body detection frames are ordered according to the area size, if the area of the first human body detection frame exceeds a preset multiple, for example, 10 times, of the area of the second human body detection frame, the first human body detection frame may be considered to include a crowd of human body, the first human body detection frame is an abnormal human body detection frame, the area of the second and third human body detection frames may be removed, the second and third human body detection frames may be considered to include a single human body, and the human body candidate frames may be determined as human body candidate frames, and then only the human bodies in the second and third human body candidate frames may be recognized.

Step 320: and estimating the human body posture in the human body image area to obtain a posture characteristic thermodynamic diagram corresponding to the human body image area.

When step 320 is executed, the method specifically includes: the method comprises the steps of inputting images corresponding to human body image areas into a pre-trained human body posture estimation model, detecting position information of each joint point in the images corresponding to the human body image areas, and outputting each joint point thermodynamic diagram and/or each trunk thermodynamic diagram.

The joint point thermodynamic diagram is used for representing the position information of the joint points, the trunk thermodynamic diagram is used for representing the position information of the trunk, and the trunk represents the connecting line of the adjacent joint points.

In this embodiment, there are a plurality of joints of a human body during training of the human body posture estimation model, each joint corresponds to a joint thermodynamic diagram through the human body posture estimation model, each joint thermodynamic diagram reflects position information of each joint, each trunk corresponds to a pair of trunk thermodynamic diagrams, and each pair of trunk thermodynamic diagrams reflects position information of one trunk, so that a plurality of posture feature thermodynamic diagrams can be output through the human body posture estimation model for each human body candidate frame.

For example, referring to fig. 4, which is a schematic diagram of human body node and trunk distribution in the embodiment of the present application, the human body posture estimation model uses openPose as an example, the posture feature thermodynamic diagram obtained by openPose includes a node thermodynamic diagram and a trunk thermodynamic diagram, the nodes estimated by openPose are 18, as shown in fig. 4, the left graph in fig. 4 is a schematic diagram of human body node distribution, there are 18 nodes, the 18 nodes in the left graph in fig. 4 correspond to 18 node thermodynamic diagrams, the right graph in fig. 4 is a schematic diagram of trunk distribution, there are 19 trunks, each trunk is a connection line between adjacent nodes, and the 19 trunks in the right graph in fig. 4 correspond to 19 pairs of trunk thermodynamic diagrams.

The flow of obtaining the thermodynamic diagrams of each joint point and each trunk will be briefly described below by taking the human body posture estimation model as an example using openPose.

Referring to fig. 5, an openPose principle flowchart in the embodiment of the present application is shown, and as shown in fig. 5, an attitude feature thermodynamic diagram output by an openPose middle layer is denoted as S ^t And L ^t Wherein S is ^t To provide a thermodynamic diagram of the joint point, L ^t For the trunk thermodynamic diagram, T is the iteration number of openPose, the value of T is 1 to T, the accuracy and the algorithm performance of the attitude characteristic thermodynamic diagram are comprehensively considered, and the preferred value of T is 4, which is not limited in the embodiment of the application.

As shown in fig. 5, when t=1, the first stage of openelse is operated, the image F to be processed is input, and the images F are respectively passed through two branches, namely, the sub-network 1 and the sub-network 2, and respectively output the joint point thermodynamic diagram S ¹ And torso thermodynamic diagram L ¹ The method comprises the steps of carrying out a first treatment on the surface of the When t=2, the second stage of openPose is operated to process the image F to be processed and the joint point thermodynamic diagram S ¹ And torso thermodynamic diagram L ¹ Respectively input to the sub-network 1 and the sub-network 2 and respectively output the joint point thermodynamic diagram S ² And torso thermodynamic diagram L ² The method comprises the steps of carrying out a first treatment on the surface of the Repeating the process for T times until outputting joint point thermodynamic diagram S ^T And torso thermodynamic diagram L ^T Thus, through T iterations, the characteristic information of the human body gesture in the image to be processed is continuously extracted, which is a gesture estimation process from thick to thin, and the initial obtained joint point thermodynamic diagram S ¹ And torso thermodynamic diagram L ¹ May only roughly reflect the human body posture position information, and the finally obtained joint point thermodynamic diagram S ^T And torso thermodynamic diagram L ^T All the information of the gesture characteristic thermodynamic diagram obtained in the previous T-1 stages is fused, so that the human gesture position information can be described finely, and the joint point thermodynamic diagram S can be further processed ^T And/or torso thermodynamic diagram L ^T As input to a subsequent classifier.

For example, if the longest edge of the image corresponding to the human body candidate box input by openPose is set to 150 pixels, the joint point thermodynamic diagramS ^T And torso thermodynamic diagram L ^T The magnitudes of (2) may be 19×mn and 38×mn, respectively, wherein 19 and 38 are related to the number of nodes and the number of torso, respectively, and max (M, N) =19, where the values of M and N are related to the convolution and pooling structure of the input image size and openPose.

In this way, through the human body posture estimation model, each posture characteristic thermodynamic diagram of the human body in the image corresponding to the human body image area can be obtained, and as the posture characteristic thermodynamic diagram contains more abundant human body posture information compared with the joint points, each obtained posture characteristic thermodynamic diagram is input into the classifier to output the posture category, and the accuracy of recognition and classification can be improved.

Step 330: according to the gesture characteristic thermodynamic diagram, determining gesture scores of the human body image areas corresponding to preset gesture categories respectively, and obtaining a human body gesture recognition result in the human body image areas based on the gesture scores.

Specifically, in step 330, according to the pose characteristic thermodynamic diagram, determining pose scores of the human body image areas corresponding to each preset pose category respectively includes: inputting the gesture characteristic thermodynamic diagram into a classifier, and determining gesture scores of human body image areas corresponding to preset gesture categories respectively.

In the embodiment of the application, when determining the gesture category according to the gesture feature thermodynamic diagram and the classifier, only the joint point thermodynamic diagram or the trunk thermodynamic diagram can be input, and the joint point thermodynamic diagram and the trunk thermodynamic diagram both contain more human gesture information than the joint points, so that the accuracy of human gesture category identification can be improved even if only one gesture feature thermodynamic diagram is input, and certainly, the joint point thermodynamic diagram and the trunk thermodynamic diagram can be input at the same time, and the identification accuracy can be further improved.

Further, before inputting each joint point thermodynamic diagram and/or torso thermodynamic diagram into the classifier, the method further comprises: and adjusting the dimension of each joint point thermodynamic diagram and/or each trunk thermodynamic diagram to a preset fixed dimension.

For example, S ^T And L ^T Adjusting to a fixed dimension by bilinear interpolation, e.g. adjusted fixedDimensions 19 x 19 and 38 x 19, respectively, are not limiting in this example.

Therefore, the purpose of adjusting to the preset fixed dimension is to ensure that the input dimension of the classifier is fixed, so that the network structure design of the classifier is facilitated.

In the embodiment of the present application, a network structure of a classifier is designed, and in particular, when step 330 is executed, different classifiers may be designed according to the type of the input gesture feature thermodynamic diagram, and the following two cases may be classified:

first case: only the joint thermodynamic diagram or torso thermodynamic diagram is entered.

The method specifically comprises the following steps: 1) Each joint thermodynamic diagram or each torso thermodynamic diagram is input into a classifier.

The structure of the classifier at least comprises two-channel pooling, a first full-connection network, a nonlinear activation function, a second full-connection network and a normalization function, wherein the number of input neurons of the first full-connection network is an input dimension, the number of output neurons is a parameter value obtained by training, the number of input neurons of the second full-connection network is a dimension output by an upper network, and the number of output neurons is the number of preset gesture categories.

2) After feature extraction is carried out on each joint point thermodynamic diagram or each trunk thermodynamic diagram by the classifier, the determined pose scores of the human body image areas corresponding to the preset pose categories respectively are obtained, and the recognized pose categories in the human body image areas are output according to the pose scores.

The method specifically comprises the following steps: the method comprises the steps of carrying out feature compression on each joint point thermodynamic diagram or each trunk thermodynamic diagram through two preset pooling modes in a classifier, splicing the features compressed in the two preset pooling modes, carrying out feature extraction on the spliced features sequentially through a first fully-connected network, a nonlinear activation function and a second fully-connected network, obtaining gesture scores of human body image areas corresponding to preset gesture categories respectively, normalizing the gesture scores to a preset value range through a normalization function, and outputting the gesture categories identified in the human body image areas according to the normalized gesture scores.

That is, when the joint point thermodynamic diagram or the trunk thermodynamic diagram is used alone, the network structure of the classifier may include two-channel pooling, a first fully-connected network, a nonlinear activation function, a second fully-connected network, and a normalization function, so that after the joint point thermodynamic diagram or the trunk thermodynamic diagram is input into the classifier, feature extraction is performed sequentially through the two-channel pooling process, the first fully-connected network, the nonlinear activation function process, and the second fully-connected network, feature extraction is performed sequentially through the normalization function process, and finally, the gesture score belonging to each preset gesture category is obtained, and finally, the identified gesture category is output based on the gesture score, and specifically, the gesture score maximum gesture category is taken as the final identified gesture category.

The dual-channel pooling is used for removing redundant features, performing feature compression on a joint thermodynamic diagram or a trunk thermodynamic diagram, and reducing loss of feature compression, so that in the embodiment of the application, dual-channel pooling processing is adopted, a basic pooling method called by a dual-channel pooling mode can be Mean (Mean) pooling, and maximum (Max) pooling or Mean-maximum (Mean-Max) pooling, wherein Mean-Max pooling refers to respectively calling Max pooling and Mean pooling to realize the dual-channel pooling mode.

The number of the input neurons of the first full-connection network is determined by the input dimension, the number of the output neurons is a parameter value obtained by training, the number of the output neurons of the first full-connection network is a super parameter, and optimization needs to be selected in a training stage. The number of the input neurons of the second full-connection network is the dimension output by the upper network, and the number of the output neurons is the preset number of each gesture type.

The purpose of the nonlinear activation function is to increase the nonlinear relationship between the network structure layers, for example, a modified linear unit (Rectified linear unit, reLU) method may be used, which is not limited in the embodiment of the present application.

The normalization function is used for normalizing the gesture score to a preset value range, so that the gesture score belonging to each gesture category can be more intuitively evaluated, for example, a SoftMax method can be adopted, and the gesture score can be normalized to be between 0 and 1.

For example, a torso thermodynamic diagram L with an input size of 38 x 19 ^T For example, first, the second channel of 38×19 is pooled to obtain 38×1×19, which is equivalent to 38×19 by a two-channel pooling method, then the third channel of 38×19 is pooled to obtain 38×19×1, which is equivalent to 38×19, and finally the two pooled results are spliced to obtain 38×38, and by taking the Mean-Max pooled method as an example, the Max pooled method and the Mean pooled method are respectively invoked, and then the obtained 38×38 and 38×38 dimensional features are spliced to form 38×76 dimensional features, that is, the trunk thermodynamic diagram L with the size of 38×19×19 ^T After the processing in the two-channel pooling mode, the dimension size is 38×76, then the characteristics of the dimension size of 38×76 are processed by the first fully-connected network, the nonlinear activation, the second fully-connected network and the normalization function in sequence, and finally the corresponding gesture category is output.

Second case: and simultaneously inputting an articulation point thermodynamic diagram and a trunk thermodynamic diagram.

The method specifically comprises the following steps: 1) Each joint point thermodynamic diagram and each torso thermodynamic diagram are input into a classifier.

The structure of the classifier at least comprises two-channel pooling, feature cascading processing, a first fully-connected network, a nonlinear activation function, a second fully-connected network and a normalization function, wherein the number of input neurons of the first fully-connected network is an input dimension, the number of output neurons is a parameter value obtained by training, the number of input neurons of the second fully-connected network is a dimension output by an upper network, and the number of output neurons is the number of preset gesture categories.

2) And after the classifier performs feature extraction on the joint point thermodynamic diagrams and the trunk thermodynamic diagrams, respectively corresponding the determined human body image areas to the gesture scores of the preset gesture categories, and outputting the gesture categories identified in the human body image areas according to the gesture scores.

The method specifically comprises the following steps: the method comprises the steps of carrying out feature compression on each joint point thermodynamic diagram and each trunk thermodynamic diagram through two preset pooling modes in a classifier, splicing the features compressed in the two preset pooling modes, carrying out cascading splicing on the spliced features corresponding to each joint point thermodynamic diagram and the spliced features corresponding to each trunk thermodynamic diagram through feature cascading, carrying out feature extraction on the cascading spliced features sequentially through a first full-connection network, a nonlinear activation function and a second full-connection network, obtaining gesture scores of human body image areas corresponding to preset gesture categories respectively, normalizing the gesture scores to a preset value range through a normalization function, and outputting the gesture categories identified in the human body image areas according to the normalized gesture scores.

That is, when the node thermodynamic diagram and the trunk thermodynamic diagram are simultaneously input, the network structure of the classifier at least includes two-channel pooling, feature cascading, a first fully-connected network, a nonlinear activation function, a second fully-connected network, and a normalization function.

Further, the network structure of the classifier in the embodiment of the present application, other module structures after the two-channel pooling and feature cascading, that is, the first fully-connected network, the nonlinear activation function, the second fully-connected network, and the normalization function, are not limited, and other classifiers may be used instead, for example, a one-dimensional convolutional neural network, a support vector machine (Support Vector Machine, SVM), a decision tree, a nearest neighbor classifier, and the like.

In this way, in the embodiment of the application, for an image to be identified, a human body image area is obtained by detecting from the image to be identified, and a gesture characteristic thermodynamic diagram corresponding to the human body image area is obtained by performing gesture prediction on the human body image area, wherein the gesture characteristic thermodynamic diagram comprises each joint point thermodynamic diagram and/or each trunk thermodynamic diagram, then the gesture characteristic thermodynamic diagram is input into a classifier, a gesture score belonging to each preset gesture category is determined, the gesture category with the largest gesture score is determined as a human body gesture identification result in the human body image area, and the gesture characteristic thermodynamic diagram is an intermediate feature for estimating the joint point and contains more abundant human body gesture information compared with the joint point, so that the gesture characteristic thermodynamic diagram is used for replacing the joint point as the input of the classifier, and the gesture category is obtained through the gesture characteristic thermodynamic diagram, thereby improving the classification accuracy.

Based on the above embodiments, in order to further evaluate the effect and accuracy of the gesture recognition method in the embodiment of the present application, image samples are obtained, the cases of using joint points as classifier inputs and using gesture feature thermodynamic diagrams as classifier inputs are respectively trained, and according to the method of training completion, the advantages and effects of the gesture recognition method in the embodiment of the present application compared with the method in the prior art are verified, and in order to ensure the reliability of experiments, the network structures of the classifiers in the two cases designed in the embodiment of the present application are substantially consistent.

Referring specifically to fig. 6, which is a schematic flowchart of a gesture recognition method in the embodiment of the present application, taking a human body image area as a human body candidate frame as an example, as shown in fig. 6, the gesture recognition method is mainly divided into four parts, namely, generation of the human body candidate frame, estimation of a human body gesture, and recognition (classifier) of the human body gesture, and specifically includes the following steps:

step 60: a set of image samples is acquired.

The image sample set initially acquired herein may be an image related to each preset gesture category, and may be obtained after being processed by the human body candidate frame generating module, and then the gesture category is marked for the image including the single person.

For example, the preset gesture categories include 10 categories, which are respectively squat down, leg raising, kneel down, saluting, crawling, lying down, bending down, standing and sitting, the 10-category human gesture related image sample set is obtained, then the human gesture candidate frames are obtained through the subsequent human gesture candidate frame generating module based on the human detection model detection, the images of the human gesture candidate frames are respectively scratched, the single person images with the human gestures are obtained for each human gesture candidate frame, and gesture category labeling is performed.

And each type of marked image can be split into a simple image sample and a difficult image sample, wherein the simple image sample refers to an image with easily distinguished human body gestures observed through human eyes, for example, as shown in fig. 7, is an example graph of the simple image sample in the embodiment of the application, and can be split into a training set and a verification set according to 8:2, as shown in table 1, which is the total amount distribution of the simple image sample in the embodiment of the application, the number of image samples corresponding to 10 types of human body gesture types and each gesture type is listed in table 1, the difficult sample refers to an image with difficultly distinguished human body gestures observed through human eyes, as shown in fig. 8, which is an example graph of the difficult image sample in the embodiment of the application, and all the difficult samples are used as test sets, as shown in table 2, and as shown in table 2, the total amount distribution of the difficult image sample in the embodiment of the application, as well, the 10 types of human body gesture types and the corresponding image sample numbers are listed in table 2, and it is required that only the image samples themselves are used in the practical training, verification and test, and the gesture only is marked on the image sample graph in the image graph for the drawing.

Table 1.

0	1	2	3	4	5	6	7	8	9
										Squatting	Leg-raising	Kneeling down	Gift	Crawling	Lying on the body	Lying flat	Bending down	Standing up	Sitting on
1005	872	660	504	383	432	358	512	2284	1974

Table 2.

0	1	2	3	4	5	6	7	8	9
										Squatting	Leg-raising	Kneeling down	Gift	Crawling	Lying on the body	Lying flat	Bending down	Standing up	Sitting on
18	178	245	150	31	192	33	99	261	145

Step 61: and generating human body candidate frames.

The method specifically comprises the following steps: step 61.1: and filtering the abnormal image.

One possible implementation manner is provided in the embodiment of the present application: filtering out an abnormal image in the image sample set, wherein the abnormal image represents an image meeting any one of the following conditions: the aspect ratio exceeds the scale threshold, and the width or height is less than the pixel threshold.

In the embodiment of the present application, further, in order to improve the efficiency of the gesture recognition method, when generating the human body candidate frame, the abnormal image may be further processed, and considering that the image with the abnormal aspect ratio generally does not contain a human body, and the human body is generally difficult to be detected from the low resolution image, so that the two types of abnormal images may be filtered, and the overall response speed of the gesture recognition method may also be improved.

For example, if the scale threshold is 10 times and the pixel threshold is 20, then images with aspect ratios exceeding 10 times, or images with widths or heights less than 20 pixels, can be filtered out.

Step 61.2: and detecting to obtain a human body detection frame.

For example, the human detection frame may be detected according to the YoloV3 algorithm, without limitation.

Step 61.3: judging whether the human body detection frame is detected or not, if yes, executing the step 61.5, and if no, executing the step 61.4.

Step 61.4: the artwork is used.

In this embodiment of the present application, in order to prevent the condition of missing the human body, when no human body candidate frame is detected, the whole image is directly adopted as input, that is, the original image is not required to be scratched, and the original image is input to the next module.

Step 61.5: and removing the abnormal human body detection frame.

Wherein the abnormal human body detection frame represents a human body candidate frame satisfying any one of the following conditions: the largest human body detection frame among the plurality of human body detection frames detected in the image has a predetermined size than the areas of the other human body detection frames.

The human body candidate frames detected in the same image can be sorted according to the area for each image, if the human body detection frame with the largest area of the first sorting exceeds the area of the human body detection frame with the second sorting by 10 times, the human body detection frame with the first sorting is considered to be an abnormal human body detection frame, and the human body detection frame with the first sorting is removed.

For example, the YoloV3 algorithm may be trained based on an open-source open graph (OpenImage) database, and considering that the database classifies people into labels of people as well, the human body detection frames output by the YoloV3 obtained by training may be single people or people, so as to exclude human body detection frames containing people, so that the human body detection frames with abnormal areas need to be removed, and it is ensured that all human body candidate frames obtained after filtering the abnormal human body detection frames finally contain single human bodies.

Step 61.6: and selecting a human body candidate frame with the largest area.

Namely, selecting the human body candidate frame with the largest area among the human body candidate frames after the abnormal human body detection frame is filtered out as the input of a human body detection algorithm.

Therefore, the algorithm performance can be ensured, the training and verification process is convenient, and a plurality of human body candidate frames remained after filtering can be sequentially sent to the subsequent modules according to the area size without limitation.

Further, to improve the performance of the subsequent human body posture estimation algorithm, the edge size of the human body candidate frame may also be defined, for example, the maximum edge of the human body candidate frame may be scaled to 150 pixels.

And in the training stage, the images corresponding to the candidate frames of the human body obtained by matting can be marked in gesture categories and divided into a training set, a verification set and a test set, wherein the training set and the verification set are used for determining the optimal parameter configuration of each module in the gesture recognition method, and the test set is used for testing the training effect after training is completed.

Step 62: and estimating the human body posture.

Specifically, an image corresponding to a human body candidate frame is input into a human body posture estimation model, position information of each joint point in the image corresponding to the human body candidate frame is detected, and each joint point thermodynamic diagram and/or each trunk thermodynamic diagram is output.

Further, in order to compare the recognition effect of directly inputting the joint to the classifier with the input gesture feature thermodynamic diagram to the classifier, not only the gesture feature thermodynamic diagram of the middle layer feature but also the finally recognized joint point is output through the human body gesture estimation model.

For example, the body posture estimation model employs openPose, through which 18 nodes and 19 pairs of torso can be identified.

Step 63: and (5) recognizing human body gestures.

The core of the human body gesture recognition module is the design of a classifier, in order to verify the recognition accuracy and effect of the gesture recognition method in the embodiment of the present application and the gesture recognition method in the prior art, in the process of human body gesture recognition in the embodiment of the present application, the network structures of the classifier under different conditions may be designed respectively, and the network structures of the classifier should be basically consistent, for convenience of distinction, the classifier using joint points to perform gesture recognition is called a joint feature classification network, the classifier using gesture feature thermodynamic diagram to perform gesture recognition is called a convolution layer feature classification network, and based on different classifiers, the classifier may be divided into two branches to perform processing respectively, and specifically includes:

Step 63.1: and removing the incomplete posture.

In the embodiment of the present application, if the number of the joints obtained by the human body posture estimation model is too small, it is impossible to determine which posture category the joints belong to, so in order to improve the response speed of the algorithm, a threshold may be set, and the situation of too few joints may be abandoned.

For example, there are 18 joints estimated by openPose, including the head 5 and 13 of other positions, and 4 joints at the eyes and ears can be ignored when performing gesture recognition, because these 4 joints are redundant for human gesture integrity judgment, that is, only 10 joints are needed to be input into the classifier to recognize the gesture class to which they belong, based on which a threshold, for example, a threshold of 10 can be set, and when the number of joints estimated by the human gesture estimation model is less than 10, they can be directly discarded without being input into the classifier for gesture recognition.

Step 63.2: a joint feature classification network.

Referring to fig. 9, a schematic structural diagram of a joint feature classification network in the embodiment of the present application is shown, where the joint feature classification network is a classifier input as a joint node, and as shown in fig. 9, the network structure of the joint feature classification network at least includes a first fully-connected network, a nonlinear activation function, a second fully-connected network, and a normalization function. The input joint characteristics consist of 18 joint points estimated by OpenPose, the size is 18 x 2, and the respective abscissas and ordinates of the 18 joint points are represented; the number of the input neurons of the first full-connection network is determined by the dimension of the input characteristics, and the number of the output neurons is super-parameters and needs training, selecting and optimizing; the number of the input neurons of the second full-connection network is determined by the output of the upper network, and the number of the output neurons is the number of human body posture categories; the nonlinear activation function may employ a ReLU method, the normalization function may normalize the output of the second fully-connected network, and a SoftMax function may be employed.

And the joint characteristic classification network can also select other classifiers, such as a convolutional neural network, an SVM, a decision tree, a nearest neighbor classifier and the like, and the specific setting is the same as that of the convolutional layer characteristic classification network.

In this way, the nodes obtained through the human body posture estimation model are filtered in step 63.1, if the number of the estimated joints in the image corresponding to the human body candidate frame is smaller than the threshold value, the nodes estimated by the image corresponding to the human body candidate frame can be filtered, and then the joints of the image corresponding to each filtered human body candidate frame are input into the joint feature classification network for posture recognition, and the posture category is output.

Step 63.3: convolutional layer feature classification networks.

Taking the input gesture feature thermodynamic diagram including each joint point thermodynamic diagram and each trunk thermodynamic diagram as an example, referring to fig. 10, a schematic diagram of a convolutional layer feature classification network structure in the embodiment of the present application is shown, where the convolutional layer feature classification network is a classifier whose input is the gesture feature thermodynamic diagram, as shown in fig. 10, the convolutional layer feature classification network structure at least includes a dual-channel pooling, a feature cascade, a first full-connection network, a nonlinear activation function, a second full-connection network, and a normalization function, where the input is the joint point thermodynamic diagram and the trunk thermodynamic diagram, the joint point thermodynamic diagram and the trunk thermodynamic diagram are respectively subjected to dual-channel pooling, redundant features are removed, then the result of the dual-channel pooling processing is spliced by a feature cascade module, and is input to the first full-connection network through a feature stretching module, the number of input neurons of the first full-connection network is an input dimension, the number of output neurons of the first full-connection network is an superparameter, the number of output neurons of the second full-connection network is a dimension of an upper network output, the output neurons of the second full-connection network is a preset gesture category number, the nonlinear activation function can be normalized by training optimization, and the normalization function can be a ReLU function can be set to be the same as the specific feature of the soft feature, and the soft feature is set.

That is, in the embodiment of the present application, the human body posture estimation model may further obtain each joint point thermodynamic diagram and each trunk thermodynamic diagram, and input each joint point thermodynamic diagram and each trunk thermodynamic diagram into the convolutional layer feature classification network, so that feature learning and extraction may be performed on each joint point thermodynamic diagram and each trunk thermodynamic diagram, and a posture category may be output.

Of course, only the thermodynamic diagrams of each joint point or each trunk may be input, and the convolutional layer feature classification network structure is the same as that shown in fig. 10, and only the feature cascade module is required to be removed, so that the convolutional layer feature classification network of only the thermodynamic diagrams of each joint point or each trunk may be trained during training and contrast verification.

Step 63.4: and outputting the gesture category with the largest gesture score.

In the embodiment of the application, the gesture score belonging to each gesture category can be determined through the convolution layer feature classification network or the joint feature classification network, the gesture category with the largest gesture score is taken as the predicted gesture category, the gesture score threshold value can be set, the gesture category with the gesture score greater than the threshold value is considered as the predicted gesture category, and if the gesture score greater than the threshold value is multiple, the gesture category with the largest gesture score can be taken as the final predicted gesture category.

In this embodiment of the present application, through the schematic flowchart shown in fig. 6, the effect of the gesture recognition method in the embodiment of the present application with respect to the gesture recognition method in the prior art may be compared through experiments.

In addition, in the embodiment of the application, each module in fig. 6 may perform training respectively in the training process, that is, may first train a human body detection model based on an image sample, and generate a human body candidate frame, so that an image corresponding to the human body candidate frame may be obtained, that is, a single image including a single gesture may be obtained, gesture classification is performed, then a human body gesture estimation model is trained according to an image corresponding to the human body candidate frame and corresponding gesture classification information, a gesture feature thermodynamic diagram or a joint point output by the human body gesture estimation model is obtained, and finally a corresponding convolutional layer feature classification network is trained according to the gesture feature thermodynamic diagram, or a corresponding joint feature classification network is trained according to the joint point.

Based on the above embodiments, the following description of the parameter settings and experimental results of the following experiments will be briefly described:

referring to Table 3, a comparative example of experimental results in the examples of the present application is shown.

Table 3.

Batch_size

Mid_fea_num

Pool_method

Is_init

Val

Test

Pose

32

200

-

No

90.96％

50.96％

S ^T

16

300

Max

No

89.73％

58.51％

L ^T

8

300

Mean_max

No

91.41％

62.43％

S ^T +L ^T

8

200

Mean_max

Yes

91.80％

63.02％

As shown in table 3, wherein the experimental parameters were set as follows:

input characteristics: pose represents an off node, S ^T Representing a thermodynamic diagram of a joint point, L ^T Representing a torso thermodynamic diagram, S ^T +L ^T Representing the simultaneous use of a joint point thermodynamic diagram and a torso thermodynamic diagram as input features.

Parameter selection: batch_size represents the number of images input, and the selection range can be 8, 16 and 32; mid_fea_num represents the number of output neurons of the first fully connected network, and the selection range can be 50, 100, 200 and 300; pool_method represents a pooling method for input features, and the selection range can be Max, mean, mean _max; is_init represents whether to initialize the network, and the selection range can be Yes or No, wherein Yes represents that the initialization parameters accord with a set rule when the training Is just started, and No represents that the initialization parameters are randomly distributed when the training Is just started.

As can be seen from table 3, in the embodiment of the present application, training Is performed according to a training set, then a parameter configuration with highest average accuracy of classification of each gesture category Is determined according to a verification set, and the determined parameter configuration Is used to test a test set to obtain effects and recognition results on the test set, where values of parameters batch_size, mid_fea_num, pool_method, is_init listed in table 3 are optimal configurations corresponding to different input features, as shown in table 3, it can be known that the input S Is ^T 、L ^T Or S ^T +L ^T Is better than the input Pose, in particular, using S ^T The average accuracy of the 10 classification of (2) is 58.51%, which is 7.55% higher than Pose, L is used ^T The average accuracy of 10 classification of (2) is 62.43%, which is 11.47% higher than Pose, S is used ^T +L ^T The highest 10-class average accuracy is 63.02 percent, which is improved by 12.06 percent compared with Pose, namelyThe gesture recognition method based on the gesture feature thermodynamic diagram is more accurate and better in classification effect compared with the gesture recognition method based on the joint point.

For further explanation of the comparison, the recognition of the states in the respective posture categories will be explained by taking the case where the posture is recognized by using the joint point, the joint point thermodynamic diagram and the trunk thermodynamic diagram as input to the classifier, and the preset posture category is 10.

Specifically, as shown in tables 4 and 5, referring to table 4, a table of comparison of confusion matrix of joint points is shown in the embodiment of the present application, and referring to table 5, a table of comparison of confusion matrix of joint points is shown in the embodiment of the present application, and S is shown in the embodiment of the present application ^T +L ^T Is a confusing matrix contrast.

Table 4.

Table 5.

	0	1	2	3	4	5	6	7	8	9
											0	4	0	1	0	2	1	0	2	3	5
1	5	104	0	0	0	0	3	0	4	62
											2	7	1	80	2	20	5	3	25	61	41
3	0	0	0	93	1	1	1	0	52	2
											4	0	0	1	0	23	1	0	2	1	3
5	2	0	0	15	43	119	1	2	2	8
											6	0	1	0	1	2	1	23	1	0	4
7	1	0	0	0	1	0	0	78	16	3
											8	1	1	2	18	1	0	1	2	221	14
9	6	4	3	3	4	0	1	3	14	107

As shown in tables 4 and 5, the first row and the first column each represent 10 types of pose classes, where the pose class in the first column represents the pose class of the image actually true, the pose class in the first row represents the pose class of the image recognized, it is known that the recognition is accurate only if the pose classes of the corresponding rows and columns are identical, otherwise, the recognition is incorrect, the numbers in the remaining rows or columns except the first row and the first column in tables 4 and 5 represent the number of images belonging to the corresponding case, for example, the number 3 on the 3 rd column of the 4 th row in table 4 represents 3 true pose classes 2 but is determined to be the pose class 1 by the algorithm recognition, that is, the numbers on only diagonal lines (i.e. the positions where the pose classes of the rows and columns are identical) in tables 4 and 5 represent the number of images correctly classified, and the numbers on the remaining positions represent the number of images incorrectly classified, and the comparison tables 4 and 5 also know that the recognition accuracy of using the joint points and the trunk is higher on the 10 pose classes than using the joint points.

Based on the same inventive concept, the embodiment of the application also provides a gesture recognition device, which may be, for example, a server in the foregoing embodiment, and the gesture recognition device may be a hardware structure, a software module, or a hardware structure plus a software module. Based on the above embodiments, referring to fig. 11, the gesture recognition apparatus in the embodiment of the present application specifically includes:

a detection module 1110, configured to determine a human body image area in an image to be identified;

the estimation module 1120 is configured to perform human body pose estimation in the human body image area, and obtain a pose feature thermodynamic diagram corresponding to the human body image area;

the recognition module 1130 is configured to determine, according to the gesture feature thermodynamic diagram, gesture scores of the human body image areas corresponding to preset gesture categories respectively; and obtaining a human body gesture recognition result in the human body image area based on the gesture score.

Optionally, the detection module 1110 is specifically configured to: when the image to be identified does not meet the abnormal image condition, performing human body detection on the image to be identified based on a human body detection model obtained through pre-training to obtain a human body detection result;

When the human body detection result does not have a human body detection frame, taking the image to be identified as a human body image area;

when a human body detection frame exists in the human body detection result, filtering out abnormal human body detection frames in the human body detection frame to obtain human body candidate frames; and determining a human body image area according to the human body candidate frame.

Optionally, the method further comprises: a filtering module 1140, configured to calculate an area of each human body detection frame when the human body detection frame exists in the human body detection result; determining abnormal human body detection frames in the human body detection frames according to the areas of the human body detection frames; and filtering the abnormal human body detection frame to obtain human body candidate frames.

Optionally, the pose feature thermodynamic diagram includes a joint point thermodynamic diagram and/or a torso thermodynamic diagram; the estimation module 1120 is specifically configured to:

inputting the image corresponding to the human body image area into a pre-trained human body posture estimation model, detecting the position information of each joint point in the image corresponding to the human body image area, and outputting each joint point thermodynamic diagram and/or each trunk thermodynamic diagram, wherein the joint point thermodynamic diagram is used for representing the position information of the joint point, the trunk thermodynamic diagram is used for representing the position information of the trunk, and the trunk represents the connecting line of the adjacent joint points.

Optionally, the identification module 1130 is specifically configured to: and inputting the gesture characteristic thermodynamic diagram into a classifier, and determining gesture scores of the human body image areas corresponding to preset gesture categories respectively.

Optionally, the pose feature thermodynamic diagram includes a joint point thermodynamic diagram and/or a torso thermodynamic diagram; the identification module 1130 is specifically configured to:

inputting the thermodynamic diagrams of all the joints or the thermodynamic diagrams of all the trunk into a classifier, wherein the structure of the classifier at least comprises a two-channel pooling, a first fully-connected network, a nonlinear activation function, a second fully-connected network and a normalization function, the number of input neurons of the first fully-connected network is an input dimension, the number of output neurons is a parameter value obtained by training, the number of input neurons of the second fully-connected network is a dimension output by an upper network, and the number of output neurons is a preset number of gesture categories;

and respectively carrying out feature compression on each joint point thermodynamic diagram or each trunk thermodynamic diagram by adopting two preset pooling modes through two channel pooling in the classifier, splicing the features compressed by the two preset pooling modes, sequentially carrying out feature extraction on the spliced features through a first full-connection network, a nonlinear activation function and a second full-connection network to obtain gesture scores of the human body image areas corresponding to each preset gesture category, normalizing the gesture scores to a preset value range through a normalization function, and outputting the gesture categories identified in the human body image areas according to the normalized gesture scores.

Optionally, the gesture feature thermodynamic diagram includes a joint point thermodynamic diagram and a torso thermodynamic diagram, and the identifying module 1130 is specifically configured to:

inputting the thermodynamic diagrams of each joint point and each trunk thermodynamic diagram into a classifier, wherein the structure of the classifier at least comprises two-channel pooling, feature cascading, a first fully-connected network, a nonlinear activation function, a second fully-connected network and a normalization function, the number of input neurons of the first fully-connected network is an input dimension, the number of output neurons is a parameter value obtained by training, the number of input neurons of the second fully-connected network is a dimension output by an upper network, and the number of output neurons is a preset number of gesture categories;

and carrying out feature compression on each joint point thermodynamic diagram and each trunk thermodynamic diagram by adopting a preset two-way pooling mode in the classifier, splicing the characteristics compressed by the preset two-way pooling mode, splicing the spliced characteristics corresponding to each joint point thermodynamic diagram and the spliced characteristics corresponding to each trunk thermodynamic diagram, carrying out cascading splicing through feature cascading, carrying out feature extraction on the characteristics after cascading splicing through a first full-connection network, a nonlinear activation function and a second full-connection network in sequence, obtaining gesture scores of the human body image region corresponding to each preset gesture category, normalizing the gesture scores to a preset value range through a normalization function, and outputting the gesture categories identified in the human body image region according to the normalized gesture scores.

The division of the modules in the embodiments of the present application is schematically only one logic function division, and there may be another division manner in actual implementation, and in addition, each functional module in the embodiments of the present application may be integrated in one processor, or may exist separately and physically, or two or more modules may be integrated in one module. The integrated modules may be implemented in hardware or in software functional modules.

Based on the above embodiments, in the embodiments of the present application, there is provided a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the gesture recognition method in any of the above method embodiments.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the application.

It will be apparent to those skilled in the art that various modifications and variations can be made to the embodiments of the present application without departing from the spirit and scope of the embodiments of the present application. Thus, if such modifications and variations of the embodiments of the present application fall within the scope of the claims and the equivalents thereof, the present application is intended to encompass such modifications and variations.

Claims

1. A gesture recognition method, comprising:

Determining a human body image area in an image to be identified;

inputting the gesture characteristic thermodynamic diagram into a classifier, and determining gesture scores of the human body image areas corresponding to preset gesture categories respectively;

based on the gesture score, a human gesture recognition result in the human image area is obtained;

wherein if the pose characteristic thermodynamic diagram comprises a joint point thermodynamic diagram or a torso thermodynamic diagram; inputting the gesture characteristic thermodynamic diagram into a classifier, determining gesture scores of the human body image areas corresponding to preset gesture categories respectively, and obtaining a human body gesture recognition result in the human body image areas based on the gesture scores, wherein the method specifically comprises the following steps:

2. The method of claim 1, wherein the determining the human body image region in the image to be identified comprises:

when the image to be identified does not meet the abnormal image condition, performing human body detection on the image to be identified based on a human body detection model obtained through pre-training to obtain a human body detection result;

3. The method according to claim 2, wherein when the human detection frame exists in the human detection result, filtering out abnormal human detection frames in the human detection frame to obtain human candidate frames, including:

when the human body detection frames exist in the human body detection results, calculating the area of each human body detection frame;

determining abnormal human body detection frames in the human body detection frames according to the areas of the human body detection frames;

and filtering the abnormal human body detection frame to obtain human body candidate frames.

4. The method of claim 1, wherein the pose feature thermodynamic diagram comprises a joint point thermodynamic diagram and/or a torso thermodynamic diagram; the human body posture estimation is carried out on the human body image area to obtain a posture characteristic thermodynamic diagram corresponding to the human body image area, which comprises the following steps:

5. The method of claim 1, wherein if the pose characteristic thermodynamic diagram includes a joint point thermodynamic diagram and a torso thermodynamic diagram, inputting the pose characteristic thermodynamic diagram into a classifier, determining pose scores of the human body image regions corresponding to respective preset pose categories, and obtaining a human body pose recognition result in the human body image regions based on the pose scores, specifically comprising:

6. A gesture recognition apparatus, comprising:

the recognition module is used for inputting the gesture characteristic thermodynamic diagram into a classifier and determining gesture scores of the human body image areas corresponding to preset gesture categories respectively; based on the gesture score, a human gesture recognition result in the human image area is obtained;

wherein if the pose characteristic thermodynamic diagram comprises a joint point thermodynamic diagram or a torso thermodynamic diagram; the identification module is specifically configured to:

7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method of any of claims 1-5 when the program is executed.

8. A computer-readable storage medium having stored thereon a computer program, characterized by: the computer program implementing the steps of the method of any of claims 1-5 when executed by a processor.