CN112101326A

CN112101326A - Multi-person posture recognition method and device

Info

Publication number: CN112101326A
Application number: CN202011291906.5A
Authority: CN
Inventors: 李宇欣; 裘实
Original assignee: Health Hope (beijing) Technology Co ltd
Current assignee: Health Hope (beijing) Technology Co ltd
Priority date: 2020-11-18
Filing date: 2020-11-18
Publication date: 2020-12-18

Abstract

The invention relates to a multi-person gesture recognition method and a multi-person gesture recognition device, wherein a to-be-recognized image is obtained, and multi-person gesture recognition is carried out on the to-be-recognized image to obtain a first part affinity field and a first key point of a human body in the to-be-recognized image; performing feature fusion on the first part of affinity fields and the first key points to obtain a first fusion feature map; extracting the features of the first fusion feature map to obtain a second part of affinity field and a second key point; then, performing feature fusion on the first part of affinity field and the second part of affinity field to obtain a target part of affinity field; performing feature fusion on the first key point and the second key point to obtain a target key point; and finally, determining the gesture recognition result of the image to be recognized according to the target part affinity field and the target key point. The target part affinity field and the target key points are obtained in a feature fusion mode, so that the features of the target part affinity field and the target key points in different stages are correlated, and the accuracy of gesture recognition can be effectively improved.

Description

Multi-person posture recognition method and device

Technical Field

The invention relates to the technical field of computer vision, in particular to a multi-person posture identification method and device.

Background

At present, the multi-person gesture recognition technology comprises two schemes: a top-down scheme and a bottom-up scheme.

The top-down scheme requires that each person in an image to be identified is detected in a form of a bounding box, and then human key point detection is performed on the person in each bounding box. The bottom-up scheme is to detect the human key points of all the people in the image to be identified at one time and judge the person to which each human key point belongs at the same time.

As can be seen, the bottom-up scheme has a higher processing efficiency but insufficient accuracy compared to the top-down scheme.

Disclosure of Invention

The invention aims to solve the technical problem that the recognition accuracy of the existing multi-person gesture recognition scheme is insufficient, and provides a multi-person gesture recognition method and device aiming at the defects in the prior art.

In order to solve the technical problem, the invention provides a multi-person gesture recognition method, which comprises the following steps:

acquiring an image to be identified;

carrying out multi-person gesture recognition on the image to be recognized to obtain a first part affinity field and a first key point of a human body in the image to be recognized;

performing feature fusion on the first part of affinity fields and the first key points to obtain a first fusion feature map;

performing feature extraction on the first fusion feature map to obtain a second partial affinity field and a second key point;

performing feature fusion on the first part of affinity fields and the second part of affinity fields to obtain target part affinity fields;

performing feature fusion on the first key point and the second key point to obtain a target key point;

and determining the gesture recognition result of the image to be recognized according to the target part affinity field and the target key point.

In a possible implementation manner, the performing multi-person gesture recognition on the image to be recognized to obtain a first part affinity field and a first key point of a human body in the image to be recognized includes:

carrying out multi-person posture recognition on the image to be recognized to obtain a high-dimensional feature map and a low-dimensional feature map of a human body in the image to be recognized;

and performing feature fusion on the high-dimensional feature map and the low-dimensional feature map to obtain a second fused feature map, wherein the second fused feature map comprises the first part of affinity fields and the first key points.

In a possible implementation manner, the performing multi-person gesture recognition on the image to be recognized to obtain a high-dimensional feature map and a low-dimensional feature map of a human body in the image to be recognized includes:

performing down-sampling processing on the image to be identified to obtain a first down-sampling feature map;

performing feature extraction on the first downsampling feature map to obtain a first extracted feature map;

performing downsampling processing on the first extracted feature map to obtain a second downsampled feature map;

performing feature extraction on the second downsampling feature map to obtain a second extracted feature map;

performing downsampling processing on the second extracted feature map to obtain a third downsampled feature map;

performing feature extraction on the third down-sampling feature map to obtain a third extracted feature map;

the first downsampled feature map, the second downsampled feature map and the third downsampled feature map are the low-dimensional feature map, and the third extracted feature map is the high-dimensional feature map.

In a possible implementation manner, the downsampling the image to be recognized to obtain a first downsampling feature map includes:

utilizing four convolution cores of 3 x 3 to carry out down-sampling processing on the image to be identified so as to obtain a fourth down-sampling feature map;

sequentially performing down-sampling processing and dimensionality reduction processing on the image to be identified by using a convolution kernel of 3 x 3 and a convolution kernel of 1 x 1 to obtain a fifth down-sampling feature map;

performing feature fusion on the fourth downsampling feature map and the fifth downsampling feature map to obtain the first downsampling feature map;

the downsampling processing of the first extracted feature map to obtain a second downsampled feature map includes:

performing down-sampling processing on the first extracted feature map by using four convolution kernels of 3 x 3 to obtain a sixth down-sampling feature map;

sequentially performing down-sampling processing and dimensionality reduction processing on the first extracted feature map by using a convolution kernel of 3 x 3 and a convolution kernel of 1 x 1 to obtain a seventh down-sampling feature map;

performing feature fusion on the sixth downsampling feature map and the seventh downsampling feature map to obtain a second downsampling feature map;

the downsampling the second extracted feature map to obtain a third downsampled feature map includes:

performing down-sampling processing on the second extracted feature map by using four convolution kernels of 3 x 3 to obtain an eighth down-sampling feature map;

sequentially performing down-sampling processing and dimensionality reduction processing on the second extracted feature map by using a convolution kernel of 3 x 3 and a convolution kernel of 1 x 1 to obtain a ninth down-sampling feature map;

and performing feature fusion on the eighth downsampling feature map and the ninth downsampling feature map to obtain the third downsampling feature map.

In a possible implementation manner, the performing feature fusion on the high-dimensional feature map and the low-dimensional feature map to obtain a second fused feature map includes:

utilizing three convolution cores of 5 by 5 to carry out down-sampling processing on the high-dimensional feature map and the low-dimensional feature map so as to obtain a down-sampling fusion feature map;

performing dimensionality reduction on the high-dimensional feature map and the low-dimensional feature map by using a 1-by-1 convolution kernel to obtain a dimensionality reduction fusion feature map;

and performing feature fusion on the downsampling fusion feature map and the dimension reduction fusion feature map to obtain the second fusion feature map.

In a possible implementation manner, the determining a gesture recognition result of the image to be recognized according to the target portion affinity field and the target key point includes:

and inputting the target part affinity field and the target key points into a classifier, determining the posture scores of the human body in the image to be recognized corresponding to each preset posture category, and outputting the posture recognition result of the image to be recognized.

In a possible implementation manner, the inputting the target portion affinity field and the target keypoints into a classifier, determining pose scores of the human body in the image to be recognized corresponding to each preset pose category, and outputting a pose recognition result of the image to be recognized includes:

inputting the target part affinity field and the target key point into a classifier, wherein the structure of the classifier at least comprises a double-channel pooling function, a first fully-connected network, a nonlinear activation function, a second fully-connected network and a normalization function, the number of input neurons of the first fully-connected network is an input dimension, the number of output neurons is a parameter value obtained by training, the number of input neurons of the second fully-connected network is a dimension output by an upper network, and the number of output neurons is a preset number of each posture category;

through two-channel pooling in the classifier, the dimensionality reduction treatment is carried out on the target part affinity field and the target key point by adopting two preset pooling modes, the dimensionality reduced characteristics of the two preset pooling modes are spliced, the spliced characteristics are subjected to characteristic extraction sequentially through a first full-connection network, a nonlinear activation function and a second full-connection network, posture scores of the human body in the image to be recognized, which correspond to preset posture categories, are obtained, the posture scores are normalized to a preset value range through a normalization function, and the posture recognition result of the image to be recognized is output according to the normalized posture scores.

The invention also provides a multi-person gesture recognition device, which comprises:

the acquisition module is used for acquiring an image to be identified;

the gesture recognition module is used for carrying out multi-person gesture recognition on the image to be recognized to obtain a first part affinity field and a first key point of a human body in the image to be recognized;

the first feature fusion module is used for performing feature fusion on the first part of affinity fields and the first key points to obtain a first fusion feature map;

the feature extraction module is used for extracting features of the first fusion feature map to obtain a second partial affinity field and a second key point;

the second feature fusion module is used for performing feature fusion on the first part of affinity fields and the second part of affinity fields to obtain target part affinity fields;

the third feature fusion module is used for performing feature fusion on the first key point and the second key point to obtain a target key point;

and the recognition result determining module is used for determining the gesture recognition result of the image to be recognized according to the target part affinity field and the target key point.

The invention also provides a multi-person gesture recognition device, which comprises: at least one memory and at least one processor;

the at least one memory to store a machine readable program;

the at least one processor is configured to invoke the machine readable program to perform the method as described above.

The invention also provides a computer readable medium having stored thereon computer instructions which, when executed by a processor, cause the processor to perform the method as described above.

The multi-person gesture recognition method and the device have the following beneficial effects that:

the method comprises the steps of carrying out multi-person gesture recognition on an image to be recognized by obtaining the image to be recognized to obtain a first part affinity field and a first key point of a human body in the image to be recognized; then, performing feature fusion on the first part of affinity fields and the first key points to obtain a first fusion feature map; then, extracting the features of the first fusion feature map to obtain a second part of affinity field and a second key point; then, performing feature fusion on the first part of affinity field and the second part of affinity field to obtain a target part of affinity field; performing feature fusion on the first key point and the second key point to obtain a target key point; and finally, determining the gesture recognition result of the image to be recognized according to the target part affinity field and the target key point. Therefore, the target part affinity field and the target key points obtained in the feature fusion mode enable the features of the target part affinity field and the target key points in different stages to be correlated and not isolated, and therefore the accuracy of gesture recognition can be effectively improved.

Drawings

FIG. 1 is a flow chart of a multi-person gesture recognition method provided by an embodiment of the invention;

FIG. 2 is a diagram of a multi-person gesture recognition model provided by one embodiment of the present invention;

FIG. 3 is a block diagram of a downsampling module in a multi-person gesture recognition model according to an embodiment of the present invention;

FIG. 4 is a diagram of cpm in a multi-person gesture recognition model provided by one embodiment of the present invention;

FIG. 5 is a schematic diagram of an apparatus for multi-person gesture recognition according to an embodiment of the present invention;

fig. 6 is a schematic diagram of a multi-person gesture recognition apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without any inventive step based on the embodiments of the present invention, are within the scope of the present invention.

Before further detailed description of the embodiments of the present invention, terms and expressions mentioned in the embodiments of the present invention are explained, and the terms and expressions mentioned in the embodiments of the present invention are applied to the following explanations.

1) An artificial neural network is a mathematical model simulating the structure and function of a biological neural network, and comprises an input layer, an intermediate layer and an output layer, wherein each layer is formed by connecting a large number of processing units, each node processes input data by using an excitation function and outputs the processed data to other nodes, and exemplary types of the excitation function comprise a threshold type, a linear type, an S growth curve (Sigmoid) type and the like.

2) Machine Learning (ML) is a multi-domain cross discipline, relating to multi-domain disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. Specially researching how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills; reorganizing the existing knowledge structure to improve the performance of the knowledge structure. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, and inductive learning.

3) Deep Learning (DL) is a new research direction in the field of machine Learning; deep learning is to learn the internal rules and the expression levels of sample data, and the final aim is to enable a machine to have the analysis and learning capacity like a human and to recognize data such as characters, images and sounds; deep learning is a complex machine learning algorithm.

4) Convolutional Neural Networks (CNN), which is a kind of feed-forward Neural network including convolution calculation and having a deep structure, is one of the representative algorithms for deep learning.

5) A Partial Affinity Field (PAF), which refers to a set of variable number of flow Field representations encoded by unstructured correspondence of human body parts; the PAF is used for carrying out confidence measure on each pair of body parts, namely the body parts belong to the same human body.

6) The heat map (Heatmap), also called thermodynamic diagram, refers specifically to an image in which the key points of the human posture are displayed by heat points in the embodiment of the present invention.

With the research and progress of artificial intelligence technology, the artificial intelligence technology develops research and application in a plurality of fields; for example, common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, autonomous, unmanned, robotic, smart medical, and smart customer service, etc.; with the development of the technology, the artificial intelligence technology can be applied in more fields and can play more and more important value; artificial intelligence can also be applied, for example, in the field of computer vision.

Here, it should be noted that artificial intelligence is a comprehensive technique of computer science, which attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

In addition, the artificial intelligence technology is a comprehensive subject, and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

The inventor finds that in the process of implementing the invention, in the scheme provided by the related technology, a top-down identification method and a bottom-up identification method are generally adopted for human posture identification. The top-down identification method comprises the steps of firstly positioning the approximate position of a human body, and then specifically identifying the human body posture of each piece of human body information; for example, a position frame of each piece of human body information in the image to be recognized is detected, then, the human body posture key point detection is performed on each piece of human body information on the basis of the position frame, and finally, the human body posture of each piece of human body information is obtained. The bottom-up recognition method firstly detects key points of all human postures in an image to be recognized, and then clusters the key points of all human postures on different human bodies; for example, each key point of the human body posture is calculated by adopting a human body posture heat map, the calculated key points are connected by adopting a partial affinity field, and finally the human body posture of each human body information is obtained by adopting a bipartite graph solving method of graph theory.

However, with the top-down recognition method, the detection effect is poor when the human body in the image to be recognized approaches to generate spatial interference. Therefore, since the human body posture recognition is mostly a scene of a plurality of human bodies, a bottom-up recognition method is often selected. For the bottom-up recognition method, the key points of all human postures in the image are detected firstly, and are allocated to each human body information, so that the acquisition of the key points of the human postures is crucial to the result of human posture recognition, and the acquisition of the key points of the human postures is related to the recognition capability of a network model and the structure of the network model, the network model generally has poor distinguishing capability on easily confused limbs, and when the network model processes the image to be recognized, the data is output by adopting a down-sampling mode, and the output result of the last layer of the network model is used as the output result of the model, so that pixel loss exists; therefore, the accuracy of the obtained key points of the human body posture is poor, so that the accuracy of the recognition result is low when the human body posture recognition is realized according to the key points.

Based on the above, the embodiment of the invention provides a human body posture recognition method and device, which can realize multi-person posture recognition on the basis of artificial intelligence and improve the accuracy of human body posture recognition.

As shown in fig. 1, a multi-person gesture recognition method provided by the embodiment of the present invention includes:

step 101, obtaining an image to be identified.

The image to be recognized is generated by shooting a plurality of persons, so that multi-person gesture recognition can be performed on the image to be recognized containing the plurality of persons subsequently.

The image to be recognized may be derived from an image acquired by the recognition end in real time, for example, the recognition end is a smart phone, the smart phone is configured with a camera, or an image stored in advance by the recognition end, for example, the recognition end is a server, and is acquired by local reading or network transmission.

In other words, for the multi-person gesture recognition device deployed at the recognition end, the image to be recognized collected in real time may be obtained, so as to perform multi-person gesture recognition on the image to be recognized in real time, and the image to be recognized collected in a historical time period may also be obtained, so as to perform multi-person gesture recognition on the image to be recognized when the processing task is few, or perform multi-person gesture recognition on the image to be recognized under the instruction of the operator, which is not specifically limited in this embodiment.

Further, for the camera shooting assembly configured at the recognition end, if the camera shooting assembly can be used as an independent device, such as a camera, a video recorder, and the like, the camera shooting assembly can be arranged around the environment where the plurality of people are located, so as to shoot the plurality of people from different angles, thereby obtaining images to be recognized reflecting the plurality of people from different angles, and being beneficial to ensuring the accuracy of subsequent gesture recognition.

It should be noted that the shooting may be a single shooting or a continuous shooting, and accordingly, in the case of a single shooting, the obtained image to be recognized is a picture, and in the case of a continuous shooting, a video including a plurality of images to be recognized is obtained. Therefore, in each embodiment of the present invention, the image to be recognized for multi-person gesture recognition may be a single picture shot at a time, or may also be a certain image to be recognized in a section of video shot continuously, which is not specifically limited in the present invention.

102, carrying out multi-person gesture recognition on the image to be recognized to obtain a first part affinity field and a first key point of the human body in the image to be recognized.

When a recognition end (for example, a smartphone or a server) is used to perform multi-person gesture recognition, a multi-person gesture recognition model (as shown in fig. 2) is preset at the recognition end, and an image to be recognized is input into the multi-person gesture recognition model through an input module of the multi-person gesture recognition model shown in fig. 2. Correspondingly, the multi-person posture recognition model is trained aiming at human body posture key points preset by a recognition end in the training process. Specifically, for example, a labeled human body key point diagram is shot, more than 140 million human body various motion and posture pictures including teenagers, middle-aged people and old people are collected, the pictures are subjected to operations such as translation, turning, gray scale and sharpening by using image processing tools such as opencv and the like, the generalization capability of multi-person posture recognition model recognition is improved, and 8 GPU servers with 12g video memory are used for training.

In addition, the key points of the human body posture include at least one key point, for example, including: 14 key points such as the head, the neck, the left shoulder, the left elbow, the left wrist, the right shoulder, the right elbow, the right wrist, the left hip, the left knee, the left ankle, the right hip, the right knee, the right ankle and the like.

In an embodiment of the present invention, step 102 may specifically include:

carrying out multi-person posture recognition on the image to be recognized to obtain a high-dimensional characteristic diagram and a low-dimensional characteristic diagram of a human body in the image to be recognized;

and performing feature fusion on the high-dimensional feature map and the low-dimensional feature map to obtain a second fused feature map, wherein the second fused feature map comprises the first part of affinity field and the first key point.

In this embodiment, the high-dimensional features are used to represent all features of the human body in the image to be recognized, the low-dimensional features are used to represent local features of the human body in the image to be recognized, and the low-dimensional features can capture details of the image, so that the high-dimensional feature map and the low-dimensional feature map are fused, the global features of the human body in the image to be recognized and the local features of important interest can be effectively fused together, and the feature robustness of the second fused feature map can be greatly enhanced.

In order to further improve the feature robustness of the second fused feature map, the high-dimensional feature map and the low-dimensional feature map can be obtained as follows:

performing down-sampling processing on an image to be identified to obtain a first down-sampling feature map;

carrying out down-sampling processing on the first extracted feature map to obtain a second down-sampling feature map;

performing down-sampling processing on the second extracted feature map to obtain a third down-sampling feature map;

the first downsampling feature map, the second downsampling feature map and the third downsampling feature map are all low-dimensional feature maps, and the third extracted feature map is a high-dimensional feature map.

In this embodiment, as shown in fig. 2, a first Block _ SubSample module (i.e., a downsampling module) is used to perform downsampling on an image to be recognized, so as to obtain a first downsampling feature map; performing feature extraction on the first downsampling feature map by using a first DW _ Block module (namely a convolution module) to obtain a first extracted feature map; utilizing a second Block _ SubSample module to perform downsampling processing on the first extracted feature map to obtain a second downsampled feature map; performing feature extraction on the second downsampling feature map by using a second DW _ Block module to obtain a second extracted feature map; utilizing a third Block _ SubSample module to perform downsampling processing on the second extracted feature map to obtain a third downsampled feature map; and performing feature extraction on the third downsampling feature map by using a third DW _ Block module to obtain a third extracted feature map. According to the arrangement, more features can be extracted through the three Block _ SubSample modules and the three DW _ Block modules, so that the recognition accuracy of the multi-person gesture recognition model is improved.

It should be noted that, compared with two Block _ sub-sample modules and two DW _ Block modules, three Block _ sub-sample modules and three DW _ Block modules can improve the recognition accuracy of the multi-person gesture recognition model; compared with four Block _ SubSample modules and four DW _ Block modules, the three Block _ SubSample modules and the three DW _ Block modules can reduce the operation time of the multi-person gesture recognition model.

Further, the structure of each Block _ sub sample module is the same, wherein a schematic diagram of the structure of the Block _ sub sample module is shown in fig. 3, and feature fusion may be, for example, an Add fusion method.

Specifically, in an embodiment of the present invention, the downsampling an image to be identified to obtain a first downsampling feature map includes:

utilizing four convolution kernels of 3 x 3 to carry out down-sampling processing on the image to be identified to obtain a fourth down-sampling feature map;

performing feature fusion on the fourth down-sampling feature map and the fifth down-sampling feature map to obtain a first down-sampling feature map;

the method for performing downsampling processing on the first extracted feature map to obtain a second downsampled feature map comprises the following steps:

using a convolution kernel of 3 x 3 and a convolution kernel of 1 x 1 to check the first extracted feature map, and sequentially performing down-sampling processing and dimensionality reduction processing to obtain a seventh down-sampling feature map;

and performing downsampling processing on the second extracted feature map to obtain a third downsampled feature map, wherein the third downsampling processing comprises the following steps:

using a convolution kernel of 3 x 3 and a convolution kernel of 1 x 1 to check the second extracted feature map, and sequentially performing down-sampling processing and dimensionality reduction processing to obtain a ninth down-sampling feature map;

and performing feature fusion on the eighth downsampling feature map and the ninth downsampling feature map to obtain a third downsampling feature map.

In the embodiment of the invention, the Block _ SubSample module with the structure can improve the fitting degree of the nonlinear problem, specifically, a convolution kernel of 3 × 3 is used in the Block _ SubSample module, the extraction of features can be increased, the sense field can be enlarged by using five convolution kernels of 3 × 3, and the inaccuracy of the multi-person posture recognition model can be ensured while the downsampling processing is ensured to be increased. In addition, DW (Depth Wise Separable convolution kernels) is adopted as the convolution kernel, so that the parameter number of the multi-person posture recognition model can be reduced, and the running speed can be increased.

Note that, in this embodiment, the fusion of features is performed by Add.

As shown in fig. 4, in order to further increase the running speed of the multi-person gesture recognition model, it may be considered to optimize the structure of cpm (convolutional gesture machines), where feature fusion may be, for example, a fusion method using Add.

In an embodiment of the present invention, performing feature fusion on the high-dimensional feature map and the low-dimensional feature map to obtain a second fused feature map, includes:

utilizing three convolution kernels of 5 x 5 to carry out down-sampling processing on the high-dimensional feature map and the low-dimensional feature map to obtain a down-sampling fusion feature map;

performing dimensionality reduction on the high-dimensional feature map and the low-dimensional feature map by using a 1 x 1 convolution kernel to obtain a dimensionality reduction fusion feature map;

and performing feature fusion on the down-sampling fusion feature map and the dimension reduction fusion feature map to obtain a second fusion feature map.

In the embodiment of the invention, cpm can improve the speed and the accuracy and enlarge the receptive field of the model by using three DW convolution kernels of 5 × 5, and the parameter quantity of the model can be reduced by adopting the DW convolution kernels. Through testing, the use of three DW convolution kernels of 5 × 5 can reduce the computation time by 23% compared to the use of six DW convolution kernels of 3 × 3.

Note that, in this embodiment, the feature fusion is performed by Add fusion.

And 103, performing feature fusion on the first part of affinity fields and the first key points to obtain a first fusion feature map.

As shown in fig. 2, a first feature fusion module (e.g., a concat feature fusion) may be used to perform feature fusion on the first partial affinity field and the first keypoint.

And 104, performing feature extraction on the first fusion feature map to obtain a second partial affinity field and a second key point.

As shown in fig. 2, the first fused feature is subjected to feature extraction by one stage module (i.e., a transition module, which is mainly used for feature extraction) to obtain the second partial affinity field, and is subjected to feature extraction by another stage module to obtain the second keypoint, so that the generated second partial affinity field and the second keypoint are favorable for characterizing the local feature of the first fused feature.

Step 105, performing feature fusion on the first part of affinity field and the second part of affinity field to obtain a target part of affinity field.

As shown in fig. 2, the first part of the affinity field refers to the PAF output through cpm, the second part of the affinity field refers to the PAF output through the stage module, and then feature fusion is performed on the first part of the affinity field and the second part of the affinity field to obtain a target part of affinity field, so that the target part of affinity field includes both all features and local features, and further the accuracy of gesture recognition can be effectively improved.

And 106, performing feature fusion on the first key points and the second key points to obtain target key points.

As shown in fig. 2, the first key point refers to a Heatmap output by cpm, the second key point refers to a Heatmap output by a stage module, and then feature fusion is performed on the first key point and the second key point to obtain a target key point, so that the target key point includes not only all features but also local features, and further the accuracy of gesture recognition can be effectively improved.

And step 107, determining the gesture recognition result of the image to be recognized according to the target part affinity field and the target key point.

In the step, the obtained target key points are connected through the affinity field of the target part, and finally, a human body posture recognition result of the image to be recognized is obtained by adopting a bipartite graph solving method of graph theory.

In one embodiment of the present invention, determining a gesture recognition result of an image to be recognized according to a target portion affinity field and a target key point includes:

and inputting the target part affinity field and the target key points into a classifier, determining the posture score of the human body in the image to be recognized corresponding to each preset posture category, and outputting the posture recognition result of the image to be recognized.

In the embodiment of the invention, because the target part affinity field and the target key point obtained by the multi-person posture recognition model contain more abundant characteristics, namely all characteristics and local characteristics, the obtained target part affinity field and the target key point are input into the classifier to output the posture category, so that the accuracy of posture recognition and classification can be improved.

It is understood that the preset gesture categories may include, for example, the following: squatting, raising legs, kneeling, offering, crawling, lying prone, lying down, bending down, standing and sitting, which are not limited herein.

In order to reduce the loss of feature dimension reduction and remove redundant features, in an embodiment of the present invention, the method includes inputting a target portion affinity field and target key points into a classifier, determining pose scores of a human body in an image to be recognized corresponding to each preset pose category, and outputting a pose recognition result of the image to be recognized, including:

inputting a target part affinity field and a target key point into a classifier, wherein the structure of the classifier at least comprises a double-channel pooling mode, a first full-connection network, a nonlinear activation function, a second full-connection network and a normalization function, the number of input neurons of the first full-connection network is an input dimension, the number of output neurons is a parameter value obtained by training, the number of input neurons of the second full-connection network is a dimension output by an upper network, and the number of output neurons is a preset number of each posture category;

performing dimensionality reduction treatment on a target part affinity field and target key points by adopting two preset pooling modes through two-channel pooling in a classifier, splicing the dimensionality reduced characteristics of the two preset pooling modes, performing characteristic extraction on the spliced characteristics sequentially through a first full-connection network, a nonlinear activation function and a second full-connection network, obtaining posture scores of a human body in an image to be recognized, which respectively correspond to each preset posture category, normalizing the posture scores to a preset value range through a normalization function, and outputting a posture recognition result of the image to be recognized according to the normalized posture scores.

In the embodiment of the present invention, the dual-channel pooling is used for removing redundant features, and in order to reduce loss of feature dimension reduction, the dual-channel pooling is adopted in the embodiment of the present invention, and a basic pooling method called by the dual-channel pooling mode may be Mean (Mean) pooling, Maximum (MAX) pooling, or Mean-maximum (Mean-MAX) pooling, where Mean-MAX pooling refers to calling MAX pooling and Mean pooling respectively to implement the dual-channel pooling mode.

The number of input neurons of the first fully-connected network is determined by input dimensions, the number of output neurons is a parameter value obtained by training, the number of output neurons of the first fully-connected network is a hyper-parameter, and optimization needs to be selected in a training stage. The number of input neurons of the second fully-connected network is the dimension of upper-layer network output, and the number of output neurons is the number of preset posture categories.

The non-linear activation function is to increase a non-linear relationship between network structure layers, and for example, a modified linear unit (ReLU) method may be adopted, which is not limited in the embodiment of the present invention.

The normalization function is used for normalizing the attitude score to a preset value range, so that the attitude score belonging to each attitude category can be evaluated more visually, for example, a SoftMax method can be adopted, and the attitude score can be normalized to be between 0 and 1.

In summary, the multi-person gesture recognition method provided by the embodiment of the invention performs multi-person gesture recognition on the image to be recognized by acquiring the image to be recognized, so as to obtain a first part affinity field and a first key point of a human body in the image to be recognized; then, performing feature fusion on the first part of affinity fields and the first key points to obtain a first fusion feature map; then, extracting the features of the first fusion feature map to obtain a second part of affinity field and a second key point; then, performing feature fusion on the first part of affinity field and the second part of affinity field to obtain a target part of affinity field; performing feature fusion on the first key point and the second key point to obtain a target key point; and finally, determining the gesture recognition result of the image to be recognized according to the target part affinity field and the target key point. Therefore, the target part affinity field and the target key points obtained in the feature fusion mode enable the features of the target part affinity field and the target key points in different stages to be correlated and not isolated, and therefore the accuracy of gesture recognition can be effectively improved.

As shown in fig. 5 and 6, an embodiment of the present invention provides an apparatus in which a multi-person gesture recognition device is located and a multi-person gesture recognition device. The device embodiments may be implemented by software, or by hardware, or by a combination of hardware and software. From a hardware level, as shown in fig. 5, a hardware structure diagram of a device in which the multi-person gesture recognition apparatus provided in the embodiment of the present invention is located is shown, where in addition to the processor, the memory, the network interface, and the nonvolatile memory shown in fig. 5, the device in the embodiment may also include other hardware, such as a forwarding chip responsible for processing a packet, in general. Taking a software implementation as an example, as shown in fig. 6, as a logical apparatus, the apparatus is formed by reading, by a CPU of a device in which the apparatus is located, corresponding computer program instructions in a non-volatile memory into a memory for execution.

As shown in fig. 6, the multi-person gesture recognition apparatus provided in this embodiment includes:

an obtaining module 601, configured to obtain an image to be identified;

a gesture recognition module 602, configured to perform multi-person gesture recognition on the image to be recognized, so as to obtain a first part affinity field and a first key point of a human body in the image to be recognized;

a first feature fusion module 603, configured to perform feature fusion on the first partial affinity field and the first keypoint to obtain a first fusion feature map;

a feature extraction module 604, configured to perform feature extraction on the first fused feature map to obtain a second partial affinity field and a second key point;

a second feature fusion module 605, configured to perform feature fusion on the first partial affinity field and the second partial affinity field to obtain a target partial affinity field;

a third feature fusion module 606, configured to perform feature fusion on the first key point and the second key point to obtain a target key point;

and the recognition result determining module 607 is configured to determine a gesture recognition result of the image to be recognized according to the target portion affinity field and the target key point.

In this embodiment of the present invention, the obtaining module 601 may be configured to perform step 101 in the foregoing method embodiment, and the gesture recognizing module 602 may be configured to perform step 102 in the foregoing method embodiment; the first feature fusion module 603 may be configured to perform step 103 in the above method embodiment; the feature extraction module 604 may be configured to perform step 104 in the above-described method embodiments; the second feature fusion module 605 may be configured to perform step 105 in the above method embodiment; the third feature fusion module 606 may be configured to perform step 106 in the above method embodiment; the recognition result determination module 607 may be configured to perform step 107 in the above-described method embodiment.

In an embodiment of the present invention, the gesture recognition module 602 is configured to perform the following operations:

In an embodiment of the present invention, the gesture recognition module 602, after performing the multi-person gesture recognition on the image to be recognized, obtains a high-dimensional feature map and a low-dimensional feature map of a human body in the image to be recognized, and is configured to perform the following operations:

In an embodiment of the present invention, when the gesture recognition module 602 performs the downsampling process on the image to be recognized to obtain the first downsampling feature map, it is configured to:

In an embodiment of the present invention, when performing the feature fusion on the high-dimensional feature map and the low-dimensional feature map to obtain a second fused feature map, the gesture recognition module 602 is configured to perform the following operations:

In an embodiment of the present invention, the recognition result determining module 607 is configured to perform the following operations:

In an embodiment of the present invention, the recognition result determining module 607 is configured to perform the following operations when performing the inputting of the target portion affinity field and the target keypoints into the classifier, determining the pose scores of the human body in the image to be recognized corresponding to the preset pose categories, and outputting the pose recognition result of the image to be recognized:

It is to be understood that the illustrated structure of the embodiment of the present invention does not constitute a specific limitation to the multi-person gesture recognition apparatus. In other embodiments of the invention, the multi-person gesture recognition apparatus may include more or fewer components than shown, or combine certain components, or split certain components, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

Because the content of information interaction, execution process, and the like among the modules in the device is based on the same concept as the method embodiment of the present invention, specific content can be referred to the description in the method embodiment of the present invention, and is not described herein again.

The embodiment of the invention also provides a multi-person gesture recognition device, which comprises: at least one memory and at least one processor;

the at least one memory to store a machine readable program;

the at least one processor is configured to invoke the machine readable program to perform the multi-person gesture recognition method in any embodiment of the invention.

Embodiments of the present invention also provide a computer-readable medium storing instructions for causing a computer to perform a multi-person gesture recognition method as described herein. Specifically, a method or an apparatus equipped with a storage medium on which a software program code that realizes the functions of any of the above-described embodiments is stored may be provided, and a computer (or a CPU or MPU) of the method or the apparatus is caused to read out and execute the program code stored in the storage medium.

In this case, the program code itself read from the storage medium can realize the functions of any of the above-described embodiments, and thus the program code and the storage medium storing the program code constitute a part of the present invention.

Examples of the storage medium for supplying the program code include a floppy disk, a hard disk, a magneto-optical disk, an optical disk (e.g., CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-RAM, DVD-RW, DVD + RW), a magnetic tape, a nonvolatile memory card, and a ROM. Alternatively, the program code may be downloaded from a server computer via a communications network.

Further, it should be clear that the functions of any one of the above-described embodiments can be implemented not only by executing the program code read out by the computer, but also by performing a part or all of the actual operations by an operation method or the like operating on the computer based on instructions of the program code.

Further, it is to be understood that the program code read out from the storage medium is written to a memory provided in an expansion board inserted into the computer or to a memory provided in an expansion unit connected to the computer, and then causes a CPU or the like mounted on the expansion board or the expansion unit to perform part or all of the actual operations based on instructions of the program code, thereby realizing the functions of any of the above-described embodiments.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A multi-person gesture recognition method, comprising:

acquiring an image to be identified;

2. The method according to claim 1, wherein the performing multi-person gesture recognition on the image to be recognized to obtain a first part affinity field and a first key point of a human body in the image to be recognized comprises:

3. The method according to claim 2, wherein the performing multi-person gesture recognition on the image to be recognized to obtain a high-dimensional feature map and a low-dimensional feature map of a human body in the image to be recognized comprises:

4. The method according to claim 3, wherein the downsampling the image to be recognized to obtain a first downsampled feature map comprises:

5. The method according to claim 2, wherein the feature fusing the high-dimensional feature map and the low-dimensional feature map to obtain a second fused feature map comprises:

6. The method according to any one of claims 1-5, wherein the determining the gesture recognition result of the image to be recognized according to the target portion affinity field and the target key point comprises:

7. The method according to claim 6, wherein the inputting the target portion affinity field and the target key point into a classifier, determining the pose scores of the human body in the image to be recognized corresponding to the preset pose categories, and outputting the pose recognition result of the image to be recognized comprises:

8. A multi-person gesture recognition apparatus, comprising:

the acquisition module is used for acquiring an image to be identified;

9. A multi-person gesture recognition apparatus, comprising: at least one memory and at least one processor;

the at least one memory to store a machine readable program;

the at least one processor, configured to invoke the machine readable program to perform the method of any of claims 1-7.

10. A computer readable medium having stored thereon computer instructions which, when executed by a processor, cause the processor to perform the method of any one of claims 1-7.