CN116453226A

CN116453226A - Human body posture recognition method and device based on artificial intelligence and related equipment

Info

Publication number: CN116453226A
Application number: CN202310567461.6A
Authority: CN
Inventors: 李茜萌; 陆进; 陈远旭
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2023-05-18
Filing date: 2023-05-18
Publication date: 2023-07-18

Abstract

The invention relates to the field of artificial intelligence and financial science and technology, and provides a human body posture recognition method and device based on artificial intelligence and related equipment, wherein the method comprises the following steps: updating each feature extraction network in the first gesture recognition model to obtain a second gesture recognition model; and carrying out feature extraction processing on the picture to be identified by using the second gesture recognition model, respectively inputting the feature map of each scale into a first target detection network and a second target detection network, preprocessing the output first detection result and second detection result, and carrying out fusion processing on the third detection result and fourth detection result obtained by the preprocessing to obtain a human gesture recognition result. According to the invention, the three-dimensional key points are identified by updating each feature extraction network in the first gesture recognition model, so that the accuracy of human gesture recognition is improved.

Description

Human body posture recognition method and device based on artificial intelligence and related equipment

Technical Field

The invention relates to the field of artificial intelligence and financial science and technology, in particular to a human body posture recognition method and device based on artificial intelligence and related equipment.

Background

With the development of the age, the use of cameras to capture pictures or videos for character behavior analysis is becoming more and more common. For example, a home camera is used to detect if the elderly falls, a fitness APP is used to detect if the action is completed is normal, or a monitor is used to monitor if a bank is present with a offending operator, etc.

However, in these scenes, the prior art uses a two-dimensional target person detection or gesture estimation method to identify the gesture of the target person in the picture acquired by the camera, but the gesture is greatly affected by factors such as the position and the orientation of the camera, and the human gesture of the target person in the picture cannot be accurately identified.

Therefore, there is a need for a method that can quickly and accurately recognize the posture of a human body.

Disclosure of Invention

In view of the above, it is necessary to provide a human body gesture recognition method, device and related equipment based on artificial intelligence, which can update each feature extraction network in the first gesture recognition model to recognize three-dimensional key points, thereby improving accuracy of human body gesture recognition.

A first aspect of the present invention provides a human body posture recognition method based on artificial intelligence, the method comprising:

Acquiring a pre-trained first gesture recognition model, and updating each feature extraction network in the first gesture recognition model to obtain a second gesture recognition model;

responding to a received human body gesture recognition request, and acquiring a picture to be recognized;

performing feature extraction processing on the picture to be identified by using the second gesture recognition model to obtain feature images with multiple scales;

inputting the feature images of each scale into a first target detection network to obtain a first detection result of the feature images of the corresponding scale, and inputting the feature images of each scale into a second target detection network to obtain a second detection result of the feature images of the corresponding scale;

performing first pretreatment on each first detection result to obtain a third detection result, and performing second pretreatment on each second detection result to obtain a fourth detection result;

and carrying out fusion processing on the third detection result and the fourth detection result to obtain a human body posture recognition result.

Optionally, the updating each feature extraction network in the first gesture recognition model to obtain a second gesture recognition model includes:

adding a preset number of feature extraction channels in each feature extraction network of the first gesture recognition model to obtain a corresponding target feature extraction network;

And replacing the corresponding feature extraction network in the first gesture recognition model by adopting a target feature extraction network to obtain a second gesture recognition model.

Optionally, the performing feature extraction processing on the to-be-identified picture by using the second gesture recognition model to obtain a feature map with multiple scales includes:

and extracting a convolution layer in a network by utilizing the target features of the second gesture recognition model, and downsampling the picture to be recognized to obtain a feature map corresponding to the scale of the convolution layer.

Optionally, inputting the feature map of each scale into the first target detection network, and obtaining the first detection result of the feature map of the corresponding scale includes:

detecting the feature map of each scale by using the first target detection network;

and outputting a first detection frame, a two-dimensional first key point and a three-dimensional first key point of the target person in the feature map of each scale to obtain a first detection result.

Optionally, the performing the first preprocessing on each first detection result to obtain a third detection result includes:

acquiring a plurality of first detection frames and confidence degrees of each first detection frame from each first detection result;

Sequencing the confidence degrees of the plurality of first detection frames to obtain a first detection frame list;

selecting a first detection frame with highest confidence from the first detection frame list, and adding the first detection frame to a preset first output list;

calculating a first overlapping degree between the first detection frame with the highest confidence and each first detection frame in the rest of the first detection frame list;

and reserving each first detection frame with the first overlapping degree smaller than a preset overlapping degree threshold value, and adding the first detection frames to the preset first output list to obtain a third detection result.

Optionally, the fusing the third detection result and the fourth detection result to obtain a human body gesture recognition result includes:

traversing the two-dimensional second key points and the three-dimensional second key points in the fourth detection result;

calculating a first Euclidean distance between each two-dimensional second key point and the two-dimensional first key point at the corresponding position, and calculating a second Euclidean distance between each three-dimensional second key point and the three-dimensional first key point at the corresponding position;

respectively judging whether the calculated first Euclidean distance and second Euclidean distance meet preset replacement conditions;

If the calculated first Euclidean distance meets the preset replacement condition, replacing the coordinates of the two-dimensional first key point in the third detection result with the coordinates of the corresponding two-dimensional second key point in the fourth detection result, and/or if the calculated second Euclidean distance meets the preset replacement condition, replacing the coordinates of the three-dimensional first key point in the third detection result with the coordinates of the corresponding three-dimensional second key point in the fourth detection result, and obtaining a fifth detection result;

and carrying out third pretreatment on the fifth detection result and the fourth detection result to obtain a human body posture recognition result.

Optionally, the performing third preprocessing on the fifth detection result and the fourth detection result to obtain a human body gesture recognition result includes:

and combining and de-duplication processing is carried out on the fifth detection result and the fourth detection result, so as to obtain a human body gesture recognition result.

A second aspect of the present invention provides an artificial intelligence based human body posture recognition apparatus, the apparatus comprising:

the first acquisition module is used for acquiring a pre-trained first gesture recognition model and updating each feature extraction network in the first gesture recognition model to obtain a second gesture recognition model;

The second acquisition module is used for responding to the received human body gesture recognition request and acquiring a picture to be recognized;

the feature extraction module is used for carrying out feature extraction processing on the picture to be identified by utilizing the second gesture recognition model to obtain feature images with multiple scales;

the input module is used for inputting the feature images of each scale into a first target detection network to obtain a first detection result of the feature images of the corresponding scale, and inputting the feature images of each scale into a second target detection network to obtain a second detection result of the feature images of the corresponding scale;

the pretreatment module is used for carrying out first pretreatment on each first detection result to obtain a third detection result, and carrying out second pretreatment on each second detection result to obtain a fourth detection result;

and the fusion processing module is used for carrying out fusion processing on the third detection result and the fourth detection result to obtain a human body gesture recognition result.

A third aspect of the present invention provides an electronic device comprising a processor and a memory, the processor being adapted to implement the artificial intelligence based human gesture recognition method when executing a computer program stored in the memory.

A fourth aspect of the present invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the artificial intelligence based human body posture recognition method.

In summary, the human body gesture recognition method, the device and the related equipment based on artificial intelligence can promote the construction of smart cities, are applied to the fields of smart architecture, smart security, smart communities, smart life, the internet of things, financial science and technology and the like, and solve the problem of inaccurate gesture recognition caused by lack of depth information in two-dimensional gestures by acquiring a pre-trained first gesture recognition model and updating each feature extraction network in the first gesture recognition model to obtain a second gesture recognition model, and by adding a preset number of feature extraction channels in each feature extraction network of the first gesture recognition model, coordinate values of three-dimensional key points of a target person are predicted, namely gesture information of the three-dimensional key points of the target person is predicted. And carrying out feature extraction processing on the picture to be identified by using the second gesture recognition model to obtain feature images with multiple scales, inputting the feature images with each scale into a first target detection network to obtain a first detection result of the feature images with the corresponding scale, inputting the feature images with each scale into a second target detection network to obtain a second detection result of the feature images with the corresponding scale, carrying out first pretreatment on each first detection result to obtain a third detection result, carrying out second pretreatment on each second detection result to obtain a fourth detection result, eliminating redundant detection frames, improving the accuracy of the obtained first detection frames and second detection frames, and further improving the accuracy of gesture recognition.

Drawings

Fig. 1 is a flowchart of a human body gesture recognition method based on artificial intelligence according to an embodiment of the present invention.

Fig. 2 is a schematic diagram of an output detection result of a target detection network according to an embodiment of the present invention.

Fig. 3 is a block diagram of an artificial intelligence-based human body posture recognition device according to a second embodiment of the present invention.

Fig. 4 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention.

Detailed Description

In order that the above-recited objects, features and advantages of the present invention will be more clearly understood, a more particular description of the invention will be rendered by reference to specific embodiments thereof which are illustrated in the appended drawings. It should be noted that, without conflict, the embodiments of the present invention and features in the embodiments may be combined with each other.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein in the description of the invention is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.

Example 1

In this embodiment, the human body gesture recognition method based on artificial intelligence may be applied to an electronic device, and for an electronic device that needs to perform human body gesture recognition based on artificial intelligence, the human body gesture recognition function based on artificial intelligence provided by the method of the present invention may be directly integrated on the electronic device, or may be run in the electronic device in the form of a software development kit (Software Development Kit, SDK).

The embodiment of the invention can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.

Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

As shown in fig. 1, the human body gesture recognition method based on artificial intelligence specifically includes the following steps, the order of the steps in the flowchart may be changed according to different requirements, and some may be omitted.

In the prior art, it is becoming more and more common to adopt a camera to collect pictures or videos for human body gesture behavior analysis, for example, whether the old falls down or not is detected by using a household camera, whether the motion is completed by using a body-building APP or not is normalized, or whether a illegal operator appears in a bank or not is monitored by using a monitor. However, in these scenes, if the pose of the target person is accurately identified by using a two-dimensional target detection frame or a pose estimation method, depth information of key points of the target person needs to be identified, and the pose of the target person in the three-dimensional space is predicted according to the depth information of the identified key points of the target person, so that the pose of the target person in the picture is predicted better.

101, acquiring a pre-trained first gesture recognition model, and updating each feature extraction network in the first gesture recognition model to obtain a second gesture recognition model.

In this embodiment, a first gesture recognition model may be trained in advance, where the first gesture recognition model may be an existing KAPAO (Keypoints And Poses As Objects) model, specifically, an acquired single picture is input into the KAPAO model, and feature extraction is performed on the single picture by using a feature extraction network in the KAPAO model, so as to obtain feature graphs of different scales corresponding to the single picture, where the feature extraction network is configured to perform feature extraction on the single picture and output feature graphs of different scales corresponding to the single picture.

In this embodiment, the first gesture recognition model includes a first prediction branch and a second prediction branch, where the first prediction branch can well sense the overall gesture of the target person in each picture, for example, the positions of key points such as a head, a left hand, a right hand, a left foot, a right foot, etc., but prediction of local positions such as eyes, mouth, etc., of the target person in each picture is not accurate enough; the second prediction branch focuses on local information of the target person in each picture, for example, the local information has stronger position perception on the eyes, mouth and other specific details of the target person.

In this embodiment, after feature maps of different scales of each picture are obtained, the feature maps of each scale of each single picture are respectively input into a first prediction branch and a second prediction branch of a first gesture recognition model, where the first prediction branch is used to predict whether a detection frame of a target person and a two-dimensional key point position of the target person exist in each picture, the second prediction branch is used to use the two-dimensional key point of the target person predicted by the first prediction branch as a target detection task, predict a position of a rectangular frame generated with the two-dimensional key point as a center, and finally output a gesture prediction result of the target person according to the detection frame of the target person and the two-dimensional key point position of the target person predicted by the first prediction branch and the position of the rectangular frame generated with the two-dimensional key point as a center.

In this embodiment, since the two-dimensional pose information of the target person is input by the existing KAPAO model, the three-dimensional pose information of the target person in the picture cannot be accurately identified, and in order to improve the accuracy of the pose prediction of the target person in the picture, the depth information of the key points of the target person needs to be detected, and the pose of the target person in the three-dimensional space is estimated, so that the human body pose of the target person can be predicted more accurately. Therefore, the embodiment accurately predicts the human body posture of the target person in the three-dimensional space by building a three-dimensional posture estimation network on the basis of the existing KAPAO model.

In an optional embodiment, the updating each feature extraction network in the first gesture recognition model to obtain the second gesture recognition model includes:

In this embodiment, a preset number of feature extraction channels are added to each feature extraction network of the first gesture recognition model, for example, the feature extraction channels of each feature extraction network of the first gesture recognition model are increased from original 6+3k to 6+6k, so as to obtain a target feature extraction network, where the added 3K channels are used for outputting x, y, z values of three-dimensional keypoints of the target person, and K represents the number of three-dimensional keypoints of the target person.

Specifically, adding a preset number of feature extraction channels in each feature extraction network of the first gesture recognition model, to obtain a corresponding target feature extraction network includes: and modifying parameters of a convolution layer of each feature extraction network in the first gesture recognition model to increase the number of feature extraction channels, wherein the increased feature extraction channels are used for predicting coordinate values of three-dimensional key points of a target person in the feature map.

In this embodiment, since the second gesture recognition model is obtained by building a three-dimensional gesture estimation network on the basis of the existing KAPAO model, in the training process of the second gesture recognition model, a loss function adopted by the second gesture recognition model when predicting three-dimensional key points of the target person is as follows:

wherein K represents the number of three-dimensional key points of the target person, v _k Indicating whether or not the three-dimensional key point is visible, O ^p Representing the predicted target person, G ^s Representing a feature map of each scale output by the target feature extraction network, s epsilon {8,16,32,64}, wherein 8,16,32 and 64 respectively represent the scales of the feature map, n represents the number of positive samples corresponding to the feature map of each scale, delta represents a preset function, and if v _k > 0, δ=1, otherwise δ=0,predicted value of coordinates representing three-dimensional key point, u _k Representing the actual values of the coordinates of the three-dimensional keypoints.

The training set in the training process of the second gesture recognition model selects a multi-person three-dimensional gesture estimation data set, a three-dimensional key point label of a target person in the three-dimensional gesture estimation data set needs to be acquired by using a professional motion capture system, an acquired scene and the target person are limited, in order to promote generalization of the second gesture recognition model, the second gesture recognition model is trained by adding the COCO two-dimensional gesture estimation data set in the training set, and for samples without the three-dimensional key point label, three-dimensional key point loss in training can be ignored, and due to the fact that the COCO two-dimensional gesture estimation data set is added in the training set, the ignored three-dimensional key point loss is effectively generalized, and accuracy of key points obtained by recognition of the second gesture recognition model is ensured.

In this embodiment, a preset number of feature extraction channels are added to each feature extraction network of the first gesture recognition model to predict coordinate values of three-dimensional key points of the target person in each picture, that is, predict the gesture of the three-dimensional key points of the target person, so as to solve the problem of inaccurate gesture recognition of the target person caused by lack of depth information in two-dimensional gesture information of the target person obtained by recognition of the existing first gesture recognition model.

102, responding to the received human body gesture recognition request, and acquiring a picture to be recognized.

In this embodiment, the human body posture recognition request is used to request to recognize the human body posture of the target person in the picture.

In this embodiment, when the electronic device receives a human body gesture recognition request sent by a user terminal, a message of the human body gesture recognition request is parsed, and a picture to be recognized is obtained from the message, where the picture to be recognized may include one or more target characters.

In this embodiment, the image to be identified may be a video frame of an online sign of the borrower and the bank face inspector, an image for performing identity verification on the user, an image in a business handling process, and the like, where the business may include applying for loans, credit cards, or purchasing insurance, financial products, and the like.

And 103, carrying out feature extraction processing on the picture to be identified by using the second gesture recognition model to obtain feature images with multiple scales.

In this embodiment, the second gesture recognition model includes a plurality of scale target feature extraction networks, and each scale target feature extraction network outputs a feature map.

In an optional embodiment, the performing feature extraction processing on the to-be-identified picture by using the second gesture recognition model to obtain a feature map with multiple scales includes:

In this embodiment, the second gesture recognition model may perform multi-scale feature prediction on the to-be-recognized picture to obtain feature graphs of multiple scales, and specifically, the second gesture recognition model includes multiple scale convolution layers, and downsampling operation is performed on the to-be-recognized picture by using the convolution layer of each scale to obtain feature graphs of corresponding scales.

104, inputting the feature map of each scale into a first target detection network to obtain a first detection result of the feature map of the corresponding scale, and inputting the feature map of each scale into a second target detection network to obtain a second detection result of the feature map of the corresponding scale.

In this embodiment, the second gesture recognition model includes two branches, a first branch is a first target detection network, a second branch is a second target detection network, after obtaining feature maps of multiple scales of each picture, the feature maps of each scale are respectively input into a first target detection network and a second target detection network of the second gesture recognition model, where the first target detection network is used to detect a first detection frame, a two-dimensional first key point and a three-dimensional first key point of a target person in the feature map of each scale; the second target detection network is used for detecting a second detection frame, a two-dimensional second key point and a three-dimensional second key point of the target person in the feature map of each scale.

Referring to fig. 2, the detection frame 11 includes detection frame information of the target person, which indicates a first detection frame or a second detection frame of the target person corresponding to the detection frame 11. The key points 12, wherein the key points 12 comprise two-dimensional information and three-dimensional information of the target person, the two-dimensional information represents a two-dimensional first key point or a two-dimensional second key point of the target person corresponding to the key points 12, and the three-dimensional information represents a three-dimensional first key point or a three-dimensional second key point of the target person corresponding to the key points 12.

In an optional embodiment, the inputting the feature map of each scale into the first target detection network, and obtaining the first detection result of the feature map of the corresponding scale includes:

In this embodiment, the first target detection network includes a multi-task convolutional neural network, where the multi-task convolutional neural network is configured to detect a target person, a detection frame of the target person, and location information of a key point of the target person in a feature map of each scale, and determine a first detection frame, a two-dimensional first key point, and a three-dimensional first key point of the detected target person as a first detection result.

In an optional embodiment, the inputting the feature map of each scale into the second target detection network, and obtaining the second detection result of the feature map of the corresponding scale includes:

detecting the feature map of each scale by using the second target detection network;

and outputting a second detection frame, a two-dimensional second key point and a three-dimensional second key point of the target person in the feature map of each scale to obtain a second detection result.

In this embodiment, the process of obtaining the second detection result is the same as the process of obtaining the first detection result, which is not described in detail herein.

Specifically, referring to fig. 2, for a first or a second object detection network, the first or the second object detection network can predict 6+6k values for each scale feature map, where p ₀ A detection frame representing whether the target person exists in the feature map of each scale, when p ₀ When=0, the probability of the target person being present in the feature map representing any one scale is 0, and when p ₀ When=1, the probability of the target person existing in the feature map representing any one scale is 1, t _x ,t _y ,t _w ,t _h Indicating the position of the detection frame c ₁ …c _k+1 Representing a preset category of the target person, wherein c ₁ Indicating whether the target is a person, c ₂ To c _k+1 For determining the probability of each of k keypoints belonging to a predetermined category, e.g., hand keypoints, leg keypoints, head keypoints, etc., k representing the index of the keypoint, v _x1 ,v _y1 …v _xk ,v _yk X, y coordinates representing k two-dimensional key points, u _x1 ,u _y1 ,u _z1 …u _xk ,u _yk ,u _zk And representing x, y and z coordinates of the k three-dimensional key points relative to a preset position. For example, the predetermined location may be a hip bone.

In this embodiment, the three-dimensional key points of the target person are detected through the first target detection network and the second target detection network in the second gesture recognition model, and when the gesture recognition is performed subsequently, the three-dimensional first key points and the three-dimensional second key points detected by the first target detection network and the second target detection network are considered, so that the accuracy of the gesture recognition result is improved.

In this embodiment, the first target detection network of the second gesture recognition model is used to output the first detection frame, the two-dimensional first key point and the three-dimensional first key point of the target person, and the second target detection network is used to output the second detection frame, the two-dimensional second key point and the three-dimensional second key point of the target person, so that gesture recognition of the target person can be realized.

105, performing first pretreatment on each first detection result to obtain a third detection result, and performing second pretreatment on each second detection result to obtain a fourth detection result.

In this embodiment, in the target detection process, a large number of detection frames may be generated at the same position, and these detection frames may overlap with each other, so that in order to find an optimal target detection frame, the first preprocessing is used to eliminate redundant first detection frames, that is, delete the overlapping first detection frames in each first detection result; the second preprocessing is used for eliminating redundant second detection frames, namely deleting overlapped second detection frames in each second detection result.

In an optional embodiment, the performing the first preprocessing on each first detection result to obtain a third detection result includes:

In this embodiment, the confidence level is used to indicate a probability value that the first target detection network predicts that the first detection frame is correct, for example, the confidence level of any one of the first detection frames is 0.9, and indicates that the predicted probability value that the first detection frame is a correct detection frame is 90%, where the correct detection frame indicates that the target person exists in the predicted detection frame.

In an optional embodiment, the performing the second preprocessing on each second detection result to obtain a fourth detection result includes:

acquiring a plurality of second detection frames and confidence degrees of each second detection frame from each second detection result;

sequencing the confidence degrees of the plurality of second detection frames to obtain a second detection frame list;

selecting a second detection frame with highest confidence from the second detection frame list, and adding the second detection frame to a preset second output list;

Calculating a second overlapping degree between the second detection frame with the highest confidence degree and each remaining second detection frame in the second detection frame list;

and reserving each second detection frame with the second overlapping degree smaller than the preset overlapping degree threshold value, and adding the second detection frames to the preset second output list to obtain a fourth detection result.

In this embodiment, the confidence is used to indicate a probability value that the second target detection network predicts that the second detection frame is correct.

In this embodiment, the detection frame list may be a first detection frame list or a second detection frame list, where the detection frame may be the first detection frame or the second detection frame, the overlapping degree refers to Intersection Over Union (IOU), and the IOU refers to the intersection of any one detection frame in the detection frame list and the remaining detection frames in the detection frame list divided by the union of the detection frame in the corresponding detection frame list and the remaining detection frames in the detection frame list. The IOU is used for measuring the overlapping degree of the two sets, wherein when the IOU is 0, the two frames are not overlapped and have no intersection; when IOU is 1, the two frames are completely overlapped; when the IOU takes a value between 0 and 1, the overlapping degree of the two frames is represented, and the higher the value is, the higher the overlapping degree is.

In this embodiment, an overlapping degree threshold may be preset, and redundant first detection frames and second detection frames may be eliminated based on the preset overlapping degree threshold, so as to keep the optimal first detection frames and second detection frames.

In this embodiment, when the first overlapping degree or the second overlapping degree is less than or equal to the preset overlapping degree threshold, it is determined that no overlapping detection frame appears at the same position, and deletion is not performed.

In this embodiment, each first detection result and each second detection result are preprocessed, and when the first overlapping degree or the second overlapping degree is greater than the preset overlapping degree threshold value, redundant detection frames are eliminated, so that the accuracy of the acquired first detection frames and second detection frames is improved, and the accuracy of gesture recognition is further improved.

And 106, carrying out fusion processing on the third detection result and the fourth detection result to obtain a human body posture recognition result.

In this embodiment, the third detection result includes a first detection frame after pretreatment, a two-dimensional first key point, and a three-dimensional first key point, the fourth detection result includes a second detection frame after pretreatment, a two-dimensional second key point, and a three-dimensional second key point, and the fusion processing refers to fusing the two-dimensional key point and the three-dimensional key point in the third detection result and the fourth detection result to the target person, so as to obtain a human body gesture recognition result of the target person.

In an optional embodiment, the fusing the third detection result and the fourth detection result to obtain a human body gesture recognition result includes:

Further, the performing third preprocessing on the fifth detection result and the fourth detection result to obtain a human body gesture recognition result includes: and combining and de-duplication processing is carried out on the fifth detection result and the fourth detection result, so as to obtain a human body gesture recognition result.

In this embodiment, the fifth detection result represents a third detection result after the fusion processing, and the human body gesture recognition result includes vectors of 6+6k channels predicted by the second gesture recognition model, where the vectors include a first detection frame, a two-dimensional first key point, a three-dimensional first key point, and a second detection frame, a two-dimensional second key point, and a three-dimensional second key point in the fifth detection result.

In this embodiment, the euclidean distance is also called euclidean distance, and is used to measure the absolute distance between two points in the multidimensional space.

In this embodiment, a replacement condition may be preset, where the replacement condition may be set such that the first euclidean distance or the second euclidean distance is smaller than a preset distance threshold.

In this embodiment, the third preprocessing is configured to perform deduplication processing on the same two-dimensional first key point and the same three-dimensional first key point, and the same two-dimensional second key point and the same three-dimensional second key point in the fifth detection result and the fourth detection result, and perform merging processing on different two-dimensional first key points and three-dimensional first key points, and different two-dimensional second key points and three-dimensional second key points in the fifth detection result and the fourth detection result, so as to obtain a human body gesture recognition result.

The first object detection network of the second gesture recognition model outputs M key points of any one object person, the second object detection network outputs N key points of the any one object person, the key points may be two-dimensional key points or three-dimensional key points, each key point output by the second object detection network is traversed, if the euclidean distance between any one key point and the key point at the corresponding position output by the first object detection network is smaller than a preset distance threshold, for example, the distance between the nose coordinate value output by the second object detection network and the nose coordinate value output by the first object detection network is smaller than a preset distance threshold, then the nose coordinate value output by the first object detection network is replaced by the nose coordinate value output by the second object detection network, and the nose coordinate value output by the second object detection network is regarded as a nose final coordinate value.

In this embodiment, since the first target detection network has a better overall perception of the target, but the predictions of these relatively local key points such as the nose and eyes are not very accurate, the key points need to be repaired, and the repair depends on the key points output by the second target detection network, the overall perception of the second target detection network is inferior to that of the first target detection network, but the perception of the local key points is superior to that of the first target detection network, if the euclidean distance between the key points output by the first target detection network and the key points output by the second target detection network meets the replacement condition, the key points output by the second target detection network are used to replace the key points output by the first target detection network, and the final key points combine the advantages of the overall perception of the first target detection network and the local perception of the second target detection network, thereby improving the accuracy of the finally obtained key points, and further improving the accuracy of the human body posture recognition of the target person.

In this embodiment, the second gesture recognition model estimates coordinate values of three-dimensional key points of the target person by adding a preset number of feature extraction channels, so that the problem that the human body gesture of the target person cannot be accurately recognized by using a two-dimensional target detection frame or gesture estimation method is solved, and accuracy of human body gesture recognition of the target person is improved.

In this embodiment, the second gesture recognition model is used to perform human gesture recognition on the target person in the picture to be recognized, so that the gesture of the target person is accurately recognized, for example, if the picture to be recognized is a picture for authentication of the user, the human gesture result of the target person in the picture for authentication is recognized as follows: in the face recognition process, face images in the mobile phone are adopted, and identity verification is determined to be failed according to the human body gesture recognition result, so that the accuracy of the identity verification is improved.

In summary, according to the human body gesture recognition method based on artificial intelligence in this embodiment, by acquiring the pre-trained first gesture recognition model and updating each feature extraction network in the first gesture recognition model, a second gesture recognition model is obtained, and by adding a preset number of feature extraction channels in each feature extraction network of the first gesture recognition model, coordinate values of three-dimensional key points of the target person, that is, gesture information of three-dimensional key points of the target person, are predicted, so that the problem of inaccurate gesture recognition caused by lack of depth information in two-dimensional gestures is solved. And carrying out feature extraction processing on the picture to be identified by using the second gesture recognition model to obtain feature images with multiple scales, inputting the feature images with each scale into a first target detection network to obtain a first detection result of the feature images with the corresponding scale, inputting the feature images with each scale into a second target detection network to obtain a second detection result of the feature images with the corresponding scale, carrying out first pretreatment on each first detection result to obtain a third detection result, carrying out second pretreatment on each second detection result to obtain a fourth detection result, eliminating redundant detection frames, improving the accuracy of the obtained first detection frames and second detection frames, and further improving the accuracy of gesture recognition.

Example two

In some embodiments, the artificial intelligence based human gesture recognition apparatus 20 may include a plurality of functional modules consisting of program code segments. Program code for each of the program segments in the artificial intelligence based human gesture recognition apparatus 20 may be stored in a memory of the electronic device and executed by the at least one processor to perform (see fig. 1 and 2 for details) artificial intelligence based human gesture recognition functions.

In this embodiment, the human body gesture recognition apparatus 20 based on artificial intelligence may be divided into a plurality of functional modules according to the functions performed thereby. The functional module may include: the device comprises a first acquisition module 201, a second acquisition module 202, a feature extraction module 203, an input module 204, a preprocessing module 205 and a fusion processing module 206. The module referred to herein is a series of computer readable instructions capable of being executed by at least one processor and of performing a fixed function, stored in a memory. In the present embodiment, the functions of the respective modules will be described in detail in the following embodiments.

The first obtaining module 201 is configured to obtain a pre-trained first gesture recognition model, and update each feature extraction network in the first gesture recognition model to obtain a second gesture recognition model.

The second obtaining module 202 is configured to obtain a picture to be identified in response to the received human gesture recognition request.

And the feature extraction module 203 is configured to perform feature extraction processing on the to-be-identified picture by using the second gesture recognition model, so as to obtain a feature map with multiple scales.

The input module 204 is configured to input the feature map of each scale into a first target detection network to obtain a first detection result of the feature map of the corresponding scale, and input the feature map of each scale into a second target detection network to obtain a second detection result of the feature map of the corresponding scale.

The preprocessing module 205 is configured to perform a first preprocessing on each first detection result to obtain a third detection result, and perform a second preprocessing on each second detection result to obtain a fourth detection result.

And the fusion processing module 206 is configured to perform fusion processing on the third detection result and the fourth detection result, so as to obtain a human body gesture recognition result.

In an alternative embodiment, the first obtaining module 201 is configured to: adding a preset number of feature extraction channels in each feature extraction network of the first gesture recognition model to obtain a corresponding target feature extraction network; and replacing the corresponding feature extraction network in the first gesture recognition model by adopting a target feature extraction network to obtain a second gesture recognition model.

In an alternative embodiment, the feature extraction module 203 is configured to: and extracting a convolution layer in a network by utilizing the target features of the second gesture recognition model, and downsampling the picture to be recognized to obtain a feature map corresponding to the scale of the convolution layer.

In an alternative embodiment, the input module 204 is configured to: detecting the feature map of each scale by using the first target detection network; and outputting a first detection frame, a two-dimensional first key point and a three-dimensional first key point of the target person in the feature map of each scale to obtain a first detection result.

In an alternative embodiment, the preprocessing module 205 is configured to: acquiring a plurality of first detection frames and confidence degrees of each first detection frame from each first detection result; sequencing the confidence degrees of the plurality of first detection frames to obtain a first detection frame list; selecting a first detection frame with highest confidence from the first detection frame list, and adding the first detection frame to a preset first output list; calculating a first overlapping degree between the first detection frame with the highest confidence and each first detection frame in the rest of the first detection frame list; and reserving each first detection frame with the first overlapping degree smaller than a preset overlapping degree threshold value, and adding the first detection frames to the preset first output list to obtain a third detection result.

In an alternative embodiment, the preprocessing module 205 is further configured to: acquiring a plurality of second detection frames and confidence degrees of each second detection frame from each second detection result; sequencing the confidence degrees of the plurality of second detection frames to obtain a second detection frame list; selecting a second detection frame with highest confidence from the second detection frame list, and adding the second detection frame to a preset second output list; calculating a second overlapping degree between the second detection frame with the highest confidence degree and each remaining second detection frame in the second detection frame list; and reserving each second detection frame with the second overlapping degree smaller than the preset overlapping degree threshold value, and adding the second detection frames into the preset second output list to obtain a fourth detection result.

In an alternative embodiment, the fusion processing module 206 is configured to: traversing the two-dimensional second key points and the three-dimensional second key points in the fourth detection result; calculating a first Euclidean distance between each two-dimensional second key point and the two-dimensional first key point at the corresponding position, and calculating a second Euclidean distance between each three-dimensional second key point and the three-dimensional first key point at the corresponding position; respectively judging whether the calculated first Euclidean distance and second Euclidean distance meet preset replacement conditions; if the calculated first Euclidean distance meets the preset replacement condition, replacing the coordinates of the two-dimensional first key point in the third detection result with the coordinates of the corresponding two-dimensional second key point in the fourth detection result, or replacing the coordinates of the three-dimensional first key point in the third detection result with the coordinates of the corresponding three-dimensional second key point in the fourth detection result to obtain a fifth detection result; and carrying out third pretreatment on the fifth detection result and the fourth detection result to obtain a human body posture recognition result.

In summary, according to the human body gesture recognition device based on artificial intelligence of the present embodiment, a pre-trained first gesture recognition model is obtained, each feature extraction network in the first gesture recognition model is updated to obtain a second gesture recognition model, and a preset number of feature extraction channels are added to each feature extraction network of the first gesture recognition model to predict coordinate values of three-dimensional key points of a target person, namely gesture information of three-dimensional key points of the predicted target person, so that the problem of inaccurate gesture recognition caused by lack of depth information in two-dimensional gestures is solved. And carrying out feature extraction processing on the picture to be identified by using the second gesture recognition model to obtain feature images with multiple scales, inputting the feature images with each scale into a first target detection network to obtain a first detection result of the feature images with the corresponding scale, inputting the feature images with each scale into a second target detection network to obtain a second detection result of the feature images with the corresponding scale, carrying out first pretreatment on each first detection result to obtain a third detection result, carrying out second pretreatment on each second detection result to obtain a fourth detection result, eliminating redundant detection frames, improving the accuracy of the obtained first detection frames and second detection frames, and further improving the accuracy of gesture recognition.

Example III

Fig. 4 is a schematic structural diagram of an electronic device according to a third embodiment of the present invention. In the preferred embodiment of the invention, the electronic device 3 comprises a memory 31, at least one processor 32, at least one communication bus 33 and a transceiver 34.

It will be appreciated by those skilled in the art that the configuration of the electronic device shown in fig. 4 is not limiting of the embodiments of the present invention, and that either a bus-type configuration or a star-type configuration is possible, and that the electronic device 3 may also include more or less other hardware or software than that shown, or a different arrangement of components.

In some embodiments, the electronic device 3 is an electronic device capable of automatically performing numerical calculation and/or information processing according to a preset or stored instruction, and its hardware includes, but is not limited to, a microprocessor, an application specific integrated circuit, a programmable gate array, a digital processor, an embedded device, and the like. The electronic device 3 may further include a client device, where the client device includes, but is not limited to, any electronic product that can interact with a client by way of a keyboard, a mouse, a remote control, a touch pad, or a voice control device, such as a personal computer, a tablet computer, a smart phone, a digital camera, etc.

It should be noted that the electronic device 3 is only used as an example, and other electronic products that may be present in the present invention or may be present in the future are also included in the scope of the present invention by way of reference.

In some embodiments, the memory 31 is used to store program code and various data, such as the artificial intelligence based human gesture recognition apparatus 20 installed in the electronic device 3, and to enable high speed, automatic access to programs or data during operation of the electronic device 3. The Memory 31 includes Read-Only Memory (ROM), programmable Read-Only Memory (PROM), erasable programmable Read-Only Memory (EPROM), one-time programmable Read-Only Memory (One-time Programmable Read-Only Memory, OTPROM), electrically erasable rewritable Read-Only Memory (EEPROM), compact disc Read-Only Memory (Compact Disc Read-Only Memory, CD-ROM) or other optical disc Memory, magnetic tape Memory, or any other medium that can be used for computer-readable carrying or storing data.

In some embodiments, the at least one processor 32 may be comprised of an integrated circuit, for example, a single packaged integrated circuit, or may be comprised of multiple integrated circuits packaged with the same or different functions, including one or more central processing units (Central Processing unit, CPU), microprocessors, digital processing chips, graphics processors, combinations of various control chips, and the like. The at least one processor 32 is a Control Unit (Control Unit) of the electronic device 3, connects the respective components of the entire electronic device 3 using various interfaces and lines, and executes various functions of the electronic device 3 and processes data by running or executing programs or modules stored in the memory 31 and calling data stored in the memory 31.

In some embodiments, the at least one communication bus 33 is arranged to enable connected communication between the memory 31 and the at least one processor 32 or the like.

Although not shown, the electronic device 3 may further include a power source (such as a battery) for powering the various components, and optionally, the power source may be logically connected to the at least one processor 32 via a power management device, thereby implementing functions such as managing charging, discharging, and power consumption by the power management device. The power supply may also include one or more of any of a direct current or alternating current power supply, recharging device, power failure detection circuit, power converter or inverter, power status indicator, etc. The electronic device 3 may further include various sensors, bluetooth modules, wi-Fi modules, etc., which will not be described herein.

It should be understood that the embodiments described are for illustrative purposes only and are not limited to this configuration in the scope of the patent application.

The integrated units implemented in the form of software functional modules described above may be stored in a computer readable storage medium. The software functional modules described above are stored in a storage medium and include instructions for causing a computer device (which may be a personal computer, an electronic device, or a network device, etc.) or a processor (processor) to perform portions of the methods described in the various embodiments of the invention.

In a further embodiment, in connection with fig. 3, the at least one processor 32 may execute the operating means of the electronic device 3 as well as various types of applications installed (e.g., the artificial intelligence based human gesture recognition apparatus 20), program code, etc., such as the various modules described above.

The memory 31 has program code stored therein, and the at least one processor 32 can invoke the program code stored in the memory 31 to perform related functions. For example, the various modules depicted in FIG. 3 are program code stored in the memory 31 and executed by the at least one processor 32 to perform the functions of the various modules for purposes of artificial intelligence based human gesture recognition.

Illustratively, the program code may be partitioned into one or more modules/units that are stored in the memory 31 and executed by the processor 32 to complete the present application. The one or more modules/units may be a series of computer readable instruction segments capable of performing the specified functions, which instruction segments describe the execution of the program code in the electronic device 3. For example, the program code may be divided into a first acquisition module 201, a second acquisition module 202, a feature extraction module 203, an input module 204, a preprocessing module 205, and a fusion processing module 206.

In one embodiment of the invention, the memory 31 stores a plurality of computer readable instructions that are executed by the at least one processor 32 to implement artificial intelligence based human body gesture recognition functionality.

Specifically, the specific implementation method of the above instruction by the at least one processor 32 may refer to descriptions of related steps in the corresponding embodiments of fig. 1 and fig. 2, which are not repeated herein.

In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus and method may be implemented in other manners. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be other manners of division when actually implemented.

The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical units, may be located in one place, or may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional module in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units can be realized in a form of hardware or a form of hardware and a form of software functional modules.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it will be obvious that the term "comprising" does not exclude other elements or that the singular does not exclude a plurality. The units or means stated in the invention may also be implemented by one unit or means, either by software or hardware. The terms first, second, etc. are used to denote a name, but not any particular order.

Finally, it should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made to the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention.

Claims

1. A human body posture recognition method based on artificial intelligence, the method comprising:

2. The artificial intelligence based human body posture recognition method of claim 1, wherein updating each feature extraction network in the first posture recognition model to obtain a second posture recognition model comprises:

3. The artificial intelligence-based human body posture recognition method of claim 1, wherein the performing feature extraction processing on the picture to be recognized by using the second posture recognition model to obtain a feature map with multiple scales comprises:

4. The artificial intelligence based human body posture recognition method of claim 1, wherein inputting the feature map of each scale into the first target detection network, obtaining a first detection result of the feature map of the corresponding scale includes:

5. The artificial intelligence based human body posture recognition method of claim 1, wherein the performing the first preprocessing on each first detection result to obtain a third detection result includes:

6. The artificial intelligence based human body posture recognition method of claim 1, wherein the fusing the third detection result and the fourth detection result to obtain a human body posture recognition result includes:

7. The artificial intelligence based human body posture identifying method of claim 6, wherein the performing third preprocessing on the fifth detection result and the fourth detection result to obtain a human body posture identifying result includes:

8. An artificial intelligence based human posture recognition device, the device comprising:

9. An electronic device comprising a processor and a memory, wherein the processor is configured to implement the artificial intelligence based human body posture recognition method of any one of claims 1 to 7 when executing a computer program stored in the memory.

10. A computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the artificial intelligence based human gesture recognition method of any one of claims 1 to 7.