CN112329513A

CN112329513A - High frame rate 3D (three-dimensional) posture recognition method based on convolutional neural network

Info

Publication number: CN112329513A
Application number: CN202010856917.7A
Authority: CN
Inventors: 孙旭; 朱晓勇
Original assignee: Suzhou Horus Technology Co ltd
Current assignee: Suzhou Horus Technology Co ltd
Priority date: 2020-08-24
Filing date: 2020-08-24
Publication date: 2021-02-05

Abstract

The invention discloses a high frame rate 3D posture identification method based on a convolutional neural network, which comprises the following steps of: the method comprises the following steps that firstly, a terminal obtains a video containing human body actions, and orientation coordinates of a scene in the video with a human body to be identified as a center are established; step two, performing human body 2D key point detection and motion recognition on each known human body position; thirdly, performing off-line 3D key point detection through comparison of a plurality of images based on the time sequence information of the video; and step four, finally, carrying out model compression to deploy on the mobile phone. According to the method, the technology such as the convolutional neural network is applied to the mobile terminal of the mobile phone to identify the 3D human body key points, the algorithm is smoothly operated on the mobile phone with insufficient computing power and other devices through model compression and algorithm innovation, the operation frame rate is improved, and a good identification effect is achieved.

Description

High frame rate 3D (three-dimensional) posture recognition method based on convolutional neural network

Technical Field

The technology belongs to the technical field of computer software, and particularly relates to a high frame rate mobile phone 3D posture identification method based on a convolutional neural network.

Background

The human body action posture judgment application prospect is wide, a plurality of technologies for 3D human body key point identification exist in the market at present, and at present, the application range of human body key points is very wide, such as leg stretching and other operations are realized, the technology becomes the popular research field of a plurality of researchers, and the research mainly focuses on the detection of the human body key points by using a camera with depth information at present.

The general process of judging the action posture is the positioning of joint points, and the posture is judged according to the positioning data of the joint points. However, in order to obtain more accurate posture judgment, joint positioning coordinates are often required to be 3D space coordinates. The method not only puts high requirements on acquisition equipment and depth data, but also puts high requirements on calculation power and can hardly run in real-time transcription application. The 3D visual motion capture algorithm based on the monocular camera, which is popular in the industry, has a low running frame rate on a mobile phone and poor user experience, for example, the OpenPose project in the industry only reaches 10 FPS.

Disclosure of Invention

Aiming at the technical problems, the method and the device have the advantages that the technology such as the convolutional neural network is applied to the mobile terminal of the mobile phone to identify the 3D human body key points, the algorithm is smoothly operated on the mobile phone with insufficient calculation capacity and other devices through model compression and algorithm innovation, the operation frame rate is improved, and the better identification effect is realized.

In order to achieve the purpose, the invention adopts the following technical scheme:

the invention relates to a high frame rate 3D posture identification method based on a convolutional neural network, which comprises the following steps:

the method comprises the following steps that firstly, a terminal obtains a video containing human body actions, and orientation coordinates of a scene in the video with a human body to be identified as a center are established;

step two, performing human body 2D key point detection and motion recognition on each known human body position;

thirdly, performing off-line 3D key point detection through comparison of a plurality of images based on the time sequence information of the video;

and step four, finally, carrying out model compression to deploy on the mobile phone.

Further, the second step is specifically as follows:

step 2.1, extracting the characteristic points of the human body appearing in the video through a convolutional neural network;

step 2.2, carrying out a logistic regression algorithm on the feature point extraction result to obtain a human joint point;

step 2.3, removing the response to the non-recognition object according to the center point positioning;

step 2.4, finally, repeatedly carrying out fine adjustment on the predicted thermodynamic diagram to obtain a final result, namely a 2D key point;

and 2.5, searching the state space of the obtained 2D key points. Based on the human engineering data, the 2D key points should meet the general skeleton length and performance of human engineering, and the range of state space search is reduced according to the general values of human engineering, so that the judgment of the action on an individual body conforms to the general angle of human engineering;

and 2.6, the result in the previous step is in a certain numerical range. At this time, 2D spatial dog-ear estimation is performed for all the numerical ranges, ergonomic dog-ear search is performed again, and dog-ears of the numerical ranges are controlled within an ergonomically reasonable range. And finally, classifying the folding angle and the length obtained by the result into a specific action type to determine the basic posture of the human body.

Further, the third step is specifically:

step 3.1, comparing cosine distances of image feature vectors of adjacent frames of the video picture, and if the difference of the features of the adjacent video frames is more, considering that the difference between the two images is overlarge, and discarding the two images; if the adjacent frame features are close, the 3D key point detection and identification can be carried out on the adjacent frames;

step 3.2, comparing the 2D key point information with the 3D real value by implementing a 3D key point identification algorithm on a plurality of adjacent frames, and recording the difference;

and 3.3, optimizing the difference value in the step 3.2 by using a gradient descent method, and reducing the difference between the 2D key point information and the 3D true value.

Furthermore, in the step 3.2, the 3D keypoint identification algorithm is to perform 3D on the 2D keypoints obtained in the step 2.4 by using a random forest algorithm to obtain 3D keypoint information in the video picture. Random forest is a general algorithm.

Preferably, the posture identification method further comprises a fifth step of optimizing the detected 3D key points by an ergonomic method.

Further, the fifth step is specifically:

step 5.1, marking an included angle between the camera and the tested person, and recording the included angle as A;

and 5.2, performing state space search on the obtained 3D key points, wherein the 3D key points should meet the general skeleton length and performance of ergonomics based on the ergonomic data. Reducing the searching range of the state space according to the general value of the human engineering, and further reducing the searching range of the state space by combining the included angle A, so that the judgment of the action on an individual body accords with the general angle of the human engineering;

and 5.3, calculating the space break angle of the 3D key point according to the human engineering data and by combining the actual value of the included angle A. The ergonomic data includes the range of motion that the limb can move, and the general most inclined motion of the limb; and comparing the most inclined angle in the search space with the classified motion, and classifying the motion.

Further, the range of motion that the limb can move includes the range of possible angles for the elbow, knee, hip, ankle, wrist, neck, shoulder from fully extended to fully flexed; the generally most inclined movement pattern of the limb refers to the 3D spatial coordinates of the given limb, the most comfortable spatial coordinates of which are unique.

Preferably, in step 3.1, if the cosine distance is greater than or equal to the set threshold, it is determined that the difference between the features of the adjacent video frames is large, and the difference between the two images is too large, and the two images are discarded; if the cosine distance is less than the set threshold, then the adjacent frame features are considered to be close.

The invention discloses a high frame rate mobile phone 3D posture identification method based on a convolutional neural network, which has the beneficial effects that:

1. in the prior art, 3D posture recognition is generally carried out based on a 3D camera, but the calculated amount is large in the mode, the requirements on sensors such as the camera are high, and the method is not beneficial to large-scale application on mobile phone end detection.

2. The 3D posture recognition algorithm with higher precision is realized on equipment with lower computing power such as a mobile phone, meanwhile, the estimation and detection of the 3D position of the human body can be carried out through 2D images and videos, the requirements of the computing power and a sensor are reduced, the device can be smoothly operated on the mobile phone, generally can reach 30FPS, and is also a frame rate which is smooth for human eyes to feel.

3. By applying human engineering, the calculation accuracy is further improved by calculating the corresponding human body joint break angle (for example, the break angle of a limb to the trunk, the break angle of an upper arm to the forearm, and the like) and performing motion recognition.

Drawings

In order to more clearly illustrate the technical solution of the present invention, the drawings needed to be used in the present invention will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained based on these drawings without inventive labor.

FIG. 1 is a detection flow chart of the high frame rate mobile phone 3D posture identification method based on the convolutional neural network of the present invention;

FIG. 2 is a diagram of a neural network architecture employed in the present invention;

FIG. 3 is a diagram of the classification of motion detection according to the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

Example 1

The embodiment provides a high frame rate 3D posture identification method based on a convolutional neural network; the method firstly needs to teach a machine, and the teaching process is as follows:

1. demonstration human recording accurate human skeleton (joint to joint distance)

a. Articulation points include, but are not limited to: nose, left shoulder, right shoulder, left elbow, right elbow, left hand, right hand, left hip, right hip, left knee, right knee, left foot, right foot

2. Demonstration of standard action by human, shooting by multiple cameras

a. The angle of each camera to the demonstration person is different.

b. The distortion of the camera lens is known or very small

c. Other parameters of the camera are known and may be input as alternative parameters when adjusting the model. These parameters include, but are not limited to: camera distortion, focal length, height of camera from ground, distance from person

d. Human body 3D model data can be obtained through data fusion after multi-angle simultaneous shooting, so that 3D folding angle degrees of corresponding joints are obtained.

3. The demonstration people repeatedly do the actions, and a plurality of demonstration people do the actions to increase the data volume

4. And inputting the picture into a human body joint point 2D coordinate model to obtain the 2D coordinates of the joint points.

The scheme provided by the embodiment comprises the following limb actions and posture, but the key points to be calibrated and the posture of the key points can also be determined based on the motion mode and the motion type to be identified:

and (3) calculating the included angle of a quadrangle formed by four trunk nodes, the break angle of a joint corresponding to the action and the length of each skeleton to serve as independent variables, and constructing a convolutional neural network and training a model by taking the action of a demonstration humanistic diagram as dependent variables.

a. Four nodes of the trunk refer to the left and right shoulders and the left and right hips

b. The break angle of the motion-corresponding joint may include, but is not limited to:

i. left shoulder angle: left hip-left shoulder-left elbow

Left elbow angle: left shoulder-left elbow-left hand

Left hip angle: left shoulder-left hip-left knee

Left knee angle: left hip, left knee and left foot

v. right shoulder angle: right hip-right shoulder-right elbow

Right elbow angle: right shoulder-right elbow-right hand

Right hip angle: right shoulder-right hip-right knee

Right knee angle: right hip-right knee-right foot.

In order to establish a specific process of modeling and learning using a scene in a cloud server, after an algorithm is deployed, the practical application comprises the following steps:

the method comprises the following steps that firstly, a terminal obtains a video containing human body actions, and orientation coordinates of a scene in the video with a human body to be identified as a center are established; the tester records accurate human bones (the distance from the joint to the joint, see the recording model part); the tester can (but does not require) pre-record standard actions, the process can be used for optimizing an algorithm and has better data output for the identification result of the tester, then the tester starts to move, and the camera continuously performs reciprocating detection in a given time period to make judgment;

a. if recorded, the action is added to the model as a new data point and a relatively large weight is obtained to optimize the local model. First, the algorithm will detect all the human bodies present in the image and position them accordingly, forming a rectangular frame with the top left corner (x 1, y 1), the bottom right corner (x 2, y 2), and the human body middle point (x 3, y 3).

b. If recorded, the 2D joint coordinates and dog-ears generated by the motion are stored in the system library, and the common model is trained again.

Step two, performing human body 2D key point detection and motion recognition on each known human body position; the second step is specifically as follows:

step 2.2, carrying out a logistic regression algorithm on the feature point extraction result to obtain a human joint point; the logistic regression algorithm is a general algorithm;

and 2.4, finally, repeatedly carrying out fine adjustment on the predicted thermodynamic diagram to obtain a final result, namely the 2D key point.

And 2.5, searching the state space of the obtained 2D key points. Based on ergonomic data, 2D key points should satisfy ergonomic general bone length and performance. Reducing the range of state space search according to the general values of ergonomics, so that the judgment of the action on an individual is in accordance with the general angle of ergonomics; and 2.6, calculating a 2D space break angle according to the human engineering data, and classifying the space break angle into a specific action type to determine the human basic posture.

And in each detection, the camera captures four trunk nodes to form a quadrilateral included angle, a break angle of a joint corresponding to the action and a pre-recorded skeleton length, and inputs the quadrilateral included angle, the break angle and the pre-recorded skeleton length into the model to obtain the score of the intention action. If the intention score exceeds a threshold, a prejudgment action is judged to be made, otherwise, no prejudgment action is made.

The third step is specifically as follows:

step 3.1, comparing cosine distances of image feature vectors of adjacent frames of the video picture, and if the difference of the features of the adjacent video frames is more, namely the cosine distances are greater than a specific threshold value, considering that the difference between the two images is too large, and discarding the images; if the adjacent frame features are close, namely the cosine distance is smaller than a specific threshold value, the 3D key point detection and identification can be carried out on the adjacent frames;

The model uses the 3D dog-ear number for more accurate determination.

i. 3D angle of articulation can be calculated through human engineering data;

for some cell phones, a multi-view camera, or depth camera, at the back of the cell phone may be used to directly acquire the 3D spatial coordinates of the joints.

Furthermore, in the step 3.2, the 3D key point identification algorithm is to perform 3D on the 2D key points obtained in the step 2.4 by using a random forest algorithm to obtain 3D key point information in the video picture.

And step four, finally, carrying out model compression to deploy on the mobile phone. The model compression in the embodiment adopts methods such as knowledge distillation and the like, so that the volume of the model can be greatly reduced.

And the posture identification method further comprises a fifth step of optimizing the detected 3D key points by using an ergonomic method.

The fifth step is specifically as follows:

and 5.2, searching the state space of the obtained 3D key points. Based on ergonomic data, the 3D key points should satisfy ergonomic general bone length and performance. Reducing the searching range of the state space according to the general value of the human engineering, and further reducing the searching range of the state space by combining the included angle A, so that the judgment of the action on an individual body accords with the general angle of the human engineering;

and 5.3, calculating the space break angle of the 3D key point by combining the actual value of A according to the human engineering data. The ergonomic data includes the range of motion that the limb can move, and the general most inclined motion of the limb; and comparing the most inclined angle in the search space with the classified motion, and classifying the motion.

The action is detection of an overturning action, and the input is a video of a period of time. Through feature extraction and labeling of data, 3D keypoints are determined through the algorithms of the first to fifth sections above. And after the 3D key points are determined, specifically identifying the action. In this example we mainly look at 1 if the foot is above the head 2, steering 3, back is in view. The specific type of action is determined by calculating and categorizing the 3D keypoints of the three scenes.

Preferably, the range of motion of the limb includes the range of possible angles of elbow, knee, hip, ankle, wrist, neck, shoulder from fully extended to fully flexed; the generally most inclined movement pattern of the limb refers to the 3D spatial coordinates of the given limb, the most comfortable spatial coordinates of which are unique.

The invention identifies the body state of the human body under different motion states so as to feed back the body state condition of the tested person, and can guide and grade the tested person to a certain extent by combining with the standard action standard.

1) Firstly, the algorithm detects all the human bodies appearing in the image and carries out corresponding positioning, and a rectangular frame of the upper left corner (x 1, y 1), the lower right corner (x 2, y 2) and the middle point (x 3, y 3) of the human body is formed

2) At which point the algorithm already knows the position of all the human body. We perform human 2D keypoint detection and motion recognition for each known human location.

a. Extracting the characteristics of the appearing human body through a convolutional neural network;

b. performing a logistic regression algorithm on each appearing human body feature extraction result in the image, and further finding out the joint points of each person;

c. responses to others are then removed based on the center point location

d. Finally, repeatedly fine-tuning the predicted thermodynamic diagram to obtain a final result (2D key point);

3) 3D key point detection is performed offline through comparison of a plurality of images according to time sequence information of a video;

a. because video has continuity, we usually compare images of adjacent frames;

b. if the adjacent video frame features have more differences, the two images are considered to have overlarge differences and are discarded;

c. if the adjacent video frame features are close, the 3D key point detection and identification can be carried out on the adjacent video frames

d. By running the 3D keypoint identification algorithm on 5 adjacent frames. The algorithm carries out 3D on the 2D key points by adopting a random forest algorithm, compares 2D key point information with a 3D real value, and records a difference value;

e. optimizing the difference value in the previous step by using a gradient descent method, and further reducing the difference between the 2D key point information and the 3D true value;

4) optimizing the detected 3D key points by using an ergonomic method:

a. because the contained angle of camera and people can cause the reading deviation, our data can regard trunk face as the reference, mark camera and people's contained angle.

b. We will refer to the actual bone length and performance of each tested person and adjust the model so that the judgment of the motion on the individual is in accordance with the general angle of ergonomics.

c. We can deduce the 3D spatial break angle from the ergonomic data. Ergonomic data includes, but is not limited to, the range of possible limb motion, such as the range of possible angles for the elbow from fully extended to fully flexed, and the general most inclined motion patterns of the limb, such as the spatial coordinates of the elbow's most comfortable position, which are unique given the 3D spatial coordinates of the hand.

5) And finally, performing model compression, completing the model compression in a knowledge distillation mode, and deploying the model compression on a mobile phone. The specific compression steps are as follows:

a. the original model trained by the cloud is used as a teacher model, and the teacher model generally has larger calculation amount. Meanwhile, the first three input layers of the teacher model are used as the student model. The general student model is small in calculation amount and can be deployed on a mobile phone.

b. Simultaneously inputting the same sample into a teacher model and a student model;

c. the results of the teacher model and the results of the student model are compared, and the difference is called the model distillation loss.

d. The results of the student model are compared with the actual measurements, and the difference is called the learning loss of the student model.

e. Weighted averaging of the model distillation loss and the student model learning loss results in a loss function for the student model. By performing gradient descent or the like to minimize the value of the loss function, we have a distilled student model.

f. The volume of the general student model is 1/10 of the teacher model, so that the calculation amount is effectively reduced.

g. The trained small-volume student model is deployed on the mobile phone, so that the calculated amount can be effectively reduced.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A high frame rate 3D posture identification method based on a convolutional neural network is characterized by comprising the following steps:

2. The convolutional neural network-based high frame rate 3D posture identification method according to claim 1, wherein the second step specifically is:

step 2.5, performing state space search on the obtained 2D key points, wherein the 2D key points should meet the general skeleton length and performance of ergonomics based on the ergonomic data, and the range of state space search is reduced according to the general values of ergonomics, so that the judgment of actions on individuals conforms to the general angle of ergonomics;

step 2.6, the result in the previous step is in a certain numerical range; at the moment, 2D space bevel calculation is carried out on all numerical ranges, ergonomic bevel search is carried out again, and the bevel of the numerical range is controlled within a reasonable ergonomic range;

and finally, classifying the folding angle and the length obtained by the result into a specific action type to determine the basic posture of the human body.

3. The convolutional neural network-based high frame rate 3D posture identification method according to claim 1, wherein the third step specifically is:

4. The convolutional neural network-based high frame rate 3D posture recognition method as claimed in claim 3, wherein in the step 3.2, the 3D key point recognition algorithm is to obtain 3D key point information in a video picture after 3D conversion is performed on the 2D key points obtained in the step 2.4 by using a random forest algorithm.

5. The convolutional neural network-based high frame rate 3D posture identification method as claimed in claim 1, wherein the posture identification method further comprises a fifth step of optimizing the detected 3D key points by using an ergonomic method.

6. The convolutional neural network-based high frame rate 3D posture identification method according to claim 5, wherein the step five specifically comprises:

step 5.2, performing state space search on the obtained 3D key points, wherein the 3D key points should meet the general skeleton length and performance of human engineering based on human engineering data; reducing the searching range of the state space according to the general value of the human engineering, and further reducing the searching range of the state space by combining the included angle A, so that the judgment of the action on an individual body accords with the general angle of the human engineering;

step 5.3, calculating the space break angle of the 3D key point according to the human engineering data by combining the actual value of the included angle A;

the ergonomic data includes the range of motion that the limb can move, and the general most inclined motion of the limb; and comparing the most inclined angle in the search space with the classified motion, and classifying the motion.

7. The convolutional neural network based high frame rate 3D posture recognition method as claimed in claim 6, wherein the range of possible limb motion comprises the range of possible angles from full extension to full flexion of elbow, knee, hip, ankle, wrist, neck, shoulder; the generally most inclined movement pattern of the limb refers to the 3D spatial coordinates of the given limb, the most comfortable spatial coordinates of which are unique.

8. The method according to claim 3, wherein in step 3.1, if the cosine distance is greater than or equal to the set threshold, the difference between the features of the adjacent video frames is considered to be large, and the difference between the two images is too large, and the two images are discarded; if the cosine distance is less than the set threshold, then the adjacent frame features are considered to be close.