Disclosure of Invention
The embodiment of the invention provides a human body motion recognition method and device based on a neural network, which are used for solving the problems that an application scene of real-time motion migration is difficult to support and the motion capture accuracy is low in the existing human body motion real-time capturing method based on RGB images.
In a first aspect, an embodiment of the present invention provides a human body motion recognition method based on a neural network, including:
preprocessing an RGB-D image of the human body action to be recognized to obtain an RGB image without background pixel interference and a point cloud three-dimensional coordinate graph without background pixel interference;
inputting the RGB image without background pixel interference and the point cloud three-dimensional coordinate graph without background pixel interference into an attitude parameter identification model, and outputting attitude parameters, morphological parameters and displacement parameters of the human body action to be identified;
the gesture parameter recognition model is obtained by training a RGB image without background pixel interference and a point cloud three-dimensional coordinate graph without background pixel interference based on samples and a predetermined 3D key point coordinate label and three-dimensional point cloud label corresponding to the RGB image without background pixel interference of each sample, and a loss function during training of the gesture parameter recognition model is formed on the basis of key point loss, smooth loss and point cloud loss;
inputting the posture parameters, the morphological parameters and the displacement parameters of the human body action to be recognized into a parameterized model, and outputting the human body action result to be recognized.
Preferably, in the method, the first and second reaction conditions,
the predetermined 3D key point coordinate label corresponding to the RGB image without background pixel interference of each sample is obtained by inputting the RGB image without background pixel interference of the sample into an annotation algorithm to obtain 2D key point coordinates, and then converting the 2D key point coordinates into 3D key point coordinates;
correspondingly, the predetermined three-dimensional point cloud label corresponding to the RGB image without background pixel interference of each sample is obtained by converting the depth image corresponding to the RGB image without background pixel interference of the sample into the three-dimensional point cloud based on the camera internal parameters of the acquired image.
Preferably, in the method, the first and second reaction conditions,
the labeling algorithm is an alphase 2D key point detection algorithm.
Preferably, in the method, the preprocessing is performed on the RGB-D image of the human body motion to be recognized, so as to obtain an RGB image without background pixel interference and a point cloud three-dimensional coordinate map without background pixel interference, and specifically includes:
the RGB-D image of the human body action to be recognized comprises a color image, a depth image and a human body mask image;
masking the color image by using the human body mask image to obtain an RGB image without background pixel interference;
and carrying out conversion based on internal parameters of a depth camera for collecting the RGB-D images on the color image and the depth image to obtain a point cloud three-dimensional coordinate graph without background pixel interference.
Preferably, in the method, the loss function in the training of the attitude parameter recognition model is formed based on the key point loss, the smoothing loss and the point cloud loss, and specifically includes:
loss function L during training of attitude parameter recognition modelTotal=λ1L2D+λ2L3D+λ3LPoint+λ4LSmoothWherein L is2DFor 2D key point loss, L3DFor 3D key point loss, LPointAs loss of point cloud, LSmoothTo smooth the loss, λi(i ═ 1,2,3,4) is the weight corresponding to the loss term.
Preferably, in the method, the first and second reaction conditions,
the 2D keypoint loss L2DIs calculated by the following formulaCalculating:
wherein p isgtThe method comprises the steps that a reference standard 2D key point information set corresponding to an RGB image without background pixel interference of each sample is obtained by inputting the RGB image without background pixel interference of each sample into an annotation algorithm; p is a radical oflThe 2D key point information set of the RGB image of each sample without background pixel interference is predicted by the neural network during the training of the attitude parameter recognition model;
the 3D keypoint loss L3DCalculated by the following formula:
wherein p isgt2The method comprises the steps that a reference standard 3D key point information set corresponding to an RGB image without background pixel interference of each sample is obtained by converting a reference standard 2D key point obtained by inputting the RGB image without background pixel interference of each sample into a labeling algorithm into a reference standard 3D key point coordinate; p is a radical ofjThe 3D key point information set of the RGB image of each sample without background pixel interference is predicted by the neural network during the training of the attitude parameter recognition model; v is a one-hot vector composed of 0 and 1, and is used for describing the self-shielding of the human body;
the point cloud loss LPointCalculated by the following formula:
wherein, VgtIs a set of grid points, V, corresponding to each three-dimensional point cloud labelpredIs a point cloud three-dimensional coordinate set without background pixel interference, ngtFor the normal set of the grid points, w is a one-hot vector composed of 0 and 1, if a corresponding point corresponding to the ith element of the one-hot vector can be found on the R depth map without background pixel interference, the ith element is 1, otherwise, the ith element is 0, which is the sum of the values of the first element and the second elementIn the formula, i is a positive integer;
the smoothing loss LSmoothCalculated by the following formula:
wherein R ispreA rotation parameter, T, of an RGB image predicted by a neural network during the training of the attitude parameter recognition model and corresponding to a previous frame sample of the same human body action without background pixel interferencepreTranslation parameters, R, of RGB images predicted by the neural network during the training of the attitude parameter recognition model and corresponding to the same human body action and without background pixel interference of the previous frame samplecurA rotation parameter T of the RGB image predicted by the neural network during the training of the attitude parameter recognition model and corresponding to the same human body action without background pixel interferencecurAnd the translation parameters of the RGB image are predicted by the neural network during the training of the attitude parameter recognition model, and are corresponding to the previous frame of sample of the same human body action without background pixel interference.
In a second aspect, an embodiment of the present invention provides a human body motion recognition apparatus based on a neural network, including:
the preprocessing unit is used for preprocessing the RGB-D image of the human body action to be recognized to obtain an RGB image without background pixel interference and a point cloud three-dimensional coordinate graph without background pixel interference;
the recognition unit is used for inputting the RGB image without background pixel interference and the point cloud three-dimensional coordinate graph without background pixel interference into a posture parameter recognition model and outputting a posture parameter, a morphological parameter and a displacement parameter of the human body action to be recognized;
the gesture parameter recognition model is obtained by training a RGB image without background pixel interference and a point cloud three-dimensional coordinate graph without background pixel interference based on samples and a predetermined 3D key point coordinate label and three-dimensional point cloud label corresponding to the RGB image without background pixel interference of each sample, and a loss function during training of the gesture parameter recognition model is formed on the basis of key point loss, smooth loss and point cloud loss;
and the action unit is used for inputting the posture parameters, the form parameters and the displacement parameters of the human body action to be recognized into a parameterized model and outputting a human body action result to be recognized.
Preferably, in the apparatus, the first and second electrodes are,
the predetermined 3D key point coordinate label corresponding to the RGB image without background pixel interference of each sample is obtained by inputting the RGB image without background pixel interference of the sample into an annotation algorithm to obtain 2D key point coordinates, and then converting the 2D key point coordinates into 3D key point coordinates;
correspondingly, the predetermined three-dimensional point cloud label corresponding to the RGB image without background pixel interference of each sample is obtained by converting the depth image corresponding to the RGB image without background pixel interference of the sample into the three-dimensional point cloud based on the camera internal parameters of the acquired image.
In a third aspect, an embodiment of the present invention provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the program, the processor implements the steps of the neural network-based human motion recognition method according to the first aspect.
In a fourth aspect, an embodiment of the present invention provides a non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the neural network-based human body motion recognition method as provided in the first aspect.
The method and the device provided by the embodiment of the invention firstly preprocess the RGB-D image of the human body action to be recognized to obtain the RGB image without background pixel interference and the point cloud three-dimensional coordinate graph without background pixel interference, then input the RGB image without background pixel interference and the point cloud three-dimensional coordinate graph without background pixel interference into an attitude parameter recognition model, output the attitude parameter, the form parameter and the displacement parameter of the human body action to be recognized, and finally input the attitude parameter, the form parameter and the displacement parameter into a parameterization model to output the human body action result to be recognized, wherein the attitude parameter recognition model is obtained by training the RGB image without background pixel interference, the point cloud three-dimensional coordinate graph without background pixel interference of a sample and a predetermined 3D key point coordinate label and a predetermined point cloud three-dimensional label corresponding to the RGB image without background pixel interference of each sample, meanwhile, a loss function during the limited attitude parameter recognition model training is formed on the basis of key point loss, smooth loss and point cloud loss. Therefore, a large number of sample images and point cloud data are adopted to train the attitude parameter recognition model in a deep learning mode, the accuracy of the model can be guaranteed, the accuracy of human body action recognition is guaranteed, key point loss, smooth loss and point cloud loss are considered when a loss function is constructed, the accuracy of the model is further guaranteed, the trained model is used for real-time human body action recognition based on RGB-D images of human body actions, the complexity of the whole recognition process is reduced, and the real-time performance of human body action recognition is guaranteed. Therefore, the method and the device provided by the embodiment of the invention realize the application scene supporting the real-time action migration and improve the accuracy of the action recognition.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.
The existing method for capturing human body actions in real time based on RGB images generally has the problems of difficult support of application scenes of real-time action migration and low action capturing accuracy. Therefore, the embodiment of the invention provides a method for determining the calibration parameters of a speckle projector of a monocular speckle structured light system. Fig. 1 is a schematic flow chart of a human body motion recognition method based on a neural network according to an embodiment of the present invention, and as shown in fig. 1, the method includes:
and 110, preprocessing the RGB-D image of the human body action to be recognized to obtain an RGB image without background pixel interference and a point cloud three-dimensional coordinate graph without background pixel interference.
Specifically, for the RGB-D image of the human body motion to be recognized, a background segmentation preprocessing is firstly performed on the RGB image of the human body motion to be recognized, that is, the background behind the human body of the foreground object is removed. The background segmentation method includes two methods, one is to cut out a foreground human body based on human body key points, and specifically includes: identifying key points of a human body in the RGB image, wherein the key points comprise a left ankle, a right ankle, a left knee, a right knee, a left hip edge, a right hip edge, a left waist, a right waist, a left elbow, a right elbow, a left shoulder, a right shoulder, a nose, a left ear and a right ear of the human body, directly expanding the key points to certain pixels to obtain a cutting frame, cutting off effective background pixels, and removing the background behind the human body of a foreground object; and the other method is that after the RGB-D image is acquired by the RGB-D camera, the color image, the depth image and the human body mask image can be obtained by calling the SDK of the acquired data, and then the human body mask image is subjected to mask processing on the color image to obtain the RGB image without background pixel interference. The above two methods for segmenting the background can be adopted, and are not particularly limited herein. Then, based on the RGB image and the depth image without background pixel interference, a point cloud three-dimensional coordinate graph without background pixel interference can be obtained.
Step 120, inputting the RGB image without background pixel interference and the point cloud three-dimensional coordinate graph without background pixel interference into an attitude parameter identification model, and outputting attitude parameters, morphological parameters and displacement parameters of the human body action to be identified;
the attitude parameter identification model is obtained by training a RGB image without background pixel interference and a point cloud three-dimensional coordinate graph without background pixel interference based on samples and a predetermined 3D key point coordinate label and three-dimensional point cloud label corresponding to the RGB image without background pixel interference of each sample, and a loss function during training of the attitude parameter identification model is formed on the basis of key point loss, smooth loss and point cloud loss.
Specifically, an RGB image without background pixel interference and a point cloud three-dimensional coordinate graph without background pixel interference, which are obtained by preprocessing an RGB-D image of the human body action to be recognized, are input into a pre-trained attitude parameter recognition model, and the model outputs attitude parameters, morphological parameters and displacement parameters of the human body action to be recognized. The attitude parameter identification model is obtained by training a RGB image without background pixel interference and a point cloud three-dimensional coordinate graph without background pixel interference based on samples and a predetermined 3D key point coordinate label and three-dimensional point cloud label corresponding to the RGB image without background pixel interference of each sample. The forming process of the training sample is as follows: the method comprises the steps of collecting RGB-D data by using a depth camera in a common indoor scene, extracting data frames which comprise color images, depth images and human body mask images, collecting 310 people in the indoor scene, collecting their actions of twisting, bowing, kicking, running, jumping, free walking, punching a fist and the like to form 508170 sample RGB-D images, then collecting RGB-D data in a laboratory scene by using the depth camera, collecting 32 people in the laboratory scene, collecting their actions of twisting, in-situ rotating, standing and body forward bending, large-angle punching, large-angle kicking, leg pressing, basketball playing, football playing, bowling playing, billiards playing, bowing, weightlifting, baseball playing, volleyball playing, tennis playing and the like, and forming 367528 sample RGB-D images. The RGB-D image preprocessing of step 110 is performed on all the sample RGB-D images to obtain corresponding sample RGB images without background pixel interference and a point cloud three-dimensional coordinate map without background pixel interference. The label of the point cloud three-dimensional coordinate image corresponding to the RGB image without background pixel interference and the point cloud three-dimensional coordinate image without background pixel interference of each sample is a 3D key point coordinate label and a three-dimensional point cloud label corresponding to the RGB image without background pixel interference of each sample, the obtaining method of the label is a 3D key point coordinate label and a three-dimensional point cloud label extracted by a third-party labeling algorithm on the original RGB image corresponding to the RGB image without background pixel interference of the sample, the third-party labeling algorithm is a common algorithm with higher accuracy for extracting key points, and the third-party labeling algorithm is not specifically limited here. Secondly, further defining a loss function during the training of the attitude parameter recognition model to be formed based on key point loss, smooth loss and point cloud loss, wherein the loss function considers the difference between the key point predicted by the neural network and a predetermined key point label and the difference between the three-dimensional point cloud of the key point predicted by the neural network and the predetermined key point three-dimensional point cloud label, and also considers the smooth loss, namely the degree of translation and rotation between the front frame and the rear frame corresponding to the same action of the human body, and the smooth loss is used for preventing jitter. The output of the model is the gesture parameter, the form parameter and the displacement parameter of the human body action to be recognized, the parameters, the 3D key point coordinate of the training label and the three-dimensional point cloud are equivalent parameters, the gesture parameter, the form parameter and the displacement parameter of the human body action can be obtained by carrying out equivalent transformation on the 3D key point coordinate of the human body and the three-dimensional point cloud, and the function of the parameter formed by the 3D key point coordinate of the human body and the three-dimensional point cloud in describing the human body action is the same as that of the parameter formed by the gesture parameter, the form parameter and the displacement parameter of the human body action.
And step 130, inputting the posture parameters, the form parameters and the displacement parameters of the human body action to be recognized into a parameterized model, and outputting the human body action result to be recognized.
Specifically, the posture parameters, the morphological parameters and the displacement parameters of the output results of the posture parameter recognition model are input into a pre-established parameterized model, namely a G model, the action result of the human body can be constructed through the G model, and the action result of the human body to be recognized is output.
The method provided by the embodiment of the invention comprises the steps of firstly preprocessing an RGB-D image of human body action to be recognized to obtain an RGB image without background pixel interference and a point cloud three-dimensional coordinate graph without background pixel interference, then inputting the RGB image without background pixel interference and the point cloud three-dimensional coordinate graph without background pixel interference into an attitude parameter recognition model, outputting attitude parameters, morphological parameters and displacement parameters of the human body action to be recognized, and finally inputting the attitude parameters, the morphological parameters and the displacement parameters into a parameterized model to output a human body action result to be recognized, wherein the attitude parameter recognition model is obtained by training the RGB image without background pixel interference, the point cloud three-dimensional coordinate graph without background pixel interference of samples and a predetermined 3D key point coordinate label and three-dimensional point cloud label corresponding to the RGB image without background pixel interference of each sample, meanwhile, a loss function during the limited attitude parameter recognition model training is formed on the basis of key point loss, smooth loss and point cloud loss. Therefore, a large number of sample images and point cloud data are adopted to train the attitude parameter recognition model in a deep learning mode, the accuracy of the model can be guaranteed, the accuracy of human body action recognition is guaranteed, key point loss, smooth loss and point cloud loss are considered when a loss function is constructed, the accuracy of the model is further guaranteed, the trained model is used for real-time human body action recognition based on RGB-D images of human body actions, the complexity of the whole recognition process is reduced, and the real-time performance of human body action recognition is guaranteed. Therefore, the method provided by the embodiment of the invention realizes the application scene supporting the real-time action migration and improves the accuracy of the action recognition.
Based on the above-described embodiments, in this method,
the predetermined 3D key point coordinate label corresponding to the RGB image without background pixel interference of each sample is obtained by inputting the RGB image without background pixel interference of the sample into an annotation algorithm to obtain 2D key point coordinates, and then converting the 2D key point coordinates into 3D key point coordinates;
correspondingly, the predetermined three-dimensional point cloud label corresponding to the RGB image without background pixel interference of each sample is obtained by converting the depth image corresponding to the RGB image without background pixel interference of the sample into the three-dimensional point cloud based on the camera internal parameters of the acquired image.
Specifically, to determine a 3D key point coordinate label and a three-dimensional point cloud label corresponding to an RGB image of a sample without background pixel interference, a labeling algorithm is first used to extract 2D key point coordinates of the RGB image of the sample without background pixel interference, and then the 2D key point coordinates are converted into 3D key point coordinates, where the labeling algorithm is an algorithm that is commonly used by a third party and has a higher accuracy for extracting 2D key points, and is not specifically limited here, and the conversion of the 2D key point coordinates into the 3D key point coordinates also only needs to be performed by camera internal reference for acquiring an original image, and is not described here again. And then converting the depth image corresponding to the RGB image without background pixel interference of the sample into a three-dimensional point cloud based on the camera internal reference of the acquired image to obtain the three-dimensional point cloud label. Wherein, the key point is for the human body, namely human body's left ankle, right ankle, left knee, right knee, left buttock limit, right buttock limit, left waist, right waist, left elbow, right elbow, left shoulder, right shoulder, nose, left ear and right ear.
Based on any of the above embodiments, in the method, the labeling algorithm is an alphapos 2D keypoint detection algorithm.
Specifically, the adopted labeling algorithm is limited to be a 2D key point detection algorithm of alphaphase, which is an algorithm with high accuracy for extracting 2D key points at present, and the key points extracted by adopting the algorithm are often used as a reference standard (groudtuth).
Based on any of the above embodiments, in the method, the preprocessing is performed on the RGB-D image of the human body motion to be recognized, so as to obtain an RGB image without background pixel interference and a point cloud three-dimensional coordinate graph without background pixel interference, and specifically includes:
the RGB-D image of the human body action to be recognized comprises a color image, a depth image and a human body mask image;
masking the color image by using the human body mask image to obtain an RGB image without background pixel interference;
and carrying out conversion based on internal parameters of a depth camera for collecting the RGB-D images on the color image and the depth image to obtain a point cloud three-dimensional coordinate graph without background pixel interference.
Specifically, a specific method for performing segmentation background preprocessing on an RGB-D image of a human body motion to be recognized is specifically defined here: namely, a human body mask processing method is adopted, and the specific flow is as follows: after data are collected by an RGB-D camera, an SDK of the collected data is called to obtain a color image, a depth image and a human body mask image of an RGB-D image of human body action to be recognized, the color image is subjected to masking processing by using the human body mask image, namely the R value, the G value and the B value of an invalid pixel (background) are all set to be-255 by using the human body mask. Preferably, the mask edge part can be enlarged by a certain pixel to obtain a larger mask, so that the probability that the key point is lost due to too fast action can be reduced, and the image contains mask information without background pixel interference, and for a scene with a color similar to that of the background, the mask of the depth camera uses the depth information, so that the image is more robust. Compared with the method of cutting out the human body from the cutting frame by enlarging key points through identifying key points, the method of separating the human body from the background through the human body mask has the advantages that the comparison result is shown in the table 1, the table 1 is the comparison result of the accuracy of the color map without the mask for separating the human body and the accuracy of the color map with the mask for separating the human body, and the table 1 is as follows:
TABLE 1 comparison of color map no mask separation human accuracy and color map with mask separation human accuracy
Model
|
Color drawing without mask
|
Color picture tape mask
|
LAnkle
|
0.9459
|
0.9560
|
RAnkle
|
0.9442
|
0.9514
|
LKnee
|
0.9509
|
0.9621
|
RKnee
|
0.9446
|
0.9567
|
LHip
|
0.9578
|
0.9738
|
RHip
|
0.9621
|
0.9749
|
LWrist
|
0.8330
|
0.8763
|
RWrist
|
0.8214
|
0.8620
|
LElbow
|
0.9234
|
0.9543
|
RElbow
|
0.9095
|
0.9515
|
LShoulder
|
0.9724
|
0.9844
|
RShoulder
|
0.9625
|
0.9861
|
Nose
|
0.9759
|
0.9879
|
LEar
|
0.9816
|
0.9907
|
REar
|
0.9823
|
0.9901
|
Total
|
0.9378
|
0.9523 |
The results of the values corresponding to the respective parts of the human body in table 1 were calculated based on the PCK @0.2(percent of correct Keypoint) index, which was calculated by calculating the ratio of the euclidean distance between the detection key point and the reference standard (ground route) to be less than 0.2 × the trunk diameter. From the comparison results in table 1, it can be seen that the human body separated by the mask method with color map is more advantageous for the accuracy of the subsequent detection of the key points than the human body separated by the mask method without color map. Therefore, the method for preprocessing the RGB-D image, which is adopted by the embodiment of the invention, uses the human body mask, and the human body image without the background extracted by the method is more beneficial to extracting key points of a human body on the image subsequently and describing the action condition of the human body.
Based on any one of the above embodiments, in the method, the loss function in the training of the attitude parameter recognition model is formed based on the key point loss, the smoothing loss, and the point cloud loss, and specifically includes:
loss function L during training of attitude parameter recognition modelTotal=λ1L2D+λ2L3D+λ3LPoint+λ4LSmoothWherein L is2DFor 2D key point loss, L3DFor 3D key point loss, LPointAs loss of point cloud, LSmoothTo smooth the loss, λi(i ═ 1,2,3,4) is the weight corresponding to the loss term.
Specifically, the loss function during the training of the attitude parameter recognition model considers the key point loss, the smoothing loss and the point cloud loss, and the key point loss is composed of the 2D key point loss and the 3D key point loss. The specific loss function is expressed as LTotal=λ1L2D+λ2L3D+λ3LPoint+λ4LSmoothWherein L is2DFor 2D key point loss, L3DFor 3D key point loss, LPointAs loss of point cloud, LSmoothTo smooth the loss, λ1Is the weight, λ, corresponding to the 2D keypoint loss term2Is the weight, λ, corresponding to the 3D keypoint loss term3Is the weight, λ, corresponding to the point cloud loss term4Is the weight corresponding to the smoothing loss term.
In accordance with any of the above embodiments, in the method,
the 2D keypoint loss L2DCalculated by the following formula:
wherein p isgtThe method comprises the steps that a reference standard 2D key point information set corresponding to an RGB image without background pixel interference of each sample is obtained by inputting the RGB image without background pixel interference of each sample into an annotation algorithm; p is a radical oflThe 2D key point information set of the RGB image of each sample without background pixel interference is predicted by the neural network during the training of the attitude parameter recognition model;
the 3D keypoint loss L3DCalculated by the following formula:
wherein p isgt2The method comprises the steps that a reference standard 3D key point information set corresponding to an RGB image without background pixel interference of each sample is obtained by converting a reference standard 2D key point obtained by inputting the RGB image without background pixel interference of each sample into a labeling algorithm into a reference standard 3D key point coordinate; p is a radical ofjThe 3D key point information set of the RGB image of each sample without background pixel interference is predicted by the neural network during the training of the attitude parameter recognition model; v is a one-hot vector composed of 0 and 1, and is used for describing the self-shielding of the human body;
the point cloud loss LPointCalculated by the following formula:
wherein, VgtIs a set of grid points, V, corresponding to each three-dimensional point cloud labelpredIs a point cloud three-dimensional coordinate set without background pixel interference, ngtFor the normal set of grid points, w is a one-hot vector consisting of 0 and 1, if one can find the corresponding to the R depth map without background pixel interferenceIf the ith element of the one-hot vector is a corresponding point of the ith element, the ith element is 1, otherwise, the ith element is 0, wherein i is a positive integer;
the smoothing loss LSmoothCalculated by the following formula:
wherein R ispreA rotation parameter, T, of an RGB image predicted by a neural network during the training of the attitude parameter recognition model and corresponding to a previous frame sample of the same human body action without background pixel interferencepreTranslation parameters, R, of RGB images predicted by the neural network during the training of the attitude parameter recognition model and corresponding to the same human body action and without background pixel interference of the previous frame samplecurA rotation parameter T of the RGB image predicted by the neural network during the training of the attitude parameter recognition model and corresponding to the same human body action without background pixel interferencecurAnd the translation parameters of the RGB image are predicted by the neural network during the training of the attitude parameter recognition model, and are corresponding to the previous frame of sample of the same human body action without background pixel interference.
Specifically, 2D keypoint loss L2DCalculated by the following formula:
wherein p isgtThe method is a reference standard 2D key point information set corresponding to RGB images of samples without background pixel interference, and is obtained by inputting the RGB images of the samples without background pixel interference into an annotation algorithm, wherein the annotation algorithm is an algorithm which is commonly used by a third party and has higher accuracy in extracting 2D key points, preferably, the annotation algorithm adopts a 2D key point detection algorithm of alphapos, and p is a reference standard 2D key point information set corresponding to the RGB images of the samples without background pixel interference, and the annotation algorithm adopts a 2D key point detection algorithm of alphaposlThe 2D key point information set of the RGB image of each sample without background pixel interference is predicted by the neural network during the training of the attitude parameter recognition model;
3D Key Point loss L3DBy the following formulaAnd (3) calculating:
wherein p isgt2The method comprises the steps that a reference standard 3D key point information set corresponding to an RGB image without background pixel interference of each sample is obtained by converting a reference standard 2D key point obtained by inputting the RGB image without background pixel interference of each sample into a labeling algorithm into a reference standard 3D key point coordinate, wherein the labeling algorithm is an algorithm which is commonly used by a third party and has higher accuracy in extracting the 2D key point, preferably, the labeling algorithm adopts a 2D key point detection algorithm of alphapos, and the reference standard 3D key point obtained by the reference standard 2D key point only needs to be converted based on camera internal parameters for acquiring an original image, and p is a method for obtaining the reference standard 3D key point by the reference standard 2D key pointjThe 3D key point information set of the RGB image of each sample without background pixel interference is predicted by the neural network during the training of the attitude parameter recognition model; v is a one-hot vector composed of 0 and 1 and used for describing the self-shielding of the human body, the value of the ith element in v depends on whether the key point of the human body corresponding to the ith element is shielded by the ith element, if the key point is shielded by the ith element, the ith element is 0, and if the key point is not shielded by the ith element, the ith element is 1;
point cloud loss LPointCalculated by the following formula:
wherein, VgtIs a set of grid points corresponding to each three-dimensional point cloud label, the set of grid points being obtained by meshing each three-dimensional point cloud, VpredIs a point cloud three-dimensional coordinate set without background pixel interference, ngtFor the normal set of the grid points, w is a one-hot vector composed of 0 and 1, if a corresponding point corresponding to the ith element of the one-hot vector can be found on the R depth map without background pixel interference, the ith element is 1, otherwise, the ith element is 0, wherein i is a positive integer;
the smoothing loss LSmoothBy passingCalculated by the following formula:
in the case of sample training, a plurality of frames before and after the same human body motion are usually used as samples for training, and R is a reference for trainingpreA rotation parameter, T, of an RGB image predicted by a neural network during the training of the attitude parameter recognition model and corresponding to a previous frame sample of the same human body action without background pixel interferencepreTranslation parameters, R, of RGB images predicted by the neural network during the training of the attitude parameter recognition model and corresponding to the same human body action and without background pixel interference of the previous frame samplecurA rotation parameter T of the RGB image predicted by the neural network during the training of the attitude parameter recognition model and corresponding to the same human body action without background pixel interferencecurAnd the translation parameters of the RGB image are predicted by the neural network during the training of the attitude parameter recognition model, and are corresponding to the previous frame of sample of the same human body action without background pixel interference.
Based on any of the above embodiments, an embodiment of the present invention provides a human body motion recognition device based on a neural network, and fig. 2 is a schematic structural diagram of the human body motion recognition device based on the neural network provided in the embodiment of the present invention. As shown in fig. 2, the apparatus includes a preprocessing 210, a unit recognition unit 220, and an action unit 230, wherein,
the preprocessing unit 210 is configured to preprocess an RGB-D image of a human body motion to be recognized, so as to obtain an RGB image without background pixel interference and a point cloud three-dimensional coordinate map without background pixel interference;
the recognition unit 220 is configured to input the RGB image without background pixel interference and the point cloud three-dimensional coordinate map without background pixel interference into an attitude parameter recognition model, and output an attitude parameter, a morphological parameter, and a displacement parameter of the human body motion to be recognized;
the gesture parameter recognition model is obtained by training a RGB image without background pixel interference and a point cloud three-dimensional coordinate graph without background pixel interference based on samples and a predetermined 3D key point coordinate label and three-dimensional point cloud label corresponding to the RGB image without background pixel interference of each sample, and a loss function during training of the gesture parameter recognition model is formed on the basis of key point loss, smooth loss and point cloud loss;
the action unit 230 is configured to input the posture parameter, the form parameter, and the displacement parameter of the human body action to be recognized into a parameterized model, and output a result of the human body action to be recognized.
The device provided by the embodiment of the invention obtains an RGB image without background pixel interference and a point cloud three-dimensional coordinate graph without background pixel interference by preprocessing an RGB-D image of a human body action to be recognized, then inputs the RGB image without background pixel interference and the point cloud three-dimensional coordinate graph without background pixel interference into an attitude parameter recognition model, outputs attitude parameters, morphological parameters and displacement parameters of the human body action to be recognized, and finally outputs a human body action result to be recognized by inputting the attitude parameters, the morphological parameters and the displacement parameters into a parameterized model, wherein the attitude parameter recognition model is obtained by training an RGB image without background pixel interference, a point cloud three-dimensional coordinate graph without background pixel interference of a sample and a predetermined 3D key point coordinate label and three-dimensional point cloud label corresponding to the RGB image without background pixel interference of each sample, meanwhile, a loss function during the limited attitude parameter recognition model training is formed on the basis of key point loss, smooth loss and point cloud loss. Therefore, a large number of sample images and point cloud data are adopted to train the attitude parameter recognition model in a deep learning mode, the accuracy of the model can be guaranteed, the accuracy of human body action recognition is guaranteed, key point loss, smooth loss and point cloud loss are considered when a loss function is constructed, the accuracy of the model is further guaranteed, the trained model is used for real-time human body action recognition based on RGB-D images of human body actions, the complexity of the whole recognition process is reduced, and the real-time performance of human body action recognition is guaranteed. Therefore, the device provided by the embodiment of the invention realizes an application scene supporting real-time action migration and improves the accuracy of action recognition.
In accordance with any of the above embodiments, in the apparatus,
the predetermined 3D key point coordinate label corresponding to the RGB image without background pixel interference of each sample is obtained by inputting the RGB image without background pixel interference of the sample into an annotation algorithm to obtain 2D key point coordinates, and then converting the 2D key point coordinates into 3D key point coordinates;
correspondingly, the predetermined three-dimensional point cloud label corresponding to the RGB image without background pixel interference of each sample is obtained by converting the depth image corresponding to the RGB image without background pixel interference of the sample into the three-dimensional point cloud based on the camera internal parameters of the acquired image.
In accordance with any of the above embodiments, in the apparatus,
the labeling algorithm is an alphase 2D key point detection algorithm.
Based on any one of the above embodiments, in the apparatus, the preprocessing unit is specifically configured to:
the RGB-D image of the human body action to be recognized comprises a color image, a depth image and a human body mask image;
masking the color image by using the human body mask image to obtain an RGB image without background pixel interference;
and carrying out conversion based on internal parameters of a depth camera for collecting the RGB-D images on the color image and the depth image to obtain a point cloud three-dimensional coordinate graph without background pixel interference.
Based on any one of the above embodiments, in the apparatus, the loss function in the training of the attitude parameter recognition model is formed based on the key point loss, the smoothing loss, and the point cloud loss, and specifically includes:
loss function L during training of attitude parameter recognition modelTotal=λ1L2D+λ2L3D+λ3LPoint+λ4LSmoothWherein L is2DFor 2D key point loss, L3DFor 3D key point loss, LPointAs loss of point cloud, LSmoothFor smoothingLoss, λi(i ═ 1,2,3,4) is the weight corresponding to the loss term.
In accordance with any of the above embodiments, in the apparatus,
the 2D keypoint loss L2DCalculated by the following formula:
wherein p isgtThe method comprises the steps that a reference standard 2D key point information set corresponding to an RGB image without background pixel interference of each sample is obtained by inputting the RGB image without background pixel interference of each sample into an annotation algorithm; p is a radical oflThe 2D key point information set of the RGB image of each sample without background pixel interference is predicted by the neural network during the training of the attitude parameter recognition model;
the 3D keypoint loss L3DCalculated by the following formula:
wherein p isgt2The method comprises the steps that a reference standard 3D key point information set corresponding to an RGB image without background pixel interference of each sample is obtained by converting a reference standard 2D key point obtained by inputting the RGB image without background pixel interference of each sample into a labeling algorithm into a reference standard 3D key point coordinate; p is a radical ofjThe 3D key point information set of the RGB image of each sample without background pixel interference is predicted by the neural network during the training of the attitude parameter recognition model; v is a one-hot vector composed of 0 and 1, and is used for describing the self-shielding of the human body;
the point cloud loss LPointCalculated by the following formula:
wherein, VgtIs a set of grid points, V, corresponding to each three-dimensional point cloud labelpredFor point cloud three-dimensional coordinate set without background pixel interference,ngtFor the normal set of the grid points, w is a one-hot vector composed of 0 and 1, if a corresponding point corresponding to the ith element of the one-hot vector can be found on the R depth map without background pixel interference, the ith element is 1, otherwise, the ith element is 0, wherein i is a positive integer;
the smoothing loss LSmoothCalculated by the following formula:
wherein R ispreA rotation parameter, T, of an RGB image predicted by a neural network during the training of the attitude parameter recognition model and corresponding to a previous frame sample of the same human body action without background pixel interferencepreTranslation parameters, R, of RGB images predicted by the neural network during the training of the attitude parameter recognition model and corresponding to the same human body action and without background pixel interference of the previous frame samplecurA rotation parameter T of the RGB image predicted by the neural network during the training of the attitude parameter recognition model and corresponding to the same human body action without background pixel interferencecurAnd the translation parameters of the RGB image are predicted by the neural network during the training of the attitude parameter recognition model, and are corresponding to the previous frame of sample of the same human body action without background pixel interference.
Fig. 3 is a schematic entity structure diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 3, the electronic device may include: a processor (processor)301, a communication Interface (communication Interface)302, a memory (memory)303 and a communication bus 304, wherein the processor 301, the communication Interface 302 and the memory 303 complete communication with each other through the communication bus 304. The processor 301 may call a computer program stored on the memory 303 and operable on the processor 301 to execute the neural network based human motion recognition method provided by the above embodiments, for example, including: preprocessing an RGB-D image of the human body action to be recognized to obtain an RGB image without background pixel interference and a point cloud three-dimensional coordinate graph without background pixel interference; inputting the RGB image without background pixel interference and the point cloud three-dimensional coordinate graph without background pixel interference into an attitude parameter identification model, and outputting attitude parameters, morphological parameters and displacement parameters of the human body action to be identified; the gesture parameter recognition model is obtained by training a RGB image without background pixel interference and a point cloud three-dimensional coordinate graph without background pixel interference based on samples and a predetermined 3D key point coordinate label and three-dimensional point cloud label corresponding to the RGB image without background pixel interference of each sample, and a loss function during training of the gesture parameter recognition model is formed on the basis of key point loss, smooth loss and point cloud loss; inputting the posture parameters, the morphological parameters and the displacement parameters of the human body action to be recognized into a parameterized model, and outputting the human body action result to be recognized.
In addition, the logic instructions in the memory 303 may be implemented in the form of software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or make a contribution to the prior art, or may be implemented in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
An embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented to perform the method for recognizing human body actions based on a neural network provided in the foregoing embodiments when executed by a processor, and the method includes: preprocessing an RGB-D image of the human body action to be recognized to obtain an RGB image without background pixel interference and a point cloud three-dimensional coordinate graph without background pixel interference; inputting the RGB image without background pixel interference and the point cloud three-dimensional coordinate graph without background pixel interference into an attitude parameter identification model, and outputting attitude parameters, morphological parameters and displacement parameters of the human body action to be identified; the gesture parameter recognition model is obtained by training a RGB image without background pixel interference and a point cloud three-dimensional coordinate graph without background pixel interference based on samples and a predetermined 3D key point coordinate label and three-dimensional point cloud label corresponding to the RGB image without background pixel interference of each sample, and a loss function during training of the gesture parameter recognition model is formed on the basis of key point loss, smooth loss and point cloud loss; inputting the posture parameters, the morphological parameters and the displacement parameters of the human body action to be recognized into a parameterized model, and outputting the human body action result to be recognized.
The above-described system embodiments are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.