CN111723687A - Human body action recognition method and device based on neural network - Google Patents

Human body action recognition method and device based on neural network Download PDF

Info

Publication number
CN111723687A
CN111723687A CN202010490878.3A CN202010490878A CN111723687A CN 111723687 A CN111723687 A CN 111723687A CN 202010490878 A CN202010490878 A CN 202010490878A CN 111723687 A CN111723687 A CN 111723687A
Authority
CN
China
Prior art keywords
background pixel
human body
pixel interference
image
loss
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010490878.3A
Other languages
Chinese (zh)
Inventor
户磊
李廷照
石彪
闫祥
张举勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei Dilusense Technology Co Ltd
Original Assignee
Beijing Dilusense Technology Co Ltd
Hefei Dilusense Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Dilusense Technology Co Ltd, Hefei Dilusense Technology Co Ltd filed Critical Beijing Dilusense Technology Co Ltd
Priority to CN202010490878.3A priority Critical patent/CN111723687A/en
Publication of CN111723687A publication Critical patent/CN111723687A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Social Psychology (AREA)
  • Human Computer Interaction (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Psychiatry (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the invention provides a human body action recognition method and a human body action recognition device based on a neural network, wherein the method comprises the following steps: preprocessing an RGB-D image of the human body action to be recognized to obtain an RGB image without background pixel interference and a point cloud three-dimensional coordinate graph without background pixel interference; inputting an RGB image without background pixel interference and a point cloud three-dimensional coordinate graph without background pixel interference into an attitude parameter identification model, and outputting attitude parameters, morphological parameters and displacement parameters of human body actions to be identified; the gesture parameter recognition model is obtained by training a large number of sample labels, the gesture parameters, the form parameters and the displacement parameters of the human body actions to be recognized are input into the parameterized model, and the human body action results to be recognized are output. The method and the device provided by the embodiment of the invention realize the application scene supporting the real-time action migration and improve the accuracy of human action recognition.

Description

Human body action recognition method and device based on neural network
Technical Field
The invention relates to the technical field of human body action recognition, in particular to a human body action recognition method and device based on a neural network.
Background
The reconstruction and attribute identification of three-dimensional human bodies are always important research directions in the field of machine vision, and currently, human body reconstruction related work based on deep learning in the academic field can be roughly divided into two categories, namely parameterized model reconstruction and non-parameterized model reconstruction. The non-parametric model reconstruction representative work is cloud-derived DenseBody, the method expands the human body grid into UVMap, and then regresses the UVMap through a convolution network, and the method has the advantages that data expression is more suitable for convolution, and the effect is better. Parameterized model reconstruction, typically performed by berkeley's HMR, extracts the human model parameters Beta (for describing morphological parameters) and Theta (for describing pose parameters) from the image, respectively, directly using a convolution network, generates a human mesh through a parameterized model (e.g., SMPL-X, VAE-based models, etc.), and then performs point-to-point regression on the mesh. The industrial field currently uses this method. The RGB image-based method has the advantages of low data acquisition difficulty, better data diversity and richer actions.
Although the hardware condition of human motion capture has already met the requirement at present, no technical scheme for capturing human motion in real time based on a single depth camera exists in the industry, and the related technology for capturing human motion in real time based on a single depth camera in the prior art has the problems of extremely high requirement on hardware or incapability of achieving a real-time effect.
At present, the applicable real-time human body motion capturing method technology is mainly a parameterization method and a non-parameterization method, such as a non-parameterization method like DenseBody, as mentioned above, and has the defect that parameterization cost is high, so that application scenarios such as real-time motion migration are difficult to support.
Therefore, how to avoid the trouble of an application scene that is difficult to support real-time motion migration in the existing method for capturing human motion in real time based on RGB images, and the situation that the accuracy of motion capture is not high, still remains a problem to be solved by those skilled in the art.
Disclosure of Invention
The embodiment of the invention provides a human body motion recognition method and device based on a neural network, which are used for solving the problems that an application scene of real-time motion migration is difficult to support and the motion capture accuracy is low in the existing human body motion real-time capturing method based on RGB images.
In a first aspect, an embodiment of the present invention provides a human body motion recognition method based on a neural network, including:
preprocessing an RGB-D image of the human body action to be recognized to obtain an RGB image without background pixel interference and a point cloud three-dimensional coordinate graph without background pixel interference;
inputting the RGB image without background pixel interference and the point cloud three-dimensional coordinate graph without background pixel interference into an attitude parameter identification model, and outputting attitude parameters, morphological parameters and displacement parameters of the human body action to be identified;
the gesture parameter recognition model is obtained by training a RGB image without background pixel interference and a point cloud three-dimensional coordinate graph without background pixel interference based on samples and a predetermined 3D key point coordinate label and three-dimensional point cloud label corresponding to the RGB image without background pixel interference of each sample, and a loss function during training of the gesture parameter recognition model is formed on the basis of key point loss, smooth loss and point cloud loss;
inputting the posture parameters, the morphological parameters and the displacement parameters of the human body action to be recognized into a parameterized model, and outputting the human body action result to be recognized.
Preferably, in the method, the first and second reaction conditions,
the predetermined 3D key point coordinate label corresponding to the RGB image without background pixel interference of each sample is obtained by inputting the RGB image without background pixel interference of the sample into an annotation algorithm to obtain 2D key point coordinates, and then converting the 2D key point coordinates into 3D key point coordinates;
correspondingly, the predetermined three-dimensional point cloud label corresponding to the RGB image without background pixel interference of each sample is obtained by converting the depth image corresponding to the RGB image without background pixel interference of the sample into the three-dimensional point cloud based on the camera internal parameters of the acquired image.
Preferably, in the method, the first and second reaction conditions,
the labeling algorithm is an alphase 2D key point detection algorithm.
Preferably, in the method, the preprocessing is performed on the RGB-D image of the human body motion to be recognized, so as to obtain an RGB image without background pixel interference and a point cloud three-dimensional coordinate map without background pixel interference, and specifically includes:
the RGB-D image of the human body action to be recognized comprises a color image, a depth image and a human body mask image;
masking the color image by using the human body mask image to obtain an RGB image without background pixel interference;
and carrying out conversion based on internal parameters of a depth camera for collecting the RGB-D images on the color image and the depth image to obtain a point cloud three-dimensional coordinate graph without background pixel interference.
Preferably, in the method, the loss function in the training of the attitude parameter recognition model is formed based on the key point loss, the smoothing loss and the point cloud loss, and specifically includes:
loss function L during training of attitude parameter recognition modelTotal=λ1L2D2L3D3LPoint4LSmoothWherein L is2DFor 2D key point loss, L3DFor 3D key point loss, LPointAs loss of point cloud, LSmoothTo smooth the loss, λi(i ═ 1,2,3,4) is the weight corresponding to the loss term.
Preferably, in the method, the first and second reaction conditions,
the 2D keypoint loss L2DIs calculated by the following formulaCalculating:
Figure BDA0002520931610000031
wherein p isgtThe method comprises the steps that a reference standard 2D key point information set corresponding to an RGB image without background pixel interference of each sample is obtained by inputting the RGB image without background pixel interference of each sample into an annotation algorithm; p is a radical oflThe 2D key point information set of the RGB image of each sample without background pixel interference is predicted by the neural network during the training of the attitude parameter recognition model;
the 3D keypoint loss L3DCalculated by the following formula:
Figure BDA0002520931610000032
wherein p isgt2The method comprises the steps that a reference standard 3D key point information set corresponding to an RGB image without background pixel interference of each sample is obtained by converting a reference standard 2D key point obtained by inputting the RGB image without background pixel interference of each sample into a labeling algorithm into a reference standard 3D key point coordinate; p is a radical ofjThe 3D key point information set of the RGB image of each sample without background pixel interference is predicted by the neural network during the training of the attitude parameter recognition model; v is a one-hot vector composed of 0 and 1, and is used for describing the self-shielding of the human body;
the point cloud loss LPointCalculated by the following formula:
Figure BDA0002520931610000041
wherein, VgtIs a set of grid points, V, corresponding to each three-dimensional point cloud labelpredIs a point cloud three-dimensional coordinate set without background pixel interference, ngtFor the normal set of the grid points, w is a one-hot vector composed of 0 and 1, if a corresponding point corresponding to the ith element of the one-hot vector can be found on the R depth map without background pixel interference, the ith element is 1, otherwise, the ith element is 0, which is the sum of the values of the first element and the second elementIn the formula, i is a positive integer;
the smoothing loss LSmoothCalculated by the following formula:
Figure BDA0002520931610000042
wherein R ispreA rotation parameter, T, of an RGB image predicted by a neural network during the training of the attitude parameter recognition model and corresponding to a previous frame sample of the same human body action without background pixel interferencepreTranslation parameters, R, of RGB images predicted by the neural network during the training of the attitude parameter recognition model and corresponding to the same human body action and without background pixel interference of the previous frame samplecurA rotation parameter T of the RGB image predicted by the neural network during the training of the attitude parameter recognition model and corresponding to the same human body action without background pixel interferencecurAnd the translation parameters of the RGB image are predicted by the neural network during the training of the attitude parameter recognition model, and are corresponding to the previous frame of sample of the same human body action without background pixel interference.
In a second aspect, an embodiment of the present invention provides a human body motion recognition apparatus based on a neural network, including:
the preprocessing unit is used for preprocessing the RGB-D image of the human body action to be recognized to obtain an RGB image without background pixel interference and a point cloud three-dimensional coordinate graph without background pixel interference;
the recognition unit is used for inputting the RGB image without background pixel interference and the point cloud three-dimensional coordinate graph without background pixel interference into a posture parameter recognition model and outputting a posture parameter, a morphological parameter and a displacement parameter of the human body action to be recognized;
the gesture parameter recognition model is obtained by training a RGB image without background pixel interference and a point cloud three-dimensional coordinate graph without background pixel interference based on samples and a predetermined 3D key point coordinate label and three-dimensional point cloud label corresponding to the RGB image without background pixel interference of each sample, and a loss function during training of the gesture parameter recognition model is formed on the basis of key point loss, smooth loss and point cloud loss;
and the action unit is used for inputting the posture parameters, the form parameters and the displacement parameters of the human body action to be recognized into a parameterized model and outputting a human body action result to be recognized.
Preferably, in the apparatus, the first and second electrodes are,
the predetermined 3D key point coordinate label corresponding to the RGB image without background pixel interference of each sample is obtained by inputting the RGB image without background pixel interference of the sample into an annotation algorithm to obtain 2D key point coordinates, and then converting the 2D key point coordinates into 3D key point coordinates;
correspondingly, the predetermined three-dimensional point cloud label corresponding to the RGB image without background pixel interference of each sample is obtained by converting the depth image corresponding to the RGB image without background pixel interference of the sample into the three-dimensional point cloud based on the camera internal parameters of the acquired image.
In a third aspect, an embodiment of the present invention provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the program, the processor implements the steps of the neural network-based human motion recognition method according to the first aspect.
In a fourth aspect, an embodiment of the present invention provides a non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the neural network-based human body motion recognition method as provided in the first aspect.
The method and the device provided by the embodiment of the invention firstly preprocess the RGB-D image of the human body action to be recognized to obtain the RGB image without background pixel interference and the point cloud three-dimensional coordinate graph without background pixel interference, then input the RGB image without background pixel interference and the point cloud three-dimensional coordinate graph without background pixel interference into an attitude parameter recognition model, output the attitude parameter, the form parameter and the displacement parameter of the human body action to be recognized, and finally input the attitude parameter, the form parameter and the displacement parameter into a parameterization model to output the human body action result to be recognized, wherein the attitude parameter recognition model is obtained by training the RGB image without background pixel interference, the point cloud three-dimensional coordinate graph without background pixel interference of a sample and a predetermined 3D key point coordinate label and a predetermined point cloud three-dimensional label corresponding to the RGB image without background pixel interference of each sample, meanwhile, a loss function during the limited attitude parameter recognition model training is formed on the basis of key point loss, smooth loss and point cloud loss. Therefore, a large number of sample images and point cloud data are adopted to train the attitude parameter recognition model in a deep learning mode, the accuracy of the model can be guaranteed, the accuracy of human body action recognition is guaranteed, key point loss, smooth loss and point cloud loss are considered when a loss function is constructed, the accuracy of the model is further guaranteed, the trained model is used for real-time human body action recognition based on RGB-D images of human body actions, the complexity of the whole recognition process is reduced, and the real-time performance of human body action recognition is guaranteed. Therefore, the method and the device provided by the embodiment of the invention realize the application scene supporting the real-time action migration and improve the accuracy of the action recognition.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, a brief description will be given below of the drawings required for the embodiments or the technical solutions in the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
Fig. 1 is a schematic flow chart of a human body motion recognition method based on a neural network according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a human body motion recognition apparatus based on a neural network according to an embodiment of the present invention;
fig. 3 is a schematic physical structure diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without inventive effort based on the embodiments of the present invention, are within the scope of the present invention.
The existing method for capturing human body actions in real time based on RGB images generally has the problems of difficult support of application scenes of real-time action migration and low action capturing accuracy. Therefore, the embodiment of the invention provides a method for determining the calibration parameters of a speckle projector of a monocular speckle structured light system. Fig. 1 is a schematic flow chart of a human body motion recognition method based on a neural network according to an embodiment of the present invention, and as shown in fig. 1, the method includes:
and 110, preprocessing the RGB-D image of the human body action to be recognized to obtain an RGB image without background pixel interference and a point cloud three-dimensional coordinate graph without background pixel interference.
Specifically, for the RGB-D image of the human body motion to be recognized, a background segmentation preprocessing is firstly performed on the RGB image of the human body motion to be recognized, that is, the background behind the human body of the foreground object is removed. The background segmentation method includes two methods, one is to cut out a foreground human body based on human body key points, and specifically includes: identifying key points of a human body in the RGB image, wherein the key points comprise a left ankle, a right ankle, a left knee, a right knee, a left hip edge, a right hip edge, a left waist, a right waist, a left elbow, a right elbow, a left shoulder, a right shoulder, a nose, a left ear and a right ear of the human body, directly expanding the key points to certain pixels to obtain a cutting frame, cutting off effective background pixels, and removing the background behind the human body of a foreground object; and the other method is that after the RGB-D image is acquired by the RGB-D camera, the color image, the depth image and the human body mask image can be obtained by calling the SDK of the acquired data, and then the human body mask image is subjected to mask processing on the color image to obtain the RGB image without background pixel interference. The above two methods for segmenting the background can be adopted, and are not particularly limited herein. Then, based on the RGB image and the depth image without background pixel interference, a point cloud three-dimensional coordinate graph without background pixel interference can be obtained.
Step 120, inputting the RGB image without background pixel interference and the point cloud three-dimensional coordinate graph without background pixel interference into an attitude parameter identification model, and outputting attitude parameters, morphological parameters and displacement parameters of the human body action to be identified;
the attitude parameter identification model is obtained by training a RGB image without background pixel interference and a point cloud three-dimensional coordinate graph without background pixel interference based on samples and a predetermined 3D key point coordinate label and three-dimensional point cloud label corresponding to the RGB image without background pixel interference of each sample, and a loss function during training of the attitude parameter identification model is formed on the basis of key point loss, smooth loss and point cloud loss.
Specifically, an RGB image without background pixel interference and a point cloud three-dimensional coordinate graph without background pixel interference, which are obtained by preprocessing an RGB-D image of the human body action to be recognized, are input into a pre-trained attitude parameter recognition model, and the model outputs attitude parameters, morphological parameters and displacement parameters of the human body action to be recognized. The attitude parameter identification model is obtained by training a RGB image without background pixel interference and a point cloud three-dimensional coordinate graph without background pixel interference based on samples and a predetermined 3D key point coordinate label and three-dimensional point cloud label corresponding to the RGB image without background pixel interference of each sample. The forming process of the training sample is as follows: the method comprises the steps of collecting RGB-D data by using a depth camera in a common indoor scene, extracting data frames which comprise color images, depth images and human body mask images, collecting 310 people in the indoor scene, collecting their actions of twisting, bowing, kicking, running, jumping, free walking, punching a fist and the like to form 508170 sample RGB-D images, then collecting RGB-D data in a laboratory scene by using the depth camera, collecting 32 people in the laboratory scene, collecting their actions of twisting, in-situ rotating, standing and body forward bending, large-angle punching, large-angle kicking, leg pressing, basketball playing, football playing, bowling playing, billiards playing, bowing, weightlifting, baseball playing, volleyball playing, tennis playing and the like, and forming 367528 sample RGB-D images. The RGB-D image preprocessing of step 110 is performed on all the sample RGB-D images to obtain corresponding sample RGB images without background pixel interference and a point cloud three-dimensional coordinate map without background pixel interference. The label of the point cloud three-dimensional coordinate image corresponding to the RGB image without background pixel interference and the point cloud three-dimensional coordinate image without background pixel interference of each sample is a 3D key point coordinate label and a three-dimensional point cloud label corresponding to the RGB image without background pixel interference of each sample, the obtaining method of the label is a 3D key point coordinate label and a three-dimensional point cloud label extracted by a third-party labeling algorithm on the original RGB image corresponding to the RGB image without background pixel interference of the sample, the third-party labeling algorithm is a common algorithm with higher accuracy for extracting key points, and the third-party labeling algorithm is not specifically limited here. Secondly, further defining a loss function during the training of the attitude parameter recognition model to be formed based on key point loss, smooth loss and point cloud loss, wherein the loss function considers the difference between the key point predicted by the neural network and a predetermined key point label and the difference between the three-dimensional point cloud of the key point predicted by the neural network and the predetermined key point three-dimensional point cloud label, and also considers the smooth loss, namely the degree of translation and rotation between the front frame and the rear frame corresponding to the same action of the human body, and the smooth loss is used for preventing jitter. The output of the model is the gesture parameter, the form parameter and the displacement parameter of the human body action to be recognized, the parameters, the 3D key point coordinate of the training label and the three-dimensional point cloud are equivalent parameters, the gesture parameter, the form parameter and the displacement parameter of the human body action can be obtained by carrying out equivalent transformation on the 3D key point coordinate of the human body and the three-dimensional point cloud, and the function of the parameter formed by the 3D key point coordinate of the human body and the three-dimensional point cloud in describing the human body action is the same as that of the parameter formed by the gesture parameter, the form parameter and the displacement parameter of the human body action.
And step 130, inputting the posture parameters, the form parameters and the displacement parameters of the human body action to be recognized into a parameterized model, and outputting the human body action result to be recognized.
Specifically, the posture parameters, the morphological parameters and the displacement parameters of the output results of the posture parameter recognition model are input into a pre-established parameterized model, namely a G model, the action result of the human body can be constructed through the G model, and the action result of the human body to be recognized is output.
The method provided by the embodiment of the invention comprises the steps of firstly preprocessing an RGB-D image of human body action to be recognized to obtain an RGB image without background pixel interference and a point cloud three-dimensional coordinate graph without background pixel interference, then inputting the RGB image without background pixel interference and the point cloud three-dimensional coordinate graph without background pixel interference into an attitude parameter recognition model, outputting attitude parameters, morphological parameters and displacement parameters of the human body action to be recognized, and finally inputting the attitude parameters, the morphological parameters and the displacement parameters into a parameterized model to output a human body action result to be recognized, wherein the attitude parameter recognition model is obtained by training the RGB image without background pixel interference, the point cloud three-dimensional coordinate graph without background pixel interference of samples and a predetermined 3D key point coordinate label and three-dimensional point cloud label corresponding to the RGB image without background pixel interference of each sample, meanwhile, a loss function during the limited attitude parameter recognition model training is formed on the basis of key point loss, smooth loss and point cloud loss. Therefore, a large number of sample images and point cloud data are adopted to train the attitude parameter recognition model in a deep learning mode, the accuracy of the model can be guaranteed, the accuracy of human body action recognition is guaranteed, key point loss, smooth loss and point cloud loss are considered when a loss function is constructed, the accuracy of the model is further guaranteed, the trained model is used for real-time human body action recognition based on RGB-D images of human body actions, the complexity of the whole recognition process is reduced, and the real-time performance of human body action recognition is guaranteed. Therefore, the method provided by the embodiment of the invention realizes the application scene supporting the real-time action migration and improves the accuracy of the action recognition.
Based on the above-described embodiments, in this method,
the predetermined 3D key point coordinate label corresponding to the RGB image without background pixel interference of each sample is obtained by inputting the RGB image without background pixel interference of the sample into an annotation algorithm to obtain 2D key point coordinates, and then converting the 2D key point coordinates into 3D key point coordinates;
correspondingly, the predetermined three-dimensional point cloud label corresponding to the RGB image without background pixel interference of each sample is obtained by converting the depth image corresponding to the RGB image without background pixel interference of the sample into the three-dimensional point cloud based on the camera internal parameters of the acquired image.
Specifically, to determine a 3D key point coordinate label and a three-dimensional point cloud label corresponding to an RGB image of a sample without background pixel interference, a labeling algorithm is first used to extract 2D key point coordinates of the RGB image of the sample without background pixel interference, and then the 2D key point coordinates are converted into 3D key point coordinates, where the labeling algorithm is an algorithm that is commonly used by a third party and has a higher accuracy for extracting 2D key points, and is not specifically limited here, and the conversion of the 2D key point coordinates into the 3D key point coordinates also only needs to be performed by camera internal reference for acquiring an original image, and is not described here again. And then converting the depth image corresponding to the RGB image without background pixel interference of the sample into a three-dimensional point cloud based on the camera internal reference of the acquired image to obtain the three-dimensional point cloud label. Wherein, the key point is for the human body, namely human body's left ankle, right ankle, left knee, right knee, left buttock limit, right buttock limit, left waist, right waist, left elbow, right elbow, left shoulder, right shoulder, nose, left ear and right ear.
Based on any of the above embodiments, in the method, the labeling algorithm is an alphapos 2D keypoint detection algorithm.
Specifically, the adopted labeling algorithm is limited to be a 2D key point detection algorithm of alphaphase, which is an algorithm with high accuracy for extracting 2D key points at present, and the key points extracted by adopting the algorithm are often used as a reference standard (groudtuth).
Based on any of the above embodiments, in the method, the preprocessing is performed on the RGB-D image of the human body motion to be recognized, so as to obtain an RGB image without background pixel interference and a point cloud three-dimensional coordinate graph without background pixel interference, and specifically includes:
the RGB-D image of the human body action to be recognized comprises a color image, a depth image and a human body mask image;
masking the color image by using the human body mask image to obtain an RGB image without background pixel interference;
and carrying out conversion based on internal parameters of a depth camera for collecting the RGB-D images on the color image and the depth image to obtain a point cloud three-dimensional coordinate graph without background pixel interference.
Specifically, a specific method for performing segmentation background preprocessing on an RGB-D image of a human body motion to be recognized is specifically defined here: namely, a human body mask processing method is adopted, and the specific flow is as follows: after data are collected by an RGB-D camera, an SDK of the collected data is called to obtain a color image, a depth image and a human body mask image of an RGB-D image of human body action to be recognized, the color image is subjected to masking processing by using the human body mask image, namely the R value, the G value and the B value of an invalid pixel (background) are all set to be-255 by using the human body mask. Preferably, the mask edge part can be enlarged by a certain pixel to obtain a larger mask, so that the probability that the key point is lost due to too fast action can be reduced, and the image contains mask information without background pixel interference, and for a scene with a color similar to that of the background, the mask of the depth camera uses the depth information, so that the image is more robust. Compared with the method of cutting out the human body from the cutting frame by enlarging key points through identifying key points, the method of separating the human body from the background through the human body mask has the advantages that the comparison result is shown in the table 1, the table 1 is the comparison result of the accuracy of the color map without the mask for separating the human body and the accuracy of the color map with the mask for separating the human body, and the table 1 is as follows:
TABLE 1 comparison of color map no mask separation human accuracy and color map with mask separation human accuracy
Model Color drawing without mask Color picture tape mask
LAnkle 0.9459 0.9560
RAnkle 0.9442 0.9514
LKnee 0.9509 0.9621
RKnee 0.9446 0.9567
LHip 0.9578 0.9738
RHip 0.9621 0.9749
LWrist 0.8330 0.8763
RWrist 0.8214 0.8620
LElbow 0.9234 0.9543
RElbow 0.9095 0.9515
LShoulder 0.9724 0.9844
RShoulder 0.9625 0.9861
Nose 0.9759 0.9879
LEar 0.9816 0.9907
REar 0.9823 0.9901
Total 0.9378 0.9523
The results of the values corresponding to the respective parts of the human body in table 1 were calculated based on the PCK @0.2(percent of correct Keypoint) index, which was calculated by calculating the ratio of the euclidean distance between the detection key point and the reference standard (ground route) to be less than 0.2 × the trunk diameter. From the comparison results in table 1, it can be seen that the human body separated by the mask method with color map is more advantageous for the accuracy of the subsequent detection of the key points than the human body separated by the mask method without color map. Therefore, the method for preprocessing the RGB-D image, which is adopted by the embodiment of the invention, uses the human body mask, and the human body image without the background extracted by the method is more beneficial to extracting key points of a human body on the image subsequently and describing the action condition of the human body.
Based on any one of the above embodiments, in the method, the loss function in the training of the attitude parameter recognition model is formed based on the key point loss, the smoothing loss, and the point cloud loss, and specifically includes:
loss function L during training of attitude parameter recognition modelTotal=λ1L2D2L3D3LPoint4LSmoothWherein L is2DFor 2D key point loss, L3DFor 3D key point loss, LPointAs loss of point cloud, LSmoothTo smooth the loss, λi(i ═ 1,2,3,4) is the weight corresponding to the loss term.
Specifically, the loss function during the training of the attitude parameter recognition model considers the key point loss, the smoothing loss and the point cloud loss, and the key point loss is composed of the 2D key point loss and the 3D key point loss. The specific loss function is expressed as LTotal=λ1L2D2L3D3LPoint4LSmoothWherein L is2DFor 2D key point loss, L3DFor 3D key point loss, LPointAs loss of point cloud, LSmoothTo smooth the loss, λ1Is the weight, λ, corresponding to the 2D keypoint loss term2Is the weight, λ, corresponding to the 3D keypoint loss term3Is the weight, λ, corresponding to the point cloud loss term4Is the weight corresponding to the smoothing loss term.
In accordance with any of the above embodiments, in the method,
the 2D keypoint loss L2DCalculated by the following formula:
Figure BDA0002520931610000121
wherein p isgtThe method comprises the steps that a reference standard 2D key point information set corresponding to an RGB image without background pixel interference of each sample is obtained by inputting the RGB image without background pixel interference of each sample into an annotation algorithm; p is a radical oflThe 2D key point information set of the RGB image of each sample without background pixel interference is predicted by the neural network during the training of the attitude parameter recognition model;
the 3D keypoint loss L3DCalculated by the following formula:
Figure BDA0002520931610000122
wherein p isgt2The method comprises the steps that a reference standard 3D key point information set corresponding to an RGB image without background pixel interference of each sample is obtained by converting a reference standard 2D key point obtained by inputting the RGB image without background pixel interference of each sample into a labeling algorithm into a reference standard 3D key point coordinate; p is a radical ofjThe 3D key point information set of the RGB image of each sample without background pixel interference is predicted by the neural network during the training of the attitude parameter recognition model; v is a one-hot vector composed of 0 and 1, and is used for describing the self-shielding of the human body;
the point cloud loss LPointCalculated by the following formula:
Figure BDA0002520931610000131
wherein, VgtIs a set of grid points, V, corresponding to each three-dimensional point cloud labelpredIs a point cloud three-dimensional coordinate set without background pixel interference, ngtFor the normal set of grid points, w is a one-hot vector consisting of 0 and 1, if one can find the corresponding to the R depth map without background pixel interferenceIf the ith element of the one-hot vector is a corresponding point of the ith element, the ith element is 1, otherwise, the ith element is 0, wherein i is a positive integer;
the smoothing loss LSmoothCalculated by the following formula:
Figure BDA0002520931610000132
wherein R ispreA rotation parameter, T, of an RGB image predicted by a neural network during the training of the attitude parameter recognition model and corresponding to a previous frame sample of the same human body action without background pixel interferencepreTranslation parameters, R, of RGB images predicted by the neural network during the training of the attitude parameter recognition model and corresponding to the same human body action and without background pixel interference of the previous frame samplecurA rotation parameter T of the RGB image predicted by the neural network during the training of the attitude parameter recognition model and corresponding to the same human body action without background pixel interferencecurAnd the translation parameters of the RGB image are predicted by the neural network during the training of the attitude parameter recognition model, and are corresponding to the previous frame of sample of the same human body action without background pixel interference.
Specifically, 2D keypoint loss L2DCalculated by the following formula:
Figure BDA0002520931610000133
wherein p isgtThe method is a reference standard 2D key point information set corresponding to RGB images of samples without background pixel interference, and is obtained by inputting the RGB images of the samples without background pixel interference into an annotation algorithm, wherein the annotation algorithm is an algorithm which is commonly used by a third party and has higher accuracy in extracting 2D key points, preferably, the annotation algorithm adopts a 2D key point detection algorithm of alphapos, and p is a reference standard 2D key point information set corresponding to the RGB images of the samples without background pixel interference, and the annotation algorithm adopts a 2D key point detection algorithm of alphaposlThe 2D key point information set of the RGB image of each sample without background pixel interference is predicted by the neural network during the training of the attitude parameter recognition model;
3D Key Point loss L3DBy the following formulaAnd (3) calculating:
Figure BDA0002520931610000141
wherein p isgt2The method comprises the steps that a reference standard 3D key point information set corresponding to an RGB image without background pixel interference of each sample is obtained by converting a reference standard 2D key point obtained by inputting the RGB image without background pixel interference of each sample into a labeling algorithm into a reference standard 3D key point coordinate, wherein the labeling algorithm is an algorithm which is commonly used by a third party and has higher accuracy in extracting the 2D key point, preferably, the labeling algorithm adopts a 2D key point detection algorithm of alphapos, and the reference standard 3D key point obtained by the reference standard 2D key point only needs to be converted based on camera internal parameters for acquiring an original image, and p is a method for obtaining the reference standard 3D key point by the reference standard 2D key pointjThe 3D key point information set of the RGB image of each sample without background pixel interference is predicted by the neural network during the training of the attitude parameter recognition model; v is a one-hot vector composed of 0 and 1 and used for describing the self-shielding of the human body, the value of the ith element in v depends on whether the key point of the human body corresponding to the ith element is shielded by the ith element, if the key point is shielded by the ith element, the ith element is 0, and if the key point is not shielded by the ith element, the ith element is 1;
point cloud loss LPointCalculated by the following formula:
Figure BDA0002520931610000142
wherein, VgtIs a set of grid points corresponding to each three-dimensional point cloud label, the set of grid points being obtained by meshing each three-dimensional point cloud, VpredIs a point cloud three-dimensional coordinate set without background pixel interference, ngtFor the normal set of the grid points, w is a one-hot vector composed of 0 and 1, if a corresponding point corresponding to the ith element of the one-hot vector can be found on the R depth map without background pixel interference, the ith element is 1, otherwise, the ith element is 0, wherein i is a positive integer;
the smoothing loss LSmoothBy passingCalculated by the following formula:
Figure BDA0002520931610000143
in the case of sample training, a plurality of frames before and after the same human body motion are usually used as samples for training, and R is a reference for trainingpreA rotation parameter, T, of an RGB image predicted by a neural network during the training of the attitude parameter recognition model and corresponding to a previous frame sample of the same human body action without background pixel interferencepreTranslation parameters, R, of RGB images predicted by the neural network during the training of the attitude parameter recognition model and corresponding to the same human body action and without background pixel interference of the previous frame samplecurA rotation parameter T of the RGB image predicted by the neural network during the training of the attitude parameter recognition model and corresponding to the same human body action without background pixel interferencecurAnd the translation parameters of the RGB image are predicted by the neural network during the training of the attitude parameter recognition model, and are corresponding to the previous frame of sample of the same human body action without background pixel interference.
Based on any of the above embodiments, an embodiment of the present invention provides a human body motion recognition device based on a neural network, and fig. 2 is a schematic structural diagram of the human body motion recognition device based on the neural network provided in the embodiment of the present invention. As shown in fig. 2, the apparatus includes a preprocessing 210, a unit recognition unit 220, and an action unit 230, wherein,
the preprocessing unit 210 is configured to preprocess an RGB-D image of a human body motion to be recognized, so as to obtain an RGB image without background pixel interference and a point cloud three-dimensional coordinate map without background pixel interference;
the recognition unit 220 is configured to input the RGB image without background pixel interference and the point cloud three-dimensional coordinate map without background pixel interference into an attitude parameter recognition model, and output an attitude parameter, a morphological parameter, and a displacement parameter of the human body motion to be recognized;
the gesture parameter recognition model is obtained by training a RGB image without background pixel interference and a point cloud three-dimensional coordinate graph without background pixel interference based on samples and a predetermined 3D key point coordinate label and three-dimensional point cloud label corresponding to the RGB image without background pixel interference of each sample, and a loss function during training of the gesture parameter recognition model is formed on the basis of key point loss, smooth loss and point cloud loss;
the action unit 230 is configured to input the posture parameter, the form parameter, and the displacement parameter of the human body action to be recognized into a parameterized model, and output a result of the human body action to be recognized.
The device provided by the embodiment of the invention obtains an RGB image without background pixel interference and a point cloud three-dimensional coordinate graph without background pixel interference by preprocessing an RGB-D image of a human body action to be recognized, then inputs the RGB image without background pixel interference and the point cloud three-dimensional coordinate graph without background pixel interference into an attitude parameter recognition model, outputs attitude parameters, morphological parameters and displacement parameters of the human body action to be recognized, and finally outputs a human body action result to be recognized by inputting the attitude parameters, the morphological parameters and the displacement parameters into a parameterized model, wherein the attitude parameter recognition model is obtained by training an RGB image without background pixel interference, a point cloud three-dimensional coordinate graph without background pixel interference of a sample and a predetermined 3D key point coordinate label and three-dimensional point cloud label corresponding to the RGB image without background pixel interference of each sample, meanwhile, a loss function during the limited attitude parameter recognition model training is formed on the basis of key point loss, smooth loss and point cloud loss. Therefore, a large number of sample images and point cloud data are adopted to train the attitude parameter recognition model in a deep learning mode, the accuracy of the model can be guaranteed, the accuracy of human body action recognition is guaranteed, key point loss, smooth loss and point cloud loss are considered when a loss function is constructed, the accuracy of the model is further guaranteed, the trained model is used for real-time human body action recognition based on RGB-D images of human body actions, the complexity of the whole recognition process is reduced, and the real-time performance of human body action recognition is guaranteed. Therefore, the device provided by the embodiment of the invention realizes an application scene supporting real-time action migration and improves the accuracy of action recognition.
In accordance with any of the above embodiments, in the apparatus,
the predetermined 3D key point coordinate label corresponding to the RGB image without background pixel interference of each sample is obtained by inputting the RGB image without background pixel interference of the sample into an annotation algorithm to obtain 2D key point coordinates, and then converting the 2D key point coordinates into 3D key point coordinates;
correspondingly, the predetermined three-dimensional point cloud label corresponding to the RGB image without background pixel interference of each sample is obtained by converting the depth image corresponding to the RGB image without background pixel interference of the sample into the three-dimensional point cloud based on the camera internal parameters of the acquired image.
In accordance with any of the above embodiments, in the apparatus,
the labeling algorithm is an alphase 2D key point detection algorithm.
Based on any one of the above embodiments, in the apparatus, the preprocessing unit is specifically configured to:
the RGB-D image of the human body action to be recognized comprises a color image, a depth image and a human body mask image;
masking the color image by using the human body mask image to obtain an RGB image without background pixel interference;
and carrying out conversion based on internal parameters of a depth camera for collecting the RGB-D images on the color image and the depth image to obtain a point cloud three-dimensional coordinate graph without background pixel interference.
Based on any one of the above embodiments, in the apparatus, the loss function in the training of the attitude parameter recognition model is formed based on the key point loss, the smoothing loss, and the point cloud loss, and specifically includes:
loss function L during training of attitude parameter recognition modelTotal=λ1L2D2L3D3LPoint4LSmoothWherein L is2DFor 2D key point loss, L3DFor 3D key point loss, LPointAs loss of point cloud, LSmoothFor smoothingLoss, λi(i ═ 1,2,3,4) is the weight corresponding to the loss term.
In accordance with any of the above embodiments, in the apparatus,
the 2D keypoint loss L2DCalculated by the following formula:
Figure BDA0002520931610000171
wherein p isgtThe method comprises the steps that a reference standard 2D key point information set corresponding to an RGB image without background pixel interference of each sample is obtained by inputting the RGB image without background pixel interference of each sample into an annotation algorithm; p is a radical oflThe 2D key point information set of the RGB image of each sample without background pixel interference is predicted by the neural network during the training of the attitude parameter recognition model;
the 3D keypoint loss L3DCalculated by the following formula:
Figure BDA0002520931610000172
wherein p isgt2The method comprises the steps that a reference standard 3D key point information set corresponding to an RGB image without background pixel interference of each sample is obtained by converting a reference standard 2D key point obtained by inputting the RGB image without background pixel interference of each sample into a labeling algorithm into a reference standard 3D key point coordinate; p is a radical ofjThe 3D key point information set of the RGB image of each sample without background pixel interference is predicted by the neural network during the training of the attitude parameter recognition model; v is a one-hot vector composed of 0 and 1, and is used for describing the self-shielding of the human body;
the point cloud loss LPointCalculated by the following formula:
Figure BDA0002520931610000173
wherein, VgtIs a set of grid points, V, corresponding to each three-dimensional point cloud labelpredFor point cloud three-dimensional coordinate set without background pixel interference,ngtFor the normal set of the grid points, w is a one-hot vector composed of 0 and 1, if a corresponding point corresponding to the ith element of the one-hot vector can be found on the R depth map without background pixel interference, the ith element is 1, otherwise, the ith element is 0, wherein i is a positive integer;
the smoothing loss LSmoothCalculated by the following formula:
Figure BDA0002520931610000174
wherein R ispreA rotation parameter, T, of an RGB image predicted by a neural network during the training of the attitude parameter recognition model and corresponding to a previous frame sample of the same human body action without background pixel interferencepreTranslation parameters, R, of RGB images predicted by the neural network during the training of the attitude parameter recognition model and corresponding to the same human body action and without background pixel interference of the previous frame samplecurA rotation parameter T of the RGB image predicted by the neural network during the training of the attitude parameter recognition model and corresponding to the same human body action without background pixel interferencecurAnd the translation parameters of the RGB image are predicted by the neural network during the training of the attitude parameter recognition model, and are corresponding to the previous frame of sample of the same human body action without background pixel interference.
Fig. 3 is a schematic entity structure diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 3, the electronic device may include: a processor (processor)301, a communication Interface (communication Interface)302, a memory (memory)303 and a communication bus 304, wherein the processor 301, the communication Interface 302 and the memory 303 complete communication with each other through the communication bus 304. The processor 301 may call a computer program stored on the memory 303 and operable on the processor 301 to execute the neural network based human motion recognition method provided by the above embodiments, for example, including: preprocessing an RGB-D image of the human body action to be recognized to obtain an RGB image without background pixel interference and a point cloud three-dimensional coordinate graph without background pixel interference; inputting the RGB image without background pixel interference and the point cloud three-dimensional coordinate graph without background pixel interference into an attitude parameter identification model, and outputting attitude parameters, morphological parameters and displacement parameters of the human body action to be identified; the gesture parameter recognition model is obtained by training a RGB image without background pixel interference and a point cloud three-dimensional coordinate graph without background pixel interference based on samples and a predetermined 3D key point coordinate label and three-dimensional point cloud label corresponding to the RGB image without background pixel interference of each sample, and a loss function during training of the gesture parameter recognition model is formed on the basis of key point loss, smooth loss and point cloud loss; inputting the posture parameters, the morphological parameters and the displacement parameters of the human body action to be recognized into a parameterized model, and outputting the human body action result to be recognized.
In addition, the logic instructions in the memory 303 may be implemented in the form of software functional units and stored in a computer readable storage medium when the logic instructions are sold or used as independent products. Based on such understanding, the technical solutions of the embodiments of the present invention may be essentially implemented or make a contribution to the prior art, or may be implemented in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
An embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program is implemented to perform the method for recognizing human body actions based on a neural network provided in the foregoing embodiments when executed by a processor, and the method includes: preprocessing an RGB-D image of the human body action to be recognized to obtain an RGB image without background pixel interference and a point cloud three-dimensional coordinate graph without background pixel interference; inputting the RGB image without background pixel interference and the point cloud three-dimensional coordinate graph without background pixel interference into an attitude parameter identification model, and outputting attitude parameters, morphological parameters and displacement parameters of the human body action to be identified; the gesture parameter recognition model is obtained by training a RGB image without background pixel interference and a point cloud three-dimensional coordinate graph without background pixel interference based on samples and a predetermined 3D key point coordinate label and three-dimensional point cloud label corresponding to the RGB image without background pixel interference of each sample, and a loss function during training of the gesture parameter recognition model is formed on the basis of key point loss, smooth loss and point cloud loss; inputting the posture parameters, the morphological parameters and the displacement parameters of the human body action to be recognized into a parameterized model, and outputting the human body action result to be recognized.
The above-described system embodiments are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A human body action recognition method based on a neural network is characterized by comprising the following steps:
preprocessing an RGB-D image of the human body action to be recognized to obtain an RGB image without background pixel interference and a point cloud three-dimensional coordinate graph without background pixel interference;
inputting the RGB image without background pixel interference and the point cloud three-dimensional coordinate graph without background pixel interference into an attitude parameter identification model, and outputting attitude parameters, morphological parameters and displacement parameters of the human body action to be identified;
the gesture parameter recognition model is obtained by training a RGB image without background pixel interference and a point cloud three-dimensional coordinate graph without background pixel interference based on samples and a predetermined 3D key point coordinate label and three-dimensional point cloud label corresponding to the RGB image without background pixel interference of each sample, and a loss function during training of the gesture parameter recognition model is formed on the basis of key point loss, smooth loss and point cloud loss;
inputting the posture parameters, the morphological parameters and the displacement parameters of the human body action to be recognized into a parameterized model, and outputting the human body action result to be recognized.
2. The neural network-based human body motion recognition method according to claim 1,
the predetermined 3D key point coordinate label corresponding to the RGB image without background pixel interference of each sample is obtained by inputting the RGB image without background pixel interference of the sample into an annotation algorithm to obtain 2D key point coordinates, and then converting the 2D key point coordinates into 3D key point coordinates;
correspondingly, the predetermined three-dimensional point cloud label corresponding to the RGB image without background pixel interference of each sample is obtained by converting the depth image corresponding to the RGB image without background pixel interference of the sample into the three-dimensional point cloud based on the camera internal parameters of the acquired image.
3. The human body motion recognition method based on the neural network as claimed in claim 2, wherein the labeling algorithm is an alphase 2D key point detection algorithm.
4. The human body motion recognition method based on the neural network as claimed in claim 1, wherein the preprocessing is performed on the RGB-D image of the human body motion to be recognized to obtain an RGB image without background pixel interference and a point cloud three-dimensional coordinate map without background pixel interference, specifically comprising:
the RGB-D image of the human body action to be recognized comprises a color image, a depth image and a human body mask image;
masking the color image by using the human body mask image to obtain an RGB image without background pixel interference;
and carrying out conversion based on internal parameters of a depth camera for collecting the RGB-D images on the color image and the depth image to obtain a point cloud three-dimensional coordinate graph without background pixel interference.
5. The human body motion recognition method based on the neural network as claimed in claim 1, wherein the loss function during the training of the posture parameter recognition model is formed based on key point loss, smooth loss and point cloud loss, and specifically comprises:
loss function L during training of attitude parameter recognition modelTotal=λ1L2D2L3D3LPoint4LSmoothWherein L is2DFor 2D key point loss, L3DFor 3D key point loss, LPointAs loss of point cloud, LSmoothTo smooth the loss, λi(i ═ 1,2,3,4) is the weight corresponding to the loss term.
6. The neural network-based human motion recognition method of claim 5,
the 2D keypoint loss L2DCalculated by the following formula:
Figure FDA0002520931600000021
wherein p isgtThe method comprises the steps that a reference standard 2D key point information set corresponding to an RGB image without background pixel interference of each sample is obtained by inputting the RGB image without background pixel interference of each sample into an annotation algorithm; p is a radical oflThe 2D key point information set of the RGB image of each sample without background pixel interference is predicted by the neural network during the training of the attitude parameter recognition model;
the 3D keypoint loss L3DCalculated by the following formula:
Figure FDA0002520931600000022
wherein p isgt2The method comprises the steps that a reference standard 3D key point information set corresponding to an RGB image without background pixel interference of each sample is obtained by converting a reference standard 2D key point obtained by inputting the RGB image without background pixel interference of each sample into a labeling algorithm into a reference standard 3D key point coordinate; p is a radical ofjThe 3D key point information set of the RGB image of each sample without background pixel interference is predicted by the neural network during the training of the attitude parameter recognition model; v is a one-hot vector composed of 0 and 1, and is used for describing the self-shielding of the human body;
the point cloud loss LPointCalculated by the following formula:
Figure FDA0002520931600000031
wherein, VgtIs corresponding toSet of grid points of three-dimensional point cloud labels, VpredIs a point cloud three-dimensional coordinate set without background pixel interference, ngtFor the normal set of the grid points, w is a one-hot vector composed of 0 and 1, if a corresponding point corresponding to the ith element of the one-hot vector can be found on the R depth map without background pixel interference, the ith element is 1, otherwise, the ith element is 0, wherein i is a positive integer;
the smoothing loss LSmoothCalculated by the following formula:
Figure FDA0002520931600000032
wherein R ispreA rotation parameter, T, of an RGB image predicted by a neural network during the training of the attitude parameter recognition model and corresponding to a previous frame sample of the same human body action without background pixel interferencepreTranslation parameters, R, of RGB images predicted by the neural network during the training of the attitude parameter recognition model and corresponding to the same human body action and without background pixel interference of the previous frame samplecurA rotation parameter T of the RGB image predicted by the neural network during the training of the attitude parameter recognition model and corresponding to the same human body action without background pixel interferencecurAnd the translation parameters of the RGB image are predicted by the neural network during the training of the attitude parameter recognition model, and are corresponding to the previous frame of sample of the same human body action without background pixel interference.
7. A human body action recognition device based on a neural network is characterized by comprising:
the preprocessing unit is used for preprocessing the RGB-D image of the human body action to be recognized to obtain an RGB image without background pixel interference and a point cloud three-dimensional coordinate graph without background pixel interference;
the recognition unit is used for inputting the RGB image without background pixel interference and the point cloud three-dimensional coordinate graph without background pixel interference into a posture parameter recognition model and outputting a posture parameter, a morphological parameter and a displacement parameter of the human body action to be recognized;
the gesture parameter recognition model is obtained by training a RGB image without background pixel interference and a point cloud three-dimensional coordinate graph without background pixel interference based on samples and a predetermined 3D key point coordinate label and three-dimensional point cloud label corresponding to the RGB image without background pixel interference of each sample, and a loss function during training of the gesture parameter recognition model is formed on the basis of key point loss, smooth loss and point cloud loss;
and the action unit is used for inputting the posture parameters, the form parameters and the displacement parameters of the human body action to be recognized into a parameterized model and outputting a human body action result to be recognized.
8. The neural network-based human motion recognition device of claim 7,
the predetermined 3D key point coordinate label corresponding to the RGB image without background pixel interference of each sample is obtained by inputting the RGB image without background pixel interference of the sample into an annotation algorithm to obtain 2D key point coordinates, and then converting the 2D key point coordinates into 3D key point coordinates;
correspondingly, the predetermined three-dimensional point cloud label corresponding to the RGB image without background pixel interference of each sample is obtained by converting the depth image corresponding to the RGB image without background pixel interference of the sample into the three-dimensional point cloud based on the camera internal parameters of the acquired image.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the neural network based human motion recognition method as claimed in any one of claims 1 to 6 when executing the program.
10. A non-transitory computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the steps of the neural network-based human motion recognition method according to any one of claims 1 to 6.
CN202010490878.3A 2020-06-02 2020-06-02 Human body action recognition method and device based on neural network Pending CN111723687A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010490878.3A CN111723687A (en) 2020-06-02 2020-06-02 Human body action recognition method and device based on neural network

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010490878.3A CN111723687A (en) 2020-06-02 2020-06-02 Human body action recognition method and device based on neural network

Publications (1)

Publication Number Publication Date
CN111723687A true CN111723687A (en) 2020-09-29

Family

ID=72565522

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010490878.3A Pending CN111723687A (en) 2020-06-02 2020-06-02 Human body action recognition method and device based on neural network

Country Status (1)

Country Link
CN (1) CN111723687A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112489129A (en) * 2020-12-18 2021-03-12 深圳市优必选科技股份有限公司 Pose recognition model training method and device, pose recognition method and terminal equipment
CN112801064A (en) * 2021-04-12 2021-05-14 北京的卢深视科技有限公司 Model training method, electronic device and storage medium
CN113079136A (en) * 2021-03-22 2021-07-06 广州虎牙科技有限公司 Motion capture method, motion capture device, electronic equipment and computer-readable storage medium
CN113111743A (en) * 2021-03-29 2021-07-13 北京工业大学 Personnel distance detection method and device
CN113689541A (en) * 2021-07-23 2021-11-23 电子科技大学 Two-person three-dimensional human body shape optimization reconstruction method in interactive scene
CN114677572A (en) * 2022-04-08 2022-06-28 北京百度网讯科技有限公司 Object description parameter generation method and deep learning model training method
US20220405954A1 (en) * 2021-06-15 2022-12-22 Acronis International Gmbh Systems and methods for determining environment dimensions based on environment pose
WO2024114500A1 (en) * 2022-11-30 2024-06-06 天翼数字生活科技有限公司 Human body pose recognition method and device

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107909027A (en) * 2017-11-14 2018-04-13 电子科技大学 It is a kind of that there is the quick human body target detection method for blocking processing
CN109670591A (en) * 2018-12-14 2019-04-23 深圳市商汤科技有限公司 A kind of training method and image matching method, device of neural network
CN110020633A (en) * 2019-04-12 2019-07-16 腾讯科技(深圳)有限公司 Training method, image-recognizing method and the device of gesture recognition model
US20190242975A1 (en) * 2018-02-04 2019-08-08 KaiKuTek Inc. Gesture recognition system and gesture recognition method thereof
CN110189397A (en) * 2019-03-29 2019-08-30 北京市商汤科技开发有限公司 A kind of image processing method and device, computer equipment and storage medium
CN110555412A (en) * 2019-09-05 2019-12-10 深圳龙岗智能视听研究院 End-to-end human body posture identification method based on combination of RGB and point cloud
CN111047630A (en) * 2019-11-13 2020-04-21 芯启源(上海)半导体科技有限公司 Neural network and target detection and depth prediction method based on neural network

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107909027A (en) * 2017-11-14 2018-04-13 电子科技大学 It is a kind of that there is the quick human body target detection method for blocking processing
US20190242975A1 (en) * 2018-02-04 2019-08-08 KaiKuTek Inc. Gesture recognition system and gesture recognition method thereof
CN109670591A (en) * 2018-12-14 2019-04-23 深圳市商汤科技有限公司 A kind of training method and image matching method, device of neural network
CN110189397A (en) * 2019-03-29 2019-08-30 北京市商汤科技开发有限公司 A kind of image processing method and device, computer equipment and storage medium
CN110020633A (en) * 2019-04-12 2019-07-16 腾讯科技(深圳)有限公司 Training method, image-recognizing method and the device of gesture recognition model
CN110555412A (en) * 2019-09-05 2019-12-10 深圳龙岗智能视听研究院 End-to-end human body posture identification method based on combination of RGB and point cloud
CN111047630A (en) * 2019-11-13 2020-04-21 芯启源(上海)半导体科技有限公司 Neural network and target detection and depth prediction method based on neural network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
孙鹏慧 等: "《微小型仿生机器鼠设计与控制》", 北京理工大学出版社, pages: 125 - 85 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112489129A (en) * 2020-12-18 2021-03-12 深圳市优必选科技股份有限公司 Pose recognition model training method and device, pose recognition method and terminal equipment
CN113079136A (en) * 2021-03-22 2021-07-06 广州虎牙科技有限公司 Motion capture method, motion capture device, electronic equipment and computer-readable storage medium
CN113079136B (en) * 2021-03-22 2022-11-15 广州虎牙科技有限公司 Motion capture method, motion capture device, electronic equipment and computer-readable storage medium
CN113111743A (en) * 2021-03-29 2021-07-13 北京工业大学 Personnel distance detection method and device
CN112801064A (en) * 2021-04-12 2021-05-14 北京的卢深视科技有限公司 Model training method, electronic device and storage medium
US20220405954A1 (en) * 2021-06-15 2022-12-22 Acronis International Gmbh Systems and methods for determining environment dimensions based on environment pose
CN113689541A (en) * 2021-07-23 2021-11-23 电子科技大学 Two-person three-dimensional human body shape optimization reconstruction method in interactive scene
CN113689541B (en) * 2021-07-23 2023-03-07 电子科技大学 Two-person three-dimensional human body shape optimization reconstruction method in interactive scene
CN114677572A (en) * 2022-04-08 2022-06-28 北京百度网讯科技有限公司 Object description parameter generation method and deep learning model training method
WO2024114500A1 (en) * 2022-11-30 2024-06-06 天翼数字生活科技有限公司 Human body pose recognition method and device

Similar Documents

Publication Publication Date Title
CN111723687A (en) Human body action recognition method and device based on neural network
CN110738101B (en) Behavior recognition method, behavior recognition device and computer-readable storage medium
CN108446694B (en) Target detection method and device
CN112381837B (en) Image processing method and electronic equipment
CN111080670B (en) Image extraction method, device, equipment and storage medium
CN104123749A (en) Picture processing method and system
CN105740945A (en) People counting method based on video analysis
CN114937232B (en) Wearing detection method, system and equipment for medical waste treatment personnel protective appliance
CN111401144A (en) Escalator passenger behavior identification method based on video monitoring
CN110827312B (en) Learning method based on cooperative visual attention neural network
CN110619316A (en) Human body key point detection method and device and electronic equipment
CN111353385B (en) Pedestrian re-identification method and device based on mask alignment and attention mechanism
CN109711268B (en) Face image screening method and device
CN110046574A (en) Safety cap based on deep learning wears recognition methods and equipment
CN112200056B (en) Face living body detection method and device, electronic equipment and storage medium
WO2022174523A1 (en) Method for extracting gait feature of pedestrian, and gait recognition method and system
CN112836625A (en) Face living body detection method and device and electronic equipment
CN114445651A (en) Training set construction method and device of semantic segmentation model and electronic equipment
CN111898571A (en) Action recognition system and method
CN112633221A (en) Face direction detection method and related device
CN114120389A (en) Network training and video frame processing method, device, equipment and storage medium
CN113191216A (en) Multi-person real-time action recognition method and system based on gesture recognition and C3D network
CN114708617A (en) Pedestrian re-identification method and device and electronic equipment
CN113378799A (en) Behavior recognition method and system based on target detection and attitude detection framework
CN112801020B (en) Pedestrian re-identification method and system based on background graying

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20220630

Address after: 230094 room 611-217, R & D center building, China (Hefei) international intelligent voice Industrial Park, No. 3333 Xiyou Road, high tech Zone, Hefei, Anhui Province

Applicant after: Hefei lushenshi Technology Co.,Ltd.

Address before: Room 3032, gate 6, block B, 768 Creative Industry Park, 5 Xueyuan Road, Haidian District, Beijing 100083

Applicant before: BEIJING DILUSENSE TECHNOLOGY CO.,LTD.

Applicant before: Hefei lushenshi Technology Co.,Ltd.

TA01 Transfer of patent application right
RJ01 Rejection of invention patent application after publication

Application publication date: 20200929

RJ01 Rejection of invention patent application after publication