WO2024094227A1 - 一种基于卡尔曼滤波和深度学习的手势姿态估计方法 - Google Patents

一种基于卡尔曼滤波和深度学习的手势姿态估计方法 Download PDF

Info

Publication number
WO2024094227A1
WO2024094227A1 PCT/CN2023/139747 CN2023139747W WO2024094227A1 WO 2024094227 A1 WO2024094227 A1 WO 2024094227A1 CN 2023139747 W CN2023139747 W CN 2023139747W WO 2024094227 A1 WO2024094227 A1 WO 2024094227A1
Authority
WO
WIPO (PCT)
Prior art keywords
gesture
posture
hand
angle
estimation
Prior art date
Application number
PCT/CN2023/139747
Other languages
English (en)
French (fr)
Inventor
马凤英
纪鹏
曹茂永
王先建
张慧
Original Assignee
齐鲁工业大学(山东省科学院)
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 齐鲁工业大学(山东省科学院) filed Critical 齐鲁工业大学(山东省科学院)
Publication of WO2024094227A1 publication Critical patent/WO2024094227A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/107Static hand or arm
    • G06V40/11Hand-related biometrics; Hand pose recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/77Processing image or video features in feature spaces; using data integration or data reduction, e.g. principal component analysis [PCA] or independent component analysis [ICA] or self-organising maps [SOM]; Blind source separation
    • G06V10/80Fusion, i.e. combining data from various sources at the sensor level, preprocessing level, feature extraction level or classification level
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/82Arrangements for image or video recognition or understanding using pattern recognition or machine learning using neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/64Three-dimensional objects

Definitions

  • the present invention relates to the field of computer vision and human-computer interaction related technologies, and in particular to a hand gesture estimation method based on Kalman filtering and deep learning, which performs virtual and real information fusion.
  • gestures As an important medium for the human body to contact the outside world, gestures have a wide range of applications in human-computer interaction, augmented reality, virtual reality, gesture recognition, and other fields. As new human interaction methods develop towards a more natural and convenient trend, gesture-based human-computer interaction has very important research significance and prospects in the fields of game entertainment, medical treatment, smart home, and military industry. Accurate gesture posture estimation is a key link in the use of gestures for human-computer interaction and other applications.
  • the hand gesture estimation methods can be divided into methods based on wearable sensor devices and methods based on computer vision.
  • the wearable sensor device-based method requires the user to wear a data glove or other external auxiliary device equipped with a sensor, and directly obtain the position coordinates of the hand gesture joint points with the help of the sensor components.
  • This type of method is not easily affected by natural environmental factors such as lighting and background, and has good robustness and stability.
  • the auxiliary equipment is generally expensive and has high precision, it requires cumbersome operation steps and maintenance calibration processes. After wearing, it will cause certain movement constraints on the human hand, and the flexibility is not high when used.
  • Another method based on computer vision is a method of model learning or data matching for gesture images, which can be divided into 3D gesture estimation and 2D gesture estimation according to the spatial dimension of the prediction result.
  • Most of the research on 3D hand gesture estimation is based on depth images. Depth images carry the depth information of the target object, which greatly facilitates the research on gesture estimation.
  • the depth camera that obtains the depth image is based on structured light technology, binocular stereo vision technology or time of flight method. It is very sensitive to other environmental factors such as lighting, and is not suitable for outdoor and other special scenes. It is generally expensive and has poor portability. Compared with depth images, RGB images are more applicable, have low environmental requirements, and are easy to obtain.
  • RGB images have depth ambiguity, which is one of the difficulties in realizing 3D posture estimation.
  • the commonly used posture annotation method is to obtain the corresponding posture data with the help of external physical sensors.
  • the relative displacement between the sensor and the hand and the influence of the environment on the sensor are prone to errors.
  • the high degree of freedom and self-occlusion characteristics of the human hand are still problems that need to be overcome in gesture posture estimation.
  • the method of using a single external sensor device for posture annotation is still not accurate enough. The reasons are: first, the accuracy of the sensor itself; second, it is difficult to avoid relative displacement between the sensor and the human hand during use. Even if a more accurate sensor is used, large errors may still exist.
  • the present invention proposes a gesture estimation method based on Kalman filtering and deep learning.
  • 3D gesture estimation is performed on a fixed hand shape based on a dual-view RGB image.
  • the gesture angle data output by the gesture sensor in the hand when collecting gesture images (actual physical sensor observation data) and the gesture angle data predicted by the pre-trained gesture posture estimation model of the gesture image (virtual sensor observation data) are combined.
  • the Kalman filter fusion of the observation data of an actual sensor and a virtual sensor the measurement errors caused by non-sensor accuracy such as relative displacement between the sensor and the target object during use can be effectively corrected.
  • the main process of the data set production method in this scheme is to first pre-build a simulated hand model of the predicted hand shape in a 3D simulation environment and collect dual-view RGB gesture images of the simulated hand model during uniform rotation in three-dimensional space and three-dimensional posture data corresponding to the gesture images, and train the 3D posture estimation model of the simulated hand with the collected simulated hand images and posture data; in a real environment, the human hand maintains the same hand shape posture as in the above-mentioned simulation environment and holds the posture sensor, and uses two RGB cameras to collect dual-view real hand RGB images of the real hand in the process of uniform rotation in three-dimensional space with a similar perspective to the above-mentioned simulation environment, and records the gesture posture data output by the posture sensor when collecting gesture images.
  • the collected dual-view real hand RGB images are used to perform posture prediction using the above-mentioned trained simulated hand posture estimation model, and the Kalman filter multi-sensor data fusion algorithm is used to fuse the gesture posture data predicted by the model and the gesture posture data output by the posture sensor corresponding to the dual-view real hand images, and the fusion outputs high-precision posture annotation data for the dual-view real hand images.
  • a large number of dual-view RGB images of real hands are collected and Kalman filtering is used to fuse the above-mentioned gesture data from two different channels, so as to obtain a hand gesture estimation data set with high-precision gesture annotations, and solve the problem that RGB images are difficult to annotate due to the lack of depth information.
  • the present application discloses a method for 3D gesture estimation of fixed hand shapes based on dual-view RGB images, which combines the excellent automatic feature extraction capability of deep learning and the robust regression fitting capability of ensemble learning algorithms.
  • CNN is used to extract the deep features of dual-view gesture images, and then the features are regressed using ensemble learning algorithms.
  • a hand gesture estimation model that integrates the deep features of dual-view RGB gesture images is constructed. This hand gesture estimation method can effectively overcome the influence of gesture self-occlusion on prediction, and solves the problem of 3D hand gesture estimation on ordinary 2D images.
  • a hand gesture estimation method based on Kalman filtering and deep learning firstly, a dual-view hand gesture estimation dataset with posture annotation based on Kalman filtering data fusion is prepared, including a first-stage simulated hand posture estimation stage and a second-stage real gesture image acquisition and posture data fusion stage; secondly, 3D posture estimation is performed on the posture-annotated dual-view hand gesture estimation dataset, including a training stage and a prediction stage of a hand gesture estimation model.
  • Steps 1-9 are a method for producing a posture estimation dataset with high-precision posture annotation based on Kalman filter data fusion
  • steps 1-4 are the first-stage simulated hand posture estimation stage
  • steps 5-9 are the second-stage real gesture image acquisition and posture data fusion stage
  • steps 10-20 are the second part of the gesture posture estimation method based on deep learning and ensemble learning
  • steps 10-14 are the first-stage gesture posture estimation model training stage
  • steps 15-20 are the second-stage model prediction stage.
  • a high-quality dataset is a prerequisite for the posture estimation method based on learning to achieve the expected results.
  • the simulated hand gesture estimation is first performed, and then the real gesture image acquisition and gesture data fusion are performed.
  • Simulated hand posture estimation includes the following steps:
  • Step 1 Determine the fixed gesture form to be predicted, that is, the fixed hand shape
  • Step 2 For the fixed hand shape determined in step 1, 3D modeling is performed on the fixed hand shape using modeling and simulation software to generate a simulated hand model that is similar to the physical appearance characteristics of the hand shape in terms of shape, skin color, texture, etc.;
  • Step 3 Import the simulated hand model obtained in step 2 into the 3D simulation software and Set up two cameras, and then collect dual-view gesture images and three-axis posture angle data of the simulated hand model when it rotates in 3D space in the 3D simulation environment software
  • the roll angle, ⁇ is the pitch angle, and ⁇ is the yaw angle, and a posture estimation data set of the simulated hand model is produced;
  • the posture relationship between the two cameras and the simulated hand model in the 3D simulation software is the same as the posture relationship between human eyes and gestures;
  • Step 4 For the posture estimation data set of the simulated hand model, use the gesture estimation method based on deep learning and ensemble learning to train the 3D posture estimation model of the simulated hand, so that the 3D posture estimation model can predict the three-dimensional gesture posture of the simulated hand model image; the specific operation is the same as steps 10-20.
  • the real gesture image acquisition and posture data fusion include the following steps:
  • Step 5 In a real environment, a real human hand maintains the hand posture to be predicted, and a posture sensor, i.e., a gyroscope, is placed in the hand to collect a dual-view gesture image sequence when the real human hand rotates in three-dimensional space and a three-axis posture angle data sequence output by the posture sensor. At this time, the dual-view camera view position is the same as the dual-view setting in step 2.
  • the posture of this process is called the sensor output posture;
  • Step 6 Input the dual-view real hand image frames collected in step 3 into step 4, and perform posture prediction in the simulated hand posture estimation model trained with the simulated hand image.
  • the posture data is called the model predicted posture.
  • Step 7 Use Kalman filtering to fuse the sensor output posture corresponding to the dual-view image predicted in step 6 and the predicted posture of the model for the image. After fusing the two posture data with uncertainty through Kalman filtering, accurate gesture three-dimensional posture data is output. The three-dimensional posture data is called fused posture.
  • Kalman filtering is used to perform multi-sensor posture data fusion operation, which fuses the gesture posture data from different sensors instead of the accuracy correction of the sensor itself.
  • Step 8 Use the gesture fusion posture generated in step 7 as a label of the gesture image collected in step 6 and save it;
  • Step 9 Perform steps 6, 7, and 8 on all dual-view real gesture image frames and corresponding sensor output gestures collected in step 5 to obtain a real hand image sequence with fused gesture data labels, that is, generate a gesture posture estimation dataset with high-precision gesture annotations.
  • step 3 the pose estimation data set of the simulated hand model is prepared, and the specific steps are as follows:
  • Step 31 import the 3D modeling model of the simulation hand designed in step 2 into the 3D modeling simulation software, and set the coordinate system;
  • Step 32 setting a visual sensor capable of capturing two RGB simulated hand images at different viewing angles and a posture sensor capable of outputting three-axis posture angles of the simulated hand model in the 3D modeling software;
  • Step 33 program the simulated hand model to rotate around the three-dimensional space coordinate axis in the 3D modeling software, regularly collect the simulated hand images captured by the dual-view sensor, and record the sensor output posture angle when collecting the image, and save the posture angle as the label of the dual-view image.
  • the collection of gesture images and posture data completes the production of the posture estimation data set of the simulated hand model, in which a large number of gesture images and posture data are collected.
  • step 5 The specific steps of collecting the dual-view gesture image sequence of the real hand and the corresponding three-dimensional posture data sequence in step 5 are as follows:
  • Step 51 maintaining the gesture shape to be predicted and placing a gesture sensor in the hand, and the gesture sensor element and the hand do not move relative to each other when the hand rotates;
  • Step 52 setting two ordinary RGB cameras with the same viewing angle as in step 3;
  • Step 53 Rotate the wrist at a constant speed and capture gesture images of two view cameras at a fixed time, and record the gesture data output by the gesture sensor in the hand when capturing the image, wherein the speed of the constant wrist rotation is random, and the gesture images of two view cameras are captured.
  • the gesture images are captured automatically.
  • the Kalman filter gesture data fusion operation in step 7, the model fuses the two parts of uncertain gesture data into a set of more accurate gesture angle data
  • the gesture three-axis gesture angle values output by the hand gesture sensor i.e., the actual physical sensor observation angle
  • second, the gesture angle value predicted by the simulated gesture estimation model trained in step 4 for the collected real gesture image i.e., the observation angle of the virtual sensor. Both sets of data have certain uncertainties.
  • the uncertainty of Angle I is firstly due to the certain accuracy problem of the posture sensor itself, and secondly, the posture sensor of the hand grip or patch will also have a certain relative displacement during the rotation of the human hand, resulting in a certain operational error between its measurement value and the hand posture; the uncertainty of Angle S is firstly because the training of its model uses the image of the simulated hand, and in actual use it is a prediction of the real human hand image, which is bound to have a certain error. Secondly, the gesture posture estimation is also affected by factors such as the brightness and resolution of the gesture image, resulting in prediction errors.
  • the above-mentioned posture data Angle I collected by the posture sensor can be regarded as obtained by the actual sensor, while the posture data Angle S predicted by the simulated hand model for the real hand image can be regarded as obtained by the virtual sensor. Therefore, the Kalman filter multi-sensor data fusion method is used to fuse the above-mentioned two types of sensor observation values that both have uncertainties, and obtain data fusion posture annotations that are closer to the real value of the real hand gesture image.
  • the system state vector X(k) at the kth moment is selected as the three-axis posture angle of the gesture
  • the dimension is 3*1.
  • the first observation Z1 is the gesture data Angle I output by the sensor, and the second observation Z2 is the predicted gesture data Angle S of the real hand image by the simulated hand gesture estimation model:
  • w(k-1) is the process noise of the system at time k-1, P(w) ⁇ N(0, Q);
  • v 1 (k) is the measurement noise at time k when the posture data output by the sensor is the system observation Z 1 , P(v 1 ) ⁇ N(0, R 1 );
  • v 2 (k) is the measurement noise at time k when the posture data predicted by the gesture image by the gesture posture estimation model is the system observation Z 2 , P(v 2 ) ⁇ N(0, R 2 ).
  • the state equation is used to make a priori estimation of the gesture angle, and the gesture angle Angle I output by the gesture sensor is used as the system observation quantity Z1 to make the first observation correction to the system state estimation. Then, the gesture gesture angle Angle S predicted by the gesture image by the gesture estimation model is used as the system observation quantity Z2 to make the second observation correction to the state after the first observation correction.
  • the output result after two observation updates is the final fusion Angle F of the two sets of data.
  • Step 701 Initialize the parameters of the Kalman filter gesture data fusion system.
  • Initialize system status Initialize the system uncertainty covariance matrix P(0), the system state noise covariance matrix Q, the noise covariance matrix R1 with the gesture angle Angle I output by the gesture sensor as the system observation Z1 , and the noise covariance matrix R2 with the gesture angle Angle S predicted by the gesture estimation model for the gesture image as the system observation Z2 .
  • Step 702 Estimate the gesture angle at time k based on the optimal gesture angle at time k-1
  • T is the transpose of the matrix
  • Step 704 Calculate the Kalman gain K(k) based on the data of the system observation Z1 .
  • K(k) P(k) - H 1 T [H 1 P(k) - H 1 T +R 1 ] -1
  • Step 705 Update the posterior uncertainty covariance matrix P(k) of the computing system.
  • P(k) [IK(k)H 1 ]P(k) ⁇
  • Step 706 Use the sensor output attitude angle Angle I as the observation value Z 1 to perform the first update correction on the attitude, where Z 1 (k) represents the value of the observation value Z 1 at the kth moment.
  • Step 707 Obtain the system status after the first observation update through the above steps (Hand gesture angle ) and the uncertainty covariance matrix P(k) of the system, the gesture angle Angle S predicted by the gesture image using the gesture estimation model is used as the observation value Z 2 to perform a second update correction on the state of the system.
  • Step 708 Calculate the Kalman gain K(k) based on the data of system observation 2.
  • K(k) P(k) H2T [ H2P (k ) H2T + R2 ] -1
  • Step 709 Update the uncertainty covariance P(k) of the system.
  • P(k) [IK(k)H 2 ]P(k)
  • Step 710 Use the gesture angle Angle S predicted by the gesture estimation model for the gesture image as the observation value Z 2 to perform a second update correction on the gesture posture, where Z 2 (k) represents the value of the observation value Z 2 at the kth moment.
  • the gesture angle value Angle F after the two sets of observations are fused by Kalman filtering, and the fused angle data is output, that is, the gesture angle value after fusion is output;
  • Step 711 iterate steps 702-710, and continuously fuse the two sets of data to output high-precision gesture posture angle values.
  • the gesture pose estimation model is first trained, and then the gesture pose estimation model is predicted;
  • the training phase of the hand gesture estimation model consists of the following steps:
  • Step 10 train the CNN-based feature extractor Fe1 with all images of view 1 in the dual-view gesture estimation dataset;
  • Step 11 as in step 10, train the CNN-based feature extractor Fe2 with all images of view 2 in the dual-view gesture estimation dataset;
  • Step 12 using the feature extractors Fe1 and Fe2 trained in steps 10 and 11 to respectively extract deep features Fv1 and Fv2 of gesture images of respective viewpoints of the dual-view gesture posture estimation dataset;
  • Step 13 For the dual-view features Fv1 and Fv2 of the dual-view images collected at the same time in the data set, perform left-right serial splicing to generate a combined feature Fv1
  • Step 14 construct an ensemble learning gesture posture regressor based on Bayesian optimization for the combined feature sequence obtained in step 13, use the ensemble learning regression algorithm to perform gesture posture regression, and save the trained ensemble learning gesture regression model.
  • Step 15 training a hand detection model for screening the camera captured images before real-time hand gesture estimation, and removing invalid images that do not contain human hands;
  • Step 16 collecting dual-view test gesture image frames of the same view as those in the dual-view gesture posture estimation dataset;
  • Step 17 Use the hand detection model trained in step 15 to perform hand detection on the dual-view test image frame collected in step 16 to confirm whether the image contains a human hand;
  • Step 18 using the feature extractor trained in step 10 and step 11 to extract deep features Fv1 and Fv2 of the dual-view test image containing the human hand after hand detection;
  • Step 19 similar to step 13, the dual-view test image features Fv1 and Fv2 extracted in step 18 are serially concatenated left and right to obtain a combined feature Fv1
  • Step 20 Input the obtained test image combined features into the integrated learning gesture regression model trained in step 14 to perform gesture prediction, and output the three-dimensional gesture prediction value of the gesture.
  • the CNN-based feature extractor is trained as follows:
  • Step 101 selecting a CNN architecture that can extract deep features of the image
  • Step 102 Set the fully connected layer of the CNN network in step 101 to a regression layer with three-dimensional output;
  • Step 103 using all gesture images in a single perspective as network input and the three-axis posture angle label of the gesture as output, training the CNN network to fit the gesture images and the three-axis posture angle;
  • Step 104 After the CNN training converges to a set range, stop the training and save the network training weights with the highest accuracy.
  • step 12 the trained CNN model is used to extract the output features of the last convolutional layer of the network when proposing the gesture image.
  • step 14 the ensemble learning gesture posture regressor is constructed, which is to select and use an ensemble learning regression algorithm with strong regression ability to perform gesture regression on the deep features of the extracted dual-view gesture image, and fit the dual-view gesture image features and the corresponding gesture posture angle values.
  • the specific steps are as follows:
  • Step 141 performing feature dimensionality reduction on the combined deep features of the extracted and spliced dual-view gesture images
  • Step 142 construct a new hand gesture regression data set based on the hand gesture image features after dimension reduction and the gesture angle data corresponding to the image;
  • Step 143 constructing a hand gesture regression model based on an integrated learning regression algorithm, that is, fitting the features of the hand gesture image and the gesture angle data;
  • Step 144 using the hyperparameter value range set of the ensemble learning regression algorithm as the search space ⁇ , minimizing the error of the gesture posture angle regression as the objective function f(x), and using the Bayesian optimization method to search for the optimal hyperparameter combination x * of the ensemble learning gesture posture regression model so that its objective function obtains the minimum value;
  • Step 145 Use the optimal hand gesture regression hyperparameter combination searched in step 144 to train the model and save it.
  • step 20 the deep features of the dual-view test gesture image need to be subjected to the same feature dimensionality reduction processing as step 141 before being predicted using the integrated learning gesture posture regression model trained in step 14.
  • This application proposes a A method for producing a dual-view gesture image posture estimation dataset with high-precision posture marking based on Kalman filtering virtual and real information fusion. This method can solve the problem of difficult posture annotation of ordinary RGB images, can effectively overcome the errors caused by using a single sensor, and can obtain a more accurate posture estimation dataset; 2.
  • the gesture posture estimation method proposed in this application performs model training and model prediction based on dual-view images, can effectively overcome the self-occlusion problem of gestures, and improve the accuracy of model posture estimation; 3.
  • the gesture posture estimation method proposed in this application realizes 3D posture estimation of ordinary RGB images, has a wider applicability, and is convenient and simple to operate; 4.
  • the gesture posture estimation method proposed in this application is for a certain fixed gesture, can realize posture estimation of any fixed gesture, and can be better combined with gesture applications with low degrees of freedom.
  • FIG. 1 is an overall block diagram of a specific implementation mode of the present invention.
  • FIG. 2 is a flow chart of a method for preparing a posture estimation data set according to the present invention.
  • FIG3 is a flow chart of hand gesture angle data fusion based on Kalman filtering of the present invention.
  • FIG4 is a flow chart of the model training phase of the 3D hand gesture estimation method based on dual-view RGB images of the present invention.
  • FIG5 is a flow chart of the model testing phase of the 3D hand gesture estimation method based on dual-view RGB images of the present invention.
  • this solution includes two parts: one is the preparation of a posture estimation dataset with high-precision posture annotation based on Kalman filter data fusion; the other is the 3D gesture posture estimation of dual-view RGB images based on deep learning and ensemble learning.
  • the 3D gesture posture estimation is divided into the training stage and the prediction stage of the gesture posture estimation model.
  • the gesture posture estimation method based on deep learning and ensemble learning proposed in this application is also needed to be used as observation correction in Kalman filtering. Therefore, the gesture posture estimation method of this application and the method for preparing a dataset with high-precision posture annotation are closely related, and can be used separately.
  • this scheme produces a dual-view gesture image posture estimation dataset with high-precision posture markings based on Kalman filtering.
  • the steps are as follows:
  • Step 1 Determine the fixed gesture shape to be predicted, such as the Cartesian coordinate system hand shape
  • Step 2 For the fixed hand shape determined in step 1, use modeling simulation software to model it, and generate a file of a simulation hand model that is similar to the hand shape in terms of shape, skin color, texture and other physical appearance characteristics;
  • Step 3 Import the simulated hand model obtained in step 2 into the 3D simulation software, set up two cameras in the 3D simulation software, and then collect the dual-view gesture images and three-dimensional posture of the simulated hand model in the 3D simulation environment software. Data, to produce a pose estimation dataset of a simulated hand model; the pose relationship between the two cameras and the simulated hand model in the 3D simulation software is similar to the pose relationship between human eyes and gestures;
  • Step 4 For the posture estimation data set of the simulated hand model, use the hand gesture estimation method based on deep learning and ensemble learning proposed in the second part of this application to train the 3D posture estimation model of the simulated hand so that it can predict the three-dimensional hand gesture of the simulated hand model image.
  • the specific operations are the same as steps 10-20;
  • Step 5 in the real environment, the real human hand also maintains the hand posture to be predicted, and a gyroscope, i.e., a posture sensor, is placed in the hand.
  • the dual-view gesture image sequence when it rotates in three-dimensional space and the three-dimensional posture data sequence output by the posture sensor are also collected. Note that the dual-view camera view position at this time must be similar to the dual-view in step 2.
  • the posture of this process is called the sensor output posture;
  • Step 6 input the dual-view real hand image frames collected in step 3 into the simulated hand gesture estimation model trained using the simulated hand image in step 4 to perform gesture prediction.
  • the gesture data is called the model predicted gesture.
  • Step 7 as shown in Figure 2, since the simulated hand posture estimation model trained in step 4 is trained by simulated hand images, directly predicting it for real hand images will produce certain errors; in addition, the posture data output by the posture sensor for the real hand in step 5 will also produce certain errors due to various operational factors such as the accuracy and sensitivity of the sensor and the relative movement of the hand during use. Therefore, there are uncertainties in the sensor output posture and model predicted posture corresponding to the real hand image.
  • the sensor output posture and model predicted posture of the same group of dual-view gesture images corresponding to the dual-view images predicted in step 6 are fused using Kalman filtering for multi-data fusion.
  • the two posture data both of which have uncertainty, are fused through Kalman filtering to output an accurate gesture three-dimensional posture data, which is called fused posture.
  • Kalman filtering is used to perform multi-sensor posture data fusion operations, and what is fused is the gesture posture data from different sensors, rather than the accuracy correction of the sensor itself.
  • Step 8 Use the gesture fusion posture generated in step 7 as the label of the gesture image predicted in step 6 and save the gesture image and label;
  • Step 9 Perform steps 6, 7, and 8 on all dual-view real hand image frames and corresponding sensor output postures collected in step 5 to obtain a real hand image sequence with fused posture labels, thereby generating a hand gesture estimation dataset with high-precision posture annotations.
  • step 3 The specific steps for making the posture estimation data set of the simulated hand model in step 3 are as follows:
  • Step 31 import the 3D modeling model of the simulation hand designed in step 2 into the 3D modeling simulation software, and set the coordinate system;
  • Step 32 setting a visual sensor capable of capturing two RGB simulated hand images at different viewing angles and a posture sensor capable of outputting a three-axis posture of the simulated hand model in the 3D modeling software;
  • Step 33 program the simulated hand model to rotate around the three-dimensional space coordinate axis in the 3D modeling software, regularly collect the simulated hand images captured by the dual-view sensor, and record the sensor output posture angle when collecting the image.
  • the posture angle is saved as the label of the dual-view image.
  • the method for training the simulated hand posture estimation model in step 4 uses the hand gesture estimation method based on deep learning and ensemble learning proposed in this application, and its specific operation is the same as the following steps 10-20.
  • step 5 The specific steps of collecting the dual-view gesture image sequence of the real hand and the corresponding three-dimensional posture data sequence described in step 5 are as follows:
  • Step 51 Keep the gesture shape to be predicted and place the gesture sensor in the palm. The sensor and the hand do not move relative to each other;
  • Step 52 setting two ordinary RGB cameras with the same viewing angle as in step 3;
  • Step 53 Program the wrist to rotate at a random and uniform speed to automatically capture the gesture images of the two-view cameras at a fixed time, and record the output data of the gesture sensor in the hand when collecting the images.
  • the structure and operation flow of the gesture posture multi-data fusion based on Kalman filtering in the process of data set production are given. Since the gesture posture estimation model needs a certain amount of time to predict the gesture image, that is, there is a certain time difference in the acquisition of Angle I and Angle S. Therefore, when using Kalman filtering to fuse these two observation data, a serial processing method is adopted, that is, the system state is updated and corrected for the two sets of gesture posture observation data in turn to obtain the final gesture posture fusion data Angle F.
  • the system state vector X(k) at the kth moment is selected as the three-axis posture angle of the gesture
  • the dimension is 3*1.
  • the first observation Z1 is the gesture data Angle I output by the sensor, and the second observation Z2 is the predicted gesture data Angle S of the real hand image by the simulated hand gesture estimation model:
  • w(k-1) is the process noise of the system at time k-1, P(w) ⁇ N(0, Q);
  • v 1 (k) is the measurement noise at time k when the posture data output by the sensor is the system observation Z 1 , P(v 1 ) ⁇ N(0, R 1 );
  • v 2 (k) is the measurement noise at time k when the posture data predicted by the gesture image by the gesture posture estimation model is the system observation Z 2 , P(v 2 ) ⁇ N(0, R 2 ).
  • the gesture angle is estimated a priori by the state equation, and the gesture angle Angle I output by the gesture sensor is used as the system observation quantity Z1 to make the first observation correction to the system state estimation. Then, the gesture gesture angle Angle S predicted by the gesture image by the gesture estimation model is used as the system observation quantity Z2 to make the first observation correction to the state after the first observation correction. A second observation correction is performed, and the output result after two observation updates is the final fusion of the two sets of data.
  • Step 701 initialization of parameters of Kalman filter gesture data fusion system:
  • Initialize system status Initialize the system uncertainty covariance matrix P(0), the system state noise covariance matrix Q, the noise covariance matrix R1 with the gesture angle Angle I output by the gesture sensor as the system observation Z1 , and the noise covariance matrix R2 with the gesture angle Angle S predicted by the gesture image by the gesture estimation model as the system observation Z2 .
  • Step 702 Estimate the gesture angle at time k from the gesture angle at time k-1 according to the prior estimate
  • Step 706 the attitude angle Angle I output by the sensor is used as the observation value Z 1 to perform the first update correction on the attitude, where Z 1 (k) represents the value of the observation value Z 1 at the kth moment:
  • Step 707 Obtain the system status after the first observation update through the above steps (Hand gesture angle ) and the uncertainty covariance matrix P(k) of the system.
  • the gesture posture estimation model is used to predict the gesture posture angle Angle S of the gesture image as the observation value Z 2 to perform a second update correction on the system state.
  • Step 710 the gesture image is predicted by the gesture estimation model using the gesture angle Angle S as the observation value Z 2 to perform a second update correction on the gesture posture, where Z 2 (k) represents the value of the observation value Z 2 at the kth moment:
  • the gesture angle value Angle F after the two sets of observation values are fused by Kalman filtering, and the fused angle data is output.
  • Step 711 iterate steps 702-710, and continuously fuse the two sets of data to output high-precision gesture posture angle values.
  • the Kalman filtering method is used to fuse the two sets of uncertain posture data of the dual-view gesture image into a set of posture markers with higher accuracy and closer to the real data.
  • the hand gesture estimation method based on convolutional neural network (CNN) and ensemble learning for dual-view gesture images is used. This method is also used in the dataset for making high-precision gesture labels proposed in this application.
  • the hand gesture estimation method is mainly divided into two parts: model training and prediction.
  • the steps in the training phase of the gesture estimation model are as follows:
  • Step 10 record the dual perspectives as perspective 1 and perspective 2 respectively, and train the CNN-based feature extractor Fe1 with all the images of perspective 1 in the dual-perspective hand gesture estimation dataset.
  • CNN can select a deep convolutional neural network such as ResNet;
  • Step 11 train the CNN-based feature extractor Fe2 with all images of view 2 in the dual-view gesture estimation dataset;
  • Step 12 using the feature extractors Fe1 and Fe2 trained in steps 10 and 11 to respectively extract deep features Fv1 and Fv2 of gesture images of respective viewpoints of the dual-view gesture posture estimation dataset;
  • Step 13 For the dual-view features Fv1 and Fv2 of the dual-view images collected at the same time in the data set, perform left-right serial splicing to generate a combined feature Fv1
  • Step 14 construct an ensemble learning gesture posture regressor based on Bayesian optimization for the combined feature sequence obtained in step 13, and use the ensemble learning regression algorithm to regress the gesture of the gesture.
  • the ensemble learning algorithm can select algorithms with excellent regression performance such as LightGBM and CatBooat, and finally save the trained ensemble learning posture regression model.
  • the CNN-based feature extractor is trained, and the specific process is as follows:
  • Step 101 select a CNN architecture that can extract deep features of the image.
  • CNN can select a deep convolutional neural network such as ResNet;
  • Step 102 Set the fully connected layer of the CNN network in step 101 to a regression layer with three-dimensional output;
  • Step 103 using all gesture images in a single perspective as network input and the three-axis posture angle label of the gesture as output, training the CNN network to fit the gesture images and the three-axis posture angle;
  • Step 104 After the CNN training converges to a certain range, stop the training and save the network training weights with the highest accuracy.
  • step 12 the trained CNN model is used to extract the output features of the last convolutional layer of the network when proposing the gesture image.
  • step 14 the ensemble learning gesture posture regressor is constructed, which is to select an ensemble learning regression algorithm with strong regression ability to perform gesture regression on the deep features of the extracted dual-view gesture image, and fit the dual-view gesture image features and the corresponding gesture posture angle values.
  • the steps are as follows:
  • Step 141 performing a PCA feature dimensionality reduction process on the combined deep features of the extracted and spliced dual-view gesture images
  • Step 142 construct a new hand gesture regression data set based on the hand gesture image features after dimension reduction and the gesture angle data corresponding to the image;
  • Step 143 construct a gesture posture regression model based on an ensemble learning regression algorithm, that is, to fit the characteristics of the gesture image. Sign and attitude angle data;
  • Step 144 using the hyperparameter value range set of the ensemble learning regression algorithm as the search space ⁇ , using the minimization of the error of the gesture posture regression as the objective function f(x), and using the Bayesian optimization method to search for the optimal hyperparameter combination x * of the ensemble learning gesture posture regression model so that the objective function obtains the minimum value;
  • Step 145 Use the optimal hand gesture regression hyperparameter combination searched in step 144 to train the model and save it.
  • the steps of the prediction phase of the gesture estimation model are as follows:
  • Step 15 training a hand detection model for screening the camera captured images before real-time hand gesture estimation, and removing invalid images that do not contain human hands;
  • Step 16 collecting dual-view test gesture image frames of the same view as those in the dual-view gesture posture estimation dataset;
  • Step 17 First, use the hand detection model trained in step 15 to perform hand detection on the dual-view test image frame collected in step 16 to confirm whether the image contains a human hand, as shown in FIG5 ;
  • Step 18 extracting deep features Fv1 and Fv2 of the dual-view test image containing human hands after hand detection using the feature extractor trained in Steps 10 and 11;
  • Step 19 similar to step 13, the dual-view test image features Fv1 and Fv2 extracted in step 18 are serially concatenated left and right to obtain a combined feature Fv1
  • Step 20 Perform the same feature dimensionality reduction processing as step 141 on the obtained test image combination features, and then input them into the integrated learning gesture posture regression model trained in step 14 for posture prediction, and output the three-dimensional posture prediction value of the gesture.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Evolutionary Computation (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Geometry (AREA)
  • General Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Computer Hardware Design (AREA)
  • Human Computer Interaction (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computer Graphics (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Molecular Biology (AREA)
  • Mathematical Physics (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

本发明涉及计算机视觉技术领域,尤其是一种基于卡尔曼滤波和深度学习的手势姿态估计方法,本发明包括两个主要部分:一是一种基于卡尔曼滤波数据融合的高精度姿态标注的姿态估计数据集制作,包括第一阶段仿真手姿态估计阶段和第二阶段真实手势图像采集和姿态数据融合阶段;二是基于深度学习和集成学习的对双视角RGB图像进行3D手势姿态估计,3D手势姿态估计分为手势姿态估计模型的训练阶段和预测阶段。在数据集的制作过程中也需要使用到本申请提出的基于深度学习和集成学习的手势姿态估计方法作为卡尔曼滤波中的观测校正使用。

Description

一种基于卡尔曼滤波和深度学习的手势姿态估计方法 技术领域
本发明涉及计算机视觉,人机交互相关技术领域,尤其是一种基于卡尔曼滤波和深度学习的手势姿态估计方法,进行了虚实信息融合。
背景技术
手势作为人体与外界进行接触的一种重要媒介,在人机交互、增强现实、虚拟现实和手势识别等领域具有十分广泛的应用。随着新型人交互方式朝着更加自然、便利的趋势发展,基于手势的人机交互在游戏娱乐、医疗、智能家居和军工等领域具有非常重要的研究意义和前景,准确的手势姿态估计是使用手势进行人机交互等应用的关键环节。
目前,手势姿态估计方法可以分为基于可穿戴传感器设备的方法和基于计算机视觉的方法。基于可穿戴传感器设备的方法要求使用者穿戴装有传感器的数据手套或其它外部辅助设备,借助传感器部件直接获取手势关节点的位置坐标。此类方法不易受光照、背景等自然环境因素的影响,具有较好的鲁棒性和稳定性,但该方法由于辅助设备一般价格昂贵,精密度较高,需要繁琐的操作步骤和维护校准过程,佩戴后会对人手产生一定的动作束缚,使用时灵活性不高。另一种基于计算机视觉的方法,是对手势图像进行模型学习或者数据匹配的方法,它又可根据预测结果的空间维度分为3D姿态估计和2D姿态估计。3D手势姿态估计的研究大多是基于深度图像的,深度图像自带目标对象的深度信息,极大的便利姿态估计的研究。而获取深度图像的深度相机是依据结构光技术、双目立体视觉技术或者飞行时间方法而成像的,其对光照等其它环境因素十分敏感,不适合室外以及其它特殊场景的使用,且一般价格昂贵,移植性差。相比于深度图像,RGB图像的适用性更高,对环境要求低,获取简便,但RGB图像具有深度模糊性,是实现3D姿态估计难点之一,同时难以对其进行精确的姿态数据标注,常用的姿态标注方法就是借助外部物理传感器获取相应的姿态数据,但在实际使用过程中,传感器与手的相对位移以及传感器受环境影响等因素容易产生误差,高质量、高精度的数据集相对缺乏。另外,人手的高自由度和自遮挡特性仍然是手势姿态估计需要克服的问题。
在基于手势进行视觉人机交互的一些应用中,如移动车载云台监控、特种武装移动侦察机器人以及各种简单的机械臂结构,它们的受控对象的自由度不高,使用一种或几种简单手型即可完成相应的控制。因此,在2D图像上实现对固定手型的3D姿态估计具有十分重要的研究意义和广泛的应用前景。
由于RGB图像缺乏深度信息,常借助单一外部传感器设备进行姿态标注的方法仍然不够精确,其原因一是传感器本身的精度问题;二是传感器在使用过程中难以避免与人手发生相对位移等情况,即使使用精确度较高的传感器仍然可能存在较大误差。
发明内容
针对上述问题,本发明提出了一种基于卡尔曼滤波和深度学习的手势姿态估计方法,本方法中基于双视角RGB图像的对固定手型进行3D手势姿态估计,采用卡尔曼滤波融 合了采集手势图像时手中姿态传感器输出的姿态角度数据(实际物理传感器观测数据)和手势图像经过预先训练的手势姿态估计模型预测的姿态角度数据(虚拟传感器观测数据),通过对一个实际传感器和一个虚拟传感器观测数据的卡尔曼滤波融合,能够有效纠正如传感器在使用中与目标对象间因发生相对位移等情况产生的非传感器精度而导致的测量误差。
本方案中的数据集制作方法的主要流程为,首先在3D仿真环境中预先构建所预测手型的仿真手模型并采集仿真手模型在三维空间中匀速转动过程中的双视角RGB手势图像以及手势图像对应的三维姿态数据,对采集的仿真手图像和姿态数据训练仿真手的3D姿态估计模型;在现实环境中,人手保持与上述仿真环境中同样的手型姿态并手握姿态传感器,使用两个RGB相机采集真手在三维空间中匀速转动过程中与上述仿真环境中的视角相似的双视角真手RGB图像,同时记录采集手势图像时姿态传感器输出的手势姿态数据。将采集的双视角真手RGB图像使用上述训练好的仿真手姿态估计模型进行姿态预测,使用卡尔曼滤波多传感器数据融合算法将该模型预测的手势姿态数据和双视角真手图像对应的姿态传感器输出的手势姿态数据进行数据融合,融合输出对双视角真手图像的高精度姿态标注数据。采集大量的双视角真手RGB图像并使用卡尔曼滤波对上述来自两种不同渠道的姿态数据进行数据融合,以此获得高精度姿态标注的手势姿态估计数据集,并解决RGB图像因缺少深度信息而标注困难的问题。另外,本申请公开了基于双视角RGB图像的对固定手型进行3D姿态估计的方法,融合了深度学习出色的自动特征提取能力和集成学习算法稳健的回归拟合能力,首先使用CNN提取双视角手势图像的深层特征,再对特征使用集成学习算法进行姿态的回归,构造了一种对双视角RGB手势图像的深度特征进行集成的手势姿态估计模型。该手势姿态估计方法能够有效克服手势自遮挡对预测产生的影响,解决了在普通2D图像上的3D手势姿态估计问题。
本发明提供如下技术方案:一种基于卡尔曼滤波和深度学习的手势姿态估计方法,首先,制作基于卡尔曼滤波数据融合的姿态标注的双视角手势姿态估计数据集,包括第一阶段仿真手姿态估计阶段和第二阶段真实手势图像采集和姿态数据融合阶段;其次,对姿态标注的双视角手势姿态估计数据集进行3D姿态估计,包括手势姿态估计模型的训练阶段和预测阶段。
步骤1-9为基于卡尔曼滤波数据融合的高精度姿态标注的姿态估计数据集制作方法,步骤1-4为第一阶段仿真手姿态估计阶段,步骤5-9为第二阶段真实手势图像采集和姿态数据融合阶段;步骤10-20为第二部分基于深度学习和集成学习的手势姿态估计方法,步骤10-14为第一阶段手势姿态估计模型的训练阶段,步骤15-20为第二阶段模型的预测阶段,高质量的数据集是基于学习的姿态估计方法取得预期效果的前提。
制作双视角手势姿态估计数据集时,首先进行仿真手姿态估计,其次,进行真实手势图像采集和姿态数据融合。
仿真手姿态估计包括如下步骤,
步骤1、确定所要预测的固定手势形态,即固定手型;
步骤2、对于步骤1确定的固定手型,使用建模仿真软件对固定手型进行3D建模,生成与该手型在形态、肤色和纹理等物理外观特性近似的仿真手模型;
步骤3、在3D仿真软件中导入对于步骤2中获得的仿真手模型,并在3D仿真软件中 设置两个摄像头,然后在3D仿真环境软件中采集仿真手模型在3维空间中旋转时的双视角手势图像和三轴姿态角度数据为翻滚角、θ为俯仰角、ψ为偏航角,制作仿真手模型的姿态估计数据集;其中3D仿真软件中两个摄像头和仿真手模型的位姿关系与人类双眼和手势的位姿关系相同;
步骤4、对于仿真手模型的姿态估计数据集使用本基于深度学习和集成学习的手势姿态估计方法,训练仿真手的3D姿态估计模型,使3D姿态估计模型能够对仿真手模型图像实现三维手势姿态的预测;具体操作同步骤10-20。
真实手势图像采集和姿态数据融合包括如下步骤,
步骤5、真实环境下,真实人手保持所要预测的手型姿态,手中置有姿态传感器,即陀螺仪,采集真实人手在三维空间旋转时的双视角手势图像序列和姿态传感器输出的三轴姿态角度数据序列,此时的双视角相机视角位置与步骤2中的双视角设置相同,此过程的姿态称为传感器输出姿态;
步骤6、将步骤3采集的双视角真手图像帧输入到步骤4,使用仿真手图像训练得到的仿真手姿态估计模型中进行姿态预测,该姿态数据称为模型预测姿态;
步骤7、将步骤6中预测的双视角图像对应的传感器输出姿态和模型对图像的预测姿态使用卡尔曼滤波进行数据的融合,将两个均具有不确定性的姿态数据通过卡尔曼滤波融合后输出准确的手势三维姿态数据,该三维姿态数据称为融合姿态,此过程中使用卡尔曼滤波进行多传感器的姿态数据融合操作,融合的是来自不同传感器的手势姿态数据,而非对传感器内部本身的精度校正;
步骤8、将步骤7生成的手势融合姿态作为步骤6采集的手势图像的标签并保存;
步骤9、对步骤5中采集的所有双视角真实手势图像帧和对应的传感器输出姿态均按照步骤6、7、8进行操作,获得具有融合姿态数据标签的真手图像序列,即生成了高精度姿态标注的手势姿态估计数据集。
所述的步骤3中制作仿真手模型的姿态估计数据集,具体步骤如下:
步骤31、在3D建模仿真软件中导入步骤2设计的仿真手的3D建模模型,并设置好坐标系;
步骤32、在3D建模软件中设置可以捕获两个不同视角RGB仿真手图像的视觉传感器和能够输出仿真手模型三轴姿态角度的姿态传感器;
步骤33、编程实现仿真手模型在3D建模软件中绕三维空间坐标轴旋转,定时采集双视角传感器捕获的仿真手图像,同时记录采集图像时的传感器输出姿态角度,以姿态角度作为双视角图像的标签进行保存,采集手势图像和姿态数据就完成了仿真手模型的姿态估计数据集的制作,其中采集大量手势图像和姿态数据。
步骤5中采集真手的双视角手势图像序列和对应的三维姿态数据序列的具体步骤如下:
步骤51、保持所要预测的手势形态并在手中置有姿态传感器,手在转动时姿态传感器元件与手不发生相对移动;
步骤52、设置两个与步骤3中视角相同的两个普通RGB相机;
步骤53、匀速转动手腕并定时捕获两个视角相机的手势图像,并记录采集图像时手中姿态传感器输出的姿态数据,其中匀速转动手腕速度是随机的,且捕获两个视角相机 的手势图像是自动捕获。
所述的步骤7中卡尔曼滤波手势姿态数据融合操作,其模型将两部分不确定性的手势姿态数据融合为一组更加精确的手势姿态角度数据一是采集真实手势图像时,手握姿态传感器输出的手势三轴姿态角度值(即实际物理传感器观测角度);二是使用步骤4训练的仿真手势姿态估计模型对采集的真实手势图像进行预测的手势姿态角度值(即虚拟传感器观测角度)。这两组数据皆存在一定的不确定性,AngleI的不确定性首先是因为姿态传感器自身存在的一定的精度问题,其次是由于手握或贴片的姿态传感器在人手转动变化过程中也会存在一定的相对位移,导致其测量值与手的姿态存在一定操作误差;AngleS的不确定性首先是因为其模型的训练使用的是仿真手的图像,而实际使用时是对真实的人手图像的预测,其势必会存在一定误差,其次是手势姿态估计同样受手势图像的光照亮度、分辨率等因素的影响而产生预测误差。对于上述通过姿态传感器采集的姿态数据AngleI可以看作是由实际传感器获取的,而通过仿真手模型对真手图像预测的姿态数据AngleS可看作是由虚拟传感器获取的。因此,采用卡尔曼滤波多传感器数据融合的方法对上述两种均具有不确定性的传感器观测值进行数据融合,获得对真手的手势图像更加接近真实值的数据融合姿态标注
由于手势姿态估计模型对手势图像进行姿态预测需要一定的时间,即AngleI和AngleS的获取具有一定的时间差。因此,使用卡尔曼滤波对此两种观测数据进行融合时采用串行处理的方式,即对两组手势姿态的观测数据依次对系统状态进行更新校正,获得最终的手势姿态融合数据。
对该卡尔曼滤波手势姿态数据融合预测模型的分析如下:
首先确定系统的状态向量:
由于两个观测数据均为手势的三轴姿态角度,所以第k时刻的系统状态向量X(k)选择为手势三轴姿态角度维度为3*1。
建立系统的状态方程,确定系统的状态转移矩阵A:
因为没有控制量U(k),所以B=0。
此系统具有两个观测量,第一个观测量Z1为传感器输出的手势姿态数据AngleI,第二个观测量Z2为仿真手姿态估计模型对真手图像的预测姿态数据AngleS
Z1=AngleI=H1*Angle(k)
Z2=AngleS=H2*Angle(k)
对于状态方程和观测方程具有以下形式的卡尔曼滤波手势姿态数据融合系统:

Z1=H1*Angle(k)+v1(k)(2)
Z2=H2*Angle(k)+v2(k)(3)
其中,w(k-1)为系统k-1时刻的过程噪声,P(w)~N(0,Q);v1(k)是以传感器输出的姿态数据为系统观测Z1的k时刻测量噪声,P(v1)~N(0,R1);v2(k)是以手势姿态估计模型对手势图像预测的姿态数据为系统观测Z2的k时刻的测量噪声,P(v2)~N(0,R2)。
首先由状态方程对手势的姿态角度进行先验估计,以姿态传感器输出的姿态角度AngleI作为系统观测量Z1对系统的状态估计进行第一次观测校正,再以手势姿态估计模型对手势图像预测的手势姿态角度AngleS作为系统观测量Z2对经过第一观测校正后的状态进行第二次观测校正,经过两次观测更新后的输出结果就是对两组数据最终的融合AngleF
卡尔曼滤波数据串行融合的步骤如下:
步骤701、卡尔曼滤波手势姿态数据融合系统的参数初始化,
初始化系统状态初始化系统不确定性协方差矩阵P(0)、系统状态噪声协方差矩阵Q以及以姿态传感器输出姿态角度AngleI作为系统观测量Z1的噪声协方差矩阵R1和以手势姿态估计模型对手势图像预测的手势姿态角度AngleS作为系统观测量Z2的噪声协方差矩阵R2
步骤702、根据k-1时刻的最优手势姿态角度估计k时刻的手势姿态角度

步骤703、根据先验估计系统不确定性协方差矩阵P(k)-
P(k)-=AP(k-1)AT+Q
T为矩阵的转置,
步骤704、根据系统观测Z1的数据计算卡尔曼增益K(k),
K(k)=P(k)-H1 T[H1P(k)-H1 T+R1]-1
步骤705、更新计算系统的后验不确定性协方差矩阵P(k),
P(k)=[I-K(k)H1]P(k)-
I为单位矩阵
步骤706、采用传感器输出姿态角度AngleI作为观测值Z1对姿态进行第一次更新校正,Z1(k)表示观测值Z1第k时刻的值,
得到第一次更新的手势姿态角度
步骤707、由上述步骤获得第一次观测更新后的系统状态(手势姿态角度)和系统的不确定性协方差矩阵P(k),采用手势姿态估计模型对手势图像预测的手势姿态角度AngleS作为观测值Z2对系统的状态进行第二次更新校正,
步骤708、根据系统观测2的数据计算卡尔曼增益K(k),
K(k)=P(k)H2 T[H2P(k)H2 T+R2]-1
步骤709、更新系统的不确定性协方差P(k),
P(k)=[I-K(k)H2]P(k)
步骤710、以手势姿态估计模型对手势图像预测的姿态角度AngleS作为观测值Z2对手势姿态进行第二次更新校正,Z2(k)表示观测值Z2第k时刻的值,
即为对两组观测值进行卡尔曼滤波融合后的手势姿态角度值AngleF,输出该融合角度数据,即输出融合后的手势姿态角度值;
步骤711、迭代步骤702-710,不断对两组数据融合输出高精度的手势姿态角度值。
以下为本申请的第二部分发明内容,即对步骤1-9生成的高精度姿态标注的双视角手势姿态估计数据集进行3D姿态估计,基于深度学习和集成学习的对双视角RGB图像进行3D姿态估计的方法,其操作步骤为步骤10-20。
进行3D姿态估计时,首先进行手势姿态估计模型的训练,然后进行手势姿态估计模型的预测;
手势姿态估计模型的训练阶段步骤如下:
步骤10、将双视角手势姿态估计数据集中所有视角1的图像训练基于CNN的特征提取器Fe1;
步骤11、与步骤10一样,将双视角手势姿态估计数据集中所有视角2的图像训练基于CNN的特征提取器Fe2;
步骤12、使用步骤10和11训练得到的特征提取器Fe1和Fe2分别提取双视角手势姿态估计数据集各自视角手势图像的深层特征Fv1和Fv2;
步骤13、对于数据集中属于同时刻下采集的双视角图像的双视角特征Fv1和Fv2进行左右串行拼接,生成一个组合特征Fv1|2;
步骤14、对步骤13中获得的组合特征序列构造基于贝叶斯优化的集成学习手势姿态回归器,使用集成学习的回归算法进行手势的姿态回归,保存训练好的集成学习姿态回归模型。
手势姿态估计模型的预测阶段的步骤如下:
步骤15、训练一个手部检测模型,用于在实时手势姿态估计之前的摄像机捕获图像的筛选,剔除不包含人手的无效图像;
步骤16、采集与双视角手势姿态估计数据集中相同视角的双视角测试手势图像帧;
步骤17、使用步骤15中训练的手部检测模型对步骤16采集的双视角测试图像帧进行手部检测,确认图像是否包含人手;
步骤18、对手部检测后包含人手的双视角图像使用步骤10和步骤11训练的特征提取器提取双视角测试图像的深层特征Fv1和Fv2;
步骤19、同步骤13,对步骤18提取的双视角测试图像特征Fv1和Fv2进行左右串行拼接,获得组合特征Fv1|2;
步骤20、将获得的测试图像组合特征输入到步骤14训练的集成学习手势姿态回归模型中进行姿态预测,输出对手势的三维姿态预测值。
步骤10和步骤11中训练基于CNN的特征提取器,操作步骤如下:
步骤101、选择可以提取图像深层特征的CNN架构;
步骤102、设置步骤101中CNN网络的全连接层为3个维度输出的回归层;
步骤103、以单一视角中的所有手势图像作为网络的输入,以手势的三轴姿态角度标签为输出,训练CNN网络拟合手势图像和三轴姿态角度;
步骤104、在训练CNN收敛到设定范围内后,停止训练,保存准确率最高的网络训练权重。
所述的步骤12中,使用训练好的CNN模型在提出手势图像时是提取的网络最后一个卷积层的输出特征。
所述的步骤14中构造集成学习手势姿态回归器,就是选择使用回归能力强的集成学习回归算法,对提取的双视角手势图像的深层特征进行姿态回归,拟合双视角手势图像特征和对应的手势姿态角度值。具体步骤如下:
步骤141、对提取并拼接后的双视角手势图像的组合深层特征进行特征降维;
步骤142、对降维后的手势图像特征和图像所对应的姿态角度数据构造新的手势姿态回归数据集;
步骤143、构造基于集成学习回归算法的手势姿态回归模型,即拟合手势图像的特征和姿态角度数据;
步骤144、以集成学习回归算法的超参数取值范围集合为搜索空间χ,以最小化手势姿态角度回归的误差为目标函数f(x),采用贝叶斯优化方法搜索集成学习手势姿态回归模型的最优超参数组合x*,使其目标函数取得最小值;
步骤145、使用步骤144搜索的最优手势姿态回归超参数组合训练模型并保存。
在步骤20中,对双视角测试手势图像的深层特征在使用步骤14训练好的集成学习手势姿态回归模型预测前需要进行与步骤141同样的特征降维处理。
通过上述描述可以看出本方案相比现有技术,有益效果包括:1、本申请提出一种 基于卡尔曼滤波虚实信息融合的制作高精度姿态标记的双视角手势图像姿态估计数据集的方法,该方法可解决普通RGB图像姿态标注困难的问题,能够有效克服使用单一传感器产生的误差,可以获得更加精确的姿态估计数据集;2、本申请提出的手势姿态估计估计方法基于双视角图像进行模型训练和模型预测,能够有效克服手势的自遮挡问题,提高模型姿态估计精确度;3、本申请提出的手势姿态估计方法实现了对普通RGB图像的3D姿态估计,适用性更加广泛,操作方便简单;4、本申请提出的手势姿态估计方法是针对某种固定手势的,可实现对任何固定手势进行姿态估计,并且能够更好地与低自由度的手势应用相结合。
附图说明
图1是本发明具体实施方式的整体框图。
图2是本发明的姿态估计数据集的制作方法流程图。
图3是本发明的基于卡尔曼滤波的手势姿态角度数据融合流程图。
图4是本发明的基于双视角RGB图像的3D手势姿态估计方法的模型训练阶段流程图。
图5是本发明的基于双视角RGB图像的3D手势姿态估计方法的模型测试阶段流程图。
具体实施方式
下面将结合本发明具体实施方式中的附图,对本发明具体实施方式中的技术方案进行清楚、完整地描述,显然,所描述的具体实施方式仅仅是本发明一种具体实施方式,而不是全部的具体实施方式。基于本发明中的具体实施方式,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他具体实施方式,都属于本发明保护的范围。
需要注意的是,这里所使用的术语仅是为了描述具体实施方式,而非意图限制根据本申请的示例性实施方式。如在这里所使用的,除非上下文另外明确指出,否则单数形式也意图包括复数形式,此外,还应当理解的是,当在本说明书中使用术语“包含”和/或“包括”时,其指明存在特征、步骤、操作、器件、组件和/或它们的组合;
通过附图1可以看出,本方案包括两个部分:一是基于卡尔曼滤波数据融合的高精度姿态标注的姿态估计数据集制作;二是基于深度学习和集成学习的对双视角RGB图像进行3D手势姿态估计,3D手势姿态估计分为手势姿态估计模型的训练阶段和预测阶段。在数据集的制作过程中也需要使用到本申请提出的基于深度学习和集成学习的手势姿态估计方法作为卡尔曼滤波中的观测校正使用。故本申请的手势姿态估计方法和高精度姿态标记的数据集制作方法具有十分紧密的联系,又可进行单独使用。
结合图2可以看出,本方案基于卡尔曼滤波制作具有高精度姿态标记的双视角手势图像姿态估计数据集,步骤如下:
步骤1、确定所要预测的固定手势形态,如笛卡尔坐标系手型;
步骤2、对于步骤1确定的固定手型,使用建模仿真软件对其进行建模,生成与该手型在形态、肤色和纹理等其它物理外观特性近似的仿真手模型的文件;
步骤3、在3D仿真软件中导入对于步骤2中获得的仿真手模型,并在3D仿真软件中设置两个摄像头,然后在3D仿真环境软件中采集仿真手模型的双视角手势图像和三维姿态 数据,制作仿真手模型的姿态估计数据集;其中3D仿真软件中两个摄像头和仿真手模型的位姿关系与人类双眼和手势的位姿关系相近;
步骤4、对于仿真手模型的姿态估计数据集使用本申请第二部分提出的基于深度学习和集成学习的手势姿态估计方法,训练仿真手的3D姿态估计模型,使其能够对仿真手模型图像实现三维手势姿态的预测,具体操作同步骤10-20;
步骤5、如图2,真实环境下,真实人手同样保持所要预测的手型姿态,手中置有陀螺仪,即姿态传感器,同样采集其在三维空间旋转时的双视角手势图像序列和姿态传感器输出的三维姿态数据序列,注意此时的双视角相机视角位置需与步骤2中的双视角相似。此过程的姿态称为传感器输出姿态;
步骤6、将步骤3采集的双视角真手图像帧输入到步骤4使用仿真手图像训练得到的仿真手势姿态估计模型中进行姿态预测,该姿态数据称为模型预测姿态;
步骤7、如图2,由于步骤4中训练的仿真手姿态估计模型是由仿真手图像训练的,将其直接对真手图像预测会产生一定的误差;另外,步骤5中对真手使用姿态传感器输出的姿态数据也会由于传感器的精度、灵敏度以及使用时与手的相对移动等各种操作因素而产生一定误差。因此,真手图像对应的传感器输出姿态和模型预测姿态均存在不确定性,将步骤6中预测的双视角图像对应的同一组双视角手势图像的传感器输出姿态和模型预测姿态使用卡尔曼滤波进行多数据的融合,将两个均具有不确定性的姿态数据通过卡尔曼滤波融合后输出一个准确的手势三维姿态数据,该三维姿态数据称为融合姿态;此过程中使用卡尔曼滤波进行多传感器的姿态数据融合操作,融合的是来自不同传感器的手势姿态数据,而非对传感器内部本身的精度校正;
步骤8、将步骤7生成的手势融合姿态作为步骤6预测手势图像的标签并保存手势图像和标签;
步骤9、对步骤5中采集的所有双视角真手图像帧和对应的传感器输出姿态均按照步骤6、7、8进行操作,获得带融合姿态标签的真手图像序列,即生成了高精度姿态标注的手势姿态估计数据集。
步骤3中制作仿真手模型的姿态估计数据集的具体步骤如下:
步骤31、在3D建模仿真软件中导入步骤2设计的仿真手的3D建模模型,并设置好坐标系;
步骤32、在3D建模软件中设置可以捕获两个不同视角RGB仿真手图像的视觉传感器和能够输出仿真手模型三轴姿态的姿态传感器;
步骤33、编程实现仿真手模型在3D建模软件中绕三维空间坐标轴旋转,定时采集双视角传感器捕获的仿真手图像,同时记录采集图像时的传感器输出姿态角度,以姿态角度作为双视角图像的标签进行保存,采集大量手势图像和姿态数据就完成了仿真手模型的姿态估计数据集的制作。
步骤4中训练仿真手姿态估计模型的方法使用的就是本申请提出的基于深度学习和集成学习的手势姿态估计方法,其具体操作同如下步骤10-20。
步骤5中所述的采集真手的双视角手势图像序列和对应的三维姿态数据序列的具体步骤如下:
步骤51、保持所要预测的手势形态并在手掌内放置姿态传感器,手在转动时姿态 传感器与手不发生相对移动;
步骤52、设置两个与步骤3中视角相同的两个普通RGB相机;
步骤53、随机匀速转动手腕编程实现定时自动捕获两个视角相机的手势图像,并记录采集图像时手中姿态传感器的输出数据。
如图3,给出了在数据集制作过程中,基于卡尔曼滤波的手势姿态多数据融合的结构和操作流程。由于手势姿态估计模型对手势图像进行姿态预测需要一定的时间,即AngleI和AngleS的获取具有一定的时间差。因此,使用卡尔曼滤波对此两种观测数据进行融合时采用串行处理的方式,即对两组手势姿态的观测数据依次对系统状态进行更新校正,获得最终的手势姿态融合数据AngleF
对该卡尔曼滤波手势姿态的数据融合预测模型的分析如下:
首先确定系统的状态向量:
由于两个观测数据均为手势的三轴姿态角度,所以第k时刻的系统状态向量X(k)选择为手势三轴姿态角度维度为3*1。
建立系统的状态方程,确定系统的状态转移矩阵A:
因为没有控制量U(k),所以B=0。
此系统具有两个观测量,第一个观测量Z1为传感器输出的手势姿态数据AngleI,第二个观测量Z2为仿真手姿态估计模型对真手图像的预测姿态数据AngleS
Z1=AngleI=H1*Angle(k)
Z2=AngleS=H2*Angle(k)
对于状态方程和观测方程具有以下形式的卡尔曼滤波手势姿态数据融合系统:

Z1=H1*Angle(k)+v1(k)        (2)
Z2=H2*Angle(k)+v2(k)          (3)
其中,w(k-1)为系统k-1时刻的过程噪声,P(w)~N(0,Q);v1(k)是以传感器输出的姿态数据为系统观测Z1的k时刻测量噪声,P(v1)~N(0,R1);v2(k)是以手势姿态估计模型对手势图像预测的姿态数据为系统观测Z2的k时刻的测量噪声,P(v2)~N(0,R2)。
首先由状态方程对手势的姿态角度进行先验估计,以姿态传感器输出的姿态角度AngleI作为系统观测量Z1对系统的状态估计进行第一次观测校正,再以手势姿态估计模型对手势图像预测的手势姿态角度AngleS作为系统观测量Z2对经过第一观测校正后的状态进 行第二次观测校正,经过两次观测更新后的输出结果就是对两组数据最终的融合。
具体地,其卡尔曼滤波数据串行融合的操作步骤如下:
步骤701、卡尔曼滤波手势姿态数据融合系统的参数初始化:
初始化系统状态初始化系统不确定性协方差矩阵P(0)、系统状态噪声协方差矩阵Q以及以姿态传感器输出姿态角度AngleI作为系统观测量Z1的噪声协方差矩阵R1和以手势姿态估计模型对手势图像预测的手势姿态角度AngleS作为系统观测量Z2的噪声协方差矩阵R2
步骤702、根据先验估计由k-1时刻的手势姿态角度估计k时刻的手势姿态角度
步骤703、根据先验估计系统不确定性协方差矩阵P(k)-
P(k)-=AP(k-1)AT+Q
步骤704、根据系统观测1的数据计算卡尔曼增益K(k):
K(k)=P(k)-H1 T[H1P(k)-H1 T+R1]-1
步骤705、更新计算系统的后验不确定性协方差矩阵P(k):
P(k)=[I-K(k)H1]P(k)-
步骤706、如图3,采用传感器输出姿态角度AngleI作为观测值Z1对姿态进行第一次更新校正,Z1(k)表示观测值Z1第k时刻的值:
得到第一次更新的手势姿态角度
步骤707、由上述步骤获得第一次观测更新后的系统状态(手势姿态角度)和系统的不确定性协方差矩阵P(k)。采用手势姿态估计模型对手势图像预测的手势姿态角度AngleS作为观测值Z2对系统的状态进行第二次更新校正。
步骤708、根据系统观测Z2的数据计算卡尔曼增益K(k):
K(k)=P(k)H2 T[H2P(k)H2 T+R2]-1
步骤709、更新系统的不确定性协方差P(k):
P(k)=[I-K(k)H2]P(k)
步骤710、如图3,以手势姿态估计模型对手势图像预测的姿态角度AngleS作为观测值Z2对手势姿态进行第二次更新校正,Z2(k)表示观测值Z2第k时刻的值:
即为对两组观测值进行卡尔曼滤波融合后的手势姿态角度值AngleF,输出该融合角度数据。
步骤711、迭代步骤702-710,不断对两组数据融合输出高精度的手势姿态角度值。
经过上述步骤,采用卡尔曼滤波的方法将双视角手势图像的两组不确定性姿态数据融合为一组精确度更高、更接近真实数据的姿态标记。
如图4和图5,手势姿态估计采用基于卷积神经网络CNN和集成学习的对双视角手势图像的手势姿态估计方法,该方法同样也在本申请提出的制作高精度姿态标记的数据集中进行了使用。该手势姿态估计方法主要分为模型的训练和预测两个部分。
如图4,手势姿态估计模型训练阶段,步骤如下:
步骤10、将双视角分别记录为视角1和视角2,将双视角手势姿态估计数据集中所有视角1的图像训练基于CNN的特征提取器Fe1,CNN可选择如ResNet等深层卷积神经网络;
步骤11、将双视角手势姿态估计数据集中所有视角2的图像训练基于CNN的特征提取器Fe2;
步骤12、使用步骤10和11训练得到的特征提取器Fe1和Fe2分别提取双视角手势姿态估计数据集各自视角手势图像的深层特征Fv1和Fv2;
步骤13、对于数据集中属于同时刻下采集的双视角图像的双视角特征Fv1和Fv2进行左右串行拼接,生成一个组合特征Fv1|2;
步骤14、对步骤13中获得的组合特征序列构造基于贝叶斯优化的集成学习手势姿态回归器,使用集成学习的回归算法进行手势的姿态回归,集成学习算法可选择回归性能出色的LightGBM、CatBooat等算法,最后保存训练好的集成学习姿态回归模型。
所述的步骤10和步骤11中训练基于CNN的特征提取器,具体过程如下:
步骤101、选择可以提取图像深层特征的CNN架构,CNN可选择如ResNet等深层卷积神经网络;
步骤102、设置步骤101中CNN网络的全连接层为3个维度输出的回归层;
步骤103、以单一视角中的所有手势图像作为网络的输入,以手势的三轴姿态角度标签为输出,训练CNN网络拟合手势图像和三轴姿态角度;
步骤104、在训练CNN收敛到一定范围内后,停止训练,保存准确率最高的网络训练权重。
在步骤12中,使用训练好的CNN模型在提出手势图像时是提取的网络最后一个卷积层的输出特征。
步骤14中构造集成学习手势姿态回归器,就是选择使用回归能力强的集成学习回归算法,对提取的双视角手势图像的深层特征进行姿态回归,拟合双视角手势图像特征和对应的手势姿态角度值。步骤如下:
步骤141、对提取并拼接后的双视角手势图像的组合深层特征进行如PCA特征降维的处理;
步骤142、对降维后的手势图像特征和图像所对应的姿态角度数据构造新的手势姿态回归数据集;
步骤143、构造基于集成学习回归算法的手势姿态回归模型,即拟合手势图像的特 征和姿态角度数据;
步骤144、以集成学习回归算法的超参数取值范围集合为搜索空间χ,以最小化手势姿态回归的误差为目标函数f(x),采用贝叶斯优化方法搜索集成学习手势姿态回归模型的最优超参数组合x*,使目标函数取得最小值;
步骤145、使用步骤144搜索的最优手势姿态回归超参数组合训练模型并保存。
如图4,手势姿态估计模型的预测阶段的步骤如下:
步骤15、训练一个手部检测模型,用于在实时手势姿态估计之前的摄像机捕获图像的筛选,剔除不包含人手的无效图像;
步骤16、采集与双视角手势姿态估计数据集中相同视角的双视角测试手势图像帧;
步骤17、首先使用步骤15中训练的手部检测模型对步骤16采集的双视角测试图像帧进行手部检测,确认图像是否包含人手,如图5所示;
步骤18、对手部检测后包含人手的双视角测试图像使用步骤10和步骤11训练的特征提取器提取双视角测试图像的深层特征Fv1和Fv2;
步骤19、同步骤13,对步骤18提取的双视角测试图像特征Fv1和Fv2进行左右串行拼接,获得组合特征Fv1|2;
步骤20、将获得的测试图像组合特征进行与步骤141同样的特征降维处理,然后输入到步骤14训练的集成学习手势姿态回归模型中进行姿态预测,输出对手势的三维姿态预测值。
以上所述仅为本公开的优选实施例而已,并不用于限制本公开,对于本领域的技术人员来说,本公开可以有各种更改和变化。凡在本公开的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本公开的保护范围之内。

Claims (7)

  1. 一种基于卡尔曼滤波和深度学习的手势姿态估计方法,其特征在于:
    制作基于卡尔曼滤波数据融合的姿态标注的双视角手势姿态估计数据集,包括第一阶段仿真手姿态估计阶段和第二阶段真实手势图像采集和姿态数据融合阶段;
    对姿态标注的双视角手势姿态估计数据集进行3D姿态估计,包括手势姿态估计模型的训练阶段和预测阶段;
    制作双视角手势姿态估计数据集时,首先进行仿真手姿态估计,其次,进行真实手势图像采集和姿态数据融合;
    仿真手姿态估计包括如下步骤,
    步骤1、确定所要预测的固定手势形态,即固定手型;
    步骤2、对于步骤1确定的固定手型,使用建模仿真软件对固定手型进行3D建模,生成与该手型的物理外观特性近似的仿真手模型,物理外观特性包括在形态、肤色和纹理;
    步骤3、在3D仿真软件中导入对于步骤2中获得的仿真手模型,并在3D仿真软件中设置两个摄像头,然后在3D仿真环境软件中采集仿真手模型在3维空间中旋转时的双视角手势图像和三轴姿态角度数据为翻滚角、θ为俯仰角、ψ为偏航角,制作仿真手模型的姿态估计数据集;其中3D仿真软件中两个摄像头和仿真手模型的位姿关系与人类双眼和手势的位姿关系相同;
    步骤4、对于仿真手模型的姿态估计数据集使用基于深度学习和集成学习的手势姿态估计方法,训练仿真手的3D姿态估计模型,使3D姿态估计模型能够对仿真手模型图像实现三维手势姿态的预测;
    真实手势图像采集和姿态数据融合包括如下步骤,
    步骤5、真实环境下,真实人手保持所要预测的手型姿态,手中置有姿态传感器,采集真实人手在三维空间旋转时的双视角手势图像序列和姿态传感器输出的三轴姿态角度数据序列,此时的双视角相机视角位置与步骤2中的双视角设置相同,此过程的姿态称为传感器输出姿态;
    步骤6、将步骤3采集的双视角真手图像帧输入到步骤4,使用仿真手图像训练得到的仿真手姿态估计模型中进行姿态预测,该姿态数据称为模型预测姿态;
    步骤7、将步骤6中预测的双视角图像对应的传感器输出姿态和模型对图像的预测姿态使用卡尔曼滤波进行数据的融合,将两个均具有不确定性的姿态数据通过卡尔曼滤波融合后输出准确的手势三维姿态数据,该三维姿态数据称为融合姿态,此过程中使用卡尔曼滤波进行多传感器的姿态数据融合操作,融合的是来自不同传感器的手势姿态数据;
    步骤8、将步骤7生成的手势融合姿态作为步骤6采集的手势图像的标签并保存;
    步骤9、对步骤5中采集的所有双视角真实手势图像帧和对应的传感器输出姿态均按照步骤6、7、8进行操作,获得具有融合姿态数据标签的真手图像序列,即生成了高精度姿态标注的手势姿态估计数据集;
    进行3D姿态估计时,首先进行手势姿态估计模型的训练,然后进行手势姿态估计模型的预测;
    手势姿态估计模型的训练阶段步骤如下:
    步骤10、将双视角分别记录为视角1和视角2,将双视角手势姿态估计数据集中所有视 角1的图像训练基于卷积神经网络CNN的特征提取器Fe1;
    步骤11、将双视角手势姿态估计数据集中所有视角2的图像训练基于卷积神经网络CNN的特征提取器Fe2;
    步骤12、使用步骤10和11训练得到的特征提取器Fe1和Fe2分别提取双视角手势姿态估计数据集各自视角手势图像的深层特征Fv1和Fv2;
    步骤13、对于数据集中属于同时刻下采集的双视角图像的双视角特征Fv1和Fv2进行左右串行拼接,生成一个组合特征Fv1|2;
    步骤14、对步骤13中获得的组合特征序列构造基于贝叶斯优化的集成学习手势姿态回归器,使用集成学习的回归算法进行手势的姿态回归,保存训练好的集成学习姿态回归模型;
    手势姿态估计模型的预测阶段的步骤如下:
    步骤15、训练一个手部检测模型,用于在实时手势姿态估计之前的摄像机捕获图像的筛选,剔除不包含人手的无效图像;
    步骤16、采集与双视角手势姿态估计数据集中相同视角的双视角测试手势图像帧;
    步骤17、使用步骤15中训练的手部检测模型对步骤16采集的双视角测试图像帧进行手部检测,确认图像是否包含人手;
    步骤18、对手部检测后包含人手的双视角图像使用步骤10和步骤11训练的特征提取器提取双视角测试图像的深层特征Fv1和Fv2;
    步骤19、同步骤13,对步骤18提取的双视角测试图像特征Fv1和Fv2进行左右串行拼接,获得组合特征Fv1|2;
    步骤20、将获得的测试图像组合特征输入到步骤14训练的集成学习手势姿态回归模型中进行姿态预测,输出对手势的三维姿态预测值。
  2. 根据权利要求1所述的基于卡尔曼滤波和深度学习的手势姿态估计方法,其特征在于,
    所述的步骤3中制作仿真手模型的姿态估计数据集,具体步骤如下:
    步骤31、在3D建模仿真软件中导入步骤2设计的仿真手的3D建模模型,并设置好坐标系;
    步骤32、在3D建模软件中设置可以捕获两个不同视角RGB仿真手图像的视觉传感器和能够输出仿真手模型三轴姿态角度的姿态传感器;
    步骤33、实现仿真手模型在3D建模软件中绕三维空间坐标轴旋转,定时采集双视角传感器捕获的仿真手图像,同时记录采集图像时的传感器输出姿态角度,以姿态角度作为双视角图像的标签进行保存,采集手势图像和姿态数据就完成了仿真手模型的姿态估计数据集的制作。
  3. 根据权利要求2所述的基于卡尔曼滤波和深度学习的手势姿态估计方法,其特征在于,
    所述的步骤5中采集真手的双视角手势图像序列和对应的三维姿态数据序列的具体步骤如下:
    步骤51、保持所要预测的手势形态并在手中置有姿态传感器,手在转动时姿态传感器元件与手不发生相对移动;
    步骤52、设置两个与步骤3中视角相同的两个普通RGB相机;
    步骤53、匀速转动手腕并定时捕获两个视角相机的手势图像,并记录采集图像时手中姿态传感器输出的姿态数据。
  4. 根据权利要求3所述的基于卡尔曼滤波和深度学习的手势姿态估计方法,其特征在于,
    卡尔曼滤波数据串行融合的步骤如下:
    步骤701、卡尔曼滤波手势姿态数据融合系统的参数初始化,
    初始化系统状态初始化系统不确定性协方差矩阵P(0)、系统状态噪声协方差矩阵Q以及以姿态传感器输出姿态角度AngleI作为系统观测量Z1的噪声协方差矩阵R1和以手势姿态估计模型对手势图像预测的手势姿态角度AngleS作为系统观测量Z2的噪声协方差矩阵R2
    步骤702、根据时刻的最优手势姿态角度估计k时刻的手势姿态角度

    步骤703、根据先验估计系统不确定性协方差矩阵P(k)-
    P(k)-=AP(k-1)AT+Q
    T表示矩阵的转置,
    步骤704、根据系统观测Z1的数据计算卡尔曼增益K(k),
    K(k)=P(k)-H1 T[H1P(k)-H1 T+R1]-1
    步骤705、更新计算系统的后验不确定性协方差矩阵P(k),
    P(k)=[I-K(k)H1]P(k)-
    I为单位阵,
    步骤706、采用传感器输出姿态角度AngleI作为观测值Z1对姿态进行第一次更新校正,Z1(k)表示观测值第k时刻的值,
    得到第一次更新的手势姿态角度
    步骤707、由上述步骤获得第一次观测更新后的系统状态和系统的不确定性协方差矩阵P(k),采用手势姿态估计模型对手势图像预测的手势姿态角度AngleS作为观测值Z2对系统的状态进行第二次更新校正,
    步骤708、根据系统观测2的数据计算卡尔曼增益K(k),
    K(k)=P(k)H2 T[H2P(k)H2 T+R2]-1
    步骤709、更新系统的不确定性协方差P(k),
    P(k)=[I-K(k)H2]P(k)
    步骤710、以手势姿态估计模型对手势图像预测的姿态角度AngleS作为观测值Z2对手势姿态进行第二次更新校正,Z2(k)表示观测值Z2第k时刻的值,
    即为对两组观测值进行卡尔曼滤波融合后的手势姿态角度值AngleF,输出该融合后的手势姿态角度值,
    步骤711、迭代步骤702-710,不断对两组数据融合输出高精度的手势姿态角度值。
  5. 根据权利要求1所述的基于卡尔曼滤波和深度学习的手势姿态估计方法,其特征在于,
    训练基于CNN的特征提取器,操作步骤如下:
    步骤101、选择可以提取图像深层特征的CNN架构;
    步骤102、设置步骤101中CNN网络的全连接层为3个维度输出的回归层;
    步骤103、以单一视角中的所有手势图像作为网络的输入,以手势的三轴姿态角度标签为输出,训练CNN网络拟合手势图像和三轴姿态角度;
    步骤104、在训练CNN收敛到设定范围内后,停止训练,保存准确率最高的网络训练权重。
  6. 根据权利要求5所述的基于卡尔曼滤波和深度学习的手势姿态估计方法,其特征在于,
    所述的步骤12中,使用训练好的CNN模型在提出手势图像时是提取的网络最后一个卷积层的输出特征。
  7. 根据权利要求6所述的基于卡尔曼滤波和深度学习的手势姿态估计方法,其特征在于,
    所述的步骤14中构造集成学习手势姿态回归器,具体步骤如下:
    步骤141、对提取并拼接后的双视角手势图像的组合深层特征进行特征降维;
    步骤142、对降维后的手势图像特征和图像所对应的姿态角度数据构造新的手势姿态回归数据集;
    步骤143、构造基于集成学习回归算法的手势姿态回归模型,即拟合手势图像的特征和姿态角度数据;
    步骤144、以集成学习回归算法的超参数取值范围集合为搜索空间χ,以最小化手势姿态角度回归的误差为目标函数f(x),采用贝叶斯优化方法搜索集成学习手势姿态回归模型的最优超参数组合x*,使其目标函数取得最小值;
    步骤145、使用步骤144搜索的最优手势姿态回归超参数组合训练模型并保存。
PCT/CN2023/139747 2022-11-01 2023-12-19 一种基于卡尔曼滤波和深度学习的手势姿态估计方法 WO2024094227A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211354116.6 2022-11-01
CN202211354116.6A CN115410233B (zh) 2022-11-01 2022-11-01 一种基于卡尔曼滤波和深度学习的手势姿态估计方法

Publications (1)

Publication Number Publication Date
WO2024094227A1 true WO2024094227A1 (zh) 2024-05-10

Family

ID=84168230

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/139747 WO2024094227A1 (zh) 2022-11-01 2023-12-19 一种基于卡尔曼滤波和深度学习的手势姿态估计方法

Country Status (2)

Country Link
CN (1) CN115410233B (zh)
WO (1) WO2024094227A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115410233B (zh) * 2022-11-01 2023-01-24 齐鲁工业大学 一种基于卡尔曼滤波和深度学习的手势姿态估计方法
CN117349599A (zh) * 2023-12-05 2024-01-05 中国人民解放军国防科技大学 基于遗传算法的无人机姿态估计方法、装置、设备和介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111860274A (zh) * 2020-07-14 2020-10-30 清华大学 基于头部朝向与上半身骨架特征的交警指挥手势识别方法
CN113408443A (zh) * 2021-06-24 2021-09-17 齐鲁工业大学 基于多视角图像的手势姿态预测方法及系统
US20220262036A1 (en) * 2021-02-12 2022-08-18 Grazper Technologies ApS Computer-implemented method, data processing apparatus, and computer program for generating three-dimensional pose-estimation data
CN115410233A (zh) * 2022-11-01 2022-11-29 齐鲁工业大学 一种基于卡尔曼滤波和深度学习的手势姿态估计方法

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4148276B2 (ja) * 2006-05-09 2008-09-10 ソニー株式会社 位置推定装置、位置推定方法及びプログラム記録媒体
LU100684B1 (en) * 2018-01-26 2019-08-21 Technische Univ Kaiserslautern Method and system for head pose estimation
CN111464978A (zh) * 2019-01-22 2020-07-28 岳秀兰 主次无线设备通过物联网连接建立的车辆远程驾驶体系
US11842517B2 (en) * 2019-04-12 2023-12-12 Ultrahaptics Ip Ltd Using iterative 3D-model fitting for domain adaptation of a hand-pose-estimation neural network
CN110458944B (zh) * 2019-08-08 2023-04-07 西安工业大学 一种基于双视角Kinect关节点融合的人体骨架重建方法
CN115100744A (zh) * 2022-06-27 2022-09-23 浙江大学 一种羽毛球比赛人体姿态估计和球路追踪方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111860274A (zh) * 2020-07-14 2020-10-30 清华大学 基于头部朝向与上半身骨架特征的交警指挥手势识别方法
US20220262036A1 (en) * 2021-02-12 2022-08-18 Grazper Technologies ApS Computer-implemented method, data processing apparatus, and computer program for generating three-dimensional pose-estimation data
CN113408443A (zh) * 2021-06-24 2021-09-17 齐鲁工业大学 基于多视角图像的手势姿态预测方法及系统
CN115410233A (zh) * 2022-11-01 2022-11-29 齐鲁工业大学 一种基于卡尔曼滤波和深度学习的手势姿态估计方法

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
FENG JINXIANG 1657674261@QQ.COM; JI PENG JIPENG@QLU.EDU.CN; MA FENGYING MFY@QLU.EDU.CN: "Gesture Position Detection Based on Generative Adversarial Networks", 2022 13TH INTERNATIONAL CONFERENCE ON E-EDUCATION, E-BUSINESS, E-MANAGEMENT, AND E-LEARNING (IC4E), ACMPUB27, NEW YORK, NY, USA, 25 March 2022 (2022-03-25) - 27 March 2022 (2022-03-27), New York, NY, USA, pages 39 - 44, XP058854308, ISBN: 978-1-4503-9585-4, DOI: 10.1145/3529261.3529268 *
WANG XIANJIAN 10431200475@STU.QLU.EDU.CN; JI PENG JIPENG@QLU.EDU.CN; MA FENGYING MFY@QLU.EDU.CN: "End-to-end training of convolutional neural network for 3D hand pose estimation in dual-view RGB image", 2022 13TH INTERNATIONAL CONFERENCE ON E-EDUCATION, E-BUSINESS, E-MANAGEMENT, AND E-LEARNING (IC4E), ACMPUB27, NEW YORK, NY, USA, 25 March 2022 (2022-03-25) - 27 March 2022 (2022-03-27), New York, NY, USA, pages 45 - 49, XP058854307, ISBN: 978-1-4503-9585-4, DOI: 10.1145/3529261.3529269 *

Also Published As

Publication number Publication date
CN115410233A (zh) 2022-11-29
CN115410233B (zh) 2023-01-24

Similar Documents

Publication Publication Date Title
CN109636831B (zh) 一种估计三维人体姿态及手部信息的方法
CN107833271B (zh) 一种基于Kinect的骨骼重定向方法及装置
WO2024094227A1 (zh) 一种基于卡尔曼滤波和深度学习的手势姿态估计方法
WO2019157925A1 (zh) 视觉惯性里程计的实现方法及系统
JP2021103564A (ja) 仮想オブジェクト駆動方法、装置、電子機器及び可読記憶媒体
CN110327048B (zh) 一种基于可穿戴式惯性传感器的人体上肢姿态重建系统
US10445930B1 (en) Markerless motion capture using machine learning and training with biomechanical data
WO2023071964A1 (zh) 数据处理方法, 装置, 电子设备及计算机可读存储介质
CN111402290A (zh) 一种基于骨骼关键点的动作还原方法以及装置
CN103279186A (zh) 融合光学定位与惯性传感的多目标运动捕捉系统
CN109242887A (zh) 一种基于多摄像机和imu的实时人体上肢动作捕捉方法
EP3047454A1 (en) 3d reconstruction
US20230067081A1 (en) System and method for real-time creation and execution of a human Digital Twin
CN108734762B (zh) 运动轨迹仿真方法及系统
CN111401340B (zh) 目标对象的运动检测方法和装置
CN111433783B (zh) 手部模型生成方法、装置、终端设备及手部动作捕捉方法
CN112183316B (zh) 一种运动员人体姿态测量方法
CN114882106A (zh) 位姿确定方法和装置、设备、介质
Lin et al. Using hybrid sensoring method for motion capture in volleyball techniques training
CN111292411A (zh) 一种基于向内环视多rgbd相机的实时动态人体三维重建方法
Tong et al. Cascade-LSTM-based visual-inertial navigation for magnetic levitation haptic interaction
CN116749168A (zh) 一种基于体势手势示教的康复轨迹获取方法
CN116485953A (zh) 数据处理方法、装置、设备和可读存储介质
CN114756130A (zh) 一种手部虚实交互系统
Xu Single-view and multi-view methods in marker-less 3d human motion capture