CN113296604A

CN113296604A - True 3D gesture interaction method based on convolutional neural network

Info

Publication number: CN113296604A
Application number: CN202110564285.1A
Authority: CN
Inventors: 王琼华; 张力; 李小伟; 李大海; 马孝铭
Original assignee: Sichuan University; Beihang University
Current assignee: Sichuan University; Beihang University
Priority date: 2021-05-24
Filing date: 2021-05-24
Publication date: 2021-08-24
Anticipated expiration: 2041-05-24
Also published as: CN113296604B

Abstract

The invention provides a true 3D gesture interaction method based on a convolutional neural network. The method includes the steps that leapfunction is adopted to obtain gesture data, semantics of gestures are output by synthesizing a gesture instruction predicted by the leapfunction and a gesture instruction predicted by a trained neural network model, the gestures are interacted with 3D images, the interacted 3D images are rendered by utilizing back ray tracing, real-time rendering of the 3D models is achieved by adopting a space bounding box technology in the rendering process, and finally the rendered 3D models are displayed through an integrated imaging 3D display. According to the invention, the accuracy of gesture interaction can be obviously improved and the experience of a user can be improved while a real 3D scene is displayed.

Description

True 3D gesture interaction method based on convolutional neural network

One, the technical field

The invention belongs to the technical field of interaction, and particularly relates to a true 3D gesture interaction method based on a convolutional neural network.

Second, background Art

With the rapid development of the digital information era and the 3D display technology, the traditional display platform is gradually cooled down due to the unicity of the display form and the lack of the user experience, and instead, a new media technology is adopted, which can attract the eyes of the user and realize the user interaction experience. Therefore, people hope to have a 3D image interaction technology to support man-machine interaction. Gesture interaction is an ergonomic way of interaction that can quickly and naturally express some of the simple intentions of a user. However, the existing true 3D display platform is low in gesture recognition accuracy, poor in customer experience effect and not practical.

Third, the invention

The invention provides a true 3D gesture interaction method based on a convolutional neural network. The method includes the steps that somatosensory interaction equipment is used for obtaining gesture data, and gesture images and corresponding gesture semantics are input into a designed convolutional neural network for training. The final gesture semantics are subjected to modified gesture semantics by synthesizing the semantics output by the Leap Motion and the semantics output by the trained convolutional neural network, the gestures are defined, as shown in the attached drawing 1, the gestures are interacted with the 3D image, the interacted model is rendered by utilizing back ray tracing, the real-time rendering of the 3D model is realized by adopting a space bounding box technology in the rendering process, and finally the rendered 3D model is displayed by an integrated imaging 3D display. The method provides a gesture interaction mode with higher accuracy for the user while displaying a real 3D scene, and improves the experience of the user. The method comprises three processes of gesture interaction and gesture semantic correction of the 3D model, real-time rendering of the 3D model and real-time display of the 3D image.

In the gesture interaction process of the 3D model, the scaling factor of the 3D model is controlled by the five-finger varices, and as shown in fig. 100, the interaction gesture may be represented as:

wherein L, W, H represents the length, width, and height, S, respectively, of the maximum bounding box of the 3D model₁、S₂、S₃Respectively representing L, W, H corresponding scaling factors.

By translating the movement of the palm control 3D model, the interactive gesture is as shown in fig. 102, and the translation operation on the three-dimensional affine coordinates can be represented as:

where x, y, z represent the coordinates of the 3D model centroid, T_x、T_y、T_zThe offsets corresponding to x, y, z are indicated.

The rotation of the 3D model can be controlled by rotating the index finger, the interactive gesture is shown in fig. 101, and the three-dimensional affine coordinates are rotated around the x-axis and the y-axis, and the rotation operation around the z-axis can be respectively represented as:

wherein θ is the variation of the rotation angle of the palm center, and the gesture semantic correction process is to train a gesture set by deep learning, and the network structure of the gesture set is shown in fig. 2. A training sample containing M training samples is formed by capturing pictures of gestures in the interaction process and attaching labels

In which I_iRepresenting the ith image, y_i＝{y_i0,y_i1,y_i2,...,y_i(c-1)Is the corresponding annotation, if the sample is labeled as C category, then y_icGiven an image, we can get a predicted component s, which is 1, and 0 otherwise_i＝{s_i0,s_i1,s_i2,...,s_i(c-1)And calculating corresponding probability vector p by a softmax function_i＝{p_i0,p_i1,p_i2,...,p_i(c-1)}，p_ic＝softmax(s_ic) We take cross entropy as the loss function of the target:

the network model is trained by loss L in an end-to-end mode, specifically, a labeled data set is used for training, an Adam optimizer is used for optimizing, when the error L reaches a stable state, the network is trained completely, the training is stopped, then probability vectors calculated by Leap Motion are synthesized, different weights are defined according to different instructions, and finally a predicted value is calculated, wherein the process is shown in the attached figure 3.

In the real-time rendering process of the 3D model, firstly, lens parameters, integrated imaging 3D display parameters and the 3D model are input. After the parameters are input, a three-dimensional scene group is established, a virtual camera is established, an image plane is established to preprocess input data, whether an interactive instruction is detected in an interactive module is judged, and if the interactive instruction is detected, the parameters of the integrated light field visual model are changed and the integrated light field visual model enters a rendering module. In the rendering module, the real-time rendering of the 3D model is realized by using the ray tracing and bounding box technology.

In the 3D image display process, the micro image array is input into the integrated imaging 3D display to display a true 3D image with stereoscopic vision, and the overall effect diagram is shown in fig. 4, in which fig. 400 is a Leap Motion gesture information acquisition device, fig. 401 is a real-time information processing device, and fig. 402 is an integrated imaging 3D display.

The invention solves the technical problem of the gesture interaction technology of true 3D display based on the convolutional neural network, and provides a high-accuracy true 3D gesture interaction method.

Description of the drawings

FIG. 1 is a diagram of an interaction gesture.

FIG. 2 is a diagram of a convolutional neural network architecture in accordance with the present invention.

FIG. 3 is a schematic diagram of a convolutional neural network-based gesture recognition system according to the present invention.

FIG. 4 is a diagram of the overall effect of true 3D gesture interaction based on a convolutional neural network.

The figures of the drawings are numbered:

the system comprises a 100 five-finger varicosity gesture, a 101 index finger circling gesture, a 102 palm waving gesture, a 400Leap Motion gesture information acquisition device, a 401 real-time information processing device and a 402 integrated imaging 3D display.

It should be understood that the above-described figures are merely schematic and are not drawn to scale.

Fifth, detailed description of the invention

The following describes an exemplary embodiment of a convolutional neural network-based true 3D gesture interaction method, and further details the present invention. It should be noted that the following embodiments are only for illustrative purposes and should not be construed as limiting the scope of the present invention, and those skilled in the art will be able to make modifications and variations of the present invention without departing from the scope of the present invention.

The real 3D gesture interaction method based on integrated imaging specifically comprises three processes of gesture interaction of a 3D model, real-time rendering of the 3D model and real-time display of a 3D image.

In the gesture interaction process of the 3D model, the interaction command is three interaction gestures defined by calculating the moving direction, speed and displacement of the hand and the change conditions of a pitch angle, a roll angle, a yaw angle and the like according to the palm direction and normal vector detected by the Leap Motion device, the center and radius of a palm ball, the direction and position of fingers and the like, as shown in the attached drawing 1. In which the five fingers are flexed as shown in fig. 100, the index finger is rotated as shown in fig. 101, the palm is translated as shown in fig. 102, and the scaling, moving and rotating of the 3D model can be realized by performing matrix operation on the three-dimensional scene group of the 3D model. Scaling the three-dimensional affine coordinates can be represented as:

The translation operation on the three-dimensional affine coordinates can be expressed as:

The three-dimensional affine coordinates are rotated around an x axis and a y axis, and the rotation operation around the z axis can be respectively expressed as:

the method mainly comprises two parts of data preprocessing and convolutional neural network training, wherein the network structure is shown in figure 2, the data preprocessing is to compress and graye an obtained gesture image, finally, corresponding instruction labels are attached, and a training sample containing M training samples is established

In which I_iRepresenting the ith image, y_i＝{y_i0,y_i1,y_i2,...,y_i(c-1)The data set entered for the corresponding annotation. The convolutional neural network framework is composed of 8 convolutional layers in which the convolutional kernel size is 3 × 3 and ReLU is used as an activation function, and two fully-connected layers and 2 pooling layers. The convolutional layer performs feature extraction on the image, and finally generates a feature map with an output channel of 1024. Then inputting the probability vector into a full connection layer, introducing a Dropout mechanism in the full connection process, preventing over-fitting of the network, enhancing the robustness of the network, and finally calculating a corresponding probability vector p through a softmax function_i＝{p_i0,p_i1,p_i2,...,p_i(c-1)During the training process, we adopt cross entropy as a loss function of the target:

the network model is trained in an end-to-end mode through loss L, specifically, a labeled data set is used for training, an Adam optimizer is used for optimizing, when the error L reaches a stable state, the fact that the network is trained is indicated, and the training is stopped. And then testing different instructions, weighting the predicted value of the trained network model and the predicted value of the Leap Motion, wherein different instructions have different weight values, and finally obtaining the predicted value of the gesture interaction system, wherein the schematic block diagram of the gesture interaction system is shown in the attached figure 3.

In the real-time rendering process of the 3D model, firstly, lens parameters, display screen parameters and the 3D model are input. After the parameters are input, a three-dimensional scene group is established, a virtual camera is established, an image plane is established to preprocess input data, whether an interactive instruction is detected in an interactive module is judged, if the interactive instruction is detected, the parameters of the integrated light field vision model are changed and enter a rendering module, and the rendering module applies light ray tracing and bounding box technology.

In the display process of the 3D image, a real 3D image with an interactive function can be displayed on the integrated imaging 3D display by debugging related parameters between the integrated imaging 3D display and the image element, and an overall effect diagram of the display process is shown in fig. 4, in which fig. 400 is a Leap Motion gesture information acquisition device, fig. 401 is a real-time information processing device, and fig. 402 is an integrated imaging 3D display.

Claims

1. A real 3D gesture interaction method based on a convolutional neural network is characterized in that a Leap Motion somatosensory controller is adopted to obtain gesture data information, the semantics of a gesture are output by integrating a gesture instruction predicted by the Leap Motion and a gesture instruction predicted by a trained neural network model, the three-dimensional affine coordinate of a 3D model is changed, the three-dimensional affine coordinate of the 3D model is enabled to realize the functions of zooming, translation and rotation interaction, a back ray tracing is utilized to render an interacted 3D image, the real-time rendering of the 3D model is realized by adopting a space bounding box technology in the rendering process, and finally the rendered 3D model is displayed through an integrated imaging 3D display; the specific steps are as follows:

the method comprises the following steps: acquiring gesture data information by adopting a Leap Motion somatosensory controller, outputting the semantics of a gesture by integrating a gesture instruction predicted by the Leap Motion and a gesture instruction predicted by a trained neural network model, defining three interactive gestures of five-finger varicose, palm translation and index finger rotation, and realizing the scaling, moving and rotating of the 3D model by performing matrix transformation on three-dimensional affine coordinates of the 3D model;

step two: real-time rendering of the 3D model, wherein a rendering module uses a ray tracing and bounding box technology to calculate an incident radiance value of a closest collision point of a ray emitted by a ray emitting plane and the surface of the 3D model as a color value of a plane pixel of a corresponding unit image array to generate a micro-image array;

step three: and displaying the 3D image in real time, inputting the micro-image array generated in the step into the integrated imaging 3D display, and displaying a true 3D image with stereoscopic vision.