CN110362210B - Human-computer interaction method and device integrating eye movement tracking and gesture recognition in virtual assembly - Google Patents

Human-computer interaction method and device integrating eye movement tracking and gesture recognition in virtual assembly Download PDF

Info

Publication number
CN110362210B
CN110362210B CN201910670994.0A CN201910670994A CN110362210B CN 110362210 B CN110362210 B CN 110362210B CN 201910670994 A CN201910670994 A CN 201910670994A CN 110362210 B CN110362210 B CN 110362210B
Authority
CN
China
Prior art keywords
gesture
eye movement
gesture recognition
eye
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910670994.0A
Other languages
Chinese (zh)
Other versions
CN110362210A (en
Inventor
杨晓晖
察晓磊
徐涛
冯志全
吕娜
范雪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Microcrystalline Shushi (Shandong) Equipment Technology Co.,Ltd.
Original Assignee
University of Jinan
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Jinan filed Critical University of Jinan
Priority to CN201910670994.0A priority Critical patent/CN110362210B/en
Publication of CN110362210A publication Critical patent/CN110362210A/en
Application granted granted Critical
Publication of CN110362210B publication Critical patent/CN110362210B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • G06F3/013Eye tracking input arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/017Gesture based interaction, e.g. based on a set of recognized hand gestures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/107Static hand or arm
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/18Eye characteristics, e.g. of the iris
    • G06V40/197Matching; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/20Movements or behaviour, e.g. gesture recognition
    • G06V40/28Recognition of hand or arm movements, e.g. recognition of deaf sign language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2203/00Indexing scheme relating to G06F3/00 - G06F3/048
    • G06F2203/01Indexing scheme relating to G06F3/01
    • G06F2203/012Walk-in-place systems for allowing a user to walk in a virtual environment while constraining him to a given position in the physical environment

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Human Computer Interaction (AREA)
  • General Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Ophthalmology & Optometry (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Psychiatry (AREA)
  • Social Psychology (AREA)
  • Image Analysis (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

The utility model provides a human-computer interaction method and a device which integrate eye movement tracking and gesture recognition in virtual assembly, which can track the fixation point according to the obtained eye movement data; recognizing gestures according to the acquired gesture information, labeling the acquired gesture recognition data and eye movement data to form a training set, and constructing a multi-stream convolution neural network-long and short term memory network model, wherein the network model utilizes the training set to perform self-learning; and applying the trained optimal network model to the virtual assembly process, acquiring eye movement data and gesture information of the virtual assembly process, extracting eye movement and gesture characteristics, and analyzing behavior categories of operators according to the characteristic information to complete the assembly task. The method solves the problem of misjudgment of similar behaviors in a single mode, and utilizes the advantages of a deep learning algorithm to identify the behaviors of operators in the video at a high accuracy rate, complete a virtual assembly task and realize human-computer interaction.

Description

Human-computer interaction method and device integrating eye movement tracking and gesture recognition in virtual assembly
Technical Field
The disclosure belongs to the technical field of human-computer interaction, and particularly relates to a human-computer interaction method and device integrating eye tracking and gesture recognition in virtual assembly.
Background
The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.
With the development of information technology, human-computer interaction gradually changes from being centered on a machine to being centered on a human, that is, in a virtual environment, the machine serves as an auxiliary tool to assist a human to complete an interaction task. Meanwhile, the virtual reality technology is also rapidly developed, and virtual assembly is widely applied to product design verification and computer-aided assembly planning as an application of the virtual reality technology. The virtual assembly utilizes technologies such as data processing, human-computer interaction and visualization, so that the research and development of products directly enter a virtual product stage from a design stage, and related work efficiency analysis, design, analysis and evaluation are completed through a virtual human-computer model, so that a reasonable assembly sequence and a reasonable assembly path are obtained, and the assembly process of the products and the accessibility, maintainability and visibility of the assembly are checked.
The virtual assembly technology solves the problem that the manual assembly mode is large in consumption in the aspects of manpower and material resources to a great extent, and eliminates the operation danger of the manual assembly mode. In addition, the virtual assembly technology does not depend on time and places, and the process of virtual assembly can be finished at any time and place as long as the requirements of equipment conditions are met, so that the cost of manual assembly is greatly reduced.
The interactive technology plays an important role in virtual assembly. The interaction technology commonly used in the current virtual assembly is a traditional interaction method, for example, interaction in a virtual scene is performed by using a mouse, a touch screen, an infrared pen and the like; and secondly, the gesture recognition is utilized for interaction, so that the operations of moving, disassembling and the like of objects in the virtual scene are realized. However, these interaction methods have certain deficiencies in the aspects of naturalness and accuracy of interaction.
Disclosure of Invention
The human-computer interaction method and the human-computer interaction device for fusing eye tracking and gesture recognition in virtual assembly are provided for solving the problems, the false judgment rate of similar behaviors in a single mode is reduced through the fusion of two modes of eye tracking and gesture recognition, and the accuracy of virtual assembly is improved to a certain extent; secondly, a deep learning algorithm is selected to realize the fusion of eye movement tracking and gesture recognition, a multi-stream Convolutional Neural network-Long Short-Term Memory network (CNN-LSTM) model is constructed, the advantages of deep learning are utilized, eye movement and gesture features are extracted through self-learning, the advantages of LSTM processing video sequences are given play, and the accuracy and the naturalness of interaction under a virtual assembly scene are improved; and thirdly, processing the original video data, wherein the input data is a video sequence only comprising eye movement information and a video sequence only comprising gesture information, redundant information in the video is eliminated, key features are extracted, and the complexity of a network model is reduced.
According to some embodiments, the following technical scheme is adopted in the disclosure:
a human-computer interaction method integrating eye tracking and gesture recognition in virtual assembly comprises the following steps:
tracking a fixation point according to the obtained eye movement data;
recognizing the gesture according to the acquired gesture information;
labeling the obtained gesture recognition data and eye movement data to form a training set, and constructing a multi-stream convolution neural network-long and short term memory network model, wherein the network model utilizes the training set to perform self-learning;
and applying the trained optimal network model to the virtual assembly process, acquiring eye movement data and gesture information of the virtual assembly process, extracting eye movement and gesture characteristics, and analyzing behavior categories of operators according to the characteristic information to complete the assembly task.
As an alternative embodiment, the specific process of tracking the gaze point according to the acquired eye movement data includes:
acquiring an eye image;
identifying an eye region by using a Haar algorithm, obtaining a purkinje spot formed by irradiating an eye by using an infrared light source by using a threshold value method, and detecting the center of the pupil by using a Hough transformation method;
establishing an angle mapping relation between the position of the pupil center relative to the purkinje spot and the position of the fixation point relative to the infrared light source to accurately estimate the fixation point position;
and displaying the estimated gazing point position in the virtual scene in a red mark point mode.
As an alternative embodiment, the specific process of recognizing the gesture according to the acquired gesture information includes:
a, recording gesture position information of an operator in a video;
b, utilizing discrete cosine transform to carry out edge detection and YCbCr space skin color model combination to segment the hand region;
c, realizing gesture recognition by using a gesture principal direction and spatial gesture coordinate point distribution feature matching algorithm;
and d, constructing a hand model in the virtual scene through three-dimensional modeling, and synchronously displaying according to the gestures identified in the steps.
As a further limitation, the specific process of step b includes:
shielding a high-frequency part in the image by using DCT (discrete cosine transformation) filtering to further inhibit noise, detecting all edge points of a hand by using edge detection, mapping skin color to a YCbCr (YCbCr) space, segmenting gestures through a skin color model, performing morphological closed operation processing on segmented skin color areas, filling holes to enable the outline to be clearer, and obtaining a final hand area
As a further limitation, the specific process of step c includes:
extracting feature vectors of the gestures according to the main direction of the gestures, performing similarity measurement on the extracted features and a sample library, selecting M similar candidate samples, and identifying the final gestures from the M candidate samples by adopting a class-Hausdorff distance template matching idea.
As an alternative embodiment, when labeling the data, the video information is labeled differently according to the difference between the eye fixation point and the gesture recognition result.
As an alternative embodiment, the eye fixation point is uncertain, the gesture is a fist, the hand motion is translation or fixation, and the video of the type is marked as the beginning; the eye fixation point stays at a fixed position, the gesture is that only the index finger extends out, the hand action is fixed, and the video is marked as object selection; the eye fixation point stays at a fixed position, the gesture is five-finger bending, the hand action is fixed, and the video is marked as a captured object; the eye fixation point translates along with the hand along the track, the gesture is five-finger bending, and the video is marked as moving with an object; the eye fixation point stays at a fixed position, the five fingers are opened by the gesture, the hand action is fixed, and the video is marked as a release object; the eye fixation point is uncertain, the eyes are in a closed state, or hands cannot be detected in the scene, and the videos are marked as rest, namely no operation is performed in the virtual scene.
As an alternative embodiment, the specific process of building the network model of the multi-modal converged CNN-LSTM comprises:
adopting CNN to extract the spatial features of each video frame, adjusting a network structure based on a VGG-16 network, wherein the network structure comprises 13 convolutional layers, 5 pooling layers and 2 full-connection layers, and using an SELU (self-adaptive equalizer) activation function to replace a RELU function for the middle of the convolutional layers;
adopting LSTM to extract time characteristics, carrying CNN parts, and taking the CNN extracted characteristics as the input of the LSTM network;
fusing the features extracted by the multi-stream CNN-LSTM network through two fully-connected layers, giving corresponding weight to each branch of the network, and obtaining a final classification result by using a softmax layer;
and training the network model by using the labeled data to determine an optimal network model.
A human-computer interaction device integrating eye tracking and gesture recognition in virtual assembly comprises:
a light source configured to illuminate an eye position of an operator to be captured;
a Kinect device configured to acquire eye movement data and gesture information;
the processor is configured to receive information collected by the Kinect device, conduct fixation point tracking according to eye movement data and conduct gesture recognition according to gesture information;
labeling the obtained gesture recognition data and eye movement data to form a training set, and constructing a multi-stream convolution neural network-long and short term memory network model, wherein the network model utilizes the training set to perform self-learning;
and applying the optimal network model obtained by training in the virtual assembly process, and generating a corresponding assembly task according to the extracted characteristic information.
A computer-readable storage medium having stored thereon a plurality of instructions adapted to be loaded by a processor of a terminal device and to execute said method of human-computer interaction incorporating eye tracking and gesture recognition in a virtual fit.
A terminal device comprising a processor and a computer readable storage medium, the processor for implementing instructions; the computer readable storage medium is used for storing a plurality of instructions, and the instructions are suitable for being loaded by a processor and executing the human-computer interaction method for integrating eye tracking and gesture recognition in virtual assembly.
Compared with the prior art, the beneficial effect of this disclosure is:
the method solves the problem of misjudgment of similar behaviors in a single mode, and utilizes the advantages of a deep learning algorithm to identify the behaviors of operators in the video at a high accuracy rate, complete a virtual assembly task and realize human-computer interaction.
According to the method and the device, only the video sequence only comprising the eye movement information and the video sequence only comprising the gesture information are extracted, redundant information in the video is eliminated, the key features are extracted, and the complexity of a network model is reduced.
When the virtual assembly task is executed, an operator only needs to perform eye movement and gesture movement, the operation is more natural, and the experience feeling is better.
The method utilizes the advantages of deep learning, extracts eye movement and gesture features through self learning, exerts the advantages of processing the video sequence by the LSTM, and further improves the accuracy and naturalness of interaction under the virtual assembly scene.
Drawings
The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and are not to limit the disclosure.
FIG. 1 is a general block diagram of a virtual fit method combining eye tracking and gesture recognition, with key elements of the present disclosure within dashed boxes;
FIG. 2 is a schematic view of a gaze point estimation;
FIG. 3 (a) is a gesture segmentation of an original image; fig. 3 (b) is a diagram showing a gesture segmentation result.
FIG. 4 (a) only the index finger is extended, (b) a fist is closed; (c) opening the five fingers; and (d) a gesture diagram of five-finger bending.
FIG. 5 is a schematic diagram of a multi-modal converged CNN-LSTM network model.
The specific implementation mode is as follows:
the present disclosure is further illustrated by the following examples in conjunction with the accompanying drawings.
It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
Aiming at the problem that the existing interaction mode has certain defects in the aspect of interaction naturalness, the method for man-machine interaction integrating eye tracking and gesture recognition is provided in the background technology. Firstly, through the fusion of eye movement tracking and gesture recognition, the misjudgment rate of similar behaviors under a single mode is reduced, and the accuracy of virtual assembly is improved to a certain extent; secondly, a deep learning algorithm is selected to realize the fusion of eye movement tracking and gesture recognition, a multi-stream Convolutional Neural network-Long Short-Term Memory network (CNN-LSTM) model is constructed, the advantages of deep learning are utilized, eye movement and gesture features are extracted through self-learning, the advantages of the LSTM processing video sequence are played, and the accuracy and the naturalness of interaction under the virtual assembly scene are improved; and thirdly, processing the original video data, wherein the input data is a video sequence only comprising eye movement information and a video sequence only comprising gesture information, redundant information in the video is eliminated, key features are extracted, and the complexity of a network model is reduced.
Firstly, acquiring eye movement data, processing the eye movement data, and displaying the eye movement data in a virtual scene in a red marking point form; then, acquiring gesture information, identifying gestures, and displaying the hands of the operator in a virtual scene through three-dimensional modeling; and finally, fusing the gesture recognition result and the eye movement tracking result, and performing interaction to complete virtual assembly. The key technology of the method is embodied in the fusion of two modes of eye tracking and gesture recognition and a CNN-LSTM network model suitable for multi-mode fusion.
A human-computer interaction method fusing eye tracking and gesture recognition in virtual assembly specifically comprises the following steps:
step 1, obtaining eye movement data, and processing the eye movement data: kinect V2 collects eye movement information of an operator, and a characteristic-based method is used for realizing the tracking of the fixation point;
step 2, acquiring gesture information, and recognizing gestures: kinect V2 acquires gesture information of an operator, and realizes gesture recognition by utilizing a YCbCr space skin color model and the like to divide gestures and a feature matching algorithm;
and 3, fusing the gesture recognition result and the eye movement tracking result to carry out interaction: marking the gesture data and the eye movement data obtained in the steps according to rules, training a multi-stream CNN-LSTM network model by using the marked data, applying the trained optimal network model to a virtual assembly process, completing an assembly task, and realizing interaction.
The specific implementation steps of the step 1 are as follows:
a. irradiating eyes of an operator by using infrared light sources arranged at the upper left corner and the upper right corner of the display, and acquiring eye images by using Kinect V2;
b. identifying an eye region through a Haar algorithm, obtaining a purkinje spot formed by irradiating an eye with an infrared light source by using a threshold value method, and detecting the center of the pupil by using Hough transformation;
c. establishing an angle mapping relation between the position of the pupil center relative to the purkinje spot and the position of the fixation point relative to the infrared light source to accurately estimate the fixation point position;
d. and displaying the estimated gazing point position in the virtual scene in a red-colored mark point mode.
The concrete implementation steps of the step 2 are as follows:
a. recording gesture position information of an operator in the video through the Kinect V2;
b. the method comprises the steps of utilizing discrete cosine transform to carry out edge detection and YCbCr space skin color model combination to segment a hand region, namely, firstly utilizing DCT filtering to shield a high-frequency part in an image so as to inhibit noise, then utilizing edge detection to detect all edge points of the hand, finally mapping skin color to the YCbCr space, carrying out morphological closed operation processing on the segmented skin color region through a skin color model segmentation gesture, filling a cavity so as to enable the outline to be clearer, and obtaining a final hand region;
c. the method comprises the steps of utilizing a matching algorithm based on a gesture principal direction and a spatial gesture coordinate point distribution feature to realize gesture recognition, namely, firstly, extracting a feature vector of a gesture according to the gesture principal direction, then, carrying out similarity measurement on the extracted feature and a sample library, selecting M similar candidate samples, and finally, recognizing a final gesture from the M candidate samples by using a class-Hausdorff distance template matching idea;
d. and constructing a hand model in the virtual scene through three-dimensional modeling, and synchronously displaying according to the gestures identified in the steps.
The concrete implementation steps of the step 3 are as follows:
a. determining a data labeling rule: the eye fixation point is uncertain, the gesture is fist making, the hand movement is translation or fixation, and the video is marked as the beginning; the eye fixation point stays at a fixed position, the gesture is that only the index finger extends out, the hand action is fixed, and the video is marked as object selection; the eye fixation point stays at a fixed position, the gesture is five-finger bending, the hand action is fixed, and the video is marked as a captured object; the eye fixation point translates along with the hand along the track, the gesture is five-finger bending, and the video is marked as moving with an object; the eye fixation point stays at a fixed position, the five fingers are opened by the gesture, the hand action is fixed, and the video is marked as a release object; the eye fixation point is uncertain, the eyes are in a closed state, or hands cannot be detected in the scene, the videos are marked as rest, namely, no operation is performed in the virtual scene;
b. building a multi-mode fused CNN-LSTM network model, firstly, adopting CNN to extract spatial features of each video frame, and adjusting a network structure based on a VGG-16 network, wherein the network structure comprises 13 layers of convolution, 5 layers of pooling and 2 layers of full connection; secondly, extracting time characteristics by adopting LSTM, carrying CNN parts, taking the CNN extracted characteristics as input of the LSTM network, and fixing the number of video frames input by the network to be 10 frames, so that the characteristics of a key video sequence can be extracted, and meanwhile, characteristic redundancy is avoided; then, fusing the features extracted by the multi-stream CNN-LSTM network through two fully-connected layers, and in addition, in order to reduce the over-fitting problem to the minimum, giving corresponding weight to each branch of the network to generate more reliable final feature representation; finally, obtaining a final classification result by using a softmax layer;
c. training the network model by using the labeling data to determine an optimal network model;
and applying the optimal network model to the virtual assembly process, analyzing eye movement information and gesture information of subsequent operators, completing the task of virtual assembly and realizing interaction.
As a specific implementation mode, the human-computer interaction method for integrating eye movement tracking and gesture recognition in virtual assembly is provided. Firstly, acquiring eye movement information of an operator by using Kinect V2, and realizing eye movement tracking by using a characteristic-based method; then, acquiring gesture information of an operator by using Kinect V2, and realizing gesture recognition by using a YCbCr space skin color model and the like to segment gestures, a feature matching algorithm and the like; and finally, training a multi-stream CNN-LSTM network model by using the gesture data and the eye movement data obtained in the steps, fusing a gesture recognition result and an eye movement tracking result, completing an assembly task, and realizing interaction. Fig. 1 is a general block diagram of a virtual assembly method combining eye tracking and gesture recognition, that is, a human-computer interaction method combining eye tracking and gesture recognition in virtual assembly, which has the following specific embodiments:
(1) The method comprises the steps of utilizing infrared light sources arranged at the upper left corner and the upper right corner of a display to irradiate eyes of an operator, obtaining an angle between the position of the pupil center relative to a corneal ray reflection point and the position of a fixation point relative to the infrared light sources, accurately estimating the position of the fixation point according to the mapping relation between the pupil center and the corneal ray reflection point, and displaying the estimated position of the fixation point in a virtual scene in the form of a red marking point as shown in figure 2.
(2) The gesture position information of an operator in a video is recorded through a Kinect V2, the gesture is divided through a method of combining edge detection and a YCbCr space skin color model through discrete cosine transform, gesture recognition is achieved through a feature matching algorithm based on the main direction of the gesture and the distribution of spatial gesture coordinate points (related gestures are shown in figures 4 (a) - (d)), a hand model is built in a virtual scene through three-dimensional modeling, and synchronous display is conducted according to the gesture recognized through the method.
(3) Determining a data labeling rule: the eye fixation point is uncertain, the gesture is fist making, the hand movement is translation or fixation, and the video is marked as the beginning; the eye fixation point stays at a fixed position, the gesture is that only the index finger extends out, the hand action is fixed, and the video is marked as object selection; the eye fixation point stays at a fixed position, the gesture is five-finger bending, the hand action is fixed, and the video is marked as a captured object; the eye fixation point translates along with the hand along the track, the gesture is five-finger bending, and the video is marked as moving with an object; the eye fixation point stays at a fixed position, the five fingers are opened by the gesture, the hand action is fixed, and the video is marked as a release object; the eye fixation point is uncertain, the eyes are in a closed state, or hands cannot be detected in the scene, and the videos are marked as rest, namely no operation is performed in the virtual scene.
(4) Building a network model (as shown in fig. 5) of a multi-modal fused CNN-LSTM, firstly, adopting the CNN to extract spatial features of each video frame, and adjusting a network structure based on a VGG-16 network, wherein the network structure comprises 13 layers of convolution, 5 layers of pooling and 2 layers of full connection, and in order to achieve the purpose of normalizing a data set, a SELU activation function is used in the middle of a convolution layer to replace a RELU function, and gradient explosion or disappearance is reduced; secondly, extracting time characteristics by adopting LSTM, carrying CNN parts, taking the CNN extracted characteristics as input of the LSTM network, and fixing the number of video frames input by the network to be 10 frames, so that the characteristics of a key video sequence can be extracted, and meanwhile, characteristic redundancy is avoided; then, fusing the features extracted by the multi-stream CNN-LSTM network through two fully-connected layers, and in addition, in order to reduce the over-fitting problem to the minimum, giving corresponding weight to each branch of the network to generate a more reliable final feature representation (formula (1)); finally, a softmax layer is used to derive the final classification result.
Figure BDA0002141700400000122
F represents the final characterization, w eo Representing weights for extracting features of the spatial stream of eye movement information, F eo Representing the final feature, w, of the spatial stream extraction of eye movement information et Weight representing the time-stream characteristic of the extracted eye movement information, F et Final feature, w, representing the temporal stream extraction of eye movement information go Weight, F, representing extracted features of the spatial stream of gesture information go Representing the final feature of the gesture information spatial stream extraction, w gt Weight representing the time-stream feature of the extracted gesture information, F gt Representing the final features of the gesture information time stream extraction. Wherein the weight w eo (w et 、w go 、w gt Same as w eo ) Is represented as follows:
Figure BDA0002141700400000121
w represents the weight of the fitted objective function during network training.
As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.
The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.
Although the present disclosure has been described with reference to specific embodiments, it should be understood that the scope of the present disclosure is not limited thereto, and those skilled in the art will appreciate that various modifications and changes can be made without departing from the spirit and scope of the present disclosure.

Claims (8)

1. A human-computer interaction method integrating eye tracking and gesture recognition in virtual assembly is characterized by comprising the following steps: the method comprises the following steps:
tracking a fixation point according to the obtained eye movement data;
recognizing the gesture according to the acquired gesture information;
labeling the obtained gesture recognition data and eye movement data to form a training set, and constructing a multi-stream convolution neural network-long and short term memory network model, wherein the network model utilizes the training set to perform self-learning;
when the data are labeled, different labels are carried out on the video information according to the difference between the eye fixation point and the gesture recognition result;
the specific process for building the network model of the multi-mode fused CNN-LSTM comprises the following steps:
adopting CNN to extract the space characteristic of each video frame, adjusting the network structure based on VGG-16 network, including 13 layers of convolution layer, 5 layers of pooling layer and 2 layers of full-connection layer, and using SELU activation function to replace RELU function in the middle of convolution layer;
adopting LSTM to extract time characteristics, carrying CNN parts, and taking the CNN extracted characteristics as the input of the LSTM network;
fusing the features extracted by the multi-stream CNN-LSTM network through two fully-connected layers, giving corresponding weight to each branch of the network, and obtaining a final classification result by using a softmax layer;
training the network model by using the labeling data to determine an optimal network model;
and applying the trained optimal network model to the virtual assembly process, acquiring eye movement data and gesture information of the virtual assembly process, extracting eye movement and gesture characteristics, and analyzing to obtain behavior categories of operators according to the characteristic information so as to complete the assembly task.
2. The human-computer interaction method integrating eye tracking and gesture recognition in virtual assembly as claimed in claim 1, wherein: the specific process of tracking the fixation point according to the acquired eye movement data comprises the following steps:
acquiring an eye image;
identifying an eye region by using a Haar algorithm, obtaining a purkinje spot formed by irradiating an eye by using an infrared light source by using a threshold value method, and detecting the center of the pupil by using a Hough transformation method;
establishing an angle mapping relation between the position of the pupil center relative to the purkinje spot and the position of the fixation point relative to the infrared light source to accurately estimate the fixation point position;
and displaying the estimated gazing point position in the virtual scene in a red mark point mode.
3. The human-computer interaction method integrating eye tracking and gesture recognition in virtual assembly as claimed in claim 1, wherein: according to the acquired gesture information, the specific process of recognizing the gesture comprises the following steps:
a, recording gesture position information of an operator in a video;
b, utilizing discrete cosine transform to carry out edge detection and YCbCr space skin color model combination to segment the hand region;
c, realizing gesture recognition by using a gesture principal direction and spatial gesture coordinate point distribution feature matching algorithm;
and d, constructing a hand model in the virtual scene through three-dimensional modeling, and synchronously displaying according to the gestures identified in the steps.
4. The human-computer interaction method integrating eye tracking and gesture recognition in virtual assembly as claimed in claim 3, wherein: the specific process of the step b comprises the following steps:
the method comprises the steps of shielding a high-frequency part in an image by using DCT (discrete cosine transformation) filtering to further suppress noise, detecting all edge points of a hand by using edge detection, mapping skin colors to a YCbCr (YCbCr) space, segmenting gestures through a skin color model, performing morphological closed operation processing on segmented skin color areas, filling holes to enable the outline to be clearer, and obtaining a final hand area.
5. The human-computer interaction method integrating eye tracking and gesture recognition in virtual assembly as claimed in claim 3, wherein: the specific process of the step c comprises the following steps:
extracting feature vectors of the gestures according to the principal direction of the gestures, performing similarity measurement on the extracted features and a sample library, selecting M similar candidate samples, and recognizing the final gestures from the M candidate samples by adopting a similar-Hausdorff distance template matching idea.
6. A human-computer interaction device integrating eye tracking and gesture recognition in virtual assembly is characterized in that: the method comprises the following steps:
a light source configured to illuminate an eye position of an operator to be captured;
a Kinect device configured to acquire eye movement data and gesture information;
the processor is configured to receive information collected by the Kinect equipment, track a fixation point according to eye movement data, and recognize gestures according to gesture information;
labeling the obtained gesture recognition data and eye movement data to form a training set, and constructing a multi-stream convolution neural network-long and short term memory network model, wherein the network model utilizes the training set to perform self-learning; when the data are labeled, different labels are carried out on the video information according to the difference between the eye fixation point and the gesture recognition result;
the specific process for building the network model of the multi-mode fused CNN-LSTM comprises the following steps:
adopting CNN to extract the space characteristic of each video frame, adjusting the network structure based on VGG-16 network, including 13 layers of convolution layer, 5 layers of pooling layer and 2 layers of full-connection layer, and using SELU activation function to replace RELU function in the middle of convolution layer;
adopting LSTM to extract time characteristics, carrying CNN parts, and taking the CNN extracted characteristics as the input of the LSTM network;
fusing the features extracted by the multi-stream CNN-LSTM network through two fully-connected layers, giving corresponding weight to each branch of the network, and obtaining a final classification result by using a softmax layer;
training the network model by using the labeling data to determine an optimal network model;
and applying the optimal network model obtained by training in the virtual assembly process, and generating a corresponding assembly task according to the extracted characteristic information.
7. A computer-readable storage medium, comprising: a plurality of instructions are stored, the instructions are suitable for being loaded by a processor of a terminal device and executing the human-computer interaction method for integrating eye tracking and gesture recognition in virtual assembly, wherein the human-computer interaction method is as claimed in any one of claims 1 to 5.
8. A terminal device is characterized in that: the system comprises a processor and a computer readable storage medium, wherein the processor is used for realizing instructions; the computer readable storage medium is used for storing a plurality of instructions, and the instructions are suitable for being loaded by a processor and executing the human-computer interaction method for integrating eye tracking and gesture recognition in virtual assembly according to any one of claims 1-5.
CN201910670994.0A 2019-07-24 2019-07-24 Human-computer interaction method and device integrating eye movement tracking and gesture recognition in virtual assembly Active CN110362210B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910670994.0A CN110362210B (en) 2019-07-24 2019-07-24 Human-computer interaction method and device integrating eye movement tracking and gesture recognition in virtual assembly

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910670994.0A CN110362210B (en) 2019-07-24 2019-07-24 Human-computer interaction method and device integrating eye movement tracking and gesture recognition in virtual assembly

Publications (2)

Publication Number Publication Date
CN110362210A CN110362210A (en) 2019-10-22
CN110362210B true CN110362210B (en) 2022-10-11

Family

ID=68220850

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910670994.0A Active CN110362210B (en) 2019-07-24 2019-07-24 Human-computer interaction method and device integrating eye movement tracking and gesture recognition in virtual assembly

Country Status (1)

Country Link
CN (1) CN110362210B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111526118B (en) * 2019-10-29 2023-06-30 南京翱翔信息物理融合创新研究院有限公司 Remote operation guiding system and method based on mixed reality
CN111290575A (en) * 2020-01-21 2020-06-16 中国人民解放军空军工程大学 Multichannel interactive control system of air defense anti-pilot weapon
CN111368980B (en) * 2020-03-06 2023-11-07 京东科技控股股份有限公司 State detection method, device, equipment and storage medium
CN111625098B (en) * 2020-06-01 2022-11-18 广州市大湾区虚拟现实研究院 Intelligent virtual avatar interaction method and device based on multi-channel information fusion
CN112308116B (en) * 2020-09-28 2023-04-07 济南大学 Self-optimization multi-channel fusion method and system for old-person-assistant accompanying robot
CN112462940A (en) * 2020-11-25 2021-03-09 苏州科技大学 Intelligent home multi-mode man-machine natural interaction system and method thereof
CN112990153A (en) * 2021-05-11 2021-06-18 创新奇智(成都)科技有限公司 Multi-target behavior identification method and device, storage medium and electronic equipment
CN113283354B (en) * 2021-05-31 2023-08-18 中国航天科工集团第二研究院 Method, system and storage medium for analyzing eye movement signal behavior
CN113222712A (en) * 2021-05-31 2021-08-06 中国银行股份有限公司 Product recommendation method and device
CN113537335B (en) * 2021-07-09 2024-02-23 北京航空航天大学 Method and system for analyzing hand assembly skills
CN115111964A (en) * 2022-06-02 2022-09-27 中国人民解放军东部战区总医院 MR holographic intelligent helmet for individual training
CN116108549B (en) * 2023-04-12 2023-06-27 武昌理工学院 Green building component combined virtual assembly system and method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102749991A (en) * 2012-04-12 2012-10-24 广东百泰科技有限公司 Non-contact free space eye-gaze tracking method suitable for man-machine interaction
CN104766054A (en) * 2015-03-26 2015-07-08 济南大学 Vision-attention-model-based gesture tracking method in human-computer interaction interface
WO2018033154A1 (en) * 2016-08-19 2018-02-22 北京市商汤科技开发有限公司 Gesture control method, device, and electronic apparatus

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102749991A (en) * 2012-04-12 2012-10-24 广东百泰科技有限公司 Non-contact free space eye-gaze tracking method suitable for man-machine interaction
CN104766054A (en) * 2015-03-26 2015-07-08 济南大学 Vision-attention-model-based gesture tracking method in human-computer interaction interface
WO2018033154A1 (en) * 2016-08-19 2018-02-22 北京市商汤科技开发有限公司 Gesture control method, device, and electronic apparatus

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
基于Leap Motion手势识别方法在树木交互的应用;王红全等;《计算机应用与软件》;20181012(第10期);全文 *
虚拟环境的用户意图捕获;程成等;《中国图象图形学报》;20150216(第02期);全文 *

Also Published As

Publication number Publication date
CN110362210A (en) 2019-10-22

Similar Documents

Publication Publication Date Title
CN110362210B (en) Human-computer interaction method and device integrating eye movement tracking and gesture recognition in virtual assembly
Zhou et al. A novel finger and hand pose estimation technique for real-time hand gesture recognition
CN103941866B (en) Three-dimensional gesture recognizing method based on Kinect depth image
Islam et al. Real time hand gesture recognition using different algorithms based on American sign language
CN102831439B (en) Gesture tracking method and system
CN107168527B (en) The first visual angle gesture identification and exchange method based on region convolutional neural networks
JP6079832B2 (en) Human computer interaction system, hand-to-hand pointing point positioning method, and finger gesture determination method
US10664983B2 (en) Method for providing virtual reality interface by analyzing image acquired by single camera and apparatus for the same
CN105809144A (en) Gesture recognition system and method adopting action segmentation
CN104616028B (en) Human body limb gesture actions recognition methods based on space segmentation study
EP3035235B1 (en) Method for setting a tridimensional shape detection classifier and method for tridimensional shape detection using said shape detection classifier
Wu et al. Robust fingertip detection in a complex environment
CN107450714A (en) Man-machine interaction support test system based on augmented reality and image recognition
Pandey et al. Hand gesture recognition for sign language recognition: A review
CN103034851B (en) The hand tracking means based on complexion model of self study and method
CN109325408A (en) A kind of gesture judging method and storage medium
She et al. A real-time hand gesture recognition approach based on motion features of feature points
CN109543644A (en) A kind of recognition methods of multi-modal gesture
Liu et al. Temporal segmentation of fine-gained semantic action: A motion-centered figure skating dataset
CN114549557A (en) Portrait segmentation network training method, device, equipment and medium
Thabet et al. Fast marching method and modified features fusion in enhanced dynamic hand gesture segmentation and detection method under complicated background
CN111460858B (en) Method and device for determining finger tip point in image, storage medium and electronic equipment
Fakhfakh et al. Gesture recognition system for isolated word sign language based on key-point trajectory matrix
CN108108648A (en) A kind of new gesture recognition system device and method
Hasan et al. Gesture feature extraction for static gesture recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20230928

Address after: Room 507, 5th Floor, Block A, Building A1, Industry University Research Base, No. 9 Haichuan Road, Liuhang Street, High tech Zone, Jining City, Shandong Province, 272199

Patentee after: Microcrystalline Shushi (Shandong) Equipment Technology Co.,Ltd.

Address before: 250022 No. 336, South Xin Zhuang West Road, Shizhong District, Ji'nan, Shandong

Patentee before: University of Jinan