CN110362210B

CN110362210B - Human-computer interaction method and device integrating eye movement tracking and gesture recognition in virtual assembly

Info

Publication number: CN110362210B
Application number: CN201910670994.0A
Authority: CN
Inventors: 杨晓晖; 察晓磊; 徐涛; 冯志全; 吕娜; 范雪
Original assignee: University of Jinan
Current assignee: Microcrystalline Shushi (Shandong) Equipment Technology Co.,Ltd.
Priority date: 2019-07-24
Filing date: 2019-07-24
Publication date: 2022-10-11
Anticipated expiration: 2039-07-24
Also published as: CN110362210A

Abstract

The utility model provides a human-computer interaction method and a device which integrate eye movement tracking and gesture recognition in virtual assembly, which can track the fixation point according to the obtained eye movement data; recognizing gestures according to the acquired gesture information, labeling the acquired gesture recognition data and eye movement data to form a training set, and constructing a multi-stream convolution neural network-long and short term memory network model, wherein the network model utilizes the training set to perform self-learning; and applying the trained optimal network model to the virtual assembly process, acquiring eye movement data and gesture information of the virtual assembly process, extracting eye movement and gesture characteristics, and analyzing behavior categories of operators according to the characteristic information to complete the assembly task. The method solves the problem of misjudgment of similar behaviors in a single mode, and utilizes the advantages of a deep learning algorithm to identify the behaviors of operators in the video at a high accuracy rate, complete a virtual assembly task and realize human-computer interaction.

Description

Human-computer interaction method and device integrating eye movement tracking and gesture recognition in virtual assembly

Technical Field

The disclosure belongs to the technical field of human-computer interaction, and particularly relates to a human-computer interaction method and device integrating eye tracking and gesture recognition in virtual assembly.

Background

The statements in this section merely provide background information related to the present disclosure and may not constitute prior art.

With the development of information technology, human-computer interaction gradually changes from being centered on a machine to being centered on a human, that is, in a virtual environment, the machine serves as an auxiliary tool to assist a human to complete an interaction task. Meanwhile, the virtual reality technology is also rapidly developed, and virtual assembly is widely applied to product design verification and computer-aided assembly planning as an application of the virtual reality technology. The virtual assembly utilizes technologies such as data processing, human-computer interaction and visualization, so that the research and development of products directly enter a virtual product stage from a design stage, and related work efficiency analysis, design, analysis and evaluation are completed through a virtual human-computer model, so that a reasonable assembly sequence and a reasonable assembly path are obtained, and the assembly process of the products and the accessibility, maintainability and visibility of the assembly are checked.

The virtual assembly technology solves the problem that the manual assembly mode is large in consumption in the aspects of manpower and material resources to a great extent, and eliminates the operation danger of the manual assembly mode. In addition, the virtual assembly technology does not depend on time and places, and the process of virtual assembly can be finished at any time and place as long as the requirements of equipment conditions are met, so that the cost of manual assembly is greatly reduced.

The interactive technology plays an important role in virtual assembly. The interaction technology commonly used in the current virtual assembly is a traditional interaction method, for example, interaction in a virtual scene is performed by using a mouse, a touch screen, an infrared pen and the like; and secondly, the gesture recognition is utilized for interaction, so that the operations of moving, disassembling and the like of objects in the virtual scene are realized. However, these interaction methods have certain deficiencies in the aspects of naturalness and accuracy of interaction.

Disclosure of Invention

The human-computer interaction method and the human-computer interaction device for fusing eye tracking and gesture recognition in virtual assembly are provided for solving the problems, the false judgment rate of similar behaviors in a single mode is reduced through the fusion of two modes of eye tracking and gesture recognition, and the accuracy of virtual assembly is improved to a certain extent; secondly, a deep learning algorithm is selected to realize the fusion of eye movement tracking and gesture recognition, a multi-stream Convolutional Neural network-Long Short-Term Memory network (CNN-LSTM) model is constructed, the advantages of deep learning are utilized, eye movement and gesture features are extracted through self-learning, the advantages of LSTM processing video sequences are given play, and the accuracy and the naturalness of interaction under a virtual assembly scene are improved; and thirdly, processing the original video data, wherein the input data is a video sequence only comprising eye movement information and a video sequence only comprising gesture information, redundant information in the video is eliminated, key features are extracted, and the complexity of a network model is reduced.

According to some embodiments, the following technical scheme is adopted in the disclosure:

a human-computer interaction method integrating eye tracking and gesture recognition in virtual assembly comprises the following steps:

tracking a fixation point according to the obtained eye movement data;

recognizing the gesture according to the acquired gesture information;

labeling the obtained gesture recognition data and eye movement data to form a training set, and constructing a multi-stream convolution neural network-long and short term memory network model, wherein the network model utilizes the training set to perform self-learning;

and applying the trained optimal network model to the virtual assembly process, acquiring eye movement data and gesture information of the virtual assembly process, extracting eye movement and gesture characteristics, and analyzing behavior categories of operators according to the characteristic information to complete the assembly task.

As an alternative embodiment, the specific process of tracking the gaze point according to the acquired eye movement data includes:

acquiring an eye image;

identifying an eye region by using a Haar algorithm, obtaining a purkinje spot formed by irradiating an eye by using an infrared light source by using a threshold value method, and detecting the center of the pupil by using a Hough transformation method;

establishing an angle mapping relation between the position of the pupil center relative to the purkinje spot and the position of the fixation point relative to the infrared light source to accurately estimate the fixation point position;

and displaying the estimated gazing point position in the virtual scene in a red mark point mode.

As an alternative embodiment, the specific process of recognizing the gesture according to the acquired gesture information includes:

a, recording gesture position information of an operator in a video;

b, utilizing discrete cosine transform to carry out edge detection and YCbCr space skin color model combination to segment the hand region;

c, realizing gesture recognition by using a gesture principal direction and spatial gesture coordinate point distribution feature matching algorithm;

and d, constructing a hand model in the virtual scene through three-dimensional modeling, and synchronously displaying according to the gestures identified in the steps.

As a further limitation, the specific process of step b includes:

shielding a high-frequency part in the image by using DCT (discrete cosine transformation) filtering to further inhibit noise, detecting all edge points of a hand by using edge detection, mapping skin color to a YCbCr (YCbCr) space, segmenting gestures through a skin color model, performing morphological closed operation processing on segmented skin color areas, filling holes to enable the outline to be clearer, and obtaining a final hand area

As a further limitation, the specific process of step c includes:

extracting feature vectors of the gestures according to the main direction of the gestures, performing similarity measurement on the extracted features and a sample library, selecting M similar candidate samples, and identifying the final gestures from the M candidate samples by adopting a class-Hausdorff distance template matching idea.

As an alternative embodiment, when labeling the data, the video information is labeled differently according to the difference between the eye fixation point and the gesture recognition result.

As an alternative embodiment, the eye fixation point is uncertain, the gesture is a fist, the hand motion is translation or fixation, and the video of the type is marked as the beginning; the eye fixation point stays at a fixed position, the gesture is that only the index finger extends out, the hand action is fixed, and the video is marked as object selection; the eye fixation point stays at a fixed position, the gesture is five-finger bending, the hand action is fixed, and the video is marked as a captured object; the eye fixation point translates along with the hand along the track, the gesture is five-finger bending, and the video is marked as moving with an object; the eye fixation point stays at a fixed position, the five fingers are opened by the gesture, the hand action is fixed, and the video is marked as a release object; the eye fixation point is uncertain, the eyes are in a closed state, or hands cannot be detected in the scene, and the videos are marked as rest, namely no operation is performed in the virtual scene.

As an alternative embodiment, the specific process of building the network model of the multi-modal converged CNN-LSTM comprises:

adopting CNN to extract the spatial features of each video frame, adjusting a network structure based on a VGG-16 network, wherein the network structure comprises 13 convolutional layers, 5 pooling layers and 2 full-connection layers, and using an SELU (self-adaptive equalizer) activation function to replace a RELU function for the middle of the convolutional layers;

adopting LSTM to extract time characteristics, carrying CNN parts, and taking the CNN extracted characteristics as the input of the LSTM network;

fusing the features extracted by the multi-stream CNN-LSTM network through two fully-connected layers, giving corresponding weight to each branch of the network, and obtaining a final classification result by using a softmax layer;

and training the network model by using the labeled data to determine an optimal network model.

A human-computer interaction device integrating eye tracking and gesture recognition in virtual assembly comprises:

a light source configured to illuminate an eye position of an operator to be captured;

a Kinect device configured to acquire eye movement data and gesture information;

the processor is configured to receive information collected by the Kinect device, conduct fixation point tracking according to eye movement data and conduct gesture recognition according to gesture information;

and applying the optimal network model obtained by training in the virtual assembly process, and generating a corresponding assembly task according to the extracted characteristic information.

A computer-readable storage medium having stored thereon a plurality of instructions adapted to be loaded by a processor of a terminal device and to execute said method of human-computer interaction incorporating eye tracking and gesture recognition in a virtual fit.

A terminal device comprising a processor and a computer readable storage medium, the processor for implementing instructions; the computer readable storage medium is used for storing a plurality of instructions, and the instructions are suitable for being loaded by a processor and executing the human-computer interaction method for integrating eye tracking and gesture recognition in virtual assembly.

Compared with the prior art, the beneficial effect of this disclosure is:

the method solves the problem of misjudgment of similar behaviors in a single mode, and utilizes the advantages of a deep learning algorithm to identify the behaviors of operators in the video at a high accuracy rate, complete a virtual assembly task and realize human-computer interaction.

According to the method and the device, only the video sequence only comprising the eye movement information and the video sequence only comprising the gesture information are extracted, redundant information in the video is eliminated, the key features are extracted, and the complexity of a network model is reduced.

When the virtual assembly task is executed, an operator only needs to perform eye movement and gesture movement, the operation is more natural, and the experience feeling is better.

The method utilizes the advantages of deep learning, extracts eye movement and gesture features through self learning, exerts the advantages of processing the video sequence by the LSTM, and further improves the accuracy and naturalness of interaction under the virtual assembly scene.

Drawings

The accompanying drawings, which are included to provide a further understanding of the disclosure, illustrate embodiments of the disclosure and together with the description serve to explain the disclosure and are not to limit the disclosure.

FIG. 1 is a general block diagram of a virtual fit method combining eye tracking and gesture recognition, with key elements of the present disclosure within dashed boxes;

FIG. 2 is a schematic view of a gaze point estimation;

FIG. 3 (a) is a gesture segmentation of an original image; fig. 3 (b) is a diagram showing a gesture segmentation result.

FIG. 4 (a) only the index finger is extended, (b) a fist is closed; (c) opening the five fingers; and (d) a gesture diagram of five-finger bending.

FIG. 5 is a schematic diagram of a multi-modal converged CNN-LSTM network model.

The specific implementation mode is as follows:

the present disclosure is further illustrated by the following examples in conjunction with the accompanying drawings.

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the disclosure. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this disclosure belongs.

It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of example embodiments according to the present disclosure. As used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.

Aiming at the problem that the existing interaction mode has certain defects in the aspect of interaction naturalness, the method for man-machine interaction integrating eye tracking and gesture recognition is provided in the background technology. Firstly, through the fusion of eye movement tracking and gesture recognition, the misjudgment rate of similar behaviors under a single mode is reduced, and the accuracy of virtual assembly is improved to a certain extent; secondly, a deep learning algorithm is selected to realize the fusion of eye movement tracking and gesture recognition, a multi-stream Convolutional Neural network-Long Short-Term Memory network (CNN-LSTM) model is constructed, the advantages of deep learning are utilized, eye movement and gesture features are extracted through self-learning, the advantages of the LSTM processing video sequence are played, and the accuracy and the naturalness of interaction under the virtual assembly scene are improved; and thirdly, processing the original video data, wherein the input data is a video sequence only comprising eye movement information and a video sequence only comprising gesture information, redundant information in the video is eliminated, key features are extracted, and the complexity of a network model is reduced.

Firstly, acquiring eye movement data, processing the eye movement data, and displaying the eye movement data in a virtual scene in a red marking point form; then, acquiring gesture information, identifying gestures, and displaying the hands of the operator in a virtual scene through three-dimensional modeling; and finally, fusing the gesture recognition result and the eye movement tracking result, and performing interaction to complete virtual assembly. The key technology of the method is embodied in the fusion of two modes of eye tracking and gesture recognition and a CNN-LSTM network model suitable for multi-mode fusion.

A human-computer interaction method fusing eye tracking and gesture recognition in virtual assembly specifically comprises the following steps:

step 1, obtaining eye movement data, and processing the eye movement data: kinect V2 collects eye movement information of an operator, and a characteristic-based method is used for realizing the tracking of the fixation point;

step 2, acquiring gesture information, and recognizing gestures: kinect V2 acquires gesture information of an operator, and realizes gesture recognition by utilizing a YCbCr space skin color model and the like to divide gestures and a feature matching algorithm;

and 3, fusing the gesture recognition result and the eye movement tracking result to carry out interaction: marking the gesture data and the eye movement data obtained in the steps according to rules, training a multi-stream CNN-LSTM network model by using the marked data, applying the trained optimal network model to a virtual assembly process, completing an assembly task, and realizing interaction.

The specific implementation steps of the step 1 are as follows:

a. irradiating eyes of an operator by using infrared light sources arranged at the upper left corner and the upper right corner of the display, and acquiring eye images by using Kinect V2;

b. identifying an eye region through a Haar algorithm, obtaining a purkinje spot formed by irradiating an eye with an infrared light source by using a threshold value method, and detecting the center of the pupil by using Hough transformation;

c. establishing an angle mapping relation between the position of the pupil center relative to the purkinje spot and the position of the fixation point relative to the infrared light source to accurately estimate the fixation point position;

d. and displaying the estimated gazing point position in the virtual scene in a red-colored mark point mode.

The concrete implementation steps of the step 2 are as follows:

a. recording gesture position information of an operator in the video through the Kinect V2;

b. the method comprises the steps of utilizing discrete cosine transform to carry out edge detection and YCbCr space skin color model combination to segment a hand region, namely, firstly utilizing DCT filtering to shield a high-frequency part in an image so as to inhibit noise, then utilizing edge detection to detect all edge points of the hand, finally mapping skin color to the YCbCr space, carrying out morphological closed operation processing on the segmented skin color region through a skin color model segmentation gesture, filling a cavity so as to enable the outline to be clearer, and obtaining a final hand region;

c. the method comprises the steps of utilizing a matching algorithm based on a gesture principal direction and a spatial gesture coordinate point distribution feature to realize gesture recognition, namely, firstly, extracting a feature vector of a gesture according to the gesture principal direction, then, carrying out similarity measurement on the extracted feature and a sample library, selecting M similar candidate samples, and finally, recognizing a final gesture from the M candidate samples by using a class-Hausdorff distance template matching idea;

d. and constructing a hand model in the virtual scene through three-dimensional modeling, and synchronously displaying according to the gestures identified in the steps.

The concrete implementation steps of the step 3 are as follows:

a. determining a data labeling rule: the eye fixation point is uncertain, the gesture is fist making, the hand movement is translation or fixation, and the video is marked as the beginning; the eye fixation point stays at a fixed position, the gesture is that only the index finger extends out, the hand action is fixed, and the video is marked as object selection; the eye fixation point stays at a fixed position, the gesture is five-finger bending, the hand action is fixed, and the video is marked as a captured object; the eye fixation point translates along with the hand along the track, the gesture is five-finger bending, and the video is marked as moving with an object; the eye fixation point stays at a fixed position, the five fingers are opened by the gesture, the hand action is fixed, and the video is marked as a release object; the eye fixation point is uncertain, the eyes are in a closed state, or hands cannot be detected in the scene, the videos are marked as rest, namely, no operation is performed in the virtual scene;

b. building a multi-mode fused CNN-LSTM network model, firstly, adopting CNN to extract spatial features of each video frame, and adjusting a network structure based on a VGG-16 network, wherein the network structure comprises 13 layers of convolution, 5 layers of pooling and 2 layers of full connection; secondly, extracting time characteristics by adopting LSTM, carrying CNN parts, taking the CNN extracted characteristics as input of the LSTM network, and fixing the number of video frames input by the network to be 10 frames, so that the characteristics of a key video sequence can be extracted, and meanwhile, characteristic redundancy is avoided; then, fusing the features extracted by the multi-stream CNN-LSTM network through two fully-connected layers, and in addition, in order to reduce the over-fitting problem to the minimum, giving corresponding weight to each branch of the network to generate more reliable final feature representation; finally, obtaining a final classification result by using a softmax layer;

c. training the network model by using the labeling data to determine an optimal network model;

and applying the optimal network model to the virtual assembly process, analyzing eye movement information and gesture information of subsequent operators, completing the task of virtual assembly and realizing interaction.

As a specific implementation mode, the human-computer interaction method for integrating eye movement tracking and gesture recognition in virtual assembly is provided. Firstly, acquiring eye movement information of an operator by using Kinect V2, and realizing eye movement tracking by using a characteristic-based method; then, acquiring gesture information of an operator by using Kinect V2, and realizing gesture recognition by using a YCbCr space skin color model and the like to segment gestures, a feature matching algorithm and the like; and finally, training a multi-stream CNN-LSTM network model by using the gesture data and the eye movement data obtained in the steps, fusing a gesture recognition result and an eye movement tracking result, completing an assembly task, and realizing interaction. Fig. 1 is a general block diagram of a virtual assembly method combining eye tracking and gesture recognition, that is, a human-computer interaction method combining eye tracking and gesture recognition in virtual assembly, which has the following specific embodiments:

(1) The method comprises the steps of utilizing infrared light sources arranged at the upper left corner and the upper right corner of a display to irradiate eyes of an operator, obtaining an angle between the position of the pupil center relative to a corneal ray reflection point and the position of a fixation point relative to the infrared light sources, accurately estimating the position of the fixation point according to the mapping relation between the pupil center and the corneal ray reflection point, and displaying the estimated position of the fixation point in a virtual scene in the form of a red marking point as shown in figure 2.

(2) The gesture position information of an operator in a video is recorded through a Kinect V2, the gesture is divided through a method of combining edge detection and a YCbCr space skin color model through discrete cosine transform, gesture recognition is achieved through a feature matching algorithm based on the main direction of the gesture and the distribution of spatial gesture coordinate points (related gestures are shown in figures 4 (a) - (d)), a hand model is built in a virtual scene through three-dimensional modeling, and synchronous display is conducted according to the gesture recognized through the method.

(3) Determining a data labeling rule: the eye fixation point is uncertain, the gesture is fist making, the hand movement is translation or fixation, and the video is marked as the beginning; the eye fixation point stays at a fixed position, the gesture is that only the index finger extends out, the hand action is fixed, and the video is marked as object selection; the eye fixation point stays at a fixed position, the gesture is five-finger bending, the hand action is fixed, and the video is marked as a captured object; the eye fixation point translates along with the hand along the track, the gesture is five-finger bending, and the video is marked as moving with an object; the eye fixation point stays at a fixed position, the five fingers are opened by the gesture, the hand action is fixed, and the video is marked as a release object; the eye fixation point is uncertain, the eyes are in a closed state, or hands cannot be detected in the scene, and the videos are marked as rest, namely no operation is performed in the virtual scene.

(4) Building a network model (as shown in fig. 5) of a multi-modal fused CNN-LSTM, firstly, adopting the CNN to extract spatial features of each video frame, and adjusting a network structure based on a VGG-16 network, wherein the network structure comprises 13 layers of convolution, 5 layers of pooling and 2 layers of full connection, and in order to achieve the purpose of normalizing a data set, a SELU activation function is used in the middle of a convolution layer to replace a RELU function, and gradient explosion or disappearance is reduced; secondly, extracting time characteristics by adopting LSTM, carrying CNN parts, taking the CNN extracted characteristics as input of the LSTM network, and fixing the number of video frames input by the network to be 10 frames, so that the characteristics of a key video sequence can be extracted, and meanwhile, characteristic redundancy is avoided; then, fusing the features extracted by the multi-stream CNN-LSTM network through two fully-connected layers, and in addition, in order to reduce the over-fitting problem to the minimum, giving corresponding weight to each branch of the network to generate a more reliable final feature representation (formula (1)); finally, a softmax layer is used to derive the final classification result.

F represents the final characterization, w _eo Representing weights for extracting features of the spatial stream of eye movement information, F _eo Representing the final feature, w, of the spatial stream extraction of eye movement information _et Weight representing the time-stream characteristic of the extracted eye movement information, F _et Final feature, w, representing the temporal stream extraction of eye movement information _go Weight, F, representing extracted features of the spatial stream of gesture information _go Representing the final feature of the gesture information spatial stream extraction, w _gt Weight representing the time-stream feature of the extracted gesture information, F _gt Representing the final features of the gesture information time stream extraction. Wherein the weight w _eo (w _et 、w _go 、w _gt Same as w _eo ) Is represented as follows:

w represents the weight of the fitted objective function during network training.

As will be appreciated by one skilled in the art, embodiments of the present disclosure may be provided as a method, system, or computer program product. Accordingly, the present disclosure may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present disclosure may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and so forth) having computer-usable program code embodied therein.

The present disclosure is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the disclosure. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above description is only a preferred embodiment of the present disclosure and is not intended to limit the present disclosure, and various modifications and changes may be made to the present disclosure by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present disclosure should be included in the protection scope of the present disclosure.

Although the present disclosure has been described with reference to specific embodiments, it should be understood that the scope of the present disclosure is not limited thereto, and those skilled in the art will appreciate that various modifications and changes can be made without departing from the spirit and scope of the present disclosure.

Claims

1. A human-computer interaction method integrating eye tracking and gesture recognition in virtual assembly is characterized by comprising the following steps: the method comprises the following steps:

tracking a fixation point according to the obtained eye movement data;

recognizing the gesture according to the acquired gesture information;

when the data are labeled, different labels are carried out on the video information according to the difference between the eye fixation point and the gesture recognition result;

the specific process for building the network model of the multi-mode fused CNN-LSTM comprises the following steps:

adopting CNN to extract the space characteristic of each video frame, adjusting the network structure based on VGG-16 network, including 13 layers of convolution layer, 5 layers of pooling layer and 2 layers of full-connection layer, and using SELU activation function to replace RELU function in the middle of convolution layer;

training the network model by using the labeling data to determine an optimal network model;

and applying the trained optimal network model to the virtual assembly process, acquiring eye movement data and gesture information of the virtual assembly process, extracting eye movement and gesture characteristics, and analyzing to obtain behavior categories of operators according to the characteristic information so as to complete the assembly task.

2. The human-computer interaction method integrating eye tracking and gesture recognition in virtual assembly as claimed in claim 1, wherein: the specific process of tracking the fixation point according to the acquired eye movement data comprises the following steps:

acquiring an eye image;

3. The human-computer interaction method integrating eye tracking and gesture recognition in virtual assembly as claimed in claim 1, wherein: according to the acquired gesture information, the specific process of recognizing the gesture comprises the following steps:

a, recording gesture position information of an operator in a video;

4. The human-computer interaction method integrating eye tracking and gesture recognition in virtual assembly as claimed in claim 3, wherein: the specific process of the step b comprises the following steps:

the method comprises the steps of shielding a high-frequency part in an image by using DCT (discrete cosine transformation) filtering to further suppress noise, detecting all edge points of a hand by using edge detection, mapping skin colors to a YCbCr (YCbCr) space, segmenting gestures through a skin color model, performing morphological closed operation processing on segmented skin color areas, filling holes to enable the outline to be clearer, and obtaining a final hand area.

5. The human-computer interaction method integrating eye tracking and gesture recognition in virtual assembly as claimed in claim 3, wherein: the specific process of the step c comprises the following steps:

extracting feature vectors of the gestures according to the principal direction of the gestures, performing similarity measurement on the extracted features and a sample library, selecting M similar candidate samples, and recognizing the final gestures from the M candidate samples by adopting a similar-Hausdorff distance template matching idea.

6. A human-computer interaction device integrating eye tracking and gesture recognition in virtual assembly is characterized in that: the method comprises the following steps:

the processor is configured to receive information collected by the Kinect equipment, track a fixation point according to eye movement data, and recognize gestures according to gesture information;

labeling the obtained gesture recognition data and eye movement data to form a training set, and constructing a multi-stream convolution neural network-long and short term memory network model, wherein the network model utilizes the training set to perform self-learning; when the data are labeled, different labels are carried out on the video information according to the difference between the eye fixation point and the gesture recognition result;

7. A computer-readable storage medium, comprising: a plurality of instructions are stored, the instructions are suitable for being loaded by a processor of a terminal device and executing the human-computer interaction method for integrating eye tracking and gesture recognition in virtual assembly, wherein the human-computer interaction method is as claimed in any one of claims 1 to 5.

8. A terminal device is characterized in that: the system comprises a processor and a computer readable storage medium, wherein the processor is used for realizing instructions; the computer readable storage medium is used for storing a plurality of instructions, and the instructions are suitable for being loaded by a processor and executing the human-computer interaction method for integrating eye tracking and gesture recognition in virtual assembly according to any one of claims 1-5.