CN107545302B

CN107545302B - Eye direction calculation method for combination of left eye image and right eye image of human eye

Info

Publication number: CN107545302B
Application number: CN201710650058.4A
Authority: CN
Inventors: 陆峰; 陈小武; 赵沁平
Original assignee: Beihang University
Current assignee: Beihang University
Priority date: 2017-08-02
Filing date: 2017-08-02
Publication date: 2020-07-07
Anticipated expiration: 2037-08-02
Also published as: CN107545302A

Abstract

The invention provides a method for calculating the combined sight direction of left and right eye images of human eyes, which comprises the following steps: the extraction model of the binocular information inputs a human eye image, and automatically extracts the information characteristics of the left eye and the right eye contained in the image through a dual-channel model respectively; the model for extracting the human eye joint information features inputs the binocular images of the user, and extracts the human eye joint information features by combining the binocular information; the invention discloses a joint algorithm, and the three-dimensional sight line direction is calculated by inputting characteristic information. One of the applications of the invention is virtual reality and man-machine interaction, and the principle is to calculate the sight direction of a user by shooting an eye image of the user so as to interact with an intelligent system interface or a virtual reality object. The invention can also be widely used in the fields of training and training, game and entertainment, video monitoring, medical monitoring and the like.

Description

Eye direction calculation method for combination of left eye image and right eye image of human eye

Technical Field

The invention relates to the field of computer vision and image processing, in particular to a method for calculating the combined sight direction of left and right eye images of human eyes.

Background

Gaze/eye tracking is of great importance for user behavior understanding and efficient human-computer interaction. More than 80% of the human perceptible information is received by the human eye and more than 90% of it is processed by the vision system. Therefore, sight is an important clue reflecting the process of human interaction with the outside world. In recent years, due to rapid development of virtual reality technology and man-machine interaction technology, the application value of sight tracking technology is gradually highlighted; on the other hand, gaze direction calculation remains a very challenging problem in the field of computer vision.

The current gaze tracking technology is fundamentally divided into two types, an appearance-based gaze tracking technology and a model-based gaze tracking technology. In the current environment, most people are dedicated to research on the model-based gaze tracking method because the model-based gaze tracking technology is often highly accurate. Model-based gaze tracking techniques require the experimenter to provide a number of geometric features, such as the direction of the pupil, and to build an eye model from which to predict the direction of a person's gaze. As such, model-based line-of-sight tracking techniques suffer from several drawbacks, including 1) the need for expensive instrumentation. Model-based gaze tracking techniques predict the gaze direction of a participant by building a model of the eye or other geometric model, and therefore it requires the use of unique devices to extract and model the geometric features of the participant about the eye. 2) Model-based gaze tracking techniques need to be performed in a strict indoor environment. Since geometric features required to be provided by experimenters in the model-based sight tracking technology are generally measured by infrared rays, and other interference sources, such as sunlight, contain too much infrared light, which causes very serious interference to the measurement result of the instrument, the measuring equipment also needs to be placed in a strict indoor environment so as to avoid interference of other infrared light, such as sunlight. 3) Model-based gaze tracking techniques require high resolution images for training and therefore have a limited working distance, typically not exceeding 60 cm. Therefore, model-based experimental methods cannot be universally used in most common environments.

In contrast, the visual line tracking technique based on the appearance directly learns various information from the human eye picture to establish the mapping relationship between the picture and the human eye visual line, so as to obtain the visual line direction, and the visual line tracking technique based on the appearance does not have the limitations of the model-based visual line tracking technique, and only needs to shoot the visual line picture of the human eye through a common camera. This condition makes the sight line tracking technology based on appearance have general applicability, and also makes the sight line tracking technology based on appearance have a comparable application prospect. However, the requirements of the appearance-based gaze tracking technology on the sampling tool and the sampling environment are not strict, so that the input data of the model often has various environmental factors, for example: lighting, participants, head position, etc. The intensity of illumination can make the picture become bright or dark, and in an extremely dark environment, people are more difficult to distinguish the eye image of one person from the picture; similarly, the difference of the head position has a great influence on the sampling of human eyes, and the human eye images obtained by taking the front photograph and the side photograph of the same person are different. Due to the above factors, the input data of the appearance-based gaze tracking technology is very noisy, which is a challenge faced by the appearance-based gaze tracking technology, and because of this drawback, the appearance-based gaze tracking technology is far inferior in accuracy to the model-based gaze tracking method. Meanwhile, in the current sight tracking method based on the appearance, a monocular image of a user is often used as input information, however, in practical application, a binocular image of the user at a certain moment is often obtained by acquiring user information each time, the binocular image at the same moment is input separately, and the correlation of the binocular image at the certain moment is also ignored.

In recent years, various new models have come to be developed, and among these various models, the neural network is particularly prominent in its performance. The convolutional neural network in deep learning is a category of the neural network, and is particularly very hot. Because the convolutional neural network in deep learning has the characteristic of local perception, the convolutional neural network in deep learning can well extract the local characteristics of the picture, and retain the local related information of the picture, and because of the reason of weight sharing, the convolutional neural network in deep learning does not need to consume a large amount of time to train, and therefore, the convolutional neural network in deep learning is particularly prominent in various image processing tasks, such as: image classification, object detection, and semantic separation, among others. Meanwhile, in recent years, the rapid development of hardware makes the convolutional neural network in deep learning perform better in an image processing method. However, similar literature reports on methods for determining the direction of human eye sight have not yet been reported.

Disclosure of Invention

The invention solves the problems: the method is used for solving the problem that a single-eye image input in an appearance-based sight tracking method is high in noise mainly by combining image information of two eyes, and therefore high-precision three-dimensional sight direction prediction is achieved.

The technical scheme of the invention is as follows: a method for calculating the combined sight line direction of left and right eye images of human eyes comprises the following steps:

(1) shooting a facial image of a user, positioning a left eye or right eye area, preprocessing the human eye image, realizing the correction of the head position, and obtaining the human eye image with a fixed pixel size;

(2) establishing a two-channel model, respectively inputting image information of a left eye and a right eye in a human eye image, and respectively extracting and outputting information characteristics of the left eye and the right eye by using a deep neural network model;

(3) establishing a single-channel model, inputting image information of a left eye and a right eye, and extracting and outputting combined information characteristics of a left eye image and a right eye image by using a deep neural network model;

(4) predicting three-dimensional sight directions respectively corresponding to the two eyes by using a regression analysis method and combining the information characteristics of the left eye and the right eye and the combined information characteristics of the left eye image and the right eye image through combined optimization; or the information characteristics of the left eye and the right eye or the combined information characteristics of the left eye image and the right eye image are independently used, a regression analysis method is used, and after optimization, the three-dimensional sight line directions respectively corresponding to the two eyes are predicted

Establishing a dual-channel model, respectively inputting image information of the left eye and the right eye in the human eye image, and respectively extracting and outputting information characteristics of the left eye and the right eye through the dual-channel model, wherein the specific process comprises the following steps:

(21) will be fixed after correctionSized left and right eye images I_lAnd I_rInput into a two-channel model, I_lAnd I_rRespectively processing through one channel;

(22) each channel is a deep neural network model, and the model performs convolution, pooling and full-connection operations on an input human eye image and outputs a feature vector with a fixed length;

(23) the fixed-length feature vector generated by each channel is the information feature of the corresponding input image after extraction by the deep neural network, and the information features generated by the two channels are connected to obtain the final information features of the left eye and the right eye.

Establishing a single-channel model, inputting picture information of a left eye and a right eye, and extracting and outputting combined information characteristics of the left eye image and the right eye image by using the single-channel model in the step (3) as follows:

(31) inputting the corrected human eye image with fixed size into a single-channel model;

(32) respectively performing convolution, pooling and full-connection operation on the images of the left eye and the right eye by using a deep neural network model, and outputting simplified left and right eye information characteristics;

(33) and connecting the left and right eye information characteristics, adding a plurality of full connection layers behind the deep neural network model, and combining the information characteristics of the left and right eyes by using the full connection layers to finally obtain the left and right eye image combined information characteristics.

The specific process of predicting the three-dimensional sight directions respectively corresponding to the two eyes by combining the information characteristics of the left eye and the right eye and the combined information characteristics of the left eye image and the right eye image by using a regression analysis method in the step (4) is as follows:

(41) inputting the modified left-eye and right-eye images I_lAnd I_r，And the true left eye sight line direction g corresponding to the image_lAnd true right eye gaze direction g_r；

(42) Extracting binocular information features and joint information features corresponding to the images by using the deep neural network models proposed in the step (2) and the step (3);

(43) connection ofAll the extracted information features or one feature alone is used as the overall feature, and a regression analysis method is used to obtain the predicted left eye sight direction f (I)_lAnd predicted right eye gaze direction f (I)_r；

(44) The angle difference is used as an error value, a gradient descent method is used, and the model is subjected to iterative optimization, so that the predicted sight line direction is closer to the real sight line direction;

(45) and selecting the model with the predicted sight line direction closest to the real sight line direction as a final model, inputting the human eye image by the model to obtain the predicted sight line direction, and taking the sight line direction as a final prediction result.

Compared with other sight tracking methods, the invention has the beneficial characteristics that:

(1) the information feature extraction model of the two eyes is invented, the information features of the two eyes can be extracted, the sight line direction can be predicted by independently using the feature information of the two eyes, and the result is still superior to that of a common monocular sight line tracking method;

(2) considering that the relationship of certain correlation exists in the binocular images at the same time, the invention provides an extraction model of human eye joint information characteristics, which can extract the correlation characteristic information of the two eyes, effectively combines the image information of the two eyes, and simultaneously predicts the sight direction only by using the correlation characteristic information of the two eyes, and the result is still superior to the common single-eye sight tracking method;

(3) a multi-path neural network is established, the network can more accurately predict and obtain the three-dimensional sight directions of the two eyes by inputting characteristic information and a regression method, and meanwhile, the problem that the prediction result is inaccurate due to the fact that certain monocular images have large noise can be effectively solved.

Drawings

FIG. 1 is a schematic diagram of the network architecture of the present invention, wherein a is the two-channel model in step (2) of the inventive content, b is the single-channel model in step (3) of the inventive content, and c is the line-of-sight prediction model in step (4) of the inventive content;

FIG. 2 is a schematic diagram of the basic neural network architecture of the present invention;

FIG. 3 is a general block diagram of the computing method of the present invention for matching gaze directions based on user binocular analysis;

FIG. 4 is a model training flow diagram of the present invention.

Detailed Description

The following detailed description of embodiments of the invention refers to the accompanying drawings.

The invention provides a method for calculating the combined sight direction of left and right eye images of human eyes, which inputs the information characteristics of human eyes, predicts the sight directions of both eyes of people, and simultaneously respectively provides a single-channel and double-channel deep neural network model for extracting the information characteristics used in the method. The method has no extra requirement on the system, and only uses the human eye image shot by the single camera as input. Meanwhile, the invention can eliminate certain error conditions with larger monocular noise by combining the image information of the two eyes, thereby realizing better robustness compared with other similar methods.

First, for human eye image acquisition, the present invention includes the following procedures. Using a single camera, an image containing a face region of a user is captured. And positioning the left eye or right eye area by using the existing human face analysis method. The extracted human eye image is preprocessed to obtain a human eye image with a fixed pixel size, the head position of which is corrected.

Secondly, a dual-channel depth neural network model for simultaneously extracting left-eye and right-eye information features is invented, under the premise of inputting a binocular image, the left-eye image and the right-eye image respectively enter one channel, and feature information of the left eye and the right eye is respectively obtained after the left-eye image and the right-eye image are independently processed by the channels.

Furthermore, a single-channel deep neural network model for extracting human eye joint information features input by the binocular images is created, the deep neural network model is established on the premise of inputting the binocular images, the model processes image information of the left eye and the right eye simultaneously, the processed image information is combined, and the combined information is further processed by the model, so that joint information features of the left eye and the right eye are obtained.

Finally, by combining the two network models, the method for determining the visual line direction of the left eye image and the right eye image of the human eye in a combined mode is invented. The method establishes a multi-path deep neural network sight direction prediction model, connects the obtained information characteristics of the two eyes and the joint information characteristics of the two eyes by fusing the two information characteristic extraction models, obtains the predicted sight directions of the two eyes by regression analysis, and measures the quality of the current model by counting the angle deviation between the predicted sight direction and the real sight direction. The model is autonomously optimized using a gradient descent method, by using the formula:

and calculating the angle deviation, wherein n represents the number of the input image pairs, continuously optimizing the model by taking the reduction of the angle deviation as a target, performing one-time iterative optimization on the model every time a pair of human eye images and a real sight line direction are input in the process of joint optimization, and finishing the optimization process after all known image information is input to obtain the final model. In practical application, the model directly predicts the sight direction of the human eye image by receiving a pair of brand-new human eye images.

Meanwhile, the sight direction estimation method has the advantages of being reducible, capable of independently using information characteristics or combined information characteristics of left and right eyes of human eyes for regression analysis to obtain predicted sight directions of both eyes, and adopting the average angle deviation between the real sight direction of an input image and the output predicted sight direction as an error term to carry out self-adaptive adjustment on a prediction model. Similarly, the sight direction estimation method has the addition property, after the information characteristics of the two eyes and the joint information characteristics of the two eyes are obtained, other relevant information characteristics can be directly added, and regression analysis judgment is carried out by taking all the characteristics as a whole.

In the following, a detailed description is given, referring to fig. 1, which shows a schematic diagram of the network structure of the present invention, specifically as follows:

fig. 1 (a) is a schematic structural diagram TE-I of a two-channel deep neural network model for extracting left-eye and right-eye information features simultaneously. The model respectively outputs the characteristic information of the left eye and the right eye by inputting the images of the two eyes and processing the images by a network based on a Convolutional Neural Network (CNN), wherein the characteristic information is called the characteristics of the two eyes, and the sight line directions of the two eyes can be predicted by the characteristics of the two eyes;

fig. 1 (b) is a structural diagram TE-II of a single-channel deep neural network model for extracting left-eye and right-eye information features simultaneously. After the binocular image is input, the image is firstly subjected to network processing based on CNN to obtain respective independent feature information, then the respective independent feature information is fused through a full connection layer to finally obtain binocular correlation information, and similarly, the sight line directions of the two eyes can be predicted only through the correlation information of the two eyes;

fig. 1 (c) is a schematic structural diagram TE-a of a network model of a method for calculating a gaze direction of a left-eye image and a right-eye image of a human eye in a combined manner, and by combining two model structures in fig. 1 (a) and fig. 1 (b), binocular features and binocular correlation features are obtained at one time, and with these information as overall features, a regression analysis is performed to obtain the gaze direction of the two eyes.

In the three models, the head position vector is also added to the final feature set when the visual direction is subjected to predictive analysis, because the human eye images are deformed differently due to the difference of the head positions of the users when the human eye appearances are shot from the front by using the camera, and although the human eye images are transformed correspondingly at the beginning to eliminate the influence of the head positions, the influence cannot be completely eliminated.

Referring to fig. 2, a schematic diagram of a basic neural network structure of the present invention is shown, in order to extract excellent feature information from an image, in view of the excellent performance of the current convolutional neural network in image processing, a CNN network is used as a basic network for feature extraction, the input of the network is a 36 × gray-scale picture, the output is an x-dimensional feature, and a specific value of x can be set by itself, after the picture is input, the picture is firstly subjected to a layer of convolution, the size of a convolution kernel is set to be 5 ×, the number of output channels is set to be 20, after the first layer of convolution, 20 pictures with the size of 32 5636 056 are output, then the pictures are subjected to a maximum pooling of 2 ×, and then 20 pictures with 16 × are output, then, the 20 pictures are subjected to convolution again, the convolution kernel is still 5 ×, the output channels are 50, a total of 50 pictures with 7324 are output, and then the 50 pictures are subjected to a maximum pooling of 2 × to obtain 50 pictures with 3876, finally, the pictures with 3535, 50, 5, three-dimensional, three.

Referring to fig. 3, a general structure diagram of a method for determining the direction of a combined eye-gaze of a left eye and a right eye of a human eye according to the present invention is shown. The invention constructs the neural network by self, and predicts and analyzes the three-dimensional sight direction of the eyes of the user. According to the method, 1506-dimensional characteristic vectors are obtained by inputting human eye gray level pictures with fixed sizes and head angle vectors, and then 6-dimensional binocular sight directions are obtained through regression analysis. The general structure of the invention also comprises the network of steps (2), (3) in the summary of the invention. The overall network structure of step (2) is as the structure of fig. 3 that the above two human eye pictures are used as input, the network inputs images of the left eye and the right eye respectively, then the CNN network in fig. 2 is used to perform convolution on the images, and the final feature number x is set to 1000, so that a feature vector with the length of 1000 is obtained; respectively obtaining 500-dimensional feature vectors by the feature vectors through a full connection layer (FC); finally, the two 500-dimensional feature vectors are simply concatenated as the output features of the first part. The overall network structure of step (3) is as the structure of the following two images as input in fig. 3, the second part also uses left-eye and right-eye images as input, the images are convolved by using a CNN network, and the final feature number x is set to 500, so as to obtain 500-dimensional feature vectors respectively, then, the 500-dimensional feature vectors are simply connected into 1000-dimensional vectors, the feature vectors are fused through a full-connection layer, so as to obtain 500-dimensional feature vectors, and the 500-dimensional feature vectors are used as the output of the second part; similarly, the present invention takes the head position vector as the third part input, without processing, and adds it directly to the final feature vector, since it takes into account the difference in head position, which will have some irrevocable effect on the image. The three parts respectively output 1000-dimensional, 500-dimensional and 6-dimensional feature vectors, and the vectors are connected to obtain final features of 1506 dimensions.

Referring to fig. 4, a flowchart of the present invention for predicting the direction of the user's eyes based on the image of the user's eyes, in conjunction with the related art described above, a specific implementation process of the prediction of the direction of the user's eyes based on the image of the user's eyes is described below.

First, the prediction model is initialized by using a simple method of randomly assigning initial values to the model presented in fig. 3. Subsequently, the processed human eye images are inputted, and each time a pair of human eye images I are inputted_lAnd I_rObtaining a pair of predicted three-dimensional visual directions f (I) through a network_lAnd f (I)_rAnd represent the line of sight directions of the left and right eyes, respectively. Then passes through the original three-dimensional sight line direction g_lAnd g_rAnd comparing to obtain the predicted angle deviation, continuously optimizing the network by using a gradient descending method to reduce the angle deviation as a target, and performing one-time iterative adjustment on network parameters every time a pair of images are input. And when all the images are input, obtaining the final prediction model. In the final prediction model, the sight line direction corresponding to the image can be predicted by inputting the image.

The above description is only an exemplary embodiment of the present invention, and any equivalent changes made according to the technical solutions of the present invention should fall within the protection scope of the present invention.

Claims

1. A method for calculating the combined sight line direction of left and right eye images of human eyes is characterized by comprising the following steps:

(4) predicting three-dimensional sight directions respectively corresponding to the two eyes by using a regression analysis method and combining the information characteristics of the left eye and the right eye and the combined information characteristics of the left eye image and the right eye image through combined optimization;

the specific process of the step (4) is as follows:

(41) inputting the modified left-eye and right-eye images I_lAnd I_rAnd the true left eye gaze direction g corresponding to the image_lAnd true right eye gaze direction g_r；

(43) connecting all the extracted binocular information features and the joint information features as overall features, and obtaining the predicted left eye sight direction f (I) by using a regression analysis method_lAnd predicted right eye gaze direction f (I)_r；

(44) The angle difference is used as an error value, a gradient descent method is used, and the integral deep neural network model is subjected to iterative optimization, so that the predicted sight line direction is closer to the real sight line direction;

(45) selecting a depth neural network model with the predicted sight direction closest to the real sight direction as a final depth neural network model, inputting a human eye image by the depth neural network model to obtain the predicted sight direction, and taking the sight direction as a final prediction result;

the deep neural network model is autonomously optimized by using a gradient descent method, and by using a formula:

calculating angle deviation, wherein n represents the number of input image pairs, continuously optimizing the deep neural network model by taking reduction of the angle deviation as a target, performing one-time iterative optimization on the deep neural network model every time when a pair of human eye images and a real sight line direction are input in the process of combined optimization, finishing the optimization process after all known image information is input, obtaining a final deep neural network model, and directly predicting the sight line direction of the human eye images by the final deep neural network model by receiving a pair of brand new human eye images.

2. The method for calculating the combined gaze direction of left and right eye images of a human eye according to claim 1, wherein: establishing a dual-channel model, respectively inputting image information of the left eye and the right eye in the human eye image, and respectively extracting and outputting information characteristics of the left eye and the right eye through the dual-channel model, wherein the specific process comprises the following steps:

(21) the corrected left eye image and right eye image I with fixed sizes_lAnd I_rInput into a two-channel model, I_lAnd I_rRespectively processing through one channel;

3. The method for calculating the combined gaze direction of left and right eye images of a human eye according to claim 1, wherein: establishing a single-channel model, inputting picture information of a left eye and a right eye, and extracting and outputting combined information characteristics of the left eye image and the right eye image by using the single-channel model in the step (3) as follows: