CN114816060A

CN114816060A - User fixation point estimation and precision evaluation method based on visual tracking

Info

Publication number: CN114816060A
Application number: CN202210432536.5A
Authority: CN
Inventors: 闫野; 谢良; 胡薇; 印二威; 张敬; 张亚坤; 罗治国; 艾勇保
Original assignee: National Defense Technology Innovation Institute PLA Academy of Military Science
Current assignee: National Defense Technology Innovation Institute PLA Academy of Military Science
Priority date: 2022-04-23
Filing date: 2022-04-23
Publication date: 2022-07-29

Abstract

The invention discloses a user fixation point estimation method based on visual tracking.A user wears head-mounted eye movement interaction equipment, a fixation point extraction module is utilized to obtain the fixation point coordinates of the user, a residual error between the eye movement offset of the user and the fixation point coordinates is calculated through a residual error estimation module, and the obtained residual error is sent to an offset self-adaption module to update the fixation point coordinates to obtain the final fixation point estimation value of the user. The invention also discloses a method for evaluating the precision of the user fixation point estimation method, wherein a user wears the head-wearing eye movement interaction equipment, sequentially gazes at the precision test points displayed on the display interface, acquires the fixation point coordinates of the user, and calculates and acquires the eye movement precision; and averaging the eye movement precision values of all the sub-regions to obtain a final eye movement precision evaluation value. The invention can realize the extraction of the eye movement offset of the self-adaptive individual difference, and has very important significance for improving the robustness and generalization capability of the eye movement algorithm.

Description

User fixation point estimation and precision evaluation method based on visual tracking

Technical Field

The invention relates to the field of digital image processing, in particular to a user fixation point estimation and precision evaluation method based on visual tracking.

Background

Eyeball tracking, also called gaze tracking technology, is a technology for estimating gaze and fixation point coordinates by extracting eyeball motion-related parameters. With the continuous development of the eyeball tracking technology, the application scenes of the technology in the fields of human-computer interaction, behavior analysis and the like are also continuously enriched.

Since the eyeball tracking technology is to collect the human eye movement physiological signals for processing and analyzing through the head-mounted eye movement device, the technology is strongly related to the different physiological structures of the eyes of each user and the personalized use habits of the device. On one hand, because the eye movement signal is a physiological signal of a human body, different individuals have physiological structure differences, and therefore the eye movement tracking algorithm needs to be capable of self-adapting to the physiological structure differences of the individuals to effectively ensure the eye movement precision. On the other hand, the existing head-mounted eye movement equipment irradiates eyeballs through an infrared light source based on an optical recording method, and adopts a high-speed eye pattern camera to acquire near-eye pictures. Image acquisition equipment sets up in the place ahead that is close to the eye, and image acquisition equipment has certain contained angle with the horizontal direction at eye place, and different testees have different use habits when using same eye movement tracking equipment, and this eye movement parameter that can direct influence eye movement equipment and gather to can cause the fluctuation by a wide margin of eye movement precision. In addition, for a user wearing glasses, wearing glasses causes the distance of the exit pupil to change and the eyeball may be shielded to a certain degree, and the precision deviation of the inference result of the eye movement algorithm is also influenced.

Disclosure of Invention

The invention discloses a user fixation point estimation and precision evaluation method based on visual tracking, aiming at the problem that the existing eyeball tracking technology is strongly related to the eye difference physiological structure of each user and the personalized use habit of equipment.

The invention discloses a user fixation point estimation method based on visual tracking, which specifically comprises the following steps:

the user wears the head-wearing eye movement interaction equipment, the fixation point coordinate of the user is obtained by the fixation point extraction module, the residual error between the eye movement offset of the user and the fixation point coordinate is calculated by the residual error estimation module, and then the obtained residual error is sent to the offset self-adaption module to update the fixation point coordinate, so that the final fixation point estimation value of the user is obtained.

The gaze point extraction module is realized by a deep learning artificial neural network formed by superposing a plurality of deep convolution neural layers on a plurality of expansion convolution layers, the binocular pictures of the user collected by the head-mounted eye movement interaction equipment are used as the input of the module, and the output of the module is the extracted gaze point coordinate values of the user.

The method comprises the steps of obtaining the fixation point coordinates of a user by using a fixation point extraction module, firstly constructing a sample data set, then constructing a deep learning artificial neural network, training and testing the deep learning artificial neural network, using the trained deep learning artificial neural network as a fixation point extraction model, and obtaining the fixation point coordinates of the user by using the fixation point extraction model.

The construction of the sample data set requires that a plurality of users wear the head-wearing eye movement interaction equipment, the users watch the target anchor points which continuously move in the display interface of the equipment, the target anchor points sequentially move to the positions of pixel points of each row and each column of the display interface in a snake-shaped traversing mode, the target anchor points change more than three different moving speeds in the moving process, the head-wearing eye movement interaction equipment collects eye images of the target anchor points which continuously move watched by the users, and one round of sample data extraction is completed after one snake-shaped traversing of the target anchor points is completed. In each round of sample data extraction process, when a near-eye high-speed camera carried on the head-mounted eye movement interaction equipment traverses each pixel point position on a display interface in a snake shape, the binocular image of the user and the position coordinate value of the target anchor point watched by the binocular image of the user at the moment are stored, and the binocular image of the user and the position coordinate value of the target anchor point watched by the binocular image of the user are used as the sample and the label of the sample data set, so that the construction of the sample data set is completed.

The method comprises the steps of constructing a deep learning artificial neural network, firstly, extracting features of left and right eye diagrams of a user binocular picture by adopting a deep convolution neural layer, wherein the convolution kernel size of each convolution layer in the deep convolution neural layer is 3 multiplied by 3, and the convolution step length is 2. Three layers of expansion convolutional layers are superposed behind the deep convolutional neural layer, the convolutional kernel size of the first layer of expansion convolutional layer is 3 multiplied by 3, the expansion rate is (1, 2), the convolutional kernel size of the second layer of expansion convolutional layer is 3 multiplied by 3, the expansion rate is (2, 3), the convolutional kernel size of the third layer of expansion convolutional layer is 3 multiplied by 3, the expansion rate is (4, 5), and the convolution step length of the three layers of expansion convolutional layers is 1. And performing inactivation treatment on the final output of the expansion convolutional layer, so that the parameter quantity of the deep learning artificial neural network is controlled, the real-time performance of the deep learning artificial neural network inference is ensured, and the parameters of the deep learning artificial neural network are normalized before the activation treatment by using the ReLU as an activation function.

The established deep learning artificial neural network is trained and tested, the standard processing of size and pixel distribution is carried out on the sample data set, the resolution of the user binocular picture of the sample data set is reduced to a set value, all pixel values of the user binocular picture of the sample data set are divided by 256, the pixel values are distributed between 0 and 1, therefore the normalization of the pixel values is realized, and then the standard distribution processing is carried out on all pixel value data of the user binocular picture of the sample data set by taking 0.5 as a mean value and 0.5 as a variance. The method comprises the steps of converting data after standardized distribution processing into tensor data by using a PyTorch framework, using the tensor data as input of the deep learning artificial neural network, updating parameters of the network by using a random gradient descent algorithm, optimizing the parameters of the network by using an Adam function, dividing a sample data set into a training set and a testing set by using a cross-validation method according to a data volume proportion of 7:3, using an L1 norm loss function as a loss function of the network, and using the Adam function as an optimizer when the network is trained. The set deep learning artificial neural network is subjected to iterative training, a group of network parameters with the best training result are taken as final parameters obtained by training the deep learning artificial neural network, and therefore training of the deep learning artificial neural network is completed.

After the user gazes the offset extraction identifier, the residual error estimation module extracts the eye movement offset of the user and calculates the residual error between the eye movement offset of the user and the gazing point coordinate by using a first-order difference function.

The residual estimation module establishes a two-dimensional plane rectangular coordinate system in a display interface of the head-mounted eye movement interaction equipment, and displays an offset extraction identifier at the central position of the display interface, wherein the position coordinate of the offset extraction identifier is (x) ₀ ，y ₀ ) The offset extraction flag is still picture or animation. The user wearing the head-wearing eye movement interaction equipment gazes at the offset extraction identifier in the display interface of the user, and the real-time fixation point coordinate of the user extracted by the fixation point extraction module in the display interface appearing at the ith time is (x) _gi ，y _gi ) As the eye movement offset of the user.

The frame rate of a display interface of the head-mounted eye movement interaction equipment is 30fps, the time for displaying the offset extraction identification is set to be one second, and a first-order difference function is used for calculating the residual error between the eye movement offset of the user and the fixation point coordinate to be [ x ] _d ，y _d ]The calculation formula thereofComprises the following steps:

wherein i is an integer from 0 to 29.

The user's eye movement offset includes user's use habit offset and the contained angle of user's eye visual axis and eye optical axis, and user's use habit offset is a fixed value, the contained angle of user's eye visual axis and eye optical axis, its estimation process includes, eye visual axis is the offset and draws the line of sign to eye macula lutea fovea, and eye optical axis is the line of eye pupil center to eye retina center. The coordinate of the center position of the pupil of the eye is represented by P, the coordinate of the center of curvature of the cornea of the eye is represented by C, the direction vector of the visual axis of the eye is represented by V, the direction vector from the center of the pupil of the eye to the offset extraction mark, namely the direction vector of the visual axis of the eye, is represented by U, the included angle between the visual axis of the eye and the optical axis of the eye is represented by e, the direction vector of the optical axis of the eye is represented by W, and the calculation formula is as follows:

wherein,

the direction vector of the optical axis of the eye is represented, and the calculation formula of the direction vector of the visual axis of the eye is as follows:

wherein (α, β) represents a deviation correction amount of a direction vector of an eye optical axis, and a calculation formula of the eye visual axis direction vector is as follows:

wherein T represents the position coordinate of the offset extraction mark, and the calculation formula of the included angle e between the optical axis of the eye and the visual axis of the eye is e ═arccos θ (U, V), completing the estimation of the angle between the eye's visual axis and the eye's optical axis.

The offset self-adapting module corrects the user fixation point coordinate acquired by the fixation point extracting module by using the residual error calculated by the residual error estimating module, and takes the corrected user fixation point coordinate as the final user fixation point estimated value.

The offset self-adaption module is realized through a deep learning artificial neural network, the deep learning artificial neural network comprises a plurality of deep convolutional neural layers, a plurality of expansion convolutional layers, an offset prediction branch and a full-connection layer, the four parts are connected in sequence, the deep convolutional neural layers and the expansion convolutional layers respectively adopt the same structures as the deep convolutional neural layers and the expansion convolutional layers in the point of regard extraction module, and a loss function L used in the training process of the deep learning artificial neural network _1new Is expressed as L _1new ＝L ₁ + λ | b |, where L ₁ For the L1 norm loss function used in the gaze point extraction module, λ | b | is a regularization term for adjusting the adaptive capability of the network, where λ is the adjustment coefficient and b is the residual extracted in the gaze point extraction module.

And calculating the residual error between the eye movement offset of the user and the fixation point coordinate through a residual error estimation module, and sending the obtained residual error into an offset self-adaption module to update the fixation point coordinate to obtain a final fixation point estimation value of the user.

The invention also discloses a method for evaluating the precision of the user fixation point estimation method, which specifically comprises the steps of rendering a plurality of precision test points on a display interface of the head-mounted eye movement interaction equipment in sequence according to preset positions, controlling the precision test points to be sequentially hidden after being sequentially displayed on the display interface according to a certain time sequence, and only displaying one precision test point at each moment;

the user wears the head-wearing eye movement interaction equipment, sequentially gazes at the precision test points displayed on the display interface, acquires the fixation point coordinates of the user by using the fixation point extraction module, and calculates and acquires the eye movement precision;

and calculating a plurality of eye movement precision values for each precision test point, and averaging the eye movement precision values to obtain the eye movement precision value of the precision test point. Dividing a display interface into a plurality of sub-regions, respectively setting a plurality of precision test points on each sub-region, respectively calculating and evaluating eye movement precision aiming at different sub-regions, averaging eye movement precision values obtained by all the precision test points in each sub-region to be used as the eye movement precision value of the sub-region, and averaging the eye movement precision values of all the sub-regions to obtain a final evaluation value of the eye movement precision.

The eye movement precision is used for reflecting the precision of user fixation point extraction and the concentration degree of user attention, the eye movement precision is obtained by calculating an angle deviation delta between a precision test point and a user fixation point coordinate obtained by the fixation point extraction module, and a calculation formula of the angle deviation delta is as follows:

wherein (x, y) represents the position coordinates of the precision test points,

the user fixation point coordinates acquired by the fixation point extraction module are represented, Z represents the virtual screen depth of the display interface of the head-mounted eye movement interaction equipment, and W and H respectively represent the number of pixel points in the horizontal direction and the vertical direction of the display interface.

The invention has the beneficial effects that:

the invention is suitable for avoiding the complicated process of multi-point calibration and long waiting in the process of experiencing eye movement interaction by different users, and can also avoid a calibration page independently designed by developers for the purpose, thereby saving the storage and calculation resources of eye movement equipment. The eye movement offset extraction in the invention realizes the extraction of the eye movement offset of self-adaptive individual difference through the preset animation in the gaze interaction scene of the user, and the offset directly influences the precision of the gaze point deduced by the eye movement algorithm, thereby having very important significance for improving the robustness and generalization capability of the eye movement algorithm.

According to the method, the offset is rendered in the VR eye movement equipment according to the preset target position information to extract the identification, the offset is extracted to calculate the residual error by means of the fact that the user stares at the identification for a short time, and then the residual error is fed back to the algorithm to carry out reasoning compensation, so that the eye movement precision is effectively improved, and different users can use the eye movement equipment more conveniently and efficiently.

Drawings

FIG. 1 is a flow chart of a method for estimating a user's gaze point based on visual tracking according to the present invention;

FIG. 2 is a schematic diagram of an offset extraction flag according to the present invention;

FIG. 3 is a schematic diagram of an angle between an optical axis of an eye and a visual axis according to the present invention;

FIG. 4 is a flowchart illustrating the accuracy evaluation of the user gaze point estimation method of the present invention;

FIG. 5 is a diagram illustrating an offset adaptation module according to the present invention.

Detailed Description

For a better understanding of the present disclosure, an example is given here.

FIG. 1 is a flow chart of a method for estimating a user's gaze point based on visual tracking according to the present invention; FIG. 2 is a schematic diagram of an offset extraction flag according to the present invention; FIG. 3 is a schematic diagram of an angle between an optical axis of an eye and a visual axis according to the present invention; FIG. 4 is a flowchart illustrating the accuracy evaluation of the user gaze point estimation method of the present invention; FIG. 5 is a diagram illustrating an offset adaptation module according to the present invention.

The gaze point extraction module considers that the extracted gaze point has a use value in a real scene, the real-time performance of gaze point extraction needs to be guaranteed, and meanwhile, the gaze point extraction module has good gaze point precision, so that a deep learning mode is used for realizing the gaze point extraction process, the gaze point extraction process is realized by a deep learning artificial neural network formed by overlapping a plurality of deep convolutional neural layers with a plurality of expansion convolutional layers, a user binocular pictures acquired by the head-mounted eye movement interaction device are used as the input of the module, and the output of the module is the extracted gaze point coordinate value of the user.

The method comprises the steps that a sample data set is constructed, the influence of the physiological feature difference of eyes of different users, the personalized use habit of the head-mounted eye movement interaction equipment and other reasons on the performance of a model crossing a test is considered, a plurality of users need to wear the head-mounted eye movement interaction equipment, the users watch constantly moving target anchor points in a display interface of the equipment, the target anchor points sequentially move to the positions of pixel points of each row and each column of the display interface in a snake-shaped traversing mode, the target anchor points change three or more different moving speeds in the moving process, the head-mounted eye movement interaction equipment collects eye images of the target anchor points watched by the users, one round of sample data extraction is completed after one snake-shaped traversing is completed for the target anchor points, and more than 10 rounds of sample data extraction are needed for each user. In each round of sample data extraction process, when a near-eye high-speed camera carried on the head-mounted eye movement interaction equipment traverses a target anchor point to each pixel point position on a display interface in a snake shape, the binocular image of the user and the position coordinate value of the target anchor point watched by the binocular image at the moment are stored, the resolution ratio of the binocular image of the user is 640 x 400, the position coordinate value is an x value and a y value under a two-dimensional plane rectangular coordinate system, and the binocular image of the user and the position coordinate value of the target anchor point watched by the binocular image of the user are used as a sample and a label of the sample data set, so that the construction of the sample data set is completed.

The method comprises the steps of constructing a deep learning artificial neural network, firstly, extracting features of left and right eye diagrams of a user binocular picture by adopting a deep convolution neural layer, wherein the convolution kernel size of each convolution layer in the deep convolution neural layer is 3 multiplied by 3, and the convolution step length is 2. Because the expanded convolution has a larger receptive field and can improve the efficiency of feature extraction, three layers of expanded convolution layers are superposed behind the deep convolutional neural layer, the convolution kernel size of the first layer of expanded convolution layer is 3 x 3, the expansion rate is (1, 2), the convolution kernel size of the second layer of expanded convolution layer is 3 x 3, the expansion rate is (2, 3), the convolution kernel size of the third layer of expanded convolution layer is 3 x 3, the expansion rate is (4, 5), and the convolution step length of the three layers of expanded convolution layers is 1. And (3) inactivating the final output of the expansion convolutional layer, wherein the value of a corresponding dropout function is 0.1, so that the parameter quantity of the deep learning artificial neural network is controlled, the real-time performance of the deep learning artificial neural network inference is ensured, ReLU is used as an activation function, and the parameters of the deep learning artificial neural network are normalized before activation.

The built deep learning artificial neural network is trained and tested, considering that the real-time processing speed of the model is influenced by the overlarge parameter amount, and meanwhile, in order to improve the effective extraction of the convolutional neural network on the sample characteristics, the sample data set is subjected to the standardized processing of the size and the pixel distribution, the resolution of the user binocular picture of the sample data set is reduced to a set value, for example, the resolution is reduced to 128 × 192 from 640 × 400, all the pixel values of the user binocular picture of the sample data set are divided by 256, the pixel values are distributed between 0 and 1, the normalization of the pixel values is realized, and then the standardized distribution processing is performed on all the pixel value data of the user binocular picture of the sample data set by taking 0.5 as a mean value and 0.5 as a variance. The method comprises the steps of converting data after standardized distribution processing into tensor data by using a PyTorch framework, using the tensor data as input of the deep learning artificial neural network, updating parameters of the network by using a random gradient descent algorithm, optimizing the parameters of the network by using an Adam function, dividing a sample data set into a training set and a testing set by using a cross-validation method according to a data volume proportion of 7:3, using an L1 norm loss function as a loss function of the network, setting an epoch value of training times to be 64 when the network is trained, and using the Adam function as an optimizer. The initial learning rate was 1.0e-3, with 1/10 drops every 35 epochs. The model was trained for a total of 100 epochs. The set deep learning artificial neural network is subjected to iterative training, a group of network parameters with the best training result are taken as final parameters obtained by training the deep learning artificial neural network, and therefore training of the deep learning artificial neural network is completed.

The gaze point coordinates extracted by the gaze point extraction module can generate a relatively consistent personalized difference offset due to the eye physiological differences of different users and different use habits of the head-mounted eye movement interaction device, and the offset can have a large influence on the precision of the gaze point coordinates. After the user gazing offset extraction identifier is extracted by the residual error estimation module, the module extracts the eye movement offset of the user and calculates the residual error between the eye movement offset of the user and the gazing point coordinate by using a first-order difference function.

The residual estimation module establishes a two-dimensional plane rectangular coordinate system in a display interface of the head-mounted eye movement interaction equipment, and displays an offset extraction identifier at the central position of the display interface, wherein the position coordinate of the offset extraction identifier is (x) ₀ ，y ₀ ) And the offset extraction identifier is a static picture or animation. The user wearing the head-wearing eye movement interaction equipment gazes at the offset extraction identifier in the display interface of the user, and the real-time fixation point coordinate of the user extracted by the fixation point extraction module in the display interface appearing at the ith time is (x) _gi ，y _gi ) As the eye movement offset of the user.

The frame rate of a display interface of the head-mounted eye movement interaction equipment is 30fps, and the time for displaying the offset extraction identification is extractedSetting the difference as one second, and calculating the residual error between the eye movement offset of the user and the fixation point coordinate as x by using a first-order difference function _d ，y _d ]The calculation formula is as follows:

where i is an integer from 0 to 29.

wherein,

and T represents the position coordinates of the offset extraction marker, and the calculation formula of the included angle e between the optical axis of the eye and the visual axis of the eye is arccos theta (U, V), so that the estimation of the included angle between the visual axis of the eye and the optical axis of the eye is finished.

The offset adaptive module corrects the user gaze point coordinates obtained by the gaze point extraction module using the residual calculated by the residual estimation module as shown in fig. 5, and uses the corrected user gaze point coordinates as the final user gaze point estimation value.

The offset self-adaption module is realized through a deep learning artificial neural network, the deep learning artificial neural network comprises a plurality of deep convolutional neural layers, a plurality of expansion convolutional layers, an offset prediction branch and a full-connection layer, the four parts are connected in sequence, the deep convolutional neural layers and the expansion convolutional layers respectively adopt the same structures as the deep convolutional neural layers and the expansion convolutional layers in the point of regard extraction module, and a loss function L used in the training process of the deep learning artificial neural network _1new Is expressed as L _1new ＝L ₁ + λ | b |, where L ₁ For the L1 norm loss function used in the gaze point extraction module, λ | b | is a regularization term for adjusting the adaptive capability of the network, where λ is the adjustment coefficient and b is the residual extracted in the gaze point extraction module. When the lambda is larger, the influence of the introduction of the residual on the model is larger, when the lambda is smaller, the influence of the introduction of the residual on the model is smaller, when the lambda is set to be 0, the model is degraded into a backbone model used in the fixation point extraction module, and experiments prove that the output with higher fixation point precision can be obtained by setting the lambda to be 0.01 in the offset self-adaption module.

and calculating a plurality of eye movement precision values for each precision test point, and averaging the eye movement precision values to obtain the eye movement precision values of the precision test points. Dividing a display interface into a plurality of sub-regions, respectively setting a plurality of precision test points on each sub-region, respectively calculating and evaluating eye movement precision aiming at different sub-regions, averaging eye movement precision values obtained by all the precision test points in each sub-region to be used as the eye movement precision value of the sub-region, and averaging the eye movement precision values of all the sub-regions to obtain a final evaluation value of the eye movement precision.

The eye movement precision is used for reflecting the precision of user fixation point extraction and the concentration degree of user attention, the eye movement precision is obtained by calculating an angle deviation delta between a precision test point and a user fixation point coordinate obtained by the fixation point extraction module, the smaller the angle deviation is, the higher the precision is, and the calculation formula of the angle deviation delta is as follows:

The above description is only an example of the present application and is not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A user fixation point estimation method based on visual tracking is characterized by specifically comprising the following steps:

2. The visual tracking-based user gaze point estimation method of claim 1, comprising in particular:

the gaze point extraction module is realized by a deep learning artificial neural network formed by superposing a plurality of deep convolutional neural layers on a plurality of expansion convolutional layers, the binocular pictures of the user, which are acquired by the head-mounted eye movement interaction equipment, are used as the input of the module, and the output of the module is the extracted gaze point coordinate values of the user;

3. The visual tracking-based user gaze point estimation method of claim 2, comprising in particular:

the method comprises the following steps that a sample data set is constructed, a plurality of users need to wear head-mounted eye movement interaction equipment, the users watch target anchor points which continuously move in a display interface of the equipment, the target anchor points sequentially move to the positions of pixel points of each row and each column of the display interface in a snake-shaped traversal mode, the target anchor points change more than three different moving speeds in the moving process, the head-mounted eye movement interaction equipment collects eye images of the target anchor points which continuously move and watched by the users, and one round of sample data extraction is completed after one-time snake-shaped traversal of the target anchor points is completed; in each round of sample data extraction process, when a near-eye high-speed camera carried on the head-mounted eye movement interaction equipment traverses each pixel point position on a display interface in a snake shape, the binocular image of the user and the position coordinate value of the target anchor point watched by the binocular image of the user at the moment are stored, and the binocular image of the user and the position coordinate value of the target anchor point watched by the binocular image of the user are used as the sample and the label of the sample data set, so that the construction of the sample data set is completed.

4. The visual tracking-based user gaze point estimation method of claim 2, comprising in particular:

firstly, extracting the characteristics of left and right eye diagrams of a user binocular picture by adopting a deep convolution neural layer, wherein the convolution kernel size of each convolution layer in the deep convolution neural layer is 3 multiplied by 3, and the convolution step length is 2; superposing three layers of expanded convolutional layers behind a deep convolutional neural layer, wherein the convolutional kernel size of the first layer of expanded convolutional layer is 3 multiplied by 3, the expansion rate is (1, 2), the convolutional kernel size of the second layer of expanded convolutional layer is 3 multiplied by 3, the expansion rate is (2, 3), the convolutional kernel size of the third layer of expanded convolutional layer is 3 multiplied by 3, the expansion rate is (4, 5), and the convolution step length of each three layer of expanded convolutional layer is 1; and performing inactivation treatment on the final output of the expansion convolutional layer, so that the parameter quantity of the deep learning artificial neural network is controlled, the real-time performance of the deep learning artificial neural network inference is ensured, and the parameters of the deep learning artificial neural network are normalized before the activation treatment by using the ReLU as an activation function.

5. The visual tracking-based user gaze point estimation method of claim 2, comprising in particular:

the built deep learning artificial neural network is trained and tested, the standard processing of size and pixel distribution is carried out on the sample data set, the resolution of the user binocular picture of the sample data set is reduced to a set value, all pixel values of the user binocular picture of the sample data set are divided by 256, the pixel values are distributed between 0 and 1, therefore the normalization of the pixel values is realized, and then the standard distribution processing is carried out on all pixel value data of the user binocular picture of the sample data set by taking 0.5 as a mean value and 0.5 as a variance; converting data after standardized distribution processing into tensor data by using a PyTorch frame, using the tensor data as input of the deep learning artificial neural network, updating parameters of the network by using a random gradient descent algorithm, optimizing the parameters of the network by using an Adam function, dividing a sample data set into a training set and a testing set by using a cross-validation method according to a data volume ratio of 7:3, using an L1 norm loss function as a loss function of the network, and using the Adam function as an optimizer when the network is trained; the set deep learning artificial neural network is subjected to iterative training, a group of network parameters with the best training result are taken as final parameters obtained by training the deep learning artificial neural network, and therefore training of the deep learning artificial neural network is completed.

6. The visual tracking-based user gaze point estimation method of claim 1, comprising in particular:

7. The visual tracking-based user gaze point estimation method of claim 6, comprising in particular:

the residual error estimation module establishes two-dimensional plane straightness in a display interface of the head-mounted eye movement interaction equipmentAn angular coordinate system for displaying the offset extraction mark at the central position of the display interface, and the position coordinate is (x) ₀ ，y ₀ ) The offset extraction identifier is a static picture or animation; the user wearing the head-wearing eye movement interaction equipment gazes at the offset extraction identifier in the display interface of the user, and the real-time fixation point coordinate of the user extracted by the fixation point extraction module in the display interface appearing at the ith time is (x) _gi ，y _gi ) As the eye movement offset of the user;

the frame rate of a display interface of the head-mounted eye movement interaction equipment is 30fps, the time for displaying the offset extraction identification is set to be one second, and a first-order difference function is used for calculating the residual error between the eye movement offset of the user and the fixation point coordinate to be [ x ] _d ，y _d ]The calculation formula is as follows:

wherein i is an integer from 0 to 29;

the estimation process comprises the steps that the eye movement offset of a user comprises a user use habit offset and an included angle between a user eye visual axis and an eye optical axis, the user use habit offset is a fixed value, the included angle between the user eye visual axis and the eye optical axis is a connecting line of an offset extraction mark to a central fovea of macula lutea of an eye, and the eye optical axis is a connecting line of a pupil center of the eye and a retina center of the eye; the coordinate of the center position of the pupil of the eye is represented by P, the coordinate of the center of curvature of the cornea of the eye is represented by C, the direction vector of the visual axis of the eye is represented by V, the direction vector from the center of the pupil of the eye to the offset extraction mark, namely the direction vector of the visual axis of the eye, is represented by U, the included angle between the visual axis of the eye and the optical axis of the eye is represented by e, the direction vector of the optical axis of the eye is represented by W, and the calculation formula is as follows:

wherein,

wherein, (α, β) represents a deviation correction amount of a direction vector of an eye optical axis, and a calculation formula of the eye visual axis direction vector is as follows:

and T represents the position coordinate of the offset extraction identifier, and the calculation formula of the included angle e between the eye optical axis and the eye visual axis is (e ═ arccos theta (U, V)), so that the estimation of the included angle between the eye visual axis and the eye optical axis is completed.

8. The visual tracking-based user gaze point estimation method of claim 1, comprising in particular:

the offset self-adapting module corrects the user fixation point coordinate acquired by the fixation point extracting module by using the residual error calculated by the residual error estimating module, and takes the corrected user fixation point coordinate as a final user fixation point estimated value;

the offset self-adaption module is realized through a deep learning artificial neural network, the deep learning artificial neural network comprises a plurality of deep convolution neural layers, a plurality of expansion convolution layers, an offset prediction branch and a full-connection layer, the four parts are connected in sequence, and a loss function L used in the training process of the deep learning artificial neural network _1new Is expressed as L _1new ＝L ₁ + λ | b |, where L ₁ For the L1 norm loss function used in the gaze point extraction module, λ | b | is a regularization term for adjusting the adaptive capability of the network, where λ is the adjustment coefficient and b is the residual extracted in the gaze point extraction module.

9. A method for performing precision evaluation on the user fixation point estimation method of any one of claims 1 to 8 is characterized in that a plurality of precision test points are rendered in sequence according to a preset position on a display interface of a head-mounted eye movement interaction device, the precision test points are controlled to be sequentially displayed on the display interface and then sequentially hidden according to a certain time sequence, and only one precision test point is displayed at each moment;

calculating a plurality of eye movement precision values for each precision test point and averaging to obtain the eye movement precision values of the precision test points; dividing a display interface into a plurality of sub-regions, respectively setting a plurality of precision test points on each sub-region, respectively calculating and evaluating eye movement precision aiming at different sub-regions, averaging eye movement precision values obtained by all the precision test points in each sub-region to be used as the eye movement precision value of the sub-region, and averaging the eye movement precision values of all the sub-regions to obtain a final evaluation value of the eye movement precision.

10. The method for accuracy evaluation of a user gaze point estimation method according to claim 9, wherein the eye movement accuracy is used to reflect the accuracy of user gaze point extraction and the concentration of the user's attention, the eye movement accuracy is obtained by calculating an angle deviation δ between the accuracy test point and the user gaze point coordinates obtained by the gaze point extraction module, and the calculation formula of the angle deviation δ is: