CN115862095A

CN115862095A - Adaptive sight line estimation method, system, electronic equipment and storage medium

Info

Publication number: CN115862095A
Application number: CN202211471537.7A
Authority: CN
Inventors: 赵梓辰; 柯维涛; 陆小锋
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2022-11-23
Filing date: 2022-11-23
Publication date: 2023-03-28

Abstract

The invention discloses a self-adaptive sight line estimation method, a system, electronic equipment and a storage medium, and relates to the technical field of sight line estimation, wherein the method comprises the steps of processing a face image of a target person to obtain eye and face position information; inputting the face image and the eyelid position information of the target person into the self-adaptive sight line estimation model to obtain a sight line fixation estimation result of the target person; the adaptive sight line estimation model is obtained by training the adaptive sight line estimation network based on the multi-scale feature fusion by using the first sample input data and the sample actual measurement result corresponding to the first sample input data, and updating the translation parameter and the scaling parameter in the adaptive sight line estimation network based on the multi-scale feature fusion by using the eye guidance network. The invention fully utilizes the inherent characteristic relation between the face and the eyes to achieve the purpose of self-adaptive sight line estimation.

Description

Adaptive sight line estimation method, system, electronic equipment and storage medium

Technical Field

The invention relates to the technical field of sight line estimation, in particular to a self-adaptive sight line estimation method and system based on multi-scale feature fusion, electronic equipment and a storage medium.

Background

The sight line estimation is widely applied to the fields of human-computer interaction, psychology, disease diagnosis and the like. Gaze, as a primary means for humans to obtain external information, can reveal cognitive processing and cognitive processing deficiencies of humans. Scholars study various psychological disorders through gaze estimation, such as: depression, autism, etc.

Over the past several decades, many line-of-sight estimation methods have been proposed. For example: the sight line estimation method based on the 3D human eye model relies on some special equipment, such as: infrared cameras, depth cameras, high-resolution cameras, etc., many wearable eye tracking devices have been developed; the sight line estimation method based on the appearance only needs one network camera to capture images and directly learns the mapping relation from the images to the gaze direction. Because the requirement on hardware equipment is low, and mobile equipment with a network camera, such as a smart phone and a tablet personal computer, is gradually popularized, the sight line estimation method based on the appearance has a better application prospect.

The sight line estimation method based on the appearance is to estimate the sight line from the face image or the eye image. Convolutional Neural Networks (CNNs) have applications in various fields of computer vision due to their powerful feature extraction capabilities. Some appearance-based gaze estimation methods perform gaze estimation from a single eye image or two eye images using CNN, some are gaze estimation from a face image, and some are gaze estimation using both a face image and an eye image. However, these methods use simple techniques to fuse information from the face image and the eye image, for example, through simple connections or fully connected layers. Since gaze estimation is inherently a challenging task, simple feature stitching operations do not facilitate interactive modeling of face and eye images, ignoring the intrinsic relationship of the face and eyes.

Disclosure of Invention

The invention aims to provide a self-adaptive sight line estimation method, a self-adaptive sight line estimation system, electronic equipment and a storage medium, which fully utilize the inherent characteristic relationship between a face and eyes to achieve the self-adaptive sight line estimation purpose.

In order to achieve the purpose, the invention provides the following scheme:

in a first aspect, the invention provides a self-adaptive sight line estimation method based on multi-scale feature fusion, which includes:

acquiring a face image of a target person;

processing the face image of the target person to obtain the eye-face position information of the target person; the eyelid position information comprises face bounding box information, left eye bounding box information and right eye bounding box information;

inputting the information of the target person into the self-adaptive sight line estimation model to obtain a sight line estimation result of the target person; the target person information comprises a face image and eyelid position information of a target person;

the adaptive sight line estimation model is obtained by training a multi-scale feature fusion-based adaptive sight line estimation network by utilizing first sample input data and sample actual measurement results corresponding to the first sample input data, and updating translation parameters and scaling parameters in the multi-scale feature fusion-based adaptive sight line estimation network by utilizing an eye guidance network;

the eye guidance network is used for processing second sample input data corresponding to the first sample input data by adopting a deep learning algorithm to obtain a translation parameter and a zooming parameter;

the first sample input data are used as face images and eyelid position information required by model training; the second sample input data is left eye image, right eye image and eyelid position information required by model training; and the actual measurement result of the sample is used as the actual measurement result of the sight line required by model training.

Optionally, the training process of the adaptive gaze estimation model is:

constructing a sample data set; the sample data set comprises a plurality of sample data; the sample data comprises first sample input data, corresponding second sample input data and a sample actual measurement result;

inputting the first sample input data into a self-adaptive sight estimation network based on multi-scale feature fusion to obtain a sample prediction result;

calculating a network loss value by using the sample prediction result and the sample actual measurement result;

updating the network parameters of the multi-scale feature fusion-based adaptive sight line estimation network by using the network loss value, updating the translation parameters and the scaling parameters in the updated network parameters by using the eye guidance network to obtain the updated multi-scale feature fusion-based adaptive sight line estimation network, performing iterative loop optimization until the iteration times reach the maximum iteration times or the network loss value is smaller than a set threshold value, and determining the last updated multi-scale feature fusion-based adaptive sight line estimation network as an adaptive sight line estimation model.

Optionally, the adaptive sight line estimation network based on the multi-scale feature fusion includes a convolutional layer, a global average pooling layer without translation parameters and scaling parameters, a first translation scaling layer, a channel dimension splicing layer, a first multi-scale attention module, a second translation scaling layer, a second multi-scale attention module, and a full connection layer, which are connected in sequence; wherein the translation parameters and scaling parameters in the first translation scaling layer are determined by the eye-guided network; the translation parameters and scaling parameters in the second translation scaling layer are determined by the eye-guided network; the full connection layer comprises a first full connection block, a second full connection block and a third full connection block; the first full connecting block is used for inputting eyelid position information, and the second full connecting block is used for inputting the characteristics output by the second multi-scale attention module; the input end of the third full connecting block is respectively connected with the output end of the first full connecting block and the output end of the second full connecting block; the convolutional layer is used for inputting a face image.

Optionally, the eye guidance network includes a first branch network, a second branch network, a third branch network, and a full connection layer module connected to an output of the first branch network, an output of the second branch network, and an output of the third branch network; the first branch network comprises a first volume block, a channel dimension splicing layer and a full connection layer which are connected in sequence; the first convolution block is used for inputting a right eye image; the second branch network comprises a second volume block, a cross-view pooling layer and a full-connection layer which are connected in sequence; the second convolution block is used for inputting a left eye image; the third branch network is used for inputting the position information of the eye and the face; and the full connection layer module is used for outputting a translation parameter and a scaling parameter.

Optionally, the structure of the first multi-scale attention module and the structure of the second multi-scale attention module are the same;

the first multi-scale attention module comprises an SPC module, an SE module, a spatial attention map acquisition module and a summary module;

the input end of the SPC module is used for inputting the characteristics output by the channel dimension splicing layer, the output end of the SPC module is connected with the input end of the SE module, the first output end of the SE module is connected with the first input end of the summarizing module, the second output end of the SE module is connected with the input end of the space attention map acquiring module, the output end of the space attention map acquiring module is connected with the second input end of the summarizing module, and the third input end of the summarizing module is used for inputting the characteristics output by the channel dimension splicing layer; the summarizing module is used for outputting a characteristic diagram with multi-scale information under different receptive fields.

Optionally, the spatial attention map obtaining module is configured to:

performing convolution, global average pooling and dimension transformation on the feature graph output by the SE module to obtain a first feature sub-graph and a second feature sub-graph;

after normalization operation is carried out on the first characteristic subgraph, two-dimensional tensor multiplication is carried out on the first characteristic subgraph and a second characteristic subgraph to obtain two-dimensional characteristics;

and carrying out dimension transformation and activation operation on the two-dimensional features to obtain a space attention diagram.

Optionally, the processing the face image of the target person to obtain the eye and face position information of the target person specifically includes:

processing the face image of the target person to obtain a left eye image and a right eye image of the target person;

and obtaining the eye and face position information of the target person according to the left eye image, the right eye image and the face image of the target person.

In a second aspect, the present invention provides an adaptive gaze estimation system based on multi-scale feature fusion, including:

the face image acquisition module is used for acquiring a face image of the target person;

the eye and face position information calculation module is used for processing the face image of the target person to obtain the eye and face position information of the target person; the eyelid position information comprises face bounding box information, left eye bounding box information and right eye bounding box information;

the gaze estimation result prediction module is used for inputting the information of the target person into the self-adaptive gaze estimation model to obtain a gaze estimation result of the target person; the target person information comprises a face image and eyelid position information of a target person;

the eye guidance network is used for processing second sample input data corresponding to the first sample input data by adopting a deep learning algorithm to obtain a translation parameter and a scaling parameter;

the first sample input data are used as face images and eyelid position information required by model training; the second sample input data are used as left eye image, right eye image and eyelid position information required by model training; and the actual measurement result of the sample is used as the actual measurement result of the sight line required by model training.

In a third aspect, the present invention provides an electronic device, comprising a memory for storing a computer program and a processor for executing the computer program to make the electronic device execute the adaptive gaze estimation method according to the first aspect.

In a fourth aspect, the present invention provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the adaptive gaze estimation method of the first aspect.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects:

the invention outputs the sight estimation result through the self-adaptive sight estimation network based on the multi-scale feature fusion, and better excavates the global features of the face image. The eye guidance network fuses the characteristics of the binocular images, extracts characteristic parameters more concentrated on the fixation point, and dynamically guides the characteristic extraction of the face images, thereby realizing self-adaptive sight estimation.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a schematic flow chart of an adaptive gaze estimation method based on multi-scale feature fusion according to the present invention;

FIG. 2 is a block diagram of an adaptive gaze estimation network based on multi-scale feature fusion in accordance with the present invention;

FIG. 3 is a block diagram of an eye guidance network of the present invention;

FIG. 4 is a block diagram of a multi-scale attention module of the present invention;

FIG. 5 is a block diagram of an SPC module according to the present invention;

fig. 6 is a schematic structural diagram of an adaptive gaze estimation system based on multi-scale feature fusion according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Compared with the eye images, the face images have richer global information, and the eye images focus on sight line falling points, so that the inherent characteristic relationship between the face and the eyes is fully utilized, the invention provides a self-adaptive sight line estimation method, a self-adaptive sight line estimation system, electronic equipment and a storage medium based on multi-scale characteristic fusion.

In order to make the aforementioned objects, features and advantages of the present invention more comprehensible, the present invention is described in detail with reference to the accompanying drawings and the detailed description thereof.

Example one

The embodiment provides a self-adaptive sight line estimation method based on multi-scale feature fusion, and the main invention points of the embodiment are as follows:

1. the face feature extraction network is used as a backbone network to output sight estimation results, and a multi-scale attention mechanism is introduced to better mine global features of the face image.

2. The eye feature extraction network is used as a guide network, the features of the binocular images are fused, feature parameters more concentrated on the fixation point are extracted, and the feature extraction of the face images is dynamically guided, so that the self-adaptive sight estimation is realized.

As shown in fig. 1, the present embodiment provides a method for adaptive gaze estimation based on multi-scale feature fusion, which specifically includes:

step 100: an image of the face of the target person is acquired.

Step 200: processing the face image of the target person to obtain the eye-face position information of the target person; the eyelid position information includes face bounding box information, left eye bounding box information, and right eye bounding box information.

Step 300: inputting the information of the target person into the self-adaptive sight line estimation model to obtain a sight line estimation result of the target person; the target person information includes a face image and eyelid position information of the target person.

The adaptive sight line estimation model is obtained by training a multi-scale feature fusion-based adaptive sight line estimation network by using first sample input data and sample actual measurement results corresponding to the first sample input data, and updating translation parameters and scaling parameters in the multi-scale feature fusion-based adaptive sight line estimation network by using an eye guidance network.

The eye guidance network is used for processing second sample input data corresponding to the first sample input data by adopting a deep learning algorithm to obtain a translation parameter and a scaling parameter.

The training process of the adaptive sight line estimation model comprises the following steps:

(1) Constructing a sample data set; the sample data set comprises a plurality of sample data; the sample data comprises first sample input data, corresponding second sample input data and a sample actual measurement result.

(2) And inputting the first sample input data into a self-adaptive sight estimation network based on multi-scale feature fusion to obtain a sample prediction result.

(3) And calculating a network loss value by using the sample prediction result and the sample actual measurement result.

(4) Updating the network parameters of the multi-scale feature fusion-based adaptive sight line estimation network by using the network loss value, updating the translation parameters and the scaling parameters in the updated network parameters by using the eye guidance network to obtain the updated multi-scale feature fusion-based adaptive sight line estimation network, performing iterative loop optimization until the iteration times reach the maximum iteration times or the network loss value is smaller than a set threshold value, and determining the last updated multi-scale feature fusion-based adaptive sight line estimation network as an adaptive sight line estimation model.

One example is: the training process of the adaptive gaze estimation model is determined by the following steps.

Step 1: and respectively selecting a GazeCapture data set and an MPIIFaceGaze data set to train and test the model. From 1400 samples of the GazeCapture dataset (240 more than ten thousand face pictures in total), the following 7:2: the proportion of 1 is randomly divided into a training set, a verification set and a test set. The training set is used for model learning of the mapping relation between the images and the fixation points, the verification set is used for optimizing the models in the training process, and the test set is used for evaluating the performance of the models for the prediction of the fixation points. The pictures in the training set, the verification set and the test set are not overlapped. From 15 samples of the mpii facesize data set (37667 face pictures total), 13 samples were selected as the training set and 2 samples as the test set. Because of the small number of samples in this dataset, cross validation was performed on the mpii facegaze dataset (all samples were made into one test set) for better validation of model performance, and 8 experiments were averaged.

Step 2: according to the eye frame coordinates and face frame coordinates provided by the data set, face images in the GazeCapture data set and the MPIIFaceGaze data set are preprocessed, left eye images and right eye images are cut out, position information of the left eye images and the right eye images relative to the face images is obtained, the face image size is processed to be 224 × 3, the left eye images and the right eye images are processed to be 112 × 112 (respectively representing the length, the width and the RGB three-channel number of the images), and pixel values are normalized from [0, 255] to be in a [0,1] interval.

And step 3: the preprocessed image is input into a self-adaptive sight line estimation network based on multi-scale feature fusion for training, the structure of the self-adaptive sight line estimation network (namely, a facial feature extraction network) based on the multi-scale feature fusion is shown in fig. 2, and the structure of an eye guidance network is shown in fig. 3.

The input of the facial feature extraction network comprises a facial image, facial bounding box information and left and right eye bounding box information; the input of the eye guidance network is a left eye image, a right eye image, face bounding box information, and left and right eye bounding box information.

The face feature extraction network is used as a backbone network, a face image is input, the face features are extracted by a plurality of convolution layers, self-adaptive adjustment is carried out by a translation parameter (Pa parameter) and a zoom parameter (Pm parameter), then the face features are input into a multi-scale attention module, finally, a full connection layer is spliced with eyelid position information (coordinate information of a boundary frame of a left eye image and a right eye image and coordinate information of the boundary frame of the face image), and a fixation point coordinate is output; the eye guiding network inputs left and right eye images and the position information of the eye face, after feature extraction is carried out on the convolution layer, the feature blocks are fused, and finally two parameters of Pa and Pm are output through the full-connection layer to dynamically guide face feature extraction.

The self-adaptive sight line estimation network based on the multi-scale feature fusion comprises a convolution layer, a global average pooling layer without translation parameters and scaling parameters, a first translation scaling layer, a channel dimension splicing layer, a first multi-scale attention module, a second translation scaling layer, a second multi-scale attention module and a full connection layer which are connected in sequence; wherein the translation parameters and scaling parameters in the first translation scaling layer are determined by the eye-guided network; the translation parameters and scaling parameters in the second translation scaling layer are determined by the eye-guided network; the full connection layer comprises a first full connection block, a second full connection block and a third full connection block; the first full connecting block is used for inputting eyelid position information, and the second full connecting block is used for inputting features output by the second multi-scale attention module; the input end of the third full connecting block is respectively connected with the output end of the first full connecting block and the output end of the second full connecting block; the convolutional layer is used for inputting a face image.

After the facial image is extracted by the convolutional layer features, a normalization layer containing scaling operation and translation operation in the low-dimensional features and the high-dimensional features is replaced by a global average pooling layer without scaling parameters and translation parameters, and the translation parameters and the scaling parameters are replaced by two parameters generated by an eye guidance network, so that the facial features are extracted again in a self-adaptive mode. And then splicing the adjusted low-dimensional features and high-dimensional features according to the channel dimensions, inputting the spliced low-dimensional features and high-dimensional features into a multi-scale attention module, effectively utilizing spatial information of the features with different scales, and simultaneously establishing a dependency relationship among the channels of the features to better capture the global information of the facial image. And finally, splicing the full connecting layer and the eyelid position information to output the fixation point coordinate through a self-adaptive adjustment and multi-scale attention module. Wherein GN (·) in fig. 2 denotes a global average pooling layer without a translation parameter and a scaling parameter, stack denotes a channel dimension splicing layer, and FC denotes a fully-connected layer.

The eye guidance network comprises a first branch network, a second branch network, a third branch network and a full connection layer module which is connected with the output end of the first branch network, the output end of the second branch network and the output end of the third branch network; the first branch network comprises a first volume block, a channel dimension splicing layer and a full connection layer which are connected in sequence; the first convolution block is used for inputting a right eye image; the second branch network comprises a second volume block, a cross-view pooling layer and a full-connection layer which are connected in sequence; the second convolution block is used for inputting a left eye image; the third branch network is used for inputting the position information of the eye and the face; and the full connection layer module is used for outputting a translation parameter and a scaling parameter.

Eye guidance network: because the left and right eyes are similar in shape and structure, feature fusion is performed on the left and right eye images. And inputting the right eye image and the horizontally inverted left eye image into a network model, splicing the low-dimensional features and the high-dimensional features according to the channel dimensions, reserving more spatial information in the feature diagram extracted from the lower layer, and enabling the feature diagram extracted from the higher layer to have stronger representation capability. And finally, outputting two parameters of Pa and Pm through a full connection layer together with the position information of the face and the eyes, and dynamically guiding and adjusting the face feature extraction. The cv-pool in FIG. 3 is a cross-view pooling layer.

The first multi-scale attention module and the second multi-scale attention module are identical in structure. The first multi-scale attention module will now be described as an example.

The first multi-scale attention module comprises an SPC module, an SE module, a spatial attention map acquisition module and a summary module; the input end of the SPC module is used for inputting the characteristics output by the channel dimension splicing layer, the output end of the SPC module is connected with the input end of the SE module, the first output end of the SE module is connected with the first input end of the summarizing module, the second output end of the SE module is connected with the input end of the space attention map acquiring module, the output end of the space attention map acquiring module is connected with the second input end of the summarizing module, and the third input end of the summarizing module is used for inputting the characteristics output by the channel dimension splicing layer; the summarizing module is used for outputting a characteristic diagram with multi-scale information under different receptive fields.

The spatial attention map acquisition module is configured to:

As shown in fig. 4 and 5, the SPC module performs convolution of input features X (size: C × H × W) with different convolution kernel sizes (convolution kernel sizes: 3 × 3,5 × 5,7 × 7,9 × 9, respectively, feature size after convolution: C/4 × H × W) to obtain the receptive fields of different scales, extracts information of different scales, and then concatenates in the channel dimension. And extracting the weighted value of each group of channels through an SE module (performing global average pooling operation on the channel dimension and compressing the weighted value into a vector of C multiplied by 1), and finally multiplying the weighted value with the characteristic in the channel dimension after performing softmax normalization so as to weight the channel. The rescaled feature map Xc focuses on the useful channel, but pixels in the same channel still share the same weight. Thus, the calculation is further based on Xc in the spatial dimension.

As shown in the figure, the feature Xc is sent into a convolution layer of 1 × 1, and global average pooling and dimension conversion are performed to obtain feature maps Q (1 × C/2) and V (C/2 × HW), respectively. And (3) performing softmax normalization on Q, multiplying Q by V through a two-dimensional tensor to obtain two-dimensional characteristics of 1 multiplied by HW, and performing dimension transformation and sigmoid activation function to obtain a space attention diagram As (1 multiplied by H multiplied by W). And finally, matrix multiplication is carried out on the As and Xc space dimensions, and then the matrix multiplication is added with the original input characteristic X, and finally a characteristic diagram with multi-scale information under different receptive fields is output.

Experimental setup: experiments were completed on a high performance computing platform: the system comprises a windows10, a CPU AMD5800x, a GPU RTX3080 and a memory 32g.

Adaptive sight line estimation method

The method fully utilizes the characteristic relation between the human face and the eyes, the facial feature extraction network is used as a backbone network to output a sight line estimation result, the eye feature extraction network is used as a guide network, the characteristics of the binocular images are fused, the feature parameters more concentrated on the fixation point are extracted, and the feature extraction of the face is dynamically guided, so that the self-adaptive sight line estimation is realized.

Multi-scale attention mechanism

Aiming at the fact that the whole-face image contains richer global information, a multi-scale attention module is designed, the spatial information of different-scale features is effectively utilized, meanwhile, the dependency relationship among all channels of the features is built, and the global information of the face image is captured better.

Table 1: comparison test table

The experimental results are as follows: compared with the mainstream sight line estimation method based on appearance, the self-adaptive sight line estimation method provided by the inventor achieves the optimal performance by performing a comparison experiment on the MPIIFaceGaze data set and the GazeCapture data set. The MPIIFaceGaze data set has an error of 3.8 cm; the GazeCapture data set acquisition equipment is divided into a mobile phone and a tablet personal computer, and errors of 2.68cm and 3.14cm are achieved by the method.

Example two

In order to implement the corresponding method of the above embodiments to achieve the corresponding functions and technical effects, an adaptive gaze estimation system based on multi-scale feature fusion is provided below.

As shown in fig. 6, the adaptive gaze estimation system based on multi-scale feature fusion provided in this embodiment includes:

the face image acquisition module 1 is used for acquiring a face image of a target person;

the eye and face position information calculation module 2 is used for processing the face image of the target person to obtain the eye and face position information of the target person; the eyelid position information includes face bounding box information, left eye bounding box information and right eye bounding box information;

the gaze estimation result prediction module 3 is used for inputting the information of the target person into the adaptive gaze estimation model to obtain the gaze estimation result of the target person; the target person information comprises a face image and eyelid position information of a target person;

EXAMPLE III

An embodiment of the present invention provides an electronic device, which includes a memory and a processor, where the memory is used to store a computer program, and the processor runs the computer program to make the electronic device execute the adaptive gaze estimation method according to the first embodiment.

Alternatively, the electronic device may be a server.

In addition, an embodiment of the present invention further provides a computer-readable storage medium, which stores a computer program, and when the computer program is executed by a processor, the computer program implements the adaptive gaze estimation method of the first embodiment.

The embodiments in the present description are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same and similar parts among the embodiments are referred to each other. For the system disclosed by the embodiment, the description is relatively simple because the system corresponds to the method disclosed by the embodiment, and the relevant points can be referred to the method part for description.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. An adaptive sight line estimation method based on multi-scale feature fusion is characterized by comprising the following steps:

acquiring a face image of a target person;

processing the face image of the target person to obtain the eye-face position information of the target person; the eyelid position information includes face bounding box information, left eye bounding box information and right eye bounding box information;

2. The adaptive sight line estimation method based on multi-scale feature fusion according to claim 1, wherein the training process of the adaptive sight line estimation model is as follows:

3. The adaptive sight line estimation method based on multi-scale feature fusion is characterized in that the adaptive sight line estimation network based on multi-scale feature fusion comprises a convolutional layer, a global average pooling layer without translation parameters and scaling parameters, a first translation scaling layer, a channel dimension splicing layer, a first multi-scale attention module, a second translation scaling layer, a second multi-scale attention module and a full connection layer which are connected in sequence; wherein the translation parameters and scaling parameters in the first translation scaling layer are determined by the eye-guided network; the translation parameters and scaling parameters in the second translation scaling layer are determined by the eye-guided network; the full connection layer comprises a first full connection block, a second full connection block and a third full connection block; the first full connecting block is used for inputting eyelid position information, and the second full connecting block is used for inputting features output by the second multi-scale attention module; the input end of the third full connecting block is respectively connected with the output end of the first full connecting block and the output end of the second full connecting block; the convolutional layer is used for inputting a face image.

4. The adaptive sight line estimation method based on multi-scale feature fusion according to claim 1, wherein the eye guidance network comprises a first branch network, a second branch network, a third branch network, and a full connection layer module connected with the output end of the first branch network, the output end of the second branch network and the output end of the third branch network; the first branch network comprises a first volume block, a channel dimension splicing layer and a full connection layer which are connected in sequence; the first convolution block is used for inputting a right eye image; the second branch network comprises a second volume block, a cross-view pooling layer and a full-connection layer which are connected in sequence; the second convolution block is used for inputting a left eye image; the third branch network is used for inputting the position information of the eye and the face; and the full connection layer module is used for outputting a translation parameter and a scaling parameter.

5. The adaptive sight line estimation method based on multi-scale feature fusion according to claim 1, wherein the structure of the first multi-scale attention module and the structure of the second multi-scale attention module are the same;

6. The adaptive sight line estimation method based on multi-scale feature fusion according to claim 5, wherein the spatial attention map acquisition module is used for:

7. The adaptive sight line estimation method based on multi-scale feature fusion according to claim 1, wherein the processing the face image of the target person to obtain the eye-face position information of the target person specifically comprises:

8. An adaptive gaze estimation system based on multi-scale feature fusion, comprising:

9. An electronic device, comprising a memory for storing a computer program and a processor for executing the computer program to cause the electronic device to perform the adaptive gaze estimation method according to any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that it stores a computer program which, when executed by a processor, implements the adaptive gaze estimation method according to any one of claims 1 to 7.