CN117523650B

CN117523650B - Eyeball motion tracking method and system based on rotation target detection

Info

Publication number: CN117523650B
Application number: CN202410008039.1A
Authority: CN
Inventors: 沈益冉; 张桐瑜; 赵广荣
Original assignee: Shandong University
Current assignee: Shandong University
Priority date: 2024-01-04
Filing date: 2024-01-04
Publication date: 2024-04-02
Anticipated expiration: 2044-01-04
Also published as: CN117523650A

Abstract

The invention relates to the technical field of eye movement tracking, and discloses an eyeball movement tracking method and system based on rotation target detection, wherein the method comprises the following steps: acquiring an eye image sequence; inputting each frame of image in the eye image sequence into a trained eye movement tracking model, outputting pupil positioning results of each frame of image by the model, and carrying out feature extraction and feature fusion on a T frame of image by the trained eye movement tracking model to obtain image features of the T frame of image; determining whether feature fusion on a time domain is carried out on the current frame image according to pupil shielding degree of the previous frame image; pupil positioning is carried out on the T frame image by adopting a rotating target detection model; estimating the pupil shielding degree of the T frame image by adopting semantic segmentation, and taking the pupil of the T frame image as a new template if the pupil of the T frame image is not shielded; the invention can remarkably improve the accuracy and stability of eye movement tracking.

Description

Eyeball motion tracking method and system based on rotation target detection

Technical Field

The invention relates to the technical field of eye movement tracking, in particular to an eyeball movement tracking method and system based on rotation target detection.

Background

The statements in this section merely relate to the background of the present disclosure and may not necessarily constitute prior art.

The eye tracking technology, namely analyzing the gaze point and direction of the human eye through pupil positioning, plays an important role in applications such as foreground rendering of virtual reality, man-machine interaction, virtual classroom teaching, identity verification, mental analysis in the biomedical field and the like. In the core flow of eye movement tracking, accurate identification of pupil areas is a critical aspect. At present, algorithms based on deep learning have shown more excellent performance in this field than conventional methods, but still have certain limitations.

First, the existing deep learning algorithm mainly relies on a semantic segmentation technology, which identifies pupil areas by performing two classifications on pixel points in an image and then fitting ellipses from irregular shape prediction results through a post-processing flow. This type of method does not fully utilize prior information that the pupil shape is actually elliptical.

Secondly, the existing algorithm does not effectively solve the problem of influence of blink actions on pupil detection accuracy. During blinking, the eyelid can occlude the pupil, thereby causing a large deviation of the model's predicted result from reality. Such deviations directly affect the performance of eye tracking techniques in the above-described application scenarios.

Disclosure of Invention

In order to solve the defects in the prior art, the invention provides an eyeball motion tracking method and system based on rotation target detection; by combining priori knowledge and optimization algorithm flow, the accuracy and stability of eye movement tracking are expected to be improved remarkably, so that more reliable technical support is provided for virtual reality, man-machine interaction and other related fields.

In one aspect, there is provided an eye movement tracking method based on rotation target detection, including: acquiring an eye image sequence; inputting each frame of image in the eye image sequence into a trained eye movement tracking model, and outputting pupil positioning results of each frame of image by the model: (1): extracting features and fusing features of the T frame image to obtain image features of the T frame image; t is a positive integer greater than or equal to 1; (2): judging whether the current T frame image is a first frame image in the eye image sequence, if so, entering (3); if the first frame image is not the first frame image, judging whether the pupil shielding degree of the T-1 frame is smaller than a first set threshold value, and if the pupil shielding degree is smaller than the first set threshold value, entering (3); if the pupil occlusion degree of the T-1 frame is smaller than a second set threshold, carrying out feature fusion on the time domain on the T-1 frame image and the template, and entering (3); if the threshold value is larger than a second set threshold value, entering (4); (3): pupil positioning is carried out on the T frame image by adopting a rotating target detection model; (4): estimating the pupil shielding degree of the T frame image by adopting semantic segmentation, and taking the pupil of the T frame image as a new template if the pupil of the T frame image is not shielded; (5): judging whether the T frame image is the last frame image, if so, ending, and if not, adding 1 to T, and returning to (1).

In another aspect, there is provided an eye movement tracking system based on rotational target detection, comprising: an acquisition module configured to: acquiring an eye image sequence; a tracking module configured to: inputting each frame of image in the eye image sequence into a trained eye movement tracking model, and outputting pupil positioning results of each frame of image by the model; wherein, the tracking module includes: a feature extraction fusion unit configured to: extracting features and fusing features of the T frame image to obtain image features of the T frame image; t is a positive integer greater than or equal to 1; a judgment unit configured to: judging whether the current T frame image is a first frame image in the eye image sequence, and if so, entering a pupil positioning unit; if the first frame image is not the first frame image, judging whether the pupil shielding degree of the T-1 frame is smaller than a first set threshold value, and if the pupil shielding degree is smaller than the first set threshold value, entering a pupil positioning unit; if the pupil occlusion degree of the T-1 frame is smaller than the second set threshold, carrying out feature fusion on the time domain on the T-1 frame image and the template, and entering a pupil positioning unit; if the first set threshold value is larger than the second set threshold value, entering a shielding degree estimation unit; a pupil positioning unit configured to: pupil positioning is carried out on the T frame image by adopting a rotating target detection model; an occlusion degree estimation unit configured to: estimating the pupil shielding degree of the T frame image by adopting semantic segmentation, and taking the pupil of the T frame image as a new template if the pupil of the T frame image is not shielded; a re-judgment unit configured to: and judging whether the T frame image is the last frame image, if so, ending, and if not, adding 1 to T, and returning to the feature extraction fusion unit.

The technical scheme has the following advantages or beneficial effects: the invention can maintain high detection accuracy when the blinking action occurs. The invention mainly utilizes the priori knowledge that the pupil is in an elliptical shape in nature to directly obtain the ellipse corresponding to the pupil. Post-processing steps required by the algorithm based on semantic segmentation are avoided, so that the algorithm is more concise and elegant. The core mechanism is to apply a rotation target detection method to obtain the minimum circumscribed rectangle of the ellipse sharing parameter corresponding to the pupil, thereby obtaining the center point coordinate, the long axis length, the short axis length and the rotation angle of the pupil ellipse.

When the pupil partial shielding condition is processed, the invention adopts a fusion technology in a time domain. Specifically, the pupil image features which are not blocked by the eyelid before are used as templates, and fusion processing is carried out on the current frame image features in a time domain, so that when the pupil is blocked by the eyelid part, the relevant information of the pupil can still be accurately obtained, and therefore accurate pupil detection is achieved.

Meanwhile, the shielding degree of the pupil is judged by using a semantic segmentation technology, so that the model is switched under different working modes according to the shielding degree of the pupil.

In addition, in order to solve the problem that blink images are sparsely distributed in the data set due to the excessively high blink action speed and low blink frequency and reduce the manpower required for marking the partially occluded pupils, the invention provides an innovative data generation strategy. The strategy can generate a corresponding pupil image which is partially blocked by the eyelid by utilizing the pupil image in the state of completely opening the eyes, so that the diversity of the data set is enriched, and the robustness and the accuracy of the pupil detection algorithm in practical application are improved.

The pupil detection method has the advantages that the priori information that the pupil is elliptical is fully utilized, post-processing operation required by a semantic segmentation algorithm is avoided, a good effect is achieved no matter whether the pupil is partially shielded or not in pupil detection, and particularly, when the degree of shielding of the pupil is high, the best effect is achieved compared with the existing method. The specific index is that when more than 80% of the pupil area is shielded, compared with the prior art, the invention is improved by 20% in the cross-over ratio and is improved by 12.5% in the F1 fraction. Moreover, the fusion technology on the time domain and the data generation strategy provided by the invention are proved to be effective through an ablation experiment.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the invention.

Fig. 1 is a flow chart of a method according to a first embodiment.

Detailed Description

It should be noted that the following detailed description is exemplary and is intended to provide further explanation of the invention. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

Example 1

As shown in fig. 1, the present embodiment provides an eye movement tracking method based on rotation target detection, including: s101: acquiring an eye image sequence; s102: inputting each frame of image in the eye image sequence into a trained eye movement tracking model, and outputting pupil positioning results of each frame of image by the model; wherein, the step S102 specifically includes: s102-1: extracting features and fusing features of the T frame image to obtain image features of the T frame image; t is a positive integer greater than or equal to 1; s102-2: judging whether the current T frame image is a first frame image in the eye image sequence, if so, entering S102-3; if the first frame image is not the first frame image, judging whether the pupil shielding degree of the T-1 frame is smaller than a first set threshold value, and if the pupil shielding degree is smaller than the first set threshold value, entering S102-3; if the pupil occlusion degree of the T-1 frame is smaller than the second set threshold, carrying out feature fusion on the time domain on the T-1 frame image and the template, and entering S102-3; if the threshold value is larger than the second set threshold value, S102-4 is entered; s102-3: pupil positioning is carried out on the T frame image by adopting a rotating target detection model; s102-4: estimating the pupil shielding degree of the T frame image by adopting semantic segmentation, and taking the pupil of the T frame image as a new template if the pupil of the T frame image is not shielded; s102-5: and judging whether the T frame image is the last frame image, if so, ending, and if not, adding 1 to T, and returning to S102-1.

Further, the trained eye movement tracking model comprises the following structures: the input end of the multi-scale space feature extraction fusion module is used for inputting an eye image sequence, and the multi-scale space feature extraction fusion module performs feature extraction and feature fusion processing on the eye image to obtain primary fusion features; the judging module is used for inputting the pupil shielding degree of the T-1 frame, comparing the pupil shielding degree of the T-1 frame with a first set threshold value and a second set threshold value, and inputting the primary fusion characteristic of the T frame image to the pupil positioning module if the pupil shielding degree of the T-1 frame is smaller than the first set threshold value; if the first set threshold value is larger than the second set threshold value, inputting the primary fusion characteristic and the template characteristic of the T-th frame image into a time domain characteristic fusion module, fusing the primary fusion characteristic and the template characteristic by the time domain characteristic fusion module, and inputting the fused result into a pupil positioning module; if the image is larger than the second set threshold value, the T frame image is considered as an invalid image; the input end of the judging module is also connected with the output end of the pupil shielding degree estimating module; the pupil shielding degree estimation module is used for estimating the pupil shielding degree in the image; the pupil positioning module is used for determining the position of the pupil; and the time domain feature fusion module is used for realizing feature fusion of the features in the time domain.

Further, the training process of the trained eye movement tracking model comprises the following steps: constructing a first training set, wherein the first training set is an eye image sequence with known pupil positions and shapes; and inputting the first training set into the eye movement tracking model, and stopping training when the total loss function value of the eye movement tracking model is not reduced, or the iteration number exceeds the set number, so as to obtain the trained eye movement tracking model.

Illustratively, the constructing a first training set includes: the invention adopts a picture sequence formed by near-eye 8bit gray level images continuously shot by a camera as input data to position pupils in the gray level images.

Further, the multi-scale space feature extraction fusion module adopts a feature extraction Swin transform network to perform feature extraction on the T frame image; the feature pyramid network FPN (Feature Pyramid Networks) is used to achieve fusion of the features of each stage extracted by the Swin Transformer network with the next stage features of each stage.

Further, S102-1: extracting and fusing the features of the T frame image to obtain the image features of the T frame image, which specifically comprises the following steps: adopting a Swin transform network to extract features of different stages of the T frame image; the feature pyramid network FPN (Feature Pyramid Networks) is used to achieve fusion of the features of each stage extracted by the Swin Transformer network with the next stage features of each stage.

Illustratively, the image is spatially characterized using a feature extractor Swin transform, which extracts features by computing self-attention within a window in the feature map, the window size being 77. And meanwhile, merging the bottom layer features through patch merging operation to generate high-level features with the scale of only 1/2 of the bottom layer features, and extracting semantic information of the feature map through the operation.

The feature pyramid FPN is used to fuse the features of different scales on the spatial domain, and each layer of feature map generated by upsampling the upper layer features by 2 times is compared with the 1 th pass1 are added. The generated feature map has detail information rich in the bottom layer features and semantic information of the high layer features.

Further, the pupil shielding degree adopts the trained languageThe artificial segmentation network U-net identifies the number of pupil pixel points in the image; calculating the pupil shielding degree：/>；（1）。

Wherein,representing the number of pupil pixel points in the complete pupil area when the pupil is not shielded; />And when the pupil is partially blocked, the U-net network identifies the number of pupil pixel points in the non-blocked pupil area.

Further, the trained semantic segmentation network U-net comprises the following training processes: constructing a second training set, wherein the second training set is provided with eye images with known pupil positions and morphology; and inputting the second training set into the semantic segmentation network U-net, and training the second training set to obtain the trained semantic segmentation network U-net.

The invention adopts semantic segmentation technology to judge the degree of the shielded pupil. Specifically, performing semantic segmentation on pupils in the picture to obtain the number of pupil pixel points in the picture. If the current pupil is not occluded by the eyelid or is occluded to a lesser extent by the eyelid, then +.>The number of pixel points in the complete pupil area predicted by the model +.>Comparing, if not, the number of pixel points in the complete pupil area predicted by the model is +.>Obtain the degree of the blocked pupil->. And judging a strategy for detecting the next picture in the sequence according to the shielding degree of the pupil in the current picture. If the current pupil is not occluded, the template is updated with the characteristics of the current pupil.

Illustratively, the first set threshold is 25%; the second set threshold is 87.5%.

Further, the feature fusion in the time domain is performed on the T frame image and the template, including:；（2）。

wherein,is a feature map after fusion, < >>Is the T-th frame image feature,/>Is a template feature, k is the kth channel of the feature map, < >>Is a convolution operation, the template is allowed to be updated, and the initial template is an unobstructed pupil image in the image sequence.

And carrying out feature fusion on the time domain on the T frame image and the template, and also being an internal working process of the time domain feature fusion module. If the pupil is partially occluded, the fusion feature will be used, and if the pupil is not partially occluded, the feature of the current frame will be used.

Further, the step S102-3: pupil positioning is carried out on the T frame image by adopting a rotary target detection model, and the method comprises the following steps: the rotating target detection model includes: a classification subnet and a regression subnet in parallel; the classification sub-network and the regression sub-network are realized through a convolutional neural network; the classifying sub-network is used for judging whether the priori anchor frame contains pupils, the input value of the classifying sub-network is the characteristic of the T frame image, and the output value of the classifying sub-network is the confidence coefficient of the pupil contained in the anchor frame; the regression sub-network is used for predicting the offset of the rotation rectangular frame corresponding to the priori anchor frame and the pupil, the input value of the regression sub-network is the characteristic of the T frame image, and the output value of the regression sub-network is the offset of the ellipse corresponding to the pupil and the anchor frame.

Further, the step S102-3: the training process comprises the following steps of: constructing a third training set, wherein the third training set is pupil area images of known pupil position labels; and inputting the third training set into the rotating target detection model, training the model, and stopping training when the loss function value of the model is not reduced any more or the iteration number reaches the set number of times to obtain the trained rotating target detection model.

Further, after obtaining the anchor frame with the highest confidence coefficient, the classifying sub-network assumes that the parameters of the anchor frame with the highest confidence coefficient are respectively: anchor frame center pointCoordinates->Anchor frame center->Coordinates->The width of the anchor frame>Length of anchor frame->Rotation angle of anchor frame->Deviation amount obtained according to regression subnet +.>According to formulas (3), (4), (5), (6), (7), (8) and (9), the center point +_of the predicted pupil corresponding rotation rectangular frame for the rotation rectangular frame is finally obtained>Coordinates->Center point->Coordinates->Length of rectangular frame->Width ∈of rectangular frame>Rotation angle of rectangular frame->，/>Is also the center point of the pupil corresponding ellipse +.>Coordinates of->Is also the center point of the pupil corresponding ellipse +.>Coordinates of->Also the length of the major axis of the ellipse corresponding to the pupil, +.>Also the pupil corresponds to the length of the minor axis of the ellipse, +.>The pupil corresponds to the elliptical rotation angle.

Wherein,representing intermediate variables +.>Representing intermediate variables +.>Representing the predicted ellipse center point +.>Coordinate offset amount->Representing the predicted ellipse center point +.>Coordinate offset amount->Representing the deviation of the predicted ellipse major axis relative to the anchor frame major axis, < >>Representing the predicted elliptical short axis offset relative to the anchor frame short axis, < ->Indicating the amount of deviation of the predicted elliptical rotation angle with respect to the rotation angle of the anchor frame.

Example two

The present embodiment provides an eye movement tracking system based on rotational target detection, including: an acquisition module configured to: acquiring an eye image sequence; a tracking module configured to: inputting each frame of image in the eye image sequence into a trained eye movement tracking model, and outputting pupil positioning results of each frame of image by the model; wherein, the tracking module includes: a feature extraction fusion unit configured to: extracting features and fusing features of the T frame image to obtain image features of the T frame image; t is a positive integer greater than or equal to 1; a judgment unit configured to: judging whether the current T frame image is a first frame image in the eye image sequence, and if so, entering a pupil positioning unit; if the first frame image is not the first frame image, judging whether the pupil shielding degree of the T-1 frame is smaller than a first set threshold value, and if the pupil shielding degree is smaller than the first set threshold value, entering a pupil positioning unit; if the pupil occlusion degree of the T-1 frame is smaller than the second set threshold, carrying out feature fusion on the time domain on the T-1 frame image and the template, and entering a pupil positioning unit; if the first set threshold value is larger than the second set threshold value, entering a shielding degree estimation unit; a pupil positioning unit configured to: pupil positioning is carried out on the T frame image by adopting a rotating target detection model; an occlusion degree estimation unit configured to: estimating the pupil shielding degree of the T frame image by adopting semantic segmentation, and taking the pupil of the T frame image as a new template if the pupil of the T frame image is not shielded; a re-judgment unit configured to: and judging whether the T frame image is the last frame image, if so, ending, and if not, adding 1 to T, and returning to the feature extraction fusion unit.

The above description is only of the preferred embodiments of the present invention and is not intended to limit the present invention, but various modifications and variations can be made to the present invention by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The eyeball motion tracking method based on the detection of the rotating target is characterized by comprising the following steps:

acquiring an eye image sequence;

inputting each frame of image in the eye image sequence into a trained eye movement tracking model, outputting pupil positioning results of each frame of image by the model, wherein the trained eye movement tracking model is used for:

(1): extracting features and fusing features of the T frame image to obtain image features of the T frame image; t is a positive integer greater than or equal to 1;

(2): judging whether the current T frame image is a first frame image in the eye image sequence, if so, entering (3); if the first frame image is not the first frame image, judging whether the pupil shielding degree of the T-1 frame is smaller than a first set threshold value, and if the pupil shielding degree is smaller than the first set threshold value, entering (3); if the pupil occlusion degree of the T-1 frame is smaller than a second set threshold, carrying out feature fusion on the time domain on the T-1 frame image and the template, and entering (3); if the threshold value is larger than a second set threshold value, entering (4);

the feature fusion in the time domain is carried out on the T frame image and the template, which comprises the following steps:

wherein,is a feature map after fusion, < >>Is->Frame image feature->Is a template feature->Refer to the +.>Multiple channels (I)>The method is a convolution operation, the template is allowed to be updated, and the initial template is a set non-occlusion pupil image;

(3): pupil positioning is carried out on the T frame image by adopting a rotating target detection model;

pupil positioning is carried out on the T frame image by adopting a rotary target detection model, and the method comprises the following steps:

the rotating target detection model includes: a classification subnet and a regression subnet in parallel; the classification sub-network and the regression sub-network are realized through a convolutional neural network;

the classifying sub-network is used for judging whether the priori anchor frame contains pupils, the input value of the classifying sub-network is the characteristic of the T frame image, and the output value of the classifying sub-network is the confidence coefficient of the pupil contained in the anchor frame;

the regression sub-network is used for predicting the offset of the rotation rectangular frame corresponding to the priori anchor frame and the pupil, the input value of the regression sub-network is the characteristic of the T frame image, and the output value of the regression sub-network is the offset of the ellipse corresponding to the pupil and the anchor frame;

after obtaining the anchor frame with the highest confidence coefficient, the classifying sub-network assumes that the parameters of the anchor frame with the highest confidence coefficient are respectively as follows: anchor frame center pointCoordinates->Anchor frame center->Coordinates->The width of the anchor frame>Length of anchor frame->Rotation angle of anchor frame->Deviation amount obtained according to regression subnet +.>According to formulas (3), (4), (5), (6), (7), (8) and (9), the center point +_of the predicted pupil corresponding rotation rectangular frame for the rotation rectangular frame is finally obtained>Coordinates->Center point->Coordinates->Length of rectangular frame->Width ∈of rectangular frame>Rotation angle of rectangular frame->，/>Is also the center point of the pupil corresponding ellipse +.>Coordinates of->Is also the center point of the pupil corresponding ellipse +.>Coordinates of->Also the length of the major axis of the ellipse corresponding to the pupil, +.>Also the pupil corresponds to the length of the minor axis of the ellipse, +.>The pupil corresponds to the elliptical rotation angle;

wherein,representing intermediate variables +.>Representing intermediate variables +.>Representing the x coordinate offset of the predicted ellipse center point relative to the anchor frame center point, +.>Representing the predicted ellipse center point +.>Coordinate offset amount->Representing the deviation of the predicted ellipse major axis relative to the anchor frame major axis, < >>Representing the predicted elliptical short axis offset relative to the anchor frame short axis, < ->Representing the offset of the predicted elliptical rotation angle relative to the rotation angle of the anchor frame;

(4): estimating the pupil shielding degree of the T frame image by adopting semantic segmentation, and taking the pupil of the T frame image as a new template if the pupil of the T frame image is not shielded;

(5): judging whether the T frame image is the last frame image, if so, ending, and if not, adding 1 to T, and returning to (1).

2. The eye movement tracking method based on rotation target detection according to claim 1, wherein feature extraction and feature fusion are performed on a T-th frame image to obtain image features of the T-th frame image, and the method specifically comprises: extracting features of the T frame image by adopting a feature extraction network; and the feature pyramid network is adopted to realize the fusion of the features of each stage extracted by the feature extraction network and the features of the next stage of each stage.

3. The eye movement tracking method based on rotation target detection as claimed in claim 1, wherein the pupil shielding degree is used for identifying the number of pupil pixels in the image by using a trained semantic segmentation network; calculating the pupil shielding degree：

Wherein,representing the number of pupil pixel points in the complete pupil area when the pupil is not shielded; />When the pupil is partially blocked, the semantic segmentation network identifies the number of pupil pixel points in the non-blocked pupil area.

4. A method of eye movement tracking based on rotational object detection as claimed in claim 3, wherein the trained semantic segmentation network, the training process comprises:

constructing a second training set, wherein the second training set is provided with eye images with known pupil positions and morphology;

and inputting the second training set into the semantic segmentation network, and training the second training set to obtain a trained semantic segmentation network.

5. A method of eye movement tracking based on rotational object detection as claimed in claim 1, wherein the trained eye movement tracking model comprises:

the input end of the multi-scale space feature extraction fusion module is used for inputting an eye image sequence, and the multi-scale space feature extraction fusion module performs feature extraction and feature fusion processing on the eye image to obtain primary fusion features;

the judging module is used for inputting the pupil shielding degree of the T-1 frame, comparing the pupil shielding degree of the T-1 frame with a first set threshold value and a second set threshold value, and inputting the primary fusion characteristic of the T frame image to the pupil positioning module if the pupil shielding degree of the T-1 frame is smaller than the first set threshold value; if the first set threshold value is larger than the second set threshold value, inputting the primary fusion characteristic and the template characteristic of the T-th frame image into a time domain characteristic fusion module, fusing the primary fusion characteristic and the template characteristic by the time domain characteristic fusion module, and inputting the fused result into a pupil positioning module; if the image is larger than the second set threshold value, the T frame image is considered as an invalid image;

the input end of the judging module is also connected with the output end of the pupil shielding degree estimating module;

the pupil shielding degree estimation module is used for estimating the pupil shielding degree in the image;

the pupil positioning module is used for determining the position of the pupil;

and the time domain feature fusion module is used for realizing feature fusion of the features in the time domain.

6. A method of eye movement tracking based on rotational object detection as claimed in claim 1, wherein the training process of the trained eye movement tracking model comprises:

constructing a first training set, wherein the first training set is an eye image sequence with known pupil positions and shapes;

and inputting the first training set into the eye movement tracking model, and stopping training when the total loss function value of the eye movement tracking model is not reduced, or the iteration number exceeds the set number, so as to obtain the trained eye movement tracking model.

7. An eye movement tracking system based on rotational target detection, comprising:

an acquisition module configured to: acquiring an eye image sequence;

a tracking module configured to: inputting each frame of image in the eye image sequence into a trained eye movement tracking model, and outputting pupil positioning results of each frame of image by the model; wherein, the tracking module includes:

a feature extraction fusion unit configured to: extracting features and fusing features of the T frame image to obtain image features of the T frame image; t is a positive integer greater than or equal to 1;

a judgment unit configured to: judging whether the current T frame image is a first frame image in the eye image sequence, and if so, entering a pupil positioning unit; if the first frame image is not the first frame image, judging whether the pupil shielding degree of the T-1 frame is smaller than a first set threshold value, and if the pupil shielding degree is smaller than the first set threshold value, entering a pupil positioning unit; if the pupil occlusion degree of the T-1 frame is smaller than the second set threshold, carrying out feature fusion on the time domain on the T-1 frame image and the template, and entering a pupil positioning unit; if the first set threshold value is larger than the second set threshold value, entering a shielding degree estimation unit;

a pupil positioning unit configured to: pupil positioning is carried out on the T frame image by adopting a rotating target detection model;

after obtaining the anchor frame with the highest confidence coefficient, the classifying sub-network assumes that the parameters of the anchor frame with the highest confidence coefficient are respectively as follows: anchor frame center pointCoordinates->Anchor frame center->Coordinates->The width of the anchor frame>Length of anchor frame->Rotation angle of anchor frame->According to regression sub-networkThe resulting offset->According to formulas (3), (4), (5), (6), (7), (8) and (9), the center point +_of the predicted pupil corresponding rotation rectangular frame for the rotation rectangular frame is finally obtained>Coordinates->Center point->Coordinates->Length of rectangular frame->Width ∈of rectangular frame>Rotation angle of rectangular frame->，/>Is also the center point of the pupil corresponding ellipse +.>Coordinates of->Is also the center point of the pupil corresponding ellipse +.>Coordinates of->Also the length of the major axis of the ellipse corresponding to the pupil, +.>Also the pupil corresponds to the length of the minor axis of the ellipse, +.>The pupil corresponds to the elliptical rotation angle;

an occlusion degree estimation unit configured to: estimating the pupil shielding degree of the T frame image by adopting semantic segmentation, and taking the pupil of the T frame image as a new template if the pupil of the T frame image is not shielded;

a re-judgment unit configured to: and judging whether the T frame image is the last frame image, if so, ending, and if not, adding 1 to T, and returning to the feature extraction fusion unit.