CN114120432A

CN114120432A - Online learning attention tracking method based on sight estimation and application thereof

Info

Publication number: CN114120432A
Application number: CN202111361427.0A
Authority: CN
Inventors: 刘婷婷; 杨兵; 刘海; 张昭理; 赵万里; 安庆; 黄正华; 陈胜勇; 李友福
Original assignee: Hubei University; Central China Normal University
Current assignee: Hubei University; Central China Normal University
Priority date: 2021-11-17
Filing date: 2021-11-17
Publication date: 2022-03-01

Abstract

The application discloses an online learning attention tracking method based on sight line estimation and application thereof, wherein the method comprises the following steps: acquiring a face image, an eye image and a binocular infrared image of an object to be detected, and a scene image of a learning environment where the object to be detected is located; the scene image comprises an image of learning equipment interacting with an object to be detected; inputting the face image, the eye image and the binocular infrared image into a sight line estimation and identification model to obtain three-dimensional gazing directions of the two eyes of the object to be detected under a camera coordinate system; converting the three-dimensional gazing direction under the camera coordinate system into a two-dimensional gazing point under a screen coordinate system where the learning equipment is located; generating a current attention detection result of the object to be detected according to the position relation between the two-dimensional gaze point and the learning area in the scene image; the invention realizes the complementation of the head posture and the two eye characteristics, improves the accuracy of sight line estimation under a complex background and provides objective support data for improving the student line learning attention.

Description

Online learning attention tracking method based on sight estimation and application thereof

Technical Field

The present application relates to the field of intelligent human-computer interaction technologies, and in particular, to an online learning attention tracking method based on gaze estimation, a computer device, and a readable medium.

Background

With the rapid development of education informatization, artificial intelligence technology has been increasingly applied to the process of education and teaching. In recent years, the outbreak of the novel coronavirus pneumonia breaks through the traditional offline teaching mode, and the online remote teaching method is widely adopted at one time. However, in the case of no close-distance supervision by a teacher, how to realize a remote supervision or an automatic supervision mode to ensure the learning state and the learning efficiency of the learner for independent learning becomes a problem to be solved urgently. Therefore, the method for tracking the attention of the online learning is adopted to assist the learner to perform the online learning, and has very important significance for improving the learning efficiency of the online learning of the learner.

The detection and tracking of learning attention can be started by estimating the angle of the sight direction of the learner in the learning process; the sight line estimation in a broad sense refers to research related to eyeballs, eye movements, sight lines and the like. Generally speaking, the sight line estimation method can be divided into two broad categories, model-based method and appearance-based method. The basic idea of the model-based method is to estimate the sight direction based on the characteristics of corneal reflection and the like of the eyes and by combining the prior knowledge of the 3D eyeball. The appearance-based method directly extracts the visual features of the eyes, trains a regression model, and learns a model that maps the appearance to the direction of the line of sight, thereby performing line of sight estimation. Through comparative analysis, the accuracy obtained by the model-based method is higher, but the requirements on the quality and the resolution of an image are also higher, in order to achieve the purpose, special hardware is generally required to be used, and the limitation on the mobility of the head posture and the like of a user is great; while the appearance-based approach performs better for low-resolution and high-noise images, training of its model requires a large amount of data and is prone to the phenomenon of over-fitting. With the development of deep learning and the disclosure of large data sets, appearance-based approaches are receiving increasing attention.

Zhang in its paper "Appearance-based size estimation in the world" tried to use neural network to make sight estimation at the earliest, it uses monocular image as input, utilizes LeNet's shallow network to learn, and splices the head posture information and the extracted eye feature to make sight estimation, finally gets 6.3 ° error on MPIIGAZE data set. Later, Zhang in its article "It's Written All Over Your Face" also proposed a Full Face Gaze Estimation method Based on attention mechanism, which utilizes attention mechanism to increase the weight of eye region, suppress the weight of other regions irrelevant to Gaze, and adopt end-to-end learning strategy to directly obtain the Gaze direction under camera coordinate system. Park et al in their article, "Deep visual volume estimation" propose a visual line estimation method based on eye graph representation, which abstracts eyes into an eyeball graph representation through a depth network to improve the accuracy of visual line estimation, wherein the eyeball graph representation is generated by geometrically extrapolating the real value of visual line. Yu et al, thesis "Deep multi task space estimation with a constrained land navigation-space model" also proposes a line-of-sight estimation method based on a constrained model in the same year based on the idea of multi-task learning.

At present, although research on a sight line estimation technology is greatly improved, the precision of sight line estimation obtained based on a general model and a single eye feature is limited, and in addition, the moving amplitude of the head of a user has a great influence on a measurement result, so that the identification precision is reduced, and the accuracy of on-line learning attention tracking detection is limited.

Disclosure of Invention

Aiming at least one defect or improvement requirement in the prior art, the invention provides an online learning attention tracking method based on sight line estimation and application thereof, which can realize information complementation of different modes and aim to improve the precision of sight line estimation under a complex background so as to improve the reliability of line learning attention judgment.

To achieve the above object, according to a first aspect of the present invention, there is provided an online learning attention tracking method based on gaze estimation, comprising:

acquiring a face image, an eye image and a binocular infrared image of an object to be detected, and a scene image of a learning environment where the object to be detected is located; the scene image is acquired by a scene camera arranged at the head of the object to be detected, and the scene image comprises an image of learning equipment interacting with the object to be detected;

inputting the facial image, the eye image and the binocular infrared image into a trained sight line estimation and recognition model to obtain three-dimensional staring directions of the two eyes of the object to be detected under a camera coordinate system;

converting the three-dimensional gazing direction under the camera coordinate system into a two-dimensional gazing point under a screen coordinate system where the learning equipment is located;

and generating a current attention detection result of the object to be detected according to the position relation between the two-dimensional gaze point and a pre-divided learning region in the scene image.

Preferably, in the above method for tracking attention during online learning, the inputting the facial image, the eye image and the binocular infrared image into a trained sight line estimation and recognition model to obtain three-dimensional gazing directions of both eyes of the object to be detected in a camera coordinate system specifically includes:

extracting the features of the facial image to generate a corresponding head posture feature vector;

extracting the features of the eye image to generate corresponding eye feature vectors;

extracting the characteristics of the eyes of the binocular infrared image to generate corresponding binocular infrared characteristic vectors;

and splicing the head posture characteristic vector, the eye characteristic vector and the binocular infrared characteristic vector to obtain a fusion characteristic vector, and generating the binocular three-dimensional gazing directions of the object to be detected in the camera coordinate system according to the fusion characteristic vector.

Preferably, in the above online learning attention tracking method, the converting the three-dimensional gaze direction in the camera coordinate system into a two-dimensional gaze point in the screen coordinate system where the learning device is located includes:

calculating a three-dimensional target fixation point of the two eyes of the object to be detected in a camera coordinate system according to the three-dimensional fixation direction and the three-dimensional fixation origin; the three-dimensional fixation origin point is the center of the face or the center of an eyeball of the object to be detected; the calculation formula is as follows:

wherein g represents a three-dimensional gaze direction in a camera coordinate system, and g ═ g_x,g_y,g_z)，g_x、g_y、g_zRespectively representing the coordinates of the eyeball center in a three-dimensional space; o represents the three-dimensional gaze origin; t represents a three-dimensional target fixation point under a camera coordinate system;

acquiring a space conversion matrix between a camera coordinate system and a screen coordinate system, and generating a two-dimensional fixation point of two eyes of an object to be detected under the screen coordinate system according to the space conversion matrix and a three-dimensional target fixation point; the calculation formula is as follows:

t＝Rs[u,v,0]^T+Ts

wherein { Rs, Ts } represents a spatial transformation matrix, Rs is a rotation matrix, and Ts is a translation matrix; (u, v) represents a two-dimensional gaze point under a screen coordinate system; [. the]^TRepresenting a transpose operation of the matrix.

Preferably, in the above online learning attention tracking method, the training process of the sight-line estimation recognition model includes:

obtaining a training sample, wherein the training sample has a real three-dimensional gazing direction label;

inputting the training sample and the three-dimensional gaze direction label corresponding to the training sample into a gaze estimation and recognition model, and outputting a predicted gaze direction corresponding to the training sample through the gaze estimation and recognition model to be trained;

and calculating a loss function according to the predicted sight direction and the three-dimensional watching direction label, and reversely adjusting the model parameters of the sight estimation and recognition model to be trained until the loss function is minimized to obtain the trained sight estimation and recognition model.

Preferably, in the above online learning attention tracking method, the loss function is:

wherein L represents a loss function; g_gt(I) A three-dimensional gaze direction label; g_p(I) Representing a predicted gaze direction; i represents a training sample; d represents the training data set, |, is the radix operator.

Preferably, the online learning attention tracking method further includes:

in the training process of the sight estimation and identification model, a cosine annealing restart learning rate mechanism is used for dynamically adjusting the learning rate of the model.

According to a second aspect of the present invention, there is also provided an online learning attention tracking system based on gaze estimation, comprising:

the image acquisition unit is used for acquiring a face image, an eye image and a binocular infrared image of an object to be detected and a scene image of a learning environment in which the object to be detected is located; the scene image is acquired by a scene camera arranged at the head of the object to be detected, and the scene image comprises an image of learning equipment interacting with the object to be detected;

the detection unit is used for acquiring a face image, an eye image and a binocular infrared image of the object to be detected, inputting the face image, the eye image and the binocular infrared image into the trained sight line estimation and recognition model, and acquiring three-dimensional gazing directions of two eyes of the object to be detected under a camera coordinate system;

the conversion unit is used for converting the three-dimensional gazing direction under the camera coordinate system into a two-dimensional gazing point under a screen coordinate system where the learning equipment is located;

and the result output unit is used for generating a current attention detection result of the object to be detected according to the position relation between the two-dimensional gaze point and a pre-divided learning area in the scene image.

Preferably, in the above online learning attention tracking system, the image acquisition unit further includes an infrared miniature camera and a foreground camera;

the infrared miniature camera is used for collecting double-eye infrared images of an object to be detected in the learning process and comprises a left-eye infrared camera and a right-eye infrared camera;

the left eye infrared camera, the right eye infrared camera and the scene camera are integrated on the head-mounted equipment, and the left eye infrared camera and the right eye infrared camera are distributed on two sides of the scene camera;

the foreground camera is used for collecting a face image of an object to be detected in a learning process.

According to a third aspect of the present invention, there is also provided a computer device comprising at least one processing unit, and at least one memory unit, wherein the memory unit stores a computer program which, when executed by the processing unit, causes the processing unit to perform the steps of any of the above-mentioned online learning attention tracking methods.

According to a fourth aspect of the present invention, there is also provided a computer readable medium storing a computer program executable by a computer device, the computer program, when run on the computer device, causing the computer device to perform the steps of any of the above-mentioned online learning attention tracking methods.

In general, compared with the prior art, the above technical solution contemplated by the present invention can achieve the following beneficial effects:

(1) the online learning attention tracking method based on the sight line estimation provided by the invention has the advantages that the facial image, the eye image and the binocular infrared image of the object to be detected are collected and subjected to feature extraction, the head posture feature, the eye feature and the binocular infrared feature vector are obtained, the features of three different modes are fused, the three-dimensional watching direction of the two eyes of the object to be detected in a camera coordinate system is calculated based on the fused multimode feature, the information of the three modes is mutually assisted after being connected and fused, the accuracy of the sight line estimation is greatly improved, the sight line estimation result is applied to online learning attention detection after being converted in the coordinate system, and the method has important significance for improving the learning quality and assisting teaching.

(2) The left eye infrared camera, the right eye infrared camera and the scene camera are integrated on the head-mounted equipment, and the left eye infrared camera and the right eye infrared camera are distributed on two sides of the scene camera; based on the head-mounted equipment, the eye image, the binocular infrared image and the scene image of the object to be detected are collected, so that the eye diagram can be obtained without being influenced by illumination change, relative positions of a human body and a machine, interference and shielding and the like.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings without creative efforts.

Fig. 1 is a schematic flowchart of an online learning attention tracking method based on gaze estimation according to this embodiment;

fig. 2 is a schematic network structure diagram of a sight line estimation recognition model provided in this embodiment;

fig. 3 is a schematic diagram of a coordinate system conversion process of the sight line direction provided in this embodiment;

fig. 4 is a schematic composition diagram of an image capturing unit provided in this embodiment;

fig. 5 is a schematic composition diagram of the computer device provided in this embodiment.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application.

The terms "first," "second," "third," and the like in the description and claims of this application and in the above-described drawings are used for distinguishing between different objects and not for describing a particular order. Furthermore, the terms "include" and "have," as well as any variations thereof, are intended to cover non-exclusive inclusions. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those steps or elements listed, but may alternatively include other steps or elements not listed, or inherent to such process, method, article, or apparatus.

Fig. 1 is a schematic flow chart of an online learning attention tracking method based on gaze estimation according to this embodiment, and as shown in fig. 1, the method mainly includes the following steps:

s1, acquiring a face image, an eye image and a binocular infrared image of the object to be detected and a scene image of the learning environment in which the object to be detected is located; the scene image is acquired by a scene camera arranged at the head of the object to be detected, and the scene image comprises an image of learning equipment interacting with the object to be detected;

in the embodiment, in the scene of online learning, firstly, an RGB face image, an RGB eye image, and left and right eye infrared images of an object to be detected in the learning process are collected; the RGB eye images may be obtained by capturing RGB face images, or may be obtained by separately capturing the eye region of the object to be detected. Further, preprocessing such as image scaling is performed on the RGB face image, the RGB eye image, and the left and right eye infrared images, for example, the RGB face image is converted into a gray scale image of 224 × 224, and the RGB eye image and the left and right eye infrared images are all converted into a gray scale image of 36 × 60.

The scene camera is mainly used for collecting the learning scene of the object to be detected, and in the embodiment, the scene camera is arranged at the head of the object to be detected, so that the collected scene image contains the image of the learning equipment interacting with the object to be detected.

S2, inputting the face image, the eye image and the binocular infrared image into the trained sight line estimation and recognition model to obtain the three-dimensional gazing directions of the two eyes of the object to be detected under the camera coordinate system;

fig. 2 is a schematic network structure diagram of a gaze estimation recognition model provided in this embodiment, and as shown in fig. 2, the gaze estimation recognition model includes a first network branch, a second network branch, a third network branch, and a multi-layer perceptual feature fusion network;

the first network branch is a head posture information extraction network and is used for extracting features of RGB facial images and generating corresponding head posture feature vectors;

the RGB face image is used to extract three feature points, which are a nose and an eye, respectively, and form a triangular region, and the image features extracted from this region are used to form a head feature vector representing the head posture.

The second network branch is a left-eye and right-eye feature extraction network and is used for performing feature extraction on the RGB eye images and generating corresponding eye feature vectors;

the third network branch is a left-eye and right-eye infrared feature extraction network and is used for extracting the features of the eyes of the binocular infrared image and generating corresponding binocular infrared feature vectors;

the eye image is divided into images under infrared light and visible light, the characteristics of the pupil and the canthus are focused in the extracted eye characteristic vectors, and the characteristic vectors with the characteristics of the pupil and the canthus fused are formed through the extraction of a neural network.

The multi-layer perception feature fusion network is used for splicing the information of the three modes of the head posture feature vector, the eye feature vector and the binocular infrared feature vector to obtain a fusion feature vector, and generating the three-dimensional staring directions of the two eyes of the object to be detected under the camera coordinate system according to the fusion feature vector.

Before splicing the head posture characteristic vector, the eye characteristic vector and the binocular infrared characteristic vector, respectively normalizing the characteristic vectors to ensure that the characteristic values are in a uniform limited range, namely, between 0 and 1, and then continuously optimizing the distance between the predicted value and the true value of the network based on the Euclidean distance to ensure that the distance is minimum. Therefore, the gaze direction can be predicted to a certain extent by the predicted value of the network, and the three-dimensional gaze direction is subjected to coordinate transformation to obtain a two-dimensional coordinate vector on the screen.

In one particular example, the first network branch, the second network branch, and the third network branch each include a 3D convolutional neural network and a fully connected layer; the 3D map convolution neural network mainly comprises convolution layers and pooling layers, wherein the convolution layers mainly extract image features through convolution operation, each convolution layer is composed of a plurality of two-dimensional planes, and the two-dimensional planes perform convolution operation on a previous layer through different convolution cores to form a feature map, so that layer-by-layer propagation of input features is realized. The pooling layer, also known as a downsampling layer, is typically provided after the convolutional layer for downsampling local regions of the characteristic response map. The average value or the maximum value of the features at different positions is calculated to perform clustering analysis, so that the purpose of reducing the dimension of the input image is achieved, and the phenomenon of overfitting is avoided while the calculated amount is reduced. The fully connected layer connects the feature maps extracted by the convolutional layer and the pooling layer, so that each input neuron is connected with each output neuron, thereby expressing global features to the maximum extent, and then the extracted feature representations are pertinently mapped to a sample mark space to perform a regression task.

Referring to fig. 2, the RGB face image, the RGB eye image, and the binocular infrared image are input to the sight line estimation recognition model, and respectively enter a head posture information extraction network, a left and right eye feature extraction network, and a left and right eye infrared feature extraction network. The method comprises the steps of performing feature extraction on binocular infrared images through a left-eye and right-eye infrared feature extraction network, wherein the left-eye and right-eye infrared feature extraction network has a 50-layer structure, inputting an eye image I of 3 x 36 x 60, outputting a multidimensional array of 2048 x 7 in size after 50 layers of rolling blocks, and outputting a binocular infrared feature vector of 1 x 64 after one layer of full connection layer. Similarly, the head posture characteristic information extraction network and the left-eye and right-eye RGB image characteristic extraction network adopt the same architecture and respectively output 1 × 128 head posture characteristic vectors and 1 × 64 eye characteristic vectors, the head posture characteristic vectors, the eye characteristic vectors and the binocular infrared characteristic vectors are spliced through a concat function in a Pythrch frame, and the fused characteristic vectors of 1 × 256 are output. Then, the fused feature vectors enter 2 continuous full-connected layers, and the three-dimensional sight line direction is obtained by training a linear regression.

In this embodiment, the training process of the sight line estimation recognition model includes:

acquiring a training set, wherein the training set comprises a plurality of training samples, and each training sample is provided with a real three-dimensional gazing direction label;

inputting the training sample and the three-dimensional gaze direction label corresponding to the training sample into the gaze estimation and recognition model, and outputting the predicted gaze direction corresponding to the training sample through the gaze estimation and recognition model to be trained;

calculating a loss function according to the predicted sight line direction and the three-dimensional watching direction label, and reversely adjusting the model parameters of the sight line estimation and recognition model to be trained until the loss function is minimized to obtain the trained sight line estimation and recognition model.

As a specific example, the loss function is expressed as:

wherein L represents a lossA loss function; g_gt(I) A three-dimensional gaze direction label; g_p(I) Representing a predicted gaze direction; i represents a training sample; d represents the training data set, |, is the radix operator.

As a preferred example, during the training process of the sight line estimation recognition model, the learning rate of the model is dynamically adjusted by using a cosine annealing restart learning rate mechanism. Specifically, the SGD optimizer is used for gradient optimization, the initial learning rate is set to 0.01, the batch is set to 128, and the cosine annealing restart learning rate mechanism is used to make the network training more stable, which specifically includes the following steps: after setting cawb _ steps to 50, the initial learning rate is multiplied by a scaling factor step _ scale at each restart. And adjusting parameters such as step _ scale and epoch _ scale, so as to realize whether the learning rate is increased or decreased when jumping. And the intermediate iteration step can be adjusted without finishing an annealing process, so that a higher learning rate is kept, and a more complex learning rate change mechanism is realized.

Further, after the sight line estimation recognition model is trained by using a training set, the sight line estimation model is subjected to fine tuning learning by using a verification sample set, and then the sight line estimation model is subjected to effect testing by using a test sample set, in a specific example, the learning rate in the fine tuning and effect testing process can be set to be e-6.

S3, converting the three-dimensional gazing direction in the camera coordinate system into a two-dimensional gazing point in a screen coordinate system where the learning equipment is located;

in the embodiment, after the three-dimensional gazing directions of the two eyes of the object to be detected output by the model under the camera coordinate system are obtained, the three-dimensional gazing directions under the camera coordinate system need to be projected to a screen coordinate system, and a two-dimensional gazing point of the three-dimensional gazing directions on a screen of the learning equipment is obtained; specifically, the method comprises the following steps:

referring to fig. 3, first, a three-dimensional target fixation point of two eyes of an object to be detected in a camera coordinate system is calculated according to a three-dimensional fixation direction and a three-dimensional fixation origin; the three-dimensional fixation origin point can be the face center or eyeball center of the object to be detected; the calculation formula is as follows:

wherein g represents a three-dimensional gaze direction in a camera coordinate system, and g ═ g_x,g_y,g_z)，g_x、g_y、g_zRespectively representing the coordinates of the eyeball center in a three-dimensional space; t represents a three-dimensional target fixation point under a camera coordinate system; o denotes a three-dimensional gaze origin, which may be estimated by a keypoint detection algorithm or a stereo measurement method.

Then, acquiring a space transformation matrix between a camera coordinate system and a screen coordinate system, and generating a two-dimensional fixation point of the two eyes of the object to be detected under the screen coordinate system according to the space transformation matrix and the three-dimensional target fixation point; the calculation formula is as follows:

t＝Rs[u,v,0]^T+Ts

And S4, generating a current attention detection result of the object to be detected according to the position relation between the two-dimensional gaze point and a pre-divided learning area in the scene image.

In this embodiment, whether the two-dimensional gaze points of the two eyes of the object to be detected fall within a pre-divided learning region in the scene image is determined, and generally, the learning region is a screen region of the learning device; if so, indicating that the current attention of the object to be detected is concentrated; otherwise, the current attention of the object to be detected is not focused, and corresponding intervention measures can be provided, such as voice early warning is given out, the concentration of the learner is further focused, and the autonomous ability of online learning is improved.

Furthermore, the drop point regions (namely two-dimensional fixation points) of the two eyes of the object to be detected at different moments are obtained based on the steps, an attention data log is generated, and the attention change of the online learner is further analyzed and displayed in real time by utilizing a data visualization analysis technology.

It should be noted that although in the above-described embodiments, the operations of the methods of the embodiments of the present specification are described in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Rather, the steps depicted in the flowcharts may change the order of execution. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

The embodiment also provides an online learning attention tracking system based on sight line estimation, which comprises an image acquisition unit, a detection unit, a conversion unit and a result output unit;

the image acquisition unit is mainly used for acquiring a face image, an eye image and a binocular infrared image of an object to be detected and a scene image of a learning environment in which the object to be detected is located;

fig. 4 is a schematic diagram of a composition of the image capturing unit provided in this embodiment, and referring to fig. 4, in this embodiment, the image capturing unit includes an infrared miniature camera, a scene camera, and a foreground camera;

the infrared micro camera is used for collecting double-eye infrared images of an object to be detected in the learning process and comprises a left-eye infrared camera and a right-eye infrared camera;

the scene camera is arranged at the head of the object to be detected and used for collecting the learning environment of the object to be detected, and the scene image comprises an image of learning equipment interacting with the object to be detected;

as a preferred example, the left-eye infrared camera, the right-eye infrared camera, and the scene camera are integrated on the head-mounted device, and the left-eye infrared camera and the right-eye infrared camera are distributed on both sides of the scene camera; further preferably, the scene camera is arranged on the central line of the left-eye infrared camera and the right-eye infrared camera, and the position relation between the scene camera and the two-dimensional fixation points of the two eyes of the object to be detected is calculated according to the scene image collected by the scene camera at the position, so that the accuracy and the reliability are better.

The foreground camera is used for collecting the facial image of the object to be detected in the learning process, the foreground camera can be a camera carried by learning equipment such as a tablet, a computer, a mobile phone and the like, and independent camera equipment can be used and erected on the learning equipment or placed in other areas where the facial image of the object to be detected can be collected.

The detection unit is used for acquiring a face image, an eye image and a binocular infrared image of the object to be detected and inputting the images into the trained sight estimation and recognition model to acquire three-dimensional gazing directions of the two eyes of the object to be detected in a camera coordinate system;

For the online learning attention tracking system, the network structure of the sight estimation recognition model, the training process thereof, and the specific function implementation of the conversion unit and the result output unit, which are provided by this embodiment, refer to the description of the online learning attention tracking method, and are not described herein again.

In the on-line learning attention tracking system, besides the image acquisition unit, the detection unit, the conversion unit and the result output unit can be realized in a software and/or hardware mode and can be integrated on computer equipment.

The present embodiment further provides a computer device, see fig. 5, which includes at least one processor and at least one memory, where the memory stores a computer program, and when the computer program is executed by the processor, the processor executes the steps of the online learning attention tracking method, and the specific steps refer to the foregoing embodiments and are not described herein again; in this embodiment, the types of the processor and the memory are not particularly limited, for example: the processor may be a microprocessor, digital information processor, on-chip programmable logic system, or the like; the memory may be volatile memory, non-volatile memory, a combination thereof, or the like.

Further, the computer device may also communicate with one or more external devices (e.g., keyboard, pointing terminal, display, etc.), with one or more terminals that enable a user to interact with the computer device, and/or with any terminals (e.g., network card, modem, etc.) that enable the computer device to communicate with one or more other computing terminals. Such communication may be through an input/output (I/O) interface. Also, the computer device may communicate with one or more networks (e.g., a Local Area Network (LAN), Wide Area Network (WAN), and/or a public Network, such as the internet) via the Network adapter.

It should be noted that the computer device may be an external monitoring device independent from the learning device, or may also be referred to as a learning device, in which case, a computer program that can execute the above-mentioned online learning attention tracking method needs to be embedded in the learning device. The computer device preferably has a voice broadcasting function, and can perform voice prompt when the learner is detected to be inattentive.

The present embodiments also provide a computer readable medium storing a computer program executable by a computer device, which when run on the computer device causes the computer device to perform the steps of the above-described online learning attention tracking method. Types of computer readable media include, but are not limited to, storage media such as SD cards, usb disks, fixed hard disks, removable hard disks, and the like.

It will be understood by those skilled in the art that the foregoing is only a preferred embodiment of the present invention, and is not intended to limit the invention, and that any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. An online learning attention tracking method based on sight line estimation is characterized by comprising the following steps:

2. The on-line learning attention tracking method according to claim 1, wherein the inputting of the facial image, the eye image and the binocular infrared image into a trained sight line estimation recognition model to obtain three-dimensional gazing directions of both eyes of the object to be detected in a camera coordinate system specifically comprises:

3. The method of on-line learning attention tracking according to claim 1, wherein said converting the three-dimensional gaze direction in a camera coordinate system to a two-dimensional gaze point in a screen coordinate system at which the learning device is located comprises:

t＝Rs[u,v,0]^T+Ts

4. The on-line learning attention tracking method of claim 3, wherein the training process of the gaze estimation recognition model comprises:

5. The online learning attention tracking method of claim 3, wherein the loss function is:

6. The online learning attention tracking method of claim 4, further comprising:

7. An online learning attention tracking system based on gaze estimation, comprising:

8. The system for tracking attention for online learning of claim 7, wherein said image acquisition unit further comprises an infrared micro-camera and a foreground camera;

9. A computer arrangement comprising at least one processing unit and at least one memory unit, wherein the memory unit stores a computer program that, when executed by the processing unit, causes the processing unit to carry out the steps of the method according to any one of claims 1 to 6.

10. A computer-readable medium, in which a computer program is stored which is executable by a computer device, and which, when run on the computer device, causes the computer device to carry out the steps of the method according to any one of claims 1 to 6.