CN107392120B

CN107392120B - Attention intelligent supervision method based on sight line estimation

Info

Publication number: CN107392120B
Application number: CN201710546644.4A
Authority: CN
Inventors: 姬艳丽; 胡玉晗
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2017-07-06
Filing date: 2017-07-06
Publication date: 2020-04-14
Anticipated expiration: 2037-07-06
Also published as: CN107392120A

Abstract

Aiming at the problems in the prior art, the invention provides an attention intelligent supervision method based on sight line estimation, which adopts a watching area concept, divides the watching area into 9 areas, selects the manual frames of eyes by acquiring the facial pictures of watching images of 9 areas of different collected objects, marks the watching area, and trains the set Yolo network as training data to obtain a watching area estimation model based on the Yolo network. And finally, acquiring a face image of the user in real time and sending the face image into a trained viewing area estimation model based on the Yolo network in use to obtain a result of judging whether the user focuses on the area five or not, and further judging whether the user focuses on the area five or not. The invention judges through the division of the watching area and the positions of the iris and the pupil of the eye, thus greatly reducing the requirements on equipment, reducing the implementation cost, having no requirements on the use position of a user, expanding the application range and being convenient for popularization and use.

Description

Attention intelligent supervision method based on sight line estimation

Technical Field

The invention belongs to the technical field of computer vision and human-computer interaction, and particularly relates to an intelligent attention supervision method based on sight estimation.

Background

The eye occupies an important position in the human facial organs, since 80% of the information that a human acquires from the surrounding environment is obtained by the eye. The eyes can not only help people to observe and recognize the external world, but also reflect the psychological activities of people; the eye expresses one's desire, needs, emotion and cognitive processes and it can act as a silent communication between people.

Gaze estimationThe technology is a technology which takes a picture of human eyes as an input medium and reflects the user's sensing, attention and interest area distribution on external equipment by acquiring the gaze information of the eyes.

The sight line estimation can be divided into sight line estimation of a single camera and sight line estimation of multiple cameras according to the number of cameras for collecting data. The sight line estimation of a single camera is lower in accuracy than that of a plurality of cameras, the movable range of the head is smaller during sight line estimation, but the application range is the widest, a plurality of mobile devices in people's life are single cameras such as mobile phones and personal notebook computers, and meanwhile, the sight line estimation cost of a monocular camera is lower.

The high-precision single-camera sight line estimation technology is mature at present, and the external light source detection is adopted at presentEye part Light spotThe line of sight estimation of (1) has been able to control the error below 1 deg., but its expensive price limits its commercial application. At present, most of the information of the eyes needs to be calibrated before use, the use is inconvenient when experimenters change, the requirement on the acquired pictures is high because of the need of accurate information of eyeballs, part of the information needs to be assisted by an external light source, and the application scene is limited.

With the progress of society, people pay more and more attention to education, and various education equipment and education robots with intelligent learning functions are widely released. However, the learning of children using education equipment is often lack of supervision, attention is often not focused, and the learning state of a user can be timely supervised in real time when the user uses the learning interactive robot.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides an attention intelligent supervision method based on sight line estimation, which is convenient to use, reduces the requirements on equipment, reduces the implementation cost and enlarges the application range.

In order to achieve the above object, the attention intelligent supervision method based on sight line estimation of the present invention is characterized by comprising the following steps:

(1) division of the region of interest

Dividing the whole watching area in front of a user into 9 blocks, setting the area where a screen of the learning interactive robot is located as an area five, and setting the area five as an attention focusing area for the user to learn by using the learning interactive robot; the upper left of the area five is an area I, the upper side is an area II, the upper right is an area III, the left side is an area IV, the right side is an area VI, the lower left is an area seven, the lower side is an area eight, and the lower right is an area nine;

(2) acquisition of training data

2.1) acquiring training data by using a color camera, wherein the position of the color camera is fixed in a fifth area, the face of an acquisition object (user) is over against the color camera, then the acquisition object (user) watches 9 areas respectively, and each watching area acquires n pictures with the same number;

2.2) carrying out image acquisition on different acquisition objects according to the step 2.1), and acquiring n images in each watching area;

2.3) classifying the collected pictures of all the collected objects according to the watching areas to obtain training data of 9 watching areas;

(3) labeling of training data

The training data is manually marked, and the marking content comprises two aspects: positioning the eye position, namely selecting the eye frame at which position of the whole picture the eye is positioned; the type of the eye information is the gaze region which is divided corresponding to the information (iris and pupil positions) of the eye selected by the frame in the picture; the marking process is simply to tell the network what is the eye, the region of gaze of such eye is a few;

(4) construction and training of gaze region estimation model

Adopting a Yolo network as a gazing area estimation model, adjusting the input of a GoogleNet network in the Yolo network from 224 multiplied by 224 to 448 multiplied by 448, extracting features from training data by an initial convolutional layer of the network, further extracting the features from other convolutional layers layer by layer, and predicting the gazing area category probability and a frame by a final full-connection layer; in the selection of the activation function, the Yolo network uses the ReLU (corrected linear units) in the other layers except the last layer using the logic activation function;

in the training process, selecting the parameters of neurons of an official model of a Yolo network as initial values, reserving the first 23 layers of the model, and in the training process, modifying the parameters of the neurons of the last 3 layers (the last layer of convolution layer and two layers of full-connected layers) through the error between the output obtained by the Yolo model and a label according to training data, wherein the last layer of convolution layer is provided with 70 filters;

and setting the iteration times and the learning rate of training and updating the weight of the trained pictures once, and then sending the training data into the set Yolo network to obtain a fixation area estimation model based on the Yolo network.

(5) Detecting the direction of sight in real time

The method comprises the steps that a kinect color camera (a color camera developed by Microsoft) on a learning interactive robot is used for collecting facial images of a user in real time, the collected images are used as input and sent into a watching region estimation model, and a frame, namely the eye position and the corresponding watching region category probability are obtained;

(6) eye gaze estimation-based attention detection

When the real-time detection result is not the area five within a period of time, the fact that the user's attention leaves the screen for a period of time represents that the user has left the learning state is judged according to the learning time of the user, and if the learning time of the user is lower than a set threshold value, the user is prompted to concentrate the attention until the user gazes the area to return to the area five; and if the learning time of the user is higher than the set threshold value, prompting the user to rest, recording the rest time, and continuing to detect after the rest is finished.

The object of the invention is thus achieved.

Aiming at the problems in the prior art, the invention provides an attention intelligent supervision method based on sight line estimation, which adopts a watching area concept, divides the watching area into 9 areas, manually frames eyes out by acquiring facial pictures of watching images of 9 areas of different collected objects, marks the watching area, and trains a set Yolo network, namely a watching area estimation model of the invention as training data to obtain the watching area estimation model based on the Yolo network. And finally, acquiring a face image of the user in real time during use, sending the acquired image into a trained viewing area estimation model based on the Yolo network, obtaining a result of whether the user focuses on the area five, and further judging whether the user focuses on the area five. The invention judges through the division of the watching area and the positions of the iris and the pupil of the eye, thus greatly reducing the requirements on equipment, reducing the implementation cost, having no requirements on the use position of a user, expanding the application range and being convenient for popularization and use.

Drawings

FIG. 1 is a schematic diagram of the principle of intelligent attention supervision;

FIG. 2 is a flow chart of an embodiment of the intelligent attention supervision method based on gaze estimation;

fig. 3 is a schematic diagram of the division of the gaze area;

FIG. 4 is a schematic diagram of a training data acquisition environment;

fig. 5 is a schematic view of the gaze region corresponding to different eye information;

FIG. 6 is a schematic view of a marked picture;

FIG. 7 is a schematic diagram of the structure of the Yolo network;

FIG. 8 is a schematic diagram of gaze direction estimation by the gaze region estimation model;

FIG. 9 is a schematic view of an attention detection process;

FIG. 10 is a schematic view of attention detection in an ideal state;

FIG. 11 is a schematic view of attention detection in a non-ideal state;

FIG. 12 is a schematic diagram of positioning a human body under a kinect coordinate system of a learning interactive robot;

FIG. 13 is a data acquisition schematic of a calibration process;

fig. 14 is a schematic diagram of the trimming process.

Detailed Description

The following description of the embodiments of the present invention is provided in order to better understand the present invention for those skilled in the art with reference to the accompanying drawings. It is to be expressly noted that in the following description, a detailed description of known functions and designs will be omitted when it may obscure the subject matter of the present invention.

The invention provides a sight line estimation method for attention intelligent supervision, which solves the problems through a deep learning method, has the advantages of convenient use, simple equipment and lower implementation cost, can be implemented on mobile equipment only by obtaining RGB images of the face of a user and obtaining results, and the like, and solves the defects of smaller cost and application range.

With the progress of society, people pay more and more attention to education, and various education equipment and education robots with intelligent learning functions are widely released. However, children often lack supervision to learn using educational equipment. Aiming at the problem, the invention designs a learning attention supervision method aiming at children education, which takes a sight estimation algorithm as a core, monitors whether the attention of a user is concentrated when the user uses education equipment and an education robot to learn, and can supervise the learning state of the user in real time when the user uses a learning interactive robot. According to the method and the device, the sight line direction accurate to a specific angle does not need to be obtained, the system is divided into a plurality of areas according to actual requirements, and the area sight line is estimated.

The invention divides the area facing human eyes into 9 areas, and estimates the areas of the sight line of the user, but because the space is infinitely extended, the areas except five areas are infinitely large, if the learning interactive robot is placed in the areas, larger errors can be caused, so the position of the learning interactive robot and the learning area are set as the area five, as shown in figure 1, when the sight line area of the user is detected to be the area five, the attention of the user is considered to be focused on the learning interactive robot, and when the other areas are detected, the attention of the user is not focused on the learning interactive robot.

The invention aims at the learning attention supervision of children education, uses objects (users) as children, monitors whether the attention of the users is concentrated when the users use the interactive system for learning, and can timely supervise the learning state of the users in real time when the users use the learning interactive robot.

The sight estimation part needs to use a model of a yolo network training model, does not have a ready-made database, and needs to build the database by self. In this embodiment, a scene of interactive learning using a larger screen is simulated, a sight area, that is, a watching area is divided into 9 blocks, sight estimation is performed on the 9 watching areas, data acquisition is performed according to the 9 watching areas, the learning time of a child is generally morning and afternoon, the light in the time period is sufficient, the learning place is generally in a room with sufficient light, and therefore the data acquisition time is set in the morning and afternoon, and the place is set in the room with sufficient light.

in this embodiment, as shown in fig. 2, the attention intelligent supervision method based on gaze estimation of the present invention includes two major parts, namely: off-line training and real-time detection specifically include:

step 101: division of a gaze area

In the present invention, as shown in fig. 3, the whole gazing area in front of the user (user) is divided into 9 blocks, the area where the learning interactive robot screen is located is area five, and the area five is set as the attention focusing area where the user uses the learning interactive robot to learn; the upper left of the area five is area one, the upper is area two, the upper right is area three, the left is area four, the right is area six, the lower left is area seven, the lower is area eight, and the lower right is area nine.

Step 102: acquisition of training data

The method can train a model capable of estimating the user watching area, namely a watching area estimation model, through the yolo network, and training data needs a database, and the database is not provided and needs to be established by self. The data acquisition environment established by the database is as shown in fig. 4, in this embodiment, a color camera of the Kinect is used for acquiring data, the position of the Kinect color camera is fixed in a region five, an acquisition object, namely a user face, needs to be over against the Kinect color camera, and then the acquisition object looks at 9 regions respectively, the Kinect color camera is used for acquiring a picture of the user at the moment as training data, the more acquisition objects are in the database, but it needs to be ensured that the data volume of different acquisition objects is the same, and the better each acquisition object looks at the data volume of 9 regions is the same. Acquiring training data by using a color camera, fixing the position of the color camera in a fifth area, wherein the face of an acquisition object (a user) is opposite to the color camera, and then respectively watching 9 areas, wherein each watching area acquires n pictures with the same number; and (4) carrying out image acquisition on different acquisition objects according to the same method, and acquiring n images in each watching area.

Then classifying the collected pictures of all the collected objects according to the watching areas to obtain training data of 9 watching areas;

step 103: labeling of training data

After the training data is collected, the training data needs to be manually labeled, and the labeling content includes two aspects: positioning the eye position, namely selecting the eye frame at which position of the whole picture the eye is positioned; the type of the eye information, i.e., which gaze region the information (iris, pupil position) of the eye selected by the frame in the picture corresponds to is shown in fig. 5.

The marking process is simply to tell the network what is the eye, the gaze area of such eye is several, and the marked picture is shown in fig. 6.

Step 104: construction and training of gaze region estimation model

The invention adopts a Yolo network as a gazing area estimation model, the Yolo network structure is realized through CNN, the structure diagram is shown in figure 7, the network has 24 convolutional layers and 2 full-connection layers, the former 20 convolutional layers directly use the former 20 layers of GoogleNet, and the whole network largely uses the cascade characteristic of convolution. In the invention, in order to obtain more accurate information by detection, the input of a GoogleNet network in a Yolo network is adjusted from 224 multiplied by 224 to 448 multiplied by 448, an initial convolutional layer of the network extracts characteristics from training data, and other convolutional layers further extract the characteristics layer by layer, and a final full-connection layer predicts the category probability and the frame of a gazing area; in the selection of the activation function, the Yolo network uses the ReLU (corrected Linear Units) in the other layers except the last layer using the logical activation function.

During the training process, parameters of neurons of an official model of the yolo network are selected as initial values, the first 23 layers of the model are reserved, and during the training process, the parameters of the neurons of the last 3 layers (namely, a convolutional layer 24 and two fully-connected layers) are modified according to errors between data obtained through the model and labels.

We divide the line-of-sight region into 9 blocks, then the classification result is also 9 classes, and we set 70 filters in the last layer of convolutional layer.

And setting iteration times and learning rate of model training and updating the weight of the trained pictures once, and then sending training data into a set Yolo network to obtain a fixation area estimation model based on the Yolo network.

In the present embodiment, the number of iterations of model training is set to 50000 times, the learning rate is set to 0.00001, and the weight value is updated every 64 training images.

Training is carried out through the setting, the trained model effect is good, when the test set is detected, the accuracy rate reaches 100%, real-time detection can be achieved in a 30-frame video screen, and the accuracy rate is high. Thus, the offline model training is finished.

Step 201: color camera real-time acquisition of user face image

The method comprises the steps of collecting facial images of a user in real time through a color camera, taking the collected images as input, sending the input images into a watching region estimation model, and obtaining a frame, namely the eye position, and the corresponding watching region category probability.

Real-time detection of the user's attention requires real-time user data, i.e., a user's facial picture. In the embodiment, the Kinect color camera is used for collecting the face image of the user, and the collection frequency is 30 frames/second.

In this embodiment, a face picture of a user is acquired in real time through a Kinect color camera, a watching region estimation model is obtained by combining with previous training, the direction of sight of the user can be estimated, the picture acquired in real time is used as input, the watching region estimation model can process the picture of each frame, and a detection result of the picture is output: the frame is the eye position and the corresponding gaze area class probability.

In this embodiment, the time taken for the gaze region estimation model to process one frame of picture is about 0.014 seconds, and the processing speed is greater than the acquisition speed, which can well complete the real-time detection task. The gaze direction estimation process by the gaze region estimation model is shown in fig. 8.

Step 202: eye gaze estimation based attention detection

When the real-time detection result is not the area five within a period of time, the fact that the user's attention leaves the screen for a period of time represents that the user has left the learning state is judged according to the learning time of the user, and if the learning time of the user is lower than a set threshold value, the user is prompted to concentrate the attention until the user gazes the area to return to the area five; and if the learning time of the user is higher than the set threshold value, prompting the user to rest, recording the rest time, and continuing to detect after the rest is finished. The specific attention detection is shown in fig. 9.

Step 203: locating user position using Kinect

The model trained offline is in an ideal case, such as the Kinect color camera of the learning interactive robot with the position of the user 1, i.e. the human, facing the area five in fig. 10, and when the position of the human changes, and is in an non-ideal state (the position of the human does not face the area five), i.e. the position of the user 2 in fig. 11, a large error is caused if we still use the model in the ideal case to detect, so we need to calibrate the user in the non-ideal case. And the user needs to use the Kinect to locate the position of the user, so as to judge whether the user is in an ideal use state or a non-ideal use state at present.

The position of a user is located by using a Kinect on a learning interactive robot, the visual field range of the Kinect is in a conical shape, as shown in FIG. 12, the coordinates of a user joint point in a Kinect coordinate system can be obtained by combining a depth camera of the Kinect with data acquired by a color camera, and the position of the user in the space can be located by the coordinates of the user in the Kinect coordinate system.

Step 204: and judging whether the user is in an ideal use state or a non-ideal use state, if so, returning to the step 202 and continuing to perform attention detection.

Step 205: user calibration

When the user is in a non-ideal state, the model trained in the ideal state can cause a large error when being used for detection, and the model can not be used for detection at the moment.

In the calibration process, the user needs to watch the 9 regions according to the prompt of the learning interactive robot, and a small amount of data, namely pictures, of the 9 regions watched by the user at the current position are collected, a data collection schematic diagram in the calibration process is shown in fig. 13, and the collected data needs to record the types of the watching regions during collection.

The method includes the steps that collected data are used for modifying a gazing area estimation model, features extracted by a convolutional layer in an ideal state are still applicable to a non-ideal state, only fine tuning of a full connection layer of the model is needed by using the data (pictures), the fine tuning process can be divided into two steps, ① data are automatically marked, the data collected in the calibration process are input into the gazing area estimation model trained offline, gazing area information in an obtained detection result is replaced by area information recorded in the marking process, automatic marking can be completed, ② fine tuning of the model is achieved, the model is trained by using the automatically marked data, parameters of the first 24 layers are kept unchanged, only parameters of the full connection layer are modified, the number of times of training iteration can be set to be small, only parameters of the full connection layer are fine tuned, training time is short, and a user only needs to wait for a short period of time, and the fine tuning process is shown in fig. 14.

Then, the fine-tuned model is used to replace the offline-trained region estimation model, so that the attention of the user in the non-ideal state can be detected, and the step 202 is returned.

Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited to the scope of the embodiments, and various changes may be made apparent to those skilled in the art as long as they are within the spirit and scope of the present invention as defined and defined by the appended claims, and all matters of the invention which utilize the inventive concepts are protected.

Claims

1. An attention intelligent supervision method based on sight line estimation is characterized by comprising the following steps:

(1) division of the region of interest

Dividing the whole watching area in front of the user into 9 blocks, setting the area where the screen of the learning interactive robot is located as an area five, and setting the area five as an attention focusing area for the user to learn by using the learning interactive robot; the upper left of the area five is an area I, the upper side is an area II, the upper right is an area III, the left side is an area IV, the right side is an area VI, the lower left is an area seven, the lower side is an area eight, and the lower right is an area nine;

(2) acquisition of training data

2.1) acquiring training data by using a color camera, wherein the position of the color camera is fixed in a fifth area, the face of the user is acquired to face the color camera, then the user watches 9 areas respectively, and each watching area acquires n pictures with the same number;

2.2) carrying out image acquisition on different acquisition users according to the step 2.1), and acquiring n images in each watching area;

2.3) classifying the collected pictures of all the collected users according to the watching areas to obtain training data of 9 watching areas;

(3) labeling of training data

The training data is manually marked, and the marking content comprises two aspects: positioning the eye position, namely selecting the eye frame at which position of the whole picture the eye is positioned; the type of the eye information is the gaze region divided by the information of the eye selected by the frame in the picture; the marking process is simply to tell the network what is the eye, the region of gaze of such eye is a few;

(4) construction and training of gaze region estimation model

Adopting a Yolo network as a gazing area estimation model, adjusting the input of a GoogleNet network in the Yolo network from 224 multiplied by 224 to 448 multiplied by 448, extracting features from training data by an initial convolutional layer of the network, further extracting the features from other convolutional layers layer by layer, and predicting the gazing area category probability and a frame by a final full-connection layer; in the selection of the activation function, the Yolo network uses the logical activation function in the last layer, and the other layers are all used relus (Rectified Linear Units);

in the training process, selecting the parameters of the neurons of the official model of the Yolo network as initial values, reserving the first 23 layers of the model, in the training process, modifying the parameters of the neurons of the last 3 layers according to the error between the output obtained by the Yolo model and the label according to training data, and setting 70 filters on the convolution layer of the last layer;

setting the iteration times and the learning rate of training and updating the weight of the trained pictures once, and then sending training data into a set Yolo network to obtain a fixation area estimation model based on the Yolo network;

(5) detecting the direction of sight in real time

Acquiring facial images of a user in real time through a color camera on the learning interactive robot, and sending the acquired images as input into a watching region estimation model to obtain a frame, namely an eye position and a corresponding watching region category probability;

(6) eye gaze estimation-based attention detection

2. The intelligent attention supervision method according to claim 1, wherein in the step (4), the number of iterations of model training is set to 50000 times, the learning rate is set to 0.00001, and the weight value is updated every 64 training charts.

3. The intelligent attention supervision method according to claim 1, further comprising the steps of:

(7) positioning the position of a user by using a kinect color camera on the learning interactive robot, judging whether the user is in an ideal use state or a non-ideal use state, returning to the step (6) if the user is in the ideal use state, continuing to perform attention detection, and otherwise, turning to the step (8);

(8) user calibration

The method comprises the following steps that a user watches 9 regions according to the prompt of a learning interactive robot, pictures of the 9 regions watched by the user at the position of the user are collected, the collected pictures are used for modifying a watching region estimation model, the characteristics extracted from a convolutional layer in an ideal state are still suitable for a non-ideal state, so that the fine adjustment of a full connection layer of the model only needs to be carried out by using the data, the fine adjustment process can be divided into two steps, namely ① automatically marking data, inputting the data collected in the calibration process into the watching region estimation model trained offline, replacing the watching region information in the obtained detection result with the region information recorded during marking, and completing automatic marking;

and (5) replacing the offline-trained area estimation model with the fine-tuned model, namely, performing attention detection on the user in the non-ideal state, and returning to the step (6).