CN116030512A

CN116030512A - Gaze point detection method and device

Info

Publication number: CN116030512A
Application number: CN202210932760.0A
Authority: CN
Inventors: 舒畅; 周俊伟
Original assignee: Honor Device Co Ltd
Current assignee: Honor Device Co Ltd
Priority date: 2022-08-04
Filing date: 2022-08-04
Publication date: 2023-04-28
Anticipated expiration: 2042-08-04
Also published as: CN116030512B

Abstract

The application provides a gaze point detection method and device, wherein the method comprises the following steps: acquiring face images of a calibration fixation point and a target fixation point; respectively carrying out frontal face treatment on the two face images by using a frontal face model to obtain predicted frontal face images of the two face images; then, processing the two predicted front face images by using a prediction model to obtain the offset of the target fixation point and the calibrated fixation point; and then obtaining the coordinates of the target fixation point according to the coordinates of the calibration fixation point and the offset. According to the scheme, the image is subjected to frontal face processing, which is equivalent to filling the missing information of eyes in the side face image, if the image is the frontal face image, the information of the eyes can be still richer, then the calibration fixation point is introduced on the basis of the image with richer information, the coordinate of the point to be measured is comprehensively obtained by solving the offset of the calibration fixation point and the point to be measured, the error influence caused by the side face and the individual difference can be reduced, and the accuracy of the fixation point detection can be effectively improved.

Description

Gaze point detection method and device

Technical Field

The application relates to the technical field of sight line detection and tracking, in particular to a gaze point detection method and a gaze point detection device.

Background

The eyes are not only windows of hearts, but also can reflect the focus of the current sight, so that the eyes can be used as a simple and effective man-machine interaction mode. With the development of technologies such as line-of-sight detection and eye tracking, more and more line-of-sight control scenes appear, that is, more and more scenes of human-computer interaction are performed through line of sight. For example, the steering intention of the driver is estimated or an obstacle is tracked by line-of-sight detection during driving of the vehicle; the terminal can be controlled by detecting the action of eyes when photographing; in addition, there are other scenes in which human-computer interaction is performed based on the direction of the line of sight, the point of gaze (i.e., the focal point of the line of sight), or the actions of the eyes, such as an eye-turned page, an eye-controlled game, and the like.

In the traditional scheme, the vision detection is often realized by geometric solution of an eyeball model or learning the appearance characteristics of eyeballs by using a deep learning method. However, these two schemes do not take into account the influence of individual differences, and the error is relatively large, and especially for the case of a side face, the difficulty of detection is increased due to information deficiency, and the error is further increased. Most of the traditional schemes can only realize the detection of the sight line direction, but the detection of a certain gaze point is difficult to be accurate, and the requirements of high accuracy of the gaze point detection cannot be met just because the errors of the schemes are too large.

Therefore, how to improve the accuracy of gaze point detection is a technical problem to be solved.

Disclosure of Invention

The application provides a gaze point detection method and device, which can improve the accuracy of gaze point detection.

In a first aspect, a gaze point detection method is provided, the method comprising: acquiring a first face image of a target fixation point and a second face image of a target fixation point; carrying out frontal processing on the first face image and the second face image by using the frontal model to obtain a first predicted frontal image and a second predicted frontal image; processing the first predicted front face image and the second predicted front face image by using a prediction model to obtain the offset of the target fixation point and the calibration fixation point; and obtaining the coordinates of the target fixation point according to the coordinates of the calibration fixation point and the offset. Wherein the first predicted front face image represents a predicted front face image of the first face image and the second predicted front face image represents a predicted front face image of the second face image.

According to the technical scheme, the image is subjected to frontal face processing, which is equivalent to filling the missing information of eyes in the side face image, if the image is the frontal face image, the information of the eyes in the frontal face image can be still richer, then the calibration point of regard is introduced on the basis of the richer-information image, the coordinate of the point to be measured is comprehensively obtained by solving the offset of the calibration point of regard and the point to be measured, the error influence caused by the side face and individual difference can be reduced, and the accuracy of point of regard detection can be effectively improved.

The first face image is used for representing a face image shot under the condition that the user looks at the calibrated gaze point, so that the gaze point of the person in the first face image is the calibrated gaze point. In actual operation, a prompt of a user asking for a certain calibration point may be given first, and then a face image of the user is collected to obtain the first face image, where the calibration point is the calibration point. The second face image is used for representing a face image shot when the user looks at the target point of regard, so that the point of regard of the person in the second face image is the target point of regard. The target gaze point may be understood as a point at which a user desiring to know is gazing, so that the coordinates of the target gaze point are not known and need to be detected, and may also be referred to as a gaze point to be detected.

In the embodiment of the application, the frontal face model is used for converting a human face image input into the frontal face model into a frontal face. The process of converting the face image into a frontal face is simply referred to as frontal face processing. It should be understood that the face image herein may be a frontal face or a side face, and if the face image is a side face, after the frontal face processing, the missing information caused by the side face is filled in while the side face is turned right, so that the side face is actually changed into the frontal face, not just the conversion of the coordinate system, or the side face is filled into the frontal face. In particular, the information of the eyes is supplemented in the embodiments of the present application. If the face image input into the face model is a face, the face information contained in the face image can be more abundant through the face treatment, and the face image can be understood as a fine adjustment process.

In the embodiment of the application, the prediction model is mainly used for obtaining the offset between the gaze points of the people in the two face images input to the prediction model. It should be understood that there are no limitations on the two face images input to the predictive model, that is, both the face image and the side face image may be either the face image or the side face image, or both the face image and the side face image may be either the face image or the side face image. When the prediction model processes two prediction front face images, the prediction model predicts the offset relatively more accurately because the two prediction front face images comprise relatively rich eye information.

With reference to the first aspect, in certain implementations of the first aspect, the face model may include an encoder and a decoder; the encoder is used for extracting the feature vector of the face image input to the face model; the decoder is configured to convert the feature vector into a face image corresponding to a face image input to the face model. The following operations may be included when performing frontal processing on the first face image and the second face image by using the frontal model to obtain a first predicted frontal image and a second predicted frontal image:

The encoder performs feature extraction on the first face image to obtain a feature vector of the first face image; the decoder processes the feature vector of the first face image to obtain a first predicted face image;

the encoder performs feature extraction on the second face image to obtain a feature vector of the second face image; and the decoder processes the feature vector of the second face image to obtain a second predicted face image.

It should be noted that, as described below, the feature vectors may include a face feature vector and a line of sight feature vector, that is, feature vectors obtained by the encoder extracting features of face images (for example, the first face image and the second face image) include a face feature vector and a line of sight feature vector. The decoder may process only the face feature vector or process both the face feature vector and the line-of-sight feature vector when processing the feature vector to obtain the predicted front face image. If the decoder only processes the face feature vector, the operation amount is relatively small; if the decoder processes both the face feature vector and the sight line feature vector, the operation amount is relatively large, but the sight line feature vector contains abundant information related to the sight line, so that the obtained predicted front face image is more suitable for the task of gaze point detection, and the accuracy of the detection result is further improved if the gaze point is detected by using the predicted front face image.

With reference to the first aspect, in certain implementations of the first aspect, the prediction model may include a feature extraction network and a full-connection layer network, the feature extraction network being configured to extract line-of-sight vectors of two face images input to the prediction model; the full-connection layer network is used for fusing the sight line vectors to obtain the offset of the gaze point corresponding to the two face images. Processing the first predicted front face image and the second predicted front face image by using the prediction model to obtain the offset of the target gaze point and the calibration gaze point may include the following operations:

the feature extraction network performs feature extraction on the first predicted front face image and the second predicted front face image to obtain a sight line vector of the first predicted front face image and a sight line vector of the second predicted front face image;

and the full-connection layer network performs fusion processing on the sight line vector of the first predicted front face image and the sight line vector of the second predicted front face image to obtain the offset between the target fixation point and the calibration fixation point.

With reference to the first aspect, in certain implementation manners of the first aspect, the method further includes: and performing operations at the target gaze point according to coordinates of the target gaze point, the operations including at least one of clicking, sliding, touching, unlocking, turning pages, focusing, or gaming operations.

With reference to the first aspect, in some implementations of the first aspect, the calibration gaze point is a screen center point.

In a second aspect, a training method of a face model is provided, where the training method includes: acquiring training data, wherein the training data comprises a front face image sample and a plurality of side face image samples corresponding to the front face image sample; training the face model by using training data to obtain a target face model, wherein the face model comprises an encoder and a decoder, the encoder is used for extracting characteristic vectors of images input to the encoder, and the decoder is used for processing the characteristic vectors to obtain a predicted face image.

The target frontal model trained by any one of the training methods of the second aspect may be used in the step of frontal processing by using the frontal model in the first aspect.

In the technical scheme of the application, the target frontal model trained by the training method of the second aspect can have frontal processing capability, namely, the facial image input into the frontal model/target frontal model can be converted into the frontal image. The training method is used for training by using the front face image and the side face image of the same person, so that the trained model can convert the side face into the front face, namely, the side face can be converted into the front face in a real sense by supplementing the missing information of the side face while converting the coordinate system.

With reference to the second aspect, in certain implementations of the second aspect, the feature vectors include a face feature vector and a line-of-sight feature vector; the decoder is specifically configured to process the face feature vector to obtain a predicted front face image. Then when training the frontal model with training data to obtain the target frontal model, the following operations may be included:

inputting the side face image sample to an encoder to obtain a face feature vector and a sight feature vector;

inputting the face feature vector to a decoder to obtain a predicted face image;

and adjusting parameters of the encoder and the decoder according to the predicted front face image and the front face image sample, thereby obtaining a target front face model.

In these implementations, the decoder processes only the face feature vector, but not the line of sight feature vector, so that the operation complexity can be reduced, and the training process can be simplified. It should be understood that the task of turning the side face to the front face can be achieved even if the line of sight feature vector is not processed, that is, the front face model trained by the implementation manner has the capability of turning the side face to the front face.

With reference to the second aspect, in some implementations of the second aspect, the decoder is further configured to process a line of sight feature vector, that is, the decoder may be configured to process the face feature vector and the line of sight feature vector, and input the face feature vector to the decoder, so as to obtain a predicted front face image, where the method specifically includes: and inputting the face feature vector and the sight feature vector into a decoder, and processing the two vectors by the decoder to obtain a predicted face image. In these implementations, the decoder needs to process both the face feature vector and the line-of-sight feature vector, and the computational complexity is relatively high, but the trained decoder has the capability of processing both vectors. In the front face model, the obtained predicted front face image comprises richer sight-related information, so that the predicted front face image is more suitable for the task of gaze point detection, and the accuracy of a detection result is further improved when the gaze point is detected by using the predicted front face image. In short, the implementation manner further fits the task of gaze point detection on the basis of enabling the frontal face model to have the capability of turning the side face to the frontal face, so that the processing capability of the vision related features is enhanced, and the predicted frontal face image contains more vision related information.

With reference to the second aspect, in certain implementations of the second aspect, the gaze point of the face image sample and the plurality of side face image samples are the same; the training data also includes known coordinates of the gaze point. Then when training the frontal model with training data to obtain the target frontal model, the following operations may be included:

inputting the sight feature vector into the full-connection layer to obtain the coordinate of the predicted gaze point;

and adjusting parameters of the encoder, the decoder and the full-connection layer according to the predicted coordinates of the gaze point and the known coordinates of the gaze point, thereby obtaining the target face model.

In the implementation manner, the constraint of the gaze point is added in the training process of the face model, which is equivalent to adding a supervision learning, and the accuracy of the target face model can be further improved through the constraint of the gaze point.

It should be noted that, when the parameters of the encoder, the decoder and the fully-connected layer are adjusted according to the coordinates of the predicted gaze point and the known coordinates of the gaze point, the parameters of the encoder and the fully-connected layer are directly adjusted in the process of adjusting the encoder and the fully-connected layer in the reverse direction, that is, according to the difference between the predicted value and the true value (label) of the gaze point, the parameters of the encoder and the fully-connected layer are directly affected, and when the parameters of the encoder are changed, the parameters of the decoder are changed due to the relevance between the encoder and the decoder. Thus, during a training process, the parameters of the encoder, decoder and fully connected layers are ultimately adjusted because of the predicted gaze point coordinates and the gaze point known coordinates.

In a third aspect, a training method of a prediction model is provided, the training method including: acquiring training data, the training data comprising: a first face image sample of a first gaze point, a second face image sample of a second gaze point, and offset labels of the first gaze point and the second gaze point; training the prediction model by using training data to obtain a target prediction model, wherein the prediction model is used for processing two face images input into the prediction model to obtain the offset between the gaze points corresponding to the two face images.

The target prediction model trained by any one of the training methods of the third aspect may be used in the step of determining the offset using the prediction model in the first aspect.

In the technical scheme of the application, the target prediction model trained by the training method of the third aspect can have the capability of calculating the gaze point offset, that is, the offset between the gaze points in the two face images input into the prediction model/target prediction model can be calculated. The training method is used for training by using the face images with known fixation points, so that the trained model is not affected by individual differences. For example, while there may be a difference in the images of the same point viewed by different people, the same person views the offset between the two images of the two points or eliminates the difference from a single point of gaze. And the input image of the target prediction model obtained by training by the training method in the third aspect is not limited by the front face or the side face.

With reference to the third aspect, in some implementations of the third aspect, the prediction model may include a feature extraction network and a full-connection layer network, the feature extraction network being configured to extract line-of-sight feature vectors of two face images input to the prediction model; the full-connection layer network is used for fusing the sight feature vectors to obtain the offset of the gaze point corresponding to the two face images.

With reference to the third aspect, in some implementations of the third aspect, when training the prediction model with training data to obtain the target prediction model, the following operations may be included:

respectively inputting the first face image sample and the second face image sample into a feature extraction network to respectively obtain a sight feature vector of the first face image sample and a sight feature vector of the second face image sample;

inputting the sight line feature vector of the first face image sample and the sight line feature vector of the second face image sample into a full-connection layer network to obtain a predicted offset, wherein the predicted offset is used for representing a predicted value of the offset between the first gaze point and the second gaze point;

and adjusting parameters of the feature extraction network and the full-connection layer according to the predicted offset and the offset label, thereby obtaining a target prediction model.

In a fourth aspect, there is provided a gaze point detection apparatus comprising means for performing any one of the methods of the first aspect, comprised of software and/or hardware.

In a fifth aspect, a training apparatus for a face model is provided, the apparatus comprising means for performing any one of the training methods of the second aspect, comprised of software and/or hardware.

In a sixth aspect, a training apparatus for a predictive model is provided, the apparatus comprising means for performing any one of the training methods of the second aspect, comprised of software and/or hardware.

In a seventh aspect, there is provided a terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor being capable of implementing any one of the methods of the first aspect when the computer program is executed.

In an eighth aspect, there is provided a computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor being capable of implementing any one of the training methods of the second or third aspects when the computer program is executed by the processor.

In a ninth aspect, there is provided a chip comprising a processor for reading and executing a computer program stored in a memory, the computer program being capable of implementing any one of the methods of the first, second or third aspects when executed by the processor.

Optionally, the chip further comprises a memory, the memory being electrically connected to the processor.

Optionally, the chip may further comprise a communication interface.

In a tenth aspect, there is provided a computer readable storage medium storing a computer program which when executed by a processor is capable of carrying out any one of the methods of the first, second or third aspects.

In an eleventh aspect, there is provided a computer program product comprising a computer program capable of carrying out any one of the methods of the first, second or third aspects when the computer program is executed by a processor.

Drawings

Fig. 1 is a schematic diagram of a scene of man-machine interaction using a line of sight according to an embodiment of the present application.

Fig. 2 is a schematic diagram of an interface of man-machine interaction according to an embodiment of the present application.

FIG. 3 is a schematic illustration of another human-machine interaction interface according to an embodiment of the present application.

Fig. 4 is a schematic flow chart of a gaze point detection method of an embodiment of the present application.

Fig. 5 is a schematic flow chart of a training method of a face model according to an embodiment of the present application.

FIG. 6 is a schematic diagram of another training process for a face model according to an embodiment of the present application.

FIG. 7 is a schematic flow chart diagram of a method of training a predictive model in accordance with an embodiment of the application.

FIG. 8 is a schematic diagram of a training process of another predictive model in accordance with an embodiment of the application.

Fig. 9 is a schematic flow chart of a line-of-sight detection process of an embodiment of the present application.

Fig. 10 is a schematic diagram of a gaze point detection apparatus in an embodiment of the present application.

Fig. 11 is a schematic diagram of a training device for a face model according to an embodiment of the present application.

Fig. 12 is a schematic diagram of a training apparatus for a predictive model according to an embodiment of the present application.

Fig. 13 is a schematic diagram of a hardware structure of a terminal device according to an embodiment of the present application.

Fig. 14 is a schematic hardware structure of a computer device according to an embodiment of the present application.

Detailed Description

The following describes embodiments of the present application with reference to the drawings. The gaze point detection method provided by the application can be applied to various human-computer interaction scenes such as eye movement interaction, sight line interaction or eye line interaction. The gaze point detection method provided by the application can be used in equipment such as terminal equipment and computer equipment which can support man-machine interaction by using eyes.

Fig. 1 is a schematic diagram of a scene of man-machine interaction using a line of sight according to an embodiment of the present application. As shown in fig. 1, in the scene, the user looks at a screen, which may be a screen or a projection screen of various types of computer devices, and performs a corresponding operation by a position at which the user looks, that is, a related operation at the gaze point according to the user's gaze point.

In fig. 1 (a), the user a looks at a point a in the screen, and if it is detected that the gaze point of the user a is a in the image of the user a captured by the camera on the screen, the operation at a can be performed accordingly. For example, an application may be opened assuming an icon for the application at a, and a selection of an interaction option may be performed assuming the interaction option at a. Illustrating: assuming that a shooting game is running in the screen, the user A views which position to shoot to, that is, shooting to the point of regard A is performed according to the analysis of the image of the user A shot by the camera. Assuming that a treasured seeking game is running in the screen, the user A sees which treasured box, and opens which treasured box, that is, the treasured box of the point of regard A is opened according to analysis of the image of the user A shot by the camera. And if the position A is a weather application program, namely weather software, opening the weather software according to analysis of the image of the user A shot by the camera to obtain that the gaze point of the user A is A. Other examples are not listed one by one.

In (B) in fig. 1, the user looks at B of the screen, and if it is detected that the gaze point of the user is B, the operation at B can be performed accordingly. The description of the correlation may be described with reference to the correlation of (a) in fig. 1. For example, assuming that the position B is a closed option box, when the gaze point of the user a is B according to the image analysis of the user a captured by the camera, the current page is closed or the software being opened is exited. The interaction options can also be various interaction options such as selection boxes, page turning, pull-down, clicking and the like, and are not listed one by one.

It should be appreciated that fig. 1 mainly gives one example of a line-of-sight interaction scenario, in which other situations may exist. For example, the user may be using virtual reality glasses, and the screen is a virtual projection and no longer a physical screen. For another example, the camera that captures the image may not be on-screen, but external. That is, the gaze point detection scheme provided in the embodiments of the present application may be applied to a scene where human-computer interaction can be performed by using eyes.

In the embodiment of the present application, the computer device may be a mobile phone, a computer, a tablet computer, a wearable device, a vehicle-mounted device, an augmented reality (augmented reality, AR)/Virtual Reality (VR) device, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a personal digital assistant (personal digital assistant, PDA), a video phone, a projection device, a network television device, and other terminal devices. The intelligent terminal can be, for example, an intelligent television, an intelligent bracelet or an intelligent screen. The system can also be other computer devices such as a server, a computer, a centralized control box, a cloud device and the like which can execute eye movement interaction.

Fig. 2 is a schematic diagram of an interface of man-machine interaction according to an embodiment of the present application. Fig. 2 is presented using a mobile phone interface as an example, but it should be understood that other computer devices may implement the man-machine interaction process shown in fig. 2. As shown in fig. 2, a plurality of applications are included in the control interface 210. The eye movement control mode is triggered by hard key triggering, voice wakeup, or touch screen clicking operation, and the like, and then jumps to the control interface 220. In the control interface 220, the user is reminded to look at the dot in the center of the screen, and the dot may be a relatively bright color such as red for reminding. Within the circle in control interface 220 is a countdown time, which reserves a preparation time for the user. And after the countdown is finished, the front camera is utilized to shoot the face image of the user. The face image captured at this time is a face image of a calibration gaze point, such as the first face image herein, and the dot at the center of the screen in the control interface 220 is an example of the calibration gaze point, and the coordinates of the gaze point are known. And then to control interface 230. In the control interface 230, the user is reminded to look at the target gaze point (i.e. the gaze point to be detected). The front camera is used to capture a face image of the user again, and the face image captured at this time is the face image of the target gaze point, i.e. the face image of the gaze point to be detected, e.g. the second face image in this context, the coordinates of which are unknown. The coordinate of the target fixation point is obtained by utilizing the scheme of the embodiment of the application. Assuming that the coordinates of the resulting target gaze point are at the icon of the calculator, as shown by control interface 240, an operation to turn on the calculator is performed, shown as control interface 250.

FIG. 3 is a schematic illustration of another human-machine interaction interface according to an embodiment of the present application. Fig. 3 is presented using a mobile phone interface as an example, but it should be understood that other computer devices may implement the man-machine interaction process shown in fig. 3. As shown in FIG. 3, control interface 310 displays an interface for a seek game. The eye movement control mode is triggered by hard key triggering, voice wakeup, or touch screen clicking operation, or the eye interaction mode is selected in the mode selection in the game, and then the control interface 320 is skipped. In the control interface 320, the user is reminded to look at the dot in the center of the screen, and the dot may be a relatively bright color such as red for reminding. Within the circle in control interface 330 is a countdown time, which reserves a preparation time for the user. And after the countdown is finished, the front camera is utilized to shoot the face image of the user. The face image captured at this time is a face image of a calibration gaze point, such as the first face image herein, and the dot at the center of the screen in the control interface 320 is an example of the calibration gaze point, and the coordinates of the gaze point are known. And then to control interface 330. In the control interface 330, the user is prompted to begin the game, i.e., to begin a seek. The front camera is used for continuously shooting face images of a user according to preset time intervals, and the face images shot at the moment are face images of a target fixation point, namely face images of a fixation point to be detected, such as a second face image in the text, and the coordinates of the fixation point are unknown. The coordinate of the target fixation point is obtained by utilizing the scheme of the embodiment of the application. Assuming that the coordinates of the target point of regard are at the box numbered 3, as shown by the control interface 340, an operation of opening the box is performed, and the box numbered 3 shown by the control interface 350 is opened.

It should be understood that fig. 2 and fig. 3 only show two specific examples of human-computer interaction by eyes, and in practical application, the examples are not limited thereto, and are not listed individually for brevity.

Fig. 4 is a schematic flow chart of a gaze point detection method of an embodiment of the present application. The method of fig. 4 may be used in any of the scenarios shown in fig. 1-3, and for ease of understanding, the following description will take as an example that the gaze point is a point on the screen of the computer device.

S401, acquiring a first face image of the target fixation point and a second face image of the target fixation point.

The coordinates of the calibration gaze point are known. Assuming that the gaze point is the screen of the computer device, the coordinates of the calibration gaze point are the relative position coordinates of the point on the screen of the computer device.

The first face image is used for representing a face image shot under the condition that the user looks at the calibrated gaze point, so that the gaze point of the person in the first face image is the calibrated gaze point. In actual operation, a prompt of a user asking for a certain calibration point may be given first, and then a face image of the user is collected to obtain the first face image, where the calibration point is the calibration point. For example, in the scenario shown in fig. 2, the screen center dot in the control interface 220 may be used as an example of calibrating the gaze point, and the image captured by ending the countdown in the control interface 220 is used as an example of the first face image. For example, in the scenario shown in fig. 3, the screen center dot in the control interface 320 may be used as an example of calibrating the gaze point, and the image captured by ending the countdown in the control interface 320 is an example of the first face image. It will be appreciated that the centre of the screen is only one example of a calibration gaze point, and that in an actual scene the calibration gaze point may be other known points, such as any point on the screen in the scene shown in figures 2 and 3. When the center of the screen is taken as a calibration point, the coordinates of other points in the screen are more conveniently represented.

The second face image is used for representing a face image shot when the user looks at the target point of regard, so that the point of regard of the person in the second face image is the target point of regard.

The target gaze point may be understood as a point at which a user desiring to know is gazing, so that the coordinates of the target gaze point are not known and need to be detected, and may also be referred to as a gaze point to be detected.

It should be noted that one calibration gaze point may be used to detect more than one target gaze point. That is, the first face image of one calibration gaze point is not used only once, but can be used for any target gaze point after the calibration image acquisition. For example, assume that T1, T2, and T3 are three different moments, T1 is earlier than T2, T2 is earlier than T3, and coordinates of the gaze point at the 3 moments are different. The face images at three moments may be acquired in real time at three moments respectively, or may be a stored image sequence. The image of the target point of gaze at time T1 is assumed to be the first face image and the image of the point of gaze at time T2 is assumed to be the second face image. The gaze point at time T1 is a calibrated gaze point and the coordinates are known; the point of regard at time T2 is the target point of regard and the coordinates are unknown; the point of gaze at time T3 is another target point of gaze different from the point of gaze at time T2 in terms of coordinates, and the coordinates are unknown. Therefore, the operation of performing S402 frontal face processing on the images at three times T1, T2, and T3 corresponds to obtaining predicted frontal face images at three times, respectively; performing S403 offset prediction and S404 operation of determining the coordinates of the target gaze point on the predicted front face images at the two moments T1 and T2 to obtain the coordinates of the target gaze point at the moment T2; the operations of S403 offset prediction and S404 determination of the coordinates of the target gaze point are performed on the predicted images at both times T1 and T3, resulting in the coordinates of the target gaze point at time T3. In short, recalibration is not required every time the target gaze point needs to be detected.

For example, in the scenario shown in fig. 2, the image captured at the end of the countdown in the control interface 230 is an example of the second face image, and the user does not know where to look when capturing the image. Then, the gaze point detection method according to the embodiment of the present application is used to determine the position of the target gaze point as shown in the control interface 240, where the circle at the calculator in the control interface 240 represents the coordinate of the detection result of the target gaze point. For another example, in the scenario shown in fig. 3, the image captured at the end of the countdown in the control interface 330 is an example of the second face image, and the user does not know where to look when capturing the image. Then, the gaze point detection method in the embodiment of the present application is used to determine the position of the target gaze point as shown in the control interface 340, where the circle at the box numbered 3 in the control interface 340 represents the coordinate of the detection result of the target gaze point.

Alternatively, the coordinates of the calibration point of regard may be set to (0, 0), that is, as the origin, and the offset amount of the target point of regard is the coordinates thereof, for example, assuming that the offset amount is (Δx, Δy) obtained in step S403 described below by the method shown in fig. 4, the offset amount is the coordinates of the detection result of the target point of regard, that is, (0+Δx,0+Δy) = (Δx, Δy). Based on this example, assuming that the coordinates of the calibration gaze point are not the origin, and are (x, y), the coordinates of the detection result of the target gaze point are (x+Δx, y+Δy).

In some implementations, the calibration gaze point is a screen center point, so as to conveniently represent coordinates of the target gaze point.

In step S401, the face image may be acquired in real time by using an image acquisition device, or the stored face image may be read from a storage device, or the face image may be received from a network through a communication interface. The information of the calibration gaze point may be read from the device, for example, the application scenario shown in fig. 2 and fig. 3, where the calibration gaze point is a preset screen center point of the terminal device, and the information may be stored in an internal storage unit of the terminal device.

S402, performing frontal face processing on the first face image and the second face image by using the frontal face model to obtain a first predicted frontal face image and a second predicted frontal face image.

Wherein the first predicted front face image represents a predicted front face image of the first face image and the second predicted front face image represents a predicted front face image of the second face image.

It should also be understood that in the embodiments of the present application, the front face and the side face are with respect to the camera coordinate system, the coordinate system of the front face is parallel to the camera coordinate system, and the coordinate system of the side face is not parallel to the camera coordinate system. Alternatively, it is understood that the plane of the face in the front face image is parallel to the plane of the camera and the plane of the face in the side face image is not parallel to the plane of the camera. The camera may also be a video camera or other device capable of capturing images. In actual operation, for example, the front face image is an image of a face of a person that is shot against a lens, and the side face image is an image of a face of a person that is shot against a lens.

In the task of line-of-sight detection or gaze point detection, the frontal image includes more information about the eyes, so that the direction or position of the eye gaze can be detected relatively more accurately. However, in a practical scenario, the situation of the side face is very common, and may be even more than the front face. For example, when a camera shoots a person's activities in an entire room, there is little possibility that the face of the person faces the camera. If the conventional scheme is adopted for line-of-sight detection, the line-of-sight and the point-of-gaze are difficult to determine due to the side face in the image, so that the accuracy of the detection result is too low.

Aiming at the problems, the face image is subjected to frontal face processing by using the frontal face model, so that the predicted frontal face image obtained after processing contains more effective information, and the gaze point detection is performed based on the predicted frontal face image, so that the accuracy of a detection result can be effectively improved. Although there is a method of converting a side face into a front face in the conventional scheme, the accuracy is still low because the problem of information missing caused by the side face cannot be solved by the simple coordinate system conversion.

In some implementations, the face model may include an encoder and a decoder; the encoder is used for extracting the feature vector of the face image input to the face model; the decoder is configured to convert the feature vector into a face image corresponding to a face image input to the face model. The following operations may be included when performing step S402:

That is, the encoder may perform feature extraction on a face image (for example, the first face image or the second face image) input to the encoder, to obtain a face feature vector and a line of sight feature vector, and, taking the first face image as an example, the face feature vector of the first face image and the line of sight feature vector of the first face image are obtained through the encoder processing. In this case, in one example, the decoder may process only the face feature vector, for example, the first face image, that is, the decoder may process only the face feature vector of the first face image to obtain the first predicted front face image. In another example, the decoder may process both the face feature vector and the line of sight feature vector, for example, the first face image, that is, the decoder processes both the face feature vector of the first face image and the line of sight feature vector of the first face image to obtain the first predicted front face image. The processing procedure of the second face image is similar to that of the first face image, and will not be described again, so long as the first face image in the above example is replaced by the second face image.

In the above example, when the decoder processes only the face feature vector, the complexity of the operation can be reduced and the speed of reasoning can be increased. It will be appreciated that the task of face-to-face has been enabled even without processing the line of sight feature vectors, that is, in this example, the predicted face image resulting from processing only the face feature vectors by the decoder already includes more abundant eye information, as the eyes are an essential part of the face, and the population of eyes is also included in the population of information for the entire face. Therefore, compared with the situation that only the coordinate conversion of the side face to the front face is carried out in the traditional scheme, the purpose that the front face converted from the side face is a truly complete face is achieved, and the situation that missing face information caused by the side face in the traditional scheme is still missing is avoided.

It should also be appreciated that in this example, since the decoder processes only face feature vectors, the encoder may extract only face feature vectors of the face image without extracting line-of-sight feature vectors thereof, thereby further reducing the amount of computation.

In the above example, when the decoder processes both the face feature vector and the line of sight feature vector, the amount of computation is larger than in the above example in which only the face feature vector is processed, but the obtained predicted front image includes more abundant line of sight-related information than in the case in which only the face feature vector is processed, so that such predicted front image is more suitable for the task of gaze point detection, and therefore, the accuracy of the detection result is further improved when such predicted front image is used to detect the gaze point. In short, in this example, the task of gaze point detection is further combined with the ability to make the frontal model face-to-face, enhancing the processing ability for gaze-related features, so that the predicted frontal image contains more gaze-related information than if only face feature vectors were processed.

It should be further noted that, since the method shown in fig. 4 uses an existing face model (i.e., a pre-trained face model, such as the target face model below) to perform reasoning, that is, a process of application, the face feature vector and the line-of-sight feature vector belong to intermediate quantities, that is, process quantities, generated in the face model during the processing. For step S402, the frontal model is an integral body, the input is the first face image and the second face image, and the output is the first predicted frontal image and the second predicted frontal image, respectively.

S403, processing the first predicted front face image and the second predicted front face image by using the prediction model to obtain the offset of the target fixation point and the calibration fixation point.

In the embodiment of the application, the prediction model is mainly used for obtaining the offset between the gaze points of the people in the two face images input to the prediction model. It should be understood that there are no limitations on the two face images input to the predictive model, that is, both the face image and the side face image may be either the face image or the side face image, or both the face image and the side face image may be either the face image or the side face image.

In some examples, the face image input to the prediction model may also be an image of only an eye portion, and since most of the gaze point related information is included in the eye portion and has a small correlation with other parts of the face, the amount of data to be processed can be reduced by predicting only the image of the eye portion, and the processing rate can be increased.

The prediction model in step S403 processes the two predicted front images obtained in step S402, and since the two predicted front images include relatively rich eye information, the prediction of the offset in step S403 will also be relatively more accurate.

In some implementations, before step S403, the two predicted front face images obtained in step S402 may be further processed to obtain images of eye portions of the two predicted front face images, and then the images of the eye portions of the two predicted front face images are input into a prediction model, so as to perform step S403.

It should be noted that, as described above, one calibration gaze point may be used to detect coordinates of more than one target gaze point, so that it is not necessary to repeatedly process face images of the calibration gaze point when step S403 is performed. Continuing with the example of three different moments T1, T2 and T3, T1 is earlier than T2, T2 is earlier than T3, and the coordinates of the gaze point at the 3 moments are different. The face images at three moments may be acquired in real time at three moments respectively, or may be a stored image sequence. The image of the target point of gaze at time T1 is assumed to be the first face image and the image of the point of gaze at time T2 is assumed to be the second face image. The gaze point at time T1 is a calibrated gaze point and the coordinates are known; the point of regard at time T2 is the target point of regard and the coordinates are unknown; the point of gaze at time T3 is another target point of gaze different from the point of gaze at time T2 in terms of coordinates, and the coordinates are unknown. Therefore, the operation of performing the forward facing processing at S402 on the images at three times T1, T2, and T3 corresponds to obtaining predicted forward face images at three times, respectively. When executing S403, it may be that the prediction model processes the predicted front face images at two times of T1 and T2 to obtain the offset of T2 relative to the time of T1; and the predictive model processes the predictive front face images at the two moments of T1 and T3 to obtain the offset of T3 relative to the moment of T1. That is, if there is more than one target gaze point, the intermediate processing result corresponding to the calibration gaze point may be saved for detection of other target gaze points, where the intermediate processing result is a predicted face image at time T1.

In some implementations, the predictive model may include a feature extraction network for extracting line-of-sight vectors of two face images input to the predictive model and a full-connection layer network; the full-connection layer network is used for fusing the sight line vectors to obtain the offset of the gaze point corresponding to the two face images. The following operations may be included when performing step S403:

As described above, if there is more than one target gaze point, intermediate processing results corresponding to the calibration gaze point may be saved for detection of other target gaze points. In some implementations, the intermediate processing result may be a line-of-sight vector obtained after the feature extraction network performs feature extraction on the predicted front face image corresponding to the calibration gaze point.

Continuing taking the above three different moments of T1, T2 and T3 as an example, when executing S403, the feature extraction network may process the predicted front face images at the three moments of T1, T2 and T3 respectively, so as to obtain sight vectors at the three moments respectively; the full-connection layer processes the sight vectors at the time T1 and the time T2 to obtain the offset of the time T2 relative to the time T1; and the full connection layer processes the sight vectors at the two moments T1 and T3 to obtain the offset of T3 relative to the moment T1. This reduces the repetition of processing the image of the nominal gaze point.

S404, obtaining the coordinate of the target fixation point according to the coordinate of the calibration fixation point and the offset.

That is, the coordinates of the target gaze point can be obtained by adding the offset obtained in S403 to the coordinates of the calibration gaze point.

In the method shown in fig. 4, the image is subjected to frontal face processing, which is equivalent to filling the missing information of eyes in the side face image, if the image is the frontal face image, the information of the eyes in the frontal face image can be still richer, then the calibration gaze point is introduced on the basis of the image with richer information, the coordinates of the point to be measured are comprehensively obtained by solving the offset of the calibration gaze point and the point to be measured, the influence caused by individual difference can be avoided, and the accuracy of gaze point detection can be effectively improved.

The eyes of different persons are different, so that even if the same point is observed under the condition that other conditions are identical, different eye information is generated. In the conventional scheme, if the line-of-sight detection method is designed separately for each person, the compatibility is too poor and the cost is too high, and if a unified indiscriminate detection method is adopted, errors caused by such individual differences cannot be eliminated. In the embodiment of the application, the problem is solved by introducing the calibration fixation point, and because the calibration fixation point and the target fixation point are acquired under the condition that the same person watches the screen, the offset is firstly obtained and then the target fixation point is determined by combining the known coordinates of the calibration fixation point, the influence of individual difference is effectively eliminated, and the accuracy is effectively improved. According to the method for processing the positive face and determining the offset by using the calibrated gaze point and then determining the target gaze point coordinate, the accuracy is effectively improved, so that the high-precision requirement of gaze point detection can be met, and more accurate eye movement interaction can be realized.

In one implementation, the method shown in fig. 4 may further include: according to the coordinates of the target gaze point, an operation at the target gaze point is performed. The respective operation may include at least one of clicking, sliding, touching, unlocking, paging, focusing, or gaming operations. For example, in the scenario shown in fig. 1, the corresponding operation of the gaze point a or the gaze point B is performed; in the scenario shown in fig. 2, the operation of opening the calculator is performed, and the result of the execution is shown as a control interface 250; in the scenario shown in fig. 3, the operation of opening the box, that is, an example of the game operation, is performed, and the result of the execution is shown as a control interface 350. Other operations are not listed one by one.

S501, training data is acquired, wherein the training data comprises a front face image sample and a plurality of side face image samples corresponding to the front face image sample.

Assuming that the training data includes face images of a plurality of persons, a set of face images and side face images of the same person looking at the same point, that is, the above-mentioned face image samples and a plurality of side face image samples corresponding to the face image samples. The same person can watch a fixed point to collect the front face image and the side face image, and the collected front face image and side face image can be used as the training data. That is, a plurality of side face image samples corresponding to a positive face image sample may be understood as a plurality of side face images of the same person as the person in this positive face image sample, with the point of regard being the same as this positive face image sample.

It should be understood that the method shown in fig. 5 is mainly aimed at training to obtain a model with the capability of frontal face, so that the capability of frontal face turning is more important, so that during training, the frontal face image sample is mainly used as a label, that is, a reference image, and the lateral face image sample is used as a training image. For example, if 3 front face images and 10 side face images of the gaze a point of the user a are acquired, the reference image corresponding to the 10 side face images may be any one of the 3 front face images. Of course, the 3 front face images can also be used as training images, and at this time, the reference image corresponding to the 3 front face images can be a certain image of the 3 front face images, and also can be used for additionally shooting a front face image of the gazing point A of the user A as the reference image. It should also be understood that the above values are merely for ease of understanding the scheme and are not limiting.

In some implementations, the front face image sample and the side face image sample may be images of only human face parts, that is, the front face image and the side face image may be preprocessed first, and the image areas of the human face parts may be extracted as the image samples. The method can eliminate the interference of the background elements in the image, and reduce the processing of the image data of the background part, so that the aim of improving the training efficiency and the model precision can be achieved.

S502, training the face model by using training data to obtain a target face model.

The face model comprises an encoder and a decoder, wherein the encoder is used for extracting characteristic vectors of images input to the encoder, and the decoder is used for processing the characteristic vectors to obtain predicted face images.

The frontal face model may also be called a frontal face conversion model or a side face conversion frontal face model, etc., and the main purpose is to change a face in a face image input to the frontal face model into a frontal face. If the input face image is a front face, the predicted front face image after the front face processing of the front face model is equivalent to fine adjustment, and the difference between the predicted front face image and the original input front face image is not large. If the input face image is a side face, the method is equivalent to converting the side face into a positive face angle, namely, after coordinate system conversion, filling missing face information, so that a predicted positive face image is obtained.

The target frontal model may be used to perform step S402 described above.

In one implementation, the feature vectors processed by the encoder may include face feature vectors and line-of-sight feature vectors. The facial feature vector may be understood as a feature vector including features of facial features, facial shapes, and the like of a face. The line-of-sight vector is understood to be a feature vector that includes characteristics of the pupil of the eye, the angle of the eye, etc.

In one implementation, only face feature vectors may be input to a decoder for processing to obtain a predicted face image. In the implementation mode, the decoder only processes the face feature vector, but does not process the video feature vector, so that the operation complexity can be reduced, the training process is simplified, but the decoder obtained through training only has the capability of processing the face feature vector, but does not have the capability of processing the video feature vector. It should be understood that the task of turning the side face to the front face can be achieved even if the line of sight feature vector is not processed, that is, the front face model trained by the implementation manner has the capability of turning the side face to the front face.

In another implementation, both the face feature vector and the line-of-sight feature vector may be input to a decoder for processing to obtain a predicted face image. In this implementation, the decoder needs to process both the face feature vector and the line-of-sight feature vector, and the operation complexity is relatively high, but the decoder obtained by training has the capability of processing both vectors. In the front face model, the obtained predicted front face image comprises richer sight-related information, so that the predicted front face image is more suitable for the task of gaze point detection, and the accuracy of a detection result is further improved when the gaze point is detected by using the predicted front face image. In short, the implementation manner further fits the task of gaze point detection on the basis of enabling the frontal face model to have the capability of turning the side face to the frontal face, so that the processing capability of the vision related features is enhanced, and the predicted frontal face image contains more vision related information.

In one example, the decoder may be specifically configured to process the face feature vector to obtain a predicted front face image. In this example, when step S502 is performed, the side face image samples may be input to an encoder, resulting in a face feature vector and a line-of-sight feature vector; inputting the face feature vector to a decoder to obtain a predicted face image; and adjusting parameters of an encoder and a decoder according to the predicted face image and the face image sample corresponding to the side face image sample (namely, the reference image of the side face image sample), thereby obtaining a trained face model. In this example, when step S502 is performed, the face image samples may be input to an encoder, resulting in a face feature vector and a line-of-sight feature vector; inputting the face feature vector to a decoder to obtain a predicted face image; and adjusting parameters of an encoder and a decoder according to the predicted front face image and the label image of the front face image sample (namely the reference image of the front face image sample), so as to obtain a trained front face model.

In another example, the decoder may be specifically configured to process the face feature vector and the line-of-sight feature vector to obtain the predicted face image. In this example, when step S502 is performed, the side face image samples may be input to an encoder, resulting in a face feature vector and a line-of-sight feature vector; inputting the face feature vector and the sight feature vector into a decoder to obtain a predicted face image; and adjusting parameters of an encoder and a decoder according to the predicted face image and the face image sample corresponding to the side face image sample (namely, the reference image of the side face image sample), thereby obtaining a trained face model. In this example, when step S502 is performed, the face image samples may be input to an encoder, resulting in a face feature vector and a line-of-sight feature vector; inputting the face feature vector and the sight feature vector into a decoder to obtain a predicted face image; and adjusting parameters of an encoder and a decoder according to the predicted front face image and the label image of the front face image sample (namely the reference image of the front face image sample), so as to obtain a trained front face model. In another implementation, the training data further includes known coordinates of the gaze point, that is, coordinate labels of a common gaze point of the face image sample and the plurality of side face image samples.

In one example, when step S502 is performed, a line-of-sight feature vector is input to a Full Connected (FC) layer, resulting in coordinates of a predicted gaze point; and adjusting parameters of the encoder, the decoder and the full-connection layer according to the coordinates of the predicted gaze point and the known coordinates of the gaze point (namely the coordinate label of the common gaze point of the front face image sample and the plurality of side face image samples), so as to obtain the trained front face model. In this example, the constraint of the gaze point is added, which is equivalent to adding a supervised learning, and through the constraint of the gaze point, the accuracy of the target face model can be further improved.

The front face model obtained after training is the target front face model.

It should be appreciated that in this example, the full connection layer is provided to add a constraint and may not be included in the target face model. That is, the target frontal model is the encoder and decoder after training, while the full-connection layer only participates in training in the training phase, and does not participate in the frontal processing in the reasoning phase.

The target frontal model obtained by training by the training method shown in fig. 5 can have the capacity of frontal processing, that is, can convert the face image input into the frontal model/target frontal model into the frontal image. The training method is used for training by using the front face image and the side face image of the same person, so that the trained model can convert the side face into the front face, namely, the side face can be converted into the front face in a real sense by supplementing the missing information of the side face while converting the coordinate system.

To further understand the scheme of fig. 5, the following description is provided in connection with fig. 6. FIG. 6 is a schematic diagram of another training process for a face model according to an embodiment of the present application. Fig. 6 may be regarded as a specific example of the method shown in fig. 5. The frontal model in fig. 6, which includes an encoder and a decoder, is a neural network model.

The training process using the side face samples (i.e., side face image samples) is shown in fig. 6 (a). As shown in fig. 6 (a), a face sample is input to an encoder, a face feature vector and a line of sight feature vector of the face sample are obtained through feature extraction of the encoder, the face feature vector and the line of sight feature vector are input to a decoder, and a predicted face image, which can be understood as being obtained by performing orthographic processing on the face sample, can be obtained through processing of the face feature vector and the line of sight feature vector by the decoder. At this time, parameters of the encoder and the decoder may be adjusted according to the difference between the predicted face image and the face image corresponding to the side face sample (i.e., the label image or the reference image of the side face sample), so as to complete one training of the face model. As shown in fig. 6 (a), the training process further includes a process for the gaze point, that is, inputting the line of sight feature vector to the FC layer, and obtaining the predicted gaze point coordinates, that is, the predicted value of the gaze point in the side face sample after the FC layer process. At this time, parameters of the encoder, decoder and FC layer may be adjusted according to the difference between the predicted gaze point coordinates and the gaze point labels of the side face samples (i.e., the known coordinates of the gaze point of the side face samples), thereby completing one training of the frontal face model.

The training process using the frontal face samples (i.e., frontal face image samples) is shown in fig. 6 (b). As shown in fig. 6 (b), a face sample is input to an encoder, a face feature vector and a line of sight feature vector of the face sample are obtained through feature extraction of the encoder, the face feature vector and the line of sight feature vector are input to a decoder, and a predicted face image, which can be understood as being obtained by performing orthographic processing on the face sample, can be obtained through processing of the face feature vector and the line of sight feature vector by the decoder. At this time, parameters of the encoder and the decoder may be adjusted according to the difference between the predicted face image and the face image corresponding to the face sample (i.e., the label image or the reference image of the face sample), thereby completing one training of the face model. As shown in fig. 6 (b), the training process further includes a process for the gaze point, that is, inputting the line of sight feature vector to the FC layer, and obtaining the predicted gaze point coordinates after the FC layer process, that is, the predicted value of the gaze point in the face sample. At this time, parameters of the encoder, decoder and FC layer may be adjusted according to the difference between the predicted gaze point coordinates and the gaze point labels of the face samples (i.e., the known coordinates of the gaze point of the face samples), thereby completing one training of the face model.

It can be seen that in the training process shown in fig. 6, the side face sample, the front face sample, the gaze point tag, and the tag image are training data that participate in training. The label image of the side face sample is a front face image corresponding to the side face sample, and the label image of the front face sample may be other front face images corresponding to the front face sample or may be the front face sample itself. The label of the gaze point is the known gaze point coordinate of the side face sample or the known gaze point of the front face sample. For example, if a user is supposed to watch a specific point and collect a plurality of front face images and a plurality of side face images, the coordinates of the specific point are the watch point labels, the plurality of front face images are the front face samples, the plurality of side face images are the side face samples, and the plurality of front face images can be used as the label images of the front face samples and the side face samples.

S701, acquiring training data, wherein the training data comprises: a first face image sample of a first gaze point, a second face image sample of a second gaze point, and offset labels of the first gaze point and the second gaze point.

It will be appreciated that the training data comprises a plurality of face image samples of known gaze points, which are not limited to a face or a side face, which may be known at the coordinates of the gaze point of each sample, or may be known at the offset between the gaze points of every two samples.

Alternatively, the face image sample may be an image of only a face portion, so as to exclude the influence of the background element in the sample. The face image sample may be an image of only an eye portion, and since the gaze point is mainly related to the eye and has low relevance to other portions of the face, only the image of the eye portion may be used as the face image sample, but only a predictive model obtained by training the image of the eye portion is used, and when performing the operation in the reasoning stage, the image to be detected needs to be preprocessed first, so that the image of the eye portion is input into the predictive model.

The first and second gaze points are only for distinguishing between the two gaze points, and there are no other limitations, and the first and second face image samples are only for respectively corresponding to the first and second gaze points, and there are no other limitations.

If the training data includes face images of a plurality of persons, face images (possibly a front face or a side face) of the same person looking at different points can be used as the face images in the training data.

In order to improve the training effect, the same person can obtain a face image sample from a fixed position through slight change of the head posture.

It should be appreciated that the method shown in fig. 7 is mainly aimed at training a model with the ability to predict gaze point offset, so that the face image samples may be either frontal or side faces during training. However, since the frontal face includes more abundant eye information, the embodiment of the present application may predict the image after frontal face processing, that is, the face images input to the prediction model are all frontal faces (the frontal face image sample or the predicted frontal face image obtained by frontal face processing). Therefore, in order to further improve the training effect, the change of the head posture is only horizontal movement, namely, the face coordinate system and the coordinate system of the image acquisition device are parallel in the process of the change of the head posture, so that all the generated face images are face images, and at the moment, the face image samples in the training data are all face samples.

It should be noted that the first face image sample and the first face image described above are different, the first face image sample is a sample image of the training phase, and the first face image is an image acquired by the calibration gaze point of the inference phase involved in the inference (i.e. the application phase of gaze point detection). The second face image sample is different from the second face image described above, the second face image sample is a sample image of the training phase, and the second face image is an image of the target gaze point of the reasoning phase, i.e. the image to be processed in the application phase of gaze point detection, where the coordinates of the gaze point need to be deduced.

S702, training the prediction model by using training data to obtain a target prediction model.

The prediction model is used for processing two face images input into the prediction model to obtain the offset between the gaze points corresponding to the two face images.

The target prediction model may be used to perform step S403 described above.

In one implementation, the prediction model includes a feature extraction network and a full-connection layer network, the feature extraction network is used for extracting line-of-sight feature vectors of face images input to the prediction model, and the full-connection layer network is used for fusing the line-of-sight feature vectors to obtain offset of gaze points corresponding to the face images input to the prediction model.

In one example, when step S702 is performed, the first face image sample and the second face image sample may be respectively input to the feature extraction network, so as to obtain a line-of-sight feature vector of the first face image sample and a line-of-sight feature vector of the second face image sample, respectively; inputting the sight line feature vector of the first face image sample and the sight line feature vector of the second face image sample into a full-connection layer network to obtain a predicted offset, wherein the predicted offset is used for representing a predicted value of the offset between the first gaze point and the second gaze point; and adjusting parameters of the feature extraction network and the full-connection layer according to the predicted offset and the offset label, thereby obtaining a target prediction model.

It should be understood that the full connection layer in the predictive model and the full connection layer in the face model are the same thing, but not the same thing.

Alternatively, the feature extraction network may be a convolutional neural network (convolutional neural networks, CNN), a deep neural network (deep neural networks, DNN), or other neural network capable of extracting images and retrograde features.

The target prediction model trained by the training method shown in fig. 7 can have the capability of obtaining the gaze point offset, that is, the offset between the gaze points in the two face images input to the prediction model/target prediction model can be obtained. The training method is used for training by using the face images with known fixation points, so that the trained model is not affected by individual differences. For example, while there may be a difference in the images of the same point viewed by different people, the same person views the offset between the two images of the two points or eliminates the difference from a single point of gaze. And the input image of the target prediction model trained by the training method shown in fig. 7 is not limited by the front face or the side face.

To further understand the scheme of fig. 7, the following description is provided in connection with fig. 8. FIG. 8 is a schematic diagram of a training process of another predictive model in accordance with an embodiment of the application. Fig. 8 may be regarded as a specific example of the method shown in fig. 7. The predictive model in fig. 8, which includes a feature extraction network and FC layer, is a neural network model.

As shown in fig. 8, the calibration sample and the test sample are input into the feature extraction network, and a line-of-sight feature vector #1 and a line-of-sight feature vector #2 are obtained, wherein the line-of-sight feature vector #1 and the line-of-sight feature vector #2 are line-of-sight feature vectors corresponding to the calibration sample and the side view, respectively.

The feature extraction network as shown in fig. 8 includes CNN and FC1 layers, that is, fig. 8 gives one example in which the feature extraction network may be a combination of CNN and FC1 layers. It should be appreciated that since the CNN itself includes at least one FC layer, the combination of CNN and FC1 layer can also be considered a new CNN. It should also be understood that fig. 8 is merely an example of a feature extraction network, and that in a practical scenario, other feature extraction networks may be selected or designed as needed, so long as line-of-sight features in an image can be extracted.

The calibration sample and the test sample are face images and are face images of known gaze points. Therefore, the calibration sample and the test sample can be selected from the face image samples randomly, and the method is not limited, and is only used for facilitating understanding of scheme distinguishing naming. The calibration sample may be regarded as an example of the first face image sample of the first gaze point or the second face image sample of the second gaze point, and the test sample may be regarded as an example of the first face image sample of the first gaze point or the second face image sample of the second gaze point.

The gaze feature vector #1 and the gaze feature vector #2 are input to the FC2 layer, and the fusion processing of the gaze feature vector #1 and the gaze feature vector #2 by the FC2 layer can obtain a predicted offset, which can be understood as a predicted value of the offset between the gaze point of the person in the calibration sample and the gaze point of the person in the test sample. At this time, parameters of the feature extraction network and the FC2 layer may be adjusted according to the difference between the predicted offset and the offset label (i.e., the known offset of the gaze point in the calibration sample and the test sample), thereby completing one training of the prediction model.

Fig. 9 is a schematic flow chart of a gaze detection process according to an embodiment of the present application, and fig. 9 may be regarded as an example of fig. 4, that is, an example of a process of reasoning with a trained model, or as an example of a process of reasoning phase of gaze point detection with a trained model.

As shown in fig. 9, a calibration chart and a target chart (i.e., a chart to be detected or a chart to be processed) of known coordinates are input to a target frontal model, so as to obtain a predicted frontal chart of the calibration chart and a predicted frontal chart of the target chart.

The calibration map may be regarded as an example of the first face image of the above-mentioned calibration gaze point and the target map may be regarded as an example of the second face image of the above-mentioned target gaze point. The gaze point coordinates of the calibration map are thus known and the coordinates of the gaze point in the target map need to be detected. The predicted frontal map of the calibration map may be regarded as one example of the first predicted frontal image, and the predicted frontal map of the target map may be regarded as one example of the second predicted frontal image described above. The predicted frontal map of the calibration map is represented by a calibrated predicted frontal map in the map, and the predicted frontal map of the target map is represented by a target predicted frontal map in the map.

The target frontal model can be regarded as an example of the frontal model in step S402. The target frontal model may be a target frontal model obtained by using any one of the training methods shown in fig. 5 or fig. 6.

As shown in fig. 9, the predicted front face map of the calibration map and the predicted front face map of the target map are input into the target prediction model, and after the target prediction model is processed, a predicted offset amount, that is, an offset amount between the calibration gaze point corresponding to the calibration map and the target gaze point corresponding to the target map, can be output.

The target prediction model can be regarded as an example of the prediction model in step S403. The target prediction model may be obtained by using any one of the training methods shown in fig. 7 or fig. 8.

After knowing the offset, the coordinates of the target gaze point can be determined in combination with the calibrated gaze point.

It should be noted that, in the scenario shown in fig. 9, one calibration chart may correspond to at least one target chart. When one calibration map corresponds to one target map, the process is not repeated as above. When one calibration map corresponds to a plurality of target maps, the calibration map does not need to be processed by the target face model and the target prediction model each time. Assuming that the target graph includes a target graph 1 and a target graph 2, in the process shown in fig. 9, the target forward model may respectively process the calibration graph, the target graph 1 and the target graph 2 to respectively obtain a calibration prediction forward graph, a target prediction forward graph 1 and a target prediction forward graph 2; the target prediction model processes the calibration prediction front face diagram and the target prediction front face diagram 1 to obtain a prediction offset between the target diagram 1 and the calibration diagram; and the target prediction model processes the calibration prediction front face diagram and the target prediction front face diagram 2 to obtain the prediction offset between the target diagram 2 and the calibration diagram. In addition, since the target prediction model may include a feature extraction network and a full-connection layer, in order to facilitate understanding, taking fig. 8 as an example, in the process shown in fig. 9, the target front face model may respectively process the calibration map, the target map 1 and the target map 2 to respectively obtain a calibration prediction front face map, a target prediction front face map 1 and a target prediction front face map 2; after the feature extraction network (for example, the CNN layer and the FC1 layer in fig. 8) in the target prediction model respectively processes the calibration graph, the target graph 1 and the target graph 2, a sight feature vector of the calibration graph, a sight feature vector 1 of the target graph 1 and a sight feature vector 2 of the target graph 2 are respectively obtained; the full connection layer (such as the FC2 layer in fig. 8) in the target prediction model processes the sight feature vector of the calibration graph and the sight feature vector 1 of the target graph 1 to obtain a prediction offset between the target graph 1 and the calibration graph; the fully connected layer (e.g., FC2 layer in fig. 8) in the target prediction model processes the line-of-sight feature vector of the calibration graph and the line-of-sight feature vector 2 of the target graph 2 to obtain the prediction offset between the target graph 2 and the calibration graph.

The foregoing description of the method of the embodiments of the present application is provided primarily with reference to the accompanying drawings. It should be understood that, although the steps in the flowcharts related to the embodiments described above are shown in order, these steps are not necessarily performed in the order shown in the figures. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages. The apparatus according to the embodiments of the present application will be described below with reference to the accompanying drawings.

Fig. 10 is a schematic diagram of a gaze point detection apparatus in an embodiment of the present application. As shown in fig. 10, the apparatus 1000 includes an acquisition unit 1001 and a processing unit 1002. The apparatus 1000 may be integrated into a computing device or may be a device independent of a computing device.

In one implementation, the apparatus 1000 may be or be provided in a computer device corresponding to the screen in fig. 1.

In another implementation, the apparatus 1000 may also be or be provided in the terminal device shown in fig. 2 or fig. 3.

The apparatus 1000 can be used to perform any of the gaze point detection methods described above. For example, the acquisition unit 1001 may be used to perform step S401, and the processing unit 1002 may be used to perform steps S402 to S404. For another example, the processing unit 1002 may be further configured to perform the operations performed by the target face model and the target prediction model in fig. 9. Also for example, the processing unit 1002 may be further configured to perform: according to the coordinates of the target gaze point, an operation at the target gaze point is performed. The respective operation may include at least one of clicking, sliding, touching, unlocking, paging, focusing, or gaming operations.

In one implementation, the apparatus 1000 may further include a storage unit, configured to store data such as a face image. The storage unit may be integrated in the processing unit 1002 or may be a unit independent of the acquisition unit 1001 and the processing unit 1002.

Fig. 11 is a schematic diagram of a training device for a face model according to an embodiment of the present application. As shown in fig. 11, the training apparatus 2000 includes an acquisition unit 2001 and a training unit 2002. The training apparatus 2000 may be a computer device capable of neural network training, such as a server, cloud device, or the like.

Because the training process has higher requirements on data processing capacity and the data quantity required to be stored and calculated is larger, the terminal equipment such as a common mobile phone, a tablet computer and the like cannot be used as a training device, and the training device is often data processing equipment such as a server and cloud equipment with stronger data processing capacity, but the data processing equipment can still be used as a gaze point detection device as long as the data processing equipment has the condition of being capable of performing man-machine interaction based on eye actions.

The apparatus 2000 can be used to perform any of the training methods of the face model above. For example, the acquisition unit 2001 may be used to perform step S501, and the training unit 2002 may be used to perform step S502. The apparatus 2000 can be used to perform the operations of the training process shown in fig. 6.

In one implementation, the training device 2000 may further include a storage unit for storing data such as training data. The storage unit may be integrated in the training unit 2002 or may be a unit independent of the acquisition unit 2001 and the training unit 2002.

Fig. 12 is a schematic diagram of a training apparatus for a predictive model according to an embodiment of the present application. As shown in fig. 12, the training apparatus 3000 includes an acquisition unit 3001 and a training unit 3002. The training apparatus 3000 may be a computer device capable of neural network training, such as a server, cloud device, or the like.

The apparatus 3000 can be used to perform any of the above predictive model training methods. For example, the acquisition unit 3001 may be used to perform step S701, and the training unit 3002 may be used to perform step S702. The apparatus 3000 can be used to perform the operations of the training process shown in fig. 8.

In one implementation, training device 3000 may further include a storage unit for storing training data and the like. The storage unit may be integrated in the training unit 3002, or may be a unit independent of the acquisition unit 3001 and the training unit 3002.

Fig. 13 is a schematic diagram of a hardware structure of a terminal device according to an embodiment of the present application. The terminal device 100 shown in fig. 13 can be used to perform any of the gaze point detection methods of the embodiments of the present application.

In one implementation, the terminal device 100 shown in fig. 13 may be the terminal device shown in fig. 2 or fig. 3.

As shown in fig. 13, the terminal device 100 includes: radio Frequency (RF) circuitry 110, memory 120, input unit 130, display unit 140, sensor 150, audio circuitry 160, wireless fidelity (wireless fidelity, wiFi) module 170, processor 180, and camera 190. It will be appreciated by those skilled in the art that the various structures shown in fig. 13 do not constitute a limitation of the terminal device 100, and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

As shown in fig. 13, the RF circuit 110 may be used for receiving and transmitting information or receiving and transmitting signals during a call, for example, after receiving downlink information of a base station, the downlink information may be processed by the processor 180; in addition, the data of the design uplink is sent to the base station. Typically, RF circuitry includes, but is not limited to, antennas, at least one amplifier, transceivers, couplers, low noise amplifiers (low noise amplifier, LNAs), diplexers, and the like. In addition, RF circuit 110 may also communicate with networks and other devices via wireless communications. In the embodiment of the present application, the RF circuit 110 may be configured to perform an operation of acquiring a face image or gaze point coordinates, for example, when the RF circuit 110 is configured to perform step S401, this corresponds to an example of acquiring the first face image and/or the second face image through the communication interface.

The memory 120 may be used to store software programs and modules, and the processor 180 performs various functional applications and data processing of the terminal device 100 by running the software programs and modules stored in the memory 120. The memory 120 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data (such as audio data, phonebook, etc.) created according to the use of the handset, etc. In addition, memory 120 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. In the embodiment of the present application, the memory 120 may be used to store face images, annotation point coordinates, a frontal face model, a prediction model, and intermediate data or detection results generated in the gaze point detection process.

The input unit 130 may be used to receive input numeric or character information and to generate key signal inputs related to user settings and function control of the terminal device 100. In particular, the input unit 130 may include a touch panel 131 and other input devices 132. The touch panel 131, also referred to as a touch screen, may collect touch operations thereon or thereabout by a user (e.g., operations of the user on the touch panel 131 or thereabout by using any suitable object or accessory such as a finger, a stylus, etc.), and drive the corresponding connection device according to a predetermined program. Alternatively, the touch panel 131 may include two parts of a touch detection device and a touch controller. The touch detection device detects the touch azimuth of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch detection device and converts it into touch point coordinates, which are then sent to the processor 180, and can receive commands from the processor 180 and execute them. In addition, the touch panel 131 may be implemented in various types such as resistive, capacitive, infrared, and surface acoustic wave. The input unit 130 may include other input devices 132 in addition to the touch panel 131. In particular, other input devices 132 may include, but are not limited to, one or more of a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, mouse, joystick, etc. Taking the scenario shown in fig. 2 as an example, the input unit 130 may be used to perform an operation of inputting the trigger eye-control mode when controlling the interface 210.

The display unit 140 may be used to display information input by a user or information provided to the user and various menus of the terminal device 100. The display unit 140 may include a display panel 141, and alternatively, the display panel 141 may be configured in the form of a liquid crystal display (liquid crystal display, LCD), an organic light-emitting diode (OLED), or the like. Further, the touch panel 131 may cover the display panel 141, and when the touch panel 131 detects a touch operation thereon or thereabout, the touch panel is transferred to the processor 180 to determine the type of the touch event, and then the processor 180 provides a corresponding visual output on the display panel 141 according to the type of the touch event. Although in fig. 13, the touch panel 131 and the display panel 141 implement the input and output functions of the terminal device 100 as two independent components, in some embodiments, the touch panel 131 and the display panel 141 may be integrated to implement the input and output functions of the terminal device 100.

The terminal device 100 may also include at least one sensor 150, such as a light sensor, a motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor that may adjust the brightness of the display panel 141 according to the brightness of ambient light, and a proximity sensor that may turn off the display panel 141 and/or the backlight when the terminal device 100 moves to the ear. As one of the motion sensors, the accelerometer sensor can detect the acceleration in all directions (typically three axes), and can detect the gravity and direction when stationary, and can be used for applications of recognizing the gesture of the terminal device 100 (such as horizontal-vertical screen switching, related games, magnetometer gesture calibration), vibration recognition related functions (such as pedometer, knocking), and the like; other sensors such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc. that may also be configured with the terminal device 100 are not described in detail herein.

Audio circuitry 160, speaker 161, microphone 162 may provide an audio interface between the user and terminal device 100. The audio circuit 160 may transmit the received electrical signal converted from audio data to the speaker 161, and the electrical signal is converted into a sound signal by the speaker 161 to be output; on the other hand, the microphone 162 converts the collected sound signal into an electrical signal, receives the electrical signal by the audio circuit 160, converts the electrical signal into audio data, outputs the audio data to the processor 180 for processing, transmits the audio data to, for example, another terminal device 100 via the RF circuit 110, or outputs the audio data to the memory 120 for further processing.

WiFi belongs to a short-distance wireless transmission technology, and the terminal equipment 100 can help a user to send and receive e-mails, browse web pages, access streaming media and the like through the WiFi module 170, so that wireless broadband Internet access is provided for the user. Although fig. 13 shows the WiFi module 170, it is understood that it does not belong to the essential constitution of the terminal device 100, and may be omitted entirely within the scope of not changing the essence of the scheme as needed.

The processor 180 is a control center of the terminal device 100, connects respective parts of the entire terminal device 100 using various interfaces and lines, and performs various functions of the terminal device 100 and processes data by running or executing software programs and/or modules stored in the memory 120 and calling data stored in the memory 120, thereby performing overall monitoring of the terminal device 100. Optionally, the processor 180 may include one or more processing units. Alternatively, the processor 180 may integrate an application processor that primarily handles operating systems, user interfaces, applications, etc., with a modem processor that primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 180.

In the embodiment of the present application, the processor 180 may be used to perform steps S402 to S404, for example, as an example of the processing unit 1002 described above.

The terminal device 100 may also include a camera 190. Alternatively, the position of the camera 190 on the terminal device 100 may be front-mounted or rear-mounted, which is not limited in the embodiment of the present application. If detection of the gaze point in the scene of fig. 1-3 is implemented, the camera 190 is front-facing.

Alternatively, the camera 190 may include a single camera, a dual camera, or a triple camera, etc., which is not limited in this embodiment. For example, the cameras 190 may include three cameras, one of which is a main camera, one of which is a wide angle camera, and one of which is a tele camera. Alternatively, when the terminal device 100 includes a plurality of cameras, the plurality of cameras may be all front-mounted, all rear-mounted, or one part of front-mounted, another part of rear-mounted, which is not limited in the embodiment of the present application.

In the embodiment of the present application, the camera 190 may be an example of the above-described obtaining unit 1001, and the camera 190 may be used to perform the operation of obtaining the face image in the gaze point detection method of the embodiment of the present application, for example, may be used to perform S401. When the camera 190 is used to perform S401, it is equivalent to capturing a face image in real time, for example, capturing a first face image or a second face image, with the camera 190. Taking the scenario shown in fig. 2 as an example, the camera 190 is a front camera for capturing face images at the end of countdown in the control interface 220 and the control interface 230. Taking the scenario shown in fig. 3 as an example, the camera 190 is a front camera for capturing a face image at the end of countdown in the control interface 320 and the control interface 330.

It should be understood that fig. 13 is merely an example of a hardware structure of a terminal device, and in practice, the structure of the terminal device includes, but is not limited to, that shown in fig. 13, and may include only a part of the structure shown in fig. 13.

Fig. 14 is a schematic hardware structure of a computer device according to an embodiment of the present application. As shown in fig. 14, the computer device 4000 includes: at least one processor 4001 (only one is shown in fig. 14), a memory 4002, and a computer program 4003 stored in the memory 4002 and executable on the at least one processor 4001, the processor 4001 implementing the steps in any of the training methods described above when executing the computer program 4003. For example, the computer device 4000 shown in fig. 14 can be used to perform the steps of any of the training methods of fig. 5-8.

It will be appreciated by those skilled in the art that fig. 14 is merely an example of a computer device and is not intended to be limiting, and that in practice a computer device may include more or less components than those illustrated, or may combine certain components, or different components, such as input-output devices, network access devices, etc.

The processor 4001 may be a central processing unit (central processing unit, CPU), other general purpose processor, digital signal processor (digital signal processor, DSP), application specific integrated circuit (application specific integrated circuit, ASIC), off-the-shelf programmable gate array (field-programmable gate array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

Memory 4002 may be an internal storage unit of computer device 4000 in some embodiments, such as a hard disk or memory of computer device 4000. The memory 4002 may also be an external storage device of the computer device 4000 in other embodiments, such as a plug-in hard disk provided on the computer device 4000, a Smart Media Card (SMC), a Secure Digital (SD) card, a flash card, or the like. Alternatively, memory 4002 may also include both internal and external memory units of computer device 4000. The memory 4002 is used for storing an operating system, an application program, a boot loader, data, and other programs and the like, such as program codes of the computer programs. The memory 4002 can also be used for temporarily storing data that has been output or is to be output.

It should be noted that, because the content of information interaction and execution process between the above devices/units is based on the same concept as the method embodiment of the present application, specific functions and technical effects thereof may be referred to in the method embodiment section, and will not be described herein again.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.

The embodiment of the application also provides a terminal device, which comprises: at least one processor, a memory and a computer program stored in the memory and executable on the at least one processor, which when executed by the processor implements the steps of any gaze point detection method described above.

The embodiment of the application also provides a computer device, which comprises: at least one processor, a memory, and a computer program stored in the memory and executable on the at least one processor, which when executed by the processor performs the steps of any of the training methods described above.

Embodiments of the present application also provide a computer readable storage medium storing a computer program which, when executed by a processor, implements steps that may implement the various method embodiments described above.

Embodiments of the present application provide a computer program product which, when run on a mobile terminal, causes the mobile terminal to perform steps that may be performed in the various method embodiments described above.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present application implements all or part of the flow of the method of the above embodiments, and may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, where the computer program, when executed by a processor, may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to a photographing device/terminal apparatus, recording medium, computer memory, read-only memory (ROM), random access memory (random access memory, RAM), electrical carrier signals, telecommunications signals, and software distribution media. Such as a U-disk, removable hard disk, magnetic or optical disk, etc. In some jurisdictions, computer readable media may not be electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus/network device and method may be implemented in other manners. For example, the apparatus/network device embodiments described above are merely illustrative, e.g., the division of the modules or units is merely a logical functional division, and there may be additional divisions in actual implementation, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

It should be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in this specification and the appended claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is detected" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon detection of a [ described condition or event ]" or "in response to detection of a [ described condition or event ]".

In addition, in the description of the present application and the appended claims, the terms "first," "second," "third," and the like are used merely to distinguish between descriptions and are not to be construed as indicating or implying relative importance.

Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the application. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in the specification are not necessarily all referring to the same embodiment, but mean "one or more but not all embodiments" unless expressly specified otherwise. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.

The above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims

1. A gaze point detection method, comprising:

acquiring a first face image of a target fixation point and a second face image of a target fixation point;

carrying out orthofacization processing on the first face image and the second face image by utilizing a face model to obtain a first predicted face image and a second predicted face image, wherein the first predicted face image represents the predicted face image of the first face image, and the second predicted face image represents the predicted face image of the second face image;

processing the first predicted front face image and the second predicted front face image by using a prediction model to obtain the offset of the target fixation point and the calibration fixation point;

and obtaining the coordinate of the target fixation point according to the coordinate of the calibration fixation point and the offset.

2. The method of claim 1, wherein the face model comprises an encoder and a decoder; the encoder is used for extracting the feature vector of the face image input to the face model; the decoder is used for converting the feature vector into a face image corresponding to the face image input to the face model; the performing frontal processing on the first face image and the second face image by using a frontal model to obtain a first predicted frontal image and a second predicted frontal image, including:

The encoder performs feature extraction on the first face image to obtain a feature vector of the first face image; the decoder processes the feature vector of the first face image to obtain the first predicted face image;

the encoder performs feature extraction on the second face image to obtain a feature vector of the second face image; and the decoder processes the feature vector of the second face image to obtain the second predicted face image.

3. The method according to claim 1, wherein the predictive model includes a feature extraction network for extracting line-of-sight vectors of two face images input to the predictive model and a full-connection layer network; the full-connection layer network is used for fusing the sight line vectors to obtain the offset of the gaze point corresponding to the two face images; the processing the first predicted front face image and the second predicted front face image by using a prediction model to obtain the offset of the target gaze point and the calibration gaze point comprises the following steps:

And the full-connection layer network performs fusion processing on the sight line vector of the first predicted front face image and the sight line vector of the second predicted front face image to obtain the offset between the target gaze point and the calibration gaze point.

4. A method according to any one of claims 1 to 3, further comprising:

and executing an operation at the target gaze point according to the coordinates of the target gaze point, wherein the operation comprises at least one of clicking, sliding, touching, unlocking, turning pages, focusing or game operation.

5. A method according to any one of claims 1 to 3, wherein the nominal gaze point is a screen centre point.

6. A method for training a face model, comprising:

acquiring training data, wherein the training data comprises a front face image sample and a plurality of side face image samples corresponding to the front face image sample;

and training the face model by using the training data to obtain a target face model, wherein the face model comprises an encoder and a decoder, the encoder is used for extracting feature vectors of images input to the encoder, and the decoder is used for processing the feature vectors to obtain predicted face images.

7. The training method of claim 6, wherein the feature vectors include a face feature vector and a line-of-sight feature vector; the decoder is specifically configured to process the face feature vector to obtain the predicted face image;

training the frontal model by using the training data to obtain a target frontal model, including:

inputting the side face image sample to the encoder to obtain the face feature vector and the sight feature vector;

inputting the face feature vector to the decoder to obtain the predicted face image;

and adjusting parameters of the encoder and the decoder according to the predicted front face image and the front face image sample, thereby obtaining the target front face model.

8. The training method of claim 7, wherein the decoder is further configured to process the line-of-sight feature vector, and wherein the inputting the face feature vector to the decoder results in the predicted face image, comprises:

and inputting the face feature vector and the sight feature vector to the decoder to obtain the predicted front face image.

9. Training method according to claim 7 or 8, characterized in that the gaze point of the face image sample and the plurality of side face image samples is the same; the training data further comprises known coordinates of the gaze point;

inputting the sight feature vector into a full-connection layer to obtain the coordinate of a predicted gaze point;

and adjusting parameters of the encoder, the decoder and the full-connection layer according to the coordinates of the predicted gazing point and the known coordinates of the gazing point, so as to obtain the target face model.

10. A method of training a predictive model, comprising:

acquiring training data, the training data comprising: a first face image sample of a first gaze point, a second face image sample of a second gaze point, and offset labels of the first gaze point and the second gaze point;

and training the prediction model by using the training data to obtain a target prediction model, wherein the prediction model is used for processing two face images input into the prediction model to obtain the offset between the gaze points corresponding to the two face images.

11. The training method according to claim 10, wherein the predictive model includes a feature extraction network for extracting line-of-sight feature vectors of the two face images input to the predictive model, and a full-connection layer network; and the full-connection layer network is used for fusing the sight feature vectors to obtain the offset of the gaze point corresponding to the two face images.

12. The training method of claim 11, wherein training the predictive model using the training data to obtain a target predictive model comprises:

respectively inputting the first face image sample and the second face image sample into the feature extraction network to respectively obtain a sight line feature vector of the first face image sample and a sight line feature vector of the second face image sample;

inputting the sight line feature vector of the first face image sample and the sight line feature vector of the second face image sample into the full-connection layer network to obtain a predicted offset, wherein the predicted offset is used for representing a predicted value of the offset between the first gazing point and the second gazing point;

And adjusting parameters of the feature extraction network and the full-connection layer according to the predicted offset and the offset label, so as to obtain the target prediction model.

13. A gaze point detection apparatus, comprising:

the acquisition unit is used for acquiring a first face image of the target fixation point and a second face image of the target fixation point;

a processing unit for performing the following operations:

14. The apparatus of claim 13, wherein the face model comprises an encoder and a decoder; the encoder is used for extracting the feature vector of the face image input to the face model; the decoder is used for converting the feature vector into a face image corresponding to the face image input to the face model; the processing unit is specifically configured to:

Extracting the characteristics of the first face image by using the encoder to obtain the characteristic vector of the first face image; the decoder processes the feature vector of the first face image to obtain the first predicted face image;

extracting features of the second face image by using the encoder to obtain feature vectors of the second face image; and the decoder processes the feature vector of the second face image to obtain the second predicted face image.

15. The apparatus according to claim 13, wherein the predictive model includes a feature extraction network for extracting line-of-sight vectors of two face images input to the predictive model and a full-connection layer network; the full-connection layer network is used for fusing the sight line vectors to obtain the offset of the gaze point corresponding to the two face images; the processing unit is specifically configured to:

extracting features of the first predicted front face image and the second predicted front face image by using the feature extraction network to obtain a sight line vector of the first predicted front face image and a sight line vector of the second predicted front face image;

And carrying out fusion processing on the sight line vector of the first predicted front face image and the sight line vector of the second predicted front face image by using the full-connection layer network to obtain the offset between the target gaze point and the calibration gaze point.

16. The apparatus according to any one of claims 13 to 15, wherein the processing unit is further configured to:

17. The apparatus according to any one of claims 13 to 15, wherein the nominal gaze point is a screen centre point.

18. A training device for a face model, comprising:

an acquisition unit, configured to acquire training data, where the training data includes a front face image sample and a plurality of side face image samples corresponding to the front face image sample;

the training unit is used for training the face model by utilizing the training data to obtain a target face model, the face model comprises an encoder and a decoder, the encoder is used for extracting feature vectors of images input to the encoder, and the decoder is used for processing the feature vectors to obtain a predicted face image.

19. The training device of claim 18, wherein the feature vectors comprise a face feature vector and a line-of-sight feature vector; the decoder is specifically configured to process the face feature vector to obtain the predicted face image; the training unit is specifically used for:

20. Training device according to claim 19, characterized in that the decoder is further adapted to process the line of sight feature vector, the training unit being specifically adapted to: and inputting the face feature vector and the sight feature vector to the decoder to obtain the predicted front face image.

21. Training device according to claim 19 or 20, characterized in that the gaze point of the face image sample and the plurality of side face image samples is the same; the training data further comprises known coordinates of the gaze point; the training unit is specifically used for:

22. A training device for a predictive model, comprising:

an acquisition unit configured to acquire training data including: a first face image sample of a first gaze point, a second face image sample of a second gaze point, and offset labels of the first gaze point and the second gaze point;

the training unit is used for training the prediction model by utilizing the training data to obtain a target prediction model, and the prediction model is used for processing two face images input into the prediction model to obtain the offset between the gaze points corresponding to the two face images.

23. The training device according to claim 22, wherein the predictive model includes a feature extraction network for extracting line-of-sight feature vectors of the two face images input to the predictive model, and a full-connection layer network; and the full-connection layer network is used for fusing the sight feature vectors to obtain the offset of the gaze point corresponding to the two face images.

24. Training device according to claim 23, characterized in that the training unit is specifically adapted to:

25. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 5 when executing the computer program.

26. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements the method of any one of claims 6 to 9, or any one of claims 10 to 12 when the computer program is executed.

27. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the method of any one of claims 1 to 5, or any one of claims 6 to 9, or any one of claims 10 to 12.