CN117671767A

CN117671767A - Gaze point acquisition method, model training method, device and electronic equipment

Info

Publication number: CN117671767A
Application number: CN202311687082.7A
Authority: CN
Inventors: 孟令宣; 孙哲; 邱榆清
Original assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Current assignee: Guangdong Oppo Mobile Telecommunications Corp Ltd
Priority date: 2023-12-08
Filing date: 2023-12-08
Publication date: 2024-03-08

Abstract

The embodiment of the application discloses a gaze point acquisition method, a model training device and electronic equipment. The method comprises the following steps: acquiring a first feature through a face image of a target user and a target head gesture feature extraction unit, and acquiring a second feature through an eye image of the target user and a target human eye feature extraction unit; fusing the first feature and the second feature to obtain a fused feature; predicting the fixation point of the target user based on the fusion characteristics; the target head posture feature extraction unit is obtained by training a first data set and a second data set, and the target human eye feature extraction unit is obtained by training a first data set and a third data set. Therefore, the accuracy of the sight estimation based on the characteristics obtained by the target head posture characteristic extraction unit and the target human eye characteristic extraction unit is effectively improved.

Description

Gaze point acquisition method, model training method, device and electronic equipment

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a gaze point acquisition method, a model training device, and an electronic device.

Background

As technology advances, an electronic device may detect a position of a user's gaze screen, so as to perform a corresponding operation according to the detected user's gaze position. However, there is a problem that the detection accuracy is to be improved in the related method of detecting the gaze position of the user.

Disclosure of Invention

In view of the above problems, the present application proposes a gaze point acquisition method, a model training method, a device, and an electronic apparatus, so as to improve the above problems.

In a first aspect, the present application provides a gaze point obtaining method, the method including: acquiring a first feature through a face image of a target user and a target head gesture feature extraction unit, and acquiring a second feature through an eye image of the target user and a target human eye feature extraction unit; fusing the first feature and the second feature to obtain a fused feature; predicting the gaze point of the target user based on the fusion characteristics; the target head gesture feature extraction unit is obtained by training a first data set and a second data set, the target human eye feature extraction unit is obtained by training the first data set and a third data set, the first data set comprises a first face image, the label of the first face image is a line of sight angle, the second data set comprises a second face image, the label of the second face image is the position of a face key point, the third data set comprises an eye image, and the label of the eye image is the position of the eye key point.

In a second aspect, the present application provides a model training method, the method comprising: training a head posture feature extraction unit to be trained through a first data set and a second data set to obtain a target head posture feature extraction unit, wherein the first data set comprises a first face image, a label of the first face image is a line-of-sight angle, the second data set comprises a second face image, and the label of the second face image is a position of a face key point; training a human eye feature extraction unit to be trained through the first data set and a third data set to obtain a target human eye feature extraction unit, wherein the third data set comprises an eye image, and the label of the eye image is the position of an eye key point; the target head gesture feature extraction unit and the target human eye feature extraction unit are used for identifying the fixation point of the target user according to the face image.

In a third aspect, the present application provides a gaze point acquisition device, the device comprising: the device comprises a feature acquisition unit, a target user eye image and target human eye feature extraction unit, a target head gesture feature extraction unit and a target head gesture feature extraction unit, wherein the feature acquisition unit is used for acquiring a first feature through a human face image of a target user and the target head gesture feature extraction unit, and acquiring a second feature through an eye image of the target user and the target human eye feature extraction unit; the feature fusion unit is used for fusing the first feature and the second feature to obtain a fusion feature; the gaze point prediction unit is used for predicting the gaze point of the target user based on the fusion characteristics; the target head gesture feature extraction unit is obtained by training a first data set and a second data set, the target human eye feature extraction unit is obtained by training the first data set and a third data set, the first data set comprises a first face image, the label of the first face image is a line of sight angle, the second data set comprises a second face image, the label of the second face image is the position of a face key point, the third data set comprises an eye image, and the label of the eye image is the position of the eye key point.

In a fourth aspect, the present application provides a model training apparatus, the apparatus comprising: the first training unit is used for training the head posture feature extraction unit to be trained through a first data set and a second data set to obtain a target head posture feature extraction unit, wherein the first data set comprises a first face image, the label of the first face image is a line of sight angle, the second data set comprises a second face image, and the label of the second face image is the position of a face key point; the second training unit is used for training the human eye feature extraction unit to be trained through the first data set and the third data set to obtain a target human eye feature extraction unit, wherein the third data set comprises an eye image, and the label of the eye image is the position of an eye key point; the target head gesture feature extraction unit and the target human eye feature extraction unit are used for identifying the fixation point of the target user according to the face image.

In a fifth aspect, the present application provides an electronic device comprising one or more processors and a memory; one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the methods described above.

In a sixth aspect, the present application provides a computer readable storage medium having program code stored therein, wherein the method described above is performed when the program code is run.

According to the gaze point obtaining method, the gaze point obtaining device and the electronic equipment, the first feature is obtained through the face image of the target user and the target head gesture feature extraction unit, the second feature is obtained through the eye image of the target user and the target human eye feature extraction unit, and then the first feature and the second feature are fused to obtain the fusion feature so as to predict the gaze point of the target user based on the fusion feature. Therefore, under the conditions that the target head posture feature extraction unit is obtained by training the first data set and the second data set, and the target human eye feature extraction unit is obtained by training the first data set and the third data set, the additional human face image (the second human face image in the second data set) and human eye data (the eye image in the third data set) are introduced besides the first data set, so that the learning of the head posture feature and the learning of the human eye feature are enhanced, and the accuracy of the sight estimation based on the features obtained by the target head posture feature extraction unit and the target human eye feature extraction unit is effectively improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 shows a schematic diagram of an application scenario according to an embodiment of the present application;

fig. 2 shows a schematic diagram of another application scenario proposed in an embodiment of the present application;

fig. 3 shows a flowchart of a gaze point acquisition method according to an embodiment of the present application;

fig. 4 shows a flowchart of a gaze point acquisition method according to another embodiment of the present application;

FIG. 5 is a schematic diagram of masking eyes of a face image according to an embodiment of the present application;

FIG. 6 is a schematic diagram of a face and eye feature decoupling network according to an embodiment of the present application;

FIG. 7 is a schematic diagram of training a head pose feature extraction unit in an embodiment of the present application;

FIG. 8 is a schematic diagram of performing auxiliary training on a human eye feature extraction unit according to an embodiment of the present application;

FIG. 9 is a flow chart illustrating a model training method according to an embodiment of the present application;

fig. 10 is a block diagram showing a structure of a gaze point acquisition device according to an embodiment of the present application;

FIG. 11 is a block diagram of a model training apparatus according to an embodiment of the present application;

FIG. 12 shows a block diagram of an electronic device as proposed herein;

fig. 13 is a storage unit for storing or carrying program codes for implementing the gaze point acquisition method or the model training method according to the embodiments of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

As technology advances, electronic devices may detect a position where a user gazes at a screen, so that corresponding operations are performed according to the detected gazing position. For example, in an information browsing scenario, the electronic device may determine whether to update browsing information according to the detected position of the gaze point of the user, the update including turning pages or the like. Furthermore, in some scenarios, a control operation corresponding to a focused key may be triggered according to the focused key.

However, the inventors have found that, in research on a technique of detecting a user's gaze position, there is a problem that the detection accuracy is not high enough in the related method of detecting a user's gaze position. Therefore, in order to improve the above-mentioned problems, the embodiments of the present application provide a gaze point acquisition method, a model training method, a device, and an electronic apparatus. The method comprises the steps of obtaining a first feature through a face image of a target user and a target head gesture feature extraction unit, obtaining a second feature through an eye image of the target user and a target human eye feature extraction unit, and fusing the first feature with the second feature to obtain a fused feature so as to predict the gaze point of the target user based on the fused feature.

Therefore, under the conditions that the target head posture feature extraction unit is obtained by training the first data set and the second data set, and the target human eye feature extraction unit is obtained by training the first data set and the third data set, the additional human face image (the second human face image in the second data set) and human eye data (the eye image in the third data set) are introduced besides the first data set, so that the learning of the head posture feature and the learning of the human eye feature are enhanced, and the accuracy of the sight estimation based on the features obtained by the target head posture feature extraction unit and the target human eye feature extraction unit is effectively improved.

The application scenario according to the embodiment of the present application will be described first.

In the embodiment of the application, the provided gaze point acquisition method or model training method may be executed by the electronic device. In this manner performed by the electronic device, all steps in the gaze point acquisition method or the model training method provided by the embodiments of the present application may be performed by the electronic device. For example, as shown in fig. 1, all steps in the gaze point acquisition method or the model training method provided in the embodiments of the present application may be executed by the processor of the electronic device 100.

Alternatively, the gaze point acquisition method or the model training method provided in the embodiments of the present application may also be executed by the server. Correspondingly, in this manner executed by the server, the server may start executing steps in the gaze point acquisition method or the model training method provided by the embodiments of the present application in response to the trigger instruction. The triggering instruction may be sent by an electronic device used by a user, or may be triggered locally by a server in response to some automation event.

In addition, the gaze point acquisition method or the model training method provided by the embodiment of the application may also be cooperatively executed by the electronic device and the server. In this manner, which is cooperatively performed by the electronic device and the server, some steps in the gaze point acquisition method or the model training method provided by the embodiments of the present application are performed by the electronic device, and other parts of the steps are performed by the server. For example, as shown in fig. 2, the electronic device 100 may perform the gaze point acquisition method including: the electronic device 100 may transmit the first feature and the second feature to the server 200 after obtaining the first feature by the face image of the target user and the target head pose feature extraction unit, and obtain the second feature by the eye image of the target user and the target human eye feature extraction unit, and then the server 200 may perform subsequent steps to obtain the gaze point of the target user, and then transmit the gaze point of the target user to the electronic device 100, and the electronic device 100 may perform subsequent operations after receiving the gaze point of the target user.

In this way, the steps performed by the electronic device and the server are not limited to those described in the above examples, and in practical applications, the steps performed by the electronic device and the server may be dynamically adjusted according to practical situations.

It should be noted that, the electronic device 100 may be a tablet computer, a smart watch, a smart voice assistant, or other devices besides the smart phone shown in fig. 1 and 2. The server 200 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing basic cloud computing services such as cloud services, cloud computing, cloud storage, network services, cloud communication, middleware services, CDNs (Content Delivery Network, content delivery networks), and artificial intelligence platforms. In the case where the image processing method provided in the embodiment of the present application is executed by a server cluster or a distributed system formed by a plurality of physical servers, different steps in the image processing method may be executed by different physical servers, or may be executed by servers built based on the distributed system in a distributed manner.

Embodiments of the present application will be described below with reference to the accompanying drawings.

Referring to fig. 3, the method for obtaining a gaze point provided in the present application includes:

s110: the method comprises the steps of obtaining a first feature through a face image of a target user and a target head gesture feature extraction unit, and obtaining a second feature through an eye image of the target user and a target human eye feature extraction unit.

In the embodiment of the present application, the target user may be understood as an object for performing gaze point acquisition. Wherein the gaze point may refer to a gaze point in a specific area. The feature region may be a content display region (e.g., a screen or projection region of an electronic device), or in a space. There are a number of ways in which the target user may be determined. As one way, image acquisition may be performed by a camera of the electronic device, in which case a person in the acquired image may be the target user. For example, in the case where there are a plurality of persons in the captured image, the plurality of persons may each be regarded as the target user. Alternatively, after the electronic device performs image acquisition through the camera, face recognition may be performed first based on the acquired image, and then the identified specific person may be used as the target user. For example, the particular persona may be a user of the electronic device, or may be a persona specified by the electronic device.

In the embodiment of the application, the gaze point acquisition method may be executed by the target network model. The target network model may include a target head pose feature extraction unit and a target human eye feature extraction unit. Wherein the target head pose feature extraction unit may be adapted to derive features regarding the head pose from an input face image, e.g. a first face image or a second face image), in which case the first feature may then be understood as the head pose feature. The head pose features may include, among other features, head orientation, head angle, and the like. The target human eye feature extraction unit may be configured to extract an eye feature of the target user, in which case the second feature may be understood as an eye feature. The eye features may be location features of critical areas of the eye, which may be the areas where the canthus, iris and pupil are located.

It should be noted that, in some cases, the obtained face image of the target user may have some problems, for example, there may be a problem that the face is inclined. In this case, in order to enhance the accuracy of acquiring the gaze point, preprocessing may be performed first after obtaining a face image (e.g., a first face image or a second face image) of the target user. In the preprocessing process, the face position and the human eye position can be obtained through a key point detection network, and then the preprocessed face image and the eye image of the target user can be obtained through clipping and rotation. Then, the preprocessed face image may be input to the target head pose feature extraction unit to obtain the first feature, and the preprocessed eye image of the target user may be input to the target human eye feature extraction unit to obtain the second feature.

S120: and fusing the first feature and the second feature to obtain a fused feature.

In the application embodiment, the head gesture of the target user and the position of the key region in the eyes are combined to predict the gaze point of the target user. In this case, after the first feature and the second feature are obtained, the first feature and the second feature may be fused to obtain a fused feature. The fusion feature may be understood as a feature that characterizes both the head pose of the target user and the position of the critical area of the eyes of the target user.

In this way, the first feature, the left-eye feature, and the right-eye feature can be fused to obtain an initial fusion feature, and then the initial fusion feature is processed through two full-connection layers to obtain the fusion feature. The function of the two full-connection layers for processing the initial fusion features is to fuse the features and reduce the dimension. For example, the first feature may be a 256-dimensional head pose feature, and the second feature may include a 256-dimensional left-eye feature, and a 256-dimensional right-eye feature, in which case the first feature, the left-eye feature, and the right-eye feature may be fused in a cascade manner to obtain an initial fused feature, and then the initial fused feature may be processed through two fully connected layers to obtain the 256-dimensional fused feature.

In the case where the second feature includes a left-eye feature and a right-eye feature, the eye image input to the target human eye feature extraction unit may include a left-eye image and a right-eye image. For example, if the preprocessing described above is performed, a left-eye image and a right-eye image can be obtained from a face image of a target user during preprocessing.

S130: and predicting the gaze point of the target user based on the fusion characteristic, wherein the target head gesture characteristic extraction unit is obtained by training a first data set and a second data set, the target human eye characteristic extraction unit is obtained by training the first data set and a third data set, the first data set comprises a first human face image, the label of the first human face image is a view angle, the second data set comprises a second human face image, the label of the second human face image is the position of a human face key point, the third data set comprises an eye image, and the label of the eye image is the position of the eye key point.

In the embodiment of the application, the attention point of the user can be represented by a sight line yaw angle and a sight line pitch angle. Wherein the gaze yaw angle and gaze pitch angle refer to the angle of the eye to the horizontal and vertical planes, respectively. The angle of the sight line yaw is understood as the deflection angle of the sight line on the horizontal plane, and can be used to describe the rotation of the head in the horizontal direction. The line of sight pitch angle is understood to be the angle of deflection of the line of sight in a vertical plane and can be used to describe the rotation of the head in a vertical direction.

As one way, after the fusion feature is obtained, the fusion feature may be processed by two fully connected layers to predict the gaze point of the target user.

And in the two fully-connected layers adopted when the gaze point is acquired, the first fully-connected layer is used for carrying out feature extraction and dimension reduction, and the second fully-connected layer is used for carrying out classification or regression based on the output of the first fully-connected layer to obtain predicted values (such as a gaze angle and a gaze pitch angle). By way of example, if the fusion feature is 256 dimensions, then by processing the first fully connected layer, the dimension can be reduced to 64 dimensions.

According to the gaze point acquisition method provided by the embodiment, under the conditions that the target head gesture feature extraction unit is obtained by training through the first data set and the second data set, and the target human eye feature extraction unit is obtained by training through the first data set and the third data set, the additional face image (the second face image in the second data set) and the human eye data (the eye image in the third data set) are introduced besides the first data set, so that the learning of the head gesture feature and the learning of the human eye feature are enhanced, and the accuracy of the vision estimation based on the features obtained by the target head gesture feature extraction unit and the target human eye feature extraction unit is effectively improved.

Referring to fig. 4, the method for acquiring a gaze point provided in the present application is applied to an electronic device, and includes:

s210: and obtaining a first loss value based on the first data set, the head posture feature extraction unit to be trained and the human eye feature extraction unit to be trained.

In the present embodiment, the first data set may be understood as a Gaze data set (gaze_dataset).

As one mode, the first face image in the first data set is input to a head gesture feature extraction unit to be trained to obtain a first head gesture feature, the eye image of the first face image in the first data set is input to a human eye feature extraction unit to be trained to obtain a first eye feature, and a first loss value is calculated based on the first head gesture feature, the first eye feature and a label corresponding to the first face image.

After the first head pose feature and the first eye feature are obtained, the first head pose feature and the first eye feature may be fused to obtain a fused feature, and then, based on the fused feature, a gaze point is predicted, so as to obtain a predicted gaze angle (for example, a predicted gaze yaw angle and a predicted gaze pitch angle). For the first data set, the label of the first face image is a line of sight angle, and the line of sight angle corresponding to the label can be understood as a real implementation angle. In this case, the first loss value may be calculated by the predicted line of sight angle, the actual line of sight angle, and the first loss function. Alternatively, the first loss function may be an L1 loss function. L1 Loss, also known as Mean Absolute Error (average absolute error), is a Loss function that measures the difference between the model output and the real label, and calculates the average of the absolute values of the differences between the predicted and real values.

In the process of calculating the first loss value, a loss value may be calculated based on the real line of sight angle of the single first face image and the predicted line of sight angle, so as to obtain loss values corresponding to the plurality of first face images, and then the loss values corresponding to the plurality of first face images are averaged to obtain the first loss value. Illustratively, the following formula is shown:

Loss_n＝|Yaw_gt-Yaw_pred|+|Pitch_gt-Pitch_pred|

Loss_gaze＝avg(Loss_1+Loss_2+Loss_3+…+Loss_N)

where loss_n characterizes the corresponding penalty value for a single first face image. Wherein, yaw_gt represents a real sight line Yaw angle in the real sight line angles, pitch_gt represents a real sight line Pitch angle in the real sight line angles, yaw_pred represents a predicted sight line Yaw angle in the predicted sight line angles, and pitch_pred represents a predicted sight line Pitch angle in the predicted sight line angles. The loss_size represents a first Loss value obtained by averaging Loss values corresponding to each of the plurality of first face images.

Optionally, the eyes of the first face image in the first data set are covered to obtain a first face image covering the eyes, and the first face image covering the eyes is input to the head gesture feature extraction unit to be trained to obtain the first head gesture feature. An exemplary effect diagram of masking the eyes of the first face image is shown in fig. 5. As shown in fig. 5, in the embodiment of the present application, the masking of the eyes of the first face image may also be understood as adding a mask to the eyes, so that in the process of training the to-be-trained head pose feature extraction unit, no information about the eyes is involved, so as to implement decoupling of the head features and the eye features.

S220: and obtaining a second loss value based on the second data set and the head posture feature extraction unit to be trained.

In the embodiment of the present application, the second data set may be understood as a Face data set face_data set.

As one way, the second face image in the second data set is input to a head pose feature extraction unit to be trained, so as to obtain a second head pose feature, and a second loss value is calculated based on the second head pose feature and a label corresponding to the second face image.

The label of the second face image in the second data set is the position of the face key point, wherein the position of the face key point represented by the label can be understood as the position of the real face key point corresponding to the second face image. In the process of calculating the second loss value, after the second head posture feature is obtained, the predicted position of the face key point can be obtained based on the second head posture feature. In this case, for the same second face image, corresponding to both the position of the true face key point and the position of the predicted face key point, the second loss value may be calculated based on the second loss function. The second Loss function may be an L1 Loss function. As one way, for each of the second face images, a loss value may be calculated based on the positions of the corresponding real face key points and the predicted face key points, and then the second loss value may be calculated based on the loss values corresponding to the plurality of second face images. Alternatively, the loss values of each of the plurality of second face images may be averaged to obtain the second loss value.

Alternatively, in the embodiment of the present application, there may be multiple (for example, 8) positions of the face key points in the single second face image, in this case, the loss difference value of the positions of each face key point may be calculated first separately, and then the loss difference value of each face key point may be averaged to obtain the loss value of the single second face image. Alternatively, the position of the face key point may be represented by the coordinates of the face key point in the image, in which case, the difference between the abscissa of the real face key point and the abscissa of the predicted face key point may be taken as an abscissa loss difference, the difference between the ordinate of the real face key point and the ordinate of the predicted face key point may be taken as an ordinate loss difference, and then the loss difference may be obtained based on the abscissa loss difference and the ordinate loss difference, and based on this, the loss difference of each face key point may be calculated.

Optionally, in the process of calculating the second loss value, after normalizing the positions of the face key points, the loss difference value, the loss value corresponding to the single second face image, and the second loss value may be calculated. It should be noted that, the positions of the face key points may be represented by coordinates of the face key points in the image, in this case, the abscissa of the face key points may be compared with the width of the second face image, and the ordinate of the face key points may be compared with the height of the second face image, so as to normalize the positions of the face key points. Illustratively, the normalized formula is as follows:

Norm_x＝x/img_width；

Norm_y＝y/img_height

Wherein, norm_x represents a normalized abscissa, norm_y represents a normalized ordinate, x represents an initial abscissa (or an abscissa not normalized), y represents an initial abscissa (or an ordinate not normalized), img_width represents a width of the second face image, and img_height represents a height of the second face image.

Optionally, the eyes of the second face image in the second data set are covered to obtain a second face image covering the eyes, and the second face image covering the eyes is input to the head posture feature extraction unit to be trained to obtain a second head posture feature. It should be noted that, by covering the eyes of the second face image, the head posture feature extraction unit to be trained is trained by using the second face image covering the eyes, which is more beneficial to decoupling the head posture feature and the eye feature.

S230: and obtaining a third loss value based on the third data set and the human eye feature extraction unit to be trained.

In the present embodiment, the third data set may be understood as the Eye data set eye_dataset.

As one way, the eye image in the third dataset is input to a human eye feature extraction unit to be trained to obtain a second eye feature.

The label of the eye image in the third data set is the position of the eye key point, wherein the position of the eye key point represented by the label can be understood as the position of the real eye key point corresponding to the eye image. In calculating the third loss value, after the second eye feature is obtained, a predicted location of the eye keypoint may be obtained based on the second eye feature. In this case, for the same eye image, there is a correspondence of both the position of the true eye keypoint and the position of the predicted eye keypoint, and a third loss value is calculated based on a third loss function. The third Loss function may be an L1 Loss function. As one way, for each eye image, a loss value may be calculated based on the position of the corresponding real eye keypoint, and the predicted position of the eye keypoint, and then a third loss value may be calculated based on the loss values for each of the plurality of eye images. Alternatively, the loss values of each of the plurality of eye images may be averaged to obtain a third loss value.

Alternatively, in the embodiment of the present application, there may be a plurality (e.g., 6) of positions of the eye keypoints in the single eye image, in which case, the loss difference value of the position of each eye keypoint may be calculated separately, and then the loss difference value of each eye keypoint may be averaged to obtain the loss value of the single eye image. Alternatively, the position of the eye key may be represented by the coordinates of the eye key in the image, in which case the difference between the abscissa of the real eye key and the abscissa of the predicted eye key may be taken as the abscissa loss difference, the ordinate of the real eye key and the ordinate of the predicted eye key may be taken as the ordinate loss difference, and then the loss difference may be obtained based on the abscissa loss difference and the ordinate loss difference, and based on this, the loss difference for each eye key may be calculated.

S240: and training the head posture feature extraction unit to be trained and the human eye feature extraction unit to be trained based on the first loss value, the second loss value and the third loss value to obtain a target head posture feature extraction unit and a target human eye feature extraction unit.

After the first loss value, the second loss value and the third loss value are obtained, a total loss value can be calculated based on the three loss values, so that the head posture feature extraction unit to be trained and the human eye feature extraction unit to be trained are trained according to the total loss value, and the target head posture feature extraction unit and the target human eye feature extraction unit are obtained.

As one aspect, the first loss value, the second loss value, and the third loss value may be directly added to obtain the total loss value. Alternatively, the second loss value and the third loss value may be respectively assigned with corresponding weights, and the total loss value may be obtained based on the weights. Illustratively, one formula for calculating the total loss value is as follows:

Loss_all＝Loss_gaze+a*Loss_face+b*Loss_eye

the loss_all characterizes a total Loss value, the loss_size characterizes a first Loss value, the loss_face characterizes a second Loss value, and the loss_eye characterizes a third Loss value. Wherein a represents the weight corresponding to the second loss value and b represents the weight of the third loss value. Alternatively, a and b may be 5.

S250: the method comprises the steps of obtaining a first feature through a face image of a target user and a target head gesture feature extraction unit, and obtaining a second feature through an eye image of the target user and a target human eye feature extraction unit.

S260: and fusing the first feature and the second feature to obtain a fused feature.

S270: and predicting the gaze point of the target user based on the fusion characteristics.

The target head posture feature extraction unit is obtained by training a first data set and a second data set, and the target human eye feature extraction unit is obtained by training the first data set and a third data set.

In this embodiment of the present application, after performing multiple training on the head pose feature extraction unit to be trained and the human eye feature extraction unit to be trained, the target head pose feature extraction unit and the target human eye feature extraction unit may be finally obtained. In this case, in the process of each training, the first loss value, the second loss value and the third loss value may be sequentially calculated, then the total loss value corresponding to the current training process is obtained based on the first loss value, the second loss value and the third loss value, the current training is performed on the head posture feature extraction unit to be trained and the human eye feature extraction unit to be trained based on the total loss value in the current training process, and then the next training process is performed until the training is completed.

Optionally, in the current training process, a first loss value is obtained based on a batch of first face images in the first data set, a head posture feature extraction unit to be trained, and a human eye feature extraction unit to be trained. And in the current training process, obtaining a second loss value based on a batch of second face images in the second data set and a head posture feature extraction unit to be trained. And in the current training process, obtaining a third loss value based on a batch of eye images in the third data set and the human eye feature extraction unit to be trained.

Based on the first loss value, the second loss value and the third loss value in the current training process, performing current training on the head posture feature extraction unit to be trained and the human eye feature extraction unit to be trained; if the current training does not meet the target condition, the next training process is carried out, and if the current training meets the target condition, the target head posture feature extraction unit and the target human eye feature extraction unit are obtained.

It should be noted that, a first face image may be understood as a first face image of a batch, a second face image may be understood as a second face image of a batch, and an eye image may be understood as an eye image of a batch. The number of the first face image of one batch, the second face image of one batch and the eye image of one batch can be the same or different. For example, one batch of first face images may be 320 first face images, one batch of second face images may be 100 first face images, and one batch of eye images may be 100 eye images.

According to the gaze point acquisition method, the learning of the head posture features and the learning of the human eye features are enhanced, and the accuracy of performing the gaze estimation based on the features obtained by the target head posture feature extraction unit and the target human eye feature extraction unit is effectively improved. In addition, in the embodiment, the eyes of the first face image in the first data set can be covered, and the eyes of the second face image in the second data set can be covered, so that the decoupling of the head posture feature and the eye feature is realized, and the accuracy of acquiring the gaze point is improved based on the head posture feature and the eye feature. Wherein the robustness of the features learned by the head pose feature extraction unit may be enhanced by the second data set.

It should be noted that, in the case where eyes in the first face image and the second face image may be covered, the target neural network performing the gaze point acquisition method in the embodiment of the present application may be understood as a fednaet (Face and Eye feature Decoupling Network ) network. The block diagram of the fednaet network may be illustrated in fig. 6, for example. The functions of the units in fig. 6 may be referred to the relevant content shown in the foregoing, and will not be described herein. The head posture feature extraction unit and the human eye feature extraction unit shown in fig. 6 may be understood as a head posture feature extraction unit to be trained and a human eye feature extraction unit to be trained in the training process. After the training is completed, the head pose feature extraction unit shown in fig. 6, and the human eye feature extraction unit may be understood as a target head pose feature extraction unit, and a target human eye feature extraction unit.

In the training of the fednaet network, the calculation of the correlation loss value may be performed by the loss calculation unit. In the course of gaze point acquisition by applying the fednaet network, the loss calculation unit may not be used anymore.

The training using the second data set and the third data set as described above may be understood as a training aid task. For the training-assisted task using the second data set, its network structure may be as shown in fig. 7. The head posture feature extraction unit shown in fig. 7 can be understood as the head posture feature extraction unit shown in fig. 6. For the training-assisted task using the third data set, its network structure may be as shown in fig. 8. The human eye feature extraction unit shown in fig. 8 can be understood as the human eye feature extraction unit shown in fig. 6.

Referring to fig. 9, the method for training a model provided in the present application includes:

s310: training the head posture feature extraction unit to be trained through a first data set and a second data set to obtain a target head posture feature extraction unit, wherein the first data set comprises a first face image, the label of the first face image is a line-of-sight angle, the second data set comprises a second face image, and the label of the second face image is the position of a face key point.

S320: training a human eye feature extraction unit to be trained through the first data set and a third data set to obtain a target human eye feature extraction unit, wherein the third data set comprises an eye image, and the label of the eye image is the position of an eye key point; the target head gesture feature extraction unit and the target human eye feature extraction unit are used for identifying the fixation point of the target user according to the face image.

Referring to fig. 10, a gaze point obtaining apparatus 400 provided in the present application, the apparatus 400 includes:

the feature acquiring unit 410 is configured to acquire a first feature by using a face image of a target user and a target head pose feature extracting unit, and acquire a second feature by using an eye image of the target user and a target human eye feature extracting unit.

And a feature fusion unit 420, configured to fuse the first feature with the second feature to obtain a fused feature.

A gaze point prediction unit 430, configured to predict a gaze point of the target user based on the fusion feature; the target head posture feature extraction unit is obtained by training a first data set and a second data set, and the target human eye feature extraction unit is obtained by training the first data set and a third data set.

As one approach, the apparatus 400 further comprises:

the training unit 440 is configured to obtain a first loss value based on the first data set, the head pose feature extraction unit to be trained, and the human eye feature extraction unit to be trained; obtaining a second loss value based on the second data set and the head posture feature extraction unit to be trained; obtaining a third loss value based on the third data set and the human eye feature extraction unit to be trained; and training the head posture feature extraction unit to be trained and the human eye feature extraction unit to be trained based on the first loss value, the second loss value and the third loss value to obtain a target head posture feature extraction unit and a target human eye feature extraction unit.

Optionally, the training unit 440 is specifically configured to input the first facial image in the first dataset to the head pose feature extraction unit to be trained, so as to obtain a first head pose feature; inputting an eye image of a first face image in the first data set to a human eye feature extraction unit to be trained so as to obtain a first eye feature; a first loss value is calculated based on the first head pose feature and the first eye feature, and a label corresponding to the first face image.

Optionally, the training unit 440 is specifically configured to mask the eyes of the first face image in the first dataset to obtain a first face image of the masked eyes, and input the first face image of the masked eyes to the head pose feature extraction unit to be trained to obtain the first head pose feature.

Optionally, the training unit 440 is specifically configured to input the second face image in the second data set to the head pose feature extraction unit to be trained, so as to obtain a second head pose feature; and calculating a second loss value based on the second head posture feature and the label corresponding to the second face image.

Optionally, the training unit 440 is specifically configured to mask eyes of the second face image in the second data set to obtain a second face image with eyes masked; and inputting the second face image covering the eyes to a head posture feature extraction unit to be trained so as to obtain second head posture features.

Optionally, the training unit 440 is specifically configured to input the eye image in the third data set to the human eye feature extraction unit to be trained, so as to obtain a second eye feature; and calculating a third loss value based on the second eye feature and a label corresponding to the eye image.

As a way, the feature fusion unit 420 is specifically configured to fuse the first feature, the left-eye feature, and the right-eye feature to obtain an initial fusion feature; and processing the initial fusion characteristics through two full-connection layers to obtain fusion characteristics.

As a way, the gaze point prediction unit 430 is specifically configured to process the fusion feature through two fully-connected layers to predict the gaze point of the target user.

Referring to fig. 11, the present application provides a model training device, which includes:

the first training unit 510 is configured to train the head pose feature extraction unit to be trained through a first data set and a second data set, so as to obtain a target head pose feature extraction unit, where the first data set includes a first face image, a label of the first face image is a line of sight angle, the second data set includes a second face image, and the label of the second face image is a position of a face key point;

the second training unit 520 is configured to train the human eye feature extraction unit to be trained through the first data set and a third data set, so as to obtain a target human eye feature extraction unit, where the third data set includes an eye image, and a label of the eye image is a position of an eye key point;

The target head gesture feature extraction unit and the target human eye feature extraction unit are used for identifying the fixation point of the target user according to the face image.

It should be noted that, for convenience and brevity, specific working procedures of the apparatus and units described above may refer to corresponding procedures in the foregoing method embodiments, and are not described herein again. In several embodiments provided herein, the coupling of the modules to each other may be electrical. In addition, each functional module in each embodiment of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module. The integrated modules may be implemented in hardware or in software functional modules.

An electronic device provided in the present application will be described with reference to fig. 12.

Referring to fig. 12, based on the above-mentioned device control method and apparatus, an electronic device 1000 capable of executing the above-mentioned device control method is further provided in the embodiments of the present application. The electronic device 1000 includes one or more (only one shown in the figures) processors 102, memory 104, cameras 106, and audio acquisition devices 108 coupled to each other. The memory 104 stores therein a program capable of executing the contents of the foregoing embodiments, and the processor 102 can execute the program stored in the memory 104.

Wherein the processor 102 may include one or more processing cores. The processor 102 utilizes various interfaces and lines to connect various portions of the overall electronic device 1000, perform various functions of the electronic device 1000, and process data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 104, and invoking data stored in the memory 104. Alternatively, the processor 102 may be implemented in hardware in at least one of digital signal processing (Digital Signal Processing, DSP), field programmable gate array (Field-Programmable Gate Array, FPGA), programmable logic array (Programmable Logic Array, PLA). The processor 102 may integrate one or a combination of several of a central processing unit (Central Processing Unit, CPU), an image processor (Graphics Processing Unit, GPU), and a modem, etc. The CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for being responsible for rendering and drawing of display content; the modem is used to handle wireless communications. It will be appreciated that the modem may not be integrated into the processor 102 and may be implemented solely by a single communication chip. As one approach, the processor 102 may be a neural network chip. For example, an embedded neural network chip (NPU) may be provided.

The Memory 104 may include random access Memory (Random Access Memory, RAM) or Read-Only Memory (RAM). Memory 104 may be used to store instructions, programs, code sets, or instruction sets. For example, the memory 104 may have stored therein data acquisition means. The data acquisition device may be the device 500 described previously. The memory 104 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (e.g., a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the various method embodiments described below, etc.

Further, the electronic device 1000 may include a network module 110 and a sensor module 112 in addition to the devices shown above.

The network module 110 is configured to implement information interaction between the electronic device 1000 and other devices, for example, transmit a device control command, a manipulation request command, and a status information acquisition command. While the electronic device 200 may be embodied as a different device, its corresponding network module 110 may be different.

The sensor module 112 may include at least one sensor. Specifically, the sensor module 112 may include, but is not limited to: level gauges, light sensors, motion sensors, pressure sensors, infrared thermal sensors, distance sensors, acceleration sensors, and other sensors.

Wherein the pressure sensor may detect a pressure generated by pressing against the electronic device 1000. That is, the pressure sensor detects a pressure generated by contact or pressing between the user and the electronic device, for example, a pressure generated by contact or pressing between the user's ear and the mobile terminal. Thus, the pressure sensor may be used to determine whether contact or pressure has occurred between the user and the electronic device 1000, as well as the magnitude of the pressure.

The acceleration sensor may detect the acceleration in each direction (typically, three axes), and may detect the gravity and direction when stationary, and may be used for applications for recognizing the gesture of the electronic device 1000 (such as landscape/portrait screen switching, related games, magnetometer gesture calibration), vibration recognition related functions (such as pedometer, and knocking), and so on. In addition, the electronic device 1000 may further be configured with other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, etc., which will not be described herein.

The audio acquisition device 110 is used for acquiring audio signals. Optionally, the audio capturing device 110 includes a plurality of audio capturing devices, which may be microphones.

As one way, the network module of the electronic device 1000 is a radio frequency module, and the radio frequency module is configured to receive and transmit electromagnetic waves, and implement mutual conversion between the electromagnetic waves and the electrical signals, so as to communicate with a communication network or other devices. The radio frequency module may include various existing circuit elements for performing these functions, such as an antenna, a radio frequency transceiver, a digital signal processor, an encryption/decryption chip, a Subscriber Identity Module (SIM) card, memory, and the like. For example, the radio frequency module may interact with external devices through transmitted or received electromagnetic waves. For example, the radio frequency module may send instructions to the target device.

Referring to fig. 13, a block diagram of a computer readable storage medium according to an embodiment of the present application is shown. The computer readable medium 800 has stored therein program code which can be invoked by a processor to perform the methods described in the method embodiments described above.

The computer readable storage medium 800 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. Optionally, the computer readable storage medium 800 comprises a non-volatile computer readable medium (non-transitory computer-readable storage medium). The computer readable storage medium 800 has storage space for program code 810 that performs any of the method steps described above. The program code can be read from or written to one or more computer program products. Program code 810 may be compressed, for example, in a suitable form.

In summary, according to the gaze point obtaining method, the gaze point obtaining device and the electronic equipment provided by the application, the first feature is obtained through the face image of the target user and the target head gesture feature extraction unit, the second feature is obtained through the eye image of the target user and the target human eye feature extraction unit, and then the first feature and the second feature are fused to obtain the fusion feature, so that the gaze point of the target user is predicted based on the fusion feature. Therefore, under the conditions that the target head posture feature extraction unit is obtained by training the first data set and the second data set, and the target human eye feature extraction unit is obtained by training the first data set and the third data set, the additional human face image (the second human face image in the second data set) and human eye data (the eye image in the third data set) are introduced besides the first data set, so that the learning of the head posture feature and the learning of the human eye feature are enhanced, and the accuracy of the sight estimation based on the features obtained by the target head posture feature extraction unit and the target human eye feature extraction unit is effectively improved.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting thereof; although the present application has been described in detail with reference to the foregoing embodiments, one of ordinary skill in the art will appreciate that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not drive the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present application _。

Claims

1. A gaze point acquisition method, the method comprising:

acquiring a first feature through a face image of a target user and a target head gesture feature extraction unit, and acquiring a second feature through an eye image of the target user and a target human eye feature extraction unit;

fusing the first feature and the second feature to obtain a fused feature;

predicting the gaze point of the target user based on the fusion characteristics;

the target head gesture feature extraction unit is obtained by training a first data set and a second data set, the target human eye feature extraction unit is obtained by training the first data set and a third data set, the first data set comprises a first face image, the label of the first face image is a line of sight angle, the second data set comprises a second face image, the label of the second face image is the position of a face key point, the third data set comprises an eye image, and the label of the eye image is the position of the eye key point.

2. The method according to claim 1, wherein the target head pose feature extraction unit is trained from a first face image covering eyes and a second face image covering eyes.

3. The method according to claim 1, wherein the method further comprises:

obtaining a first loss value based on the first data set, the head posture feature extraction unit to be trained and the human eye feature extraction unit to be trained;

obtaining a second loss value based on the second data set and the head posture feature extraction unit to be trained;

obtaining a third loss value based on the third data set and the human eye feature extraction unit to be trained;

and training the head posture feature extraction unit to be trained and the human eye feature extraction unit to be trained based on the first loss value, the second loss value and the third loss value to obtain a target head posture feature extraction unit and a target human eye feature extraction unit.

4. A method according to claim 3, wherein obtaining a first loss value based on the first dataset, the head pose feature extraction unit to be trained, the human eye feature extraction unit to be trained, comprises: in the current training process, a first loss value is obtained based on a batch of first face images in the first data set, a head posture feature extraction unit to be trained and a human eye feature extraction unit to be trained;

The obtaining a second loss value based on the second data set and the head posture feature extraction unit to be trained includes: in the current training process, a second loss value is obtained based on a batch of second face images in the second data set and a head posture feature extraction unit to be trained;

the obtaining a third loss value based on the third data set and the human eye feature extraction unit to be trained includes: in the current training process, a third loss value is obtained based on a batch of eye images in the third data set and a human eye feature extraction unit to be trained;

the training the head posture feature extraction unit to be trained and the human eye feature extraction unit to be trained based on the first loss value, the second loss value and the third loss value includes:

based on the first loss value, the second loss value and the third loss value, performing current training on the head posture feature extraction unit to be trained and the human eye feature extraction unit to be trained;

if the current training does not meet the target condition, the next training process is carried out, and if the current training meets the target condition, the target head posture feature extraction unit and the target human eye feature extraction unit are obtained.

5. A method according to claim 3, wherein the obtaining a first loss value based on the first data set, the head pose feature extraction unit to be trained, and the human eye feature extraction unit to be trained comprises:

inputting the first face image in the first data set to a head posture feature extraction unit to be trained so as to obtain a first head posture feature;

inputting an eye image of a first face image in the first data set to a human eye feature extraction unit to be trained so as to obtain a first eye feature;

a first loss value is calculated based on the first head pose feature and the first eye feature, and a label corresponding to the first face image.

6. The method according to claim 5, wherein inputting the first face image in the first dataset to a head pose feature extraction unit to be trained to obtain a first head pose feature, comprises:

and covering eyes of the first face image in the first data set to obtain a first face image covering the eyes, and inputting the first face image covering the eyes to a head posture feature extraction unit to be trained to obtain first head posture features.

7. A method according to claim 3, wherein said deriving a second loss value based on said second dataset, a head pose feature extraction unit to be trained, comprises:

inputting the second face image in the second data set to a head posture feature extraction unit to be trained so as to obtain a second head posture feature;

and calculating a second loss value based on the second head posture feature and the label corresponding to the second face image.

8. The method according to claim 7, wherein the inputting the second face image in the second data set to the head pose feature extraction unit to be trained to obtain the second head pose feature includes:

covering eyes of a second face image in the second data set to obtain a second face image covering the eyes;

and inputting the second face image covering the eyes to a head posture feature extraction unit to be trained so as to obtain second head posture features.

9. A method according to claim 3, wherein deriving a third loss value based on the third dataset, the human eye feature extraction unit to be trained, comprises:

Inputting the eye images in the third data set to a human eye feature extraction unit to be trained so as to obtain second eye features;

and calculating a third loss value based on the second eye feature and a label corresponding to the eye image.

10. The method of any of claims 1-9, wherein the second feature comprises a left eye feature and a right eye feature, and fusing the first feature with the second feature to obtain a fused feature comprises:

fusing the first feature, the left eye feature and the right eye feature to obtain an initial fusion feature;

processing the initial fusion characteristics through two full-connection layers to obtain fusion characteristics;

the predicting the gaze point of the target user based on the fusion characteristics comprises

And processing the fusion characteristics through two full-connection layers to predict the gaze point of the target user.

11. A method of model training, the method comprising:

training a head posture feature extraction unit to be trained through a first data set and a second data set to obtain a target head posture feature extraction unit, wherein the first data set comprises a first face image, a label of the first face image is a line-of-sight angle, the second data set comprises a second face image, and the label of the second face image is a position of a face key point;

Training a human eye feature extraction unit to be trained through the first data set and a third data set to obtain a target human eye feature extraction unit, wherein the third data set comprises an eye image, and the label of the eye image is the position of an eye key point;

12. A gaze point acquisition device, the device comprising:

the device comprises a feature acquisition unit, a target user eye image and target human eye feature extraction unit, a target head gesture feature extraction unit and a target head gesture feature extraction unit, wherein the feature acquisition unit is used for acquiring a first feature through a human face image of a target user and the target head gesture feature extraction unit, and acquiring a second feature through an eye image of the target user and the target human eye feature extraction unit;

the feature fusion unit is used for fusing the first feature and the second feature to obtain a fusion feature;

the gaze point prediction unit is used for predicting the gaze point of the target user based on the fusion characteristics;

13. A model training apparatus, the apparatus comprising:

the first training unit is used for training the head posture feature extraction unit to be trained through a first data set and a second data set to obtain a target head posture feature extraction unit, wherein the first data set comprises a first face image, the label of the first face image is a line of sight angle, the second data set comprises a second face image, and the label of the second face image is the position of a face key point;

the second training unit is used for training the human eye feature extraction unit to be trained through the first data set and the third data set to obtain a target human eye feature extraction unit, wherein the third data set comprises an eye image, and the label of the eye image is the position of an eye key point;

14. An electronic device comprising one or more processors and memory;

one or more programs stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the method of any of claims 1-10.

15. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a program code, wherein the method of any of claims 1-10 is performed when the program code is run _。