CN113822174A

CN113822174A - Gaze estimation method, electronic device, and storage medium

Info

Publication number: CN113822174A
Application number: CN202111028654.1A
Authority: CN
Inventors: 赵欲苗; 陈智超; 朱海涛; 户磊
Original assignee: Beijing Dilusense Technology Co Ltd; Hefei Dilusense Technology Co Ltd
Current assignee: Hefei Dilusense Technology Co Ltd
Priority date: 2021-09-02
Filing date: 2021-09-02
Publication date: 2021-12-21
Anticipated expiration: 2041-09-02
Also published as: CN113822174B

Abstract

The embodiment of the invention relates to the field of computers, and discloses a sight line estimation method, electronic equipment and a storage medium. The sight line estimation method comprises the following steps: acquiring a face color image, a left eye color image and a right eye color image of the target object and face point cloud data of the target object under a world coordinate system according to the color image and the depth image of the target object; inputting the facial point cloud data of the target object into a head posture subnet in a preset multi-modal sight estimation network to obtain the head posture data of the target object; and inputting the face color image, the left eye color image, the right eye color image and the head posture data into a sight line estimation subnet in the multi-modal sight line estimation network to obtain a sight line estimation result of the target object. By adopting the embodiment, accurate head posture data can be acquired, and the accuracy of sight line estimation is improved.

Description

Gaze estimation method, electronic device, and storage medium

Technical Field

Embodiments of the present invention relate to the field of computers, and in particular, to a method for estimating a gaze, an electronic device, and a storage medium.

Background

Gaze estimation is used to estimate the direction of the gaze of the human eye. Sight line estimation is widely applied to various scenes, such as judging whether a driver is tired to drive or not by an intelligent cabin or carrying out human-vehicle interaction through sight lines; the medical aspect can be used for detecting and diagnosing mental or psychological diseases; and in the virtual reality, only the places concerned by the human eyes are subjected to fine scene rendering. Line of sight estimation may also be applied to offline retail, human-computer interaction, or the like. The gaze estimation is closely related to the head pose of a person, and the gaze direction varies greatly when the head pose is different even if the images of the two eyes are identical. The current sight line estimation method mainly comprises the following steps: a geometry-based approach, a cross-scale approach, and an appearance-based approach. The basic idea of the geometry-based approach is to build a 3D eye model from the anatomy of the human eye/face and to compute the 3D gaze direction from the geometrical relations between different eye and face features. This geometry-based approach has the advantage of high accuracy, but the algorithm needs to provide high resolution eye images and requires subject-specific calibration procedures. The human eye sight estimation direction is obtained by ingenious lighting arrangement based on a cross proportion method and by using the invariance of the cross proportion in the projective geometry. The typical method is as follows: four light sources are placed on the same plane and the subject's eyes are imaged using a video camera, which is placed below the screen. Four corneal reflections and pupil centers of the eye are identified. The gaze direction is calculated using the assumption that the intersection ratio formed by the elements in the scene plane is equal to the intersection ratio formed by the corresponding features in the camera imaging plane. The cross-scale based approach is easy to implement, the model is simple, and no subject-specific calibration procedures are required; however, this method needs to provide a high-resolution eye image and needs to perform fine positioning on the corneal reflection point and the pupil center point. The appearance-based method has simpler system structure and can obtain better results for low-resolution images, but the sight line estimated by the method is difficult to achieve higher precision, and the method is closely related to the precision of head posture estimation.

At present, vision estimation based on appearance mostly calculates head gestures through facial key points, however, facial key points of different people are different due to different appearance features of different people, such as different degrees of mouth prominence, and facial key points of the same person can change along with the change of expressions of people, and the head gestures of the people may not change at this time; how to acquire accurate head gestures and improve the accuracy of sight line estimation are problems to be solved.

Disclosure of Invention

An object of embodiments of the present invention is to provide a method, an electronic device, and a storage medium for gaze estimation, which can obtain accurate head pose data and improve the accuracy of gaze estimation.

To solve the above technical problem, in a first aspect, an embodiment of the present application provides a method for gaze estimation, including: acquiring a face color image, a left eye color image and a right eye color image of the target object and face point cloud data of the target object under a world coordinate system according to the color image and the depth image of the target object; inputting the facial point cloud data of the target object into a head posture subnet in a preset multi-modal sight estimation network to obtain the head posture data of the target object; and inputting the face color image, the left eye color image, the right eye color image and the head posture data into a sight line estimation subnet in the multi-modal sight line estimation network to obtain a sight line estimation result of the target object.

In a second aspect, an embodiment of the present application further provides an electronic device, including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the above-described method of gaze estimation.

In a third aspect, embodiments of the present application further provide a computer-readable storage medium storing a computer program, which when executed by a processor implements the above-mentioned method for gaze estimation.

In the embodiment of the application, the facial point cloud data of the target object in the world coordinate system is acquired through the depth image of the target object, and in the sight line estimation process, the sight line direction is not only related to the eye state of the target object, such as the position of eyes, the opening and closing degree of eyes and the like, but also related to the head posture of the target object, and the facial point cloud data of the target object has the advantages that the facial point cloud data of the target object is input into a head posture subnet in the multi-modal sight line estimation network, so that accurate head posture data of the target object can be acquired, the facial color image, the left eye color image, the right eye color image and the head posture data are input into a sight line estimation subnet in the multi-modal sight line estimation network, the sight line estimation result of the target object is acquired, and the multi-modal sight line estimation method is realized. Because the sight line estimation is carried out by utilizing the information of the RGB-D two modes, the geometric information of the depth image and the appearance information of the color image can be fully utilized, and the precision of the sight line estimation is improved.

In addition, before the point cloud data of the target object is input to a head posture subnet in a preset multi-modal sight estimation network and the head posture data of the target object is obtained, the method further comprises the following steps: fixing the network parameters of the head attitude subnet trained to be convergent in advance, wherein the output end of the head attitude subnet is connected with the input end of the initial sight estimation subnet; training the initial sight estimation subnet to be convergent, and obtaining the trained initial sight estimation subnet; and after the fixation of the network parameters of the head attitude subnet is removed, the head attitude subnet and the trained initial sight line estimation subnet are trained again until the head attitude subnet and the initial sight line estimation subnet are converged again, so that the multi-mode sight line estimation network is obtained. The output end of the head attitude subnet is connected with the input end of the initial sight estimation network, the network parameters of the head attitude subnet trained to be convergent in advance are fixed, at the moment, only the initial sight estimation network is trained, the influence of the head attitude subnet on the network parameters in the untrained initial sight estimation subnet is reduced, and the training speed is also improved; and after the initial sight line estimation network is trained to be convergent, the fixation of the network parameters of the head attitude subnet is released, and the head attitude subnet and the trained initial sight line estimation subnet are trained again, so that the accuracy of the multi-mode sight line estimation network is further improved.

In addition, before fixing the network parameters pre-trained to the converged head pose subnet, the method further comprises: acquiring facial point cloud data of a sample object; determining a rotation matrix corresponding to the facial point cloud data of the sample object according to the facial point cloud data of the sample object, wherein the rotation matrix is used for indicating the information of the rotation of the facial point cloud data of the sample object; training an initial head attitude subnet until convergence according to the rotation matrix, the facial point cloud data of the sample object and a preset head attitude loss function, wherein the head attitude loss function of the head attitude subnet is used for representing the difference between the sample object front facial attitude point cloud data output by the head attitude subnet and the facial point cloud data of the sample object under the rotation matrix. In the training of the head attitude subnet, the unsupervised training is carried out through the difference between the point cloud data of the front face attitude of the sample object and the point cloud data of the face of the sample object under the rotation matrix, so that a real label does not need to be added in the sample, and the workload of concentrated labeling of the training is reduced.

In addition, the expression of the head pose loss function is:

wherein the content of the first and second substances,

representing the front face pose point cloud data of the sample object, D representing the input face point cloud data of the sample object, R representing a rotation matrix, n representing the point number of the face point cloud data of the sample object, and D representing the front head point cloud data of a standard face.

In addition, training an initial gaze estimation subnet to converge, obtaining the trained initial gaze estimation subnet, comprises: inputting facial point cloud data of a sample object into a head posture network with fixed network parameters, and acquiring head posture data of the sample object; inputting the left eye color image of the sample object into a left eye feature extraction network in the initial sight estimation subnet to obtain the left eye feature of the sample object, and inputting the right eye color image of the sample object into a right eye feature extraction network in the initial sight estimation subnet to obtain the right eye feature of the sample object; splicing the left eye features of the sample object and the right eye features of the sample object to obtain the double-eye splicing features of the sample object; sequentially inputting the binocular splicing characteristics of the sample object into two full-connected layers, and then generating a sight line vector of the sample object; calculating a loss value of the multi-modal sight estimation network according to the sight line vector, the head posture data of the sample object and a preset expression of a multi-modal loss function, wherein when the loss value is minimum, the initial sight line estimation subnet is converged, and the multi-modal loss function represents a difference value between a real sight line of the sample object and a predicted sight line estimation result of the sample object.

In addition, the gaze estimation subnet includes: the sight line conversion network layer is used for fusing the characteristics of the left eye color image and the characteristics of the right eye color image by using 1-by-1 convolution and connecting the two full connection layers to obtain sight line offset; the line-of-sight offset is used as output data in a multi-modal loss function. The characteristics of the right eye color image and the left eye color image are fused through 1-by-1 convolution, the two full-connection layers are connected, the offset of the sight line can be obtained, the offset of the sight line is the offset of different human sight lines caused by the difference of internal parameters of eyeballs, the sight line offset is increased in the sight line estimation network, and the problem of inaccurate sight line estimation caused by the difference of the internal parameters of the eyeballs can be solved.

In addition, the expression of the multi-modal loss function is:

wherein g represents a labelThe true line-of-sight vector of note, | x | represents the modular length of the vector, R represents the rotation matrix,

a line-of-sight vector representing the sample object, and t represents a line-of-sight offset of the sample object.

In addition, acquiring a face color image, a left eye color image, a right eye color image of the target object and the face point cloud data of the target object under the world coordinate system according to the color image and the depth image of the target object, including: inputting the color image of the target object into a preset face recognition network to obtain a face color image of the target object; acquiring each key point position of a face color image of a target object; cutting the face color image according to the positions of the key points to obtain a left-eye color image and a right-eye color image; and converting the coordinates of the depth image into the facial point cloud data of the target object in a world coordinate system. The image is preprocessed, the positions of the key points can be accurately obtained through the depth information, and then the left eye color image and the right eye color image of the target object can be accurately segmented, so that the accuracy of sight line estimation is improved.

Drawings

One or more embodiments are illustrated by way of example in the accompanying drawings, which correspond to the figures in which like reference numerals refer to similar elements and which are not to scale unless otherwise specified.

Fig. 1 is a flow chart of a method of gaze estimation provided according to an embodiment of the application;

FIG. 2 is a schematic diagram of an embodiment of the present application for training a multi-modal gaze estimation network;

fig. 3 is a schematic network structure diagram of a gaze translation network provided in an embodiment of the present application;

fig. 4 is a schematic diagram of a specific implementation of a training head pose subnet provided in an embodiment of the present application;

fig. 5 is a schematic network structure diagram of a head pose subnet provided in an embodiment of the present application;

fig. 6 is a schematic network structure diagram of a multi-modal line of sight estimation network provided by an embodiment of the present application;

FIG. 7 is a schematic diagram of an implementation of preprocessing an acquired image according to an embodiment of the present disclosure;

fig. 8 is a schematic diagram of face key points provided in an embodiment of the present application;

fig. 9 is a schematic structural diagram of an electronic device provided in an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention more apparent, embodiments of the present invention will be described in detail below with reference to the accompanying drawings. However, it will be appreciated by those of ordinary skill in the art that numerous technical details are set forth in order to provide a better understanding of the present application in various embodiments of the present invention. However, the technical solution claimed in the present application can be implemented without these technical details and various changes and modifications based on the following embodiments.

The following embodiments are divided for convenience of description, and should not constitute any limitation to the specific implementation manner of the present invention, and the embodiments may be mutually incorporated and referred to without contradiction.

A first embodiment of the present application relates to a method of gaze estimation. The method for estimating the sight line may be executed by an electronic device, and the flow of the method for estimating the sight line is as shown in fig. 1:

step 101: and acquiring a face color image, a left eye color image and a right eye color image of the target object and face point cloud data of the target object under a world coordinate system according to the color image and the depth image of the target object.

Specifically, the electronic device may be mounted with a color camera through which a color image of the target object is acquired, and a depth camera through which a depth image of the target object is acquired. Color and depth images of the target object may also be acquired from other acquisition devices. The target object may be a human, an animal with two eyes, e.g., a monkey, a dog; it may also be a robot with two eyes. In the present embodiment, the target object is described by taking a human as an example.

Because the direction of the sight line is related to human eyes, the color image of the target object can be identified through the face identification network, and the face color image of the target object is obtained; and intercepting a left eye color image and a right eye color image of the face from the face color image of the face according to the positions of the key points of the face. The color images in the embodiments of the present application are all RGB images.

Because the coordinate system of the depth image is the depth camera, in order to facilitate subsequent calculation, the coordinate system of the depth image can be converted into the facial point cloud data of the target object in the world coordinate system, and the conversion mode can be realized through the corresponding relation between the coordinate system of the depth camera and the world coordinate system.

Step 102: and inputting the facial point cloud data of the target object into a head posture subnet in a preset multi-modal sight estimation network to obtain the head posture data of the target object.

Specifically, the direction of the line of sight of the human eyes is also generally correlated with the head posture, and the direction of the line of sight differs even if the directions of both eyes coincide, when the head postures differ. The head pose determined based on the two-dimensional image only has two dimensions, the point cloud data comprises information of the three dimensions, and the information of the head pose is enriched by the face point cloud data, so that the information of the head pose data of the target object determined based on the face point cloud data is more accurate.

The training set of the head pose subnet can be data with head pose labels and facial point cloud data of each face, and the initial head pose subnet is trained according to the training set until convergence to obtain the head pose subnet.

Step 103: and inputting the face color image, the left eye color image, the right eye color image and the head posture data into a sight line estimation subnet in the multi-modal sight line estimation network to obtain a sight line estimation result of the target object.

Specifically, the multi-modal gaze estimation network comprises a head pose sub-network and a gaze estimation sub-network, the output of the head pose sub-network is connected with the input of the gaze estimation sub-network, i.e. the head pose data output by the head pose sub-network is used as the input data of the gaze estimation sub-network. And inputting the face color image, the left eye color image, the right eye color image and the head posture data into a sight line estimation subnet in the multi-modal sight line estimation network to obtain a sight line estimation result of the target object, wherein the sight line estimation result is the direction of the sight line.

In the embodiment of the application, the face point cloud data of the target object under the world coordinate system is obtained through the depth image of the target object, because in the sight line estimation process, the sight line direction is not only related to the eye state of the target object, such as the eye position, the eye opening and closing degree, but also related to the head posture of the target object, and the face point cloud data of the target object has rich geometric information, the head posture angle can be accurately estimated, the face point cloud data of the target object is input into the head posture subnet in the multimodal sight line estimation network, that is, the accurate head posture data of the target object can be obtained, the face, the left eye color image, the right eye color image and the head posture data are input into the sight line estimation subnet in the multimodal sight line estimation network, and the sight line estimation result of the target object is obtained, a multi-modal sight estimation method is realized; because the sight line estimation is carried out by utilizing the information of the RGB-D two modes, the geometric information of the depth image and the appearance information of the color image can be fully utilized, and the precision of the sight line estimation is improved.

In one embodiment, the process of pre-training the multi-modal gaze estimation network may be as shown in fig. 2:

step 101-1: and fixing the network parameters of the head attitude subnet trained to be converged in advance, wherein the output end of the head attitude subnet is connected with the input end of the initial sight line estimation subnet.

Specifically, since the output data in the head pose subnet in the multi-modal gaze estimation network is used as the input data of the gaze estimation subnet, the network parameters previously trained to the converged head pose subnet may be fixed in order to simplify the training speed for the gaze estimation subnet. Wherein, the head pose subnet can be trained separately in advance.

Step 101-2: and training the initial sight estimation subnet to be convergent to obtain the trained initial sight estimation subnet.

Specifically, inputting the facial point cloud data of the sample object into a head posture network with fixed network parameters, and acquiring the head posture data of the sample object; inputting the left eye color image of the sample object into a left eye feature extraction network in the initial sight estimation subnet to obtain the left eye feature of the sample object, and inputting the right eye color image of the sample object into a right eye feature extraction network in the initial sight estimation subnet to obtain the right eye feature of the sample object; splicing the left eye characteristic of the sample object and the right eye characteristic of the sample object to obtain the double-eye splicing characteristic of the sample object; sequentially inputting the binocular splicing characteristics of the sample object into the two fully-connected layers to generate a sight line vector of the sample object; and calculating a loss value of the multi-modal sight estimation network according to the sight vector, the head posture data of the sample object and a preset expression of a multi-modal loss function, wherein when the loss value is minimum, the initial sight estimation subnet is converged, and the multi-modal loss function represents a difference value between a real sight of the sample object and a predicted sight estimation result of the sample object.

The training set includes a face color image, a left eye color image, a right eye color image of each sample object, and face point cloud data of the sample objects. After the network parameters of the head pose sub-network are fixed, the facial point cloud data of the training object are input into the head pose sub-network, and the head pose data of the sample object are obtained.

And inputting the left eye color image of the sample object into a left eye feature extraction network in the initial sight estimation subnet to obtain the left eye feature of the sample object, and inputting the right eye color image of the sample object into a right eye feature extraction network in the initial sight estimation subnet to obtain the right eye feature of the sample object. The left eye feature extraction network and the right eye feature extraction network can be ReptVGG network layers, and feature dimensions output by the ReptVGG network layers can be input according to requirementsSetting lines, for example, outputting 128-dimensional features by a RepVGG network layer, splicing the left-eye features and the right-eye features of the sample object to form two-eye splicing features, wherein the two-eye splicing features sequentially pass through two full-connection layers, and the output dimensions of the two full-connection layers are 128 dimensions and 3 dimensions respectively; line of sight vector with final output dimension of 3 x 1

The expression of the preset multi-modal loss function is shown in formula (1):

wherein g represents the true Gaze vector of the annotation, defined by the Gaze Euler angle Gaze (p)_i，y_i) Is converted to obtain p_i，y_iThe pitch and yaw angles, respectively, of the line of sight, where | | | | represents the modulo length of the vector. At a loss value L_gazeAnd when the minimum time is reached, the initial sight estimation subnet converges, namely the included angle between the real sight vector and the estimated sight vector is used for obtaining a robust sight estimation value.

Step 101-3: and after the fixation of the network parameters of the head attitude subnet is removed, the head attitude subnet and the trained initial sight line estimation subnet are trained again until the head attitude subnet and the initial sight line estimation subnet are converged again, so that the multi-mode sight line estimation network is obtained.

Specifically, since the output data of the head pose subnet has an influence on the output result of the initial sight line estimation subnet, the accuracy of the multi-modal sight line estimation network is further improved; the fixation of the network parameters of the head attitude subnet can be released after the initial sight line estimation subnet converges, and the head attitude subnet and the trained initial sight line estimation subnet are retrained until the head attitude subnet and the initial sight line estimation subnet converge again, so that the multi-mode sight line estimation network is obtained. The accuracy of the multi-mode sight line estimation network is improved by fine-tuning the network parameters of the head attitude sub-network and the initial sight line estimation sub-network.

In one embodiment, the gaze estimation subnet further comprises: line of sight switching network, the network structure of the line of sight estimation subnet is as shown in part a1 of fig. 3:

the sight line estimation subnet comprises: the sight line conversion network layer is used for fusing the characteristics of the left eye characteristic and the right eye characteristic by using 1 × 1 convolution to obtain sight line offset; the line-of-sight offset is used as output data in a multi-modal loss function. The input data in the sight line conversion network layer are the left eye features of the sample images and the right eye features of the sample objects, the left eye features of the input sample objects and the right eye features of the sample objects can be fused by using 1 × 1 convolution, the fused features are sequentially connected with two full-connection layers, the output dimensions of the two full-connection layers can be 128 dimensions and 3 dimensions respectively, and finally sight line offset t of different sample objects caused by differences of eyeball internal parameters is obtained, and the dimension of the t is 3 ^ 1.

In this example, the expression of the multi-modal loss function can be as shown in equation (2):

wherein g represents the marked real sight line vector, | | x | represents the modular length of the vector, R represents the rotation matrix,

According to the embodiment, the sight line conversion network layer is added, and the sight line conversion network layer outputs the parallax offset, so that the problem that the sight line directions of different people are different even if different people have the same head postures and the same eye images due to the difference of the internal parameters of the eyes can be effectively solved. The method does not need a strict camera calibration process of a specific person, and is greatly convenient for a client to use.

In one embodiment, the process of training the head pose subnet is as shown in FIG. 4:

step 101-1-1: facial point cloud data of the sample object is acquired.

The Left-eye color image is set as Left ═ l₁，l₂，…，l_nIn which l_iRepresenting a single left eye color image. The Right eye color image set is given as Right ═ r₁,r₂，…,r_nIn which r is_iRepresenting a single right-eye color image. The facial point cloud data set of the world coordinate system human face is recorded as Depth ═ d₁,d₂,…,d_nIn which d is_iFace point cloud data representing a single face, and the euler angle of the eye line is given as size { (p)₁,y₁),(p₂,y₂),…,(p_n,y_n) In which p is_i,y_iRespectively representing the pitch and yaw angles of the line of sight.

Step 101-1-2: and determining a rotation matrix corresponding to the facial point cloud data of the sample object according to the facial point cloud data of the sample object, wherein the rotation matrix is used for indicating the information of the rotation of the facial point cloud data of the sample object.

Step 101-1-3: and training the initial head attitude sub-network until convergence according to the rotation matrix, the facial point cloud data of the sample object and a preset head attitude loss function, wherein the head attitude loss function of the head attitude sub-network is used for representing the difference between the sample object front facial attitude point cloud data output by the head attitude sub-network and the facial point cloud data of the sample object under the rotation matrix.

The head posture subnet adopts an Encoder-Decoder network structure, namely an Encoder-Decoder network structure, inputs the facial point cloud data d of the sample object, and outputs the head posture data of the sample object, wherein the head posture data are the head posture angles pitch, yaw, roll and the facial posture point cloud data of the sample object

The rotation matrix R is determined from the head pose angles picth, yaw, roll. The head pose subnet graph is shown in dashed lines in fig. 5. Head pose loss function for head pose sub-network for characterizing head pose sub-networkAnd the difference between the obtained point cloud data of the face pose of the sample object and the point cloud data of the face of the sample object under the rotation matrix. When the head pose loss function is minimal, the head pose subnet converges.

The expression of the head pose loss function is:

wherein the content of the first and second substances,

representing the facial pose point cloud data of the front of the sample object, D representing the facial point cloud data of the input sample object, R representing a rotation matrix, n representing the point number of the facial point cloud data of the sample object, and D representing the head point cloud data of the front of the standard human face.

For ensuring prediction

And D is the head attitude data of the point cloud which is as positive as possible, three-dimensional reconstruction is carried out through a plurality of face images to obtain a head model of the 3D point cloud, and the head model is obtained after averaging the plurality of head models.

It should be noted that, in this embodiment, the network structure of the multi-modal gaze estimation network is shown in fig. 6, and includes a gaze estimation sub-network a component and a head pose estimation sub-network B.

In the embodiment, an unsupervised hiking posture subnet is constructed, so that the posture data of the head of each sample object with a label does not need to be acquired, the labeling work of a data set is reduced, and the training efficiency is improved.

In one embodiment, the process of obtaining the face color image, the left eye color image, the right eye color image of the target object and the face point cloud data of the target object under the world coordinate system may be as shown in fig. 7:

step 1011: and inputting the color image of the target object into a preset face recognition network to obtain the face color image of the target object.

Specifically, a bounding box of the face is obtained through a color face detection network. Acquiring a boundary frame of a face returned by a color face detection network, expanding the width and the height of the boundary frame to be N times of the original width and the height, and cutting, wherein N is a positive number greater than 1, for example, N is 1.5; and taking the cut image as a face color image of the human face.

Step 1012: the positions of each key point of a face color image of a target object are acquired.

And inputting the face color image of the target object into a preset key point detection network, and acquiring the position of each key point of the face color image of the target object. For example, the positions of 106 key points of the face are obtained, the specific positions and sequence of the key points of the face are shown in fig. 8,

step 1013: and cutting the face color image according to the positions of the key points to obtain a left-eye color image and a right-eye color image.

Cutting human eyes through key points of the human eyes to respectively obtain a left eye image and a right eye image, scaling the human eye images to 36 × 60 in the same scale, and obtaining external parameters of rotation and translation matrixes aligned by RGB and Depth cameras through manual calibration

Step 1014: and converting the coordinates of the depth image into the facial point cloud data of the target object under the world coordinate system.

Specifically, the coordinate position of the RGB face frame cut and the position of the key point of the face are aligned to the depth map, and the depth map is cut to obtain the depth map of the face. In addition, the coordinates of the image coordinate system of the face depth image are converted into the coordinates of the world coordinate system, and finally point cloud data of the face of the world coordinate system are obtained. By data preprocessing, RGB face pictures, RGB left-eye pictures, RGB right-eye pictures, RGB 106 personal face key point positions and world coordinate system face point cloud data are obtained.

The steps of the above methods are divided for clarity, and the implementation may be combined into one step or split some steps, and the steps are divided into multiple steps, so long as the same logical relationship is included, which are all within the protection scope of the present patent; it is within the scope of the patent to add insignificant modifications to the algorithms or processes or to introduce insignificant design changes to the core design without changing the algorithms or processes.

A second embodiment of the present application relates to an electronic apparatus, as shown in fig. 9, including: at least one processor 201; and a memory 202 communicatively coupled to the at least one processor 201; the memory 202 stores instructions executable by the at least one processor 201, and the instructions are executed by the at least one processor 201 to enable the at least one processor 201 to perform the above-mentioned method for estimating the gaze.

The memory 202 and the processor 201 are connected by a bus, which may include any number of interconnected buses and bridges that link one or more of the various circuits of the processor 201 and the memory 202. The bus may also link various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface provides an interface between the bus and the transceiver. The transceiver may be one element or a plurality of elements, such as a plurality of receivers and transmitters, providing a means for communicating with various other apparatus over a transmission medium. The data processed by the processor 201 is transmitted over a wireless medium through an antenna, which further receives the data and transmits the data to the processor 201.

The processor 201 is responsible for managing the bus and general processing and may also provide various functions including timing, peripheral interfaces, voltage regulation, power management, and other control functions. And the memory may be used to store data used by the processor in performing operations.

A third embodiment of the present application relates to a computer-readable storage medium storing a computer program which, when executed by a processor, implements the above-described gaze estimation method.

Those skilled in the art can understand that all or part of the steps in the method of the foregoing embodiments may be implemented by a program to instruct related hardware, where the program is stored in a storage medium and includes several instructions to enable a device (which may be a single chip, a chip, etc.) or a processor (processor) to execute all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

It will be understood by those of ordinary skill in the art that the foregoing embodiments are specific examples for carrying out the invention, and that various changes in form and details may be made therein without departing from the spirit and scope of the invention in practice.

Claims

1. A method of gaze estimation, comprising:

acquiring a face color image, a left eye color image and a right eye color image of a target object and face point cloud data of the target object under a world coordinate system according to the color image and the depth image of the target object;

inputting the facial point cloud data of the target object into a head posture sub-network in a preset multi-modal sight estimation network to obtain the head posture data of the target object;

and inputting the face color image, the left eye color image, the right eye color image and the head posture data into a sight line estimation subnet in the multi-modal sight line estimation network to obtain a sight line estimation result of the target object.

2. The gaze estimation method according to claim 1, wherein before inputting the point cloud data of the target object into a head pose sub-network in a preset multi-modal gaze estimation network and obtaining the head pose data of the target object, the method further comprises:

fixing the network parameters of the head attitude subnet trained to be convergent in advance, wherein the output end of the head attitude subnet is connected with the input end of the initial sight estimation subnet;

training an initial sight estimation subnet to be convergent, and obtaining the trained initial sight estimation subnet;

and after the fixation of the network parameters of the head attitude subnet is released, the head attitude subnet and the trained initial sight line estimation subnet are trained again until the head attitude subnet and the initial sight line estimation subnet are converged again, so that the multi-mode sight line estimation network is obtained.

3. The gaze estimation method of claim 2, wherein prior to the fixing the pre-trained network parameters to the converged head pose subnet, the method further comprises:

acquiring facial point cloud data of a sample object;

determining a rotation matrix corresponding to the facial point cloud data of the sample object according to the facial point cloud data of the sample object, wherein the rotation matrix is used for indicating information of the rotation of the facial point cloud data of the sample object;

training an initial head attitude sub-network until convergence according to the rotation matrix, the facial point cloud data of the sample object and a preset head attitude loss function, wherein the head attitude loss function of the head attitude sub-network is used for representing the difference between the sample object front facial attitude point cloud data output by the head attitude sub-network and the facial point cloud data of the sample object under the rotation matrix.

4. The gaze estimation method of claim 3, wherein the head pose loss function is expressed by:

wherein the content of the first and second substances,

representing the frontal facial pose point cloud data of the sample object, d representing the input facial point cloud data of the sample object, R representing a rotation matrix, n representing the sample objectThe number of points of the face point cloud data of the object, and D represents the front head point cloud data of the standard face.

5. The method of gaze estimation according to claim 2, wherein said training an initial gaze estimation subnet to converge, obtaining said trained initial gaze estimation subnet, comprises:

inputting facial point cloud data of a sample object into a head posture network with fixed network parameters, and acquiring head posture data of the sample object;

inputting the left eye color image of the sample object into a left eye feature extraction network in the initial sight estimation subnet to obtain the left eye feature of the sample object, and inputting the right eye color image of the sample object into a right eye feature extraction network in the initial sight estimation subnet to obtain the right eye feature of the sample object;

splicing the left eye features of the sample object and the right eye features of the sample object to obtain the double-eye splicing features of the sample object;

sequentially inputting the binocular splicing characteristics of the sample object into two full-connected layers, and then generating a sight line vector of the sample object;

calculating a loss value of the multi-modal sight estimation network according to the sight line vector, the head posture data of the sample object and a preset expression of a multi-modal loss function, wherein when the loss value is minimum, the initial sight line estimation subnet is converged, and the multi-modal loss function represents a difference value between a real sight line of the sample object and a predicted sight line estimation result of the sample object.

6. The gaze estimation method of claim 5, wherein the gaze estimation sub-network comprises: the sight line conversion network layer is used for fusing the characteristics of the left eye characteristic and the right eye characteristic by utilizing 1-x 1 convolution and connecting the two full-connection layers to obtain sight line offset; the line of sight offset is used as output data in the multi-modal loss function.

7. The gaze estimation method of claim 6, wherein the multi-modal loss function is expressed by:

8. The gaze estimation method according to claim 1, wherein the acquiring a face color image, a left eye color image, a right eye color image of a target object and face point cloud data of the target object in a world coordinate system from a color image and a depth image of the target object comprises:

inputting the color image of the target object into a preset face recognition network to obtain a face color image of the target object;

acquiring each key point position of the face color image of the target object;

cutting the face color image according to the position of each key point to obtain a left-eye color image and a right-eye color image;

and converting the coordinates of the depth image into the facial point cloud data of the target object in a world coordinate system.

9. An electronic device, comprising:

at least one processor; and the number of the first and second groups,

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of gaze estimation according to any of claims 1-8.

10. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the method of gaze estimation according to any one of claims 1 to 8.