CN112308006A

CN112308006A - Sight line area prediction model generation method and device, storage medium and electronic equipment

Info

Publication number: CN112308006A
Application number: CN202011244436.7A
Authority: CN
Inventors: 陶冶; 张宏志
Original assignee: Shenzhen Horizon Robotics Science and Technology Co Ltd
Current assignee: Shenzhen Horizon Robotics Science and Technology Co Ltd
Priority date: 2020-11-10
Filing date: 2020-11-10
Publication date: 2021-02-02

Abstract

The embodiment of the disclosure discloses a method and a device for generating a sight line region prediction model, a line region prediction method and a device, a computer readable storage medium and an electronic device, wherein the sight line region prediction model comprises an eye posture prediction module and a sight line region prediction module, and the method comprises the following steps: taking a first face image sample set with an eye posture label as input of an initial model, and training an eye posture prediction module; taking a second face image sample set with a sight line area label as the input of an initial model to obtain an eye posture prediction result; training a sight line region prediction module based on a second face image sample set; and determining the initial model after the training is finished as a sight line area prediction model. The embodiment of the invention realizes end-to-end model training, improves the prediction precision of the sight line region prediction model, can directly input the face image when using the sight line region prediction model, can obtain the sight line region information, and improves the efficiency of sight line region prediction.

Description

Sight line area prediction model generation method and device, storage medium and electronic equipment

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for generating a line-of-sight region prediction model, a line region prediction method and apparatus, a computer-readable storage medium, and an electronic device.

Background

With the development of artificial intelligence technology, the sight line area prediction technology is applied to many fields, such as fatigue driving detection, sight line tracking, intelligent cockpit, intelligent human-computer interaction and the like. In the current sight line region prediction method, a plurality of models are generally used for performing prediction of different functions, and then the sight line region is further determined according to prediction results. For example, a gaze angle prediction model, a keypoint detection model, a human eye depth prediction model, a gaze three-dimensional region calculation model, etc. are trained, each model typically being trained separately.

Disclosure of Invention

The embodiment of the disclosure provides a method and a device for generating a sight line region prediction model, a computer-readable storage medium and electronic equipment.

The embodiment of the disclosure provides a method for generating a sight line region prediction model, wherein the sight line region prediction model comprises an eye posture prediction module and a sight line region prediction module, and the method comprises the following steps: taking a first face image sample set with an eye posture label as input of an initial model, and training an eye posture prediction module; taking a second face image sample set with a sight line area label as the input of an initial model to obtain an eye posture prediction result; training a sight line region prediction module based on a second face image sample set; and determining the initial model after the training is finished as a sight line area prediction model.

According to another aspect of an embodiment of the present disclosure, there is provided a gaze area prediction method including: acquiring a face image shot by a target camera; inputting a face image into a pre-trained sight region prediction model, and obtaining an eye posture prediction result based on an eye posture prediction module included in the sight region prediction model, wherein the sight region prediction model is obtained by training in advance based on a generation method of the sight region prediction model; and inputting the eye posture prediction result into a sight line region prediction module included in the sight line region prediction model to obtain sight line region information corresponding to the face image.

According to another aspect of the embodiments of the present disclosure, there is provided an apparatus for generating a gaze region prediction model, the gaze region prediction model including an eye pose prediction module and a gaze region prediction module, the apparatus including: the first training module is used for taking a first face image sample set with an eye posture label as the input of an initial model and training the eye posture prediction module; the first determining module is used for taking a second face image sample set with a sight line area label as the input of the initial model to obtain an eye posture prediction result; the second training module is used for training the sight line region prediction module based on a second face image sample set; and the second determining module is used for determining the initial model after the training is finished as the sight line area prediction model.

According to another aspect of an embodiment of the present disclosure, there is provided a sight-line region prediction apparatus including: the acquisition module is used for acquiring a face image shot by the target camera; the first prediction module is used for inputting the face image into a pre-trained sight region prediction model, and obtaining an eye posture prediction result based on an eye posture prediction module included in the sight region prediction model, wherein the sight region prediction model is obtained by training in advance based on a generation method of the sight region prediction model; and the second prediction module is used for inputting the eye posture prediction result into the sight line region prediction module included in the sight line region prediction model to obtain sight line region information corresponding to the face image.

According to another aspect of the embodiments of the present disclosure, there is provided a computer-readable storage medium storing a computer program for executing the above-described sight-line region prediction model generation method or sight-line region prediction method.

According to another aspect of the embodiments of the present disclosure, there is provided an electronic apparatus including: a processor; a memory for storing processor-executable instructions; and the processor is used for reading the executable instructions from the memory and executing the instructions to realize the generation method of the sight line region prediction model or the sight line region prediction method.

Based on the method, the apparatus, the computer-readable storage medium, and the electronic device for generating the gaze region prediction model provided by the above embodiments of the present disclosure, the eye pose prediction module included in the initial model is trained by using the first face image sample set, the eye pose prediction result predicted by the eye pose prediction module and the gaze region prediction module included in the initial model are trained by using the second face image sample set, and finally the trained initial model is used as the gaze region prediction model, because the eye pose prediction module and the gaze region prediction module are included in the gaze region prediction model, during training, parameters of each component included in the model are updated at the same time, thereby implementing end-to-end model training, improving the prediction accuracy of the gaze region prediction model, and when the gaze region prediction model is used, the human face image can be directly input, so that the sight line area information can be obtained, and the efficiency of sight line area prediction is improved.

The technical solution of the present disclosure is further described in detail by the accompanying drawings and examples.

Drawings

The above and other objects, features and advantages of the present disclosure will become more apparent by describing in more detail embodiments of the present disclosure with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the disclosure and are incorporated in and constitute a part of this specification, illustrate embodiments of the disclosure and together with the description serve to explain the principles of the disclosure and not to limit the disclosure. In the drawings, like reference numbers generally represent like parts or steps.

Fig. 1 is a system diagram to which the present disclosure is applicable.

Fig. 2 is a flowchart illustrating a method for generating a gaze area prediction model according to an exemplary embodiment of the present disclosure.

Fig. 3 is a flowchart illustrating a method for generating a gaze area prediction model according to another exemplary embodiment of the present disclosure.

Fig. 4 is a flowchart illustrating a method for generating a gaze area prediction model according to another exemplary embodiment of the present disclosure.

Fig. 5 is an exemplary schematic diagram of a model training process of a generation method of a sight-line region prediction model according to the present disclosure.

Fig. 6 is a flowchart illustrating a method for predicting a gaze area according to an exemplary embodiment of the disclosure.

Fig. 7 is a schematic structural diagram of a device for generating a gaze region prediction model according to an exemplary embodiment of the present disclosure.

Fig. 8 is a schematic structural diagram of a device for generating a gaze region prediction model according to another exemplary embodiment of the present disclosure.

Fig. 9 is a schematic structural diagram of a sight-line region prediction apparatus according to an exemplary embodiment of the present disclosure.

Fig. 10 is a block diagram of an electronic device provided in an exemplary embodiment of the present disclosure.

Detailed Description

Hereinafter, example embodiments according to the present disclosure will be described in detail with reference to the accompanying drawings. It is to be understood that the described embodiments are merely a subset of the embodiments of the present disclosure and not all embodiments of the present disclosure, with the understanding that the present disclosure is not limited to the example embodiments described herein.

It should be noted that: the relative arrangement of the components and steps, the numerical expressions, and numerical values set forth in these embodiments do not limit the scope of the present disclosure unless specifically stated otherwise.

It will be understood by those of skill in the art that the terms "first," "second," and the like in the embodiments of the present disclosure are used merely to distinguish one element from another, and are not intended to imply any particular technical meaning, nor is the necessary logical order between them.

It is also understood that in embodiments of the present disclosure, "a plurality" may refer to two or more and "at least one" may refer to one, two or more.

It is also to be understood that any reference to any component, data, or structure in the embodiments of the disclosure, may be generally understood as one or more, unless explicitly defined otherwise or stated otherwise.

In addition, the term "and/or" in the present disclosure is only one kind of association relationship describing an associated object, and means that three kinds of relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" in the present disclosure generally indicates that the former and latter associated objects are in an "or" relationship.

It should also be understood that the description of the various embodiments of the present disclosure emphasizes the differences between the various embodiments, and the same or similar parts may be referred to each other, so that the descriptions thereof are omitted for brevity.

Meanwhile, it should be understood that the sizes of the respective portions shown in the drawings are not drawn in an actual proportional relationship for the convenience of description.

The following description of at least one exemplary embodiment is merely illustrative in nature and is in no way intended to limit the disclosure, its application, or uses.

Techniques, methods, and apparatus known to those of ordinary skill in the relevant art may not be discussed in detail but are intended to be part of the specification where appropriate.

It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, further discussion thereof is not required in subsequent figures.

The disclosed embodiments may be applied to electronic devices such as terminal devices, computer systems, servers, etc., which are operational with numerous other general purpose or special purpose computing system environments or configurations. Examples of well known terminal devices, computing systems, environments, and/or configurations that may be suitable for use with electronic devices, such as terminal devices, computer systems, servers, and the like, include, but are not limited to: personal computer systems, server computer systems, thin clients, thick clients, hand-held or laptop devices, microprocessor-based systems, set top boxes, programmable consumer electronics, network pcs, minicomputer systems, mainframe computer systems, distributed cloud computing environments that include any of the above systems, and the like.

Electronic devices such as terminal devices, computer systems, servers, etc. may be described in the general context of computer system-executable instructions, such as program modules, being executed by a computer system. Generally, program modules may include routines, programs, objects, components, logic, data structures, etc. that perform particular tasks or implement particular abstract data types. The computer system/server may be practiced in distributed cloud computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed cloud computing environment, program modules may be located in both local and remote computer system storage media including memory storage devices.

Summary of the application

In the existing scheme, when a sight line area is calculated, a plurality of model fitting links are involved, and the optimization target of a model is not completely consistent with the prediction result of the final sight line area, so that the optimization result is deviated, and the performance of the final sight line area is influenced. For example, current gaze region prediction schemes typically involve the fitting of four key models (including gaze angle prediction model, keypoint detection model, eye depth prediction model, gaze three-dimensional region calculation model), each model needing to be fitted with labeled raw data. The accuracy requirement corresponding to each model is generally achieved by independently training each model on line, and the method omits the guidance of the prediction error of the sight region to the learning target of each model, so that the prediction accuracy of the model is reduced.

Exemplary System

Fig. 1 illustrates a method or apparatus for generating a line-of-sight area prediction model to which an embodiment of the present disclosure may be applied, and an exemplary system architecture 100 of the line-of-sight area prediction method or apparatus.

As shown in fig. 1, system architecture 100 may include terminal device 101, network 102, and server 103. Network 102 is the medium used to provide communication links between terminal devices 101 and server 103. Network 102 may include various connection types, such as wired, wireless communication links, or fiber optic cables, to name a few.

A user may use terminal device 101 to interact with server 103 over network 102 to receive or send messages and the like. Various communication client applications, such as a search-type application, a web browser application, a shopping-type application, an instant messaging tool, etc., may be installed on the terminal device 101.

The terminal device 101 may be various electronic devices including, but not limited to, a mobile terminal such as a mobile phone, a notebook computer, a digital broadcast receiver, a PDA (personal digital assistant), a PAD (tablet computer), a PMP (portable multimedia player), a vehicle-mounted terminal (e.g., a car navigation terminal), etc., and a fixed terminal such as a digital TV, a desktop computer, etc.

The server 103 may be a server providing various services, such as a background server performing model training using a face image sample set uploaded by the terminal device 101 or performing gaze area prediction using a face image uploaded by the terminal device 101. The background server can perform model training by using the face image sample set to obtain a sight region prediction model, or perform sight region prediction on the received face image by using the sight region prediction model to obtain sight region information.

It should be noted that the method for generating the line-of-sight area prediction model or the method for predicting the line-of-sight area provided in the embodiment of the present disclosure may be executed by the server 103 or the terminal device 101, and accordingly, the device for generating the line-of-sight area prediction model or the device for predicting the line-of-sight area may be provided in the server 103 or the terminal device 101.

It should be understood that the number of terminal devices, networks, and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation. In the case that the face image sample set or the face image for performing the sight line region prediction does not need to be acquired from a remote place, the system architecture may not include a network, and only includes a server or a terminal device.

Exemplary method

Fig. 2 is a flowchart illustrating a method for generating a gaze area prediction model according to an exemplary embodiment of the present disclosure. The embodiment can be applied to an electronic device (such as the terminal device 101 or the server 103 shown in fig. 1), and as shown in fig. 2, the method includes the following steps:

step 201, taking a first face image sample set with an eye posture label as an input of an initial model, and training an eye posture prediction module.

In this embodiment, the electronic device may train the eye pose prediction module by using the first face image sample set with the eye pose tag as an input of the initial model. The face images in the first face image sample set comprise eye regions. The sight line region prediction model in the present embodiment includes an eye pose prediction module and a sight line region prediction module. The eye pose prediction module is used for determining the eye pose of a person indicated by the input human face image, such as the sight angle, the pupil depth and the like. The sight line region prediction module is used for determining a sight line region of a person indicated by the input face image, namely a region watched by human eyes.

The eye pose tag may be information that labels the face images in the first face image sample set in advance. The eye posture label is obtained by the existing eye posture acquisition method in advance. Determining the distance between the pupil and the camera by determining the position of the pupil in the three-dimensional space, for example, to obtain a pupil depth label; by determining the position of the pupil in three-dimensional space and the three-dimensional space position of the gazing viewpoint, the angle of a straight line between two points in three-dimensional space is determined.

The eye pose prediction module may typically be comprised of an artificial neural network, such as a convolutional neural network comprising convolutional layers, pooling layers, fully-connected layers, and the like. The electronic device may use a machine learning method to input a face image in the first face image sample set as an initial model, where the initial model may first extract features of an eye region in the face image, use the features of the eye region as inputs of an eye pose prediction module, use an eye pose tag corresponding to the input face image as an expected output (i.e., an object of optimizing model parameters), train the initial model, and obtain an actual output for the face image input by each training. And the actual output is data actually output by the initial model and is used for representing the eye posture. Then, the electronic device may adjust parameters of the eye posture prediction module included in the initial model based on the actual output and the expected output by using a gradient descent method and a back propagation method, use the model obtained after each parameter adjustment as the initial model for the next training, and end the training under the condition that a preset training end condition for the eye posture prediction module is satisfied, thereby obtaining the trained eye posture prediction module. The end-of-training conditions may include, but are not limited to, at least one of the following: the training times reach preset times, and the loss value of the loss function corresponding to the eye posture prediction module reaches a preset loss value.

Step 202, taking the second face image sample set with the sight line area label as the input of the initial model, and obtaining the eye posture prediction result.

In this embodiment, the electronic device may use the second face image sample set with the sight line region label as an input of the initial model, so as to obtain the eye pose prediction result. Specifically, the trained eye posture prediction module can be used for predicting the eye posture of the input face image to obtain an eye posture prediction result. Eye pose results may include, but are not limited to, gaze angle, pupil depth, and the like.

The sight line region label may be obtained by labeling the face image in advance. The line-of-sight region label typically includes a plurality of coordinate information for determining a region in three-dimensional space.

And step 203, training a sight line region prediction module based on the second face image sample set.

In this embodiment, the electronic device may train the gaze region prediction module based on the second face image sample set. The sight line area prediction module is used for obtaining sight line area information representing a fixation area of human eyes according to the eye posture prediction result.

In particular, the gaze area prediction module may generally consist of an artificial neural network, for example a convolutional neural network comprising convolutional layers, pooling layers, fully-connected layers, etc., which may classify the three-dimensional space, determining which areas are the areas at which the human eye is gazing.

The electronic device may use a machine learning method to input the face images in the second face image sample set as an initial model, obtain an eye pose prediction result through processing by the eye pose prediction module, use the eye pose prediction result as an input of the sight region prediction module, use a sight region label corresponding to the input face image as an expected output (i.e., an object for optimizing model parameters), train the initial model, and obtain an actual output for each training of the input face image. Wherein the actual output is data actually output by the initial model and used for representing the sight line area. Then, the electronic device may adjust parameters of the sight region prediction module included in the initial model based on the actual output and the expected output by using a gradient descent method and a back propagation method, use the model obtained after each parameter adjustment as the initial model for the next training, and end the training under the condition that a preset training end condition for the sight region prediction module is met, thereby obtaining the trained sight region prediction module. The end-of-training conditions may include, but are not limited to, at least one of the following: the training times reach preset times, and the loss value of the loss function corresponding to the sight line region prediction module reaches a preset loss value.

And step 204, determining the initial model after the training is finished as a sight line area prediction model.

In this embodiment, the electronic device may determine the initial model after the training is finished as the gaze area prediction model. The sight line region prediction model may receive a face image including an eye region and output sight line region information.

According to the method provided by the embodiment of the disclosure, the eye posture prediction module included in the initial model is trained by using the first face image sample set, the eye posture prediction result obtained by predicting the eye posture prediction module and the sight line area prediction module included in the second face image sample set are trained by using the eye posture prediction module, and finally the trained initial model is used as the sight line area prediction model.

In some alternative implementations, the first sample set of facial images includes a smaller number of facial images than the second sample set of facial images. Due to the fact that labeling of the visual line region label is simple, and the method adopted for collecting the eye posture label is complex, and professional tools and personnel are needed. For example, the eye pose tags may include a gaze angle tag and/or a pupil depth tag, and the gaze angle tag may be obtained by determining the position of the pupil in three-dimensional space and the three-dimensional space position of the viewpoint of the gaze in advance, and determining the angle of a straight line between the two points in three-dimensional space. The pupil depth label may be obtained by utilizing a depth harvester, such as a binocular camera, a TOF camera, a structured light camera, and the like. Therefore, the number of the face images included in the first face image sample set with the eye posture labels collected by the implementation mode is small, when the eye posture prediction module is trained, the eye posture prediction module can be preliminarily trained by using fewer samples, the sight region prediction module can be trained by using more samples, and a good training effect can be obtained. Therefore, the effects of reducing the acquisition difficulty of the label and improving the model training efficiency can be achieved.

In some alternative implementations, the step 201 may be performed as follows:

in response to determining that the eye pose tag corresponding to the input face image comprises a gaze angle tag, a gaze angle prediction sub-module included by the eye pose prediction module is trained based on the gaze angle tag.

And in response to determining that the eye posture label corresponding to the input human face image comprises a pupil depth label, training a pupil depth prediction submodule comprised by the eye posture prediction module based on the pupil depth label.

The eye posture prediction module provided by the implementation manner may include a sight angle prediction submodule and/or a pupil depth prediction submodule. The sight angle submodule is used for determining the sight angle of a person indicated by the input face image. The pupil depth prediction submodule is used for determining the pupil depth of a person indicated by the input face image. The pupil depth represents the distance of the pupil from the camera. The gaze angle prediction sub-module and the pupil depth prediction sub-module may typically be comprised of artificial neural networks (e.g., convolutional neural networks).

It should be understood that the method for training the gaze angle prediction sub-module and the pupil depth prediction sub-module is substantially the same as step 201, and will not be further described here.

According to the implementation mode, the sight angle prediction submodule and the pupil depth prediction submodule are arranged, so that the eye posture can be determined more comprehensively, and the accuracy of sight region prediction is improved.

In some alternative implementations, step 201 may be performed as follows:

firstly, a face image included in a first face image sample set is input into a feature extraction module included in an initial model, and eye feature data and face feature data are obtained.

The feature extraction module is used for extracting eye feature data and face feature data. The ocular feature data may characterize the features of the eye, such as the size, proportion, pupil location, etc. of the eye; the face feature data is used to characterize features of the face, such as the orientation, size, etc. of the face. The feature extraction module may employ an existing multi-layer convolutional neural network structure.

Then, based on the eye pose labels, the eye pose prediction module is trained by using the eye feature data and the face feature data.

Specifically, in the training process, the module parameters may be continuously adjusted based on the loss values of the loss functions respectively corresponding to the eye posture prediction module and the sight line region prediction module. It should be noted that, if the eye posture prediction module includes the gaze angle prediction sub-module and/or the pupil depth prediction sub-module shown in the above optional implementation manners, the eye feature data and the face feature data may be used as the input of the gaze angle prediction sub-module and/or the pupil depth prediction sub-module, and the gaze angle label and/or the pupil depth label are used for training.

According to the implementation mode, the feature extraction module is arranged in the sight line region prediction model, so that the eye features and the face features of the input face image can be extracted, the eye pose prediction is performed by fully utilizing the eye features and the face features, and the accuracy of the eye pose prediction and the accuracy of the sight line region prediction are improved.

With further reference to fig. 3, a flow diagram of yet another embodiment of a method of generating a gaze region prediction model is shown. As shown in fig. 3, on the basis of the embodiment shown in fig. 2, step 201 may include the following steps:

in step 2011, in response to determining that the input sample face image further has a corresponding gaze region label, setting the weight of the eye pose loss function corresponding to the eye pose prediction module as a first weight, and setting the weight of the gaze region loss function corresponding to the gaze region prediction module as a second weight.

In this embodiment, the first weight is greater than the second weight.

Specifically, the face images in the first face image sample set in the present embodiment have corresponding eye posture labels and sight line region labels (some face images may have, or all face images may have). In training, a first larger weight (e.g., 0.9) may be set for the eye pose loss function and a second smaller weight (e.g., 0.1) may be set for the eye region loss function. In the training process, because the weight of the sight line region loss function is smaller, the updating amplitude of the parameters of the sight line region prediction module is smaller, namely the updating of the model parameters can be concentrated in the eye posture prediction module. The eye pose loss function may be a regression loss function, such as an L1 norm loss function, an L2 norm loss function, or the like. The line-of-sight region loss function may be a loss function for classification, such as a cross-entropy loss function.

Step 2012, a first loss value of the eye pose loss function and a second loss value of the sight line region loss function are determined based on the eye pose tag and the sight line region tag corresponding to the input sample face image.

In this embodiment, based on the eye pose loss function, a first loss value characterizing a difference between the predicted eye pose and the actual labeled eye pose may be determined during training. Based on the gaze region loss function, a second loss value characterizing the gap between the predicted gaze region and the actual labeled gaze region may be determined at the time of training.

Step 2013, training an eye posture prediction module based on the first weight, the second weight, the first loss value and the second loss value.

As an example, assuming that the eye pose loss function is l1, the gaze region loss function is l2, the first weight is a, and the second weight is b, the total loss function is l ═ a × l1+ b × l 2. In the training process, parameters of the initial model are adjusted by using a gradient descent method and a back propagation method, so that the total loss value is continuously reduced until convergence.

In some alternative implementations, step 2011 may be performed as follows:

and setting the sum of the weights of the sight angle loss function and the pupil depth loss function included in the eye posture loss function as a first weight. The eye posture prediction module comprises a sight angle prediction submodule and a pupil depth prediction submodule, wherein the sight angle loss function and the pupil depth loss function respectively correspond to the sight angle prediction submodule and the pupil depth prediction submodule which are included in the eye posture prediction module.

As an example, assuming that the gaze angle loss function is l11, the pupil depth loss function is l12, the gaze area loss function is l2, the weight of l11 is a1, the weight of l12 is a2, the first weight is a1+ a2, and the second weight is b, the total loss function is l1 × l11+ a2 × l12+ b × l 2. The gaze angle prediction sub-module and the pupil depth prediction sub-module are substantially the same as those described in the embodiment corresponding to fig. 2, and are not repeated here.

According to the implementation mode, the weight is set for the loss function of the sight angle prediction submodule and the pupil depth prediction submodule included in the eye posture prediction module, so that the contribution degree of the sight angle prediction submodule and the pupil depth prediction submodule to the sight area prediction can be adjusted more flexibly, and the pertinence of model training is improved.

In the method provided by the embodiment corresponding to fig. 3, the weight of the eye posture loss function is set to be larger, and the sight line region loss function is set to be smaller, so that the training of the sight line region prediction module is considered when the eye posture prediction module is trained, and the prediction accuracy of the trained sight line region prediction model is improved.

With further reference to fig. 4, a flow diagram of yet another embodiment of a method of generating a gaze region prediction model is shown. As shown in fig. 4, on the basis of the embodiment shown in fig. 2, step 203 may include the following steps:

step 2031, in response to determining that the input sample face image further has a corresponding eye pose tag, setting the weight of the eye pose loss function corresponding to the eye pose prediction module as a third weight, and setting the weight of the sight line area loss function corresponding to the sight line area prediction module as a fourth weight.

Wherein the third weight is less than the fourth weight.

Specifically, the face images in the second face image sample set in this embodiment have corresponding eye pose labels and sight line area labels (which may be provided for some face images or all face images). During training, a smaller third weight (e.g., 0.1) may be set for the eye pose loss function, and a larger fourth weight (e.g., 0.9) may be set for the eye line region loss function, and during training, since the weight of the eye pose loss function is smaller, the update amplitude of the parameters of the eye pose prediction module is smaller, that is, the update of the model parameters may be concentrated in the eye line region prediction module.

Step 2032, based on the eye pose tag and the sight line region tag corresponding to the input sample face image, determining a first loss value of the eye pose loss function and a second loss value of the sight line region loss function.

Step 2033, training the sight-line region prediction module based on the third weight, the fourth weight, and the first loss value and the second loss value.

As an example, assuming that the eye pose loss function is l1, the visual line region loss function is l2, the third weight is c, and the fourth weight is d, the total loss function is l — c × l1+ d × l 2. In the training process, parameters of the initial model are adjusted by using a gradient descent method and a back propagation method, so that the total loss value is continuously reduced until convergence.

It should be understood that the eye pose module of this embodiment may also include a gaze angle prediction sub-module and a pupil depth prediction sub-module, and accordingly, the eye pose loss function may include a gaze angle loss function and a pupil depth loss function, and a sum of weights of the gaze angle loss function and the pupil depth loss function may be set as a third weight.

In the method provided by the embodiment corresponding to fig. 4, the weight of the eye pose loss function is set to be small, and the sight line area loss function is set to be large, so that training of the eye pose prediction module can be considered when the sight line area prediction module is trained, and the method is favorable for improving the prediction accuracy of the trained sight line area prediction model. Generally, the embodiments corresponding to fig. 3 and fig. 4 may be combined, that is, the weight of the loss function is adjusted, and the modules included in the initial model are trained at the same time, so that guidance of the sight region prediction error to the optimization target during learning of each module is achieved, the model training efficiency is improved, and the prediction accuracy of the trained sight region prediction model is higher.

Referring to fig. 5, an exemplary schematic diagram of a model training process of the generation method of the sight-line region prediction model according to the present disclosure is shown. In fig. 5, a face image 501 including an eye region is first input to the feature extraction module 5021 of the initial model 502, and eye feature data F1 and face feature data F2 are obtained. Then, F1 and F2 are input to the gaze angle prediction sub-module 5022 and the pupil depth prediction sub-module 5023 as two branches, resulting in gaze angle information O1 and pupil depth information O2. Next, O1 and O2 are input to the sight-line region prediction module 5024, and sight-line region information O3 is obtained. In the training process, the difference between the gaze angle information O1 and the gaze angle label 503 is determined by using a gaze angle loss function L1, the difference between the pupil depth information O2 and the pupil depth label 504 is determined by using a pupil depth loss function L2, the difference between the gaze area information O3 and the gaze area label 505 is determined by using a gaze area loss function L3, parameters of modules included in the initial model are adjusted based on the differences, so that the differences are reduced to be converged, and the final trained initial model is used as a gaze area prediction model.

With continuing reference to fig. 6, fig. 6 is a flowchart illustrating a gaze area prediction method provided by an exemplary embodiment of the present disclosure. The embodiment can be applied to an electronic device (such as the terminal device 101 or the server 103 shown in fig. 1), and as shown in fig. 6, the method includes the following steps:

step 601, acquiring a face image shot by a target camera.

In this embodiment, the electronic device may acquire the face image taken by the subject camera locally or remotely. Wherein the target camera may be a camera disposed at a specific position within a specific space. For example, the target camera may be disposed inside the vehicle for capturing the face of a person inside the vehicle to obtain a human face image.

Step 602, inputting the face image into a pre-trained sight line region prediction model, and obtaining an eye posture prediction result based on an eye posture prediction module included in the sight line region prediction model.

In this embodiment, the electronic device may input the face image into a pre-trained sight line region prediction model, and obtain an eye pose prediction result based on an eye pose prediction module included in the sight line region prediction model. The sight line region prediction model is obtained by training in advance based on the method described in the embodiment corresponding to fig. 2.

Step 603, inputting the eye posture prediction result into a sight line region prediction module included in the sight line region prediction model to obtain sight line region information corresponding to the face image.

In this embodiment, the electronic device may input the eye pose prediction result into a sight line region prediction module included in the sight line region prediction model, so as to obtain sight line region information corresponding to the face image. The sight line area information is used for representing the area of sight line fixation of a person indicated by the face image, and the sight line area information can comprise a plurality of coordinates which can represent the area in the three-dimensional space. Generally, after obtaining the sight line area information, the sight line area information may be output in various ways. For example, display on a display, or storage in a preset storage area, or transmission to other electronic devices, etc.

According to the sight line region prediction method provided by the embodiment of the disclosure, the sight line region information can be obtained by inputting the shot face image into the sight line region prediction model trained on the basis of the embodiment corresponding to fig. 2, so that the sight line region prediction is performed by effectively utilizing the model trained end to end, and the accuracy and efficiency of the sight line region prediction are improved.

Exemplary devices

Fig. 7 is a schematic structural diagram of a device for generating a gaze region prediction model according to an exemplary embodiment of the present disclosure. The embodiment can be applied to an electronic device, the gaze region prediction model includes an eye pose prediction module and a gaze region prediction module, and as shown in fig. 7, the generating device of the gaze region prediction model includes: a first training module 701, configured to train an eye pose prediction module by using a first face image sample set with an eye pose tag as an input of an initial model; a first determining module 702, configured to use a second face image sample set with a sight line region label as an input of an initial model, to obtain an eye pose prediction result; a second training module 703, configured to train a sight line region prediction module based on a second face image sample set; and a second determining module 704, configured to determine the initial model after the training is finished as the sight line region prediction model.

In this embodiment, the first training module 701 may train the eye pose prediction module by using the first face image sample set with the eye pose tag as an input of the initial model. The face images in the first face image sample set comprise eye regions. The sight line region prediction model in the present embodiment includes an eye pose prediction module and a sight line region prediction module. The eye pose prediction module is used for determining the eye pose of a person indicated by the input human face image, such as the sight angle, the pupil depth and the like. The sight line region prediction module is used for determining a sight line region of a person indicated by the input face image, namely a region watched by human eyes.

The eye pose prediction module may typically be comprised of an artificial neural network, such as a convolutional neural network comprising convolutional layers, pooling layers, fully-connected layers, and the like. The first training module 701 may use a machine learning method to take the face images in the first face image sample set as input of an initial model, where the initial model may first extract features of eye regions in the face images, take the features of the eye regions as input of an eye pose prediction module, take eye pose labels corresponding to the input face images as expected output (i.e., an object for optimizing model parameters), train the initial model, and obtain actual output for each training of the input face images. And the actual output is data actually output by the initial model and is used for representing the eye posture. Then, the first training module 701 may adjust parameters of the eye posture prediction module included in the initial model based on the actual output and the expected output by using a gradient descent method and a back propagation method, take the model obtained after each parameter adjustment as the initial model for the next training, and end the training under the condition that a preset training end condition for the eye posture prediction module is met, thereby obtaining the trained eye posture prediction module. The end-of-training conditions may include, but are not limited to, at least one of the following: the training times reach preset times, and the loss value of the loss function corresponding to the eye posture prediction module reaches a preset loss value.

In this embodiment, the first determining module 702 may use the second face image sample set with the sight line region label as an input of the initial model, and obtain the eye pose prediction result. Specifically, the trained eye posture prediction module can be used for predicting the eye posture of the input face image to obtain an eye posture prediction result. Eye pose results may include, but are not limited to, gaze angle, pupil depth, and the like.

In this embodiment, the second training module 703 may train the sight-line region prediction module based on the second face image sample set. The sight line area prediction module is used for obtaining sight line area information representing a fixation area of human eyes according to the eye posture prediction result.

The second training module 703 may use a machine learning method to input the face images in the second face image sample set as an initial model, obtain an eye pose prediction result through processing by the eye pose prediction module, use the eye pose prediction result as an input of the sight line region prediction module, use a sight line region label corresponding to the input face image as an expected output (i.e., an object for optimizing model parameters), train the initial model, and obtain an actual output for the face image input for each training. Wherein the actual output is data actually output by the initial model and used for representing the sight line area. Then, the second training module 703 may adjust parameters of the sight line region prediction module included in the initial model based on the actual output and the expected output by using a gradient descent method and a back propagation method, take the model obtained after each parameter adjustment as the initial model for the next training, and end the training under the condition that a preset training end condition for the sight line region prediction module is satisfied, thereby obtaining the trained sight line region prediction module. The end-of-training conditions may include, but are not limited to, at least one of the following: the training times reach preset times, and the loss value of the loss function corresponding to the sight line region prediction module reaches a preset loss value.

In this embodiment, the second determining module 704 may determine the initial model after the training is finished as the sight line region prediction model. The sight line region prediction model may receive a face image including an eye region and output sight line region information.

Referring to fig. 8, fig. 8 is a schematic structural diagram of a device for generating a gaze region prediction model according to another exemplary embodiment of the present disclosure.

In some optional implementations, the first training module 701 may include: a first training unit 7011, configured to train, in response to determining that the eye pose tag corresponding to the input human face image includes a gaze angle tag, a gaze angle prediction submodule included in the eye pose prediction module based on the gaze angle tag; a second training unit 7012, configured to train, in response to determining that the eye pose tag corresponding to the input human face image includes a pupil depth tag, a pupil depth prediction submodule included in the eye pose prediction module based on the pupil depth tag.

In some optional implementations, the first training module 701 may include: an extracting unit 7013, configured to input the face image included in the first face image sample set into the feature extracting module included in the initial model, so as to obtain eye feature data and face feature data; a third training unit 7014 is configured to train the eye pose prediction module using the eye feature data and the face feature data based on the eye pose tag.

In some optional implementations, the first training module 701 may: a first setting unit 7015, configured to, in response to determining that the input sample face image further has a corresponding gaze region tag, set a weight of an eye pose loss function corresponding to the eye pose prediction module as a first weight, and set a weight of a gaze region loss function corresponding to the gaze region prediction module as a second weight, where the first weight is greater than the second weight; a first determining unit 7016, configured to determine a first loss value of the eye pose loss function and a second loss value of the sight line area loss function based on the eye pose tag and the sight line area tag corresponding to the input sample face image; a fourth training unit 7017 is configured to train the eye pose prediction module based on the first weight, the second weight, and the first loss value and the second loss value.

In some optional implementations, the first setting unit 7015 may be further configured to: and setting the sum of the weights of the sight angle loss function and the pupil depth loss function included in the eye posture loss function as a first weight, wherein the sight angle loss function and the pupil depth loss function respectively correspond to a sight angle prediction submodule and a pupil depth prediction submodule included in the eye posture prediction module.

In some optional implementations, the second training module 703 may include: a second setting unit 7031, configured to, in response to determining that the input sample face image further has a corresponding eye pose tag, set a weight of an eye pose loss function corresponding to the eye pose prediction module as a third weight, and set a weight of a sight line region loss function corresponding to the sight line region prediction module as a fourth weight, where the third weight is smaller than the fourth weight; a second determining unit 7032, configured to determine a first loss value of the eye pose loss function and a second loss value of the sight line area loss function based on the eye pose tag and the sight line area tag corresponding to the input sample face image; a fifth training unit 7033 is configured to train the sight line region prediction module based on the third weight, the fourth weight, and the first loss value and the second loss value.

In some alternative implementations, the first sample set of facial images includes a smaller number of facial images than the second sample set of facial images.

According to the device for generating the sight line region prediction model provided by the embodiment of the disclosure, the eye posture prediction module included in the initial model is trained by using the first face image sample set, the eye posture prediction result obtained by predicting by using the eye posture prediction module and the sight line region prediction module included in the initial model is trained by using the second face image sample set, and finally the trained initial model is used as the sight line region prediction model.

Fig. 9 is a schematic structural diagram of a sight-line region prediction apparatus according to an exemplary embodiment of the present disclosure. The embodiment can be applied to an electronic device, and as shown in fig. 9, the gaze area prediction apparatus includes: an obtaining module 901, configured to obtain a face image captured by a target camera; the first prediction module 902 is configured to input a face image into a pre-trained sight region prediction model, and obtain an eye posture prediction result based on an eye posture prediction module included in the sight region prediction model, where the sight region prediction model is obtained by training in advance based on a generation method of the sight region prediction model; and the second prediction module 903 is configured to input the eye pose prediction result into a sight line region prediction module included in the sight line region prediction model, so as to obtain sight line region information corresponding to the face image.

In this embodiment, the acquiring module 901 may acquire the face image shot by the target camera locally or remotely. Wherein the target camera may be a camera disposed at a specific position within a specific space. For example, the target camera may be disposed inside the vehicle for capturing the face of a person inside the vehicle to obtain a human face image.

In this embodiment, the first prediction module 902 may input the face image into a pre-trained sight line region prediction model, and obtain an eye pose prediction result based on an eye pose prediction module included in the sight line region prediction model. The sight line region prediction model is obtained by training in advance based on the method described in the embodiment corresponding to fig. 2.

In this embodiment, the second prediction module 903 may input the eye pose prediction result into a sight line region prediction module included in the sight line region prediction model, so as to obtain sight line region information corresponding to the face image. The sight line area information is used for representing the area of sight line fixation of a person indicated by the face image, and the sight line area information can comprise a plurality of coordinates which can represent the area in the three-dimensional space. Generally, after obtaining the sight line area information, the sight line area information may be output in various ways. For example, display on a display, or storage in a preset storage area, or transmission to other electronic devices, etc.

According to the device for generating the sight line region prediction model provided by the embodiment of the disclosure, the sight line region information can be obtained by inputting the shot face image into the sight line region prediction model trained on the basis of the embodiment corresponding to fig. 2, so that the sight line region prediction is performed by effectively utilizing the model trained end to end, and the accuracy and efficiency of the sight line region prediction are improved.

Exemplary electronic device

Next, an electronic apparatus according to an embodiment of the present disclosure is described with reference to fig. 10. The electronic device may be either or both of the terminal device 101 and the server 103 as shown in fig. 1, or a stand-alone device separate from them, which may communicate with the terminal device 101 and the server 103 to receive the collected input signals therefrom.

FIG. 10 illustrates a block diagram of an electronic device in accordance with an embodiment of the disclosure.

As shown in fig. 10, the electronic device 1000 includes one or more processors 1001 and memory 1002.

The processor 1001 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device 1000 to perform desired functions.

Memory 1002 may include one or more computer program products, which may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory can include, for example, Random Access Memory (RAM), cache memory (or the like). The non-volatile memory may include, for example, Read Only Memory (ROM), a hard disk, flash memory, and the like. One or more computer program instructions may be stored on a computer readable storage medium and executed by the processor 1001 to implement the above generation method of the sight line region prediction model or the sight line region prediction method of the various embodiments of the present disclosure and/or other desired functions. Various contents such as a human face image sight line region prediction model and the like can also be stored in the computer-readable storage medium.

In one example, the electronic device 1000 may further include: an input device 1003 and an output device 1004, which are interconnected by a bus system and/or other form of connection mechanism (not shown).

For example, when the electronic device is the terminal device 101 or the server 103, the input device 1003 may be a camera, a mouse, a keyboard, or the like, and is used for inputting a face image. When the electronic device is a stand-alone device, the input device 1003 may be a communication network connector for receiving the input face image from the terminal device 101 and the server 103.

The output device 1004 may output various information including predicted line-of-sight region information to the outside. The output devices 1004 may include, for example, a display, speakers, a printer, and a communication network and its connected remote output devices, among others.

Of course, for simplicity, only some of the components of the electronic device 1000 relevant to the present disclosure are shown in fig. 10, omitting components such as buses, input/output interfaces, and the like. In addition, the electronic device 1000 may include any other suitable components depending on the particular application.

Exemplary computer program product and computerReadable storage medium

In addition to the above methods and apparatus, embodiments of the present disclosure may also be a computer program product comprising computer program instructions which, when executed by a processor, cause the processor to perform the steps in the method of generating a gaze area prediction model or the gaze area prediction method according to various embodiments of the present disclosure described in the "exemplary methods" section of this specification above.

The computer program product may write program code for carrying out operations for embodiments of the present disclosure in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present disclosure may also be a computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, cause the processor to perform the steps in the method for generating a gaze area prediction model or the gaze area prediction method according to various embodiments of the present disclosure described in the "exemplary methods" section of this specification above.

The computer-readable storage medium may take any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing describes the general principles of the present disclosure in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present disclosure are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present disclosure. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the disclosure is not intended to be limited to the specific details so described.

In the present specification, the embodiments are described in a progressive manner, each embodiment focuses on differences from other embodiments, and the same or similar parts in the embodiments are referred to each other. For the system embodiment, since it basically corresponds to the method embodiment, the description is relatively simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The block diagrams of devices, apparatuses, systems referred to in this disclosure are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".

The methods and apparatus of the present disclosure may be implemented in a number of ways. For example, the methods and apparatus of the present disclosure may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present disclosure are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present disclosure may also be embodied as programs recorded in a recording medium, the programs including machine-readable instructions for implementing the methods according to the present disclosure. Thus, the present disclosure also covers a recording medium storing a program for executing the method according to the present disclosure.

It is also noted that in the devices, apparatuses, and methods of the present disclosure, each component or step can be decomposed and/or recombined. These decompositions and/or recombinations are to be considered equivalents of the present disclosure.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present disclosure. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the disclosure. Thus, the present disclosure is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit embodiments of the disclosure to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims

1. A method of generating a gaze region prediction model, the gaze region prediction model comprising an eye pose prediction module and a gaze region prediction module, the method comprising:

taking a first face image sample set with an eye posture label as an input of an initial model, and training the eye posture prediction module;

taking a second face image sample set with a sight line area label as the input of an initial model to obtain an eye posture prediction result;

training the sight line region prediction module based on the second face image sample set;

and determining the initial model after training as a sight line area prediction model.

2. The method of claim 1, wherein said training the eye pose prediction module using the first sample set of face images with eye pose labels as input to an initial model comprises:

in response to determining that the eye posture label corresponding to the input face image comprises a sight angle label, training a sight angle prediction submodule comprised by the eye posture prediction module based on the sight angle label;

in response to determining that the eye posture label corresponding to the input face image comprises a pupil depth label, training a pupil depth prediction sub-module comprised by the eye posture prediction module based on the pupil depth label.

3. The method of claim 1, wherein said training the eye pose prediction module using the first sample set of face images with eye pose labels as input to an initial model comprises:

inputting the face images included in the first face image sample set into a feature extraction module included in the initial model to obtain eye feature data and face feature data;

and training the eye posture prediction module by using the eye characteristic data and the human face characteristic data based on the eye posture label.

4. The method of claim 1, wherein the training the eye pose prediction module comprises:

in response to determining that the input sample face image further has a corresponding sight line region label, setting a weight of an eye pose loss function corresponding to the eye pose prediction module as a first weight and setting a weight of a sight line region loss function corresponding to the sight line region prediction module as a second weight, wherein the first weight is greater than the second weight;

determining a first loss value of the eye posture loss function and a second loss value of the sight line region loss function based on an eye posture label and a sight line region label corresponding to an input sample face image;

training the eye pose prediction module based on the first weight, the second weight, and the first loss value and the second loss value.

5. The method of claim 1, wherein the training the gaze region prediction module comprises:

in response to determining that the input sample face image further has a corresponding eye pose tag, setting a weight of an eye pose loss function corresponding to the eye pose prediction module as a third weight and setting a weight of a sight line area loss function corresponding to the sight line area prediction module as a fourth weight, wherein the third weight is smaller than the fourth weight;

training the gaze region prediction module based on the third weight, the fourth weight, and the first loss value and the second loss value.

6. A gaze region prediction method, comprising:

acquiring a face image shot by a target camera;

inputting the face image into a pre-trained sight line region prediction model, and obtaining an eye posture prediction result based on an eye posture prediction module included in the sight line region prediction model, wherein the sight line region prediction model is obtained by pre-training based on the method of one of claims 1 to 5;

and inputting the eye posture prediction result into a sight line region prediction module included in the sight line region prediction model to obtain sight line region information corresponding to the face image.

7. An apparatus for generating a gaze region prediction model including an eye pose prediction module and a gaze region prediction module, the apparatus comprising:

the first training module is used for taking a first face image sample set with an eye posture label as the input of an initial model and training the eye posture prediction module;

the first determining module is used for taking a second face image sample set with a sight line area label as the input of the initial model to obtain an eye posture prediction result;

the second training module is used for training the sight line region prediction module based on the second face image sample set;

and the second determining module is used for determining the initial model after the training is finished as the sight line area prediction model.

8. A sight-line region prediction apparatus comprising:

the acquisition module is used for acquiring a face image shot by the target camera;

a first prediction module, configured to input the face image into a pre-trained sight region prediction model, and obtain an eye pose prediction result based on an eye pose prediction module included in the sight region prediction model, where the sight region prediction model is obtained by being trained in advance based on the apparatus according to any one of claims 1 to 5;

and the second prediction module is used for inputting the eye posture prediction result into the sight line region prediction module included in the sight line region prediction model to obtain sight line region information corresponding to the face image.

9. A computer-readable storage medium, the storage medium storing a computer program for performing the method of any of the preceding claims 1-6.

10. An electronic device, the electronic device comprising:

a processor;

a memory for storing the processor-executable instructions;

the processor is configured to read the executable instructions from the memory and execute the instructions to implement the method of any one of claims 1-6.