WO2020186867A1

WO2020186867A1 - Method and apparatus for detecting gaze area and electronic device

Info

Publication number: WO2020186867A1
Application number: PCT/CN2019/127833
Authority: WO
Inventors: 黄诗尧; 王飞; 钱晨
Original assignee: 北京市商汤科技开发有限公司
Priority date: 2019-03-18
Filing date: 2019-12-24
Publication date: 2020-09-24
Also published as: JP2022517254A; CN111723828B; KR20210104107A; CN111723828A; JP7244655B2

Abstract

A method and apparatus for detecting a gaze area and an electronic device. The method comprises: obtaining a face image collected in a predetermined three-dimensional space (11); performing line-of-sight detection based on the face image to obtain the line-of-sight detection result (12); and using a gaze area classifier trained for the predetermined three-dimensional space in advance to detect the category of a target gaze area corresponding to the face image according to the line-of-sight detection result (13).

Description

Gaze area detection method, device and electronic equipment

Cross references to related applications

This disclosure claims the priority of a Chinese patent application filed on March 18, 2019 with an application number of 201910204793.1 and an invention title of "Looking Area Detection Method, Apparatus, and Electronic Equipment". The entire content of the Chinese patent application is cited The method is incorporated into this article.

Technical field

The present disclosure relates to the field of computer vision technology, and in particular to a method, device and electronic equipment for detecting a gaze area.

Background technique

Gaze area detection can play an important role in applications such as intelligent driving, human-computer interaction, and security monitoring. In terms of human-computer interaction, by determining the three-dimensional position of the human eye in space, combined with the three-dimensional line of sight direction, the position of the human gaze point in the three-dimensional space can be obtained and output to the machine for further interactive processing. In the aspect of attention detection, by estimating the gaze direction of the human eye, the gaze direction of the person can be judged and the area of interest of the person can be obtained, and then it can be judged whether the person's attention is concentrated.

Summary of the invention

According to a first aspect of the present disclosure, there is provided a gaze area detection method, the method comprising: acquiring a face image collected in a predetermined three-dimensional space; performing sight line detection based on the face image to obtain a sight line detection result; The gaze area classifier trained in advance for the predetermined three-dimensional space detects the category of the target gaze area corresponding to the face image according to the line of sight detection result, wherein the target gaze area belongs to the predetermined three-dimensional space The divided categories define one of the gaze areas.

According to a second aspect of the present disclosure, there is provided a gaze area detection device, the device comprising: an image acquisition module for acquiring a face image collected in a predetermined three-dimensional space; a line of sight detection module for The gaze detection of the face image is performed to obtain the gaze detection result; the gaze area detection module is configured to use the gaze area classifier trained in advance for the predetermined three-dimensional space to detect the target corresponding to the face image according to the gaze detection result The category of the gaze area, wherein the target gaze area belongs to one of multiple types of defined gaze areas obtained by dividing the predetermined three-dimensional space in advance.

According to a third aspect of the present disclosure, there is provided a computer-readable storage medium having a computer program stored thereon, and when the computer program is executed by a processor, the processor realizes the method according to the above-mentioned first aspect.

According to a fourth aspect of the present disclosure, there is provided an electronic device including a memory and a processor, the memory is stored with a computer program, and the processor implements the method according to the above first aspect when the computer program is executed.

According to an embodiment of the present disclosure, for changes in a predetermined three-dimensional space, only corresponding gaze area classifiers need to be trained for different three-dimensional spaces. Since the training of the classifier does not require a large amount of data and the training speed is relatively fast, it can significantly reduce the time cost and technical difficulty of migrating between different three-dimensional spaces (such as the space of different car models) using the gaze area detection method.

Description of the drawings

Fig. 1 is a flowchart of a gaze area detection method according to an exemplary embodiment of the present disclosure;

2 is a flowchart of a method for training a gaze area classifier for a predetermined three-dimensional space in real time according to an exemplary embodiment of the present disclosure;

3 is a schematic diagram of multiple types of defined gaze regions according to an exemplary embodiment of the present disclosure;

4 is a flowchart of a method for determining starting point information of a person's line of sight in a face image according to an exemplary embodiment of the present disclosure;

FIG. 5 is a flowchart of a method for detecting line-of-sight direction information of a person in a face image according to an exemplary embodiment of the present disclosure;

6 is a flowchart of a method for detecting head posture information of a person in a face image according to an exemplary embodiment of the present disclosure;

Fig. 7 is a flowchart of a method for detecting line-of-sight direction information of a person in a face image based on head posture information according to an exemplary embodiment of the present disclosure;

FIG. 8A is a flowchart of a method for normalizing a face image to obtain a normalized face image according to an exemplary embodiment of the present disclosure;

Fig. 8B is a schematic diagram of normalizing an acquired face image according to an exemplary embodiment of the present disclosure;

FIG. 9A is a schematic diagram of a classifier outputting a target gaze area category according to an exemplary embodiment of the present disclosure;

FIG. 9B is a schematic diagram of the classifier outputting the name of the target gaze area according to an exemplary embodiment of the present disclosure;

FIG. 10 is a flowchart of a method of training a neural network for detecting a three-dimensional line of sight direction according to an exemplary embodiment of the present disclosure;

Fig. 11 is a block diagram of a gaze area detecting device according to an exemplary embodiment of the present disclosure;

FIG. 12 is a block diagram of a line of sight detection module of the gaze area detection device according to an exemplary embodiment of the present disclosure;

FIG. 13 is a block diagram of another line of sight detection module of the gaze area detection device according to an exemplary embodiment of the present disclosure;

14 is a block diagram of the eye position detection sub-module in FIGS. 12 and 13 according to an exemplary embodiment of the present disclosure;

15 is a block diagram of another line of sight detection module of the gaze area detection device according to an exemplary embodiment of the present disclosure;

FIG. 16 is a block diagram of a posture detection sub-module of the sight line detection module in FIG. 15 according to an exemplary embodiment of the present disclosure;

FIG. 17 is a block diagram of a direction detection sub-module of the sight line detection module in FIG. 15 according to an exemplary embodiment of the present disclosure;

FIG. 18 is a block diagram of an image processing unit of the direction detection sub-module in FIG. 17 according to an exemplary embodiment of the present disclosure;

FIG. 19 is a block diagram of another gaze area detection device according to an exemplary embodiment of the present disclosure;

20 is a block diagram of another gaze area detecting device according to an exemplary embodiment of the present disclosure;

21 is a block diagram of another gaze area detection device according to an exemplary embodiment of the present disclosure;

FIG. 22 is a block diagram of another gaze area detection device according to an exemplary embodiment of the present disclosure;

FIG. 23 is a block diagram of an electronic device according to an exemplary embodiment of the present disclosure.

detailed description

Here, exemplary embodiments will be described in detail, and examples thereof are shown in the accompanying drawings. When the following description refers to the drawings, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements. The implementation manners described in the following exemplary embodiments do not represent all implementation manners consistent with the present disclosure. On the contrary, they are merely examples of devices and methods consistent with some aspects of the present disclosure as described in the appended claims.

The terms used in the present disclosure are only for the purpose of describing specific embodiments, and are not intended to limit the present disclosure. The singular forms of "a", "said" and "the" used in the present disclosure are also intended to include plural forms, unless the context clearly indicates other meanings. It should also be understood that the term "and/or" used herein refers to and includes any one or all possible combinations of one or more associated listed items.

It should be understood that although the terms first, second, third, etc. may be used in this disclosure to describe various information, these information should not be limited by these terms. These terms are only used to distinguish the same type of information from each other. For example, without departing from the scope of the present disclosure, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information. Depending on the context, the word "if" as used herein can be interpreted as "when" or "when" or "in response to".

The present disclosure provides a gaze area detection method, which can be applied to scenarios such as intelligent driving, human-computer interaction, and security monitoring. This disclosure will take the gaze area detection method applied to an intelligent driving scene as an example for detailed description.

In the embodiment of the present disclosure, the involved execution subject may include: a computer system and a camera arranged in a predetermined three-dimensional space. The camera set in the predetermined three-dimensional space can send the collected face image data of the user to the aforementioned computer system. The computer system can use the artificial neural network to process the above-mentioned face image data, detect which part of the user’s attention is concentrated in the predetermined three-dimensional space, that is, detect the user’s target gaze area, so that the computer system can according to the above The user's target gaze area outputs corresponding operation control information, such as instructions for smart driving vehicles.

The above-mentioned computer system may be installed in a server, a server cluster, or a cloud platform, or may be a computer system in electronic equipment such as personal computers, vehicle-mounted equipment, and mobile terminals. The aforementioned camera may be a vehicle-mounted device such as a camera in a driving recorder, a camera of a smart terminal, and the like. The above-mentioned smart terminal may include electronic devices such as smart phones, PDAs (Personal Digital Assistants), tablet computers, and vehicle-mounted devices. In the specific implementation process, the camera and the computer system can be independent of each other, while being connected to each other to jointly implement the gaze area detection method provided by the embodiments of the present disclosure. The following uses a computer system as an example to describe in detail the gaze area detection method provided by the present disclosure.

Fig. 1 is a flowchart of a gaze area detection method according to an exemplary embodiment of the present disclosure. The method can be executed by a computer system and can be applied to various smart devices (for example, smart vehicles, smart robots, smart home devices, etc.). As shown in Figure 1, the method may include steps 11-13.

In step 11, a face image collected in a predetermined three-dimensional space is acquired.

Take the M model vehicle as an example. The predetermined three-dimensional space is the space of the vehicle. A camera can be fixedly installed in the internal space of the vehicle such as the center console. The camera can collect target objects in real time or according to a preset time period. For example, the driver's face image is provided to the computer system, so that the computer system obtains the collected face image.

In step 12, line of sight detection is performed based on the face image to obtain a line of sight detection result.

In the embodiment of the present disclosure, the computer system can perform the line of sight detection of the human eye based on the aforementioned face image, and obtain the line of sight detection result. The line of sight detection is based on analyzing the position and/or direction of the line of sight of the human eye in the face image to obtain the line of sight detection result. The present disclosure does not limit the method of detecting the human eye. That is, the method mentioned in the embodiment of the present disclosure may be used to detect the human eye, or other traditional methods may be used to detect the human eye. The above-mentioned line of sight detection result may include the starting point information and the line of sight direction information of the person in the face image, and may also include information such as the head posture of the person in the face image.

In step 13, a gaze area classifier that has been trained in advance for the predetermined three-dimensional space is used to detect the category of the target gaze area corresponding to the face image according to the line of sight detection result.

The target gaze area belongs to one of multiple types of defined gaze areas obtained by dividing the predetermined three-dimensional space in advance. For example, each space that the driver can look at when the vehicle is traveling can be used as a predetermined three-dimensional space, such as a front windshield, a rear-view mirror, or other spaces in the vehicle.

As in the above example, after obtaining the gaze detection result of the person in the face image, the computer system can input the gaze detection result into the pre-trained gaze area classifier for the M-type intelligent driving vehicle, thereby detecting the above The category of the target gaze area corresponding to the face image is to detect which area of the vehicle the person in the face image, such as the driver, is looking at when the image is collected.

In the present disclosure, the above-mentioned gaze area classifier for the predetermined three-dimensional space is pre-trained by the computer system based on the training sample set for the above-mentioned predetermined three-dimensional space, wherein the training sample set includes a plurality of gaze feature samples, each of which is The gaze feature sample includes gaze starting point information, gaze direction information, and annotation information of a gaze area category corresponding to the gaze feature sample, and the type of the gaze area marked belongs to one of multiple types of defined gaze areas divided into the predetermined three-dimensional space.

According to the embodiment of the present disclosure, before training the gaze area classifier for the predetermined three-dimensional space, the three-dimensional space areas that the human eye may pay attention to in the predetermined three-dimensional space are finely classified, and multiple types of defined gaze areas are obtained, and based on each type of definition The training sample set corresponding to the gaze area is trained on a classifier to obtain a gaze area classifier for a predetermined three-dimensional space. Subsequent use of the gaze area classifier can accurately detect the target gaze area information based on the gaze detection result, which is simple to calculate and can effectively reduce the misjudgment rate of the target gaze area, thereby providing more accurate information for subsequent operations.

The gaze detection stage corresponding to step 12 has nothing to do with the distribution of multiple types of defined gaze areas in the predetermined three-dimensional space, and the gaze area detection stage corresponding to step 13 is related to the distribution of the multiple types of defined gaze areas in the predetermined three-dimensional space. For example, since the overall space size of different models of vehicles may be different, and the location of the same type of area such as a glove box in different vehicle spaces may be different, the division of multiple types of defined gaze areas for different three-dimensional spaces may also be different, such as the definition The number and types of fixation areas are different. Therefore, different gaze area classifiers need to be trained for different three-dimensional spaces, for example, different gaze area classifiers are trained for M-type cars and N-type cars with different spatial distributions.

Therefore, the same method can be used for sight detection for different models of vehicles, and only the gaze area classifier needs to be retrained when changing models. Compared with retraining the entire convolutional neural network in an end-to-end manner, the training of the gaze region classifier is relatively simple, does not require so much data, and the training speed is fast, so it can significantly reduce the migration and use of the above gaze regions between different models The time cost and technical difficulty of the detection method.

In another embodiment of the present disclosure, the above-mentioned gaze area detection method may further include: before step 11, obtaining a gaze area classifier that has been trained for the predetermined three-dimensional space. In the present disclosure, the following method 1 or method 2 may be used to obtain the gaze region classifier completed for the predetermined three-dimensional space training.

The first way is to train a gaze area classifier for a predetermined three-dimensional space in real time when gaze area detection is required.

As shown in FIG. 2, real-time training of a gaze area classifier for a predetermined three-dimensional space may include: step 101, inputting the gaze starting point information and gaze direction information of at least one gaze feature sample into the gaze area classifier to be trained to obtain the gaze feature The gaze area category prediction information corresponding to the sample; step 102, according to the deviation between the gaze area category prediction information and the annotation information of the gaze area category corresponding to the gaze feature sample, adjust the parameters of the gaze area classifier to Training the gaze area classifier.

For example, the aforementioned predetermined three-dimensional space may be the space of a certain model of vehicle. First, determine the fixed position of the camera used to collect facial images. For example, fix the camera to the position of the center console to collect the facial image of the driver in the driving area. The subsequent classifier training phase and the detection phase need people The face images are all collected by the above-mentioned camera at the fixed position.

At the same time, the gaze area is divided for different parts of the above-mentioned vehicle, mainly according to the area that the driver needs to pay attention to during the driving of the vehicle, and multiple types of defined gaze areas are divided in the above-mentioned vehicle space, and each type of gaze is defined separately Category information corresponding to the locale.

In an embodiment of the present disclosure, the multiple types of defined gaze areas obtained by dividing the vehicle space may include at least two of the following: left front windshield area, right front windshield area, instrument panel area, interior rearview mirror area, Center console area, left rearview mirror area, right rearview mirror area, visor area, shift lever area, under the steering wheel, co-pilot area, glove box area in front of the co-pilot.

FIG. 3 is a schematic diagram of multiple types of defined gaze areas according to an exemplary embodiment of the present disclosure. For a vehicle of a preset model, the following multiple types of defined gaze areas can be determined: left front windshield, right front windshield, instrument panel, interior rearview mirror, center console, left rearview mirror, right rearview mirror, Sun visor, shift lever, mobile phone. Corresponding category information can be preset for each type of defined gaze area, such as a category value represented by a number. The corresponding relationship between the multiple types of defined gaze areas and the preset category values can be shown in Table 1:

Table 1

定义注视区域Define the gaze area	类别值Category value
左前挡风玻璃Left front windshield	11
右前挡风玻璃Right front windshield	22
仪表盘 dash board	33
车内后视镜 Rearview mirror	44
中控台 Center console	55
左后视镜Left rearview mirror	66
右后视镜Right rearview mirror	77
遮阳板 Sun visor	88
换挡杆 Shift lever	99
手机 Cell phone	1010

It should be noted that the above category information can also be represented by preset English letters such as A, B, C...J, etc.

After that, collect face image samples to obtain a training sample set. The training sample set may include a plurality of gaze feature samples, wherein each of the gaze feature samples includes gaze starting point information, gaze direction information, and label information of the gaze area category corresponding to the gaze feature sample, and the type of the gaze area that is labeled It belongs to one of multiple types of defined gaze areas divided for the predetermined three-dimensional space. Among them, how to determine the start point information and the line of sight direction information of the person based on the face image will be described in detail later.

Then, according to the above training sample set, the following steps are performed iteratively to train the classifier for the predetermined three-dimensional space: the gaze starting point information and gaze direction information of a gaze feature sample in the training sample set are input into the gaze area classifier to be trained, Obtain the prediction information of the gaze area category corresponding to the gaze feature sample; perform parameters on the gaze area classifier according to the deviation between the prediction information of the gaze area category and the label information of the gaze area category for the gaze feature sample Adjustment to train the gaze area classifier.

In an exemplary embodiment, the foregoing step 102 may include: obtaining a loss function value according to the difference between the predicted value of the gaze area category and the label value of the gaze area category of the same gaze feature sample; when the loss function value When the preset training termination condition is met, the training is terminated, and the classifier in the current training stage is determined as the classifier that has been trained; otherwise, if the loss function value does not meet the preset training termination condition, it is based on the loss function value The parameters of the gaze area classifier are adjusted.

In the embodiment of the present disclosure, the loss function is a mathematical expression used to measure the degree of misclassification of training samples by the classifier model during the training process. The value of the loss function can be obtained based on the entire training sample set. The larger the value of the above loss function, the greater the probability of misclassification of the classifier in the current training stage. On the contrary, the smaller the value of the above loss function indicates the probability of misclassification of the classifier in the current training stage. The smaller.

The aforementioned preset training termination condition is a condition for terminating the training of the gaze area classifier. In an embodiment, the foregoing preset training termination condition may be: the loss function value of the preset loss function is less than the preset threshold. Ideally, the aforementioned preset training termination condition should be that the loss function value is equal to 0, which means that the gaze area categories predicted by the current classifier are correct. In actual operation, considering the training efficiency and training cost of the gaze area classifier, the above-mentioned preset threshold may be a preset empirical value.

As in the above example, if the current loss function value is greater than or equal to the above preset threshold, it means that the accuracy of the prediction result of the current training stage of the classifier is not as expected. Therefore, the above loss function value can be used to adjust the relevant parameters of the gaze area classifier. Then, the gaze area classifier with updated parameters is used to iteratively execute step 101 and step 102 until the preset training termination condition is met, and the gaze area classifier completed for the predetermined three-dimensional space training is obtained.

In the embodiments of the present disclosure, the computer system may use algorithms such as support vector machines, naive Bayes, decision trees, random forests, and K-means to train the above-mentioned gaze area classifier.

In the embodiment of the present application, for changes in the predetermined three-dimensional space, it is only necessary to re-determine the training sample set and train the corresponding gaze area classifier. Since the training of the classifier does not require a large amount of data and the training speed is relatively fast, it can significantly reduce the time cost and technical difficulty of using the gaze area detection method to migrate between different three-dimensional spaces (such as the space of different car models).

Manner 2: When the gaze area detection needs to be performed, the gaze area classifier for the predetermined three-dimensional space is directly obtained from the preset storage resource.

In an embodiment of the present disclosure, the computer system may store the gaze area classifier completed for each predetermined three-dimensional space training in association with the spatial identifier of the predetermined three-dimensional space in a designated storage resource, such as a cloud server, to form A set of preset gaze area classifiers. In the above-mentioned intelligent driving application scenario, the above-mentioned preset gaze area classifier set may include the correspondence between multiple vehicle models and gaze area classifiers, as shown in Table 2:

Table 2

车辆型号Vehicle model	分类器Classifier
M01M01	第一分类器First classifier
M02M02	第二分类器Second classifier
M03M03	第三分类器Third classifier
……	……

If the computer system of a new car with a known model (for example, model M01) does not have a gaze area classifier program, the vehicle can automatically download it from the cloud server according to its own model (for example, M01) before performing gaze area detection The corresponding target gaze area classifier program (for example, the computer program corresponding to the above-mentioned first classifier), so as to quickly realize the gaze area detection.

In the embodiment of the present disclosure, the human eye sight detection result obtained in the above step 12 includes at least the starting point information and the sight direction information of the person in the face image, and may also include the head posture information of the person in the face image.

According to an embodiment of the present disclosure, as shown in FIG. 4, steps 1211-1212 may be executed to determine the starting point information of the person's line of sight in the face image.

In step 1211, the position of the eyes in the face image is detected.

In the embodiment of the present disclosure, the aforementioned eye position is the position of the human eye in the face image in the actual camera coordinate system. The aforementioned actual camera coordinate system is a spatial rectangular coordinate system determined by the computer system based on the aforementioned camera. The aforementioned camera is a camera that captures the aforementioned human face image in the aforementioned predetermined three-dimensional space, and may be marked as a camera C0.

The Z axis of the actual camera coordinate system is the optical axis of the aforementioned camera, and the optical center of the camera lens is the origin of the preset actual camera coordinate system. The horizontal axis of the actual camera coordinate system is the X axis and the vertical axis is the Y axis parallel to the lens plane of the camera.

In the embodiment of the present disclosure, the computer system can detect the eye position in the face image in any of the following ways: The first way is based on at least two frames of face images simultaneously collected by at least two cameras for the same target object, such as the above-mentioned driver Using a camera calibration method to obtain the eye positions in the face image, the at least two cameras include cameras that collect the face image to be measured; the second method is to detect the head posture information of the person in the face image, Detecting the position of the eyes in the face image based on the head posture information.

In an embodiment of the present disclosure, the computer system can determine the above-mentioned head posture information of the driver by using head posture estimation methods in related technologies, such as flexible model methods, geometric methods, etc., according to a face image taken by a camera. The 3D position of the eyes of the target object in the preset actual camera coordinate system is acquired based on the head posture information, and the preset actual camera coordinate system is based on the camera coordinate system determined by the camera C0.

Using the above-mentioned second eye position determination method, the 3D position of the human eye can be determined by using the face image collected by a single camera, that is, a monocular camera, so that the hardware configuration cost for gaze area detection can be saved.

In step 1212, the starting point information of the line of sight of the person in the face image is determined according to the eye position.

In the present disclosure, the eye position detected from the face image in step 1211 may include the target object in the face image, such as the position of a single eye of the driver, and may also include the positions of both eyes (that is, the positions of the left and right eyes of the driver). .

Correspondingly, in the embodiments of the present disclosure, the following method 1 or method 2 may be used to determine the starting point information of the person's line of sight in the face image.

Manner 1: Determine the starting point of the person's line of sight in the face image according to the position of the single eye. In an embodiment, if the eye positions determined in step 1211 include the positions of both eyes, the starting point information of the line of sight of the person in the face image can be determined according to the position of any one of the eyes. In another embodiment, if the eye position determined in step 1211 includes the position of a single eye, the starting point information of the line of sight of the person in the face image is determined according to the position of the single eye.

Manner 2: If the position of the eyes determined in step 1211 includes the positions of both eyes, the middle position of the eyes is determined to be the line-of-sight starting point information, where the middle position of the eyes may be the middle point position of the 3D coordinate connection of the eyes, It can also be other positions on the 3D coordinate line of the eyes.

In the embodiment of the present disclosure, the second method described above is used to determine the start point information of the person's line of sight in the face image. Compared with the first method described above, it is beneficial to eliminate the inaccuracy of the start point information of the line of sight caused by the monocular detection error, thereby improving the line of sight detection result. Accuracy.

According to an embodiment of the present disclosure, as shown in FIG. 5, steps 1221-1222 can be executed to detect the line of sight direction information of the person in the face image.

In step 1221, the head posture information of the person in the face image is detected.

As mentioned above, the computer system can determine the head posture information of the driver by using the head posture estimation methods in related technologies such as flexible model method and geometric method according to the face image taken by a camera.

The above-mentioned flexible model method refers to matching a flexible model such as Active Shape Model (ASM), Active Appearance Model (AAM), and elastic image matching model on the head image and face structure of the image plane. Elastic Graph Matching (EGM), etc., obtain the final result of head pose estimation through feature comparison or model parameters.

The geometric method refers to the use of the shape of the head and the accurate morphological information of the local feature points of the face, such as the relative positions of the eyes, nose, and mouth, to estimate the head posture.

According to the embodiments of the present disclosure, the head posture of a person in the image can be estimated based on a single frame image collected by a monocular camera.

According to an embodiment of the present disclosure, as shown in FIG. 6, the head posture information of the person in the face image can be detected by performing steps 1201 to 1202 (step 1221).

In step 1201, multiple key points of the face in the face image are detected.

In an embodiment of the present disclosure, the key points of the face can be detected by edge detection algorithms such as the Robert algorithm and the Sobel algorithm, and the key points of the face can also be detected by related models such as active contour models (such as the Snake model).

In another embodiment of the present disclosure, face key point detection may be performed by a neural network used for face key point detection. In addition, a third-party application (such as the Dlib toolkit) can also be used for face key point detection.

Using the above method, a preset number (such as 160) of facial key point positions can be detected, which may include the position coordinates of the key points of the face such as the left eye corner, the right eye corner, the nose tip, the left mouth corner, the right mouth corner, and the lower jaw. It is understandable that the number of face key point position coordinates obtained may be different according to different face key point detection methods. For example, using the Dlib toolkit can detect 68 key points on the face.

In step 1202, based on the detected key points of the face and a preset average face model, the head posture information of the person in the face image is determined.

Referring back to FIG. 5, in step 1222, the line of sight direction information of the person in the face image is detected based on the head posture information.

In the embodiments of the present disclosure, based on head posture information, a trained neural network may be used to detect the line of sight direction information of the person in the face image.

Referring to FIG. 7, the step 1222 may include steps 12221 to 12223.

In step 12221, normalize the face image according to the head posture information to obtain a normalized face image.

In actual operation, for the face images collected by the camera C0 at different times, the position of the face area image in the entire image changes randomly, and the posture of the person's head in the image also changes randomly. If the face image directly collected by the camera is used as the sample image when training the above neural network, the training difficulty and training time of the neural network will be increased due to the randomness of the head posture and the image position of the face area.

According to the embodiments of the present disclosure, when training the above neural network for detecting the direction of sight, in order to reduce the difficulty of training, firstly, each sample image data in the training sample set is normalized, so that the normalized sample image data is equivalent to The virtual camera is facing the image data taken by the human head, and then the normalized sample image data is used to train the neural network.

Correspondingly, in the application stage of the neural network, in order to ensure the accuracy of the detection of the line of sight direction information, it is also necessary to first normalize the face image to obtain the corrected face image in the corresponding virtual camera coordinate system to input the aforementioned nerve Network to detect line of sight direction information.

Referring to FIG. 8A, the above step 12221 may include steps 12-1 to 12-3.

In step 12-1, the head coordinate system of the person in the face image is determined according to the head posture information. For example, the X axis of the head coordinate system is parallel to the line connecting the left and right eye coordinates; the Y axis of the head coordinate system is perpendicular to the X axis in the face plane; the Z axis of the head coordinate system It is perpendicular to the face plane; the starting point of the line of sight of the human eye is the origin of the head coordinate system.

In the embodiment of the present disclosure, the computer system detects the head posture information of the target object based on the aforementioned face image, which is equivalent to the computer system predicting the three-dimensional head model of the target object. The three-dimensional head model may represent the posture information of the head of the target object relative to the camera C0 when the camera C0 collects the aforementioned face image. On this basis, the computer system can determine the head coordinate system of the target object based on the head posture information.

The head coordinate system can be expressed as a spatial rectangular coordinate system. The X axis of the head coordinate system may be parallel to the line connecting the 3D position coordinates of the two eyes in the three-dimensional head model. The midpoint of the line of the 3D position coordinates of the two eyes, that is, the starting point of the line of sight of the human eye can be determined as the origin of the head coordinate system. The Y axis of the head coordinate system is perpendicular to the X axis in the face plane. The Z axis of the head coordinate system is perpendicular to the face plane.

In step 12-2, the actual camera coordinate system corresponding to the face image is rotated and translated based on the head coordinate system to obtain a virtual camera coordinate system. For example, the Z axis of the virtual camera coordinate system points to the origin of the head coordinate system, the X axis of the virtual camera coordinate system and the X axis of the head coordinate system are in the same plane, and the virtual camera coordinates The origin of the system and the origin of the head coordinate system are separated by a predetermined distance in the Z-axis direction of the virtual camera coordinate system.

In the embodiments of the present disclosure, after the computer system determines the head coordinate system of the target object, the camera can be rotated and translated with reference to the head coordinate system to determine a virtual camera, and based on the head coordinate system of the virtual camera Establish the virtual camera coordinate system corresponding to the above-mentioned virtual camera. The method for establishing the virtual camera coordinate system is similar to the method for establishing the preset actual camera coordinate system, that is, the Z axis of the virtual camera coordinate system is the optical axis of the virtual camera, and the X and Y axes of the virtual camera coordinate system are parallel to The lens plane of the virtual camera; the optical center of the virtual camera lens is the origin of the virtual camera coordinate system.

The positional relationship between the virtual camera coordinate system and the head coordinate system meets the following three conditions:

Condition 1: The Z axis of the virtual camera coordinate system points to the origin of the head coordinate system;

Condition 2: The X axis of the virtual camera coordinate system and the X axis of the head coordinate system are in the same plane, wherein the relative positional relationship between the X axis of the virtual camera coordinate system and the X axis of the head coordinate system Including but not limited to parallel relationships;

Condition 3: The origin of the virtual camera coordinate system and the origin of the head coordinate system are separated by a predetermined distance in the Z-axis direction of the virtual camera coordinate system.

The above process is equivalent to determining a virtual camera by performing the following operations on the camera C0: rotating the camera C0 so that the Z axis points to the starting point of the person's three-dimensional line of sight in the human eye image, and making the X axis of the camera C0 coincide with the head The X axis of the head coordinate system is in the same plane; the rotated camera C0 is translated along its Z axis so that the distance between the optical center of the lens and the origin of the head coordinate system is a preset length.

So far, the computer system can determine the relationship between the actual camera coordinate system and the aforementioned virtual camera coordinate system based on the positional relationship between the actual camera coordinate system and the head coordinate system, and the positional relationship between the virtual camera coordinate system and the aforementioned head coordinate system. The position transformation relationship.

It should be understood that in the present disclosure, the virtual camera coordinate system is related to the head posture of the person in the face image. Therefore, different face images may correspond to different virtual camera coordinate systems.

In step 12-3, according to the position transformation relationship between the actual camera coordinate system and the virtual camera coordinate system, normalization processing is performed on the face image to obtain the corrected face image.

In the embodiments of the present disclosure, the computer system can use the position transformation relationship between the actual camera coordinate system and the virtual camera coordinate system to perform processing such as rotation, affine, and zoom transformation on the face image to obtain the virtual camera coordinate system. Of the face image.

FIG. 8B shows a schematic diagram of normalization processing of acquired facial images according to an exemplary embodiment, where the image P0 is the facial image collected by the actual vehicle camera C0 for the driver, and the image P1 represents the normalization processing described above. The corrected face image in the virtual camera coordinate system obtained later is equivalent to the driver's face image collected by a virtual camera C1 facing the driver's head.

Referring back to FIG. 7, in step 12222, the line of sight direction detection is performed based on the corrected face image to obtain the first detected line of sight direction. For example, the first detected line of sight direction is the three-dimensional line of sight direction information in the virtual camera coordinate system, and may be a three-dimensional direction vector.

In the embodiment of the present disclosure, the normalized face image that has undergone the above-mentioned normalization processing may be input to a trained neural network for detecting the line of sight direction to detect the three-dimensional line of sight information of the person in the above-mentioned corrected face image. The aforementioned neural network for detecting the direction of the line of sight may include a deep neural network (DNN) such as a convolutional neural network (convolutional neural network, CNN), etc.

In step 12223, perform coordinate inverse transformation processing on the first detected line of sight direction to obtain the person's line of sight direction information in the face image.

In the subsequent stage of gaze area detection, it is necessary to input the gaze feature vector in the actual camera coordinate system to the gaze area classifier. Therefore, in the present disclosure, after the computer system detects the line-of-sight direction information in the virtual camera coordinate system, that is, the first detected line-of-sight direction, it is also necessary to perform the upper first detection line-of-sight direction from the virtual camera coordinate system to the actual camera coordinate system. The coordinate inverse transformation process is used to obtain the line-of-sight direction information in the actual camera coordinate system.

Referring back to FIG. 1, the above step 12 is equivalent to the process of determining the line of sight feature vector of the person in the face image, and the line of sight feature vector includes the start point information and the line of sight direction information of the person in the face image.

In practical applications such as intelligent driving, the process of extracting the sight feature vector of the face image will not change due to the change of the vehicle model. The artificial neural network used in this stage is such as the neural network used to detect the key points of the face , The neural network used to detect the direction of the line of sight, etc., can be applied to different car models and has good mobility.

As described above, according to an embodiment of the present disclosure, in step 13, the gaze starting point information and gaze direction information of the person in the face image determined in step 12 can be input into the gaze region classifier that has been trained in advance for a predetermined three-dimensional space. To detect the category of the target gaze area corresponding to the face image.

In the embodiment of the present disclosure, the above step 13 may include: determining target gaze area information according to the category of the target gaze area, and output the target gaze area information.

For example, the classifier may output the category of the target gaze area, as shown in FIG. 9A, or directly output the name of the target gaze area, as shown in FIG. 9B.

In another embodiment of the present disclosure, the above-mentioned gaze area detection method may further include: before the above-mentioned step 11, training a neural network for detecting the direction of the line of sight. This step corresponds to the training process of the 3D line of sight direction estimation model. It should be noted that this step and the process of real-time training of the gaze area classifier shown in FIG. 2 can be executed in different computer systems.

FIG. 10 is a flowchart of a method of training a neural network for detecting a three-dimensional line of sight direction according to an exemplary embodiment of the present disclosure. The method may include steps 1001-1005.

In step 1001, an original sample set containing at least one face sample is determined, where each face sample includes a face image sample and line-of-sight direction label information.

In the embodiments of the present disclosure, the above-mentioned neural network may be trained by a supervised learning method. Correspondingly, each sample in the sample set used to train the aforementioned neural network may include: input information used for prediction, that is, a face image sample; and the true value corresponding to the input information is the actual line of sight measured in the actual camera coordinate system Direction information. In the embodiments of the present disclosure, the above-mentioned actually measured line-of-sight direction information is also referred to as line-of-sight direction marking information.

In step 1002, according to the key points of the face and the average face model, head posture information corresponding to each of the face image samples is determined.

In step 1003, based on the head posture information and the actual camera coordinate system, determine the normalized face image sample corresponding to each of the face image samples and the line-of-sight direction label information in the virtual coordinate system. Information on virtual line of sight direction;

The implementation process of the foregoing step 1002 and step 1003 is similar to the foregoing step 1202 and steps 12-1 to 12-3, respectively, and will not be repeated here. At the same time, the computer system can convert the above-mentioned line-of-sight direction labeling information into virtual line-of-sight labeling information according to the position transformation relationship from the actual camera coordinate system to the virtual camera coordinate system.

So far, the sample set in the virtual camera coordinate system is obtained. Then, based on the sample set, iterative training can be carried out through the following steps until the training requirements of the neural network for detecting the three-dimensional line of sight direction are met: in step 1004, each of the corrected face image samples is input to the to-be-trained Three-dimensional line-of-sight direction detection neural network to obtain three-dimensional line-of-sight direction prediction information; in step 1005, according to the deviation between the three-dimensional line-of-sight direction prediction information and the virtual line-of-sight direction labeling information, the neural network is parameterized to obtain Neural network for detecting the direction of the line of sight.

In the embodiment of the present disclosure, the normalized face image processed in the virtual camera coordinate system is used as the training sample data, which can reduce the difficulty of neural network training caused by head posture changes, and improve the neural network used to detect the direction of sight Training efficiency.

As an example, after identifying the gaze area of the driver, further operations can be performed according to the gaze area. For example, the attention monitoring result of the person corresponding to the face image can be determined according to the detection result of the gaze area category. For example, the gaze area category detection result may be the gaze area detection category within a preset time period. Exemplarily, the detection result of the gaze area category may be "During the preset time period, the gaze area of the driver has always been area 2", then, if the area 2 is the right front windshield, it indicates that the driver is driving More attentive. If this area 2 is the glove box area in front of the co-pilot, it means that the driver is likely to be distracted and unable to concentrate.

After the attention monitoring result is detected, the attention monitoring result may be output, for example, "driving is very attentive" may be displayed in a certain display area in the vehicle. Alternatively, it is also possible to output a distraction prompt message according to the attention monitoring result, and prompt the driver to "please concentrate on driving and ensure driving safety" through a prominent display on the display screen or voice prompts. Of course, when specific information is output, at least one of the attention monitoring result and the distraction prompt information may be output.

By detecting the human attention monitoring results or outputting distraction prompt information according to the category of the gaze area, it is an important help for the driver's attention monitoring, which can effectively detect the driver's inattention, promptly remind and reduce accidents If risks occur, ensure driving safety.

In the description of the above examples, the monitoring of the driver's attention in the intelligent driving application scenario is taken as an example for description. In addition, the detection of the gaze area can also have many other uses.

For example, vehicle-machine interactive control based on gaze area detection can be performed. Some electronic equipment, such as a multimedia player, can be installed in the vehicle, which can automatically control the multimedia player to start the playback function according to the detection result of the gaze area by detecting the gaze area of the person in the vehicle.

Exemplarily, the face image of the person (such as the driver or passenger) in the vehicle is captured by a camera deployed in the vehicle, and the detection result of the gaze area category is detected through a pre-trained neural network. For example, the detection result may be: within a period of time T, the gaze area of the person in the vehicle has been the area where the "gaze on" option on a certain multimedia player in the vehicle is located. According to the above detection result, it can be determined that the person in the vehicle wants to turn on the multimedia player, so that corresponding control instructions can be output to control the multimedia player to start playing.

In addition to vehicle-related applications, it can also include multiple application scenarios such as game control, smart home device control, and advertising push. Taking smart home control as an example, the face image of the control person can be collected, and the gaze area category detection result can be detected through a pre-trained neural network. For example, the detection result may be: within a period of time T, the gaze area of the controller has been the area where the "gaze on" option on the smart air conditioner is located. According to the above detection results, it can be determined that the controller wants to start the smart air conditioner, so that a corresponding control command can be output to control the air conditioner to turn on.

For ease of description, the foregoing method embodiments are all described as a series of action combinations. Those skilled in the art should know that the present disclosure is not limited by the described sequence of actions. According to the present disclosure, certain steps can be performed in other order or simultaneously.

The present disclosure may also provide embodiments of devices and electronic equipment corresponding to the foregoing method embodiments.

FIG. 11 is a block diagram of a gaze area detecting device 1100 according to an exemplary embodiment of the present disclosure. The gaze area detection device 1100 may include an image acquisition module 21, a gaze detection module 22 and a gaze area detection module 23.

The image acquisition module 21 is used to acquire a face image collected in a predetermined three-dimensional space. The sight line detection module 22 is configured to perform sight line detection based on the face image to obtain a sight line detection result. In an embodiment of the present disclosure, the sight line detection result may include the start point information and the sight direction information of the person in the face image. The gaze area detection module 23 is configured to use a gaze area classifier trained in advance for the predetermined three-dimensional space to detect the category of the target gaze area corresponding to the face image according to the line of sight detection result. The target gaze area belongs to one of multiple types of defined gaze areas obtained by dividing the predetermined three-dimensional space in advance.

Referring to FIG. 12, a line-of-sight detection module 22 of a gaze area detection device according to an exemplary embodiment of the present disclosure may include: an eye position detection sub-module 221 for detecting the eye position in the face image; a first starting point The information determining submodule 222 is configured to determine that the middle position of the eyes is the line of sight starting point information when the eye position includes the positions of the eyes.

Referring to FIG. 13, another line of sight detection module 22 of the gaze area detection device according to an exemplary embodiment of the present disclosure may include: an eye position detection sub-module 221 for detecting the eye position in the face image; second The starting point information determining submodule 223 is configured to determine that the position of any one of the eyes is the line of sight starting point information when the eye position includes the positions of both eyes, or, when the eye position includes a single eye In the case of the position of, the position of the single eye is determined as the start point information of the line of sight.

Referring to FIG. 14, the eye position detection sub-module 221 in FIGS. 12 and 13 according to an exemplary embodiment of the present disclosure may include: a posture detection unit 2211 for detecting head posture information of the person in the face image; The position determining unit 2212 is configured to determine the position of the eyes in the face image according to the head posture information.

Referring to FIG. 15, another line of sight detection module 22 of the gaze area detection device according to an exemplary embodiment of the present disclosure may include: a posture detection sub-module 22-1 for detecting the head posture of the person in the face image Information; direction detection sub-module 22-2, used to detect the line of sight direction information of the person in the face image based on the head posture information.

Referring to FIG. 16, the posture detection sub-module 22-1 in FIG. 15 according to an exemplary embodiment of the present disclosure may include: a key point detection unit 22-11 for detecting multiple face key points in the face image The posture determination unit 22-12 is configured to determine the head posture information of the person in the face image based on the key points of the face and a preset average face model.

Referring to FIG. 17, the direction detection sub-module 22-2 in FIG. 15 according to an exemplary embodiment of the present disclosure may include: an image processing unit 22-21, configured to perform processing on the face image according to the head posture information Normalized processing to obtain the corrected face image; the first direction detection unit 22-22 is configured to detect the line of sight direction based on the corrected face image to obtain the first detected line of sight direction; the direction determining unit 22-23 is used to check the The first detected line of sight direction is subjected to coordinate inverse transformation processing to obtain the line of sight direction information of the person in the face image.

Referring to FIG. 18, the image processing unit 22-21 in FIG. 17 according to an exemplary embodiment of the present disclosure may include: a head coordinate determination subunit 22-211 for determining the face according to the head posture information The head coordinate system of the person in the image; coordinate transformation subunits 22-212 are used to rotate and translate the actual camera coordinate system corresponding to the face image based on the head coordinate system to obtain a virtual camera coordinate system; The processing subunits 22-213 are configured to perform normalization processing on the face image according to the position transformation relationship between the actual camera coordinate system and the virtual camera coordinate system to obtain the corrected face image.

In any of the foregoing device embodiments of the present disclosure, the gaze area classifier may be trained in advance based on a training sample set for the predetermined three-dimensional space. The training sample set may include a plurality of gaze feature samples, each of the gaze feature samples includes gaze starting point information, gaze direction information, and label information of the gaze area category corresponding to the gaze feature sample, and the type of the gaze area marked belongs to One of the multiple categories defined for the predetermined three-dimensional space is defined.

FIG. 19 is a block diagram of another gaze area detecting device 1900 according to an exemplary embodiment of the present disclosure. Compared with the gaze area detection device 1100 shown in FIG. 11, the gaze area detection device 1900 may further include a classifier training module 20.

The classifier training module 20 may include: a category prediction sub-module 201, configured to input the gaze starting point information and the gaze direction information of at least one of the gaze feature samples into the gaze area classifier to be trained to obtain the gaze feature sample Corresponding gaze area category prediction information; parameter adjustment sub-module 202 for performing the gaze area classifier based on the deviation between the gaze area category prediction information and the annotation information of the gaze area category corresponding to the gaze feature sample The parameters are adjusted to train the gaze area classifier.

FIG. 20 is a block diagram of another gaze area detecting device 2000 according to an exemplary embodiment of the present disclosure. Compared with the gaze area detection device 1100 shown in FIG. 11, the gaze area detection device 2000 may further include a classifier acquisition module 203.

The classifier obtaining module 203 may obtain the gaze area classifier corresponding to the space identifier from the preset gaze area classifier set according to the space identifier of the predetermined three-dimensional space. The preset gaze area classifier set may include: gaze area classifiers respectively corresponding to the spatial identifiers of different three-dimensional spaces.

In any of the foregoing device embodiments of the present disclosure, the predetermined three-dimensional space may include a vehicle space. Correspondingly, the face image may be determined based on the image collected for the driving area in the vehicle space. The multiple types of defined gaze areas obtained by dividing the predetermined three-dimensional space may include at least two of the following types: left front windshield area, right front windshield area, instrument panel area, interior rearview mirror area, and center console Area, left rearview mirror area, right rearview mirror area, visor area, shift lever area, area under the steering wheel, co-pilot area, glove box area in front of the co-pilot.

FIG. 21 is a block diagram of another gaze area detecting device 2100 according to an exemplary embodiment of the present disclosure. Compared with the gaze area detection device 1100 shown in FIG. 11, the gaze area detection device 2100 may further include: an attention monitoring module 24, configured to determine the face according to the gaze area category detection result obtained by the gaze area detection module 23 The attention monitoring result of the person corresponding to the image; the monitoring result output module 25 is configured to output the attention monitoring result and/or output distraction prompt information according to the attention monitoring result.

FIG. 22 is a block diagram of another gaze area detecting device 2200 according to an exemplary embodiment of the present disclosure. Compared with the gaze area detection device 1100 shown in FIG. 11, the gaze area detection device 2200 may further include: a control instruction determination module 26 for determining a control instruction corresponding to the gaze area category detection result obtained by the gaze area detection module 23; The operation control module 27 is configured to control the electronic device to perform operations corresponding to the control instructions.

For the device embodiment, since it basically corresponds to the method embodiment, the relevant part can refer to the part of the description of the method embodiment. The device embodiments described above are merely illustrative, where the units described as separate components may or may not be physically separate, and the components displayed as units may or may not be physical units, that is, they may be located in one unit. Locally, or it can be distributed to multiple network units. Those of ordinary skill in the art can select some or all of the modules according to actual needs to implement the embodiments of the present disclosure without creative work.

The present disclosure may also provide an electronic device corresponding to the above-mentioned gaze area detection method. FIG. 23 is a block diagram of an electronic device 2300 according to an exemplary embodiment of the present disclosure. For example, the electronic device 2300 may include a processor, an internal bus, a network interface, a memory, and a non-volatile memory. The processor can read the corresponding computer program from the non-volatile memory to run in the memory, thereby logically forming a gaze area detection device that implements the above gaze area detection method.

Those skilled in the art should understand that the present disclosure can be provided as a method, device, system, or computer program product. Therefore, the present disclosure may adopt the form of a complete hardware embodiment, a complete software embodiment, or an embodiment combining software and hardware.

The present disclosure may also provide a computer-readable storage medium, the storage medium may store a computer program, and when the computer program is executed by a processor, the processor realizes the gaze area detection method according to any of the foregoing method embodiments .

The embodiments of the subject and functional operations described herein can be implemented in the following: digital electronic circuits, tangible computer software or firmware, computer hardware including the structures disclosed herein and their structural equivalents, or one of them Or a combination of multiple. Embodiments of the subject matter described herein may be implemented as one or more computer programs, that is, one or more of computer program instructions encoded on a tangible non-transitory program carrier to be executed by a data processing device or to control the operation of the data processing device Modules. Alternatively or in addition, the program instructions may be encoded on the generated propagating signal (such as a machine-generated electrical, optical or electromagnetic signal) that is generated to encode the information and transmit it to a suitable receiver device for data transmission The processing device executes. The computer storage medium may be a machine-readable storage device, a machine-readable storage substrate, a random or serial access memory device, or a combination of one or more of them.

The processing and logic flow described herein can be executed by one or more programmable computers executing one or more computer programs to perform corresponding functions by operating according to input data and generating output. The processing and logic flow can also be executed by a dedicated logic circuit such as FPGA (Field Programmable Gate Array) or ASIC (Application Specific Integrated Circuit), and the device can also be implemented as a dedicated logic circuit.

Computers suitable for executing computer programs include, for example, general-purpose or special-purpose microprocessors, or any other type of central processing unit. Generally, the central processing unit will receive instructions and data from a read-only memory and/or random access memory. The basic components of a computer include a central processing unit for implementing or executing instructions and one or more memory devices for storing instructions and data. Generally, a computer can include one or more mass storage devices for storing data, such as magnetic disks, magneto-optical disks, or optical disks, or the computer can be operatively coupled to this mass storage device to receive data from or send data to it. Transfer data. In addition, the computer can be embedded in another device (such as a mobile phone, a personal digital assistant (PDA), a mobile audio or video player, a game console, a global positioning system (GPS) receiver, or a universal serial bus (USB) ) Flash drives, portable storage devices, etc.).

Computer readable media suitable for storing computer program instructions and data may include various forms of non-volatile memory, such as semiconductor memory devices (for example, Erasable Programmable Read Only Memory (EPROM), electronic Erasable Programmable Read Only Memory (Electrically Erasable Programmable Read Only Memory, EEPROM) and flash memory), magnetic disks (such as internal hard disks or removable disks), magneto-optical disks, CD-ROMs (Compact Disc Read Only Memory, CD-ROM) , Digital Versatile Disc (DVD), etc. The processor and the memory can be supplemented by or incorporated into a dedicated logic circuit.

Although this document contains many specific implementation details, these should not be construed as limiting the scope of the present disclosure or the scope of the claimed protection, but are mainly used to describe the features of specific embodiments of the present disclosure. Certain features described separately in multiple embodiments can also be implemented in combination in a single embodiment. On the other hand, various features described in a single embodiment can also be implemented in multiple embodiments separately or in any suitable sub-combination. In addition, although features may function in certain combinations as described above and even initially claimed as such, one or more features from the claimed combination may in some cases be removed from the combination, and the claimed The combination of protection can be directed to a sub-combination or a variant of the sub-combination.

Similarly, although operations are depicted in a specific order in the drawings, this should not be construed as requiring these operations to be performed in the specific order shown or performed sequentially or requiring all the illustrated operations to be performed to achieve the desired result . In some cases, multitasking and parallel processing may be advantageous. In addition, the separation of various system modules and components in the foregoing embodiments should not be understood as requiring such separation in all embodiments, and it should be understood that the described program components and systems can usually be integrated together in a single software product. In, or packaged into multiple software products.

The above descriptions are only some embodiments of the present disclosure, and are not used to limit the present disclosure. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present disclosure shall be included in the scope of the present disclosure.

Claims

A method for detecting a gaze area, the method comprising:

Acquiring a face image collected in a predetermined three-dimensional space;

Performing line of sight detection based on the face image to obtain a line of sight detection result;

Using a gaze area classifier trained in advance for the predetermined three-dimensional space to detect the category of the target gaze area corresponding to the face image according to the line of sight detection result,

Wherein, the target gaze area belongs to one of multiple types of defined gaze areas obtained by dividing the predetermined three-dimensional space in advance.
The method according to claim 1, wherein the line-of-sight detection result includes: starting point information and line-of-sight direction information of the person in the face image.
The method according to claim 2, wherein the performing line of sight detection based on the face image to obtain a line of sight detection result comprises:

Detecting eye positions in the face image;

In a case where the eye position includes the positions of both eyes, it is determined that the middle position of the eyes is the line of sight starting point information.
The method according to claim 2, wherein the performing line of sight detection based on the face image to obtain a line of sight detection result comprises:

Detecting eye positions in the face image;

In the case where the eye position includes the position of both eyes, the position of any one of the eyes is determined as the line-of-sight starting point information, or in the case where the eye position includes the position of a single eye, the position of the eye is determined The position of a single eye is the starting point information of the line of sight.
The method according to claim 3 or 4, wherein the detecting the position of the eyes in the face image comprises:

Detecting head posture information of the person in the face image;

The eye position in the face image is determined according to the head posture information.
The method according to claim 2, wherein the performing line of sight detection based on the face image to obtain a line of sight detection result comprises:

Detecting head posture information of the person in the face image;

Detecting the line of sight direction information of the person in the face image based on the head posture information.
The method according to claim 5 or 6, wherein the detecting the head posture information of the person in the face image comprises:

Detecting multiple face key points in the face image;

Based on the key points of the face and a preset average face model, the head posture information of the person in the face image is determined.
The method according to claim 6 or 7, wherein the detecting the line of sight direction information of the person in the face image based on the head posture information comprises:

Performing normalization processing on the face image according to the head posture information to obtain a normalized face image;

Performing line-of-sight direction detection based on the corrected face image to obtain the first detected line-of-sight direction;

Perform coordinate inverse transformation processing on the first detected line of sight direction to obtain the line of sight direction information of the person in the face image.
8. The method according to claim 8, wherein said normalizing said face image according to said head posture information to obtain a normalized face image comprises:

Determining the head coordinate system of the person in the face image according to the head posture information;

Rotate and translate the actual camera coordinate system corresponding to the face image based on the head coordinate system to obtain a virtual camera coordinate system;

According to the position transformation relationship between the actual camera coordinate system and the virtual camera coordinate system, normalizing the face image is performed to obtain the corrected face image.
The method according to any one of claims 1-9, wherein the gaze area classifier is trained in advance based on a training sample set for the predetermined three-dimensional space, wherein the training sample set includes a plurality of gaze feature samples Each of the gaze feature samples includes gaze starting point information, gaze direction information, and annotation information of the gaze area category corresponding to the gaze feature sample, and the type of the gaze area marked belongs to the multiple types divided for the predetermined three-dimensional space Define one of the gaze areas.
The method according to claim 10, further comprising: before said acquiring a face image collected in a predetermined three-dimensional space,

Input the gaze starting point information and the gaze direction information of at least one gaze feature sample into a gaze area classifier to be trained to obtain the gaze area category prediction information corresponding to the gaze feature sample;

According to the deviation between the gaze area category prediction information and the annotation information of the gaze area category corresponding to the gaze feature sample, the gaze area classifier is adjusted to train the gaze area classifier.
The method according to claim 10, further comprising: before said acquiring the face image collected in a predetermined three-dimensional space, selecting from a preset gaze area classifier set according to the spatial identifier of the predetermined three-dimensional space Acquiring a gaze area classifier corresponding to the spatial identifier,

Wherein, the preset gaze area classifier set includes: gaze area classifiers respectively corresponding to spatial identifiers of different three-dimensional spaces.
The method according to any one of claims 1-12, wherein the predetermined three-dimensional space includes: a vehicle space.
The method according to claim 13, wherein:

The face image is determined based on the image collected for the driving area in the vehicle space;

The multiple types of defined gaze areas include at least two of the following: left front windshield area, right front windshield area, instrument panel area, interior mirror area, center console area, left rearview mirror area, and right rear Sight mirror area, visor area, shift lever area, area under the steering wheel, co-pilot area, and glove box area in front of the co-pilot.
The method according to any one of claims 1-14, the method further comprising:

Determine the attention monitoring result of the person corresponding to the face image according to the detection result of the gaze area category;

Output the attention monitoring result, and/or output distraction prompt information according to the attention monitoring result.
The method according to any one of claims 1-15, the method further comprising:

Determine the control instruction corresponding to the detection result of the gaze area category;

The control electronic device executes the operation corresponding to the control instruction.
A gaze area detection device, the device comprising:

An image acquisition module for acquiring a face image collected in a predetermined three-dimensional space;

A line of sight detection module, configured to perform line of sight detection based on the face image to obtain a line of sight detection result;

The gaze area detection module is configured to use a gaze area classifier trained in advance for the predetermined three-dimensional space to detect the category of the target gaze area corresponding to the face image according to the line of sight detection result,

Wherein, the target gaze area belongs to one of multiple types of defined gaze areas obtained by dividing the predetermined three-dimensional space in advance.
The device according to claim 17, wherein the line of sight detection result comprises: starting point information and line of sight direction information of the person in the face image.
The device according to claim 18, wherein the line of sight detection module comprises:

An eye position detection sub-module for detecting the eye position in the face image;

The first starting point information determining sub-module is configured to determine the middle position of the two eyes as the line of sight starting point information when the eye position includes the positions of the two eyes.
The device according to claim 18, wherein the line of sight detection module comprises:

An eye position detection sub-module for detecting the eye position in the face image;

The second starting point information determining sub-module is used to determine that the position of any one of the eyes is the line of sight starting point information when the eye position includes the positions of the eyes, or the eye position includes In the case of the position of a single eye, the position of the single eye is determined as the line of sight starting point information.
The device according to claim 19 or 20, wherein the eye position detection sub-module comprises:

A posture detection unit for detecting head posture information of the person in the face image;

The position determining unit is configured to determine the position of the eyes in the face image according to the head posture information.
The device according to claim 18, wherein the line of sight detection module comprises:

A posture detection sub-module for detecting head posture information of the person in the face image;

The direction detection sub-module is configured to detect the line of sight direction information of the person in the face image based on the head posture information.
The device according to claim 22, wherein the posture detection sub-module comprises:

A key point detection unit for detecting multiple face key points in the face image;

The posture determination unit is configured to determine the head posture information of the person in the face image based on the key points of the face and a preset average face model.
The device according to claim 22 or 23, wherein the direction detection sub-module comprises:

An image processing unit, configured to perform normalization processing on the face image according to the head posture information to obtain a normalized face image;

A first direction detection unit, configured to perform line-of-sight direction detection based on the corrected face image to obtain a first detected line-of-sight direction;

The direction determining unit is configured to perform coordinate inverse transformation processing on the first detected line of sight direction to obtain the line of sight direction information of the person in the face image.
The device according to claim 24, wherein the image processing unit comprises:

A head coordinate determination subunit, configured to determine the head coordinate system of the person in the face image according to the head posture information;

A coordinate transformation subunit, configured to rotate and translate the actual camera coordinate system corresponding to the face image based on the head coordinate system to obtain a virtual camera coordinate system;

The image processing subunit is configured to perform normalization processing on the face image according to the position transformation relationship between the actual camera coordinate system and the virtual camera coordinate system to obtain the corrected face image.
The apparatus according to any one of claims 17-25, wherein the gaze area classifier is trained in advance based on a training sample set for the predetermined three-dimensional space, wherein the training sample set includes a plurality of gaze feature samples Each of the gaze feature samples includes gaze starting point information, gaze direction information, and annotation information of the gaze area category corresponding to the gaze feature sample, and the type of the gaze area marked belongs to the multiple types divided for the predetermined three-dimensional space Define one of the gaze areas.
The device according to claim 26, the device further comprising a classifier training module, the classifier training module comprising:

The category prediction sub-module is configured to input the gaze starting point information and the gaze direction information of at least one of the gaze feature samples into the gaze area classifier to be trained to obtain the gaze area category prediction information corresponding to the gaze feature sample;

The parameter adjustment sub-module is used to adjust the parameters of the gaze area classifier according to the deviation between the gaze area category prediction information and the gaze area category label information corresponding to the gaze feature sample to train the gaze area Classifier.
The device according to claim 26, further comprising:

The classifier obtaining module is configured to obtain the gaze area classifier corresponding to the space identifier from the preset gaze area classifier set according to the space identifier of the predetermined three-dimensional space,

Wherein, the preset gaze area classifier set includes: gaze area classifiers respectively corresponding to spatial identifiers of different three-dimensional spaces.
The device according to any one of claims 17-28, wherein the predetermined three-dimensional space comprises: a vehicle space.
The device of claim 29, wherein:

The face image is determined based on the image collected for the driving area in the vehicle space;

The multiple types of defined gaze areas include at least two of the following: left front windshield area, right front windshield area, instrument panel area, interior mirror area, center console area, left rearview mirror area, and right rear Sight mirror area, visor area, shift lever area, area under the steering wheel, co-pilot area, and glove box area in front of the co-pilot.
The device according to any one of claims 17-30, the device further comprising:

An attention monitoring module, configured to determine the attention monitoring result of the person corresponding to the face image according to the gaze area category detection result obtained by the gaze area detection module;

The monitoring result output module is used to output the attention monitoring result, and/or output distraction prompt information according to the attention monitoring result.
The device according to any one of claims 17 to 31, the device further comprising:

A control instruction determination module, configured to determine a control instruction corresponding to the gaze area category detection result obtained by the gaze area detection module;

The operation control module is used to control the electronic device to execute the operation corresponding to the control instruction.
A computer-readable storage medium having a computer program stored thereon, and when the computer program is executed by a processor, the processor realizes the method according to any one of claims 1-16.
An electronic device comprising a memory and a processor, wherein a computer program is stored on the memory, and the processor implements the method according to any one of claims 1-16 when the computer program is executed.