CN115223231A

CN115223231A - Sight direction detection method and device

Info

Publication number: CN115223231A
Application number: CN202110405686.2A
Authority: CN
Inventors: 郑童方; 张普; 陈正华; 王进
Original assignee: Rainbow Software Co ltd
Current assignee: Rainbow Software Co ltd
Priority date: 2021-04-15
Filing date: 2021-04-15
Publication date: 2022-10-21

Abstract

The invention discloses a sight line direction detection method and a sight line direction detection device. The sight line direction detection method comprises the following steps: acquiring a target image containing a target face; carrying out face region detection and face key point detection on a target image, and acquiring a face image and an eye region image according to a detection result, wherein the face image comprises a face region image and/or a face shape image, and the eye region image comprises a left eye image and/or a right eye image; inputting the face image and the eye area image into a pre-trained neural network to perform sight line detection on a target face to obtain a sight line detection result; and determining the sight direction of the target face according to the sight detection result. The invention solves the technical problem that the sight line prediction result has larger deviation due to factors such as head posture, object shielding, external light source and the like in the related technology.

Description

Sight direction detection method and device

Technical Field

The invention relates to the technical field of image processing, in particular to a sight line direction detection method and device.

Background

In the related art, the sight line direction and the sight line landing position of the driver are detected in real time, and important information input can be provided for driving assistance and various product applications. For example, the angle direction of the sight line may be used to determine an observation area of the current driver, detect dangerous driving behavior of driver distraction, and the like, and the position of the falling point of the sight line may be used to determine a target of attention of the current user, thereby adjusting the display position of the augmented reality heads-up display system, and the like.

Different from the traditional head-mounted eye tracking method, the estimation of the angle of sight in the real vehicle scene has many characteristics, mainly including: 1. in real vehicle products, the on-vehicle camera is generally installed at a position that does not affect the driving safety of the driver, such as a left a-pillar, a steering wheel column, and a rear view mirror position. For the camera, the face of a driver has a certain turning angle when looking ahead; 2. the normal observation range of the head of the driver comprises a range from a left rearview mirror to a right rearview mirror, and the turning range is very large; 3. the object observed by the driver is farther than the common scene, and the distribution range of the view points is larger from the range within 2 meters of the vehicle cabin to dozens of meters outside the vehicle; 4. in the automobile traveling process, the environment outside the automobile is constantly changed, sunlight at different angles can be projected to the position of a driver, and the scene outside the automobile reflects on the glasses of the driver, so that great difficulty is caused to sight estimation.

In the existing sight line prediction method, left and right eyes are treated indiscriminately, a full-face image or a direct optional eye image is generally used as an input estimation sight line, or a human eye region is obtained according to the information of key points of a human face, and a left eye image and a right eye image are respectively input and sent to a network to directly carry out regression on an angle value of the sight line. In practice, due to the different head postures of the person, the quality of the acquired images of the two eyes may be different, especially in the case of a large angular deflection of the head of the person, the quality is far from the other. It is clear that accurate implementation estimation relies on high quality eye images. However, the method ignores problems in various practical situations such as face translation, face rotation, object shielding, external light source influence and the like, and a prediction result often has large deviation in a real vehicle scene. In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the invention provides a sight direction detection method and a sight direction detection device, which are used for at least solving the technical problem that the sight prediction result has larger deviation due to factors such as head postures, object shielding, external light sources and the like in the related technology.

According to an aspect of an embodiment of the present invention, there is provided a gaze direction detection method including: acquiring a target image containing a target face; carrying out face region detection and face key point detection on the target image, and acquiring a face image and an eye region image according to a detection result, wherein the face image comprises a face region image and/or a face shape image, and the eye region image comprises a left eye image and/or a right eye image; inputting the face image and the eye area image into a pre-trained neural network to perform sight line detection on the target face to obtain a sight line detection result; and determining the sight direction of the target face according to the sight detection result.

Optionally, the pre-trained neural network includes: the feature extraction layer is used for extracting features of the face image and the eye region image to obtain a first feature vector and a second feature vector; and the classification layer is used for determining the sight line detection result according to the sight line regression of the first characteristic vector and the second characteristic vector.

Optionally, the second feature vector includes a left-eye feature vector and/or a right-eye feature vector; the classification layer includes: a binocular information branch; or a monocular information branch; or a binocular information branch, a monocular information branch, and a confidence branch.

Optionally, the determining, by the classification layer, the sight line detection result according to the sight line regression of the first feature vector and the second feature vector, includes: combining and splicing the first feature vector, the left-eye feature vector and/or the right-eye feature vector by each branch of the classification layer respectively to obtain a first description feature vector; and determining the sight line detection result according to the sight line regression of the corresponding first description feature vector by each branch of the classification layer.

Optionally, determining the sight line detection result according to the sight line regression of the corresponding first description feature vector by each branch of the classification layer includes: the monocular classifier included in the monocular information branch performs sight line regression respectively according to the first left-eye feature vector and the first right-eye feature vector to determine a first detection result; the binocular classifier included in the binocular information branch performs sight regression according to the first binocular feature vector to determine a second detection result; and the confidence classifier included in the confidence branch respectively performs secondary classification on the confidence of the left-eye image and the right-eye image according to the left-eye feature vector and the right-eye feature vector, and determines a third detection result.

Optionally, the first left-eye feature vector is generated by splicing the left-eye feature vector and the first feature vector; the first right-eye feature vector is generated by splicing the right-eye feature vector and the first feature vector; the first binocular feature vector is generated by splicing the left eye feature vector, the right eye feature vector and the first feature vector.

Optionally, determining the gaze direction of the target face according to the gaze detection result includes: when the confidence degree branch is not included, the first detection result or the second detection result is used as the sight line detection result for determining the sight line direction of the target human face; when the confidence branch is included, the method comprises the following steps: if the third detection result indicates that the left eye image and the right eye image are both credible, determining the first detection result as the sight line direction of the target human face; if the third detection result is that the left eye image or the right eye image is credible, determining the second detection result as the sight line direction of the target face; and if the third detection result indicates that the left-eye image and the right-eye image are not credible, outputting no effective output.

Optionally, before inputting the face image and the eye region image into a pre-trained neural network to perform line-of-sight detection on the target face, and obtaining a line-of-sight detection result, the method further includes: and carrying out normalization processing on the eye area map.

Optionally, the normalizing process is performed on the eye region map, and includes: the image acquisition device coordinate system moves according to the scaling matrix and the rotation matrix to generate a virtual camera coordinate system; and mapping the eye area map to the virtual camera coordinate system to generate the normalized eye area map.

Optionally, the method further includes: and collecting a plurality of sight calibration images of the target face, wherein the sight calibration images comprise calibration sight directions of the target face.

Optionally, the feature extraction layer is further configured to perform feature extraction on one calibrated face image and one calibrated eye region image to obtain a third feature vector and a fourth feature vector; the classification layer includes a pre-calibration branch for sequentially and simultaneously establishing the third feature vector and the fourth feature vector of the plurality of sight line calibration images to determine the sight line detection result.

Optionally, each of the calibrated face images and each of the calibrated eye region images are obtained by sequentially performing face region detection and face key point detection on one of the view calibration images, where the calibrated face images include calibrated face region images and/or calibrated face shape images, and the eye region images include calibrated left-eye images and/or calibrated right-eye images.

Optionally, the pre-calibration branch sequentially combines the third feature vector and the fourth feature vector to determine the detection result, including: splicing the first feature vector and the second spliced vector, splicing the third feature vector and the fourth feature vector of the sight line calibration image, and differentiating the spliced feature vector groups to obtain a second description feature vector; determining a candidate detection result according to the second description feature vector line of sight regression; sequentially and simultaneously establishing the rest sight line calibration images, and performing feature extraction, feature combination and sight line regression to obtain all candidate detection results; and carrying out weighted average on all the candidate detection results to determine the sight line detection result.

Optionally, determining a candidate detection result according to the second description feature vector line-of-sight regression includes: the full-connection layer obtains the sight angle difference between the target image and one sight calibration image according to the sight regression of the second description feature vector; and determining the candidate detection result according to the sight angle difference and the calibrated sight direction contained in the sight calibration image.

Optionally, the method further includes: based on the gaze direction and a pre-divided region space, a gaze region is determined.

Optionally, the pre-partitioned area space includes at least one of the following: the front-rear-view mirror comprises a left front window, a right front window, a left rear-view mirror area, a right rear-view mirror area, an interior rear-view mirror, a center console, an instrument panel area, a co-driver area, a gear shift lever area and a front window area.

Optionally, the pre-divided region space is a spatial plane equation of a region under a camera coordinate system, and the establishing method includes the following steps: and constructing a space plane equation of the area under the camera coordinate system through a three-dimensional digital-to-analog diagram, or describing the area space through a plurality of irregular polygons, and determining the space plane equation of the area under the camera coordinate system according to the irregular polygons and the transformation relation between the area space and the camera coordinate system.

Optionally, determining a gazing area based on the gaze direction and a pre-divided area space includes: determining a boundary of each of the previously divided region spaces, and determining an intersection point with the region space based on a sight line direction; and when the intersection position is positioned in the boundary range of one pre-divided area space, determining the watching area by the corresponding area space.

Optionally, the training method of the pre-trained neural network includes: obtaining a sample image, wherein the sample image comprises sight line direction marking information; performing face region detection and face key point detection on the sample images, and acquiring a face sample image and an eye region sample image according to the detection result, wherein the face image comprises a face region sample image and/or a face shape sample image, and the eye region image comprises a left eye sample image and a right eye sample image; and inputting the face sample image and the eye area sample image into an initial neural network for training, and constructing the pre-trained neural network, wherein the initial neural network comprises a first binocular information branch, a first monocular information branch and a first confidence coefficient branch.

Optionally, the inputting of the face sample image and the eye area sample image into an initial neural network for training, and the constructing of the pre-trained neural network include: inputting the face sample image and the eye area sample image into the initial neural network to obtain predicted sight direction information and predicted left and right eye visible information; adjusting parameters of the first binocular information branch and the first monocular information branch in the initial neural network according to a difference between the predicted gaze direction information and the gaze direction label information; and performing unsupervised learning training on the first confidence branch according to the predicted sight direction information and the predicted left-right eye visible information, and adjusting the parameters of the first confidence branch.

Optionally, adjusting the first binocular information branch and the first monocular information branch parameters of the initial neural network according to a difference between the predicted gaze direction information and the gaze direction label information, includes: the first binocular information branch inputs the predicted gaze direction information and the gaze direction label information into a first loss function module, and adjusts a parameter of the first binocular information branch according to a first loss value output by the first loss function module, wherein the first loss value is a sum of a first left-eye gaze loss value and a first right-eye gaze loss value; the first monocular information branch inputs the predicted gaze direction information and the gaze direction labeling information into a second loss function module, and the second loss function module outputs a second loss value, wherein the second loss value includes a second left-eye gaze loss value and a second right-eye gaze loss value, and the parameter of the first monocular information branch is updated and adjusted according to the second left-eye gaze loss value and the second right-eye gaze loss value.

Optionally, performing unsupervised learning training on the first confidence branch according to the predicted gaze direction information and the predicted left-right eye visible information, and adjusting a parameter of the first confidence branch includes: representing the visible confidence coefficient of a sample by using the angle error of the left eye and the right eye contained in the predicted sight direction information, and mapping the visible confidence coefficient of the sample into two classification pseudo labels; the confidence branch inputs the visibility confidence of the sample and the predicted left-right eye visibility information into a third loss function module, and adjusts a parameter of the first confidence branch according to a third loss value output by the third loss function module, wherein the third loss value is a sum of a third left-eye sight loss value and a third right-eye sight loss value.

Optionally, obtaining a sample image, where the sample image includes gaze direction labeling information, includes: simultaneously capturing images of a user when gazing at a viewpoint label by a first camera, a second camera and a third camera to obtain a first image, a second image and a third image, wherein the first camera and the second camera are in the same direction, and the first camera and the third camera are in opposite directions; determining the position of the pupil of the user in the second depth camera coordinate system and the position of the viewpoint label in the third depth camera coordinate system respectively according to the second image and the third image; and determining the visual line direction marking information of the user in the first camera coordinate system according to the pre-calibrated position relationship among the first camera, the second depth camera and the third depth camera, the position of the pupil of the user in the second depth camera coordinate system and the position of the viewpoint label in the third depth camera coordinate system.

Optionally, according to a pre-calibrated positional relationship among the first camera, the second depth camera, and the third depth camera, the method includes: establishing a first positional relationship between the first camera and the second depth camera through a single-sided calibration board; establishing a second positional relationship between the first camera and the third depth camera via a double-sided calibration plate.

Optionally, determining, according to the second image and the third image, a position of a pupil of the user in the second depth camera coordinate system and a position of the viewpoint label in the third depth camera coordinate system, respectively, includes: the third image comprises a third depth map and a third color map, and the second image comprises a second depth map and a second color map; detecting key points of the face of the second color image to obtain the coordinates of the pupil of the user in the coordinate system of the second color image; obtaining the coordinates of the user pupil on the second depth map according to the coordinates of the user pupil in the second color map coordinate system, and determining the position of the user pupil in the second depth camera coordinate system; identifying the viewpoint label in the third color map to obtain the coordinates of the viewpoint label in the third color map coordinate system; and obtaining the coordinates of the viewpoint label on the third depth map according to the coordinates of the viewpoint label in the third color map coordinate system, and determining the position of the viewpoint label in the third depth camera coordinate system.

Optionally, before the first camera, the second depth camera and the third depth camera capture images of the user looking at the viewpoint label at the same time, the method further includes: and receiving and recognizing voice information, and when the voice information includes a capture command, simultaneously capturing images of the user's gaze point-of-view tag by the first camera, the second depth camera, and the third depth camera.

According to another aspect of the embodiments of the present invention, there is also provided a gaze direction detecting apparatus including: the image acquisition module is used for acquiring a target image containing a target face; the target detection module is used for carrying out face region detection and face key point detection on the target image and acquiring a face image and an eye region image according to a detection result, wherein the face image comprises a face region image and/or a face shape image, and the eye region image comprises a left eye image and/or a right eye image; the sight line detection module is used for inputting a pre-trained neural network into the face image and the eye region image to perform sight line detection on the target face to obtain a sight line detection result; and the sight line determining module is used for determining the sight line direction of the target face according to the sight line detection result.

Optionally, the classification layer includes: a first splicing unit, configured to separately combine and splice the first feature vector, the left-eye feature vector, and/or the right-eye feature vector in each branch of the classification layer to obtain a first description feature vector; and a first determining unit, configured to determine the sight line detection result according to the sight line regression of the corresponding first description feature vector for each branch of the classification layer.

Optionally, the first determining unit includes: the first detection subunit is used for performing line-of-sight regression by the monocular classifier included in the monocular information branch according to the first left-eye feature vector and the second right-eye feature vector respectively to determine a first detection result; a second detection subunit, configured to perform line-of-sight regression by using the binocular classifier included in the binocular information branch according to the first binocular feature vector, and determine the second detection result; and a third detection subunit, configured to perform, by using the confidence classifier included in the confidence branch, two classifications on the reliabilities of the left-eye image and the right-eye image, respectively, according to the second feature vector, and determine a third detection result.

Optionally, the gaze determining module includes: a first sight line determining unit configured to determine a sight line direction of the target face by using the first detection result or the second detection result as the sight line detection result when the confidence level branch is not included; a second sight line determination unit, configured to, when the confidence branch is included, include: if the third detection result indicates that the left-eye image and the right-eye image are both credible, determining the second detection result as the sight line direction of the target human face; if the third detection result is that the left eye image or the right eye image is credible, determining the first detection result as the sight line direction of the target human face; and if the third detection result indicates that the left-eye image and the right-eye image are not credible, outputting no effective output.

Optionally, the device further includes a normalization processing module, configured to perform normalization processing before the eye area map is input to the eye line detection module.

Optionally, the apparatus further includes a calibration image acquisition module configured to acquire a plurality of sight calibration images of the target face, where the sight calibration images include a calibration sight direction of the target face.

Optionally, the feature extraction layer is further configured to perform feature extraction on one calibrated face image and one calibrated eye region image to obtain a third feature vector and a fourth feature vector; the classification layer includes a pre-calibration branch for sequentially and simultaneously establishing the third feature vector and the fourth feature vector of the plurality of sight calibration images to determine the sight detection result.

Optionally, the pre-calibration branch includes: a first combination unit, configured to splice the first feature vector and the second spliced vector, splice the third feature vector and the fourth feature vector of one of the sight calibration images, and differentiate the spliced feature vector groups to obtain a second description feature vector; a second determining unit, configured to determine a candidate detection result according to the second descriptive feature vector line of sight regression; the third determining unit is used for sequentially and simultaneously establishing the rest sight line calibration images, and performing feature extraction, feature combination and sight line regression to obtain all candidate detection results; a fourth determining unit, configured to determine the gaze detection result by weighted averaging all the candidate detection results.

Optionally, the second determining unit includes: a second determining subunit, configured to perform line-of-sight regression on the full connection layer according to the second description feature vector to obtain a line-of-sight angle difference between the target image and one of the line-of-sight calibration images; a third determining subunit, configured to determine the candidate detection result according to the gaze angle difference and the calibrated gaze direction included in one of the gaze calibration images.

Optionally, the apparatus further includes a gazing area detecting module, configured to determine a gazing area based on the gaze direction and a pre-divided area spatial position.

Optionally, the apparatus further includes a network training module, configured to train the neural network, including: the system comprises a sample acquisition unit, a processing unit and a display unit, wherein the sample acquisition unit is used for acquiring a sample image, and the sample image comprises sight line direction marking information; a sample detection unit, configured to perform face region detection and face key point detection on the sample image, and obtain a face sample image and an eye region sample image according to a detection result, where the face image includes the face region sample image and/or a face shape sample image, and the eye region image includes a left eye sample image and a right eye sample image; and the first training unit is used for inputting the face sample image and the eye area sample image into an initial neural network for training and constructing the pre-trained neural network, wherein the initial neural network comprises a first binocular information branch, a first monocular information branch and a first confidence coefficient branch.

Optionally, the first training unit includes: a sample prediction subunit, configured to input the face sample map and the eye area sample map into the initial neural network, and obtain predicted gaze direction information and predicted left and right eye visible information; a first parameter adjustment subunit configured to adjust parameters of the first binocular information branch and the first monocular information branch of the initial neural network, based on a difference between the predicted gaze direction information and the gaze direction label information; and a second parameter adjusting subunit, configured to perform unsupervised learning training on the first confidence level branch according to the predicted gaze direction information and the predicted left-right eye visible information, so as to adjust a parameter of the first confidence level branch.

Optionally, the sample acquiring unit includes: a capturing subunit, configured to capture images of a user when gazing at a viewpoint tag by a first camera, a second camera, and a third camera simultaneously, and obtain a first image, a second image, and a third image, where the first camera and the second camera are oriented in the same direction, and the first camera and the third camera are oriented in opposite directions; a positioning subunit, configured to determine, according to the second image and the third image, a position of a pupil of the user in the second depth camera coordinate system and a position of the viewpoint label in the third depth camera coordinate system, respectively; and a label determining subunit, configured to determine, according to a pre-calibrated positional relationship among the first camera, the second depth camera, and the third depth camera, a position of the pupil of the user in the second depth camera coordinate system and a position of the viewpoint label in the third depth camera coordinate system, gaze direction label information of the user in the first camera coordinate system.

Optionally, the label determining subunit includes: establishing a first positional relationship between the first camera and the second depth camera through a single-sided calibration board; and establishing a second positional relationship between the first camera and the third depth camera through a double-sided calibration plate.

Optionally, the sample acquiring unit includes: a voice unit for receiving and recognizing voice information, wherein when the voice information includes a capture command, the first camera, the second depth camera and the third depth camera simultaneously capture images of the user's gaze point tag.

According to another aspect of the embodiments of the present invention, there is provided a storage medium, wherein the storage medium includes a stored program, and when the program runs, the apparatus where the storage medium is located is controlled to execute the gaze direction detecting method according to any one of claims 1 to 26.

According to another aspect of an embodiment of the present invention, there is also provided an electronic device, including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform the gaze direction detecting method of any one of claims 1 to 26 via execution of the executable instructions.

In the embodiment of the invention, the following steps are executed: acquiring a target image containing a target face; carrying out face region detection and face key point detection on a target image, and acquiring a face image and an eye region image according to a detection result, wherein the face image comprises a face region image and/or a face shape image, and the eye region image comprises a left eye image and/or a right eye image; inputting the face image and the eye area image into a pre-trained neural network to perform sight line detection on the target face, and determining the sight line direction of the target face; the gaze region is determined based on the gaze direction and the spatial position of the pre-divided region, and the technical problem that the gaze prediction result has large deviation due to factors such as head posture, object shielding and external light sources in the related technology is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a flow chart of an alternative gaze direction detection method in accordance with embodiments of the present invention;

FIG. 2 is a block diagram of an alternative pre-trained neural network in accordance with embodiments of the present invention;

FIG. 3 is a flow chart of another alternative gaze direction detection method according to embodiments of the present invention;

FIG. 4 is a flow chart of an alternative gaze direction detection method in accordance with embodiments of the present invention;

FIG. 5 is a schematic diagram of an alternative neural network architecture in accordance with embodiments of the present invention;

FIG. 6 is a flow chart of another alternative gaze direction detection method, in accordance with embodiments of the present invention;

FIG. 7 is a flow diagram of an alternative normalization process according to an embodiment of the invention;

FIG. 8 is a flow chart of an alternative gaze direction detection method in accordance with embodiments of the present invention;

FIG. 9 is a schematic diagram of an alternative neural network architecture in accordance with an embodiment of the present invention;

FIG. 10 is a flow chart of another alternative gaze direction detection method, in accordance with embodiments of the present invention;

fig. 11 is an alternative gaze area division diagram in accordance with embodiments of the present invention;

fig. 12 is a schematic diagram of an alternative method of determining a gaze region in accordance with embodiments of the present invention;

FIG. 13 is a flow diagram of an alternative neural network training method in accordance with an embodiment of the present invention;

FIG. 14 is a flow chart of an alternative sample image acquisition method according to an embodiment of the present invention;

FIG. 15 is a schematic diagram of an alternative sample image acquisition scenario in accordance with an embodiment of the present invention;

fig. 16 is a block diagram of an alternative gaze direction detection arrangement, in accordance with embodiments of the present invention;

fig. 17 is a block diagram of another alternative gaze direction detection apparatus according to an embodiment of the present invention;

FIG. 18 is a block diagram of an alternative network training module according to an embodiment of the present invention;

fig. 19 is a block diagram of an alternative sample image acquisition unit according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Moreover, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

According to an embodiment of the present invention, there is provided a gaze direction detection method embodiment, it is noted that the steps illustrated in the flow chart of the figures may be performed in a computer system, such as a set of computer executable instructions, and that, although a logical order is illustrated in the flow chart, in some cases, the steps illustrated or described may be performed in an order different than here.

The embodiment of the invention provides a sight direction detection method, which can be applied to various application scenes such as sight line bright screen, VR interaction, industrial detection, financial security, retail, intelligent classroom and head-up display systems, and has the characteristics of high processing precision, high processing speed, real-time performance and applicability to most practical use scenes.

The present invention is illustrated by the following detailed examples.

According to an aspect of the present invention, there is provided a gaze direction detection method. Referring to fig. 1, a flow chart of an alternative gaze direction detection method according to an embodiment of the invention is shown. As described above with reference to fig. 1, the method comprises the steps of:

s100, acquiring a target image containing a target face;

s102, carrying out face region detection and face key point detection on a target image, and acquiring a face image and an eye region image according to a detection result, wherein the face image comprises a face region image and/or a face shape image, and the eye region image comprises a left eye image and/or a right eye image;

s104, inputting the face image and the eye area image into a pre-trained neural network to perform sight line detection on the target face to obtain a sight line detection result;

and S106, determining the sight direction of the target face according to the sight detection result.

In the embodiment of the invention, a target image containing a target face is obtained; carrying out face region detection and face key point detection on a target image, and acquiring a face image and an eye region image according to a detection result, wherein the face image comprises a face region image and/or a face shape image, and the eye region image comprises a left eye image and/or a right eye image; inputting the face image and the eye area image into a pre-trained neural network to perform sight line detection on the target face to obtain a sight line detection result; and determining the sight direction of the target face according to the sight detection result. Through the steps, the technical problem that the sight line prediction result has large deviation due to factors such as head postures, object shielding and external light sources in the related technology is solved.

The following is a detailed description of the above embodiments.

S100, acquiring a target image containing a target face;

in an alternative embodiment, the target image containing the target face may be acquired by the image acquisition device. The image acquisition device can be an independent camera or an electronic device such as a camera, a mobile phone, a vehicle data recorder and an auxiliary driving system integrated with the camera, and the types of the camera include an infrared structure light camera, a Time-of-flight (ToF) camera, an RGB camera, a fisheye camera and the like. The target image may be an image to be acquired in a specific target object gazing area detection scene, and there may be many gazing area detection scenes, for example, the gazing area of the user is detected to control the intelligent device; for another example, the attention of the driver is determined by detecting the gaze area of the driver. When the image acquisition device is installed in a target object gazing area detection scene, the position containing a target object face image can be acquired, such as the front side of a display screen, a transparent A column of a vehicle, VR glasses and the like.

The image data collected by the image collecting device can comprise video data and non-video data, video image framing processing is carried out on the video data, face detection is carried out on each frame of image, and at least one target image comprising a target face is an image frame containing the target face and determined by carrying out the target face detection on each frame of image in the video; the non-video data can comprise a single-frame image, and the single-frame image is subjected to face detection without frame division processing to obtain at least one target image comprising a target face. And when the image data acquired by the image acquisition device does not detect the target face, the image acquisition device continues to acquire the image data.

And S102, carrying out face detection and face key point detection on the target image, and cutting and acquiring a face image and an eye area image according to the detection result, wherein the face image comprises a face area image and/or a face shape image, and the eye area image comprises a left eye image and/or a right eye image.

Optionally, the face map provides a priori information on head pose including head yaw angle, head pitch angle and head rotation angle.

Specifically, a face region frame is selected in the target image by adopting a face detection technology, and the target image is cut according to the selected face region frame to obtain the face image. The face image can be a face region image which only contains a face region in the target image after being cut, and all original information of the target image is reserved so as to ensure the accuracy and the integrity of subsequent data processing. If the interference of illumination and noise is obvious, the face image may only include a face shape image of face contour feature points, where the face contour feature points include but are not limited to: face contour points, eye contour points, nose contour points, eyebrow contour points, forehead contour points, and lip contour points. The human face shape graph can assist in estimating angles, and when interference information such as illumination, noise and the like exists, the human face shape graph can better represent the human face angle information compared with a human face region graph. And moreover, carrying out face key point detection on the target image, respectively cutting the target image according to the distribution of the characteristic points of the left eye and the right eye, and obtaining a left eye image and a right eye image which are collectively called as eye region images, wherein the face key points can comprise face key point data corresponding to each region of the face, and each region of the face comprises a left eye eyebrow region, a right eye eyebrow region, a left eye region, a right eye region, a nose region, a mouth region and the like.

And S104, inputting the face image and the eye area image into a pre-trained neural network to perform sight line detection on the target face, and obtaining a sight line detection result. Referring to FIG. 2, a block diagram of an alternative pre-trained neural network in accordance with an embodiment of the present invention is provided. As shown in fig. 2, the pre-trained neural network includes the following components:

s140, a feature extraction layer is used for extracting features of the face image and the eye region image to obtain a first feature vector and a second feature vector;

the size of the eye region image of the input neural network comprises the left eye image and the right eye image is the same, and the size of the face image is not limited, for example, the size of the clipped image is kept and the face image does not need to be adjusted to be consistent with the left eye image and the right eye image. The pre-trained neural network comprises a feature extraction layer which comprises a plurality of feature extraction branches for feature extraction, the structural composition of each feature extraction branch can be the same or different, and the image quality corresponding to the input branches can comprise different pooling layer quantities and different convolution layer quantities.

And correspondingly inputting the face image and the eye region image into a plurality of characteristic extraction branches of the characteristic extraction layer, and performing characteristic extraction on the face image by using the first characteristic extraction branch to obtain a first characteristic vector. Because the input eye region image comprises a left eye image and/or a right eye image, the second characteristic extraction shunt circuit performs characteristic extraction on the left eye image to obtain a left eye characteristic vector, the third characteristic extraction shunt circuit performs characteristic extraction on the right eye image to obtain a right eye characteristic vector, and finally the second characteristic vector can comprise the left eye characteristic vector and/or the right eye characteristic vector. The face image, the left eye image and the right eye image contained in the eye area image can be simultaneously input into the neural network, and feature extraction is respectively carried out on the face image, the left eye image and the right eye image through the corresponding three feature extraction branches. In the camera coordinate system, the direction of the line of sight depends not only on the state of the eyes including the position of the eyeball, the degree of opening and closing of the eyeball, and the like, but also on the head posture. For example, although the eyes are squinted relative to the head, the eyes look right ahead under a camera coordinate system, the introduction of the face image can increase other prior information such as head posture and the like, and increase the characteristic diversity and the characterization capability, thereby laying a foundation for determining a more accurate sight line detection result by a subsequent neural network.

And S142, a classification layer is used for determining the sight line detection result according to the sight line regression of the first feature vector and the second feature vector.

In an alternative embodiment, the second feature vector includes a left-eye feature vector and/or a right-eye feature vector; the classification layer includes: a binocular information branch; or a monocular information branch; or a binocular information branch, a monocular information branch, and a confidence branch.

In an optional embodiment, the classification layer is configured to determine a sight line detection result according to a sight line regression of the first feature vector and the second feature vector, and includes: combining and splicing the first characteristic vector, the left eye characteristic vector and/or the right eye characteristic vector by each branch of the classification layer respectively to obtain a first description characteristic vector; and each branch of the classification layer determines a sight detection result according to the sight regression of the corresponding first description feature vector.

In an optional embodiment, each branch of the classification layer determines the line-of-sight detection result according to the corresponding line-of-sight regression of the first descriptive feature vector, where the determining includes: the monocular classifier included in the monocular information branch performs sight line regression respectively according to the first left-eye feature vector and the first right-eye feature vector to determine a first detection result; the binocular classifier included in the binocular information branch performs line of sight regression according to the first binocular feature vector to determine a second detection result; and the confidence classifier included in the confidence branch respectively performs two classifications on the confidence of the left-eye image and the right-eye image according to the left-eye feature vector and the right-eye feature vector to determine a third detection result.

In an alternative embodiment, the first left-eye feature vector is generated by stitching a left-eye feature vector and the first feature vector; the first right-eye feature vector is generated by splicing the right-eye feature vector and the first feature vector; the first binocular feature vector is generated by splicing the left-eye feature vector, the right-eye feature vector and the first feature vector.

In an optional embodiment, when no confidence branch is included, the first detection result or the second detection result is used as a sight line detection result for determining the sight line direction of the target face;

when confidence branches are included, the method comprises the following steps:

if the third detection result is that the left eye image and the right eye image are both credible, determining the second detection result as the sight line direction of the target face;

if the third detection result is that the left eye image or the right eye image is credible, determining the first detection result as the sight line direction of the target face;

and if the third detection result is that the left eye image and the right eye image are not credible, outputting the output as no effective output.

For the sake of clearer description, the first line of sight direction, the second line of sight direction, and the third line of sight direction included in the following specific embodiments refer to the above-described line of sight detection directions.

Specifically, when the classification layer of the neural network trained in advance only includes binocular information branches or monocular information branches, the sight line detection result is determined according to the sight line regression of the first feature vector and the second feature vector. Referring to fig. 3, a flow chart of another alternative gaze direction detection method in accordance with embodiments of the present invention is provided. As shown in fig. 3, the gaze direction detection method includes the steps of:

s300, splicing the first feature vector and the second feature vector by the combination of the binocular information branch or the monocular information branch to obtain a first description feature vector;

the feature splicing is to splice a plurality of feature vectors together, splice the head pose information and the eye feature information included in the face image so as to improve the characterization capability of the network on the eye part appearance difference caused by other factors such as the head pose, and the like, and does not limit the combination sequence.

When the input eye area image contains a left eye image and a right eye image, the second characteristic vector comprises a left eye characteristic vector and a right eye characteristic vector, and the first characteristic vector, the left eye characteristic vector and the right eye characteristic are spliced by the binocular information branch to obtain a first binocular characteristic vector; and the monocular information branch respectively splices the first characteristic vector and the left-eye characteristic vector to obtain a first left-eye characteristic vector, and splices the first characteristic vector and the right-eye characteristic vector to obtain a first right-eye characteristic vector.

When the input eye area image contains a left eye image or a right eye image, the second characteristic vector comprises a left eye characteristic vector or a right eye characteristic vector, the first characteristic vector and the left eye characteristic vector are spliced by monocular information to obtain a first left eye characteristic vector, or the first characteristic vector and the right eye characteristic vector are spliced to obtain a first right eye characteristic vector.

And S302, the binocular information branch or the monocular information branch performs line regression according to the corresponding first description feature vector, and a line detection result is determined.

The classification layer included in the pre-trained neural network is trained by adopting a face image set with sight direction labeling information, the classification layer performs regression analysis on the pitch angle and the yaw angle of the left eye and/or the right eye of the target face according to the first splicing characteristic vector, the obtained pitch angle and yaw angle are converted into vectors, and finally sight detection results of the target face under a camera coordinate system are output, wherein the sight detection results of the target face comprise the sight direction of the left eye and/or the sight direction of the right eye.

And the monocular classifier included in the monocular information branch performs sight regression respectively according to the first left-eye feature vector and the first right-eye feature vector, outputs a left-eye sight direction and a right-eye sight direction respectively, and determines the left-eye sight direction and the right-eye sight direction as a first detection result. And the binocular classifier included in the binocular information branch performs visual line regression according to the first binocular feature vector, and simultaneously outputs the left eye visual line direction and the right eye visual line direction to determine as a second detection result.

S304, determining a first sight direction of the target face according to the sight detection result.

And the monocular information branch determines the first sight direction of the target face by taking the first detection result as a sight detection result. And the binocular information branch determines a second sight direction of the target face by taking the second detection result as a sight detection result.

In the prior art, an independent calculation module is needed to estimate the head pose in advance, but when the calculation module estimates the head pose inaccurately, the subsequent sight line estimation is influenced. In the embodiment of the invention, the pre-trained neural network comprises a feature extraction layer, wherein a plurality of feature extraction branches contained in the feature extraction layer respectively extract features of a face image and an eye region image to obtain a first feature vector and a second feature vector, wherein the second feature vector comprises a left-eye feature vector and/or a right-eye feature vector; splicing the first characteristic vector and the second characteristic vector by the combination of the binocular information branch or the monocular information branch to obtain a first description characteristic vector; determining a sight line detection result according to sight line regression of the corresponding first description feature vector by using the binocular information branch or the monocular information branch; and determining the sight direction of the target face according to the sight detection result. The embodiment of the invention adopts an end-to-end neural network, takes the attitude information contained in the face image as auxiliary information and eye feature information to realize estimation in the same deep learning frame, directly obtains the sight direction of the target face under a camera coordinate system, and improves the fault-tolerant rate, thereby improving the robustness of sight estimation. In addition, the embodiment of the invention can support monocular and binocular image input and is compatible with monocular and binocular sight tracking.

Furthermore, if both eye details are physically lost due to reflections and occlusion, then the gaze estimation can only be estimated from the head pose, making it temporarily difficult to improve accuracy in appearance-based schemes. When the two eyes are clear and visible, the feature vectors of the left eye, the right eye and the head posture information are spliced together, so that the sight regression precision can be greatly improved. However, for the situation that one eye is visible and the details of one eye are lost, if the features of both eyes are extracted and the feature vectors of both eyes are spliced, the interference on the features can be caused, and the sight line regression effect is further influenced.

In order to solve the problem of low precision of line-of-sight regression in monocular reflex and occlusion scenes, in an alternative embodiment, the neural network includes the following branches, a binocular information branch, a monocular information branch and a confidence branch. The neural network adopts binocular information, monocular information and confidence information to carry out multi-branch combined learning, and output results of all branches are fused to determine final sight estimation. Referring to fig. 4, a flow chart of an alternative gaze direction detection method of embodiments of the present invention is provided. The flow may be described in connection with the neural network architecture illustrated in fig. 5. As shown in fig. 4, the gaze direction detection method includes the steps of:

s400, the feature extraction layer performs feature extraction on the face image and the eye area image to obtain a first feature vector and a second feature vector, wherein the second feature vector comprises a left eye feature vector and a right eye feature vector;

s402, the monocular classifier included in the monocular information branch performs sight regression respectively according to the first left eye feature vector and the first right eye feature vector, outputs the left eye sight direction and the right eye sight direction respectively, and determines a first detection result

Specifically, the monocular information branch respectively splices the first eigenvector and the left-eye eigenvector to obtain a first left-eye eigenvector, and splices the first eigenvector and the right-eye eigenvector to obtain a first right-eye eigenvector.

In addition, the monocular information branch performs line-of-sight regression based on the first left-eye feature vector and the first right-eye feature vector, respectively, and determines a first detection result, so the first detection result includes a left-eye line-of-sight direction and a right-eye line-of-sight direction.

S404, the binocular classifier included in the binocular information branch performs line-of-sight regression according to the first binocular feature vector, and simultaneously outputs the left eye line-of-sight direction and the right eye line-of-sight direction to determine a second detection result;

specifically, the binocular information branch splices the first feature vector, the left-eye feature vector and the right-eye feature vector to obtain the first binocular feature vector. And the binocular classifier included in the binocular information branch performs line-of-sight regression according to the first binocular feature vector, and simultaneously outputs the left eye line-of-sight direction and the right eye line-of-sight direction to determine as a second detection result.

And S406, performing secondary classification on the credibility of the left-eye image and the right-eye image respectively by the confidence classifier included in the confidence branch according to the left-eye feature vector and the right-eye feature vector, and determining a third detection result.

Specifically, the confidence classifier included in the confidence branch classifies the third detection result of the input image into image credibility and image unreliability by performing secondary classification on the reliability of the input image, where the reliability refers to that the eye, iris or pupil of the input image retains clear visible details, and otherwise, the input image is unreliability.

And S408, determining a second sight direction of the target face according to the first detection result, the second detection result and the third detection result.

The method specifically comprises the following conditions:

in the first case, if the third detection result is that the left eye image and the right eye image are both credible, the second detection result is determined as the second sight line direction of the target face;

in the second situation, if the third detection result is that the left eye image or the right eye image is credible, the first detection result is determined as the second sight line direction of the target face;

in the third situation, if the third detection result indicates that the left-eye image and the right-eye image are both unreliable, the output is no effective output.

Therefore, through the sight direction detection method, the pre-trained neural network in the embodiment of the invention comprises a scheme of joint learning of the binocular information branch, the monocular information branch and the confidence coefficient branch, and finally, output results of all the branches are fused to generate a more accurate sight direction estimation result. For example, if both eye images are credible, the second detection result output by the binocular information branch is adopted as the detection result of the neural network, and if only the left eye is credible, the left eye sight line direction result in the first detection result is output by the monocular information branch as the output result of the neural network. According to the embodiment of the invention, by introducing the information of the confidence coefficient, more eye image information with high image reliability is adopted, and eye image information with low reliability is ignored, so that the problem of low visual regression precision in single-eye reflective and shielding scenes is solved, and the robustness of visual estimation is improved.

Highly robust and highly accurate gaze estimation can be obtained under conventional scenarios according to S400-S408. However, when the human eye is in a special scene, for example, a scene where the human face is at the edge of an image, if the human eye does not move relative to the human face, although the rotation matrix between the human face and the camera coordinate system is not changed, because the human face deviates from the optical axis of the camera, a certain included angle exists between the optical center-human face connecting line and the optical axis, the situation that the visual line is not changed and the appearance of the human eye image is changed can be generated; when the sight line is regressed through the classification layer, regression analysis is only carried out on the pitch angle and the yaw angle of the left eye and/or the right eye of the target human face, the condition that the appearance of the input image is changed remarkably in relation to the length-width ratio is not considered, and the unexpected eye image appearance change affects the regression accuracy of the sight line. At this time, the problem that the visual line regression accuracy is affected by the change in the appearance of the eye image may not be solved by the above embodiment. In an optional embodiment, before inputting the face map and the eye area map into a pre-trained neural network to perform gaze detection on the target face, and obtaining a gaze detection result, the method further includes: and carrying out normalization processing on the eye area map. Referring to fig. 6, a flow chart of another alternative gaze direction detection method in accordance with embodiments of the present invention is provided. As shown in fig. 6, the gaze direction detection method includes the steps of:

s600, acquiring a target image containing a target face;

s602, carrying out face region detection and face key point detection on a target image, and acquiring a face image and an eye region image according to a detection result, wherein the face image comprises a face region image and/or a face shape image, and the eye region image comprises a left eye image and/or a right eye image;

s604, normalizing the eye region image to obtain a first eye region image, wherein the first eye region image comprises a first left eye region image and/or a first right eye region image;

s606, inputting the face image and the first eye region image into a pre-trained neural network to perform sight line detection on the target face to obtain a sight line detection result;

and S608, inversely changing the sight line detection result, and determining the sight line direction of the target face.

In addition, the above steps S600, S602 and S606 are the same as the steps S100, S102 and S104 in fig. 1, and specifically refer to the corresponding description of fig. 1, and are not described in detail here. The embodiment described in fig. 6 is different from fig. 1 in that the gaze direction detection method further includes step S604, performing normalization processing on the eye region map to obtain a first eye region map, where the first eye region map includes a first left eye region map and/or a first right eye region map, and step S608, performing inverse transformation on the gaze detection result, and determining the gaze direction of the target human face.

In an alternative embodiment, normalizing the eye region map comprises: the image acquisition device coordinate system moves according to the scaling matrix and the rotation matrix to generate a virtual camera coordinate system; and mapping the eye area image to a virtual camera coordinate system to generate the normalized eye area image.

Specifically, the eye region map is normalized, and fig. 7 is a flowchart of an optional normalization process according to an embodiment of the present invention, including the following steps:

s640, setting an image acquisition device coordinate system by taking the optical center of the image acquisition device for acquiring the target image as an origin, and setting a human face coordinate system by taking the eye key point in the eye area image as the origin;

s641, rotating the coordinate system of the image acquisition device to enable the z-axis of the coordinate system of the image acquisition device to point to the original point of the coordinate system of the face, enabling the x-axis of the coordinate system of the image acquisition device to be parallel to the x-axis of the coordinate system of the face, and obtaining a rotation matrix;

rotating the coordinate system of the image acquisition device to enable the z-axis of the coordinate system of the image acquisition device to point to the original point of the coordinate system of the human face, and eliminating the appearance difference of the eye area caused by the translation of the human face through the operation; and then rotating the coordinate system of the image acquisition device to enable the x axis of the coordinate system of the image acquisition device to be parallel to the x axis of the coordinate system of the human face, so that the difference of the appearance of the eye area caused by the rotation of the roll angle of the head can be eliminated, and a final rotation matrix is obtained. The human face coordinate system can be established by taking the eye key points as the original points, the center points of the pupils of the left eye or the right eye or the center of the connecting line of the pupils of the left eye and the right eye. Compared with the scheme that the whole-face key points are adopted to calculate the face coordinate system and the rotation adjustment is carried out based on the calculated face coordinate system in the prior art, in the scene that key points of the nose tip and the mouth are difficult to accurately detect when a mask is worn and the like, the eye key points are used for correction to obtain the improved face coordinate system, and the follow-up normalization processing is carried out based on the improved face coordinate system, so that the problem of low accuracy when the head posture is calculated when the mask is worn and the like in a sheltering scene is solved, and the head posture calculated by adopting the whole-face key points in the prior art often has larger errors. In addition, the sight line can be detected only by using accurate eye key points, the sight line precision is ensured, and meanwhile the speed of sight line detection and the robustness of sight line regression are greatly improved.

In the above embodiment, the z-axis of the coordinate system of the image capturing device is adjusted first, and then the x-axis of the coordinate system of the image capturing device is adjusted. However, the above sequence of steps is only an example and not a strict limitation, and it is only required that the adjusted z-axis and x-axis directions of the coordinate system of the image capturing device satisfy the above conditions at the same time, and those skilled in the art may perform corresponding adjustment and transformation based on the principles of the above embodiments, for example, first perform adjustment on the x-axis of the coordinate system of the image capturing device, and then perform adjustment on the z-axis of the coordinate system of the image capturing device.

S642, moving the image acquisition device coordinate system according to a scaling matrix, wherein the scaling matrix is determined according to the distance between the origin of the face coordinate system and the image acquisition device;

for objects with the same size, when the focal length of the image capturing device is fixed, if the imaging size is kept the same, the object distance, which is the distance between the three-dimensional coordinates of the target in the coordinate system of the image capturing device and the origin of the coordinate system of the image capturing device, needs to be kept the same. For a scene that the human face is at the edge of an image and the appearance of the eye image changes, such as the human face deviates from the optical axis of a camera, the distance between the center of a three-dimensional coordinate point corresponding to the center point of an eye area image in a camera coordinate system and an image acquisition device is changed, and the imaging size is influenced. In order to ensure the consistency of the imaging scale, a coordinate system of the image acquisition device is moved according to a scaling matrix, and the scaling matrix is determined according to the distance between the center of a three-dimensional coordinate point corresponding to the center point of the eye area and the image acquisition device. For example, if the three-dimensional coordinate point center corresponding to the eye area map center point is too far away from the image acquisition device, the image acquisition device coordinate system needs to be translated towards the eye area map coordinate system until the target face imaging size is the same.

S643, determining a virtual camera coordinate system according to the image acquisition device coordinate system, the rotation matrix and the scaling matrix;

and adjusting the coordinate system of the image acquisition device according to the rotation matrix obtained in the step S641 and the scaling matrix obtained in the step S642 to obtain a virtual camera coordinate system, wherein the virtual camera coordinate system is the image acquisition device coordinate system subjected to the normalization processing.

S644, mapping the eye region map to a virtual camera coordinate system to generate a first eye region map.

And mapping the eye area map to a virtual camera coordinate system to obtain a first eye area map. By means of a normalization algorithm, the eye area map is mapped into a virtual camera coordinate system, and the sight line direction is transformed from a direction under the camera coordinate system to a direction under the virtual camera coordinate system. Because the head pose directly influences the result of sight line estimation, even if the change space of the head pose is very large, the invention can still effectively eliminate eye appearance difference caused by face translation and face roll angle rotation, and influence of scenes such as a face shield and the like on sight line regression precision.

The pre-trained neural network is trained by adopting a face image set with sight direction labeling information, the first eye area image obtained after normalization processing is input into the neural network, and step S606, the face image and the first eye area image are input into the pre-trained neural network to carry out sight detection on the target face, so that a sight detection result is obtained. And the classification layer performs regression analysis on the pitch angle and the yaw angle of the left eye and/or the right eye of the target face according to the spliced feature vector, and finally outputs a sight line detection result of the target face under a virtual camera coordinate system. And the sight line detection result output by the classifier is under the virtual camera coordinate system but not in the image acquisition device coordinate system, and the sight line detection result is inversely transformed according to the rotation matrix in the normalization processing to obtain the sight line direction in the image acquisition device coordinate system.

Due to the fact that individual differences are large, for example, distances between pupils are different, eyeballs are different in size, pupils are different in size, and shapes of eyes are different, the difference of images input into the neural network is obvious, the accuracy of line of sight estimation of a general pre-trained neural network model is affected, and the generalization of the neural network model is poor.

In order to solve the problem of angle errors of sight estimation caused by individual differences and ensure the neural network sight regression and generalization capability, in an optional embodiment, before the neural network model is input for sight estimation, a plurality of sight calibration images of the target face are acquired, wherein the sight calibration images comprise the calibration sight direction of the target face. Referring to fig. 8, a flow chart of an alternative gaze direction detection method of embodiments of the present invention is provided. The flow may be described in connection with the neural network architecture illustrated in fig. 9. As shown in fig. 8, the gaze direction detecting method includes the steps of:

s800, acquiring a target image containing a target face, and collecting sight calibration images of a plurality of target faces, wherein the sight calibration images comprise calibration sight directions;

specifically, the plurality of sight line calibration images may be acquired by a sight line acquisition system, and the sight line calibration images include not only images when the user gazes at a viewpoint with a known coordinate, but also sight line directions when the user gazes at the viewpoint, which are acquired by the sight line acquisition system.

S801, detecting a face region and a face key point of a target image, acquiring a face image and an eye region image according to detection results, detecting the face region and the face key point of each sight line calibration image, and acquiring a calibration face image and a calibration eye region image according to detection results;

s802, performing sight line detection on the target face by inputting the face image calibration, the eye area image calibration, the face image and the eye area image into a pre-trained neural network to obtain a sight line detection result;

s803, determining the sight direction of the target face according to the sight detection result;

in addition, the above steps S800, S801, and S803 are the same as steps S100, S102, and S104 in fig. 1, and specifically, refer to the corresponding description of fig. 1, and the methods for detecting a face and detecting a key point of a face for a line calibration image and for a target image in S801 are also the same, and will not be described in detail here.

The embodiment described in fig. 8 is different from that in fig. 1 in that the gaze direction detection method further includes step S802, inputting the calibration face map, the calibration eye area map, the face map and the eye area map into a pre-trained neural network to perform gaze detection on the target face, so as to obtain a gaze detection result.

In an alternative embodiment, the pre-trained neural network comprises: the feature extraction layer is further configured to perform feature extraction on one calibrated face image and one calibrated eye region image to obtain a third feature vector and a fourth feature vector; the classification layer comprises a pre-calibration branch, and is used for sequentially connecting a third feature vector and a fourth feature vector of a plurality of sight calibration images to determine the sight detection result.

In an optional embodiment, each calibrated face image and each calibrated eye region image are obtained by sequentially performing face region detection and face key point detection on one vision calibration image, where the calibrated face image includes a calibrated face region image and/or a calibrated face shape image, and the eye region image includes a calibrated left eye image and/or a calibrated right eye image.

Specifically, the step of the feature extraction layer extracting features of one calibration face image and one calibration eye region image is the same as the step of extracting features of the face image and the eye region image by the feature extraction layer in S140, and is not described in detail here. As shown in fig. 9, generally, the pre-trained neural network only includes one feature extraction layer, and in order to improve the efficiency of feature extraction, the pre-trained neural network in this embodiment includes two feature extraction layers, a calibrated face map and a calibrated eye region map are input into the first feature extraction layer to obtain a third feature vector and a fourth feature vector, and the face map and the eye region map are input into the pre-trained second feature extraction layer to obtain a first feature vector and a second feature vector. The first feature extraction layer and the second feature extraction layer are completely consistent in structure, the parameter weights of the two network structures are shared, the size of the input face image is consistent with that of the calibrated face image, and the size of the left/right eye image is consistent with that of the calibrated left/right eye image. The human face shape graph can assist in estimating angles, and when interference information such as illumination, noise and the like exists, the human face shape graph can better represent the human face angle information compared with a human face region graph.

Referring to fig. 10, a flow chart of an alternative gaze direction detection method of embodiments of the present invention is provided. In an alternative embodiment, the pre-calibration branch sequentially connects the third feature vector and the fourth feature vector to determine the detection result, including:

s1000, splicing the first feature vector and the second spliced vector, splicing a third feature vector and a fourth feature vector of a sight calibration image, and differentiating the spliced feature vectors to obtain a second description feature vector;

specifically, for the consistency between the stitching of the first feature vector and the second feature vector of the target image and the record in S142, the stitching of the third feature vector and the fourth feature vector of one sight line calibration image and the record in S142 are also consistent, and corresponding descriptions are visible. And splicing the first feature vector and the second feature vector to obtain a fifth feature vector, splicing a third feature vector and a fourth feature vector of one sight calibration image to obtain a sixth feature vector, and differentiating the fifth feature vector and the sixth feature vector to obtain a second description feature vector.

S1002, determining a candidate detection result according to the second description feature vector line of sight regression;

in an optional embodiment, the full-link layer obtains a sight angle difference between the target image and a sight calibration image according to the sight regression of the second description feature vector; and determining a candidate detection result according to the sight angle difference and the calibration sight direction contained in one sight calibration image.

S1004, sequentially and simultaneously establishing other sight line calibration images, and performing feature extraction, feature combination and sight line regression to obtain all candidate detection results;

and sequentially inputting a sight calibration image and a target image for other collected sight calibration images, and performing feature extraction, feature combination and sight regression to obtain all candidate detection results, thereby avoiding interference of extreme values of a single collected sight calibration image on sight estimation precision.

And S1006, carrying out weighted average on all candidate detection results, and determining a sight line detection result.

Different from the conventional network learning absolute value, the sight direction of the target image is estimated by learning the angle difference between the target image and the calibration sight line graph in the embodiment of the invention, specifically, one of the target image and the calibration sight line graph is input in the implementation of the invention, and the sight angle difference between the current target image and the calibration sight line graph is output to obtain the relative value of the sight direction instead of the absolute value of the sight direction, so that the angle error caused by individual difference is avoided. The sight direction of the target face is obtained by weighted average of all candidate sight directions, the calculation of the weight is related to the sight angle difference, the weight is lower when the difference is larger, the interference of extreme values generated by overlarge sight angle difference between a calibrated sight image and a target image on sight estimation is avoided, and the precision of the sight estimation is further improved.

The accurate sight direction of the user can be obtained by using the sight direction detection method of fig. 1 to 10, and in order to achieve interaction, the viewpoint position of the sight direction of the user falling on the region space needs to be determined, so that the central control screen is lightened, and the transparent a column effect is opened.

In an optional embodiment, the gaze direction detection method further comprises: the gaze region is determined based on the gaze direction and the pre-divided region space.

The obtained sight line direction is located under a camera coordinate system, and if the gaze area of the user is to be determined, the positions of the spatial positions of the fixed areas in the camera coordinate system need to be determined in advance to obtain the final gaze area.

In an optional embodiment, the pre-divided region space is a spatial plane equation of a region under a camera coordinate system, and the establishing manner includes the following steps: and constructing a space plane equation of the region under the camera coordinate system through the three-dimensional digital-to-analog graph, or describing the region space through a plurality of irregular polygons, and determining the space plane equation of the region under the camera coordinate system according to the plurality of irregular polygons and the transformation relation between the region space and the camera coordinate system. Specifically, the method is applied to multiple scenes and equipment, and can be used for constructing a space plane equation of the region in a camera coordinate system based on a three-dimensional digital-analog diagram of the scene or the equipment provided by a manufacturer, wherein the three-dimensional digital-analog diagram contains the position relation between the region space and the camera coordinate system. When the three-dimensional model of the known transformation relation cannot be provided, the area space is described by a plurality of irregular polygons, and the space plane equation of the area under the camera coordinate system is determined according to the irregular polygons and the transformation relation between the area space and the camera coordinate system. The plurality of irregular polygons and the transformation relationship between the area space and the camera coordinate system can be obtained by calibration in advance.

As shown in fig. 11, in order to provide a method for dividing a gaze area according to the present invention, a vehicle space can be divided into a plurality of different areas in advance by detecting a gaze area of a driver in a vehicle, and the area space divided in advance includes: the front-view mirror comprises a left front window, a right front window, a left rearview mirror area, a right rearview mirror area, an inside rearview mirror, a center console, an instrument panel area, a co-driver area, a gear shift lever area and a front window area. Due to the fact that different vehicle types are distributed in different spaces, the types of the watching areas can be divided according to the vehicle types, and the types of the watching areas can also be divided according to the distribution situation of actual viewpoints of the watching areas.

In an alternative embodiment, determining the gaze region based on the gaze direction and the pre-divided spatial location of the region comprises: determining the boundary of each pre-divided region space, and determining the intersection point with the region space based on the sight line direction; and when the intersection point position is positioned in the boundary range of a pre-divided area space, determining the corresponding area space as a watching area.

Specifically, the problem of determining the user gazing area is the intersection point problem of a line of sight direction in a camera coordinate system and an area space plane. Since the preset areas are not in the same plane and the origin of coordinates of each area plane is not the upper left corner of each plane, it is necessary to determine the boundary of each area plane, and determine whether the viewpoint falls within the plane based on the boundary of the area plane, thereby determining the gazing area. Fig. 12 shows an alternative method for determining the gaze area, which is illustrated in fig. 12 by an irregular area plane.

For irregularitiesArea plane, A, B, C, D are boundary points of four boundary points on the irregular area plane, and vectors are used

As the x-axis, vector

As the y-axis, solve for

And

and expanding the original irregular area space plane into a regular area space plane through projection on an x axis and a corresponding y axis. The origin is G, the width is w, and the height is h. In the camera coordinate system, the intersection of the line of sight direction and the regular area space plane can be denoted as P. At this time, whether the gazing area of the user is in the area space plane is judged by judging whether the projection intersection points of the AP and the x axis and the y axis are both in the range of the width w and the height h of the regular area space plane.

The gaze direction and gaze region detection methods described above with respect to fig. 1-12 implement gaze detection based on a pre-trained neural network. Fig. 13 is an alternative neural network training method according to an embodiment of the present invention, where the training method for a pre-trained neural network includes:

s1300, obtaining a sample image, wherein the sample image comprises sight direction marking information;

s1302, performing face region detection and face key point detection on the sample images, and acquiring a face sample image and an eye region sample image according to the detection result, wherein the face image comprises the face region sample image and/or a face shape sample image, and the eye region image comprises a left eye sample image and a right eye sample image;

specifically, the face region detection and the face key point detection may be implemented according to step S102 as described in fig. 1.

And S1304, inputting the face sample image and the eye area sample image into an initial neural network for training, and constructing the pre-trained neural network, wherein the initial neural network comprises a first binocular information branch, a first monocular information branch and a first confidence coefficient branch.

In an alternative embodiment, in step S1304, the inputting of the face sample map and the eye area sample map into an initial neural network for training, and constructing the pre-trained neural network includes:

s1310, inputting the face sample image and the eye area sample image into an initial neural network to obtain predicted sight direction information and predicted left and right eye visible information;

s1320, adjusting parameters of the first binocular information branch and the first monocular information branch according to the difference between the predicted gaze direction information and the gaze direction label information;

in an optional embodiment, adjusting the parameters of the first binocular information branch and the first monocular information branch of the initial neural network according to the difference between the predicted gaze direction information and the gaze direction labeling information includes: the first binocular information branch inputs the predicted sight direction information and the sight direction marking information into a first loss function module, and the parameters of the first binocular information branch are adjusted according to a first loss value output by the first loss function module; wherein the first loss value is the sum of the first left-eye sight loss value and the first right-eye sight loss value; and the first monocular information branch inputs the predicted sight direction information and the sight direction marking information into a second loss function module, and a second loss value output by the second loss function module comprises a second left-eye sight loss value and a second right-eye sight loss value, and the parameters of the first monocular information branch are updated and adjusted respectively according to the second left-eye sight loss value and the second right-eye sight loss value. The first LOSS function module and the second LOSS function module can be L1-LOSS functions.

And S1330, performing unsupervised learning training on the first confidence coefficient branch according to the predicted sight direction information and the predicted left-right eye visible information, and adjusting parameters of the first confidence coefficient branch.

In an alternative embodiment, the performing unsupervised learning training on the first confidence branch according to the predicted gaze direction information and the predicted left-right eye visibility information to adjust the parameters of the initial neural network includes: using the angle errors of the left eye and the right eye contained in the predicted sight direction information to represent the visible confidence coefficient of the sample, and mapping the visible confidence coefficient of the sample into two classification pseudo labels; the first confidence branch inputs the visibility confidence of the sample and the predicted left-right eye visibility information into a third loss function module, and adjusts the parameter of the first confidence branch according to a third loss value output by the third loss function module, wherein the third loss value is the sum of a third left-eye sight loss value and a third right-eye sight loss value.

Considering that the proportion of the invisible eyes of the left eye and the right eye in the sample data is unbalanced, the first confidence degree branch training is used as a left-right eye sharing parameter, the extracted features of the left-right eye image and the left-right eye image are respectively input into the first confidence degree branch once, and finally the losses of the left eye and the right eye are added and updated together. In training the first confidence branch, the invention adopts an unsupervised learning training mode. Considering that the angle error generally predicted for a sample with low visibility degree is larger, and the prediction error is smaller for a sample with clear visibility, the invention designs that the visible confidence of the sample is represented by the left and right eye angle error of the first monocular information branch prediction, and the visible confidence of the sample is mapped into two classification pseudo labels.

According to the embodiment provided by the steps, the introduction of an additional labeling task can be avoided, the first confidence degree branch is trained through an unsupervised learning training method, and the first confidence degree branch is optimized by using the visual angle difference simulation generated by the prediction of the first monocular information branch. The first binocular information branch, the first monocular information branch and the first confidence coefficient branch obtained after sample data training are the binocular information branch, the monocular information branch and the confidence coefficient branch corresponding to the neural network trained in advance.

The gaze detection neural network requires a large number of gaze data samples as a training basis. The sight line data sample image not only comprises an eye image, but also comprises sight line direction labeling information, and the three-dimensional sight line data is difficult to label manually unlike tasks such as classification and detection. In the sight line acquisition system, the spatial position of each area space under the camera needs to be determined, and the spatial position of the pupil of the user is determined at the same time, so that the sight line direction is obtained through calculation, and the sight line direction can be summarized as the spatial position of a measurement target. The traditional method for positioning the space position of the target usually estimates the space position of the target under a camera coordinate system according to digital-analog design information of a scene, does not relate to specific measurement work, does not meet the requirement of sight line accurately, and can not adapt to different vehicle types, application scenes and the like. In addition, the methods based on synthetic data and the methods based on domain-adaptive migration learning supplement the sight line data samples, and it is currently difficult to access the models trained using real data. The invention provides an automatic sample image identification and acquisition method, which is characterized in that the spatial position of a region and the pupil position of eyes of a user are automatically identified by constructing the position relationship among multiple cameras, so that the visual line direction under the same camera coordinate system is determined, and a large number of accurate sample images are acquired.

Referring to fig. 14, a flow chart of an alternative sample image acquisition method according to an embodiment of the invention is shown. As shown in fig. 14, the method of sample image acquisition includes the steps of:

s1400, simultaneously capturing images of a user when gazing at a viewpoint label by a first camera, a second depth camera and a third depth camera to obtain a first image, a second image and a third image, wherein the first camera and the second depth camera have the same direction, and the first camera and the third depth camera have opposite directions;

s1402, respectively determining the position of the pupil of the user in the second depth camera coordinate system and the position of the viewpoint label in the third depth camera coordinate system according to the second image and the third image;

and S1404, according to the position relation among the pre-calibrated first camera, the second depth camera and the third depth camera, the position of the pupil of the user in the second depth camera coordinate system and the position of the viewpoint label in the third depth camera coordinate system, and determining the sight line direction marking information of the user in the first camera coordinate system.

Fig. 15 is a schematic diagram of an optional sample image capturing scene according to an embodiment of the present invention, taking the detection of the driver's sight line in the vehicle as an example. As shown in fig. 15, the camera mounted in the vehicle is typically an infrared camera, denoted as the first camera, and is typically mounted in front of the driver, such as the left a-pillar, the dashboard, or the rear view mirror position, for monitoring the driver's status during the day and night. In order to measure the eye position of the driver in the first camera coordinate system, the invention uses a depth camera, which is denoted as a second depth camera, and the common depth cameras are the Kinect V2, intel's real sense series. And placing the second depth camera above the center console to enable the second depth camera to see the face of the driver, and obtaining a corresponding depth map. In the application product of sight line collection, after the eyeball space position of a driver is determined, the positions of all area positions of a vehicle body in a first camera coordinate system need to be determined. Since the first camera is directed toward the driver and each region of the vehicle body is not visible in the field of view thereof, the positions of the above-mentioned respective region positions in the first camera coordinate system are virtual. In order to measure the position of each area space under the first camera coordinate system, a third depth camera is used in the invention, for example, a common depth camera has a Kinect V2 and RealSense series of Intel, the third depth camera is placed at the position of a driver seat, and a driver is simulated to observe the front of the vehicle body, so that the driver can see each area of the vehicle body and obtain a corresponding depth map. The viewpoint labels watched by the user randomly appear in each area, the viewpoint labels in the image are named as 12, and the image of the viewpoint label 12 watched by the user is also acquired by the second depth camera.

In an alternative embodiment, the pre-calibrated positional relationship between the first camera, the second depth camera, and the third depth camera includes: establishing a first positional relationship between the first camera and the second depth camera through a single-sided calibration board; establishing a second positional relationship between the first camera and the third depth camera via the dual-sided calibration plate. Specifically, since the first camera and the third depth camera are oriented oppositely, wherein the first camera is shot from the vehicle body position toward the driver, and the third depth camera is shot from the driver position toward the vehicle body position, the positional relationship between the two cameras is established by the double-sided calibration plate. In addition, the patterns of the single-sided calibration plate and the double-sided calibration plate are not limited, for example, checkerboard, etc.

In an alternative embodiment, step S1402 of determining, according to the second image and the third image, a position of a pupil of the user in the second depth camera coordinate system and a position of the viewpoint label in the third depth camera coordinate system respectively includes:

s1412, the third image includes a third depth map and a third color map, and the second image includes a second depth map and a second color map;

specifically, the data obtained by the depth camera includes two paths of video streams, which are an RGB image and a depth map. Therefore, the third image acquired by the third depth camera comprises a third depth map and a third color map, and the second image acquired by the second depth camera comprises a second depth map and a second color map.

And S1422, detecting a face key point of the second color image to obtain coordinates of the user pupil in the second color image coordinate system, obtaining coordinates of the user pupil on the second depth image according to the coordinates of the user pupil in the second color image coordinate system, and determining the position of the user pupil in the second depth camera coordinate system.

S1432, identifying the viewpoint label in the third color map, and obtaining coordinates of the viewpoint label in a coordinate system of the third color map; and obtaining the coordinates of the viewpoint label on the third depth map according to the coordinates of the viewpoint label in the third color map coordinate system, and determining the position of the viewpoint label in the third depth camera coordinate system.

And determining the spatial position of the pupil position of the user in the first camera by combining a first position relation between the first camera and the second camera established by pre-calibration after the position of the pupil of the user in the second depth camera coordinate system is obtained. And the position of the viewpoint label in the third depth camera coordinate system is combined with the second position relation between the pre-calibrated first camera and the third depth camera to obtain the position of the viewpoint label in the first camera coordinate system at the same moment. And the connecting line of the pupil position of the user and the viewpoint label in the first camera coordinate system is the finally acquired sight line direction of the user.

During the line-of-sight acquisition, it is likely that the user's view of a certain area of the vehicle body will not be synchronized with the saved line-of-sight location. Therefore, the invention provides a method for fusing voice and images, which synchronously saves the image result of the sight line through automatic voice recognition. In an alternative embodiment, before the first camera, the second depth camera, and the third depth camera simultaneously capture images of the user while gazing at the viewpoint label, the method further comprises: receiving and recognizing voice information, and when the voice information contains a capturing instruction, simultaneously capturing images of a user gazing at a viewpoint label by a first camera, a second depth camera and a third depth camera. For example, when the user speaks "12" in voice, the system can automatically save pictures in the three cameras, so that images and depth information in the first camera, the second depth camera and the third depth camera at the same time are obtained, and the problem of unsynchronized camera image acquisition is avoided.

In many applications, such as transparent a-pillars, line-of-sight bright screens, heads-up display systems, etc., it is desirable to know both the area and the spatial location of the user under each camera. Due to the fact that the field of view of the cameras is limited, a single camera can only shoot a part of scenes in the automobile, some cameras can only see face images of users, and some cameras can only see area coordinates of the automobile body. The acquisition method automatically identifies the position and the eye pupil position of the user by constructing the position relationship among the multiple cameras, and determines the spatial positions of the eyes and the sight line falling point in the same camera coordinate system, thereby determining the sight line direction of the user and finally obtaining the sight line direction marking information of the user in the first camera coordinate system. According to the invention, a set of simple and efficient sight line data acquisition system is set up, the sight line direction marking information can be obtained only by acquiring the image, the user operation is simple and efficient, and a large number of accurate user sight line calibration data samples can be obtained.

The embodiment of the device for detecting a gaze direction is further provided in this embodiment, and the device is used to implement the above embodiments and preferred embodiments, which have already been described and will not be described again. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 16 is a block diagram of an alternative gaze direction detection apparatus according to an embodiment of the present invention. As shown in fig. 16, the sight line direction detection apparatus 160 includes an image acquisition module 1601, an object detection module 1602, a sight line detection module 1603, and a sight line determination module 1604.

Each unit included in the gaze direction detection apparatus 160 is described in detail below.

An image acquisition module 1601, configured to acquire a target image including a target face;

in an alternative embodiment, the target image containing the target face may be acquired by the image acquisition module. The image acquisition module can be an independent camera or an electronic device such as a camera integrated with the camera, a mobile phone, a vehicle data recorder, an auxiliary driving system, and the like, and the types of the camera include an infrared structure light camera, a Time-of-flight (ToF) camera, an RGB camera, a fisheye camera, and the like. The target image may be an image to be acquired in a specific target object gazing area detection scene, and there may be many gazing area detection scenes, for example, the gazing area of the user is detected to control the intelligent device; for another example, the attention of the driver is determined by detecting the gaze area of the driver. When the image acquisition module is installed in a target object gazing area detection scene, the position containing a target object face image can be acquired, such as the front side of a display screen, a vehicle transparent A column and the like.

The image acquisition module acquires image data which can comprise video data and non-video data, performs video image framing processing on the video data, and performs face detection on each frame of image, wherein at least one target image comprising a target face is an image frame containing the target face determined by performing the target face detection on each frame of image in the video; the non-video data can comprise a single-frame image, and the single-frame image is subjected to face detection without frame division processing to obtain at least one target image comprising a target face. And when the image data acquired by the image acquisition module does not detect the target face, the image acquisition module continues to acquire the image data.

A target detection module 1602, configured to perform face region detection and face key point detection on a target image, and obtain a face image and an eye region image according to a detection result, where the face image includes a face region image and/or a face shape image, and the eye region image includes a left-eye image and/or a right-eye image;

optionally, the face map provides a priori information providing head pose including head yaw angle, head pitch angle and head rotation angle.

Specifically, the face detection technology is adopted to select a face region frame in the target image, and the target image is cut according to the selected face region frame to obtain the face image. The face image can be a face region image which only contains a face region in the target image after being cut, and all original information of the target image is reserved so as to ensure the accuracy and the integrity of subsequent data processing. If the interference of illumination and noise is obvious, the face image may only include a face shape image of face contour feature points, where the face contour feature points include but are not limited to: face contour points, eye contour points, nose contour points, eyebrow contour points, forehead contour points, lip contour points. The human face shape graph can assist in estimating angles, and when interference information such as illumination, noise and the like exists, the human face shape graph can better represent the human face angle information compared with a human face region graph. And moreover, carrying out face key point detection on the target image, respectively cutting the target image according to the distribution of the characteristic points of the left eye and the right eye, and obtaining a left eye image and a right eye image which are collectively called as eye region images, wherein the face key points can comprise face key point data corresponding to each region of the face, and each region of the face comprises a left eye eyebrow region, a right eye eyebrow region, a left eye region, a right eye region, a nose region and a mouth region.

The sight detection module 1603 is used for inputting a pre-trained neural network into the face image and the eye area image to perform sight detection on the target face to obtain a sight detection result;

in an alternative embodiment, gaze detection module 1603 includes a pre-trained neural network comprising:

a feature extraction layer 1630, configured to perform feature extraction on the face image and the eye region image to obtain a first feature vector and a second feature vector; the classification layer 1640 is configured to determine the sight line detection result by sight line regression based on the first feature vector and the second feature vector.

In an alternative embodiment, the classification layer 1640, comprises:

a first splicing unit 1641, configured to combine and splice the first feature vector, the left-eye feature vector, and/or the right-eye feature vector respectively for each branch of the classification layer to obtain a first description feature vector;

a first determining unit 1642, configured to determine a line-of-sight detection result according to line-of-sight regression of the corresponding first description feature vector for each branch of the classification layer.

In an alternative embodiment, the first determining unit 1642 comprises: the first detection subunit is used for performing line-of-sight regression by the monocular classifier included in the monocular information branch according to the first left-eye feature vector and the first right-eye feature vector respectively to determine a first detection result; the second detection subunit is used for performing line-of-sight regression by the binocular classifier included in the binocular information branch according to the first binocular feature vector to determine a second detection result; and a third detection subunit, configured to perform, by using the confidence classifier included in the confidence branch, two classifications on the reliabilities of the left-eye image and the right-eye image respectively according to the left-eye feature vector and the right-eye feature vector, and determine a third detection result.

In an alternative embodiment, the first left-eye feature vector is generated by stitching a left-eye feature vector and the first feature vector; the first right-eye feature vector is generated by splicing the right-eye feature vector and the first feature vector; the first binocular feature vector is generated by splicing the left eye feature vector, the right eye feature vector and the first feature vector.

In an alternative embodiment, the gaze determination module 1604 includes:

a first sight line determining unit 1644, configured to, when no confidence level branch is included, take the first detection result or the second detection result as a sight line detection result, and determine a sight line direction of the target face;

a second line-of-sight determination unit 1645, configured to, when including a confidence branch, include:

and if the third detection result is that the left eye image and the right eye image are not credible, outputting no effective output.

And correspondingly inputting the face image and the eye area image into a plurality of characteristic extraction branches of the characteristic extraction layer, and performing characteristic extraction on the face image by the first characteristic extraction branch to obtain a first characteristic vector. Because the input eye region image comprises a left eye image and/or a right eye image, the second characteristic extraction branch circuit performs characteristic extraction on the left eye image to obtain a left eye characteristic vector, the third characteristic extraction branch circuit performs characteristic extraction on the right eye image to obtain a right eye characteristic vector, and finally the second characteristic vector can comprise the left eye characteristic vector and/or the right eye characteristic vector. The face image, the left eye image and the right eye image contained in the eye area image can be simultaneously input into the neural network, and feature extraction is respectively carried out on the face image, the left eye image and the right eye image through the corresponding three feature extraction branches. In the camera coordinate system, the direction of the line of sight depends not only on the state of the eyes including the position of the eyeball, the degree of opening and closing of the eyeball, and the like, but also on the head posture. For example, although the eyes are squinted relative to the head, the eyes are seen to be in the front under a camera coordinate system, and the introduction of the face image can increase other prior information such as head pose and the like, so that the feature diversity and the characterization capability are increased, and a foundation is laid for determining a more accurate sight line detection result by a subsequent neural network.

In the prior art, an independent calculation module is needed to estimate the head pose in advance, but when the calculation module estimates the head pose inaccurately, the subsequent sight line estimation is influenced. In the embodiment of the invention, the pre-trained neural network comprises a feature extraction layer, wherein a plurality of feature extraction branches contained in the feature extraction layer respectively extract features of a face image and an eye region image to obtain a first feature vector and a second feature vector, wherein the second feature vector comprises a left-eye feature vector and/or a right-eye feature vector; splicing the first characteristic vector, the left-eye characteristic vector and/or the right-eye characteristic vector by the combination of the binocular information branch or the monocular information branch to obtain a first description characteristic vector; the binocular information branch or the monocular information branch determines a sight detection result according to the sight regression of the corresponding first description feature vector; and determining the sight direction of the target face according to the sight detection result. The embodiment of the invention adopts an end-to-end neural network, takes the attitude information contained in the face image as auxiliary information and eye feature information to realize estimation in the same deep learning frame, directly obtains the sight direction of the target face under a camera coordinate system, improves the fault-tolerant rate and further improves the robustness of sight estimation. In addition, the embodiment of the invention can support monocular and binocular image input and is compatible with monocular sight and binocular sight tracking.

Furthermore, if both eye details are physically lost due to reflections and occlusion, then the gaze estimation can only be estimated from the head pose, making it temporarily difficult to improve accuracy in appearance-based schemes. When the two eyes are clearly visible, the feature vectors of the left eye, the right eye and the head posture information are spliced together, so that the sight line regression precision can be greatly improved. However, for the situation that one eye is visible and the details of one eye are lost, if the features of both eyes are extracted and the feature vectors of both eyes are spliced, the interference on the features can be caused, and the sight line regression effect is further influenced.

In order to solve the problem of low precision of line-of-sight regression in monocular reflex and occlusion scenes, in an alternative embodiment, the neural network includes the following branches, a binocular information branch, a monocular information branch and a confidence branch. The neural network adopts binocular information, monocular information and confidence information to carry out multi-branch combined learning, and output results of all branches are fused to determine final sight estimation. The pre-trained neural network in the embodiment of the invention can also comprise a scheme of joint learning of binocular information branches, monocular information branches and confidence degree branches, and finally, output results of all branches are fused to generate a more accurate sight direction estimation result. For example, if both eye images are credible, the two-eye information branch is adopted to output the second detection result as the detection result of the neural network, and if only the left eye is credible, the single-eye information branch is adopted to output the left-eye sight direction result in the first detection result as the output result of the neural network. According to the embodiment of the invention, by introducing the information of the confidence coefficient, more eye image information with high image reliability is adopted, and eye image information with low reliability is ignored, so that the problem of low visual regression precision in single-eye reflective and shielding scenes is solved, and the robustness of visual estimation is improved.

The sight line estimation with high robustness and high accuracy is obtained according to the sight line direction detection device under the conventional scene. However, when the image is in a special scene, for example, a scene where the face is at the edge of the image, it is assumed that the human eyes do not move relative to the face, and although the rotation matrix between the face and the camera coordinate system does not change, because the face deviates from the optical axis of the camera, a certain included angle exists between the optical center-face connecting line and the optical axis, the situation that the sight line does not change and the appearance of the human eye image changes can be generated; when the sight line is regressed through the classifier, regression analysis is only carried out on the pitch angle and the yaw angle of the left eye and/or the right eye of the target human face, the condition that the appearance of the input image is changed remarkably in relation to the length-width ratio is not considered, and the unexpected eye image appearance change affects the regression accuracy of the sight line. At this time, the problem that the visual line regression accuracy is affected by the change in the appearance of the eye image may not be solved by the embodiment. Therefore, the normalization process is performed by alignment before the eye region map is input into the neural network trained in advance. Referring to fig. 17, a block diagram of another alternative gaze direction detection apparatus according to an embodiment of the present invention is provided. As shown in fig. 17, the gaze direction detecting apparatus 160 further includes a normalization processing module 1605, configured to perform normalization processing on the eye region map before being input into the gaze detecting module 1605, so as to obtain a first eye region map, where the first eye region map includes a first left eye region map and/or a second eye region map.

In an alternative embodiment, the normalization processing module 1605 includes:

a first normalizing subunit 1651, configured to set an image acquisition device coordinate system with an optical center of an image acquisition device for acquiring a target image as an origin, and set a face coordinate system with an eye key point in an eye region map as an origin;

a second normalizing subunit 1652, configured to use the image acquisition device coordinate system to enable a z-axis of the image acquisition device coordinate system to point to an origin of the face coordinate system, and enable an x-axis of the image acquisition device coordinate system to be parallel to an x-axis of the face coordinate system, so as to obtain a rotation matrix;

rotating the coordinate system of the image acquisition device to enable the z-axis of the coordinate system of the image acquisition device to point to the original point of the coordinate system of the human face, and eliminating the appearance difference of the eye area caused by the translation of the human face through the operation; and then rotating the coordinate system of the image acquisition device to enable the x axis of the coordinate system of the image acquisition device to be parallel to the x axis of the coordinate system of the human face, so that the eye area appearance difference caused by the rotation of the head rolling angle can be eliminated, and a final rotation matrix is obtained. The face coordinate system can be established by taking the eye key points as the origin, the center point of the pupil of the left eye or the right eye or the center of the connecting line of the pupil of the left eye and the pupil of the right eye. Compared with the scheme that the whole-face key points are adopted to calculate the face coordinate system and the rotation adjustment is carried out based on the calculated face coordinate system in the prior art, in the scene that key points of the nose tip and the mouth are difficult to accurately detect when a mask is worn and the like, the eye key points are used for correction to obtain the improved face coordinate system, and the follow-up normalization processing is carried out based on the improved face coordinate system, so that the problem of low accuracy when the head posture is calculated when the mask is worn and the like in a sheltering scene is solved, and the head posture calculated by adopting the whole-face key points in the prior art often has larger errors. In addition, the sight line can be detected only by using accurate eye key points, so that the sight line precision is ensured, and the speed of sight line detection and the robustness of sight line regression are greatly improved.

In the embodiment, the z-axis of the coordinate system of the image capturing device is adjusted first, and then the x-axis of the coordinate system of the image capturing device is adjusted. However, the sequence of steps is only an example and not a strict limitation, and it is only necessary that the adjusted z-axis and x-axis directions of the first virtual coordinate system satisfy the condition at the same time, and those skilled in the art may perform corresponding adjustment and transformation based on the principle of the embodiment, for example, first perform adjustment on the x-axis of the coordinate system of the image capturing device, and then perform adjustment on the z-axis of the coordinate system of the image capturing device.

A third normalizing subunit 1653, configured to move the image acquisition device coordinate system according to the scaling matrix, where the scaling matrix is determined according to a distance between an origin of the face coordinate system and the image acquisition device;

for objects with the same size, when the focal length of the image capturing device is fixed, if the imaging size is kept the same, the object distance, which is the distance between the three-dimensional coordinates of the target in the coordinate system of the image capturing device and the origin of the coordinate system of the image capturing device, needs to be kept the same. For a scene that the human face is at the edge of an image and the appearance of the eye image changes, such as the human face deviates from the optical axis of a camera, the distance between the center of a three-dimensional coordinate point corresponding to the center point of an eye area image in a camera coordinate system and an image acquisition device is changed, and the imaging size is influenced. In order to ensure the consistency of the imaging scale, the first virtual camera coordinate system is moved according to the scaling matrix, and the scaling matrix is determined according to the distance between the three-dimensional coordinate point center corresponding to the eye area center point and the image acquisition device. For example, if the three-dimensional coordinate point center corresponding to the eye region map center point is too far away from the image acquisition device, the first virtual coordinate system needs to be translated toward the coordinate system of the eye region map until the imaging size of the target face is the same.

A fourth normalizing subunit 1654 configured to determine a virtual camera coordinate system according to the image capturing device camera coordinate system, the rotation matrix, and the scaling matrix;

the camera coordinate system of the image acquisition device is adjusted according to the rotation matrix obtained in the second normalizing subunit 1652 and the scaling matrix obtained in the third normalizing subunit 1653 to obtain a final virtual camera coordinate system, which is the image acquisition device coordinate system after normalization processing.

A fifth normalizing subunit 1655, configured to map the eye region map into the virtual camera coordinate system, generating a first eye region map.

And mapping the eye area map to a virtual camera coordinate system to obtain a first eye area map. And mapping the eye region map to a virtual camera coordinate system through a normalization algorithm, wherein the sight line direction is converted from the direction under the camera coordinate system to the direction under the virtual camera coordinate system, and the virtual camera coordinate system is the coordinate system after normalization. Because the head pose directly influences the result of sight line estimation, even if the change space of the head pose is very large, the invention can still effectively eliminate eye appearance difference caused by face translation and face roll angle rotation, and influence of scenes such as a face shield and the like on sight line regression precision.

The pre-trained neural network is trained by using a face image set with gaze direction labeling information, and the first eye region image passing through the normalization processing module 1605 is input into the gaze detection module 1604 to perform gaze detection on the target face, so as to obtain a gaze detection result. And the classification layer performs regression analysis on the pitch angle and the yaw angle of the left eye and/or the right eye of the target face according to the splicing feature vector, and finally outputs a sight line detection result of the target face under a virtual camera coordinate system. And the sight line detection result output by the classifier is under the virtual camera coordinate system but not in the image acquisition device coordinate system, and the sight line detection result is inversely transformed according to the rotation matrix in the normalization processing to obtain the sight line direction in the image acquisition device coordinate system.

Due to large individual differences, such as different distances between pupils, different sizes of eyeballs, different sizes of pupils and different shapes of eyes, the difference of images input into the neural network is significant, so that the accuracy of sight estimation of a general pre-trained neural network model is influenced, and the generalization of the neural network model is poor.

In order to solve the angle error of the sight estimation caused by the individual difference and ensure the neural network sight regression and generalization capability, in an optional embodiment, the sight direction detection apparatus 160 further includes a calibration image acquisition module 1606, configured to acquire a plurality of sight calibration images of the target face, where the sight calibration images include the calibrated sight direction of the target face.

In an optional embodiment, each of the calibrated face images and each of the calibrated eye region images are obtained by sequentially performing face region detection and face key point detection on one of the view calibration images, where the calibrated face images include calibrated face region images and/or calibrated face shape images, and the eye region images include calibrated left-eye images and/or calibrated right-eye images.

In an alternative embodiment, the sight line detection module 1603 including the pre-trained neural network includes the pre-trained neural network including: the feature extraction layer 1630 is further configured to perform feature extraction on one calibrated face image and one calibrated eye region image to obtain a third feature vector and a fourth feature vector; the classification layer 1640 includes a pre-calibration branch 16400, and is configured to establish a third feature vector and a fourth feature vector of multiple sight calibration images in sequence, and determine the sight detection result.

Specifically, the step of the feature extraction layer extracting features for a calibrated face image and a calibrated eye region image is the same as the step of extracting features of the face image and the eye region image, and is not described in detail here.

In an alternative embodiment, pre-calibration branch 16400 comprises:

a first combination unit 16401, configured to splice the first feature vector and the second spliced vector, splice a third feature vector and a fourth feature vector of a sight line calibration image, and differentiate the spliced feature vectors to obtain a second description feature vector;

a second determining unit 16402, configured to determine a candidate detection result according to the second descriptive feature vector line-of-sight regression;

in an optional embodiment, the second determining unit 16402 includes a second determining subunit, configured to perform line-of-sight regression on the fully connected layer according to the second descriptive feature vector to obtain a line-of-sight angle difference between the target image and a line-of-sight calibration image; and the third determining subunit is used for determining a candidate detection result according to the sight angle difference and the calibration sight direction contained in one sight calibration image.

A third determining unit 16403, configured to sequentially establish the other sight calibration images, perform feature extraction, feature combination, and sight regression, and obtain all candidate detection results;

A fourth determining unit 16404 is configured to determine a gaze detection result by weighted averaging all candidate detection results.

Different from the conventional network learning absolute value, the sight direction detection device estimates the sight direction of the target image by learning the angle difference between the target image and the calibration sight line image, and particularly, one of the target image and the calibration sight line image is input in the implementation of the sight direction detection device, the sight angle difference between the current target image and the calibration sight line image is output, the relative value of the sight direction is obtained instead of the absolute value of the sight direction, and the angle error caused by individual difference is avoided. The sight direction of the target face is obtained by weighted average of all candidate sight directions, the calculation of the weight is related to the sight angle difference, the weight is lower when the difference is larger, the interference of extreme values generated by overlarge sight angle difference between a calibrated sight image and a target image on sight estimation is avoided, and the precision of the sight estimation is further improved.

In an alternative embodiment, the gaze direction detecting apparatus 160 further comprises a gaze region detecting module 1607 for determining a gaze region based on the gaze direction and the pre-divided region spatial location.

If the gaze direction is located in the camera coordinate system, the gaze region detection module 1607 is required to determine the spatial position of each fixed region in the camera coordinate system in advance to obtain the final gaze region.

In an optional embodiment, the pre-divided region space is a spatial plane equation of a region under a camera coordinate system, and the establishing manner includes the following steps: and constructing a space plane equation of the region under the camera coordinate system through the three-dimensional digital-to-analog graph, or describing the region space through a plurality of irregular polygons, and determining the space plane equation of the region under the camera coordinate system according to the plurality of irregular polygons and the transformation relation between the region space and the camera coordinate system. Specifically, the method is applied to multiple scenes and equipment, and can be used for constructing a space plane equation of the region in the camera coordinate system based on a three-dimensional digital-analog diagram of the scene or the equipment provided by a manufacturer, wherein the three-dimensional digital-analog diagram contains the position relation between the region and the camera coordinate system. When the three-dimensional model of the known transformation relation cannot be provided, the area space is described by a plurality of irregular polygons, and a space plane equation of the area under the camera coordinate system is determined according to the irregular polygons and the transformation relation between the area space and the camera coordinate system. The plurality of irregular polygons and the transformation relationship between the area space and the camera coordinate system can be obtained by calibration in advance.

Driver's gaze regional detection in the car can obtain a plurality of different regions with vehicle space pre-division, and the regional space of pre-division includes: the front-rear-view mirror comprises a left front window, a right front window, a left rear-view mirror area, a right rear-view mirror area, an interior rear-view mirror, a center console, an instrument panel area, a co-driver area, a gear shift lever area and a front window area. Due to the fact that the spatial distribution of different vehicle types is different, the type of the watching region can be divided according to the vehicle type, and the type of the watching region can also be divided according to the distribution situation of the actual viewpoints of the watching region.

In an alternative embodiment, the gaze area detection module 1607 includes: determining the boundary of each pre-divided region space, and determining the intersection point with the region space based on the sight line direction; and when the intersection point position is positioned in the boundary range of a pre-divided area space, determining the corresponding area space as a watching area.

Specifically, the problem of determining the user gazing area is the intersection point problem of a line of sight direction and an area space plane in a camera coordinate system. Since the preset areas are not in the same plane and the origin of coordinates of each area plane is not the upper left corner of each plane, it is necessary to determine the boundary of each area plane, and determine whether the viewpoint falls within the plane based on the boundary of the area plane, thereby determining the gazing area.

In an alternative embodiment, the gaze direction detection apparatus 1600 further comprises a network training module 1608 for training the neural network. Fig. 18 is a block diagram of an alternative network training module according to an embodiment of the present invention. In an alternative embodiment, the network training module 1608 comprises:

a sample obtaining unit 1681, configured to obtain a sample image, where the sample image includes gaze direction labeling information;

a sample detection unit 1682, configured to perform face region detection and face key point detection on the sample image, and obtain a face sample image and an eye region sample image according to a detection result, where the face image includes a face region sample image and/or a face shape sample image, and the eye region image includes a left eye sample image and a right eye sample image;

a first training unit 1683, configured to input the face sample map and the eye region sample map into an initial neural network for training, and construct the pre-trained neural network, where the initial neural network includes a first binocular information branch, a first monocular information branch, and a first confidence coefficient branch.

In an alternative embodiment, the first training unit 1683 further comprises:

a sample predictor 16831 for inputting the face sample image and the eye area sample image into an initial neural network to obtain predicted gaze direction information and predicted left and right eye visible information;

a first parameter adjustment subunit 16832, configured to adjust a first binocular information branch and a first monocular information branch parameter of the initial neural network according to a difference between the predicted gaze direction information and the gaze direction labeling information;

in an alternative embodiment, the first parameter adjustment unit 16832 includes: the first binocular information branch inputs the predicted sight direction information and the sight direction marking information into a first loss function module, and adjusts parameters of the first binocular information branch according to a first loss value output by the first loss function module; wherein the first loss value is the sum of the first left-eye sight loss value and the first right-eye sight loss value; and the first monocular information branch inputs the predicted sight direction information and the sight direction marking information into a second loss function module, and a second loss value output by the second loss function module comprises a second left-eye sight loss value and a second right-eye sight loss value, and the parameters of the first monocular information branch are updated and adjusted respectively according to the second left-eye sight loss value and the second right-eye sight loss value. The first LOSS function module and the second LOSS function module can be L1-LOSS functions.

And a second parameter adjusting subunit 16833, configured to perform unsupervised learning training on the first confidence level branch according to the predicted gaze direction information and the predicted left-right eye visible information, and adjust a parameter of the first confidence level branch.

In an alternative embodiment, the second parameter adjustment subunit 16833 includes: using the angle errors of the left eye and the right eye contained in the predicted sight direction information to represent the visible confidence coefficient of the sample, and mapping the visible confidence coefficient of the sample into two classification pseudo labels; and the first confidence branch inputs the visible confidence of the sample and the visible information of the predicted left eye and the predicted right eye into a third loss function module, and adjusts the parameters of the first confidence branch according to a third loss value output by the third loss function module, wherein the third loss value is the sum of a third left-eye sight loss value and a third right-eye sight loss value.

Considering that the proportion of the invisible eyes of the left eye and the right eye in the sample data is unbalanced, the first confidence degree branch training is used as a left-right eye sharing parameter, the extracted features of the left-right eye image and the left-right eye image are respectively input into the first confidence degree branch once, and finally the losses of the left eye and the right eye are added and updated together. In training the first confidence branch, the invention adopts an unsupervised learning mode. Considering that the angle error generally predicted for a sample with low visibility degree is larger, and the prediction error for a sample with clear visibility is smaller, the invention designs to use the left and right eye angle error magnitude of the first monocular information branch prediction to characterize the visibility confidence of the sample, and maps the visibility confidence of the sample into two classification pseudo labels.

According to the embodiment provided by the steps, the introduction of an additional labeling task can be avoided, the first confidence coefficient branch is trained through an unsupervised learning method, and the first confidence coefficient branch is optimized by using the pseudo label generated through the simulation of the visual angle difference of the first monocular information branch prediction. The first binocular information branch, the first monocular information branch and the first confidence coefficient branch obtained after sample data training are the binocular information branch, the monocular information branch and the confidence coefficient branch corresponding to the neural network trained in advance.

The gaze detection neural network requires a large number of gaze data samples as a training basis. The sight line data sample image not only comprises an eye image, but also comprises sight line direction labeling information, and the three-dimensional sight line data is difficult to label manually unlike tasks such as classification and detection. In the sight line acquisition system, the spatial position of each area space under the camera needs to be determined, and the spatial position of the pupil of the user is determined at the same time, so that the sight line direction is obtained through calculation, and the above can be summarized as the spatial position of the measurement target. According to a traditional method for positioning the space position of the target, the space position of the target under a camera coordinate system is estimated according to digital-analog design information of a scene, specific measurement work is not involved, the requirement of sight line cannot be met accurately, and adaptation can not be carried out according to different vehicle types, application scenes and the like. In addition, the methods based on synthetic data and the methods based on domain-adaptive migration learning supplement the sight line data samples, and it is currently difficult to access the models trained using real data. The invention provides a sight line direction detection device which comprises a sample acquisition unit, wherein the sample acquisition unit is used for automatically identifying the space position of an area and the eye pupil position of a user by constructing the position relation among multiple cameras, so that the sight line direction under the same camera coordinate system is determined, and a large number of accurate sample images are acquired.

Referring to fig. 19, a block diagram of an alternative sample image acquisition unit according to an embodiment of the present invention is shown. As shown in fig. 19, the sample acquiring unit 1681 includes:

a capture subunit 16811, configured to capture images of the user looking at the viewpoint label by the first camera, the second depth camera, and the third depth camera at the same time, and obtain a first image, a second image, and a third image, where the first camera and the second depth camera are oriented in the same direction, and the first camera and the third depth camera are oriented in opposite directions;

a positioning subunit 16812, configured to determine, according to the second image and the third image, a position of a pupil of the user in the second depth camera coordinate system and a position of the viewpoint label in the third depth camera coordinate system, respectively;

and the annotation determining subunit 16813 is configured to determine, according to the pre-calibrated position relationship among the first camera, the second depth camera, and the third depth camera, the position of the pupil of the user in the second depth camera coordinate system and the position of the viewpoint label in the third depth camera coordinate system, the gaze direction annotation information of the user in the first camera coordinate system.

In an alternative embodiment, the annotation decision subunit 16813 includes: establishing a first positional relationship between the first camera and the second depth camera through the single-sided calibration plate; establishing a second positional relationship between the first camera and the third depth camera via the dual-sided calibration plate. Specifically, since the first camera and the third depth camera are oriented oppositely, wherein the first camera is shot from the vehicle body position toward the driver, and the third depth camera is shot from the driver position toward the vehicle body position, the positional relationship between the two cameras is established by the double-sided calibration plate. In addition, the patterns of the single-sided calibration plate and the double-sided calibration plate are not limited, for example, checkerboard, etc.

In an alternative embodiment, positioning subunit 16812 includes:

the first positioning subunit is used for enabling the third image to comprise a third depth map and a third color map, and enabling the second image to comprise a second depth map and a second color map;

specifically, the data obtained by the depth camera includes two video streams, which are an RGB image and a depth map. Therefore, the third image acquired by the third depth camera comprises a third depth map and a third color map, and the second image acquired by the second depth camera comprises a second depth map and a second color map.

And the second positioning subunit is used for detecting the key points of the face of the second color image, acquiring the coordinates of the pupil of the user in the second color image coordinate system, acquiring the coordinates of the pupil of the user on the second depth image according to the coordinates of the pupil of the user in the second color image coordinate system, and determining the position of the pupil of the user in the second depth camera coordinate system.

The third positioning subunit is used for identifying the viewpoint label in the third color map and obtaining the coordinate of the viewpoint label in the third color map coordinate system; and obtaining the coordinates of the viewpoint label on the third depth map according to the coordinates of the viewpoint label in the third color map coordinate system, and determining the position of the viewpoint label in the third depth camera coordinate system.

And determining the spatial position of the pupil position of the user in the first camera by combining a first position relation between the first camera and the second camera established by pre-calibration after the position of the pupil of the user in the second depth camera coordinate system is obtained. And combining the position of the viewpoint label in the third depth camera coordinate system with a second position relation between the pre-calibrated first camera and the third depth camera to obtain the position of the viewpoint label in the first camera coordinate system at the same moment. And the connecting line of the pupil position of the user and the viewpoint label in the first camera coordinate system is the finally acquired sight line direction of the user.

During the line-of-sight acquisition, it is likely that the user's view of a certain area of the vehicle body will not be synchronized with the saved line-of-sight location. Therefore, the invention provides a method for fusing voice and images, which synchronously saves the image result of the sight line through automatic voice recognition. In an alternative embodiment, the sample acquisition unit 1681 further includes a voice subunit 16814: and receiving and recognizing voice information, and when the voice information contains a capturing instruction, simultaneously capturing images of the point of view label looked at by the user by the first camera, the second depth camera and the third depth camera. For example, when the user speaks "12" in voice, the system can automatically save pictures in the three cameras, so that images and depth information in the first camera, the second depth camera and the third depth camera at the same time are obtained, and the problem of unsynchronized camera image acquisition is avoided.

In many applications, such as transparent a-pillars, line-of-sight bright screens, heads-up display systems, etc., it is desirable to know both the area and the spatial location of the user under each camera. Due to the fact that the field of view of the cameras is limited, a single camera can only shoot a part of scenes in the automobile, some cameras can only see face images of a user, and some cameras can only see area coordinates of the automobile body. The acquisition method designed by the invention automatically identifies the position and the eye pupil position of the user by constructing the position relation among the multiple cameras, and determines the space positions of the eyes and the sight line drop point in the same camera coordinate system, thereby determining the sight line direction of the user and finally obtaining the sight line direction marking information of the user in the first camera coordinate system. According to the invention, a set of simple and efficient sight line data acquisition system is set up, the sight line direction marking information can be obtained only by acquiring the image, the user operation is simple and efficient, and a large number of accurate user sight line calibration data samples can be obtained.

According to another aspect of the embodiments of the present invention, there is also provided an electronic device, including: a processor; and a memory for storing executable instructions of the processor; wherein the processor is configured to perform any one of the gaze direction detection methods via execution of executable instructions.

According to another aspect of the embodiments of the present invention, there is also provided a storage medium including a stored program, wherein when the program is executed, a device in which the storage medium is located is controlled to execute any one of the gaze direction and the detection method.

The sequence numbers of the embodiments of the present invention are merely for description, and do not represent the advantages or disadvantages of the embodiments.

In the embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, a division of a unit may be a division of a logic function, and an actual implementation may have another division, for example, a plurality of units or components may be combined or integrated into another coefficient, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware or a form of software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that it is obvious to those skilled in the art that various modifications and improvements can be made without departing from the principle of the present invention, and these modifications and improvements should also be considered as the protection scope of the present invention.

Claims

1. A gaze direction detection method, comprising:

acquiring a target image containing a target face;

carrying out face region detection and face key point detection on the target image, and acquiring a face image and an eye region image according to a detection result, wherein the face image comprises a face region image and/or a face shape image, and the eye region image comprises a left eye image and/or a right eye image;

inputting the face image and the eye region image into a pre-trained neural network to perform sight line detection on the target face to obtain a sight line detection result;

and determining the sight direction of the target face according to the sight detection result.

2. The gaze direction detection method of claim 1, wherein the pre-trained neural network comprises:

the feature extraction layer is used for extracting features of the face image and the eye region image to obtain a first feature vector and a second feature vector;

and the classification layer is used for determining the sight line detection result according to the sight line regression of the first feature vector and the second feature vector.

3. The gaze direction detection method according to claim 2, characterized in that the second eigenvector includes a left-eye eigenvector and/or a right-eye eigenvector; the classification layer includes: a binocular information branch; or a monocular information branch; or a binocular information branch, a monocular information branch, and a confidence branch.

4. The gaze direction detection method of claim 3, wherein the classification layer determines the gaze detection result from the first feature vector and the second feature vector gaze regression, comprising:

combining and splicing the first feature vector, the left-eye feature vector and/or the right-eye feature vector by each branch of the classification layer respectively to obtain a first description feature vector;

and determining the sight line detection result by each branch of the classification layer according to the sight line regression of the corresponding first description feature vector.

5. The gaze detection method of claim 4, wherein determining the gaze detection result according to the corresponding first descriptive feature vector gaze regression by each branch of the classification layer comprises:

the monocular classifier included in the monocular information branch performs sight line regression respectively according to the first left-eye feature vector and the first right-eye feature vector to determine a first detection result;

the binocular classifier included in the binocular information branch performs line of sight regression according to the first binocular feature vector to determine a second detection result;

and the confidence classifier included in the confidence branch respectively carries out secondary classification on the confidence of the left-eye image and the right-eye image according to the left-eye feature vector and the right-eye feature vector, and determines a third detection result.

6. The gaze direction detecting method according to claim 5,

the first left-eye feature vector is generated by splicing the left-eye feature vector and the first feature vector;

the first right-eye feature vector is generated by splicing the right-eye feature vector and the first feature vector;

the first binocular feature vector is generated by stitching the left-eye feature vector, the right-eye feature vector and the first feature vector.

7. The gaze direction detection method according to claim 5, wherein determining the gaze direction of the target face from the gaze detection result comprises:

when the confidence degree branch is not included, taking the first detection result or the second detection result as the sight line detection result, and determining the sight line direction of the target face;

when the confidence branch is included, the method comprises the following steps:

and if the third detection result indicates that the left eye image and the right eye image are not credible, outputting no effective output.

8. A gaze direction detecting method according to claim 1, wherein before inputting the face map and the eye region map into a pre-trained neural network to perform gaze detection on the target face and obtain a gaze detection result, the method further comprises: and carrying out normalization processing on the eye region map.

9. The gaze direction detection method according to claim 8, wherein the normalization processing of the eye region map includes:

the image acquisition device coordinate system moves according to the scaling matrix and the rotation matrix to generate a virtual camera coordinate system;

and mapping the eye area map to the virtual camera coordinate system to generate the normalized eye area map.

10. The gaze direction detection method of claim 2, further comprising: and acquiring a plurality of sight calibration images of the target face, wherein the sight calibration images comprise calibration sight directions of the target face.

11. The gaze direction detection method of claim 10,

the feature extraction layer is further used for extracting features of a calibrated face image and a calibrated eye region image to obtain a third feature vector and a fourth feature vector;

the classification layer comprises a precalibration branch, and the precalibration branch is used for sequentially and simultaneously establishing the third feature vector and the fourth feature vector of a plurality of sight line calibration images and determining the sight line detection result.

12. The gaze direction detection method according to claim 11, wherein each of the calibrated face maps and the calibrated eye region maps is obtained by performing face region detection and face key point detection on one of the gaze calibration images in sequence, wherein the calibrated face maps comprise calibrated face region maps and/or calibrated face shape maps, and the eye region maps comprise calibrated left eye images and/or calibrated right eye images.

13. The gaze direction detecting method according to claim 11, wherein the pre-calibrating branch combines the third eigenvector and the fourth eigenvector in sequence to determine the detection result, and comprises:

splicing the first feature vector and the second spliced vector, splicing the third feature vector and the fourth feature vector of the sight line calibration image, and differentiating the spliced feature vector groups to obtain a second description feature vector;

determining a candidate detection result according to the second description feature vector line of sight regression;

sequentially and simultaneously establishing the rest sight line calibration images, and performing feature extraction, feature combination and sight line regression to obtain all candidate detection results;

and carrying out weighted average on all the candidate detection results to determine the sight line detection result.

14. The gaze direction detection method of claim 13, wherein determining candidate detection results from the second descriptive feature vector gaze regression comprises:

the full connection layer obtains the sight angle difference between the target image and one sight calibration image according to the sight regression of the second description feature vector;

and determining the candidate detection result according to the sight angle difference and the calibrated sight direction contained in the sight calibration image.

15. The gaze direction detection method of claim 1, further comprising: determining a gaze region based on the gaze direction and a pre-partitioned region space.

16. The gaze direction detection method of claim 15, wherein the pre-partitioned region space comprises at least one of: the front-rear-view mirror comprises a left front window, a right front window, a left rear-view mirror area, a right rear-view mirror area, an interior rear-view mirror, a center console, an instrument panel area, a co-driver area, a gear shift lever area and a front window area.

17. A gaze direction detecting method according to claim 16, wherein the pre-divided region space is a spatial plane equation of a region under a camera coordinate system, and the establishment method comprises the following steps: and constructing a space plane equation of the region under the camera coordinate system through a three-dimensional digital-analog diagram, or describing the region space through a plurality of irregular polygons, and determining the space plane equation of the region under the camera coordinate system according to the irregular polygons and the transformation relation between the region space and the camera coordinate system.

18. The gaze direction detection method of claim 17, wherein determining a gaze region based on the gaze direction and a pre-divided region space comprises:

determining a boundary of each of the pre-divided region spaces, and determining an intersection point with the region space based on the sight line direction;

and when the intersection point position is located in the boundary range of one pre-divided region space, determining the gazing region by the corresponding region space.

19. A gaze direction detection method according to any of claims 1-18, characterised in that the training method of the pre-trained neural network comprises:

obtaining a sample image, wherein the sample image comprises sight line direction marking information;

carrying out face region detection and face key point detection on the sample image, and acquiring a face sample image and an eye region sample image according to a detection result, wherein the face image comprises a face region sample image and/or a face shape sample image, and the eye region image comprises a left eye sample image and a right eye sample image;

and inputting the face sample image and the eye area sample image into an initial neural network for training, and constructing the pre-trained neural network, wherein the initial neural network comprises a first binocular information branch, a first monocular information branch and a first confidence coefficient branch.

20. A gaze direction detection method according to claim 19, wherein the inputting of the face sample map and the eye region sample map into an initial neural network for training and the constructing of the pre-trained neural network comprises:

inputting the face sample image and the eye area sample image into the initial neural network to obtain predicted sight direction information and predicted left and right eye visible information;

adjusting parameters of the first binocular information branch and the first monocular information branch according to the difference between the predicted gaze direction information and the gaze direction labeling information;

and carrying out unsupervised learning training on the first confidence coefficient branch according to the predicted sight direction information and the predicted left-right eye visible information, and adjusting the parameters of the first confidence coefficient branch.

21. The gaze direction detecting method according to claim 20, wherein adjusting the first binocular information branch and the first monocular information branch parameters of the initial neural network based on the difference between the predicted gaze direction information and the gaze direction labeling information comprises:

the first binocular information branch inputs the predicted gaze direction information and the gaze direction labeling information into a first loss function module, and adjusts parameters of the first binocular information branch according to a first loss value output by the first loss function module, wherein the first loss value is the sum of a first left-eye gaze loss value and a first right-eye gaze loss value;

and the first monocular information branch inputs the predicted sight direction information and the sight direction marking information into a second loss function module, the second loss function module outputs a second loss value, the second loss value comprises a second left-eye sight loss value and a second right-eye sight loss value, and the parameters of the first monocular information branch are updated and adjusted respectively according to the second left-eye sight loss value and the second right-eye sight loss value.

22. The gaze direction detection method of claim 20, wherein performing unsupervised learning training on the first confidence branch based on the predicted gaze direction information and the predicted left and right eye visibility information, adjusting parameters of the first confidence branch comprises:

using the angle errors of the left eye and the right eye contained in the predicted sight direction information to represent the visible confidence coefficient of a sample, and mapping the visible confidence coefficient of the sample into two classification pseudo labels;

and the first confidence branch inputs the visible confidence of the sample and the predicted left-right eye visible information into a third loss function module, and adjusts the parameters of the first confidence branch according to a third loss value output by the third loss function module, wherein the third loss value is the sum of a third left-eye sight loss value and a third right-eye sight loss value.

23. The gaze direction detecting method according to claim 19, wherein acquiring a sample image including gaze direction labeling information comprises:

simultaneously capturing images of a user when gazing at a viewpoint label by a first camera, a second camera and a third camera to obtain a first image, a second image and a third image, wherein the first camera and the second camera are in the same orientation, and the first camera and the third camera are in opposite orientations;

according to the second image and the third image, respectively determining the position of the pupil of the user in the second depth camera coordinate system and the position of the viewpoint label in the third depth camera coordinate system;

and according to the pre-calibrated position relationship among the first camera, the second depth camera and the third depth camera, the position of the pupil of the user in the second depth camera coordinate system and the position of the viewpoint label in the third depth camera coordinate system, determining the sight line direction marking information of the user in the first camera coordinate system.

24. A gaze direction detecting method according to claim 23, comprising, in dependence on the positional relationship between the first camera, the second depth camera and the third depth camera pre-calibrated:

establishing a first positional relationship between the first camera and the second depth camera through a single-sided calibration plate; establishing a second positional relationship between the first camera and the third depth camera via a two-sided calibration plate.

25. The gaze direction detecting method of claim 23, wherein determining, from the second image and the third image, a position of the user's pupil in the second depth camera coordinate system and a position of the viewpoint label in the third depth camera coordinate system, respectively, comprises:

the third image comprises a third depth map and a third color map, and the second image comprises a second depth map and a second color map;

detecting key points of the face of the second color image to obtain the coordinates of the pupil of the user in the coordinate system of the second color image; obtaining the coordinates of the user pupil on the second depth map according to the coordinates of the user pupil in the second color map coordinate system, and determining the position of the user pupil in the second depth camera coordinate system;

identifying the viewpoint label in the third color map, and obtaining the coordinates of the viewpoint label in the third color map coordinate system; and obtaining the coordinates of the viewpoint label on the third depth map according to the coordinates of the viewpoint label in the third color map coordinate system, and determining the position of the viewpoint label in the third depth camera coordinate system.

26. A gaze direction detection method according to claim 23, wherein before the first camera, the second depth camera and the third depth camera simultaneously capture images of the user looking at the point of view tag, the method further comprises: receiving and recognizing voice information, and when the voice information contains a capturing instruction, simultaneously capturing images of the user gazing viewpoint labels by the first camera, the second depth camera and the third depth camera.

27. A visual line direction detection device, characterized by comprising:

the image acquisition module is used for acquiring a target image containing a target face;

the target detection module is used for carrying out face region detection and face key point detection on the target image and acquiring a face image and an eye region image according to a detection result, wherein the face image comprises a face region image and/or a face shape image, and the eye region image comprises a left eye image and/or a right eye image;

the sight detection module is used for inputting the face image and the eye area image into a pre-trained neural network to perform sight detection on the target face to obtain a sight detection result;

and the sight line determining module is used for determining the sight line direction of the target face according to the sight line detection result.

28. The apparatus of claim 27, wherein the pre-trained neural network comprises:

29. The device of claim 28, wherein the second eigenvector comprises a left-eye eigenvector and/or a right-eye eigenvector; the classification layer includes: a binocular information branch; or a monocular information branch; or a binocular information branch, a monocular information branch, and a confidence branch.

30. The apparatus of claim 29, wherein the classification layer comprises:

the first splicing unit is used for respectively combining and splicing the first feature vector, the left-eye feature vector and/or the right-eye feature vector by each branch of the classification layer to obtain a first description feature vector;

and the first determining unit is used for determining the sight line detection result according to the sight line regression of the corresponding first description feature vector by each branch of the classification layer.

31. The apparatus of claim 30, wherein the first determining unit comprises:

the first detection subunit is used for performing line-of-sight regression by the monocular classifier included in the monocular information branch according to the first left-eye feature vector and the second right-eye feature vector respectively to determine a first detection result;

the second detection subunit is used for performing line-of-sight regression by the binocular classifier included in the binocular information branch according to the first binocular feature vector to determine the second detection result;

and the third detection subunit is used for performing secondary classification on the confidence degrees of the left-eye image and the right-eye image respectively by using the confidence degree classifier included in the confidence degree branch according to the second feature vector to determine a third detection result.

32. The apparatus of claim 30, wherein the gaze determination module comprises:

a first sight line determining unit, configured to, when the confidence branch is not included, use the first detection result or the second detection result as the sight line detection result, to determine a sight line direction of the target face;

a second sight line determination unit configured to, when the confidence branch is included, include:

33. The apparatus of claim 27, further comprising a normalization module for normalizing the eye region map before it is input to the eye gaze detection module.

34. The apparatus of claim 27, further comprising a calibration image acquisition module configured to acquire a plurality of line-of-sight calibration images of the target face, wherein the line-of-sight calibration images include a calibration line-of-sight direction of the target face.

35. The apparatus of claim 34,

the classification layer comprises a pre-calibration branch, and is used for sequentially connecting the third feature vector and the fourth feature vector of a plurality of sight calibration images to determine the sight detection result.

36. The apparatus of claim 35, wherein the pre-calibration branch comprises:

the first combination unit is used for splicing the first feature vector and the second spliced vector, splicing the third feature vector and the fourth feature vector of the sight line calibration image, and differentiating the spliced feature vectors to obtain a second description feature vector;

the second determining unit is used for determining a candidate detection result according to the second description feature vector line-of-sight regression;

the third determining unit is used for sequentially and simultaneously establishing the rest sight line calibration images, performing feature extraction, feature combination and sight line regression and obtaining all candidate detection results;

and the fourth determining unit is used for weighting and averaging all the candidate detection results to determine the sight line detection result.

37. The apparatus of claim 36, wherein the second determining unit comprises:

the second determining subunit is used for the full-link layer to obtain the sight angle difference between the target image and one sight calibration image according to the sight regression of the second description feature vector;

a third determining subunit, configured to determine the candidate detection result according to the gaze angle difference and the calibration gaze direction included in one of the gaze calibration images.

38. The apparatus of claim 29, further comprising a gaze region detection module to determine a gaze region based on the gaze direction and a pre-divided region spatial location.

39. The apparatus of claim 29, further comprising a network training module for training a neural network, comprising:

the system comprises a sample acquisition unit, a processing unit and a display unit, wherein the sample acquisition unit is used for acquiring a sample image, and the sample image comprises sight line direction marking information;

the sample detection unit is used for carrying out face region detection and face key point detection on the sample images and obtaining a face sample image and an eye region sample image according to the detection result, wherein the face image comprises a face region sample image and/or a face shape sample image, and the eye region image comprises a left eye sample image and a right eye sample image;

and the first training unit is used for inputting the face sample image and the eye area sample image into an initial neural network for training and constructing the pre-trained neural network, wherein the initial neural network comprises a first binocular information branch, a first monocular information branch and a first confidence coefficient branch.

40. The apparatus of claim 39, wherein the first training unit comprises:

the sample prediction subunit is used for inputting the face sample image and the eye area sample image into the initial neural network to obtain predicted sight direction information and predicted left and right eye visible information;

a first parameter adjusting subunit, configured to adjust parameters of the first binocular information branch and the first monocular information branch of the initial neural network according to a difference between the predicted gaze direction information and the gaze direction labeling information;

and the second parameter adjusting subunit is used for performing unsupervised learning training on the first confidence coefficient branch according to the predicted sight direction information and the predicted left-right eye visible information, and adjusting the parameters of the first confidence coefficient branch.

41. The apparatus of claim 39, wherein the sample acquiring unit comprises:

a capturing subunit, wherein a first camera, a second depth camera and a third depth camera simultaneously capture images of a user looking at a viewpoint label to obtain a first image, a second image and a third image, wherein the first camera and the second depth camera are in the same orientation, and the first camera and the third depth camera are in opposite orientations;

a positioning subunit, configured to determine, according to the second image and the third image, a position of a pupil of the user in the second depth camera coordinate system and a position of the viewpoint label in the third depth camera coordinate system, respectively;

and the annotation determining subunit is configured to determine, according to the pre-calibrated position relationship among the first camera, the second depth camera, and the third depth camera, the position of the pupil of the user in the second depth camera coordinate system and the position of the viewpoint label in the third depth camera coordinate system, gaze direction annotation information of the user in the first camera coordinate system.

42. The apparatus of claim 41, wherein the label determination subunit comprises:

establishing a first positional relationship between the first camera and the second depth camera through a single-sided calibration plate;

establishing a second positional relationship between the first camera and the third depth camera via a two-sided calibration plate.

43. The apparatus of claim 39, wherein the sample acquiring unit comprises:

and the voice subunit is used for receiving and recognizing voice information, and when the voice information contains a capture instruction, the first camera, the second depth camera and the third depth camera simultaneously capture images of the user gazing viewpoint label.

44. A storage medium characterized by comprising a stored program, wherein an apparatus in which the storage medium is located is controlled to execute the gaze direction detecting method of any one of claims 1 to 26 when the program is run.

45. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the gaze direction detection method of any one of claims 1 to 26 via execution of the executable instructions.