CN115410242A - Sight estimation method and device - Google Patents

Sight estimation method and device Download PDF

Info

Publication number
CN115410242A
CN115410242A CN202110590686.4A CN202110590686A CN115410242A CN 115410242 A CN115410242 A CN 115410242A CN 202110590686 A CN202110590686 A CN 202110590686A CN 115410242 A CN115410242 A CN 115410242A
Authority
CN
China
Prior art keywords
image
user
information
sample
target image
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110590686.4A
Other languages
Chinese (zh)
Inventor
罗飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Zitiao Network Technology Co Ltd
Original Assignee
Beijing Zitiao Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zitiao Network Technology Co Ltd filed Critical Beijing Zitiao Network Technology Co Ltd
Priority to CN202110590686.4A priority Critical patent/CN115410242A/en
Publication of CN115410242A publication Critical patent/CN115410242A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/70Determining position or orientation of objects or cameras
    • G06T7/73Determining position or orientation of objects or cameras using feature-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10004Still image; Photographic image

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Image Analysis (AREA)

Abstract

The embodiment of the disclosure provides a sight line estimation method and a sight line estimation device, wherein the method comprises the steps of obtaining a target image, wherein the target image presents a facial image of a user to be detected; carrying out facial feature detection on the target image to obtain facial feature information of a user to be detected; inputting the target image and the facial feature information into a machine learning model for sight line estimation to obtain a sight line estimation result output by the machine learning model; a machine learning model has been trained based on a sample image, sample facial features that are facial features present in the sample image, and sample gaze information that is indicative of gaze present in the sample image; and determining the viewpoint position of the user to be detected in the target image according to the sight line estimation result. According to the embodiment of the sight estimation method and the sight estimation device, the model obtained through pre-training is used for processing the image of the user to obtain the sight estimation result, the operation complexity is reduced, and the sight estimation efficiency and accuracy are improved.

Description

Sight estimation method and device
Technical Field
The embodiment of the disclosure relates to the technical field of computer vision, in particular to a sight line estimation method and device.
Background
With the rapid development of science and technology, the application field of computer vision is more and more extensive, and the computer vision can be applied to the technical field of human-computer interaction. For example, the Gaze Estimation (Gaze Estimation) technology refers to estimating a Gaze direction and a viewpoint position of a user by collecting human eye image information of the user, thereby implementing operations and the like on a computer and other devices. In recent years, with the development of new-generation information technologies such as big data and artificial intelligence, higher demands have been made on the line-of-sight estimation technology.
In the related art, the method for determining the position of the gaze point by estimating the direction of the eye sight generally includes determining the position information of the gaze point of the eyeball by a pupil-cornea reflection method, that is, forming a plurality of light spots in the eyeball of the user by using a plurality of light sources, acquiring an image of the eye of the user, and then obtaining the position information of the gaze point according to the light spots corresponding to all the light sources detected on the image of the eye.
However, the existing method needs to additionally use a light source and sensor equipment, is complex to operate, and when the eyeball movement amplitude of a user is large or the head movement amplitude of the user is large, part of the light sources cannot form light spots in human eyes, so that the problem that the sight line estimation cannot be performed or is inaccurate is caused.
Disclosure of Invention
The embodiment of the disclosure provides a sight line estimation method and a sight line estimation device, which are used for overcoming the technical problems of inaccurate sight line estimation and complex equipment operation in the prior art.
In a first aspect, an embodiment of the present disclosure provides a gaze estimation method, including:
acquiring a target image, wherein the target image presents a facial image of a user to be detected;
carrying out facial feature detection on the target image to obtain facial feature information of the user to be detected;
inputting the target image and the facial feature information into a machine learning model for sight line estimation to obtain a sight line estimation result output by the machine learning model; the machine learning model has been trained based on sample images, sample facial features that are facial features present in the sample images, and sample gaze information that indicates a gaze present in the sample images;
and determining the viewpoint position of the user to be detected in the target image according to the sight line estimation result.
In a second aspect, an embodiment of the present disclosure provides a gaze estimation device, including:
the system comprises an image acquisition module, a face detection module and a face detection module, wherein the image acquisition module is used for acquiring a target image, and the target image presents a face image of a user to be detected;
the feature detection module is used for carrying out facial feature detection on the target image to obtain facial feature information of the user to be detected;
the sight line estimation module is used for inputting the target image and the facial feature information into a machine learning model for sight line estimation to obtain a sight line estimation result output by the machine learning model; the machine learning model has been trained based on sample images, sample facial features that are facial features present in the sample images, and sample gaze information that indicates a gaze present in the sample images;
and the viewpoint determining module is used for determining the viewpoint position of the user to be detected in the target image according to the sight line estimation result.
In a third aspect, an embodiment of the present disclosure provides an electronic device, including: at least one processor and memory;
the memory stores computer-executable instructions;
the at least one processor executing the computer-executable instructions stored by the memory causes the at least one processor to perform the gaze estimation method as described above in the first aspect and various possible designs of the first aspect.
In a fourth aspect, embodiments of the present disclosure provide a computer-readable storage medium, in which computer-executable instructions are stored, and when a processor executes the computer-executable instructions, the gaze estimation method according to the first aspect and various possible designs of the first aspect are implemented.
In a fifth aspect, embodiments of the present disclosure provide a computer program product comprising a computer program that, when executed by a processor, implements the gaze estimation method as described above in the first aspect and various possible designs of the first aspect.
The embodiment of the disclosure provides a sight estimation method and a sight estimation device, wherein the method comprises the steps of obtaining a target image of a user to be detected, wherein the target image presents a facial image of the user to be detected, and carrying out facial feature detection on the target image to obtain facial feature information of the user to be detected; and then inputting the target image and the facial feature information into a machine learning model, and using the machine learning model because the machine learning model is trained based on a sample image, sample facial features and sample sight line information, the sample facial features are the facial features presented in the sample image, and the sample sight line information is used for indicating the sight line presented in the sample image, so that a sight line estimation result of a user to be detected can be directly obtained through the machine learning model, and finally, the viewpoint position of the user to be detected in the target image is determined according to the sight line estimation result output by the machine learning model. According to the embodiment of the sight estimation method and the sight estimation device, the machine learning model obtained through pre-training is used for processing the image of the user to obtain the sight estimation result, additional devices such as a light source and a sensor are not needed in the whole process, the operation complexity is reduced, and the sight estimation efficiency and accuracy are greatly improved.
Drawings
In order to more clearly illustrate the embodiments of the present disclosure or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present disclosure, and for those skilled in the art, other drawings can be obtained according to the drawings without inventive exercise.
Fig. 1 is a first application scenario diagram of a gaze estimation method provided by an embodiment of the present disclosure;
fig. 2 is a first flowchart illustrating a gaze estimation method according to an embodiment of the disclosure;
fig. 3 is a second application scenario diagram of the gaze estimation method provided by the embodiment of the present disclosure;
fig. 4 is a second flowchart illustrating a gaze estimation method according to an embodiment of the present disclosure;
fig. 5 is a third application scenario diagram of the gaze estimation method provided by the embodiment of the present disclosure;
fig. 6 is a diagram of an application scenario of the gaze estimation method provided by the embodiment of the present disclosure;
fig. 7 is a schematic flow chart of a training method of a face detection model according to an embodiment of the present disclosure;
fig. 8 is a flowchart illustrating a training method of a machine learning model according to an embodiment of the present disclosure;
fig. 9 is a schematic structural diagram of a gaze estimation apparatus provided in an embodiment of the present disclosure;
fig. 10 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present disclosure.
Detailed Description
To make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, but not all embodiments of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.
In the related art, the method for determining the position of the gaze point by estimating the eye sight direction generally includes: the method comprises the steps of determining the gaze point position information of eyeballs by adopting a pupil-cornea reflection P-CR method, namely forming a plurality of light spots in the eyeballs of a user by utilizing a plurality of light sources, acquiring an eye image of the user, detecting the light spots corresponding to all the light sources on the eye image of the user to obtain pupil-cornea reflection P-CR vectors, mapping the P-CR vectors to gaze point vectors on a computer screen facing the user when the head and the eyeballs of the user are static in one-to-one correspondence, and realizing sight line estimation to obtain the gaze point position information. However, the existing method needs to additionally use a light source and sensor equipment, the operation is complex, when the eyeball movement amplitude of a user is large or the head movement amplitude of the user is large, part of light sources cannot form light spots in human eyes, and under the condition that the light spots are not formed or only part of the light spots are formed in the human eyes, accurate P-CR vectors cannot be obtained, so that the sight line estimation cannot be performed or is inaccurate. If a complete P-CR vector is to be obtained when the eyeball motion amplitude of the user is large and the head motion amplitude of the user is large, technicians are required to manually adjust the positions of the light sources, so that all the light sources can form light spots in human eyes. Under normal conditions, the head and the eyeballs of the user are in a continuous motion state, so that the position of the light source needs to be adjusted frequently when the existing method is adopted for sight estimation, the operation complexity of equipment is greatly increased, the sight estimation efficiency is reduced, and the sight estimation accuracy is low.
In view of this drawback, the technical idea of the embodiment of the present disclosure mainly includes: the method comprises the steps of firstly training by a deep learning method to obtain two models, namely a face detection model and a machine learning model, wherein the face detection model is used for carrying out face detection on an image containing face information of a user so as to extract facial feature information of the user, such as facial rectangular region information of the user, facial key point data, visibility of each facial key point, a facial horizontal turning angle, a facial pitch angle, a facial rotation angle, a binocular distance and the like. The machine learning model is used for performing sight estimation based on the facial feature information of the user to obtain sight estimation results such as left eye position information, right eye position information, average sight direction of eyes and the like of the user. When the sight of a user needs to be estimated, an image containing face information of the user is acquired through image acquisition equipment, then the image is input into a face detection model, so that face feature information of the user output by the face detection model can be directly obtained, then the face feature information of the user is input into a machine learning model as input quantity, a sight estimation result output by the machine learning model can be directly obtained, finally, the viewpoint position of the sight of the user on a target plane is calculated according to the sight estimation result, and whether the sight of the user falls into a preset area in the target plane is judged according to the viewpoint position. According to the embodiment of the invention, the sight line estimation result is obtained by processing the image of the user through the two pre-trained models, and no additional devices such as a light source and a sensor are needed in the whole process, so that the operation complexity is reduced, and the sight line estimation efficiency and accuracy are greatly improved.
Fig. 1 is a first application scenario diagram of the gaze estimation method provided by the embodiment of the present disclosure.
As shown in fig. 1, the basic architecture of the application scenario provided by this embodiment mainly includes: an image acquisition device 101 and a server 102; the image acquisition equipment acquires an image containing face information of a user, the acquired image is sent to the server, and the server performs related processing on the image to obtain a sight line estimation result of the user.
It should be noted that the image capturing device in this embodiment may be a camera or other mobile devices with a camera function, such as a mobile phone, a tablet, a smart wearable device, a display, and the like, and this embodiment is not limited in particular.
Fig. 2 is a first schematic flowchart of a gaze estimation method provided in the embodiment of the present disclosure, where an execution subject of the method provided in this embodiment may be a server in the embodiment shown in fig. 1.
As shown in fig. 2, the method provided in the present embodiment may include the following steps.
S201, acquiring a target image, wherein the target image presents a facial image of a user to be detected.
In the step, images including face information of the user to be detected are acquired by a camera and the like, the target image is obtained, and then the target image is actively sent to a server. Or after the image acquisition equipment such as the camera acquires the image of the user to be detected, the image is stored in the local memory, and then the server sends a corresponding acquisition instruction to the image acquisition equipment to acquire the target image of the user to be detected.
In some embodiments, as shown in fig. 3, the camera is located in a first direction of the user to be tested, where the first direction is a direction that the head of the user to be tested faces when the head is in a normal state (i.e., an L direction indicated in fig. 3), so that the camera can acquire an image including the complete head and face of the user to be tested.
It should be noted that the target image in this embodiment may be an image captured by a camera, or may be a frame of image in a video recorded by the camera.
S202, carrying out facial feature detection on the target image to obtain facial feature information of the user to be detected.
In some embodiments, a face detection model may be used to detect facial feature information of a user to be detected in a target image, the target image of the user to be detected is input into the face detection model as an input, and the face detection model performs face feature detection on the target image to obtain the facial feature information of the user to be detected output by the face detection model.
In some embodiments, the facial feature information includes information such as facial rectangular region information of the user to be detected, human face key point data, visibility of each human face key point, a facial horizontal rotation angle, a facial pitch angle, a facial rotation angle, and a binocular distance.
S203, inputting the target image and the facial feature information into a machine learning model for sight line estimation, and obtaining a sight line estimation result output by the machine learning model.
The machine learning model is trained based on a sample image, sample facial features and sample sight line information, so that the machine learning model can estimate the sight line of a user to be detected according to an input target image and facial feature information, wherein the sample facial features are facial features presented in the sample image, the sample sight line information is used for indicating the sight line presented in the sample image and used for performing sight line estimation according to facial feature information output by a face detection model, and a detailed training process of the machine learning model is described in detail in the related embodiments below, and is not specifically described here.
Specifically, facial feature information such as facial rectangular region information, facial key point data, visibility of each facial key point, a facial horizontal rotation angle, a facial pitch angle, a facial rotation angle, a binocular distance and the like of a user to be detected is input into the machine learning model as input quantities, and sight line estimation results such as left eye position information, right eye position information, a binocular average sight line direction and the like output by the machine learning model are obtained.
And S204, determining the viewpoint position of the user to be detected in the target image according to the sight line estimation result.
In this step, the sight line estimation result includes: position information of the left eye and the right eye in a three-dimensional space and average sight line directions of the two eyes, wherein the three-dimensional space is a three-dimensional space established by taking the position of the target point as an origin.
It should be noted that the target point may be any point in the space where the user is located.
Illustratively, as shown in fig. 3, the target plane 31 is located in a first direction (i.e., L direction) of the user to be measured, a three-dimensional coordinate system is established with the target point O as an origin, the average visual line direction of the user to be measured in both eyes is represented as G direction as shown in fig. 3, and a vector of the average visual line direction of both eyes output by the machine learning model in step S203 is represented as G (G) x ,g y ,g z ) Wherein g is x Is the angle between the line of sight and the X-axis in a three-dimensional coordinate system, g y Is the angle between the line of sight and the Y axis in the three-dimensional coordinate system, g z Is the angle between the line of sight and the Z axis in the three-dimensional coordinate system. The viewpoint is denoted as P point, and the viewpoint coordinates are denoted as P (P) x ,p y ,p z ). Assume that the left eye position information of the user to be measured in the gaze estimation result output by the machine learning model in this embodiment is E L (x l ,y l ,z l ) And the right eye position information is E R (x r ,y r ,z r ) Determining the coordinate of the middle point of the two eyes according to the left eye position information and the right eye position information, taking the coordinate of the middle point of the two eyes as the eye coordinate of the user to be detected, and marking as E (E) x ,e y ,e z ) Wherein, in the step (A),
Figure BDA0003089395070000071
according to the eye coordinates E (E) x ,e y ,e z ) Line of sight vector G (G) x ,g y ,g z ) The coordinates of the viewpoint P can be calculated according to a linear equation in space, wherein the linear equation is:
Figure BDA0003089395070000072
in the embodiment, a target image of a user to be detected is obtained, the target image presents a facial image of the user to be detected, and facial feature detection is performed on the target image to obtain facial feature information of the user to be detected; and then inputting the target image and the facial feature information into a machine learning model, and using the machine learning model because the machine learning model is trained based on a sample image, sample facial features and sample sight line information, the sample facial features are the facial features presented in the sample image, and the sample sight line information is used for indicating the sight line presented in the sample image, so that a sight line estimation result of a user to be detected can be directly obtained through the machine learning model, and finally, the viewpoint position of the user to be detected in the target image is determined according to the sight line estimation result output by the machine learning model. According to the embodiment of the sight estimation method and the sight estimation device, the machine learning model obtained through pre-training is used for processing the image of the user to obtain the sight estimation result, additional devices such as a light source and a sensor are not needed in the whole process, the operation complexity is reduced, and the sight estimation efficiency and accuracy are greatly improved.
Fig. 4 is a schematic flowchart of a second method for estimating a line of sight according to an embodiment of the present disclosure, and the present embodiment further describes a line of sight estimation method based on the embodiment shown in fig. 2.
As shown in fig. 4, the method provided by the present embodiment includes the following steps.
S401, a target image is obtained, and the target image presents a facial image of a user to be detected.
In the step, images including face information of a user to be detected are acquired by a camera and the like, a target image is obtained, and then the target image is actively sent to a server. Or after the image acquisition equipment such as the camera acquires the image of the user to be detected, the image is stored in the local memory, and then the server sends a corresponding acquisition instruction to the image acquisition equipment to acquire the target image of the user to be detected.
In some embodiments, as shown in fig. 3, the camera is located in a first direction of the user to be measured, where the first direction is a direction that the head of the user to be measured faces when the head is in a normal state (i.e., an L direction indicated in fig. 3), so that the camera can acquire an image including the complete head and face of the user to be measured.
It should be noted that the target image in this embodiment may be an image captured by a camera, or may be a frame of image in a video recorded by the camera.
S402, obtaining image information of the target image, wherein the image information comprises the width and the height of the target image, the space occupied by each row of pixels in the target image, the color space of the target image and the direction information of the target image.
Specifically, after the server acquires the target image, the server extracts image information, including the width and height of the target image, the space stride occupied by each line of pixels in the target image, the color space of the target image, and the orientation information orientation of the target image.
Wherein, the width and the height of the target image are both expressed by pixels.
Specifically, the stride occupied by each row of pixels in the target image is determined according to the row pixels and the space occupied by each pixel, that is, stride = the number of bytes occupied by each pixel × width +, and if stride is not a multiple of 4, stride = stride width + (4-stride mod 4). For example, if a row has 11 pixels (i.e., width = 11), then stride =11 × 4=44 for a 32-bit image (i.e., each pixel occupies 4 bytes). As another example, if a row has 11 pixels (i.e., width = 11), then for a 24-bit image (i.e., each pixel occupies 3 bytes), stride =11 × 3=33, then stride is not a multiple of 4, and to ensure byte alignment, stride =33+3=36.
The orientation information of the target image is an angle of clockwise rotation or an angle of counterclockwise rotation of the target image.
And S403, inputting the image information of the target image into the face detection model obtained by pre-training to obtain the facial feature information of the user to be detected output by the face detection model.
The facial feature information comprises facial rectangular region information of the user to be detected, human face key point data, visibility of each human face key point, a facial horizontal corner, a facial pitch angle, a facial rotation angle and a binocular distance.
Specifically, the target image and the image information corresponding to the target image are input to a face detection model as input quantities, and the face detection model performs face feature detection to output face feature information.
Illustratively, the face detection result output by the face detection model is as follows:
Figure BDA0003089395070000081
Figure BDA0003089395070000092
the method comprises the steps of bef _ ai _ rect representing a rectangular face area of a user to be detected in a target image, score representing confidence of the target image, bef _ ai _ fpoint _ array [106] representing key point data of a human face, visibility _ array [106] representing visibility of each key point, yaw representing a horizontal face corner, pitch representing a face pitch angle, roll representing a face rotation angle, eye _ dist representing a distance between two eyes, and ID representing face identification (face ID).
Specifically, in this embodiment, the face key point data is a face 106 key point array, which includes 106 key point arrays such as eyebrows, eyes, nose, mouth, ears, and face contour. visibility _ array [106] is visibility of each human face key point, and is used for indicating whether the key point is shielded, if the housekeeping year point is not shielded, the data corresponding to the key point is 1.0, and if the key point is shielded, the data corresponding to the key point is 0.0.yaw is the horizontal corner of the face, and the data of yaw is left-negative-right-positive of the true measure, that is, assuming that the face of the user in the target image turns to the left, yaw is a negative number, and the value of yaw is the angle of turning to the left of the face (i.e., the angle between the direction of turning to the left of the face and the direction opposite to the face in the natural state of the user); if the user's face in the target image is turning right, then yaw is a positive number, and the value of yaw is the angle at which the face is turning right. pitch is a face pitch angle, data of the pitch is a real measurement of up, down, and positive, if the face of the user in the target image is upward facing, the pitch is a negative number, and the value of the pitch is an upward facing angle of the face (namely, an angle between a direction in which the face looks upward facing upward and a direction in which the face faces upward facing downward in a natural state of the user); if the user's face in the target image is looking down, pitch is a positive number, and the value of pitch is the angle at which the face is looking down. roll is a face rotation angle, data of the roll is left negative right positive of real measurement, if the head of a user in the target image tilts left and rotates, the roll is a negative number, and a value of pitch is an angle between the left tilting direction and the vertical direction of the head; if the head of the user in the target image rotates to face right, roll is a positive number, and the value of pitch is the angle between the direction of the head to face right and the vertical direction.
It should be noted that the ID is a unique identifier of the user, each detected face has a unique faceID, and a new faceID is generated after face tracking is lost and is detected again.
S404, inputting the target image and the facial feature information into a machine learning model obtained through pre-training, and obtaining a sight line estimation result output by the machine learning model.
Specifically, the target image and the facial feature information are input to a machine learning model as input quantities, and a sight line estimation result output by the machine learning model is obtained, wherein the sight line estimation result comprises: position information of the left eye and the right eye in a three-dimensional space and average sight line directions of the two eyes, wherein the three-dimensional space is a three-dimensional space established by taking the position of the target point as an origin.
Illustratively, the sight line estimation result output by the machine learning model is the following information:
Figure BDA0003089395070000101
wherein, the eye _ pos [3] and the eye _ pos [3] are the positions of the left eye and the right eye in the three-dimensional space respectively, and the eye _ size [3], the eye _ size [3] and the mid _ size [3] identify the left eye sight line direction, the right eye sight line direction and the average sight line direction of the two eyes respectively.
S405, judging whether the sight of the user to be detected falls in a preset area on a target plane according to the sight estimation result.
Specifically, after the two-eye sight line direction of the user to be detected and the position information of the left eye and the right eye in the three-dimensional space are obtained, the viewpoint position information of the sight line of the user to be detected on the target plane can be determined, and whether the viewpoint falls in the preset area or not can be judged according to the viewpoint position information and the position information of the preset area in the three-dimensional space.
It should be noted that, the target point (O point) may be any point in a space where a user to be measured is located, in this embodiment, for convenience of calculation, the target point (O point) may be a point on a target plane, lines on which an abscissa axis (X axis) and an ordinate axis (Y axis) of the three-dimensional space are located are any two mutually perpendicular lines that pass through the target point on the target plane, and a line on which a ordinate axis (Z axis) of the three-dimensional space is located passes through the target point and is perpendicular to the target plane.
For example, a three-dimensional coordinate system established with a point on a target plane as an origin is shown in fig. 5, the target plane is located in a first direction (i.e., L direction) of the user to be measured, an average visual line direction of both eyes of the user to be measured is shown as a G direction shown in fig. 5, and a viewpoint of the visual line on the target plane is denoted as a point P.
In one or more possible cases of this embodiment, the determining, according to the gaze estimation result, viewpoint position information of the gaze of the user to be measured on a target plane includes: determining the position information of the middle point between the left eye and the right eye according to the position information of the left eye and the right eye in the three-dimensional space; and determining the viewpoint position information of the sight of the user to be detected on the target plane according to the intermediate point position information, the average sight direction of the two eyes, the position information of the target plane in the three-dimensional space and a first formula.
Specifically, assume that the left-eye position information of the user to be detected in the sight line estimation result output by the machine learning model in this embodiment is E L (x l ,y l ,z l ) And the right eye position information is E R (x r ,y r ,z r ) Determining the coordinate of the middle point of the two eyes according to the left eye position information and the right eye position information, taking the coordinate of the middle point of the two eyes as the eye coordinate of the user to be detected, and marking as E (E) x ,e y ,e z ) Wherein, in the step (A),
Figure BDA0003089395070000111
Figure BDA0003089395070000112
according to the coordinates E (E) of the intermediate point of the two eyes x ,e y ,e z ) Line of sight vector G (G) x ,g y ,g z ) The coordinates of the viewpoint P can be calculated according to a first formula:
Figure BDA0003089395070000113
wherein e is x 、e y And e z Respectively representing the abscissa, ordinate and ordinate of the intermediate point in the three-dimensional space, g x 、g y And g z Resolution represents an angle between the average visual direction of the two eyes and the abscissa axis, an angle between the ordinate axes and an angle between the ordinate axes in the three-dimensional space, p x 、p y And p z Respectively represent an abscissa, an ordinate, and an ordinate of the viewpoint in the three-dimensional space.
Further, due to the fact thatThe target point (O point) is a point on the target plane, the straight lines of the abscissa axis and the ordinate axis of the three-dimensional space are any two mutually perpendicular straight lines which pass through the target point on the target plane, and the straight line of the ordinate axis of the three-dimensional space passes through the target point and is perpendicular to the target plane; therefore, the position of the target plane can be represented by Z =0, so that the vertical coordinate p of the viewpoint of the line of sight of the user to be measured on the target plane z =0, the first formula is:
Figure BDA0003089395070000121
in one or more possible cases of this embodiment, after determining the viewpoint position coordinates, it may be further determined whether the viewpoint falls within a preset area on the target plane, specifically, determining whether the viewpoint is within the preset area according to the viewpoint position information of the line of sight on the target plane includes:
when the abscissa, the ordinate and the vertical coordinate of the viewpoint meet preset conditions, determining that the viewpoint is in the preset area, wherein the preset conditions comprise: the abscissa of the viewpoint is within a first range, the ordinate of the viewpoint is within a second range, and the ordinate of the viewpoint is within a third range; the starting value and the ending value corresponding to each of the first range, the second range and the third range are determined according to the coordinates of each edge point of the preset area, the starting value and the ending value of the first range are respectively the minimum value and the maximum value of the abscissa of each edge point of the preset area, the starting value and the ending value of the second range are respectively the minimum value and the maximum value of the ordinate of each edge point of the preset area, and the starting value and the ending value of the third range are respectively the minimum value and the maximum value of the ordinate of each edge point of the preset area.
Illustratively, as shown in fig. 5, the target area is a rectangular area surrounded by a dotted line in the target plane, and the coordinates of four vertices of the rectangular area are (x) 1 ,y 1 ,0)、(x 2 ,y 1 ,0)、(x 1 ,y 2 0) and (x) 2 ,y 2 0), then the first range of the abscissa of the preset area is [ x ] 1 ,x 2 ]The second range of the ordinate is [ y ] 1 ,y 2 ]And the third range of the vertical coordinate is 0, the horizontal coordinate P of the viewpoint position P is judged x Whether it falls within the first range [ x ] 1 ,x 2 ]Ordinate p y Whether it falls within the second range [ y 1 ,y 2 ]If yes, determining that the viewpoint is in the preset area, otherwise, determining that the viewpoint is not in the preset area.
In the embodiment, the sight line estimation result is obtained by processing the image of the user through the two pre-trained models, no additional light source, sensor and other equipment is needed in the whole process, the operation complexity is reduced, and the sight line estimation efficiency and accuracy are greatly improved.
It should be noted that the gaze estimation method provided by the embodiment of the present disclosure may be used in various application scenarios, such as an online intelligent education scenario, an artificial intelligence control scenario, and the like, and the following will describe in detail by taking an example of implementation of the present solution in a specific application scenario.
Fig. 6 is a fourth application scene diagram of the sight line estimation method provided by the embodiment of the present disclosure, where the application scene provided by this embodiment is an online education scene, a user performs online learning by watching contents such as videos and characters played in a display screen 61, a camera 62 located on one side of the display screen collects a target image of the user in real time, sends the collected target image to a server, and the server estimates the sight line of the user according to the target image to determine whether the sight line of the user falls within the display screen.
It is understood that the camera may be a camera separately disposed on the display screen, and the position of the camera may be located at any position on one side of the display screen, in this embodiment, in order to ensure that the face information of the user included in the acquired target image is complete, as shown in fig. 6, the camera 62 is disposed right above the display screen 61, so that the face of the user faces the camera when viewing the display screen.
It should be noted that some display screens have cameras in the middle of their upper edges, and therefore, the cameras of the display screens can also be used to capture target images of users.
Specifically, the camera collects a target image of a user in front of the display screen in real time, sends the target image to the server 63, and after the server 63 acquires the target image, the server extracts image information including the width and height of the target image, the space stride occupied by each line of pixels in the target image, the color space of the target image, and the orientation information orientation of the target image. Then, the target image and the image information corresponding to the target image are input into a face detection model as input quantities, the face detection model performs face detection, and the output facial feature information comprises facial rectangular region information of the user to be detected, face key point data, visibility of each face key point, a facial horizontal rotation angle, a facial pitch angle, a facial rotation angle, a binocular interval and the like. Inputting the target image and the facial feature information into a machine learning model as input quantities to obtain a sight line estimation result output by the machine learning model, wherein the sight line estimation result comprises: position information of the left eye and the right eye in a three-dimensional space and average sight line directions of the two eyes, wherein the three-dimensional space is a three-dimensional space established by taking the position of the target point as an origin. After the sight line directions of the two eyes and the position information of the left eye and the right eye of the user to be detected in the three-dimensional space are obtained, the viewpoint position information of the sight line of the user to be detected on the target plane can be determined, and whether the viewpoint falls in the display screen can be judged according to the viewpoint position information and the position information of the display screen in the three-dimensional space.
In a possible case of this embodiment, the position of the camera and the display screen are on the same plane, i.e. both on the target plane, and the three-dimensional space is a three-dimensional coordinate system established by using the position of the camera as an origin, as shown in fig. 6.
Furthermore, the total time length of the sight line falling into the display screen in the whole online learning process of the user is counted, the ratio of the total time length to the whole online learning time length is calculated, if the ratio is larger than a preset threshold value, the whole online learning efficiency of the user is high, and if the ratio is lower than the preset threshold value, the sight line frequently leaves the display screen in the whole online learning process of the user, and the learning efficiency is low.
It should be noted that, in this embodiment, the target plane is a plane where the display screen is located, and the preset area in the target plane is an area occupied by the display screen.
It should be noted that, parts that are not described in detail in this embodiment may refer to other detailed descriptions in other method embodiments of the present application, and descriptions thereof are not repeated here.
In the embodiment, in an application scene of online education, whether the sight of the user falls on the screen can be accurately judged through the sight estimation algorithm, and the learning efficiency of the user is further judged by counting the total time length of the sight falling in the display screen.
Fig. 7 is a schematic flow chart of a training method of a face detection model provided in the embodiment of the present disclosure, and in this embodiment, a description is mainly given of a training process of a face detection model used in the above method embodiment.
As shown in fig. 7, the training method for a face detection model provided in this embodiment may include the following steps.
S701, acquiring a first image set, wherein the first image set comprises a plurality of sample images, and the sample images show facial images of users.
Specifically, a large number of images including user face information are acquired by an image such as a camera located in a direction in which the user's head faces when the user's head is in a normal state, a first image set including images of various postures such as horizontal left-right rotation, pitching, left-right head tilting and the like of the user's face is obtained, and then the first image set is sent to a server.
S702, extracting image information of each sample image in the first image set, wherein the image information comprises the width and the height of the image, the space occupied by each row of pixels in the image, the color space of the image and the direction information of the image.
Specifically, after the server acquires the target image, the server extracts image information, including the width and height of the target image, the space stride occupied by each line of pixels in the target image, the color space of the target image, and the orientation information orientation of the target image.
Wherein, the width and the height of the target image are both expressed by pixels.
Specifically, stride occupied by each row of pixels in the target image is determined according to the row of pixels and the space occupied by each pixel, that is, stride = byte width occupied by each pixel, and if stride is not a multiple of 4, stride = stride width + (4-stride mod 4). For example, if a row has 11 pixels (i.e., width = 11), then stride =11 × 4=44 for a 32-bit image (i.e., each pixel occupies 4 bytes). As another example, if a row has 11 pixels (i.e., width = 11), then for a 24-bit image (i.e., each pixel occupies 3 bytes), stride =11 × 3=33, then stride is not a multiple of 4, and to ensure byte alignment, stride =33+3=36.
The orientation information of the target image is an angle of clockwise rotation or an angle of counterclockwise rotation of the target image.
And S703, according to the input labeling data, labeling a training label on each sample image in the first image set, wherein the training label comprises a sample facial feature corresponding to each sample image, and the sample facial feature comprises facial rectangular region information of a user, a face identifier, face key point data, visibility of each face key point, a face horizontal rotation angle, a face pitch angle, a face rotation angle, a binocular distance and an image confidence coefficient.
Specifically, the server sends the acquired images in the first image set to the display terminal for displaying, and the user inputs annotation data corresponding to the images through the display terminal, for example, the user selects a user face area in the image using a rectangular frame, and inputs a user face rotation direction and an image rotation direction in the image. And then, the server automatically identifies and marks face key point data and the like in the image according to the face rectangular region marked by the user to generate a training label corresponding to the image.
S704, generating first training data according to the first image set, the image information of each sample image in the first image set and the training label corresponding to each sample image.
S705, inputting the first training data into a first deep learning network established in advance for training to obtain a face detection model.
Specifically, the first image set, the image information of each image in the first image set, and the training label corresponding to each image can be used as first training data, and input to a first deep learning network established in advance for training to obtain a face detection model.
In this embodiment, the face detection model is obtained by generating the first training data for training, and the face detection model can be used to quickly and accurately perform face detection on the user target image acquired in real time to obtain the face key data.
Fig. 8 is a flowchart illustrating a method for training a machine learning model according to an embodiment of the present disclosure, where in this embodiment, a training process of the machine learning model used in the foregoing method embodiment is mainly described.
As shown in fig. 8, the training method of the face detection model provided in this embodiment may include the following steps.
S801, determining sample sight information of a user corresponding to the face identification in each sample image in the first image set, wherein the sample sight information comprises: the left eye position information, the right eye position information, the left eye sight line direction, the right eye sight line direction and the average sight line direction of the two eyes of the user in the three-dimensional space.
Specifically, a three-dimensional coordinate system is established with the position of the camera as an origin, a plane in the plane of the camera, which is directly opposite to the user, is taken as a target plane, the straight lines in which the abscissa axis (X axis) and the ordinate axis (Y axis) of the three-dimensional coordinate system are respectively located are any two mutually perpendicular straight lines passing through the target point on the target plane, and the straight line in which the ordinate axis (Z axis) of the three-dimensional space is located passes through the target point and is perpendicular to the target plane (refer to the positions of the target plane, the camera, and the three-dimensional coordinate system in the embodiment shown in fig. 6). When each image in the first image set is collected, the left eye position information, the right eye position information, the left eye sight line direction, the right eye sight line direction and the average sight line direction of the eyes of the user corresponding to each image in the three-dimensional coordinate system are calculated in time, and therefore the binocular sight line characteristics of the user corresponding to each image in the first image set are obtained.
S802, second training data are generated according to the first training data and the sample sight line information of the user corresponding to each face identification.
And S803, inputting the second training data into a second deep learning network established in advance for training to obtain the machine learning model.
Specifically, the first image set, image information corresponding to each image in the first image set, face key data corresponding to each image, and user binocular vision features corresponding to face identification in each image are used as second training data, and the second training data are input to a second deep learning network established in advance for training, so that the machine learning model is obtained. The machine learning model can quickly and accurately output the sight estimation result of the user corresponding to the target image, wherein the sight estimation result comprises left eye position information, right eye position information, left eye sight direction, right eye sight direction, binocular average sight direction and the like of the user.
In the training of the machine learning model, the coordinate system to which the data such as the eye position information and the gaze direction in the second training data are referred and the reference coordinate system in which the gaze estimation result is obtained by using the machine learning model are the same three-dimensional coordinate system.
In this embodiment, a machine learning model is obtained by generating second training data for training, and sight detection can be performed on a user target image acquired in real time and face key data obtained by detecting the target image by a face detection model quickly and accurately through the machine learning model, so as to obtain a sight estimation result.
Fig. 9 is a schematic structural diagram of a gaze estimation apparatus provided in the embodiment of the present disclosure, where the apparatus provided in the embodiment may be deployed in a server to implement corresponding method steps in the server.
As shown in fig. 9, the apparatus provided in the embodiment of the present disclosure includes: an image acquisition module 91, a feature detection module 92, a sight line estimation module 93 and a viewpoint determination module 94; wherein the content of the first and second substances,
the system comprises an image acquisition module, a face detection module and a face detection module, wherein the image acquisition module is used for acquiring a target image, and the target image presents a face image of a user to be detected;
the feature detection module is used for carrying out facial feature detection on the target image to obtain facial feature information of the user to be detected;
the sight line estimation module is used for inputting the target image and the facial feature information into a machine learning model for sight line estimation to obtain a sight line estimation result output by the machine learning model; the machine learning model has been trained based on sample images, sample facial features that are facial features present in the sample images, and sample gaze information that indicates a gaze present in the sample images;
and the viewpoint determining module is used for determining the viewpoint position of the user to be detected in the target image according to the sight line estimation result.
In one or more possible embodiments, the feature detection module is specifically configured to:
acquiring image information of the target image, wherein the image information comprises the width and the height of the target image, the space occupied by each row of pixels in the target image, the color space of the target image and the direction information of the target image;
inputting the image information of the target image into the face detection model obtained by pre-training to obtain the facial feature information of the user to be detected output by the face detection model, wherein the facial feature information comprises facial rectangular region information of the user to be detected, face key point data, visibility of each face key point, a facial horizontal rotation angle, a facial pitch angle, a facial rotation angle and a binocular distance.
In one or more possible embodiments, the gaze estimation result includes: position information of a left eye and a right eye in a three-dimensional space and an average sight line direction of two eyes, wherein the three-dimensional space is a three-dimensional space established by taking the position of a target point as an origin;
the viewpoint determining module is specifically configured to:
determining the position information of the middle point between the left eye and the right eye according to the position information of the left eye and the right eye in the three-dimensional space;
and determining the viewpoint position of the user to be detected according to the intermediate point position information, the average sight direction of the two eyes and a first formula.
In one or more possible embodiments, the first formula is:
Figure BDA0003089395070000171
wherein e is x 、e y And e z Respectively representing the abscissa, ordinate and ordinate of the intermediate point in the three-dimensional space, g x 、g y And g z Resolution represents an angle between the average visual direction of the two eyes and the abscissa axis, an angle between the ordinate axes and an angle between the ordinate axes in the three-dimensional space, p x 、p y And p z Respectively represent an abscissa, an ordinate, and an ordinate of the viewpoint in the three-dimensional space.
In one or more possible embodiments, the target point is any point on the target plane, the straight lines on which the abscissa axis and the ordinate axis of the three-dimensional space are located are any two mutually perpendicular straight lines that pass through the target point on the target plane, and the straight line on which the ordinate axis of the three-dimensional space is located passes through the target point and is perpendicular to the straight line of the target plane;
the first formula is then:
Figure BDA0003089395070000181
wherein e is x 、e y And e z Respectively representing the abscissa, ordinate and ordinate of the intermediate point in the three-dimensional space, g x 、g y And g z Resolution represents an angle between the average visual direction of the two eyes and the abscissa axis, an angle between the ordinate axes and an angle between the ordinate axes in the three-dimensional space, p x And p y Respectively representing the abscissa and the ordinate of the viewpoint in the three-dimensional space, and the ordinate of the viewpoint is 0.
In one or more possible embodiments, the gaze estimation apparatus further includes: a position determining module 95, configured to determine whether the viewpoint of the user to be detected is within a preset area according to the viewpoint position of the user to be detected, where the preset area is an area in a target plane.
In one or more possible embodiments, the position determining module is specifically configured to: when the abscissa, the ordinate and the vertical coordinate of the viewpoint meet preset conditions, determining that the viewpoint is in the preset area, wherein the preset conditions comprise: the abscissa of the viewpoint is within a first range, the ordinate of the viewpoint is within a second range, and the ordinate of the viewpoint is within a third range;
wherein, the start value and the end value corresponding to each of the first range, the second range and the third range are determined according to the coordinates of each edge point of the preset region, the start value and the end value of the first range are respectively the minimum value and the maximum value of the abscissa of each edge point of the preset region, the start value and the end value of the second range are respectively the minimum value and the maximum value of the ordinate of each edge point of the preset region, and the start value and the end value of the third range are respectively the minimum value and the maximum value of the ordinate of each edge point of the preset region.
In one or more possible embodiments, the gaze estimation apparatus further includes: a model training module 96 for obtaining a first image set, the first image set including a plurality of sample images, the sample images presenting facial images of a user; generating first training data from the first set of images; and inputting the first training data into a first deep learning network established in advance for training to obtain a face detection model.
In one or more possible embodiments, the model training module is specifically configured to:
extracting image information of each sample image in the first image set, wherein the image information comprises the width and the height of the image, the space occupied by each row of pixels in the image, the color space of the image and the direction information of the image;
according to input labeling data, labeling a training label on each sample image in the first image set, wherein the training label comprises a sample facial feature corresponding to each sample image, and the sample facial feature comprises facial rectangular region information of a user, a face identifier, face key point data, visibility of each face key point, a face horizontal corner, a face pitch angle, a face rotation angle, a binocular distance and an image confidence coefficient;
and generating first training data according to the first image set, the image information of each sample image in the first image set and the training label corresponding to each sample image.
In one or more possible embodiments, the model training module is further configured to:
extracting image information of each sample image in the first image set, wherein the image information comprises the width and the height of the image, the space occupied by each row of pixels in the image, the color space of the image and the direction information of the image;
according to input labeling data, labeling a training label on each sample image in the first image set, wherein the training label comprises a sample facial feature corresponding to each sample image, and the sample facial feature comprises facial rectangular region information of a user, a face identifier, face key point data, visibility of each face key point, a face horizontal corner, a face pitch angle, a face rotation angle, a binocular distance and an image confidence coefficient;
and generating first training data according to the first image set, the image information of each sample image in the first image set and the training label corresponding to each sample image.
The apparatus provided in this embodiment may be used to implement the technical solutions of the above method embodiments, and the implementation principles and technical effects are similar, which are not described herein again.
Referring to fig. 10, a schematic structural diagram of an electronic device 100 suitable for implementing the embodiment of the present disclosure is shown, where the electronic device 100 may be a terminal device or a server. Among them, the terminal Device may include, but is not limited to, a mobile terminal such as a mobile phone, a notebook computer, a Digital broadcast receiver, a Personal Digital Assistant (PDA), a tablet computer (PAD), a Portable Multimedia Player (PMP), a car terminal (e.g., car navigation terminal), etc., and a fixed terminal such as a Digital TV, a desktop computer, etc. The electronic device shown in fig. 10 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present disclosure.
As shown in fig. 10, the electronic device 100 may include a processing means (e.g., a central processing unit, a graphics processor, etc.) 1001 which may perform various suitable actions and processes according to a program stored in a Read Only Memory (ROM) 1002 or a program loaded from a storage device 1008 into a Random Access Memory (RAM) 1003. In the RAM 1003, various programs and data necessary for the operation of the electronic apparatus 1000 are also stored. The processing device 1001, the ROM1002, and the RAM 1003 are connected to each other by a bus 1004. An input/output (I/O) interface 1005 is also connected to bus 1004.
Generally, the following devices may be connected to the I/O interface 1005: input devices 1006 including, for example, a touch screen, touch pad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; an output device 1007 including, for example, a Liquid Crystal Display (LCD), a speaker, a vibrator, and the like; storage devices 1008 including, for example, magnetic tape, hard disk, and the like; and a communication device 1009. The communication device 1009 may allow the electronic device 1000 to communicate with other devices wirelessly or by wire to exchange data. While fig. 10 illustrates an electronic device 1000 having various means, it is to be understood that not all illustrated means are required to be implemented or provided. More or fewer devices may alternatively be implemented or provided.
In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program may be downloaded and installed from a network through the communication means 1009, or installed from the storage means 1008, or installed from the ROM 1002. The computer program, when executed by the processing device 1001, performs the above-described functions defined in the methods of the embodiments of the present disclosure.
It should be noted that the computer readable medium in the present disclosure can be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present disclosure, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In contrast, in the present disclosure, a computer readable signal medium may comprise a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: electrical wires, optical cables, RF (radio frequency), etc., or any suitable combination of the foregoing.
The computer readable medium may be embodied in the electronic device; or may exist separately without being assembled into the electronic device.
The computer readable medium carries one or more programs which, when executed by the electronic device, cause the electronic device to perform the methods shown in the above embodiments.
Computer program code for carrying out operations for aspects of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C + +, and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of Network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider).
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
The units described in the embodiments of the present disclosure may be implemented by software or hardware. Where the name of a unit does not in some cases constitute a limitation of the unit itself, for example, the first retrieving unit may also be described as a "unit for retrieving at least two internet protocol addresses".
The functions described herein above may be performed, at least in part, by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), system on a chip (SOCs), complex Programmable Logic Devices (CPLDs), and the like.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
In a first aspect, according to one or more embodiments of the present disclosure, there is provided a gaze estimation method, including:
acquiring a target image, wherein the target image presents a facial image of a user to be detected;
carrying out facial feature detection on the target image to obtain facial feature information of the user to be detected;
inputting the target image and the facial feature information into a machine learning model for sight line estimation to obtain a sight line estimation result output by the machine learning model; the machine learning model has been trained based on sample images, sample facial features that are facial features present in the sample images, and sample gaze information that indicates a gaze present in the sample images;
and determining the viewpoint position of the user to be detected in the target image according to the sight line estimation result.
According to one or more embodiments of the present disclosure, the performing facial feature detection on the target image to obtain facial feature information of the user to be detected includes:
acquiring image information of the target image, wherein the image information comprises the width and the height of the target image, the space occupied by each row of pixels in the target image, the color space of the target image and the direction information of the target image;
inputting the image information of the target image into the face detection model obtained by pre-training to obtain the facial feature information of the user to be detected output by the face detection model, wherein the facial feature information comprises facial rectangular region information of the user to be detected, face key point data, visibility of each face key point, a facial horizontal rotation angle, a facial pitch angle, a facial rotation angle and a binocular distance.
According to one or more embodiments of the present disclosure, the sight line estimation result includes: position information of a left eye and a right eye in a three-dimensional space and an average sight line direction of two eyes, wherein the three-dimensional space is a three-dimensional space established by taking the position of a target point as an origin;
determining the viewpoint position of the user to be detected in the target image according to the sight line estimation result, including:
determining the position information of the middle point between the left eye and the right eye according to the position information of the left eye and the right eye in the three-dimensional space;
and determining the viewpoint position of the user to be detected according to the intermediate point position information, the average sight direction of the two eyes and a first formula.
According to one or more embodiments of the present disclosure, the first formula is a spatial line equation.
According to one or more embodiments of the present disclosure, the method further comprises:
and judging whether the viewpoint of the user to be detected is in a preset area or not according to the viewpoint position of the user to be detected, wherein the preset area is an area in a target plane.
According to one or more embodiments of the present disclosure, the determining whether the viewpoint of the user to be detected is within a preset area according to the viewpoint position of the user to be detected includes:
when the abscissa, the ordinate and the ordinate of the viewpoint meet preset conditions, determining that the viewpoint is in the preset area, wherein the preset conditions comprise: the abscissa of the viewpoint is within a first range, the ordinate of the viewpoint is within a second range, and the ordinate of the viewpoint is within a third range;
the starting value and the ending value corresponding to each of the first range, the second range and the third range are determined according to the coordinates of each edge point of the preset area, the starting value and the ending value of the first range are respectively the minimum value and the maximum value of the abscissa of each edge point of the preset area, the starting value and the ending value of the second range are respectively the minimum value and the maximum value of the ordinate of each edge point of the preset area, and the starting value and the ending value of the third range are respectively the minimum value and the maximum value of the ordinate of each edge point of the preset area.
According to one or more embodiments of the present disclosure, the method further comprises:
acquiring a first image set, wherein the first image set comprises a plurality of sample images, and the sample images show facial images of a user;
generating first training data from the first set of images;
and inputting the first training data into a first deep learning network established in advance for training to obtain a face detection model.
According to one or more embodiments of the present disclosure, the generating first training data from the first set of images includes:
extracting image information of each sample image in the first image set, wherein the image information comprises the width and the height of the image, the space occupied by each row of pixels in the image, the color space of the image and the direction information of the image;
according to input labeling data, labeling a training label on each sample image in the first image set, wherein the training label comprises a sample facial feature corresponding to each sample image, and the sample facial feature comprises facial rectangular region information of a user, a face identifier, face key point data, visibility of each face key point, a face horizontal corner, a face pitch angle, a face rotation angle, a binocular distance and an image confidence coefficient;
and generating first training data according to the first image set, the image information of each sample image in the first image set and the training label corresponding to each sample image.
According to one or more embodiments of the present disclosure, the method further comprises:
determining sample sight information of a user corresponding to face identification in each sample image in the first image set, the sample sight information comprising: left eye position information, right eye position information, a left eye sight line direction, a right eye sight line direction and a double-eye average sight line direction of a user in a three-dimensional space;
generating second training data according to the first training data and the sample sight information of the user corresponding to each face identification;
and inputting the second training data into a second deep learning network established in advance for training to obtain a machine learning model.
In a second aspect, according to one or more embodiments of the present disclosure, there is provided a gaze estimation device, including:
the system comprises an image acquisition module, a face recognition module and a face recognition module, wherein the image acquisition module is used for acquiring a target image, and the target image presents a face image of a user to be detected;
the feature detection module is used for carrying out facial feature detection on the target image to obtain facial feature information of the user to be detected;
the sight line estimation module is used for inputting the target image and the facial feature information into a machine learning model for sight line estimation to obtain a sight line estimation result output by the machine learning model; the machine learning model has been trained based on sample images, sample facial features that are facial features present in the sample images, and sample gaze information that indicates a gaze present in the sample images;
and the viewpoint determining module is used for determining the viewpoint position of the user to be detected in the target image according to the sight line estimation result.
According to one or more embodiments of the present disclosure, the feature detection module is specifically configured to:
acquiring image information of the target image, wherein the image information comprises the width and the height of the target image, the space occupied by each row of pixels in the target image, the color space of the target image and the direction information of the target image;
inputting the image information of the target image into the face detection model obtained by pre-training to obtain the facial feature information of the user to be detected output by the face detection model, wherein the facial feature information comprises facial rectangular region information of the user to be detected, face key point data, visibility of each face key point, a facial horizontal rotation angle, a facial pitch angle, a facial rotation angle and a binocular distance.
According to one or more embodiments of the present disclosure, the sight line estimation result includes: position information of a left eye and a right eye in a three-dimensional space and an average sight line direction of two eyes, wherein the three-dimensional space is a three-dimensional space established by taking the position of a target point as an origin;
the viewpoint determining module is specifically configured to:
determining the position information of the middle point between the left eye and the right eye according to the position information of the left eye and the right eye in the three-dimensional space;
and determining the viewpoint position of the user to be detected according to the intermediate point position information, the average sight direction of the two eyes and a first formula.
According to one or more embodiments of the present disclosure, the sight line estimation device further includes: and the position judgment module is used for judging whether the viewpoint of the user to be detected is in a preset area according to the viewpoint position of the user to be detected, wherein the preset area is an area in a target plane.
In one or more possible embodiments, the position determining module is specifically configured to: when the abscissa, the ordinate and the ordinate of the viewpoint meet preset conditions, determining that the viewpoint is in the preset area, wherein the preset conditions comprise: the abscissa of the viewpoint is within a first range, the ordinate of the viewpoint is within a second range, and the ordinate of the viewpoint is within a third range;
the starting value and the ending value corresponding to each of the first range, the second range and the third range are determined according to the coordinates of each edge point of the preset area, the starting value and the ending value of the first range are respectively the minimum value and the maximum value of the abscissa of each edge point of the preset area, the starting value and the ending value of the second range are respectively the minimum value and the maximum value of the ordinate of each edge point of the preset area, and the starting value and the ending value of the third range are respectively the minimum value and the maximum value of the ordinate of each edge point of the preset area.
According to one or more embodiments of the present disclosure, the gaze estimation apparatus further includes: a model training module for obtaining a first image set, the first image set comprising a plurality of sample images, the sample images presenting facial images of a user; generating first training data from the first set of images; and inputting the first training data into a pre-established first deep learning network for training to obtain a face detection model.
According to one or more embodiments of the present disclosure, the model training module is specifically configured to:
extracting image information of each sample image in the first image set, wherein the image information comprises the width and the height of the image, the space occupied by each row of pixels in the image, the color space of the image and the direction information of the image;
according to input labeling data, labeling a training label on each sample image in the first image set, wherein the training label comprises a sample facial feature corresponding to each sample image, and the sample facial feature comprises facial rectangular region information of a user, a face identifier, face key point data, visibility of each face key point, a face horizontal corner, a face pitch angle, a face rotation angle, a binocular distance and an image confidence coefficient;
and generating first training data according to the first image set, the image information of each sample image in the first image set and the training label corresponding to each sample image.
In accordance with one or more embodiments of the present disclosure, the model training module is further configured to:
extracting image information of each sample image in the first image set, wherein the image information comprises the width and the height of the image, the space occupied by each row of pixels in the image, the color space of the image and the direction information of the image;
according to input labeling data, labeling a training label on each sample image in the first image set, wherein the training label comprises a sample facial feature corresponding to each sample image, and the sample facial feature comprises facial rectangular region information of a user, a face identifier, face key point data, visibility of each face key point, a face horizontal corner, a face pitch angle, a face rotation angle, a binocular distance and an image confidence coefficient;
and generating first training data according to the first image set, the image information of each sample image in the first image set and the training label corresponding to each sample image.
In a seventh aspect, according to one or more embodiments of the present disclosure, there is provided an electronic device including: at least one processor and memory;
the memory stores computer-executable instructions;
the at least one processor executing the computer-executable instructions stored by the memory causes the at least one processor to perform the gaze estimation method as described above in the first aspect and various possible designs of the first aspect.
In an eighth aspect, according to one or more embodiments of the present disclosure, there is provided a computer-readable storage medium having stored therein computer-executable instructions that, when executed by a processor, implement the gaze estimation method as set forth in the first aspect and various possible designs of the first aspect.
A ninth aspect, according to one or more embodiments of the present disclosure, provides a computer program product comprising a computer program which, when executed by a processor, implements the gaze estimation method as described above in the first aspect and in various possible designs of the first aspect.
The foregoing description is only exemplary of the preferred embodiments of the disclosure and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the disclosure herein is not limited to the particular combination of features described above, but also encompasses other embodiments in which any combination of the features described above or their equivalents does not depart from the spirit of the disclosure. For example, the above features and (but not limited to) the features disclosed in this disclosure having similar functions are replaced with each other to form the technical solution.
Further, while operations are depicted in a particular order, this should not be understood as requiring that such operations be performed in the particular order shown or in sequential order. Under certain circumstances, multitasking and parallel processing may be advantageous. Likewise, while several specific implementation details are included in the above discussion, these should not be construed as limitations on the scope of the disclosure. Certain features that are described in the context of separate embodiments can also be implemented in combination in a single embodiment. Conversely, various features that are described in the context of a single embodiment can also be implemented in multiple embodiments separately or in any suitable subcombination.
Although the subject matter has been described in language specific to structural features and/or methodological acts, it is to be understood that the subject matter defined in the appended claims is not necessarily limited to the specific features or acts described above. Rather, the specific features and acts described above are disclosed as example forms of implementing the claims.

Claims (13)

1. A sight line estimation method, characterized by comprising:
acquiring a target image, wherein the target image presents a facial image of a user to be detected;
carrying out facial feature detection on the target image to obtain facial feature information of the user to be detected;
inputting the target image and the facial feature information into a machine learning model for sight line estimation to obtain a sight line estimation result output by the machine learning model; the machine learning model has been trained based on sample images, sample facial features that are facial features present in the sample images, and sample gaze information that indicates a gaze present in the sample images;
and determining the viewpoint position of the user to be detected in the target image according to the sight line estimation result.
2. The method according to claim 1, wherein the performing facial feature detection on the target image to obtain facial feature information of the user to be detected comprises:
acquiring image information of the target image, wherein the image information comprises the width and the height of the target image, the space occupied by each row of pixels in the target image, the color space of the target image and the direction information of the target image;
inputting the image information of the target image into a face detection model obtained by pre-training to obtain the facial feature information of the user to be detected output by the face detection model, wherein the facial feature information comprises facial rectangular region information of the user to be detected, face key point data, visibility of each face key point, a facial horizontal rotation angle, a facial pitch angle, a facial rotation angle and a binocular distance.
3. The method according to claim 1, wherein the gaze estimation result comprises: position information of a left eye and a right eye in a three-dimensional space and average sight directions of two eyes, wherein the three-dimensional space is a three-dimensional space established by taking the position of a target point as an origin;
determining the viewpoint position of the user to be detected in the target image according to the sight line estimation result, including:
determining the position information of the middle point between the left eye and the right eye according to the position information of the left eye and the right eye in the three-dimensional space;
and determining the viewpoint position of the user to be detected according to the intermediate point position information, the average sight direction of the two eyes and a first formula.
4. The method of claim 3, wherein the first formula is a spatial line equation.
5. The method according to any one of claims 1-4, further comprising:
and judging whether the viewpoint of the user to be detected is in a preset area according to the viewpoint position of the user to be detected, wherein the preset area is an area in a target plane.
6. The method of claim 5, wherein the determining whether the viewpoint of the user to be detected is within a preset area according to the viewpoint position of the user to be detected comprises:
when the abscissa, the ordinate and the vertical coordinate of the viewpoint meet preset conditions, determining that the viewpoint is in the preset area, wherein the preset conditions comprise: the abscissa of the viewpoint is within a first range, the ordinate of the viewpoint is within a second range, and the ordinate of the viewpoint is within a third range;
wherein, the start value and the end value corresponding to each of the first range, the second range and the third range are determined according to the coordinates of each edge point of the preset region, the start value and the end value of the first range are respectively the minimum value and the maximum value of the abscissa of each edge point of the preset region, the start value and the end value of the second range are respectively the minimum value and the maximum value of the ordinate of each edge point of the preset region, and the start value and the end value of the third range are respectively the minimum value and the maximum value of the ordinate of each edge point of the preset region.
7. The method according to any one of claims 1-4, further comprising:
acquiring a first image set, wherein the first image set comprises a plurality of sample images, and the sample images show facial images of a user;
generating first training data from the first set of images;
and inputting the first training data into a first deep learning network established in advance for training to obtain a face detection model.
8. The method of claim 7, wherein generating first training data from the first set of images comprises:
extracting image information of each sample image in the first image set, wherein the image information comprises the width and the height of the image, the space occupied by each row of pixels in the image, the color space of the image and the direction information of the image;
according to input labeling data, labeling a training label on each sample image in the first image set, wherein the training label comprises a sample facial feature corresponding to each sample image, and the sample facial feature comprises facial rectangular region information of a user, a face identifier, face key point data, visibility of each face key point, a face horizontal corner, a face pitch angle, a face rotation angle, a binocular distance and an image confidence coefficient;
and generating first training data according to the first image set, the image information of each sample image in the first image set and the training label corresponding to each sample image.
9. The method of claim 8, further comprising:
determining sample sight information of a user corresponding to face identification in each sample image in the first image set, the sample sight information comprising: left eye position information, right eye position information, a left eye sight line direction, a right eye sight line direction and a double-eye average sight line direction of a user in a three-dimensional space;
generating second training data according to the first training data and the sample sight information of the user corresponding to each face identification;
and inputting the second training data into a second deep learning network established in advance for training to obtain a machine learning model.
10. A gaze estimation device, comprising:
the system comprises an image acquisition module, a face detection module and a face detection module, wherein the image acquisition module is used for acquiring a target image, and the target image presents a face image of a user to be detected;
the feature detection module is used for carrying out facial feature detection on the target image to obtain facial feature information of the user to be detected;
the sight line estimation module is used for inputting the target image and the facial feature information into a machine learning model for sight line estimation to obtain a sight line estimation result output by the machine learning model; the machine learning model has been trained based on sample images, sample facial features that are facial features present in the sample images, and sample gaze information that indicates a gaze present in the sample images;
and the viewpoint determining module is used for determining the viewpoint position of the user to be detected in the target image according to the sight line estimation result.
11. An electronic device, comprising: at least one processor and memory;
the memory stores computer-executable instructions;
the at least one processor executing the computer-executable instructions stored by the memory causes the at least one processor to perform the gaze estimation method of any of claims 1-9.
12. A computer-readable storage medium having stored therein computer-executable instructions that, when executed by a processor, implement the gaze estimation method according to any one of claims 1-9.
13. A computer program product comprising a computer program which, when executed by a processor, implements the gaze estimation method of any one of claims 1-9.
CN202110590686.4A 2021-05-28 2021-05-28 Sight estimation method and device Pending CN115410242A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110590686.4A CN115410242A (en) 2021-05-28 2021-05-28 Sight estimation method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110590686.4A CN115410242A (en) 2021-05-28 2021-05-28 Sight estimation method and device

Publications (1)

Publication Number Publication Date
CN115410242A true CN115410242A (en) 2022-11-29

Family

ID=84156110

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110590686.4A Pending CN115410242A (en) 2021-05-28 2021-05-28 Sight estimation method and device

Country Status (1)

Country Link
CN (1) CN115410242A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115862124A (en) * 2023-02-16 2023-03-28 南昌虚拟现实研究院股份有限公司 Sight estimation method and device, readable storage medium and electronic equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115862124A (en) * 2023-02-16 2023-03-28 南昌虚拟现实研究院股份有限公司 Sight estimation method and device, readable storage medium and electronic equipment
CN115862124B (en) * 2023-02-16 2023-05-09 南昌虚拟现实研究院股份有限公司 Line-of-sight estimation method and device, readable storage medium and electronic equipment

Similar Documents

Publication Publication Date Title
CN110322500B (en) Optimization method and device for instant positioning and map construction, medium and electronic equipment
CN111602140B (en) Method of analyzing objects in images recorded by a camera of a head-mounted device
WO2019242262A1 (en) Augmented reality-based remote guidance method and device, terminal, and storage medium
CN108764071B (en) Real face detection method and device based on infrared and visible light images
US8831337B2 (en) Method, system and computer program product for identifying locations of detected objects
CN113822977A (en) Image rendering method, device, equipment and storage medium
CN110349212B (en) Optimization method and device for instant positioning and map construction, medium and electronic equipment
US20210041945A1 (en) Machine learning based gaze estimation with confidence
EP4105766A1 (en) Image display method and apparatus, and computer device and storage medium
CN108463840A (en) Information processing equipment, information processing method and recording medium
US11902677B2 (en) Patch tracking image sensor
CN111062981A (en) Image processing method, device and storage medium
WO2023124693A1 (en) Augmented reality scene display
CN111784765A (en) Object measurement method, virtual object processing method, object measurement device, virtual object processing device, medium, and electronic apparatus
CN114910052A (en) Camera-based distance measurement method, control method and device and electronic equipment
CN111836073A (en) Method, device and equipment for determining video definition and storage medium
CN115410242A (en) Sight estimation method and device
CN112153320B (en) Method and device for measuring size of article, electronic equipment and storage medium
US20200211275A1 (en) Information processing device, information processing method, and recording medium
US11726320B2 (en) Information processing apparatus, information processing method, and program
CN114895789A (en) Man-machine interaction method and device, electronic equipment and storage medium
CN112114659A (en) Method and system for determining a fine point of regard for a user
CN111080630A (en) Fundus image detection apparatus, method, device, and storage medium
CN111739098A (en) Speed measuring method and device, electronic equipment and storage medium
CN110287872A (en) A kind of recognition methods of point of interest direction of visual lines and device based on mobile terminal

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination