WO2024021742A9 - Procédé d'estimation de point de fixation et dispositif associé - Google Patents

Procédé d'estimation de point de fixation et dispositif associé Download PDF

Info

Publication number
WO2024021742A9
WO2024021742A9 PCT/CN2023/092415 CN2023092415W WO2024021742A9 WO 2024021742 A9 WO2024021742 A9 WO 2024021742A9 CN 2023092415 W CN2023092415 W CN 2023092415W WO 2024021742 A9 WO2024021742 A9 WO 2024021742A9
Authority
WO
WIPO (PCT)
Prior art keywords
image
image block
face
size
feature map
Prior art date
Application number
PCT/CN2023/092415
Other languages
English (en)
Chinese (zh)
Other versions
WO2024021742A1 (fr
Inventor
孙贻宝
Original Assignee
荣耀终端有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 荣耀终端有限公司 filed Critical 荣耀终端有限公司
Publication of WO2024021742A1 publication Critical patent/WO2024021742A1/fr
Publication of WO2024021742A9 publication Critical patent/WO2024021742A9/fr

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • G06F3/013Eye tracking input arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/10Image acquisition
    • G06V10/12Details of acquisition arrangements; Constructional details thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • G06V40/166Detection; Localisation; Normalisation using acquisition arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/18Eye characteristics, e.g. of the iris
    • G06V40/19Sensors therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/18Eye characteristics, e.g. of the iris
    • G06V40/193Preprocessing; Feature extraction
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to fields such as deep learning and big data processing, and in particular to a gaze point estimation method and related equipment.
  • Gaze point estimation generally refers to inputting an image, calculating the gaze direction through eye/head features and mapping it to the gaze point. Gaze point estimation is mainly used in human-computer interaction and visualization display of smartphones, tablets, smart screens, and AR/VR glasses.
  • gaze point estimation methods can be divided into two categories: geometry-based methods and appearance-based methods.
  • the basic idea of estimating the gaze point coordinates through geometry-based methods is to restore the three-dimensional line of sight direction through some two-dimensional information (such as eye features such as the corner of the eye).
  • the basic idea of estimating the gaze point coordinates through appearance-based methods is to learn a model that maps the input image to the gaze point. Both methods have their own advantages and disadvantages.
  • the geometry-based method is relatively more accurate, but it has high requirements on the quality and resolution of the image, and requires additional hardware (for example, infrared sensors and multiple cameras, etc.) to support, which may lead to high power consumption, while the appearance-based method is relatively less accurate.
  • the appearance-based method requires a large amount of data to be trained, the distance between the camera and the subject is not fixed, and the depth information of the input image may also vary.
  • the size of the facial images obtained based on different input images may be quite different, which cannot meet the model requirements. If the input image is scaled, it may meet the model requirements, but there may be a risk of feature deformation, which will reduce the accuracy of the gaze point estimation.
  • the present application provides a gaze point estimation method and related equipment.
  • an electronic device can collect images through a camera, and obtain face position information and eye position information in the collected image when the face detection result meets the preset face condition. Based on the face position information and eye position information, the electronic device can determine the gaze point coordinates of the target object through a gaze point estimation network model. It can be understood that the shooting subject in the image collected by the electronic device through the camera is the target object. It can be understood that the shooting subject mentioned in the present application refers to the main shooting object when the user uses the electronic device to shoot.
  • the electronic device can process the ROI of the target image block with the corresponding preset feature map size based on the region of interest pooling module therein to obtain a feature map.
  • the target image block is obtained by cropping the collected image.
  • the target image block includes at least one type of image block among a face image block, a left eye image block, and a right eye image block. Different types of image blocks each correspond to a preset feature map size.
  • the above method can unify the size of the feature map through the region of interest pooling module, avoid deformation of the target image block after scaling, and improve the accuracy of the gaze point estimation.
  • the present application provides a method for estimating a gaze point.
  • the method can be applied to an electronic device provided with a camera.
  • the method can include: the electronic device can collect a first image through a camera; when the face detection result meets the preset face condition, the electronic device can obtain the face position information and the eye position information in the first image.
  • the electronic device can process the region of interest ROI of the target image block with the corresponding preset feature map size based on the region of interest pooling module of the gaze point estimation network model to obtain a feature map.
  • the target image block includes at least one type of image block among a face image block, a left eye image block, and a right eye image block.
  • the face position information includes the coordinates of the relevant feature points of the face area
  • the eye position information includes the coordinates of the relevant feature points of the eye area.
  • the face image block is an image block obtained by cropping the face area in the first image based on the face position information.
  • the left eye image block is an image block obtained by cropping the left eye area in the first image based on the eye position information.
  • the right eye image block is an image block obtained by cropping the right eye area in the first image based on the eye position information.
  • the electronic device can determine the gaze point coordinates of the target object based on the gaze point estimation network model.
  • the electronic device can process the region of interest ROI of the target image block with the corresponding preset feature map size based on the region of interest pooling module to obtain a feature map.
  • the target image block includes at least one type of image block in the face image block, the left eye image block, and the right eye image block. Different types of image blocks in the target image block each correspond to a preset feature map size.
  • the size of the feature map corresponding to the same type of image blocks is the same, while the size of the feature map corresponding to different types of image blocks can be the same or different.
  • This method can unify the size of the feature map corresponding to the same type of image blocks through the region of interest pooling module, prepare for subsequent feature extraction, and can also avoid feature deformation caused by adjusting the feature map size by scaling, thereby improving the accuracy of gaze point estimation. It can be understood that the feature gender may cause inaccurate feature extraction, thereby affecting the accuracy of gaze point estimation.
  • the electronic device may capture the first image through a front camera. It is understandable that the electronic device may acquire the first image in real time, and the details may refer to the relevant description in step S301 below, which will not be elaborated here.
  • the first image may be image I1.
  • the description of the facial position information and the eye position information can be referred to in the following text, which is not described in detail here.
  • the relevant feature points of the facial region can include the edge contour feature points of the face.
  • the relevant feature points of the eye region can include the corner feature points, and can also include the edge contour feature points of the eye region.
  • the description of the relevant feature points of the facial region and the relevant feature points of the eye region can be referred to in the following text, which is not described in detail here.
  • the electronic device can obtain face position information during face detection. Specifically, during face detection, the electronic device can perform feature point detection and determine feature points related to the face to obtain face position information.
  • the electronic device can complete the detection of eyes during the face detection process to obtain eye position information.
  • the eye-related feature points may include pupil coordinates.
  • the electronic device can perform eye detection to obtain eye position information.
  • eye detection can be found in the following text and will not be elaborated here.
  • the region of interest pooling module may include several region of interest pooling layers.
  • the region of interest pooling module may include a region of interest pooling layer-1 and may also include a region of interest pooling layer-2.
  • the region of interest pooling module may include a region of interest pooling layer-1 and may also include a region of interest pooling layer-2.
  • the gaze point estimation network model may unify the feature map for the same type of image blocks in the target image block and perform feature extraction on it.
  • the present application may also provide a gaze point estimation network model, the input of which may not include a face grid, pupil coordinates, fully connected layer-2, and fully connected layer-3.
  • the present application may also provide a gaze point estimation network model, the input of which may not include a face grid, pupil coordinates, fully connected layer-2, fully connected layer-5, fully connected layer-3, and fully connected layer-6.
  • the present application may also provide a gaze point estimation network model, the input of which may not include a face grid, pupil coordinates, fully connected layer-2, fully connected layer-5, fully connected layer-3, and fully connected layer-6.
  • the preset feature map size corresponding to the facial image block is a first preset feature map size
  • the preset feature map size corresponding to the left eye image block is a second preset feature map size
  • the preset feature map size corresponding to the right eye image block is a third preset feature map size.
  • the region of interest of the target image block is the entire target image block.
  • the ROI of the face image block is the entire face image block
  • the ROI of the left eye image block is the entire left eye image block
  • the ROI of the right eye image block is the entire right eye image block.
  • the method may further include: the electronic device may crop the facial area in the first image based on the facial position information to obtain the facial image block.
  • the method may further include: the electronic device may crop the left-eye area in the first image based on the eye position information to obtain the left-eye image block.
  • the method may further include: the electronic device may crop the right-eye area in the first image based on the eye position information to obtain the right-eye image block.
  • the electronic device processes the region of interest ROI of the target image block with the corresponding preset feature map size to obtain a feature map, which may specifically include: the electronic device may divide the ROI of the target image block based on the corresponding preset feature map size to obtain a number of block areas, and the electronic device may also perform maximum pooling processing on each block area in the ROI of the target image block to obtain a feature map.
  • the number of block areas in each row of the ROI of the target image block is the same as the width value in the corresponding preset feature map size
  • the number of block areas in each column of the ROI of the target image block is the same as the height value in the corresponding preset feature map size.
  • the electronic device can divide the ROI in the target image block based on the width value and height value in the corresponding preset feature map size to obtain a number of block areas, and perform maximum pooling processing on each block area to obtain a feature map of the target image block. Since the number of block areas and the dimension of the feature map output by the pooling layer of the region of interest are consistent. Therefore, for image blocks of different sizes, this method can unify the feature maps corresponding to the image blocks, thereby avoiding feature deformation caused by scaling, improving the accuracy of feature extraction, and thus improving the accuracy of gaze point estimation.
  • the ROI of the face image block in the target image block can be the face area in the face image block.
  • the ROI of the left eye image block in the target image block can be the left eye area in the left eye image block.
  • the ROI of the right eye image block in the target image block can be the right eye area in the right eye image block.
  • the electronic device performs maximum pooling processing on each block area in the ROI of the target image block to obtain a feature map, which may specifically include: the electronic device may perform maximum pooling processing on each face block area in the ROI of the face image block to obtain a first feature map, may perform maximum pooling processing on each left eye block area in the ROI of the left eye image block to obtain a second feature map, and may also perform maximum pooling processing on each right eye block area in the ROI of the right eye image block to obtain a third feature map.
  • the first feature map is a feature map corresponding to the ROI of the face image block
  • the second feature map is a feature map corresponding to the ROI of the left eye image block
  • the third feature map is a feature map corresponding to the ROI of the right eye image block.
  • the number of block areas in each row of the ROI of the target image block is the same as the width value in the corresponding preset feature map size, and the number of block areas in each column of the ROI of the target image block is the same as the height value in the corresponding preset feature map size, specifically including: the number of face block areas in each row of the ROI of the face image block is the same as the width value in the first preset feature map size, and the number of face block areas in each column of the ROI of the face image block is the same as the height value in the first preset feature map size; the number of left eye block areas in each row of the ROI of the left eye image block is the same as the width value in the second preset feature map size, and the number of left eye block areas in each column of the ROI of the left eye image block is the same as the height value in the second preset feature map size; the number of right eye block areas in each row of the ROI of the right eye image block is the same as the width value in the third preset feature map size, and the number of right eye block areas in each column
  • the target image block may include a facial image block, a left eye image block, and a right eye image block.
  • the electronic device can unify the sizes of the feature maps corresponding to the facial image block, the left eye image block, and the right eye image block based on the gaze point estimation network model, and extract features based on the feature maps corresponding to the facial image block, the left eye image block, and the right eye image block. It is understandable that this method can unify the feature maps corresponding to the image blocks, thereby avoiding feature deformation caused by scaling, improving the accuracy of feature extraction, and thus improving the accuracy of gaze point estimation.
  • the second preset feature map size may be the same as the third preset feature map size.
  • the first preset feature map size may be the same as the second preset feature map size.
  • the first preset feature map size may be the same as the third preset feature map size.
  • the processing of the target image block performed by the electronic device based on the region of interest pooling module can be referred to above and will not be repeated here.
  • the subject of the first image is the target object.
  • the method may further include: when the face detection result meets the preset face condition, the electronic device may obtain the pupil coordinates in the first image; the electronic device may determine the position and size of the face area in the first image in the first image based on the face position information, and obtain the face grid corresponding to the first image.
  • the face grid is used to characterize the distance between the target object and the camera.
  • the method may further include: the electronic device may perform convolution processing on the feature map based on the convolution module of the gaze point estimation network model to extract eye features and/or facial features; the electronic device may also integrate the eye features and/or facial features, the face grid and the pupil coordinates based on the fusion module of the gaze point estimation network model to obtain the gaze point coordinates of the target object.
  • the electronic device can perform gaze point estimation based on more types of features (for example, facial features, eye features, depth information, pupil position, etc.), that is, the gaze point estimation can be performed based on more comprehensive feature information, which can improve the accuracy of gaze point estimation.
  • features for example, facial features, eye features, depth information, pupil position, etc.
  • the face grid can represent the position and size of the face in the image, and can reflect the depth information of the target object in the image, that is, the distance between the target object and the camera that captures the image.
  • the human face in the first image mentioned in the present application is the face of the target object in the first image.
  • the electronic device may input the facial image block, the left eye image block, the right eye image block, the facial grid and the pupil coordinates into the gaze point estimation network model, and input the gaze point coordinates.
  • the gaze point estimation network model may include a region of interest pooling module, a convolution module and a fusion module.
  • the region of interest pooling module can be used to: process the region of interest ROI of the facial image block with a first preset feature map size to obtain a first feature map.
  • the region of interest pooling module can also be used to: process the ROI of the left eye image block with a second preset feature map size to obtain a second feature map, and process the ROI of the right eye image block with a third preset feature map size to obtain a third feature map.
  • the convolution module can be used to: perform convolution processing on the first feature map, the second feature map and the third feature map respectively, and extract facial features and eye features.
  • the fusion module can be used to: integrate facial features, eye features, facial grids and pupil coordinates to obtain the gaze point coordinates of the target object.
  • the size of the first feature map is the same as the size of the first preset feature map
  • the size of the second feature map is the same as the size of the second preset feature map
  • the size of the third feature map is the same as the size of the third preset feature map.
  • the face detection result satisfies a preset face condition, specifically including: a face is detected in the first image.
  • the electronic device can obtain facial position information and eye position information when a human face is detected in the first image.
  • the face detection result satisfies the preset face condition, which may specifically include: a face is detected in the first image, and the size of the face area in the first image meets the preset size requirement.
  • the method may also include: when a face is detected in the first image, and the size of the face area in the first image does not meet the preset size requirement, the electronic device may perform adaptive zoom, and re-capture the image based on the focal length after the adaptive zoom.
  • the electronic device when the first image includes a face and the size of the face area in the first image does not meet the preset size requirement, the electronic device can perform adaptive zoom and re-capture the image based on the focal length after adaptive zoom, so that the size of the face in the subsequent captured image meets expectations. In this way, the electronic device can capture images containing faces of appropriate sizes without losing image details and subsequent feature extraction difficulties due to the face in the captured image being too small, nor will it cause image information loss and subsequent feature extraction difficulties due to the face in the captured image being too large. In other words, through the above method, the features extracted by the electronic device are relatively accurate, so that the accuracy of the gaze point estimation is also improved.
  • the size of the face region in the first image meets a preset size requirement, specifically including: the area of the face region in the first image is within a preset area range.
  • the size of the facial area in the first image meets a preset size requirement, specifically including: the height of the facial area in the first image is within a preset height range, and the width of the facial area in the first image is within a preset width range.
  • the electronic device can ensure that the input distance scale is adaptive to the simple sample through adaptive zoom. In other words, the electronic device can capture images at a moderate shooting distance through adaptive zoom.
  • the electronic device crops the face area in the first image based on the face position information, which may specifically include: the electronic device may determine the relevant feature points of the face area in the first image; the electronic device may determine the first circumscribed rectangle; the electronic device may also crop the first image based on the position of the first circumscribed rectangle in the first image.
  • the first circumscribed rectangle is the circumscribed rectangle of the relevant feature points of the face area in the first image
  • the face image block is at the same position as the first circumscribed rectangle in the first image
  • the face image block is the same size as the first circumscribed rectangle.
  • the electronic device crops the left eye area in the first image based on the eye position information, which may specifically include: the electronic device may determine the relevant feature points of the left eye area in the first image; the electronic device may determine the second circumscribed rectangle, and crop the first image based on the position of the second circumscribed rectangle in the first image.
  • the second circumscribed rectangle is the circumscribed rectangle of the relevant feature points of the left eye area in the first image
  • the left eye image block is at the same position as the second circumscribed rectangle in the first image
  • the left eye image block is the same size as the second circumscribed rectangle.
  • the electronic device crops the right eye region in the first image based on the eye position information, which may specifically include: the electronic device may determine the relevant feature points of the right eye region in the first image; the electronic device may determine the third circumscribed rectangle, and based on the position of the third circumscribed rectangle in the first image, crops the first image.
  • the third circumscribed rectangle is the circumscribed rectangle of the relevant feature points of the right eye region in the first image, the right eye image block and the third circumscribed rectangle are at the same position in the first image, and the right eye image block and the third circumscribed rectangle are the same size.
  • the electronic device can obtain the facial image block, the left eye image block and the right eye image block based on the circumscribed rectangle of the relevant feature points of the facial area, the circumscribed rectangle of the relevant feature points of the left eye area and the circumscribed rectangle of the relevant feature points of the right eye area, respectively.
  • cropping the face area in the first image based on the face position information to obtain the face image block may specifically include: the electronic device may determine the face area in the first image based on the face position information; the electronic device may crop the first image with the face area as the center of the first cropping frame to obtain the face image block.
  • the size of the first cropping frame is the first preset cropping size.
  • the face image block is the same size as the first cropping frame.
  • Cropping the left eye area and the right eye area in the first image based on the eye position information to obtain the left eye image block and the right eye image block may specifically include: the electronic device determines the left eye area in the first image and the right eye area in the first image based on the eye position information; the electronic device may crop the first image with the left eye area as the center of the second cropping frame to obtain the left eye image block, and may also crop the first image with the right eye area as the center of the third cropping frame to obtain the right eye image block.
  • the size of the second cropping frame is the second preset cropping size.
  • the left eye image block is the same size as the second cropping frame.
  • the size of the third cropping frame is the third preset cropping size.
  • the right eye image block has the same size as the third cropping frame.
  • the electronic device can crop the first image based on the face position information and the preset face cropping size to obtain the face image block.
  • the electronic device can also crop the first image based on the eye position information and the preset eye cropping size to obtain the left eye image block and the right eye image block.
  • the first preset cropping size is a preset face cropping size.
  • the second preset cropping size and the third preset cropping size are preset eye cropping sizes.
  • the second preset cropping size and the third preset cropping size may be the same.
  • the preset eye cropping size may include a preset left eye cropping size and a preset right eye cropping size.
  • the second preset cropping size may be a preset left eye cropping size.
  • the third preset cropping size may be a preset right eye cropping size.
  • the gaze point estimation network model may further include several activation layers.
  • the region of interest pooling module may include several region of interest pooling layers.
  • the convolution module may include several convolution layers.
  • the fusion module includes several fully connected layers.
  • the gaze point estimation network model may include several region of interest pooling layers, several convolution layers, and may also include several activation layers.
  • the gaze point estimation network model may include several region of interest pooling layers, several convolutional layers and several pooling layers.
  • the gaze point estimation network model may also include several activation layers.
  • the present application provides an electronic device.
  • the electronic device may include a display screen, a camera, a memory, and one or more processors.
  • the memory is used to store computer programs.
  • the camera may be used to: collect a first image.
  • the processor may be used to: obtain face position information and eye position information in the first image when the face detection result meets the preset face condition; in the process of processing the target image block based on the gaze point estimation network model, the region of interest pooling module based on the gaze point estimation network model processes the region of interest ROI of the target image block with the corresponding preset feature map size to obtain a feature map.
  • the face position information includes the coordinates of the relevant feature points of the face area
  • the eye position information includes the coordinates of the relevant feature points of the eye area.
  • the target image block includes at least one type of image block among the face image block, the left eye image block, and the right eye image block. Different types of image blocks each correspond to a preset feature map size.
  • the face image block is an image block obtained by cropping the face area in the first image based on the face position information
  • the left eye image block is an image block obtained by cropping the left eye area in the first image based on the eye position information
  • the right eye image block is an image block obtained by cropping the right eye area in the first image based on the eye position information.
  • the processor when used to process the region of interest ROI of the target image block with the corresponding preset feature map size to obtain a feature map, can be specifically used to: divide the ROI of the target image block based on the corresponding preset feature map size to obtain a plurality of block regions; perform maximum pooling processing on each block region in the ROI of the target image block to obtain a feature map.
  • the number of each row of block regions in the ROI of the target image block is the same as the width value in the corresponding preset feature map size, and the number of each column of block regions in the ROI of the target image block is the same as the height value in the corresponding preset feature map size.
  • the processor when used to divide the ROI of the target image block based on the corresponding preset feature map size to obtain a number of block areas, can be specifically used to: determine the ROI of the facial image block, and divide the ROI of the facial image block based on the first preset feature map size to obtain a number of facial block areas; determine the ROI of the left eye image block, and divide the ROI of the left eye image block based on the second preset feature map size to obtain a number of left eye block areas; determine the ROI of the right eye image block, and divide the ROI of the right eye image block based on the third preset feature map size to obtain a number of right eye block areas.
  • the processor when used to perform maximum pooling processing on each block area in the ROI of the target image block to obtain a feature map, can be specifically used to: perform maximum pooling processing on each face block area in the ROI of the face image block to obtain a first feature map; perform maximum pooling processing on each left eye block area in the ROI of the left eye image block to obtain a second feature map; perform maximum pooling processing on each right eye block area in the ROI of the right eye image block to obtain a third feature map.
  • the first feature map is a feature map corresponding to the ROI of the face image block
  • the second feature map is a feature map corresponding to the ROI of the left eye image block
  • the third feature map is a feature map corresponding to the ROI of the right eye image block.
  • the number of block areas in each row of the ROI of the target image block is the same as the width value in the corresponding preset feature map size, and the number of block areas in each column of the ROI of the target image block is the same as the height value in the corresponding preset feature map size.
  • the number of face block areas in each row of the ROI of the facial image block is the same as the width value in the first preset feature map size, and the number of face block areas in each column of the ROI of the facial image block is the same as the height value in the first preset feature map size;
  • the number of left eye block areas in each row of the ROI of the left eye image block is the same as the width value in the second preset feature map size, and the number of left eye block areas in each column of the ROI of the left eye image block is the same as the height value in the second preset feature map size;
  • the number of right eye block areas in each row of the ROI of the right eye image block is the same as the width value in the third preset feature map size, and the number of right eye block areas in each column of the ROI of the right eye image block is the same as the height value in the third preset feature map size.
  • the subject of the first image is the target object.
  • the processor can also be used to: obtain the pupil coordinates in the first image when the face detection result meets the preset face condition; determine the position and size of the face area in the first image in the first image based on the face position information, and obtain the face grid corresponding to the first image.
  • the face grid is used to characterize the distance between the target object and the camera.
  • the processor can also be used to: perform convolution processing on the feature map based on the convolution module of the gaze point estimation network model to extract eye features and/or facial features; integrate the eye features and/or facial features, the face grid and the pupil coordinates based on the fusion module of the gaze point estimation network model to obtain the gaze point coordinates of the target object.
  • the face detection result satisfies a preset face condition, which may specifically include: a face is detected in the first image.
  • the face detection result satisfies the preset face condition, which may specifically include: a face is detected in the first image, and the size of the face area in the first image meets the preset size requirement.
  • the processor may also be used to: when a face is detected in the first image and the size of the face area in the first image does not meet the preset size requirement, perform adaptive zoom, and recapture the image based on the focal length after the adaptive zoom.
  • the processor when used to crop the face area in the first image based on the face position information, can be specifically used to: determine the relevant feature points of the face area in the first image; determine the first circumscribed rectangle; and crop the first image based on the position of the first circumscribed rectangle in the first image.
  • the first circumscribed rectangle is the circumscribed rectangle of the relevant feature points of the face area in the first image.
  • the face image block and the first circumscribed rectangle have the same position in the first image.
  • the face image block and the first circumscribed rectangle have the same size.
  • the processor when used to crop the left eye area in the first image based on the eye position information, can be specifically used to: determine the relevant feature points of the left eye area in the first image; determine the second circumscribed rectangle; and crop the first image based on the position of the second circumscribed rectangle in the first image.
  • the second circumscribed rectangle is the circumscribed rectangle of the relevant feature points of the left eye area in the first image.
  • the left eye image block and the second circumscribed rectangle have the same position in the first image.
  • the left eye image block and the second circumscribed rectangle have the same size.
  • the processor when used to crop the right eye region in the first image based on the eye position information, can be specifically used to: determine the relevant feature points of the right eye region in the first image; determine the third circumscribed rectangle; and crop the first image based on the position of the third circumscribed rectangle in the first image.
  • the third circumscribed rectangle is the circumscribed rectangle of the relevant feature points of the right eye region in the first image.
  • the right eye image block and the third circumscribed rectangle have the same position in the first image.
  • the right eye image block and the third circumscribed rectangle have the same size.
  • the processor when used to crop the face area in the first image based on the face position information to obtain the face image block, can be specifically used to: determine the face area in the first image based on the face position information; crop the first image with the face area as the center of the first cropping frame to obtain the face image block.
  • the size of the first cropping frame is the first preset cropping size.
  • the face image block has the same size as the first cropping frame.
  • the processor when used to crop the left eye area and the right eye area in the first image based on the eye position information to obtain the left eye image block and the right eye image block, can be specifically used to: determine the left eye area in the first image and the right eye area in the first image based on the eye position information; crop the first image with the left eye area as the center of the second cropping frame to obtain the left eye image block, and crop the first image with the right eye area as the center of the third cropping frame to obtain the right eye image block.
  • the size of the second cropping frame is the second preset cropping size.
  • the left eye image block has the same size as the second cropping frame.
  • the size of the third cropping frame is a third preset cropping size.
  • the size of the right eye image block is the same as that of the third cropping frame.
  • the gaze point estimation network model may further include several activation layers.
  • the region of interest pooling module may include several region of interest pooling layers.
  • the convolution module may include several convolution layers.
  • the fusion module may include several fully connected layers.
  • the present application provides a computer storage medium, comprising computer instructions, which, when executed on an electronic device, enables the electronic device to execute any possible implementation of the first aspect.
  • an embodiment of the present application provides a chip, which can be applied to an electronic device.
  • the chip includes one or more processors, and the processor is used to call computer instructions to enable the electronic device to execute any possible implementation method of the above-mentioned first aspect.
  • an embodiment of the present application provides a computer program product comprising instructions, which, when executed on an electronic device, enables the electronic device to execute any possible implementation of the first aspect.
  • the electronic device provided in the second aspect, the computer storage medium provided in the third aspect, the chip provided in the fourth aspect, and the computer program product provided in the fifth aspect are all used to execute any possible implementation of the first aspect. Therefore, the beneficial effects that can be achieved can refer to the beneficial effects of any possible implementation of the first aspect, and will not be repeated here.
  • FIG1 is a schematic diagram of a scene of gaze point estimation provided by an embodiment of the present application.
  • 2A-2D are schematic diagrams of a set of scenes for gaze point estimation provided in an embodiment of the present application.
  • FIG3 is a flow chart of a method for estimating a gaze point provided in an embodiment of the present application.
  • FIG4 is a schematic diagram of a cutting principle provided in an embodiment of the present application.
  • FIG5 is a schematic diagram of another cutting principle provided in an embodiment of the present application.
  • FIG6 is a schematic diagram of a face grid provided in an embodiment of the present application.
  • FIG7 is a schematic diagram of the architecture of a gaze point estimation network model provided in an embodiment of the present application.
  • FIG8 is a schematic diagram of the architecture of another gaze point estimation network model provided in an embodiment of the present application.
  • FIG9 is a schematic diagram of the architecture of another gaze point estimation network model provided in an embodiment of the present application.
  • FIGS. 10A and 10B are schematic diagrams of a principle of a region of interest pooling layer provided in an embodiment of the present application.
  • FIG11 is a schematic diagram of a ROI mapped onto a feature map provided in an embodiment of the present application.
  • FIG12 is a schematic diagram of the structure of a CNN-1 provided in an embodiment of the present application.
  • FIG13 is a schematic diagram of the structure of a CNN-3 provided in an embodiment of the present application.
  • FIG14 is a schematic diagram of the hardware structure of an electronic device provided in an embodiment of the present application.
  • FIG. 15 is a schematic diagram of a software structure of an electronic device provided in an embodiment of the present application.
  • the present application provides a method for estimating a gaze point.
  • the method for estimating a gaze point can be applied to an electronic device.
  • the electronic device can collect images through a front camera. If the collected image includes a face, the electronic device can crop the collected image based on the face position information obtained by face detection and the preset face cropping size to obtain a face image block. Similarly, the electronic device can also crop the collected image based on the eye position information obtained by eye detection and the preset eye cropping size to obtain a left eye image block and a right eye image block.
  • the electronic device can also determine a face grid based on the face position information, and determine the pupil coordinates by pupil positioning. Among them, the face grid is used to represent the position and size of the face in the entire image.
  • the face grid can reflect the distance between the face and the camera.
  • the electronic device can input the left eye image block, the right eye image block, the face image block, the face grid and the pupil coordinates into the gaze point estimation network model, and output the gaze point coordinates.
  • the gaze point estimation network model can include a region of interest pooling layer.
  • the region of interest pooling layer can be used to unify the size of the feature map to prepare for subsequent feature extraction.
  • the electronic device may determine whether the size of the face region in the captured image meets a preset size requirement. If the size of the face region does not meet the preset size requirement, the electronic device may ensure that the shooting distance is appropriate through adaptive zooming and recapture the image. If the size of the face region meets the preset size requirement, the electronic device may estimate the gaze point coordinates according to the above method.
  • the electronic device can estimate the gaze point coordinates based on the left eye image block, the right eye image block, the face image block, the face grid and the pupil coordinates, that is, a more comprehensive feature extraction is achieved.
  • the electronic device can control the size of the face area in the captured image through adaptive zoom, and unify the size of the feature map based on the pooling layer of the region of interest, which can avoid the deformation of the image block (for example, the left eye image block, the right eye image block and the face image block) after scaling, thereby improving the accuracy of the gaze point estimation.
  • the electronic device can obtain a user image and estimate the gaze point coordinates through the user image.
  • the electronic device can capture images through a front camera. If the captured image includes a face, the electronic device can crop the captured image to obtain a left eye image block, a right eye image block, and a face image block.
  • the electronic device can also determine a face grid based on the face position information obtained by face detection, and determine the pupil coordinates through pupil positioning. Among them, the face grid is used to represent the position and size of the face in the entire image. It can also be understood that the face grid can reflect the distance between the face and the camera.
  • the electronic device can input the left eye image block, the right eye image block, the face image block, the face grid, and the pupil coordinates into the gaze point estimation network model, and output the gaze point coordinates. It can be understood that the relevant description of the gaze point estimation network model can be referred to later, and will not be expanded here.
  • the electronic device may specifically be a mobile phone, a tablet computer, a wearable device, a vehicle-mounted device, an augmented reality (AR)/virtual reality (VR) device, a laptop computer, an ultra-mobile personal computer (UMPC), a netbook, a personal digital assistant (PDA) or a dedicated camera (for example, a SLR camera, a card camera) and other devices.
  • AR augmented reality
  • VR virtual reality
  • UMPC ultra-mobile personal computer
  • PDA personal digital assistant
  • a dedicated camera for example, a SLR camera, a card camera
  • the electronic device may trigger a corresponding operation based on the estimated gaze point coordinates. In this case, the user may more conveniently interact with the electronic device.
  • the electronic device may display a reading interface 100.
  • the reading interface 100 displays the first page of the e-book that the user is reading.
  • the e-book has a total of 243 pages.
  • the electronic device may estimate the gaze point coordinates in real time.
  • the electronic device may estimate the gaze point coordinates based on the captured image, and determine that the gaze point coordinates are located at the end of the first page of content displayed on the reading interface 100. In this case, the electronic device may trigger page turning.
  • the electronic device may display the reading interface 200 shown in FIG2C .
  • the reading interface 200 displays the second page of the e-book that the user is reading.
  • the electronic device may continue to estimate the gaze point coordinates in real time.
  • the real-time estimation of gaze point coordinates mentioned in the present application may include: the electronic device may capture a frame of image at regular intervals (for example, 10 ms) and estimate the gaze point coordinates based on the image.
  • the electronic device when a user browses information using an electronic device, the electronic device can collect the user's preference information based on the estimated gaze point coordinates, thereby providing services to the user more intelligently based on the collected user's preference information. For example, when a user browses information using an electronic device, the electronic device may recommend some content (e.g., videos, articles, etc.). In this case, the electronic device can estimate the user's gaze point coordinates to determine the recommended content that the user is interested in. In the subsequent process, the electronic device can recommend content related to the recommended content of interest to the user.
  • some content e.g., videos, articles, etc.
  • the electronic device can estimate the user's gaze point coordinates to determine the recommended content that the user is interested in.
  • the electronic device can recommend content related to the recommended content of interest to the user.
  • the electronic device may display a user interface 300.
  • the user interface 300 may include several videos or text information.
  • the user interface 300 may include recommended content 1, recommended content 2, recommended content 3, recommended content 4, and recommended content 5.
  • the electronic device may capture images in real time and estimate the gaze point coordinates based on the captured images.
  • the electronic device may also count the distribution of the gaze point coordinates during the process of the electronic device displaying the user interface 300, thereby determining the recommended content that the user is interested in.
  • the recommended content 2 in the user interface 300 may intelligently recommend content related to the recommended content 2 to the user, thereby avoiding the user from spending time excluding content that is not of interest, and providing services to the user more intelligently.
  • a gaze point estimation method provided by the present application is introduced below.
  • Figure 3 is a flow chart of a method for estimating a gaze point provided in an embodiment of the present application.
  • the method for estimating a gaze point may include but is not limited to the following steps:
  • S301 The electronic device acquires an image I1.
  • the electronic device obtains the image I1 through the front camera of the electronic device.
  • the electronic device receives the image I1 acquired by other cameras.
  • the electronic device can acquire images in real time. That is, the image I1 is an image acquired by the electronic device in real time.
  • the electronic device can acquire a frame of image every time T.
  • the time T mentioned in the present application can be set according to actual needs. Exemplarily, the time T can be 1 millisecond (ms).
  • S302 The electronic device performs face detection on the image I1 to determine whether the image I1 includes a face.
  • the electronic device can perform face detection on the image I1 to determine whether the image I1 includes a face. If it is detected that the image I1 includes a face, the electronic device can continue to perform subsequent steps. If it is detected that the image I1 does not include a face, the electronic device can discard the image I1 and reacquire the image.
  • face detection refers to determining whether a face exists in a dynamic scene and a complex background and separating it. In other words, based on the search strategy included in face detection, any given image can be searched to determine whether it contains a face.
  • the electronic device may determine the degree of match (i.e., correlation) between the input image and one or more pre-set standard face templates, and then determine whether there is a face in the image based on the degree of match. For example, the electronic device may determine the magnitude relationship between the degree of match and a preset threshold, and determine whether there is a face in the image based on the magnitude relationship. Specifically, if the degree of match is greater than the preset threshold, the electronic device determines that there is a face in the image, otherwise, the electronic device determines that there is no face in the image.
  • the degree of match i.e., correlation
  • the electronic device when determining the degree of match between an input image and one or more pre-set standard face templates, may specifically calculate the degree of match between the input image and the facial contour, nose, eyes, mouth and other parts in the standard face template.
  • the electronic device may include a template library, and the standard face templates may be stored in the template library.
  • human faces have certain structural distribution characteristics.
  • Electronic devices can extract the structural distribution characteristics of human faces from a large number of samples and generate corresponding rules, and then judge whether there is a human face in the image based on the rules.
  • the structural distribution characteristics of human faces may include: two symmetrical eyes, two symmetrical ears, a nose, a mouth, and the positions and relative distances between the five facial features.
  • the sample learning method refers to the method of artificial neural network, that is, to generate a classifier by learning a face sample set and a non-face sample set.
  • the electronic device can train the neural network based on the sample.
  • the parameters of the neural network include the statistical characteristics of the face.
  • Feature detection refers to the use of the invariant characteristics of a human face for face detection.
  • a human face has some features that are robust to different postures. For example, a person's eyes and eyebrows are darker than the cheeks, the lips are darker than the surrounding area, the bridge of the nose is lighter than the sides, etc.
  • the electronic device can extract these features and create a statistical model that can describe the relationship between these features, and then determine whether there is a face in the image based on the statistical model. It can be understood that the features extracted by the electronic device can be represented as a one-dimensional vector in the image feature space of the face. When creating a statistical model that can describe the relationship between the features, the electronic device can transform the one-dimensional vector into a relatively simple feature space.
  • the above four face detection methods can be used in combination in actual detection.
  • factors such as individual differences (e.g., differences in hairstyles, opening and closing of eyes, etc.), occlusion of faces in the shooting environment (e.g., occlusion of faces by hair, glasses, etc.), the angle of faces facing the camera (e.g., the side of the face facing the camera), the shooting environment (e.g., objects around the face, etc.), and imaging conditions (e.g., lighting conditions, imaging equipment) can also be taken into account in face detection.
  • factors e.g., differences in hairstyles, opening and closing of eyes, etc.
  • occlusion of faces in the shooting environment e.g., occlusion of faces by hair, glasses, etc.
  • the angle of faces facing the camera e.g., the side of the face facing the camera
  • the shooting environment e.g., objects around the face, etc.
  • imaging conditions e.g., lighting conditions, imaging equipment
  • the above-mentioned face detection method is only an example given in the embodiment of the present application.
  • the electronic device may also use other face detection methods to perform face detection.
  • the above-mentioned face detection method should not be regarded as a limitation to the present application.
  • the electronic device detects the facial features when performing face detection. This means that the electronic device also performs eye detection when performing face detection.
  • feature points related to the eyes can be obtained.
  • the electronic device can detect the eyes in the image I1
  • the electronic device can obtain eye position information. It is understandable that the eye position information may include the coordinates of feature points related to the eyes. The relevant description of the eye position information can be found in the following text and will not be elaborated here.
  • the eye-related feature points obtained by the electronic device may include the pupil center point.
  • the electronic device may obtain the pupil center point coordinates.
  • S303 The electronic device obtains the facial position information in the image I1.
  • the electronic device may acquire and save the face position information in the image I1 .
  • the facial position information may include the coordinates of the face detection frame.
  • the facial position information may include the coordinates of relevant feature points of the face, for example, the coordinates of the edge contour feature points of the face, for example, the coordinates of the feature points in the facial region related to the eyes, nose, mouth and ears.
  • S304 The electronic device performs eye detection and pupil positioning on the image I1, and obtains eye position information and pupil coordinates.
  • the electronic device when the electronic device detects that the image I1 includes a human face, the electronic device can perform eye detection and pupil positioning on the image I1, thereby obtaining eye position information and pupil coordinates in the image I1.
  • the eye position information may include the coordinates of feature points related to the eyes.
  • the electronic device may determine the feature points related to the eyes and obtain the coordinates of these feature points. For example, the two eye corner feature points of the left eye, the two eye corner feature points of the right eye, and the edge contour feature points of the eyes.
  • the electronic device may determine the eye position in the image I1 based on the coordinates of these feature points related to the eyes.
  • pupil coordinates are two-dimensional coordinates.
  • pupil coordinates may include pupil center coordinates.
  • pupil coordinates may also include other coordinates related to the pupil. For example, the coordinates of the pupil center of gravity, the coordinates of the pupil edge contour points, etc.
  • the pupil positioning method is briefly described below.
  • the electronic device when the electronic device detects an eye on the image I1, the electronic device may blur the eye portion on the image I1, extract the pupil contour, and then determine the pupil centroid. It is understandable that the electronic device may use the coordinates of the pupil centroid as the pupil coordinates.
  • the electronic device when the electronic device detects an eye on image I1, the electronic device may blur the eye portion on image I1, calculate the horizontal and vertical pixel values, and then select the index of the row with the lowest pixel value and the index of the column with the lowest pixel value as the vertical and horizontal coordinates of the pupil coordinates.
  • the electronic device may also adopt other pupil positioning methods, and this application does not limit this.
  • S305 The electronic device determines whether the size of the face area in the image I1 meets a preset size requirement.
  • the electronic device can determine the size of the face area in the image I1, and determine whether the size of the face area in the image I1 meets the preset size requirement.
  • the face area can include important features of the face, such as eyes, nose, and mouth.
  • the size of the face region refers to the area of the face region.
  • the area of the face region refers to the area of the face detection frame. In some other embodiments of the present application, the area of the face region refers to the area of the entire face region in the image detected by the electronic device.
  • the face detection frame can be used to select a facial region including important features, and is not necessarily used to select a complete facial region.
  • the face detection frame can be used to select a large portion of a facial region including features such as eyebrows, eyes, nose, mouth, and ears.
  • the shape of the face detection frame can be set according to actual needs.
  • the face detection frame can be a rectangle.
  • the size of the face area in the image I1 meets the preset size requirement, which means that the area of the face area in the image I1 is within the preset area range.
  • the preset area range can be [220px*220px, 230px*230px]. That is to say, the area of the face area is not less than 220px*220px and not more than 230px*230px.
  • the present application does not limit the specific value of the preset area range. It can be understood that the full name of px is "Pixel", which means "pixel” in Chinese and is the smallest unit representing a picture or graphic.
  • the size of the facial area in the image I1 meets the preset size requirement, which means that: the height of the facial area in the image I1 is within the preset height range, and the width of the facial area in the image I1 is within the preset width range.
  • the preset height range can be [215px, 240px]
  • the preset width range can be [215px, 240px].
  • the preset height range and the preset width range can be inconsistent, and the present application does not limit the specific values of the preset height range and the preset width range.
  • the height of the facial area mentioned in the present application can be understood as the height of the face detection frame
  • the width of the facial area mentioned in the present application can be understood as the width of the face detection frame.
  • the electronic device can continue to execute subsequent steps, and when the size of the facial area included in the image I1 does not meet the preset size requirement, the electronic device can perform adaptive zoom and re-acquire the image according to the focal length after adaptive zoom.
  • focal length the focal length
  • the wider the framing range the wider the field of view of the shot, and the more objects that can be captured, but the smaller the objects in the picture.
  • the larger the focal length the narrower the framing range, the smaller the field of view of the shot, and the fewer objects that can be captured, but the objects occupy a large proportion of the picture.
  • the adaptive zoom method is described.
  • the electronic device can determine the focal length when acquiring the image I1.
  • the focal length when the electronic device acquires the image I1 is recorded as the original focal length in this application.
  • the electronic device can determine whether the area of the facial region in the image I1 is smaller than the minimum value of the preset area range, or larger than the maximum value of the preset area range. If the area of the facial region in the image I1 is smaller than the minimum value of the preset area range, the electronic device can add J1 to the original focal length to obtain the focal length after adaptive zoom, and reacquire the image based on the focal length. If the area of the facial region in the image I1 is larger than the maximum value of the preset area range, the electronic device can subtract J1 from the original focal length to obtain the focal length after adaptive zoom, and reacquire the image based on the focal length.
  • J1 is the preset focal length adjustment step, and the specific value of J1 can be set according to actual needs.
  • the electronic device may determine the middle value of the preset area range, and determine the ratio of the area of the face region to the middle value.
  • the electronic device may multiply the ratio by the original focal length to obtain the focal length after adaptive zooming, and reacquire the image based on the focal length.
  • the electronic device can determine the preset area range based on the preset height range and the preset width range, and then perform adaptive zoom based on the area of the facial area in image I1, the preset area range, and the original focal length.
  • the electronic device can determine the preset area range based on the preset height range and the preset width range, and then perform adaptive zoom based on the area of the facial area in image I1, the preset area range, and the original focal length.
  • the electronic device can determine the middle value of the preset height range and the middle value of the preset width range, and then multiply the middle value of the preset height range by the middle value of the preset width range to obtain a preset area, and perform adaptive zoom based on the preset area, the area of the facial area in image I1, and the original focal length.
  • the electronic device can determine the middle value of the preset height range and the middle value of the preset width range, and then multiply the middle value of the preset height range by the middle value of the preset width range to obtain a preset area, and perform adaptive zoom based on the preset area, the area of the facial area in image I1, and the original focal length.
  • the adaptive zoom method may also include other specific methods, which are not limited by this application.
  • the size of the facial area can reflect the shooting distance (i.e., the distance between the camera and the face). It can also be understood that the size of the facial area contains the depth information of the shooting. If the shooting distance is large, the eye features in the image captured by the electronic device through the camera may be blurred, thereby affecting the accuracy of the gaze point estimation. If the shooting distance is large, the facial features in the image captured by the electronic device through the camera may be incomplete, thereby affecting the accuracy of the gaze point estimation.
  • the electronic device can capture an image containing a suitable face size, thereby improving the accuracy of the gaze point estimation.
  • S306 The electronic device crops the image I1 based on the face position information to obtain a face image block, and crops the image I1 based on the eye position information to obtain a left eye image block and a right eye image block.
  • This embodiment of the application provides two implementation methods when the electronic device performs step S306:
  • the first implementation mode the electronic device determines the bounding rectangle of the face area in the image I1 based on the coordinates of the feature points included in the face position information, and crops the image I1 based on the bounding rectangle of the face area to obtain a face image block.
  • the electronic device can also determine the bounding rectangle of the left eye area and the bounding rectangle of the right eye area in the image I1 based on the coordinates of the feature points included in the eye position information, and crops the image I1 based on the bounding rectangle of the left eye area and the bounding rectangle of the right eye area to obtain a left eye image block and a right eye image block.
  • the bounding rectangle mentioned in the present application can be a minimum bounding rectangle.
  • the minimum bounding rectangle refers to the maximum range of several two-dimensional shapes (e.g., points, lines, polygons) represented by two-dimensional coordinates, that is, a rectangle whose lower boundary is defined by the maximum horizontal coordinate, the minimum horizontal coordinate, the maximum vertical coordinate, and the minimum vertical coordinate of each vertex of a given two-dimensional shape.
  • the bounding rectangle of the face area can be understood as the minimum bounding rectangle of facial feature points (e.g., facial edge contour feature points).
  • the bounding rectangle of the left eye area can be understood as the minimum bounding rectangle of the left eye feature points (e.g., the 2 corners of the left eye feature points, the edge contour feature points of the left eye).
  • the bounding rectangle of the right eye area can be understood as the minimum bounding rectangle of the right eye feature points (e.g., the 2 corners of the right eye feature points, the edge contour feature points of the right eye).
  • the size of the face image block is the same as the size of the circumscribed rectangle of the face area in the image I1.
  • the size of the left eye image block is the same as the size of the circumscribed rectangle of the left eye area in the image I1.
  • the size of the right eye image block is the same as the size of the circumscribed rectangle of the right eye area in the image I1.
  • the electronic device may determine the bounding box of the facial feature points by using a bounding box algorithm.
  • the bounding box of the facial feature points may be understood as the optimal enclosing area of the facial feature points.
  • the electronic device may also crop the image I1 based on the bounding box of the facial feature points to obtain a facial image block.
  • the electronic device may determine the bounding boxes of the left eye feature points and the right eye feature points respectively by using a bounding box algorithm.
  • the bounding boxes of the left eye feature points and the right eye feature points may be understood as the optimal enclosing area of the left eye feature points and the optimal enclosing area of the right eye feature points respectively.
  • the electronic device may also crop the image I1 based on the bounding boxes of the left eye feature points and the right eye feature points respectively to obtain a left eye image block and a right eye image block.
  • the bounding box is an algorithm for solving the optimal bounding space of a discrete point set.
  • the basic idea is to use a slightly larger geometric body with simpler characteristics (called a bounding box) to approximately replace complex geometric objects.
  • a bounding box For the relevant description of the bounding box, please refer to the relevant technical documents, and this application will not elaborate on this.
  • the electronic device crops the image I1 based on the face position information and the preset face cropping size to obtain a face image block, and crops the image I1 based on the eye position information and the preset eye cropping size to obtain a left eye image block and a right eye image block.
  • the electronic device can determine the facial area in the image I1 based on the coordinates in the facial position information, and crop the image I1 based on the preset face cropping size with the facial area as the center, thereby obtaining a facial image block.
  • the size of the facial image block is the same as the preset face cropping size.
  • the facial area in the facial image block is located at the center of the facial image block.
  • the coordinates in the facial position information can include the coordinates of the edge contour feature points of the face, can also include the coordinates of the face detection frame, and can also include the coordinates of the feature points related to the eyes, nose, mouth and ears in the face.
  • the electronic device can also determine the left eye area and the right eye area in image I1 based on the coordinates in the eye position information, and crop image I1 based on the preset eye cropping size with the left eye area and the right eye area as the center, respectively, to obtain the left eye image block and the right eye image block.
  • the left eye area in the left eye image block is located at the center of the left eye image block.
  • the right eye area in the right eye image block is located at the center of the right eye image block.
  • the coordinates in the eye position information can include 2 eye corner feature points of the left eye and 2 eye corner feature points of the right eye, and can also include edge contour feature points of the eye corners.
  • the size of the left eye image block is the same as the preset eye cropping size
  • the size of the right eye image block is the same as the preset eye cropping size.
  • the preset eye cropping size is 60px*60px.
  • the sizes of the left eye image block and the right eye image block cropped by the electronic device are both 60px*60px.
  • the preset eye cropping size may include a preset left eye cropping size and a preset right eye cropping size.
  • the preset left eye cropping size may be inconsistent with the preset right eye cropping size.
  • the size of the left eye image block is the same as the preset left eye cropping size, and the size of the right eye image block is the same as the preset right eye cropping size.
  • the preset face cropping size and the preset eye cropping size can be set according to actual needs, and this application does not limit this.
  • the preset face cropping size can be 244px*244px
  • the preset eye cropping size can be 60px*60px.
  • the electronic device may determine the facial area based on the coordinates included in the facial position information (e.g., the coordinates of the edge contour feature points of the face, etc.), set the cropping frame according to a preset face cropping size, and then crop the image I1 with the facial area as the center of the cropping frame, thereby obtaining a facial image block.
  • the facial position information e.g., the coordinates of the edge contour feature points of the face, etc.
  • the electronic device may determine the left eye area and the right eye area based on the coordinates included in the eye position information, and set the left eye cropping frame and the right eye cropping frame according to a preset eye cropping size, and then use the left eye area and the right eye area as the center of the left eye cropping frame and the right eye cropping frame, respectively, to crop the image I1 to obtain a left eye image block and a right eye image block, respectively.
  • the electronic device determines the face grid corresponding to the image I1 based on the face position information.
  • the face grid is used to indicate the position and size of the face in the entire image.
  • the electronic device can determine the position of the face area in the image I1 based on the coordinates included in the face position information (for example, the coordinates of the edge contour feature points of the face, etc.), thereby determining the face grid corresponding to the image I1.
  • the face grid can be used to represent the position and size of the face in the entire image. It is understood that the face grid can represent the distance between the face and the camera.
  • the face grid can be understood as a binary mask.
  • a binary mask can be understood as a binary matrix corresponding to an image, that is, a matrix whose elements are all 0 or 1. Generally speaking, an image (all or part) can be blocked by a binary mask. Binary masks can be used for region of interest extraction, shielding, structural feature extraction, etc.
  • the electronic device can determine the proportional relationship between the face area in the image I1 and the image I1 according to the coordinates included in the face position information, thereby obtaining the depth information of the face in the image I1.
  • the electronic device can also determine that the face in the image I1 is located at a lower center position in the image I1. Further, the electronic device can determine the face grid corresponding to the image I1.
  • S308 The electronic device inputs the left eye image block, the right eye image block, the face image block, the face grid and the pupil coordinates into the gaze point estimation network model, and outputs the gaze point coordinates.
  • the electronic device can input the left eye image block, the right eye image block, the face image block, the face grid and the pupil coordinates into the gaze point estimation network model, and output the two-dimensional coordinates.
  • the two-dimensional coordinates are the gaze point coordinates.
  • the gaze point estimation network model can be a neural network model including several branches. The gaze point estimation network model can extract corresponding features through the several branches it contains, and then estimate the gaze point coordinates by synthesizing the extracted features.
  • a neural network is a mathematical model or computational model that imitates the structure and function of a biological neural network (the central nervous system of an animal, especially the brain).
  • a neural network is composed of a large number of artificial neurons, and different networks are constructed according to different connection methods.
  • Neural networks can include convolutional neural networks, recurrent neural networks, etc.
  • the gaze point estimation network model may include several region of interest pooling layers, several convolutional layers, several pooling layers and several fully connected layers.
  • the region of interest pooling layer is used to unify the size of the feature map.
  • the convolutional layer is used to extract features.
  • the pooling layer is used to downsample to reduce the amount of data.
  • the fully connected layer is used to map the extracted features to the sample label space. In layman's terms, the fully connected layer is used to integrate the extracted features together and output them as a value.
  • the gaze point estimation network model may include a region of interest pooling (ROI pooling) layer-1, a region of interest pooling layer-2, CNN-1, CNN-2, CNN-3, a fully connected layer-1, a fully connected layer-2, a fully connected layer-3 and a fully connected layer-4.
  • ROI pooling region of interest pooling
  • the region of interest pooling layer-1 is used to unify the size of the feature map corresponding to the left eye image block, and unify the size of the feature map corresponding to the right eye image block.
  • the region of interest pooling layer-2 is used to unify the size of the feature map corresponding to the facial image block.
  • CNN-1, CNN-2 and CNN-3 are all convolutional neural networks (CNNs), which are used to extract left eye features, right eye features and facial features, respectively.
  • CNN-1, CNN-2 and CNN-3 may include several convolutional layers and several pooling layers, respectively.
  • CNN-1, CNN-2 and CNN-3 may also include one or more fully connected layers.
  • Fully connected layer-1 is used to integrate the extracted left eye features, right eye features and facial features.
  • Fully connected layer-2 and fully connected layer-3 are used to integrate the depth information represented by the facial mesh (i.e., the distance between the face and the camera) and the pupil position information represented by the pupil coordinates.
  • Fully connected layer-4 is used to integrate the left eye features, right eye features, facial features, depth information and pupil position information and output them as one value.
  • the electronic device can use the left eye image block and the right eye image block as the input of the region of interest pooling layer-1, and use the face image as the input of the region of interest pooling layer-2.
  • the region of interest pooling layer-1 can output feature maps of the same size.
  • the region of interest pooling layer-2 can also output feature maps of the same size.
  • the electronic device can use the feature map corresponding to the left eye image block output by the region of interest pooling layer-1 as the input of CNN-1, and can also use the feature map corresponding to the right eye image block output by the region of interest pooling layer-1 as the input of CNN-2.
  • the electronic device can use the feature map output by the region of interest pooling layer-2 as the input of CNN-3.
  • the electronic device can use the output of CNN-1, the output of CNN-2, and the output of CNN-3 as the input of the fully connected layer-1.
  • the electronic device can also use the face grid and pupil coordinates as the input of the fully connected layer-2 and the fully connected layer-3, respectively.
  • the electronic device can use the output of the fully connected layer-1, the fully connected layer-2, and the fully connected layer-3 as the input of the fully connected layer-4.
  • the fully connected layer-4 can output two-dimensional coordinates.
  • the two-dimensional coordinates are the gaze point coordinates estimated by the electronic device.
  • the gaze point estimation network model may include more region of interest pooling layers.
  • the electronic device may use the left eye image block and the right eye image block as inputs of different region of interest pooling layers, respectively. Accordingly, the electronic device may use the outputs of the different region of interest pooling layers as inputs of CNN-1 and CNN-2, respectively.
  • the gaze point estimation network model may include more fully connected layers. It is understandable that there may be more fully connected layers before and after the fully connected layer-2, and there may be more fully connected layers before and after the fully connected layer-3.
  • the electronic device can use the output of the fully connected layer-2 as the input of the fully connected layer-5, and the output of the fully connected layer-5 as the input of the fully connected layer-4.
  • the electronic device can use the output of the fully connected layer-3 as the input of the fully connected layer-6, and the output of the fully connected layer-6 as the input of the fully connected layer-4.
  • the electronic device can use the output of the fully connected layer-4 as the input of the fully connected layer-7, and the output of the fully connected layer-7 is the gaze point coordinate estimated by the electronic device.
  • FIG8 is a schematic diagram of the architecture of another gaze point estimation network model provided in an embodiment of the present application.
  • the gaze point estimation network model may include an area of interest pooling layer-1, an area of interest pooling layer-2, CNN-1, CNN-2, CNN-3, a fully connected layer-1, a fully connected layer-2, a fully connected layer-3, a fully connected layer-4, a fully connected layer-5, a fully connected layer-6, and a fully connected layer-7.
  • the functions of the area of interest pooling layer-1, the area of interest pooling layer-2, CNN-1, CNN-2, CNN-3, and the fully connected layer-1 can all refer to the above, and this application will not repeat them here.
  • the fully connected layer-2 and the fully connected layer-5 are used to integrate the depth information represented by the face grid.
  • the fully connected layer-3 and the fully connected layer-6 are used to integrate the pupil position information represented by the pupil coordinates.
  • the fully connected layer-4 and the fully connected layer-7 are used to integrate the information such as the left eye features, the right eye features, the facial features, the depth information, and the pupil position, and output them as a value.
  • the electronic device can use the output of fully connected layer-2 as the input of fully connected layer-5, use the output of fully connected layer-3 as the input of fully connected layer-6, and use the outputs of fully connected layer-1, fully connected layer-5, and fully connected layer-6 as the input of fully connected layer-4.
  • the electronic device can also use the output of fully connected layer-4 as the input of fully connected layer-7, and the output of fully connected layer-7 is the gaze point coordinate estimated by the electronic device.
  • FIG9 is a schematic diagram of the architecture of another gaze point estimation network model provided by an embodiment of the present application.
  • the gaze point estimation network model may include an area of interest pooling layer-1, an area of interest pooling layer-2, CNN-1, CNN-2, CNN-3, a fully connected layer-2, a fully connected layer-3, a fully connected layer-4, a fully connected layer-5, a fully connected layer-6, and a fully connected layer-7.
  • the functions of the area of interest pooling layer-1, the area of interest pooling layer-2, CNN-1, CNN-2, and CNN-3 can all refer to the above, and this application will not repeat them here.
  • Fully connected layer-2 and fully connected layer-5 are used to integrate the depth information represented by the face grid.
  • Fully connected layer-3 and fully connected layer-6 are used to integrate the pupil position information represented by the pupil coordinates.
  • Fully connected layer-4 and fully connected layer-7 are used to integrate information such as left eye features, right eye features, facial features, depth information, and pupil position, and output them as a value.
  • the electronic device can use the output of fully connected layer-2 as the input of fully connected layer-5, use the output of fully connected layer-3 as the input of fully connected layer-6, and use the outputs of fully connected layer-5 and fully connected layer-6 as the input of fully connected layer-4.
  • the electronic device can also use the output of fully connected layer-4 as the input of fully connected layer-7, and the output of fully connected layer-7 is the gaze point coordinate estimated by the electronic device.
  • the gaze point estimation network model may further include several activation layers.
  • an activation layer may be set between the fully connected layer-1 and the fully connected layer-4
  • an activation layer may be set between the fully connected layer-2 and the fully connected layer-4
  • an activation layer may be set between the fully connected layer-3 and the fully connected layer-4.
  • an activation layer may be set between the fully connected layer-1 and the fully connected layer-4, an activation layer may be set between the fully connected layer-2 and the fully connected layer-5, an activation layer may be set between the fully connected layer-5 and the fully connected layer-4, an activation layer may be set between the fully connected layer-3 and the fully connected layer-6, an activation layer may be set between the fully connected layer-6 and the fully connected layer-4, and an activation layer may be set between the fully connected layer-4 and the fully connected layer-7.
  • an activation layer can be set between the fully connected layer-2 and the fully connected layer-5, an activation layer can be set between the fully connected layer-5 and the fully connected layer-4, an activation layer can be set between the fully connected layer-3 and the fully connected layer-6, an activation layer can be set between the fully connected layer-6 and the fully connected layer-4, and an activation layer can be set between the fully connected layer-4 and the fully connected layer-7.
  • the following uses the gaze point estimation network model shown in FIG7 , FIG8 and FIG9 as an example to illustrate various parts of the gaze point estimation network model.
  • Region of interest refers to: in machine vision and image processing, the area to be processed is outlined from the image being processed in the form of a box, circle, ellipse, irregular polygon, etc.
  • the region of interest pooling layer is a type of pooling layer.
  • the electronic device can divide the ROI in the image input to the region of interest pooling layer into sections of the same size, and perform a maximum pooling operation on each section.
  • the processed feature map obtained is the output of the region of interest pooling layer.
  • the number of sections is consistent with the dimension of the feature map output by the region of interest pooling layer.
  • the following example illustrates the processing process in the region of interest pooling layer-1.
  • the electronic device can divide the ROI of the left eye image block-1 into 3*3 block areas of equal size, and perform maximum pooling processing (i.e., taking the maximum value of each block area) on each block area.
  • the electronic device can obtain the feature map-1 corresponding to the ROI after the maximum pooling processing.
  • the electronic device can use the feature map-1 as the output of the region of interest pooling layer-1.
  • the size of the feature map-1 is 3*3. That is, the feature map-1 can be understood as a 3*3 matrix. It can be understood that the ROI of the left eye image block-1 is the entire left eye image block-1.
  • the electronic device can divide the ROI of the left eye image block-2 into 3*3 block areas of equal size, and perform maximum pooling processing (i.e., taking the maximum value of each block area) on each block area.
  • the electronic device can obtain the feature map-2 corresponding to the ROI after the maximum pooling processing.
  • the ROI electronic device can use the feature map-2 as the output of the region of interest pooling layer-1.
  • the size of the feature map-2 is 3*3. That is, the feature map-2 can be understood as a 3*3 matrix. It can be understood that the ROI of the left eye image block-2 is the entire left eye image block-2.
  • FIG. 10A and FIG. 10B show the processing process of one channel among the three RGB channels.
  • the ROI of the input image can be divided into a plurality of block regions.
  • Each block region contains data.
  • the data contained in the block regions mentioned here can be understood as the elements of the corresponding regions in the matrix corresponding to the ROI of the input image.
  • the electronic device may divide the ROI of the image input to the region of interest pooling layer based on the size of the preset feature map.
  • the size of the preset feature map may be 10*10. If the size of the ROI of the image input to the region of interest pooling layer is 100*100.
  • the electronic device may evenly divide the ROI into 10*10 block areas, each of which is 10*10 in size.
  • the electronic device may not be able to evenly divide the ROI.
  • the electronic device can perform a zero padding operation, or, while ensuring that most of the block areas are the same size, divide a column block area or a row block area into slightly larger or smaller ones.
  • the size of the preset feature map may be 10*10.
  • the size of the ROI of the image input to the region of interest pooling layer is 101*101.
  • the electronic device may divide the ROI into 9*9 block areas of size 10*10, 9 block areas of size 10*11, 9 block areas of size 11*10, and 1 block area of size 11*11.
  • the size of the feature map output by the region of interest pooling layer-1 is the same.
  • the size of the feature map output by the region of interest pooling layer-2 is the same.
  • the size of the feature map-1 and feature map-2 obtained by inputting the left eye image block-1 and the left eye image block-2 into the region of interest pooling layer-1 is 3*3.
  • the size of the feature map output by the region of interest pooling layer is not limited to the above example, and this application does not impose any restrictions on this.
  • the left eye image block-3 is an RGB image.
  • the left eye image block-3 can be represented as a 60*60*3 matrix.
  • the elements in the matrix include the values of the RGB three channels corresponding to each pixel in the left eye image block-3.
  • the electronic device can input the left eye image block-3 into the region of interest pooling layer-1, and can output 3 3*3 feature maps. These 3 3*3 feature maps correspond to the feature maps of the RGB three channels respectively.
  • the output feature map is 1.
  • the processing process of inputting it to the ROI pooling layer-1 can refer to FIG. 10A .
  • CNN refers to convolutional neural network, which is a type of neural network.
  • CNN can include convolutional layers, pooling layers and fully connected layers.
  • each convolutional layer in the convolutional neural network is composed of several convolutional units.
  • the parameters of each convolutional unit are optimized by the back propagation algorithm.
  • the purpose of the convolution operation is to extract different features of the input.
  • the first convolutional layer may only extract some low-level features such as edges, lines and corners. More layers of the network can iteratively extract more complex features from low-level features.
  • the essence of pooling is downsampling.
  • the main function of the pooling layer is to reduce the amount of calculation by reducing the parameters of the network, and it can control overfitting to a certain extent.
  • the operations performed by the pooling layer generally include maximum pooling, mean pooling, etc.
  • CNN-1, CNN-2 and CNN-3 can include several convolutional layers and several pooling layers respectively. It can be understood that CNN-1, CNN-2 and CNN-3 can also include several activation layers.
  • the activation layer is also called the neuron layer, and the most important thing is the setting of the activation function.
  • the activation function can include ReLU, PReLU and Sigmoid, etc.
  • the electronic device can perform activation operations on the input data, which can also be understood as a function change.
  • CNN-1 may include 4 convolutional layers and 4 activation layers.
  • the 4 convolutional layers refer to: convolutional layer-1, convolutional layer-2, convolutional layer-3, and convolutional layer-4.
  • the 4 activation layers refer to: activation layer-1, activation layer-2, activation layer-3, and activation layer-4. It can be understood that the size of the convolution kernel (i.e., filter) of the 4 convolutional layers can be 3*3.
  • CNN-3 may include 4 convolutional layers, 4 activation layers, and 4 pooling layers.
  • the 5 convolutional layers refer to: convolutional layer-1, convolutional layer-2, convolutional layer-3, and convolutional layer-4.
  • the 4 activation layers refer to: activation layer-1, activation layer-2, activation layer-3, and activation layer-4.
  • the 4 pooling layers refer to: pooling layer-1, pooling layer-2, pooling layer-3, and pooling layer-4.
  • the size of the convolution kernel (i.e., filter) of the 4 convolutional layers can be 3*3.
  • the step size of the 4 pooling layers can be 2 (for example, the maximum pooling process is performed for every 2*2 "cells").
  • the feature map can also be padded with zeros in the convolutional layer.
  • the zero-padding operation please refer to the relevant technical documents, which will not be explained in detail here.
  • the structures of CNN-2 and CNN-1 may be the same. In some other embodiments of the present application, the structures of CNN-2, CNN-3 and CNN-1 may be the same.
  • CNN-1, CNN-2 and CNN-3 may also be other contents, not limited to the above examples, and this application does not impose any restrictions on this.
  • the fully connected layer is used to map the extracted features to the sample label space.
  • the fully connected layer is used to integrate the extracted features together and output them as a value.
  • the number of neurons in the fully connected layer-1 is 128, the number of neurons in the fully connected layer-2 and the fully connected layer-3 are both 256, the number of neurons in the fully connected layer-5 and the fully connected layer-6 are both 128, the number of neurons in the fully connected layer-4 is 128, and the number of neurons in the fully connected layer-7 is 2.
  • the number of neurons in the fully connected layer in the gaze point estimation network model can also be other values, not limited to the above examples, and this application does not impose any restrictions on this.
  • the electronic device can obtain eye position information and pupil coordinates during the face detection process, so the electronic device does not need to execute step S304.
  • the electronic device does not need to determine whether the size of the face area in the image I1 meets the preset size requirement. In other words, the electronic device does not need to perform step S305.
  • FIG. 14 is a schematic diagram of the hardware structure of an electronic device provided in an embodiment of the present application.
  • the electronic device may include a processor 110, an external memory interface 120, an internal memory 121, a Universal Serial Bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, a button 190, a motor 191, an indicator 192, a camera 193, a display screen 194, and a Subscriber Identification Module (SIM) card interface 195, etc.
  • SIM Subscriber Identification Module
  • the sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, a bone conduction sensor 180M, etc.
  • the structure illustrated in the embodiments of the present invention does not constitute a specific limitation on the electronic device.
  • the electronic device may include more or fewer components than shown in the figure, or combine certain components, or split certain components, or arrange the components differently.
  • the components shown in the figure may be implemented in hardware, software, or a combination of software and hardware.
  • the processor 110 may include one or more processing units, for example, the processor 110 may include an application processor (AP), a modem processor, a graphics processor (GPU), an image signal processor (ISP), a controller, a memory, a video codec, a digital signal processor (DSP), a baseband processor, and/or a neural-network processing unit (NPU), etc.
  • AP application processor
  • GPU graphics processor
  • ISP image signal processor
  • controller a memory
  • video codec a digital signal processor
  • DSP digital signal processor
  • NPU neural-network processing unit
  • Different processing units may be independent devices or integrated in one or more processors.
  • the controller can be the nerve center and command center of the electronic device.
  • the controller can generate operation control signals according to the instruction operation code and timing signal to complete the control of fetching and executing instructions.
  • the electronic device may execute the gaze point estimation method through the processor 110.
  • the processor 110 may also be provided with a memory for storing instructions and data.
  • the memory in the processor 110 is a cache memory.
  • the memory may store instructions or data that the processor 110 has just used or cyclically used. If the processor 110 needs to use the instruction or data again, it may be directly called from the memory. This avoids repeated access, reduces the waiting time of the processor 110, and thus improves the efficiency of the system.
  • the processor 110 may include one or more interfaces.
  • the USB interface 130 is an interface that complies with the USB standard specification, and may specifically be a Mini USB interface, a Micro USB interface, a USB Type C interface, etc.
  • the interface included in the processor 110 may also be used to connect other electronic devices, such as AR devices, etc.
  • the charging management module 140 is used to receive charging input from a charger. While the charging management module 140 is charging the battery 142 , it can also power the electronic device through the power management module 141 .
  • the wireless communication function of the electronic device can be implemented through antenna 1, antenna 2, mobile communication module 150, wireless communication module 160, modem processor and baseband processor.
  • Antenna 1 and antenna 2 are used to transmit and receive electromagnetic wave signals.
  • Each antenna in the electronic device can be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve the utilization of the antennas.
  • the mobile communication module 150 can provide solutions for wireless communications including 2G/3G/4G/5G, etc., applied in electronic devices.
  • the wireless communication module 160 can provide wireless communication solutions for application in electronic devices, including Wireless Local Area Networks (WLAN) (such as Wireless Fidelity (Wi-Fi) networks), Bluetooth (BT), Global Navigation Satellite System (GNSS), Frequency Modulation (FM), Near Field Communication (NFC), infrared technology (IR), etc.
  • WLAN Wireless Local Area Networks
  • BT Wireless Fidelity
  • GNSS Global Navigation Satellite System
  • FM Frequency Modulation
  • NFC Near Field Communication
  • IR infrared technology
  • antenna 1 of the electronic device is coupled to mobile communication module 150, and antenna 2 is coupled to wireless communication module 160, so that the electronic device can communicate with the network and other devices through wireless communication technology.
  • the electronic device implements the display function through a GPU, a display screen 194, and an application processor.
  • the GPU is a microprocessor for image processing, which connects the display screen 194 and the application processor.
  • the GPU is used to perform mathematical and geometric calculations for graphics rendering.
  • the processor 110 may include one or more GPUs that execute program instructions to generate or change display information.
  • the display screen 194 is used to display images, videos, etc.
  • the display screen 194 includes a display panel.
  • the display panel can be a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode or an active-matrix organic light-emitting diode (AMOLED), a flexible light-emitting diode (FLED), a Mini LED, a Micro LED, a Micro-OLED, a quantum dot light-emitting diode (QLED), etc.
  • the electronic device may include 1 or N display screens 194, where N is a positive integer greater than 1.
  • the electronic device can realize the acquisition function through ISP, camera 193, video codec, GPU, display screen 194 and application processor.
  • ISP is used to process the data fed back by camera 193. For example, when taking a photo, the shutter is opened, and the light is transmitted to the camera photosensitive element through the lens. The light signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to ISP for processing and converts it into an image or video visible to the naked eye. ISP can also perform algorithm optimization on the noise, brightness, and color of the image. ISP can also optimize the exposure, color temperature and other parameters of the shooting scene. In some embodiments, ISP can be set in camera 193.
  • the camera 193 is used to capture still images or videos.
  • the object generates an optical image through the lens and projects it onto the photosensitive element.
  • the photosensitive element can be a charge coupled device (CCD) or a complementary metal oxide semiconductor (CMOS) phototransistor.
  • CMOS complementary metal oxide semiconductor
  • the photosensitive element converts the optical signal into an electrical signal, and then transmits the electrical signal to the ISP to convert it into a digital image or video signal.
  • the ISP outputs the digital image or video signal to the DSP for processing.
  • the DSP converts the digital image or video signal into an image or video signal in a standard RGB, YUV or other format.
  • Digital signal processors are used to process digital signals. In addition to processing digital images or video signals, they can also process other digital signals. For example, when an electronic device selects a frequency point, a digital signal processor is used to perform Fourier transform on the frequency point energy.
  • Video codecs are used to compress or decompress digital videos.
  • Electronic devices can support one or more video codecs. In this way, electronic devices can play or record videos in multiple encoding formats, such as Moving Picture Experts Group (MPEG) 1, MPEG2, MPEG3, MPEG4, etc.
  • MPEG Moving Picture Experts Group
  • MPEG2 MPEG2, MPEG3, MPEG4, etc.
  • the external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the electronic device.
  • the external memory card communicates with the processor 110 through the external memory interface 120 to implement a data storage function. For example, files such as music and videos can be saved in the external memory card.
  • the internal memory 121 can be used to store computer executable program codes, which include instructions.
  • the processor 110 executes various functional applications and data processing of the electronic device by running the instructions stored in the internal memory 121.
  • the internal memory 121 may include a program storage area and a data storage area.
  • the program storage area may store an operating system, an application required for at least one function (such as a sound playback function, an image and video playback function, etc.).
  • the data storage area may store data created during the use of the electronic device (such as audio data, a phone book, etc.).
  • the electronic device can implement audio functions such as music playing and recording through the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the headphone jack 170D, and the application processor.
  • the sensor module 180 may include one or more sensors, which may be of the same type or different types. It is understood that the sensor module 180 shown in FIG. 14 is only an exemplary division method, and there may be other division methods, which are not limited in the present application.
  • the pressure sensor 180A is used to sense the pressure signal and can convert the pressure signal into an electrical signal.
  • the pressure sensor 180A can be set on the display screen 194.
  • the electronic device detects the touch operation intensity according to the pressure sensor 180A.
  • the electronic device can also calculate the touch position according to the detection signal of the pressure sensor 180A.
  • touch operations acting on the same touch position but with different touch operation intensities can correspond to different operation instructions.
  • the gyro sensor 180B can be used to determine the motion posture of the electronic device. In some embodiments, the angular velocity of the electronic device around three axes (i.e., x, y, and z axes) can be determined by the gyro sensor 180B. The gyro sensor 180B can be used for anti-shake shooting.
  • the acceleration sensor 180E can detect the magnitude of the acceleration of the electronic device in all directions (generally three axes). When the electronic device is stationary, it can detect the magnitude and direction of gravity. It can also be used to identify the posture of the electronic device and is applied to applications such as horizontal and vertical screen switching and pedometers.
  • the distance sensor 180F is used to measure the distance.
  • the electronic device can measure the distance by infrared or laser. In some embodiments, when shooting a scene, the electronic device can use the distance sensor 180F to measure the distance to achieve fast focusing.
  • the touch sensor 180K is also called a "touch panel”.
  • the touch sensor 180K can be set on the display screen 194, and the touch sensor 180K and the display screen 194 form a touch screen, also called a "touch screen”.
  • the touch sensor 180K is used to detect touch operations acting on or near it.
  • the touch sensor can pass the detected touch operation to the application processor to determine the type of touch event.
  • Visual output related to the touch operation can be provided through the display screen 194.
  • the touch sensor 180K may also be disposed on the surface of the electronic device, at a location different from that of the display screen 194 .
  • the air pressure sensor 180C is used to measure air pressure.
  • the magnetic sensor 180D includes a Hall sensor.
  • the proximity light sensor 180G may include, for example, a light emitting diode (LED) and a light detector, such as a photodiode.
  • the electronic device uses a photodiode to detect infrared reflected light from nearby objects.
  • the ambient light sensor 180L is used to sense the brightness of ambient light.
  • the fingerprint sensor 180H is used to obtain fingerprints.
  • the temperature sensor 180J is used to detect temperature.
  • the bone conduction sensor 180M can obtain vibration signals.
  • the key 190 includes a power button, a volume button, etc.
  • the key 190 can be a mechanical key. It can also be a touch key.
  • the electronic device can receive key input and generate key signal input related to the user settings and function control of the electronic device.
  • the motor 191 can generate a vibration prompt.
  • the motor 191 can be used for incoming call vibration prompts, and can also be used for touch vibration feedback.
  • the indicator 192 can be an indicator light, which can be used to indicate the charging status, power changes, messages, missed calls, notifications, etc.
  • the SIM card interface 195 is used to connect a SIM card.
  • FIG. 15 is a schematic diagram of a software structure of an electronic device provided in an embodiment of the present application.
  • the software framework of the electronic device involved in the present application may include an application layer, an application framework layer (framework, FWK), a system library, an Android runtime, a hardware abstraction layer and a kernel layer (kernel).
  • an application layer an application framework layer (framework, FWK)
  • FWK application framework layer
  • system library an application framework layer
  • Android runtime a hardware abstraction layer
  • kernel layer kernel layer
  • the application layer may include a series of application packages, such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, video, short message and other applications (also referred to as applications).
  • the camera is used to obtain images and videos.
  • other applications of the application layer please refer to the introduction and description in the conventional technology, which will not be elaborated in this application.
  • the application on the electronic device can be a native application (such as an application installed in the electronic device when the operating system is installed before the electronic device leaves the factory), or it can be a third-party application (such as an application downloaded and installed by the user through the application store), which is not limited in the embodiments of this application.
  • the application framework layer provides application programming interface (API) and programming framework for the applications in the application layer.
  • API application programming interface
  • the application framework layer includes some predefined functions.
  • the application framework layer may include a window manager, a content provider, a view system, a telephony manager, a resource manager, a notification manager, and the like.
  • the window manager is used to manage window programs.
  • the window manager can obtain the display screen size, determine whether there is a status bar, lock the screen, capture the screen, etc.
  • Content providers are used to store and retrieve data and make it accessible to applications.
  • the data may include videos, images, audio, calls made and received, browsing history and bookmarks, phone books, etc.
  • the view system includes visual controls, such as controls for displaying text, controls for displaying images, etc.
  • the view system can be used to build applications.
  • a display interface can be composed of one or more views.
  • a display interface including a text notification icon can include a view for displaying text and a view for displaying images.
  • the phone manager is used to provide communication functions for electronic devices, such as management of call status (including answering, hanging up, etc.).
  • the resource manager provides various resources for applications, such as localized strings, icons, images, layout files, video files, and so on.
  • the notification manager enables applications to display notification information in the status bar. It can be used to convey notification-type messages and can disappear automatically after a short stay without user interaction. For example, the notification manager is used to notify download completion, message reminders, etc.
  • the notification manager can also be a notification that appears in the system top status bar in the form of a chart or scroll bar text, such as notifications from applications running in the background, or a notification that appears on the screen in the form of a dialogue interface. For example, a text message is displayed in the status bar, a reminder sound is emitted, an electronic device vibrates, an indicator light flashes, etc.
  • the runtime includes the core library and the virtual machine.
  • the runtime is responsible for the scheduling and management of the system.
  • the core library consists of two parts: one part is the function that the programming language (for example, Java language) needs to call, and the other part is the core library of the system.
  • one part is the function that the programming language (for example, Java language) needs to call
  • the other part is the core library of the system.
  • the application layer and the application framework layer run in a virtual machine.
  • the virtual machine executes the programming files (e.g., java files) of the application layer and the application framework layer as binary files.
  • the virtual machine is used to perform functions such as object life cycle management, stack management, thread management, security and exception management, and garbage collection.
  • the system library can include multiple functional modules, such as surface manager, media library, 3D graphics processing library (such as OpenGL ES), 2D graphics engine (such as SGL), etc.
  • functional modules such as surface manager, media library, 3D graphics processing library (such as OpenGL ES), 2D graphics engine (such as SGL), etc.
  • the surface manager is used to manage the display subsystem and provides the fusion of two-dimensional (2D) and three-dimensional (3D) layers for multiple applications.
  • the media library supports playback and recording of a variety of commonly used audio and video formats, as well as static image files, etc.
  • the media library can support a variety of audio and video encoding formats, such as: MPEG4, H.264, MP3, AAC, AMR, JPG, PNG, etc.
  • the 3D graphics processing library is used to implement 3D graphics drawing, image rendering, compositing, and layer processing.
  • a 2D graphics engine is a drawing engine for 2D drawings.
  • the Hardware Abstraction Layer is an interface layer between the operating system kernel and the upper software, and its purpose is to abstract the hardware.
  • the hardware abstraction layer is an abstract interface driven by the device kernel, which is used to implement the application programming interface that provides access to the underlying device to the higher-level Java API framework.
  • HAL contains multiple library modules, such as camera HAL, display, Bluetooth, audio, etc. Each of these library modules implements an interface for a specific type of hardware component.
  • the Android operating system will load the library module for the hardware component.
  • the kernel layer is the foundation of the Android operating system. The final functions of the Android operating system are completed through the kernel layer.
  • the kernel layer at least includes display driver, camera driver, audio driver, sensor driver, and virtual card driver.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • General Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Ophthalmology & Optometry (AREA)
  • Image Processing (AREA)
  • Image Analysis (AREA)

Abstract

La présente invention concerne un procédé d'estimation de point de fixation et un dispositif associé. Selon le procédé, un dispositif électronique peut acquérir une image au moyen d'une caméra, assure l'entrée d'un échantillon simple ayant une échelle de distance adaptative au moyen d'un zoom adaptatif, et obtient des informations de position de visage et des informations de position d'œil dans l'image acquise lorsqu'un résultat de détection de visage respecte une condition de visage prédéfinie. Sur la base d'un module de regroupement de régions d'intérêt (ROI) dans un modèle de réseau d'estimation de point de fixation, le dispositif électronique peut traiter une ROI d'un bloc d'image cible à l'aide d'une taille de carte de caractéristiques prédéfinie correspondante pour obtenir une carte de caractéristiques. Le bloc d'image cible est obtenu par recadrage de l'image acquise. Le bloc d'image cible comprend un bloc d'image de visage et/ou un bloc d'image d'œil gauche et/ou un bloc d'image d'œil droit. Différents types de blocs d'image correspondent respectivement à des tailles de carte de caractéristiques prédéfinies. Le procédé peut unifier la taille de la carte de caractéristiques au moyen du module de regroupement de ROI, ce qui permet d'éviter la déformation du bloc d'image cible après la mise à l'échelle, et d'améliorer la précision d'estimation de point de fixation.
PCT/CN2023/092415 2022-07-29 2023-05-06 Procédé d'estimation de point de fixation et dispositif associé WO2024021742A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210910894.2A CN116048244B (zh) 2022-07-29 2022-07-29 一种注视点估计方法及相关设备
CN202210910894.2 2022-07-29

Publications (2)

Publication Number Publication Date
WO2024021742A1 WO2024021742A1 (fr) 2024-02-01
WO2024021742A9 true WO2024021742A9 (fr) 2024-05-16

Family

ID=86127878

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/092415 WO2024021742A1 (fr) 2022-07-29 2023-05-06 Procédé d'estimation de point de fixation et dispositif associé

Country Status (2)

Country Link
CN (1) CN116048244B (fr)
WO (1) WO2024021742A1 (fr)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116048244B (zh) * 2022-07-29 2023-10-20 荣耀终端有限公司 一种注视点估计方法及相关设备
CN117576298B (zh) * 2023-10-09 2024-05-24 中微智创(北京)软件技术有限公司 一种基于上下文分离3d透镜的战场态势目标突出显示方法
CN117472256B (zh) * 2023-12-26 2024-08-23 荣耀终端有限公司 图像处理方法及电子设备

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6377566B2 (ja) * 2015-04-21 2018-08-22 日本電信電話株式会社 視線計測装置、視線計測方法、およびプログラム
CN109492514A (zh) * 2018-08-28 2019-03-19 初速度(苏州)科技有限公司 一种单相机采集人眼视线方向的方法及系统
CN111723596B (zh) * 2019-03-18 2024-03-22 北京市商汤科技开发有限公司 注视区域检测及神经网络的训练方法、装置和设备
CN112000226B (zh) * 2020-08-26 2023-02-03 杭州海康威视数字技术股份有限公司 一种人眼视线估计方法、装置及视线估计系统
CN112329699A (zh) * 2020-11-19 2021-02-05 北京中科虹星科技有限公司 一种像素级精度的人眼注视点定位方法
US11947717B2 (en) * 2021-01-22 2024-04-02 Blink Technologies Inc. Gaze estimation systems and methods using relative points of regard
CN113642393B (zh) * 2021-07-07 2024-03-22 重庆邮电大学 基于注意力机制的多特征融合视线估计方法
CN116048244B (zh) * 2022-07-29 2023-10-20 荣耀终端有限公司 一种注视点估计方法及相关设备

Also Published As

Publication number Publication date
CN116048244A (zh) 2023-05-02
CN116048244B (zh) 2023-10-20
WO2024021742A1 (fr) 2024-02-01

Similar Documents

Publication Publication Date Title
CN111738122B (zh) 图像处理的方法及相关装置
WO2024021742A9 (fr) Procédé d'estimation de point de fixation et dispositif associé
WO2021078001A1 (fr) Procédé et appareil d'amélioration d'image
WO2021027585A1 (fr) Procédé de traitement d'images de visages humains et dispositif électronique
CN111782879B (zh) 模型训练方法及装置
CN111882642B (zh) 三维模型的纹理填充方法及装置
WO2021180046A1 (fr) Procédé et dispositif de conservation des couleurs des images
EP4325877A1 (fr) Procédé pour photographier et dispositif associé
CN111400605A (zh) 基于眼球追踪的推荐方法及装置
US20230162529A1 (en) Eye bag detection method and apparatus
CN111768352A (zh) 图像处理方法及装置
CN113538227B (zh) 一种基于语义分割的图像处理方法及相关设备
CN111612723B (zh) 图像修复方法及装置
CN115661912A (zh) 图像处理方法、模型训练方法、电子设备及可读存储介质
WO2022143314A1 (fr) Procédé et appareil d'enregistrement d'objet
CN115633255B (zh) 视频处理方法和电子设备
CN112528760B (zh) 图像处理方法、装置、计算机设备及介质
CN115150542B (zh) 一种视频防抖方法及相关设备
CN114697530B (zh) 一种智能取景推荐的拍照方法及装置
CN113642359B (zh) 人脸图像生成方法、装置、电子设备及存储介质
CN115587938A (zh) 视频畸变校正方法及相关设备
CN114399622A (zh) 图像处理方法和相关装置
CN114970576A (zh) 识别码的识别方法、相关电子设备及计算机可读存储介质
CN114693538A (zh) 一种图像处理方法及装置
CN117710697B (zh) 对象检测方法、电子设备、存储介质及程序产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23844957

Country of ref document: EP

Kind code of ref document: A1