WO2024021742A1 - Procédé d'estimation de point de fixation et dispositif associé - Google Patents

Procédé d'estimation de point de fixation et dispositif associé Download PDF

Info

Publication number
WO2024021742A1
WO2024021742A1 PCT/CN2023/092415 CN2023092415W WO2024021742A1 WO 2024021742 A1 WO2024021742 A1 WO 2024021742A1 CN 2023092415 W CN2023092415 W CN 2023092415W WO 2024021742 A1 WO2024021742 A1 WO 2024021742A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
face
image block
eye
size
Prior art date
Application number
PCT/CN2023/092415
Other languages
English (en)
Chinese (zh)
Other versions
WO2024021742A9 (fr
Inventor
孙贻宝
Original Assignee
荣耀终端有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 荣耀终端有限公司 filed Critical 荣耀终端有限公司
Publication of WO2024021742A1 publication Critical patent/WO2024021742A1/fr
Publication of WO2024021742A9 publication Critical patent/WO2024021742A9/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • G06F3/013Eye tracking input arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/10Image acquisition
    • G06V10/12Details of acquisition arrangements; Constructional details thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • G06V40/166Detection; Localisation; Normalisation using acquisition arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/18Eye characteristics, e.g. of the iris
    • G06V40/19Sensors therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/18Eye characteristics, e.g. of the iris
    • G06V40/193Preprocessing; Feature extraction
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • This application relates to the fields of deep learning and big data processing, and in particular to a fixation point estimation method and related equipment.
  • Gaze point estimation generally refers to inputting an image, calculating the gaze direction through eye/head features and mapping it to the gaze point. Gaze point estimation is mainly used in human-computer interaction and visual display of smartphones, tablets, smart screens, and AR/VR glasses.
  • gaze point estimation methods can be divided into two categories: Geometry Based Methods and Appearance Based Methods.
  • the basic idea of estimating gaze point coordinates through a geometry-based method is to restore the three-dimensional line of sight direction through some two-dimensional information (such as eye features such as eye corners).
  • the basic idea of estimating gaze point coordinates through appearance-based methods is to learn a model that maps the input image to the gaze point. Both types of methods have their own advantages and disadvantages.
  • the geometry-based method is relatively more accurate, but has high requirements for image quality and resolution and requires additional hardware (for example, infrared sensors and multiple cameras, etc.) support, which may result in poor performance. It consumes a lot of money, and the appearance-based method is relatively less accurate.
  • appearance-based methods require training on a large amount of data, the distance between the camera and the subject is not fixed, and the depth information of the input image will also vary. For example, there may be large differences in the sizes of facial images obtained based on different input images, which cannot meet the model requirements. Scaling the input image may meet the model requirements, but this may risk feature deformation, which will reduce the accuracy of gaze point estimation.
  • This application provides a gaze point estimation method and related equipment.
  • the electronic device can collect images through the camera, and when the face detection results meet the preset face conditions, obtain the face position information and eye position information in the collected images. Based on the facial position information and eye position information, the electronic device can determine the gaze point coordinates of the target object through the gaze point estimation network model. It can be understood that the subject in the image collected by the electronic device through the camera is the target object. It can be understood that the photographing subject mentioned in this application refers to the main photographing subject when the user uses the electronic device to photograph.
  • the electronic device can process the ROI of the target image block with the corresponding preset feature map size based on the region of interest pooling module to obtain the feature map.
  • the target image block is obtained by cropping the collected image.
  • the target image block includes at least one type of image block among a face image block, a left eye image block, and a right eye image block. Different types of image blocks each correspond to a preset feature map size.
  • the above method can unify the size of the feature map through the area of interest pooling module, avoid deformation of the target image block after scaling, and improve the accuracy of gaze point estimation.
  • this application provides a gaze point estimation method.
  • This method can be applied to electronic devices equipped with cameras.
  • the method may include: the electronic device can collect the first image through the camera; and when the face detection result meets the preset face conditions, the electronic device can obtain the face position information and eye position information in the first image.
  • the electronic device can A type of region-of-interest pooling module processes the region-of-interest ROI of the target image block with the corresponding preset feature map size to obtain a feature map.
  • the target image block includes at least one type of image block among a face image block, a left eye image block, and a right eye image block.
  • the face position information includes the coordinates of relevant feature points in the face area
  • the eye position information includes the coordinates of relevant feature points in the eye area.
  • the face image block is an image block obtained by cropping the face area in the first image based on the face position information.
  • the left eye image block is an image block obtained by cropping the left eye area in the first image based on the eye position information.
  • the right eye image block is an image block obtained by cropping the right eye area in the first image based on the eye position information.
  • the electronic device can determine the gaze point coordinates of the target object based on the gaze point estimation network model.
  • the electronic device can process the region of interest ROI of the target image block with the corresponding preset feature map size based on the region of interest pooling module. Get the feature map.
  • the target image block includes at least one type of image block among a face image block, a left eye image block, and a right eye image block. Different types of image blocks in the target image block each correspond to a preset feature map size.
  • the size of the feature maps corresponding to the same type of image blocks is the same, while the size of the feature maps corresponding to different types of image blocks can be the same or different.
  • This method can unify the size of feature maps corresponding to the same type of image blocks through the area of interest pooling module to prepare for subsequent feature extraction. It can also avoid feature deformation caused by adjusting the size of the feature map through scaling, and improves attention. Point estimate accuracy. It is understandable that feature gender may cause inaccurate feature extraction, thereby affecting the accuracy of fixation point estimation.
  • the electronic device may collect the first image through a front-facing camera. It can be understood that the electronic device can acquire the first image in real time. For details, please refer to the relevant description in step S301 below, which will not be described here.
  • the first image may be image I1.
  • the relevant feature points of the facial area may include edge contour feature points of the human face.
  • the relevant feature points of the eye area may include eye corner feature points, and may also include edge contour feature points of the eye area.
  • the relevant descriptions of the relevant feature points in the face area and the relevant feature points in the eye area can be referred to later, and will not be explained here.
  • the electronic device can obtain facial location information during face detection. Specifically, during the face detection process, the electronic device can perform feature point detection and determine the feature points related to the face, thereby obtaining facial position information.
  • the electronic device can complete the detection of the eyes during the face detection process, thereby obtaining the eye position information.
  • the eye-related feature points may include pupil coordinates.
  • the electronic device can perform eye detection to obtain eye position information.
  • eye detection can be found later and will not be explained here.
  • the region of interest pooling module may include several region of interest pooling layers.
  • the region of interest pooling module may include region of interest pooling layer-1, and may also include region of interest pooling layer-2.
  • region of interest pooling layer-1 may include region of interest pooling layer-1, and may also include region of interest pooling layer-2.
  • the gaze point estimation network model can unify feature maps for image blocks of the same type in the target image block and perform feature extraction on them.
  • this application can also provide a gaze point estimation network model.
  • the input of this gaze point estimation network model may not include face mesh, pupil coordinates, Fully connected layer-2 and fully connected layer-3.
  • this application can also provide a gaze point estimation network model.
  • the input of this gaze point estimation network model may not include the face grid, pupil coordinates, Fully connected layer-2, fully connected layer-5, fully connected layer-3 and fully connected layer-6.
  • the input of this gaze point estimation network model may not include face grids, pupil coordinates, and fully connected layers. -2, fully connected layer-5, fully connected layer-3 and fully connected layer-6.
  • the preset feature map size corresponding to the face image block is the first preset feature map size
  • the preset feature map size corresponding to the left eye image block is the second preset feature map size
  • the preset feature map size corresponding to the left eye image block is the second preset feature map size.
  • the preset feature map size corresponding to the eye image block is the third preset feature map size.
  • the area of interest of the target image block is the entire target image block.
  • the ROI of the face image block is the entire face image block
  • the ROI of the left eye image block is the entire left eye image block
  • the ROI of the right eye image block is the entire right eye image block.
  • the method may also include: the electronic device may crop the face area in the first image based on the face position information to obtain the face image block.
  • the method may further include: the electronic device may crop the left eye area in the first image based on the eye position information to obtain the left eye image block.
  • the method may further include: the electronic device may crop the right eye area in the first image based on the eye position information to obtain the right eye image block.
  • the electronic device processes the region of interest ROI of the target image block with a corresponding preset feature map size to obtain a feature map.
  • the electronic device may process the region of interest ROI based on the corresponding preset feature map size.
  • the preset feature map size divides the ROI of the target image block to obtain several block areas, and the electronic device can also perform maximum pooling processing on each block area in the ROI of the target image block to obtain the feature map.
  • the number of block areas in each row in the ROI of the target image block is the same as the width value in the corresponding preset feature map size
  • the number of block areas in each column in the ROI of the target image block is the same as the corresponding preset feature map size.
  • the height values in are the same.
  • the electronic device can divide the ROI in the target image block based on the width value and height value in the corresponding preset feature map size, obtain several block areas, and classify each block area. Perform maximum pooling processing to obtain the feature map of the target image block. Since the number of block regions and the dimension of the feature map output by the region of interest pooling layer are consistent. Therefore, for image blocks of different sizes, this method can unify the feature maps corresponding to the image blocks, thereby avoiding feature deformation caused by scaling, improving the accuracy of feature extraction, and thereby improving the accuracy of gaze point estimation.
  • the ROI of the face image block in the target image block can be the face area in the face image block.
  • the ROI of the left eye image block in the target image block may be the left eye area in the left eye image block.
  • the ROI of the right eye image block in the target image block may be the right eye area in the right eye image block.
  • the electronic device maps the target image based on the corresponding preset feature map size. Divide the ROI of the face image block to obtain a number of divided regions. Specifically, the electronic device can determine the ROI of the face image block, and divide the ROI of the face image block based on the first preset feature map size to obtain a number of face parts. block area; the electronic device can also determine the ROI of the left eye image block, and divide the ROI of the left eye image block based on the second preset feature map size to obtain several left eye block areas; the electronic device can also determine the right eye block area.
  • ROI of the eye image block and divides the ROI of the right eye image block based on the third preset feature map size to obtain several right eye block regions.
  • ROI of electronic device on target image patch Perform maximum pooling processing on each block area in the face image block to obtain the feature map.
  • the electronic device can perform maximum pooling processing on each face block area in the ROI of the face image block to obtain the first feature map.
  • the maximum pooling process can be performed on each left-eye block area in the ROI of the left-eye image block to obtain the second feature map
  • the maximum pooling process can be performed on each right-eye block area in the ROI of the right-eye image block. Pooling processing is performed to obtain the third feature map.
  • the first feature map is a feature map corresponding to the ROI of the face image block
  • the second feature map is a feature map corresponding to the ROI of the left eye image block
  • the third feature map is a feature map corresponding to the ROI of the right eye image block.
  • the number of block areas in each row in the ROI of the target image block is the same as the width value in the corresponding preset feature map size.
  • the number of block areas in each column in the ROI of the target image block is the same as the width value in the corresponding preset feature map size.
  • the height values are the same, specifically including: the number of face segment areas in each row in the ROI of the face image block is the same as the width value in the first preset feature map size, and the number of face segment areas in each column in the ROI of the face image block is the same.
  • the number is the same as the height value in the first preset feature map size; the number of left eye block areas in each row of the ROI of the left eye image block is the same as the width value in the second preset feature map size.
  • the number of left-eye block areas in each column in the ROI is the same as the height value in the second preset feature map size; the number of right-eye block areas in each row in the ROI of the right-eye image block is the same as the height value in the third preset feature map size.
  • the target image block may include a face image block, a left eye image block, and a right eye image block.
  • the electronic device can unify the sizes of the feature maps corresponding to the face image block, the left eye image block and the right eye image block based on the gaze point estimation network model, and based on the face image block, left eye image respectively Features are extracted from the feature map corresponding to the block and the right eye image block. It is understandable that this method can unify the feature maps corresponding to the image blocks, thereby avoiding feature deformation caused by scaling, improving the accuracy of feature extraction, and thereby improving the accuracy of gaze point estimation.
  • the second preset feature map size and the third preset feature map size may be the same.
  • the first preset feature map size and the second preset feature map size may be the same.
  • the first preset feature map size and the third preset feature map size may be the same.
  • the electronic device based on the region of interest pooling
  • the processing of the target image block by the transformation module can be referred to the above, and will not be repeated here.
  • the shooting subject of the first image is the target object.
  • the method may also include: when the face detection result meets the preset face conditions, the electronic device may obtain the pupil coordinates in the first image; the electronic device may based on the face position The information determines the position and size of the face area in the first image, and obtains the face grid corresponding to the first image.
  • the face mesh is used to represent the distance between the target object and the camera.
  • the method may also include: the electronic device may perform convolution processing on the feature map based on the convolution module of the gaze estimation network model to extract eye features and/or facial features; the electronic device may also perform convolution processing based on The fusion module of the gaze point estimation network model integrates eye features and/or facial features, facial grid and pupil coordinates to obtain the gaze point coordinates of the target object.
  • the electronic device can perform gaze point estimation based on more types of features (for example, facial features, eye features, depth information, pupil positions, etc.), that is, gaze based on more comprehensive feature information. Point estimation can improve the accuracy of gaze point estimation.
  • features for example, facial features, eye features, depth information, pupil positions, etc.
  • the face grid can represent the position and size of the face in the image, and can reflect the depth information of the target object in the image, that is, the distance between the target object and the camera that collects the image.
  • the human face in the first image mentioned in this application is the human face of the target object in the first image.
  • the electronic device can combine the face image block, the left eye image block, the right eye image block, the face image block, and the face image block.
  • the head grid and pupil coordinates are input into the gaze point estimation network model, and the gaze point coordinates are input.
  • the gaze point estimation network model can include a region of interest pooling module, a convolution module and a fusion module.
  • the region of interest pooling module may be used to process the region of interest ROI of the face image block with the first preset feature map size to obtain the first feature map.
  • the region of interest pooling module can also be used to: process the ROI of the left eye image block with the second preset feature map size to obtain the second feature map, and process the ROI of the right eye image block with the third preset feature map size.
  • the ROI is processed to obtain the third feature map.
  • the convolution module can be used to perform convolution processing on the first feature map, the second feature map and the third feature map respectively to extract facial features and eye features.
  • the fusion module can be used to: integrate facial features, eye features, facial mesh and pupil coordinates to obtain the gaze point coordinates of the target object.
  • the size of the first feature map is the same as the size of the first preset feature map
  • the size of the second feature map is the same as the size of the second preset feature map
  • the size of the third feature map is the same as the size of the third preset feature map.
  • the face detection result satisfies the preset face conditions, specifically including: a face is detected in the first image.
  • the electronic device can obtain face position information and eye position information when a human face is detected in the first image.
  • the face detection result satisfies the preset face conditions, which may specifically include: a face is detected in the first image, and the size of the face area in the first image satisfies Default size requirements.
  • the method may also include: when a human face is detected in the first image and the size of the face area in the first image does not meet the preset size requirement, the electronic device may Perform adaptive zoom and reacquire images based on the focal length after adaptive zoom.
  • the electronic device can perform adaptive zoom, and based on the adaptive zoom Use the focal length to re-acquire images so that the face size in subsequent captured images meets expectations.
  • the electronic device can capture images containing faces of appropriate sizes without causing loss of image details and difficulty in subsequent feature extraction due to faces in the captured images being too small.
  • the face in the image is too large, resulting in the loss of image information and difficulty in subsequent feature extraction.
  • the features extracted by the electronic device are more accurate, so that the accuracy of gaze point estimation is also improved.
  • the size of the face area in the first image meets the preset size requirement, specifically including: the area of the face area in the first image is within the preset area range.
  • the size of the facial area in the first image meets the preset size requirements, specifically including: the height of the facial area in the first image is within the preset height range, and the height of the facial area in the first image is within the preset height range.
  • the width of the face area is within the preset width range.
  • the electronic device can ensure input distance scale-adaptive simple samples through adaptive zoom.
  • the electronic device can capture images at a moderate shooting distance through adaptive zoom.
  • the electronic device crops the face area in the first image based on the face position information.
  • this may include: the electronic device may determine the location of the face area in the first image. Relevant feature points; the electronic device can determine the first circumscribed rectangle; the electronic device can also crop the first image based on the position of the first circumscribed rectangle in the first image.
  • the first circumscribed rectangle is the circumscribed moment of the relevant feature points of the face area in the first image. shape, the facial image block and the first circumscribed rectangle have the same position in the first image, and the face image block and the first circumscribed rectangle have the same size.
  • the electronic device crops the left eye area in the first image based on the eye position information.
  • the electronic device can determine relevant feature points of the left eye area in the first image; the electronic device can determine the second circumscribed rectangle, and based on The first image is cropped at the position of the second circumscribed rectangle in the first image.
  • the second circumscribed rectangle is the circumscribed rectangle of the relevant feature points of the left eye area in the first image
  • the left eye image block and the second circumscribed rectangle have the same position in the first image
  • the left eye image block and the second circumscribed rectangle are the same size.
  • the electronic device crops the right eye area in the first image based on the eye position information.
  • the electronic device can determine relevant feature points of the right eye area in the first image; the electronic device can determine a third circumscribed rectangle, and based on The first image is cropped at the position of the third circumscribed rectangle in the first image.
  • the third circumscribed rectangle is the circumscribed rectangle of the relevant feature points of the right eye area in the first image
  • the right eye image block and the third circumscribed rectangle have the same position in the first image
  • the right eye image block and the third circumscribed rectangle are the same size.
  • the electronic device can obtain the face based on the circumscribed rectangle of the relevant feature points in the face area, the circumscribed rectangle of the relevant feature points in the left eye area, and the circumscribed rectangle of the relevant feature points in the right eye area.
  • image block, left eye image block and right eye image block can obtain the face based on the circumscribed rectangle of the relevant feature points in the face area, the circumscribed rectangle of the relevant feature points in the left eye area, and the circumscribed rectangle of the relevant feature points in the right eye area.
  • step S306 in the following text, and will not be described here.
  • the face area in the first image is cropped based on the face position information to obtain the face image block, which may specifically include: the electronic device may determine based on the face position information The face area in the first image; the electronic device can use the face area as the center of the first cropping frame to crop the first image to obtain a face image block.
  • the size of the first cropping frame is the first preset cropping size.
  • the electronic device can use the left eye area as the center of the second cropping frame to crop the first image to obtain the left eye image block, and can also use the right eye area as the center of the third cropping frame. Center to crop the first image to obtain the right eye image block.
  • the size of the second cropping frame is the second preset cropping size.
  • the left eye image patch has the same size as the second cropping box.
  • the size of the third cropping frame is the third preset cropping size.
  • the right eye image block has the same size as the third cropping box.
  • the electronic device can crop the first image based on the face position information and the preset face cropping size to obtain a face image block.
  • the electronic device can also crop the first image based on the eye position information and the preset eye cropping size to obtain a left eye image block and a right eye image block.
  • the first preset cropping size is a preset face cropping size.
  • the second preset cropping size and the third preset cropping size are preset eye cropping sizes.
  • the second preset cropping size and the third preset cropping size may be the same.
  • the preset eye cropping size may include a preset left eye cropping size and a preset right eye cropping size.
  • the second preset crop size may be a preset left eye crop size.
  • the third preset crop size may be a preset right eye crop size.
  • step S306 in the following text, and will not be described here.
  • the gaze point estimation network model may also include several activation layers.
  • the region of interest pooling module may include several region of interest pooling layers.
  • a convolutional module can include several convolutional layers.
  • the fusion module includes several fully connected layers.
  • the gaze point estimation network model may include several region-of-interest pooling layers and several convolutional layers.
  • the gaze estimation network model can also include several activation layers.
  • the gaze point estimation network model may include several region-of-interest pooling layers, several convolutional layers, and several pooling layers.
  • the gaze estimation network model can also include several activation layers.
  • the present application provides an electronic device.
  • the electronic device may include a display screen, a camera, memory, and one or more processors.
  • Memory is used to store computer programs.
  • the camera can be used to: collect the first image.
  • the processor can be used to: obtain the face position information and eye position information in the first image when the face detection result meets the preset face conditions; process the target image block based on the gaze point estimation network model In the process, based on the region of interest pooling module of the gaze estimation network model, the region of interest ROI of the target image block is processed with the corresponding preset feature map size to obtain a feature map.
  • the face position information includes the coordinates of relevant feature points in the face area
  • the eye position information includes the coordinates of relevant feature points in the eye area.
  • the target image block includes at least one type of image block among a face image block, a left eye image block, and a right eye image block. Different types of image blocks each correspond to a preset feature map size.
  • the face image block is an image block obtained by cropping the face area in the first image based on the face position information
  • the left eye image block is an image block obtained by cropping the left eye area in the first image based on the eye position information.
  • the right eye image block is an image block obtained by cropping the right eye area in the first image based on the eye position information.
  • the processor when used to process the region of interest ROI of the target image block with the corresponding preset feature map size to obtain the feature map, it can be specifically used to: Divide the ROI of the target image block based on the corresponding preset feature map size to obtain several block areas; perform maximum pooling processing on each block area in the ROI of the target image block to obtain the feature map.
  • the number of block areas in each row in the ROI of the target image block is the same as the width value in the corresponding preset feature map size
  • the number of block areas in each column in the ROI of the target image block is the same as the corresponding preset feature map size.
  • the height values in are the same.
  • the processor is configured to use the corresponding preset feature map based on The size divides the ROI of the target image block to obtain several block areas.
  • the processor is used to perform maximum pooling processing on each block area in the ROI of the target image block to obtain the feature map. Specifically, it can be used to: each face block area in the ROI of the face image block.
  • the first feature map is a feature map corresponding to the ROI of the face image block
  • the second feature map is a feature map corresponding to the ROI of the left eye image block
  • the third feature map is a feature map corresponding to the ROI of the right eye image block.
  • the number of block areas in each column in the ROI of the target image block is the same as the width value in the corresponding preset feature map size.
  • the height values are the same, which may specifically include: the number of face block areas in each row in the ROI of the face image block is the same as the width value in the first preset feature map size, and the face block areas in each column in the ROI of the face image block are the same.
  • the number is the same as the height value in the first preset feature map size; the number of left eye block areas in each row of the ROI of the left eye image block is the same as the width value in the second preset feature map size, and the left eye image block
  • medium width The degree values are the same, and the number of right-eye block areas in each column in the ROI of the right-eye image block is the same as the height value in the third preset feature map size.
  • the shooting subject of the first image is the target object.
  • the processor may also be used to: obtain the pupil coordinates in the first image when the face detection result meets the preset face conditions; determine the first image based on the face position information. The position and size of the face area in the image in the first image are used to obtain the face grid corresponding to the first image.
  • the face mesh is used to represent the distance of the target object from the camera.
  • the processor after being used to obtain the feature map, can also be used to: perform convolution processing on the feature map based on the convolution module of the gaze point estimation network model, and extract eye features and/or facial features; based on the gaze point estimation network
  • the fusion module of the model integrates eye features and/or facial features, facial mesh and pupil coordinates to obtain the gaze point coordinates of the target object.
  • the face detection result satisfies the preset face conditions, which may specifically include: a face is detected in the first image.
  • the face detection result satisfies the preset face conditions, which may specifically include: a face is detected in the first image, and the size of the face area in the first image satisfies Default size requirements.
  • the processor may also be used to: detect a human face in the first image and the size of the face area in the first image does not meet the preset size requirement. Adaptive zoom and re-acquire images based on the focal length after adaptive zoom.
  • the processor when the processor is used to crop the face area in the first image based on the face position information, the processor may be specifically used to: determine the face in the first image. Relevant feature points of the inner region are determined; a first circumscribed rectangle is determined; and the first image is cropped based on the position of the first circumscribed rectangle in the first image.
  • the first circumscribed rectangle is the circumscribed rectangle of the relevant feature points of the face area in the first image.
  • the face image block is at the same position in the first image as the first enclosing rectangle.
  • the face image block is the same size as the first bounding rectangle.
  • the processor when used to crop the left eye area in the first image based on the eye position information, may specifically be used to: determine relevant feature points of the left eye area in the first image; determine the second circumscribed rectangle; based on the third Crop the first image based on the position of the second circumscribed rectangle in the first image.
  • the second circumscribed rectangle is the circumscribed rectangle of the relevant feature points of the left eye area in the first image.
  • the position of the left eye image block and the second circumscribed rectangle in the first image are the same.
  • the left eye image block is the same size as the second bounding rectangle.
  • the processor when used to crop the right eye area in the first image based on the eye position information, may specifically be used to: determine relevant feature points of the right eye area in the first image; determine a third circumscribed rectangle; based on the third Crop the first image based on the positions of the three circumscribed rectangles in the first image.
  • the third circumscribed rectangle is the circumscribed rectangle of the relevant feature points of the right eye area in the first image.
  • the position of the right eye image block and the third circumscribed rectangle in the first image are the same.
  • the right eye image patch is the same size as the third enclosing rectangle.
  • the processor when used to crop the face area in the first image based on the face position information to obtain the face image block, it may be specifically used to:
  • the face position information determines the face area in the first image; the first image is cropped with the face area as the center of the first cropping frame to obtain a face image block.
  • the size of the first cropping frame is the first preset cropping size.
  • the face image block is the same size as the first cropping box.
  • the processor is configured to crop the left eye area and the right eye area in the first image based on the eye position information to obtain the left eye image block and the right eye image block. Specifically, it can be used to: determine the first image based on the eye position information.
  • the first image is cropped at the center to obtain the right eye image block.
  • the size of the second cropping frame is the second preset cropping size.
  • the left eye image block is the same size as the second cropping box.
  • the size of the third cropping frame is the third preset cropping size.
  • the right eye image block has the same size as the third cropping box.
  • the gaze point estimation network model may also include several activation layers.
  • the region of interest pooling module may include several region of interest pooling layers.
  • a convolutional module can include several convolutional layers.
  • the fusion module can include several fully connected layers.
  • the present application provides a computer storage medium that includes computer instructions.
  • the computer instructions When the computer instructions are run on an electronic device, the electronic device causes the electronic device to execute any of the possible implementations of the first aspect.
  • embodiments of the present application provide a chip that can be applied to electronic equipment.
  • the chip includes one or more processors, and the processor is used to call computer instructions to cause the electronic equipment to execute any of the above-mentioned first aspects.
  • embodiments of the present application provide a computer program product containing instructions, which when the computer program product is run on an electronic device, causes the electronic device to execute any of the possible implementations of the first aspect.
  • the electronic device provided by the second aspect the computer storage medium provided by the third aspect, the chip provided by the fourth aspect, and the computer program product provided by the fifth aspect are all used to execute any possibility of the first aspect. way of implementation. Therefore, the beneficial effects that can be achieved can be referred to the beneficial effects of any possible implementation method in the first aspect, and will not be described again here.
  • Figure 1 is a schematic diagram of a gaze point estimation scene provided by an embodiment of the present application.
  • Figures 2A-2D are schematic diagrams of a set of gaze point estimation scenes provided by embodiments of the present application.
  • Figure 3 is a flow chart of a gaze point estimation method provided by an embodiment of the present application.
  • Figure 4 is a schematic diagram of a cutting principle provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of another cutting principle provided by the embodiment of the present application.
  • Figure 6 is a schematic diagram of a facial mesh provided by an embodiment of the present application.
  • Figure 7 is a schematic architectural diagram of a gaze point estimation network model provided by an embodiment of the present application.
  • Figure 8 is an architectural schematic diagram of another gaze point estimation network model provided by an embodiment of the present application.
  • Figure 9 is a schematic architectural diagram of another gaze point estimation network model provided by an embodiment of the present application.
  • Figures 10A and 10B are schematic diagrams of the principles of a region of interest pooling layer provided by embodiments of the present application.
  • Figure 11 is a schematic diagram of an ROI mapped to a feature map provided by an embodiment of the present application.
  • Figure 12 is a schematic structural diagram of a CNN-1 provided by an embodiment of the present application.
  • Figure 13 is a schematic structural diagram of a CNN-3 provided by an embodiment of the present application.
  • Figure 14 is a schematic diagram of the hardware structure of an electronic device provided by an embodiment of the present application.
  • Figure 15 is a schematic diagram of the software structure of an electronic device provided by an embodiment of the present application.
  • an embodiment means that a particular feature, structure or characteristic described in connection with the embodiment may be included in at least one embodiment of the application.
  • the appearances of this phrase in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. It will be explicitly and implicitly understood by those skilled in the art that the embodiments described herein may be combined with other embodiments.
  • This application provides a gaze point estimation method.
  • This gaze point estimation method can be applied to electronic devices.
  • the electronic device can capture images through the front camera. If the collected image includes a human face, the electronic device can crop the collected image based on the facial position information obtained through face detection and the preset face cropping size to obtain a face image block. Similarly, the electronic device can also crop the collected image based on the eye position information obtained by eye detection and the preset eye cropping size to obtain a left eye image block and a right eye image block.
  • the electronic device can also determine the face grid based on the face position information and determine the pupil coordinates through pupil positioning. Among them, the face grid is used to represent the position and size of the human face in the entire image.
  • the face mesh can reflect the distance between the face and the camera.
  • the electronic device can input the left eye image block, the right eye image block, the face image block, the face grid and the pupil coordinates into the gaze point estimation network model, and output the gaze point coordinates.
  • the gaze point estimation network model may include a region of interest pooling layer. This region of interest pooling layer can be used to unify the size of the feature map to prepare for subsequent feature extraction.
  • the electronic device can determine whether the size of the facial area in the collected image meets the preset size requirement. When the size of the facial area does not meet the preset size requirements, the electronic device can ensure a moderate shooting distance through adaptive zoom and re-acquire images. When the size of the facial area meets the preset size requirement, the electronic device can estimate the coordinates of the gaze point according to the above method.
  • the electronic device can estimate the gaze point coordinates based on the left eye image block, the right eye image block, the face image block, the face grid and the pupil coordinates, that is, more comprehensive feature extraction is achieved. Moreover, the electronic device can control the size of the facial area in the captured image through adaptive zoom, and unify the size of the feature map based on the area of interest pooling layer, which can avoid image blocks (for example, left eye image blocks, right eye images Blocks and face image blocks) are deformed after scaling, which improves the accuracy of gaze point estimation.
  • image blocks for example, left eye image blocks, right eye images Blocks and face image blocks
  • the electronic device can obtain the user's image and estimate the coordinates of the gaze point through the user's image.
  • the electronic device can collect images through the front camera. If the collected images include human faces, the electronic device can crop the collected images to obtain left eye image blocks, right eye image blocks and Face image blocks.
  • the electronic device can also determine the face grid based on the facial position information obtained by face detection, and determine the pupil coordinates through pupil positioning. Among them, the face grid is used to represent the position and size of the human face in the entire image. It can also be understood as: the face grid can reflect the distance between the face and the camera.
  • the electronic device can input the left eye image block, the right eye image block, the face image block, the face grid and the pupil coordinates into the gaze point estimation network model, and output the gaze point coordinates. It is understandable that the relevant description of the gaze point estimation network model can be referred to later, and will not be explained here.
  • the electronic device may be a mobile phone, a tablet computer, a wearable device, a vehicle-mounted device, an augmented reality (AR)/virtual reality (VR) device, a laptop, or an ultra-mobile personal computer (Ultra-Mobile).
  • Personal Computer UMPC
  • netbook personal digital assistant
  • PDA Personal Digital Assistant
  • specialized cameras for example, SLR camera, card camera
  • the electronic device when a user uses an electronic device to browse information, the electronic device can trigger a corresponding operation based on the estimated coordinates of the gaze point. In this case, the user can interact with the electronic device more conveniently.
  • the electronic device can display a reading interface 100.
  • the reading interface 100 displays the first page of the e-book that the user is reading.
  • the e-book has a total of 243 pages.
  • the electronic device can estimate the coordinates of the gaze point in real time.
  • the electronic device can estimate the coordinates of the gaze point based on the collected image, and determine that the coordinates of the gaze point are located at the end of the first page of content displayed on the reading interface 100 .
  • electronic devices can trigger page turning.
  • the electronic device can display the reading interface 200 shown in Figure 2C.
  • the reading interface 200 displays the second page of the e-book that the user is reading.
  • the electronic device can continue to estimate gaze point coordinates in real time.
  • the real-time estimation of gaze point coordinates mentioned in this application may include: the electronic device may collect a frame of image every certain time (for example, 10 ms) and estimate the gaze point coordinates based on the image.
  • the electronic device when a user uses an electronic device to browse information, the electronic device can collect the user's preference information based on the estimated gaze point coordinates, thereby providing more intelligent services to the user based on the collected user's preference information. For example, when a user uses an electronic device to browse information, the electronic device may recommend some content (such as videos, articles, etc.). In this case, the electronic device can estimate the coordinates of the user's gaze point, thereby determining recommended content that the user is interested in. In the subsequent process, the electronic device can recommend content related to the recommended content of interest to the user.
  • some content such as videos, articles, etc.
  • the electronic device may display a user interface 300.
  • User interface 300 may include several videos or text messages.
  • the user interface 300 may include Recommended Content 1, Recommended Content 2, Recommended Content 3, Recommended Content 4, and Recommended Content 5.
  • the electronic device can collect images in real time and estimate gaze point coordinates based on the collected images.
  • the electronic device can also count the distribution of gaze point coordinates during the process of the electronic device displaying the user interface 300, thereby determining recommended content that the user is interested in.
  • recommended content 2 in the user interface 300 the electronic device can intelligently recommend content related to the recommended content 2 to the user, thereby preventing the user from spending time excluding content that is not of interest, and providing services to the user more intelligently.
  • FIG. 3 is a flow chart of a gaze point estimation method provided by an embodiment of the present application.
  • the gaze point estimation method may include but is not limited to the following steps:
  • S301 The electronic device acquires the image I1.
  • the electronic device acquires the image I1 through the front camera of the electronic device.
  • the electronic device receives the image I1 acquired by other cameras.
  • the electronic device can acquire images in real time.
  • the image I1 is an image acquired by the electronic device in real time.
  • the electronic device can acquire a frame of image every time T.
  • the time T mentioned in this application can be set according to actual needs.
  • the time T may be 1 millisecond (ms).
  • S302 The electronic device performs face detection on the image I1 to determine whether the image I1 includes a human face.
  • the electronic device can perform face detection on the image I1 to determine whether the image I1 includes a human face. exist When it is detected that the image I1 includes a human face, the electronic device can continue to perform subsequent steps. When it is detected that the image I1 does not include a human face, the electronic device can discard the image I1 and reacquire the image.
  • face detection refers to determining whether there are faces in dynamic scenes and complex backgrounds, and separating them. That is, based on the search strategy included in face detection, any given image can be searched to determine whether it contains a human face.
  • the electronic device can determine the matching degree (ie, correlation) between the input image and one or several preset standard face templates, and then determine whether there is a face in the image based on the matching degree. For example, the electronic device can determine the size relationship between the matching degree and a preset threshold, and determine whether there is a face in the image based on the size relationship. Specifically, if the matching degree is greater than the preset threshold, the electronic device determines that a human face exists in the image; otherwise, the electronic device determines that a human face does not exist in the image.
  • the matching degree ie, correlation
  • the electronic device when determining the matching degree between the input image and one or several preset standard face templates, can specifically calculate the facial contours, nose, etc. in the input image and the standard face template. , eyes, mouth and other parts of the match.
  • the electronic device may include a template library.
  • Standard face templates can be stored in this template library.
  • human faces have certain structural distribution characteristics.
  • Electronic devices can extract the structural distribution characteristics of faces from a large number of samples and generate corresponding rules, and then determine whether there is a face in the image based on the rules.
  • the structural distribution characteristics of the human face can include: two symmetrical eyes, two symmetrical ears, a nose, a mouth, as well as the position and relative distance between the facial features, etc.
  • the sample learning method refers to the artificial neural network method, which generates a classifier by learning face sample sets and non-face sample sets. That is, electronic devices can train neural networks based on samples.
  • the parameters of the neural network include the statistical characteristics of the face.
  • Feature detection method refers to using the invariant characteristics of faces for face detection.
  • Human faces have some features that are robust to different poses. For example, a person's eyes and eyebrows are darker than their cheeks, their lips are darker than their surroundings, and the bridge of their nose is lighter than the sides.
  • the electronic device can extract these features and create a statistical model that describes the relationship between these features, and then determine whether a face is present in the image based on this statistical model. It can be understood that the features extracted by the electronic device can be expressed as a one-dimensional vector in the image feature space of the human face. Electronic devices can transform this one-dimensional vector into a relatively simple feature space when creating a statistical model that can describe the relationship between features.
  • the above four face detection methods can be used comprehensively in actual detection.
  • individual differences for example, differences in hairstyles, opening and closing of eyes, etc.
  • occlusion of the face in the shooting environment for example, hair, The face is blocked by glasses, etc.
  • the angle of the face facing the camera for example, the side of the face is facing the camera
  • the shooting environment for example, objects around the face, etc.
  • imaging conditions for example, lighting conditions, imaging Equipment
  • the electronic device when performing face detection, the electronic device will detect facial features. That means Now, when electronic devices perform face detection, they will also perform eye detection. When electronic equipment performs eye detection, feature points related to the eyes can be obtained. In this case, if the electronic device can detect the eyes in the image I1, the electronic device can obtain the eye position information. It can be understood that the eye position information may include coordinates of feature points related to the eye. The relevant description of the eye position information can be found later, and will not be explained here.
  • the feature points related to the eyes obtained by the electronic device may include the pupil center point.
  • the electronic device can obtain the coordinates of the pupil center point.
  • S303 The electronic device obtains the facial position information in the image I1.
  • the electronic device can obtain and save the face position information in the image I1.
  • the face location information may include the coordinates of the face detection frame.
  • the facial position information may include coordinates of relevant feature points of the human face.
  • the coordinates of the edge contour feature points of a human face may be used as a reference to the eyes, nose, mouth and ears in the face area.
  • S304 The electronic device performs eye detection and pupil positioning on the image I1, and obtains eye position information and pupil coordinates.
  • the electronic device when the electronic device detects that the image I1 includes a human face, the electronic device can perform eye detection and pupil positioning on the image I1, thereby obtaining eye position information and pupil coordinates in the image I1.
  • the eye position information may include coordinates of feature points related to the eye.
  • the electronic device performs eye detection, it can determine the feature points related to the eyes and obtain the coordinates of these feature points. For example, two corner feature points of the left eye, two corner feature points of the right eye, and edge contour feature points of the eye.
  • the electronic device can determine the eye position in the image I1 based on the coordinates of these eye-related feature points.
  • pupil coordinates are two-dimensional coordinates.
  • the pupil coordinates may include pupil center point coordinates.
  • the pupil coordinates may also include other coordinates related to the pupil. For example, the coordinates of the pupil center of gravity, the coordinates of the pupil edge contour points, etc.
  • the electronic device when the electronic device detects the eyes on the image I1, the electronic device can blur the eye part on the image I1, extract the pupil outline, and then determine the pupil center of gravity. It can be understood that the electronic device can use the coordinates of the center of gravity of the pupil as the pupil coordinates.
  • the electronic device when the electronic device detects the eyes on the image I1, the electronic device can blur the eye part on the image I1, calculate the horizontal and vertical pixel values, and then select the pixel values The index of the lowest row and the index of the column with the lowest pixel value are used as the vertical and horizontal coordinates of the pupil coordinates.
  • the electronic device can also adopt other pupil positioning methods, which is not limited in this application.
  • S305 The electronic device determines whether the size of the face area in the image I1 meets the preset size requirement.
  • the electronic device when the electronic device detects that the image I1 includes a human face, the electronic device can determine the size of the face area in the image I1 and determine whether the size of the face area in the image I1 meets the preset size requirement.
  • the facial region may include important features of a human face. For example, eyes, nose, mouth, etc.
  • the size of the facial area refers to the area of the facial area.
  • the facial area The area refers to the area of the face detection frame.
  • the area of the facial area refers to the area of the entire facial area in the image detected by the electronic device.
  • the face detection frame can be used to frame a facial area including important features, and is not necessarily used to frame a complete facial area.
  • face detection boxes can be used to frame large areas of the face including features such as eyebrows, eyes, nose, mouth, and ears.
  • the shape of the face detection frame can be set according to actual needs.
  • the face detection frame can be a rectangle.
  • the fact that the size of the facial area in the image I1 meets the preset size requirement means that the area of the facial area in the image I1 is within the preset area range.
  • the preset area range can be [220px*220px, 230px*230px].
  • the area of the face area is not less than 220px*220px, and not larger than 230px*230px.
  • this application does not limit the specific value of the preset area range. It is understandable that the full name of px is "Pixel", which means "pixel” in Chinese and is the smallest unit to represent a picture or graphic.
  • the size of the face area in image I1 meeting the preset size requirement means: the height of the face area in image I1 is within the preset height range, and the face in image I1
  • the width of the inner area is within the preset width range.
  • the preset height range can be [215px, 240px]
  • the preset width range can be [215px, 240px].
  • the preset height range and the preset width range may be inconsistent. This application does not limit the specific values of the preset height range and the preset width range. It can be understood that the height of the face area mentioned in this application can be understood as the height of the face detection frame, and the width of the face area mentioned in this application can be understood as the width of the face detection frame.
  • the electronic device can continue to perform subsequent steps, and when the size of the facial area included in the image I1 does not meet the preset size requirements.
  • the electronic device can perform adaptive zoom and reacquire the image according to the focal length after adaptive zoom.
  • focal length the focal length
  • the wider the viewing range the wider the field of view of the photographed picture, and the more objects that can be photographed, but the smaller the objects in the picture will be.
  • the adaptive zoom method will be described.
  • the electronic device can determine the focal length when acquiring the image I1.
  • the focal length when the electronic device acquires the image I1 is recorded as the original focal length.
  • the electronic device can determine whether the area of the face area in the image I1 is smaller than the minimum value of the preset area range, or larger than the maximum value of the preset area range. If the area of the face area in image I1 is less than the minimum value of the preset area range, the electronic device can add J1 to the original focal length to obtain the focal length after adaptive zoom, and reacquire the image based on this focal length. If the area of the face area in the image I1 is greater than the maximum value of the preset area range, the electronic device can subtract J1 from the original focal length to obtain the focal length after adaptive zoom, and reacquire the image based on the focal length.
  • J1 is the preset focal length adjustment step, and the specific value of J1 can be set according to actual needs.
  • the electronic device may determine a middle value of the preset area range and determine a ratio of the area of the face region to the middle value.
  • the electronic device can multiply the ratio by the original focal length to obtain the focal length after adaptive zoom, and re-obtain the image based on this focal length. picture.
  • the size of the face area in image I1 meets the preset size requirement means: the height of the face area in image I1 is within the preset height range, and the width of the face area in image I1 is within
  • the electronic device can determine the preset area range based on the preset height range and the preset width range, and then determine the preset area range based on the area of the face area in the image I1, the preset area range, and the original focal length.
  • the size of the face area in image I1 meets the preset size requirement means: the height of the face area in image I1 is within the preset height range, and the width of the face area in image I1 is within
  • the electronic device can determine the middle value of the preset height range and the middle value of the preset width range, and then multiply the middle value of the preset height range by the middle value of the preset width range, to obtain Preset the area, and perform adaptive zoom based on the preset area, the area of the face area in the image I1, and the original focal length.
  • the electronic device can determine the middle value of the preset height range and the middle value of the preset width range, and then multiply the middle value of the preset height range by the middle value of the preset width range, to obtain Preset the area, and perform adaptive zoom based on the preset area, the area of the face area in the image I1, and the original focal length.
  • the size of the facial area can reflect the shooting distance (ie, the distance between the camera and the human face). It can also be understood that the size of the facial area contains the depth information of the shot. If the shooting distance is large, the eye features in the images collected by the electronic device through the camera may be blurred, thus affecting the accuracy of gaze point estimation. If the shooting distance is large, the facial features in the images collected by the electronic device through the camera may be incomplete, thus affecting the accuracy of gaze point estimation.
  • the electronic device can collect images containing appropriate face sizes, improving the accuracy of gaze point estimation.
  • S306 The electronic device crops the image I1 based on the face position information to obtain a face image block, and crops the image I1 based on the eye position information to obtain a left eye image block and a right eye image block.
  • the embodiments of this application provide two implementation methods when the electronic device performs step S306:
  • the first implementation mode the electronic device determines the circumscribed rectangle of the face area in the image I1 based on the coordinates of the feature points included in the face position information, and crops the image I1 based on the circumscribed rectangle of the face area to obtain the face. image block.
  • the electronic device can also determine the bounding rectangle of the left eye area and the right eye area in the image I1 based on the coordinates of the feature points included in the eye position information, and based on the bounding rectangle of the left eye area and the bounding rectangle of the right eye area respectively.
  • the image I1 is cropped by the circumscribed rectangle of the right eye area to obtain a left eye image block and a right eye image block.
  • the circumscribed rectangle mentioned in this application may be the minimum circumscribed rectangle.
  • the minimum circumscribed rectangle refers to the maximum range of several two-dimensional shapes (for example, points, lines, polygons) represented by two-dimensional coordinates, that is, the maximum abscissa, minimum abscissa, and maximum vertical coordinate of each vertex of a given two-dimensional shape.
  • the coordinates and the minimum vertical coordinate determine the rectangle of the lower boundary.
  • the circumscribed rectangle of the facial area can be understood as the minimum circumscribed rectangle of facial feature points (for example, facial edge contour feature points).
  • the circumscribed rectangle of the left eye area can be understood as the minimum circumscribed rectangle of the left eye feature points (for example, the two corner feature points of the left eye, the edge contour feature points of the left eye).
  • the circumscribed rectangle of the right eye area can be understood as the minimum circumscribed rectangle of the right eye feature points (for example, the two corner feature points of the right eye, the edge contour feature points of the right eye).
  • the size of the face image block is the same as the size of the circumscribing rectangle of the face area in image I1.
  • the size of the left eye image block is the same as the size of the bounding rectangle of the left eye area in image I1.
  • the size of the right eye image block is the same as the size of the enclosing rectangle of the right eye area in image I1.
  • the electronic device can determine the bounding box of the facial feature points through a bounding box algorithm.
  • the bounding box of facial feature points can be understood as the optimal surrounding area of facial feature points.
  • Electronic devices can also be based on facial features
  • the image I1 is cropped by the bounding box of the points to obtain the face image block.
  • the electronic device can determine the bounding boxes of the left eye feature point and the right eye feature point respectively through the bounding box algorithm.
  • the bounding boxes of the left eye feature point and the right eye feature point can be understood as the optimal surrounding area of the left eye feature point and the optimal surrounding area of the right eye feature point respectively.
  • the electronic device can also crop the image I1 based on the bounding boxes of the left eye feature points and the right eye feature points respectively to obtain the left eye image block and the right eye image block.
  • bounding box is an algorithm for solving the optimal bounding space of a discrete point set.
  • the basic idea is to approximately replace complex geometric objects with slightly larger geometric objects (called bounding boxes) with simple characteristics.
  • bounding boxes For relevant descriptions of bounding boxes, please refer to relevant technical documents, which will not be described in this application.
  • the second implementation mode the electronic device crops the image I1 based on the face position information and the preset face cropping size to obtain the face image block, and crops the image I1 based on the eye position information and the preset eye cropping size to obtain Left eye image block and right eye image block.
  • the electronic device can determine the facial area in the image I1 based on the coordinates in the facial position information, and center the facial area on the image I1 , crop the image I1 based on the preset face cropping size, thereby obtaining the face image block.
  • the size of the face image block is the same as the preset face cropping size.
  • the face area in the face image block is located at the center of the face image block.
  • the coordinates in the facial position information may include the coordinates of the edge contour feature points of the human face, may also include the coordinates of the face detection frame, and may also include the coordinates of the eyes, nose, mouth and Coordinates of ear-related feature points.
  • the electronic device can also determine the left eye area and right eye area in image I1 based on the coordinates in the eye position information, and use the left eye area and the left eye area respectively.
  • the eye area and the right eye area are taken as the center, and the image I1 is cropped based on the preset eye cropping size, thereby obtaining the left eye image block and the right eye image block.
  • the left eye area in the left eye image block is located at the center of the left eye image block.
  • the right eye area in the right eye image block is located in the center of the right eye image block.
  • the coordinates in the eye position information may include two eye corner feature points of the left eye and two eye corner feature points of the right eye, and may also include edge contour feature points of the eye corner.
  • the size of the left eye image block is the same as the preset eye crop size
  • the size of the right eye image block is the same as the preset eye crop size.
  • the default eye crop size is 60px*60px.
  • the sizes of the left-eye image block and the right-eye image block cropped by the electronic device are both 60px*60px.
  • the preset eye cropping size may include a preset left eye cropping size and a preset right eye cropping size.
  • the preset left eye crop size may be different from the preset right eye crop size.
  • the size of the left eye image block is the same as the preset left eye crop size, and the size of the right eye image block is the same as the preset right eye crop size.
  • the preset face cropping size and the preset eye cropping size can be set according to actual needs, and this application does not limit this.
  • the preset face cropping size may be 244px*244px
  • the preset eye cropping size may be 60px*60px.
  • the electronic device can determine the face area based on the coordinates included in the face position information (for example, the coordinates of the edge contour feature points of the face, etc.), and set it according to the preset face cropping size Cropping box, and then use the face area as the center of the cropping box to crop the image I1, thereby obtaining the face image block.
  • the coordinates included in the face position information for example, the coordinates of the edge contour feature points of the face, etc.
  • the electronic device can determine the left eye area and the right eye area based on the coordinates included in the eye position information, and set the left eye cropping frame and the right eye cropping frame according to the preset eye cropping size, and then The left eye area and the right eye area are respectively used as the centers of the left eye cropping frame and the right eye cropping frame to crop the image I1 to obtain the left eye image block and the right eye image block respectively.
  • the electronic device determines the face grid corresponding to the image I1 based on the face position information.
  • Face mesh is used to represent people The position and size of the face in the overall image.
  • the electronic device can determine the position of the face area in the image I1 based on the coordinates included in the face position information (for example, the coordinates of the edge contour feature points of the human face, etc.), thereby determining the face grid corresponding to the image I1 .
  • a face grid can be used to represent the position and size of a face within the entire image. It can be understood that the face grid can represent the distance between the face and the camera.
  • the face grid can be understood as a binary mask.
  • a binary mask can be understood as a binary matrix corresponding to the image, that is, a matrix whose elements are all 0 or 1.
  • an image (whole or partially) can be occluded by a binary mask.
  • Binary masks can be used for region of interest extraction, masking, structural feature extraction, etc.
  • the electronic device can determine the proportional relationship between the face area in the image I1 and the image I1 according to the coordinates included in the face position information, thereby obtaining the depth information of the human face in the image I1.
  • the electronic device may also determine that the human face in the image I1 is located at a lower-center position in the image I1. Further, the electronic device can determine the facial grid corresponding to the image I1.
  • S308 The electronic device inputs the left eye image block, the right eye image block, the face image block, the face grid and the pupil coordinates into the gaze point estimation network model, and outputs the gaze point coordinates.
  • the electronic device can input the left eye image block, right eye image block, face image block, face grid and pupil coordinates to the gaze point estimation network model, and output the two-dimensional coordinates.
  • the two-dimensional coordinates are the gaze point coordinates.
  • the gaze point estimation network model can be a neural network model including several branches.
  • the gaze point estimation network model can extract corresponding features through several branches it contains, and then comprehensively extract the extracted features to estimate the gaze point coordinates.
  • a neural network is a mathematical model or computational model that imitates the structure and function of a biological neural network (the central nervous system of animals, especially the brain).
  • Neural networks are composed of a large number of artificial neurons, and different networks are constructed according to different connection methods. Neural networks can include convolutional neural networks, recurrent neural networks, etc.
  • the gaze point estimation network model may include several region-of-interest pooling layers, several convolutional layers, several pooling layers, and several fully connected layers.
  • the area of interest pooling layer is used to unify the size of the feature map.
  • Convolutional layers are used to extract features.
  • the pooling layer is used for downsampling to reduce the amount of data.
  • the fully connected layer is used to map the extracted features to the sample label space. In layman's terms, the fully connected layer is used to integrate the extracted features and output it as a value.
  • the gaze point estimation network model can include region of interest pooling (ROI pooling) layer-1, region of interest pooling layer-2, CNN-1, CNN-2, CNN-3, fully connected layer-1, fully connected layer -2, fully connected layer-3 and fully connected layer-4.
  • ROI pooling region of interest pooling
  • the area of interest pooling layer-1 is used to unify the size of the feature map corresponding to the left eye image block, and unify the size of the feature map corresponding to the right eye image block.
  • the region of interest pooling layer-2 is used to unify the size of the feature maps corresponding to the face image blocks.
  • CNN-1, CNN-2 and CNN-3 are all convolutional neural networks (Convolutional Neural Network, CNN), which are used to extract left eye features, right eye features and facial features respectively.
  • CNN-1, CNN-2 and CNN-3 may include several convolutional layers and several pooling layers respectively.
  • CNN-1, CNN-2 and CNN-3 may also include one or more fully connected layers.
  • the fully connected layer-1 is used to integrate the extracted left eye features, right eye features and facial features.
  • Fully connected layer-2 and fully connected layer-3 are respectively used to integrate the depth information represented by the face grid (that is, the distance between the face and the camera), and the pupil position information represented by the pupil coordinates.
  • the fully connected layer-4 is used to integrate left eye features, right eye features, facial features, depth information, pupil position and other information, and output it as a value.
  • the electronic device can use the left eye image block and the right eye image block as the input of the region of interest pooling layer-1, and the facial image as the input of the region of interest pooling layer-2.
  • the region of interest pooling layer-1 can output feature maps of the same size.
  • the region of interest pooling layer-2 can also output feature maps of the same size.
  • Electronic devices can be interested in The feature map corresponding to the left eye image block output by the area of interest pooling layer-1 is used as the input of CNN-1.
  • the feature map corresponding to the right eye image block output by the area of interest pooling layer-1 can also be used as the input of CNN-1. 2 input.
  • the electronic device can use the feature map output by the region of interest pooling layer-2 as the input of CNN-3.
  • the electronic device can use the output of CNN-1, the output of CNN-2 and the output of CNN-3 as the input of the fully connected layer-1.
  • the electronic device can also use the facial mesh and pupil coordinates as inputs to fully connected layer-2 and fully connected layer-3 respectively.
  • the electronic device can use the outputs of the fully connected layer-1, the fully connected layer-2 and the fully connected layer-3 as the input of the fully connected layer-4.
  • Fully connected layer-4 can output two-dimensional coordinates. The two-dimensional coordinates are the gaze point coordinates estimated by the electronic device.
  • the gaze point estimation network model may include more region-of-interest pooling layers.
  • the electronic device may use the left-eye image block and the right-eye image block as inputs to different region-of-interest pooling layers respectively.
  • the electronic device can use the outputs of the pooling layers of the different regions of interest as inputs of CNN-1 and CNN-2 respectively.
  • the gaze point estimation network model may include more fully connected layers. It is understandable that there can be more fully connected layers before and after fully connected layer-2, and there can also be more fully connected layers before and after fully connected layer-3.
  • the electronic device may use the output of fully connected layer-2 as the input of fully connected layer-5, and the output of fully connected layer-5 as the input of fully connected layer-4.
  • the electronic device may use the output of fully connected layer-3 as the input of fully connected layer-6, and the output of fully connected layer-6 as the input of fully connected layer-4.
  • the electronic device can use the output of the fully connected layer-4 as the input of the fully connected layer-7, and the output of the fully connected layer-7 is the gaze point coordinate estimated by the electronic device.
  • FIG 8 is a schematic architectural diagram of yet another gaze point estimation network model provided by an embodiment of the present application.
  • the gaze point estimation network model can include area of interest pooling layer-1, area of interest pooling layer-2, CNN-1, CNN-2, CNN-3, fully connected layer-1, fully connected layer-2, fully connected layer Connected layer-3, fully connected layer-4, fully connected layer-5, fully connected layer-6 and fully connected layer-7.
  • the functions of the region of interest pooling layer-1, the region of interest pooling layer-2, CNN-1, CNN-2, CNN-3 and the fully connected layer-1 can all be referred to above, and this application will not describe them here.
  • Fully connected layer-2 and fully connected layer-5 are used to integrate the depth information represented by the face mesh.
  • the fully connected layer-3 and the fully connected layer-6 are used to integrate the pupil position information represented by the pupil coordinates.
  • Fully connected layer-4 and fully connected layer-7 are used to integrate left eye features, right eye features, facial features, depth information, pupil position and other information, and output them as a value.
  • the electronic device can use the output of fully connected layer-2 as the input of fully connected layer-5, the output of fully connected layer-3 as the input of fully connected layer-6, and the output of fully connected layer-1 , the outputs of fully connected layer-5 and fully connected layer-6 serve as the input of fully connected layer-4.
  • the electronic device can also use the output of the fully connected layer-4 as the input of the fully connected layer-7, and the output of the fully connected layer-7 is the gaze point coordinate estimated by the electronic device.
  • FIG. 9 is a schematic architectural diagram of yet another gaze point estimation network model provided by an embodiment of the present application.
  • the gaze point estimation network model can include area of interest pooling layer-1, area of interest pooling layer-2, CNN-1, CNN-2, CNN-3, fully connected layer-2, fully connected layer-3, fully connected layer Connected layer-4, fully connected layer-5, fully connected layer-6 and fully connected layer-7.
  • the functions of the region of interest pooling layer-1, the region of interest pooling layer-2, CNN-1, CNN-2, and CNN-3 can all be referred to above, and this application will not repeat them here.
  • Fully connected layer-2 and fully connected layer-5 are used to integrate the depth information represented by the face mesh.
  • the fully connected layer-3 and the fully connected layer-6 are used to integrate the pupil position information represented by the pupil coordinates.
  • Fully connected layer-4 and fully connected layer-7 are used to integrate left eye features, right eye features, facial features, depth information, pupil position and other information, and output them as a value.
  • the electronic device can use the output of fully connected layer-2 as the input of fully connected layer-5, the output of fully connected layer-3 as the input of fully connected layer-6, and the output of fully connected layer-5 and the output of fully connected layer-6 serves as the input of fully connected layer-4.
  • the electronic device can also use the output of the fully connected layer-4 as the input of the fully connected layer-7, and the output of the fully connected layer-7 is the gaze point coordinate estimated by the electronic device.
  • the gaze point estimation network model may also include several activation layers.
  • an activation layer can be set between the fully connected layer-1 and the fully connected layer-4.
  • An activation layer can be set between layer-2 and fully connected layer-4, and an activation layer can also be set between fully connected layer-3 and fully connected layer-4.
  • an activation layer can be set between the fully connected layer-1 and the fully connected layer-4, and an activation layer can be set between the fully connected layer-2 and the fully connected layer-5.
  • an activation layer can be set between the fully connected layer-5 and the fully connected layer-4, an activation layer can be set between the fully connected layer-3 and the fully connected layer-6, and the fully connected layer-6 and the fully connected layer An activation layer can be set between layer-4, and an activation layer can be set between fully connected layer-4 and fully connected layer-7.
  • an activation layer can be set between the fully connected layer-2 and the fully connected layer-5, and an activation layer can be set between the fully connected layer-5 and the fully connected layer-4.
  • An activation layer can be set between fully connected layer-3 and fully connected layer-6, and an activation layer can be set between fully connected layer-6 and fully connected layer-4, and between fully connected layer-4 and fully connected layer
  • An activation layer can be set between layers -7.
  • Region of interest refers to: in machine vision and image processing, the area to be processed is outlined from the processed image in the form of boxes, circles, ellipses, irregular polygons, etc.
  • the region of interest pooling layer is a type of pooling layer.
  • the electronic device can divide the ROI in the image input to the region of interest pooling layer into block areas (sections) of the same size, and perform a maximum pooling operation on each block area.
  • the resulting processed feature map is is the output of the pooling layer for the region of interest. Among them, the number of block regions is consistent with the dimension of the feature map output by the region of interest pooling layer.
  • the following example illustrates the processing process in the region of interest pooling layer-1.
  • the electronic device can The ROI is divided into 3*3 block areas of the same size, and maximum pooling is performed on each block area (that is, the maximum value of each block area is taken).
  • the electronic device can obtain the feature map-1 corresponding to the ROI after the maximum pooling process.
  • the electronic device can use the feature map-1 as the output of the region-of-interest pooling layer-1.
  • the size of feature map-1 is 3*3. That is, the feature map-1 can be understood as a 3*3 matrix. It can be understood that the ROI of the left eye image block -1 is the entire left eye image block -1.
  • the electronic device can The ROI is divided into 3*3 block areas of the same size, and maximum pooling is performed on each block area (that is, the maximum value of each block area is taken).
  • the electronic device can obtain the feature map-2 corresponding to the ROI after the maximum pooling process.
  • the ROI electronics can take the feature map-2 as the output of the region-of-interest pooling layer-1.
  • the size of feature map-2 is 3*3. That is, feature map-2 can be understood as a 3*3 matrix. It can be understood that the ROI of the left eye image block -2 is the entire left eye image block -2.
  • Figure 10A and Figure 10B represent the processing process of one channel among the three RGB channels.
  • the ROI of the input image can be divided into several sub-regions.
  • Each tiled area contains data.
  • the data contained in the block area mentioned here can be understood as elements of the corresponding area in the matrix corresponding to the ROI of the input image.
  • the electronic device can divide the ROI of the image input to the region of interest pooling layer based on the preset size of the feature map.
  • the size of the preset feature map can be 10*10. If the ROI size of the image input to the region of interest pooling layer is 100*100. Then the electronic device can evenly divide the ROI into 10*10 block areas, and the size of each block area is 10*10.
  • the electronic device can perform a zero-padding operation, or divide a certain column block area or a certain row block area to be slightly larger or smaller while ensuring that most of the block areas are of the same size.
  • the size of the preset feature map may be 10*10.
  • the size of the ROI of the image input to the region of interest pooling layer is 101*101.
  • the electronic device can divide the ROI into 9*9 block areas with a size of 10*10, 9 block areas with a size of 10*11, 9 block areas with a size of 11*10, and 1 block area with a size of 10*10 It is a 11*11 block area.
  • the size of the feature maps output by the region of interest pooling layer-1 is the same.
  • the size of the feature maps output by the region of interest pooling layer-2 is the same.
  • the left eye image block-1 and the left eye image block-2 are input into the region of interest pooling layer-1 and the resulting feature map-1 and feature map-2 have both sizes. is 3*3.
  • the size of the feature map output by the region of interest pooling layer is not limited to the above example, and this application does not limit this.
  • left eye image block-3 is an RGB image.
  • the left-eye image block-3 can be expressed as a 60*60*3 matrix.
  • the elements in this matrix include the value of the RGB three channels corresponding to each pixel in the left eye image block-3.
  • the electronic device can input the left eye image block-3 into the area of interest pooling layer-1, and can output three 3*3 feature maps. These three 3*3 feature maps respectively correspond to the RGB three-channel feature maps.
  • the processing process of inputting it to the region of interest pooling layer-1 can refer to Figure 10A.
  • CNN refers to convolutional neural network, which is a type of neural network.
  • CNN can include convolutional layers, pooling layers and fully connected layers.
  • each convolution layer in the convolutional neural network consists of several convolution units.
  • the parameters of each convolution unit are optimized through the back-propagation algorithm.
  • the purpose of the convolution operation is to extract different features of the input.
  • the first convolutional layer may only be able to extract some low-level features such as edges, lines, and corners, and more layers of networks can iteratively extract more complex features from low-level features.
  • the essence of pooling is downsampling.
  • the main function of the pooling layer is to reduce the amount of calculation by reducing the parameters of the network, and to control overfitting to a certain extent.
  • the operations performed by the pooling layer generally include maximum pooling, mean pooling, etc.
  • CNN-1, CNN-2 and CNN-3 may include several convolutional layers and several pooling layers respectively. It can be understood that CNN-1, CNN-2 and CNN-3 can also include several activation layers.
  • the activation layer is also called the neuron layer. The most important thing is the setting of the activation function. Activation functions can include ReLU, PReLU, Sigmoid, etc.
  • electronic devices can perform activation operations on input data, which can also be understood as a function change.
  • CNN-1 may include 4 convolutional layers and 4 activation layers.
  • the 4 convolutional layers refer to: convolutional layer-1, convolutional layer-2, convolutional layer-3 and convolutional layer-4.
  • the 4 activation layers refer to: activation layer-1, activation layer -2, activation layer-3 and activation layer-4. It can be understood that the size of the convolution kernels (ie filters) of the four convolutional layers can be 3*3.
  • CNN-3 may include 4 convolutional layers, 4 activation layers, and 4 pooling layers.
  • the 5 convolutional layers refer to: convolutional layer-1, convolutional layer-2, convolutional layer-3 and convolutional layer-4.
  • the 4 activation layers refer to: activation layer-1, activation layer -2.
  • the 4 pooling layers refer to: pooling layer-1, pooling layer-2, pooling layer-3 and pooling layer-4.
  • the stride of the 4 pooling layers can be 2 (for example, max pooling is performed every 2*2 "cells"). It can be understood that in the convolutional layer You can also perform zero padding on the feature map. For the relevant description of the zero-fill operation, please refer to the relevant technical documents and will not be explained here.
  • the structures of CNN-2 and CNN-1 may be the same. In some embodiments of the present application, the structures of CNN-2, CNN-3 and CNN-1 may be the same.
  • CNN-1, CNN-2 and CNN-3 can also be other contents and are not limited to the above examples, and this application is not limited to this.
  • the fully connected layer is used to map the extracted features to the sample label space.
  • the fully connected layer is used to integrate the extracted features and output it as a value.
  • the number of neurons in fully connected layer-1 is 128, and the number of neurons in fully connected layer-2 and fully connected layer-3 is both 256.
  • the number of neurons in layer-5 and fully-connected layer-6 is both 128, the number of neurons in fully-connected layer-4 is 128, and the number of neurons in fully-connected layer-7 is 2.
  • the number of neurons in the fully connected layer in the gaze estimation network model can also be other values, which are not limited to the above examples, and this application is not limited to this.
  • the electronic device can obtain eye position information and pupil coordinates during the face detection process. Therefore, the electronic device does not need to perform step S304.
  • the electronic device does not need to determine whether the size of the facial area in the image I1 meets the preset size requirement. That is to say, the electronic device does not need to perform step S305.
  • Figure 14 is a schematic diagram of the hardware structure of an electronic device provided by an embodiment of the present application.
  • the electronic device may include a processor 110, an external memory interface 120, an internal memory 121, a Universal Serial Bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, Mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, headphone interface 170D, sensor module 180, button 190, motor 191, indicator 192, camera 193, display screen 194, and user Identification module (Subscriber Identification Module, SIM) card interface 195, etc.
  • a processor 110 an external memory interface 120, an internal memory 121, a Universal Serial Bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, Mobile communication module 150, wireless communication module 160, audio module 170, speaker 170A, receiver 170B, microphone 170C, headphone interface 170D, sensor module 180, button 190, motor 191, indicator 192, camera 193, display screen
  • the sensor module 180 may include a pressure sensor 180A, a gyro sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, and ambient light. Sensor 180L, bone conduction sensor 180M, etc.
  • the structures illustrated in the embodiments of the present invention do not constitute specific limitations on the electronic equipment.
  • the electronic device may include more or less components than shown in the figures, or some components may be combined, some components may be separated, or some components may be arranged differently.
  • the components illustrated may be implemented in hardware, software, or a combination of software and hardware.
  • the processor 110 may include one or more processing units.
  • the processor 110 may include an application processor (Application Processor, AP), a modem processor, a graphics processor (Graphics Processing unit, GPU), and an image signal processor. (i.e. ISP), controller, memory, video codec, digital signal processor (Digital Signal Processor, DSP), baseband processor, and/or neural network processor (Neural-network Processing Unit, NPU), etc.
  • ISP application processor
  • controller memory
  • video codec digital signal processor
  • DSP Digital Signal Processor
  • NPU neural network Processing Unit
  • different processing units can be independent devices or integrated in one or more processors.
  • the controller can be the nerve center and command center of the electronic device.
  • the controller can generate operation control signals based on the instruction operation code and timing signals to complete the control of fetching and executing instructions.
  • the electronic device can execute the gaze point estimation method through the processor 110 .
  • the processor 110 may also be provided with a memory for storing instructions and data.
  • the memory in processor 110 is cache memory. This memory may hold instructions or data that have been recently used or recycled by processor 110 . If the processor 110 needs to use the instructions or data again, it can be called directly from the memory. Repeated access is avoided and the waiting time of the processor 110 is reduced, thus improving the efficiency of the system.
  • processor 110 may include one or more interfaces.
  • the USB interface 130 is an interface that complies with the USB standard specification, and may be a Mini USB interface, a Micro USB interface, a USB Type C interface, etc.
  • the interface included in the processor 110 can also be used to connect other electronic devices, such as AR devices.
  • the charging management module 140 is used to receive charging input from the charger. While the charging management module 140 charges the battery 142, it can also provide power to the electronic device through the power management module 141.
  • the wireless communication function of the electronic device can be realized through the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, the modem processor and the baseband processor.
  • Antenna 1 and Antenna 2 are used to transmit and receive electromagnetic wave signals.
  • Each antenna in an electronic device can be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve antenna utilization.
  • the mobile communication module 150 can provide wireless communication solutions including 2G/3G/4G/5G applied to electronic devices.
  • the wireless communication module 160 can provide applications on electronic devices including Wireless Local Area Networks (WLAN) (such as Wireless Fidelity (Wi-Fi) network), Bluetooth (Bluetooth, BT), and Global Navigation Satellite System. (Global Navigation Satellite System, GNSS), Frequency Modulation (FM), Near Field Communication (NFC), Infrared (IR) and other wireless communication solutions.
  • WLAN Wireless Local Area Networks
  • Wi-Fi Wireless Fidelity
  • Bluetooth Bluetooth
  • BT Global Navigation Satellite System
  • GNSS Global Navigation Satellite System
  • FM Frequency Modulation
  • NFC Near Field Communication
  • IR Infrared
  • the antenna 1 of the electronic device is coupled to the mobile communication module 150, and the antenna 2 is coupled to the wireless communication module 160, so that the electronic device can communicate with the network and other devices through wireless communication technology.
  • the electronic device implements display functions through the GPU, display screen 194, and application processor.
  • the GPU is an image processing microprocessor and is connected to the display screen 194 and the application processor. GPUs are used to perform mathematical and geometric calculations for graphics rendering.
  • Processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.
  • the display screen 194 is used to display images, videos, etc.
  • Display 194 includes a display panel.
  • the display panel can use a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode or an active-matrix organic light-emitting diode (Active-Matrix Organic Light).
  • the electronic device may include 1 or N display screens 194, where N is a positive integer greater than 1.
  • Electronic devices can achieve acquisition functions through ISPs, cameras 193, video codecs, GPUs, display screens 194, and application processors.
  • the ISP is used to process the data fed back by the camera 193. For example, when taking a photo, the shutter is opened, the light is transmitted to the camera sensor through the lens, the light signal is converted into an electrical signal, and the camera sensor passes the electrical signal to the ISP for processing, and converts it into an image or video visible to the naked eye. ISP can also perform algorithm optimization on image noise, brightness, and color. ISP can also optimize the exposure, color temperature and other parameters of the shooting scene. In some embodiments, the ISP may be provided in the camera 193.
  • Camera 193 is used to capture still images or video.
  • the object passes through the lens to produce an optical image that is projected onto the photosensitive element.
  • the optical element may be a Charge Coupled Device (CCD) or a Complementary Metal-Oxide-Semiconductor (CMOS) phototransistor.
  • CCD Charge Coupled Device
  • CMOS Complementary Metal-Oxide-Semiconductor
  • the photosensitive element converts the optical signal into an electrical signal, and then passes the electrical signal to the ISP for conversion into a digital image or video signal.
  • ISP outputs digital images or video signals to DSP for processing.
  • DSP converts digital images or video signals into standard RGB, YUV and other formats.
  • a digital signal processor is used to process digital signals. In addition to processing digital images or video signals, it can also process other digital signals. For example, when the electronic device selects a frequency point, the digital signal processor is used to perform Fourier transform on the frequency point energy.
  • Video codecs are used to compress or decompress digital video.
  • Electronic devices may support one or more video codecs. In this way, electronic devices can play or record videos in multiple encoding formats, such as: Moving Picture Experts Group (MPEG) 1, MPEG2, MPEG3, MPEG4, etc.
  • MPEG Moving Picture Experts Group
  • MPEG2 MPEG2, MPEG3, MPEG4, etc.
  • the external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the electronic device.
  • the external memory card communicates with the processor 110 through the external memory interface 120 to implement the data storage function. Such as saving music, videos, etc. files in external memory card.
  • Internal memory 121 may be used to store computer executable program code, which includes instructions.
  • the processor 110 executes instructions stored in the internal memory 121 to execute various functional applications and data processing of the electronic device.
  • the internal memory 121 may include a program storage area and a data storage area.
  • the stored program area can store an operating system, at least one application program required for a function (such as a sound playback function, an image video playback function, etc.), etc.
  • the storage data area can store data created during the use of electronic equipment (such as audio data, phone books, etc.).
  • the electronic device can implement audio functions through the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the headphone interface 170D, and the application processor. Such as music playback, recording, etc.
  • the sensor module 180 may include one or more sensors, which may be of the same type or different types. It can be understood that the sensor module 180 shown in FIG. 14 is only an exemplary division method, and other division methods are possible, and this application is not limited thereto.
  • the pressure sensor 180A is used to sense pressure signals and can convert the pressure signals into electrical signals.
  • pressure sensor 180A may be disposed on display screen 194 .
  • the electronic device detects the strength of the touch operation according to the pressure sensor 180A.
  • the electronic device may also calculate the touched position based on the detection signal of the pressure sensor 180A.
  • touch operations acting on the same touch location but with different touch operation intensities may correspond to different operation instructions.
  • the gyro sensor 180B can be used to determine the motion posture of the electronic device. In some embodiments, the angular velocity of the electronic device about three axes (ie, x, y, and z axes) may be determined by gyro sensor 180B. The gyro sensor 180B can be used for image stabilization.
  • the acceleration sensor 180E can detect the acceleration of the electronic device in various directions (generally three axes). When the electronic device is stationary, the magnitude and direction of gravity can be detected. It can also be used to identify the posture of electronic devices and be used in horizontal and vertical screen switching, pedometer and other applications.
  • Distance sensor 180F for measuring distance.
  • Electronic devices can measure distance via infrared or laser. In some embodiments, when shooting a scene, the electronic device can utilize the distance sensor 180F to measure distance to achieve fast focusing.
  • Touch sensor 180K also called “touch panel”.
  • the touch sensor 180K can be disposed on the display screen 194.
  • the touch sensor 180K and the display screen 194 form a touch screen, which is also called a "touch screen”.
  • the touch sensor 180K is used to detect a touch operation on or near the touch sensor 180K.
  • the touch sensor can pass the detected touch operation to the application processor to determine the touch event type.
  • Visual output related to the touch operation may be provided through display screen 194 .
  • the touch sensor 180K can also be disposed on the surface of the electronic device at a location different from that of the display screen 194 .
  • Air pressure sensor 180C is used to measure air pressure.
  • Magnetic sensor 180D includes a Hall sensor.
  • Proximity light sensor 180G may include, for example, a light emitting diode (LED) and a light detector, such as a photodiode. Electronic devices use photodiodes to detect infrared reflected light from nearby objects.
  • the ambient light sensor 180L is used to sense ambient light brightness.
  • Fingerprint sensor 180H is used to acquire fingerprints.
  • Temperature sensor 180J is used to detect temperature.
  • Bone conduction sensor 180M can acquire vibration signals.
  • the buttons 190 include a power button, a volume button, etc.
  • Key 190 may be a mechanical key. It can also be a touch button.
  • the electronic device can receive key input and generate key signal input related to user settings and function control of the electronic device.
  • the motor 191 can generate vibration prompts.
  • the motor 191 can be used for vibration prompts for incoming calls and can also be used for touch vibration feedback.
  • the indicator 192 may be an indicator light, which may be used to indicate charging status, power changes, or may be used to indicate messages, missed calls, notifications, etc.
  • the SIM card interface 195 is used to connect a SIM card.
  • Figure 15 is a schematic diagram of the software structure of an electronic device provided by an embodiment of the present application.
  • the software framework of the electronic device involved in this application may include an application layer, an application framework layer (framework, FWK), a system library, an Android runtime, a hardware abstraction layer and a kernel layer (kernel).
  • an application layer an application framework layer (framework, FWK)
  • FWK application framework layer
  • system library an application framework layer
  • Android runtime a hardware abstraction layer
  • kernel layer kernel layer
  • the application layer can include a series of application packages, such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, video, short message and other applications (also called applications).
  • cameras are used to acquire images and videos.
  • other applications of the application layer please refer to the introduction and description in conventional technologies, and this application does not elaborate on them.
  • the application on the electronic device may be a native application (such as an application installed in the electronic device when the operating system is installed before the electronic device leaves the factory), or it may be a third-party application (such as a user downloading and installing it through an application store). application), the embodiments of this application are not limited.
  • the application framework layer provides application programming interfaces (Application Programming Interface, API) and programming frameworks for applications in the application layer.
  • API Application Programming Interface
  • the application framework layer includes some predefined functions.
  • the application framework layer can include a window manager, content provider, view system, phone manager, resource manager, notification manager, etc.
  • a window manager is used to manage window programs.
  • the window manager can obtain the display size, determine whether there is a status bar, lock the screen, capture the screen, etc.
  • Content providers are used to store and retrieve data and make this data accessible to applications.
  • Said data can include videos, images, audio, calls made and received, browsing history and bookmarks, phone books, etc.
  • the view system includes visual controls, such as controls that display text, controls that display pictures, etc.
  • a view system can be used to build applications.
  • the display interface can be composed of one or more views.
  • a display interface including a text message notification icon may include a view for displaying text and a view for displaying pictures.
  • Telephone managers are used to provide communication functions of electronic devices. For example, call status management (including connected, hung up, etc.).
  • the resource manager provides various resources to applications, such as localized strings, icons, pictures, layout files, video files, etc.
  • the notification manager allows applications to display notification information in the status bar, which can be used to convey notification-type messages and can automatically disappear after a short stay without user interaction.
  • the notification manager is used to notify download completion, message reminders, etc.
  • the notification manager can also be notifications that appear in the status bar at the top of the system in the form of charts or scroll bar text, such as notifications for applications running in the background, or notifications that appear on the screen in the form of a conversation interface. For example, text information is prompted in the status bar, a beep sounds, the electronic device vibrates, the indicator light flashes, etc.
  • Runtime includes core libraries and virtual machines. Runtime is responsible for the scheduling and management of the system.
  • the core library contains two parts: one part is the functional functions that need to be called by the programming language (for example, Java language), and the other part is the core library of the system.
  • the programming language for example, Java language
  • the application layer and application framework layer run in virtual machines.
  • the virtual machine executes the programming files (for example, java files) of the application layer and application framework layer into binary files.
  • the virtual machine is used to perform object life cycle management, stack management, thread management, security and exception management, and garbage collection and other functions.
  • System libraries can include multiple functional modules. For example: Surface Manager (Surface Manager), Media Libraries (Media Libraries), 3D graphics processing library (for example: OpenGL ES), 2D graphics engine (for example: SGL), etc.
  • Surface Manager Surface Manager
  • Media Libraries Media Libraries
  • 3D graphics processing library for example: OpenGL ES
  • 2D graphics engine for example: SGL
  • the surface manager is used to manage the display subsystem and provides the fusion of two-dimensional (2-Dimensional, 2D) and three-dimensional (3-Dimensional, 3D) layers for multiple applications.
  • the media library supports playback and recording of a variety of commonly used audio and video formats, as well as static image files, etc.
  • the media library can support a variety of audio and video encoding formats, such as: MPEG4, H.264, MP3, AAC, AMR, JPG, PNG, etc.
  • the 3D graphics processing library is used to implement 3D graphics drawing, image rendering, composition, and layer processing.
  • 2D Graphics Engine is a drawing engine for 2D drawing.
  • the hardware abstraction layer is the interface layer between the operating system kernel and upper-layer software. Its purpose is to abstract the hardware.
  • the hardware abstraction layer is an abstract interface of the device kernel driver, which is used to provide application programming interfaces for accessing underlying devices to higher-level Java API frameworks.
  • HAL contains multiple library modules, such as camera HAL, display, Bluetooth, audio, etc. Each of these library modules implements an interface for a specific type of hardware component.
  • the Android operating system will load the library module for the hardware component.
  • the kernel layer is the foundation of the Android operating system, and the final functions of the Android operating system are completed through the kernel layer.
  • the kernel layer at least includes display driver, camera driver, audio driver, sensor driver, and virtual card driver.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • General Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Ophthalmology & Optometry (AREA)
  • Image Processing (AREA)
  • Image Analysis (AREA)

Abstract

La présente invention concerne un procédé d'estimation de point de fixation et un dispositif associé. Selon le procédé, un dispositif électronique peut acquérir une image au moyen d'une caméra, assure l'entrée d'un échantillon simple ayant une échelle de distance adaptative au moyen d'un zoom adaptatif, et obtient des informations de position de visage et des informations de position d'œil dans l'image acquise lorsqu'un résultat de détection de visage respecte une condition de visage prédéfinie. Sur la base d'un module de regroupement de régions d'intérêt (ROI) dans un modèle de réseau d'estimation de point de fixation, le dispositif électronique peut traiter une ROI d'un bloc d'image cible à l'aide d'une taille de carte de caractéristiques prédéfinie correspondante pour obtenir une carte de caractéristiques. Le bloc d'image cible est obtenu par recadrage de l'image acquise. Le bloc d'image cible comprend un bloc d'image de visage et/ou un bloc d'image d'œil gauche et/ou un bloc d'image d'œil droit. Différents types de blocs d'image correspondent respectivement à des tailles de carte de caractéristiques prédéfinies. Le procédé peut unifier la taille de la carte de caractéristiques au moyen du module de regroupement de ROI, ce qui permet d'éviter la déformation du bloc d'image cible après la mise à l'échelle, et d'améliorer la précision d'estimation de point de fixation.
PCT/CN2023/092415 2022-07-29 2023-05-06 Procédé d'estimation de point de fixation et dispositif associé WO2024021742A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210910894.2 2022-07-29
CN202210910894.2A CN116048244B (zh) 2022-07-29 2022-07-29 一种注视点估计方法及相关设备

Publications (2)

Publication Number Publication Date
WO2024021742A1 true WO2024021742A1 (fr) 2024-02-01
WO2024021742A9 WO2024021742A9 (fr) 2024-05-16

Family

ID=86127878

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/092415 WO2024021742A1 (fr) 2022-07-29 2023-05-06 Procédé d'estimation de point de fixation et dispositif associé

Country Status (2)

Country Link
CN (1) CN116048244B (fr)
WO (1) WO2024021742A1 (fr)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116048244B (zh) * 2022-07-29 2023-10-20 荣耀终端有限公司 一种注视点估计方法及相关设备
CN117576298B (zh) * 2023-10-09 2024-05-24 中微智创(北京)软件技术有限公司 一种基于上下文分离3d透镜的战场态势目标突出显示方法
CN117472256A (zh) * 2023-12-26 2024-01-30 荣耀终端有限公司 图像处理方法及电子设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111723596A (zh) * 2019-03-18 2020-09-29 北京市商汤科技开发有限公司 注视区域检测及神经网络的训练方法、装置和设备
CN112000226A (zh) * 2020-08-26 2020-11-27 杭州海康威视数字技术股份有限公司 一种人眼视线估计方法、装置及视线估计系统
CN113642393A (zh) * 2021-07-07 2021-11-12 重庆邮电大学 基于注意力机制的多特征融合视线估计方法
US20220236797A1 (en) * 2021-01-22 2022-07-28 Blink O.G. Ltd. Gaze estimation systems and methods using relative points of regard
CN116048244A (zh) * 2022-07-29 2023-05-02 荣耀终端有限公司 一种注视点估计方法及相关设备

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6377566B2 (ja) * 2015-04-21 2018-08-22 日本電信電話株式会社 視線計測装置、視線計測方法、およびプログラム
CN109492514A (zh) * 2018-08-28 2019-03-19 初速度(苏州)科技有限公司 一种单相机采集人眼视线方向的方法及系统
CN112329699A (zh) * 2020-11-19 2021-02-05 北京中科虹星科技有限公司 一种像素级精度的人眼注视点定位方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111723596A (zh) * 2019-03-18 2020-09-29 北京市商汤科技开发有限公司 注视区域检测及神经网络的训练方法、装置和设备
CN112000226A (zh) * 2020-08-26 2020-11-27 杭州海康威视数字技术股份有限公司 一种人眼视线估计方法、装置及视线估计系统
US20220236797A1 (en) * 2021-01-22 2022-07-28 Blink O.G. Ltd. Gaze estimation systems and methods using relative points of regard
CN113642393A (zh) * 2021-07-07 2021-11-12 重庆邮电大学 基于注意力机制的多特征融合视线估计方法
CN116048244A (zh) * 2022-07-29 2023-05-02 荣耀终端有限公司 一种注视点估计方法及相关设备

Also Published As

Publication number Publication date
CN116048244A (zh) 2023-05-02
WO2024021742A9 (fr) 2024-05-16
CN116048244B (zh) 2023-10-20

Similar Documents

Publication Publication Date Title
WO2024021742A1 (fr) Procédé d'estimation de point de fixation et dispositif associé
CN111738122B (zh) 图像处理的方法及相关装置
WO2021078001A1 (fr) Procédé et appareil d'amélioration d'image
CN111782879B (zh) 模型训练方法及装置
CN110570460B (zh) 目标跟踪方法、装置、计算机设备及计算机可读存储介质
WO2021027585A1 (fr) Procédé de traitement d'images de visages humains et dispositif électronique
CN112749613B (zh) 视频数据处理方法、装置、计算机设备及存储介质
CN112036331A (zh) 活体检测模型的训练方法、装置、设备及存储介质
CN111882642B (zh) 三维模型的纹理填充方法及装置
WO2024007715A1 (fr) Procédé pour photographier et dispositif associé
CN111612723B (zh) 图像修复方法及装置
WO2021180046A1 (fr) Procédé et dispositif de conservation des couleurs des images
US20230162529A1 (en) Eye bag detection method and apparatus
CN116916151B (zh) 拍摄方法、电子设备和存储介质
CN113642359B (zh) 人脸图像生成方法、装置、电子设备及存储介质
CN113538227B (zh) 一种基于语义分割的图像处理方法及相关设备
CN117132515A (zh) 一种图像处理方法及电子设备
WO2022143314A1 (fr) Procédé et appareil d'enregistrement d'objet
CN114697530B (zh) 一种智能取景推荐的拍照方法及装置
CN112528760B (zh) 图像处理方法、装置、计算机设备及介质
CN114399622A (zh) 图像处理方法和相关装置
CN114693538A (zh) 一种图像处理方法及装置
CN115880348B (zh) 一种人脸深度的确定方法、电子设备及存储介质
CN116091572B (zh) 获取图像深度信息的方法、电子设备及存储介质
CN116109828B (zh) 图像处理方法和电子设备

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23844957

Country of ref document: EP

Kind code of ref document: A1