WO2024021742A9 - Fixation point estimation method and related device - Google Patents

Fixation point estimation method and related device Download PDF

Info

Publication number
WO2024021742A9
WO2024021742A9 PCT/CN2023/092415 CN2023092415W WO2024021742A9 WO 2024021742 A9 WO2024021742 A9 WO 2024021742A9 CN 2023092415 W CN2023092415 W CN 2023092415W WO 2024021742 A9 WO2024021742 A9 WO 2024021742A9
Authority
WO
WIPO (PCT)
Prior art keywords
image
image block
face
size
feature map
Prior art date
Application number
PCT/CN2023/092415
Other languages
French (fr)
Chinese (zh)
Other versions
WO2024021742A1 (en
Inventor
孙贻宝
Original Assignee
荣耀终端有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 荣耀终端有限公司 filed Critical 荣耀终端有限公司
Publication of WO2024021742A1 publication Critical patent/WO2024021742A1/en
Publication of WO2024021742A9 publication Critical patent/WO2024021742A9/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • G06F3/013Eye tracking input arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/10Image acquisition
    • G06V10/12Details of acquisition arrangements; Constructional details thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/25Determination of region of interest [ROI] or a volume of interest [VOI]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features
    • G06V10/44Local feature extraction by analysis of parts of the pattern, e.g. by detecting edges, contours, loops, corners, strokes or intersections; Connectivity analysis, e.g. of connected components
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/161Detection; Localisation; Normalisation
    • G06V40/166Detection; Localisation; Normalisation using acquisition arrangements
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/16Human faces, e.g. facial parts, sketches or expressions
    • G06V40/168Feature extraction; Face representation
    • G06V40/171Local features and components; Facial parts ; Occluding parts, e.g. glasses; Geometrical relationships
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/18Eye characteristics, e.g. of the iris
    • G06V40/19Sensors therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V40/00Recognition of biometric, human-related or animal-related patterns in image or video data
    • G06V40/10Human or animal bodies, e.g. vehicle occupants or pedestrians; Body parts, e.g. hands
    • G06V40/18Eye characteristics, e.g. of the iris
    • G06V40/193Preprocessing; Feature extraction
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • the present application relates to fields such as deep learning and big data processing, and in particular to a gaze point estimation method and related equipment.
  • Gaze point estimation generally refers to inputting an image, calculating the gaze direction through eye/head features and mapping it to the gaze point. Gaze point estimation is mainly used in human-computer interaction and visualization display of smartphones, tablets, smart screens, and AR/VR glasses.
  • gaze point estimation methods can be divided into two categories: geometry-based methods and appearance-based methods.
  • the basic idea of estimating the gaze point coordinates through geometry-based methods is to restore the three-dimensional line of sight direction through some two-dimensional information (such as eye features such as the corner of the eye).
  • the basic idea of estimating the gaze point coordinates through appearance-based methods is to learn a model that maps the input image to the gaze point. Both methods have their own advantages and disadvantages.
  • the geometry-based method is relatively more accurate, but it has high requirements on the quality and resolution of the image, and requires additional hardware (for example, infrared sensors and multiple cameras, etc.) to support, which may lead to high power consumption, while the appearance-based method is relatively less accurate.
  • the appearance-based method requires a large amount of data to be trained, the distance between the camera and the subject is not fixed, and the depth information of the input image may also vary.
  • the size of the facial images obtained based on different input images may be quite different, which cannot meet the model requirements. If the input image is scaled, it may meet the model requirements, but there may be a risk of feature deformation, which will reduce the accuracy of the gaze point estimation.
  • the present application provides a gaze point estimation method and related equipment.
  • an electronic device can collect images through a camera, and obtain face position information and eye position information in the collected image when the face detection result meets the preset face condition. Based on the face position information and eye position information, the electronic device can determine the gaze point coordinates of the target object through a gaze point estimation network model. It can be understood that the shooting subject in the image collected by the electronic device through the camera is the target object. It can be understood that the shooting subject mentioned in the present application refers to the main shooting object when the user uses the electronic device to shoot.
  • the electronic device can process the ROI of the target image block with the corresponding preset feature map size based on the region of interest pooling module therein to obtain a feature map.
  • the target image block is obtained by cropping the collected image.
  • the target image block includes at least one type of image block among a face image block, a left eye image block, and a right eye image block. Different types of image blocks each correspond to a preset feature map size.
  • the above method can unify the size of the feature map through the region of interest pooling module, avoid deformation of the target image block after scaling, and improve the accuracy of the gaze point estimation.
  • the present application provides a method for estimating a gaze point.
  • the method can be applied to an electronic device provided with a camera.
  • the method can include: the electronic device can collect a first image through a camera; when the face detection result meets the preset face condition, the electronic device can obtain the face position information and the eye position information in the first image.
  • the electronic device can process the region of interest ROI of the target image block with the corresponding preset feature map size based on the region of interest pooling module of the gaze point estimation network model to obtain a feature map.
  • the target image block includes at least one type of image block among a face image block, a left eye image block, and a right eye image block.
  • the face position information includes the coordinates of the relevant feature points of the face area
  • the eye position information includes the coordinates of the relevant feature points of the eye area.
  • the face image block is an image block obtained by cropping the face area in the first image based on the face position information.
  • the left eye image block is an image block obtained by cropping the left eye area in the first image based on the eye position information.
  • the right eye image block is an image block obtained by cropping the right eye area in the first image based on the eye position information.
  • the electronic device can determine the gaze point coordinates of the target object based on the gaze point estimation network model.
  • the electronic device can process the region of interest ROI of the target image block with the corresponding preset feature map size based on the region of interest pooling module to obtain a feature map.
  • the target image block includes at least one type of image block in the face image block, the left eye image block, and the right eye image block. Different types of image blocks in the target image block each correspond to a preset feature map size.
  • the size of the feature map corresponding to the same type of image blocks is the same, while the size of the feature map corresponding to different types of image blocks can be the same or different.
  • This method can unify the size of the feature map corresponding to the same type of image blocks through the region of interest pooling module, prepare for subsequent feature extraction, and can also avoid feature deformation caused by adjusting the feature map size by scaling, thereby improving the accuracy of gaze point estimation. It can be understood that the feature gender may cause inaccurate feature extraction, thereby affecting the accuracy of gaze point estimation.
  • the electronic device may capture the first image through a front camera. It is understandable that the electronic device may acquire the first image in real time, and the details may refer to the relevant description in step S301 below, which will not be elaborated here.
  • the first image may be image I1.
  • the description of the facial position information and the eye position information can be referred to in the following text, which is not described in detail here.
  • the relevant feature points of the facial region can include the edge contour feature points of the face.
  • the relevant feature points of the eye region can include the corner feature points, and can also include the edge contour feature points of the eye region.
  • the description of the relevant feature points of the facial region and the relevant feature points of the eye region can be referred to in the following text, which is not described in detail here.
  • the electronic device can obtain face position information during face detection. Specifically, during face detection, the electronic device can perform feature point detection and determine feature points related to the face to obtain face position information.
  • the electronic device can complete the detection of eyes during the face detection process to obtain eye position information.
  • the eye-related feature points may include pupil coordinates.
  • the electronic device can perform eye detection to obtain eye position information.
  • eye detection can be found in the following text and will not be elaborated here.
  • the region of interest pooling module may include several region of interest pooling layers.
  • the region of interest pooling module may include a region of interest pooling layer-1 and may also include a region of interest pooling layer-2.
  • the region of interest pooling module may include a region of interest pooling layer-1 and may also include a region of interest pooling layer-2.
  • the gaze point estimation network model may unify the feature map for the same type of image blocks in the target image block and perform feature extraction on it.
  • the present application may also provide a gaze point estimation network model, the input of which may not include a face grid, pupil coordinates, fully connected layer-2, and fully connected layer-3.
  • the present application may also provide a gaze point estimation network model, the input of which may not include a face grid, pupil coordinates, fully connected layer-2, fully connected layer-5, fully connected layer-3, and fully connected layer-6.
  • the present application may also provide a gaze point estimation network model, the input of which may not include a face grid, pupil coordinates, fully connected layer-2, fully connected layer-5, fully connected layer-3, and fully connected layer-6.
  • the preset feature map size corresponding to the facial image block is a first preset feature map size
  • the preset feature map size corresponding to the left eye image block is a second preset feature map size
  • the preset feature map size corresponding to the right eye image block is a third preset feature map size.
  • the region of interest of the target image block is the entire target image block.
  • the ROI of the face image block is the entire face image block
  • the ROI of the left eye image block is the entire left eye image block
  • the ROI of the right eye image block is the entire right eye image block.
  • the method may further include: the electronic device may crop the facial area in the first image based on the facial position information to obtain the facial image block.
  • the method may further include: the electronic device may crop the left-eye area in the first image based on the eye position information to obtain the left-eye image block.
  • the method may further include: the electronic device may crop the right-eye area in the first image based on the eye position information to obtain the right-eye image block.
  • the electronic device processes the region of interest ROI of the target image block with the corresponding preset feature map size to obtain a feature map, which may specifically include: the electronic device may divide the ROI of the target image block based on the corresponding preset feature map size to obtain a number of block areas, and the electronic device may also perform maximum pooling processing on each block area in the ROI of the target image block to obtain a feature map.
  • the number of block areas in each row of the ROI of the target image block is the same as the width value in the corresponding preset feature map size
  • the number of block areas in each column of the ROI of the target image block is the same as the height value in the corresponding preset feature map size.
  • the electronic device can divide the ROI in the target image block based on the width value and height value in the corresponding preset feature map size to obtain a number of block areas, and perform maximum pooling processing on each block area to obtain a feature map of the target image block. Since the number of block areas and the dimension of the feature map output by the pooling layer of the region of interest are consistent. Therefore, for image blocks of different sizes, this method can unify the feature maps corresponding to the image blocks, thereby avoiding feature deformation caused by scaling, improving the accuracy of feature extraction, and thus improving the accuracy of gaze point estimation.
  • the ROI of the face image block in the target image block can be the face area in the face image block.
  • the ROI of the left eye image block in the target image block can be the left eye area in the left eye image block.
  • the ROI of the right eye image block in the target image block can be the right eye area in the right eye image block.
  • the electronic device performs maximum pooling processing on each block area in the ROI of the target image block to obtain a feature map, which may specifically include: the electronic device may perform maximum pooling processing on each face block area in the ROI of the face image block to obtain a first feature map, may perform maximum pooling processing on each left eye block area in the ROI of the left eye image block to obtain a second feature map, and may also perform maximum pooling processing on each right eye block area in the ROI of the right eye image block to obtain a third feature map.
  • the first feature map is a feature map corresponding to the ROI of the face image block
  • the second feature map is a feature map corresponding to the ROI of the left eye image block
  • the third feature map is a feature map corresponding to the ROI of the right eye image block.
  • the number of block areas in each row of the ROI of the target image block is the same as the width value in the corresponding preset feature map size, and the number of block areas in each column of the ROI of the target image block is the same as the height value in the corresponding preset feature map size, specifically including: the number of face block areas in each row of the ROI of the face image block is the same as the width value in the first preset feature map size, and the number of face block areas in each column of the ROI of the face image block is the same as the height value in the first preset feature map size; the number of left eye block areas in each row of the ROI of the left eye image block is the same as the width value in the second preset feature map size, and the number of left eye block areas in each column of the ROI of the left eye image block is the same as the height value in the second preset feature map size; the number of right eye block areas in each row of the ROI of the right eye image block is the same as the width value in the third preset feature map size, and the number of right eye block areas in each column
  • the target image block may include a facial image block, a left eye image block, and a right eye image block.
  • the electronic device can unify the sizes of the feature maps corresponding to the facial image block, the left eye image block, and the right eye image block based on the gaze point estimation network model, and extract features based on the feature maps corresponding to the facial image block, the left eye image block, and the right eye image block. It is understandable that this method can unify the feature maps corresponding to the image blocks, thereby avoiding feature deformation caused by scaling, improving the accuracy of feature extraction, and thus improving the accuracy of gaze point estimation.
  • the second preset feature map size may be the same as the third preset feature map size.
  • the first preset feature map size may be the same as the second preset feature map size.
  • the first preset feature map size may be the same as the third preset feature map size.
  • the processing of the target image block performed by the electronic device based on the region of interest pooling module can be referred to above and will not be repeated here.
  • the subject of the first image is the target object.
  • the method may further include: when the face detection result meets the preset face condition, the electronic device may obtain the pupil coordinates in the first image; the electronic device may determine the position and size of the face area in the first image in the first image based on the face position information, and obtain the face grid corresponding to the first image.
  • the face grid is used to characterize the distance between the target object and the camera.
  • the method may further include: the electronic device may perform convolution processing on the feature map based on the convolution module of the gaze point estimation network model to extract eye features and/or facial features; the electronic device may also integrate the eye features and/or facial features, the face grid and the pupil coordinates based on the fusion module of the gaze point estimation network model to obtain the gaze point coordinates of the target object.
  • the electronic device can perform gaze point estimation based on more types of features (for example, facial features, eye features, depth information, pupil position, etc.), that is, the gaze point estimation can be performed based on more comprehensive feature information, which can improve the accuracy of gaze point estimation.
  • features for example, facial features, eye features, depth information, pupil position, etc.
  • the face grid can represent the position and size of the face in the image, and can reflect the depth information of the target object in the image, that is, the distance between the target object and the camera that captures the image.
  • the human face in the first image mentioned in the present application is the face of the target object in the first image.
  • the electronic device may input the facial image block, the left eye image block, the right eye image block, the facial grid and the pupil coordinates into the gaze point estimation network model, and input the gaze point coordinates.
  • the gaze point estimation network model may include a region of interest pooling module, a convolution module and a fusion module.
  • the region of interest pooling module can be used to: process the region of interest ROI of the facial image block with a first preset feature map size to obtain a first feature map.
  • the region of interest pooling module can also be used to: process the ROI of the left eye image block with a second preset feature map size to obtain a second feature map, and process the ROI of the right eye image block with a third preset feature map size to obtain a third feature map.
  • the convolution module can be used to: perform convolution processing on the first feature map, the second feature map and the third feature map respectively, and extract facial features and eye features.
  • the fusion module can be used to: integrate facial features, eye features, facial grids and pupil coordinates to obtain the gaze point coordinates of the target object.
  • the size of the first feature map is the same as the size of the first preset feature map
  • the size of the second feature map is the same as the size of the second preset feature map
  • the size of the third feature map is the same as the size of the third preset feature map.
  • the face detection result satisfies a preset face condition, specifically including: a face is detected in the first image.
  • the electronic device can obtain facial position information and eye position information when a human face is detected in the first image.
  • the face detection result satisfies the preset face condition, which may specifically include: a face is detected in the first image, and the size of the face area in the first image meets the preset size requirement.
  • the method may also include: when a face is detected in the first image, and the size of the face area in the first image does not meet the preset size requirement, the electronic device may perform adaptive zoom, and re-capture the image based on the focal length after the adaptive zoom.
  • the electronic device when the first image includes a face and the size of the face area in the first image does not meet the preset size requirement, the electronic device can perform adaptive zoom and re-capture the image based on the focal length after adaptive zoom, so that the size of the face in the subsequent captured image meets expectations. In this way, the electronic device can capture images containing faces of appropriate sizes without losing image details and subsequent feature extraction difficulties due to the face in the captured image being too small, nor will it cause image information loss and subsequent feature extraction difficulties due to the face in the captured image being too large. In other words, through the above method, the features extracted by the electronic device are relatively accurate, so that the accuracy of the gaze point estimation is also improved.
  • the size of the face region in the first image meets a preset size requirement, specifically including: the area of the face region in the first image is within a preset area range.
  • the size of the facial area in the first image meets a preset size requirement, specifically including: the height of the facial area in the first image is within a preset height range, and the width of the facial area in the first image is within a preset width range.
  • the electronic device can ensure that the input distance scale is adaptive to the simple sample through adaptive zoom. In other words, the electronic device can capture images at a moderate shooting distance through adaptive zoom.
  • the electronic device crops the face area in the first image based on the face position information, which may specifically include: the electronic device may determine the relevant feature points of the face area in the first image; the electronic device may determine the first circumscribed rectangle; the electronic device may also crop the first image based on the position of the first circumscribed rectangle in the first image.
  • the first circumscribed rectangle is the circumscribed rectangle of the relevant feature points of the face area in the first image
  • the face image block is at the same position as the first circumscribed rectangle in the first image
  • the face image block is the same size as the first circumscribed rectangle.
  • the electronic device crops the left eye area in the first image based on the eye position information, which may specifically include: the electronic device may determine the relevant feature points of the left eye area in the first image; the electronic device may determine the second circumscribed rectangle, and crop the first image based on the position of the second circumscribed rectangle in the first image.
  • the second circumscribed rectangle is the circumscribed rectangle of the relevant feature points of the left eye area in the first image
  • the left eye image block is at the same position as the second circumscribed rectangle in the first image
  • the left eye image block is the same size as the second circumscribed rectangle.
  • the electronic device crops the right eye region in the first image based on the eye position information, which may specifically include: the electronic device may determine the relevant feature points of the right eye region in the first image; the electronic device may determine the third circumscribed rectangle, and based on the position of the third circumscribed rectangle in the first image, crops the first image.
  • the third circumscribed rectangle is the circumscribed rectangle of the relevant feature points of the right eye region in the first image, the right eye image block and the third circumscribed rectangle are at the same position in the first image, and the right eye image block and the third circumscribed rectangle are the same size.
  • the electronic device can obtain the facial image block, the left eye image block and the right eye image block based on the circumscribed rectangle of the relevant feature points of the facial area, the circumscribed rectangle of the relevant feature points of the left eye area and the circumscribed rectangle of the relevant feature points of the right eye area, respectively.
  • cropping the face area in the first image based on the face position information to obtain the face image block may specifically include: the electronic device may determine the face area in the first image based on the face position information; the electronic device may crop the first image with the face area as the center of the first cropping frame to obtain the face image block.
  • the size of the first cropping frame is the first preset cropping size.
  • the face image block is the same size as the first cropping frame.
  • Cropping the left eye area and the right eye area in the first image based on the eye position information to obtain the left eye image block and the right eye image block may specifically include: the electronic device determines the left eye area in the first image and the right eye area in the first image based on the eye position information; the electronic device may crop the first image with the left eye area as the center of the second cropping frame to obtain the left eye image block, and may also crop the first image with the right eye area as the center of the third cropping frame to obtain the right eye image block.
  • the size of the second cropping frame is the second preset cropping size.
  • the left eye image block is the same size as the second cropping frame.
  • the size of the third cropping frame is the third preset cropping size.
  • the right eye image block has the same size as the third cropping frame.
  • the electronic device can crop the first image based on the face position information and the preset face cropping size to obtain the face image block.
  • the electronic device can also crop the first image based on the eye position information and the preset eye cropping size to obtain the left eye image block and the right eye image block.
  • the first preset cropping size is a preset face cropping size.
  • the second preset cropping size and the third preset cropping size are preset eye cropping sizes.
  • the second preset cropping size and the third preset cropping size may be the same.
  • the preset eye cropping size may include a preset left eye cropping size and a preset right eye cropping size.
  • the second preset cropping size may be a preset left eye cropping size.
  • the third preset cropping size may be a preset right eye cropping size.
  • the gaze point estimation network model may further include several activation layers.
  • the region of interest pooling module may include several region of interest pooling layers.
  • the convolution module may include several convolution layers.
  • the fusion module includes several fully connected layers.
  • the gaze point estimation network model may include several region of interest pooling layers, several convolution layers, and may also include several activation layers.
  • the gaze point estimation network model may include several region of interest pooling layers, several convolutional layers and several pooling layers.
  • the gaze point estimation network model may also include several activation layers.
  • the present application provides an electronic device.
  • the electronic device may include a display screen, a camera, a memory, and one or more processors.
  • the memory is used to store computer programs.
  • the camera may be used to: collect a first image.
  • the processor may be used to: obtain face position information and eye position information in the first image when the face detection result meets the preset face condition; in the process of processing the target image block based on the gaze point estimation network model, the region of interest pooling module based on the gaze point estimation network model processes the region of interest ROI of the target image block with the corresponding preset feature map size to obtain a feature map.
  • the face position information includes the coordinates of the relevant feature points of the face area
  • the eye position information includes the coordinates of the relevant feature points of the eye area.
  • the target image block includes at least one type of image block among the face image block, the left eye image block, and the right eye image block. Different types of image blocks each correspond to a preset feature map size.
  • the face image block is an image block obtained by cropping the face area in the first image based on the face position information
  • the left eye image block is an image block obtained by cropping the left eye area in the first image based on the eye position information
  • the right eye image block is an image block obtained by cropping the right eye area in the first image based on the eye position information.
  • the processor when used to process the region of interest ROI of the target image block with the corresponding preset feature map size to obtain a feature map, can be specifically used to: divide the ROI of the target image block based on the corresponding preset feature map size to obtain a plurality of block regions; perform maximum pooling processing on each block region in the ROI of the target image block to obtain a feature map.
  • the number of each row of block regions in the ROI of the target image block is the same as the width value in the corresponding preset feature map size, and the number of each column of block regions in the ROI of the target image block is the same as the height value in the corresponding preset feature map size.
  • the processor when used to divide the ROI of the target image block based on the corresponding preset feature map size to obtain a number of block areas, can be specifically used to: determine the ROI of the facial image block, and divide the ROI of the facial image block based on the first preset feature map size to obtain a number of facial block areas; determine the ROI of the left eye image block, and divide the ROI of the left eye image block based on the second preset feature map size to obtain a number of left eye block areas; determine the ROI of the right eye image block, and divide the ROI of the right eye image block based on the third preset feature map size to obtain a number of right eye block areas.
  • the processor when used to perform maximum pooling processing on each block area in the ROI of the target image block to obtain a feature map, can be specifically used to: perform maximum pooling processing on each face block area in the ROI of the face image block to obtain a first feature map; perform maximum pooling processing on each left eye block area in the ROI of the left eye image block to obtain a second feature map; perform maximum pooling processing on each right eye block area in the ROI of the right eye image block to obtain a third feature map.
  • the first feature map is a feature map corresponding to the ROI of the face image block
  • the second feature map is a feature map corresponding to the ROI of the left eye image block
  • the third feature map is a feature map corresponding to the ROI of the right eye image block.
  • the number of block areas in each row of the ROI of the target image block is the same as the width value in the corresponding preset feature map size, and the number of block areas in each column of the ROI of the target image block is the same as the height value in the corresponding preset feature map size.
  • the number of face block areas in each row of the ROI of the facial image block is the same as the width value in the first preset feature map size, and the number of face block areas in each column of the ROI of the facial image block is the same as the height value in the first preset feature map size;
  • the number of left eye block areas in each row of the ROI of the left eye image block is the same as the width value in the second preset feature map size, and the number of left eye block areas in each column of the ROI of the left eye image block is the same as the height value in the second preset feature map size;
  • the number of right eye block areas in each row of the ROI of the right eye image block is the same as the width value in the third preset feature map size, and the number of right eye block areas in each column of the ROI of the right eye image block is the same as the height value in the third preset feature map size.
  • the subject of the first image is the target object.
  • the processor can also be used to: obtain the pupil coordinates in the first image when the face detection result meets the preset face condition; determine the position and size of the face area in the first image in the first image based on the face position information, and obtain the face grid corresponding to the first image.
  • the face grid is used to characterize the distance between the target object and the camera.
  • the processor can also be used to: perform convolution processing on the feature map based on the convolution module of the gaze point estimation network model to extract eye features and/or facial features; integrate the eye features and/or facial features, the face grid and the pupil coordinates based on the fusion module of the gaze point estimation network model to obtain the gaze point coordinates of the target object.
  • the face detection result satisfies a preset face condition, which may specifically include: a face is detected in the first image.
  • the face detection result satisfies the preset face condition, which may specifically include: a face is detected in the first image, and the size of the face area in the first image meets the preset size requirement.
  • the processor may also be used to: when a face is detected in the first image and the size of the face area in the first image does not meet the preset size requirement, perform adaptive zoom, and recapture the image based on the focal length after the adaptive zoom.
  • the processor when used to crop the face area in the first image based on the face position information, can be specifically used to: determine the relevant feature points of the face area in the first image; determine the first circumscribed rectangle; and crop the first image based on the position of the first circumscribed rectangle in the first image.
  • the first circumscribed rectangle is the circumscribed rectangle of the relevant feature points of the face area in the first image.
  • the face image block and the first circumscribed rectangle have the same position in the first image.
  • the face image block and the first circumscribed rectangle have the same size.
  • the processor when used to crop the left eye area in the first image based on the eye position information, can be specifically used to: determine the relevant feature points of the left eye area in the first image; determine the second circumscribed rectangle; and crop the first image based on the position of the second circumscribed rectangle in the first image.
  • the second circumscribed rectangle is the circumscribed rectangle of the relevant feature points of the left eye area in the first image.
  • the left eye image block and the second circumscribed rectangle have the same position in the first image.
  • the left eye image block and the second circumscribed rectangle have the same size.
  • the processor when used to crop the right eye region in the first image based on the eye position information, can be specifically used to: determine the relevant feature points of the right eye region in the first image; determine the third circumscribed rectangle; and crop the first image based on the position of the third circumscribed rectangle in the first image.
  • the third circumscribed rectangle is the circumscribed rectangle of the relevant feature points of the right eye region in the first image.
  • the right eye image block and the third circumscribed rectangle have the same position in the first image.
  • the right eye image block and the third circumscribed rectangle have the same size.
  • the processor when used to crop the face area in the first image based on the face position information to obtain the face image block, can be specifically used to: determine the face area in the first image based on the face position information; crop the first image with the face area as the center of the first cropping frame to obtain the face image block.
  • the size of the first cropping frame is the first preset cropping size.
  • the face image block has the same size as the first cropping frame.
  • the processor when used to crop the left eye area and the right eye area in the first image based on the eye position information to obtain the left eye image block and the right eye image block, can be specifically used to: determine the left eye area in the first image and the right eye area in the first image based on the eye position information; crop the first image with the left eye area as the center of the second cropping frame to obtain the left eye image block, and crop the first image with the right eye area as the center of the third cropping frame to obtain the right eye image block.
  • the size of the second cropping frame is the second preset cropping size.
  • the left eye image block has the same size as the second cropping frame.
  • the size of the third cropping frame is a third preset cropping size.
  • the size of the right eye image block is the same as that of the third cropping frame.
  • the gaze point estimation network model may further include several activation layers.
  • the region of interest pooling module may include several region of interest pooling layers.
  • the convolution module may include several convolution layers.
  • the fusion module may include several fully connected layers.
  • the present application provides a computer storage medium, comprising computer instructions, which, when executed on an electronic device, enables the electronic device to execute any possible implementation of the first aspect.
  • an embodiment of the present application provides a chip, which can be applied to an electronic device.
  • the chip includes one or more processors, and the processor is used to call computer instructions to enable the electronic device to execute any possible implementation method of the above-mentioned first aspect.
  • an embodiment of the present application provides a computer program product comprising instructions, which, when executed on an electronic device, enables the electronic device to execute any possible implementation of the first aspect.
  • the electronic device provided in the second aspect, the computer storage medium provided in the third aspect, the chip provided in the fourth aspect, and the computer program product provided in the fifth aspect are all used to execute any possible implementation of the first aspect. Therefore, the beneficial effects that can be achieved can refer to the beneficial effects of any possible implementation of the first aspect, and will not be repeated here.
  • FIG1 is a schematic diagram of a scene of gaze point estimation provided by an embodiment of the present application.
  • 2A-2D are schematic diagrams of a set of scenes for gaze point estimation provided in an embodiment of the present application.
  • FIG3 is a flow chart of a method for estimating a gaze point provided in an embodiment of the present application.
  • FIG4 is a schematic diagram of a cutting principle provided in an embodiment of the present application.
  • FIG5 is a schematic diagram of another cutting principle provided in an embodiment of the present application.
  • FIG6 is a schematic diagram of a face grid provided in an embodiment of the present application.
  • FIG7 is a schematic diagram of the architecture of a gaze point estimation network model provided in an embodiment of the present application.
  • FIG8 is a schematic diagram of the architecture of another gaze point estimation network model provided in an embodiment of the present application.
  • FIG9 is a schematic diagram of the architecture of another gaze point estimation network model provided in an embodiment of the present application.
  • FIGS. 10A and 10B are schematic diagrams of a principle of a region of interest pooling layer provided in an embodiment of the present application.
  • FIG11 is a schematic diagram of a ROI mapped onto a feature map provided in an embodiment of the present application.
  • FIG12 is a schematic diagram of the structure of a CNN-1 provided in an embodiment of the present application.
  • FIG13 is a schematic diagram of the structure of a CNN-3 provided in an embodiment of the present application.
  • FIG14 is a schematic diagram of the hardware structure of an electronic device provided in an embodiment of the present application.
  • FIG. 15 is a schematic diagram of a software structure of an electronic device provided in an embodiment of the present application.
  • the present application provides a method for estimating a gaze point.
  • the method for estimating a gaze point can be applied to an electronic device.
  • the electronic device can collect images through a front camera. If the collected image includes a face, the electronic device can crop the collected image based on the face position information obtained by face detection and the preset face cropping size to obtain a face image block. Similarly, the electronic device can also crop the collected image based on the eye position information obtained by eye detection and the preset eye cropping size to obtain a left eye image block and a right eye image block.
  • the electronic device can also determine a face grid based on the face position information, and determine the pupil coordinates by pupil positioning. Among them, the face grid is used to represent the position and size of the face in the entire image.
  • the face grid can reflect the distance between the face and the camera.
  • the electronic device can input the left eye image block, the right eye image block, the face image block, the face grid and the pupil coordinates into the gaze point estimation network model, and output the gaze point coordinates.
  • the gaze point estimation network model can include a region of interest pooling layer.
  • the region of interest pooling layer can be used to unify the size of the feature map to prepare for subsequent feature extraction.
  • the electronic device may determine whether the size of the face region in the captured image meets a preset size requirement. If the size of the face region does not meet the preset size requirement, the electronic device may ensure that the shooting distance is appropriate through adaptive zooming and recapture the image. If the size of the face region meets the preset size requirement, the electronic device may estimate the gaze point coordinates according to the above method.
  • the electronic device can estimate the gaze point coordinates based on the left eye image block, the right eye image block, the face image block, the face grid and the pupil coordinates, that is, a more comprehensive feature extraction is achieved.
  • the electronic device can control the size of the face area in the captured image through adaptive zoom, and unify the size of the feature map based on the pooling layer of the region of interest, which can avoid the deformation of the image block (for example, the left eye image block, the right eye image block and the face image block) after scaling, thereby improving the accuracy of the gaze point estimation.
  • the electronic device can obtain a user image and estimate the gaze point coordinates through the user image.
  • the electronic device can capture images through a front camera. If the captured image includes a face, the electronic device can crop the captured image to obtain a left eye image block, a right eye image block, and a face image block.
  • the electronic device can also determine a face grid based on the face position information obtained by face detection, and determine the pupil coordinates through pupil positioning. Among them, the face grid is used to represent the position and size of the face in the entire image. It can also be understood that the face grid can reflect the distance between the face and the camera.
  • the electronic device can input the left eye image block, the right eye image block, the face image block, the face grid, and the pupil coordinates into the gaze point estimation network model, and output the gaze point coordinates. It can be understood that the relevant description of the gaze point estimation network model can be referred to later, and will not be expanded here.
  • the electronic device may specifically be a mobile phone, a tablet computer, a wearable device, a vehicle-mounted device, an augmented reality (AR)/virtual reality (VR) device, a laptop computer, an ultra-mobile personal computer (UMPC), a netbook, a personal digital assistant (PDA) or a dedicated camera (for example, a SLR camera, a card camera) and other devices.
  • AR augmented reality
  • VR virtual reality
  • UMPC ultra-mobile personal computer
  • PDA personal digital assistant
  • a dedicated camera for example, a SLR camera, a card camera
  • the electronic device may trigger a corresponding operation based on the estimated gaze point coordinates. In this case, the user may more conveniently interact with the electronic device.
  • the electronic device may display a reading interface 100.
  • the reading interface 100 displays the first page of the e-book that the user is reading.
  • the e-book has a total of 243 pages.
  • the electronic device may estimate the gaze point coordinates in real time.
  • the electronic device may estimate the gaze point coordinates based on the captured image, and determine that the gaze point coordinates are located at the end of the first page of content displayed on the reading interface 100. In this case, the electronic device may trigger page turning.
  • the electronic device may display the reading interface 200 shown in FIG2C .
  • the reading interface 200 displays the second page of the e-book that the user is reading.
  • the electronic device may continue to estimate the gaze point coordinates in real time.
  • the real-time estimation of gaze point coordinates mentioned in the present application may include: the electronic device may capture a frame of image at regular intervals (for example, 10 ms) and estimate the gaze point coordinates based on the image.
  • the electronic device when a user browses information using an electronic device, the electronic device can collect the user's preference information based on the estimated gaze point coordinates, thereby providing services to the user more intelligently based on the collected user's preference information. For example, when a user browses information using an electronic device, the electronic device may recommend some content (e.g., videos, articles, etc.). In this case, the electronic device can estimate the user's gaze point coordinates to determine the recommended content that the user is interested in. In the subsequent process, the electronic device can recommend content related to the recommended content of interest to the user.
  • some content e.g., videos, articles, etc.
  • the electronic device can estimate the user's gaze point coordinates to determine the recommended content that the user is interested in.
  • the electronic device can recommend content related to the recommended content of interest to the user.
  • the electronic device may display a user interface 300.
  • the user interface 300 may include several videos or text information.
  • the user interface 300 may include recommended content 1, recommended content 2, recommended content 3, recommended content 4, and recommended content 5.
  • the electronic device may capture images in real time and estimate the gaze point coordinates based on the captured images.
  • the electronic device may also count the distribution of the gaze point coordinates during the process of the electronic device displaying the user interface 300, thereby determining the recommended content that the user is interested in.
  • the recommended content 2 in the user interface 300 may intelligently recommend content related to the recommended content 2 to the user, thereby avoiding the user from spending time excluding content that is not of interest, and providing services to the user more intelligently.
  • a gaze point estimation method provided by the present application is introduced below.
  • Figure 3 is a flow chart of a method for estimating a gaze point provided in an embodiment of the present application.
  • the method for estimating a gaze point may include but is not limited to the following steps:
  • S301 The electronic device acquires an image I1.
  • the electronic device obtains the image I1 through the front camera of the electronic device.
  • the electronic device receives the image I1 acquired by other cameras.
  • the electronic device can acquire images in real time. That is, the image I1 is an image acquired by the electronic device in real time.
  • the electronic device can acquire a frame of image every time T.
  • the time T mentioned in the present application can be set according to actual needs. Exemplarily, the time T can be 1 millisecond (ms).
  • S302 The electronic device performs face detection on the image I1 to determine whether the image I1 includes a face.
  • the electronic device can perform face detection on the image I1 to determine whether the image I1 includes a face. If it is detected that the image I1 includes a face, the electronic device can continue to perform subsequent steps. If it is detected that the image I1 does not include a face, the electronic device can discard the image I1 and reacquire the image.
  • face detection refers to determining whether a face exists in a dynamic scene and a complex background and separating it. In other words, based on the search strategy included in face detection, any given image can be searched to determine whether it contains a face.
  • the electronic device may determine the degree of match (i.e., correlation) between the input image and one or more pre-set standard face templates, and then determine whether there is a face in the image based on the degree of match. For example, the electronic device may determine the magnitude relationship between the degree of match and a preset threshold, and determine whether there is a face in the image based on the magnitude relationship. Specifically, if the degree of match is greater than the preset threshold, the electronic device determines that there is a face in the image, otherwise, the electronic device determines that there is no face in the image.
  • the degree of match i.e., correlation
  • the electronic device when determining the degree of match between an input image and one or more pre-set standard face templates, may specifically calculate the degree of match between the input image and the facial contour, nose, eyes, mouth and other parts in the standard face template.
  • the electronic device may include a template library, and the standard face templates may be stored in the template library.
  • human faces have certain structural distribution characteristics.
  • Electronic devices can extract the structural distribution characteristics of human faces from a large number of samples and generate corresponding rules, and then judge whether there is a human face in the image based on the rules.
  • the structural distribution characteristics of human faces may include: two symmetrical eyes, two symmetrical ears, a nose, a mouth, and the positions and relative distances between the five facial features.
  • the sample learning method refers to the method of artificial neural network, that is, to generate a classifier by learning a face sample set and a non-face sample set.
  • the electronic device can train the neural network based on the sample.
  • the parameters of the neural network include the statistical characteristics of the face.
  • Feature detection refers to the use of the invariant characteristics of a human face for face detection.
  • a human face has some features that are robust to different postures. For example, a person's eyes and eyebrows are darker than the cheeks, the lips are darker than the surrounding area, the bridge of the nose is lighter than the sides, etc.
  • the electronic device can extract these features and create a statistical model that can describe the relationship between these features, and then determine whether there is a face in the image based on the statistical model. It can be understood that the features extracted by the electronic device can be represented as a one-dimensional vector in the image feature space of the face. When creating a statistical model that can describe the relationship between the features, the electronic device can transform the one-dimensional vector into a relatively simple feature space.
  • the above four face detection methods can be used in combination in actual detection.
  • factors such as individual differences (e.g., differences in hairstyles, opening and closing of eyes, etc.), occlusion of faces in the shooting environment (e.g., occlusion of faces by hair, glasses, etc.), the angle of faces facing the camera (e.g., the side of the face facing the camera), the shooting environment (e.g., objects around the face, etc.), and imaging conditions (e.g., lighting conditions, imaging equipment) can also be taken into account in face detection.
  • factors e.g., differences in hairstyles, opening and closing of eyes, etc.
  • occlusion of faces in the shooting environment e.g., occlusion of faces by hair, glasses, etc.
  • the angle of faces facing the camera e.g., the side of the face facing the camera
  • the shooting environment e.g., objects around the face, etc.
  • imaging conditions e.g., lighting conditions, imaging equipment
  • the above-mentioned face detection method is only an example given in the embodiment of the present application.
  • the electronic device may also use other face detection methods to perform face detection.
  • the above-mentioned face detection method should not be regarded as a limitation to the present application.
  • the electronic device detects the facial features when performing face detection. This means that the electronic device also performs eye detection when performing face detection.
  • feature points related to the eyes can be obtained.
  • the electronic device can detect the eyes in the image I1
  • the electronic device can obtain eye position information. It is understandable that the eye position information may include the coordinates of feature points related to the eyes. The relevant description of the eye position information can be found in the following text and will not be elaborated here.
  • the eye-related feature points obtained by the electronic device may include the pupil center point.
  • the electronic device may obtain the pupil center point coordinates.
  • S303 The electronic device obtains the facial position information in the image I1.
  • the electronic device may acquire and save the face position information in the image I1 .
  • the facial position information may include the coordinates of the face detection frame.
  • the facial position information may include the coordinates of relevant feature points of the face, for example, the coordinates of the edge contour feature points of the face, for example, the coordinates of the feature points in the facial region related to the eyes, nose, mouth and ears.
  • S304 The electronic device performs eye detection and pupil positioning on the image I1, and obtains eye position information and pupil coordinates.
  • the electronic device when the electronic device detects that the image I1 includes a human face, the electronic device can perform eye detection and pupil positioning on the image I1, thereby obtaining eye position information and pupil coordinates in the image I1.
  • the eye position information may include the coordinates of feature points related to the eyes.
  • the electronic device may determine the feature points related to the eyes and obtain the coordinates of these feature points. For example, the two eye corner feature points of the left eye, the two eye corner feature points of the right eye, and the edge contour feature points of the eyes.
  • the electronic device may determine the eye position in the image I1 based on the coordinates of these feature points related to the eyes.
  • pupil coordinates are two-dimensional coordinates.
  • pupil coordinates may include pupil center coordinates.
  • pupil coordinates may also include other coordinates related to the pupil. For example, the coordinates of the pupil center of gravity, the coordinates of the pupil edge contour points, etc.
  • the pupil positioning method is briefly described below.
  • the electronic device when the electronic device detects an eye on the image I1, the electronic device may blur the eye portion on the image I1, extract the pupil contour, and then determine the pupil centroid. It is understandable that the electronic device may use the coordinates of the pupil centroid as the pupil coordinates.
  • the electronic device when the electronic device detects an eye on image I1, the electronic device may blur the eye portion on image I1, calculate the horizontal and vertical pixel values, and then select the index of the row with the lowest pixel value and the index of the column with the lowest pixel value as the vertical and horizontal coordinates of the pupil coordinates.
  • the electronic device may also adopt other pupil positioning methods, and this application does not limit this.
  • S305 The electronic device determines whether the size of the face area in the image I1 meets a preset size requirement.
  • the electronic device can determine the size of the face area in the image I1, and determine whether the size of the face area in the image I1 meets the preset size requirement.
  • the face area can include important features of the face, such as eyes, nose, and mouth.
  • the size of the face region refers to the area of the face region.
  • the area of the face region refers to the area of the face detection frame. In some other embodiments of the present application, the area of the face region refers to the area of the entire face region in the image detected by the electronic device.
  • the face detection frame can be used to select a facial region including important features, and is not necessarily used to select a complete facial region.
  • the face detection frame can be used to select a large portion of a facial region including features such as eyebrows, eyes, nose, mouth, and ears.
  • the shape of the face detection frame can be set according to actual needs.
  • the face detection frame can be a rectangle.
  • the size of the face area in the image I1 meets the preset size requirement, which means that the area of the face area in the image I1 is within the preset area range.
  • the preset area range can be [220px*220px, 230px*230px]. That is to say, the area of the face area is not less than 220px*220px and not more than 230px*230px.
  • the present application does not limit the specific value of the preset area range. It can be understood that the full name of px is "Pixel", which means "pixel” in Chinese and is the smallest unit representing a picture or graphic.
  • the size of the facial area in the image I1 meets the preset size requirement, which means that: the height of the facial area in the image I1 is within the preset height range, and the width of the facial area in the image I1 is within the preset width range.
  • the preset height range can be [215px, 240px]
  • the preset width range can be [215px, 240px].
  • the preset height range and the preset width range can be inconsistent, and the present application does not limit the specific values of the preset height range and the preset width range.
  • the height of the facial area mentioned in the present application can be understood as the height of the face detection frame
  • the width of the facial area mentioned in the present application can be understood as the width of the face detection frame.
  • the electronic device can continue to execute subsequent steps, and when the size of the facial area included in the image I1 does not meet the preset size requirement, the electronic device can perform adaptive zoom and re-acquire the image according to the focal length after adaptive zoom.
  • focal length the focal length
  • the wider the framing range the wider the field of view of the shot, and the more objects that can be captured, but the smaller the objects in the picture.
  • the larger the focal length the narrower the framing range, the smaller the field of view of the shot, and the fewer objects that can be captured, but the objects occupy a large proportion of the picture.
  • the adaptive zoom method is described.
  • the electronic device can determine the focal length when acquiring the image I1.
  • the focal length when the electronic device acquires the image I1 is recorded as the original focal length in this application.
  • the electronic device can determine whether the area of the facial region in the image I1 is smaller than the minimum value of the preset area range, or larger than the maximum value of the preset area range. If the area of the facial region in the image I1 is smaller than the minimum value of the preset area range, the electronic device can add J1 to the original focal length to obtain the focal length after adaptive zoom, and reacquire the image based on the focal length. If the area of the facial region in the image I1 is larger than the maximum value of the preset area range, the electronic device can subtract J1 from the original focal length to obtain the focal length after adaptive zoom, and reacquire the image based on the focal length.
  • J1 is the preset focal length adjustment step, and the specific value of J1 can be set according to actual needs.
  • the electronic device may determine the middle value of the preset area range, and determine the ratio of the area of the face region to the middle value.
  • the electronic device may multiply the ratio by the original focal length to obtain the focal length after adaptive zooming, and reacquire the image based on the focal length.
  • the electronic device can determine the preset area range based on the preset height range and the preset width range, and then perform adaptive zoom based on the area of the facial area in image I1, the preset area range, and the original focal length.
  • the electronic device can determine the preset area range based on the preset height range and the preset width range, and then perform adaptive zoom based on the area of the facial area in image I1, the preset area range, and the original focal length.
  • the electronic device can determine the middle value of the preset height range and the middle value of the preset width range, and then multiply the middle value of the preset height range by the middle value of the preset width range to obtain a preset area, and perform adaptive zoom based on the preset area, the area of the facial area in image I1, and the original focal length.
  • the electronic device can determine the middle value of the preset height range and the middle value of the preset width range, and then multiply the middle value of the preset height range by the middle value of the preset width range to obtain a preset area, and perform adaptive zoom based on the preset area, the area of the facial area in image I1, and the original focal length.
  • the adaptive zoom method may also include other specific methods, which are not limited by this application.
  • the size of the facial area can reflect the shooting distance (i.e., the distance between the camera and the face). It can also be understood that the size of the facial area contains the depth information of the shooting. If the shooting distance is large, the eye features in the image captured by the electronic device through the camera may be blurred, thereby affecting the accuracy of the gaze point estimation. If the shooting distance is large, the facial features in the image captured by the electronic device through the camera may be incomplete, thereby affecting the accuracy of the gaze point estimation.
  • the electronic device can capture an image containing a suitable face size, thereby improving the accuracy of the gaze point estimation.
  • S306 The electronic device crops the image I1 based on the face position information to obtain a face image block, and crops the image I1 based on the eye position information to obtain a left eye image block and a right eye image block.
  • This embodiment of the application provides two implementation methods when the electronic device performs step S306:
  • the first implementation mode the electronic device determines the bounding rectangle of the face area in the image I1 based on the coordinates of the feature points included in the face position information, and crops the image I1 based on the bounding rectangle of the face area to obtain a face image block.
  • the electronic device can also determine the bounding rectangle of the left eye area and the bounding rectangle of the right eye area in the image I1 based on the coordinates of the feature points included in the eye position information, and crops the image I1 based on the bounding rectangle of the left eye area and the bounding rectangle of the right eye area to obtain a left eye image block and a right eye image block.
  • the bounding rectangle mentioned in the present application can be a minimum bounding rectangle.
  • the minimum bounding rectangle refers to the maximum range of several two-dimensional shapes (e.g., points, lines, polygons) represented by two-dimensional coordinates, that is, a rectangle whose lower boundary is defined by the maximum horizontal coordinate, the minimum horizontal coordinate, the maximum vertical coordinate, and the minimum vertical coordinate of each vertex of a given two-dimensional shape.
  • the bounding rectangle of the face area can be understood as the minimum bounding rectangle of facial feature points (e.g., facial edge contour feature points).
  • the bounding rectangle of the left eye area can be understood as the minimum bounding rectangle of the left eye feature points (e.g., the 2 corners of the left eye feature points, the edge contour feature points of the left eye).
  • the bounding rectangle of the right eye area can be understood as the minimum bounding rectangle of the right eye feature points (e.g., the 2 corners of the right eye feature points, the edge contour feature points of the right eye).
  • the size of the face image block is the same as the size of the circumscribed rectangle of the face area in the image I1.
  • the size of the left eye image block is the same as the size of the circumscribed rectangle of the left eye area in the image I1.
  • the size of the right eye image block is the same as the size of the circumscribed rectangle of the right eye area in the image I1.
  • the electronic device may determine the bounding box of the facial feature points by using a bounding box algorithm.
  • the bounding box of the facial feature points may be understood as the optimal enclosing area of the facial feature points.
  • the electronic device may also crop the image I1 based on the bounding box of the facial feature points to obtain a facial image block.
  • the electronic device may determine the bounding boxes of the left eye feature points and the right eye feature points respectively by using a bounding box algorithm.
  • the bounding boxes of the left eye feature points and the right eye feature points may be understood as the optimal enclosing area of the left eye feature points and the optimal enclosing area of the right eye feature points respectively.
  • the electronic device may also crop the image I1 based on the bounding boxes of the left eye feature points and the right eye feature points respectively to obtain a left eye image block and a right eye image block.
  • the bounding box is an algorithm for solving the optimal bounding space of a discrete point set.
  • the basic idea is to use a slightly larger geometric body with simpler characteristics (called a bounding box) to approximately replace complex geometric objects.
  • a bounding box For the relevant description of the bounding box, please refer to the relevant technical documents, and this application will not elaborate on this.
  • the electronic device crops the image I1 based on the face position information and the preset face cropping size to obtain a face image block, and crops the image I1 based on the eye position information and the preset eye cropping size to obtain a left eye image block and a right eye image block.
  • the electronic device can determine the facial area in the image I1 based on the coordinates in the facial position information, and crop the image I1 based on the preset face cropping size with the facial area as the center, thereby obtaining a facial image block.
  • the size of the facial image block is the same as the preset face cropping size.
  • the facial area in the facial image block is located at the center of the facial image block.
  • the coordinates in the facial position information can include the coordinates of the edge contour feature points of the face, can also include the coordinates of the face detection frame, and can also include the coordinates of the feature points related to the eyes, nose, mouth and ears in the face.
  • the electronic device can also determine the left eye area and the right eye area in image I1 based on the coordinates in the eye position information, and crop image I1 based on the preset eye cropping size with the left eye area and the right eye area as the center, respectively, to obtain the left eye image block and the right eye image block.
  • the left eye area in the left eye image block is located at the center of the left eye image block.
  • the right eye area in the right eye image block is located at the center of the right eye image block.
  • the coordinates in the eye position information can include 2 eye corner feature points of the left eye and 2 eye corner feature points of the right eye, and can also include edge contour feature points of the eye corners.
  • the size of the left eye image block is the same as the preset eye cropping size
  • the size of the right eye image block is the same as the preset eye cropping size.
  • the preset eye cropping size is 60px*60px.
  • the sizes of the left eye image block and the right eye image block cropped by the electronic device are both 60px*60px.
  • the preset eye cropping size may include a preset left eye cropping size and a preset right eye cropping size.
  • the preset left eye cropping size may be inconsistent with the preset right eye cropping size.
  • the size of the left eye image block is the same as the preset left eye cropping size, and the size of the right eye image block is the same as the preset right eye cropping size.
  • the preset face cropping size and the preset eye cropping size can be set according to actual needs, and this application does not limit this.
  • the preset face cropping size can be 244px*244px
  • the preset eye cropping size can be 60px*60px.
  • the electronic device may determine the facial area based on the coordinates included in the facial position information (e.g., the coordinates of the edge contour feature points of the face, etc.), set the cropping frame according to a preset face cropping size, and then crop the image I1 with the facial area as the center of the cropping frame, thereby obtaining a facial image block.
  • the facial position information e.g., the coordinates of the edge contour feature points of the face, etc.
  • the electronic device may determine the left eye area and the right eye area based on the coordinates included in the eye position information, and set the left eye cropping frame and the right eye cropping frame according to a preset eye cropping size, and then use the left eye area and the right eye area as the center of the left eye cropping frame and the right eye cropping frame, respectively, to crop the image I1 to obtain a left eye image block and a right eye image block, respectively.
  • the electronic device determines the face grid corresponding to the image I1 based on the face position information.
  • the face grid is used to indicate the position and size of the face in the entire image.
  • the electronic device can determine the position of the face area in the image I1 based on the coordinates included in the face position information (for example, the coordinates of the edge contour feature points of the face, etc.), thereby determining the face grid corresponding to the image I1.
  • the face grid can be used to represent the position and size of the face in the entire image. It is understood that the face grid can represent the distance between the face and the camera.
  • the face grid can be understood as a binary mask.
  • a binary mask can be understood as a binary matrix corresponding to an image, that is, a matrix whose elements are all 0 or 1. Generally speaking, an image (all or part) can be blocked by a binary mask. Binary masks can be used for region of interest extraction, shielding, structural feature extraction, etc.
  • the electronic device can determine the proportional relationship between the face area in the image I1 and the image I1 according to the coordinates included in the face position information, thereby obtaining the depth information of the face in the image I1.
  • the electronic device can also determine that the face in the image I1 is located at a lower center position in the image I1. Further, the electronic device can determine the face grid corresponding to the image I1.
  • S308 The electronic device inputs the left eye image block, the right eye image block, the face image block, the face grid and the pupil coordinates into the gaze point estimation network model, and outputs the gaze point coordinates.
  • the electronic device can input the left eye image block, the right eye image block, the face image block, the face grid and the pupil coordinates into the gaze point estimation network model, and output the two-dimensional coordinates.
  • the two-dimensional coordinates are the gaze point coordinates.
  • the gaze point estimation network model can be a neural network model including several branches. The gaze point estimation network model can extract corresponding features through the several branches it contains, and then estimate the gaze point coordinates by synthesizing the extracted features.
  • a neural network is a mathematical model or computational model that imitates the structure and function of a biological neural network (the central nervous system of an animal, especially the brain).
  • a neural network is composed of a large number of artificial neurons, and different networks are constructed according to different connection methods.
  • Neural networks can include convolutional neural networks, recurrent neural networks, etc.
  • the gaze point estimation network model may include several region of interest pooling layers, several convolutional layers, several pooling layers and several fully connected layers.
  • the region of interest pooling layer is used to unify the size of the feature map.
  • the convolutional layer is used to extract features.
  • the pooling layer is used to downsample to reduce the amount of data.
  • the fully connected layer is used to map the extracted features to the sample label space. In layman's terms, the fully connected layer is used to integrate the extracted features together and output them as a value.
  • the gaze point estimation network model may include a region of interest pooling (ROI pooling) layer-1, a region of interest pooling layer-2, CNN-1, CNN-2, CNN-3, a fully connected layer-1, a fully connected layer-2, a fully connected layer-3 and a fully connected layer-4.
  • ROI pooling region of interest pooling
  • the region of interest pooling layer-1 is used to unify the size of the feature map corresponding to the left eye image block, and unify the size of the feature map corresponding to the right eye image block.
  • the region of interest pooling layer-2 is used to unify the size of the feature map corresponding to the facial image block.
  • CNN-1, CNN-2 and CNN-3 are all convolutional neural networks (CNNs), which are used to extract left eye features, right eye features and facial features, respectively.
  • CNN-1, CNN-2 and CNN-3 may include several convolutional layers and several pooling layers, respectively.
  • CNN-1, CNN-2 and CNN-3 may also include one or more fully connected layers.
  • Fully connected layer-1 is used to integrate the extracted left eye features, right eye features and facial features.
  • Fully connected layer-2 and fully connected layer-3 are used to integrate the depth information represented by the facial mesh (i.e., the distance between the face and the camera) and the pupil position information represented by the pupil coordinates.
  • Fully connected layer-4 is used to integrate the left eye features, right eye features, facial features, depth information and pupil position information and output them as one value.
  • the electronic device can use the left eye image block and the right eye image block as the input of the region of interest pooling layer-1, and use the face image as the input of the region of interest pooling layer-2.
  • the region of interest pooling layer-1 can output feature maps of the same size.
  • the region of interest pooling layer-2 can also output feature maps of the same size.
  • the electronic device can use the feature map corresponding to the left eye image block output by the region of interest pooling layer-1 as the input of CNN-1, and can also use the feature map corresponding to the right eye image block output by the region of interest pooling layer-1 as the input of CNN-2.
  • the electronic device can use the feature map output by the region of interest pooling layer-2 as the input of CNN-3.
  • the electronic device can use the output of CNN-1, the output of CNN-2, and the output of CNN-3 as the input of the fully connected layer-1.
  • the electronic device can also use the face grid and pupil coordinates as the input of the fully connected layer-2 and the fully connected layer-3, respectively.
  • the electronic device can use the output of the fully connected layer-1, the fully connected layer-2, and the fully connected layer-3 as the input of the fully connected layer-4.
  • the fully connected layer-4 can output two-dimensional coordinates.
  • the two-dimensional coordinates are the gaze point coordinates estimated by the electronic device.
  • the gaze point estimation network model may include more region of interest pooling layers.
  • the electronic device may use the left eye image block and the right eye image block as inputs of different region of interest pooling layers, respectively. Accordingly, the electronic device may use the outputs of the different region of interest pooling layers as inputs of CNN-1 and CNN-2, respectively.
  • the gaze point estimation network model may include more fully connected layers. It is understandable that there may be more fully connected layers before and after the fully connected layer-2, and there may be more fully connected layers before and after the fully connected layer-3.
  • the electronic device can use the output of the fully connected layer-2 as the input of the fully connected layer-5, and the output of the fully connected layer-5 as the input of the fully connected layer-4.
  • the electronic device can use the output of the fully connected layer-3 as the input of the fully connected layer-6, and the output of the fully connected layer-6 as the input of the fully connected layer-4.
  • the electronic device can use the output of the fully connected layer-4 as the input of the fully connected layer-7, and the output of the fully connected layer-7 is the gaze point coordinate estimated by the electronic device.
  • FIG8 is a schematic diagram of the architecture of another gaze point estimation network model provided in an embodiment of the present application.
  • the gaze point estimation network model may include an area of interest pooling layer-1, an area of interest pooling layer-2, CNN-1, CNN-2, CNN-3, a fully connected layer-1, a fully connected layer-2, a fully connected layer-3, a fully connected layer-4, a fully connected layer-5, a fully connected layer-6, and a fully connected layer-7.
  • the functions of the area of interest pooling layer-1, the area of interest pooling layer-2, CNN-1, CNN-2, CNN-3, and the fully connected layer-1 can all refer to the above, and this application will not repeat them here.
  • the fully connected layer-2 and the fully connected layer-5 are used to integrate the depth information represented by the face grid.
  • the fully connected layer-3 and the fully connected layer-6 are used to integrate the pupil position information represented by the pupil coordinates.
  • the fully connected layer-4 and the fully connected layer-7 are used to integrate the information such as the left eye features, the right eye features, the facial features, the depth information, and the pupil position, and output them as a value.
  • the electronic device can use the output of fully connected layer-2 as the input of fully connected layer-5, use the output of fully connected layer-3 as the input of fully connected layer-6, and use the outputs of fully connected layer-1, fully connected layer-5, and fully connected layer-6 as the input of fully connected layer-4.
  • the electronic device can also use the output of fully connected layer-4 as the input of fully connected layer-7, and the output of fully connected layer-7 is the gaze point coordinate estimated by the electronic device.
  • FIG9 is a schematic diagram of the architecture of another gaze point estimation network model provided by an embodiment of the present application.
  • the gaze point estimation network model may include an area of interest pooling layer-1, an area of interest pooling layer-2, CNN-1, CNN-2, CNN-3, a fully connected layer-2, a fully connected layer-3, a fully connected layer-4, a fully connected layer-5, a fully connected layer-6, and a fully connected layer-7.
  • the functions of the area of interest pooling layer-1, the area of interest pooling layer-2, CNN-1, CNN-2, and CNN-3 can all refer to the above, and this application will not repeat them here.
  • Fully connected layer-2 and fully connected layer-5 are used to integrate the depth information represented by the face grid.
  • Fully connected layer-3 and fully connected layer-6 are used to integrate the pupil position information represented by the pupil coordinates.
  • Fully connected layer-4 and fully connected layer-7 are used to integrate information such as left eye features, right eye features, facial features, depth information, and pupil position, and output them as a value.
  • the electronic device can use the output of fully connected layer-2 as the input of fully connected layer-5, use the output of fully connected layer-3 as the input of fully connected layer-6, and use the outputs of fully connected layer-5 and fully connected layer-6 as the input of fully connected layer-4.
  • the electronic device can also use the output of fully connected layer-4 as the input of fully connected layer-7, and the output of fully connected layer-7 is the gaze point coordinate estimated by the electronic device.
  • the gaze point estimation network model may further include several activation layers.
  • an activation layer may be set between the fully connected layer-1 and the fully connected layer-4
  • an activation layer may be set between the fully connected layer-2 and the fully connected layer-4
  • an activation layer may be set between the fully connected layer-3 and the fully connected layer-4.
  • an activation layer may be set between the fully connected layer-1 and the fully connected layer-4, an activation layer may be set between the fully connected layer-2 and the fully connected layer-5, an activation layer may be set between the fully connected layer-5 and the fully connected layer-4, an activation layer may be set between the fully connected layer-3 and the fully connected layer-6, an activation layer may be set between the fully connected layer-6 and the fully connected layer-4, and an activation layer may be set between the fully connected layer-4 and the fully connected layer-7.
  • an activation layer can be set between the fully connected layer-2 and the fully connected layer-5, an activation layer can be set between the fully connected layer-5 and the fully connected layer-4, an activation layer can be set between the fully connected layer-3 and the fully connected layer-6, an activation layer can be set between the fully connected layer-6 and the fully connected layer-4, and an activation layer can be set between the fully connected layer-4 and the fully connected layer-7.
  • the following uses the gaze point estimation network model shown in FIG7 , FIG8 and FIG9 as an example to illustrate various parts of the gaze point estimation network model.
  • Region of interest refers to: in machine vision and image processing, the area to be processed is outlined from the image being processed in the form of a box, circle, ellipse, irregular polygon, etc.
  • the region of interest pooling layer is a type of pooling layer.
  • the electronic device can divide the ROI in the image input to the region of interest pooling layer into sections of the same size, and perform a maximum pooling operation on each section.
  • the processed feature map obtained is the output of the region of interest pooling layer.
  • the number of sections is consistent with the dimension of the feature map output by the region of interest pooling layer.
  • the following example illustrates the processing process in the region of interest pooling layer-1.
  • the electronic device can divide the ROI of the left eye image block-1 into 3*3 block areas of equal size, and perform maximum pooling processing (i.e., taking the maximum value of each block area) on each block area.
  • the electronic device can obtain the feature map-1 corresponding to the ROI after the maximum pooling processing.
  • the electronic device can use the feature map-1 as the output of the region of interest pooling layer-1.
  • the size of the feature map-1 is 3*3. That is, the feature map-1 can be understood as a 3*3 matrix. It can be understood that the ROI of the left eye image block-1 is the entire left eye image block-1.
  • the electronic device can divide the ROI of the left eye image block-2 into 3*3 block areas of equal size, and perform maximum pooling processing (i.e., taking the maximum value of each block area) on each block area.
  • the electronic device can obtain the feature map-2 corresponding to the ROI after the maximum pooling processing.
  • the ROI electronic device can use the feature map-2 as the output of the region of interest pooling layer-1.
  • the size of the feature map-2 is 3*3. That is, the feature map-2 can be understood as a 3*3 matrix. It can be understood that the ROI of the left eye image block-2 is the entire left eye image block-2.
  • FIG. 10A and FIG. 10B show the processing process of one channel among the three RGB channels.
  • the ROI of the input image can be divided into a plurality of block regions.
  • Each block region contains data.
  • the data contained in the block regions mentioned here can be understood as the elements of the corresponding regions in the matrix corresponding to the ROI of the input image.
  • the electronic device may divide the ROI of the image input to the region of interest pooling layer based on the size of the preset feature map.
  • the size of the preset feature map may be 10*10. If the size of the ROI of the image input to the region of interest pooling layer is 100*100.
  • the electronic device may evenly divide the ROI into 10*10 block areas, each of which is 10*10 in size.
  • the electronic device may not be able to evenly divide the ROI.
  • the electronic device can perform a zero padding operation, or, while ensuring that most of the block areas are the same size, divide a column block area or a row block area into slightly larger or smaller ones.
  • the size of the preset feature map may be 10*10.
  • the size of the ROI of the image input to the region of interest pooling layer is 101*101.
  • the electronic device may divide the ROI into 9*9 block areas of size 10*10, 9 block areas of size 10*11, 9 block areas of size 11*10, and 1 block area of size 11*11.
  • the size of the feature map output by the region of interest pooling layer-1 is the same.
  • the size of the feature map output by the region of interest pooling layer-2 is the same.
  • the size of the feature map-1 and feature map-2 obtained by inputting the left eye image block-1 and the left eye image block-2 into the region of interest pooling layer-1 is 3*3.
  • the size of the feature map output by the region of interest pooling layer is not limited to the above example, and this application does not impose any restrictions on this.
  • the left eye image block-3 is an RGB image.
  • the left eye image block-3 can be represented as a 60*60*3 matrix.
  • the elements in the matrix include the values of the RGB three channels corresponding to each pixel in the left eye image block-3.
  • the electronic device can input the left eye image block-3 into the region of interest pooling layer-1, and can output 3 3*3 feature maps. These 3 3*3 feature maps correspond to the feature maps of the RGB three channels respectively.
  • the output feature map is 1.
  • the processing process of inputting it to the ROI pooling layer-1 can refer to FIG. 10A .
  • CNN refers to convolutional neural network, which is a type of neural network.
  • CNN can include convolutional layers, pooling layers and fully connected layers.
  • each convolutional layer in the convolutional neural network is composed of several convolutional units.
  • the parameters of each convolutional unit are optimized by the back propagation algorithm.
  • the purpose of the convolution operation is to extract different features of the input.
  • the first convolutional layer may only extract some low-level features such as edges, lines and corners. More layers of the network can iteratively extract more complex features from low-level features.
  • the essence of pooling is downsampling.
  • the main function of the pooling layer is to reduce the amount of calculation by reducing the parameters of the network, and it can control overfitting to a certain extent.
  • the operations performed by the pooling layer generally include maximum pooling, mean pooling, etc.
  • CNN-1, CNN-2 and CNN-3 can include several convolutional layers and several pooling layers respectively. It can be understood that CNN-1, CNN-2 and CNN-3 can also include several activation layers.
  • the activation layer is also called the neuron layer, and the most important thing is the setting of the activation function.
  • the activation function can include ReLU, PReLU and Sigmoid, etc.
  • the electronic device can perform activation operations on the input data, which can also be understood as a function change.
  • CNN-1 may include 4 convolutional layers and 4 activation layers.
  • the 4 convolutional layers refer to: convolutional layer-1, convolutional layer-2, convolutional layer-3, and convolutional layer-4.
  • the 4 activation layers refer to: activation layer-1, activation layer-2, activation layer-3, and activation layer-4. It can be understood that the size of the convolution kernel (i.e., filter) of the 4 convolutional layers can be 3*3.
  • CNN-3 may include 4 convolutional layers, 4 activation layers, and 4 pooling layers.
  • the 5 convolutional layers refer to: convolutional layer-1, convolutional layer-2, convolutional layer-3, and convolutional layer-4.
  • the 4 activation layers refer to: activation layer-1, activation layer-2, activation layer-3, and activation layer-4.
  • the 4 pooling layers refer to: pooling layer-1, pooling layer-2, pooling layer-3, and pooling layer-4.
  • the size of the convolution kernel (i.e., filter) of the 4 convolutional layers can be 3*3.
  • the step size of the 4 pooling layers can be 2 (for example, the maximum pooling process is performed for every 2*2 "cells").
  • the feature map can also be padded with zeros in the convolutional layer.
  • the zero-padding operation please refer to the relevant technical documents, which will not be explained in detail here.
  • the structures of CNN-2 and CNN-1 may be the same. In some other embodiments of the present application, the structures of CNN-2, CNN-3 and CNN-1 may be the same.
  • CNN-1, CNN-2 and CNN-3 may also be other contents, not limited to the above examples, and this application does not impose any restrictions on this.
  • the fully connected layer is used to map the extracted features to the sample label space.
  • the fully connected layer is used to integrate the extracted features together and output them as a value.
  • the number of neurons in the fully connected layer-1 is 128, the number of neurons in the fully connected layer-2 and the fully connected layer-3 are both 256, the number of neurons in the fully connected layer-5 and the fully connected layer-6 are both 128, the number of neurons in the fully connected layer-4 is 128, and the number of neurons in the fully connected layer-7 is 2.
  • the number of neurons in the fully connected layer in the gaze point estimation network model can also be other values, not limited to the above examples, and this application does not impose any restrictions on this.
  • the electronic device can obtain eye position information and pupil coordinates during the face detection process, so the electronic device does not need to execute step S304.
  • the electronic device does not need to determine whether the size of the face area in the image I1 meets the preset size requirement. In other words, the electronic device does not need to perform step S305.
  • FIG. 14 is a schematic diagram of the hardware structure of an electronic device provided in an embodiment of the present application.
  • the electronic device may include a processor 110, an external memory interface 120, an internal memory 121, a Universal Serial Bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, a button 190, a motor 191, an indicator 192, a camera 193, a display screen 194, and a Subscriber Identification Module (SIM) card interface 195, etc.
  • SIM Subscriber Identification Module
  • the sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, a bone conduction sensor 180M, etc.
  • the structure illustrated in the embodiments of the present invention does not constitute a specific limitation on the electronic device.
  • the electronic device may include more or fewer components than shown in the figure, or combine certain components, or split certain components, or arrange the components differently.
  • the components shown in the figure may be implemented in hardware, software, or a combination of software and hardware.
  • the processor 110 may include one or more processing units, for example, the processor 110 may include an application processor (AP), a modem processor, a graphics processor (GPU), an image signal processor (ISP), a controller, a memory, a video codec, a digital signal processor (DSP), a baseband processor, and/or a neural-network processing unit (NPU), etc.
  • AP application processor
  • GPU graphics processor
  • ISP image signal processor
  • controller a memory
  • video codec a digital signal processor
  • DSP digital signal processor
  • NPU neural-network processing unit
  • Different processing units may be independent devices or integrated in one or more processors.
  • the controller can be the nerve center and command center of the electronic device.
  • the controller can generate operation control signals according to the instruction operation code and timing signal to complete the control of fetching and executing instructions.
  • the electronic device may execute the gaze point estimation method through the processor 110.
  • the processor 110 may also be provided with a memory for storing instructions and data.
  • the memory in the processor 110 is a cache memory.
  • the memory may store instructions or data that the processor 110 has just used or cyclically used. If the processor 110 needs to use the instruction or data again, it may be directly called from the memory. This avoids repeated access, reduces the waiting time of the processor 110, and thus improves the efficiency of the system.
  • the processor 110 may include one or more interfaces.
  • the USB interface 130 is an interface that complies with the USB standard specification, and may specifically be a Mini USB interface, a Micro USB interface, a USB Type C interface, etc.
  • the interface included in the processor 110 may also be used to connect other electronic devices, such as AR devices, etc.
  • the charging management module 140 is used to receive charging input from a charger. While the charging management module 140 is charging the battery 142 , it can also power the electronic device through the power management module 141 .
  • the wireless communication function of the electronic device can be implemented through antenna 1, antenna 2, mobile communication module 150, wireless communication module 160, modem processor and baseband processor.
  • Antenna 1 and antenna 2 are used to transmit and receive electromagnetic wave signals.
  • Each antenna in the electronic device can be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve the utilization of the antennas.
  • the mobile communication module 150 can provide solutions for wireless communications including 2G/3G/4G/5G, etc., applied in electronic devices.
  • the wireless communication module 160 can provide wireless communication solutions for application in electronic devices, including Wireless Local Area Networks (WLAN) (such as Wireless Fidelity (Wi-Fi) networks), Bluetooth (BT), Global Navigation Satellite System (GNSS), Frequency Modulation (FM), Near Field Communication (NFC), infrared technology (IR), etc.
  • WLAN Wireless Local Area Networks
  • BT Wireless Fidelity
  • GNSS Global Navigation Satellite System
  • FM Frequency Modulation
  • NFC Near Field Communication
  • IR infrared technology
  • antenna 1 of the electronic device is coupled to mobile communication module 150, and antenna 2 is coupled to wireless communication module 160, so that the electronic device can communicate with the network and other devices through wireless communication technology.
  • the electronic device implements the display function through a GPU, a display screen 194, and an application processor.
  • the GPU is a microprocessor for image processing, which connects the display screen 194 and the application processor.
  • the GPU is used to perform mathematical and geometric calculations for graphics rendering.
  • the processor 110 may include one or more GPUs that execute program instructions to generate or change display information.
  • the display screen 194 is used to display images, videos, etc.
  • the display screen 194 includes a display panel.
  • the display panel can be a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode or an active-matrix organic light-emitting diode (AMOLED), a flexible light-emitting diode (FLED), a Mini LED, a Micro LED, a Micro-OLED, a quantum dot light-emitting diode (QLED), etc.
  • the electronic device may include 1 or N display screens 194, where N is a positive integer greater than 1.
  • the electronic device can realize the acquisition function through ISP, camera 193, video codec, GPU, display screen 194 and application processor.
  • ISP is used to process the data fed back by camera 193. For example, when taking a photo, the shutter is opened, and the light is transmitted to the camera photosensitive element through the lens. The light signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to ISP for processing and converts it into an image or video visible to the naked eye. ISP can also perform algorithm optimization on the noise, brightness, and color of the image. ISP can also optimize the exposure, color temperature and other parameters of the shooting scene. In some embodiments, ISP can be set in camera 193.
  • the camera 193 is used to capture still images or videos.
  • the object generates an optical image through the lens and projects it onto the photosensitive element.
  • the photosensitive element can be a charge coupled device (CCD) or a complementary metal oxide semiconductor (CMOS) phototransistor.
  • CMOS complementary metal oxide semiconductor
  • the photosensitive element converts the optical signal into an electrical signal, and then transmits the electrical signal to the ISP to convert it into a digital image or video signal.
  • the ISP outputs the digital image or video signal to the DSP for processing.
  • the DSP converts the digital image or video signal into an image or video signal in a standard RGB, YUV or other format.
  • Digital signal processors are used to process digital signals. In addition to processing digital images or video signals, they can also process other digital signals. For example, when an electronic device selects a frequency point, a digital signal processor is used to perform Fourier transform on the frequency point energy.
  • Video codecs are used to compress or decompress digital videos.
  • Electronic devices can support one or more video codecs. In this way, electronic devices can play or record videos in multiple encoding formats, such as Moving Picture Experts Group (MPEG) 1, MPEG2, MPEG3, MPEG4, etc.
  • MPEG Moving Picture Experts Group
  • MPEG2 MPEG2, MPEG3, MPEG4, etc.
  • the external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the electronic device.
  • the external memory card communicates with the processor 110 through the external memory interface 120 to implement a data storage function. For example, files such as music and videos can be saved in the external memory card.
  • the internal memory 121 can be used to store computer executable program codes, which include instructions.
  • the processor 110 executes various functional applications and data processing of the electronic device by running the instructions stored in the internal memory 121.
  • the internal memory 121 may include a program storage area and a data storage area.
  • the program storage area may store an operating system, an application required for at least one function (such as a sound playback function, an image and video playback function, etc.).
  • the data storage area may store data created during the use of the electronic device (such as audio data, a phone book, etc.).
  • the electronic device can implement audio functions such as music playing and recording through the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the headphone jack 170D, and the application processor.
  • the sensor module 180 may include one or more sensors, which may be of the same type or different types. It is understood that the sensor module 180 shown in FIG. 14 is only an exemplary division method, and there may be other division methods, which are not limited in the present application.
  • the pressure sensor 180A is used to sense the pressure signal and can convert the pressure signal into an electrical signal.
  • the pressure sensor 180A can be set on the display screen 194.
  • the electronic device detects the touch operation intensity according to the pressure sensor 180A.
  • the electronic device can also calculate the touch position according to the detection signal of the pressure sensor 180A.
  • touch operations acting on the same touch position but with different touch operation intensities can correspond to different operation instructions.
  • the gyro sensor 180B can be used to determine the motion posture of the electronic device. In some embodiments, the angular velocity of the electronic device around three axes (i.e., x, y, and z axes) can be determined by the gyro sensor 180B. The gyro sensor 180B can be used for anti-shake shooting.
  • the acceleration sensor 180E can detect the magnitude of the acceleration of the electronic device in all directions (generally three axes). When the electronic device is stationary, it can detect the magnitude and direction of gravity. It can also be used to identify the posture of the electronic device and is applied to applications such as horizontal and vertical screen switching and pedometers.
  • the distance sensor 180F is used to measure the distance.
  • the electronic device can measure the distance by infrared or laser. In some embodiments, when shooting a scene, the electronic device can use the distance sensor 180F to measure the distance to achieve fast focusing.
  • the touch sensor 180K is also called a "touch panel”.
  • the touch sensor 180K can be set on the display screen 194, and the touch sensor 180K and the display screen 194 form a touch screen, also called a "touch screen”.
  • the touch sensor 180K is used to detect touch operations acting on or near it.
  • the touch sensor can pass the detected touch operation to the application processor to determine the type of touch event.
  • Visual output related to the touch operation can be provided through the display screen 194.
  • the touch sensor 180K may also be disposed on the surface of the electronic device, at a location different from that of the display screen 194 .
  • the air pressure sensor 180C is used to measure air pressure.
  • the magnetic sensor 180D includes a Hall sensor.
  • the proximity light sensor 180G may include, for example, a light emitting diode (LED) and a light detector, such as a photodiode.
  • the electronic device uses a photodiode to detect infrared reflected light from nearby objects.
  • the ambient light sensor 180L is used to sense the brightness of ambient light.
  • the fingerprint sensor 180H is used to obtain fingerprints.
  • the temperature sensor 180J is used to detect temperature.
  • the bone conduction sensor 180M can obtain vibration signals.
  • the key 190 includes a power button, a volume button, etc.
  • the key 190 can be a mechanical key. It can also be a touch key.
  • the electronic device can receive key input and generate key signal input related to the user settings and function control of the electronic device.
  • the motor 191 can generate a vibration prompt.
  • the motor 191 can be used for incoming call vibration prompts, and can also be used for touch vibration feedback.
  • the indicator 192 can be an indicator light, which can be used to indicate the charging status, power changes, messages, missed calls, notifications, etc.
  • the SIM card interface 195 is used to connect a SIM card.
  • FIG. 15 is a schematic diagram of a software structure of an electronic device provided in an embodiment of the present application.
  • the software framework of the electronic device involved in the present application may include an application layer, an application framework layer (framework, FWK), a system library, an Android runtime, a hardware abstraction layer and a kernel layer (kernel).
  • an application layer an application framework layer (framework, FWK)
  • FWK application framework layer
  • system library an application framework layer
  • Android runtime a hardware abstraction layer
  • kernel layer kernel layer
  • the application layer may include a series of application packages, such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, video, short message and other applications (also referred to as applications).
  • the camera is used to obtain images and videos.
  • other applications of the application layer please refer to the introduction and description in the conventional technology, which will not be elaborated in this application.
  • the application on the electronic device can be a native application (such as an application installed in the electronic device when the operating system is installed before the electronic device leaves the factory), or it can be a third-party application (such as an application downloaded and installed by the user through the application store), which is not limited in the embodiments of this application.
  • the application framework layer provides application programming interface (API) and programming framework for the applications in the application layer.
  • API application programming interface
  • the application framework layer includes some predefined functions.
  • the application framework layer may include a window manager, a content provider, a view system, a telephony manager, a resource manager, a notification manager, and the like.
  • the window manager is used to manage window programs.
  • the window manager can obtain the display screen size, determine whether there is a status bar, lock the screen, capture the screen, etc.
  • Content providers are used to store and retrieve data and make it accessible to applications.
  • the data may include videos, images, audio, calls made and received, browsing history and bookmarks, phone books, etc.
  • the view system includes visual controls, such as controls for displaying text, controls for displaying images, etc.
  • the view system can be used to build applications.
  • a display interface can be composed of one or more views.
  • a display interface including a text notification icon can include a view for displaying text and a view for displaying images.
  • the phone manager is used to provide communication functions for electronic devices, such as management of call status (including answering, hanging up, etc.).
  • the resource manager provides various resources for applications, such as localized strings, icons, images, layout files, video files, and so on.
  • the notification manager enables applications to display notification information in the status bar. It can be used to convey notification-type messages and can disappear automatically after a short stay without user interaction. For example, the notification manager is used to notify download completion, message reminders, etc.
  • the notification manager can also be a notification that appears in the system top status bar in the form of a chart or scroll bar text, such as notifications from applications running in the background, or a notification that appears on the screen in the form of a dialogue interface. For example, a text message is displayed in the status bar, a reminder sound is emitted, an electronic device vibrates, an indicator light flashes, etc.
  • the runtime includes the core library and the virtual machine.
  • the runtime is responsible for the scheduling and management of the system.
  • the core library consists of two parts: one part is the function that the programming language (for example, Java language) needs to call, and the other part is the core library of the system.
  • one part is the function that the programming language (for example, Java language) needs to call
  • the other part is the core library of the system.
  • the application layer and the application framework layer run in a virtual machine.
  • the virtual machine executes the programming files (e.g., java files) of the application layer and the application framework layer as binary files.
  • the virtual machine is used to perform functions such as object life cycle management, stack management, thread management, security and exception management, and garbage collection.
  • the system library can include multiple functional modules, such as surface manager, media library, 3D graphics processing library (such as OpenGL ES), 2D graphics engine (such as SGL), etc.
  • functional modules such as surface manager, media library, 3D graphics processing library (such as OpenGL ES), 2D graphics engine (such as SGL), etc.
  • the surface manager is used to manage the display subsystem and provides the fusion of two-dimensional (2D) and three-dimensional (3D) layers for multiple applications.
  • the media library supports playback and recording of a variety of commonly used audio and video formats, as well as static image files, etc.
  • the media library can support a variety of audio and video encoding formats, such as: MPEG4, H.264, MP3, AAC, AMR, JPG, PNG, etc.
  • the 3D graphics processing library is used to implement 3D graphics drawing, image rendering, compositing, and layer processing.
  • a 2D graphics engine is a drawing engine for 2D drawings.
  • the Hardware Abstraction Layer is an interface layer between the operating system kernel and the upper software, and its purpose is to abstract the hardware.
  • the hardware abstraction layer is an abstract interface driven by the device kernel, which is used to implement the application programming interface that provides access to the underlying device to the higher-level Java API framework.
  • HAL contains multiple library modules, such as camera HAL, display, Bluetooth, audio, etc. Each of these library modules implements an interface for a specific type of hardware component.
  • the Android operating system will load the library module for the hardware component.
  • the kernel layer is the foundation of the Android operating system. The final functions of the Android operating system are completed through the kernel layer.
  • the kernel layer at least includes display driver, camera driver, audio driver, sensor driver, and virtual card driver.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Human Computer Interaction (AREA)
  • General Health & Medical Sciences (AREA)
  • Oral & Maxillofacial Surgery (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Engineering & Computer Science (AREA)
  • Ophthalmology & Optometry (AREA)
  • Image Processing (AREA)
  • Image Analysis (AREA)

Abstract

The present application provides a fixation point estimation method and a related device. According to the method, an electronic device can acquire an image by means of a camera, ensures to input a simple sample having an adaptive distance scale by means of adaptive zooming, and obtains face position information and eye position information in the acquired image when a face detection result meets a preset face condition. On the basis of a region of interest (ROI) pooling module in a fixation point estimation network model, the electronic device can process an ROI of a target image block using a corresponding preset feature map size to obtain a feature map. The target image block is obtained by cropping the acquired image. The target image block comprises at least one of a face image block, a left eye image block, and a right eye image block. Different types of image blocks respectively correspond to preset feature map sizes. The method can unify the size of the feature map by means of the ROI pooling module, thereby avoiding deformation of the target image block after scaling, and improving the accuracy of fixation point estimation.

Description

一种注视点估计方法及相关设备A method for estimating gaze point and related equipment
本申请要求于2022年07月29日提交中国专利局、申请号为202210910894.2、申请名称为“一种注视点估计方法及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application filed with the China Patent Office on July 29, 2022, with application number 202210910894.2 and application name “A gaze point estimation method and related equipment”, all contents of which are incorporated by reference in this application.
技术领域Technical Field
本申请涉及深度学习及大数据处理等领域,尤其涉及一种注视点估计方法及相关设备。The present application relates to fields such as deep learning and big data processing, and in particular to a gaze point estimation method and related equipment.
背景技术Background technique
注视点估计一般指输入图像,通过眼部/头部特征计算视线方向并映射到注视点。注视点估计主要应用在智能手机、平板、智慧屏、AR/VR眼镜的人机交互和可视化显示上。Gaze point estimation generally refers to inputting an image, calculating the gaze direction through eye/head features and mapping it to the gaze point. Gaze point estimation is mainly used in human-computer interaction and visualization display of smartphones, tablets, smart screens, and AR/VR glasses.
一般来说,注视点估计方法可以分为基于几何的方法(Geometry Based Methods)和基于外观的方法(Appearance Based Methods)两大类。通过基于几何的方法来估计注视点坐标的基本思路为:通过一些二维信息(比如眼角等眼睛的特征)恢复三维视线方向。而通过基于外观的方法来估计注视点坐标的基本思路为:学习一个将输入图像映射到注视点的模型。这两类方法各有优缺点,基于几何的方法相对更准确,但是对图片的质量和分辨率有很高的要求,需要额外硬件(例如,红外传感器和多个摄像头等)支持,可能导致功耗大,而基于外观的方法相对准确度没有那么高。可理解,基于外观的方法需要对大量数据进行训练,摄像头与拍摄主体之间的距离并不固定,输入图像的深度信息也会有所差异。比如,基于不同输入图像获取的脸部图像的大小之间可能存在较大差异,不能满足模型要求。若将该输入图像进行缩放可能可以满足模型要求,但是这样可能有特征形变的风险,这样就会造成注视点估计的准确性降低。Generally speaking, gaze point estimation methods can be divided into two categories: geometry-based methods and appearance-based methods. The basic idea of estimating the gaze point coordinates through geometry-based methods is to restore the three-dimensional line of sight direction through some two-dimensional information (such as eye features such as the corner of the eye). The basic idea of estimating the gaze point coordinates through appearance-based methods is to learn a model that maps the input image to the gaze point. Both methods have their own advantages and disadvantages. The geometry-based method is relatively more accurate, but it has high requirements on the quality and resolution of the image, and requires additional hardware (for example, infrared sensors and multiple cameras, etc.) to support, which may lead to high power consumption, while the appearance-based method is relatively less accurate. It is understandable that the appearance-based method requires a large amount of data to be trained, the distance between the camera and the subject is not fixed, and the depth information of the input image may also vary. For example, the size of the facial images obtained based on different input images may be quite different, which cannot meet the model requirements. If the input image is scaled, it may meet the model requirements, but there may be a risk of feature deformation, which will reduce the accuracy of the gaze point estimation.
因此,如何在保证功耗较小的情况下提高注视点估计的准确性是目前亟待解决的问题。Therefore, how to improve the accuracy of gaze point estimation while ensuring low power consumption is an urgent problem to be solved.
发明内容Summary of the invention
本申请提供了一种注视点估计方法及相关设备。根据该注视点估计方法,电子设备可以通过摄像头采集图像,并在人脸检测结果满足预设人脸条件的情况下,获取采集的图像中的脸部位置信息和眼睛位置信息。在脸部位置信息和眼睛位置信息的基础上,电子设备可以通过注视点估计网络模型确定目标对象的注视点坐标。可理解,电子设备通过摄像头采集的图像中的拍摄主体为目标对象。可理解,本申请中所提及的拍摄主体指的是用户利用电子设备进行拍摄时的主要拍摄对象。在基于注视点估计网络模型对目标图像块进行处理的过程中,电子设备可以基于其中的感兴趣区域池化模块,以对应的预设特征图尺寸对目标图像块的ROI进行处理,得到特征图。目标图像块由采集的图像经裁剪后得到。目标图像块包括脸部图像块、左眼图像块、右眼图像块中的至少一种类型的图像块。不同类型的图像块各自对应有预设特征图尺寸。上述方法可以通过感兴趣区域池化模块统一特征图的尺寸,避免目标图像块经缩放后发生形变,提高了注视点估计的准确性。The present application provides a gaze point estimation method and related equipment. According to the gaze point estimation method, an electronic device can collect images through a camera, and obtain face position information and eye position information in the collected image when the face detection result meets the preset face condition. Based on the face position information and eye position information, the electronic device can determine the gaze point coordinates of the target object through a gaze point estimation network model. It can be understood that the shooting subject in the image collected by the electronic device through the camera is the target object. It can be understood that the shooting subject mentioned in the present application refers to the main shooting object when the user uses the electronic device to shoot. In the process of processing the target image block based on the gaze point estimation network model, the electronic device can process the ROI of the target image block with the corresponding preset feature map size based on the region of interest pooling module therein to obtain a feature map. The target image block is obtained by cropping the collected image. The target image block includes at least one type of image block among a face image block, a left eye image block, and a right eye image block. Different types of image blocks each correspond to a preset feature map size. The above method can unify the size of the feature map through the region of interest pooling module, avoid deformation of the target image block after scaling, and improve the accuracy of the gaze point estimation.
第一方面,本申请提供了一种注视点估计方法。该方法可以应用于设置有摄像头的电子设备。该方法可以包括:电子设备可以通过摄像头采集第一图像;在人脸检测结果满足预设人脸条件的情况下,电子设备可以获取第一图像中的脸部位置信息和眼睛位置信息。在基于注视点估计网络模型对目标图像块进行处理的过程中,电子设备可以基于注视点估计网络模型的感兴趣区域池化模块,以对应的预设特征图尺寸对目标图像块的感兴趣区域ROI进行处理,得到特征图。可理解,目标图像块包括脸部图像块、左眼图像块、右眼图像块中的至少一种类型的图像块。不同类型的图像块各自对应有预设特征图尺寸。其中,脸部位置信息包括脸部区域的相关特征点的坐标,眼睛位置信息包括眼睛区域的相关特征点的坐标。脸部图像块为基于脸部位置信息对第一图像中的脸部区域进行裁剪得到的图像块。左眼图像块为基于眼睛位置信息对第一图像中的左眼区域进行裁剪得到的图像块。右眼图像块为基于眼睛位置信息对第一图像中的右眼区域进行裁剪得到的图像块。In a first aspect, the present application provides a method for estimating a gaze point. The method can be applied to an electronic device provided with a camera. The method can include: the electronic device can collect a first image through a camera; when the face detection result meets the preset face condition, the electronic device can obtain the face position information and the eye position information in the first image. In the process of processing the target image block based on the gaze point estimation network model, the electronic device can process the region of interest ROI of the target image block with the corresponding preset feature map size based on the region of interest pooling module of the gaze point estimation network model to obtain a feature map. It can be understood that the target image block includes at least one type of image block among a face image block, a left eye image block, and a right eye image block. Different types of image blocks each correspond to a preset feature map size. Among them, the face position information includes the coordinates of the relevant feature points of the face area, and the eye position information includes the coordinates of the relevant feature points of the eye area. The face image block is an image block obtained by cropping the face area in the first image based on the face position information. The left eye image block is an image block obtained by cropping the left eye area in the first image based on the eye position information. The right eye image block is an image block obtained by cropping the right eye area in the first image based on the eye position information.
在本申请提供的方案中,电子设备可以基于注视点估计网络模型来确定目标对象的注视点坐标。在基于注视点估计网络模型来确定目标对象的注视点坐标的过程中,电子设备可以基于感兴趣区域池化模块,以对应的预设特征图尺寸对目标图像块的感兴趣区域ROI进行处理,得到特征图。可理解,目标图像块包括脸部图像块、左眼图像块、右眼图像块中的至少一种类型的图像块。目标图像块中的不同类型的图像块各自对应有预设特征图尺寸。也就是说,相同类型的图像块对应的特征图的尺寸是一样的,而不同类型的图像块对应的特征图的尺寸可以是一样的,也可以不一样。这种方法可以通过感兴趣区域池化模块统一相同类型的图像块对应的特征图的尺寸,为后续特征提取做准备,还可以避免通过缩放来调整特征图尺寸所造成的特征形变,提高了注视点估计的准确性。可理解,特征性别可能造成特征提取不准确,从而影响注视点估计的准确性。In the scheme provided in the present application, the electronic device can determine the gaze point coordinates of the target object based on the gaze point estimation network model. In the process of determining the gaze point coordinates of the target object based on the gaze point estimation network model, the electronic device can process the region of interest ROI of the target image block with the corresponding preset feature map size based on the region of interest pooling module to obtain a feature map. It can be understood that the target image block includes at least one type of image block in the face image block, the left eye image block, and the right eye image block. Different types of image blocks in the target image block each correspond to a preset feature map size. In other words, the size of the feature map corresponding to the same type of image blocks is the same, while the size of the feature map corresponding to different types of image blocks can be the same or different. This method can unify the size of the feature map corresponding to the same type of image blocks through the region of interest pooling module, prepare for subsequent feature extraction, and can also avoid feature deformation caused by adjusting the feature map size by scaling, thereby improving the accuracy of gaze point estimation. It can be understood that the feature gender may cause inaccurate feature extraction, thereby affecting the accuracy of gaze point estimation.
在本申请的一些实施例中,电子设备可以通过前置摄像头采集第一图像。可理解,电子设备可以实时获取第一图像,具体可以参考后文步骤S301中的相关描述,在此不展开说明。In some embodiments of the present application, the electronic device may capture the first image through a front camera. It is understandable that the electronic device may acquire the first image in real time, and the details may refer to the relevant description in step S301 below, which will not be elaborated here.
在本申请的一些实施例中,第一图像可以为图像I1。In some embodiments of the present application, the first image may be image I1.
可理解,脸部位置信息和眼睛位置信息的相关描述可以参考后文,在此不展开说明。可理解,脸部区域的相关特征点可以包括人脸的边缘轮廓特征点。眼睛区域的相关特征点可以包括眼角特征点,还可以包括眼睛区域的边缘轮廓特征点。脸部区域的相关特征点与眼睛区域的相关特征点的相关描述可以参考后文,在此不展开说明。It is understood that the description of the facial position information and the eye position information can be referred to in the following text, which is not described in detail here. It is understood that the relevant feature points of the facial region can include the edge contour feature points of the face. The relevant feature points of the eye region can include the corner feature points, and can also include the edge contour feature points of the eye region. The description of the relevant feature points of the facial region and the relevant feature points of the eye region can be referred to in the following text, which is not described in detail here.
在本申请的一些实施例中,电子设备可以在进行人脸检测的过程中获取脸部位置信息。具体地,电子设备在进行人脸检测的过程中,可以进行特征点检测,并确定人脸相关的特征点,进而获取脸部位置信息。In some embodiments of the present application, the electronic device can obtain face position information during face detection. Specifically, during face detection, the electronic device can perform feature point detection and determine feature points related to the face to obtain face position information.
在本申请的一些实施例中,电子设备可以在进行人脸检测的过程中完成对眼睛的检测,从而获取眼睛位置信息,具体可以参考后文,在此不展开说明。在一种可能的实现方式中,眼睛相关的特征点可以包括瞳孔坐标。In some embodiments of the present application, the electronic device can complete the detection of eyes during the face detection process to obtain eye position information. For details, please refer to the following text and will not be described in detail here. In a possible implementation, the eye-related feature points may include pupil coordinates.
在本申请的一些实施例中,电子设备可以进行眼睛检测,从而获取眼睛位置信息。眼睛检测的相关描述可以参考后文,在此不展开说明。In some embodiments of the present application, the electronic device can perform eye detection to obtain eye position information. The relevant description of eye detection can be found in the following text and will not be elaborated here.
在本申请的一些实施例中,感兴趣区域池化模块可以包括若干感兴趣区域池化层。例如,感兴趣区域池化模块可以包括感兴趣区域池化层-1,还可以包括感兴趣区域池化层-2,具体可以参考图7、图8和图9。In some embodiments of the present application, the region of interest pooling module may include several region of interest pooling layers. For example, the region of interest pooling module may include a region of interest pooling layer-1 and may also include a region of interest pooling layer-2. For details, please refer to Figures 7, 8 and 9.
在本申请的一些实施例中,注视点估计网络模型可针对目标图像块中相同类型的图像块来统一特征图,并对其进行特征提取。例如,相较于如图7所示的注视点估计网络模型,本申请还可以提供一种注视点估计网络模型,这种注视点估计网络模型的输入可以不包括脸部网格、瞳孔坐标、全连接层-2和全连接层-3。例如,相较于如图8所示的注视点估计网络模型,本申请还可以提供一种注视点估计网络模型,这种注视点估计网络模型的输入可以不包括脸部网格、瞳孔坐标、全连接层-2、全连接层-5、全连接层-3和全连接层-6。再例如,相较于如图9所示的注视点估计网络模型,本申请还可以提供一种注视点估计网络模型,这种注视点估计网络模型的输入可以不包括脸部网格、瞳孔坐标、全连接层-2、全连接层-5、全连接层-3和全连接层-6。In some embodiments of the present application, the gaze point estimation network model may unify the feature map for the same type of image blocks in the target image block and perform feature extraction on it. For example, compared to the gaze point estimation network model shown in Figure 7, the present application may also provide a gaze point estimation network model, the input of which may not include a face grid, pupil coordinates, fully connected layer-2, and fully connected layer-3. For example, compared to the gaze point estimation network model shown in Figure 8, the present application may also provide a gaze point estimation network model, the input of which may not include a face grid, pupil coordinates, fully connected layer-2, fully connected layer-5, fully connected layer-3, and fully connected layer-6. For another example, compared to the gaze point estimation network model shown in Figure 9, the present application may also provide a gaze point estimation network model, the input of which may not include a face grid, pupil coordinates, fully connected layer-2, fully connected layer-5, fully connected layer-3, and fully connected layer-6.
在本申请的一些实施例中,脸部图像块对应的预设特征图尺寸为第一预设特征图尺寸,左眼图像块对应的预设特征图尺寸为第二预设特征图尺寸,右眼图像块对应的预设特征图尺寸为第三预设特征图尺寸。In some embodiments of the present application, the preset feature map size corresponding to the facial image block is a first preset feature map size, the preset feature map size corresponding to the left eye image block is a second preset feature map size, and the preset feature map size corresponding to the right eye image block is a third preset feature map size.
在本申请的一些实施例中,目标图像块的感兴趣区域为整个目标图像块。例如,脸部图像块的ROI为整个脸部图像块,左眼图像块的ROI为整个左眼图像块,右眼图像块的ROI为整个右眼图像块。In some embodiments of the present application, the region of interest of the target image block is the entire target image block. For example, the ROI of the face image block is the entire face image block, the ROI of the left eye image block is the entire left eye image block, and the ROI of the right eye image block is the entire right eye image block.
可理解,在目标图像块包括脸部图像块的情况下,该方法还可以包括:电子设备可以基于脸部位置信息对第一图像中的脸部区域进行裁剪,得到脸部图像块。类似的,在目标图像块包括左眼图像块的情况下,该方法还可以包括:电子设备可以基于眼睛位置信息对第一图像中的左眼区域进行裁剪,得到左眼图像块。在目标图像块包括右眼图像块的情况下,该方法还可以包括:电子设备可以基于眼睛位置信息对第一图像中的右眼区域进行裁剪,得到右眼图像块。It is understandable that, in the case where the target image block includes a facial image block, the method may further include: the electronic device may crop the facial area in the first image based on the facial position information to obtain the facial image block. Similarly, in the case where the target image block includes a left-eye image block, the method may further include: the electronic device may crop the left-eye area in the first image based on the eye position information to obtain the left-eye image block. In the case where the target image block includes a right-eye image block, the method may further include: the electronic device may crop the right-eye area in the first image based on the eye position information to obtain the right-eye image block.
结合第一方面,在一种可能的实现方式中,电子设备以对应的预设特征图尺寸对目标图像块的感兴趣区域ROI进行处理,得到特征图,具体可以包括:电子设备可以基于对应的预设特征图尺寸对目标图像块的ROI进行划分,得到若干分块区域,并且,电子设备还可以对目标图像块的ROI中的每一个分块区域进行最大池化处理,得到特征图。其中,目标图像块的ROI中每一行分块区域的数量与对应的预设特征图尺寸中的宽度值相同,目标图像块的ROI中每一列分块区域的数量与对应的预设特征图尺寸中的高度值相同。In combination with the first aspect, in a possible implementation, the electronic device processes the region of interest ROI of the target image block with the corresponding preset feature map size to obtain a feature map, which may specifically include: the electronic device may divide the ROI of the target image block based on the corresponding preset feature map size to obtain a number of block areas, and the electronic device may also perform maximum pooling processing on each block area in the ROI of the target image block to obtain a feature map. Among them, the number of block areas in each row of the ROI of the target image block is the same as the width value in the corresponding preset feature map size, and the number of block areas in each column of the ROI of the target image block is the same as the height value in the corresponding preset feature map size.
在本申请提供的方案中,电子设备可以基于对应的预设特征图尺寸中的宽度值和高度值来对目标图像块中的ROI进行划分,得到若干分块区域,并对每一个分块区域进行最大池化处理,得到目标图像块的特征图。由于分块区域的数量和感兴趣区域池化层的输出的特征图的维度是一致的。因此,对于不同尺寸的图像块,这种方式都可以统一图像块对应的特征图,从而避免因缩放导致的特征形变,提高了特征提取的准确性,从而提高了注视点估计的准确性。In the solution provided in the present application, the electronic device can divide the ROI in the target image block based on the width value and height value in the corresponding preset feature map size to obtain a number of block areas, and perform maximum pooling processing on each block area to obtain a feature map of the target image block. Since the number of block areas and the dimension of the feature map output by the pooling layer of the region of interest are consistent. Therefore, for image blocks of different sizes, this method can unify the feature maps corresponding to the image blocks, thereby avoiding feature deformation caused by scaling, improving the accuracy of feature extraction, and thus improving the accuracy of gaze point estimation.
可理解,目标图像块中的脸部图像块的ROI可以为该脸部图像块中的脸部区域。类似的,目标图像块中的左眼图像块的ROI可以为该左眼图像块中的左眼区域。目标图像块中的右眼图像块的ROI可以为该右眼图像块中的右眼区域。It can be understood that the ROI of the face image block in the target image block can be the face area in the face image block. Similarly, the ROI of the left eye image block in the target image block can be the left eye area in the left eye image block. The ROI of the right eye image block in the target image block can be the right eye area in the right eye image block.
可理解,上述内容的具体实现可以参考后文,尤其是后文中与图10A、图10B和图11相关的描述,在此不展开说明。It can be understood that the specific implementation of the above content can be referred to the following text, especially the description related to Figures 10A, 10B and 11 in the following text, which will not be elaborated here.
结合第一方面,在一种可能的实现方式中,在目标图像块包括脸部图像块、左眼图像块和右眼图像块的情况下,电子设备基于对应的预设特征图尺寸对目标图像块的ROI进行划分,得到若干分块区域,具体可以包括:电子设备可以确定脸部图像块的ROI,并基于第一预设特征图尺寸对脸部图像块的ROI进行划分,得到若干脸部分块区域;电子设备还可以确定左眼图像块的ROI,并基于所述第二预设特征图尺寸对左眼图像块的ROI进行划分,得到若干左眼分块区域;电子设备还可以确定右眼图像块的ROI,并基于第三预设特征图尺寸对右眼图像块的ROI进行划分,得到若干右眼分块区域。电子设备对目标图像块的ROI中的每一个分块区域进行最大池化处理,得到特征图,具体可以包括:电子设备可以对脸部图像块的ROI中的每一个脸部分块区域进行最大池化处理,得到第一特征图,可以对左眼图像块的ROI中的每一个左眼分块区域进行最大池化处理,得到第二特征图,还可以对右眼图像块的ROI中的每一个右眼分块区域进行最大池化处理,得到第三特征图。其中,第一特征图为与脸部图像块的ROI对应的特征图,第二特征图为与左眼图像块的ROI对应的特征图,第三特征图为与右眼图像块的ROI对应的特征图。目标图像块的ROI中每一行分块区域的数量与对应的预设特征图尺寸中的宽度值相同,目标图像块的ROI中每一列分块区域的数量与对应的预设特征图尺寸中的高度值相同,具体包括:脸部图像块的ROI中每一行脸部分块区域的数量与第一预设特征图尺寸中的宽度值相同,脸部图像块的ROI中每一列脸部分块区域的数量与第一预设特征图尺寸中的高度值相同;左眼图像块的ROI中每一行左眼分块区域的数量与第二预设特征图尺寸中的宽度值相同,左眼图像块的ROI中每一列左眼分块区域的数量与第二预设特征图尺寸中的高度值相同;右眼图像块的ROI中每一行右眼分块区域的数量与第三预设特征图尺寸中的宽度值相同,右眼图像块的ROI中每一列右眼分块区域的数量与第三预设特征图尺寸中的高度值相同。In combination with the first aspect, in a possible implementation method, when the target image block includes a facial image block, a left eye image block and a right eye image block, the electronic device divides the ROI of the target image block based on the corresponding preset feature map size to obtain a number of block areas, which may specifically include: the electronic device may determine the ROI of the facial image block, and divide the ROI of the facial image block based on the first preset feature map size to obtain a number of facial block areas; the electronic device may also determine the ROI of the left eye image block, and divide the ROI of the left eye image block based on the second preset feature map size to obtain a number of left eye block areas; the electronic device may also determine the ROI of the right eye image block, and divide the ROI of the right eye image block based on the third preset feature map size to obtain a number of right eye block areas. The electronic device performs maximum pooling processing on each block area in the ROI of the target image block to obtain a feature map, which may specifically include: the electronic device may perform maximum pooling processing on each face block area in the ROI of the face image block to obtain a first feature map, may perform maximum pooling processing on each left eye block area in the ROI of the left eye image block to obtain a second feature map, and may also perform maximum pooling processing on each right eye block area in the ROI of the right eye image block to obtain a third feature map. The first feature map is a feature map corresponding to the ROI of the face image block, the second feature map is a feature map corresponding to the ROI of the left eye image block, and the third feature map is a feature map corresponding to the ROI of the right eye image block. The number of block areas in each row of the ROI of the target image block is the same as the width value in the corresponding preset feature map size, and the number of block areas in each column of the ROI of the target image block is the same as the height value in the corresponding preset feature map size, specifically including: the number of face block areas in each row of the ROI of the face image block is the same as the width value in the first preset feature map size, and the number of face block areas in each column of the ROI of the face image block is the same as the height value in the first preset feature map size; the number of left eye block areas in each row of the ROI of the left eye image block is the same as the width value in the second preset feature map size, and the number of left eye block areas in each column of the ROI of the left eye image block is the same as the height value in the second preset feature map size; the number of right eye block areas in each row of the ROI of the right eye image block is the same as the width value in the third preset feature map size, and the number of right eye block areas in each column of the ROI of the right eye image block is the same as the height value in the third preset feature map size.
在本申请提供的方案中,目标图像块可以包括脸部图像块、左眼图像块和右眼图像块。在这种情况下,电子设备可以基于注视点估计网络模型来分别统一脸部图像块、左眼图像块和右眼图像块对应的特征图的尺寸,并分别基于脸部图像块、左眼图像块和右眼图像块对应的特征图提取特征。可理解,这种方式可以统一图像块对应的特征图,从而避免因缩放导致的特征形变,提高了特征提取的准确性,从而提高了注视点估计的准确性。In the solution provided in the present application, the target image block may include a facial image block, a left eye image block, and a right eye image block. In this case, the electronic device can unify the sizes of the feature maps corresponding to the facial image block, the left eye image block, and the right eye image block based on the gaze point estimation network model, and extract features based on the feature maps corresponding to the facial image block, the left eye image block, and the right eye image block. It is understandable that this method can unify the feature maps corresponding to the image blocks, thereby avoiding feature deformation caused by scaling, improving the accuracy of feature extraction, and thus improving the accuracy of gaze point estimation.
在本申请的一些实施例中,第二预设特征图尺寸与第三预设特征图尺寸可以相同。In some embodiments of the present application, the second preset feature map size may be the same as the third preset feature map size.
在本申请的一些实施例中,第一预设特征图尺寸与第二预设特征图尺寸可以相同。第一预设特征图尺寸与第三预设特征图尺寸可以相同。In some embodiments of the present application, the first preset feature map size may be the same as the second preset feature map size. The first preset feature map size may be the same as the third preset feature map size.
在本申请的一些实施例中,在目标图像块包括脸部图像块、左眼图像块和右眼图像块中的一种或两种类型的图像块的情况下,电子设备基于感兴趣区域池化模块对目标图像块进行的处理可以参考上文,在此不再赘述。In some embodiments of the present application, when the target image block includes one or two types of image blocks among a face image block, a left eye image block and a right eye image block, the processing of the target image block performed by the electronic device based on the region of interest pooling module can be referred to above and will not be repeated here.
结合第一方面,在一种可能的实现方式中,第一图像的拍摄主体为目标对象。电子设备通过摄像头采集第一图像之后,该方法还可以包括:在人脸检测结果满足预设人脸条件的情况下,电子设备可以获取第一图像中的瞳孔坐标;电子设备可以基于脸部位置信息确定第一图像中的脸部区域在第一图像中的位置和大小,得到第一图像对应的脸部网格。脸部网格用于表征目标对象与摄像头之间的距离。电子设备得到特征图之后,该方法还可以包括:电子设备可以基于注视点估计网络模型的卷积模块对特征图进行卷积处理,提取眼部特征和/或脸部特征;电子设备还可以基于注视点估计网络模型的融合模块对眼部特征和/或脸部特征、脸部网格和瞳孔坐标进行整合,得到目标对象的注视点坐标。In combination with the first aspect, in a possible implementation, the subject of the first image is the target object. After the electronic device captures the first image through the camera, the method may further include: when the face detection result meets the preset face condition, the electronic device may obtain the pupil coordinates in the first image; the electronic device may determine the position and size of the face area in the first image in the first image based on the face position information, and obtain the face grid corresponding to the first image. The face grid is used to characterize the distance between the target object and the camera. After the electronic device obtains the feature map, the method may further include: the electronic device may perform convolution processing on the feature map based on the convolution module of the gaze point estimation network model to extract eye features and/or facial features; the electronic device may also integrate the eye features and/or facial features, the face grid and the pupil coordinates based on the fusion module of the gaze point estimation network model to obtain the gaze point coordinates of the target object.
在本申请提供的方案中,电子设备可以基于更多类型的特征(例如,脸部特征、眼睛特征、深度信息和瞳孔位置等)来进行注视点估计,即基于更全面的特征信息来进行注视点估计,可以提高注视点估计的准确性。In the solution provided in the present application, the electronic device can perform gaze point estimation based on more types of features (for example, facial features, eye features, depth information, pupil position, etc.), that is, the gaze point estimation can be performed based on more comprehensive feature information, which can improve the accuracy of gaze point estimation.
可理解,脸部网格可以表示图像中的人脸在图像中的位置和大小,可以体现图像中目标对象的深度信息,即目标对象与采集该图像的摄像头之间的距离。It can be understood that the face grid can represent the position and size of the face in the image, and can reflect the depth information of the target object in the image, that is, the distance between the target object and the camera that captures the image.
可理解,本申请中所提及的第一图像中的人脸为第一图像中目标对象的人脸。It can be understood that the human face in the first image mentioned in the present application is the face of the target object in the first image.
在本申请的一些实施例中,电子设备可以将脸部图像块、左眼图像块、右眼图像块、脸部网格和瞳孔坐标输入到注视点估计网络模型中,输入注视点坐标。其中,注视点估计网络模型可以包括感兴趣区域池化模块、卷积模块和融合模块。感兴趣区域池化模块可以用于:以第一预设特征图尺寸对脸部图像块的感兴趣区域ROI进行处理,得到第一特征图。感兴趣区域池化模块还可以用于:以第二预设特征图尺寸对左眼图像块的ROI进行处理,得到第二特征图,以及以第三预设特征图尺寸对右眼图像块的ROI进行处理,得到第三特征图。卷积模块可以用于:分别对第一特征图、第二特征图和第三特征图进行卷积处理,提取脸部特征及眼睛特征。融合模块可以用于:对脸部特征、眼睛特征、脸部网格和瞳孔坐标进行整合,得到目标对象的注视点坐标。其中,第一特征图的尺寸与第一预设特征图尺寸相同,第二特征图的尺寸与第二预设特征图尺寸相同,第三特征图的尺寸与第三预设特征图尺寸相同。In some embodiments of the present application, the electronic device may input the facial image block, the left eye image block, the right eye image block, the facial grid and the pupil coordinates into the gaze point estimation network model, and input the gaze point coordinates. Among them, the gaze point estimation network model may include a region of interest pooling module, a convolution module and a fusion module. The region of interest pooling module can be used to: process the region of interest ROI of the facial image block with a first preset feature map size to obtain a first feature map. The region of interest pooling module can also be used to: process the ROI of the left eye image block with a second preset feature map size to obtain a second feature map, and process the ROI of the right eye image block with a third preset feature map size to obtain a third feature map. The convolution module can be used to: perform convolution processing on the first feature map, the second feature map and the third feature map respectively, and extract facial features and eye features. The fusion module can be used to: integrate facial features, eye features, facial grids and pupil coordinates to obtain the gaze point coordinates of the target object. Among them, the size of the first feature map is the same as the size of the first preset feature map, the size of the second feature map is the same as the size of the second preset feature map, and the size of the third feature map is the same as the size of the third preset feature map.
结合第一方面,在一种可能的实现方式中,人脸检测结果满足预设人脸条件,具体包括:第一图像中检测到人脸。In combination with the first aspect, in a possible implementation manner, the face detection result satisfies a preset face condition, specifically including: a face is detected in the first image.
在本申请提供的方案中,电子设备可以在第一图像中检测到人脸的情况下,获取脸部位置信息和眼睛位置信息。In the solution provided in the present application, the electronic device can obtain facial position information and eye position information when a human face is detected in the first image.
结合第一方面,在一种可能的实现方式中,人脸检测结果满足预设人脸条件,具体可以包括:第一图像中检测到人脸,且第一图像中的脸部区域的大小满足预设大小要求。电子设备通过摄像头采集第一图像之后,该方法还可以包括:在第一图像中检测到人脸,且第一图像中的脸部区域的大小不满足预设大小要求的情况下,电子设备可以进行自适应变焦,并基于自适应变焦后的焦距重新采集图像。In combination with the first aspect, in a possible implementation, the face detection result satisfies the preset face condition, which may specifically include: a face is detected in the first image, and the size of the face area in the first image meets the preset size requirement. After the electronic device captures the first image through the camera, the method may also include: when a face is detected in the first image, and the size of the face area in the first image does not meet the preset size requirement, the electronic device may perform adaptive zoom, and re-capture the image based on the focal length after the adaptive zoom.
在本申请提供的方案中,在第一图像包括人脸且第一图像中的脸部区域的大小不满足预设大小要求的情况下,电子设备可以进行自适应变焦,并基于自适应变焦后的焦距重新采集图像,使得后续采集的图像中的人脸大小满足预期。通过这种方式,电子设备可以采集包含合适大小的人脸的图像,而不会因采集到的图像中的人脸太小导致图像细节丢失以及后续特征提取困难,也不会因采集到的图像中的人脸太大导致图像信息丢失以及后续特征提取困难。也就是说,通过上述方法,电子设备提取的特征比较准确,使得注视点估计也提高了准确性。In the solution provided in the present application, when the first image includes a face and the size of the face area in the first image does not meet the preset size requirement, the electronic device can perform adaptive zoom and re-capture the image based on the focal length after adaptive zoom, so that the size of the face in the subsequent captured image meets expectations. In this way, the electronic device can capture images containing faces of appropriate sizes without losing image details and subsequent feature extraction difficulties due to the face in the captured image being too small, nor will it cause image information loss and subsequent feature extraction difficulties due to the face in the captured image being too large. In other words, through the above method, the features extracted by the electronic device are relatively accurate, so that the accuracy of the gaze point estimation is also improved.
在本申请的一些实施例中,第一图像中的脸部区域的大小满足预设大小要求,具体包括:第一图像中的脸部区域的面积在预设面积范围内。In some embodiments of the present application, the size of the face region in the first image meets a preset size requirement, specifically including: the area of the face region in the first image is within a preset area range.
在本申请的一些实施例中,第一图像中的脸部区域的大小满足预设大小要求,具体包括:第一图像中的脸部区域的高度在预设高度范围内,且第一图像中的脸部区域的宽度在预设宽度范围内。In some embodiments of the present application, the size of the facial area in the first image meets a preset size requirement, specifically including: the height of the facial area in the first image is within a preset height range, and the width of the facial area in the first image is within a preset width range.
可理解,电子设备可以通过自适应变焦保证输入距离尺度自适应的简单样本。也就是说,电子设备可以通过自适应变焦在拍摄距离适中的情况下采集图像。It is understandable that the electronic device can ensure that the input distance scale is adaptive to the simple sample through adaptive zoom. In other words, the electronic device can capture images at a moderate shooting distance through adaptive zoom.
可理解,预设大小要求和自适应变焦的相关描述可以参考后文,在此不展开说明。It is understandable that the related description of the preset size requirement and adaptive zoom can be found in the following text and will not be elaborated here.
结合第一方面,在一种可能的实现方式中,电子设备基于脸部位置信息对第一图像中的脸部区域进行裁剪,具体可以包括:电子设备可以确定第一图像中的脸部区域的相关特征点;电子设备可以确定第一外接矩形;电子设备还可以基于第一外接矩形在第一图像中的位置,对第一图像进行裁剪。其中,第一外接矩形为第一图像中的脸部区域的相关特征点的外接矩形,脸部图像块与第一外接矩形在第一图像中的位置相同,脸部图像块与第一外接矩形的大小相同。电子设备基于眼睛位置信息对第一图像中的左眼区域进行裁剪,具体可以包括:电子设备可以确定第一图像中的左眼区域的相关特征点;电子设备可以确定第二外接矩形,并基于第二外接矩形在第一图像中的位置,对第一图像进行裁剪。其中,第二外接矩形为第一图像中的左眼区域的相关特征点的外接矩形,左眼图像块与第二外接矩形在第一图像中的位置相同,左眼图像块与第二外接矩形的大小相同。电子设备基于眼睛位置信息对第一图像中的右眼区域进行裁剪,具体可以包括:电子设备可以确定第一图像中的右眼区域的相关特征点;电子设备可以确定第三外接矩形,并基于第三外接矩形在第一图像中的位置,对第一图像进行裁剪。其中,第三外接矩形为第一图像中的右眼区域的相关特征点的外接矩形,右眼图像块与第三外接矩形在第一图像中的位置相同,右眼图像块与第三外接矩形的大小相同。In combination with the first aspect, in a possible implementation, the electronic device crops the face area in the first image based on the face position information, which may specifically include: the electronic device may determine the relevant feature points of the face area in the first image; the electronic device may determine the first circumscribed rectangle; the electronic device may also crop the first image based on the position of the first circumscribed rectangle in the first image. The first circumscribed rectangle is the circumscribed rectangle of the relevant feature points of the face area in the first image, the face image block is at the same position as the first circumscribed rectangle in the first image, and the face image block is the same size as the first circumscribed rectangle. The electronic device crops the left eye area in the first image based on the eye position information, which may specifically include: the electronic device may determine the relevant feature points of the left eye area in the first image; the electronic device may determine the second circumscribed rectangle, and crop the first image based on the position of the second circumscribed rectangle in the first image. The second circumscribed rectangle is the circumscribed rectangle of the relevant feature points of the left eye area in the first image, the left eye image block is at the same position as the second circumscribed rectangle in the first image, and the left eye image block is the same size as the second circumscribed rectangle. The electronic device crops the right eye region in the first image based on the eye position information, which may specifically include: the electronic device may determine the relevant feature points of the right eye region in the first image; the electronic device may determine the third circumscribed rectangle, and based on the position of the third circumscribed rectangle in the first image, crops the first image. The third circumscribed rectangle is the circumscribed rectangle of the relevant feature points of the right eye region in the first image, the right eye image block and the third circumscribed rectangle are at the same position in the first image, and the right eye image block and the third circumscribed rectangle are the same size.
在本申请提供的方案中,电子设备可以分别基于脸部区域的相关特征点的外接矩形、左眼区域的相关特征点的外接矩形和右眼区域的相关特征点的外接矩形,来获取脸部图像块、左眼图像块和右眼图像块。In the solution provided in the present application, the electronic device can obtain the facial image block, the left eye image block and the right eye image block based on the circumscribed rectangle of the relevant feature points of the facial area, the circumscribed rectangle of the relevant feature points of the left eye area and the circumscribed rectangle of the relevant feature points of the right eye area, respectively.
可理解,上述内容的具体实现方式可以参考后文中步骤S306的相关描述,在此不展开说明。It is understandable that the specific implementation of the above content can refer to the relevant description of step S306 in the following text, which will not be elaborated here.
结合第一方面,在一种可能的实现方式中,基于脸部位置信息对第一图像中的脸部区域进行裁剪,得到脸部图像块,具体可以包括:电子设备可以基于脸部位置信息确定第一图像中的脸部区域;电子设备可以以脸部区域为第一裁剪框的中心来对第一图像进行裁剪,得到脸部图像块。第一裁剪框的尺寸为第一预设裁剪尺寸。脸部图像块与第一裁剪框的尺寸相同。基于眼睛位置信息对第一图像中的左眼区域和右眼区域进行裁剪,得到左眼图像块和右眼图像块,具体可以包括:电子设备基于眼睛位置信息确定第一图像中的左眼区域和第一图像中的右眼区域;电子设备可以以左眼区域为第二裁剪框的中心来对第一图像进行裁剪,得到左眼图像块,还可以以右眼区域为第三裁剪框的中心来对第一图像进行裁剪,得到右眼图像块。第二裁剪框的尺寸为第二预设裁剪尺寸。左眼图像块与第二裁剪框的尺寸相同。第三裁剪框的尺寸为第三预设裁剪尺寸。右眼图像块与第三裁剪框的尺寸相同。In combination with the first aspect, in a possible implementation, cropping the face area in the first image based on the face position information to obtain the face image block may specifically include: the electronic device may determine the face area in the first image based on the face position information; the electronic device may crop the first image with the face area as the center of the first cropping frame to obtain the face image block. The size of the first cropping frame is the first preset cropping size. The face image block is the same size as the first cropping frame. Cropping the left eye area and the right eye area in the first image based on the eye position information to obtain the left eye image block and the right eye image block may specifically include: the electronic device determines the left eye area in the first image and the right eye area in the first image based on the eye position information; the electronic device may crop the first image with the left eye area as the center of the second cropping frame to obtain the left eye image block, and may also crop the first image with the right eye area as the center of the third cropping frame to obtain the right eye image block. The size of the second cropping frame is the second preset cropping size. The left eye image block is the same size as the second cropping frame. The size of the third cropping frame is the third preset cropping size. The right eye image block has the same size as the third cropping frame.
在本申请提供的方案中,电子设备可以基于脸部位置信息和预设人脸裁剪尺寸对第一图像进行裁剪,得到脸部图像块。电子设备还可以基于眼睛位置信息和预设眼睛裁剪尺寸对第一图像进行裁剪,得到左眼图像块和右眼图像块。In the solution provided in the present application, the electronic device can crop the first image based on the face position information and the preset face cropping size to obtain the face image block. The electronic device can also crop the first image based on the eye position information and the preset eye cropping size to obtain the left eye image block and the right eye image block.
在本申请的一些实施例中,第一预设裁剪尺寸为预设人脸裁剪尺寸。In some embodiments of the present application, the first preset cropping size is a preset face cropping size.
在本申请的一些实施例中,第二预设裁剪尺寸与第三预设裁剪尺寸为预设眼睛裁剪尺寸。第二预设裁剪尺寸与第三预设裁剪尺寸可以相同。In some embodiments of the present application, the second preset cropping size and the third preset cropping size are preset eye cropping sizes. The second preset cropping size and the third preset cropping size may be the same.
在本申请的一些实施例中,预设眼睛裁剪尺寸可以包括预设左眼裁剪尺寸和预设右眼裁剪尺寸。第二预设裁剪尺寸可以为预设左眼裁剪尺寸。第三预设裁剪尺寸可以为预设右眼裁剪尺寸。In some embodiments of the present application, the preset eye cropping size may include a preset left eye cropping size and a preset right eye cropping size. The second preset cropping size may be a preset left eye cropping size. The third preset cropping size may be a preset right eye cropping size.
可理解,上述内容的具体实现方式可以参考后文中步骤S306的相关描述,在此不展开说明。It is understandable that the specific implementation of the above content can refer to the relevant description of step S306 in the following text, which will not be elaborated here.
结合第一方面,在一种可能的实现方式中,注视点估计网络模型还可以包括若干激活层。感兴趣区域池化模块可以包括若干感兴趣区域池化层。卷积模块可以包括若干卷积层。融合模块包括若干全连接层。 In combination with the first aspect, in a possible implementation, the gaze point estimation network model may further include several activation layers. The region of interest pooling module may include several region of interest pooling layers. The convolution module may include several convolution layers. The fusion module includes several fully connected layers.
在本申请的一些实施例中,注视点估计网络模型可以包括若干感兴趣区域池化层、若干卷积层。注视点估计网络模型还可以包括若干激活层。In some embodiments of the present application, the gaze point estimation network model may include several region of interest pooling layers, several convolution layers, and may also include several activation layers.
在本申请的一些实施例中,注视点估计网络模型可以包括若干感兴趣区域池化层、若干卷积层和若干池化层。注视点估计网络模型还可以包括若干激活层。In some embodiments of the present application, the gaze point estimation network model may include several region of interest pooling layers, several convolutional layers and several pooling layers. The gaze point estimation network model may also include several activation layers.
第二方面,本申请提供了一种电子设备。该电子设备可以包括显示屏、摄像头、存储器、一个或多个处理器。存储器用于存储计算机程序。摄像头,可以用于:采集第一图像。处理器,可以用于:在人脸检测结果满足预设人脸条件的情况下,获取第一图像中的脸部位置信息和眼睛位置信息;在基于注视点估计网络模型对目标图像块进行处理的过程中,基于注视点估计网络模型的感兴趣区域池化模块,以对应的预设特征图尺寸对目标图像块的感兴趣区域ROI进行处理,得到特征图。其中,脸部位置信息包括脸部区域的相关特征点的坐标,眼睛位置信息包括眼睛区域的相关特征点的坐标。目标图像块包括脸部图像块、左眼图像块、右眼图像块中的至少一种类型的图像块。不同类型的图像块各自对应有预设特征图尺寸。脸部图像块为基于脸部位置信息对第一图像中的脸部区域进行裁剪得到的图像块,左眼图像块为基于眼睛位置信息对第一图像中的左眼区域进行裁剪得到的图像块,右眼图像块为基于眼睛位置信息对第一图像中的右眼区域进行裁剪得到的图像块。In a second aspect, the present application provides an electronic device. The electronic device may include a display screen, a camera, a memory, and one or more processors. The memory is used to store computer programs. The camera may be used to: collect a first image. The processor may be used to: obtain face position information and eye position information in the first image when the face detection result meets the preset face condition; in the process of processing the target image block based on the gaze point estimation network model, the region of interest pooling module based on the gaze point estimation network model processes the region of interest ROI of the target image block with the corresponding preset feature map size to obtain a feature map. Among them, the face position information includes the coordinates of the relevant feature points of the face area, and the eye position information includes the coordinates of the relevant feature points of the eye area. The target image block includes at least one type of image block among the face image block, the left eye image block, and the right eye image block. Different types of image blocks each correspond to a preset feature map size. The face image block is an image block obtained by cropping the face area in the first image based on the face position information, the left eye image block is an image block obtained by cropping the left eye area in the first image based on the eye position information, and the right eye image block is an image block obtained by cropping the right eye area in the first image based on the eye position information.
结合第二方面,在一种可能的实现方式中,处理器,在用于以对应的预设特征图尺寸对目标图像块的感兴趣区域ROI进行处理,得到特征图时,具体可以用于:基于对应的预设特征图尺寸对目标图像块的ROI进行划分,得到若干分块区域;对目标图像块的ROI中的每一个分块区域进行最大池化处理,得到特征图。其中,目标图像块的ROI中每一行分块区域的数量与对应的预设特征图尺寸中的宽度值相同,目标图像块的ROI中每一列分块区域的数量与对应的预设特征图尺寸中的高度值相同。In conjunction with the second aspect, in a possible implementation, the processor, when used to process the region of interest ROI of the target image block with the corresponding preset feature map size to obtain a feature map, can be specifically used to: divide the ROI of the target image block based on the corresponding preset feature map size to obtain a plurality of block regions; perform maximum pooling processing on each block region in the ROI of the target image block to obtain a feature map. The number of each row of block regions in the ROI of the target image block is the same as the width value in the corresponding preset feature map size, and the number of each column of block regions in the ROI of the target image block is the same as the height value in the corresponding preset feature map size.
结合第二方面,在一种可能的实现方式中,在目标图像块包括脸部图像块、左眼图像块和右眼图像块的情况下,处理器,在用于基于对应的预设特征图尺寸对目标图像块的ROI进行划分,得到若干分块区域时,具体可以用于:确定脸部图像块的ROI,并基于第一预设特征图尺寸对脸部图像块的ROI进行划分,得到若干脸部分块区域;确定左眼图像块的ROI,并基于第二预设特征图尺寸对左眼图像块的ROI进行划分,得到若干左眼分块区域;确定右眼图像块的ROI,并基于第三预设特征图尺寸对右眼图像块的ROI进行划分,得到若干右眼分块区域。处理器,在用于对目标图像块的ROI中的每一个分块区域进行最大池化处理,得到特征图时,具体可以用于:对脸部图像块的ROI中的每一个脸部分块区域进行最大池化处理,得到第一特征图;对左眼图像块的ROI中的每一个左眼分块区域进行最大池化处理,得到第二特征图;对右眼图像块的ROI中的每一个右眼分块区域进行最大池化处理,得到第三特征图。其中,第一特征图为与脸部图像块的ROI对应的特征图;第二特征图为与左眼图像块的ROI对应的特征图;第三特征图为与右眼图像块的ROI对应的特征图。目标图像块的ROI中每一行分块区域的数量与对应的预设特征图尺寸中的宽度值相同,目标图像块的ROI中每一列分块区域的数量与对应的预设特征图尺寸中的高度值相同,具体可以包括:脸部图像块的ROI中每一行脸部分块区域的数量与第一预设特征图尺寸中的宽度值相同,脸部图像块的ROI中每一列脸部分块区域的数量与第一预设特征图尺寸中的高度值相同;左眼图像块的ROI中每一行左眼分块区域的数量与第二预设特征图尺寸中的宽度值相同,左眼图像块的ROI中每一列左眼分块区域的数量与第二预设特征图尺寸中的高度值相同;右眼图像块的ROI中每一行右眼分块区域的数量与第三预设特征图尺寸中的宽度值相同,右眼图像块的ROI中每一列右眼分块区域的数量与第三预设特征图尺寸中的高度值相同。In combination with the second aspect, in a possible implementation method, when the target image block includes a facial image block, a left eye image block and a right eye image block, the processor, when used to divide the ROI of the target image block based on the corresponding preset feature map size to obtain a number of block areas, can be specifically used to: determine the ROI of the facial image block, and divide the ROI of the facial image block based on the first preset feature map size to obtain a number of facial block areas; determine the ROI of the left eye image block, and divide the ROI of the left eye image block based on the second preset feature map size to obtain a number of left eye block areas; determine the ROI of the right eye image block, and divide the ROI of the right eye image block based on the third preset feature map size to obtain a number of right eye block areas. The processor, when used to perform maximum pooling processing on each block area in the ROI of the target image block to obtain a feature map, can be specifically used to: perform maximum pooling processing on each face block area in the ROI of the face image block to obtain a first feature map; perform maximum pooling processing on each left eye block area in the ROI of the left eye image block to obtain a second feature map; perform maximum pooling processing on each right eye block area in the ROI of the right eye image block to obtain a third feature map. The first feature map is a feature map corresponding to the ROI of the face image block; the second feature map is a feature map corresponding to the ROI of the left eye image block; and the third feature map is a feature map corresponding to the ROI of the right eye image block. The number of block areas in each row of the ROI of the target image block is the same as the width value in the corresponding preset feature map size, and the number of block areas in each column of the ROI of the target image block is the same as the height value in the corresponding preset feature map size. Specifically, it may include: the number of face block areas in each row of the ROI of the facial image block is the same as the width value in the first preset feature map size, and the number of face block areas in each column of the ROI of the facial image block is the same as the height value in the first preset feature map size; the number of left eye block areas in each row of the ROI of the left eye image block is the same as the width value in the second preset feature map size, and the number of left eye block areas in each column of the ROI of the left eye image block is the same as the height value in the second preset feature map size; the number of right eye block areas in each row of the ROI of the right eye image block is the same as the width value in the third preset feature map size, and the number of right eye block areas in each column of the ROI of the right eye image block is the same as the height value in the third preset feature map size.
结合第二方面,在一种可能的实现方式中,第一图像的拍摄主体为目标对象。在摄像头用于采集第一图像之后,处理器,还可以用于:在人脸检测结果满足预设人脸条件的情况下,获取第一图像中的瞳孔坐标;基于脸部位置信息确定第一图像中的脸部区域在第一图像中的位置和大小,得到第一图像对应的脸部网格。脸部网格用于表征目标对象与摄像头的距离。处理器,在用于得到特征图之后,还可以用于:基于注视点估计网络模型的卷积模块对特征图进行卷积处理,提取眼部特征和/或脸部特征;基于注视点估计网络模型的融合模块对眼部特征和/或脸部特征、脸部网格和瞳孔坐标进行整合,得到目标对象的注视点坐标。In combination with the second aspect, in a possible implementation, the subject of the first image is the target object. After the camera is used to capture the first image, the processor can also be used to: obtain the pupil coordinates in the first image when the face detection result meets the preset face condition; determine the position and size of the face area in the first image in the first image based on the face position information, and obtain the face grid corresponding to the first image. The face grid is used to characterize the distance between the target object and the camera. After being used to obtain the feature map, the processor can also be used to: perform convolution processing on the feature map based on the convolution module of the gaze point estimation network model to extract eye features and/or facial features; integrate the eye features and/or facial features, the face grid and the pupil coordinates based on the fusion module of the gaze point estimation network model to obtain the gaze point coordinates of the target object.
结合第二方面,在一种可能的实现方式中,人脸检测结果满足预设人脸条件,具体可以包括:第一图像中检测到人脸。In conjunction with the second aspect, in a possible implementation manner, the face detection result satisfies a preset face condition, which may specifically include: a face is detected in the first image.
结合第二方面,在一种可能的实现方式中,人脸检测结果满足预设人脸条件,具体可以包括:第一图像中检测到人脸,且第一图像中的脸部区域的大小满足预设大小要求。在摄像头用于采集第一图像之后,处理器,还可以用于:在第一图像中检测到人脸,且第一图像中的脸部区域的大小不满足预设大小要求的情况下,进行自适应变焦,并基于自适应变焦后的焦距重新采集图像。In conjunction with the second aspect, in a possible implementation, the face detection result satisfies the preset face condition, which may specifically include: a face is detected in the first image, and the size of the face area in the first image meets the preset size requirement. After the camera is used to capture the first image, the processor may also be used to: when a face is detected in the first image and the size of the face area in the first image does not meet the preset size requirement, perform adaptive zoom, and recapture the image based on the focal length after the adaptive zoom.
结合第二方面,在一种可能的实现方式中,处理器,在用于基于脸部位置信息对第一图像中的脸部区域进行裁剪时,具体可以用于:确定第一图像中的脸部区域的相关特征点;确定第一外接矩形;基于第一外接矩形在第一图像中的位置,对第一图像进行裁剪。其中,第一外接矩形为第一图像中的脸部区域的相关特征点的外接矩形。脸部图像块与第一外接矩形在第一图像中的位置相同。脸部图像块与第一外接矩形的大小相同。处理器,在用于基于眼睛位置信息对第一图像中的左眼区域进行裁剪时,具体可以用于:确定第一图像中的左眼区域的相关特征点;确定第二外接矩形;基于第二外接矩形在第一图像中的位置,对第一图像进行裁剪。其中,第二外接矩形为第一图像中的左眼区域的相关特征点的外接矩形。左眼图像块与第二外接矩形在所述第一图像中的位置相同。左眼图像块与第二外接矩形的大小相同。In conjunction with the second aspect, in a possible implementation, the processor, when used to crop the face area in the first image based on the face position information, can be specifically used to: determine the relevant feature points of the face area in the first image; determine the first circumscribed rectangle; and crop the first image based on the position of the first circumscribed rectangle in the first image. The first circumscribed rectangle is the circumscribed rectangle of the relevant feature points of the face area in the first image. The face image block and the first circumscribed rectangle have the same position in the first image. The face image block and the first circumscribed rectangle have the same size. The processor, when used to crop the left eye area in the first image based on the eye position information, can be specifically used to: determine the relevant feature points of the left eye area in the first image; determine the second circumscribed rectangle; and crop the first image based on the position of the second circumscribed rectangle in the first image. The second circumscribed rectangle is the circumscribed rectangle of the relevant feature points of the left eye area in the first image. The left eye image block and the second circumscribed rectangle have the same position in the first image. The left eye image block and the second circumscribed rectangle have the same size.
处理器,在用于基于眼睛位置信息对第一图像中的右眼区域进行裁剪时,具体可以用于:确定第一图像中的右眼区域的相关特征点;确定第三外接矩形;基于第三外接矩形在第一图像中的位置,对第一图像进行裁剪。其中,第三外接矩形为第一图像中的右眼区域的相关特征点的外接矩形。右眼图像块与第三外接矩形在第一图像中的位置相同。右眼图像块与第三外接矩形的大小相同。The processor, when used to crop the right eye region in the first image based on the eye position information, can be specifically used to: determine the relevant feature points of the right eye region in the first image; determine the third circumscribed rectangle; and crop the first image based on the position of the third circumscribed rectangle in the first image. The third circumscribed rectangle is the circumscribed rectangle of the relevant feature points of the right eye region in the first image. The right eye image block and the third circumscribed rectangle have the same position in the first image. The right eye image block and the third circumscribed rectangle have the same size.
结合第二方面,在一种可能的实现方式中,处理器,在用于基于脸部位置信息对第一图像中的脸部区域进行裁剪,得到脸部图像块时,具体可以用于:基于脸部位置信息确定第一图像中的脸部区域;以脸部区域为第一裁剪框的中心来对第一图像进行裁剪,得到脸部图像块。其中,第一裁剪框的尺寸为第一预设裁剪尺寸。脸部图像块与第一裁剪框的尺寸相同。处理器,在用于基于眼睛位置信息对第一图像中的左眼区域和右眼区域进行裁剪,得到左眼图像块和右眼图像块,具体可以用于:基于眼睛位置信息确定第一图像中的左眼区域和第一图像中的右眼区域;以左眼区域为第二裁剪框的中心来对第一图像进行裁剪,得到左眼图像块,以及以右眼区域为第三裁剪框的中心来对第一图像进行裁剪,得到右眼图像块。其中,第二裁剪框的尺寸为第二预设裁剪尺寸。左眼图像块与第二裁剪框的尺寸相同。第三裁剪框的尺寸为第三预设裁剪尺寸。右眼图像块与第三裁剪框的尺寸相同。 In conjunction with the second aspect, in a possible implementation, the processor, when used to crop the face area in the first image based on the face position information to obtain the face image block, can be specifically used to: determine the face area in the first image based on the face position information; crop the first image with the face area as the center of the first cropping frame to obtain the face image block. Wherein, the size of the first cropping frame is the first preset cropping size. The face image block has the same size as the first cropping frame. The processor, when used to crop the left eye area and the right eye area in the first image based on the eye position information to obtain the left eye image block and the right eye image block, can be specifically used to: determine the left eye area in the first image and the right eye area in the first image based on the eye position information; crop the first image with the left eye area as the center of the second cropping frame to obtain the left eye image block, and crop the first image with the right eye area as the center of the third cropping frame to obtain the right eye image block. Wherein, the size of the second cropping frame is the second preset cropping size. The left eye image block has the same size as the second cropping frame. The size of the third cropping frame is a third preset cropping size. The size of the right eye image block is the same as that of the third cropping frame.
结合第二方面,在一种可能的实现方式中,注视点估计网络模型还可以包括若干激活层。感兴趣区域池化模块可以包括若干感兴趣区域池化层。卷积模块可以包括若干卷积层。融合模块可以包括若干全连接层。In conjunction with the second aspect, in a possible implementation, the gaze point estimation network model may further include several activation layers. The region of interest pooling module may include several region of interest pooling layers. The convolution module may include several convolution layers. The fusion module may include several fully connected layers.
第三方面,本申请提供一种计算机存储介质,包括计算机指令,当该计算机指令在电子设备上运行时,使得该电子设备执行上述第一方面中任一种可能的实现方式。In a third aspect, the present application provides a computer storage medium, comprising computer instructions, which, when executed on an electronic device, enables the electronic device to execute any possible implementation of the first aspect.
第四方面,本申请实施例提供一种芯片,该芯片可以应用于电子设备,该芯片包括一个或多个处理器,该处理器用于调用计算机指令以使得该电子设备执行上述第一方面中任一种可能的实现方式。In a fourth aspect, an embodiment of the present application provides a chip, which can be applied to an electronic device. The chip includes one or more processors, and the processor is used to call computer instructions to enable the electronic device to execute any possible implementation method of the above-mentioned first aspect.
第五方面,本申请实施例提供一种包含指令的计算机程序产品,当该计算机程序产品在电子设备上运行时,使得该电子设备执行上述第一方面任一种可能的实现方式。In a fifth aspect, an embodiment of the present application provides a computer program product comprising instructions, which, when executed on an electronic device, enables the electronic device to execute any possible implementation of the first aspect.
可理解,上述第二方面提供的电子设备、第三方面提供的计算机存储介质、第四方面提供的芯片,以及第五方面提供的计算机程序产品均用于执行上述第一方面中任一种可能的实现方式。因此,其所能达到的有益效果可参考上述第一方面中任一种可能的实现方式的有益效果,此处不再赘述。It is understandable that the electronic device provided in the second aspect, the computer storage medium provided in the third aspect, the chip provided in the fourth aspect, and the computer program product provided in the fifth aspect are all used to execute any possible implementation of the first aspect. Therefore, the beneficial effects that can be achieved can refer to the beneficial effects of any possible implementation of the first aspect, and will not be repeated here.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1为本申请实施例提供的一种注视点估计的场景示意图;FIG1 is a schematic diagram of a scene of gaze point estimation provided by an embodiment of the present application;
图2A-图2D为本申请实施例提供的一组注视点估计的场景示意图;2A-2D are schematic diagrams of a set of scenes for gaze point estimation provided in an embodiment of the present application;
图3为本申请实施例提供的一种注视点估计方法的流程图;FIG3 is a flow chart of a method for estimating a gaze point provided in an embodiment of the present application;
图4为本申请实施例提供的一种裁剪原理示意图;FIG4 is a schematic diagram of a cutting principle provided in an embodiment of the present application;
图5为本申请实施例提供的又一种裁剪原理示意图;FIG5 is a schematic diagram of another cutting principle provided in an embodiment of the present application;
图6为本申请实施例提供的一种脸部网格示意图;FIG6 is a schematic diagram of a face grid provided in an embodiment of the present application;
图7为本申请实施例提供的一种注视点估计网络模型的架构示意图;FIG7 is a schematic diagram of the architecture of a gaze point estimation network model provided in an embodiment of the present application;
图8为本申请实施例提供的又一种注视点估计网络模型的架构示意图;FIG8 is a schematic diagram of the architecture of another gaze point estimation network model provided in an embodiment of the present application;
图9为本申请实施例提供的又一种注视点估计网络模型的架构示意图;FIG9 is a schematic diagram of the architecture of another gaze point estimation network model provided in an embodiment of the present application;
图10A和图10B为本申请实施例提供的一种感兴趣区域池化层的原理示意图;10A and 10B are schematic diagrams of a principle of a region of interest pooling layer provided in an embodiment of the present application;
图11为本申请实施例提供的一种映射到特征图上的ROI的示意图;FIG11 is a schematic diagram of a ROI mapped onto a feature map provided in an embodiment of the present application;
图12为本申请实施例提供的一种CNN-1的结构示意图;FIG12 is a schematic diagram of the structure of a CNN-1 provided in an embodiment of the present application;
图13为本申请实施例提供的一种CNN-3的结构示意图;FIG13 is a schematic diagram of the structure of a CNN-3 provided in an embodiment of the present application;
图14为本申请实施例提供的一种电子设备的硬件结构示意图;FIG14 is a schematic diagram of the hardware structure of an electronic device provided in an embodiment of the present application;
图15为本申请实施例提供的一种电子设备的软件结构示意图。FIG. 15 is a schematic diagram of a software structure of an electronic device provided in an embodiment of the present application.
具体实施方式Detailed ways
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述。其中,在本申请实施例的描述中,除非另有说明,“/”表示或的意思,例如,A/B可以表示A或B;文本中的“和/或”仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况,另外,在本申请实施例的描述中,“多个”是指两个或多于两个。The following will be combined with the drawings in the embodiments of the present application to clearly and completely describe the technical solutions in the embodiments of the present application. In the description of the embodiments of the present application, unless otherwise specified, "/" means or, for example, A/B can mean A or B; "and/or" in the text is only a description of the association relationship of the associated objects, indicating that there can be three relationships, for example, A and/or B can mean: A exists alone, A and B exist at the same time, and B exists alone. In addition, in the description of the embodiments of the present application, "multiple" means two or more than two.
应当理解,本申请的说明书和权利要求书及附图中的术语“第一”、“第二”等是用于区别不同对象,而不是用于描述特定顺序。此外,术语“包括”和“具有”以及它们任何变形, 意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。It should be understood that the terms "first", "second", etc. in the specification, claims and drawings of this application are used to distinguish different objects rather than to describe a specific order. In addition, the terms "include" and "have" and any variations thereof, It is intended to cover non-exclusive inclusions. For example, a process, method, system, product, or apparatus comprising a series of steps or units is not limited to the listed steps or units, but may optionally include steps or units not listed, or may optionally include other steps or units inherent to the process, method, product, or apparatus.
在本申请中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本申请所描述的实施例可以与其它实施例相结合。Reference to "embodiments" in this application means that a particular feature, structure, or characteristic described in conjunction with the embodiments may be included in at least one embodiment of the present application. The appearance of the phrase in various locations in the specification does not necessarily refer to the same embodiment, nor is it an independent or alternative embodiment that is mutually exclusive with other embodiments. It is explicitly and implicitly understood by those skilled in the art that the embodiments described in this application may be combined with other embodiments.
本申请提供了一种注视点估计方法。该注视点估计方法可以应用于电子设备。根据该方法,电子设备可以通过前置摄像头采集图像。若采集的图像中包括人脸,电子设备可以基于人脸检测所得的脸部位置信息和预设人脸裁剪尺寸对采集到的图像进行裁剪,得到脸部图像块。类似的,电子设备还可以基于眼睛检测所得的眼睛位置信息和预设眼睛裁剪尺寸对采集到的图像进行裁剪,得到左眼图像块、右眼图像块。电子设备还可以基于脸部位置信息确定脸部网格,并通过瞳孔定位来确定瞳孔坐标。其中,脸部网格用于表示人脸在整幅图像中的位置和大小。也就是说,脸部网格可以体现人脸与摄像头之间的距离。电子设备可以将左眼图像块、右眼图像块、脸部图像块、脸部网格和瞳孔坐标输入至注视点估计网络模型中,输出得到注视点坐标。其中,注视点估计网络模型可以包括感兴趣区域池化层。该感兴趣区域池化层可以用于统一特征图的尺寸,为后续特征提取做准备。在一种可能的实现方式中,电子设备可以确定采集的图像中的脸部区域的大小是否满足预设大小要求。在脸部区域的大小不满足预设大小要求的情况下,电子设备可以通过自适应变焦来保证拍摄距离适中,并重新采集图像。而在脸部区域的大小满足预设大小要求的情况下,电子设备可以按照上述方法估计注视点坐标。The present application provides a method for estimating a gaze point. The method for estimating a gaze point can be applied to an electronic device. According to the method, the electronic device can collect images through a front camera. If the collected image includes a face, the electronic device can crop the collected image based on the face position information obtained by face detection and the preset face cropping size to obtain a face image block. Similarly, the electronic device can also crop the collected image based on the eye position information obtained by eye detection and the preset eye cropping size to obtain a left eye image block and a right eye image block. The electronic device can also determine a face grid based on the face position information, and determine the pupil coordinates by pupil positioning. Among them, the face grid is used to represent the position and size of the face in the entire image. That is, the face grid can reflect the distance between the face and the camera. The electronic device can input the left eye image block, the right eye image block, the face image block, the face grid and the pupil coordinates into the gaze point estimation network model, and output the gaze point coordinates. Among them, the gaze point estimation network model can include a region of interest pooling layer. The region of interest pooling layer can be used to unify the size of the feature map to prepare for subsequent feature extraction. In a possible implementation, the electronic device may determine whether the size of the face region in the captured image meets a preset size requirement. If the size of the face region does not meet the preset size requirement, the electronic device may ensure that the shooting distance is appropriate through adaptive zooming and recapture the image. If the size of the face region meets the preset size requirement, the electronic device may estimate the gaze point coordinates according to the above method.
通过上述方法,电子设备可以基于左眼图像块、右眼图像块、脸部图像块、脸部网格和瞳孔坐标来估计注视点坐标,即实现了更全面的特征提取。并且,电子设备可以通过自适应变焦来控制采集的图像中脸部区域的大小,以及基于感兴趣区域池化层统一特征图的尺寸,可以避免图像块(例如,左眼图像块、右眼图像块和脸部图像块)经缩放后发生形变,提高了注视点估计的准确度。Through the above method, the electronic device can estimate the gaze point coordinates based on the left eye image block, the right eye image block, the face image block, the face grid and the pupil coordinates, that is, a more comprehensive feature extraction is achieved. In addition, the electronic device can control the size of the face area in the captured image through adaptive zoom, and unify the size of the feature map based on the pooling layer of the region of interest, which can avoid the deformation of the image block (for example, the left eye image block, the right eye image block and the face image block) after scaling, thereby improving the accuracy of the gaze point estimation.
下面介绍本申请提供的一些拍摄的场景。The following introduces some shooting scenes provided by this application.
如图1所示,用户在使用电子设备时,电子设备可以获取用户图像,并通过用户图像来估计注视点坐标。具体地,用户使用电子设备时,电子设备可以通过前置摄像头采集图像,若采集的图像中包括人脸,电子设备可以对采集到的图像进行裁剪,得到左眼图像块、右眼图像块和脸部图像块。电子设备还可以基于人脸检测所得的脸部位置信息确定脸部网格,并通过瞳孔定位来确定瞳孔坐标。其中,脸部网格用于表示人脸在整幅图像中的位置和大小。也可以理解为:脸部网格可以体现人脸与摄像头之间的距离。电子设备可以将左眼图像块、右眼图像块、脸部图像块、脸部网格和瞳孔坐标输入至注视点估计网络模型中,输出得到注视点坐标。可理解,注视点估计网络模型的相关描述可以参考后文,在此不展开说明。As shown in FIG1 , when a user uses an electronic device, the electronic device can obtain a user image and estimate the gaze point coordinates through the user image. Specifically, when a user uses an electronic device, the electronic device can capture images through a front camera. If the captured image includes a face, the electronic device can crop the captured image to obtain a left eye image block, a right eye image block, and a face image block. The electronic device can also determine a face grid based on the face position information obtained by face detection, and determine the pupil coordinates through pupil positioning. Among them, the face grid is used to represent the position and size of the face in the entire image. It can also be understood that the face grid can reflect the distance between the face and the camera. The electronic device can input the left eye image block, the right eye image block, the face image block, the face grid, and the pupil coordinates into the gaze point estimation network model, and output the gaze point coordinates. It can be understood that the relevant description of the gaze point estimation network model can be referred to later, and will not be expanded here.
可理解,电子设备具体可以是手机、平板电脑、可穿戴设备、车载设备、增强现实(Augmented Reality,AR)/虚拟现实(Virtual Reality,VR)设备、笔记本电脑、超级移动个人计算机(Ultra-Mobile Personal Computer,UMPC)、上网本、个人数字助理(Personal Digital Assistant,PDA)或专门的照相机(例如,单反相机、卡片式相机)等设备,本申请对电子设备的具体类型不作任何限制。It is understandable that the electronic device may specifically be a mobile phone, a tablet computer, a wearable device, a vehicle-mounted device, an augmented reality (AR)/virtual reality (VR) device, a laptop computer, an ultra-mobile personal computer (UMPC), a netbook, a personal digital assistant (PDA) or a dedicated camera (for example, a SLR camera, a card camera) and other devices. This application does not impose any restrictions on the specific types of electronic devices.
在本申请的一些实施例中,用户使用电子设备浏览信息时,电子设备可以基于估计的注视点坐标来触发相应操作。在这种情况下,用户可以更便利地实现与电子设备的交互。In some embodiments of the present application, when a user browses information using an electronic device, the electronic device may trigger a corresponding operation based on the estimated gaze point coordinates. In this case, the user may more conveniently interact with the electronic device.
示例性的,如图2A所示,电子设备可以显示阅读界面100。阅读界面100显示的是用户正在阅读的电子书的第1页。该电子书共有243页。在用户阅读过程中,电子设备可以实时估计注视点坐标。如图2B所示,电子设备基于采集的图像可以估计注视点坐标,并确定注视点坐标位于阅读界面100显示的第1页内容末尾。在这种情况下,电子设备可以触发翻页。相应的,电子设备可以显示图2C所示的阅读界面200。阅读界面200显示的是用户正在阅读的电子书的第2页。电子设备可以继续实时估计注视点坐标。Exemplarily, as shown in FIG2A , the electronic device may display a reading interface 100. The reading interface 100 displays the first page of the e-book that the user is reading. The e-book has a total of 243 pages. During the user's reading process, the electronic device may estimate the gaze point coordinates in real time. As shown in FIG2B , the electronic device may estimate the gaze point coordinates based on the captured image, and determine that the gaze point coordinates are located at the end of the first page of content displayed on the reading interface 100. In this case, the electronic device may trigger page turning. Accordingly, the electronic device may display the reading interface 200 shown in FIG2C . The reading interface 200 displays the second page of the e-book that the user is reading. The electronic device may continue to estimate the gaze point coordinates in real time.
可理解,本申请中所提及的实时估计注视点坐标可以包括:电子设备可以每隔一定时间(例如,10ms)采集一帧图像,并基于该图像估计注视点坐标。It is understandable that the real-time estimation of gaze point coordinates mentioned in the present application may include: the electronic device may capture a frame of image at regular intervals (for example, 10 ms) and estimate the gaze point coordinates based on the image.
在一种可能的实现方式中,在电子设备基于采集的连续x帧图像确定注视点坐标均位于阅读界面100显示的第1页内容末尾的情况下,电子设备可以触发翻页。可理解,x的具体值可以根据实际需求进行设置,本申请对此不作限制。例如,x=5。In a possible implementation, when the electronic device determines based on the collected x consecutive frames of images that the gaze point coordinates are all located at the end of the first page of content displayed on the reading interface 100, the electronic device can trigger page turning. It is understandable that the specific value of x can be set according to actual needs, and this application does not limit this. For example, x=5.
在本申请的一些实施例中,用户使用电子设备浏览信息时,电子设备可以基于估计的注视点坐标来搜集用户的偏好信息,从而基于搜集到的用户的偏好信息更智能的给用户提供服务。例如,用户使用电子设备浏览信息时,电子设备可能会推荐一些内容(比如,视频、文章等)。在这种情况下,电子设备可以估计用户的注视点坐标,从而确定用户感兴趣的推荐内容。在后续过程中,电子设备可以给用户推荐与该感兴趣的推荐内容相关的内容。In some embodiments of the present application, when a user browses information using an electronic device, the electronic device can collect the user's preference information based on the estimated gaze point coordinates, thereby providing services to the user more intelligently based on the collected user's preference information. For example, when a user browses information using an electronic device, the electronic device may recommend some content (e.g., videos, articles, etc.). In this case, the electronic device can estimate the user's gaze point coordinates to determine the recommended content that the user is interested in. In the subsequent process, the electronic device can recommend content related to the recommended content of interest to the user.
示例性的,如图2D所示,电子设备可以显示用户界面300。用户界面300可以包括若干视频或文字信息。用户界面300可以包括推荐内容1、推荐内容2、推荐内容3、推荐内容4和推荐内容5。电子设备可以实时采集图像并基于采集的图像估计注视点坐标。电子设备还可以统计电子设备显示用户界面300的过程中的注视点坐标的分布情况,从而确定用户感兴趣的推荐内容。比如,用户界面300中的推荐内容2。在后续过程中,电子设备可以给用户智能推荐与推荐内容2相关的内容,从而避免用户花费时间来排除不感兴趣的内容,更智能的提供给用户提供服务。Exemplarily, as shown in FIG2D , the electronic device may display a user interface 300. The user interface 300 may include several videos or text information. The user interface 300 may include recommended content 1, recommended content 2, recommended content 3, recommended content 4, and recommended content 5. The electronic device may capture images in real time and estimate the gaze point coordinates based on the captured images. The electronic device may also count the distribution of the gaze point coordinates during the process of the electronic device displaying the user interface 300, thereby determining the recommended content that the user is interested in. For example, the recommended content 2 in the user interface 300. In the subsequent process, the electronic device may intelligently recommend content related to the recommended content 2 to the user, thereby avoiding the user from spending time excluding content that is not of interest, and providing services to the user more intelligently.
需要说明的是,本申请提供的注视点估计方法还可以应用到其他场景中,本申请对此不作限制。It should be noted that the gaze point estimation method provided in this application can also be applied to other scenarios, and this application does not limit this.
下面介绍本申请提供的一种注视点估计方法。A gaze point estimation method provided by the present application is introduced below.
请参阅图3,图3为本申请实施例提供的一种注视点估计方法的流程图。该注视点估计方法可以包括但不限于以下步骤:Please refer to Figure 3, which is a flow chart of a method for estimating a gaze point provided in an embodiment of the present application. The method for estimating a gaze point may include but is not limited to the following steps:
S301:电子设备获取图像I1。S301: The electronic device acquires an image I1.
在本申请的一些实施例中,电子设备通过电子设备的前置摄像头获取图像I1。In some embodiments of the present application, the electronic device obtains the image I1 through the front camera of the electronic device.
在本申请的一些实施例中,电子设备接收其他摄像头获取的图像I1。In some embodiments of the present application, the electronic device receives the image I1 acquired by other cameras.
在本申请的一些实施例中,电子设备可以实时获取图像。也就是说,图像I1为电子设备实时获取的图像。例如,电子设备可以每隔时间T就获取一帧图像。本申请所提及的时间T可以根据实际需求进行设置。示例性的,时间T可以为1毫秒(ms)。In some embodiments of the present application, the electronic device can acquire images in real time. That is, the image I1 is an image acquired by the electronic device in real time. For example, the electronic device can acquire a frame of image every time T. The time T mentioned in the present application can be set according to actual needs. Exemplarily, the time T can be 1 millisecond (ms).
S302:电子设备对图像I1进行人脸检测,确定图像I1中是否包括人脸。S302: The electronic device performs face detection on the image I1 to determine whether the image I1 includes a face.
可理解,电子设备可以对图像I1进行人脸检测,从而确定图像I1中是否包括人脸。在检测到图像I1包括人脸的情况下,电子设备可以继续执行后续步骤。在检测到图像I1不包括人脸的情况下,电子设备可以舍弃图像I1,并重新获取图像。It is understandable that the electronic device can perform face detection on the image I1 to determine whether the image I1 includes a face. If it is detected that the image I1 includes a face, the electronic device can continue to perform subsequent steps. If it is detected that the image I1 does not include a face, the electronic device can discard the image I1 and reacquire the image.
可理解,人脸检测指在动态的场景与复杂的背景中判断是否存在人脸,并将其分离出来。也就是说,基于人脸检测所包括的搜索策略,可以对任意一幅给定的图像进行搜索以确定其中是否含有人脸。It can be understood that face detection refers to determining whether a face exists in a dynamic scene and a complex background and separating it. In other words, based on the search strategy included in face detection, any given image can be searched to determine whether it contains a face.
下面简单介绍人脸检测的方法。The following is a brief introduction to the face detection method.
(1)模板匹配法(1) Template matching method
电子设备可以确定输入图像与预先设置的一个或数个标准人脸模板之间的匹配程度(即相关性),然后根据该匹配程度来判断该图像中是否存在人脸。例如,电子设备可以确定该匹配程度与预设阈值的大小关系,并基于该大小关系来判断该图像中是否存在人脸。具体地,若该匹配程度大于预设阈值,则电子设备确定该图像中存在人脸,否则,电子设备确定该图像中不存在人脸。The electronic device may determine the degree of match (i.e., correlation) between the input image and one or more pre-set standard face templates, and then determine whether there is a face in the image based on the degree of match. For example, the electronic device may determine the magnitude relationship between the degree of match and a preset threshold, and determine whether there is a face in the image based on the magnitude relationship. Specifically, if the degree of match is greater than the preset threshold, the electronic device determines that there is a face in the image, otherwise, the electronic device determines that there is no face in the image.
在本申请的一些实施例中,电子设备在确定输入图像与预先设置的一个或数个标准人脸模板之间的匹配程度时,可以具体计算输入图像与标准人脸模板中的面部轮廓、鼻子、眼睛、嘴等部位之间的匹配程度。In some embodiments of the present application, when determining the degree of match between an input image and one or more pre-set standard face templates, the electronic device may specifically calculate the degree of match between the input image and the facial contour, nose, eyes, mouth and other parts in the standard face template.
可理解,电子设备中可以包括模板库。标准人脸模板可以存储在该模板库中。It is understandable that the electronic device may include a template library, and the standard face templates may be stored in the template library.
(2)人脸规则法(2) Face rule method
可理解,人脸具有一定的结构分布特征。电子设备可以从大量样本中提取人脸的结构分布特征并生成相应的规则,然后基于该规则来判断图像中是否存在人脸。其中,人脸的结构分布特征可以包括:两个对称的眼睛、两个对称的耳朵、一个鼻子、一个嘴巴,以及五官之间的位置、相对距离等。It is understandable that human faces have certain structural distribution characteristics. Electronic devices can extract the structural distribution characteristics of human faces from a large number of samples and generate corresponding rules, and then judge whether there is a human face in the image based on the rules. Among them, the structural distribution characteristics of human faces may include: two symmetrical eyes, two symmetrical ears, a nose, a mouth, and the positions and relative distances between the five facial features.
(3)样品学习法(3) Sample learning method
样品学习法指的是人工神经网络的方法,即通过对人脸样品集和非人脸样品集进行学习从而产生分类器。也就是说,电子设备可以基于样本来训练神经网络。该神经网络的参数里包含了人脸的统计特性。The sample learning method refers to the method of artificial neural network, that is, to generate a classifier by learning a face sample set and a non-face sample set. In other words, the electronic device can train the neural network based on the sample. The parameters of the neural network include the statistical characteristics of the face.
(4)特征检测法(4) Feature detection method
特征检测法指的是将人脸的不变特性用于人脸检测。人脸具有一些对不同姿势而言具有鲁棒性的特征。例如,人的眼睛和眉毛比面颊颜色深,嘴唇比四周颜色深,鼻梁比两侧颜色浅等。电子设备可以提取这些特征,并创建能够描述这些特征之间的关系的统计模型,然后基于该统计模型来确定图像中是否存在人脸。可理解,电子设备提取的特征可以表示为人脸的图像特征空间中的一维向量。电子设备在创建可以描述特征之间的关系的统计模型时,可以将该一维向量变换到相对简单的特征空间。Feature detection refers to the use of the invariant characteristics of a human face for face detection. A human face has some features that are robust to different postures. For example, a person's eyes and eyebrows are darker than the cheeks, the lips are darker than the surrounding area, the bridge of the nose is lighter than the sides, etc. The electronic device can extract these features and create a statistical model that can describe the relationship between these features, and then determine whether there is a face in the image based on the statistical model. It can be understood that the features extracted by the electronic device can be represented as a one-dimensional vector in the image feature space of the face. When creating a statistical model that can describe the relationship between the features, the electronic device can transform the one-dimensional vector into a relatively simple feature space.
值得注意的是,上述4种人脸检测方法在实际检测中可以综合采用。另外,在上述4种人脸检测方法的基础上,还可以将个体差异性(例如,发型的差异、眼睛的睁开和闭合等)、拍摄环境中对人脸的遮挡情况(例如,头发、眼镜等对人脸的遮挡)、人脸面对摄像头的角度(例如,人脸的侧面正对摄像头)、拍摄环境(例如,人脸周围的物体等)和成像条件(例如,光照条件、成像设备)等因素考虑到人脸检测中。It is worth noting that the above four face detection methods can be used in combination in actual detection. In addition, on the basis of the above four face detection methods, factors such as individual differences (e.g., differences in hairstyles, opening and closing of eyes, etc.), occlusion of faces in the shooting environment (e.g., occlusion of faces by hair, glasses, etc.), the angle of faces facing the camera (e.g., the side of the face facing the camera), the shooting environment (e.g., objects around the face, etc.), and imaging conditions (e.g., lighting conditions, imaging equipment) can also be taken into account in face detection.
需要说明的是,上述人脸检测方法仅为本申请实施例给出的示例,电子设备还可以采用其他人脸检测方法来进行人脸检测,上述人脸检测方法不应视为对本申请的限制。It should be noted that the above-mentioned face detection method is only an example given in the embodiment of the present application. The electronic device may also use other face detection methods to perform face detection. The above-mentioned face detection method should not be regarded as a limitation to the present application.
在本申请的一些实施例中,电子设备在进行人脸检测时,会对五官进行检测。也就意味着,电子设备在进行人脸检测时,也会进行眼睛检测。电子设备进行眼睛检测时,可以得到与眼睛相关的特征点。在这种情况下,若电子设备能检测到图像I1中的眼睛,电子设备可以获取眼睛位置信息。可理解,眼睛位置信息可以包括与眼睛相关的特征点的坐标。眼睛位置信息的相关描述可以参考后文,在此不展开说明。In some embodiments of the present application, the electronic device detects the facial features when performing face detection. This means that the electronic device also performs eye detection when performing face detection. When the electronic device performs eye detection, feature points related to the eyes can be obtained. In this case, if the electronic device can detect the eyes in the image I1, the electronic device can obtain eye position information. It is understandable that the eye position information may include the coordinates of feature points related to the eyes. The relevant description of the eye position information can be found in the following text and will not be elaborated here.
在本申请的一些实施例中,电子设备得到的与眼睛相关的特征点可以包括瞳孔中心点。在这种情况下,电子设备可以得到瞳孔中心点坐标。In some embodiments of the present application, the eye-related feature points obtained by the electronic device may include the pupil center point. In this case, the electronic device may obtain the pupil center point coordinates.
S303:电子设备获取图像I1中的脸部位置信息。S303: The electronic device obtains the facial position information in the image I1.
具体地,电子设备在对图像I1进行检测的过程中,若检测到图像I1包括人脸,电子设备可以获取并保存图像I1中的脸部位置信息。Specifically, during the process of detecting the image I1 , if the electronic device detects that the image I1 includes a face, the electronic device may acquire and save the face position information in the image I1 .
在本申请的一些实施例中,脸部位置信息可以包括人脸检测框的坐标。In some embodiments of the present application, the facial position information may include the coordinates of the face detection frame.
在本申请的一些实施例中,脸部位置信息可以包括人脸的相关特征点的坐标。例如,人脸的边缘轮廓特征点的坐标。再例如,脸部区域中的和眼睛、鼻子、嘴巴和耳朵相关的特征点的坐标。In some embodiments of the present application, the facial position information may include the coordinates of relevant feature points of the face, for example, the coordinates of the edge contour feature points of the face, for example, the coordinates of the feature points in the facial region related to the eyes, nose, mouth and ears.
S304:电子设备对图像I1进行眼睛检测和瞳孔定位,并获取眼睛位置信息和瞳孔坐标。S304: The electronic device performs eye detection and pupil positioning on the image I1, and obtains eye position information and pupil coordinates.
具体地,在电子设备检测到图像I1包括人脸的情况下,电子设备可以对图像I1进行眼睛检测和瞳孔定位,从而得到图像I1中的眼睛位置信息和瞳孔坐标。Specifically, when the electronic device detects that the image I1 includes a human face, the electronic device can perform eye detection and pupil positioning on the image I1, thereby obtaining eye position information and pupil coordinates in the image I1.
在本申请的一些实施例中,眼睛位置信息可以包括与眼睛相关的特征点的坐标。电子设备在进行眼睛检测时,可以确定与眼睛相关的特征点,并获取这些特征点的坐标。例如,左眼的2个眼角特征点、右眼的2个眼角特征点,以及眼睛的边缘轮廓特征点。电子设备可以根据这些与眼睛相关的特征点的坐标确定图像I1中的眼睛位置。In some embodiments of the present application, the eye position information may include the coordinates of feature points related to the eyes. When performing eye detection, the electronic device may determine the feature points related to the eyes and obtain the coordinates of these feature points. For example, the two eye corner feature points of the left eye, the two eye corner feature points of the right eye, and the edge contour feature points of the eyes. The electronic device may determine the eye position in the image I1 based on the coordinates of these feature points related to the eyes.
可理解,与人脸检测类似,眼睛的定位与检测也可以采取模板匹配法、规则法、样品学习法和特征检测法等方法,具体可以参考相关技术文档,在此不展开说明。It can be understood that, similar to face detection, eye positioning and detection can also adopt methods such as template matching, rule method, sample learning method and feature detection method. For details, please refer to relevant technical documents and will not be explained in detail here.
可理解,瞳孔坐标是二维坐标。在本申请的一些实施例中,瞳孔坐标可以包括瞳孔中心点坐标。当然,瞳孔坐标还可以包括与瞳孔有关的其他坐标。例如,瞳孔重心的坐标、瞳孔边缘轮廓点的坐标等。It is understandable that pupil coordinates are two-dimensional coordinates. In some embodiments of the present application, pupil coordinates may include pupil center coordinates. Of course, pupil coordinates may also include other coordinates related to the pupil. For example, the coordinates of the pupil center of gravity, the coordinates of the pupil edge contour points, etc.
下面对瞳孔定位方法进行简单说明。The pupil positioning method is briefly described below.
在本申请的一些实施例中,在电子设备检测到图像I1上的眼睛的情况下,电子设备可以对图像I1上的眼睛部分进行模糊处理,并提取瞳孔轮廓,然后再确定瞳孔重心。可理解,电子设备可以将瞳孔重心的坐标作为瞳孔坐标。In some embodiments of the present application, when the electronic device detects an eye on the image I1, the electronic device may blur the eye portion on the image I1, extract the pupil contour, and then determine the pupil centroid. It is understandable that the electronic device may use the coordinates of the pupil centroid as the pupil coordinates.
在本申请的一些实施例中,在电子设备检测到图像I1上的眼睛的情况下,电子设备可以对图像I1上的眼睛部分进行模糊处理,并计算横向和纵向的像素值,然后选取像素值最低的行的索引和像素值最低的列的索引来作为瞳孔坐标的纵坐标和横坐标。In some embodiments of the present application, when the electronic device detects an eye on image I1, the electronic device may blur the eye portion on image I1, calculate the horizontal and vertical pixel values, and then select the index of the row with the lowest pixel value and the index of the column with the lowest pixel value as the vertical and horizontal coordinates of the pupil coordinates.
当然,电子设备还可以采取其他瞳孔定位方法,本申请对此不作限制。Of course, the electronic device may also adopt other pupil positioning methods, and this application does not limit this.
S305:电子设备确定图像I1中的脸部区域的大小是否满足预设大小要求。S305: The electronic device determines whether the size of the face area in the image I1 meets a preset size requirement.
具体地,在电子设备检测到图像I1包括人脸的情况下,电子设备可以确定图像I1中的脸部区域的大小,并确定图像I1中的脸部区域的大小是否满足预设大小要求。可理解,脸部区域可以包括人脸的重要特征。例如,眼睛、鼻子和嘴巴等。Specifically, when the electronic device detects that the image I1 includes a human face, the electronic device can determine the size of the face area in the image I1, and determine whether the size of the face area in the image I1 meets the preset size requirement. It can be understood that the face area can include important features of the face, such as eyes, nose, and mouth.
可理解,脸部区域的大小指的是脸部区域的面积。在本申请的一些实施例中,脸部区域的面积指的是人脸检测框的面积。在本申请的又一些实施例中,脸部区域的面积指的是电子设备检测到的图像中整个脸部区域的面积。It is understood that the size of the face region refers to the area of the face region. In some embodiments of the present application, the area of the face region refers to the area of the face detection frame. In some other embodiments of the present application, the area of the face region refers to the area of the entire face region in the image detected by the electronic device.
在本申请的一些实施例中,人脸检测框可以用于框选包括重要特征的脸部区域,并不一定用于框选完整的脸部区域。例如,人脸检测框可以用于框选包括眉毛、眼睛、鼻子、嘴巴和耳朵等特征的大部分脸部区域。可理解,人脸检测框的形状可以根据实际需要进行设置。例如,人脸检测框可以为矩形。In some embodiments of the present application, the face detection frame can be used to select a facial region including important features, and is not necessarily used to select a complete facial region. For example, the face detection frame can be used to select a large portion of a facial region including features such as eyebrows, eyes, nose, mouth, and ears. It is understood that the shape of the face detection frame can be set according to actual needs. For example, the face detection frame can be a rectangle.
在本申请的一些实施例中,图像I1中的脸部区域的大小满足预设大小要求指的是:图像I1中的脸部区域的面积在预设面积范围内。例如,预设面积范围可以为[220px*220px,230px*230px]。也就是说,脸部区域的面积不小于220px*220px,并且不大过230px*230px。当然,本申请对预设面积范围的具体值不作限制。可理解,px的全称是“Pixel”,中文意思为“像素”,是表示图片或者图形的最小单位。In some embodiments of the present application, the size of the face area in the image I1 meets the preset size requirement, which means that the area of the face area in the image I1 is within the preset area range. For example, the preset area range can be [220px*220px, 230px*230px]. That is to say, the area of the face area is not less than 220px*220px and not more than 230px*230px. Of course, the present application does not limit the specific value of the preset area range. It can be understood that the full name of px is "Pixel", which means "pixel" in Chinese and is the smallest unit representing a picture or graphic.
在本申请的又一些实施例中,图像I1中的脸部区域的大小满足预设大小要求指的是:图像I1中的脸部区域的高度在预设高度范围内,且图像I1中的脸部区域的宽度在预设宽度范围内。例如,预设高度范围可以为[215px,240px],预设宽度范围可以为[215px,240px]。当然,预设高度范围和预设宽度范围可以不一致,本申请对预设高度范围和预设宽度范围的具体值不作限制。可理解,本申请中所提及的脸部区域的高度可以理解为人脸检测框的高度,而本申请中所提及的脸部区域的宽度可以理解为人脸检测框的宽度。In some other embodiments of the present application, the size of the facial area in the image I1 meets the preset size requirement, which means that: the height of the facial area in the image I1 is within the preset height range, and the width of the facial area in the image I1 is within the preset width range. For example, the preset height range can be [215px, 240px], and the preset width range can be [215px, 240px]. Of course, the preset height range and the preset width range can be inconsistent, and the present application does not limit the specific values of the preset height range and the preset width range. It can be understood that the height of the facial area mentioned in the present application can be understood as the height of the face detection frame, and the width of the facial area mentioned in the present application can be understood as the width of the face detection frame.
当然,预设大小要求的具体内容还可以存在其他实现方式,本申请对此不作限制。Of course, the specific content of the preset size requirement may also be implemented in other ways, and this application does not limit this.
需要说明的是,在图像I1包括的脸部区域的大小满足预设大小要求的情况下,电子设备可以继续执行后续步骤,而在图像I1包括的脸部区域的大小不满足预设大小要求的情况下,电子设备可以进行自适应变焦,并根据自适应变焦后的焦距来重新获取图像。It should be noted that, when the size of the facial area included in the image I1 meets the preset size requirement, the electronic device can continue to execute subsequent steps, and when the size of the facial area included in the image I1 does not meet the preset size requirement, the electronic device can perform adaptive zoom and re-acquire the image according to the focal length after adaptive zoom.
下面简单介绍自适应变焦方法。The following is a brief introduction to the adaptive zoom method.
首先对焦距和图像中的物体大小的关系进行说明。一般来说,焦距越小,取景范围越广,拍摄的画面视野就越宽,能拍到的物体也就越多,但物体在画面中也就越小。而焦距越大,取景范围越窄,拍摄的画面视野就越小,能拍到的物体也就越少,但物体在画面中占比却很大。First, let's explain the relationship between focal length and the size of objects in the image. Generally speaking, the smaller the focal length, the wider the framing range, the wider the field of view of the shot, and the more objects that can be captured, but the smaller the objects in the picture. On the other hand, the larger the focal length, the narrower the framing range, the smaller the field of view of the shot, and the fewer objects that can be captured, but the objects occupy a large proportion of the picture.
在“图像I1中的脸部区域的大小满足预设大小要求指的是图像I1中的脸部区域的面积在预设面积范围内”的情况下,对自适应变焦方法进行说明。In the case where “the size of the face region in the image I1 satisfies the preset size requirement means that the area of the face region in the image I1 is within a preset area range”, the adaptive zoom method is described.
可理解,电子设备可以确定获取图像I1时的焦距。为了便于描述,本申请中将电子设备获取图像I1时的焦距记为原始焦距。It is understandable that the electronic device can determine the focal length when acquiring the image I1. For ease of description, the focal length when the electronic device acquires the image I1 is recorded as the original focal length in this application.
方法一:method one:
电子设备可以判断图像I1中的脸部区域的面积是小于预设面积范围的最小值,还是大于预设面积范围的最大值。若图像I1中的脸部区域的面积小于预设面积范围的最小值,则电子设备可以将原始焦距加上J1,得到自适应变焦后的焦距,并基于该焦距重新获取图像。若图像I1中的脸部区域的面积大于预设面积范围的最大值,则电子设备可以将原始焦距减去J1,得到自适应变焦后的焦距,并基于该焦距重新获取图像。其中,J1为预设焦距调整步长,J1的具体值可以根据实际需要进行设置。The electronic device can determine whether the area of the facial region in the image I1 is smaller than the minimum value of the preset area range, or larger than the maximum value of the preset area range. If the area of the facial region in the image I1 is smaller than the minimum value of the preset area range, the electronic device can add J1 to the original focal length to obtain the focal length after adaptive zoom, and reacquire the image based on the focal length. If the area of the facial region in the image I1 is larger than the maximum value of the preset area range, the electronic device can subtract J1 from the original focal length to obtain the focal length after adaptive zoom, and reacquire the image based on the focal length. Among them, J1 is the preset focal length adjustment step, and the specific value of J1 can be set according to actual needs.
方法二:Method Two:
电子设备可以确定预设面积范围的中间值,并确定脸部区域的面积与该中间值的比值。电子设备可以将该比值乘以原始焦距,得到自适应变焦后的焦距,并基于该焦距重新获取图像。The electronic device may determine the middle value of the preset area range, and determine the ratio of the area of the face region to the middle value. The electronic device may multiply the ratio by the original focal length to obtain the focal length after adaptive zooming, and reacquire the image based on the focal length.
类似的,在“图像I1中的脸部区域的大小满足预设大小要求指的是:图像I1中的脸部区域的高度在预设高度范围内,且图像I1中的脸部区域的宽度在预设宽度范围内”的情况下,电子设备可以基于预设高度范围和预设宽度范围确定预设面积范围,然后基于图像I1中的脸部区域的面积、预设面积范围,以及原始焦距来进行自适应变焦,具体可以参考上述方法一的相关描述,在此不再赘述。Similarly, in the case where "the size of the facial area in image I1 meets the preset size requirement, which means that: the height of the facial area in image I1 is within a preset height range, and the width of the facial area in image I1 is within a preset width range", the electronic device can determine the preset area range based on the preset height range and the preset width range, and then perform adaptive zoom based on the area of the facial area in image I1, the preset area range, and the original focal length. For details, please refer to the relevant description of the above method one, which will not be repeated here.
类似的,在“图像I1中的脸部区域的大小满足预设大小要求指的是:图像I1中的脸部区域的高度在预设高度范围内,且图像I1中的脸部区域的宽度在预设宽度范围内”的情况下,电子设备可以确定预设高度范围的中间值和预设宽度范围的中间值,然后将预设高度范围的中间值乘以预设宽度范围的中间值,得到预设面积,并基于该预设面积、图像I1中的脸部区域的面积,以及原始焦距来进行自适应变焦,具体可以参考上述方法二的相关描述,在此不再赘述。Similarly, in the case where "the size of the facial area in image I1 meets the preset size requirement, which means that: the height of the facial area in image I1 is within a preset height range, and the width of the facial area in image I1 is within a preset width range", the electronic device can determine the middle value of the preset height range and the middle value of the preset width range, and then multiply the middle value of the preset height range by the middle value of the preset width range to obtain a preset area, and perform adaptive zoom based on the preset area, the area of the facial area in image I1, and the original focal length. For details, please refer to the relevant description of the above-mentioned method two, which will not be repeated here.
可理解,上述内容仅为本申请提供的示例,自适应变焦方法还可以包括其他具体方式,本申请对此不作限制。It is understandable that the above content is only an example provided by this application, and the adaptive zoom method may also include other specific methods, which are not limited by this application.
可理解,在其他拍摄条件固定的情况下,脸部区域的大小可以体现拍摄距离(即摄像头与人脸之间的距离)。也可以理解为,脸部区域的大小包含拍摄的深度信息。若拍摄距离较大,电子设备通过摄像头采集的图像中的眼部特征可能会模糊,从而影响注视点估计的准确性。若拍摄距离较大,电子设备通过摄像头采集的图像中的人脸特征可能不完整,从而影响注视点估计的准确性。通过上述对脸部区域的大小的判断以及自适应变焦,电子设备可以采集包含合适人脸大小的图像,提高了注视点估计的准确性。It can be understood that when other shooting conditions are fixed, the size of the facial area can reflect the shooting distance (i.e., the distance between the camera and the face). It can also be understood that the size of the facial area contains the depth information of the shooting. If the shooting distance is large, the eye features in the image captured by the electronic device through the camera may be blurred, thereby affecting the accuracy of the gaze point estimation. If the shooting distance is large, the facial features in the image captured by the electronic device through the camera may be incomplete, thereby affecting the accuracy of the gaze point estimation. Through the above-mentioned judgment of the size of the facial area and adaptive zoom, the electronic device can capture an image containing a suitable face size, thereby improving the accuracy of the gaze point estimation.
S306:电子设备基于脸部位置信息对图像I1进行裁剪,得到脸部图像块,并基于眼睛位置信息对图像I1进行裁剪,得到左眼图像块和右眼图像块。S306: The electronic device crops the image I1 based on the face position information to obtain a face image block, and crops the image I1 based on the eye position information to obtain a left eye image block and a right eye image block.
本申请实施例提供了电子设备执行步骤S306时的两种实施方式:This embodiment of the application provides two implementation methods when the electronic device performs step S306:
第一种实施方式:电子设备基于脸部位置信息所包括的特征点的坐标,确定图像I1中的脸部区域的外接矩形,并基于该脸部区域的外接矩形对图像I1进行裁剪,得到脸部图像块。类似的,电子设备还可以基于眼睛位置信息所包括的特征点的坐标,确定图像I1中的左眼区域的外接矩形和右眼区域的外接矩形,并分别基于该左眼区域的外接矩形和该右眼区域的外接矩形对图像I1进行裁剪,得到左眼图像块和右眼图像块。The first implementation mode: the electronic device determines the bounding rectangle of the face area in the image I1 based on the coordinates of the feature points included in the face position information, and crops the image I1 based on the bounding rectangle of the face area to obtain a face image block. Similarly, the electronic device can also determine the bounding rectangle of the left eye area and the bounding rectangle of the right eye area in the image I1 based on the coordinates of the feature points included in the eye position information, and crops the image I1 based on the bounding rectangle of the left eye area and the bounding rectangle of the right eye area to obtain a left eye image block and a right eye image block.
可理解,本申请中所提及的外接矩形可以为最小外接矩形。最小外接矩形是指以二维坐标表示的若干二维形状(例如,点、直线、多边形)的最大范围,即以给定的二维形状各顶点中的最大横坐标、最小横坐标、最大纵坐标、最小纵坐标定下边界的矩形。可理解,脸部区域的外接矩形可以理解为脸部特征点(例如,脸部边缘轮廓特征点)的最小外接矩形。左眼区域的外接矩形可以理解为左眼特征点(例如,左眼的2个眼角特征点、左眼的边缘轮廓特征点)的最小外接矩形。右眼区域的外接矩形可以理解为右眼特征点(例如,右眼的2个眼角特征点、右眼的边缘轮廓特征点)的最小外接矩形。It is understandable that the bounding rectangle mentioned in the present application can be a minimum bounding rectangle. The minimum bounding rectangle refers to the maximum range of several two-dimensional shapes (e.g., points, lines, polygons) represented by two-dimensional coordinates, that is, a rectangle whose lower boundary is defined by the maximum horizontal coordinate, the minimum horizontal coordinate, the maximum vertical coordinate, and the minimum vertical coordinate of each vertex of a given two-dimensional shape. It is understandable that the bounding rectangle of the face area can be understood as the minimum bounding rectangle of facial feature points (e.g., facial edge contour feature points). The bounding rectangle of the left eye area can be understood as the minimum bounding rectangle of the left eye feature points (e.g., the 2 corners of the left eye feature points, the edge contour feature points of the left eye). The bounding rectangle of the right eye area can be understood as the minimum bounding rectangle of the right eye feature points (e.g., the 2 corners of the right eye feature points, the edge contour feature points of the right eye).
可理解,脸部图像块的尺寸与图像I1中脸部区域的外接矩形的尺寸相同。左眼图像块的尺寸与图像I1中左眼区域的外接矩形的尺寸相同。右眼图像块的尺寸与图像I1中右眼区域的外接矩形的尺寸相同。It can be understood that the size of the face image block is the same as the size of the circumscribed rectangle of the face area in the image I1. The size of the left eye image block is the same as the size of the circumscribed rectangle of the left eye area in the image I1. The size of the right eye image block is the same as the size of the circumscribed rectangle of the right eye area in the image I1.
在一种可能的实现方式中,电子设备可以通过包围盒算法来确定脸部特征点的包围盒。脸部特征点的包围盒可以理解为脸部特征点的最优包围区域。电子设备还可以基于脸部特征点的包围盒对图像I1进行裁剪,得到脸部图像块。类似的,电子设备可以通过包围盒算法来分别确定左眼特征点和右眼特征点的包围盒。左眼特征点和右眼特征点的包围盒可以分别理解为左眼特征点的最优包围区域和右眼特征点的最优包围区域。电子设备还可以分别基于左眼特征点和右眼特征点的包围盒对图像I1进行裁剪,得到左眼图像块和右眼图像块。In a possible implementation, the electronic device may determine the bounding box of the facial feature points by using a bounding box algorithm. The bounding box of the facial feature points may be understood as the optimal enclosing area of the facial feature points. The electronic device may also crop the image I1 based on the bounding box of the facial feature points to obtain a facial image block. Similarly, the electronic device may determine the bounding boxes of the left eye feature points and the right eye feature points respectively by using a bounding box algorithm. The bounding boxes of the left eye feature points and the right eye feature points may be understood as the optimal enclosing area of the left eye feature points and the optimal enclosing area of the right eye feature points respectively. The electronic device may also crop the image I1 based on the bounding boxes of the left eye feature points and the right eye feature points respectively to obtain a left eye image block and a right eye image block.
可理解,包围盒是一种求解离散点集最优包围空间的算法,基本思想是用体积稍大且特性简单的几何体(称为包围盒)来近似地代替复杂的几何对象。包围盒的相关描述可以参考相关技术文档,本申请对此不展开说明。It can be understood that the bounding box is an algorithm for solving the optimal bounding space of a discrete point set. The basic idea is to use a slightly larger geometric body with simpler characteristics (called a bounding box) to approximately replace complex geometric objects. For the relevant description of the bounding box, please refer to the relevant technical documents, and this application will not elaborate on this.
第二种实施方式:电子设备基于脸部位置信息和预设人脸裁剪尺寸对图像I1进行裁剪,得到脸部图像块,并基于眼睛位置信息和预设眼睛裁剪尺寸对图像I1进行裁剪,得到左眼图像块和右眼图像块。Second implementation: the electronic device crops the image I1 based on the face position information and the preset face cropping size to obtain a face image block, and crops the image I1 based on the eye position information and the preset eye cropping size to obtain a left eye image block and a right eye image block.
具体地,在图像I1中的脸部区域的大小满足预设大小要求的情况下,电子设备可以基于脸部位置信息中的坐标确定图像I1中的脸部区域,并以该脸部区域为中心,基于预设人脸裁剪尺寸对图像I1进行裁剪,从而得到脸部图像块。可理解,脸部图像块的尺寸与预设人脸裁剪尺寸相同。脸部图像块中的脸部区域位于脸部图像块的中心。可理解,根据上文,脸部位置信息中的坐标可以包括人脸的边缘轮廓特征点的坐标,还可以包括人脸检测框的坐标,还可以包括与人脸中的眼睛、鼻子、嘴巴和耳朵相关的特征点的坐标。Specifically, when the size of the facial area in the image I1 meets the preset size requirement, the electronic device can determine the facial area in the image I1 based on the coordinates in the facial position information, and crop the image I1 based on the preset face cropping size with the facial area as the center, thereby obtaining a facial image block. It can be understood that the size of the facial image block is the same as the preset face cropping size. The facial area in the facial image block is located at the center of the facial image block. It can be understood that, according to the above, the coordinates in the facial position information can include the coordinates of the edge contour feature points of the face, can also include the coordinates of the face detection frame, and can also include the coordinates of the feature points related to the eyes, nose, mouth and ears in the face.
类似的,在图像I1中的脸部区域的大小满足预设大小要求的情况下,电子设备还可以基于眼睛位置信息中的坐标确定图像I1中的左眼区域和右眼区域,并分别以左眼区域和右眼区域为中心,基于预设眼睛裁剪尺寸对图像I1进行裁剪,从而得到左眼图像块和右眼图像块。左眼图像块中的左眼区域位于左眼图像块的中心。而右眼图像块中的右眼区域位于右眼图像块的中心。可理解,根据上文,眼睛位置信息中的坐标可以包括左眼的2个眼角特征点和右眼的2个眼角特征点,还可以包括眼角的边缘轮廓特征点。Similarly, when the size of the facial area in image I1 meets the preset size requirement, the electronic device can also determine the left eye area and the right eye area in image I1 based on the coordinates in the eye position information, and crop image I1 based on the preset eye cropping size with the left eye area and the right eye area as the center, respectively, to obtain the left eye image block and the right eye image block. The left eye area in the left eye image block is located at the center of the left eye image block. And the right eye area in the right eye image block is located at the center of the right eye image block. It can be understood that, according to the above, the coordinates in the eye position information can include 2 eye corner feature points of the left eye and 2 eye corner feature points of the right eye, and can also include edge contour feature points of the eye corners.
在本申请的一些实施例中,左眼图像块的尺寸与预设眼睛裁剪尺寸相同,右眼图像块的尺寸与预设眼睛裁剪尺寸相同。例如,预设眼睛裁剪尺寸为60px*60px。电子设备裁剪所得的左眼图像块和右眼图像块的尺寸均为60px*60px。In some embodiments of the present application, the size of the left eye image block is the same as the preset eye cropping size, and the size of the right eye image block is the same as the preset eye cropping size. For example, the preset eye cropping size is 60px*60px. The sizes of the left eye image block and the right eye image block cropped by the electronic device are both 60px*60px.
在本申请的一些实施例中,预设眼睛裁剪尺寸可以包括预设左眼裁剪尺寸和预设右眼裁剪尺寸。预设左眼裁剪尺寸可以与预设右眼裁剪尺寸不一致。左眼图像块的尺寸与预设左眼裁剪尺寸相同,右眼图像块的尺寸与预设右眼裁剪尺寸相同。In some embodiments of the present application, the preset eye cropping size may include a preset left eye cropping size and a preset right eye cropping size. The preset left eye cropping size may be inconsistent with the preset right eye cropping size. The size of the left eye image block is the same as the preset left eye cropping size, and the size of the right eye image block is the same as the preset right eye cropping size.
可理解,预设人脸裁剪尺寸和预设眼睛裁剪尺寸可以根据实际需求进行设置,本申请对此不作限制。示例性的,预设人脸裁剪尺寸可以为244px*244px,预设眼睛裁剪尺寸可以为60px*60px。It is understandable that the preset face cropping size and the preset eye cropping size can be set according to actual needs, and this application does not limit this. Exemplarily, the preset face cropping size can be 244px*244px, and the preset eye cropping size can be 60px*60px.
示例性的,如图4所示,电子设备可以基于脸部位置信息包括的坐标(例如,人脸的边缘轮廓特征点的坐标等)确定脸部区域,并按照预设人脸裁剪尺寸来设置裁剪框,然后将脸部区域作为裁剪框的中心来对图像I1进行裁剪,从而得到脸部图像块。Exemplarily, as shown in FIG. 4 , the electronic device may determine the facial area based on the coordinates included in the facial position information (e.g., the coordinates of the edge contour feature points of the face, etc.), set the cropping frame according to a preset face cropping size, and then crop the image I1 with the facial area as the center of the cropping frame, thereby obtaining a facial image block.
示例性的,如图5所示,电子设备可以基于眼睛位置信息包括的坐标确定左眼区域和右眼区域,并按照预设眼睛裁剪尺寸来设置左眼裁剪框和右眼裁剪框,然后将左眼区域和右眼区域分别作为左眼裁剪框和右眼裁剪框的中心来对图像I1进行裁剪,分别得到左眼图像块和右眼图像块。Exemplarily, as shown in FIG5 , the electronic device may determine the left eye area and the right eye area based on the coordinates included in the eye position information, and set the left eye cropping frame and the right eye cropping frame according to a preset eye cropping size, and then use the left eye area and the right eye area as the center of the left eye cropping frame and the right eye cropping frame, respectively, to crop the image I1 to obtain a left eye image block and a right eye image block, respectively.
S307:电子设备基于脸部位置信息确定图像I1对应的脸部网格。脸部网格用于表示人脸在整幅图像中的位置和大小。S307: The electronic device determines the face grid corresponding to the image I1 based on the face position information. The face grid is used to indicate the position and size of the face in the entire image.
可理解,电子设备可以基于脸部位置信息包括的坐标(例如,人脸的边缘轮廓特征点的坐标等)来确定脸部区域在图像I1中的位置,从而确定图像I1对应的脸部网格。脸部网格可以用于表示人脸在整幅图像中的位置和大小。可理解,脸部网格可以表示人脸与摄像头之间的距离。It is understood that the electronic device can determine the position of the face area in the image I1 based on the coordinates included in the face position information (for example, the coordinates of the edge contour feature points of the face, etc.), thereby determining the face grid corresponding to the image I1. The face grid can be used to represent the position and size of the face in the entire image. It is understood that the face grid can represent the distance between the face and the camera.
可理解,人脸网格可以理解为二元掩膜(binary mask)。二元掩膜可以理解为一个和图像对应的二进制矩阵,即元素均为0或1的矩阵。一般来说,通过二元掩膜可以对图像(全部或者局部)进行遮挡。二元掩膜可以用于感兴趣区域提取、屏蔽、结构特征提取等。It can be understood that the face grid can be understood as a binary mask. A binary mask can be understood as a binary matrix corresponding to an image, that is, a matrix whose elements are all 0 or 1. Generally speaking, an image (all or part) can be blocked by a binary mask. Binary masks can be used for region of interest extraction, shielding, structural feature extraction, etc.
示例性的,如图6所示,电子设备可以根据脸部位置信息包括的坐标确定图像I1中的脸部区域与图像I1的比例关系,从而获取图像I1中的人脸的深度信息。电子设备还可以确定图像I1中的人脸位于图像I1中的居中偏下的位置。进一步的,电子设备可以确定与图像I1对应的脸部网格。Exemplarily, as shown in FIG6 , the electronic device can determine the proportional relationship between the face area in the image I1 and the image I1 according to the coordinates included in the face position information, thereby obtaining the depth information of the face in the image I1. The electronic device can also determine that the face in the image I1 is located at a lower center position in the image I1. Further, the electronic device can determine the face grid corresponding to the image I1.
S308:电子设备将左眼图像块、右眼图像块、脸部图像块、脸部网格和瞳孔坐标输入至注视点估计网络模型中,输出得到注视点坐标。S308: The electronic device inputs the left eye image block, the right eye image block, the face image block, the face grid and the pupil coordinates into the gaze point estimation network model, and outputs the gaze point coordinates.
可理解,电子设备可以将左眼图像块右眼图像块、脸部图像块、脸部网格和瞳孔坐标输入至注视点估计网络模型,并输出得到二维坐标。该二维坐标即为注视点坐标。其中,注视点估计网络模型可以为包含若干支路的神经网络模型。注视点估计网络模型可以通过其包含的若干支路分别提取相应的特征,然后综合提取的特征来估计注视点坐标。It is understandable that the electronic device can input the left eye image block, the right eye image block, the face image block, the face grid and the pupil coordinates into the gaze point estimation network model, and output the two-dimensional coordinates. The two-dimensional coordinates are the gaze point coordinates. Among them, the gaze point estimation network model can be a neural network model including several branches. The gaze point estimation network model can extract corresponding features through the several branches it contains, and then estimate the gaze point coordinates by synthesizing the extracted features.
可理解,神经网络是一种模仿生物神经网络(动物的中枢神经系统,特别是大脑)结构和功能的数学模型或计算模型。神经网络由大量的人工神经元组成,按不同的连接方式构建不同的网络。神经网络可以包括卷积神经网络、循环神经网络等。It can be understood that a neural network is a mathematical model or computational model that imitates the structure and function of a biological neural network (the central nervous system of an animal, especially the brain). A neural network is composed of a large number of artificial neurons, and different networks are constructed according to different connection methods. Neural networks can include convolutional neural networks, recurrent neural networks, etc.
在本申请的一些实施例中,注视点估计网络模型可以包括若干感兴趣区域池化层、若干卷积层、若干池化层和若干全连接层。其中,感兴趣区域池化层用于统一特征图的尺寸。卷积层用于提取特征。池化层用于进行下采样以减少数据量。全连接层用于将提取的特征映射到样本标记空间。通俗来说,全连接层用于将提取的特征整合到一起,输出为一个值。In some embodiments of the present application, the gaze point estimation network model may include several region of interest pooling layers, several convolutional layers, several pooling layers and several fully connected layers. Among them, the region of interest pooling layer is used to unify the size of the feature map. The convolutional layer is used to extract features. The pooling layer is used to downsample to reduce the amount of data. The fully connected layer is used to map the extracted features to the sample label space. In layman's terms, the fully connected layer is used to integrate the extracted features together and output them as a value.
请参阅图7,图7为本申请实施例提供的一种注视点估计网络模型的架构示意图。注视点估计网络模型可以包括感兴趣区域池化(ROI pooling)层-1、感兴趣区域池化层-2、CNN-1、CNN-2、CNN-3、全连接层-1、全连接层-2、全连接层-3和全连接层-4。其中,感兴趣区域池化层-1用于统一左眼图像块对应的特征图的尺寸,以及统一右眼图像块对应的特征图的尺寸。感兴趣区域池化层-2用于统一脸部图像块对应的特征图的尺寸。CNN-1、CNN-2和CNN-3均为卷积神经网络(Convolutional Neural Network,CNN),它们分别用于提取左眼特征、右眼特征和脸部特征。CNN-1、CNN-2和CNN-3可以分别包括若干卷积层和若干池化层。在本申请的一些实施例中,CNN-1、CNN-2和CNN-3还可以包括一个或多个全连接层。全连接层-1用于整合提取所得的左眼特征、右眼特征和脸部特征。全连接层-2和全连接层-3分别用于整合脸部网格所表示的深度信息(即人脸与摄像头之间的距离),以及瞳孔坐标所表示的瞳孔位置信息。全连接层-4用于将左眼特征、右眼特征、脸部特征、深度信息和瞳孔位置等信息进行整合,并输出为一个值。Please refer to Figure 7, which is a schematic diagram of the architecture of a gaze point estimation network model provided in an embodiment of the present application. The gaze point estimation network model may include a region of interest pooling (ROI pooling) layer-1, a region of interest pooling layer-2, CNN-1, CNN-2, CNN-3, a fully connected layer-1, a fully connected layer-2, a fully connected layer-3 and a fully connected layer-4. Among them, the region of interest pooling layer-1 is used to unify the size of the feature map corresponding to the left eye image block, and unify the size of the feature map corresponding to the right eye image block. The region of interest pooling layer-2 is used to unify the size of the feature map corresponding to the facial image block. CNN-1, CNN-2 and CNN-3 are all convolutional neural networks (CNNs), which are used to extract left eye features, right eye features and facial features, respectively. CNN-1, CNN-2 and CNN-3 may include several convolutional layers and several pooling layers, respectively. In some embodiments of the present application, CNN-1, CNN-2 and CNN-3 may also include one or more fully connected layers. Fully connected layer-1 is used to integrate the extracted left eye features, right eye features and facial features. Fully connected layer-2 and fully connected layer-3 are used to integrate the depth information represented by the facial mesh (i.e., the distance between the face and the camera) and the pupil position information represented by the pupil coordinates. Fully connected layer-4 is used to integrate the left eye features, right eye features, facial features, depth information and pupil position information and output them as one value.
具体地,如图7所示,电子设备可以将左眼图像块和右眼图像块作为感兴趣区域池化层-1的输入,将脸部图像作为感兴趣区域池化层-2的输入。感兴趣区域池化层-1可以输出尺寸相同的特征图。感兴趣区域池化层-2也可以输出尺寸相同的特征图。电子设备可以将感兴趣区域池化层-1输出的与左眼图像块对应的特征图作为CNN-1的输入,还可以将感兴趣区域池化层-1输出的与右眼图像块对应的特征图作为CNN-2的输入。类似的,电子设备可以将感兴趣区域池化层-2输出的特征图作为CNN-3的输入。进一步的,电子设备可以将CNN-1的输出、CNN-2的输出和CNN-3的输出作为全连接层-1的输入。电子设备还可以将脸部网格和瞳孔坐标分别作为全连接层-2和全连接层-3的输入。再进一步的,电子设备可以将全连接层-1、全连接层-2和全连接层-3的输出作为全连接层-4的输入。全连接层-4可以输出二维坐标。该二维坐标即为电子设备估计的注视点坐标。Specifically, as shown in FIG7 , the electronic device can use the left eye image block and the right eye image block as the input of the region of interest pooling layer-1, and use the face image as the input of the region of interest pooling layer-2. The region of interest pooling layer-1 can output feature maps of the same size. The region of interest pooling layer-2 can also output feature maps of the same size. The electronic device can use the feature map corresponding to the left eye image block output by the region of interest pooling layer-1 as the input of CNN-1, and can also use the feature map corresponding to the right eye image block output by the region of interest pooling layer-1 as the input of CNN-2. Similarly, the electronic device can use the feature map output by the region of interest pooling layer-2 as the input of CNN-3. Further, the electronic device can use the output of CNN-1, the output of CNN-2, and the output of CNN-3 as the input of the fully connected layer-1. The electronic device can also use the face grid and pupil coordinates as the input of the fully connected layer-2 and the fully connected layer-3, respectively. Further, the electronic device can use the output of the fully connected layer-1, the fully connected layer-2, and the fully connected layer-3 as the input of the fully connected layer-4. The fully connected layer-4 can output two-dimensional coordinates. The two-dimensional coordinates are the gaze point coordinates estimated by the electronic device.
在本申请的一些实施例中,注视点估计网络模型可以包括更多的感兴趣区域池化层。例如,电子设备可以将左眼图像块和右眼图像块分别作为不同感兴趣区域池化层的输入。相应的,电子设备可以将该不同感兴趣区域池化层的输出分别作为CNN-1和CNN-2的输入。In some embodiments of the present application, the gaze point estimation network model may include more region of interest pooling layers. For example, the electronic device may use the left eye image block and the right eye image block as inputs of different region of interest pooling layers, respectively. Accordingly, the electronic device may use the outputs of the different region of interest pooling layers as inputs of CNN-1 and CNN-2, respectively.
在本申请的一些实施例中,注视点估计网络模型可以包括更多的全连接层。可理解,全连接层-2的前后可以存在更多的全连接层,全连接层-3的前后也可以存在更多的全连接层。例如,电子设备可以将全连接层-2的输出作为全连接层-5的输入,并将全连接层-5的输出作为全连接层-4的输入。例如,电子设备可以将全连接层-3的输出作为全连接层-6的输入,并将全连接层-6的输出作为全连接层-4的输入。再例如,电子设备可以将全连接层-4的输出作为全连接层-7的输入,而全连接层-7的输出即为电子设备估计的注视点坐标。In some embodiments of the present application, the gaze point estimation network model may include more fully connected layers. It is understandable that there may be more fully connected layers before and after the fully connected layer-2, and there may be more fully connected layers before and after the fully connected layer-3. For example, the electronic device can use the output of the fully connected layer-2 as the input of the fully connected layer-5, and the output of the fully connected layer-5 as the input of the fully connected layer-4. For example, the electronic device can use the output of the fully connected layer-3 as the input of the fully connected layer-6, and the output of the fully connected layer-6 as the input of the fully connected layer-4. For another example, the electronic device can use the output of the fully connected layer-4 as the input of the fully connected layer-7, and the output of the fully connected layer-7 is the gaze point coordinate estimated by the electronic device.
示例性的,如图8所示,图8为本申请实施例提供的又一种注视点估计网络模型的架构示意图。注视点估计网络模型可以包括感兴趣区域池化层-1、感兴趣区域池化层-2、CNN-1、CNN-2、CNN-3、全连接层-1、全连接层-2、全连接层-3、全连接层-4、全连接层-5、全连接层-6和全连接层-7。其中,感兴趣区域池化层-1、感兴趣区域池化层-2、CNN-1、CNN-2、CNN-3和全连接层-1的作用均可参考上文,本申请在此不再赘述。全连接层-2和全连接层-5用于整合脸部网格所表示的深度信息。全连接层-3和全连接层-6用于整合瞳孔坐标所表示的瞳孔位置信息。全连接层-4和全连接层-7用于将左眼特征、右眼特征、脸部特征、深度信息和瞳孔位置等信息进行整合,并输出为一个值。如图8所示,电子设备可以将全连接层-2的输出作为全连接层-5的输入,将全连接层-3的输出作为全连接层-6的输入,以及将全连接层-1、全连接层-5和全连接层-6的输出作为全连接层-4的输入。电子设备还可以将全连接层-4的输出作为全连接层-7的输入,而全连接层-7的输出即为电子设备估计的注视点坐标。Exemplarily, as shown in FIG8, FIG8 is a schematic diagram of the architecture of another gaze point estimation network model provided in an embodiment of the present application. The gaze point estimation network model may include an area of interest pooling layer-1, an area of interest pooling layer-2, CNN-1, CNN-2, CNN-3, a fully connected layer-1, a fully connected layer-2, a fully connected layer-3, a fully connected layer-4, a fully connected layer-5, a fully connected layer-6, and a fully connected layer-7. Among them, the functions of the area of interest pooling layer-1, the area of interest pooling layer-2, CNN-1, CNN-2, CNN-3, and the fully connected layer-1 can all refer to the above, and this application will not repeat them here. The fully connected layer-2 and the fully connected layer-5 are used to integrate the depth information represented by the face grid. The fully connected layer-3 and the fully connected layer-6 are used to integrate the pupil position information represented by the pupil coordinates. The fully connected layer-4 and the fully connected layer-7 are used to integrate the information such as the left eye features, the right eye features, the facial features, the depth information, and the pupil position, and output them as a value. As shown in FIG8 , the electronic device can use the output of fully connected layer-2 as the input of fully connected layer-5, use the output of fully connected layer-3 as the input of fully connected layer-6, and use the outputs of fully connected layer-1, fully connected layer-5, and fully connected layer-6 as the input of fully connected layer-4. The electronic device can also use the output of fully connected layer-4 as the input of fully connected layer-7, and the output of fully connected layer-7 is the gaze point coordinate estimated by the electronic device.
示例性的,如图9所示,图9为本申请实施例提供的又一种注视点估计网络模型的架构示意图。注视点估计网络模型可以包括感兴趣区域池化层-1、感兴趣区域池化层-2、CNN-1、CNN-2、CNN-3、全连接层-2、全连接层-3、全连接层-4、全连接层-5、全连接层-6和全连接层-7。其中,感兴趣区域池化层-1、感兴趣区域池化层-2、CNN-1、CNN-2、CNN-3的作用均可参考上文,本申请在此不再赘述。全连接层-2和全连接层-5用于整合脸部网格所表示的深度信息。全连接层-3和全连接层-6用于整合瞳孔坐标所表示的瞳孔位置信息。全连接层-4和全连接层-7用于将左眼特征、右眼特征、脸部特征、深度信息和瞳孔位置等信息进行整合,并输出为一个值。如图9所示,电子设备可以将全连接层-2的输出作为全连接层-5的输入,将全连接层-3的输出作为全连接层-6的输入,以及将全连接层-5和全连接层-6的输出作为全连接层-4的输入。电子设备还可以将全连接层-4的输出作为全连接层-7的输入,而全连接层-7的输出即为电子设备估计的注视点坐标。Exemplarily, as shown in FIG9, FIG9 is a schematic diagram of the architecture of another gaze point estimation network model provided by an embodiment of the present application. The gaze point estimation network model may include an area of interest pooling layer-1, an area of interest pooling layer-2, CNN-1, CNN-2, CNN-3, a fully connected layer-2, a fully connected layer-3, a fully connected layer-4, a fully connected layer-5, a fully connected layer-6, and a fully connected layer-7. Among them, the functions of the area of interest pooling layer-1, the area of interest pooling layer-2, CNN-1, CNN-2, and CNN-3 can all refer to the above, and this application will not repeat them here. Fully connected layer-2 and fully connected layer-5 are used to integrate the depth information represented by the face grid. Fully connected layer-3 and fully connected layer-6 are used to integrate the pupil position information represented by the pupil coordinates. Fully connected layer-4 and fully connected layer-7 are used to integrate information such as left eye features, right eye features, facial features, depth information, and pupil position, and output them as a value. As shown in Figure 9, the electronic device can use the output of fully connected layer-2 as the input of fully connected layer-5, use the output of fully connected layer-3 as the input of fully connected layer-6, and use the outputs of fully connected layer-5 and fully connected layer-6 as the input of fully connected layer-4. The electronic device can also use the output of fully connected layer-4 as the input of fully connected layer-7, and the output of fully connected layer-7 is the gaze point coordinate estimated by the electronic device.
在本申请的一些实施例中,注视点估计网络模型还可以包括若干激活层。例如,在图7所示的注视点估计网络模型中,全连接层-1和全连接层-4之间可以设置一个激活层,全连接层-2和全连接层-4之间可以设置一个激活层,全连接层-3和全连接层-4之间也可以设置一个激活层。再例如,在图8所示的注视点估计网络模型中,全连接层-1和全连接层-4之间可以设置一个激活层,全连接层-2和全连接层-5之间可以设置一个激活层,全连接层-5和全连接层-4之间可以设置一个激活层,全连接层-3和全连接层-6之间可以设置一个激活层,全连接层-6和全连接层-4之间可以设置一个激活层,全连接层-4和全连接层-7之间可以设置一个激活层。再例如,在图9所示的注视点估计网络模型中,全连接层-2和全连接层-5之间可以设置一个激活层,全连接层-5和全连接层-4之间可以设置一个激活层,全连接层-3和全连接层-6之间可以设置一个激活层,全连接层-6和全连接层-4之间可以设置一个激活层,全连接层-4和全连接层-7之间可以设置一个激活层。In some embodiments of the present application, the gaze point estimation network model may further include several activation layers. For example, in the gaze point estimation network model shown in FIG7, an activation layer may be set between the fully connected layer-1 and the fully connected layer-4, an activation layer may be set between the fully connected layer-2 and the fully connected layer-4, and an activation layer may be set between the fully connected layer-3 and the fully connected layer-4. For another example, in the gaze point estimation network model shown in FIG8, an activation layer may be set between the fully connected layer-1 and the fully connected layer-4, an activation layer may be set between the fully connected layer-2 and the fully connected layer-5, an activation layer may be set between the fully connected layer-5 and the fully connected layer-4, an activation layer may be set between the fully connected layer-3 and the fully connected layer-6, an activation layer may be set between the fully connected layer-6 and the fully connected layer-4, and an activation layer may be set between the fully connected layer-4 and the fully connected layer-7. For another example, in the gaze point estimation network model shown in Figure 9, an activation layer can be set between the fully connected layer-2 and the fully connected layer-5, an activation layer can be set between the fully connected layer-5 and the fully connected layer-4, an activation layer can be set between the fully connected layer-3 and the fully connected layer-6, an activation layer can be set between the fully connected layer-6 and the fully connected layer-4, and an activation layer can be set between the fully connected layer-4 and the fully connected layer-7.
下面以图7、图8和图9所示的注视点估计网络模型为例,对注视点估计网络模型的各个部分进行说明。The following uses the gaze point estimation network model shown in FIG7 , FIG8 and FIG9 as an example to illustrate various parts of the gaze point estimation network model.
一、感兴趣区域池化层1. Region of Interest Pooling Layer
感兴趣区域(region of interest,ROI)指的是:机器视觉、图像处理中,从被处理的图像以方框、圆、椭圆、不规则多边形等方式勾勒出需要处理的区域。Region of interest (ROI) refers to: in machine vision and image processing, the area to be processed is outlined from the image being processed in the form of a box, circle, ellipse, irregular polygon, etc.
感兴趣区域池化层是池化层的一种。电子设备可以对输入到感兴趣区域池化层的图像中的ROI划分为相同大小的分块区域(section),并对每个分块区域进行最大池化操作,得到的处理后的特征图即为该感兴趣区域池化层的输出。其中,分块区域的数量和感兴趣区域池化层的输出的特征图的维度是一致的。The region of interest pooling layer is a type of pooling layer. The electronic device can divide the ROI in the image input to the region of interest pooling layer into sections of the same size, and perform a maximum pooling operation on each section. The processed feature map obtained is the output of the region of interest pooling layer. The number of sections is consistent with the dimension of the feature map output by the region of interest pooling layer.
下面举例对感兴趣区域池化层-1中的处理过程进行说明。The following example illustrates the processing process in the region of interest pooling layer-1.
示例性的,如图10A所示,左眼图像块-1输入到电子设备中的注视点估计网络模型中的感兴趣区域池化层-1之后,电子设备可以将左眼图像块-1的ROI划分为3*3个相同大小的分块区域,并对每一个分块区域都进行最大池化处理(即取每一个分块区域的最大值)。电子设备可以得到进行最大池化处理后的ROI对应的特征图-1。电子设备可以将特征图-1作为感兴趣区域池化层-1的输出。其中,特征图-1的尺寸为3*3。即特征图-1可以理解为一个3*3的矩阵。可理解,左眼图像块-1的ROI为整个左眼图像块-1。Exemplary, as shown in Figure 10A, after the left eye image block-1 is input into the region of interest pooling layer-1 in the gaze point estimation network model in the electronic device, the electronic device can divide the ROI of the left eye image block-1 into 3*3 block areas of equal size, and perform maximum pooling processing (i.e., taking the maximum value of each block area) on each block area. The electronic device can obtain the feature map-1 corresponding to the ROI after the maximum pooling processing. The electronic device can use the feature map-1 as the output of the region of interest pooling layer-1. Wherein, the size of the feature map-1 is 3*3. That is, the feature map-1 can be understood as a 3*3 matrix. It can be understood that the ROI of the left eye image block-1 is the entire left eye image block-1.
示例性的,如图10B所示,左眼图像块-2输入到电子设备中的注视点估计网络模型中的感兴趣区域池化层-1之后,电子设备可以将左眼图像块-2的ROI划分为3*3个相同大小的分块区域,并对每一个分块区域都进行最大池化处理(即取每一个分块区域的最大值)。电子设备可以得到进行最大池化处理后的ROI对应的特征图-2。ROI电子设备可以将特征图-2作为感兴趣区域池化层-1的输出。其中,特征图-2的尺寸为3*3。即特征图-2可以理解为一个3*3的矩阵。可理解,左眼图像块-2的ROI为整个左眼图像块-2。Exemplary, as shown in Figure 10B, after the left eye image block-2 is input into the region of interest pooling layer-1 in the gaze point estimation network model in the electronic device, the electronic device can divide the ROI of the left eye image block-2 into 3*3 block areas of equal size, and perform maximum pooling processing (i.e., taking the maximum value of each block area) on each block area. The electronic device can obtain the feature map-2 corresponding to the ROI after the maximum pooling processing. The ROI electronic device can use the feature map-2 as the output of the region of interest pooling layer-1. Wherein, the size of the feature map-2 is 3*3. That is, the feature map-2 can be understood as a 3*3 matrix. It can be understood that the ROI of the left eye image block-2 is the entire left eye image block-2.
可理解,在左眼图像块-1和左眼图像块-2为RGB图像的情况下,图10A和图10B表示的是RGB三通道中的一个通道的处理过程。It can be understood that, when the left-eye image block-1 and the left-eye image block-2 are RGB images, FIG. 10A and FIG. 10B show the processing process of one channel among the three RGB channels.
在本申请的一些实施例中,由输入图像的ROI可以被划分为若干分块区域。每个分块区域都包含数据。这里所提及的分块区域包含的数据可以理解为该输入图像的ROI对应的矩阵中的相应区域的元素。In some embodiments of the present application, the ROI of the input image can be divided into a plurality of block regions. Each block region contains data. The data contained in the block regions mentioned here can be understood as the elements of the corresponding regions in the matrix corresponding to the ROI of the input image.
在本申请的一些实施例中,电子设备可以基于预设的特征图的尺寸来对输入到感兴趣区域池化层的图像的ROI进行划分。例如,预设的特征图的尺寸可以为10*10。若输入到感兴趣区域池化层的图像的ROI的尺寸为100*100。则电子设备可以将该ROI均匀划分为10*10个分块区域,每个分块区域的大小均为10*10。 In some embodiments of the present application, the electronic device may divide the ROI of the image input to the region of interest pooling layer based on the size of the preset feature map. For example, the size of the preset feature map may be 10*10. If the size of the ROI of the image input to the region of interest pooling layer is 100*100. The electronic device may evenly divide the ROI into 10*10 block areas, each of which is 10*10 in size.
可理解,在一种可能的实现方式中,一旦预设的特征图的宽无法被ROI的宽整除,以及预设的特征图的高无法被ROI的高整除这两种情况中的任意一种情况发生,电子设备可能无法均匀划分ROI。在这种情况下,电子设备可以进行补零操作,或者,在保证大部分分块区域大小相同的情况下,将某一列分块区域或某一行分块区域划分得稍大或稍小一些。It is understandable that in a possible implementation, once the width of the preset feature map cannot be divided by the width of the ROI, and the height of the preset feature map cannot be divided by the height of the ROI, the electronic device may not be able to evenly divide the ROI. In this case, the electronic device can perform a zero padding operation, or, while ensuring that most of the block areas are the same size, divide a column block area or a row block area into slightly larger or smaller ones.
示例性的,预设的特征图的尺寸可以为10*10。输入到感兴趣区域池化层的图像的ROI的尺寸为101*101。电子设备可以将该ROI划分为9*9个大小为10*10的分块区域、9个大小为10*11的分块区域、9个大小为11*10的分块区域,以及1个大小为11*11的分块区域。Exemplarily, the size of the preset feature map may be 10*10. The size of the ROI of the image input to the region of interest pooling layer is 101*101. The electronic device may divide the ROI into 9*9 block areas of size 10*10, 9 block areas of size 10*11, 9 block areas of size 11*10, and 1 block area of size 11*11.
可理解,感兴趣区域池化层-1输出的特征图的尺寸相同。类似的,感兴趣区域池化层-2输出的特征图的尺寸相同。示例性的,如图10A和图10B所示,左眼图像块-1和左眼图像块-2输入到感兴趣区域池化层-1中所得特征图-1和特征图-2的尺寸均为3*3。It can be understood that the size of the feature map output by the region of interest pooling layer-1 is the same. Similarly, the size of the feature map output by the region of interest pooling layer-2 is the same. Exemplarily, as shown in Figures 10A and 10B, the size of the feature map-1 and feature map-2 obtained by inputting the left eye image block-1 and the left eye image block-2 into the region of interest pooling layer-1 is 3*3.
需要说明的是,感兴趣区域池化层输出的特征图的尺寸不限于上述示例,本申请对此不作限制。It should be noted that the size of the feature map output by the region of interest pooling layer is not limited to the above example, and this application does not impose any restrictions on this.
可理解,输入到感兴趣区域池化层的图像为RGB图像的情况下,输出的特征图有3个。如图11所示,左眼图像块-3为RGB图像。在左眼图像块-3的尺寸为60*60的情况下,左眼图像块-3可以表示为60*60*3的矩阵。该矩阵中的元素包括左眼图像块-3中每一个像素对应的RGB三通道的值。电子设备可以将左眼图像块-3输入到感兴趣区域池化层-1中,可以输出3个3*3的特征图。这3个3*3的特征图分别对应的是RGB三通道的特征图。It can be understood that when the image input to the region of interest pooling layer is an RGB image, there are 3 feature maps output. As shown in Figure 11, the left eye image block-3 is an RGB image. When the size of the left eye image block-3 is 60*60, the left eye image block-3 can be represented as a 60*60*3 matrix. The elements in the matrix include the values of the RGB three channels corresponding to each pixel in the left eye image block-3. The electronic device can input the left eye image block-3 into the region of interest pooling layer-1, and can output 3 3*3 feature maps. These 3 3*3 feature maps correspond to the feature maps of the RGB three channels respectively.
可理解,输入到感兴趣区域池化层的图像为灰度图像的情况下,输出的特征图有1个。例如,在左眼图像块-1为灰度图像的情况下,其输入至感兴趣区域池化层-1的处理过程可以参考图10A。It can be understood that when the image input to the ROI pooling layer is a grayscale image, the output feature map is 1. For example, when the left eye image block-1 is a grayscale image, the processing process of inputting it to the ROI pooling layer-1 can refer to FIG. 10A .
二、CNNCNN
CNN指的是卷积神经网络,卷积神经网络是神经网络的一种。CNN可以包括卷积层、池化层和全连接层。其中,卷积神经网络中每层卷积层由若干卷积单元组成。每个卷积单元的参数都是通过反向传播算法优化得到的。卷积运算的目的是提取输入的不同特征。第一层卷积层可能只能提取一些低级的特征如边缘、线条和角等层级,更多层的网络能从低级特征中迭代提取更复杂的特征。池化的实质含义是下采样。池化层的主要作用是通过减少网络的参数来减小计算量,并且能够在一定程度上控制过拟合。池化层进行的运算一般包括最大池化、均值池化等。CNN refers to convolutional neural network, which is a type of neural network. CNN can include convolutional layers, pooling layers and fully connected layers. Among them, each convolutional layer in the convolutional neural network is composed of several convolutional units. The parameters of each convolutional unit are optimized by the back propagation algorithm. The purpose of the convolution operation is to extract different features of the input. The first convolutional layer may only extract some low-level features such as edges, lines and corners. More layers of the network can iteratively extract more complex features from low-level features. The essence of pooling is downsampling. The main function of the pooling layer is to reduce the amount of calculation by reducing the parameters of the network, and it can control overfitting to a certain extent. The operations performed by the pooling layer generally include maximum pooling, mean pooling, etc.
根据上文,CNN-1、CNN-2和CNN-3可以分别包括若干卷积层和若干池化层。可理解,CNN-1、CNN-2和CNN-3还可以包括若干激活层。激活(Activation)层又叫神经元(Neuron)层,最主要的是激活函数的设置。激活函数可以包括ReLU、PReLU和Sigmoid等。在激活层中,电子设备可以对输入数据进行激活操作,也可以理解为一种函数变化。According to the above, CNN-1, CNN-2 and CNN-3 can include several convolutional layers and several pooling layers respectively. It can be understood that CNN-1, CNN-2 and CNN-3 can also include several activation layers. The activation layer is also called the neuron layer, and the most important thing is the setting of the activation function. The activation function can include ReLU, PReLU and Sigmoid, etc. In the activation layer, the electronic device can perform activation operations on the input data, which can also be understood as a function change.
示例性的,如图12所示,CNN-1可以包括4个卷积层和4个激活层。其中,4个卷积层指的是:卷积层-1、卷积层-2、卷积层-3和卷积层-4。4个激活层指的是:激活层-1、激活层-2、激活层-3和激活层-4。可理解,该4个卷积层的卷积核(即滤波器)的大小可以为3*3。Exemplarily, as shown in FIG12 , CNN-1 may include 4 convolutional layers and 4 activation layers. The 4 convolutional layers refer to: convolutional layer-1, convolutional layer-2, convolutional layer-3, and convolutional layer-4. The 4 activation layers refer to: activation layer-1, activation layer-2, activation layer-3, and activation layer-4. It can be understood that the size of the convolution kernel (i.e., filter) of the 4 convolutional layers can be 3*3.
示例性的,如图13所示,CNN-3可以包括4个卷积层、4个激活层和4个池化层。其中,5个卷积层指的是:卷积层-1、卷积层-2、卷积层-3和卷积层-4。4个激活层指的是:激活层-1、激活层-2、激活层-3和激活层-4。4个池化层指的是:池化层-1、池化层-2、池化层-3和池化层-4。可理解,该4个卷积层的卷积核(即滤波器)的大小可以为3*3。该4个池化层的步长可以为2(例如,每2*2个“单元格”进行最大池化处理)。可理解,卷积层中还可以对特征图进行补零操作。补零操作的相关描述可以参考相关技术文档,在此不展开说明。Exemplarily, as shown in FIG13 , CNN-3 may include 4 convolutional layers, 4 activation layers, and 4 pooling layers. Among them, the 5 convolutional layers refer to: convolutional layer-1, convolutional layer-2, convolutional layer-3, and convolutional layer-4. The 4 activation layers refer to: activation layer-1, activation layer-2, activation layer-3, and activation layer-4. The 4 pooling layers refer to: pooling layer-1, pooling layer-2, pooling layer-3, and pooling layer-4. It is understood that the size of the convolution kernel (i.e., filter) of the 4 convolutional layers can be 3*3. The step size of the 4 pooling layers can be 2 (for example, the maximum pooling process is performed for every 2*2 "cells"). It is understood that the feature map can also be padded with zeros in the convolutional layer. For the relevant description of the zero-padding operation, please refer to the relevant technical documents, which will not be explained in detail here.
在本申请的一些实施例中,CNN-2与CNN-1的结构可以相同。在本申请的又一些实施例中,CNN-2、CNN-3与CNN-1的结构可以相同。In some embodiments of the present application, the structures of CNN-2 and CNN-1 may be the same. In some other embodiments of the present application, the structures of CNN-2, CNN-3 and CNN-1 may be the same.
可理解,CNN-1、CNN-2和CNN-3的结构还可以为其他内容,不限于上述示例,本申请对此不作限制。It is understandable that the structures of CNN-1, CNN-2 and CNN-3 may also be other contents, not limited to the above examples, and this application does not impose any restrictions on this.
三、全连接层3. Fully Connected Layer
根据上文,全连接层用于将提取的特征映射到样本标记空间。通俗来说,全连接层用于将提取的特征整合到一起,输出为一个值。According to the above, the fully connected layer is used to map the extracted features to the sample label space. In layman's terms, the fully connected layer is used to integrate the extracted features together and output them as a value.
示例性的,在图9所示的注视点估计网络模型中,全连接层-1的神经元数量为128,全连接层-2和全连接层-3的神经元数量均为256,全连接层-5和全连接层-6的神经元数量均为128,全连接层-4的神经元数量为128,全连接层-7的神经元数量为2。Exemplarily, in the gaze point estimation network model shown in Figure 9, the number of neurons in the fully connected layer-1 is 128, the number of neurons in the fully connected layer-2 and the fully connected layer-3 are both 256, the number of neurons in the fully connected layer-5 and the fully connected layer-6 are both 128, the number of neurons in the fully connected layer-4 is 128, and the number of neurons in the fully connected layer-7 is 2.
可理解,注视点估计网络模型中的全连接层的神经元数量还可以为其他值,不限于上述示例,本申请对此不作限制。It is understandable that the number of neurons in the fully connected layer in the gaze point estimation network model can also be other values, not limited to the above examples, and this application does not impose any restrictions on this.
值得注意的是,在本申请的一些实施例中,电子设备在人脸检测过程中可以获取眼睛位置信息和瞳孔坐标,因此,电子设备无需执行步骤S304。It is worth noting that in some embodiments of the present application, the electronic device can obtain eye position information and pupil coordinates during the face detection process, so the electronic device does not need to execute step S304.
在本申请的一些实施例中,电子设备无需确定图像I1中的脸部区域的大小是否满足预设大小要求。也就是说,电子设备无需执行步骤S305。In some embodiments of the present application, the electronic device does not need to determine whether the size of the face area in the image I1 meets the preset size requirement. In other words, the electronic device does not need to perform step S305.
下面介绍本申请实施例涉及的装置。The following is an introduction to the device involved in the embodiments of the present application.
图14为本申请实施例提供的一种电子设备的硬件结构示意图。FIG. 14 is a schematic diagram of the hardware structure of an electronic device provided in an embodiment of the present application.
电子设备可以包括处理器110,外部存储器接口120,内部存储器121,通用串行总线(Universal Serial Bus,USB)接口130,充电管理模块140,电源管理模块141,电池142,天线1,天线2,移动通信模块150,无线通信模块160,音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,传感器模块180,按键190,马达191,指示器192,摄像头193,显示屏194,以及用户标识模块(Subscriber Identification Module,SIM)卡接口195等。其中传感器模块180可以包括压力传感器180A,陀螺仪传感器180B,气压传感器180C,磁传感器180D,加速度传感器180E,距离传感器180F,接近光传感器180G,指纹传感器180H,温度传感器180J,触摸传感器180K,环境光传感器180L,骨传导传感器180M等。The electronic device may include a processor 110, an external memory interface 120, an internal memory 121, a Universal Serial Bus (USB) interface 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, a button 190, a motor 191, an indicator 192, a camera 193, a display screen 194, and a Subscriber Identification Module (SIM) card interface 195, etc. The sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, a bone conduction sensor 180M, etc.
可以理解的是,本发明实施例示意的结构并不构成对电子设备的具体限定。在本申请另一些实施例中,电子设备可以包括比图示更多或更少的部件,或者组合某些部件,或者拆分某些部件,或者不同的部件布置。图示的部件可以以硬件,软件或软件和硬件的组合实现。It is to be understood that the structure illustrated in the embodiments of the present invention does not constitute a specific limitation on the electronic device. In other embodiments of the present application, the electronic device may include more or fewer components than shown in the figure, or combine certain components, or split certain components, or arrange the components differently. The components shown in the figure may be implemented in hardware, software, or a combination of software and hardware.
处理器110可以包括一个或多个处理单元,例如:处理器110可以包括应用处理器(Application Processor,AP),调制解调处理器,图形处理器(Graphics Processing unit,GPU),图像信号处理器(即ISP),控制器,存储器,视频编解码器,数字信号处理器(Digital Signal Processor,DSP),基带处理器,和/或神经网络处理器(Neural-network Processing Unit,NPU)等。其中,不同的处理单元可以是独立的器件,也可以集成在一个或多个处理器中。The processor 110 may include one or more processing units, for example, the processor 110 may include an application processor (AP), a modem processor, a graphics processor (GPU), an image signal processor (ISP), a controller, a memory, a video codec, a digital signal processor (DSP), a baseband processor, and/or a neural-network processing unit (NPU), etc. Different processing units may be independent devices or integrated in one or more processors.
其中,控制器可以是电子设备的神经中枢和指挥中心。控制器可以根据指令操作码和时序信号,产生操作控制信号,完成取指令和执行指令的控制。 The controller can be the nerve center and command center of the electronic device. The controller can generate operation control signals according to the instruction operation code and timing signal to complete the control of fetching and executing instructions.
在本申请提供的实施例中,电子设备可以通过处理器110执行所述注视点估计方法。In the embodiments provided in the present application, the electronic device may execute the gaze point estimation method through the processor 110.
处理器110中还可以设置存储器,用于存储指令和数据。在一些实施例中,处理器110中的存储器为高速缓冲存储器。该存储器可以保存处理器110刚用过或循环使用的指令或数据。如果处理器110需要再次使用该指令或数据,可从所述存储器中直接调用。避免了重复存取,减少了处理器110的等待时间,因而提高了系统的效率。The processor 110 may also be provided with a memory for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may store instructions or data that the processor 110 has just used or cyclically used. If the processor 110 needs to use the instruction or data again, it may be directly called from the memory. This avoids repeated access, reduces the waiting time of the processor 110, and thus improves the efficiency of the system.
在一些实施例中,处理器110可以包括一个或多个接口。USB接口130是符合USB标准规范的接口,具体可以是Mini USB接口,Micro USB接口,USB Type C接口等。处理器110所包括的接口还可以用于连接其他电子设备,例如AR设备等。In some embodiments, the processor 110 may include one or more interfaces. The USB interface 130 is an interface that complies with the USB standard specification, and may specifically be a Mini USB interface, a Micro USB interface, a USB Type C interface, etc. The interface included in the processor 110 may also be used to connect other electronic devices, such as AR devices, etc.
充电管理模块140用于从充电器接收充电输入。充电管理模块140为电池142充电的同时,还可以通过电源管理模块141为电子设备供电。The charging management module 140 is used to receive charging input from a charger. While the charging management module 140 is charging the battery 142 , it can also power the electronic device through the power management module 141 .
电子设备的无线通信功能可以通过天线1,天线2,移动通信模块150,无线通信模块160,调制解调处理器以及基带处理器等实现。The wireless communication function of the electronic device can be implemented through antenna 1, antenna 2, mobile communication module 150, wireless communication module 160, modem processor and baseband processor.
天线1和天线2用于发射和接收电磁波信号。电子设备中的每个天线可用于覆盖单个或多个通信频带。不同的天线还可以复用,以提高天线的利用率。 Antenna 1 and antenna 2 are used to transmit and receive electromagnetic wave signals. Each antenna in the electronic device can be used to cover a single or multiple communication frequency bands. Different antennas can also be reused to improve the utilization of the antennas.
移动通信模块150可以提供应用在电子设备上的包括2G/3G/4G/5G等无线通信的解决方案。The mobile communication module 150 can provide solutions for wireless communications including 2G/3G/4G/5G, etc., applied in electronic devices.
无线通信模块160可以提供应用在电子设备上的包括无线局域网(Wireless Local Area Networks,WLAN)(如无线保真(Wireless Fidelity,Wi-Fi)网络),蓝牙(Bluetooth,BT),全球导航卫星系统(Global Navigation Satellite System,GNSS),调频(Frequency Modulation,FM),近距离无线通信技术(Near Field Communication,NFC),红外技术(Infrared,IR)等无线通信的解决方案。The wireless communication module 160 can provide wireless communication solutions for application in electronic devices, including Wireless Local Area Networks (WLAN) (such as Wireless Fidelity (Wi-Fi) networks), Bluetooth (BT), Global Navigation Satellite System (GNSS), Frequency Modulation (FM), Near Field Communication (NFC), infrared technology (IR), etc.
在一些实施例中,电子设备的天线1和移动通信模块150耦合,天线2和无线通信模块160耦合,使得电子设备可以通过无线通信技术与网络以及其他设备通信。In some embodiments, antenna 1 of the electronic device is coupled to mobile communication module 150, and antenna 2 is coupled to wireless communication module 160, so that the electronic device can communicate with the network and other devices through wireless communication technology.
电子设备通过GPU,显示屏194,以及应用处理器等实现显示功能。GPU为图像处理的微处理器,连接显示屏194和应用处理器。GPU用于执行数学和几何计算,用于图形渲染。处理器110可包括一个或多个GPU,其执行程序指令以生成或改变显示信息。The electronic device implements the display function through a GPU, a display screen 194, and an application processor. The GPU is a microprocessor for image processing, which connects the display screen 194 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. The processor 110 may include one or more GPUs that execute program instructions to generate or change display information.
显示屏194用于显示图像,视频等。显示屏194包括显示面板。显示面板可以采用液晶显示屏(Liquid Crystal Display,LCD),有机发光二极管(Organic Light-Emitting Diode,OLED),有源矩阵有机发光二极体或主动矩阵有机发光二极体(Active-Matrix Organic Light Emitting Diode的,AMOLED),柔性发光二极管(Flex Light-Emitting Diode,FLED),Mini LED,Micro LED,Micro-OLED,量子点发光二极管(Quantum Dot Light Emitting Diodes,QLED)等。在一些实施例中,电子设备可以包括1个或N个显示屏194,N为大于1的正整数。The display screen 194 is used to display images, videos, etc. The display screen 194 includes a display panel. The display panel can be a liquid crystal display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode or an active-matrix organic light-emitting diode (AMOLED), a flexible light-emitting diode (FLED), a Mini LED, a Micro LED, a Micro-OLED, a quantum dot light-emitting diode (QLED), etc. In some embodiments, the electronic device may include 1 or N display screens 194, where N is a positive integer greater than 1.
电子设备可以通过ISP,摄像头193,视频编解码器,GPU,显示屏194以及应用处理器等实现获取功能。The electronic device can realize the acquisition function through ISP, camera 193, video codec, GPU, display screen 194 and application processor.
ISP用于处理摄像头193反馈的数据。例如,拍照时,打开快门,光线通过镜头被传递到摄像头感光元件上,光信号转换为电信号,摄像头感光元件将所述电信号传递给ISP处理,转化为肉眼可见的图像或视频。ISP还可以对图像的噪点,亮度,颜色进行算法优化。ISP还可以对拍摄场景的曝光,色温等参数优化。在一些实施例中,ISP可以设置在摄像头193中。ISP is used to process the data fed back by camera 193. For example, when taking a photo, the shutter is opened, and the light is transmitted to the camera photosensitive element through the lens. The light signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to ISP for processing and converts it into an image or video visible to the naked eye. ISP can also perform algorithm optimization on the noise, brightness, and color of the image. ISP can also optimize the exposure, color temperature and other parameters of the shooting scene. In some embodiments, ISP can be set in camera 193.
摄像头193用于捕获静态图像或视频。物体通过镜头生成光学图像投射到感光元件。感光元件可以是电荷耦合器件(Charge Coupled Device,CCD)或互补金属氧化物半导体(Complementary Metal-Oxide-Semiconductor,CMOS)光电晶体管。感光元件把光信号转换成电信号,之后将电信号传递给ISP转换成数字图像或视频信号。ISP将数字图像或视频信号输出到DSP加工处理。DSP将数字图像或视频信号转换成标准的RGB,YUV等格式的图像或视频信号。The camera 193 is used to capture still images or videos. The object generates an optical image through the lens and projects it onto the photosensitive element. The photosensitive element can be a charge coupled device (CCD) or a complementary metal oxide semiconductor (CMOS) phototransistor. The photosensitive element converts the optical signal into an electrical signal, and then transmits the electrical signal to the ISP to convert it into a digital image or video signal. The ISP outputs the digital image or video signal to the DSP for processing. The DSP converts the digital image or video signal into an image or video signal in a standard RGB, YUV or other format.
数字信号处理器用于处理数字信号,除了可以处理数字图像或视频信号,还可以处理其他数字信号。例如,当电子设备在频点选择时,数字信号处理器用于对频点能量进行傅里叶变换等。Digital signal processors are used to process digital signals. In addition to processing digital images or video signals, they can also process other digital signals. For example, when an electronic device selects a frequency point, a digital signal processor is used to perform Fourier transform on the frequency point energy.
视频编解码器用于对数字视频压缩或解压缩。电子设备可以支持一种或多种视频编解码器。这样,电子设备可以播放或录制多种编码格式的视频,例如:动态图像专家组(Moving Picture Experts Group,MPEG)1,MPEG2,MPEG3,MPEG4等。Video codecs are used to compress or decompress digital videos. Electronic devices can support one or more video codecs. In this way, electronic devices can play or record videos in multiple encoding formats, such as Moving Picture Experts Group (MPEG) 1, MPEG2, MPEG3, MPEG4, etc.
外部存储器接口120可以用于连接外部存储卡,例如Micro SD卡,实现扩展电子设备的存储能力。外部存储卡通过外部存储器接口120与处理器110通信,实现数据存储功能。例如将音乐,视频等文件保存在外部存储卡中。The external memory interface 120 can be used to connect an external memory card, such as a Micro SD card, to expand the storage capacity of the electronic device. The external memory card communicates with the processor 110 through the external memory interface 120 to implement a data storage function. For example, files such as music and videos can be saved in the external memory card.
内部存储器121可以用于存储计算机可执行程序代码,所述可执行程序代码包括指令。处理器110通过运行存储在内部存储器121的指令,从而执行电子设备的各种功能应用以及数据处理。内部存储器121可以包括存储程序区和存储数据区。其中,存储程序区可存储操作系统,至少一个功能所需的应用程序(比如声音播放功能,图像视频播放功能等)等。存储数据区可存储电子设备使用过程中所创建的数据(比如音频数据,电话本等)等。The internal memory 121 can be used to store computer executable program codes, which include instructions. The processor 110 executes various functional applications and data processing of the electronic device by running the instructions stored in the internal memory 121. The internal memory 121 may include a program storage area and a data storage area. Among them, the program storage area may store an operating system, an application required for at least one function (such as a sound playback function, an image and video playback function, etc.). The data storage area may store data created during the use of the electronic device (such as audio data, a phone book, etc.).
电子设备可以通过音频模块170,扬声器170A,受话器170B,麦克风170C,耳机接口170D,以及应用处理器等实现音频功能。例如音乐播放,录音等。The electronic device can implement audio functions such as music playing and recording through the audio module 170, the speaker 170A, the receiver 170B, the microphone 170C, the headphone jack 170D, and the application processor.
传感器模块180可以包括1个或多个传感器,这些传感器可以为相同类型或不同类型。可理解,图14所示的传感器模块180仅为一种示例性的划分方式,还可能有其他划分方式,本申请对此不作限制。The sensor module 180 may include one or more sensors, which may be of the same type or different types. It is understood that the sensor module 180 shown in FIG. 14 is only an exemplary division method, and there may be other division methods, which are not limited in the present application.
压力传感器180A用于感受压力信号,可以将压力信号转换成电信号。在一些实施例中,压力传感器180A可以设置于显示屏194。当有触摸操作作用于显示屏194,电子设备根据压力传感器180A检测所述触摸操作强度。电子设备也可以根据压力传感器180A的检测信号计算触摸的位置。在一些实施例中,作用于相同触摸位置,但不同触摸操作强度的触摸操作,可以对应不同的操作指令。The pressure sensor 180A is used to sense the pressure signal and can convert the pressure signal into an electrical signal. In some embodiments, the pressure sensor 180A can be set on the display screen 194. When a touch operation is applied to the display screen 194, the electronic device detects the touch operation intensity according to the pressure sensor 180A. The electronic device can also calculate the touch position according to the detection signal of the pressure sensor 180A. In some embodiments, touch operations acting on the same touch position but with different touch operation intensities can correspond to different operation instructions.
陀螺仪传感器180B可以用于确定电子设备的运动姿态。在一些实施例中,可以通过陀螺仪传感器180B确定电子设备围绕三个轴(即,x,y和z轴)的角速度。陀螺仪传感器180B可以用于拍摄防抖。The gyro sensor 180B can be used to determine the motion posture of the electronic device. In some embodiments, the angular velocity of the electronic device around three axes (i.e., x, y, and z axes) can be determined by the gyro sensor 180B. The gyro sensor 180B can be used for anti-shake shooting.
加速度传感器180E可检测电子设备在各个方向上(一般为三轴)加速度的大小。当电子设备静止时可检测出重力的大小及方向。还可以用于识别电子设备姿态,应用于横竖屏切换,计步器等应用。The acceleration sensor 180E can detect the magnitude of the acceleration of the electronic device in all directions (generally three axes). When the electronic device is stationary, it can detect the magnitude and direction of gravity. It can also be used to identify the posture of the electronic device and is applied to applications such as horizontal and vertical screen switching and pedometers.
距离传感器180F,用于测量距离。电子设备可以通过红外或激光测量距离。在一些实施例中,拍摄场景,电子设备可以利用距离传感器180F测距以实现快速对焦。The distance sensor 180F is used to measure the distance. The electronic device can measure the distance by infrared or laser. In some embodiments, when shooting a scene, the electronic device can use the distance sensor 180F to measure the distance to achieve fast focusing.
触摸传感器180K,也称“触控面板”。触摸传感器180K可以设置于显示屏194,由触摸传感器180K与显示屏194组成触摸屏,也称“触控屏”。触摸传感器180K用于检测作用于其上或附近的触摸操作。触摸传感器可以将检测到的触摸操作传递给应用处理器,以确定触摸事件类型。可以通过显示屏194提供与触摸操作相关的视觉输出。在另一些实施例中, 触摸传感器180K也可以设置于电子设备的表面,与显示屏194所处的位置不同。The touch sensor 180K is also called a "touch panel". The touch sensor 180K can be set on the display screen 194, and the touch sensor 180K and the display screen 194 form a touch screen, also called a "touch screen". The touch sensor 180K is used to detect touch operations acting on or near it. The touch sensor can pass the detected touch operation to the application processor to determine the type of touch event. Visual output related to the touch operation can be provided through the display screen 194. In some other embodiments, The touch sensor 180K may also be disposed on the surface of the electronic device, at a location different from that of the display screen 194 .
气压传感器180C用于测量气压。磁传感器180D包括霍尔传感器。接近光传感器180G可以包括例如发光二极管(LED)和光检测器,例如光电二极管。电子设备使用光电二极管检测来自附近物体的红外反射光。环境光传感器180L用于感知环境光亮度。指纹传感器180H用于获取指纹。温度传感器180J用于检测温度。骨传导传感器180M可以获取振动信号。The air pressure sensor 180C is used to measure air pressure. The magnetic sensor 180D includes a Hall sensor. The proximity light sensor 180G may include, for example, a light emitting diode (LED) and a light detector, such as a photodiode. The electronic device uses a photodiode to detect infrared reflected light from nearby objects. The ambient light sensor 180L is used to sense the brightness of ambient light. The fingerprint sensor 180H is used to obtain fingerprints. The temperature sensor 180J is used to detect temperature. The bone conduction sensor 180M can obtain vibration signals.
按键190包括开机键,音量键等。按键190可以是机械按键。也可以是触摸式按键。电子设备可以接收按键输入,产生与电子设备的用户设置以及功能控制有关的键信号输入。马达191可以产生振动提示。马达191可以用于来电振动提示,也可以用于触摸振动反馈。指示器192可以是指示灯,可以用于指示充电状态,电量变化,也可以用于指示消息,未接来电,通知等。SIM卡接口195用于连接SIM卡。The key 190 includes a power button, a volume button, etc. The key 190 can be a mechanical key. It can also be a touch key. The electronic device can receive key input and generate key signal input related to the user settings and function control of the electronic device. The motor 191 can generate a vibration prompt. The motor 191 can be used for incoming call vibration prompts, and can also be used for touch vibration feedback. The indicator 192 can be an indicator light, which can be used to indicate the charging status, power changes, messages, missed calls, notifications, etc. The SIM card interface 195 is used to connect a SIM card.
图15为本申请实施例提供的一种电子设备的软件结构示意图。FIG. 15 is a schematic diagram of a software structure of an electronic device provided in an embodiment of the present application.
如图15所示,本申请涉及的电子设备的软件框架可以包括应用程序层,应用程序框架层(framework,FWK)、系统库、安卓运行时、硬件抽象层和内核层(kernel)。As shown in FIG. 15 , the software framework of the electronic device involved in the present application may include an application layer, an application framework layer (framework, FWK), a system library, an Android runtime, a hardware abstraction layer and a kernel layer (kernel).
其中,应用程序层可以包括一系列应用程序包,例如相机,图库,日历,通话,地图,导航,WLAN,蓝牙,音乐,视频,短信息等应用程序(也可以称为应用)。其中,相机用于获取图像和视频。关于应用程序层的其他应用,可以参考常规技术中的介绍和说明,本申请不展开说明。在本申请中,电子设备上的应用可以是原生的应用(如在电子设备出厂前,安装操作系统时安装在电子设备中的应用),也可以是第三方应用(如用户通过应用商店下载安装的应用),本申请实施例不予限定。Among them, the application layer may include a series of application packages, such as camera, gallery, calendar, call, map, navigation, WLAN, Bluetooth, music, video, short message and other applications (also referred to as applications). Among them, the camera is used to obtain images and videos. Regarding other applications of the application layer, please refer to the introduction and description in the conventional technology, which will not be elaborated in this application. In this application, the application on the electronic device can be a native application (such as an application installed in the electronic device when the operating system is installed before the electronic device leaves the factory), or it can be a third-party application (such as an application downloaded and installed by the user through the application store), which is not limited in the embodiments of this application.
应用程序框架层为应用程序层的应用程序提供应用编程接口(Application Programming Interface,API)和编程框架。应用程序框架层包括一些预先定义的函数。The application framework layer provides application programming interface (API) and programming framework for the applications in the application layer. The application framework layer includes some predefined functions.
如图15所示,应用程序框架层可以包括窗口管理器,内容提供器,视图系统,电话管理器,资源管理器,通知管理器等。As shown in FIG. 15 , the application framework layer may include a window manager, a content provider, a view system, a telephony manager, a resource manager, a notification manager, and the like.
窗口管理器用于管理窗口程序。窗口管理器可以获取显示屏大小,判断是否有状态栏,锁定屏幕,截取屏幕等。The window manager is used to manage window programs. The window manager can obtain the display screen size, determine whether there is a status bar, lock the screen, capture the screen, etc.
内容提供器用来存放和获取数据,并使这些数据可以被应用程序访问。所述数据可以包括视频,图像,音频,拨打和接听的电话,浏览历史和书签,电话簿等。Content providers are used to store and retrieve data and make it accessible to applications. The data may include videos, images, audio, calls made and received, browsing history and bookmarks, phone books, etc.
视图系统包括可视控件,例如显示文字的控件,显示图片的控件等。视图系统可用于构建应用程序。显示界面可以由一个或多个视图组成的。例如,包括短信通知图标的显示界面,可以包括显示文字的视图以及显示图片的视图。The view system includes visual controls, such as controls for displaying text, controls for displaying images, etc. The view system can be used to build applications. A display interface can be composed of one or more views. For example, a display interface including a text notification icon can include a view for displaying text and a view for displaying images.
电话管理器用于提供电子设备的通信功能。例如通话状态的管理(包括接通,挂断等)。The phone manager is used to provide communication functions for electronic devices, such as management of call status (including answering, hanging up, etc.).
资源管理器为应用程序提供各种资源,比如本地化字符串,图标,图片,布局文件,视频文件等等。The resource manager provides various resources for applications, such as localized strings, icons, images, layout files, video files, and so on.
通知管理器使应用程序可以在状态栏中显示通知信息,可以用于传达告知类型的消息,可以短暂停留后自动消失,无需用户交互。比如通知管理器被用于告知下载完成,消息提醒等。通知管理器还可以是以图表或者滚动条文本形式出现在系统顶部状态栏的通知,例如后台运行的应用程序的通知,还可以是以对话界面形式出现在屏幕上的通知。例如在状态栏提示文本信息,发出提示音,电子设备振动,指示灯闪烁等。The notification manager enables applications to display notification information in the status bar. It can be used to convey notification-type messages and can disappear automatically after a short stay without user interaction. For example, the notification manager is used to notify download completion, message reminders, etc. The notification manager can also be a notification that appears in the system top status bar in the form of a chart or scroll bar text, such as notifications from applications running in the background, or a notification that appears on the screen in the form of a dialogue interface. For example, a text message is displayed in the status bar, a reminder sound is emitted, an electronic device vibrates, an indicator light flashes, etc.
运行时(Runtime)包括核心库和虚拟机。Runtime负责系统的调度和管理。 The runtime includes the core library and the virtual machine. The runtime is responsible for the scheduling and management of the system.
核心库包含两部分:一部分是编程语言(例如,java语言)需要调用的功能函数,另一部分是系统的核心库。The core library consists of two parts: one part is the function that the programming language (for example, Java language) needs to call, and the other part is the core library of the system.
应用程序层和应用程序框架层运行在虚拟机中。虚拟机将应用程序层和应用程序框架层的编程文件(例如,java文件)执行为二进制文件。虚拟机用于执行对象生命周期的管理,堆栈管理,线程管理,安全和异常的管理,以及垃圾回收等功能。The application layer and the application framework layer run in a virtual machine. The virtual machine executes the programming files (e.g., java files) of the application layer and the application framework layer as binary files. The virtual machine is used to perform functions such as object life cycle management, stack management, thread management, security and exception management, and garbage collection.
系统库可以包括多个功能模块。例如:表面管理器(Surface Manager),媒体库(Media Libraries),三维图形处理库(例如:OpenGL ES),二维图形引擎(例如:SGL)等。The system library can include multiple functional modules, such as surface manager, media library, 3D graphics processing library (such as OpenGL ES), 2D graphics engine (such as SGL), etc.
表面管理器用于对显示子系统进行管理,并且为多个应用程序提供了二维(2-Dimensional,2D)和三维(3-Dimensional,3D)图层的融合。The surface manager is used to manage the display subsystem and provides the fusion of two-dimensional (2D) and three-dimensional (3D) layers for multiple applications.
媒体库支持多种常用的音频,视频格式回放和录制,以及静态图像文件等。媒体库可以支持多种音视频编码格式,例如:MPEG4,H.264,MP3,AAC,AMR,JPG,PNG等。The media library supports playback and recording of a variety of commonly used audio and video formats, as well as static image files, etc. The media library can support a variety of audio and video encoding formats, such as: MPEG4, H.264, MP3, AAC, AMR, JPG, PNG, etc.
三维图形处理库用于实现3D图形绘图,图像渲染,合成,和图层处理等。The 3D graphics processing library is used to implement 3D graphics drawing, image rendering, compositing, and layer processing.
2D图形引擎是2D绘图的绘图引擎。A 2D graphics engine is a drawing engine for 2D drawings.
硬件抽象层(HAL)是位于操作系统内核与上层软件之间的接口层,其目的在于将硬件抽象化。硬件抽象层是设备内核驱动的抽象接口,用于实现向更高级别的Java API框架提供访问底层设备的应用编程接口。HAL包含多个库模块,例如相机HAL、显示屏、蓝牙、音频等。其中每个库模块都为特定类型的硬件组件实现一个接口。当系统框架层API要求访问便携设备的硬件时,Android操作系统将为该硬件组件加载库模块。The Hardware Abstraction Layer (HAL) is an interface layer between the operating system kernel and the upper software, and its purpose is to abstract the hardware. The hardware abstraction layer is an abstract interface driven by the device kernel, which is used to implement the application programming interface that provides access to the underlying device to the higher-level Java API framework. HAL contains multiple library modules, such as camera HAL, display, Bluetooth, audio, etc. Each of these library modules implements an interface for a specific type of hardware component. When the system framework layer API requires access to the hardware of a portable device, the Android operating system will load the library module for the hardware component.
内核层是Android操作系统的基础,Android操作系统最终的功能都是通过内核层完成。内核层至少包含显示驱动,摄像头驱动,音频驱动,传感器驱动,虚拟卡驱动。The kernel layer is the foundation of the Android operating system. The final functions of the Android operating system are completed through the kernel layer. The kernel layer at least includes display driver, camera driver, audio driver, sensor driver, and virtual card driver.
需要说明的是,本申请提供的图15所示的电子设备的软件结构示意图仅作为一种示例,并不限定Android操作系统不同分层中的具体模块划分,具体可以参考常规技术中对Android操作系统软件结构的介绍。另外,本申请提供的拍摄方法还可以基于其他操作系统实现,本申请不再一一举例。It should be noted that the software structure diagram of the electronic device shown in FIG. 15 provided in this application is only used as an example, and does not limit the specific module division in different layers of the Android operating system. For details, reference can be made to the introduction of the Android operating system software structure in conventional technology. In addition, the shooting method provided in this application can also be implemented based on other operating systems, and this application will not give examples one by one.
以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的范围。 The above embodiments are only used to illustrate the technical solutions of the present application, rather than to limit them. Although the present application has been described in detail with reference to the aforementioned embodiments, a person skilled in the art should understand that the technical solutions described in the aforementioned embodiments may still be modified, or some of the technical features may be replaced by equivalents. However, these modifications or replacements do not deviate the essence of the corresponding technical solutions from the scope of the technical solutions of the embodiments of the present application.

Claims (11)

  1. 一种注视点估计方法,其特征在于,所述方法应用于设置有摄像头的电子设备;所述方法包括:A gaze point estimation method is characterized in that the method is applied to an electronic device provided with a camera; the method comprises:
    通过所述摄像头采集第一图像;Collecting a first image through the camera;
    在人脸检测结果满足预设人脸条件的情况下,获取所述第一图像中的脸部位置信息和眼睛位置信息;所述脸部位置信息包括脸部区域的相关特征点的坐标;所述眼睛位置信息包括眼睛区域的相关特征点的坐标;When the face detection result satisfies the preset face condition, acquiring face position information and eye position information in the first image; the face position information includes coordinates of relevant feature points of the face area; the eye position information includes coordinates of relevant feature points of the eye area;
    在基于注视点估计网络模型对目标图像块进行处理的过程中,基于所述注视点估计网络模型的感兴趣区域池化模块,以对应的预设特征图尺寸对所述目标图像块的感兴趣区域ROI进行处理,得到特征图;所述目标图像块包括脸部图像块、左眼图像块、右眼图像块中的至少一种类型的图像块;不同类型的图像块各自对应有预设特征图尺寸;In the process of processing the target image block based on the gaze point estimation network model, the region of interest ROI of the target image block is processed with the corresponding preset feature map size based on the region of interest pooling module of the gaze point estimation network model to obtain a feature map; the target image block includes at least one type of image block among a face image block, a left eye image block, and a right eye image block; different types of image blocks each correspond to a preset feature map size;
    其中,所述脸部图像块为基于所述脸部位置信息对所述第一图像中的脸部区域进行裁剪得到的图像块;所述左眼图像块为基于所述眼睛位置信息对所述第一图像中的左眼区域进行裁剪得到的图像块;所述右眼图像块为基于所述眼睛位置信息对所述第一图像中的右眼区域进行裁剪得到的图像块。Among them, the facial image block is an image block obtained by cropping the facial area in the first image based on the facial position information; the left eye image block is an image block obtained by cropping the left eye area in the first image based on the eye position information; and the right eye image block is an image block obtained by cropping the right eye area in the first image based on the eye position information.
  2. 如权利要求1所述的方法,其特征在于,所述以对应的预设特征图尺寸对所述目标图像块的感兴趣区域ROI进行处理,得到特征图,具体包括:The method according to claim 1, characterized in that the step of processing the region of interest ROI of the target image block with a corresponding preset feature map size to obtain a feature map specifically comprises:
    基于所述对应的预设特征图尺寸对所述目标图像块的ROI进行划分,得到若干分块区域;Dividing the ROI of the target image block based on the corresponding preset feature map size to obtain a plurality of block areas;
    对所述目标图像块的ROI中的每一个分块区域进行最大池化处理,得到所述特征图;Performing maximum pooling processing on each block area in the ROI of the target image block to obtain the feature map;
    其中,所述目标图像块的ROI中每一行分块区域的数量与所述对应的预设特征图尺寸中的宽度值相同,所述目标图像块的ROI中每一列分块区域的数量与所述对应的预设特征图尺寸中的高度值相同。Among them, the number of block areas in each row of the ROI of the target image block is the same as the width value in the corresponding preset feature map size, and the number of block areas in each column of the ROI of the target image block is the same as the height value in the corresponding preset feature map size.
  3. 如权利要求2所述的方法,其特征在于,在所述目标图像块包括所述脸部图像块、所述左眼图像块和所述右眼图像块的情况下,所述基于所述对应的预设特征图尺寸对所述目标图像块的ROI进行划分,得到若干分块区域,具体包括:The method according to claim 2, characterized in that, when the target image block includes the face image block, the left eye image block and the right eye image block, the ROI of the target image block is divided based on the corresponding preset feature map size to obtain a plurality of block areas, specifically including:
    确定所述脸部图像块的ROI,并基于所述第一预设特征图尺寸对所述脸部图像块的ROI进行划分,得到若干脸部分块区域;Determine the ROI of the facial image block, and divide the ROI of the facial image block based on the first preset feature map size to obtain a plurality of facial block regions;
    确定所述左眼图像块的ROI,并基于所述第二预设特征图尺寸对所述左眼图像块的ROI进行划分,得到若干左眼分块区域;Determine the ROI of the left-eye image block, and divide the ROI of the left-eye image block based on the second preset feature map size to obtain a plurality of left-eye block areas;
    确定所述右眼图像块的ROI,并基于所述第三预设特征图尺寸对所述右眼图像块的ROI进行划分,得到若干右眼分块区域;Determine the ROI of the right eye image block, and divide the ROI of the right eye image block based on the third preset feature map size to obtain a plurality of right eye block areas;
    所述对所述目标图像块的ROI中的每一个分块区域进行最大池化处理,得到所述特征图,具体包括:The performing maximum pooling processing on each block area in the ROI of the target image block to obtain the feature map specifically includes:
    对所述脸部图像块的ROI中的每一个脸部分块区域进行最大池化处理,得到第一特征图;所述第一特征图为与所述脸部图像块的ROI对应的特征图;Performing maximum pooling processing on each facial block region in the ROI of the facial image block to obtain a first feature map; the first feature map is a feature map corresponding to the ROI of the facial image block;
    对所述左眼图像块的ROI中的每一个左眼分块区域进行最大池化处理,得到第二特征图;所述第二特征图为与所述左眼图像块的ROI对应的特征图; Performing maximum pooling processing on each left-eye block area in the ROI of the left-eye image block to obtain a second feature map; the second feature map is a feature map corresponding to the ROI of the left-eye image block;
    对所述右眼图像块的ROI中的每一个右眼分块区域进行最大池化处理,得到第三特征图;所述第三特征图为与所述右眼图像块的ROI对应的特征图;Performing maximum pooling processing on each right-eye block area in the ROI of the right-eye image block to obtain a third feature map; the third feature map is a feature map corresponding to the ROI of the right-eye image block;
    所述目标图像块的ROI中每一行分块区域的数量与所述对应的预设特征图尺寸中的宽度值相同,所述目标图像块的ROI中每一列分块区域的数量与所述对应的预设特征图尺寸中的高度值相同,具体包括:The number of each row of block regions in the ROI of the target image block is the same as the width value in the corresponding preset feature map size, and the number of each column of block regions in the ROI of the target image block is the same as the height value in the corresponding preset feature map size, specifically including:
    所述脸部图像块的ROI中每一行脸部分块区域的数量与所述第一预设特征图尺寸中的宽度值相同,所述脸部图像块的ROI中每一列脸部分块区域的数量与所述第一预设特征图尺寸中的高度值相同;所述左眼图像块的ROI中每一行左眼分块区域的数量与所述第二预设特征图尺寸中的宽度值相同,所述左眼图像块的ROI中每一列左眼分块区域的数量与所述第二预设特征图尺寸中的高度值相同;所述右眼图像块的ROI中每一行右眼分块区域的数量与所述第三预设特征图尺寸中的宽度值相同,所述右眼图像块的ROI中每一列右眼分块区域的数量与所述第三预设特征图尺寸中的高度值相同。The number of face block areas in each row of the ROI of the facial image block is the same as the width value in the first preset feature map size, and the number of face block areas in each column of the ROI of the facial image block is the same as the height value in the first preset feature map size; the number of left eye block areas in each row of the ROI of the left eye image block is the same as the width value in the second preset feature map size, and the number of left eye block areas in each column of the ROI of the left eye image block is the same as the height value in the second preset feature map size; the number of right eye block areas in each row of the ROI of the right eye image block is the same as the width value in the third preset feature map size, and the number of right eye block areas in each column of the ROI of the right eye image block is the same as the height value in the third preset feature map size.
  4. 如权利要求1-3任一项所述的方法,其特征在于,所述第一图像的拍摄主体为目标对象;所述通过所述摄像头采集第一图像之后,所述方法还包括:在所述人脸检测结果满足所述预设人脸条件的情况下,获取所述第一图像中的瞳孔坐标;The method according to any one of claims 1 to 3, characterized in that the subject of the first image is a target object; after the first image is captured by the camera, the method further comprises: obtaining pupil coordinates in the first image when the face detection result meets the preset face condition;
    基于所述脸部位置信息确定所述第一图像中的脸部区域在所述第一图像中的位置和大小,得到所述第一图像对应的脸部网格;所述脸部网格用于表征所述目标对象与所述摄像头之间的距离;Determine the position and size of the face area in the first image based on the face position information, and obtain a face grid corresponding to the first image; the face grid is used to represent the distance between the target object and the camera;
    所述得到特征图之后,所述方法还包括:After obtaining the feature map, the method further includes:
    基于所述注视点估计网络模型的卷积模块对所述特征图进行卷积处理,提取眼部特征和/或脸部特征;Performing convolution processing on the feature map based on the convolution module of the gaze point estimation network model to extract eye features and/or facial features;
    基于所述注视点估计网络模型的融合模块对所述眼部特征和/或脸部特征、所述脸部网格和所述瞳孔坐标进行整合,得到所述目标对象的注视点坐标。The fusion module based on the gaze point estimation network model integrates the eye features and/or facial features, the facial grid and the pupil coordinates to obtain the gaze point coordinates of the target object.
  5. 如权利要求1-4任一项所述的方法,其特征在于,所述人脸检测结果满足预设人脸条件,具体包括:所述第一图像中检测到人脸。The method according to any one of claims 1 to 4 is characterized in that the face detection result satisfies a preset face condition, specifically including: a face is detected in the first image.
  6. 如权利要求1-4任一项所述的方法,其特征在于,所述人脸检测结果满足预设人脸条件,具体包括:所述第一图像中检测到人脸,且所述第一图像中的脸部区域的大小满足预设大小要求;The method according to any one of claims 1 to 4, characterized in that the face detection result satisfies a preset face condition, specifically comprising: a face is detected in the first image, and the size of the face area in the first image meets a preset size requirement;
    所述通过所述摄像头采集第一图像之后,所述方法还包括:After the first image is captured by the camera, the method further includes:
    在所述第一图像中检测到人脸,且所述第一图像中的脸部区域的大小不满足所述预设大小要求的情况下,进行自适应变焦,并基于所述自适应变焦后的焦距重新采集图像。When a human face is detected in the first image and the size of the face area in the first image does not meet the preset size requirement, adaptive zoom is performed, and the image is recaptured based on the focal length after the adaptive zoom.
  7. 如权利要求1-6任一项所述的方法,其特征在于,所述基于所述脸部位置信息对所述第一图像中的脸部区域进行裁剪,具体包括:The method according to any one of claims 1 to 6, characterized in that the cropping of the face area in the first image based on the face position information specifically comprises:
    确定所述第一图像中的脸部区域的相关特征点;Determining relevant feature points of the facial area in the first image;
    确定第一外接矩形;所述第一外接矩形为所述第一图像中的脸部区域的相关特征点的外接矩形;Determine a first circumscribed rectangle; the first circumscribed rectangle is a circumscribed rectangle of relevant feature points of the face area in the first image;
    基于所述第一外接矩形在所述第一图像中的位置,对所述第一图像进行裁剪;Cropping the first image based on a position of the first circumscribed rectangle in the first image;
    其中,所述脸部图像块与所述第一外接矩形在所述第一图像中的位置相同;所述脸部图像块与所述第一外接矩形的大小相同;The facial image block and the first circumscribed rectangle have the same position in the first image; the facial image block and the first circumscribed rectangle have the same size;
    所述基于所述眼睛位置信息对所述第一图像中的左眼区域进行裁剪,具体包括:The cropping of the left eye area in the first image based on the eye position information specifically includes:
    确定所述第一图像中的左眼区域的相关特征点;Determining relevant feature points of the left eye area in the first image;
    确定第二外接矩形;所述第二外接矩形为所述第一图像中的左眼区域的相关特征点的外接矩形;Determine a second circumscribed rectangle; the second circumscribed rectangle is a circumscribed rectangle of relevant feature points of the left eye region in the first image;
    基于所述第二外接矩形在所述第一图像中的位置,对所述第一图像进行裁剪;Cropping the first image based on the position of the second circumscribed rectangle in the first image;
    其中,所述左眼图像块与所述第二外接矩形在所述第一图像中的位置相同;所述左眼图像块与所述第二外接矩形的大小相同;The left-eye image block and the second circumscribed rectangle have the same position in the first image; the left-eye image block and the second circumscribed rectangle have the same size;
    所述基于所述眼睛位置信息对所述第一图像中的右眼区域进行裁剪,具体包括:The cropping of the right eye area in the first image based on the eye position information specifically includes:
    确定所述第一图像中的右眼区域的相关特征点;Determining relevant feature points of a right eye region in the first image;
    确定第三外接矩形;所述第三外接矩形为所述第一图像中的右眼区域的相关特征点的外接矩形;Determine a third circumscribed rectangle; the third circumscribed rectangle is a circumscribed rectangle of relevant feature points of the right eye region in the first image;
    基于所述第三外接矩形在所述第一图像中的位置,对所述第一图像进行裁剪;Cropping the first image based on the position of the third circumscribed rectangle in the first image;
    其中,所述右眼图像块与所述第三外接矩形在所述第一图像中的位置相同;所述右眼图像块与所述第三外接矩形的大小相同。The right-eye image block and the third circumscribed rectangle have the same position in the first image; and the right-eye image block and the third circumscribed rectangle have the same size.
  8. 如权利要求1-6任一项所述的方法,其特征在于,所述基于所述脸部位置信息对所述第一图像中的脸部区域进行裁剪,得到脸部图像块,具体包括:The method according to any one of claims 1 to 6, characterized in that the step of cropping the face area in the first image based on the face position information to obtain the face image block specifically comprises:
    基于所述脸部位置信息确定所述第一图像中的脸部区域;determining a facial region in the first image based on the facial position information;
    以所述脸部区域为第一裁剪框的中心来对所述第一图像进行裁剪,得到所述脸部图像块;所述第一裁剪框的尺寸为第一预设裁剪尺寸;所述脸部图像块与所述第一裁剪框的尺寸相同;The first image is cropped with the facial region as the center of a first cropping frame to obtain the facial image block; the size of the first cropping frame is a first preset cropping size; the facial image block has the same size as the first cropping frame;
    所述基于所述眼睛位置信息对所述第一图像中的左眼区域和右眼区域进行裁剪,得到左眼图像块和右眼图像块,具体包括:The step of cropping the left eye area and the right eye area in the first image based on the eye position information to obtain a left eye image block and a right eye image block specifically includes:
    基于所述眼睛位置信息确定所述第一图像中的左眼区域和所述第一图像中的右眼区域;determining a left eye area in the first image and a right eye area in the first image based on the eye position information;
    以所述左眼区域为第二裁剪框的中心来对所述第一图像进行裁剪,得到所述左眼图像块;所述第二裁剪框的尺寸为第二预设裁剪尺寸;所述左眼图像块与所述第二裁剪框的尺寸相同;The first image is cropped with the left eye area as the center of a second cropping frame to obtain the left eye image block; the size of the second cropping frame is a second preset cropping size; the left eye image block has the same size as the second cropping frame;
    以所述右眼区域为第三裁剪框的中心来对所述第一图像进行裁剪,得到所述右眼图像块;所述第三裁剪框的尺寸为第三预设裁剪尺寸;所述右眼图像块与所述第三裁剪框的尺寸相同。The first image is cropped with the right eye area as the center of a third cropping frame to obtain the right eye image block; the size of the third cropping frame is a third preset cropping size; and the right eye image block has the same size as the third cropping frame.
  9. 如权利要求4所述的方法,其特征在于,所述注视点估计网络模型还包括若干激活层;所述感兴趣区域池化模块包括若干感兴趣区域池化层;所述卷积模块包括若干卷积层;所述融合模块包括若干全连接层。The method as claimed in claim 4 is characterized in that the gaze point estimation network model also includes several activation layers; the region of interest pooling module includes several region of interest pooling layers; the convolution module includes several convolution layers; and the fusion module includes several fully connected layers.
  10. 一种电子设备,包括显示屏、摄像头、存储器、一个或多个处理器,其特征在于,所述存储器用于存储计算机程序;所述处理器用于调用所述计算机程序,使得所述电子设备执行权利要求1-9中任一项所述的方法。An electronic device comprises a display screen, a camera, a memory, and one or more processors, wherein the memory is used to store a computer program; the processor is used to call the computer program so that the electronic device executes the method described in any one of claims 1 to 9.
  11. 一种计算机存储介质,其特征在于,包括:计算机指令;当所述计算机指令在电子设备上运行时,使得所述电子设备执行权利要求1-9中任一项所述的方法。 A computer storage medium, characterized in that it comprises: computer instructions; when the computer instructions are executed on an electronic device, the electronic device executes the method described in any one of claims 1 to 9.
PCT/CN2023/092415 2022-07-29 2023-05-06 Fixation point estimation method and related device WO2024021742A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210910894.2 2022-07-29
CN202210910894.2A CN116048244B (en) 2022-07-29 2022-07-29 Gaze point estimation method and related equipment

Publications (2)

Publication Number Publication Date
WO2024021742A1 WO2024021742A1 (en) 2024-02-01
WO2024021742A9 true WO2024021742A9 (en) 2024-05-16

Family

ID=86127878

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/092415 WO2024021742A1 (en) 2022-07-29 2023-05-06 Fixation point estimation method and related device

Country Status (2)

Country Link
CN (1) CN116048244B (en)
WO (1) WO2024021742A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116048244B (en) * 2022-07-29 2023-10-20 荣耀终端有限公司 Gaze point estimation method and related equipment
CN117576298B (en) * 2023-10-09 2024-05-24 中微智创(北京)软件技术有限公司 Battlefield situation target highlighting method based on context separation 3D lens
CN117472256B (en) * 2023-12-26 2024-08-23 荣耀终端有限公司 Image processing method and electronic equipment

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP6377566B2 (en) * 2015-04-21 2018-08-22 日本電信電話株式会社 Line-of-sight measurement device, line-of-sight measurement method, and program
CN109492514A (en) * 2018-08-28 2019-03-19 初速度(苏州)科技有限公司 A kind of method and system in one camera acquisition human eye sight direction
CN111723596B (en) * 2019-03-18 2024-03-22 北京市商汤科技开发有限公司 Gaze area detection and neural network training method, device and equipment
CN112000226B (en) * 2020-08-26 2023-02-03 杭州海康威视数字技术股份有限公司 Human eye sight estimation method, device and sight estimation system
CN112329699A (en) * 2020-11-19 2021-02-05 北京中科虹星科技有限公司 Method for positioning human eye fixation point with pixel-level precision
US11947717B2 (en) * 2021-01-22 2024-04-02 Blink Technologies Inc. Gaze estimation systems and methods using relative points of regard
CN113642393B (en) * 2021-07-07 2024-03-22 重庆邮电大学 Attention mechanism-based multi-feature fusion sight estimation method
CN116048244B (en) * 2022-07-29 2023-10-20 荣耀终端有限公司 Gaze point estimation method and related equipment

Also Published As

Publication number Publication date
CN116048244A (en) 2023-05-02
WO2024021742A1 (en) 2024-02-01
CN116048244B (en) 2023-10-20

Similar Documents

Publication Publication Date Title
CN111738122B (en) Image processing method and related device
WO2024021742A9 (en) Fixation point estimation method and related device
WO2021078001A1 (en) Image enhancement method and apparatus
CN111782879B (en) Model training method and device
CN111400605A (en) Recommendation method and device based on eyeball tracking
CN111882642B (en) Texture filling method and device for three-dimensional model
WO2021180046A1 (en) Image color retention method and device
EP4325877A1 (en) Photographing method and related device
US20230162529A1 (en) Eye bag detection method and apparatus
CN113538227B (en) Image processing method based on semantic segmentation and related equipment
CN111612723B (en) Image restoration method and device
CN111768352A (en) Image processing method and device
CN113538321B (en) Vision-based volume measurement method and terminal equipment
CN115661912A (en) Image processing method, model training method, electronic device and readable storage medium
WO2022143314A1 (en) Object registration method and apparatus
CN115633255B (en) Video processing method and electronic equipment
CN112528760B (en) Image processing method, device, computer equipment and medium
CN114697530B (en) Photographing method and device for intelligent view finding recommendation
CN113642359B (en) Face image generation method and device, electronic equipment and storage medium
CN115587938A (en) Video distortion correction method and related equipment
CN114693538A (en) Image processing method and device
CN114399622A (en) Image processing method and related device
CN114970576A (en) Identification code identification method, related electronic equipment and computer readable storage medium
CN117710697B (en) Object detection method, electronic device, storage medium, and program product
CN115880348B (en) Face depth determining method, electronic equipment and storage medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23844957

Country of ref document: EP

Kind code of ref document: A1