CN115049819A

CN115049819A - Watching region identification method and device

Info

Publication number: CN115049819A
Application number: CN202110221018.4A
Authority: CN
Inventors: 车慧敏; 李志刚; 刘腾; 杨雨
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2021-02-26
Filing date: 2021-02-26
Publication date: 2022-09-13

Abstract

The application relates to a method and a device for identifying a gazing area. The method comprises the following steps: performing feature extraction on at least one acquired first image to obtain feature data of each first image, wherein the feature data comprises face feature point data, head pose data and eye feature data, and the first image comprises a face of a user; inputting the characteristic data into a trained sight line estimation model to obtain fixation point coordinates and fixation probability maps corresponding to the first images, wherein the fixation probability maps comprise probability values of fixation sight lines in fixation areas of a screen fixed by a user; processing the fixation point coordinates and the fixation probability maps of the first images to form a fixation point queue and a probability map queue; and determining a user gazing area from the plurality of gazing areas according to the gazing point queue and the probability map queue. The method and the device provided by the embodiment of the application have the advantages of small error, low jitter and low cost in user gazing area estimation, and can realize full-scene deployment.

Description

Gaze area identification method and device

Technical Field

The application relates to the technical field of vision, in particular to a watching region identification method and device.

Background

The eye movement technology has extremely high market value. As a novel interactive mode, the eye movement control has great exploration and research values in the fields of reading, games, psychology, marketing and the like, and the value of the eye movement technology in the field of social handicapped person guarantee is widely accepted and has a small amount of research results. The eye movement technology includes gaze point control based on a mobile phone, a tablet and other terminal devices with screens and gaze sight estimation control based on a three-dimensional space, however, the following problems exist in the implementation manner of gaze estimation in the related art: in some technical schemes, the error of the gaze estimation is small, the use requirement can be met, but the gaze estimation can be realized only by matching with special hardware equipment (such as a near infrared camera and the like), the price is high, and the full-scene deployment cannot be realized. Some technical solutions also implement gaze estimation by identifying an image in an RGB color mode (RGB for short), but the gaze estimation has a large error and jitter of an estimation result is difficult to eliminate. Moreover, the RGB image is greatly influenced by the environmental factors of the photographed image, so that the noise of the photographed RGB image is large, the error of the gaze estimation is increased, and the use requirement of the terminal device cannot be met. The method reduces the error, cost and jitter of the gaze estimation, and realizes the full scene deployment of the gaze estimation, which is an urgent technical problem to be solved.

Disclosure of Invention

In view of this, a method and an apparatus for identifying a gazing area are provided.

In a first aspect, an embodiment of the present application provides a gaze area identification method, including:

performing feature extraction on at least one acquired first image to obtain feature data of each first image, wherein the feature data comprises face feature point data, head pose data and eye feature data, and the first image comprises a face of a user;

inputting the characteristic data into a trained sight estimation model to obtain a fixation point coordinate and a fixation probability graph corresponding to each first image, wherein the fixation point coordinate represents a coordinate of a cross point of the sight of the eyes, the fixation probability graph comprises a probability value of the fixation sight in each fixation area of a screen fixed by a user, the fixation sight represents a connection line passing through the eyes and the fixation point, and the screen comprises a plurality of fixation areas which are divided in advance;

processing the fixation point coordinates and the fixation probability maps of the first images to form a fixation point queue and a probability map queue;

and determining a user gazing area from the plurality of gazing areas according to the gazing point queue and the probability map queue.

By the method provided by the first aspect, the gaze area estimation based on RGB images can be achieved. The method can be applied to terminal equipment, so that the watching area of a user when watching a screen can be estimated, the estimated watching area of the user has small error and low jitter, the method is low in cost, and full-scene deployment can be realized.

According to the first aspect, in a first possible implementation manner of the method, processing the gaze point coordinates and the gaze probability map of each first image to form a gaze point queue and a probability map queue includes:

determining a probability value of a fixation point coordinate corresponding to the fixation probability map according to the fixation probability map;

screening effective fixation point coordinates from the fixation point coordinates according to the probability value of the fixation point coordinates of each first image, and determining an effective fixation probability graph corresponding to the effective fixation point coordinates;

and forming a fixation point queue and a probability map queue according to the effective fixation probability map and the effective fixation point coordinate.

According to the first aspect, in a second possible implementation manner of the method, determining a user gazing area from the multiple gazing areas according to the gazing point queue and the probability map queue includes:

filtering the fixation point queue and the probability map queue respectively to obtain a filtered fixation point queue and a probability heat map;

and determining a user gazing area from the plurality of gazing areas according to the filtered gazing point queue and the probability heat map.

According to the second possible implementation mode of the method, before the fixation point queue and the probability map queue are formed, screening is carried out, and then the fixation point queue and the probability map queue are formed according to the effective fixation point coordinates and the effective fixation probability map, so that the reliability and stability of data in the fixation point queue and the probability map queue are improved, and errors and jitter determined by a user fixation area are reduced.

According to a second possible implementation manner of the first aspect, in a third possible implementation manner of the method, the step of filtering the gaze point queue and the probability map queue to obtain a filtered gaze point queue and a probability heat map includes:

filtering the fixation point queue to obtain a filtered fixation point queue;

and overlapping and multiplying the gazing probability maps in the probability map queue, and then performing normalization processing to form a probability heat map.

In this way, the reliability and stability of the data can be improved by filtering.

In a fourth possible implementation form of the method according to the second possible implementation form of the first aspect, before determining the user gaze region from the plurality of gaze regions according to the filtered sequence of gaze points and the probability heat map, the method further includes:

determining a probability value corresponding to the injection point coordinates in the filtered injection point queue according to the probability heat map;

screening according to the probability value of each fixation point coordinate, removing the fixation point coordinate with abnormal probability value in the fixation point queue after filtering, and

and updating the probability heat map according to the gazing probability map corresponding to the gazing point coordinate with the abnormal probability value.

In a fifth possible implementation form of the method according to the first aspect, the method further comprises:

training a sight line estimation model by using the characteristic data of the training image to obtain the trained sight line estimation model;

updating model parameters of the sight estimation model by using the loss of the sight estimation model determined according to a preset loss function in training;

wherein the preset loss function is determined according to a regression loss function and a cross entropy loss function.

In a second aspect, an embodiment of the present application provides a gaze area identification apparatus, the apparatus comprising:

the system comprises a feature extraction module, a feature extraction module and a feature extraction module, wherein the feature extraction module is used for performing feature extraction on at least one acquired first image to obtain feature data of each first image, the feature data comprises face feature point data, head pose data and eye feature data, and the first image comprises a face of a user;

the gaze estimation module is used for inputting the characteristic data into a trained gaze estimation model to obtain gaze point coordinates and a gaze probability map corresponding to each first image, wherein the gaze point coordinates represent coordinates of a cross point of binocular gaze lines, the gaze probability map comprises probability values of gaze areas of gaze lines in a screen watched by a user, the gaze lines represent connection lines passing through eyes and gaze points, and the screen comprises a plurality of pre-divided gaze areas;

the queue creating module is used for processing the fixation point coordinates and the fixation probability map of each first image to form a fixation point queue and a probability map queue;

and the area determining module is used for determining a user gazing area from the plurality of gazing areas according to the gazing point queue and the probability map queue.

In a first possible implementation manner of the apparatus according to the second aspect, the queue creating module includes:

the probability value determining submodule is used for determining the probability value of the fixation point coordinate corresponding to the fixation probability map according to the fixation probability map;

the screening submodule is used for screening effective fixation point coordinates from the fixation point coordinates according to the probability value of the fixation point coordinates of each first image and determining an effective fixation probability graph corresponding to the effective fixation point coordinates;

and the queue forming submodule is used for forming a fixation point queue and a probability map queue according to the effective fixation probability map and the effective fixation point coordinate.

In a second possible implementation manner of the apparatus according to the second aspect, the region determining module includes:

the filtering submodule is used for respectively carrying out filtering processing on the fixation point queue and the probability map queue to obtain a filtered fixation point queue and a filtered probability heat map;

and the area determining sub-module is used for determining a user gazing area from the plurality of gazing areas according to the filtered gazing point queue and the probability heat map.

In a third possible implementation manner of the apparatus according to the second possible implementation manner of the second aspect, the filtering sub-module includes:

the first filtering submodule is used for filtering the fixation point queue to obtain a filtered fixation point queue;

and the second filtering sub-module is used for performing normalization processing after the gazing probability maps in the probability map queue are subjected to superposition to form a probability heat map.

In a fourth possible implementation manner of the apparatus according to the second possible implementation manner of the second aspect, the apparatus further includes:

the determining module is used for determining a probability value corresponding to the fixation point coordinates in the filtered fixation point queue according to the probability heat map;

a screening and updating module for screening according to the probability value of each fixation point coordinate, removing the fixation point coordinate with abnormal probability value in the filtered fixation point queue, and

In a fifth possible implementation form of the apparatus according to the second aspect, the apparatus further includes:

the model training module is used for training the sight line estimation model by utilizing the characteristic data of the training image to obtain the trained sight line estimation model;

In a third aspect, an embodiment of the present application provides a gaze area identification apparatus, including: a processor; a memory for storing processor-executable instructions; wherein the processor is configured to carry out the instructions when implementing the method of the first aspect, any of its several possible implementations.

In a fourth aspect, embodiments of the present application provide a non-transitory computer-readable storage medium having stored thereon computer program instructions that, when executed by a processor, implement the first aspect, the method of any one of several possible implementations of the first aspect.

In a fifth aspect, embodiments of the present application provide a computer program product, which includes computer readable code or a non-volatile computer readable storage medium carrying computer readable code, and when the computer readable code runs in an electronic device, a processor in the electronic device executes the method of the first aspect, any one of several possible implementations of the first aspect.

These and other aspects of the present application will be more readily apparent from the following description of the embodiment(s).

Drawings

The accompanying drawings, which are incorporated in and constitute a part of the specification, illustrate exemplary embodiments, features, and aspects of the application and, together with the description, serve to explain the principles of the application.

Fig. 1 illustrates a schematic structure diagram of an electronic device 100.

Fig. 2 shows a flow chart of a gaze region identification method according to an embodiment of the application.

Fig. 3 shows a schematic diagram of human face feature points according to an embodiment of the present application.

Fig. 4 shows a schematic diagram of head pose extraction according to an embodiment of the present application.

FIG. 5 shows a schematic diagram of a user gaze screen according to an embodiment of the present application.

Fig. 6 shows a schematic diagram of a gaze probability map according to an embodiment of the application.

Fig. 7 shows a schematic diagram of a line-of-sight estimation model according to an embodiment of the present application.

Fig. 8 shows a schematic diagram of a filtering process of a probability map queue according to an embodiment of the application.

Fig. 9 shows a flow diagram of a gaze region identification method according to an embodiment of the application.

Fig. 10 shows a block diagram of a gaze region identification apparatus according to an embodiment of the present application.

Detailed Description

Various exemplary embodiments, features and aspects of the present application will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers can indicate functionally identical or similar elements. While the various aspects of the embodiments are presented in drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used exclusively herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

Furthermore, in the following detailed description, numerous specific details are set forth in order to provide a better understanding of the present application. It will be understood by those skilled in the art that the present application may be practiced without some of these specific details. In some instances, methods, means, elements and circuits that are well known to those skilled in the art have not been described in detail so as not to obscure the present application.

In order to solve the technical problem, the application provides a gazing area identification method and a gazing area identification device. The method can be applied to terminal equipment, so that the watching area of a user when watching a screen can be estimated, the estimated watching area of the user has small error and low jitter, the method is low in cost, and full-scene deployment can be realized.

Fig. 1 illustrates a schematic structure diagram of an electronic device 100.

The terminal device of the present application may be an electronic device having a screen. The electronic device 100 may include at least one of a mobile phone, a foldable electronic device, a tablet computer, a desktop computer, a laptop computer, a handheld computer, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a cellular phone, a Personal Digital Assistant (PDA), an Augmented Reality (AR) device, a Virtual Reality (VR) device, an Artificial Intelligence (AI) device, a wearable device, a vehicle-mounted device, a smart home device, or a smart city device. The embodiment of the present application does not particularly limit the specific type of the electronic device 100.

The electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a Universal Serial Bus (USB) connector 130, a charging management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, a button 190, a motor 191, an indicator 192, a camera 193, a display screen 194, a Subscriber Identity Module (SIM) card interface 195, and the like. The sensor module 180 may include a pressure sensor 180A, a gyroscope sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity light sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, a bone conduction sensor 180M, and the like.

It is to be understood that the illustrated structure of the embodiment of the present application does not specifically limit the electronic device 100. In other embodiments of the present application, electronic device 100 may include more or fewer components than shown, or some components may be combined, some components may be split, or a different arrangement of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

Processor 110 may include one or more processing units, such as: the processor 110 may include an Application Processor (AP), a modem processor, a Graphics Processor (GPU), an Image Signal Processor (ISP), a controller, a video codec, a Digital Signal Processor (DSP), a baseband processor, and/or a neural-Network Processor (NPU), among others. The different processing units may be separate devices or may be integrated into one or more processors.

The processor can generate an operation control signal according to the instruction operation code and the timing signal to complete the control of instruction fetching and instruction execution. The gaze region identification method provided by the application can be realized by a processor. The processor can execute the gazing area identification method to determine the gazing area of the user.

A memory may also be provided in processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 may be a cache memory. The memory may store instructions or data that have been used or used more frequently by the processor 110. If the processor 110 needs to use the instruction or data, it can be called directly from the memory. Avoiding repeated accesses reduces the latency of the processor 110, thereby increasing the efficiency of the system.

In some embodiments, processor 110 may include one or more interfaces. The interface may include an integrated circuit (I2C) interface, an integrated circuit built-in audio (I2S) interface, a Pulse Code Modulation (PCM) interface, a universal asynchronous receiver/transmitter (UART) interface, a Mobile Industry Processor Interface (MIPI), a general-purpose input/output (GPIO) interface, a Subscriber Identity Module (SIM) interface, and/or a Universal Serial Bus (USB) interface, etc. The processor 110 may be connected to modules such as a touch sensor, an audio module, a wireless communication module, a display, a camera, etc. through at least one of the above interfaces.

It should be understood that the interface connection relationship between the modules illustrated in the embodiments of the present application is only an illustration, and does not limit the structure of the electronic device 100. In other embodiments of the present application, the electronic device 100 may also adopt different interface connection manners or a combination of multiple interface connection manners in the above embodiments.

The electronic device 100 may implement display functionality via the GPU, the display screen 194, and the application processor, among other things. The GPU is a microprocessor for image processing, and is connected to the display screen 194 and an application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. The processor 110 may include one or more GPUs that execute program instructions to generate or alter display information.

The display screen 194 is used to display images, video, and the like. The display screen 194 includes a display panel. The display panel may be a Liquid Crystal Display (LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode (active-matrix organic light-emitting diode, AMOLED), a flexible light-emitting diode (FLED), a miniature, a Micro-oeld, a quantum dot light-emitting diode (QLED), or the like. In some embodiments, the electronic device 100 may include 1 or more display screens 194. The display screen 194 is also referred to herein as a screen as described below. At least one control which can be operated by a user through gazing can be displayed in the display screen 194, and after the control is determined to be triggered by the fact that the gazing area of the user determined by the gazing area identification method of the present application is the same as or partially coincides with the area where a certain control is located, the processor 110 or other components of the electronic device 100 can execute the operation with the triggered control.

The electronic device 100 may implement a camera function through the camera module 193, the ISP, the video codec, the GPU, the display screen 194, the application processor AP, the neural network processor NPU, and the like. The electronic device 100 may utilize its own camera function to capture an image when detecting that a user gazes at the display screen 194 or receives a photographing instruction, so as to obtain a first image. And sends the first image to the processor 110 so that the processor 110 can execute the gaze area identification method provided by the present application to determine the user gaze area.

The camera module 193 can be used to collect color image data and depth data of a subject. The ISP can be used to process color image data collected by the camera module 193. For example, when a photo is taken, the shutter is opened, light is transmitted to the camera photosensitive element through the lens, the optical signal is converted into an electrical signal, and the camera photosensitive element transmits the electrical signal to the ISP for processing and converting into an image visible to naked eyes. The ISP can also carry out algorithm optimization on the noise, brightness and skin color of the image. The ISP can also optimize parameters such as exposure, color temperature and the like of a shooting scene. In some embodiments, the ISP may be provided in the camera module 193.

In some embodiments, the camera module 193 may be composed of a color camera module and a 3D sensing module.

In some embodiments, the light sensing element of the camera of the color camera module may be a Charge Coupled Device (CCD) or a complementary metal-oxide-semiconductor (CMOS) phototransistor. The light sensing element converts the optical signal into an electrical signal, which is then passed to the ISP where it is converted into a digital image signal. And the ISP outputs the digital image signal to the DSP for processing. The DSP converts the digital image signal into an image signal in a standard RGB, YUV and other formats.

In some embodiments, the 3D sensing module may be a (time of flight, TOF)3D sensing module or a structured light (structured light)3D sensing module. The structured light 3D sensing is an active depth sensing technology, and the basic components of the structured light 3D sensing module may include an Infrared (infra) emitter, an IR camera module, and the like. The working principle of the structured light 3D sensing module is that light spots (patterns) with specific patterns are transmitted to a shot object, light spot pattern codes (light coding) on the surface of the object are received, the difference and the similarity of the original projected light spots are compared, and the three-dimensional coordinates of the object are calculated by utilizing the trigonometric principle. The three-dimensional coordinates include the distance from the electronic device 100 to the object to be photographed. The TOF 3D sensing module may be an active depth sensing technology, and the basic components of the TOF 3D sensing module may include an Infrared (infra) emitter, an IR camera module, and the like. The working principle of the TOF 3D sensing module is to calculate the distance (i.e. depth) between the TOF 3D sensing module and the object to be photographed through the time of infrared ray foldback so as to obtain a 3D depth-of-field map.

The structured light 3D sensing module can also be applied to the fields of face recognition, motion sensing game machines, industrial machine vision detection and the like. The TOF 3D sensing module can also be applied to the fields of game machines, Augmented Reality (AR)/Virtual Reality (VR), and the like.

In other embodiments, the camera module 193 may also be composed of two or more cameras. The two or more cameras may include color cameras that may be used to collect color image data of the object being photographed. The two or more cameras may employ stereoscopic vision (stereo vision) technology to acquire depth data of a photographed object. The stereoscopic vision technology is based on the principle of human eye parallax, and obtains distance information, i.e., depth information, between the electronic device 100 and an object to be photographed by photographing images of the same object from different angles through two or more cameras under a natural light source and performing calculations such as triangulation.

In some embodiments, the electronic device 100 may include 1 or more camera modules 193. Specifically, the electronic device 100 may include 1 front camera module 193 and 1 rear camera module 193. The front camera module 193 can be generally used to collect the color image data and depth data of the photographer facing the display screen 194, and the rear camera module can be used to collect the color image data and depth data of the photographed object (such as people and scenery) facing the photographer.

Fig. 2 shows a flow chart of a gaze region identification method according to an embodiment of the application. The present application provides a gaze region identification method as shown in fig. 2, which may be applied to a terminal device, the method including steps S11 to S14. The process of determining the user's gaze area of the present application may be applied in the following scenarios: the user controls the terminal equipment by watching, controls which can be operated by the user can be displayed in a screen of the terminal equipment, and the terminal equipment can execute the operation corresponding to the controls when the recognized watching area of the user is the same as the position of one of the corresponding controls in the screen. The method can also be applied to a scene for monitoring the eye movement process of the user, the first image can be continuously shot on the user watching screen within a period of time, and the eye movement process of the user is determined according to the determined change condition of the user watching area within the period of time.

In step S11, feature extraction is performed on at least one acquired first image to obtain feature data of each first image, where the feature data includes face feature point data, head pose data, and eye feature data, and the first image includes a face of a user.

In the present embodiment, the face feature point data may be data of a plurality of feature points representing the features of the face of the user, and the feature points of the face may include feature points representing the eyebrows, eyes, nose, mouth, and face contour of the user. Fig. 3 is a schematic diagram of a face feature point according to an embodiment of the present application, where the face feature point data obtained in this embodiment may be data of part or all of the 68 feature points (positions of the 68 points numbered 1-68 in fig. 3 are positions of the face feature point) shown in fig. 3. The image coordinates of each feature point in the extracted face feature point data may be expressed in a matrix form, which may be a 1 × n matrix, where n is 2 times the number j of the face feature points, and the matrix is (x1, y1, x2, y2 … xj, yj), and j × 2 ═ n. For example, the face feature point data may be data of points representing features of eyes, nose, and mouth in a face, that is, data of 41 feature points numbered 28 to 68 in the figure, and image coordinates of the 41 feature points may be represented as a matrix of 1 × 82, that is, (x1, y1, x2, y2 … x41, y41), and then x1 and y1 are coordinates of the feature point numbered 28 in the image; x2, y2 are coordinates … x41 of the feature point number 29 in the image, and y41 are coordinates of the feature point number 68 in the image. A face detection model can be created according to the number and position requirements of the face characteristic points to be extracted, and the face detection model is trained by utilizing the sample images to obtain the face detection model capable of extracting the face characteristic points. And then inputting the first image into a trained face detection model to obtain face feature point data, namely a matrix formed by image coordinates of a plurality of feature points.

In the present embodiment, the head pose data may be data representing the pose of the user's face in the first image. Fig. 4 shows a schematic diagram of head pose extraction according to an embodiment of the present application. As shown in fig. 4, a plurality of coordinate points may be selected to form a head pose frame M for extracting the head pose, the head pose frame M is used to determine the angles of the three dimensions of the head, i.e., the pitch angle pitch, the yaw angle yaw, and the roll angle roll, and then the image coordinates of a plurality of feature points constituting the head pose frame M in the first image and the values of the three dimensions of the pitch, yaw, and roll are used as the head pose data. The plurality of coordinate points constituting the head pose frame may be partially or entirely human face feature points. For example, as shown in fig. 4, 8 coordinate points may be selected to form a head pose frame M, where the head pose frame M is a three-dimensional structure shaped like a frustum of a prism, and four vertices of one bottom surface (a rectangle formed by a dotted line shown in fig. 4, and a rectangle formed by a dotted line below) of the frustum of a prism and four vertices of the other bottom surface (a rectangle formed by a solid line shown in fig. 4, and a rectangle formed by a solid line below) of the frustum of a prism are the 8 coordinate points forming the head pose frame M. Further, the actual coordinates of the 8 coordinate points of the head pose frame M can be adjusted for the heads of different users, and as shown in fig. 4, one vertex coordinate of the dotted rectangle of the head pose frame M can be determined by the recognized face feature points 37 and 40, the other vertex coordinate of the dotted rectangle of the head pose frame M can be determined by the recognized face feature points 42 and 45, and the coordinates of the four vertices of the dotted rectangle can be determined by the lengths of the long sides of the dotted rectangle determined by the recognized face feature points 42 and 58. And further determining the coordinates of four vertexes of the solid line rectangle according to the preset proportional relation between the solid line rectangle and the dotted line rectangle or by combining with human face characteristic points, and finally obtaining the head pose frame M. The head pose frame M may also be in other shapes, and those skilled in the art may set the frame according to actual needs, which is not limited in this application. A pose detection model for extracting head pose data can be created in advance according to the determined head pose extraction mode, and the pose detection model is trained by utilizing the sample image to obtain the trained pose detection model.

In this embodiment, the eye feature data may be data representing the viewing features of the eyes of the user, including the sizes of the white, iris and pupil of the eyes of the user, the relative position relationship between the white, iris and pupil with respect to the eye socket, and the like. In step S11, the first image may be cropped to obtain an eye image in the first image, and then the eye image may be identified by using a pre-trained eye detection model to obtain eye feature data.

It should be noted that, the present application exemplarily provides an implementation manner for extracting features of the first images to obtain face feature point data, head pose data, and eye feature data of each first image, and a person skilled in the art may set an implementation manner for extracting the face feature point data, the head pose data, and the eye feature data according to actual needs, which is not limited in this application.

In this embodiment, the first image may be captured by a terminal device with a screen watched by a user, the capturing time length and the capturing frequency of the captured first image may be set, and the longer the capturing time length and the higher the capturing frequency are within a certain time length range, the smaller the error of the identified user watching area is, and the higher the identification accuracy is. The terminal device that takes the first image and the terminal device that executes the gaze region identification method of the present application may be the same or different. For example, the first terminal device is provided with a screen and used for shooting a first image in the process that a user gazes at the screen, then the first terminal device sends the shot first image to the second terminal device, after the second terminal device receives the first image, the steps S11-S14 are executed, a user gazing area is obtained, and then information of the user gazing area is sent to the first terminal device. The first terminal device executes corresponding operation based on the determined user watching area, the executed operation comprises responding to the corresponding control of the user watching area, the first terminal device controls the first terminal to display corresponding content for the user, the first terminal device executes calling, song playing, software closing and the like, and the user can control the first terminal device to execute the corresponding operation through watching.

In step S12, the feature data is input into the trained gaze estimation model, and a gaze point coordinate and a gaze probability map corresponding to each first image are obtained, the gaze point coordinate represents a coordinate of an intersection of the two eye gaze lines, the gaze probability map includes a probability that the gaze line is located in each gaze area of the screen, the gaze line represents a connection line passing through the eye and the gaze point, and the screen includes a plurality of gaze areas divided in advance.

In this embodiment, before the face feature point data, the head pose data, and the eye feature data are input into the gaze estimation model, the face feature point data and the head pose data may be standardized respectively, so that the gaze estimation model can process the feature data, and the accuracy of the result is ensured.

FIG. 5 shows a schematic diagram of a user gaze screen according to an embodiment of the present application. As shown in fig. 5, during the process of the user gazing on the screen through both eyes, the gazing lines of the left and right eyes are focused on the screen to form a gazing point. The coordinates of the gazing point are the coordinates (x, y) of the gazing point on the screen, and the coordinates of the gazing point on the screen can be determined according to the mapping relation based on the coordinates of the gazing point on the first image due to the mapping relation between the first image and the screen size. In addition, the screen can be divided into regions according to the requirements of different use scenes on the identification precision, the screen is divided into a plurality of watching regions 1-1 and 1-2 … m-n, and the identification precision is more accurate when the number of the watching regions is larger. The plurality of gaze areas divided by the screen may be in a grid shape as shown in fig. 5 and 6, each gaze area may be in a rectangular shape, a regular shape such as a triangle or a square, an irregular shape, or the plurality of gaze areas may be the same or different in shape. The areas of the plurality of gazing areas can be the same or different, and the areas of the gazing areas can be set according to the distance between the gazing areas and the central position of the screen, for example, the closer to the central position of the screen, the smaller the area of the gazing areas. Or, the area of the corresponding gazing area is determined according to the user use frequency in different areas of the screen, for example, the area of the gazing area is smaller in the area with higher user use frequency. In this way, the estimation error can be reduced by adjusting the area of the attention area in accordance with the position of the attention area, the corresponding frequency of use, and the like. And dividing the screen into multiple gaze areas may improve the confidence in the determination of the user's gaze area. Meanwhile, the screen is divided into a plurality of watching areas, and the problem of low data reliability caused by errors of the coordinates of the watching point marked by the training image in the process of training the line-of-sight estimation model can be solved.

In the present embodiment, the gaze probability map may be a probability value capable of indicating that the gaze point is in each gaze area in the screen, and fig. 6 illustrates a schematic diagram of the gaze probability map according to an embodiment of the present application. As shown in fig. 6, from the gaze probability map it can be determined: the probability value of the fixation point being in fixation area 1-1 is 0.000, the probability value of the fixation point being in fixation area 1-2 is 0.000 …, the probability value of the fixation point being in fixation area 2-4 is 0.700 …, and the probability value of the fixation point being in fixation area m-n is 0.000.

In a possible implementation manner, before performing the step S12, the method further includes: and training the sight line estimation model by using the characteristic data of the training image to obtain the trained sight line estimation model. And updating the model parameters of the sight line estimation model by using the loss of the sight line estimation model determined according to a preset loss function in training. Wherein the predetermined loss function is determined based on a regression loss function and a Cross Entropy (Cross Entropy) loss function.

In the implementation mode, the preset loss function can be determined by weighting and adding the regression loss function and the cross entropy loss function, so that the accuracy of the trained sight estimation model can be ensured, and the estimation error of the finally output user watching region can be reduced. The regression loss function may be a Mean Square Error (MSE) function used in the regression problem. For example, the predetermined loss function L may be the following formula:

wherein L is _{cross-entropy} Representing a cross entropy loss function. L is _MSE Representing the mean square error function. Sigma ₁ And σ ₂ Representing the weight. log sigma ₁ σ ₂ A canonical term is represented.

Fig. 7 shows a schematic diagram of a line-of-sight estimation model according to an embodiment of the present application. As shown in fig. 7, the present application provides an example of a sight line estimation model, where the implementation estimation model may be a neural network model, which includes a plurality of operation nodes, such as cascade 1, cascade 2 and

full connections

1, 2, 3, and 4, and the process of processing the face feature point data, the head pose data, and the eye feature data by using the sight line estimation model is as follows: the terminal device respectively carries out standardization processing on the face feature point data and the head pose data and then inputs the data into the sight line estimation model, and an operation node cascade 1 in the sight line estimation model carries out cascade processing on the standardized face feature point data and the standardized head pose data (namely determining the mapping relation between the face feature point data and the head pose data) to obtain a first intermediate result and inputs the intermediate result into an operation node full-link 2. And then the operation node 'full connection 2' performs full connection processing on the first intermediate result to obtain a second intermediate result and inputs the second intermediate result into the operation node 'cascade 2'. Aiming at the eye feature data, after the terminal equipment inputs the eye feature data into the sight estimation model, the operation node 'full connection 1' in the sight estimation model performs full connection processing on the eye feature data to obtain a third intermediate result and inputs the operation node 'cascade 2'. The operation node cascade 2 performs cascade processing on the third intermediate result and the second intermediate result (i.e., determines a mapping relationship between the second intermediate result and the third intermediate result), obtains a fourth intermediate result, and inputs the fourth intermediate result to the operation node full-link 3 and full-link 4, respectively. And the operation node 'full connection 3' performs full connection processing on the fourth intermediate result to obtain the fixation point coordinate. And the operation node 'full connection 4' performs full connection processing on the fourth intermediate result to obtain a probability value and a watching area coordinate range of each watching area in the screen. And then determining a gazing probability map according to the probability value of each gazing area and the coordinate range of each gazing area.

In step S13, the gaze point coordinates and gaze probability map of each first image are processed to form a gaze point queue and a probability map queue.

In this embodiment, after the gaze point coordinates and the gaze probability maps of the plurality of first images are obtained, the gaze point coordinates and the gaze probability maps corresponding to the plurality of first images may be sorted according to the order from the morning to the evening of the shooting time of the first images, and a gaze point queue and a probability map queue may be formed.

In one possible implementation, step S13 may include: determining a probability value of a fixation point coordinate corresponding to the fixation probability map according to the fixation probability map; screening effective fixation point coordinates from the fixation point coordinates according to the probability values of the fixation point coordinates of the first images, and determining effective fixation probability graphs corresponding to the effective fixation point coordinates; and forming a fixation point queue and a probability map queue according to the effective fixation probability map and the effective fixation point coordinate.

In this implementation, the probability value of the gazing area where the gazing point coordinate determined in step S12 is located may be determined according to the gazing probability map, as shown in fig. 6, the gazing point (x, y) is located in the gazing area 2-4, and the probability value is 0.700. And then screening the coordinates of each fixation point according to preset screening conditions to obtain the effective fixation point coordinates. The screening condition may be that the probability value corresponding to the gaze point coordinate is greater than or equal to a probability threshold, for example, the probability threshold may be 0.5, the gaze point coordinate with the probability value greater than 0.5 is the effective gaze point coordinate, and the gaze probability map corresponding to the effective gaze point coordinate is the effective gaze probability map. And forming a fixation point queue and a probability graph queue according to the effective fixation point coordinates and the effective fixation probability graph.

Therefore, before the fixation point queue and the probability map queue are formed, screening is carried out, and then the fixation point queue and the probability map queue are formed according to the effective fixation point coordinates and the effective fixation probability map, so that the reliability and stability of data in the fixation point queue and the probability map queue are improved, and errors and jitters determined by a user fixation area are reduced.

In step S14, a user gazing area is determined from the plurality of gazing areas according to the gazing point queue and the probability map queue.

In one possible implementation, step S14 may include: filtering the fixation point queue and the probability map queue respectively to obtain a filtered fixation point queue and a probability heat map; and determining a user gazing area from the plurality of gazing areas according to the filtered gazing point queue and the probability heat map.

For the filtering processing of the fixation point queue, the fixation point queue may be processed in an arithmetic mean filtering method, a recursive mean filtering method (also called a sliding mean filtering method), a median mean filtering method (also called an anti-pulse interference mean filtering method), an anti-jitter filtering method, a kalman filtering method (non-extended kalman filtering) and the like, so as to achieve smooth output of data and increase reliability and stability of the data in the fixation point queue after filtering.

Fig. 8 shows a schematic diagram of a filtering process of a probability map queue according to an embodiment of the application. As shown in fig. 8, the filtering process on the probability map queue may be performed by first obtaining a superposition probability map by superposing all the watched probability maps in the probability map queue, and then performing a normalization process on the superposition probability map to obtain a probability heat map. The probability heatmap may record probability values as shown in fig. 8, and may also visually distinguish the probability values of different gaze areas by adding padding corresponding to the probability values to the gaze areas.

In one possible implementation, before determining a user gaze region from the plurality of gaze regions according to the filtered gaze point queue and the probability heat map, the method may further include: determining probability values corresponding to fixation point coordinates in the filtered fixation point queue according to the probability heatmap; and screening according to the probability value of each fixation point coordinate, removing the fixation point coordinate with the abnormal probability value in the fixation point queue after filtering, and updating the probability heat map according to the fixation probability map corresponding to the fixation point coordinate with the abnormal probability value.

In this implementation, a probability value determining a gaze region in which the corresponding gaze point coordinate is located may be determined from the probability heatmap. And then, screening the fixation point coordinates in the filtered fixation point queue according to preset screening conditions, and removing the fixation point coordinates with abnormal probability values. And then generating a new probability heat map according to the gazing probability map corresponding to the gazing point coordinate with the abnormal probability value and the generated probability human heat map. The screening condition may be that the probability value corresponding to the gaze point coordinate is smaller than or equal to a probability threshold, for example, the probability threshold may be 0.25, and the gaze point coordinate with the probability value smaller than or equal to 0.25 is the gaze point coordinate with the abnormal probability value.

In this embodiment, according to the filtered gaze point queue and the probability heat map, a user gaze area may be determined from each gaze area by voting or the like. For example, a user's gazing area is determined by voting: the injection point coordinates in the injection point queue after filtering are assumed to be 100. Traversing the 100 fixation point coordinates, and calculating the voting value of the 100 fixation point coordinates to the fixation area according to whether each fixation point coordinate is in the fixation area, the probability value of the fixation area and the like to obtain the total voting value of each fixation area. For example, the vote value T (i-j) for the ith gaze point coordinate for gaze region j may be: t (i-j) ═ a × Pi. When the ith fixation point coordinate is in the fixation area j, a is 1; when the ith gaze point coordinate is not in gaze area j, a is 0. Pi is the probability value of the ith fixation point coordinate in the fixation area j, and is obtained from the probability heat map. Then, the total voting value t (j) of the 100 gazing point coordinates to the gazing area j may be: t (j) ═ T (1-j) + T (2-j) … … + T (100-j). And then selecting the gaze area with the largest total voting value as the user gaze area. Determining the user's gaze area in a voting manner may increase the confidence level of the determined user's gaze area.

And after the watching area with the maximum total voting value is determined as the user watching area, the range of the user watching area can be corrected according to the watching point coordinates in the filtered watching point queue, so that the accuracy of the user watching area is improved. For example, a regular or irregular gazing point region may be determined according to the gazing point coordinates in the filtered gazing point queue, and then a portion of the determined user gazing region that belongs to the gazing point region may be determined as the modified user gazing region.

To further describe the method for identifying a gaze region provided by the present application, a flowchart of the method for identifying a gaze region according to an embodiment of the present application is shown in conjunction with fig. 9, and the implementation process of the method is described in its entirety. As shown in fig. 9, the gaze area identification method example includes steps S21 to S28. This example is applied to a terminal device with a screen.

In step S21, after determining that the gaze region identification is required, the terminal device turns on the camera to capture images, and obtains a plurality of first images, where the first images include the face of the user.

In step S22, after the terminal device captures a first image, feature extraction is performed on the first image to obtain face feature point data, head pose data, and eye feature data. The implementation process is referred to the above step S11, and is not described herein.

In step S23, the terminal device inputs the face feature point data, the head pose data, and the eye feature data into a pre-trained gaze estimation model, and obtains gaze point coordinates and gaze probability maps of the first images. The implementation process is described with reference to fig. 7 and related text, and is not described herein again.

In step S24, a first filtering of the abnormal points is performed to obtain the effective fixation point coordinates and the effective fixation probability map (see the above determination process of the effective fixation point coordinates and the effective fixation probability map).

In step S25, a gazing point queue is formed by using all the effective gazing point coordinates, and step S26 is continuously performed to filter the gazing point queue, so as to obtain a filtered gazing point queue.

In step S25 ', a probability map queue is formed by using all the effective gaze probability maps, and step S26' is continuously performed to perform a superposition process on the gaze probability maps in the probability map queue to obtain a superposition probability map, and to perform a normalization process on the superposition probability map to obtain a probability heat map.

In step S27, a second abnormal point filtering is performed to remove the fixation point coordinates with abnormal probability values in the fixation point queue after filtering, and the probability heat map is updated according to the fixation probability map corresponding to the fixation point coordinates with abnormal probability values.

In step S28, a user gaze region is determined from the plurality of gaze regions using the filtered gaze point queue and the updated probability heat map obtained after performing step S27.

Fig. 10 shows a block diagram of a gaze region identification apparatus according to an embodiment of the present application. As shown in fig. 10, the apparatus includes: a feature extraction module 41, a gaze estimation module 42, a queue creation module 43 and a region determination module 44.

The feature extraction module 41 is configured to perform feature extraction on the acquired at least one first image to obtain feature data of each first image, where the feature data includes face feature point data, head pose data, and eye feature data, and the first image includes a face of a user.

And the gazing estimation module 42 is used for inputting the characteristic data into a trained sight estimation model to obtain a gazing point coordinate and a gazing probability map corresponding to each first image, wherein the gazing point coordinate represents the coordinate of a cross point of the sight lines of the two eyes, the gazing probability map comprises a probability value of each gazing area of the gazing sight line on a screen watched by the user, the gazing sight line represents a connection line passing through the eyes and the gazing point, and the screen comprises a plurality of pre-divided gazing areas.

And the queue creating module 43 is configured to process the gazing point coordinates and the gazing probability map of each first image to form a gazing point queue and a probability map queue.

And the area determining module 44 is configured to determine a user gazing area from the multiple gazing areas according to the gazing point queue and the probability map queue.

In one possible implementation, the queue creating module 43 may include:

In one possible implementation, the region determining module 44 may include:

In one possible implementation manner, the filtering submodule includes:

In one possible implementation, the apparatus may further include:

and updating the probability heat map according to the fixation probability map corresponding to the fixation point coordinate with the abnormal probability value.

In one possible implementation manner, the apparatus may further include:

By the device, the gaze region estimation based on the RGB image can be realized, so that the gaze region of a user when watching a screen can be estimated, the estimated gaze region of the user has small error and low jitter, the method is low in cost, and full-scene deployment can be realized.

An embodiment of the present application provides a gaze area identification device, including: a processor and a memory for storing processor-executable instructions; wherein the processor is configured to implement the above method when executing the instructions.

Embodiments of the present application provide a non-transitory computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method.

Embodiments of the present application provide a computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when run in a processor of an electronic device, the processor in the electronic device performs the above method.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a Read-Only Memory (ROM), an erasable Programmable Read-Only Memory (EPROM or flash Memory), a Static Random Access Memory (SRAM), a portable Compact Disc Read-Only Memory (CD-ROM), a Digital Versatile Disc (DVD), a Memory stick, a floppy disk, a mechanical coding device, a punch card or an in-groove protrusion structure, for example, having instructions stored thereon, and any suitable combination of the foregoing.

The computer readable program instructions or code described herein may be downloaded to the respective computing/processing device from a computer readable storage medium, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present application may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of Network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, the electronic circuitry can execute computer-readable program instructions to implement aspects of the present application by utilizing state information of the computer-readable program instructions to personalize custom electronic circuitry, such as Programmable Logic circuits, Field-Programmable Gate arrays (FPGAs), or Programmable Logic Arrays (PLAs).

Various aspects of the present application are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.

It is also noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by hardware (e.g., an electronic Circuit or an ASIC (Application Specific Integrated Circuit)) for performing the corresponding functions or acts, or combinations of hardware and software, such as firmware.

While the invention has been described in connection with various embodiments, other variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed invention, from a review of the drawings, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps, and the word "a" or "an" does not exclude a plurality. A single processor or other unit may fulfill the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

Having described embodiments of the present application, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the disclosed embodiments. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terminology used herein is chosen in order to best explain the principles of the embodiments, the practical application, or improvements made to the technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A gaze region identification method, the method comprising:

processing the fixation point coordinates and the fixation probability map of each first image to form a fixation point queue and a probability map queue;

2. The method of claim 1, wherein processing the gaze point coordinates and the gaze probability map for each first image to form a gaze point queue and a probability map queue comprises:

screening effective fixation point coordinates from the fixation point coordinates according to the probability values of the fixation point coordinates of the first images, and determining effective fixation probability graphs corresponding to the effective fixation point coordinates;

3. The method of claim 1, wherein determining a user gaze region from the plurality of gaze regions based on the gaze point queue and the probability map queue comprises:

4. The method of claim 3, wherein filtering the point of regard queue and the probability map queue to obtain a filtered point of regard queue and a probability map comprises:

filtering the point of regard queue to obtain a filtered point of regard queue;

5. The method of claim 3, wherein prior to determining a user gaze region from the plurality of gaze regions based on the filtered sequence of gaze points and the probability heat map, the method further comprises:

determining probability values corresponding to fixation point coordinates in the filtered fixation point queue according to the probability heatmap;

screening according to the probability value of each fixation point coordinate, removing fixation point coordinates with abnormal probability values in the filtered fixation point queue, and

6. The method of claim 1, further comprising:

7. A gaze region identification apparatus, characterized in that the apparatus comprises:

8. The apparatus of claim 7, wherein the queue creation module comprises:

9. The apparatus of claim 7, wherein the region determining module comprises:

10. The apparatus of claim 9, wherein the filtering sub-module comprises:

the first filtering submodule is used for carrying out filtering processing on the point of regard queue to obtain a filtered point of regard queue;

11. The apparatus of claim 9, further comprising:

12. The apparatus of claim 7, further comprising:

updating model parameters of the sight line estimation model by using the loss of the sight line estimation model determined according to a preset loss function in training;

13. A gaze region identification apparatus, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to implement the method of any of claims 1-6 when executing the instructions.

14. A non-transitory computer readable storage medium having computer program instructions stored thereon, wherein the computer program instructions, when executed by a processor, implement the method of any of claims 1-6.

15. A computer program product comprising computer readable code or a non-transitory computer readable storage medium carrying computer readable code which, when run in an electronic device, a processor in the electronic device performs the method of any of claims 1-6.