CN117690180A

CN117690180A - Eyeball fixation recognition method and electronic equipment

Info

Publication number: CN117690180A
Application number: CN202310793444.4A
Authority: CN
Inventors: 龚少庆
Original assignee: Honor Device Co Ltd
Current assignee: Honor Device Co Ltd
Priority date: 2023-06-29
Filing date: 2023-06-29
Publication date: 2024-03-12

Abstract

The application provides an eyeball fixation recognition method and electronic equipment. The method can be applied to electronic equipment such as mobile phones, tablet computers and the like. By implementing the method, the electronic equipment can cut the image comprising the human image acquired by the camera, obtain the subgraphs such as the left eye image, the right eye image and the human face image in the image, then obtain the characteristics of each subgraph by using the preset CNN-A, skip-CNN or the mixed transducer and other characteristic extractors, splice the characteristics of each subgraph, and determine the gaze point of the eyeball of the user on the screen by using the spliced characteristics so as to improve the accuracy of the identified gaze point and the eyeball gaze identification effect.

Description

Eyeball fixation recognition method and electronic equipment

Technical Field

The application relates to the field of terminals, in particular to an eyeball fixation recognition method and electronic equipment.

Background

Along with the rise of mobile terminals and the maturation of communication technology, people begin to explore novel human-computer interaction modes such as voice control, gesture recognition control, eyeball fixation control and the like which are separated from a mouse and a keyboard, so that novel human-computer interaction modes are realized, and more diversified and more convenient interaction experience is provided for users. Eyeball gaze control refers to: a gaze point of a user's eyeball on a screen is identified and a corresponding interactive operation is performed based on the location of the gaze point in the screen. The eyeball gazing control has the advantages of high speed and convenience in operation, and can meet the interaction control of a user in any scene. However, currently, insufficient extraction of the eye characteristics results in lower accuracy of the identified gaze point, and poor eye gaze control.

Disclosure of Invention

The embodiment of the application provides an eyeball fixation recognition method and electronic equipment. The electronic equipment implementing the method can cut the image comprising the human image acquired by the camera, then respectively acquire the characteristics of the cut subgraphs, splice the characteristics of each subgraph, and determine the gazing point of the eyeballs of the user on the screen by utilizing the spliced characteristics so as to improve the accuracy of the identified gazing point and improve the gazing identification effect of the eyeballs.

In a first aspect, the present application provides an eye gaze identification method. The method is applied to the electronic device comprising the screen. The method comprises the following steps: acquiring a first image; acquiring a first left eye image, a first right eye image and a first face image from the first image; acquiring a first left-eye feature from a first left-eye image, acquiring a first right-eye feature from a first right-eye image, and acquiring a first face feature from a first face image; combining the first left eye image, the first right eye feature and the first face feature to obtain a first feature; and determining a target fixation point on the screen, wherein the target fixation point is obtained according to the first characteristic.

By implementing the method provided in the first aspect, the electronic device may cut an image including a human image acquired by the camera, obtain sub-images such as a left eye image, a right eye image, a human face image, and the like in the image, then obtain features of the cut sub-images respectively, splice features of the sub-images, and determine a gaze point of an eyeball of a user on a screen by using the spliced features. Compared with the existing method for directly extracting the image features, the method for extracting the features after clipping and then determining the eye gaze point by utilizing the spliced features can improve the recognition accuracy of the eye gaze point, namely the eye gaze recognition effect.

With reference to the method provided in the first aspect, in some embodiments, the method further includes: encoding the first left-eye image and the first right-eye image with a rotary encoder; at this time, combining the first left-eye image, the first right-eye feature, and the first face feature to obtain the first feature specifically includes: and combining the encoded first left-eye image, the encoded first right-eye feature and the first face feature to obtain a first feature.

After the method provided by the embodiment is implemented, after the left eye feature and the right eye feature are obtained, the electronic device may use the transformation encoder (i.e. transformer encoder) to encode the eye feature, also weigh the structure, and then use the encoded eye feature and the face feature to determine the eye gaze point of the user. The eye gaze point identification is carried out by utilizing the eye features reconstructed by the transformation encoder, so that the eye gaze identification effect can be further improved.

In combination with the method provided in the first aspect, in some embodiments, the electronic device is preset with a first feature extractor and a second feature extractor, where a first left-eye feature is acquired from a first left-eye image, a first right-eye feature is acquired from a first right-eye image, and a first face feature is acquired from a first face image, which specifically includes: acquiring a first left-eye feature from a first left-eye image by using a first feature extractor, and acquiring a first right-eye feature from a first right-eye image; a first face feature is obtained from the first face image using a second feature extractor.

Preferably, the first feature extractor is different from the second feature extractor. That is, the electronic device may respectively construct feature extractors for the eye image and the face feature according to the difference between the eye image and the face feature image, so as to improve the quality of the extracted eye feature and face feature as much as possible, and improve the eye gaze recognition effect.

In some embodiments, the first feature extractor and the second feature extractor may also be provided as the same feature extractor. Therefore, the volume of the eyeball fixation recognition algorithm can be reduced, and the storage space is saved.

In some embodiments, the first feature extractor is established based on a convolutional neural network.

Preferably, the first feature extractor comprises a first convolution layer, a first pooling layer, a second convolution layer, a second pooling layer, and a third convolution layer; wherein the size of the convolution kernel of the first convolution layer is 7×7, and the step size is 1; the size of the convolution kernel of the second convolution layer is 5×5, and the step size is 1; the size of the convolution kernel of the third convolution layer is 3×3, and the step size is 1; the sizes of the pooling cores of the first pooling layer and the second pooling layer are 2×2, and the step size is 2.

In some embodiments, the first feature extractor includes a plurality of processing layers and one or more downsamplers, the processing layers including a convolution layer and a pooling layer.

Wherein the plurality of processing layers comprises a first processing layer and the one or more downsamplers comprises a downsampler i. Taking the first processing layer and the downsampler i as an example, in the first feature extraction, the processing layer and the downsampler structurally satisfy: the input of the downsampler i is the same as the input of the first processing layer, and the output of the downsampler i is used for splicing with the output of the second processing layer; the second treatment layer is equal to the first treatment layer, or the second treatment layer is one treatment layer after the first treatment layer.

By implementing the method provided by the embodiment, the first feature extractor combined with the convolution layer, the pooling layer and the downsampler can acquire more eye features, enrich feature space and further improve eyeball fixation identification effect.

In combination with the method provided in the above embodiments, in some embodiments, the plurality of processing layers includes a first convolution layer, a first pooling layer, a second convolution layer, a second pooling layer, and a third convolution layer.

On the basis of the above-mentioned first convolution layer, first pooling layer, second convolution layer, second pooling layer, and third convolution layer, in some embodiments, the plurality of downsamplers includes a first downsampler, a second downsampler, and a third downsampler; the input of the first downsampler is the same as the input of the first convolution layer, and the output of the first downsampler is used for being spliced with the output of the first convolution layer and is input to the first pooling layer; the input of the second downsampler is the same as the input of the first pooling layer, and the output of the second downsampler is used for being spliced with the output of the second convolution layer and is input to the second pooling layer; the input of the third downsampler is the same as the input of the second convolutional layer, and the output of the third downsampler is used to splice with and output the output of the third convolutional layer.

On the basis of the first convolution layer, the first pooling layer, the second convolution layer, the second pooling layer and the third convolution layer, in some embodiments, the plurality of downsamplers include or further include a fourth downsampler and a fifth downsampler; the input of the fourth downsampler is the same as the input of the first convolution layer, and the output of the fourth downsampler is used for being spliced with the output of the second convolution layer and is input to the second pooling layer; the input of the fifth downsampler is the same as the input of the first pooling layer, and the output of the fifth downsampler is used to splice with and output the output of the third convolution layer.

On the basis of the above-mentioned first convolution layer, first pooling layer, second convolution layer, second pooling layer and third convolution layer, in some embodiments, the plurality of downsamplers includes or further includes a sixth downsampler, an input of the sixth downsampler being identical to an input of the first convolution layer, and an output of the sixth downsampler being for splicing with and outputting an output of the third convolution layer.

In some embodiments, the first feature extractor has a number of input channels of 1 and a number of output channels of 168; alternatively, the first feature extractor has a number of input channels of 3 and a number of output channels of 184.

In combination with the method provided in the first aspect, in some embodiments, before acquiring the first image, the method further includes: acquiring a second image; acquiring a second left eye image, a second right eye image and a second face image from the second image; acquiring a second left eye feature from a second left eye image, acquiring a second right eye feature from a second right eye image, and acquiring a second face feature from a second face image; combining the second left eye image, the second right eye feature and the second face feature to obtain a second feature; wherein the target gaze point is also derived from the second feature.

The second image acquired earlier is also referred to as the reference image with calibration. The calibration is known as the gaze point. The target fixation point is further obtained according to the second feature, and specifically includes: the target gaze point is derived from the calibration and the gaze point distance, wherein the gaze point distance is determined based on the first feature and the second feature.

By implementing the method provided by the embodiment, the electronic device can utilize the characteristics of the two images, particularly the difference between the current image (namely the first image) and the reference image, so as to improve the identification effect of the eye gaze point.

Preferably, the electronic device also uses the first feature extraction when acquiring the second left-eye feature from the second left-eye image and the second right-eye feature from the second right-eye image.

With reference to the method provided in the first aspect, in some embodiments, the method further includes: determining a hot zone corresponding to the target fixation point; and executing a first action corresponding to the hot zone.

After the eye gaze point is identified, the electronic device can determine a hot zone where the eye gaze point is located according to the position of the eye gaze point in the display screen, and further execute interactive control operation matched with the hot zone, so as to provide an interactive mode of eye gaze control for a user. For example, after determining that the eye gaze point is within an application icon hotspot of the camera application, the electronic device may open the camera application, display a main interface of the camera application, and provide the user with a wide variety of photography services. For another example, after determining that the user's eye gaze point is within the notification bar hot zone, the electronic device 100 may display a notification interface for the user to view the various notifications.

In a second aspect, the present application provides an electronic device comprising one or more processors and one or more memories; wherein the one or more memories are coupled to the one or more processors, the one or more memories being operable to store computer program code comprising computer instructions that, when executed by the one or more processors, cause the electronic device to perform the method as described in the first aspect and any possible implementation of the first aspect.

In a third aspect, embodiments of the present application provide a chip system for application to an electronic device, the chip system comprising one or more processors for invoking computer instructions to cause the electronic device to perform a method as described in the first aspect and any possible implementation of the first aspect.

In a fourth aspect, the present application provides a computer readable storage medium comprising instructions which, when run on an electronic device, cause the electronic device to perform a method as described in the first aspect and any possible implementation of the first aspect.

In a fifth aspect, the present application provides a computer program product comprising instructions which, when run on an electronic device, cause the electronic device to perform a method as described in the first aspect and any possible implementation of the first aspect.

It will be appreciated that the electronic device provided in the second aspect, the chip system provided in the third aspect, the computer storage medium provided in the fourth aspect, and the computer program product provided in the fifth aspect are all configured to perform the method provided in the present application. Therefore, the advantages achieved by the method can be referred to as the advantages of the corresponding method, and will not be described herein.

Drawings

Fig. 1 is a schematic diagram of an eyeball gaze recognition algorithm according to an embodiment of the present application;

FIGS. 2A-2B are schematic diagrams of a set of acquired left-eye, right-eye, and face images provided in embodiments of the present application;

fig. 3 is A schematic diagram of A network structure of CNN-A according to an embodiment of the present application;

FIG. 4 is a schematic diagram of feature stitching provided by an embodiment of the present application;

5A-5B are user interfaces providing a set of interactions based on eye gaze recognition in accordance with embodiments of the present application;

FIGS. 6A-6B are a set of user interfaces for providing interaction based on eye gaze recognition in accordance with embodiments of the present application;

fig. 7 is a schematic diagram of a network structure of skip-CNN according to an embodiment of the present application;

FIG. 8 is a flow chart of an electronic device 100 for reconstructing an ocular feature using a rotary encoder provided in an embodiment of the present application;

FIG. 9 is a schematic diagram of a rotary encoder provided in an embodiment of the present application;

FIG. 10 is a schematic diagram of another eye gaze identification algorithm provided by an embodiment of the present application;

11A-11B are schematic diagrams of a set of cumulative distribution functions (Cumulative distribution function, CDF) provided by an implementation of the present application;

fig. 12 shows a hardware configuration diagram of the electronic device 100.

Detailed Description

The terminology used in the following embodiments of the application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.

One or more front-facing cameras are typically disposed in the electronic device 100, such as a mobile phone or a tablet computer. The electronic device 100 may collect a portrait via one or more front-facing cameras as described above for self-photographing, face unlocking, etc.

In this embodiment of the present application, the electronic device 100 may further perform eye gaze recognition by using the person images collected by the one or more front cameras, determine a gaze point of a current user's gaze on a display screen of the electronic device 100, and further determine a corresponding hot zone according to a position of the gaze point in the display screen, and perform an interaction control operation matched with the hot zone, so as to provide an interaction manner of eye gaze control for the user.

Not limited to a mobile phone, tablet computer, electronic device 100 may also be a desktop computer, a laptop computer, a handheld computer, a notebook computer, an ultra-mobile personal computer (UMPC), a netbook, a cellular phone, a personal digital assistant (personal digital assistant, PDA), an augmented reality (augmented reality, AR) device, a Virtual Reality (VR) device, an artificial intelligence (artificial intelligence, AI) device, a wearable device, a vehicle-mounted device, a smart home device, and/or a smart city device, and the specific types of the above terminals are not particularly limited in the embodiments of the present application.

Fig. 1 is a schematic diagram of an eyeball gaze recognition algorithm according to an embodiment of the present application.

First, S101, the electronic device 100 acquires one frame of image, which is denoted as image X.

The electronic device 100 may acquire the image X through a front camera. The specific time when the electronic device 100 acquires the image X is not limited, that is, the specific scene under which the image X is acquired and the eyeball gaze recognition is performed is not limited. In some embodiments, the electronic device 100 may start the front camera to obtain a frame of image X including a face when displaying the lock screen interface. In other embodiments, the electronic device 100 may also start the front camera to obtain a frame of image X including the face during the desktop display process. In other embodiments, the electronic device 100 may also start the front camera after starting a certain application, and acquire a frame of image X including a face.

The electronic device 100 may be configured with different types of front-facing cameras including, but not limited to, wide-angle cameras, ultra-wide-angle cameras, tele cameras, structured light depth cameras, infrared cameras, and the like. The wide-angle camera, the ultra-wide-angle camera and the long-focus camera can acquire and output three-channel RGB images; the structured light depth sensing camera can output four-channel RGBD images; the Infrared camera may output a single-channel Infrared (IR) image reflecting the intensity of the Infrared light. From the above, it is clear that the image X may be a single-channel, three-channel or other multi-channel image, depending on the type of camera used. Any type of image acquired by any type of camera may be used as the input image of the eyeball gaze recognition algorithm of the present application, that is, the image X described above, which is not limited in the embodiment of the present application.

After obtaining the image X, S102, the electronic device 100 may obtain a left-eye image, a right-eye image, and a face image from the image X.

The electronic device 100 may determine a face position in the image X through a face detection algorithm, and further obtain a face image based on the face position. The electronic device 100 may determine the binocular positions in the image X through a face keypoint detection algorithm, wherein the binocular positions include a left eye position and a right eye position. Further, the electronic device 100 may acquire a left eye image based on the left eye position, and acquire a right eye image based on the right eye position.

Fig. 2A-2B are schematic diagrams of a set of acquired left eye images, right eye images, and face images provided in an embodiment of the present application.

First, fig. 2A is a schematic diagram of acquiring a face image according to an embodiment of the present application.

The electronic device 100 may input the acquired image X into a face detection algorithm, performing face detection. The face detection algorithm may recognize a face in the image and output a face frame. As shown in fig. 2A, the face box may further specifically mark the location of the face in the image X. The image corresponding to the surrounding area of the face frame is the face image. The electronic device 100 may then crop the image X to a face frame to obtain an independent face image.

Fig. 2B is a schematic diagram of acquiring left-eye and right-eye images provided in an embodiment of the present application.

The electronic device 100 may input the acquired image X into a face key point detection algorithm, and perform face key point detection. The face keypoint detection algorithm may identify face keypoints in the image. Face key points include, but are not limited to, left eye points, right eye points, nose points, left lip cusps, and right lip cusps. As shown in fig. 2B, the electronic device 100 may obtain a face key point: left eye point a, right eye point b, nose point c, left lip point d, right lip point e. The electronic device 100 may determine a rectangular area centered on the left eye point a. The image corresponding to the rectangular region is the left eye image. Then, the electronic device 100 may crop the image X to the rectangular area as described above to obtain a left eye image. Similarly, the electronic device 100 may determine a rectangular area with the right eye point b as the center. The image corresponding to the rectangular region is the right eye image. Then, the electronic device 100 may crop the image X to the rectangular area as described above to obtain a right-eye image.

It will be appreciated that the electronic device 100 may perform face detection and face keypoint detection in parallel, thereby acquiring left-eye, right-eye, and face images in parallel. That is, the process of acquiring the face image does not affect the electronic device 100 to acquire the left-eye image and the right-eye image; the process of acquiring the left eye image and the right eye image does not affect the electronic device 100 to acquire the face image.

In some embodiments, before the left-eye image, the right-eye image, and the face image are acquired, the electronic device 100 may further perform rotation correction on the image X to ensure that the face of the user is correct, thereby improving the quality of the left-eye image, the right-eye image, and the face image, and improving the recognition effect of eye gaze recognition.

After the left-eye image, the right-eye image, and the face image are acquired, S103, the electronic device 100 may input the left-eye image, the right-eye image, and the face image into the feature extractor, respectively, to acquire corresponding left-eye features, right-eye features, and face features.

The left eye image and the right eye image may be collectively referred to as an eye image. Correspondingly, obtaining left eye features based on the left eye image and obtaining right eye features based on the right eye image may be collectively referred to as ocular features. As shown in fig. 1, in an embodiment of the present application, an electronic device 100 may be preset with a feature extractor a and a feature extractor B. The feature extractor A can be used for processing the eye images and acquiring eye features; and the feature extractor B may be configured to process the face image to obtain the face feature.

In one implementation, the feature extractor a may be a set of neural network models built based on convolutional neural networks (Convolutional Neural Network, CNN). Thus, feature extractor A may also be denoted as CNN-A. The feature extraction algorithm used by the corresponding feature extractor B in the embodiment of the present application is not limited. In the scenario where feature extractor B also employs the CNN algorithm, it is preferable that the network structure and parameter settings of feature extractor B (also referred to as CNN-B) are different from CNN-A, i.e., CNN-B is different from CNN-A. Alternatively, CNN-B may also employ the same network structure and parameter settings as CNN-A, i.e., CNN-B is identical to CNN-A.

Fig. 3 is A schematic diagram of A network structure of CNN-A according to an embodiment of the present application.

As shown in fig. 3, the network structure of CNN-A may include 3 convolutional layers C1-C3 and 2 pooling layers P1-P2, alternately arranged with the pooling layers. In one specific implementation, the convolution kernel of the convolution layer C1 has a size of 7×7, with a step size of 1; the size of the convolution kernel of the convolution layer C2 is 5×5, and the step size is 1; the size of the convolution kernel of the convolution layer C3 is 3×3, and the step size is 1; the sizes of the pooling cores of the pooling layers P1 and P2 are 2×2, and the step sizes are 2.

After the left eye image is input to CNN-A, CNN-A may obtain the left eye feature through the above-described processing of the convolution layer and pooling layer. After the right eye image is input to CNN-A, CNN-A may obtain the right eye feature through the above-described processing of the convolution layer and pooling layer.

The size of the features (e.g., left-eye features, right-eye features described above) of the CNN-A output may be determined by the size of the input image (e.g., left-eye image, right-eye image described above) and the size and step size of the nuclei of the layers in the network structure. Taking the input image of h1×w1 as an example, CNN-A can output the feature of h2×w2 after the above-described processing of the convolution layer and the pooling layer. In the scenario without considering edge padding (padding), the relationship of h1×w1 to h2×w2 in the scenario of the sizes and step sizes of the convolution kernel and the pooling kernel exemplified above is as follows:

H2＝(H1-25)/4；W2＝(W1-25)/4。

The size of the feature output by each layer in the network structure of CNN-A may also be determined according to the size and step size of the core of the convolutional layer and the pooled layer that this layer passes through before, which will not be described herein.

CNN-A also includes an input channel and an output channel.

The number of input channels generally corresponds to the color structure of the input image. For example, when the input image is A three-channel RGB image, the number of input channels of CNN-A is 3; when the input image is A single-channel IR image, the number of input channels of CNN-A is 1. Specifically, the CNN-A may first identify the color structure of the input image, determine the number of color channels of the input image, and then the CNN-A may set the input channels in conformity with the number of color channels. For example, when the input image is A three-channel RGB image, CNN-A may determine the number of color channels of the input image to be 3, and correspondingly, CNN-A may set the number of input channels to be 3.

The number of output channels, i.e. the number of output features. The number of output channels is typically greater than the number of input channels. The number of output channels of CNN-A may be 60, for example. When the number of output channels is 60, CNN-A may extract and output 60 sets of features from the input image. In particular, CNN-A can update the values of the convolution kernels or pooling kernels of each layer multiple times to obtain features that are much larger than the number of input channels. The more the number of output channels, the more the characteristics of the CNN-A output, and at the same time, the greater the computational cost of CNN-A. The number of output channels can be empirically set to achieve as many features as possible with control of computational costs.

In A scenario where the number of output channels of CNN-A is N1 and the number of output channels of CNN-B is N2, after inputting the left-eye image and the right-eye image into CNN-A, the electronic device 100 may obtain N1 sets of left-eye features and N1 sets of right-eye features; after inputting the face image into CNN-B, the electronic device 100 may obtain N2 sets of face features. Then, S104, the electronic device 100 may splice the left eye feature and the right eye feature to obtain the eye feature, and continue to splice the eye feature and the face feature to obtain the query feature.

The splicing operation is also called a combining operation. The characteristics after splicing are the sum of the two groups of characteristics before splicing. Therefore, the eye features obtained after stitching are all features obtained from the eye image of the image X, and the query features obtained after stitching are all features obtained from the image X.

Fig. 4 is a schematic diagram of feature stitching provided in an embodiment of the present application. As shown in fig. 4, the upper left gray tri-layer grid may represent the N1 set of left eye features of the left eye image output by CNN-A, and the upper right white tri-layer grid may represent the N1 set of right eye features of the right eye image output by CNN-A; the six layers of grids below can represent the eye features obtained after splicing: 2 x N1 sets including the N1 set of left eye features and the N1 set of right eye features described above.

Further, after stitching the 2×n1 group of eye features and the N2 group of face features, the electronic device 100 may obtain query features: (2 XN1+N2) group.

After obtaining the query feature, S105, the electronic device 100 may input the query feature into a Full Connection (FC) layer, and obtain the gaze point (x, y) through FC prediction. The gaze point (X, y), i.e. the eye gaze recognition algorithm, recognizes the gaze point determined by the image X.

After obtaining the gaze point (x, y), the electronic device 100 determines a hotspot matching the gaze point (x, y), triggering a function corresponding to the hotspot.

The hot zone is a region with fixed shape and size on the display screen. For example, in a scenario where a desktop is displayed, the electronic device 100 may be provided with an application hot zone. The display area corresponding to any application icon on the desktop may be referred to as an application hot zone. After detecting the user operation on the application hot zone, the electronic device 100 may start or set an application program corresponding to the application hot zone. The area of the application hot area is larger than or equal to the display area of the corresponding application icon, and the application icon is completely covered. The application hot areas corresponding to the application icons are not overlapped with each other.

Fig. 5A is a table top of an electronic device 100 provided in an embodiment of the present application. As shown in fig. 5A, one or more application icons, such as a phone application icon, a browser application icon, a camera application icon, a weather application icon, and the like, may be displayed on the desktop of the electronic device 100. Taking the example of a camera application, region 501 may represent a camera hotzone. The camera hotspots completely cover the camera application icons. After determining the user' S eye gaze point (x, y) by the method shown in S101-S105, the electronic device 100 may determine whether the gaze point (x, y) is within the camera hotzone, i.e., within the region 501. After determining that the gaze point (x, y) is within the camera hotspot and the gaze duration reaches the preset duration, the electronic device 100 may open the camera application. Referring to fig. 5B, the electronic device 100 may display a photographing interface of a corresponding application.

Referring to fig. 6A, the electronic device 100 may also be provided with an area 601, and the area 601 may be referred to as a notification bar hot zone. After determining the user' S eye gaze point (x, y) by the method shown in S101-S105, the electronic device 100 may determine whether the gaze point (x, y) is within the notification bar hot zone, i.e., within the region 601. After determining that the gaze point (x, y) is within the notification bar hot zone and the gaze duration reaches the preset duration, the electronic device 100 may display a notification interface, referring to fig. 6B.

In this way, the user can realize interactive control of the electronic device 100 by eye gaze operation instead of touch operation on the screen.

Further, in order to improve accuracy of the eye gaze recognition algorithm and make the error between the gaze point predicted by the algorithm and the gaze point actually gazed by the user smaller, the electronic device 100 may use A skip-connection feature extractor instead of the original CNN-A.

The layer jump feature extractor introduces A plurality of downsamplers between the convolutional layer and the pooling layer on the basis of CNN-A. Thus, the skip-feature extractor may also be referred to as skip-CNN. The downsampler is used to downsample the input image and reduce the size of the input image. The compressed image matrix output by the downsampler may also be considered a set of features describing the image. The skip-CNN can combine the characteristics output by the downsampler with the characteristics output after the processing of the convolution layer and the pooling layer, enrich the characteristic space, and further improve the identification effect of the gaze point.

Fig. 7 is a schematic diagram of a network structure of skip-CNN according to an embodiment of the present application.

As shown in fig. 7, skip-CNN also includes convolutional layers C1-C3, pooling layers P1-P2. The connection relation between the convolutional layers C1-C3 and the pooling layers P1-P2 in skip-CNN, and the sizes and step sizes of the cores of the convolutional layers C1-C3 and the pooling layers P1-P2 are the same as CNN-A, and are not described herein. The embodiment of the application specifically describes the positions of a plurality of downsamplers introduced by skip-CNN between a convolution layer and a pooling layer, and the combination mode among the convolution layer, the pooling layer and the downsamplers.

The electronic device 100 may divide the network structure formed by the convolution layer and the pooling layer shown in fig. 3 into 3 levels with the convolution layer as a boundary, referring to table 1:

TABLE 1

As shown in fig. 7, the electronic device 100 may be provided with downsamplers corresponding to each hierarchy, for example, downsamplers 1 to 3. Further, the electronic device 100 may further be provided with downsamplers spanning two levels, such as downsamplers 4-5; and downsamplers spanning three levels, such as downsampler 6. It will be appreciated that while more convolutional layers and pooling layers are also included in the original CNN-A, the corresponding skip-CNN may also include more downsamplers.

(1) The output size and the output size configuration of the downsampler.

The input size and the output size of each downsampler are different according to the different layers of the downsampler. The input size refers to the size of input data (input image or input feature), and the output size refers to the size of output data (output image or output feature). In this embodiment, the input size of a downsampler is equal to the input size of the corresponding layer of the downsampler, and the output size is equal to the output size of the corresponding layer of the downsampler.

First, table 2 shows the input size and output size of each convolution layer and pooling layer in CNN-A provided in the embodiments of the present application:

TABLE 2

Where h1×w1 is known, h11×w11 may be determined according to the size and step size of the convolution kernel of h1×w1 and the convolution layer C1, h12×w12 may be determined according to the size and step size of the pooling kernel of h11×w11 and the pooling layer P1, and so on, where h13×w13, h14×w14, h2×w2 are determinable.

Based on the input size and output size of each convolution layer and pooling layer shown in table 2, the input size and output size of the corresponding downsamplers 1 to 6 can be referred to table 3:

TABLE 3 Table 3

Down sampler	Hierarchy level	Input size	Output size
				Down sampler 1	1	H1×W1	H11×W11
Down sampler 2	2	H11×W11	H13×W13
				Down sampler 3	3	H13×W13	H2×W2
Down sampler 4	1&2	H1×W1	H13×W13
				Downsampler 5	2&3	H11×W11	H2×W2
Down sampler6	1&2&3	H1×W1	H2×W2

Thus, based on the same output size, after each hierarchy, skip-CNN can splice the features of the hierarchy output with the features of the downsampler output.

(2) The channel configuration of skip-CNN based downsampler.

The number of input channels of skip-CNN is determined according to the color structure of the input image, which is the same as CNN-A and will not be described here again.

Table 4 is configuration information of the number of output channels for a single channel image (e.g., IR image) in a skip-CNN provided in an embodiment of the present application.

TABLE 4 Table 4

Wherein [ A/B ] represents that the number of input channels of an entity (comprising a convolution layer, a pooling layer and a downsampler) is A, and the number of output channels is B. [C] The number of input channels representing an entity is C, as is the number of output channels.

As shown in table 4, for the single-channel input image, the number of input channels of skip-CNN is 1, and the corresponding number of input channels of the convolutional layer C1, the downsampler 4, and the downsampler 6 is 1. The number of output channels of the convolutional layer C1 may be set to 32. The number of input channels of the downsampler is identical to the number of output channels. At this time, the convolutional layer C1 may output 32 sets of features. After the 32 sets of features output by the convolutional layer C1 and the 1 sets of features output by the downsampler 1 are spliced, the skip-CNN may obtain 33 sets of features. Thus, the number of input channels of the pooling layer P1, the downsampler 2, and the downsampler 5 is 33. The number of output channels of the convolution layer C2 may be set to be the same as the number of input channels of the convolution layer C2, i.e. 33. At this time, the convolutional layer C2, the downsampler 2, and the downsampler 5 may output 33 sets of features, respectively. After the 33 sets of features output by the convolutional layer C2, the 33 sets of features output by the downsampler 2, and the 1 set of features output by the downsampler 4 are spliced, the skip-CNN may obtain 67 sets of features. Then, the number of input channels of the pooling layer P2 and the downsampler 3 is 67. The number of output channels of the convolutional layer C3 may be set to 67. At this point, convolutional layer C3, downsampler 3 may assign output 67 sets of features. After the 67 sets of features output by the convolutional layer C3, the 67 sets of features output by the downsampler 3, the 33 sets of features output by the downsampler 5, and the 1 set of features output by the downsampler 6 are spliced, the skip-CNN may obtain 168 sets of features.

Table 5 is configuration information of the number of output channels for a three-channel image (e.g., RGB image) in a skip-CNN according to an embodiment of the present application.

TABLE 5

The number of output channels of each entity in the three-channel image, and the composition of the number of final output channels can be referred to the description of table 4, and will not be repeated here.

Further, after obtaining the left-eye feature and the right-eye feature using the skip-CNN, the electronic device 100 may further reconstruct the left-eye feature and the right-eye feature output by the skip-CNN using the transcoding encoder (Transformer encoder), and finally output the reconstructed left-eye feature and right-eye feature. Then, the electronic device 100 may obtain the reconstructed eye feature based on the reconstructed left eye feature and the reconstructed right eye feature, and predict the reconstructed eye feature and the reconstructed face feature to determine the gaze point, thereby further improving the accuracy of gaze point recognition. The combination of skip-CNN and the transcoder is also known as hybrid trans former.

Fig. 8 is a flowchart of an electronic device 100 for reconstructing an ocular feature using a transcoder according to an embodiment of the present application.

S201, converting the eye features of the image format into a sequence format.

S202, performing position coding on the eye features in the image format in the conversion process, and acquiring position tensors of the eye features.

S203, inputting the ocular feature in the sequence format and the corresponding position tensor into a transformation encoder to obtain the reconstructed ocular feature.

Tensors are used to indicate the organization of the data. B (batch size) in tensor [ b, c, H, W ] indicates the number of samples, c (channel) indicates the number of channels, and H and W indicate the height and width of data, respectively.

First, the tensor of the eye image of the original input skip-CNN can be expressed as image [ b, c, H, W ]. Illustratively, the tensor of a 64×64 RGB image may be represented as image [1,3,64,64]. After the skip-CNN processing, the tensor of the eye feature output by skip-CNN can be expressed as feature_map [ b, c, H, W ]. features of the feature_map [ b, c, H, W ] format are also referred to as features of the image format.

In performing S201, the electronic device 100 may first convert the feature_map [ b, c, H, W ] into feature_map1[ b, c, h×w ], and then dimension rearrange the feature_map1 to obtain feature_map2[ h×w, b, c ]. feature_map2 is also known as a feature of the sequence format. Meanwhile, the electronic device 100 may randomly generate the special character cls_token [1, b, c ]. Then, the electronic device 100 may splice the feature_map2 and the cls_token to obtain feature maps3[ h+w+1, b, c ].

Based on h+w+1 in feature maps3, electronic device 100 may perform ordered position encoding from 0 to h+w+1 to obtain a position tensor pos_feature [ h+w+1, c ].

The electronic device 100 may then input the feature maps3 and the position tensor pos_feature in a sequential format into a transcoder. After processing by the transcoder, the electronic device 100 may obtain reconstructed eye features, the tensor of which may be denoted as feature_seq [ H+W+1, b, c ]. The electronic device 100 may then dimension reorder the data organization form of the reconstructed ocular feature. The tensor of the rearranged reconstructed ocular feature may be denoted as feature_out [ b, c, h+w+1 ].

Table 6 exemplarily shows the tensor change process of the 64×64 IR image and the 64×64 RGB image in the feature reconstruction process.

TABLE 6

Fig. 9 is a schematic diagram of a rotary encoder according to an embodiment of the present application.

As shown in fig. 9, the transcoding encoder includes Norm, multi-head attention (multi-head attention), and MLP. norm used the LayerNorm method for data normalization. Multi-head attention includes multiple (N) dot product self-attention (Scaled dot-product attention) for enriching feature space and increasing feature diversity. The MLP is composed of two FC layers for narrowing the dimension of multi-headed attention expansion. The transcoder includes an L layer Norm, multi-headed attention, and MLP architecture. The transcoding encoder is existing, and the function and data processing procedure of each method included in the transcoding encoder are not repeated.

In the embodiment of the present application, the main parameters of the transcoding encoder are shown in the following table:

TABLE 7

Layers L	Heads N	dim of Q,K,V	dim of hidden linear layers
				4	2	168(IR)/184(RGB)	512

Wherein the dot product self-attention formula is as follows:

fig. 10 is a schematic diagram of another eyeball gaze recognition algorithm provided by an embodiment of the present application.

As shown in fig. 10, the electronic device 100 may also acquire an image Y with calibration before the electronic device 100 acquires the image X. The calibration is the gaze point (x 0, Y0) corresponding to the image Y, also called reference gaze point.

The electronic device 100 may process the left eye image and the right eye image in the image Y using any of the feature extractors (CNN-A/skip-CNN/hybrid transformers) described above, obtain the left eye feature and the right eye feature of the image Y, and further splice the obtained face feature to obtain the reference feature. The reference features are all features acquired from the image Y.

After acquiring the query feature of the image X, the electronic device 100 may acquire the left-eye feature, the right-eye feature, and the face feature of the image X according to the same procedure. In particular, the electronic device 100 needs to use the same feature extractor as the image Y in acquiring the left-eye feature and the right-eye feature of the image X. For example, when the electronic device 100 acquires the left-eye feature and the right-eye feature of the image Y using CNN-A, the electronic device 100 also needs to use the left-eye feature and the right-eye feature of the image X acquired using CNN-A after acquiring the image X; when the electronic device 100 acquires the left-eye feature and the right-eye feature of the image Y using skip-CNN, the electronic device 100 also needs to acquire the left-eye feature and the right-eye feature of the image X using skip-CNN after acquiring the image X.

Then, the electronic device 100 may splice the query feature and the reference feature, and then input the spliced feature into the FC to obtain a gaze point distance (Δx, Δy) between the target gaze point and the reference gaze point. The target gaze point is the gaze point corresponding to the image X that needs to be predicted by the eyeball gaze recognition algorithm. Thus, in combination with the reference gaze point and the gaze point distance, the electronic device 100 may determine the target gaze point:

(x,y)＝(x0,y0)+(△x,△y)。

the FC shown in fig. 10 is trained based on different training data from the FC shown in fig. 1, and the input/output is different in the use phase. Specifically, in the method shown in fig. 1, the input of FC is the query feature of the image X, and the output is the target gaze point; in the method shown in fig. 10, the input of FC is the query feature of image X and the reference feature of image Y, and the output is the fixation point distance. In combination with the reference gaze point corresponding to the image Y and the gaze point distance described above, the electronic device 100 further determines the target gaze point.

Fig. 11A-11B are schematic diagrams of a set of cumulative distribution functions (Cumulative distribution function, CDF) provided by implementations of the present application.

Fig. 11A is an error CDF of an eye gaze recognition algorithm employing CNN-A and skip-CNN provided in the present application.

In fig. 11A, the abscissa represents the error between the identified gaze point and the actual gaze point, and the ordinate represents the cumulative distribution percentage. For example, the cumulative distribution percentage P corresponding to the error X represents: the ratio of the prediction result with the error within X to the total prediction result is P. The higher P means the better the gaze point recognition effect of the method.

A gaze point with an error within 1.9cm is acceptable. As shown in fig. 11A, the percentage curves from high to low over an error of 1.9cm are:

a skip-CNN (RGB) curve (i.e., delta_compare_skip_connection_all_data_0215-RGB);

CNN-A (RGB) curves (i.e., delta_compare_all_data_0215-RGB);

a skip-CNN (IR) curve (i.e., delta_compare_skip_connection_all_data_0215-IR);

CNN-A (IR) curve (i.e., delta_compare_all_data_0215-IR);

the corresponding cumulative distribution ratio is as follows: 0.567, 0.547, 0.513, 0.503.

It can be seen that the cumulative distribution percentage of the eye gaze recognition algorithm using skip-CNN is higher, i.e. the recognition effect of the gaze point is better, compared with the eye gaze recognition algorithm using CNN-A, whether it is an IR image or an RGB image.

Fig. 11B is an error CDF of an eye gaze recognition algorithm employing skip-CNN and a transcoding encoder as provided in the practice of the present application.

Also taking an error of 1.9cm as an example, the curves for the percentage over 1.9cm from high to low are:

the skip-CNN+ Transformer (RGB) curve (i.e., delta_compact_hybrid_transducer_all_data_0215-RGB);

a skip-CNN (RGB) curve (i.e., delta_compare_skip_connection_all_data_0215-RGB);

the skip-CNN+ Transformer (IR) curve (i.e., delta_compare_hybrid_transducer_all_data_0215-IR);

a skip-CNN (IR) curve (i.e., delta_compare_skip_connection_all_data_0215-IR);

the corresponding cumulative distribution ratio is as follows: 0.591, 0.567, 0.544, 0.513.

It can be seen that the cumulative distribution percentage of the eye gaze recognition algorithm using the hybrid transducer is higher, i.e. the better the recognition effect of the gaze point, compared to the eye gaze recognition method using only skip-CNN, whether it is an IR image or an RGB image.

In the above embodiment:

the image X may be referred to as a first image, the left eye image cropped from the image X may be referred to as a first left eye image, the right eye image may be referred to as a first right eye image, and the face image may be referred to as a first face image; the left-eye feature obtained by processing the left-eye image of the image X using CNN-A, skip-CNN or hybrid transducer may be referred to as a first left-eye feature, the right-eye feature obtained by processing the right-eye image of the image X may be referred to as a first right-eye feature, and the face feature obtained by processing the face image of the image X may be referred to as a first face feature; query features obtained by stitching left eye features, right eye features and face features of the image X may be referred to as first features;

The image Y may be referred to as a second image, the left-eye image cut out from the image Y may be referred to as a second left-eye image, the right-eye image may be referred to as a second right-eye image, and the face image may be referred to as a second face image; the left-eye feature obtained by processing the left-eye image of the image Y using CNN-A, skip-CNN or hybrid converter may be referred to as a second left-eye feature, the right-eye feature obtained by processing the right-eye image of the image Y may be referred to as a second right-eye feature, and the face feature obtained by processing the face image of the image Y may be referred to as a second face feature; the reference feature obtained by stitching the left eye feature, the right eye feature, and the face feature of the image Y may be referred to as a second feature;

wherein, both CNN-A, skip-CNN and hybrid transducer for ocular feature can be referred to as a first feature extraction; CNN-B for extracting face images may be referred to as second feature extraction;

the convolutional layer C1 shown in fig. 3 may be referred to as a first convolutional layer, the convolutional layer C2 may be referred to as a second convolutional layer, the convolutional layer C3 may be referred to as a third convolutional layer, the pooling layer P1 may be referred to as a first pooling layer, and the pooling layer P2 may be referred to as a second pooling layer; any of the above convolutional or pooling layers may be referred to as a processing layer;

the downsampler 1 shown in fig. 7 may be referred to as a first downsampler, the downsampler 2 may be referred to as a second downsampler, the downsampler 3 may be referred to as a third downsampler, the downsampler 4 may be referred to as a fourth downsampler, the downsampler 5 may be referred to as a fifth downsampler, and the downsampler 6 may be referred to as a sixth downsampler;

In the application scenario shown in fig. 5A-5B, the action of opening the camera application and displaying the camera application main interface shown in fig. 5B may be referred to as a first action; in the application scenario shown in fig. 6A-6B, the action of displaying the notification bar shown in fig. 6B may also be referred to as a first action.

Fig. 12 shows a hardware configuration diagram of the electronic device 100.

The electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (universal serial bus, USB) interface 130, a charge management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, keys 190, a motor 191, an indicator 192, a camera 193, a display 194, and a subscriber identity module (subscriber identification module, SIM) card interface 195, etc. The sensor module 180 may include a pressure sensor 180A, a gyro sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, a bone conduction sensor 180M, and the like.

It should be understood that the illustrated structure of the embodiment of the present invention does not constitute a specific limitation on the electronic device 100. In other embodiments of the present application, electronic device 100 may include more or fewer components than shown, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

The processor 110 may include one or more processing units, such as: the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), a controller, a video codec, a digital signal processor (digital signal processor, DSP), a baseband processor, and/or a neural network processor (neural-network processing unit, NPU), etc. Wherein the different processing units may be separate devices or may be integrated in one or more processors.

The controller can generate operation control signals according to the instruction operation codes and the time sequence signals to finish the control of instruction fetching and instruction execution. A memory may also be provided in the processor 110 for storing instructions and data.

The charge management module 140 is configured to receive a charge input from a charger. The power management module 141 is used for connecting the battery 142, and the charge management module 140 and the processor 110.

The wireless communication function of the electronic device 100 may be implemented by the antenna 1, the antenna 2, the mobile communication module 150, the wireless communication module 160, a modem processor, a baseband processor, and the like.

The electronic device 100 implements display functions through a GPU, a display screen 194, an application processor, and the like. The GPU is a microprocessor for image processing, and is connected to the display 194 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. The display screen 194 is used to display images, videos, and the like. In the embodiment of the present application, the electronic device 100 displays the interactive interface, such as the user interfaces shown in fig. 5A-5B and fig. 6A-6B, through the GPU, the display screen 194, and the display function provided by the application processor.

The electronic device 100 may implement photographing functions through an ISP, a camera 193, a video codec, a GPU, a display screen 194, an application processor, and the like. The camera 193 is used to capture still images or video. In the present embodiment, the camera 193 includes, but is not limited to, a wide-angle camera, a super wide-angle camera, a tele camera, a structured light deep-sensing camera, an infrared camera, and the like. Based on the different types of cameras described above, the electronic device 100 may acquire input images of different color structures.

The NPU is a neural-network (NN) computing processor, and can rapidly process input information and continuously perform self-learning by referring to a biological neural network structure. The NPU can implement applications such as intelligent cognition of the electronic device 100. In the embodiment of the present application, the electronic device 100 may execute an eye gaze recognition algorithm through the NPU, and further recognize the eye gaze position of the user through the acquired facial image of the user.

The internal memory 121 may include one or more random access memories (random access memory, RAM) and one or more non-volatile memories (NVM).

RAM may be read directly from and written to by processor 110, may be used to store an operating system or other executable program (e.g., machine instructions) for the program in operation, may also be used to store data for users and applications, and the like. The NVM may also store executable programs, store data for users and applications, etc., and may be loaded into RAM in advance for direct reading and writing by the processor 110. In an embodiment of the present application, application codes corresponding to the eye gaze recognition algorithm may be stored in the NVM. When the eye gaze recognition algorithm is run to recognize the eye gaze point, application program codes corresponding to the eye gaze recognition algorithm may be loaded into RAM.

The external memory interface 120 may be used to connect external non-volatile memory to enable expansion of the memory capabilities of the electronic device 100.

The electronic device 100 may implement audio functions through an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, an application processor, and the like. Such as music playing, recording, etc.

The pressure sensor 180A is used to sense a pressure signal, and may convert the pressure signal into an electrical signal. The gyro sensor 180B may be used to determine the angular velocity of the electronic device 100 about three axes (i.e., x, y, and z axes). The acceleration sensor 180E may detect the magnitude of acceleration of the electronic device 100 in various directions (typically three axes). The air pressure sensor 180C is used to measure air pressure. The magnetic sensor 180D includes a hall sensor, and the opening and closing of the flip cover can be detected by the magnetic sensor 180D. The distance sensor 180F is used to measure a distance. The proximity light sensor 180G may include, for example, a Light Emitting Diode (LED) and a light detector that may be used to detect a scene of the user holding the electronic device 100 in close proximity to the user. The ambient light sensor 180L is used to sense ambient light level. The fingerprint sensor 180H is used to collect a fingerprint. The temperature sensor 180J is for detecting temperature. The bone conduction sensor 180M may acquire a vibration signal. The touch sensor 180K, also referred to as a "touch device". The touch sensor 180K may be disposed on the display screen 194, and the touch sensor 180K and the display screen 194 form a touch screen, which is also called a "touch screen". The keys 190 include a power-on key, a volume key, etc. The motor 191 may generate a vibration cue. The indicator 192 may be an indicator light. The SIM card interface 195 is used to connect a SIM card.

The term "User Interface (UI)" in the description and claims of the present application and in the drawings is a media interface for interaction and information exchange between an application program or an operating system and a user, which enables conversion between an internal form of information and a form acceptable to the user. The user interface of the application program is source code written in a specific computer language such as java, extensible markup language (extensible markup language, XML) and the like, the interface source code is analyzed and rendered on the terminal equipment, and finally the interface source code is presented as content which can be identified by a user, such as a picture, characters, buttons and the like. Controls (controls), also known as parts (widgets), are basic elements of a user interface, typical controls being toolbars (toolbars), menu bars (menu bars), text boxes (text boxes), buttons (buttons), scroll bars (scrollbars), pictures and text. The properties and content of the controls in the interface are defined by labels or nodes, such as XML specifies the controls contained in the interface by nodes of < Textview >, < ImgView >, < VideoView >, etc. One node corresponds to a control or attribute in the interface, and the node is rendered into visual content for a user after being analyzed and rendered. In addition, many applications, such as the interface of a hybrid application (hybrid application), typically include web pages. A web page, also referred to as a page, is understood to be a special control embedded in an application program interface, and is source code written in a specific computer language, such as hypertext markup language (hyper text markup language, GTML), cascading style sheets (cascading style sheets, CSS), java script (JavaScript, JS), etc., and the web page source code may be loaded and displayed as user-recognizable content by a browser or web page display component similar to the browser function. The specific content contained in a web page is also defined by tags or nodes in the web page source code, such as GTML defines elements and attributes of the web page by < p >, < img >, < video >, < canvas >.

A commonly used presentation form of the user interface is a graphical user interface (graphic user interface, GUI), which refers to a user interface related to computer operations that is displayed in a graphical manner. The method can be an interface element such as an icon, a window, a control and the like displayed in a display screen of the terminal equipment, wherein the control can comprise a visible interface element such as an icon, a button, a menu, a tab, a text box, a dialog box, a status bar, a navigation bar, a Widget and the like.

As used in the specification and the appended claims, the singular forms "a," "an," "the," and "the" are intended to include the plural forms as well, unless the context clearly indicates to the contrary. It should also be understood that the term "and/or" as used in this application refers to and encompasses any or all possible combinations of one or more of the listed items. As used in the above embodiments, the term "when …" may be interpreted to mean "if …" or "after …" or "in response to determination …" or "in response to detection …" depending on the context. Similarly, the phrase "at the time of determination …" or "if detected (a stated condition or event)" may be interpreted to mean "if determined …" or "in response to determination …" or "at the time of detection (a stated condition or event)" or "in response to detection (a stated condition or event)" depending on the context.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, produces a flow or function in accordance with embodiments of the present application, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital subscriber line), or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid state disk), etc.

Those of ordinary skill in the art will appreciate that implementing all or part of the above-described method embodiments may be accomplished by a computer program to instruct related hardware, the program may be stored in a computer readable storage medium, and the program may include the above-described method embodiments when executed. And the aforementioned storage medium includes: ROM or random access memory RAM, magnetic or optical disk, etc.

Claims

1. An eye gaze recognition method applied to an electronic device comprising a screen, the method comprising:

acquiring a first image;

acquiring a first left eye image, a first right eye image and a first face image from the first image;

acquiring a first left-eye feature from the first left-eye image, acquiring a first right-eye feature from the first right-eye image, and acquiring a first face feature from the first face image;

combining the first left eye image, the first right eye feature and the first face feature to obtain a first feature;

and determining a target fixation point on the screen, wherein the target fixation point is obtained according to the first characteristic.

2. The method according to claim 1, wherein the method further comprises:

Encoding the first left-eye image and the first right-eye image with a rotary encoder;

the combining the first left eye image, the first right eye feature and the first face feature to obtain a first feature specifically includes:

and combining the encoded first left-eye image, the encoded first right-eye feature and the first face feature to obtain the first feature.

3. The method according to claim 1 or 2, wherein the electronic device is preset with a first feature extractor and a second feature extractor, the acquiring a first left-eye feature from the first left-eye image, acquiring a first right-eye feature from the first right-eye image, and acquiring a first face feature from the first face image specifically includes:

acquiring the first left-eye feature from the first left-eye image and the first right-eye feature from the first right-eye image by using the first feature extractor; and acquiring the first face features from the first face image by using the second feature extractor.

4. The method of claim 3, wherein the first feature extractor comprises a first convolution layer, a first pooling layer, a second convolution layer, a second pooling layer, a third convolution layer;

Wherein the size of the convolution kernel of the first convolution layer is 7×7, and the step size is 1; the size of the convolution kernel of the second convolution layer is 5×5, and the step length is 1; the size of the convolution kernel of the third convolution layer is 3×3, and the step length is 1; the size of the pooling core of the first pooling layer and the second pooling layer is 2×2, and the step size is 2.

5. A method as claimed in claim 3, wherein the first feature extractor comprises a plurality of processing layers and one or more downsamplers, the processing layers comprising a convolution layer and a pooling layer.

6. The method of claim 5, wherein the plurality of processing layers comprises a first processing layer, wherein the one or more downsamplers comprises a downsampler i having an input identical to the input of the first processing layer, and wherein an output of the downsampler i is used for stitching with an output of a second processing layer; the second treatment layer is equal to the first treatment layer, or the second treatment layer is one treatment layer after the first treatment layer.

7. The method of claim 5, wherein the plurality of processing layers comprises a first convolution layer, a first pooling layer, a second convolution layer, a second pooling layer, and a third convolution layer.

8. The method of claim 7, wherein the plurality of downsamplers comprises a first downsampler, a second downsampler, a third downsampler;

the input of the first downsampler is the same as the input of the first convolution layer, and the output of the first downsampler is used for being spliced with the output of the first convolution layer and is input to a first pooling layer; the input of the second downsampler is the same as the input of the first pooling layer, and the output of the second downsampler is used for being spliced with the output of the second convolution layer and is input to the second pooling layer; the input of the third downsampler is the same as the input of the second convolution layer, and the output of the third downsampler is used to splice with and output the output of the third convolution layer.

9. The method of claim 7 or 8, wherein the plurality of downsamplers further comprises a fourth downsampler, a fifth downsampler;

the input of the fourth downsampler is the same as the input of the first convolution layer, and the output of the fourth downsampler is used for being spliced with the output of the second convolution layer and is input to a second pooling layer; the input of the fifth downsampler is the same as the input of the first pooling layer, and the output of the fifth downsampler is used to splice with and output the output of the third convolution layer.

10. The method of any of claims 7-9, wherein the plurality of downsamplers further comprises a sixth downsampler having an input identical to the input of the first convolutional layer, an output of the sixth downsampler for stitching with and outputting the output of the third convolutional layer.

11. The method according to any one of claims 3 to 10, wherein,

the number of input channels of the first feature extractor is 1, and the number of output channels is 168;

alternatively, the first feature extractor has a number of input channels of 3 and a number of output channels of 184.

12. The method of any one of claims 1-11, wherein prior to the acquiring the first image, the method further comprises:

acquiring a second image;

acquiring a second left eye image, a second right eye image and a second face image from the second image;

acquiring a second left eye feature from the second left eye image, acquiring a second right eye feature from the second right eye image, and acquiring a second face feature from the second face image;

combining the second left eye image, the second right eye feature and the second face feature to obtain a second feature;

Wherein the target gaze point is further derived from the second feature.

13. The method according to any one of claims 1-12, further comprising:

determining a hot zone corresponding to the target fixation point;

and executing a first action corresponding to the hot zone.

14. An electronic device comprising one or more processors and one or more memories; wherein the one or more memories are coupled to the one or more processors, the one or more memories for storing computer program code comprising computer instructions that, when executed by the one or more processors, cause the method of any of claims 1-13 to be performed.

15. A chip system for application to an electronic device, the chip system comprising one or more processors configured to invoke computer instructions to cause performance of the method of any of claims 1-13.

16. A computer readable storage medium comprising instructions which, when run on an electronic device, cause the method of any one of claims 1-13 to be performed.