CN114973391A

CN114973391A - Eyeball tracking method, device and equipment applied to metacarpal space

Info

Publication number: CN114973391A
Application number: CN202210759801.0A
Authority: CN
Inventors: 郝炯辉; 李茂林
Original assignee: Beijing Superred Technology Co Ltd
Current assignee: Beijing Superred Technology Co Ltd
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2022-08-30
Anticipated expiration: 2042-06-30
Also published as: CN114973391B

Abstract

Embodiments of the present application provide methods, apparatuses, devices and computer-readable storage devices for eye tracking for application to the meta universe. The method comprises the steps of obtaining an eye image of a user through a metacosmic shooting device; extracting the features in the eye image frame by frame to obtain a first feature map; processing the first characteristic diagram based on a self-attention mechanism to obtain a second characteristic diagram; detecting key points of the second characteristic diagram to obtain eyeball positions; and drawing a key point track of the eyeball based on the eyeball position, predicting the eyeball position of the next frame and finishing eyeball tracking. In this way, the eye is quickly and accurately tracked.

Description

Eyeball tracking method, device and equipment applied to meta universe

Technical Field

Embodiments of the present application relate to the field of iris image processing, and in particular, to an eyeball tracking method, apparatus, device, and computer-readable storage device applied to the meta universe.

Background

With the continuous development of virtual reality and the proposition of a meta-space concept, many researchers take a virtual reality technology as a hot topic for research, wherein eyeball tracking is a technical field for realizing an important breakthrough in the meta-space concept, and the application of eyeball tracking in a virtual reality scene can improve the processing speed of images in the scene and the permeability of a user in the scene under the condition of ensuring the accuracy not to be reduced, and reduce the consumption of performance and the dizziness and fatigue of the user.

the structure of a transformer has achieved excellent performance in natural language processing tasks, and then visual tasks have continued to explore the ability of the transformer structure in image processing. the transformer structure can effectively obtain the correlation degree among a plurality of correlation vectors, and the eyeball tracking task is highly correlated among frames in the time dimension.

Therefore, how to better use the transform structure to perform eye tracking to obtain the correlation of the eyeball in the time dimension, and draw the motion trajectory and predict the eyeball position is an urgent need.

Disclosure of Invention

According to an embodiment of the present application, an eye tracking scheme applied to the meta universe is provided.

In a first aspect of the present application, an eye tracking method applied to the meta universe is provided. The method comprises the following steps:

acquiring an eye image of a user through a metauniverse shooting device;

extracting the features in the eye image frame by frame to obtain a first feature map;

processing the first characteristic diagram based on a self-attention mechanism to obtain a second characteristic diagram;

detecting key points of the second characteristic diagram to obtain eyeball positions; and drawing a key point track of the eyeball based on the eyeball position, predicting the eyeball position of the next frame and finishing eyeball tracking.

Further, the extracting the features in the eye image frame by frame to obtain a first feature map includes:

extracting the features in the eye image frame by frame through a mobileNet V2 network to obtain an eye image feature map;

detecting the eye image characteristic diagram, and if the current frame is eye closing, not tracking; if the eyes are open, constructing a first feature map according to the eye image feature map;

wherein the first convolution of the mobilenetV2 network is an ECB volume block.

Further, the processing the first feature map based on the self-attention mechanism to obtain a second feature map includes:

if the first feature map is a feature map of a single-frame image, independently performing self-attention calculation to obtain a second feature map;

and if the first feature map is the feature map of a plurality of frames of images, performing self-attention calculation on the current frame, performing self-attention calculation between the current frame and other frames, and fusing calculation results to obtain a second feature map.

Further, if the first feature map is a feature map of a multi-frame image, performing self-attention calculation on the current frame, performing self-attention calculation between the current frame and other frames, and fusing calculation results to obtain a second feature map includes:

respectively pulling the characteristic diagrams of the current frame and at least one frame before the current frame into vectors with preset sizes to obtain the vector of each characteristic diagram and an embedded matrix of each frame; the image of each frame comprises n characteristic graphs; n is a positive integer;

and according to the time sequence, carrying out position coding on the vectors in the embedded matrix, and inputting the embedded matrix subjected to the position coding into a preset transform coder to obtain a second characteristic diagram.

Further, the inputting the embedded matrix subjected to the position coding into a preset transform encoder to obtain a second feature map includes:

the transform encoder is composed of a projection matrix, a multi-head self-attention, a residual block, normalization and convolution;

inputting the embedded matrix subjected to position coding into a preset transform coder, and respectively calculating Q, K, V matrixes through projection matrixes;

inputting Q, K, V matrix of current frame and K, V matrix of other frames into multi-head self attention to obtain first output result;

inputting the first output result and the embedded matrix of the current frame into a residual block for addition;

normalizing the addition result to obtain a second output result;

and inputting the second output result into a convolution block to obtain a second characteristic diagram.

Further, the multi-head self-attention includes:

wherein z is a multi-head self-attention calculation result;

、

、

q, k, v representing the t-th frame, respectively;

（

,

) Show that

And

splicing the two matrixes;

（

,

) Show that

And

and splicing the two matrixes.

Further, the performing of the key point detection on the second feature map to obtain the eyeball position includes:

performing convolution once, convolution twice and convolution three times on the second feature graph respectively to obtain outputs of three scales;

and splicing the outputs of the three scales and then carrying out full connection to obtain the eyeball position.

In a second aspect of the present application, an eye tracking apparatus for application to the meta universe is provided. The device comprises:

the acquisition module is used for acquiring an eye image of a user through the metauniverse shooting equipment;

the extraction module is used for extracting the features in the eye image frame by frame to obtain a first feature map;

the processing module is used for processing the first characteristic diagram based on a self-attention mechanism to obtain a second characteristic diagram;

the tracking module is used for detecting key points of the second characteristic diagram to obtain the eyeball position; and drawing a key point track of the eyeball based on the eyeball position, predicting the eyeball position of the next frame and finishing eyeball tracking.

In a third aspect of the present application, an electronic device is provided. The electronic device includes: a memory having a computer program stored thereon and a processor implementing the method as described above when executing the program.

In a fourth aspect of the present application, a computer-readable storage medium is provided, on which a computer program is stored which, when being executed by a processor, carries out the method as according to the first aspect of the present application.

According to the eyeball tracking method applied to the meta universe, the eye image of a user is obtained through the shooting equipment of the meta universe; extracting the features in the eye image frame by frame to obtain a first feature map; processing the first characteristic diagram based on a self-attention mechanism to obtain a second characteristic diagram; detecting key points of the second characteristic diagram to obtain eyeball positions; based on the eyeball position, the key point track of the eyeball is drawn, the eyeball position of the next frame is predicted, eyeball tracking is completed, the eyeball is tracked, and meanwhile the efficiency and accuracy of eyeball tracking are improved.

It should be understood that what is described in this summary section is not intended to limit key or critical features of the embodiments of the application, nor is it intended to limit the scope of the application. Other features of the present application will become apparent from the following description.

Drawings

The above and other features, advantages and aspects of various embodiments of the present application will become more apparent by referring to the following detailed description when taken in conjunction with the accompanying drawings. In the drawings, like or similar reference characters designate like or similar elements, and wherein:

FIG. 1 shows a flow diagram of an eye tracking method applied to the meta universe according to an embodiment of the application;

FIG. 2 shows a schematic diagram of the structure of an ECB volume block according to an embodiment of the present application;

fig. 3 shows a schematic structural diagram of a backbone network according to an embodiment of the present application;

FIG. 4 shows a feature extraction flow diagram for an eye image according to an embodiment of the application;

5 a-5 d show overall network structure schematics according to embodiments of the present application;

FIG. 6 shows a schematic structural diagram of a transform encoder according to an embodiment of the present application;

FIG. 7 shows a block diagram of an eye tracking apparatus applied to the meta universe according to an embodiment of the present application;

fig. 8 shows a schematic structural diagram of a terminal device or a server suitable for implementing the embodiments of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the embodiments of the present disclosure more clear, the technical solutions of the embodiments of the present disclosure will be described clearly and completely with reference to the drawings in the embodiments of the present disclosure, and it is obvious that the described embodiments are some, but not all embodiments of the present disclosure. All other embodiments, which can be derived by a person skilled in the art from the embodiments disclosed herein without making any creative effort, shall fall within the protection scope of the present disclosure.

In addition, the term "and/or" herein is only one kind of association relationship describing an associated object, and means that there may be three kinds of relationships, for example, a and/or B, which may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

Fig. 1 shows a flowchart of an eye-tracking method applied to the meta universe according to an embodiment of the present disclosure. The method comprises the following steps:

and S110, acquiring the eye image of the user through the metacosmic shooting equipment.

In some embodiments, images of the user's eyes are captured by a head-mounted camera, VR camera, and/or other pseudocosmic capture device.

And S120, extracting the features in the eye image frame by frame to obtain a first feature map.

In some embodiments, the eye image is input into a feature extraction network, and features in the eye image are extracted frame by frame to obtain a first feature map.

The feature extraction network adopts mobileNetV2 as a network main body. The first convolution of the mobileNetV2 network is an ECB volume block (edge-oriented convolution block), discarding the last bottomleneck and all following convolution pooling full-connect operations of the conventional mobileNetV2 network.

Further, as shown in fig. 2, the ECB convolution block is used to guide the edge information of the web learning image, which is the result of adding a 3 × 3 convolution, an expansion convolution, a table edge extraction and a laplacian operator.

Specifically, as shown in fig. 3, the eye image is input into a mobileNetV2 network, wherein the eye image is input through an ECB convolution block, subjected to size modification and dimension lifting through a 1 × 1 network, then fused with the input of mobileNetV2, and finally subjected to a 3 × 3 convolution to obtain a final output of the network block, and further subjected to dimension change (number of parameters is reduced, and thus the amount of computation is reduced) of the features through a plurality of bottleneck layers to obtain an eye image feature map;

further, as shown in fig. 4, detecting an eye image feature map, if the current frame is eye-closed, that is, there is no eyeball, not tracking, re-shooting the image and calculating the next frame; if the eyes are not closed, checking whether a plurality of continuous frames exist, if so, storing the feature map for subsequent frames, and fusing the features of the current frame and other frames (continuous frames) to obtain a first feature map; and if the current frame does not exist, taking the current frame as the first feature map.

And S130, processing the first characteristic diagram based on a self-attention mechanism to obtain a second characteristic diagram.

In some embodiments, if the first feature map is a feature map of a single frame image, performing self-attention calculation separately to obtain a second feature map;

and if the first feature map is the feature map of a multi-frame image, performing self-attention calculation on the current frame, performing inter-frame self-attention calculation on the current frame and other frames, fusing inter-frame information after multiple rounds of inter-frame attention calculation, and extracting relatively important information (set according to an application scene) to obtain a second feature map.

Specifically, as shown in fig. 5a to 5d, the feature maps of the current frame (t) and at least one frame (t-1, t-2) before the current frame are respectively pulled into vectors of a preset size, so as to obtain a vector of each feature map;

preferably, the vector of the predetermined size is 1 ×

；

Further, vector 1 x based on each feature map

Obtaining an embedded matrix n of each frame

(ii) a The image of each frame comprises n characteristic graphs; n is a positive integer;

further, according to the time sequence, the position of the vector in the embedding matrix is coded to obtain an embedding matrix n: (a)

+ e) embedding matrix n × (

+ e) inputting the second characteristic diagram into a preset transform coder to obtain a second characteristic diagram; the structure of the transform encoder is, as shown in fig. 6, composed of a projection matrix, a multi-head self-attention, a residual block, normalization, and convolution;

that is, Q, K, V matrices are calculated from the projection matrices, and the Q, K, V matrix of the current frame and the K, V matrix of the other frames (consecutive frames) are input into the multi-head self-attention, resulting in a first output result. Inputting the first output result and the embedded matrix of the current frame into a residual block for addition, performing normalization processing on the addition result to obtain a second output result, and inputting the second output result into a convolution block to obtain the output of a transform encoder;

further, the output result of the transform encoder is encoded (converted into a form of a feature map) according to the position of the eyeball, so as to obtain a second feature map.

In some embodiments, the multi-head self-attention comprises:

wherein z is a multi-head self-attention calculation result;

、

、

q, k, v representing the t-th frame, respectively;

（

,

) Show that

And

splicing the two matrixes;

（

,

) Show that

And

splicing the two matrixes;

furthermore, when multi-head self-attention calculation is carried out, the number of spliced matrixes is not limited; when the first frame of the video has only one matrix, the first frame is not spliced, and other frames can be spliced with one or more matrixes.

S140, carrying out key point detection on the second characteristic diagram to obtain an eyeball position; and drawing a key point track of the eyeball based on the eyeball position, predicting the eyeball position of the next frame and finishing eyeball tracking.

In some embodiments, the second feature map is subjected to keypoint detection to obtain an eyeball position. Namely, performing convolution once, convolution twice and convolution three times on the second feature map respectively to obtain outputs of three scales, and splicing the outputs of the three scales and then performing full connection to obtain the eyeball position.

Further, the eyeball position is correlated with the previous eyeball position, the track of key points of the eyeball is drawn, the eyeball position of the next frame is predicted, and the eyeball tracking is completed.

According to the embodiment of the disclosure, the following technical effects are achieved:

through the optimized mobileNetV2 network and the transform encoder, the association of the eyeballs on the time dimension is realized in the application scene of the meta-universe, and the efficiency and the accuracy of eyeball tracking are improved.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present application is not limited by the order of acts described, as some steps may occur in other orders or concurrently depending on the application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are exemplary embodiments and that the acts and modules referred to are not necessarily required in this application.

The above is a description of method embodiments, and the embodiments of the present application are further described below by way of apparatus embodiments.

Fig. 7 shows a block diagram of an eye tracking apparatus 700 for application to the meta space according to an embodiment of the application, the apparatus 700 comprising, as shown in fig. 7:

an obtaining module 710, configured to obtain an eye image of a user through a metasma shooting device;

an extracting module 720, configured to extract features in the eye image frame by frame to obtain a first feature map;

the processing module 730 is configured to process the first feature map based on a self-attention mechanism to obtain a second feature map;

the tracking module 740 is configured to perform key point detection on the second feature map to obtain an eyeball position; and drawing a key point track of the eyeball based on the eyeball position, predicting the eyeball position of the next frame and finishing eyeball tracking.

It can be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the described module may refer to the corresponding process in the foregoing method embodiment, and is not described herein again.

As shown in fig. 8, the terminal device or server 800 includes a Central Processing Unit (CPU)801 that can perform various appropriate actions and processes in accordance with a program stored in a Read Only Memory (ROM)802 or a program loaded from a storage section 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data necessary for the operation of the system 800 are also stored. The CPU 801, ROM 802, and RAM 803 are connected to each other via a bus 804. An input/output (I/O) interface 805 is also connected to bus 804.

The following components are connected to the I/O interface 805: an input portion 806 including a keyboard, a mouse, and the like; an output section 807 including a signal such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage portion 808 including a hard disk and the like; and a communication section 809 including a network interface card such as a LAN card, a modem, or the like. The communication section 809 performs communication processing via a network such as the internet. A drive 810 is also connected to the I/O interface 805 as necessary. A removable medium 811 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is mounted on the drive 810 as necessary, so that a computer program read out therefrom is mounted on the storage section 808 as necessary.

In particular, the above method flow steps may be implemented as a computer software program according to embodiments of the present application. For example, embodiments of the present application include a computer program product comprising a computer program embodied on a machine-readable medium, the computer program comprising program code for performing the method illustrated in the flow chart. In such an embodiment, the computer program can be downloaded and installed from a network through the communication section 809 and/or installed from the removable medium 811. The computer program executes the above-described functions defined in the system of the present application when executed by the Central Processing Unit (CPU) 801.

It should be noted that the computer readable medium shown in the present application may be a computer readable signal medium or a computer readable storage medium or any combination of the two. A computer readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples of the computer readable storage medium may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In the present application, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. In this application, however, a computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device. Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to: wireless, wire, fiber optic cable, RF, etc., or any suitable combination of the foregoing.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

The units or modules described in the embodiments of the present application may be implemented by software or hardware. The described units or modules may also be provided in a processor. Wherein the designation of a unit or module does not in some way constitute a limitation of the unit or module itself.

As another aspect, the present application also provides a computer-readable storage medium, which may be included in the electronic device described in the above embodiments; or may be separate and not incorporated into the electronic device. The computer readable storage medium stores one or more programs that, when executed by one or more processors, perform the methods described herein.

The foregoing description is only exemplary of the preferred embodiments of the application and is illustrative of the principles of the technology employed. It will be appreciated by those skilled in the art that the scope of the application referred to in the present application is not limited to the embodiments with a particular combination of the above-mentioned features, but also encompasses other embodiments with any combination of the above-mentioned features or their equivalents without departing from the spirit of the application. For example, the above features may be replaced with (but not limited to) features having similar functions as those described in this application.

Claims

1. An eye tracking method applied to a meta-universe, comprising:

acquiring an eye image of a user through a metauniverse shooting device;

2. The method of claim 1, wherein extracting features from the eye image frame by frame to obtain a first feature map comprises:

detecting the eye image feature map, and if the current frame is closed eye, not tracking; if the eyes are open, constructing a first feature map according to the eye image feature map;

3. The method of claim 2, wherein the processing the first feature map based on the self-attention mechanism to obtain a second feature map comprises:

4. The method according to claim 3, wherein if the first feature map is a feature map of a multi-frame image, performing self-attention calculation on the current frame and the frames of other frames, and fusing the calculation results to obtain a second feature map comprises:

respectively drawing the characteristic diagrams of the current frame and at least one frame before the current frame into vectors with preset sizes to obtain the vector of each characteristic diagram and an embedded matrix of each frame; the image of each frame comprises n characteristic graphs; n is a positive integer;

5. The method of claim 4, wherein the inputting the embedded matrix after position coding into a preset transform encoder to obtain a second feature map comprises:

normalizing the addition result to obtain a second output result;

6. The method of claim 5, wherein the multi-headed self-attention comprises:

wherein z is a multi-head self-attention calculation result;

、

、

q, k, v representing the t-th frame, respectively;

（

,

) Show that

And

splicing the two matrixes;

（

,

) Show that

And

and splicing the two matrixes.

7. The method according to claim 6, wherein the performing the keypoint detection on the second feature map to obtain the eyeball position comprises:

8. An eye tracking apparatus for application to the meta space, comprising:

9. An electronic device comprising a memory and a processor, the memory having stored thereon a computer program, wherein the processor, when executing the computer program, implements the method of any of claims 1-7.

10. A computer-readable storage device, on which a computer program is stored, which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.