CN116883436A

CN116883436A - Auxiliary understanding method and system based on sight estimation

Info

Publication number: CN116883436A
Application number: CN202310667676.5A
Authority: CN
Inventors: 白晓伟; 谢良; 闫森; 闫野; 印二威; 张敬; 张亚坤; 张皓洋; 赵少楷
Original assignee: National Defense Technology Innovation Institute PLA Academy of Military Science
Current assignee: National Defense Technology Innovation Institute PLA Academy of Military Science
Priority date: 2023-06-07
Filing date: 2023-06-07
Publication date: 2023-10-13

Abstract

The invention discloses an auxiliary understanding method and system based on sight estimation, comprising the following steps: acquiring an eye image of a user wearing the AR/VR/MR device by a near-eye infrared camera of the AR/VR/MR device; processing the eye images to determine three-dimensional sight vectors; mapping the three-dimensional sight vectors of the left eye and the right eye of the user to a virtual screen to serve as a gaze point of the user; and acquiring the fixation time of the fixation point, and when the fixation point and the fixation time meet preset conditions, determining the category of the target watched by the user according to the scene picture watched by the user and the pre-trained DETR algorithm and controlling the AR/VR/MR device to display the category of the target, wherein the category comprises the shape, the color and the mutual position relation among objects. Objects in the scene image are described on AR/VR/MR equipment and the like, so that infants, old people and adults with impaired intelligence due to diseases can be assisted in human-computer interaction through eye movement, and people or objects in the scene image can be assisted in understanding.

Description

Auxiliary understanding method and system based on sight estimation

Technical Field

The invention belongs to the technical fields of computer vision, sight line estimation and eye movement interaction, and relates to an auxiliary understanding method and system based on sight line estimation.

Background

In recent years, gaze estimation is widely used in fields such as human-computer interaction, virtual Reality (VR), augmented Reality (Augmented Reality, AR), mixed Reality (MR), medical diagnosis, and psychological analysis, and the attention of a current user is analyzed and judged according to gaze estimation; while object detection is a multi-classification task in which objects of interest to the user are found in the image, it is different from other classification tasks in that the classification task requires the size and location of the individual objects to be determined. Therefore, the target detection is widely applied in the aspects of automatic driving, man-machine interaction, target tracking, instance segmentation and the like.

In addition, most of the vision estimation methods at the present stage are appearance-based methods, namely, binocular eye images are sent into a network model, and finally the estimated gaze point position is obtained through calculation of the network model. Meanwhile, the existing technology based on combination of sight line estimation and target detection is simply integrated, namely, the sight line falling point only displays the name and the confidence of the target detection object on the target object, and further description such as the shape, the color and the mutual position relation among the objects is not carried out on the objects of interest to the user. In the prior art, a sight line estimation method is generally used for selecting a target and interacting in a manual mode, but man-machine interaction conditions for infants, old people and disabled persons with upper limbs are too harsh in a manual mode, so that the defect can be overcome after the eye movement interaction technology is integrated. In order to solve the problems, relevant technologies such as sight line estimation, target detection, eye movement interaction and the like are fused and used in equipment such as AR/VR/MR and the like to carry out text description on objects of scene images.

Disclosure of Invention

The invention aims at solving the defects existing in the prior art that the existing sight line estimation method is mostly based on appearance, binocular eye images are sent into a network model, the estimated gaze point position is obtained through calculation and return of the network model, the existing method is simple but not high in precision, head movement needs to be limited, and the auxiliary understanding method and system based on sight line estimation are provided for the defects.

In order to achieve the above purpose, the present invention adopts the following technical scheme.

In one aspect, the present invention provides a gaze estimation-based aided understanding method, comprising:

acquiring an eye image of a user wearing the AR/VR/MR device by a near-eye infrared camera of the AR/VR/MR device;

processing the eye images to determine three-dimensional sight vectors;

mapping the three-dimensional sight vectors of the left eye and the right eye of the user onto a virtual screen to serve as a gaze point of the user;

and acquiring the fixation time of the fixation point, and determining category information corresponding to a target of the fixation of the user according to a scene picture of the fixation of the user and a pre-trained DETR algorithm under the condition that the fixation point and the fixation time meet preset conditions, and controlling the AR/VR/MR device to execute the operation of displaying the category information corresponding to the target, wherein the category information comprises the shape, the color and the mutual position relation among the objects.

In some embodiments, the processing the eye image to determine a three-dimensional line-of-sight vector comprises:

inputting the eye image into a RITnet network segmentation model, and segmenting the eye image to obtain segmentation graphs of edge information of four parts of a background, a sclera, an iris and a pupil respectively;

fitting the obtained segmentation map to a pupil ellipse through an edge detector and a least square method, and determining parameter information of the pupil ellipse;

determining parameters of a camera, and determining a camera coordinate system of a three-dimensional space through the parameters of the camera;

fitting a 3D eyeball model according to pupil ellipses of a plurality of eye images of the user under a camera coordinate system, and obtaining a pupil circle tangent to the 3D eyeball model according to the fitted pupil ellipses by a triangulation method; and determining 3D eyeball model parameters according to the eye images, pupil circle information corresponding to each eye image and a least square method, and determining a three-dimensional sight line vector.

In some embodiments, fitting the obtained segmentation map to a pupil ellipse by an edge detector and a least square method and determining parameter information of the pupil ellipse includes:

the method comprises the steps of extracting pupil boundaries after determining threshold values through an edge detector from a segmentation graph comprising four parts of a background, a sclera, an iris and a pupil, fitting boundary information by using a least square method to obtain parameter information of a pupil ellipse, drawing the pupil ellipse and determining a mathematical expression of a general equation of the pupil ellipse, wherein the mathematical expression comprises the following formula:

G(x，y)＝Ax ² +Bxy+Cy ² +Dx+Ey+F

Wherein: the parameters of the general equation for an elliptical pupil are (A, B, C, D, E, F); the center of the pupil ellipse is (X) _e ，Y _e )。

In some embodiments, under a camera coordinate system, fitting a 3D eyeball model according to pupil ellipses of a plurality of eye images of the user, and obtaining a pupil circle tangent to the 3D eyeball model according to a triangulation method by using the fitted pupil ellipses, wherein the method comprises the following steps:

constructing a three-dimensional cone taking the camera as a vertex according to a triangularized geometry method by taking the camera as a focus, wherein pupil ellipses obtained by segmentation in an eye image are intersecting lines of the eye image and the cone;

under a full perspective projection model, the pupil ellipse is back projected onto a 3D eyeball model, the back projected pupil circle is ensured to be tangent to the 3D eyeball model, the pupil circle projected on the surface of the 3D eyeball model is regarded as the cross section of a cone taking a camera focus as a vertex, and the pupil ellipse boundary of an eye image is a two-dimensional plane intersection point intersecting with the cone; when the pupil circle is determined, pupil circle information is obtained:

Pupile circle＝(p,m,r)

wherein: p represents three-dimensional pupil coordinates; m represents the normal vector of the pupil; r represents the pupil circle radius;

the method for determining 3D eyeball model parameters according to a plurality of eye images, pupil circle information corresponding to each eye image and a least square method and determining three-dimensional sight vectors comprises the following steps:

Obtaining 3D eyeball model parameters of each person, including a 3D eyeball model sphere center and an average sphere radius, according to eye images of a plurality of testees;

according to pupil circle information constructed by a plurality of eye images, determining that the intersection points of a plurality of pupil circle normal vectors are the sphere centers of the 3D eyeball model, and simultaneously, determining by a least square method because all intersection points of errors of numerical measurement are unlikely to be at the same point;

after the sphere center of the 3D eyeball model is determined, calculating the radiuses obtained by all eye images according to the pupil circles and the sphere center of the 3D eyeball model, and averaging the radiuses obtained by all eye images to obtain a final 3D eyeball radius R;

R＝mean(R _i ＝||p _i -c||)

wherein: r is R _i For each eye image, the radius obtained by the distance between the pupil circle and the sphere center is c, the sphere center of the 3D eyeball model is represented, and pi is the pupil coordinate of the ith eye image;

and obtaining the pupil center projected on the 3D eyeball model according to the pupil center obtained by the eye image and the 3D eyeball model parameters, and determining the connecting line between the sphere center of the 3D eyeball model and the pupil center of each eye image back projection as a three-dimensional sight vector.

In some embodiments, mapping three-dimensional gaze vectors of left and right eyes of a user onto a virtual screen as gaze points of the user, respectively, includes:

Through the fitted 3D eyeball model and the three-dimensional sight line vector, when a user gazes at a known point S on the screen, the conversion relation between the real gazing point coordinate and the estimated gazing point coordinate can be determined through simple calibration:

wherein M is a rotation matrix;is a predicted three-dimensional sight vector; />Is a true three-dimensional sight vector;

calculated 3D eyeball center O= (a, b, c) and estimated three-dimensional sight line vectorCoordinates of an intersection point at which the two-dimensional plane equation ax+by+cz+d=0;

the parameter equation of the straight line is:

x＝mt+a

y＝nt+b

z＝pt+c

the method is obtained by substituting the plane coordinate equation:

taking t inverse into a parameter equation of a straight line to obtain an intersection point coordinate P (x, y, z) of the estimated sight line vector and the two-dimensional screen;

because the difference between the left eye and the right eye causes the estimated two-dimensional plane coordinates of the left eye and the right eye to have deviation, the final estimated two-dimensional plane fixation point coordinates are obtained according to the following formula:

the rotation matrix may be composed of known points S on the screen _i (i=1, 2, 3) and the three-dimensional line-of-sight vector predicts the gaze point P on the screen _i (i=1, 2, 3) the following equation solution obtained by the above conversion relation:

in some embodiments, before determining the category information corresponding to the target at which the user gazes according to the scene picture at which the user gazes and the pre-trained DETR algorithm, the method further includes:

Detecting each target in the scene pictures in the public data set, and marking the target frame selection area of each target in the scene pictures;

and sending each scene picture into a Transformer-based end-to-end target detection algorithm of the DETR for training, dividing each region of the scene picture into a target type and a non-type after feature extraction, decoding and prediction stages, and matching a target frame selection region with the target type with type information corresponding to the description target.

In some embodiments, determining category information corresponding to a target at which the user gazes according to a scene picture at which the user gazes and a pre-trained DETR algorithm includes:

taking a scene picture watched by the user as input of a DETR algorithm;

determining category information matched with a target frame selection area according to the target frame selection area in which the gaze point of the user falls;

and controlling the AR/VR/MR device to execute the operation of displaying the category information corresponding to the target, wherein the operation comprises the following steps:

and controlling the AR/VR/MR device to execute the central display of the category information corresponding to the target in the target frame selection area in which the gaze point falls.

In some embodiments, the preset condition is that the gaze point falls into any one preset target frame selection area, the gaze time is greater than a first threshold, and the fluctuation of the gaze point within the first threshold does not exceed a preset gaze point fluctuation range.

In some embodiments, after the controlling the AR/VR/MR device to perform the operation of displaying the category information corresponding to the target, the method further includes:

acquiring the eye closing time of a user;

and if the eye closing time is greater than a second threshold, determining that the user is autonomous eye closing, and controlling the AR/VR/MR equipment to execute the operation of stopping displaying the category information corresponding to the target.

In a second aspect, the present invention also provides an auxiliary understanding system based on line of sight estimation, comprising:

the eye image acquisition module is used for acquiring eye images of a user wearing the AR/VR/MR device through a near-eye infrared camera of the AR/VR/MR device;

the eye image processing module is used for processing the eye image and determining a three-dimensional eye vector;

the gaze point mapping module is used for mapping the three-dimensional sight line vectors of the left eye and the right eye of the user onto the virtual screen to serve as a gaze point of the user;

and the eye movement interaction module is used for acquiring the gazing time of the gazing point, determining category information corresponding to a target gazed by the user according to a scene picture gazed by the user and a pre-trained DETR algorithm when the gazing point and the gazing time meet preset conditions, and controlling the AR/VR/MR equipment to execute the operation of displaying the category information corresponding to the target, wherein the category information comprises the shape, the color and the mutual position relation among objects.

In the embodiment of the invention, the AR/VR/MR equipment can be controlled to display the category information corresponding to the target, and a series of operations such as switching, returning and text box generation of scene pictures are realized in an eye movement interaction mode. The method integrates related technologies such as sight estimation, target detection, eye movement interaction and the like, realizes auxiliary understanding of scene picture objects observed in mobile terminal equipment in an eye movement interaction mode, is used for describing objects in scene images on equipment such as AR/VR/MR and the like, can be used for helping infants, old people, adults with impaired intelligence due to diseases and the like to perform man-machine interaction in an eye movement mode, and assists understanding of people or objects in the scene images. With this system, when the point of regard falls on the target object of interest to the user, not only the name and confidence of the target detection object but also further descriptions such as the shape, color, and mutual positional relationship between the objects, etc., of the object of interest to the user are displayed.

Drawings

Fig. 1 is a flowchart of an auxiliary understanding method based on line of sight estimation according to an embodiment of the present invention;

fig. 2 is a flowchart of step S102 in fig. 1 according to an embodiment of the present invention;

FIG. 3 is a schematic view of a fitted model eye of a single eye image in an assisted understanding method based on gaze estimation according to the present invention;

fig. 4 is a flowchart of DETR algorithm in a vision estimation-based auxiliary understanding method according to an embodiment of the present invention;

FIG. 5 is a flowchart for generating and canceling a visual text box in an auxiliary understanding text based on line of sight estimation according to an embodiment of the present invention;

fig. 6 is a diagram of expected effects of a vision estimation-based auxiliary understanding method implemented in accordance with an embodiment of the present invention;

fig. 7 is a block diagram of the components and connections of an auxiliary understanding system based on line of sight estimation according to an embodiment of the present invention.

Detailed Description

Example embodiments will be described more fully hereinafter with reference to the accompanying drawings, but may be embodied in various forms and should not be construed as limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed rules.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used herein, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

Embodiments described herein may be described with reference to plan and/or cross-sectional views with the aid of idealized schematic diagrams of the present disclosure. Accordingly, the example illustrations may be modified in accordance with manufacturing techniques and/or tolerances. Thus, the embodiments are not limited to the embodiments shown in the drawings, but include modifications of the configuration formed based on the manufacturing process. Thus, the regions illustrated in the figures have schematic properties and the shapes of the regions illustrated in the figures illustrate the particular shapes of the regions of the elements, but are not intended to be limiting.

For describing objects in a scene image on an AR/VR/MR device and the like, the method and the system can be used for assisting infants, old people, adults with impaired intelligence due to diseases and the like in understanding people or objects in the scene image, and the embodiment of the disclosure provides an auxiliary understanding method and system based on sight estimation. The following detailed description is provided with reference to the accompanying drawings of the embodiments of the invention.

Example 1

In a first aspect, as shown in fig. 1, an embodiment of the present invention provides an auxiliary understanding method based on line of sight estimation, including:

step S101, acquiring a user wearing the AR/VR/MR device by means of a near-eye infrared camera of the AR/VR/MR device

An eye image;

Step S102, processing the eye images to determine three-dimensional sight vectors;

step S103, mapping the three-dimensional sight line vectors of the left eye and the right eye of the user onto a virtual screen to serve as a fixation point of the user;

step S104, obtaining the gazing time of the gazing point, when the gazing point and the gazing time meet the preset condition, determining the category information corresponding to the target gazed by the user according to the scene picture gazed by the user and the pre-trained DETR algorithm, controlling the AR/VR/MR device to execute the operation of displaying the category information corresponding to the target,

wherein the category information includes the shape, color, and positional relationship of the objects to each other.

It should be noted that, the eye image is acquired by a near-eye infrared camera of the AR/VR/MR device, and the eye image acquisition module may be configured in the AR/VR/MR device. Firstly, each position point to be adopted is visualized on a virtual screen in a visualization interface; secondly, a user collects K Zhang Yanbu images of each position through looking at a visual position point to be collected, and true value coordinates of the position point are generated for subsequent calibration work; and finally, determining hardware parameter information such as focal length, frame rate, photosensitive element size and the like of the infrared near-to-eye camera, and conveniently constructing a space coordinate system when the eyeball model is fitted subsequently. Step S101 is implemented by an eye image acquisition module.

The range of K of the present invention is 10 to 50.

In a preferred embodiment, the category information corresponding to the target is displayed by a text box. Inputting the visualized fixation point and the scene picture into a virtual screen of the AR/VR/MR equipment through a display interface module, and comparing the visualized fixation point with the information detected by the object target through sight estimation to generate a text box describing the text information;

firstly, a fixation point is visualized, a two-dimensional estimated fixation point position is obtained through a mapping method by using a three-dimensional line-of-sight vector of the fixation point, and a visualized point is drawn by taking the point as a center in a mobile terminal deployment program; secondly, visualizing scene pictures, wherein a certain number of pictures can be randomly selected for text description of targets in the scene pictures when a user uses the scene pictures each time; moreover, the text box is visualized in generation, whether the text box is generated or not is judged through eye movement behaviors after the fixation point is matched with the target box selection area, and the generation of the text box is realized by inputting marked object information into the text box according to the deployment configuration of the mobile terminal; and finally, configuring keys, namely switching scene pictures by looking at a button above the virtual screen when the next or previous picture needs to be replaced.

It will be appreciated that the category information corresponding to the display target is not limited to a text box, such as a voice prompt, and the like, and is not limited herein.

In the embodiment of the invention, the AR/VR/MR equipment can be controlled to display the category information corresponding to the target, and a series of operations such as switching, returning and text box generation of scene pictures are realized in an eye movement interaction mode. The method integrates related technologies such as sight estimation, target detection, eye movement interaction and the like, realizes auxiliary understanding of scene picture objects observed in mobile terminal equipment in an eye movement interaction mode, is used in AR/VR/MR equipment and the like, describes objects in scene images on the AR/VR/MR equipment and the like, can be used for helping infants, old people, adults with impaired intelligence due to diseases and the like to perform man-machine interaction in an eye movement mode, and assists in understanding people or objects in the scene images. With this system, when the point of regard falls on the target object of interest to the user, not only the name and confidence of the target detection object but also further descriptions such as the shape, color, and mutual positional relationship between the objects, etc., of the object of interest to the user are displayed.

In some embodiments, as shown in fig. 2, the processing the eye image to determine a three-dimensional line-of-sight vector (i.e., step S102) includes:

In the embodiment of the invention, a three-dimensional cone taking a camera as a vertex is constructed by taking the camera as a focus according to a triangularization geometry method, wherein pupil ellipses obtained by segmentation in an eye image are intersecting lines of the plane and the cone. Fitting the acquired eye images to an eyeball model to obtain spherical center coordinates and sphere radius information, and then calculating pitch angle visual line vectors on each eye image and finally outputting the three-dimensional line vectors.

The RITnet network segmentation model comprises five downsampling modules and four upsampling modules, wherein each downsampling module comprises 5 convolution layers and a LeakyReLU activation function, the pooling layers are all 2 multiplied by 2 convolution kernels, and each upsampling module comprises 4 convolution layers and a LeakyReLU activation function; the number of channels per sampling channel is 32 in order to reduce the parameters, all convolution layers share the connection from the previous convolution layer.

G(x，y)＝Ax ² +Bxy+Cy ² +Dx+Ey+F

Through the steps, pupil ellipse fitting is realized.

under the full perspective projection model, the pupil ellipse is back projected onto the 3D eyeball model, the back projected pupil circle is ensured to be tangent to the 3D eyeball model, the pupil circle projected on the surface of the 3D eyeball model is regarded as the cross section of a cone taking the focus of a camera as the vertex, and the pupil ellipse boundary of an eye image is a two-dimensional plane intersection point intersecting with the cone (as shown in fig. 3); when the pupil circle is determined, pupil circle information is obtained:

Pupile circle＝(p,m,r)

R＝mean(R _i ＝||p _i -c||)

Through the steps, the generation and visualization of the three-dimensional sight vector are realized.

In some embodiments, mapping three-dimensional gaze vectors of left and right eyes of a user onto a virtual screen as gaze points of the user, respectively (i.e. step S103), includes:

the parameter equation of the straight line is:

x＝mt+a

y＝nt+b

z＝pt+c

the method is obtained by substituting the plane coordinate equation:

and establishing the gaze point mapping model through the steps. In step S103, the three-dimensional line-of-sight vector generated by the line-of-sight estimation module is mapped onto the virtual screen as an estimated point, and the estimated coordinate point and the real coordinate point are compared and calculated to obtain the accuracy error of the line-of-sight estimation algorithm.

The category information includes the shape, color and mutual position relation of the objects, and the object name, the position of the target frame selection area and other contents are stored in the text file as mark information.

For the object detection task, the relation between the objects is helpful to promote the object detection effect. Therefore, the system adopts the DETR algorithm, and the traditional target detection algorithm does not well utilize the attention mechanism although using the relation between targets, and the main reason is probably that the number, the category, the size and the like of each picture are not completely the same; the DETR models different target relations by using a transducer, and relation information is integrated in the characteristic value, so that the end-to-end target detection is truly realized.

As shown in fig. 4, the DETR algorithm steps are divided into four steps of feature extraction, decoding, prediction:

1. feature extraction: extracting the characteristics of the picture by using a traditional CNN backbone as input, and adding a common position code in the NLP field to generate a batch of serialized data;

2. decoding: after the model is flattened, the serialized data is sent to a transducer decoder in a decoding stage, and the attention mechanism is utilized to extract the characteristics in the data.

3. Decoding: a small number of fixed number of learned positions of the transducer decoder are embedded (we call object queries) as input, i.e. N random initialization vectors are input into the transducer decoder, each object query focuses on a different position of a picture, and the final decoder outputs N vectors.

4. Prediction stage: each vector corresponds to a detected object through the decoding stage. Finally, the N vectors are input into an FFN neural network, and the neural network can predict the target category or no category with relative position relation.

In the embodiment of the invention, the detection and marking of the target object in the scene picture in the public data set are realized through the target detection module.

In some embodiments, in step S104, determining, according to the scene picture at which the user gazes and the DETR algorithm trained in advance, category information corresponding to the target at which the user gazes includes:

Taking a scene picture watched by the user as input of a DETR algorithm;

In order to display category information of a target watched by a user in the form of a text box, the display interface module mainly comprises a gaze point visualization, text box generation and the like. FIG. 5 is a flow chart for visual text box generation and cancellation.

1. Gaze point visualization:

according to the gaze point model, one gaze point corresponding to the two-dimensional plane is visualized into a red solid dot tracked along with the line of sight through a visualization interface; after the fixation point position is determined, a solid circle with the radius of 5mm is drawn by taking the fixation point coordinate as the circle center, and the visualized fixation point can be used for initial calibration and tracking the sight so as to facilitate the selection of the target by a subsequent user.

2. Generating an auxiliary understanding text box:

When the gazing point falls in a certain target frame selection area and is larger than the gazing time first threshold, the target frame selection area of the target object is highlighted and a text box is popped up in the center of the target frame selection area as shown in fig. 6, and an expected effect diagram of category information of the target gazed by the user is displayed in the form of the text box. According to the invention, a scene picture is input, when a user's sight drop point is on a certain object in the picture, text information for describing the object is popped up at the center of the object through eye movement and fixation, so that the target is subjected to text description, and the user is helped to understand the scene image in an auxiliary way.

Step S104 is implemented by an eye movement interaction module. The module takes the eye movement signal as an operation signal, so that a user can conveniently operate the region of interest; the first eye movement signal is gazing, that is, when the gazing point stays in a range for a certain time (greater than a set first threshold), the module determines that the current user is gazing at the point, and the corresponding operation is executed.

Specifically, because the unique physiological response of human eyes can involuntarily perform behaviors such as jumping, blinking and the like when observing a certain object, an eye movement interaction module for setting the eye movement behaviors is needed to judge the eye movement behaviors in a current and subsequent event.

When the human eye looks at an object, the line of sight is focused for a long time at a point and its vicinity.

Based on the above feature, a gaze point threshold is set to determine whether the human eye is gazing at an object. The gaze point threshold is the first threshold: when the stay time of the fixation point in one point or a set nearby area is more than t=1500 ms, judging that the fixation eye movement behavior exists;

dividing the fluctuation range of the gaze point: the gaze point coordinates fluctuate within a predetermined range when gazing for a long time, and the fluctuation range is determined to be gazing when the fluctuation within a circle measurement, namely, within a radius range of the fluctuation within a first threshold value, does not exceed the measurement circle, and the gazing model is shown as follows:

Gaze＝(X _g ,Y _g ,r,t)

wherein: (Xg, yg) is the initial gaze point location coordinates; r represents the radius of the circular metrology area; t represents a gaze time threshold.

Acquiring the eye closing time of a user;

Step S104 is implemented by an eye movement interaction module. The module takes the eye movement signal as an operation signal, is convenient for a user to operate the region of interest, and further comprises an independent eye closing time and an independent eye closing time, wherein the second eye movement signal is used for closing the eye, the main judgment basis is the eye closing time, the non-independent eye closing time is judged when the eye closing time is smaller than a set second threshold value, the independent eye closing time is judged when the eye closing time is longer than the set second threshold value, and the module executes different operations according to different judgment results.

Due to the unique physiological response of human eyes, the human eyes can be involuntary, blink and other behaviors when observing a certain object. Therefore, an eye movement interaction module for setting eye movement behaviors is needed to judge the eye movement behaviors in a current and subsequent event.

The eyes feel dry after long-term use, so that involuntary blinking is unavoidable in the use of the system, and thus frequent changes in gaze point due to blinking need to be reduced. An involuntary blink is usually very short in eye closure time, so that only one time threshold td=500 ms (second threshold) needs to be set to determine whether the time between the current frame and the eye closure image frame exceeds the time threshold; if T < Td indicates an involuntary blink, no action is taken; if T > Td indicates an autonomous eye closure, a return operation will be performed on it; the eye closing time calculation formula is as follows:

T＝(F _k -F _n )/frame

Wherein T represents a blink time; fk. Fn represents the number of frames in which the eyes are opened at the time k and the eyes are closed at the time n, respectively; frame represents the frame rate.

In the embodiment of the invention, the visualization of the text box is mainly realized by placing an oval dialog box above the scene image through the system visualization interface, and the text content of the oval dialog box needs to be matched with the labeling information of the current detected object, thereby realizing the description of the object of the scene picture; if the current dialog box is to be canceled, the autonomous eye closure is only needed anywhere in the current image, i.e. the eye closure time is greater than the second threshold, the text box is canceled and the highlighted target selection area is also disappeared, and the system is reset to the initial state (fig. 5 and 6). The invention designs a visual interface of a text interaction system, and realizes a series of operations such as switching, returning and text box generation of scene pictures in an eye movement interaction mode.

The auxiliary understanding method based on the sight line estimation provided by the invention is described below by way of example:

s1: the user needs to wear the AR head display correctly, an option button for checking the eye image is arranged on a display interface of the head display, and the user can adjust the current wearing posture according to the eye image transmitted back by the real-time camera, so that the pupils, the irises and the like in the eye image are ensured to be clearly visible;

S2: after wearing, acquiring eye images of a wearer by using a near-eye camera of the AR head display device, calibrating each left eye image and each right eye image respectively while acquiring a large number of eye images of the user to acquire real two-dimensional plane fixation point position information, and outputting the information as a text file;

s3: the acquired eye images of the left eye and the right eye are parallelly sent into a RITnet segmentation model, namely the left eye and the right eye are simultaneously and respectively input into a segmentation network model;

s4: obtaining edge information of four parts of a background, a sclera, an iris and a pupil through a RITnet segmentation model;

s5: after the segmentation map is obtained, determining a pupil boundary pixel point threshold, namely judging that the pupil does not belong to a pupil region when the pixels in the eye image are higher than the threshold;

s6: the rough area of the pupil can be obtained after the threshold value is judged, then the pupil of the eye image after the threshold value is judged is separated through a Canny edge detector, the rough boundary of the pupil is determined, finally, an elliptic pupil is fitted through a least square method according to rough boundary information, and parameters of the pupil ellipse are obtained and drawn;

s7: parameters such as a focal length of a camera, a size of a photosensitive element and the like are determined, and a camera coordinate system of a three-dimensional space is determined through the parameters of the camera;

S8: fitting a 3D eyeball model according to pupil ellipses of a plurality of eye images of each user, and obtaining a pupil circle tangent to the 3D eyeball model according to the fitted pupil ellipses by a triangulation method;

s9: according to a fitting method of a single eye image, determining the intersection point of a plurality of pupil circle normal vectors as the spherical center of the 3D eyeball model, wherein the average value of the connecting line distances between the spherical center and the circle centers of pupil circles of the eyeball model and each tangent is the radius of the spherical center;

s10: after the sphere center of the 3D eyeball model is estimated, a vector projected between the pupil center of the surface of the eyeball model and the sphere center of the eyeball model according to each eye image is a three-dimensional vision estimation vector;

s11: the steps S8-S10 are executed in parallel by the left eye and the right eye in the same way;

s12: mapping the three-dimensional sight estimation vectors of the left eye and the right eye onto a 2D plane according to the mapping relation to obtain sight-point coordinates;

s13: after the calibration action is completed, a visual fixation point for tracking the real-time vision line appears on the virtual two-dimensional plane;

s14: according to the image and text information of the object detection task of the public COCO data set, selecting an object and describing a text of the gaze point object, wherein the object comprises attributes such as shape, color and the like of the object and the relative relation of surrounding objects;

S15: selecting a scene picture watched by a user as input of a target detection algorithm, and marking a target frame selection area of each object in the scene picture;

s16: matching the position coordinates of the target frame selection area with text information corresponding to the description of the target through a target detection algorithm;

s17: based on the threshold value for determining the gazing time and range, when the user gazes at one point for a long time or fluctuates in the gazing point wave band range in the first threshold value, judging that the current gazing point is unchanged, and executing gazing eye movement behavior operation;

s18, when the gazing time exceeds a set first threshold, popping up text information for describing a target gazed by the user from the center of the target frame selection area if the gazing point is in the target frame selection area at the moment, and if the gazing point is not in the target frame selection area, not performing any operation;

s19: and canceling the text box operation, when the user needs to cancel the currently presented text box, and the eye closing time exceeds a second threshold value, the system judges that the autonomous eye closing cancels the current text box and changes the highlighted target box selection area into an initial state.

In a second aspect, as shown in fig. 7, an embodiment of the present invention further provides an auxiliary understanding system based on line of sight estimation, including:

An eye image acquisition module 11 for acquiring an eye image of a user wearing the AR/VR/MR device by a near-eye infrared camera of the AR/VR/MR device;

a sight line estimation module 12, configured to process the eye image and determine a three-dimensional sight line vector;

a gaze point mapping module 13 for mapping three-dimensional gaze vectors of left and right eyes of a user onto a virtual screen as a gaze point of the user;

the eye movement interaction module 14 is configured to obtain a gaze time of the gaze point, determine category information corresponding to a target gazed by the user according to a scene picture gazed by the user and a DETR algorithm trained in advance when the gaze point and the gaze time meet preset conditions, and control the AR/VR/MR device to perform an operation of displaying the category information corresponding to the target, where the category information includes a shape, a color, and a mutual positional relationship between objects.

As shown in fig. 7, the display interface module inputs the visualized gaze point and the scene picture on a virtual screen of the AR/VR/MR device, compares the visualized gaze point with the detected information of the object target through the line of sight estimation to generate a text box describing the target gazed by the user, and displays the category information of the target in the form of the text box. And the target detection module detects and marks a target object in the scene picture in the public data set.

It is to be understood that the above embodiments are merely illustrative of the application of the principles of the present invention, but not in limitation thereof. Various modifications and improvements may be made by those skilled in the art without departing from the spirit and substance of the invention, and are also considered to be within the scope of the invention.

Claims

1. A gaze estimation-based aided understanding method, comprising:

processing the eye images to determine three-dimensional sight vectors;

and acquiring the fixation time of the fixation point, and determining category information corresponding to a target of the user fixation according to a scene picture of the user fixation and a pre-trained DETR based on a transducer end-to-end target detection algorithm under the condition that the fixation point and the fixation time meet preset conditions, and controlling an AR/VR/MR device to execute the operation of displaying the category information corresponding to the target, wherein the category information comprises the shape, the color and the mutual position relation among objects.

2. The gaze-estimation-based aided understanding method of claim 1, wherein said processing the eye image to determine a three-dimensional gaze vector comprises:

3. The line-of-sight estimation-based aided understanding method of claim 1, wherein fitting the obtained segmentation map to a pupil ellipse by an edge detector and a least square method and determining parameter information of the pupil ellipse comprises:

G(x，y)＝Ax ² +Bxy+Cy ² +Dx+Ey+F

4. The vision-estimation-based aided understanding method of claim 2, wherein fitting a 3D eyeball model according to pupil ellipses of a plurality of eye images of the user under a camera coordinate system, and obtaining a pupil circle tangent to the 3D eyeball model according to the fitted pupil ellipses by a triangulation method, wherein the method comprises:

Pupilecircle＝(p,m,r)

R＝mean(R _i ＝||p _i -c||)

5. The vision-estimation-based aided understanding method of claim 1, wherein mapping three-dimensional vision vectors of left and right eyes of a user onto virtual screens as gaze points of the user, respectively, comprises:

the parameter equation of the straight line is:

x＝mt+a

y＝nt+b

z＝pt+c

the method is obtained by substituting the plane coordinate equation:

6. the gaze-estimation-based aided understanding method of claim 1, further comprising, before determining category information corresponding to a target at which a user gazes from a scene picture at which the user gazes and a pre-trained DETR algorithm:

and sending each scene picture into a DETR algorithm for training, dividing each region of the scene picture into a target type and a non-type after feature extraction, decoding and prediction stages, and matching a target frame selection region with a target type with type information describing the corresponding type of the target.

7. The gaze-estimation-based aided understanding method of claim 6, wherein determining category information corresponding to a target at which a user gazes from a scene picture at which the user gazes and a pre-trained DETR algorithm comprises:

taking a scene picture watched by the user as input of a DETR algorithm;

8. The gaze-estimation-based aided understanding method of claim 6, wherein the preset condition is that the gaze point falls within any one of preset target frame selection areas, and the gaze time is greater than a first threshold, and a fluctuation of the gaze point within the first threshold does not exceed a preset gaze point fluctuation range.

9. The gaze estimation-based assisted understanding method of claim 1, wherein after the controlling the AR/VR/MR device performs the operation of displaying the category information corresponding to the target, further comprising:

acquiring the eye closing time of a user;

10. An assisted understanding system based on gaze estimation, comprising:

the eye movement interaction module is used for acquiring the gazing time of the gazing point, determining category information corresponding to a target gazed by a user based on an end-to-end target detection algorithm of a transducer according to a scene picture gazed by the user and the pre-trained DETR under the condition that the gazing point and the gazing time meet preset conditions, and controlling the AR/VR/MR device to execute the operation of displaying the category information corresponding to the target, wherein the category information comprises the shape, the color and the mutual position relation among objects.