CN113947804A

CN113947804A - Target fixation identification method and system based on sight line estimation

Info

Publication number: CN113947804A
Application number: CN202111047180.5A
Authority: CN
Inventors: 孙晓; 高升; 汪萌
Original assignee: Hefei University of Technology; Institute of Artificial Intelligence of Hefei Comprehensive National Science Center
Current assignee: Hefei University of Technology; Institute of Artificial Intelligence of Hefei Comprehensive National Science Center
Priority date: 2021-09-08
Filing date: 2021-09-08
Publication date: 2022-01-18

Abstract

The invention provides a target gaze identification method and a target gaze identification system based on sight line estimation, and relates to the technical field of target gaze prediction. According to the method, after the face features and the head position features of an original image and a reversed image of the original image are extracted, the face features and the head position features are spliced to obtain two-dimensional features, then a feature map of a watching sight line area is obtained based on the two-dimensional features, and finally a watching thermodynamic diagram is obtained based on the characteristic pyramid network of the watching sight line area and the original image and based on the BoTNet network, so that the identification and detection of the object watching in the image are achieved. The invention is not limited by the application scene, the hardware condition constraint is less, the actual operation is simple and convenient, and the target fixation identification and detection result is accurate.

Description

Target fixation identification method and system based on sight line estimation

Technical Field

The invention relates to a target fixation prediction technology, in particular to a target fixation identification method and a target fixation identification system based on sight line estimation.

Background

With the rapid development of computer vision, artificial intelligence technology and digitization technology, eye tracking technology has become the current hot research field, and has wide application in the field of human-computer interaction. In real life, people in the image are required to watch the target through a third visual angle, so that the attention content of people in the scene is known through target watching identification, and the purpose of detecting the watching target of the people in the image is achieved.

Currently, target gaze prediction can be achieved mainly by both face-based and line-of-sight based techniques. The face-based approach refers to prediction by extracting eye features and facial features; generally, eye images and face coordinates in a photo are extracted, then a model is established, characteristics are extracted, and finally target fixation prediction is directly carried out; the sight line based method mainly comprises the steps of deducing the sight line direction of a person through an eye picture or a face picture for prediction, establishing a model, extracting face features or eye features in the picture, and then performing feature splicing to realize target fixation prediction.

However, the face-based target fixation technology requires that the image has a complete face and needs to provide eye position information, otherwise the prediction result is not accurate, so the application scene is limited; in the target fixation technology based on the sight line, an additional module is needed for detecting eyes and estimating the head posture, the operation is complex in practical application, the constraint is more, and the prediction result is not accurate. Therefore, there is a need to provide a new target fixation method to overcome at least the above problems of the prior art.

Disclosure of Invention

Technical problem to be solved

Aiming at the defects of the prior art, the invention provides a target gaze identification method and a target gaze identification system based on sight line estimation, and solves the problem that the target gaze identification result cannot be accurately obtained when the information is insufficient and the hardware equipment is deficient in the prior art.

(II) technical scheme

In order to achieve the purpose, the invention is realized by the following technical scheme:

in a first aspect, the present invention first provides a target gaze identification method based on gaze estimation, where the method includes:

extracting face features and head position features in the original image; the face features comprise face features in the original image and reversed face features of the original image after being reversed;

obtaining two-dimensional features by feature splicing of the face features and the head position features;

acquiring a characteristic diagram of a gaze line area based on the two-dimensional characteristics;

acquiring a fixation thermodynamic diagram by utilizing a characteristic pyramid network based on a BoTNet network based on the fixation sight area characteristic diagram and the original image; the BoTNet network is a network after 3x3 constraint in a ResNet network is replaced by multi-head Self-attachment.

Preferably, the extracting facial features in the original image includes:

and extracting facial features in the original image by using a ResNet model.

Preferably, the extracting facial features in the original image by using the ResNet model includes:

s11, adding or subtracting 0.15 to the eye coordinates in the original image to obtain face coordinates, and then cutting the face coordinates to obtain a face image;

s12, extracting facial features in the original image based on the facial image by using a ResNet model;

and S13, horizontally turning the original image to obtain a turned image of the original image, and extracting the turned facial features of the turned image through the steps S11 and S12.

Preferably, the gaze thermodynamic diagram is acquired by using a feature pyramid network based on a BoTNet network based on the gaze area feature map and the original image; the BoTNet network is a network which replaces 3x3 constraint in a ResNet network by using multi-head Self-attachment and comprises the following steps:

and replacing 3x3 constraint in the ResNet network by using multi-head Self-orientation to obtain a BoTNet network, and sending the gazing sight line area characteristic diagram and the original image into a Heatmap path of a characteristic pyramid network taking the BoTNet network as a main network together to generate a gazing thermodynamic diagram.

Preferably, the method further comprises:

the BoTNet network-based feature pyramid network is trained prior to acquiring the gaze thermodynamic diagram.

In a second aspect, the present invention further provides a gaze recognition system for an object based on gaze estimation, the system comprising:

the feature extraction module is used for extracting facial features and head position features in the original image; the face features comprise face features in the original image and reversed face features of the original image after being reversed;

the feature splicing module is used for splicing the face feature and the head position feature to obtain a two-dimensional feature;

the sight line area characteristic diagram acquisition module is used for acquiring a characteristic diagram of a watching sight line area based on the two-dimensional characteristics;

the gazing thermodynamic diagram acquisition module is used for acquiring a gazing thermodynamic diagram by utilizing a characteristic pyramid network based on a BoTNet network based on the gazing sight line area characteristic diagram and the original image; the BoTNet network is a network after 3x3 constraint in a ResNet network is replaced by multi-head Self-attachment.

Preferably, the extracting facial features in the original image by the feature extraction module includes:

and extracting facial features in the original image by using a ResNet model.

Preferably, the extracting facial features in the original image by the feature extraction module using the ResNet model includes:

Preferably, the gaze thermodynamic diagram acquisition module acquires a gaze thermodynamic diagram by using a feature pyramid network based on a BoTNet network based on the gaze area feature map and the original image; the BoTNet network is a network which replaces 3x3 constraint in a ResNet network by using multi-head Self-attachment and comprises the following steps:

Preferably, the system further comprises:

and the model training module is used for training the BoTNet network-based feature pyramid network before the gazing thermodynamic diagram is acquired.

(III) advantageous effects

The invention provides a target fixation identification method and a target fixation identification system based on sight line estimation. Compared with the prior art, the method has the following beneficial effects:

1. the method comprises the steps of firstly extracting face features and head position features of an original image and a reversed image of the original image, obtaining two-dimensional features by feature splicing the face features and the head position features, then obtaining a feature map of a watching sight line area based on the two-dimensional features, and finally obtaining a watching thermodynamic diagram based on the watching sight line area feature map and the original image by utilizing a feature pyramid network based on a BoTNet network, so that the identification and detection of the object watching in the image are realized. The invention can accurately acquire the recognition and detection result of the object target in the image when the information such as image information, eye position information and the like is insufficient, and does not need additional detection modules such as eyes, head postures and the like. The invention is not limited by the application scene, the hardware condition constraint is less, the actual operation is simple and convenient, and the target fixation identification and detection result is accurate.

2. According to the method, a BoTNet network is obtained by replacing 3x3 constraint in a ResNet network with multi-head Self-orientation, then a characteristic pyramid network taking the BoTNet network as a main network is used as a Heatmap path, the obtained gazing sight line area characteristic diagram and an original image are combined and sent to the Heatmap path together, and finally a gazing thermodynamic diagram is generated. The method utilizes the similarity of an attention mechanism and human visual attention to obtain a target area needing important attention, obtains an attention focus, can obtain more detailed information of the target needing attention, simultaneously inhibits other useless information, can effectively extract features, and enables a target fixation identification result based on sight estimation to be more accurate.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart of a target gaze identification method based on gaze estimation according to an embodiment of the present invention;

FIG. 2 is a block diagram of a ResNet network in an embodiment of the present invention;

FIG. 3 is a schematic view of a gaze area in an embodiment of the present invention;

fig. 4 is a view of the gaze area characteristic when γ takes different values in an embodiment of the present invention;

FIG. 5 is a comparison of the use of a BoTNet network ResNet network in an embodiment of the present invention;

FIG. 6 is a schematic diagram of a feature pyramid network in an embodiment of the invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention are clearly and completely described, and it is obvious that the described embodiments are a part of the embodiments of the present invention, but not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The embodiment of the application provides the target gazing identification method and the target gazing identification system based on the sight line estimation, solves the problem that in the prior art, the target gazing identification result cannot be accurately obtained when the information is insufficient and the hardware equipment is deficient, and achieves the purposes of flexibly, conveniently and accurately predicting the target gazing identification and detection result.

In order to solve the technical problems, the general idea of the embodiment of the application is as follows:

in order to solve the problem that in the prior art, when information such as image information, eye position information and the like is insufficient or detection modules such as eye postures and head postures are lacked, the object target fixation recognition and detection results in an image cannot be accurately obtained, the face features and the head position features of an original image and a reversed image of the original image are firstly extracted, the face features and the head position features are spliced through features to obtain two-dimensional features, then a feature map of a fixation sight line area is obtained based on the two-dimensional features, and finally, the fixation thermodynamic map is obtained based on the fixation sight line area feature map and the original image by using a feature pyramid network based on a BoTNet network. The invention is not limited by the application scene, the hardware condition constraint is less, the actual operation is simple and convenient, and the target fixation identification and detection result is accurate.

In order to better understand the technical solution, the technical solution will be described in detail with reference to the drawings and the specific embodiments.

Example 1:

in a first aspect, the present invention first proposes a target gaze identification method based on gaze estimation, and referring to fig. 1, the method includes:

s1, extracting facial features and head position features in the original image; the face features comprise face features in the original image and reversed face features of the original image after being reversed;

s2, obtaining two-dimensional features by feature splicing of the face features and the head position features;

s3, acquiring a characteristic diagram of the gaze area based on the two-dimensional characteristics;

s4, acquiring a fixation thermodynamic diagram by utilizing a characteristic pyramid network based on a BoTNet network based on the fixation sight area characteristic diagram and the original image; the BoTNet network is a network after 3x3 constraint in a ResNet network is replaced by multi-head Self-attachment.

Therefore, the method comprises the steps of firstly extracting the face characteristics and the head position characteristics of the original image and the overturn image of the original image, obtaining the two-dimensional characteristics by characteristic splicing of the face characteristics and the head position characteristics, then obtaining the characteristic diagram of the watching sight line area based on the two-dimensional characteristics, and finally obtaining the watching thermodynamic diagram based on the characteristic pyramid network based on the BoTNet network and the original image, so that the identification and detection of the object watching in the image are realized. The invention can accurately acquire the recognition and detection result of the object target in the image when the information such as image information, eye position information and the like is insufficient, and does not need additional detection modules such as eyes, head postures and the like. The invention is not limited by the application scene, the hardware condition constraint is less, the actual operation is simple and convenient, and the target fixation identification and detection result is accurate.

The following describes the implementation of one embodiment of the present invention in detail with reference to the explanation of specific steps S1-S4.

S1, extracting facial features and head position features in the original image; the face features comprise face features in the original image and reversed face features of the original image after being reversed.

And extracting the face features in the original image and the reversed face features of the reversed image of the original image based on the ResNet model.

The method comprises the steps of obtaining an original image (namely scene content) and eye coordinates in the original image through labeling of an existing data set, obtaining a face image according to the eye coordinates, and finally extracting face features in the original image. In particular, the method comprises the following steps of,

the present embodiment uses a size-focusing dataset as an input source, in which the coordinates of where the person in the image looks (i.e. the real Gaze point coordinates) and the eyes in the image are annotated. Acquiring an original image of a data set, adding or subtracting 0.15 to the coordinates of eyes to obtain the coordinates of a face, and then cutting the face according to the coordinates of the face to obtain a face image; when the original image of the data set is acquired, the original image is horizontally inverted, and then an inverted face image of the face image is obtained by the same operation as described above.

The resolution of both the resulting face image and the inverted face image are adjusted to 224x224, the pixel values of the pixels of these images are read and converted to the tensor form required by the ResNet model, the pixel values are transformed from the range of 0-255 to between 0-1, and then the images are normalized channel by channel (i.e., mean becomes 0 and standard deviation becomes 1) for the three channels rgb to speed up the convergence of the model.

In the present embodiment, the model used to extract facial features is the ResNet model. Inputting a face image into a ResNet model, outputting the face image with the dimension of 2048 after the characteristics are extracted, and converting the dimension into 512 dimensions by using a full connection layer; synchronously, the same operation is carried out on the reversed face image, so that two 512-dimensional features can be obtained, and the two features are spliced to obtain a 1024-dimensional feature.

In this embodiment, the ResNet network is designed to be h (x) ═ f (x) + x, as shown in fig. 2. The above function can be converted to learn a residual function f (x) ═ h (x) — x, and if f (x) ═ 0, an identity map h (x) ═ x is formed, and it is easier to fit the residual. ResNet provides two selection modes, namely identification mapping and residual mapping, if the network reaches the optimal state, the network is continuously deepened, the residual mapping is pushed to be 0, and only the identification mapping is left, so that the network is in the optimal state all the time theoretically, and the performance of the network cannot be reduced along with the increase of the depth.

And extracting head position features in the original image.

And sending the head position into the three full-connection layers to extract the head position characteristics. Specifically, the head position is represented using coordinates of the eye position in the data set. From the size-focusing dataset, the coordinates of the eye in the image have been annotated, assuming that the annotated eye coordinates are H (H)_x,h_y) Then the head position coordinate is also H (H)_x,h_y) And is two-dimensional, using three fully connected layers to map it into 256-dimensional feature space. Although high dimensionality can help to learn more features, the amount of computation is also increased, so that a suitable dimensionality can be selected according to actual conditions. In the embodiment, 256-dimensional feature space is selected after comprehensive consideration, that is, 256-dimensional head position features are extracted after the head position is sent into three fully-connected layers.

And S2, obtaining two-dimensional features by feature splicing the face features and the head position features.

And splicing the obtained 256-dimensional head position feature with two 512-dimensional facial features, and then sending the two 512-dimensional facial features into a full connection layer to obtain the 256-dimensional features. And then the 256-dimensional features are processed by Relu activation function and then sent to a full connection layer, and finally the 2-dimensional features are obtained. This 2-dimensional feature is the predicted gaze line of sight.

And S3, acquiring a characteristic diagram of the gaze area based on the two-dimensional characteristics.

If the gaze line of the target person is predicted to be correct, the viewpoint is generally along the gaze line. Usually the target person's gaze field will be reduced to a cone, head position H (H)_x,h_y) As the apex of the cone. FIG. 3 is a view of the gaze area, see FIG. 3, given a point P (P)_x,p_y) The probability that the point P is the fixation point should be equal to the straight line L_HPProportional to the angle theta between the predicted gaze directions, the smaller the angle theta the greater the probability. In the present embodiment, the mapping from the angle to the probability value is described using a cosine function, and the probability distribution with P point as the gaze point is referred to as a gaze line region. The gaze area is thus effectively a probability map, this probabilityThe intensity value of each point on the graph represents the probability value that the point is the point of regard. Specifically, the larger the brightness value is, the larger the probability value is; the smaller the luminance value, the smaller the probability value. Wherein the content of the first and second substances,

straight line L_HPThe direction calculation formula of (2) is as follows: g ═ p (p)_x-h_x,p_y-h_y)；

Representing the predicted gaze line as

Then the probability calculation formula for point P as the point of regard is as follows:

if the predicted gaze direction is correct, it is expected that along the gaze direction, the profile of the probability is sharp; if the predicted gaze direction is wrong, it is desirable that the probability profile is flat along the gaze direction, so the sharpness of the gaze area is controlled using:

Sim(P,γ)＝[Sim(P)^γ]

wherein gamma is a parameter for controlling the aperture size of the cone, and the larger gamma is, the smaller the aperture of the cone is. Empirically, the values of γ can be set to 5, 3, 1, respectively, see fig. 4, where fig. 4 shows the values of γ set to γ, respectively₁＝5,γ₂＝3,γ₃Three fixation sight line area characteristic maps in 1 three cases.

The three gaze area feature maps and the original image are combined and sent to a Heatmap path together, and a gaze thermodynamic diagram is finally generated. The gaze line region feature map is a map having a size of 224x224 and the number of channels of 1, and the original image is a map having a size of 224x224 and the number of channels of 3. And splicing the three gaze line area characteristic graphs with the original image to obtain a graph with the size of 224x224 and the number of channels of 6, and reusing the graph as the input of a Heatmap path.

In this embodiment, for the Heatmap path, a feature pyramid network using a BoTNet network as a backbone network is used for target detection. And connecting the last layer of the Heatmap path with a Sigmoid function, thereby calculating the probability value of each point in the thermodynamic diagram and finally obtaining the visual attention diagram. The visual attention map is a thermodynamic map of size 56x56, with the size of the value representing the probability of a point of regard.

Referring to fig. 5, BoTNet replaces 3 × 3 containment with Multi-Head Self-attachment (MHSA) only in ResNet, and does not make any other changes. Since the Attention mechanism is similar to the human visual Attention, the Multi-Head Self-Attention can obtain the target area needing important Attention, and the Attention focus is obtained to obtain more detailed information of the target needing Attention, so that other useless information is suppressed, and therefore, the Multi-Head Self-Attention can be used for replacing the 3 × 3 restriction to more effectively extract the features.

The feature pyramid network can combine shallow features and deep features to make predictions on feature maps with different resolutions respectively, and is a classic method in the field of computer vision. Referring to fig. 6, the construction of the feature pyramid involves a bottom-up path, a top-down path, and a lateral connection.

The bottom-up approach is a feed-forward computation of the backbone network, which computes a feature hierarchy that contains a plurality of feature maps with scaling steps of 2. There will typically be many layers that produce a profile of the same size, it being understood that these layers are at the same network stage. For the constructed feature pyramid, defining a pyramid grade for each stage, and then selecting the output of the last layer of each stage as the reference of the feature map set.

The top-down path illusions that the higher resolution features at the high pyramid level are spatially coarser, but semantically stronger. These functions are then enhanced by the bottom-up path through the cross-connect. Each transverse connection merges a bottom-up path and a top-down path into a feature map of the same spatial size.

And after the characteristic pyramid network is constructed, sending the three gaze area graphs and the original graph into the characteristic pyramid network together to obtain the gaze thermodynamic diagrams of the characters in the image.

To ensure accuracy of target gaze recognition of the model, the method further comprises:

and S5, training the ResNet model and the BoTNet network-based feature pyramid network before obtaining the fixation thermodynamic diagram.

The ResNet model used above and the feature pyramid network with BoTNet as the backbone network were trained using the pytorech framework. During training, the face image, the head position and the original image in the size-Following data set are used as the input of the model, and the fixation thermodynamic diagram is output. In the model training process, in order to guide the model training process to a good direction, two loss functions are used to measure the quality of the model prediction process. The better the model training effect, the smaller the loss function of the whole model. The loss function of the whole model is obtained by adding two loss functions, namely the fixation sight loss and the thermodynamic loss. Specifically, the gaze loss function is:

wherein d is the real gaze line obtained by subtracting the eye coordinates from the real gaze point coordinates in the data set,

to predict the gaze line of sight, the above-described S2 may be obtained.

The thermodynamic diagram loss function is:

wherein H_iIs the ithA true thermodynamic diagram (generated by a gaussian kernel from the eye coordinates in the dataset),

for the ith predicted thermodynamic diagram, N is the magnitude of the thermodynamic diagram, 56x 56.

After the values of the two loss functions are obtained respectively, the values corresponding to the two loss functions are added to obtain the loss function of the whole model.

In order to evaluate the prediction effect of the trained model, two evaluation indexes, namely AUC and Minimum angular error (MAng), are used for verifying the prediction effect of the model. Specifically, the AUC is used to estimate the difference between the predicted fixation point and the actual fixation point, and a larger value thereof indicates a better effect of the model, and conversely, the worse value; the MAng is used for measuring the minimum angle between the predicted gazing sight line and the real gazing sight line, and the smaller the value is, the better the model effect is, and the worse the model effect is.

Thus, the whole process of the target fixation identification method based on sight line estimation is completed.

Example 2:

in a second aspect, the present invention also provides a gaze recognition system for an object based on gaze estimation, the system comprising:

Optionally, the extracting facial features in the original image by the feature extraction module includes:

and extracting facial features in the original image by using a ResNet model.

Optionally, the extracting facial features in the original image by the feature extraction module using the ResNet model includes:

Optionally, the gaze thermodynamic diagram obtaining module obtains a gaze thermodynamic diagram by using a feature pyramid network based on a BoTNet network based on the gaze area feature diagram and the original image; the BoTNet network is a network which replaces 3x3 constraint in a ResNet network by using multi-head Self-attachment and comprises the following steps:

Optionally, the system further includes:

It can be understood that, the target gaze identification system based on gaze estimation provided by the embodiment of the present invention corresponds to the above target gaze identification method based on gaze estimation, and the explanations, examples, and beneficial effects of the relevant contents thereof may refer to the corresponding contents in the target gaze identification method based on gaze estimation, and are not described herein again.

In summary, compared with the prior art, the method has the following beneficial effects:

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A gaze-estimation-based target gaze identification method, the method comprising:

2. The method of claim 1, wherein extracting facial features in the original image comprises:

and extracting facial features in the original image by using a ResNet model.

3. The method of claim 2, wherein extracting facial features in the original image using the ResNet model comprises:

4. The method of claim 1, wherein the gaze thermodynamic diagram is obtained using a BoTNet network-based feature pyramid network based on the gaze region feature map and an original image; the BoTNet network is a network which replaces 3x3 constraint in a ResNet network by using multi-head Self-attachment and comprises the following steps:

and replacing 3x3 constraint in the ResNet network by using multi-head Self-orientation to obtain a BoTNet network, and sending the gazing sight line area characteristic diagram and the original image into a Heatmappathway of a characteristic pyramid network taking the BoTNet network as a main network together to generate a gazing thermodynamic diagram.

5. The method of claim 1, wherein the method further comprises:

6. A gaze recognition system for objects based on gaze estimation, the system comprising:

7. The system of claim 6, wherein the feature extraction module to extract facial features in the original image comprises:

and extracting facial features in the original image by using a ResNet model.

8. The system of claim 7, wherein the feature extraction module using the ResNet model to extract facial features in the original image comprises:

9. The system of claim 6, wherein the gaze thermodynamic diagram acquisition module acquires a gaze thermodynamic diagram based on the gaze area feature map and an original image using a BoTNet network based feature pyramid network; the BoTNet network is a network which replaces 3x3 constraint in a ResNet network by using multi-head Self-attachment and comprises the following steps:

10. The system of claim 6, wherein the system further comprises: