CN117746164A

CN117746164A - Gaze target estimation method based on progressive view cone

Info

Publication number: CN117746164A
Application number: CN202410100320.8A
Authority: CN
Inventors: 郭丹; 刘飞扬; 李坤; 汪萌
Original assignee: Hefei University of Technology
Current assignee: Hefei University of Technology
Priority date: 2024-01-24
Filing date: 2024-01-24
Publication date: 2024-03-22

Abstract

The invention discloses a gaze target estimation method based on a progressive view cone, which comprises the following steps: 1. estimating a line-of-sight direction from a head picture of a target person; 2. constructing a progressive relationship taking a target person as a center by utilizing the depth image; 3. generating a high-quality view cone image according to the view direction and the progressive relation; 4. extracting the salient features of potential sight targets by combining the view cone image, the scene RGB image and the scene depth map; 5. estimating the gazing distribution of the blocking level and combining with the saliency features to enrich the feature representation of the saliency features mapped to the gazing position heat map, 6, judging whether the gazing target is in the image by utilizing the optimized saliency features, and generating the gazing heat map. According to the method and the device, the gazing target position of the target person in the picture can be rapidly and accurately positioned, and the background irrelevant to gazing can be effectively eliminated by the view cone generated based on the progressive relation, so that the accuracy of gazing target estimation is improved.

Description

Gaze target estimation method based on progressive view cone

Technical Field

The invention belongs to the field of image processing and computer vision, and mainly relates to a gaze target estimation method based on a progressive view cone.

Background

With the progress of society and the development of science and technology, social interaction modes of people are continuously changed and upgraded. In public places, schools, workplaces and even home environments, people's gaze behavior and target selection often reflect their intent and emotion. This ability to estimate gaze targets is therefore a key factor for computer systems to understand what people do in a scene and their intent. Gaze target estimation, i.e. by analyzing and understanding the direction of a person's gaze and the focal position of the gaze, has become an important research topic in the field of computer vision. The application of the technology can not only improve the naturalness of man-machine interaction, but also be used for early diagnosis and treatment of autism. For example, in social interactions, people's intent and emotion can be better understood by understanding and predicting their gaze targets; in human-computer interaction, a machine can provide a more natural, intuitive interaction experience by understanding a human gaze target; therefore, research and application of gaze target estimation is of great significance to our social life.

With the development of modern image processing technology, the gaze target estimation method has also developed greatly, but the following problems are still faced:

and (3) a step of: in the existing method, the spatial information understanding of the scene is lacking, so that the position relation between the target object and other objects in the space cannot be truly reflected, and the gazing position of the target person cannot be accurately estimated.

For example, 2018, article Believe It orNot, we Know WhatYouAre Looking at, published by authors of Dongze Lian et al on top International conference SpringerAsian Conference on Computer Vision-! The method for estimating the fixation target combines plane view cone images with multiple scales to estimate the fixation target, however, the plane view cone only covers the view range and lacks the perception of the geometric position relation of objects in the view. The generated significant heat map is hard to concentrate on a key area, so that the position of the fixation target cannot be accurately estimated.

And II: many gaze target estimation algorithms that take into account spatial information often rely on a large amount of a priori knowledge, which requires complex pre-training and a large amount of data sources, and are difficult to migrate to new scenes, thus being unfavorable for application in real scenes.

For example, in 2022, jun Bao et al, J.I.conference IEEE Conference on Computer Vision and Pattern Recognition, ESCNet: gaze Target Detection with the Understanding of DScenes. This paper proposes constructing 3D point cloud information of a scene as a spatial information supplement to the scene for gaze target estimation, but this method requires additional computing resources (e.g., 3D pose data set, dense human body pose data set) to generate more reliable 3D point cloud information for the scene, resulting in difficulty in application in real-world scenes.

Disclosure of Invention

The invention provides a gaze target estimation method based on a progressive view cone, aiming at solving the defects of the prior method, so that the method does not depend on excessive priori knowledge and effectively reflects the space information of a scene in the view cone, thereby improving the accuracy of gaze target estimation.

The invention adopts the following technical scheme for solving the technical problems:

the invention discloses a gaze target estimation method based on a progressive view cone, which is characterized by comprising the following steps of:

step 1, data preprocessing:

step 1.1, acquiring a gaze target estimation dataset and selecting any one of the gaze target estimation datasetItalian nth picture is denoted as I _n The monocular depth estimation method is utilized to estimate the nth picture I _n Generating a corresponding normalized depth image D _n N is more than or equal to 1 and less than or equal to N, wherein N is the number of pictures in the gaze target estimation dataset;

marking out the nth picture I _n Bounding box P of the head position of any target person _n And according to bounding box P _n From the nth picture I _n Human head image C of corresponding target person is cut out _n ；

Step 1.2, constructing a picture I with an nth picture _n Binary image B of the same size _n If the nth picture I _n The pixel points in the pixel array are boundary boxes P _n Pixel points in the pixel array B _n The pixel point at the corresponding position in (1) is set to be 1, and conversely, is set to be 0;

step 1.3, B is carried out by using the formula (1) _n And D _n Pairing to generate a head depth image I of a corresponding target person _{d_h} Wherein, the value of any ith row and jth column pixel points

In the formula (1), B ^(i,j) Representation B _n Pixel value of ith row and jth column of (D) ^(i,j) Representation D _n The pixel value of the ith row and jth column of the matrix,representing bounding box P _n All pixel index sets in the image;

step 1.4, if the nth picture I _n The fixation object of the target person is not in I _n In the middle, let watch the labelOtherwise, let fixation label->And marks the position point of the fixation target of the target person +.>Thereby taking G as _n For the center, a gazing heat map corresponding to the target person is generated by using a Gaussian kernel function>And is about the gaze heat map>Performing block division processing, and calculating the maximum pixel value in each block as the score of the corresponding block to obtain an nth picture I _n Gaze score distribution +.>

Step 2, establishing a network model F formed by a sight-line related feature extractor, a saliency feature extractor, a heat map regression coder and an intra-frame and outer-frame classifier, wherein the heat map regression coder consists of a convolution layer and a deconvolution layer, and the intra-frame and outer-frame classifier consists of the convolution layer and a full-connection layer;

step 2.1, defining the current training times as t, and initializing t=1;

step 2.2, making the sight line related characteristic extractor at the time of the t-th training be recorded asAnd to C _n Processing to obtain optimized view cone image +.>

Step 2.3, let the saliency feature extractor at the time of the t-th training be recorded asWill I _n 、D _n And->A saliency feature extractor for inputting the t-th training/>The fine scene significance characteristic is obtained by processing the scene significance characteristic at the t-th training>

Step 2.4, let the heat map regression codec at the t-th training be recorded asWill->Heat map regressive codec at t-th training is entered +.>The treatment of (1) to obtain a predicted gaze heat map at the t-th training>

Step 2.5, marking the intra-frame and outer-frame classifier in the t-th training asWill->Input intra-frame extra-frame classifier at t-th training>Obtaining a predictive label of the fixation target in the image during the t-th training +.>

Step 3, utilizing a gradient descent method to train the network model F at the t time ^t Training to obtain a network model F after the t-th training ^t Judging whether the total loss function is converged or not, if so, representing the network model F after the t training ^t Parameter ε ^t For the optimal parameter epsilon ^* And with optimal parameter epsilon ^* The corresponding network model is used as an optimal model for estimating the position of the fixation target finally; otherwise, training the network model F after the t time ^t As the t+1st time to be trained network model F ^t+1 And assigning t+1 to t, and returning to the step 2.2 for sequential execution.

The method for estimating the gaze target based on the progressive cone of view of the present invention is also characterized in that the step 2.2 includes:

step 2.2.1, the nth picture I is processed _n Human head image C of corresponding target person _n Sight-related feature extractor at the time of inputting the t-th trainingProcessing to obtain vision related characteristics during the t-th trainingC, H, W respectively represent the sight-line-related characteristic +.>Channel number, length, width;

step 2.2.2, calculating the plane gaze vector of the target person at the t-th training time by using the method (2)

In formula (2), tanh (·) and ReLU (·) represent the Tanh activation function and ReLU activation function, respectively,andrepresents 2 linear functions, +.>Representation adaptationAn average pooling layer;

step 2.2.3, calculating a plane view cone image in the t-th training by using the step (3)Pixel value +.>Wherein H is ₀ ，W ₀ Representing a planar view cone image +.>Length, width of (c):

in the formula (3), (h) _x ,h _y ) Is the index of the center position of the head of the target person in the binary image B, and alpha is the angle threshold value of the view cone;

step 2.2.4 calculating a progressive image at the t-th training using equation (4)Any ith row and jth column pixel value +.>

In the formula (4), N _∩ Representing P _i The total number of all pixels in the pixel array;

step 2.2.5 obtaining an optimized view cone image at the t-th training time by using the method (5)

In the formula (5), the amino acid sequence of the compound,representing a pixel-level multiplication operation.

The step 2.3 includes:

step 2.3.1, step I _n 、D _n Andinput saliency feature extractor +_for the t-th training>The scene saliency feature of the t-th training is obtained by processing>

Step 2.3.2 computing a blocking-level gaze distribution using (6)

In the formula (6), sigmoid (·) represents a Sigmoid activation function,and->Representing another 2 linear functions, norm (·) representing the normalization process;

step 2.3.3, obtaining the fine scene salient features during the t-th training by utilizing the method (7)

The total loss function in the step 3 is obtained according to the following steps:

step 3.1, constructing the line of sight loss at the t-th training Using the method of (8)

In the formula (8), (g) _x ,g _y ) For the corresponding real gaze location, (h _x ,h _y ) For the corresponding head center position, (g) _x -h _x ,g _y -h _y ) Namely, the real sight direction;

step 3.2, constructing gaze distribution loss at the t-th training Using equation (9)

In the formula (11), k represents the sequence number of any one block of H×W blocks,and->Respectively representing the prediction score and the true score of the kth block;

step 3.3, constructing a gaze heat map penalty at the t-th training using equation (10)

Step 3.4, constructing internal and external tag loss at the t-th training by using the method (11)

Step 3.5, constructing the total Loss function Loss (. Epsilon.) at the t-th training by using the formula (12) ^t )：

In the formula (12), ε ^t Representing parameters of the network model F at the t-th training.

The invention provides an electronic device comprising a memory and a processor, characterized in that the memory is used for storing a program for supporting the processor to execute the gaze target estimation method, and the processor is configured to execute the program stored in the memory.

The invention relates to a computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, performs the steps of a gaze target estimation method.

Compared with the prior art, the invention has the beneficial effects that:

1. the method does not depend on excessive modal information, only utilizes 2D information to effectively simulate the geometric position of the object in the space, can be more efficiently adapted to the actual scene in the actual application, and is migrated to the unknown scene.

2. The invention discloses a space proximity-based gaze cone generation method, which utilizes depth information to construct a gaze cone simulating human visual preference and depth perception, effectively removes areas irrelevant to gaze, and highlights potential gaze targets.

3. The progressive scene optimization method of the invention operates in a coarse-to-fine manner, enriches the feature representation in the gaze location heat map and reduces the finite generalization capability in heat map regression, which enables more accurate predictions of the gaze target location of the target person.

Drawings

FIG. 1 is a schematic diagram of a network model according to the present invention;

fig. 2 is a diagram illustrating training of a network model according to the present invention.

Detailed Description

In this embodiment, a gaze target estimation method based on a progressive cone of view is performed according to the following steps:

step 1, data preprocessing:

step 1.1, acquiring a gaze target estimation dataset, downloading the gaze target estimation dataset from the internet, wherein the gaze target estimation dataset disclosed on the internet mainly comprises:

(1) GazeFollow Dataset, which is a dataset of diverse human activities performed by 130,339 human individuals and classified based on the type of activity. It consisted of 122,143 images, the dataset focused on the intra-fixation target.

(2) VideoAttentionTarget Dataset, the dataset consists of 1,131 video samples from 50 different YouTube programs. The duration of these samples varies from 1 second to 80 seconds. The picture quality is relatively high and includes the case where the fixation target is outside the image.

And marking any nth picture in the gaze target estimation dataset as I _n The monocular depth estimation method is utilized to estimate the nth picture I _n Generating a corresponding normalized depth image D _n N is more than or equal to 1 and less than or equal to N, wherein N is the number of pictures in the gaze target estimation dataset;

marking out the nth picture I _n Bounding box P of the head position of any target person _n And according to bounding box P _n From the nth picture I _n A head image F of the corresponding target person is cut out _n 。

step 1.3, B is carried out by using the formula (1) _n And D _n Pairing to generate a head depth image of a corresponding target personWherein the value of any ith row and jth column pixel points is +.>

In the formula (1), B ^(i,j) Representation B _n Pixel value of ith row and jth column of (D) ^(i,j) Representation D _n The pixel value of the ith row and jth column of the matrix,representing bounding box P _n All pixel index sets within.

Step 1.4, if the nth picture I _n The fixation object of the target person is not in I _n In the middle, let watch the labelOtherwise, let fixation label->And marks the position point of the fixation target of the target person +.>Thereby taking G as _n For the center, a gazing heat map corresponding to the target person is generated by using a Gaussian kernel function>And is about the gaze heat map>Performing block division processing, and calculating the maximum pixel value in each block as the score of the corresponding block to obtain an nth picture I _n Gaze score distribution +.>A gaussian kernel size of 3 is used for all datasets to obtain the real gaze heatmap.

Step 2, as shown in fig. 1, a network model F composed of a sight-line related feature extractor, a saliency feature extractor, a heat map regression codec and an intra-frame and extra-frame classifier is established, wherein the heat map regression codec is composed of 2 convolution layers and 3 deconvolution layers, the intra-frame and extra-frame classifier is composed of 2 convolution layers and 1 full connection layer, and the input of the network comprises a scene image I, a depth image D and a head clipping image C.

Step 2.1, defining the current training times as t, and initializing t=1;

step 2.2, making the sight line related characteristic extractor at the time of the t-th training be recorded asWill n-th picture I _n Human head image F of corresponding target person _n Feature extractor for t-th training is input>The vision related characteristics +.>C, H, W respectively represent sight-related features +.>The number, length, and width of channels of (a) are 2048×7×7.

Step 2.3, calculating the plane gaze vector of the target person at the t-th training time by using the method (2)

In the formula (2), tanh (·) and ReLU (·) represent a Tanh activation function and a ReLU activation function, respectively,andrepresents 2 linear functions, +.>Representing an adaptive average pooling layer; />Respectively indicate->A horizontal component and a vertical component of (a);

step 2.4, calculating a plane view cone image in the t-th training by using the method (3)Pixel value +.>Wherein H is ₀ ，W ₀ Representing a planar view cone image +.>The length and width of (2) are 224×224:

in the formula (3), (h) _x ,h _y ) Is the index of the center position of the head of the target person in the binary image B, and alpha is the angle threshold value of the view cone; specifically, first, the cosine value between the vector formed by any point and the head position and the predicted line-of-sight vector is calculated, and the cosine value is larger and smaller than the set angle threshold, and the smaller the angle, the view cone image is according to the nature of the cosine functionThe larger the value of the corresponding pixel index.

Step 2.5, utilization (4)Calculation of progressive images at the t-th trainingAny ith row and jth column pixel value +.>

In the formula (4), N _∩ Representing P _i The total number of all pixels in the pixel array.

Step 2.7, obtaining an optimized view cone image by utilizing the structure of (5)

Step 2.8, let the saliency feature extractor at the time of the t-th training be written asWill I _n 、D _n And->Input saliency feature extractor +.>Processing to obtain scene saliency characteristics during the t-th training

Step (a)2.9 computing a blocking-level gaze distribution using equation (6)

In the formula (6), sigmoid (·) represents a Sigmoid activation function,and->Representing another 2 linear functions, norm (·) represents normalization.

Step 2.10, obtaining the fine scene saliency characteristics during the t-th training by using the formula (7)

Step 2.11, let the heat map regression codec at the t-th training be recorded asWill->Heat map regressive codec at t-th training is entered +.>The treatment of (1) to obtain a predicted gaze heat map at the t-th training>In this embodiment, the size of the heat map is 64×64.

Step 2.12, marking the intra-frame and outer-frame classifier in the t-th training asWill->Input intra-frame extra-frame classifier at t-th training>Obtaining a predictive label of the fixation target in the image during the t-th training +.>

Step 3, network model F ^t Is the t th training of:

In the formula (11), k represents a sequence number of any one of H×W blocks, k is from 1 to 49,andthen it is the kth blockPredictive and true scores;

Step 3.5, constructing the total Loss (ε) at the t-th training by equation (12) ^t )：

In the formula (14), ε ^t Representing parameters of the network model F at the t-th training.

Step 3.6, as shown in FIG. 2, the network model F at the t-th training is trained by gradient descent method ^t Training to obtain a network model F after the t-th training ^t And determine and calculate Loss (epsilon) ^t ) Whether to converge, if so, representing the network model F after the t-th training ^t Parameter ε ^t For the optimal parameter epsilon ^* And with optimal parameter epsilon ^* The corresponding network model is used as an optimal model for estimating the position of the fixation target finally; otherwise, training the network model F after the t time ^t As the t+1st time to be trained network model F ^t+1 And assigning t+1 to t, and returning to the step 2.2 for sequential execution.

In this embodiment, an electronic device includes a memory for storing a program supporting the processor to execute the above method, and a processor configured to execute the program stored in the memory.

In this embodiment, a computer-readable storage medium stores a computer program that, when executed by a processor, performs the steps of the method described above.

Claims

1. The gaze target estimation method based on the progressive view cone is characterized by comprising the following steps of:

step 1, data preprocessing:

step 1.1, acquiring a gaze target estimation dataset, and marking any nth picture in the gaze target estimation dataset as I _n The monocular depth estimation method is utilized to estimate the nth picture I _n Generating a corresponding normalized depth image D _n N is more than or equal to 1 and less than or equal to N, wherein N is the number of pictures in the gaze target estimation dataset;

step 1.3, B is carried out by using the formula (1) _n And D _n Pairing to generate a head depth image of a corresponding target personWherein, the value of any ith row and jth column pixel points is +.>

step 1.4, if the nth picture I _n The fixation object of the target person is not in I _n In the middle, let watch the labelOtherwise, let fixation label->And marks the position point of the fixation target of the target person +.>Thereby taking G as _n For the center, a gazing heat map corresponding to the target person is generated by using a Gaussian kernel function>And is about the gaze heat map>Performing block division processing, and calculating the maximum pixel value in each block as the score of the corresponding block to obtain an nth picture I _n Gaze score distribution of (2)

step 2.1, defining the current training times as t, and initializing t=1;

Step 2.3, let the saliency feature extractor at the time of the t-th training be recorded asWill I _n 、D _n And->Input saliency feature extractor +_for the t-th training>The fine scene significance characteristic is obtained by processing the scene significance characteristic at the t-th training>

2. The method of estimating a gaze target based on a progressive cone of view of claim 1, wherein said step 2.2 comprises:

step 2.2.1, drawing the nth sheetTablet I _n Human head image C of corresponding target person _n Sight-related feature extractor at the time of inputting the t-th trainingThe vision related characteristics +.>C, H, W respectively represent the sight-line-related characteristic +.>Channel number, length, width;

In the formula (2), tanh (·) and ReLU (·) represent a Tanh activation function and a ReLU activation function, respectively,andrepresents 2 linear functions, +.>Representing an adaptive average pooling layer;

3. A method of estimating a gaze target based on a progressive cone of view according to claim 2, wherein said step 2.3 comprises:

Step 2.3.2 computing a blocking-level gaze distribution using (6)

4. A method of estimating a gaze target based on a progressive cone of view according to claim 3, wherein the total loss function in step 3 is obtained by:

5. An electronic device comprising a memory and a processor, wherein the memory is configured to store a program supporting the processor to perform the gaze target estimation method of any of claims 1-4, the processor being configured to execute the program stored in the memory.

6. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, performs the steps of the gaze target estimation method of any of claims 1-4.