CN117746164A - Gaze target estimation method based on progressive view cone - Google Patents

Gaze target estimation method based on progressive view cone Download PDF

Info

Publication number
CN117746164A
CN117746164A CN202410100320.8A CN202410100320A CN117746164A CN 117746164 A CN117746164 A CN 117746164A CN 202410100320 A CN202410100320 A CN 202410100320A CN 117746164 A CN117746164 A CN 117746164A
Authority
CN
China
Prior art keywords
training
gaze
image
target
formula
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410100320.8A
Other languages
Chinese (zh)
Inventor
郭丹
刘飞扬
李坤
汪萌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hefei University of Technology
Original Assignee
Hefei University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hefei University of Technology filed Critical Hefei University of Technology
Priority to CN202410100320.8A priority Critical patent/CN117746164A/en
Publication of CN117746164A publication Critical patent/CN117746164A/en
Pending legal-status Critical Current

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02TCLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
    • Y02T10/00Road transport of goods or passengers
    • Y02T10/10Internal combustion engine [ICE] based vehicles
    • Y02T10/40Engine management systems

Abstract

The invention discloses a gaze target estimation method based on a progressive view cone, which comprises the following steps: 1. estimating a line-of-sight direction from a head picture of a target person; 2. constructing a progressive relationship taking a target person as a center by utilizing the depth image; 3. generating a high-quality view cone image according to the view direction and the progressive relation; 4. extracting the salient features of potential sight targets by combining the view cone image, the scene RGB image and the scene depth map; 5. estimating the gazing distribution of the blocking level and combining with the saliency features to enrich the feature representation of the saliency features mapped to the gazing position heat map, 6, judging whether the gazing target is in the image by utilizing the optimized saliency features, and generating the gazing heat map. According to the method and the device, the gazing target position of the target person in the picture can be rapidly and accurately positioned, and the background irrelevant to gazing can be effectively eliminated by the view cone generated based on the progressive relation, so that the accuracy of gazing target estimation is improved.

Description

Gaze target estimation method based on progressive view cone
Technical Field
The invention belongs to the field of image processing and computer vision, and mainly relates to a gaze target estimation method based on a progressive view cone.
Background
With the progress of society and the development of science and technology, social interaction modes of people are continuously changed and upgraded. In public places, schools, workplaces and even home environments, people's gaze behavior and target selection often reflect their intent and emotion. This ability to estimate gaze targets is therefore a key factor for computer systems to understand what people do in a scene and their intent. Gaze target estimation, i.e. by analyzing and understanding the direction of a person's gaze and the focal position of the gaze, has become an important research topic in the field of computer vision. The application of the technology can not only improve the naturalness of man-machine interaction, but also be used for early diagnosis and treatment of autism. For example, in social interactions, people's intent and emotion can be better understood by understanding and predicting their gaze targets; in human-computer interaction, a machine can provide a more natural, intuitive interaction experience by understanding a human gaze target; therefore, research and application of gaze target estimation is of great significance to our social life.
With the development of modern image processing technology, the gaze target estimation method has also developed greatly, but the following problems are still faced:
and (3) a step of: in the existing method, the spatial information understanding of the scene is lacking, so that the position relation between the target object and other objects in the space cannot be truly reflected, and the gazing position of the target person cannot be accurately estimated.
For example, 2018, article Believe It orNot, we Know WhatYouAre Looking at, published by authors of Dongze Lian et al on top International conference SpringerAsian Conference on Computer Vision-! The method for estimating the fixation target combines plane view cone images with multiple scales to estimate the fixation target, however, the plane view cone only covers the view range and lacks the perception of the geometric position relation of objects in the view. The generated significant heat map is hard to concentrate on a key area, so that the position of the fixation target cannot be accurately estimated.
And II: many gaze target estimation algorithms that take into account spatial information often rely on a large amount of a priori knowledge, which requires complex pre-training and a large amount of data sources, and are difficult to migrate to new scenes, thus being unfavorable for application in real scenes.
For example, in 2022, jun Bao et al, J.I.conference IEEE Conference on Computer Vision and Pattern Recognition, ESCNet: gaze Target Detection with the Understanding of DScenes. This paper proposes constructing 3D point cloud information of a scene as a spatial information supplement to the scene for gaze target estimation, but this method requires additional computing resources (e.g., 3D pose data set, dense human body pose data set) to generate more reliable 3D point cloud information for the scene, resulting in difficulty in application in real-world scenes.
Disclosure of Invention
The invention provides a gaze target estimation method based on a progressive view cone, aiming at solving the defects of the prior method, so that the method does not depend on excessive priori knowledge and effectively reflects the space information of a scene in the view cone, thereby improving the accuracy of gaze target estimation.
The invention adopts the following technical scheme for solving the technical problems:
the invention discloses a gaze target estimation method based on a progressive view cone, which is characterized by comprising the following steps of:
step 1, data preprocessing:
step 1.1, acquiring a gaze target estimation dataset and selecting any one of the gaze target estimation datasetItalian nth picture is denoted as I n The monocular depth estimation method is utilized to estimate the nth picture I n Generating a corresponding normalized depth image D n N is more than or equal to 1 and less than or equal to N, wherein N is the number of pictures in the gaze target estimation dataset;
marking out the nth picture I n Bounding box P of the head position of any target person n And according to bounding box P n From the nth picture I n Human head image C of corresponding target person is cut out n
Step 1.2, constructing a picture I with an nth picture n Binary image B of the same size n If the nth picture I n The pixel points in the pixel array are boundary boxes P n Pixel points in the pixel array B n The pixel point at the corresponding position in (1) is set to be 1, and conversely, is set to be 0;
step 1.3, B is carried out by using the formula (1) n And D n Pairing to generate a head depth image I of a corresponding target person d_h Wherein, the value of any ith row and jth column pixel points
In the formula (1), B (i,j) Representation B n Pixel value of ith row and jth column of (D) (i,j) Representation D n The pixel value of the ith row and jth column of the matrix,representing bounding box P n All pixel index sets in the image;
step 1.4, if the nth picture I n The fixation object of the target person is not in I n In the middle, let watch the labelOtherwise, let fixation label->And marks the position point of the fixation target of the target person +.>Thereby taking G as n For the center, a gazing heat map corresponding to the target person is generated by using a Gaussian kernel function>And is about the gaze heat map>Performing block division processing, and calculating the maximum pixel value in each block as the score of the corresponding block to obtain an nth picture I n Gaze score distribution +.>
Step 2, establishing a network model F formed by a sight-line related feature extractor, a saliency feature extractor, a heat map regression coder and an intra-frame and outer-frame classifier, wherein the heat map regression coder consists of a convolution layer and a deconvolution layer, and the intra-frame and outer-frame classifier consists of the convolution layer and a full-connection layer;
step 2.1, defining the current training times as t, and initializing t=1;
step 2.2, making the sight line related characteristic extractor at the time of the t-th training be recorded asAnd to C n Processing to obtain optimized view cone image +.>
Step 2.3, let the saliency feature extractor at the time of the t-th training be recorded asWill I n 、D n And->A saliency feature extractor for inputting the t-th training/>The fine scene significance characteristic is obtained by processing the scene significance characteristic at the t-th training>
Step 2.4, let the heat map regression codec at the t-th training be recorded asWill->Heat map regressive codec at t-th training is entered +.>The treatment of (1) to obtain a predicted gaze heat map at the t-th training>
Step 2.5, marking the intra-frame and outer-frame classifier in the t-th training asWill->Input intra-frame extra-frame classifier at t-th training>Obtaining a predictive label of the fixation target in the image during the t-th training +.>
Step 3, utilizing a gradient descent method to train the network model F at the t time t Training to obtain a network model F after the t-th training t Judging whether the total loss function is converged or not, if so, representing the network model F after the t training t Parameter ε t For the optimal parameter epsilon * And with optimal parameter epsilon * The corresponding network model is used as an optimal model for estimating the position of the fixation target finally; otherwise, training the network model F after the t time t As the t+1st time to be trained network model F t+1 And assigning t+1 to t, and returning to the step 2.2 for sequential execution.
The method for estimating the gaze target based on the progressive cone of view of the present invention is also characterized in that the step 2.2 includes:
step 2.2.1, the nth picture I is processed n Human head image C of corresponding target person n Sight-related feature extractor at the time of inputting the t-th trainingProcessing to obtain vision related characteristics during the t-th trainingC, H, W respectively represent the sight-line-related characteristic +.>Channel number, length, width;
step 2.2.2, calculating the plane gaze vector of the target person at the t-th training time by using the method (2)
In formula (2), tanh (·) and ReLU (·) represent the Tanh activation function and ReLU activation function, respectively,andrepresents 2 linear functions, +.>Representation adaptationAn average pooling layer;
step 2.2.3, calculating a plane view cone image in the t-th training by using the step (3)Pixel value +.>Wherein H is 0 ,W 0 Representing a planar view cone image +.>Length, width of (c):
in the formula (3), (h) x ,h y ) Is the index of the center position of the head of the target person in the binary image B, and alpha is the angle threshold value of the view cone;
step 2.2.4 calculating a progressive image at the t-th training using equation (4)Any ith row and jth column pixel value +.>
In the formula (4), N Representing P i The total number of all pixels in the pixel array;
step 2.2.5 obtaining an optimized view cone image at the t-th training time by using the method (5)
In the formula (5), the amino acid sequence of the compound,representing a pixel-level multiplication operation.
The step 2.3 includes:
step 2.3.1, step I n 、D n Andinput saliency feature extractor +_for the t-th training>The scene saliency feature of the t-th training is obtained by processing>
Step 2.3.2 computing a blocking-level gaze distribution using (6)
In the formula (6), sigmoid (·) represents a Sigmoid activation function,and->Representing another 2 linear functions, norm (·) representing the normalization process;
step 2.3.3, obtaining the fine scene salient features during the t-th training by utilizing the method (7)
The total loss function in the step 3 is obtained according to the following steps:
step 3.1, constructing the line of sight loss at the t-th training Using the method of (8)
In the formula (8), (g) x ,g y ) For the corresponding real gaze location, (h x ,h y ) For the corresponding head center position, (g) x -h x ,g y -h y ) Namely, the real sight direction;
step 3.2, constructing gaze distribution loss at the t-th training Using equation (9)
In the formula (11), k represents the sequence number of any one block of H×W blocks,and->Respectively representing the prediction score and the true score of the kth block;
step 3.3, constructing a gaze heat map penalty at the t-th training using equation (10)
Step 3.4, constructing internal and external tag loss at the t-th training by using the method (11)
Step 3.5, constructing the total Loss function Loss (. Epsilon.) at the t-th training by using the formula (12) t ):
In the formula (12), ε t Representing parameters of the network model F at the t-th training.
The invention provides an electronic device comprising a memory and a processor, characterized in that the memory is used for storing a program for supporting the processor to execute the gaze target estimation method, and the processor is configured to execute the program stored in the memory.
The invention relates to a computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, performs the steps of a gaze target estimation method.
Compared with the prior art, the invention has the beneficial effects that:
1. the method does not depend on excessive modal information, only utilizes 2D information to effectively simulate the geometric position of the object in the space, can be more efficiently adapted to the actual scene in the actual application, and is migrated to the unknown scene.
2. The invention discloses a space proximity-based gaze cone generation method, which utilizes depth information to construct a gaze cone simulating human visual preference and depth perception, effectively removes areas irrelevant to gaze, and highlights potential gaze targets.
3. The progressive scene optimization method of the invention operates in a coarse-to-fine manner, enriches the feature representation in the gaze location heat map and reduces the finite generalization capability in heat map regression, which enables more accurate predictions of the gaze target location of the target person.
Drawings
FIG. 1 is a schematic diagram of a network model according to the present invention;
fig. 2 is a diagram illustrating training of a network model according to the present invention.
Detailed Description
In this embodiment, a gaze target estimation method based on a progressive cone of view is performed according to the following steps:
step 1, data preprocessing:
step 1.1, acquiring a gaze target estimation dataset, downloading the gaze target estimation dataset from the internet, wherein the gaze target estimation dataset disclosed on the internet mainly comprises:
(1) GazeFollow Dataset, which is a dataset of diverse human activities performed by 130,339 human individuals and classified based on the type of activity. It consisted of 122,143 images, the dataset focused on the intra-fixation target.
(2) VideoAttentionTarget Dataset, the dataset consists of 1,131 video samples from 50 different YouTube programs. The duration of these samples varies from 1 second to 80 seconds. The picture quality is relatively high and includes the case where the fixation target is outside the image.
And marking any nth picture in the gaze target estimation dataset as I n The monocular depth estimation method is utilized to estimate the nth picture I n Generating a corresponding normalized depth image D n N is more than or equal to 1 and less than or equal to N, wherein N is the number of pictures in the gaze target estimation dataset;
marking out the nth picture I n Bounding box P of the head position of any target person n And according to bounding box P n From the nth picture I n A head image F of the corresponding target person is cut out n
Step 1.2, constructing a picture I with an nth picture n Binary image B of the same size n If the nth picture I n The pixel points in the pixel array are boundary boxes P n Pixel points in the pixel array B n The pixel point at the corresponding position in (1) is set to be 1, and conversely, is set to be 0;
step 1.3, B is carried out by using the formula (1) n And D n Pairing to generate a head depth image of a corresponding target personWherein the value of any ith row and jth column pixel points is +.>
In the formula (1), B (i,j) Representation B n Pixel value of ith row and jth column of (D) (i,j) Representation D n The pixel value of the ith row and jth column of the matrix,representing bounding box P n All pixel index sets within.
Step 1.4, if the nth picture I n The fixation object of the target person is not in I n In the middle, let watch the labelOtherwise, let fixation label->And marks the position point of the fixation target of the target person +.>Thereby taking G as n For the center, a gazing heat map corresponding to the target person is generated by using a Gaussian kernel function>And is about the gaze heat map>Performing block division processing, and calculating the maximum pixel value in each block as the score of the corresponding block to obtain an nth picture I n Gaze score distribution +.>A gaussian kernel size of 3 is used for all datasets to obtain the real gaze heatmap.
Step 2, as shown in fig. 1, a network model F composed of a sight-line related feature extractor, a saliency feature extractor, a heat map regression codec and an intra-frame and extra-frame classifier is established, wherein the heat map regression codec is composed of 2 convolution layers and 3 deconvolution layers, the intra-frame and extra-frame classifier is composed of 2 convolution layers and 1 full connection layer, and the input of the network comprises a scene image I, a depth image D and a head clipping image C.
Step 2.1, defining the current training times as t, and initializing t=1;
step 2.2, making the sight line related characteristic extractor at the time of the t-th training be recorded asWill n-th picture I n Human head image F of corresponding target person n Feature extractor for t-th training is input>The vision related characteristics +.>C, H, W respectively represent sight-related features +.>The number, length, and width of channels of (a) are 2048×7×7.
Step 2.3, calculating the plane gaze vector of the target person at the t-th training time by using the method (2)
In the formula (2), tanh (·) and ReLU (·) represent a Tanh activation function and a ReLU activation function, respectively,andrepresents 2 linear functions, +.>Representing an adaptive average pooling layer; />Respectively indicate->A horizontal component and a vertical component of (a);
step 2.4, calculating a plane view cone image in the t-th training by using the method (3)Pixel value +.>Wherein H is 0 ,W 0 Representing a planar view cone image +.>The length and width of (2) are 224×224:
in the formula (3), (h) x ,h y ) Is the index of the center position of the head of the target person in the binary image B, and alpha is the angle threshold value of the view cone; specifically, first, the cosine value between the vector formed by any point and the head position and the predicted line-of-sight vector is calculated, and the cosine value is larger and smaller than the set angle threshold, and the smaller the angle, the view cone image is according to the nature of the cosine functionThe larger the value of the corresponding pixel index.
Step 2.5, utilization (4)Calculation of progressive images at the t-th trainingAny ith row and jth column pixel value +.>
In the formula (4), N Representing P i The total number of all pixels in the pixel array.
Step 2.7, obtaining an optimized view cone image by utilizing the structure of (5)
In the formula (5), the amino acid sequence of the compound,representing a pixel-level multiplication operation.
Step 2.8, let the saliency feature extractor at the time of the t-th training be written asWill I n 、D n And->Input saliency feature extractor +.>Processing to obtain scene saliency characteristics during the t-th training
Step (a)2.9 computing a blocking-level gaze distribution using equation (6)
In the formula (6), sigmoid (·) represents a Sigmoid activation function,and->Representing another 2 linear functions, norm (·) represents normalization.
Step 2.10, obtaining the fine scene saliency characteristics during the t-th training by using the formula (7)
Step 2.11, let the heat map regression codec at the t-th training be recorded asWill->Heat map regressive codec at t-th training is entered +.>The treatment of (1) to obtain a predicted gaze heat map at the t-th training>In this embodiment, the size of the heat map is 64×64.
Step 2.12, marking the intra-frame and outer-frame classifier in the t-th training asWill->Input intra-frame extra-frame classifier at t-th training>Obtaining a predictive label of the fixation target in the image during the t-th training +.>
Step 3, network model F t Is the t th training of:
step 3.1, constructing the line of sight loss at the t-th training Using the method of (8)
In the formula (8), (g) x ,g y ) For the corresponding real gaze location, (h x ,h y ) For the corresponding head center position, (g) x -h x ,g y -h y ) Namely, the real sight direction;
step 3.2, constructing gaze distribution loss at the t-th training Using equation (9)
In the formula (11), k represents a sequence number of any one of H×W blocks, k is from 1 to 49,andthen it is the kth blockPredictive and true scores;
step 3.3, constructing a gaze heat map penalty at the t-th training using equation (10)
Step 3.4, constructing internal and external tag loss at the t-th training by using the method (11)
Step 3.5, constructing the total Loss (ε) at the t-th training by equation (12) t ):
In the formula (14), ε t Representing parameters of the network model F at the t-th training.
Step 3.6, as shown in FIG. 2, the network model F at the t-th training is trained by gradient descent method t Training to obtain a network model F after the t-th training t And determine and calculate Loss (epsilon) t ) Whether to converge, if so, representing the network model F after the t-th training t Parameter ε t For the optimal parameter epsilon * And with optimal parameter epsilon * The corresponding network model is used as an optimal model for estimating the position of the fixation target finally; otherwise, training the network model F after the t time t As the t+1st time to be trained network model F t+1 And assigning t+1 to t, and returning to the step 2.2 for sequential execution.
In this embodiment, an electronic device includes a memory for storing a program supporting the processor to execute the above method, and a processor configured to execute the program stored in the memory.
In this embodiment, a computer-readable storage medium stores a computer program that, when executed by a processor, performs the steps of the method described above.

Claims (6)

1. The gaze target estimation method based on the progressive view cone is characterized by comprising the following steps of:
step 1, data preprocessing:
step 1.1, acquiring a gaze target estimation dataset, and marking any nth picture in the gaze target estimation dataset as I n The monocular depth estimation method is utilized to estimate the nth picture I n Generating a corresponding normalized depth image D n N is more than or equal to 1 and less than or equal to N, wherein N is the number of pictures in the gaze target estimation dataset;
marking out the nth picture I n Bounding box P of the head position of any target person n And according to bounding box P n From the nth picture I n Human head image C of corresponding target person is cut out n
Step 1.2, constructing a picture I with an nth picture n Binary image B of the same size n If the nth picture I n The pixel points in the pixel array are boundary boxes P n Pixel points in the pixel array B n The pixel point at the corresponding position in (1) is set to be 1, and conversely, is set to be 0;
step 1.3, B is carried out by using the formula (1) n And D n Pairing to generate a head depth image of a corresponding target personWherein, the value of any ith row and jth column pixel points is +.>
In the formula (1), B (i,j) Representation B n Pixel value of ith row and jth column of (D) (i,j) Representation D n The pixel value of the ith row and jth column of the matrix,representing bounding box P n All pixel index sets in the image;
step 1.4, if the nth picture I n The fixation object of the target person is not in I n In the middle, let watch the labelOtherwise, let fixation label->And marks the position point of the fixation target of the target person +.>Thereby taking G as n For the center, a gazing heat map corresponding to the target person is generated by using a Gaussian kernel function>And is about the gaze heat map>Performing block division processing, and calculating the maximum pixel value in each block as the score of the corresponding block to obtain an nth picture I n Gaze score distribution of (2)
Step 2, establishing a network model F formed by a sight-line related feature extractor, a saliency feature extractor, a heat map regression coder and an intra-frame and outer-frame classifier, wherein the heat map regression coder consists of a convolution layer and a deconvolution layer, and the intra-frame and outer-frame classifier consists of the convolution layer and a full-connection layer;
step 2.1, defining the current training times as t, and initializing t=1;
step 2.2, making the sight line related characteristic extractor at the time of the t-th training be recorded asAnd to C n Processing to obtain optimized view cone image +.>
Step 2.3, let the saliency feature extractor at the time of the t-th training be recorded asWill I n 、D n And->Input saliency feature extractor +_for the t-th training>The fine scene significance characteristic is obtained by processing the scene significance characteristic at the t-th training>
Step 2.4, let the heat map regression codec at the t-th training be recorded asWill->Heat map regressive codec at t-th training is entered +.>The treatment of (1) to obtain a predicted gaze heat map at the t-th training>
Step 2.5, marking the intra-frame and outer-frame classifier in the t-th training asWill->Input intra-frame extra-frame classifier at t-th training>Obtaining a predictive label of the fixation target in the image during the t-th training +.>
Step 3, utilizing a gradient descent method to train the network model F at the t time t Training to obtain a network model F after the t-th training t Judging whether the total loss function is converged or not, if so, representing the network model F after the t training t Parameter ε t For the optimal parameter epsilon * And with optimal parameter epsilon * The corresponding network model is used as an optimal model for estimating the position of the fixation target finally; otherwise, training the network model F after the t time t As the t+1st time to be trained network model F t+1 And assigning t+1 to t, and returning to the step 2.2 for sequential execution.
2. The method of estimating a gaze target based on a progressive cone of view of claim 1, wherein said step 2.2 comprises:
step 2.2.1, drawing the nth sheetTablet I n Human head image C of corresponding target person n Sight-related feature extractor at the time of inputting the t-th trainingThe vision related characteristics +.>C, H, W respectively represent the sight-line-related characteristic +.>Channel number, length, width;
step 2.2.2, calculating the plane gaze vector of the target person at the t-th training time by using the method (2)
In the formula (2), tanh (·) and ReLU (·) represent a Tanh activation function and a ReLU activation function, respectively,andrepresents 2 linear functions, +.>Representing an adaptive average pooling layer;
step 2.2.3, calculating a plane view cone image in the t-th training by using the step (3)Pixel value +.>Wherein H is 0 ,W 0 Representing a planar view cone image +.>Length, width of (c):
in the formula (3), (h) x ,h y ) Is the index of the center position of the head of the target person in the binary image B, and alpha is the angle threshold value of the view cone;
step 2.2.4 calculating a progressive image at the t-th training using equation (4)Any ith row and jth column pixel value +.>
In the formula (4), N Representing P i The total number of all pixels in the pixel array;
step 2.2.5 obtaining an optimized view cone image at the t-th training time by using the method (5)
In the formula (5), the amino acid sequence of the compound,representing a pixel-level multiplication operation.
3. A method of estimating a gaze target based on a progressive cone of view according to claim 2, wherein said step 2.3 comprises:
step 2.3.1, step I n 、D n Andinput saliency feature extractor +_for the t-th training>The scene saliency feature of the t-th training is obtained by processing>
Step 2.3.2 computing a blocking-level gaze distribution using (6)
In the formula (6), sigmoid (·) represents a Sigmoid activation function,and->Representing another 2 linear functions, norm (·) representing the normalization process;
step 2.3.3, obtaining the fine scene salient features during the t-th training by utilizing the method (7)
4. A method of estimating a gaze target based on a progressive cone of view according to claim 3, wherein the total loss function in step 3 is obtained by:
step 3.1, constructing the line of sight loss at the t-th training Using the method of (8)
In the formula (8), (g) x ,g y ) For the corresponding real gaze location, (h x ,h y ) For the corresponding head center position, (g) x -h x ,g y -h y ) Namely, the real sight direction;
step 3.2, constructing gaze distribution loss at the t-th training Using equation (9)
In the formula (11), k represents the sequence number of any one block of H×W blocks,and->Respectively representing the prediction score and the true score of the kth block;
step 3.3, constructing a gaze heat map penalty at the t-th training using equation (10)
Step 3.4, constructing internal and external tag loss at the t-th training by using the method (11)
Step 3.5, constructing the total Loss function Loss (. Epsilon.) at the t-th training by using the formula (12) t ):
In the formula (12), ε t Representing parameters of the network model F at the t-th training.
5. An electronic device comprising a memory and a processor, wherein the memory is configured to store a program supporting the processor to perform the gaze target estimation method of any of claims 1-4, the processor being configured to execute the program stored in the memory.
6. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, performs the steps of the gaze target estimation method of any of claims 1-4.
CN202410100320.8A 2024-01-24 2024-01-24 Gaze target estimation method based on progressive view cone Pending CN117746164A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410100320.8A CN117746164A (en) 2024-01-24 2024-01-24 Gaze target estimation method based on progressive view cone

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410100320.8A CN117746164A (en) 2024-01-24 2024-01-24 Gaze target estimation method based on progressive view cone

Publications (1)

Publication Number Publication Date
CN117746164A true CN117746164A (en) 2024-03-22

Family

ID=90279740

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410100320.8A Pending CN117746164A (en) 2024-01-24 2024-01-24 Gaze target estimation method based on progressive view cone

Country Status (1)

Country Link
CN (1) CN117746164A (en)

Similar Documents

Publication Publication Date Title
Monroy et al. Salnet360: Saliency maps for omni-directional images with cnn
CN110059662B (en) Deep video behavior identification method and system
US11908244B2 (en) Human posture detection utilizing posture reference maps
CN108460338B (en) Human body posture estimation method and apparatus, electronic device, storage medium, and program
CN107563494B (en) First-view-angle fingertip detection method based on convolutional neural network and heat map
Huang et al. Indoor depth completion with boundary consistency and self-attention
US11361459B2 (en) Method, device and non-transitory computer storage medium for processing image
US10614289B2 (en) Facial tracking with classifiers
JP2022534337A (en) Video target tracking method and apparatus, computer apparatus, program
JP2022526513A (en) Video frame information labeling methods, appliances, equipment and computer programs
US20220292351A1 (en) Systems, methods, and storage media for generating synthesized depth data
CN111062263B (en) Method, apparatus, computer apparatus and storage medium for hand gesture estimation
CN111652974B (en) Method, device, equipment and storage medium for constructing three-dimensional face model
CN111667001B (en) Target re-identification method, device, computer equipment and storage medium
Xu et al. Saliency prediction on omnidirectional image with generative adversarial imitation learning
US20240037898A1 (en) Method for predicting reconstructabilit, computer device and storage medium
CN103105924A (en) Man-machine interaction method and device
Xie et al. Hand detection using robust color correction and gaussian mixture model
CN107479715A (en) The method and apparatus that virtual reality interaction is realized using gesture control
CN117078809A (en) Dynamic effect generation method, device, equipment and storage medium based on image
CN117746164A (en) Gaze target estimation method based on progressive view cone
CN115731442A (en) Image processing method, image processing device, computer equipment and storage medium
US11961249B2 (en) Generating stereo-based dense depth images
US20220103891A1 (en) Live broadcast interaction method and apparatus, live broadcast system and electronic device
CN113723168A (en) Artificial intelligence-based subject identification method, related device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination