CN113313123A - Semantic inference based glance path prediction method - Google Patents

Semantic inference based glance path prediction method Download PDF

Info

Publication number
CN113313123A
CN113313123A CN202110652817.7A CN202110652817A CN113313123A CN 113313123 A CN113313123 A CN 113313123A CN 202110652817 A CN202110652817 A CN 202110652817A CN 113313123 A CN113313123 A CN 113313123A
Authority
CN
China
Prior art keywords
semantic
image
glance
path
decoder
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110652817.7A
Other languages
Chinese (zh)
Other versions
CN113313123B (en
Inventor
夏辰
钟文琦
韩军伟
郭雷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northwestern Polytechnical University
Original Assignee
Northwestern Polytechnical University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northwestern Polytechnical University filed Critical Northwestern Polytechnical University
Priority to CN202110652817.7A priority Critical patent/CN113313123B/en
Publication of CN113313123A publication Critical patent/CN113313123A/en
Application granted granted Critical
Publication of CN113313123B publication Critical patent/CN113313123B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention relates to a glance path prediction method based on semantic inference, and belongs to the field of image glance path prediction. And constructing an image panning path training set, and mapping the image to a semantic space by adopting the trained CNN to obtain a semantic vector corresponding to the fixation point. Constructing a glance path prediction model of an encoder-decoder framework, outputting a coding vector by global information of an image coded by the encoder, taking the coding vector as an initial value of a decoder, learning a gaze point semantic inference relation by the decoder, and optimizing an encoder-decoder network to minimize a loss function by adopting a Euclidean distance from a predicted gaze point semantic vector to a true gaze point semantic vector as the loss function. And (4) inputting the image in the optimized network for testing to obtain a panning path.

Description

Semantic inference based glance path prediction method
Technical Field
The invention relates to the field of image glance path prediction, in particular to a glance path prediction method based on semantic inference.
Background
The human eye receives a large amount of visual data which far exceeds the human brain can handle at every moment, and the human visual system can find important areas from complex physical scenes, so that a human can rapidly acquire useful information from a large amount of visual data by using less computing resources. Therefore, the research of the human visual system is very important to rapidly extract useful information from a large amount of visual data. Currently, the existing research is divided into two aspects: the visual saliency is the probability of fixation aiming at visual data and represents the static characteristics of vision, and the visual saccade path is the time sequence of the change of the fixation point of human eyes on time and space, so that the visual static characteristics are reflected and the visual dynamic characteristics are represented. Therefore, the research on the saccade path is helpful for more fully understanding the working mechanism of human vision, and has wide application prospect in the fields of recommendation systems, crowd identification, virtual reality and the like.
The earliest prediction of the saccade path was proposed by Itti et al in the article "A Model of salt-Based Visual Attribution for Rapid Screen Analysis", IEEE Transactions on Pattern Analysis and Machine significance, vol.20, No.11, pp.1254-1259,1998. it was proposed to use the "feature integration theory" to interpret the Visual search strategy, decompose the input image into a series of shallow feature maps to predict image Saliency, and then use the forbidden return mechanism and winner to determine the position of the next saccade point for the King mechanism. Giuseppe Bocrignone et al then used the Wang's equation in the article "modeling gap shift as a constrained random walk," Phys. A, Statist. Mech. appl., vol.331, No.1, pp.207-218,2004 to model the path of the glance as a constrained random walk process in a saliency field, where the length of the glance obeyed a Levy distribution. In the traditional method, high-level semantic information of an image is not extracted in the process of predicting the saliency map, so that the problems of inaccurate semantic target prediction and the like exist, and the accuracy of modeling a physiological mechanism is relied on when the saliency map is combined to predict a fixation point.
With the development of deep learning methods such as Convolutional Neural Network (CNN) and Long-Short-Term Memory (LSTM) networks, researchers have focused on predicting glance paths using deep learning methods. Thuyen Ngo et al, in the articles t.ngo and b.s.manjunath, "Saccade size prediction using a temporal neural network,"2017IEEE International reference Image Processing (ICIP),2017, pp.3435-3439, propose to use CNN to extract features of an Image, use gaze point features as input to an LSTM network, and predict the probability of the location of the next gaze point through the LSTM network. However, the LSTM network mines the relationship between the gaze point characteristics to the next gaze point location, models a statistical property of the gaze point location, and lacks semantic understanding of the gaze point. A glance path prediction method and device (publication number CN109447096A) based on machine learning adopts a structure that a CNN is used as an encoder and an LSTM is used as a decoder, the encoder performs feature extraction on a point of regard, the decoder directly predicts the position of the point of regard, and an attention mechanism is introduced to enable the decoding of each step to focus on different information. The method models the position relevance among the gazing points, but the gazing point characteristics lack characteristic information of other positions, so that the position of the next gazing point is difficult to accurately predict through the gazing point characteristics. At present, different groups of glance paths have differences, and there are the unclear scheduling problem of physiological mechanism in the glance path, and current model relies on single physiological modeling can't be adapted to different groups of glance paths. Therefore, the invention considers that the view point semantics have strong correlation and introduces the global information of the image, and proposes to utilize an encoder-decoder framework to mine the semantic correlation between the view points to predict the glancing path.
Disclosure of Invention
Technical problem to be solved
There are some studies for saccade path prediction based on deep learning, but these studies mainly focus on how to better model some physiological mechanisms, and do not consider the following problems: firstly, the semantic information between the fixation points has relevance, the probability of predicting the fixation points is influenced by the fixation point semantics and all previous fixation point semantics, and the existing model lacks the modeling of the relevance between the fixation point semantics. Secondly, the saccade paths of different populations have differences with semantic cognition, for example, children and adults have differences with semantic target understanding, and autism patients have semantic deletion and the like. Thirdly, the physiological mechanism of the saccade path has ambiguity, and the existing model can not adapt to the saccade path of different people by relying on single physiological modeling. Therefore, the invention aims to overcome the defects of the prior art, considers the problem of glancing path prediction as a search problem in the whole image semantic range by taking semantic correlation as measurement, proposes to utilize CNN to carry out semantic extraction on an image, and an encoder-decoder framework to excavate the semantic correlation of a fixation point, wherein an encoder is CNN, encodes the global information of the image, introduces the global information and is used as an initial value of a decoder; the decoder learns the semantic association between the points of regard for the LSTM network, modeling the sequence characteristics of the glance path. The model is driven by data, so that semantic information of different crowds can be learned according to the glance paths of the different crowds, corresponding semantic correlation is established, and excessive physiological mechanisms are not used in the model.
Technical scheme
A method for glance path prediction based on semantic inference, the method comprising:
constructing an image panning path training set;
constructing a semantic extractor to extract the semantic features of the fixation point;
constructing a sweep path prediction model of an encoder-decoder framework;
training a glance path prediction model;
a glance path of the image is predicted.
The further technical scheme of the invention is as follows: the method specifically comprises the steps of collecting images, collecting panning paths corresponding to the images, transforming the sizes of all the images to be consistent, and calculating pixel points corresponding to fixation points of the transformed panning paths.
The further technical scheme of the invention is as follows: the semantic feature extraction of the fixation point by constructing the semantic extractor specifically comprises the following steps: and mapping the image to a semantic space by adopting a trained semantic extractor to obtain a semantic vector corresponding to the fixation point.
The further technical scheme of the invention is as follows: the semantic extractor is a CNN model.
The further technical scheme of the invention is as follows: the construction of the sweep path prediction model of the encoder-decoder framework specifically comprises: the encoder outputs a coded vector for encoding global information of an image, the coded vector is used as an initial value of the decoder, and the decoder learns the view point semantic inference relation.
The further technical scheme of the invention is as follows: the encoder-decoder framework is a CNN model-LSTM network.
The further technical scheme of the invention is as follows: the training of the glance path prediction model specifically comprises: and optimizing the encoder-decoder network to minimize the loss function by using the Euclidean distance from the predicted fixation point semantic vector to the true fixation point semantic vector as the loss function.
The further technical scheme of the invention is as follows: predicting the saccade path: and (4) inputting the image in the optimized network for testing to obtain a panning path.
A semantic inference based glance path prediction apparatus, comprising:
an image processing module for constructing an image panning path training set;
the characteristic extraction module is used for extracting the semantic characteristics of the fixation point;
a training module for establishing and training a glance path prediction model;
a prediction module to predict a saccade path.
Advantageous effects
The glance path prediction method based on semantic inference provided by the invention has the following beneficial effects:
1) the invention establishes an end-to-end learning model of an encoder-decoder framework, and adopts a data driving mode without simulating excessive complex physiological phenomena. The encoder uses global information of the CNN encoded image and the decoder uses the advantages of the LSTM network to model the sequence, better revealing the dynamic properties of the panning path.
2) According to the invention, CNN is used for extracting the semantic features of the fixation point, a pre-training network model in large-scale data set is adopted, the high-level semantic information of the fixation point is extracted, and compared with the previous research of directly training in the eye movement data set, the problem that the eye movement data is less and the semantic information is difficult to extract is solved. Compared with the original image pixel block, the method has stronger semantic abstraction capability and semantic representation capability. In addition, the semantic differences of different crowds are considered, and the deep learning-based semantic extractor can be trained in a specific crowd sample, so that the semantic cognition of the specific crowd is extracted, and the glance path prediction of different crowds is realized.
3) The invention utilizes LSTM network to learn the jump relation between the gazing points from the perspective of semantic inference, compared with the previous research of establishing the correlation between the gazing point characteristics and the position by Thuyen Ngo and the like and the problem that the previous research lacks semantic understanding of the gazing points, the invention notices that the semantic information between the gazing points has correlation, utilizes the LSTM network to directly mine the semantic correlation between the preorder gazing point and the next gazing point, realizes the inference from the gazing point semantics to the gazing point semantics, and better accords with the cognitive characteristics of human beings.
4) Compared with the previous research using a mechanism of victory as king, the method provided by the invention has the advantages that the human eye saccade paths have certain distribution in amplitude and angle due to the limitation of the physiological structure of human eyes. The distribution is considered, and the distance map formed by the inference semantics and all semantic features in all images is combined to predict the fixation point position, so that the human eye movement statistical characteristics are better met.
Drawings
The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, wherein like reference numerals are used to designate like parts throughout.
Figure 1 is a general flow chart of an implementation of the present invention.
FIG. 2 is a general framework diagram of the model of the present invention.
FIG. 3 is a schematic diagram of semantic vector sequence extraction in the present invention.
FIG. 4 is a schematic diagram of decoder LSTM network mining semantic relations in the invention.
FIG. 5 illustrates experimental results in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
The scheme of the invention is as follows: and constructing an image panning path training set, and mapping the image to a semantic space by adopting the trained CNN to obtain a semantic vector corresponding to the fixation point. Constructing a glance path prediction model of an encoder-decoder framework, outputting a coding vector by global information of an image coded by the encoder, taking the coding vector as an initial value of a decoder, learning a gaze point semantic inference relation by the decoder, and optimizing an encoder-decoder network to minimize a loss function by adopting a Euclidean distance from a predicted gaze point semantic vector to a true gaze point semantic vector as the loss function. And (4) inputting the image in the optimized network for testing to obtain a panning path. As shown in fig. 1, the implementation steps include the following:
(1) construction of image panning path training set
Collecting images and acquiring corresponding sweep paths of the images, and uniformly transforming the sizes of all the images into h multiplied by w sizes, wherein h represents the height of the images, and w represents the width of the images. And calculating pixel points corresponding to the fixation points of the converted saccade path.
(2) Constructing semantic extractor to extract view point semantic features
The semantic extractor shown in FIG. 2 implements the mapping from image block pixels to high level semantic information since CNN is suitable for processing image numbersAccording to the method, a common CNN model such as VGG (virtual ground gateway), ResNet and the like is selected as a semantic extractor, the model is trained in tasks such as significance prediction and classification to obtain a semantic extraction model, and if the significance map prediction of different crowds is adopted for the glance path prediction of the different crowds, the semantic difference among the different crowds can be solved. As shown in FIG. 3, the semantic extraction model converts an hxwx3 image into h '× w' × NsA semantic vector of (2), wherein NsThe dimension of the semantic vector is calculated, the image compression coefficient can be obtained by calculating h/h', and the set image compression coefficient can be achieved by reducing the number of network layers. The image blocks where the gazing points are located correspond to the points of the semantic space one by one, so that the gazing point position sequence is converted into a semantic vector sequence.
(3) Construction of a swept path prediction model for an encoder-decoder framework
The encoder shown in fig. 2 is used for encoding the global information of the image, converting the image into a vector with a fixed length, and the encoder can select the commonly used models such as VGG, ResNet and the like by adopting the CNN model, and because the image has a uniform size of h × w, the encoding vector with a fixed dimension can be obtained by finally adding the full-connection layer in the model, and the dimension of the encoding vector of the image can be adjusted. As shown in fig. 4, the decoder exploits the association between the glance path gaze point semantic vectors using the LSTM network for inferring the semantic vector of the next gaze point.
The LSTM network introduces the gate control mechanism to control the memory and update of information, effectively solves the problems of gradient disappearance and gradient explosion, also models a long-time dependency relationship, and is suitable for sequence modeling. Because the output vector dimension of the LSTM is the same as the image coding vector dimension, a linear layer is required to be added at the rear end of the LSTM network to enable the LSTM output hidden layer feature dimension to be the same as the input semantic vector dimension.
(4) Training glance path prediction model
The input of the encoder-decoder model is the sequence of glance path semantic vectors for the image and its corresponding, the output is the predicted gaze point semantic vector, and the true value is the real gaze point semantic vector at the next moment. One image has a plurality of fixation point semantic vector sequences, and the plurality of semantic vector sequences are input into the LSTM network in parallel to improve the model calculation speed. And calculating Euclidean distances between each predicted semantic vector and the real semantic vector, summing the Euclidean distances to serve as a loss function, taking a minimum loss function as an optimization target, and training an encoder-decoder model by adopting an adam (adaptive motion) algorithm.
(5) Predicting glance paths of images
After the trained glance path prediction model is obtained, the test image is input into an encoder to obtain an image coding vector and is input into a decoder, and the test image is input into a semantic extractor and is mapped to a semantic space. Selecting an initial point as an image central point, and obtaining a central point semantic vector input LSTM as an initial value through a semantic extractor. Calculating semantic distances between the LSTM output passing through the linear layer and each image block to obtain a distance map, normalizing the distance map, enabling the probability of the image block jumping to the image to be smaller when the distance is larger, adding a Gaussian prior template and an Inhibit of Return (IOR) prior, multiplying the probability maps, and selecting a maximum probability value point as a jumping position. After the predicted gazing point position is obtained, the semantic vector corresponding to the predicted gazing point position is selected as the input of the next moment of the LSTM network, and a glance path sequence with the sequence length of T can be obtained after T time steps
Example 1:
step 1, constructing a test image library
Collecting images and collecting corresponding sweep paths of the images, uniformly transforming the sizes of all the images to h multiplied by w, wherein h represents the height of the images, w represents the width of the images, and each image corresponds to a size transformation coefficient rx、ry. Calculating the pixel points corresponding to the fixation points of the converted saccade path, and ordering (q)1,q2,…,qn) Representing a sequence of glance path gaze point coordinates.
Step 2, constructing a semantic extractor to extract the semantic features of the fixation point
The semantic extractor realizes the mapping from image block pixels to high-level semantic information, the CNN is suitable for processing image data, so a common CNN model VGG-16 is selected as the semantic extractor, and model parameters are obtained by training in a classification task. If the compression factor h/h' of the selected image size is 8, the top 10 convolution layers of the VGG-16 model are selected, and 3 pooling layers are included. The image I is subjected to a selected VGG network to obtain a characteristic diagram:
Fsemantic=VGGsemantic(I)
wherein VGGsemanticIndicating a selected VGG network, FsemanticThe obtained characteristic diagram is shown. Feature map FsemanticThe number of channels is 512, in order to change the input semantic vector xtHas a dimension of NsThe average is taken over the channel dimension. Namely:
S=mean(Fsemantic)
s represents a semantic vector diagram corresponding to the image I, and the image I with the size of h multiplied by w multiplied by 3 is subjected to a semantic extractor to obtain h 'multiplied by w' multiplied by NsThe semantic vector graph of (1). Specifically, the semantic vector x for the t-th point of regardtThe calculation is as follows:
Figure BDA0003112378440000081
wherein m is 512/d (x)t),ftAnd the t-th fixation point corresponds to the vector in the feature map. The semantic extractor converts the sequence of gaze locations into a sequence of semantic vectors as the input vector sequence for the LSTM.
Step 3, constructing a sweep path prediction model of an encoder-decoder framework
The encoder is used for encoding the global information of the image, converting the image into a vector with a fixed length, and the encoder adopts a VGG model, because the image has a uniform size of h multiplied by w, the finally adding of a full connection layer in the model can obtain an encoding vector with a fixed dimension and can adjust the dimension of the encoding vector of the image.
Fencoder=VGGencoder(I)
h0=FC(Fencoder)
Wherein FencoderPassing the image through VGGencoderCharacteristic diagram of the network, FC denotes the full connectivity layer, h0Representing hidden layer initialization values of the decoder LSTM network.
The decoder uses the LSTM network to mine the association between the glance path gaze point semantic vectors for inferring the semantic vector of the next gaze point. The LSTM network is specifically calculated as follows:
ft=σ(Wf·[ht-1,xt]+bf)
it=σ(Wi·[ht-1,xt]+bi)
Figure BDA0003112378440000091
Figure BDA0003112378440000092
ot=σ(Wo·[ht-1,xt]+bo)
ht=ot⊙tanh(ct)
wherein h istRepresenting hidden layer state at time t, ctIndicating the state of the cell at time t, it、ft
Figure BDA0003112378440000093
otRespectively representing an input gate, a forgetting gate, a cell gate and an output gate of the LSTM network, sigma (-) represents a sigmoid function, tanh (-) represents a hyperbolic tangent function, W represents a Hadamard productf、Wi、Wc、WoIs a parameter matrix of the LSTM network. The LSTM network introduces a gating mechanism to control the memory and update of information, effectively solves the problems of gradient disappearance and gradient explosion, and models a long-time dependency relationship.
Because the output vector dimension of the LSTM network is the hidden layer dimension, a linear layer needs to be added at the rear end of the LSTM network to ensure that the LSTM output hidden layer feature dimension is the same as the input semantic vector dimension. Namely:
yt=FC(xt)
in summary, the semantic vector output at each moment is obtained in the sweep path prediction model of the encoder-decoder framework to form a predicted semantic vector sequence (y)1,y2,…,yn)。
Step 4, training a glance path prediction model
The input of the encoder-decoder model is the sequence of glance path semantic vectors for the image and its corresponding, the output is the predicted gaze point semantic vector, and the true value is the real gaze point semantic vector at the next moment. One image has a plurality of fixation point semantic vector sequences, and the plurality of semantic vector sequences are input into the LSTM network in parallel to improve the model calculation speed. Calculating Euclidean distances between each predicted semantic vector and each real semantic vector and summing the Euclidean distances to serve as a loss function, taking a minimized loss function as an optimization target, and calculating the loss function according to the predicted semantic vector sequence and the real semantic vector sequence at the previous n-1 moment:
Figure BDA0003112378440000101
where α is a hyper-parameter, S represents a semantic vector in a semantic vector graph S, qiIndicating the coordinates corresponding to the ith gaze point. And the first term in the loss function represents the distance between the predicted semantic vector sequence and the real semantic vector sequence at the previous n-1 moment, and the second term represents the reciprocal of the sum of the distances between the predicted semantic vector and all non-real semantic vector sequences in the image. The model parameters are optimized such that the loss function is minimized, i.e. such that the predicted semantic vector is as close as possible to the true semantic vector and as far as possible from the non-true semantic vector.
After the loss function is obtained, an adam (adaptive motion) algorithm is adopted to train the encoder-decoder model.
Step 5, predicting the saccade path
After the trained glance path prediction model is obtained, the test image is input into an encoder to obtain an image coding vector and is input into a decoder, and the test image is input into a semantic extractor and is mapped to a semantic space. Selecting an initial point as an image center point:
Figure BDA0003112378440000102
and obtaining a central point semantic vector input LSTM as an initial value through a semantic extractor:
x0=S(q0)
obtaining semantic vector y through LSTM network and full connection layeriCalculating a semantic vector yiObtaining a distance map from semantic distances between the image blocks:
Di(m,n)=||yi-sm,n||2
m and n are respectively the horizontal and vertical coordinates of the distance map. Normalizing and probabilizing the distance map:
Figure BDA0003112378440000111
wherein
Figure BDA0003112378440000112
Representing the maximum value in the distance map.
And adding a Gaussian prior template:
Figure BDA0003112378440000113
wherein eta ═ eta (eta)xy) And beta respectively represent the difference of the angle of the eye movement and the amplitude difference. And add an Inhibit of Return (IOR) prior:
Figure BDA0003112378440000114
multiplying these probability maps:
Figure BDA0003112378440000115
and selecting a maximum probability value point as a jumping position:
Figure BDA0003112378440000116
after the predicted gazing point position is obtained, the semantic vector corresponding to the predicted gazing point position is selected as the input of the next moment of the LSTM network, and a glance path sequence with the sequence length of T can be obtained after T time steps.
The method is implemented in an Ubuntu16.04.4 operating system, a pytorch1.6 deep learning framework is adopted, input images are unified to have the h x w size of 512 x 512, the dimensionality of a semantic vector is 64, a model is built according to the steps and trained on a training set to obtain model parameters, a saccade path of the test set is predicted in an image of a test set, an initial fixation point is selected in the center of the image, the length T of the predicted saccade path is 10, the prior radius of the IOR is min (h, w) x 1/16, and the result of predicting the saccade path by the method is visualized as shown in FIG. 5.
The method of the invention predicts the result of the glance path and compares the result with the glance path prediction result of the method of Thuyen Ngo et al in ICIP, and the result visualized in the figure can obtain the rule that the method accords with the human glance path. The evaluation of the method adopts three indexes of MultiMatch (MM), Hausdorff Distance (HD) and Mean Minimum Distance (MMD), wherein the higher the MM score is, the better the method is, and the lower the HD score is, the better the method is. Wherein, in the MM index, each score Shape, Direction, Length and Position of the method are 0.9424, 0.6616, 0.9440 and 0.8423 respectively, the method scores of Thuyen Ngo and other people are 0.9108, 0.6421, 0.9142 and 0.7822 respectively, in the HD index and MMD index, the method scores are 121.2675 and 95.5071 respectively, and the method scores of Thuyen Ngo and other people are 204.6523 and 144.9966 respectively. From the above evaluation criteria, it can be concluded that the method of the present invention is superior to other methods in objective evaluation.
While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications or substitutions can be easily made by those skilled in the art within the technical scope of the present disclosure.

Claims (9)

1. A method for glance path prediction based on semantic inference, the method comprising:
constructing an image panning path training set;
constructing a semantic extractor to extract the semantic features of the fixation point;
constructing a sweep path prediction model of an encoder-decoder framework;
training a glance path prediction model;
a glance path of the image is predicted.
2. The method of claim 1, wherein constructing a training set of image glance paths comprises collecting images and collecting corresponding glance paths of the images, transforming all the images to a uniform size, and calculating pixel points corresponding to fixation points of the transformed glance paths.
3. The method of claim 1, wherein the semantic inference based glance path prediction is implemented by constructing a semantic extractor to extract the gaze point semantic features, specifically comprising: and mapping the image to a semantic space by adopting a trained semantic extractor to obtain a semantic vector corresponding to the fixation point.
4. A semantic inference based glance path prediction method as claimed in claim 3, wherein said semantic extractor is a CNN model.
5. The method of claim 1, wherein constructing a glance path prediction model of an encoder-decoder framework specifically comprises: the encoder outputs a coded vector for encoding global information of an image, the coded vector is used as an initial value of the decoder, and the decoder learns the view point semantic inference relation.
6. A semantic inference based glance path prediction method as in claim 5, wherein the encoder-decoder framework is a CNN model-LSTM network.
7. The method of claim 1, wherein training the glance path prediction model specifically comprises: and optimizing the encoder-decoder network to minimize the loss function by using the Euclidean distance from the predicted fixation point semantic vector to the true fixation point semantic vector as the loss function.
8. A semantic inference based glance path prediction method according to claim 1, characterized by predicting the glance path of the image: and (4) inputting the image in the optimized network for testing to obtain a panning path.
9. A semantic inference based glance path prediction apparatus, comprising:
an image processing module for constructing an image panning path training set;
the characteristic extraction module is used for extracting the semantic characteristics of the fixation point;
a training module for establishing and training a glance path prediction model;
a prediction module to predict a saccade path.
CN202110652817.7A 2021-06-11 2021-06-11 Glance path prediction method based on semantic inference Active CN113313123B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110652817.7A CN113313123B (en) 2021-06-11 2021-06-11 Glance path prediction method based on semantic inference

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110652817.7A CN113313123B (en) 2021-06-11 2021-06-11 Glance path prediction method based on semantic inference

Publications (2)

Publication Number Publication Date
CN113313123A true CN113313123A (en) 2021-08-27
CN113313123B CN113313123B (en) 2024-04-02

Family

ID=77378522

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110652817.7A Active CN113313123B (en) 2021-06-11 2021-06-11 Glance path prediction method based on semantic inference

Country Status (1)

Country Link
CN (1) CN113313123B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113762393A (en) * 2021-09-08 2021-12-07 杭州网易智企科技有限公司 Model training method, gaze point detection method, medium, device, and computing device
CN115037962A (en) * 2022-05-31 2022-09-09 咪咕视讯科技有限公司 Video adaptive transmission method, device, terminal equipment and storage medium
CN116343012A (en) * 2023-05-29 2023-06-27 江西财经大学 Panoramic image glance path prediction method based on depth Markov model
CN116563524A (en) * 2023-06-28 2023-08-08 南京航空航天大学 Glance path prediction method based on multi-vision memory unit

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109447096A (en) * 2018-04-13 2019-03-08 西安电子科技大学 A kind of pan path prediction technique and device based on machine learning
US20190080623A1 (en) * 2017-09-14 2019-03-14 Massachusetts Institute Of Technology Eye Tracking As A Language Proficiency Test
US20190096125A1 (en) * 2017-09-28 2019-03-28 Nec Laboratories America, Inc. Generating occlusion-aware bird eye view representations of complex road scenes
CN110298303A (en) * 2019-06-27 2019-10-01 西北工业大学 A kind of crowd recognition method based on the long pan of memory network in short-term path learning

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190080623A1 (en) * 2017-09-14 2019-03-14 Massachusetts Institute Of Technology Eye Tracking As A Language Proficiency Test
US20190096125A1 (en) * 2017-09-28 2019-03-28 Nec Laboratories America, Inc. Generating occlusion-aware bird eye view representations of complex road scenes
CN109447096A (en) * 2018-04-13 2019-03-08 西安电子科技大学 A kind of pan path prediction technique and device based on machine learning
CN110298303A (en) * 2019-06-27 2019-10-01 西北工业大学 A kind of crowd recognition method based on the long pan of memory network in short-term path learning

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李娜;赵歆波;: "一种整合语义对象特征的视觉注意力模型", 哈尔滨工业大学学报, no. 05 *
龚思宏;: "预测人眼扫视路径的新方法", 电子技术与软件工程, no. 03 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113762393A (en) * 2021-09-08 2021-12-07 杭州网易智企科技有限公司 Model training method, gaze point detection method, medium, device, and computing device
CN115037962A (en) * 2022-05-31 2022-09-09 咪咕视讯科技有限公司 Video adaptive transmission method, device, terminal equipment and storage medium
CN115037962B (en) * 2022-05-31 2024-03-12 咪咕视讯科技有限公司 Video self-adaptive transmission method, device, terminal equipment and storage medium
CN116343012A (en) * 2023-05-29 2023-06-27 江西财经大学 Panoramic image glance path prediction method based on depth Markov model
CN116343012B (en) * 2023-05-29 2023-07-21 江西财经大学 Panoramic image glance path prediction method based on depth Markov model
CN116563524A (en) * 2023-06-28 2023-08-08 南京航空航天大学 Glance path prediction method based on multi-vision memory unit
CN116563524B (en) * 2023-06-28 2023-09-29 南京航空航天大学 Glance path prediction method based on multi-vision memory unit

Also Published As

Publication number Publication date
CN113313123B (en) 2024-04-02

Similar Documents

Publication Publication Date Title
Zheng et al. A novel background subtraction algorithm based on parallel vision and Bayesian GANs
CN106407889B (en) Method for recognizing human body interaction in video based on optical flow graph deep learning model
CN113313123A (en) Semantic inference based glance path prediction method
CN110210429B (en) Method for generating network based on optical flow, image and motion confrontation to improve recognition accuracy rate of anxiety, depression and angry expression
CN112949647B (en) Three-dimensional scene description method and device, electronic equipment and storage medium
CN111160294B (en) Gait recognition method based on graph convolution network
CN114049381A (en) Twin cross target tracking method fusing multilayer semantic information
CN110298303B (en) Crowd identification method based on long-time memory network glance path learning
Harley et al. Learning from unlabelled videos using contrastive predictive neural 3d mapping
CN110334656A (en) Multi-source Remote Sensing Images Clean water withdraw method and device based on information source probability weight
CN111626152B (en) Space-time line-of-sight direction estimation prototype design method based on Few-shot
Xiong et al. Contextual sa-attention convolutional LSTM for precipitation nowcasting: A spatiotemporal sequence forecasting view
CN114419732A (en) HRNet human body posture identification method based on attention mechanism optimization
Ning et al. Deep Spatial/temporal-level feature engineering for Tennis-based action recognition
CN116524593A (en) Dynamic gesture recognition method, system, equipment and medium
CN116543021A (en) Siamese network video single-target tracking method based on feature fusion
Huang et al. Football players’ shooting posture norm based on deep learning in sports event video
CN115359550A (en) Gait emotion recognition method and device based on Transformer, electronic device and storage medium
CN114140524A (en) Closed loop detection system and method for multi-scale feature fusion
Ammar et al. Comparative Study of latest CNN based Optical Flow Estimation
CN113760091A (en) Mobile terminal perception computing technology and Internet of things technology application system
Zhou et al. Motion balance ability detection based on video analysis in virtual reality environment
Jin A three-dimensional animation character dance movement model based on the edge distance random matrix
Li A new physical posture recognition method based on feature complement-oriented convolutional neural network
van Staden et al. An Evaluation of YOLO-Based Algorithms for Hand Detection in the Kitchen

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant