CN113313123A - Semantic inference based glance path prediction method - Google Patents
Semantic inference based glance path prediction method Download PDFInfo
- Publication number
- CN113313123A CN113313123A CN202110652817.7A CN202110652817A CN113313123A CN 113313123 A CN113313123 A CN 113313123A CN 202110652817 A CN202110652817 A CN 202110652817A CN 113313123 A CN113313123 A CN 113313123A
- Authority
- CN
- China
- Prior art keywords
- semantic
- image
- glance
- path
- decoder
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 41
- 239000013598 vector Substances 0.000 claims abstract description 82
- 238000012549 training Methods 0.000 claims abstract description 23
- 238000004091 panning Methods 0.000 claims abstract description 14
- 238000012360 testing method Methods 0.000 claims abstract description 11
- 238000013507 mapping Methods 0.000 claims abstract description 6
- 230000004434 saccadic eye movement Effects 0.000 claims description 19
- 238000000605 extraction Methods 0.000 claims description 8
- 238000012545 processing Methods 0.000 claims description 5
- 230000001131 transforming effect Effects 0.000 claims description 4
- 230000006870 function Effects 0.000 abstract description 16
- 238000013527 convolutional neural network Methods 0.000 description 16
- 241000282414 Homo sapiens Species 0.000 description 12
- 230000000007 visual effect Effects 0.000 description 12
- 238000010586 diagram Methods 0.000 description 7
- 230000007246 mechanism Effects 0.000 description 7
- 238000011160 research Methods 0.000 description 7
- 238000013135 deep learning Methods 0.000 description 5
- 230000008288 physiological mechanism Effects 0.000 description 5
- 230000004424 eye movement Effects 0.000 description 4
- 230000006835 compression Effects 0.000 description 3
- 238000007906 compression Methods 0.000 description 3
- 238000010276 construction Methods 0.000 description 3
- 238000011156 evaluation Methods 0.000 description 3
- 230000009191 jumping Effects 0.000 description 3
- ORILYTVJVMAKLC-UHFFFAOYSA-N Adamantane Natural products C1C(C2)CC3CC1CC2C3 ORILYTVJVMAKLC-UHFFFAOYSA-N 0.000 description 2
- 230000003044 adaptive effect Effects 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 230000019771 cognition Effects 0.000 description 2
- 230000008034 disappearance Effects 0.000 description 2
- 238000004880 explosion Methods 0.000 description 2
- 238000012821 model calculation Methods 0.000 description 2
- 238000005457 optimization Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000005295 random walk Methods 0.000 description 2
- 230000003068 static effect Effects 0.000 description 2
- 206010003805 Autism Diseases 0.000 description 1
- 208000020706 Autistic disease Diseases 0.000 description 1
- 238000013528 artificial neural network Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000001149 cognitive effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000002123 temporal effect Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V10/00—Arrangements for image or video recognition or understanding
- G06V10/20—Image preprocessing
- G06V10/26—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
- G06V10/267—Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Abstract
The invention relates to a glance path prediction method based on semantic inference, and belongs to the field of image glance path prediction. And constructing an image panning path training set, and mapping the image to a semantic space by adopting the trained CNN to obtain a semantic vector corresponding to the fixation point. Constructing a glance path prediction model of an encoder-decoder framework, outputting a coding vector by global information of an image coded by the encoder, taking the coding vector as an initial value of a decoder, learning a gaze point semantic inference relation by the decoder, and optimizing an encoder-decoder network to minimize a loss function by adopting a Euclidean distance from a predicted gaze point semantic vector to a true gaze point semantic vector as the loss function. And (4) inputting the image in the optimized network for testing to obtain a panning path.
Description
Technical Field
The invention relates to the field of image glance path prediction, in particular to a glance path prediction method based on semantic inference.
Background
The human eye receives a large amount of visual data which far exceeds the human brain can handle at every moment, and the human visual system can find important areas from complex physical scenes, so that a human can rapidly acquire useful information from a large amount of visual data by using less computing resources. Therefore, the research of the human visual system is very important to rapidly extract useful information from a large amount of visual data. Currently, the existing research is divided into two aspects: the visual saliency is the probability of fixation aiming at visual data and represents the static characteristics of vision, and the visual saccade path is the time sequence of the change of the fixation point of human eyes on time and space, so that the visual static characteristics are reflected and the visual dynamic characteristics are represented. Therefore, the research on the saccade path is helpful for more fully understanding the working mechanism of human vision, and has wide application prospect in the fields of recommendation systems, crowd identification, virtual reality and the like.
The earliest prediction of the saccade path was proposed by Itti et al in the article "A Model of salt-Based Visual Attribution for Rapid Screen Analysis", IEEE Transactions on Pattern Analysis and Machine significance, vol.20, No.11, pp.1254-1259,1998. it was proposed to use the "feature integration theory" to interpret the Visual search strategy, decompose the input image into a series of shallow feature maps to predict image Saliency, and then use the forbidden return mechanism and winner to determine the position of the next saccade point for the King mechanism. Giuseppe Bocrignone et al then used the Wang's equation in the article "modeling gap shift as a constrained random walk," Phys. A, Statist. Mech. appl., vol.331, No.1, pp.207-218,2004 to model the path of the glance as a constrained random walk process in a saliency field, where the length of the glance obeyed a Levy distribution. In the traditional method, high-level semantic information of an image is not extracted in the process of predicting the saliency map, so that the problems of inaccurate semantic target prediction and the like exist, and the accuracy of modeling a physiological mechanism is relied on when the saliency map is combined to predict a fixation point.
With the development of deep learning methods such as Convolutional Neural Network (CNN) and Long-Short-Term Memory (LSTM) networks, researchers have focused on predicting glance paths using deep learning methods. Thuyen Ngo et al, in the articles t.ngo and b.s.manjunath, "Saccade size prediction using a temporal neural network,"2017IEEE International reference Image Processing (ICIP),2017, pp.3435-3439, propose to use CNN to extract features of an Image, use gaze point features as input to an LSTM network, and predict the probability of the location of the next gaze point through the LSTM network. However, the LSTM network mines the relationship between the gaze point characteristics to the next gaze point location, models a statistical property of the gaze point location, and lacks semantic understanding of the gaze point. A glance path prediction method and device (publication number CN109447096A) based on machine learning adopts a structure that a CNN is used as an encoder and an LSTM is used as a decoder, the encoder performs feature extraction on a point of regard, the decoder directly predicts the position of the point of regard, and an attention mechanism is introduced to enable the decoding of each step to focus on different information. The method models the position relevance among the gazing points, but the gazing point characteristics lack characteristic information of other positions, so that the position of the next gazing point is difficult to accurately predict through the gazing point characteristics. At present, different groups of glance paths have differences, and there are the unclear scheduling problem of physiological mechanism in the glance path, and current model relies on single physiological modeling can't be adapted to different groups of glance paths. Therefore, the invention considers that the view point semantics have strong correlation and introduces the global information of the image, and proposes to utilize an encoder-decoder framework to mine the semantic correlation between the view points to predict the glancing path.
Disclosure of Invention
Technical problem to be solved
There are some studies for saccade path prediction based on deep learning, but these studies mainly focus on how to better model some physiological mechanisms, and do not consider the following problems: firstly, the semantic information between the fixation points has relevance, the probability of predicting the fixation points is influenced by the fixation point semantics and all previous fixation point semantics, and the existing model lacks the modeling of the relevance between the fixation point semantics. Secondly, the saccade paths of different populations have differences with semantic cognition, for example, children and adults have differences with semantic target understanding, and autism patients have semantic deletion and the like. Thirdly, the physiological mechanism of the saccade path has ambiguity, and the existing model can not adapt to the saccade path of different people by relying on single physiological modeling. Therefore, the invention aims to overcome the defects of the prior art, considers the problem of glancing path prediction as a search problem in the whole image semantic range by taking semantic correlation as measurement, proposes to utilize CNN to carry out semantic extraction on an image, and an encoder-decoder framework to excavate the semantic correlation of a fixation point, wherein an encoder is CNN, encodes the global information of the image, introduces the global information and is used as an initial value of a decoder; the decoder learns the semantic association between the points of regard for the LSTM network, modeling the sequence characteristics of the glance path. The model is driven by data, so that semantic information of different crowds can be learned according to the glance paths of the different crowds, corresponding semantic correlation is established, and excessive physiological mechanisms are not used in the model.
Technical scheme
A method for glance path prediction based on semantic inference, the method comprising:
constructing an image panning path training set;
constructing a semantic extractor to extract the semantic features of the fixation point;
constructing a sweep path prediction model of an encoder-decoder framework;
training a glance path prediction model;
a glance path of the image is predicted.
The further technical scheme of the invention is as follows: the method specifically comprises the steps of collecting images, collecting panning paths corresponding to the images, transforming the sizes of all the images to be consistent, and calculating pixel points corresponding to fixation points of the transformed panning paths.
The further technical scheme of the invention is as follows: the semantic feature extraction of the fixation point by constructing the semantic extractor specifically comprises the following steps: and mapping the image to a semantic space by adopting a trained semantic extractor to obtain a semantic vector corresponding to the fixation point.
The further technical scheme of the invention is as follows: the semantic extractor is a CNN model.
The further technical scheme of the invention is as follows: the construction of the sweep path prediction model of the encoder-decoder framework specifically comprises: the encoder outputs a coded vector for encoding global information of an image, the coded vector is used as an initial value of the decoder, and the decoder learns the view point semantic inference relation.
The further technical scheme of the invention is as follows: the encoder-decoder framework is a CNN model-LSTM network.
The further technical scheme of the invention is as follows: the training of the glance path prediction model specifically comprises: and optimizing the encoder-decoder network to minimize the loss function by using the Euclidean distance from the predicted fixation point semantic vector to the true fixation point semantic vector as the loss function.
The further technical scheme of the invention is as follows: predicting the saccade path: and (4) inputting the image in the optimized network for testing to obtain a panning path.
A semantic inference based glance path prediction apparatus, comprising:
an image processing module for constructing an image panning path training set;
the characteristic extraction module is used for extracting the semantic characteristics of the fixation point;
a training module for establishing and training a glance path prediction model;
a prediction module to predict a saccade path.
Advantageous effects
The glance path prediction method based on semantic inference provided by the invention has the following beneficial effects:
1) the invention establishes an end-to-end learning model of an encoder-decoder framework, and adopts a data driving mode without simulating excessive complex physiological phenomena. The encoder uses global information of the CNN encoded image and the decoder uses the advantages of the LSTM network to model the sequence, better revealing the dynamic properties of the panning path.
2) According to the invention, CNN is used for extracting the semantic features of the fixation point, a pre-training network model in large-scale data set is adopted, the high-level semantic information of the fixation point is extracted, and compared with the previous research of directly training in the eye movement data set, the problem that the eye movement data is less and the semantic information is difficult to extract is solved. Compared with the original image pixel block, the method has stronger semantic abstraction capability and semantic representation capability. In addition, the semantic differences of different crowds are considered, and the deep learning-based semantic extractor can be trained in a specific crowd sample, so that the semantic cognition of the specific crowd is extracted, and the glance path prediction of different crowds is realized.
3) The invention utilizes LSTM network to learn the jump relation between the gazing points from the perspective of semantic inference, compared with the previous research of establishing the correlation between the gazing point characteristics and the position by Thuyen Ngo and the like and the problem that the previous research lacks semantic understanding of the gazing points, the invention notices that the semantic information between the gazing points has correlation, utilizes the LSTM network to directly mine the semantic correlation between the preorder gazing point and the next gazing point, realizes the inference from the gazing point semantics to the gazing point semantics, and better accords with the cognitive characteristics of human beings.
4) Compared with the previous research using a mechanism of victory as king, the method provided by the invention has the advantages that the human eye saccade paths have certain distribution in amplitude and angle due to the limitation of the physiological structure of human eyes. The distribution is considered, and the distance map formed by the inference semantics and all semantic features in all images is combined to predict the fixation point position, so that the human eye movement statistical characteristics are better met.
Drawings
The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, wherein like reference numerals are used to designate like parts throughout.
Figure 1 is a general flow chart of an implementation of the present invention.
FIG. 2 is a general framework diagram of the model of the present invention.
FIG. 3 is a schematic diagram of semantic vector sequence extraction in the present invention.
FIG. 4 is a schematic diagram of decoder LSTM network mining semantic relations in the invention.
FIG. 5 illustrates experimental results in the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.
The scheme of the invention is as follows: and constructing an image panning path training set, and mapping the image to a semantic space by adopting the trained CNN to obtain a semantic vector corresponding to the fixation point. Constructing a glance path prediction model of an encoder-decoder framework, outputting a coding vector by global information of an image coded by the encoder, taking the coding vector as an initial value of a decoder, learning a gaze point semantic inference relation by the decoder, and optimizing an encoder-decoder network to minimize a loss function by adopting a Euclidean distance from a predicted gaze point semantic vector to a true gaze point semantic vector as the loss function. And (4) inputting the image in the optimized network for testing to obtain a panning path. As shown in fig. 1, the implementation steps include the following:
(1) construction of image panning path training set
Collecting images and acquiring corresponding sweep paths of the images, and uniformly transforming the sizes of all the images into h multiplied by w sizes, wherein h represents the height of the images, and w represents the width of the images. And calculating pixel points corresponding to the fixation points of the converted saccade path.
(2) Constructing semantic extractor to extract view point semantic features
The semantic extractor shown in FIG. 2 implements the mapping from image block pixels to high level semantic information since CNN is suitable for processing image numbersAccording to the method, a common CNN model such as VGG (virtual ground gateway), ResNet and the like is selected as a semantic extractor, the model is trained in tasks such as significance prediction and classification to obtain a semantic extraction model, and if the significance map prediction of different crowds is adopted for the glance path prediction of the different crowds, the semantic difference among the different crowds can be solved. As shown in FIG. 3, the semantic extraction model converts an hxwx3 image into h '× w' × NsA semantic vector of (2), wherein NsThe dimension of the semantic vector is calculated, the image compression coefficient can be obtained by calculating h/h', and the set image compression coefficient can be achieved by reducing the number of network layers. The image blocks where the gazing points are located correspond to the points of the semantic space one by one, so that the gazing point position sequence is converted into a semantic vector sequence.
(3) Construction of a swept path prediction model for an encoder-decoder framework
The encoder shown in fig. 2 is used for encoding the global information of the image, converting the image into a vector with a fixed length, and the encoder can select the commonly used models such as VGG, ResNet and the like by adopting the CNN model, and because the image has a uniform size of h × w, the encoding vector with a fixed dimension can be obtained by finally adding the full-connection layer in the model, and the dimension of the encoding vector of the image can be adjusted. As shown in fig. 4, the decoder exploits the association between the glance path gaze point semantic vectors using the LSTM network for inferring the semantic vector of the next gaze point.
The LSTM network introduces the gate control mechanism to control the memory and update of information, effectively solves the problems of gradient disappearance and gradient explosion, also models a long-time dependency relationship, and is suitable for sequence modeling. Because the output vector dimension of the LSTM is the same as the image coding vector dimension, a linear layer is required to be added at the rear end of the LSTM network to enable the LSTM output hidden layer feature dimension to be the same as the input semantic vector dimension.
(4) Training glance path prediction model
The input of the encoder-decoder model is the sequence of glance path semantic vectors for the image and its corresponding, the output is the predicted gaze point semantic vector, and the true value is the real gaze point semantic vector at the next moment. One image has a plurality of fixation point semantic vector sequences, and the plurality of semantic vector sequences are input into the LSTM network in parallel to improve the model calculation speed. And calculating Euclidean distances between each predicted semantic vector and the real semantic vector, summing the Euclidean distances to serve as a loss function, taking a minimum loss function as an optimization target, and training an encoder-decoder model by adopting an adam (adaptive motion) algorithm.
(5) Predicting glance paths of images
After the trained glance path prediction model is obtained, the test image is input into an encoder to obtain an image coding vector and is input into a decoder, and the test image is input into a semantic extractor and is mapped to a semantic space. Selecting an initial point as an image central point, and obtaining a central point semantic vector input LSTM as an initial value through a semantic extractor. Calculating semantic distances between the LSTM output passing through the linear layer and each image block to obtain a distance map, normalizing the distance map, enabling the probability of the image block jumping to the image to be smaller when the distance is larger, adding a Gaussian prior template and an Inhibit of Return (IOR) prior, multiplying the probability maps, and selecting a maximum probability value point as a jumping position. After the predicted gazing point position is obtained, the semantic vector corresponding to the predicted gazing point position is selected as the input of the next moment of the LSTM network, and a glance path sequence with the sequence length of T can be obtained after T time steps
Example 1:
Collecting images and collecting corresponding sweep paths of the images, uniformly transforming the sizes of all the images to h multiplied by w, wherein h represents the height of the images, w represents the width of the images, and each image corresponds to a size transformation coefficient rx、ry. Calculating the pixel points corresponding to the fixation points of the converted saccade path, and ordering (q)1,q2,…,qn) Representing a sequence of glance path gaze point coordinates.
Step 2, constructing a semantic extractor to extract the semantic features of the fixation point
The semantic extractor realizes the mapping from image block pixels to high-level semantic information, the CNN is suitable for processing image data, so a common CNN model VGG-16 is selected as the semantic extractor, and model parameters are obtained by training in a classification task. If the compression factor h/h' of the selected image size is 8, the top 10 convolution layers of the VGG-16 model are selected, and 3 pooling layers are included. The image I is subjected to a selected VGG network to obtain a characteristic diagram:
Fsemantic=VGGsemantic(I)
wherein VGGsemanticIndicating a selected VGG network, FsemanticThe obtained characteristic diagram is shown. Feature map FsemanticThe number of channels is 512, in order to change the input semantic vector xtHas a dimension of NsThe average is taken over the channel dimension. Namely:
S=mean(Fsemantic)
s represents a semantic vector diagram corresponding to the image I, and the image I with the size of h multiplied by w multiplied by 3 is subjected to a semantic extractor to obtain h 'multiplied by w' multiplied by NsThe semantic vector graph of (1). Specifically, the semantic vector x for the t-th point of regardtThe calculation is as follows:
wherein m is 512/d (x)t),ftAnd the t-th fixation point corresponds to the vector in the feature map. The semantic extractor converts the sequence of gaze locations into a sequence of semantic vectors as the input vector sequence for the LSTM.
Step 3, constructing a sweep path prediction model of an encoder-decoder framework
The encoder is used for encoding the global information of the image, converting the image into a vector with a fixed length, and the encoder adopts a VGG model, because the image has a uniform size of h multiplied by w, the finally adding of a full connection layer in the model can obtain an encoding vector with a fixed dimension and can adjust the dimension of the encoding vector of the image.
Fencoder=VGGencoder(I)
h0=FC(Fencoder)
Wherein FencoderPassing the image through VGGencoderCharacteristic diagram of the network, FC denotes the full connectivity layer, h0Representing hidden layer initialization values of the decoder LSTM network.
The decoder uses the LSTM network to mine the association between the glance path gaze point semantic vectors for inferring the semantic vector of the next gaze point. The LSTM network is specifically calculated as follows:
ft=σ(Wf·[ht-1,xt]+bf)
it=σ(Wi·[ht-1,xt]+bi)
ot=σ(Wo·[ht-1,xt]+bo)
ht=ot⊙tanh(ct)
wherein h istRepresenting hidden layer state at time t, ctIndicating the state of the cell at time t, it、ft、otRespectively representing an input gate, a forgetting gate, a cell gate and an output gate of the LSTM network, sigma (-) represents a sigmoid function, tanh (-) represents a hyperbolic tangent function, W represents a Hadamard productf、Wi、Wc、WoIs a parameter matrix of the LSTM network. The LSTM network introduces a gating mechanism to control the memory and update of information, effectively solves the problems of gradient disappearance and gradient explosion, and models a long-time dependency relationship.
Because the output vector dimension of the LSTM network is the hidden layer dimension, a linear layer needs to be added at the rear end of the LSTM network to ensure that the LSTM output hidden layer feature dimension is the same as the input semantic vector dimension. Namely:
yt=FC(xt)
in summary, the semantic vector output at each moment is obtained in the sweep path prediction model of the encoder-decoder framework to form a predicted semantic vector sequence (y)1,y2,…,yn)。
Step 4, training a glance path prediction model
The input of the encoder-decoder model is the sequence of glance path semantic vectors for the image and its corresponding, the output is the predicted gaze point semantic vector, and the true value is the real gaze point semantic vector at the next moment. One image has a plurality of fixation point semantic vector sequences, and the plurality of semantic vector sequences are input into the LSTM network in parallel to improve the model calculation speed. Calculating Euclidean distances between each predicted semantic vector and each real semantic vector and summing the Euclidean distances to serve as a loss function, taking a minimized loss function as an optimization target, and calculating the loss function according to the predicted semantic vector sequence and the real semantic vector sequence at the previous n-1 moment:
where α is a hyper-parameter, S represents a semantic vector in a semantic vector graph S, qiIndicating the coordinates corresponding to the ith gaze point. And the first term in the loss function represents the distance between the predicted semantic vector sequence and the real semantic vector sequence at the previous n-1 moment, and the second term represents the reciprocal of the sum of the distances between the predicted semantic vector and all non-real semantic vector sequences in the image. The model parameters are optimized such that the loss function is minimized, i.e. such that the predicted semantic vector is as close as possible to the true semantic vector and as far as possible from the non-true semantic vector.
After the loss function is obtained, an adam (adaptive motion) algorithm is adopted to train the encoder-decoder model.
Step 5, predicting the saccade path
After the trained glance path prediction model is obtained, the test image is input into an encoder to obtain an image coding vector and is input into a decoder, and the test image is input into a semantic extractor and is mapped to a semantic space. Selecting an initial point as an image center point:
and obtaining a central point semantic vector input LSTM as an initial value through a semantic extractor:
x0=S(q0)
obtaining semantic vector y through LSTM network and full connection layeriCalculating a semantic vector yiObtaining a distance map from semantic distances between the image blocks:
Di(m,n)=||yi-sm,n||2
m and n are respectively the horizontal and vertical coordinates of the distance map. Normalizing and probabilizing the distance map:
And adding a Gaussian prior template:
wherein eta ═ eta (eta)x,ηy) And beta respectively represent the difference of the angle of the eye movement and the amplitude difference. And add an Inhibit of Return (IOR) prior:
multiplying these probability maps:
and selecting a maximum probability value point as a jumping position:
after the predicted gazing point position is obtained, the semantic vector corresponding to the predicted gazing point position is selected as the input of the next moment of the LSTM network, and a glance path sequence with the sequence length of T can be obtained after T time steps.
The method is implemented in an Ubuntu16.04.4 operating system, a pytorch1.6 deep learning framework is adopted, input images are unified to have the h x w size of 512 x 512, the dimensionality of a semantic vector is 64, a model is built according to the steps and trained on a training set to obtain model parameters, a saccade path of the test set is predicted in an image of a test set, an initial fixation point is selected in the center of the image, the length T of the predicted saccade path is 10, the prior radius of the IOR is min (h, w) x 1/16, and the result of predicting the saccade path by the method is visualized as shown in FIG. 5.
The method of the invention predicts the result of the glance path and compares the result with the glance path prediction result of the method of Thuyen Ngo et al in ICIP, and the result visualized in the figure can obtain the rule that the method accords with the human glance path. The evaluation of the method adopts three indexes of MultiMatch (MM), Hausdorff Distance (HD) and Mean Minimum Distance (MMD), wherein the higher the MM score is, the better the method is, and the lower the HD score is, the better the method is. Wherein, in the MM index, each score Shape, Direction, Length and Position of the method are 0.9424, 0.6616, 0.9440 and 0.8423 respectively, the method scores of Thuyen Ngo and other people are 0.9108, 0.6421, 0.9142 and 0.7822 respectively, in the HD index and MMD index, the method scores are 121.2675 and 95.5071 respectively, and the method scores of Thuyen Ngo and other people are 204.6523 and 144.9966 respectively. From the above evaluation criteria, it can be concluded that the method of the present invention is superior to other methods in objective evaluation.
While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications or substitutions can be easily made by those skilled in the art within the technical scope of the present disclosure.
Claims (9)
1. A method for glance path prediction based on semantic inference, the method comprising:
constructing an image panning path training set;
constructing a semantic extractor to extract the semantic features of the fixation point;
constructing a sweep path prediction model of an encoder-decoder framework;
training a glance path prediction model;
a glance path of the image is predicted.
2. The method of claim 1, wherein constructing a training set of image glance paths comprises collecting images and collecting corresponding glance paths of the images, transforming all the images to a uniform size, and calculating pixel points corresponding to fixation points of the transformed glance paths.
3. The method of claim 1, wherein the semantic inference based glance path prediction is implemented by constructing a semantic extractor to extract the gaze point semantic features, specifically comprising: and mapping the image to a semantic space by adopting a trained semantic extractor to obtain a semantic vector corresponding to the fixation point.
4. A semantic inference based glance path prediction method as claimed in claim 3, wherein said semantic extractor is a CNN model.
5. The method of claim 1, wherein constructing a glance path prediction model of an encoder-decoder framework specifically comprises: the encoder outputs a coded vector for encoding global information of an image, the coded vector is used as an initial value of the decoder, and the decoder learns the view point semantic inference relation.
6. A semantic inference based glance path prediction method as in claim 5, wherein the encoder-decoder framework is a CNN model-LSTM network.
7. The method of claim 1, wherein training the glance path prediction model specifically comprises: and optimizing the encoder-decoder network to minimize the loss function by using the Euclidean distance from the predicted fixation point semantic vector to the true fixation point semantic vector as the loss function.
8. A semantic inference based glance path prediction method according to claim 1, characterized by predicting the glance path of the image: and (4) inputting the image in the optimized network for testing to obtain a panning path.
9. A semantic inference based glance path prediction apparatus, comprising:
an image processing module for constructing an image panning path training set;
the characteristic extraction module is used for extracting the semantic characteristics of the fixation point;
a training module for establishing and training a glance path prediction model;
a prediction module to predict a saccade path.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110652817.7A CN113313123B (en) | 2021-06-11 | 2021-06-11 | Glance path prediction method based on semantic inference |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110652817.7A CN113313123B (en) | 2021-06-11 | 2021-06-11 | Glance path prediction method based on semantic inference |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113313123A true CN113313123A (en) | 2021-08-27 |
CN113313123B CN113313123B (en) | 2024-04-02 |
Family
ID=77378522
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110652817.7A Active CN113313123B (en) | 2021-06-11 | 2021-06-11 | Glance path prediction method based on semantic inference |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113313123B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113762393A (en) * | 2021-09-08 | 2021-12-07 | 杭州网易智企科技有限公司 | Model training method, gaze point detection method, medium, device, and computing device |
CN115037962A (en) * | 2022-05-31 | 2022-09-09 | 咪咕视讯科技有限公司 | Video adaptive transmission method, device, terminal equipment and storage medium |
CN116343012A (en) * | 2023-05-29 | 2023-06-27 | 江西财经大学 | Panoramic image glance path prediction method based on depth Markov model |
CN116563524A (en) * | 2023-06-28 | 2023-08-08 | 南京航空航天大学 | Glance path prediction method based on multi-vision memory unit |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109447096A (en) * | 2018-04-13 | 2019-03-08 | 西安电子科技大学 | A kind of pan path prediction technique and device based on machine learning |
US20190080623A1 (en) * | 2017-09-14 | 2019-03-14 | Massachusetts Institute Of Technology | Eye Tracking As A Language Proficiency Test |
US20190096125A1 (en) * | 2017-09-28 | 2019-03-28 | Nec Laboratories America, Inc. | Generating occlusion-aware bird eye view representations of complex road scenes |
CN110298303A (en) * | 2019-06-27 | 2019-10-01 | 西北工业大学 | A kind of crowd recognition method based on the long pan of memory network in short-term path learning |
-
2021
- 2021-06-11 CN CN202110652817.7A patent/CN113313123B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20190080623A1 (en) * | 2017-09-14 | 2019-03-14 | Massachusetts Institute Of Technology | Eye Tracking As A Language Proficiency Test |
US20190096125A1 (en) * | 2017-09-28 | 2019-03-28 | Nec Laboratories America, Inc. | Generating occlusion-aware bird eye view representations of complex road scenes |
CN109447096A (en) * | 2018-04-13 | 2019-03-08 | 西安电子科技大学 | A kind of pan path prediction technique and device based on machine learning |
CN110298303A (en) * | 2019-06-27 | 2019-10-01 | 西北工业大学 | A kind of crowd recognition method based on the long pan of memory network in short-term path learning |
Non-Patent Citations (2)
Title |
---|
李娜;赵歆波;: "一种整合语义对象特征的视觉注意力模型", 哈尔滨工业大学学报, no. 05 * |
龚思宏;: "预测人眼扫视路径的新方法", 电子技术与软件工程, no. 03 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113762393A (en) * | 2021-09-08 | 2021-12-07 | 杭州网易智企科技有限公司 | Model training method, gaze point detection method, medium, device, and computing device |
CN115037962A (en) * | 2022-05-31 | 2022-09-09 | 咪咕视讯科技有限公司 | Video adaptive transmission method, device, terminal equipment and storage medium |
CN115037962B (en) * | 2022-05-31 | 2024-03-12 | 咪咕视讯科技有限公司 | Video self-adaptive transmission method, device, terminal equipment and storage medium |
CN116343012A (en) * | 2023-05-29 | 2023-06-27 | 江西财经大学 | Panoramic image glance path prediction method based on depth Markov model |
CN116343012B (en) * | 2023-05-29 | 2023-07-21 | 江西财经大学 | Panoramic image glance path prediction method based on depth Markov model |
CN116563524A (en) * | 2023-06-28 | 2023-08-08 | 南京航空航天大学 | Glance path prediction method based on multi-vision memory unit |
CN116563524B (en) * | 2023-06-28 | 2023-09-29 | 南京航空航天大学 | Glance path prediction method based on multi-vision memory unit |
Also Published As
Publication number | Publication date |
---|---|
CN113313123B (en) | 2024-04-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zheng et al. | A novel background subtraction algorithm based on parallel vision and Bayesian GANs | |
CN106407889B (en) | Method for recognizing human body interaction in video based on optical flow graph deep learning model | |
CN113313123A (en) | Semantic inference based glance path prediction method | |
CN110210429B (en) | Method for generating network based on optical flow, image and motion confrontation to improve recognition accuracy rate of anxiety, depression and angry expression | |
CN112949647B (en) | Three-dimensional scene description method and device, electronic equipment and storage medium | |
CN111160294B (en) | Gait recognition method based on graph convolution network | |
CN114049381A (en) | Twin cross target tracking method fusing multilayer semantic information | |
CN110298303B (en) | Crowd identification method based on long-time memory network glance path learning | |
Harley et al. | Learning from unlabelled videos using contrastive predictive neural 3d mapping | |
CN110334656A (en) | Multi-source Remote Sensing Images Clean water withdraw method and device based on information source probability weight | |
CN111626152B (en) | Space-time line-of-sight direction estimation prototype design method based on Few-shot | |
Xiong et al. | Contextual sa-attention convolutional LSTM for precipitation nowcasting: A spatiotemporal sequence forecasting view | |
CN114419732A (en) | HRNet human body posture identification method based on attention mechanism optimization | |
Ning et al. | Deep Spatial/temporal-level feature engineering for Tennis-based action recognition | |
CN116524593A (en) | Dynamic gesture recognition method, system, equipment and medium | |
CN116543021A (en) | Siamese network video single-target tracking method based on feature fusion | |
Huang et al. | Football players’ shooting posture norm based on deep learning in sports event video | |
CN115359550A (en) | Gait emotion recognition method and device based on Transformer, electronic device and storage medium | |
CN114140524A (en) | Closed loop detection system and method for multi-scale feature fusion | |
Ammar et al. | Comparative Study of latest CNN based Optical Flow Estimation | |
CN113760091A (en) | Mobile terminal perception computing technology and Internet of things technology application system | |
Zhou et al. | Motion balance ability detection based on video analysis in virtual reality environment | |
Jin | A three-dimensional animation character dance movement model based on the edge distance random matrix | |
Li | A new physical posture recognition method based on feature complement-oriented convolutional neural network | |
van Staden et al. | An Evaluation of YOLO-Based Algorithms for Hand Detection in the Kitchen |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |