CN113313123B

CN113313123B - Glance path prediction method based on semantic inference

Info

Publication number: CN113313123B
Application number: CN202110652817.7A
Authority: CN
Inventors: 夏辰; 钟文琦; 韩军伟; 郭雷
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2021-06-11
Filing date: 2021-06-11
Publication date: 2024-04-02
Anticipated expiration: 2041-06-11
Also published as: CN113313123A

Abstract

The invention relates to a glance path prediction method based on semantic inference, and belongs to the field of image glance path prediction. And constructing an image glance path training set, and mapping the image to a semantic space by adopting the trained CNN to obtain a semantic vector corresponding to the gaze point. Constructing a glance path prediction model of an encoder-decoder framework, outputting a coding vector by using global information of an encoder coding image, taking the coding vector as an initial value of a decoder, learning a gaze point semantic inference relation by using the decoder, taking a Euclidean distance from a predicted gaze point semantic vector to a true gaze point semantic vector as a loss function, and optimizing an encoder-decoder network so as to minimize the loss function. And (5) inputting images into the optimized network for testing to obtain a glance path.

Description

Glance path prediction method based on semantic inference

Technical Field

The invention relates to the field of image glance path prediction, in particular to a glance path prediction method based on semantic inference, which is to map a glance point to a semantic space, and establish a semantic jump relation of the glance point through supervised learning to realize the prediction of the glance path.

Background

The human eye is receiving a great deal of visual data far exceeding that which can be processed by the human brain at any time, the human visual system can find important areas from complex physical scenes, which enables a person to quickly acquire useful information in a great deal of visual data with less computing resources. Therefore, research of the human visual system has great significance on how to extract useful information rapidly in a large amount of visual data. Currently, existing research is divided into two aspects: firstly, visual saliency refers to the probability of looking at visual data, and represents the static characteristics of vision, and secondly, a visual glance path refers to the time sequence of the change of the eye point of regard in time and space, so that the visual static characteristics are reflected, and the visual dynamic characteristics are also represented. Therefore, the research on the glance path is helpful for more fully understanding the working mechanism of human vision, and has wide application prospect in the fields of recommendation systems, crowd identification, virtual reality and the like.

The earliest glance path predictions were made by Itti et al in articles "A Model of Saliency-Based Visual Attention for Rapid Scene Analysis", IEEE Transactions on Pattern Analysis and Machine Intelligence, vol.20, no.11, pp.1254-1259,1998, which set forth the use of "feature integration theory" to explain visual search strategies, decompose an input image into a series of shallow feature maps for predicting image saliency, and then use a forbidden return mechanism and a winner-to-king mechanism to determine the location of the next glance point. Giuseppe Boccignone et al then model the glance path as a limited random walk process in the salient field using the Lang-Wen equation in article "Modelling gaze shift as a constrained random walk," Phys.A, statist.Mech.Appl., vol.331, no.1, pp.207-218,2004, where the length of the jump obeys the Levy distribution. In the traditional method, in the prediction of the saliency map, high-level semantic information of an image is not extracted, so that the problems of inaccurate prediction of a semantic target and the like exist, and the accuracy of modeling a physiological mechanism is relied on when the point of regard is predicted by combining the saliency map.

With the development of deep learning methods such as convolutional neural networks (Convolutional Neural Network, CNN) and Long Short-Term Memory (LSTM) networks, researchers have come to pay attention to using deep learning methods to predict glance paths. Thuyen Ngo et al, in the article T.Ngo and B.S. Manjunath, "Saccade gaze prediction using a recurrent neural network,"2017IEEE International Conference on Image Processing (ICIP), 2017, pp.3435-3439, propose feature extraction of images using CNN, taking gaze point features as input to an LSTM network, and predicting the probability of the next gaze point location through the LSTM network. But LSTM networks mine the relationship between gaze point features to the next gaze point location, modeling a statistical property of gaze point locations, and lack of semantic understanding of gaze points. In the machine learning-based glance path prediction method and device (publication No. CN 109447096A), CNN is adopted as a structure of an encoder to LSTM as a decoder, the encoder performs feature extraction on a gaze point, the decoder directly predicts the position of the annotation point, and a attention mechanism is introduced to enable decoding of each step to focus on different information. The invention models the position relevance between the gaze points, but the gaze point features lack feature information of other positions, so that the next gaze point position is difficult to accurately predict through the gaze point features. Currently, different people have different glance paths, the glance paths have the problems of unclear physiological mechanisms and the like, and the existing model cannot adapt to the glance paths of different people depending on single physiological modeling. Therefore, the invention considers that the gaze point semantics have strong correlation and introduces global information of images, and proposes to utilize the encoder-decoder framework to mine semantic correlation between gaze points to predict the glance path.

Disclosure of Invention

Technical problem to be solved

There are some study work of glance path prediction based on deep learning, but these study work mainly focus on how to better model some physiological mechanisms, without considering the following problems: firstly, there is relevance among semantic information of the gaze points, the probability of predicting the gaze points is influenced by the gaze point semantics and all the prior gaze point semantics, and the existing model lacks modeling of relevance among the gaze point semantics. Secondly, the glance paths of different crowds are different from semantic cognition, for example, children and adults have different understanding of semantic targets, and autism patients have the phenomenon of semantic deficiency and the like. Third, there is ambiguity in the physiological mechanism of the glance path, and the existing model cannot adapt to glance paths of different people depending on single physiological modeling. Therefore, the invention aims to overcome the defects of the prior art, consider the glance path prediction problem as a search problem taking semantic relativity as measurement in the semantic range of the whole image, and propose to extract the semantics of the image by using CNN, and the encoder-decoder framework mines the semantic relativity of the gaze point, wherein the encoder is CNN, encodes the global information of the image, introduces the global information and is used as the initial value of the decoder; the decoder learns semantic associations between gaze points for LSTM networks, modeling the sequence characteristics of the glance paths. The model is driven by data, so that semantic information of different crowds can be learned and corresponding semantic relativity can be established according to glance paths of different crowds, and the model does not use excessive physiological mechanisms.

Technical proposal

A method of glance path prediction based on semantic inference, the method comprising:

constructing an image glance path training set;

a semantic extractor is constructed to extract the point of regard semantic features;

constructing a glance path prediction model of an encoder-decoder framework;

training a glance path prediction model;

the image panning path is predicted.

The invention further adopts the technical scheme that: the construction of the image glance path training set specifically comprises the steps of collecting images, collecting glance paths corresponding to the images, transforming the sizes of all the images to be consistent, and calculating pixel points corresponding to fixation points of the glance paths after transformation.

The invention further adopts the technical scheme that: the construction of the semantic extractor for extracting the gaze point semantic features specifically comprises the following steps: and mapping the image to a semantic space by adopting a trained semantic extractor to obtain a semantic vector corresponding to the gaze point.

The invention further adopts the technical scheme that: the semantic extractor is a CNN model.

The invention further adopts the technical scheme that: the construction of the glance path prediction model of the encoder-decoder framework specifically comprises: the encoder encodes global information of the image and outputs an encoded vector, the encoded vector is used as an initial value of a decoder, and the decoder learns a gaze point semantic inference relation.

The invention further adopts the technical scheme that: the encoder-decoder framework is a CNN model-LSTM network.

The invention further adopts the technical scheme that: training the glance path prediction model specifically includes: the encoder-decoder network is optimized such that the penalty function is minimized using the Euclidean distance of the predicted gaze point semantic vector to the true gaze point semantic vector as the penalty function.

The invention further adopts the technical scheme that: predicted image panning path: and (5) inputting images into the optimized network for testing to obtain a glance path.

A semantic inference based glance path prediction apparatus, the apparatus comprising:

the image processing module is used for constructing an image glance path training set;

the feature extraction module is used for extracting the semantic features of the fixation point;

the training module is used for building and training a glance path prediction model;

and the prediction module is used for predicting the image glance path.

Advantageous effects

The glance path prediction method based on semantic inference has the following beneficial effects:

1) The invention establishes an end-to-end learning model of the encoder-decoder framework, and adopts a data driving mode without simulating excessive complex physiological phenomena. The encoder utilizes the global information of CNN coded images, and the decoder utilizes the advantage of LSTM network modeling sequences to better reveal the dynamic properties of the glance path.

2) According to the invention, CNN is utilized to extract the semantic features of the gaze point, a pre-training network model in a large-scale data set is adopted to extract the high-level semantic information of the gaze point, and compared with the previous study of training in the eye movement data set directly, the problem that the eye movement data is less and the semantic information is difficult to extract is solved. Has stronger semantic abstraction capability and semantic representation capability than the original image pixel blocks. In addition, the semantic difference of different crowds is considered, and the semantic extractor based on deep learning can be trained in a specific crowd sample, so that the semantic cognition of the specific crowd is extracted, and the glance path prediction of different crowds is realized.

3) Compared with the prior study of the correlation between the gaze point characteristics and the position and the prior study of the lack of semantic understanding of the gaze point by the Thuyen Ngo et al, the invention notices the correlation of semantic information between the gaze points, directly digs the semantic correlation between the front gaze point and the next gaze point by using the LSTM network, realizes the inference from the gaze point semantics to the gaze point semantics, and accords with the cognitive characteristics of human beings.

4) Compared with the previous study using a winner king mechanism, the invention notes that the human eye glance path has certain distribution in amplitude and angle due to the limitation of the physiological structure of the human eye. The distribution is considered, and the gaze point position is predicted by combining the inferred semantics and the distance map formed by all semantic features in all images, so that the human eye movement statistical characteristics are more met.

Drawings

The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, like reference numerals being used to refer to like parts throughout the several views.

FIG. 1 is a general flow chart for the implementation of the present invention.

Figure 2 is a diagram of the overall framework of the model of the present invention.

FIG. 3 is a schematic diagram of semantic vector sequence extraction in the present invention.

FIG. 4 is a schematic diagram of the semantic relationship mining of the decoder LSTM network of the present invention.

Fig. 5 is an example of experimental results in the present invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. In addition, technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

The scheme of the invention is as follows: and constructing an image glance path training set, and mapping the image to a semantic space by adopting the trained CNN to obtain a semantic vector corresponding to the gaze point. Constructing a glance path prediction model of an encoder-decoder framework, outputting a coding vector by using global information of an encoder coding image, taking the coding vector as an initial value of a decoder, learning a gaze point semantic inference relation by using the decoder, taking a Euclidean distance from a predicted gaze point semantic vector to a true gaze point semantic vector as a loss function, and optimizing an encoder-decoder network so as to minimize the loss function. And (5) inputting images into the optimized network for testing to obtain a glance path. As shown in fig. 1, the implementation steps include the following:

(1) Constructing an image panning path training set

And collecting images and acquiring a corresponding glance path of the images, and uniformly transforming the sizes of all the images to be h multiplied by w, wherein h represents the height of the images and w represents the width of the images. And calculating the pixel point corresponding to the glance path fixation point after transformation.

(2) Constructing a semantic extractor to extract gaze point semantic features

The semantic extractor shown in fig. 2 realizes mapping from image block pixels to advanced semantic information, and since CNN is suitable for processing image data, common CNN models such as VGG, res net and the like are selected as the semantic extractor, and the models are trained in tasks such as saliency prediction, classification and the like to obtain a semantic extraction model, if the glance path prediction for different people is performed, the saliency map prediction of the people can be selected to solve the semantic difference among different people. As shown in fig. 3, the semantic extraction model converts the hxw×3 image into h 'xw' xn _s Semantic meaning of (2)Vector, where N _s The number of the semantic vectors is the dimension, the image compression coefficient can be obtained by calculating h/h', and the set image compression coefficient can be achieved by deleting the network layer number. The image blocks where the gaze points are located are in one-to-one correspondence with points of the semantic space, so that the gaze point position sequence is converted into a semantic vector sequence.

(3) Constructing a glance path prediction model for an encoder-decoder framework

The encoder shown in fig. 2 is used for encoding global information of an image, converting the image into a vector with a fixed length, and the encoder can select common models such as VGG, res net and the like by adopting a CNN model, and since the image is of uniform size h×w, finally adding a full-connection layer into the model can obtain a coding vector with a fixed dimension, and the dimension of the image coding vector can be adjusted. As shown in fig. 4, the decoder exploits the LSTM network to mine the correlation between the glance path gaze point semantic vectors for inferring the semantic vector of the next gaze point.

The LSTM network is introduced to the memory and update of the control information of the gating mechanism, so that the problems of gradient disappearance and gradient explosion are effectively solved, long-time dependency is modeled, and the LSTM network is suitable for sequence modeling. Since the output vector dimension of the LSTM is the same as the image encoding vector dimension, a linear layer needs to be added at the back end of the LSTM network to make the LSTM output hidden layer feature dimension the same as the input semantic vector dimension.

(4) Training glance path prediction models

The input of the encoder-decoder model is the sequence of glance path semantic vectors to which the image corresponds, the output is the predicted gaze point semantic vector, and the true value is the real gaze point semantic vector at the next time. An image has a plurality of gaze point semantic vector sequences, and the plurality of semantic vector sequences are input into an LSTM network in parallel to improve the calculation speed of the model. The Euclidean distance between each predicted semantic vector and the real semantic vector is calculated and summed to serve as a loss function, the loss function is minimized to serve as an optimization target, and a Adam (Adaptive moment estimation) algorithm is adopted to train the encoder-decoder model.

(5) Predicting image panning paths

After the trained glance path prediction model is obtained, the test image is input into an encoder to obtain an image coding vector, and is input into a decoder, and the test image is input into a semantic extractor to be mapped into a semantic space. And selecting the initial point as the center point of the image, and obtaining a center point semantic vector input LSTM as an initial value through a semantic extractor. Calculating the semantic distance between the LSTM output passing through the linear layer and each image block to obtain a distance map, normalizing the distance map, enabling the image block with larger distance to jump to the image with smaller probability, adding a Gaussian prior template and prohibiting returning (Inhibit of Return, IOR) prior, multiplying the probability maps and selecting the maximum probability value point as the jumping position. After the predicted gaze point position is obtained, a semantic vector corresponding to the predicted gaze point position is selected as input of the LSTM network at the next moment, and a glance path sequence with a sequence length of T can be obtained after T time steps

Example 1:

step 1, constructing a test image library

Collecting images and collecting corresponding glance paths of the images, uniformly transforming the sizes of all the images to be h multiplied by w, wherein h represents the height of the images, w represents the width of the images, and each image corresponds to a size transformation coefficient r _x 、r _y . Calculating pixel points corresponding to the glance path fixation point after transformation, and letting (q ₁ ,q ₂ ,…,q _n ) Representing a sequence of glance path gaze point coordinates.

Step 2, constructing a semantic extractor to extract the semantic features of the gaze point

The semantic extractor realizes the mapping from the image block pixels to the advanced semantic information, and CNN is suitable for processing the image data, so that a common CNN model VGG-16 is selected as the semantic extractor, and model parameters are obtained by training in a classification task. The compression coefficient h/h' =8 for the selected image size, then the first 10 convolutional layers of the VGG-16 model are selected, and comprise 3 pooled layers. The image I is subjected to a selected VGG network to obtain a feature map:

F _semantic ＝VGG _semantic (I)

wherein VGG _semantic Representing selected VGG networks, F _semantic Representing the obtained characteristic diagram. Feature map F _semantic The number of channels is 512, in order to change the input semantic vector x _t Is of dimension N _s The average is taken over the channel dimension. Namely:

S＝mean(F _semantic )

s represents a semantic vector diagram corresponding to the image I, and the image I with the size of h multiplied by w multiplied by 3 is subjected to a semantic extractor to obtain h multiplied by w multiplied by N _s Is a semantic vector diagram of (1). Specifically, the semantic vector x for the t-th gaze point _t The calculation is as follows:

where m=512/d (x _t )，f ^t And the vector in the feature map is corresponding to the t-th fixation point. The semantic extractor thus converts the gaze point location sequence into a semantic vector sequence as an input vector sequence for the LSTM.

Step 3, constructing a panning path prediction model of the encoder-decoder framework

The encoder is used for encoding global information of the image, converting the image into a vector with a fixed length, and adopts a VGG model, and since the image is of uniform size h×w, the final addition of a full-connection layer in the model can obtain a code vector with a fixed dimension and can adjust the dimension of the image code vector.

F _encoder ＝VGG _encoder (I)

h ₀ ＝FC(F _encoder )

Wherein F is _encoder For image passing VGG _encoder Characteristic diagram of network, FC represents full connection layer, h ₀ Represents the hidden layer initial value of the decoder LSTM network.

The decoder adopts the LSTM network to mine the relevance between the semantic vectors of the gaze points of the glance path, and is used for deducing the semantic vectors of the next gaze point. The LSTM network is specifically calculated as follows:

f _t ＝σ(W _f ·[h _t-1 ,x _t ]+b _f )

i _t ＝σ(W _i ·[h _t-1 ,x _t ]+b _i )

o _t ＝σ(W _o ·[h _t-1 ,x _t ]+b _o )

h _t ＝o _t ⊙tanh(c _t )

wherein h is _t Represents the hidden layer state at time t, c _t Representing the cell state at time t, i _t 、f _t 、o _t Respectively represent an input gate, a forgetting gate, a unit gate and an output gate of the LSTM network, sigma (-) represents a sigmoid function, tan h (-) represents a hyperbolic tangent function, and alpha,/v represents a Hadamard product _f 、W _i 、W _c 、W _o Is a parameter matrix of the LSTM network. The LSTM network is introduced to the memory and update of the control information of the gating mechanism, so that the problems of gradient disappearance and gradient explosion are effectively solved, and long-time dependency is modeled.

Since the output vector dimension of the LSTM network is the hidden layer dimension, a linear layer needs to be added at the back end of the LSTM network so that the LSTM output hidden layer feature dimension is the same as the input semantic vector dimension. Namely:

y _t ＝FC(x _t )

to sum up, the semantic vector output at each moment is obtained in the glance path prediction model of the encoder-decoder framework to form a predicted semantic vector sequence (y ₁ ,y ₂ ,…,y _n )。

Step 4, training a glance path prediction model

The input of the encoder-decoder model is the sequence of glance path semantic vectors to which the image corresponds, the output is the predicted gaze point semantic vector, and the true value is the real gaze point semantic vector at the next time. An image has a plurality of gaze point semantic vector sequences, and the plurality of semantic vector sequences are input into an LSTM network in parallel to improve the calculation speed of the model. The Euclidean distance between each predicted semantic vector and the real semantic vector is calculated and summed to be used as a loss function, the minimized loss function is used as an optimization target, and the loss function is calculated according to the predicted semantic vector sequence and the real semantic vector sequence at the previous n-1 time:

where α is a hyper-parameter, S denotes a semantic vector in the semantic vector diagram S, q _i Representing the coordinates corresponding to the i-th gaze point. And the first term in the loss function represents the distance between the predicted semantic vector sequence and the real semantic vector sequence at the previous n-1 moment, and the second term represents the reciprocal of the sum of the predicted semantic vector and the distances between all the non-real semantic vector sequences in the image. Optimizing model parameters minimizes the penalty function, i.e., makes the predicted semantic vector as close to the true semantic vector as possible and as far from the non-true semantic vector as possible.

After the loss function is obtained, the encoder-decoder model is trained using the Adam (Adaptive moment estimation) algorithm.

Step 5, predicting the image glance path

After the trained glance path prediction model is obtained, the test image is input into an encoder to obtain an image coding vector, and is input into a decoder, and the test image is input into a semantic extractor to be mapped into a semantic space. Selecting an initial point as an image center point:

and obtaining a central point semantic vector input LSTM as an initial value thereof through a semantic extractor:

x ₀ ＝S(q ₀ )

obtaining semantic vector y through LSTM network and full connection layer _i Calculating semantic vector y _i The semantic distance between each image block is used for obtaining a distance graph:

D _i (m,n)＝||y _i -s _m,n || ²

m and n are the abscissa and ordinate of the distance map, respectively. Normalizing and probability the distance map:

wherein the method comprises the steps ofRepresenting the maximum value in the distance map.

And adding a Gaussian prior template:

in which eta= (eta) _x ,η _y ) And beta represents the angle of angular eye movement and the amplitude difference respectively. And add inhibit returns (Inhibit of Return, IOR) a priori:

multiplying these probability maps:

and selecting the maximum probability value point as the jump position:

after the predicted gaze point position is obtained, a semantic vector corresponding to the predicted gaze point position is selected as input of the LSTM network at the next moment, and a glance path sequence with the sequence length of T can be obtained after T time steps.

The invention is implemented in Ubuntu16.04.4 operating system, a pytorch1.6 deep learning framework is adopted, an input image is unified to be 512 multiplied by h multiplied by w, the dimension of a semantic vector is 64, a model is built according to the steps and trained on a training set to obtain model parameters, a glancing path of the model is predicted in a test set image, an initial fixation point is selected in the center of the image, the predicted glancing path length T=10, the radius of an IOR priori adopts min (h, w) multiplied by 1/16, and the result of the glancing path is predicted by the method of the invention is visualized as shown in fig. 5.

The method predicts the result of the glance path and compares the result with the predicted result of the glance path of the method of Thuyen Ngo et al in ICIP, and the visualized result in the graph can obtain the rule that the method accords with the human glance path. Three indicators, multiMatch (MM), hausdorff Distance (HD), mean Minimum Distance (MMD), were used for the evaluation of the method, with the higher MM score being better and the lower HD score being better. Wherein, the method scores Shape, direction, length, position in the MM indexes are respectively 0.9108, 0.6421, 0.9142 and 0.7822 of the method scores 0.9424, 0.6616, 0.9440 and 0.8423,Thuyen Ngo, and the method scores 204.6523 and 144.9966 in the HD indexes and the MMD indexes are respectively 121.2675 and 95.5071,Thuyen Ngo. From the above evaluation criteria, it can be derived that the method of the present invention is superior to other methods in objective evaluation.

While the invention has been described with reference to certain preferred embodiments, it will be understood by those skilled in the art that various changes and substitutions of equivalents may be made without departing from the spirit and scope of the invention.

Claims

1. A method of glance path prediction based on semantic inference, the method comprising:

constructing an image glance path training set;

a semantic extractor is constructed to extract the point of regard semantic features; the construction of the semantic extractor for extracting the gaze point semantic features specifically comprises the following steps: mapping the image to a semantic space by adopting a trained semantic extractor to obtain a semantic vector corresponding to the gaze point; the semantic extractor is a CNN model;

constructing a glance path prediction model of an encoder-decoder framework; the construction of the glance path prediction model of the encoder-decoder framework specifically comprises: the encoder encodes global information of the image and outputs an encoded vector, the encoded vector is used as an initial value of the decoder, and the decoder learns the gaze point semantic inference relation; the encoder-decoder framework is a CNN model-LSTM network;

training a glance path prediction model;

the image panning path is predicted.

2. The method for predicting the glance path based on semantic inference as claimed in claim 1, wherein constructing the image glance path training set specifically includes collecting images and collecting glance paths corresponding to the images, transforming the sizes of all the images to be consistent, and calculating pixel points corresponding to the fixation points of the glance paths after transformation.

3. The method for predicting a glance path based on semantic inference of claim 1, wherein training the glance path prediction model specifically comprises: the encoder-decoder network is optimized such that the penalty function is minimized using the Euclidean distance of the predicted gaze point semantic vector to the true gaze point semantic vector as the penalty function.

4. A method of semantic inference based pan path prediction as claimed in claim 1, characterized by predicting an image pan path: and (5) inputting images into the optimized network for testing to obtain a glance path.