CN113313123A

CN113313123A - Semantic inference based glance path prediction method

Info

Publication number: CN113313123A
Application number: CN202110652817.7A
Authority: CN
Inventors: 夏辰; 钟文琦; 韩军伟; 郭雷
Original assignee: Northwestern Polytechnical University
Current assignee: Northwestern Polytechnical University
Priority date: 2021-06-11
Filing date: 2021-06-11
Publication date: 2021-08-27
Anticipated expiration: 2041-06-11
Also published as: CN113313123B

Abstract

The invention relates to a glance path prediction method based on semantic inference, and belongs to the field of image glance path prediction. And constructing an image panning path training set, and mapping the image to a semantic space by adopting the trained CNN to obtain a semantic vector corresponding to the fixation point. Constructing a glance path prediction model of an encoder-decoder framework, outputting a coding vector by global information of an image coded by the encoder, taking the coding vector as an initial value of a decoder, learning a gaze point semantic inference relation by the decoder, and optimizing an encoder-decoder network to minimize a loss function by adopting a Euclidean distance from a predicted gaze point semantic vector to a true gaze point semantic vector as the loss function. And (4) inputting the image in the optimized network for testing to obtain a panning path.

Description

Semantic inference based glance path prediction method

Technical Field

The invention relates to the field of image glance path prediction, in particular to a glance path prediction method based on semantic inference.

Background

The human eye receives a large amount of visual data which far exceeds the human brain can handle at every moment, and the human visual system can find important areas from complex physical scenes, so that a human can rapidly acquire useful information from a large amount of visual data by using less computing resources. Therefore, the research of the human visual system is very important to rapidly extract useful information from a large amount of visual data. Currently, the existing research is divided into two aspects: the visual saliency is the probability of fixation aiming at visual data and represents the static characteristics of vision, and the visual saccade path is the time sequence of the change of the fixation point of human eyes on time and space, so that the visual static characteristics are reflected and the visual dynamic characteristics are represented. Therefore, the research on the saccade path is helpful for more fully understanding the working mechanism of human vision, and has wide application prospect in the fields of recommendation systems, crowd identification, virtual reality and the like.

The earliest prediction of the saccade path was proposed by Itti et al in the article "A Model of salt-Based Visual Attribution for Rapid Screen Analysis", IEEE Transactions on Pattern Analysis and Machine significance, vol.20, No.11, pp.1254-1259,1998. it was proposed to use the "feature integration theory" to interpret the Visual search strategy, decompose the input image into a series of shallow feature maps to predict image Saliency, and then use the forbidden return mechanism and winner to determine the position of the next saccade point for the King mechanism. Giuseppe Bocrignone et al then used the Wang's equation in the article "modeling gap shift as a constrained random walk," Phys. A, Statist. Mech. appl., vol.331, No.1, pp.207-218,2004 to model the path of the glance as a constrained random walk process in a saliency field, where the length of the glance obeyed a Levy distribution. In the traditional method, high-level semantic information of an image is not extracted in the process of predicting the saliency map, so that the problems of inaccurate semantic target prediction and the like exist, and the accuracy of modeling a physiological mechanism is relied on when the saliency map is combined to predict a fixation point.

With the development of deep learning methods such as Convolutional Neural Network (CNN) and Long-Short-Term Memory (LSTM) networks, researchers have focused on predicting glance paths using deep learning methods. Thuyen Ngo et al, in the articles t.ngo and b.s.manjunath, "Saccade size prediction using a temporal neural network,"2017IEEE International reference Image Processing (ICIP),2017, pp.3435-3439, propose to use CNN to extract features of an Image, use gaze point features as input to an LSTM network, and predict the probability of the location of the next gaze point through the LSTM network. However, the LSTM network mines the relationship between the gaze point characteristics to the next gaze point location, models a statistical property of the gaze point location, and lacks semantic understanding of the gaze point. A glance path prediction method and device (publication number CN109447096A) based on machine learning adopts a structure that a CNN is used as an encoder and an LSTM is used as a decoder, the encoder performs feature extraction on a point of regard, the decoder directly predicts the position of the point of regard, and an attention mechanism is introduced to enable the decoding of each step to focus on different information. The method models the position relevance among the gazing points, but the gazing point characteristics lack characteristic information of other positions, so that the position of the next gazing point is difficult to accurately predict through the gazing point characteristics. At present, different groups of glance paths have differences, and there are the unclear scheduling problem of physiological mechanism in the glance path, and current model relies on single physiological modeling can't be adapted to different groups of glance paths. Therefore, the invention considers that the view point semantics have strong correlation and introduces the global information of the image, and proposes to utilize an encoder-decoder framework to mine the semantic correlation between the view points to predict the glancing path.

Disclosure of Invention

Technical problem to be solved

There are some studies for saccade path prediction based on deep learning, but these studies mainly focus on how to better model some physiological mechanisms, and do not consider the following problems: firstly, the semantic information between the fixation points has relevance, the probability of predicting the fixation points is influenced by the fixation point semantics and all previous fixation point semantics, and the existing model lacks the modeling of the relevance between the fixation point semantics. Secondly, the saccade paths of different populations have differences with semantic cognition, for example, children and adults have differences with semantic target understanding, and autism patients have semantic deletion and the like. Thirdly, the physiological mechanism of the saccade path has ambiguity, and the existing model can not adapt to the saccade path of different people by relying on single physiological modeling. Therefore, the invention aims to overcome the defects of the prior art, considers the problem of glancing path prediction as a search problem in the whole image semantic range by taking semantic correlation as measurement, proposes to utilize CNN to carry out semantic extraction on an image, and an encoder-decoder framework to excavate the semantic correlation of a fixation point, wherein an encoder is CNN, encodes the global information of the image, introduces the global information and is used as an initial value of a decoder; the decoder learns the semantic association between the points of regard for the LSTM network, modeling the sequence characteristics of the glance path. The model is driven by data, so that semantic information of different crowds can be learned according to the glance paths of the different crowds, corresponding semantic correlation is established, and excessive physiological mechanisms are not used in the model.

Technical scheme

A method for glance path prediction based on semantic inference, the method comprising:

constructing an image panning path training set;

constructing a semantic extractor to extract the semantic features of the fixation point;

constructing a sweep path prediction model of an encoder-decoder framework;

training a glance path prediction model;

a glance path of the image is predicted.

The further technical scheme of the invention is as follows: the method specifically comprises the steps of collecting images, collecting panning paths corresponding to the images, transforming the sizes of all the images to be consistent, and calculating pixel points corresponding to fixation points of the transformed panning paths.

The further technical scheme of the invention is as follows: the semantic feature extraction of the fixation point by constructing the semantic extractor specifically comprises the following steps: and mapping the image to a semantic space by adopting a trained semantic extractor to obtain a semantic vector corresponding to the fixation point.

The further technical scheme of the invention is as follows: the semantic extractor is a CNN model.

The further technical scheme of the invention is as follows: the construction of the sweep path prediction model of the encoder-decoder framework specifically comprises: the encoder outputs a coded vector for encoding global information of an image, the coded vector is used as an initial value of the decoder, and the decoder learns the view point semantic inference relation.

The further technical scheme of the invention is as follows: the encoder-decoder framework is a CNN model-LSTM network.

The further technical scheme of the invention is as follows: the training of the glance path prediction model specifically comprises: and optimizing the encoder-decoder network to minimize the loss function by using the Euclidean distance from the predicted fixation point semantic vector to the true fixation point semantic vector as the loss function.

The further technical scheme of the invention is as follows: predicting the saccade path: and (4) inputting the image in the optimized network for testing to obtain a panning path.

A semantic inference based glance path prediction apparatus, comprising:

an image processing module for constructing an image panning path training set;

the characteristic extraction module is used for extracting the semantic characteristics of the fixation point;

a training module for establishing and training a glance path prediction model;

a prediction module to predict a saccade path.

Advantageous effects

The glance path prediction method based on semantic inference provided by the invention has the following beneficial effects:

1) the invention establishes an end-to-end learning model of an encoder-decoder framework, and adopts a data driving mode without simulating excessive complex physiological phenomena. The encoder uses global information of the CNN encoded image and the decoder uses the advantages of the LSTM network to model the sequence, better revealing the dynamic properties of the panning path.

2) According to the invention, CNN is used for extracting the semantic features of the fixation point, a pre-training network model in large-scale data set is adopted, the high-level semantic information of the fixation point is extracted, and compared with the previous research of directly training in the eye movement data set, the problem that the eye movement data is less and the semantic information is difficult to extract is solved. Compared with the original image pixel block, the method has stronger semantic abstraction capability and semantic representation capability. In addition, the semantic differences of different crowds are considered, and the deep learning-based semantic extractor can be trained in a specific crowd sample, so that the semantic cognition of the specific crowd is extracted, and the glance path prediction of different crowds is realized.

3) The invention utilizes LSTM network to learn the jump relation between the gazing points from the perspective of semantic inference, compared with the previous research of establishing the correlation between the gazing point characteristics and the position by Thuyen Ngo and the like and the problem that the previous research lacks semantic understanding of the gazing points, the invention notices that the semantic information between the gazing points has correlation, utilizes the LSTM network to directly mine the semantic correlation between the preorder gazing point and the next gazing point, realizes the inference from the gazing point semantics to the gazing point semantics, and better accords with the cognitive characteristics of human beings.

4) Compared with the previous research using a mechanism of victory as king, the method provided by the invention has the advantages that the human eye saccade paths have certain distribution in amplitude and angle due to the limitation of the physiological structure of human eyes. The distribution is considered, and the distance map formed by the inference semantics and all semantic features in all images is combined to predict the fixation point position, so that the human eye movement statistical characteristics are better met.

Drawings

The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, wherein like reference numerals are used to designate like parts throughout.

Figure 1 is a general flow chart of an implementation of the present invention.

FIG. 2 is a general framework diagram of the model of the present invention.

FIG. 3 is a schematic diagram of semantic vector sequence extraction in the present invention.

FIG. 4 is a schematic diagram of decoder LSTM network mining semantic relations in the invention.

FIG. 5 illustrates experimental results in the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail below with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. In addition, the technical features involved in the embodiments of the present invention described below may be combined with each other as long as they do not conflict with each other.

The scheme of the invention is as follows: and constructing an image panning path training set, and mapping the image to a semantic space by adopting the trained CNN to obtain a semantic vector corresponding to the fixation point. Constructing a glance path prediction model of an encoder-decoder framework, outputting a coding vector by global information of an image coded by the encoder, taking the coding vector as an initial value of a decoder, learning a gaze point semantic inference relation by the decoder, and optimizing an encoder-decoder network to minimize a loss function by adopting a Euclidean distance from a predicted gaze point semantic vector to a true gaze point semantic vector as the loss function. And (4) inputting the image in the optimized network for testing to obtain a panning path. As shown in fig. 1, the implementation steps include the following:

(1) construction of image panning path training set

Collecting images and acquiring corresponding sweep paths of the images, and uniformly transforming the sizes of all the images into h multiplied by w sizes, wherein h represents the height of the images, and w represents the width of the images. And calculating pixel points corresponding to the fixation points of the converted saccade path.

(2) Constructing semantic extractor to extract view point semantic features

The semantic extractor shown in FIG. 2 implements the mapping from image block pixels to high level semantic information since CNN is suitable for processing image numbersAccording to the method, a common CNN model such as VGG (virtual ground gateway), ResNet and the like is selected as a semantic extractor, the model is trained in tasks such as significance prediction and classification to obtain a semantic extraction model, and if the significance map prediction of different crowds is adopted for the glance path prediction of the different crowds, the semantic difference among the different crowds can be solved. As shown in FIG. 3, the semantic extraction model converts an hxwx3 image into h '× w' × N_sA semantic vector of (2), wherein N_sThe dimension of the semantic vector is calculated, the image compression coefficient can be obtained by calculating h/h', and the set image compression coefficient can be achieved by reducing the number of network layers. The image blocks where the gazing points are located correspond to the points of the semantic space one by one, so that the gazing point position sequence is converted into a semantic vector sequence.

(3) Construction of a swept path prediction model for an encoder-decoder framework

The encoder shown in fig. 2 is used for encoding the global information of the image, converting the image into a vector with a fixed length, and the encoder can select the commonly used models such as VGG, ResNet and the like by adopting the CNN model, and because the image has a uniform size of h × w, the encoding vector with a fixed dimension can be obtained by finally adding the full-connection layer in the model, and the dimension of the encoding vector of the image can be adjusted. As shown in fig. 4, the decoder exploits the association between the glance path gaze point semantic vectors using the LSTM network for inferring the semantic vector of the next gaze point.

The LSTM network introduces the gate control mechanism to control the memory and update of information, effectively solves the problems of gradient disappearance and gradient explosion, also models a long-time dependency relationship, and is suitable for sequence modeling. Because the output vector dimension of the LSTM is the same as the image coding vector dimension, a linear layer is required to be added at the rear end of the LSTM network to enable the LSTM output hidden layer feature dimension to be the same as the input semantic vector dimension.

(4) Training glance path prediction model

The input of the encoder-decoder model is the sequence of glance path semantic vectors for the image and its corresponding, the output is the predicted gaze point semantic vector, and the true value is the real gaze point semantic vector at the next moment. One image has a plurality of fixation point semantic vector sequences, and the plurality of semantic vector sequences are input into the LSTM network in parallel to improve the model calculation speed. And calculating Euclidean distances between each predicted semantic vector and the real semantic vector, summing the Euclidean distances to serve as a loss function, taking a minimum loss function as an optimization target, and training an encoder-decoder model by adopting an adam (adaptive motion) algorithm.

(5) Predicting glance paths of images

After the trained glance path prediction model is obtained, the test image is input into an encoder to obtain an image coding vector and is input into a decoder, and the test image is input into a semantic extractor and is mapped to a semantic space. Selecting an initial point as an image central point, and obtaining a central point semantic vector input LSTM as an initial value through a semantic extractor. Calculating semantic distances between the LSTM output passing through the linear layer and each image block to obtain a distance map, normalizing the distance map, enabling the probability of the image block jumping to the image to be smaller when the distance is larger, adding a Gaussian prior template and an Inhibit of Return (IOR) prior, multiplying the probability maps, and selecting a maximum probability value point as a jumping position. After the predicted gazing point position is obtained, the semantic vector corresponding to the predicted gazing point position is selected as the input of the next moment of the LSTM network, and a glance path sequence with the sequence length of T can be obtained after T time steps

Example 1:

step 1, constructing a test image library

Collecting images and collecting corresponding sweep paths of the images, uniformly transforming the sizes of all the images to h multiplied by w, wherein h represents the height of the images, w represents the width of the images, and each image corresponds to a size transformation coefficient r_x、r_y. Calculating the pixel points corresponding to the fixation points of the converted saccade path, and ordering (q)₁,q₂,…,q_n) Representing a sequence of glance path gaze point coordinates.

Step 2, constructing a semantic extractor to extract the semantic features of the fixation point

The semantic extractor realizes the mapping from image block pixels to high-level semantic information, the CNN is suitable for processing image data, so a common CNN model VGG-16 is selected as the semantic extractor, and model parameters are obtained by training in a classification task. If the compression factor h/h' of the selected image size is 8, the top 10 convolution layers of the VGG-16 model are selected, and 3 pooling layers are included. The image I is subjected to a selected VGG network to obtain a characteristic diagram:

F_semantic＝VGG_semantic(I)

wherein VGG_semanticIndicating a selected VGG network, F_semanticThe obtained characteristic diagram is shown. Feature map F_semanticThe number of channels is 512, in order to change the input semantic vector x_tHas a dimension of N_sThe average is taken over the channel dimension. Namely:

S＝mean(F_semantic)

s represents a semantic vector diagram corresponding to the image I, and the image I with the size of h multiplied by w multiplied by 3 is subjected to a semantic extractor to obtain h 'multiplied by w' multiplied by N_sThe semantic vector graph of (1). Specifically, the semantic vector x for the t-th point of regard_tThe calculation is as follows:

wherein m is 512/d (x)_t)，f^tAnd the t-th fixation point corresponds to the vector in the feature map. The semantic extractor converts the sequence of gaze locations into a sequence of semantic vectors as the input vector sequence for the LSTM.

Step 3, constructing a sweep path prediction model of an encoder-decoder framework

The encoder is used for encoding the global information of the image, converting the image into a vector with a fixed length, and the encoder adopts a VGG model, because the image has a uniform size of h multiplied by w, the finally adding of a full connection layer in the model can obtain an encoding vector with a fixed dimension and can adjust the dimension of the encoding vector of the image.

F_encoder＝VGG_encoder(I)

h₀＝FC(F_encoder)

Wherein F_encoderPassing the image through VGG_encoderCharacteristic diagram of the network, FC denotes the full connectivity layer, h₀Representing hidden layer initialization values of the decoder LSTM network.

The decoder uses the LSTM network to mine the association between the glance path gaze point semantic vectors for inferring the semantic vector of the next gaze point. The LSTM network is specifically calculated as follows:

f_t＝σ(W_f·[h_t-1,x_t]+b_f)

i_t＝σ(W_i·[h_t-1,x_t]+b_i)

o_t＝σ(W_o·[h_t-1,x_t]+b_o)

h_t＝o_t⊙tanh(c_t)

wherein h is_tRepresenting hidden layer state at time t, c_tIndicating the state of the cell at time t, i_t、f_t、

o_tRespectively representing an input gate, a forgetting gate, a cell gate and an output gate of the LSTM network, sigma (-) represents a sigmoid function, tanh (-) represents a hyperbolic tangent function, W represents a Hadamard product_f、W_i、W_c、W_oIs a parameter matrix of the LSTM network. The LSTM network introduces a gating mechanism to control the memory and update of information, effectively solves the problems of gradient disappearance and gradient explosion, and models a long-time dependency relationship.

Because the output vector dimension of the LSTM network is the hidden layer dimension, a linear layer needs to be added at the rear end of the LSTM network to ensure that the LSTM output hidden layer feature dimension is the same as the input semantic vector dimension. Namely:

y_t＝FC(x_t)

in summary, the semantic vector output at each moment is obtained in the sweep path prediction model of the encoder-decoder framework to form a predicted semantic vector sequence (y)₁,y₂,…,y_n)。

Step 4, training a glance path prediction model

The input of the encoder-decoder model is the sequence of glance path semantic vectors for the image and its corresponding, the output is the predicted gaze point semantic vector, and the true value is the real gaze point semantic vector at the next moment. One image has a plurality of fixation point semantic vector sequences, and the plurality of semantic vector sequences are input into the LSTM network in parallel to improve the model calculation speed. Calculating Euclidean distances between each predicted semantic vector and each real semantic vector and summing the Euclidean distances to serve as a loss function, taking a minimized loss function as an optimization target, and calculating the loss function according to the predicted semantic vector sequence and the real semantic vector sequence at the previous n-1 moment:

where α is a hyper-parameter, S represents a semantic vector in a semantic vector graph S, q_iIndicating the coordinates corresponding to the ith gaze point. And the first term in the loss function represents the distance between the predicted semantic vector sequence and the real semantic vector sequence at the previous n-1 moment, and the second term represents the reciprocal of the sum of the distances between the predicted semantic vector and all non-real semantic vector sequences in the image. The model parameters are optimized such that the loss function is minimized, i.e. such that the predicted semantic vector is as close as possible to the true semantic vector and as far as possible from the non-true semantic vector.

After the loss function is obtained, an adam (adaptive motion) algorithm is adopted to train the encoder-decoder model.

Step 5, predicting the saccade path

After the trained glance path prediction model is obtained, the test image is input into an encoder to obtain an image coding vector and is input into a decoder, and the test image is input into a semantic extractor and is mapped to a semantic space. Selecting an initial point as an image center point:

and obtaining a central point semantic vector input LSTM as an initial value through a semantic extractor:

x₀＝S(q₀)

obtaining semantic vector y through LSTM network and full connection layer_iCalculating a semantic vector y_iObtaining a distance map from semantic distances between the image blocks:

D_i(m,n)＝||y_i-s_m,n||²

m and n are respectively the horizontal and vertical coordinates of the distance map. Normalizing and probabilizing the distance map:

wherein

Representing the maximum value in the distance map.

And adding a Gaussian prior template:

wherein eta ═ eta (eta)_x,η_y) And beta respectively represent the difference of the angle of the eye movement and the amplitude difference. And add an Inhibit of Return (IOR) prior:

multiplying these probability maps:

and selecting a maximum probability value point as a jumping position:

after the predicted gazing point position is obtained, the semantic vector corresponding to the predicted gazing point position is selected as the input of the next moment of the LSTM network, and a glance path sequence with the sequence length of T can be obtained after T time steps.

The method is implemented in an Ubuntu16.04.4 operating system, a pytorch1.6 deep learning framework is adopted, input images are unified to have the h x w size of 512 x 512, the dimensionality of a semantic vector is 64, a model is built according to the steps and trained on a training set to obtain model parameters, a saccade path of the test set is predicted in an image of a test set, an initial fixation point is selected in the center of the image, the length T of the predicted saccade path is 10, the prior radius of the IOR is min (h, w) x 1/16, and the result of predicting the saccade path by the method is visualized as shown in FIG. 5.

The method of the invention predicts the result of the glance path and compares the result with the glance path prediction result of the method of Thuyen Ngo et al in ICIP, and the result visualized in the figure can obtain the rule that the method accords with the human glance path. The evaluation of the method adopts three indexes of MultiMatch (MM), Hausdorff Distance (HD) and Mean Minimum Distance (MMD), wherein the higher the MM score is, the better the method is, and the lower the HD score is, the better the method is. Wherein, in the MM index, each score Shape, Direction, Length and Position of the method are 0.9424, 0.6616, 0.9440 and 0.8423 respectively, the method scores of Thuyen Ngo and other people are 0.9108, 0.6421, 0.9142 and 0.7822 respectively, in the HD index and MMD index, the method scores are 121.2675 and 95.5071 respectively, and the method scores of Thuyen Ngo and other people are 204.6523 and 144.9966 respectively. From the above evaluation criteria, it can be concluded that the method of the present invention is superior to other methods in objective evaluation.

While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications or substitutions can be easily made by those skilled in the art within the technical scope of the present disclosure.

Claims

1. A method for glance path prediction based on semantic inference, the method comprising:

constructing an image panning path training set;

constructing a sweep path prediction model of an encoder-decoder framework;

training a glance path prediction model;

a glance path of the image is predicted.

2. The method of claim 1, wherein constructing a training set of image glance paths comprises collecting images and collecting corresponding glance paths of the images, transforming all the images to a uniform size, and calculating pixel points corresponding to fixation points of the transformed glance paths.

3. The method of claim 1, wherein the semantic inference based glance path prediction is implemented by constructing a semantic extractor to extract the gaze point semantic features, specifically comprising: and mapping the image to a semantic space by adopting a trained semantic extractor to obtain a semantic vector corresponding to the fixation point.

4. A semantic inference based glance path prediction method as claimed in claim 3, wherein said semantic extractor is a CNN model.

5. The method of claim 1, wherein constructing a glance path prediction model of an encoder-decoder framework specifically comprises: the encoder outputs a coded vector for encoding global information of an image, the coded vector is used as an initial value of the decoder, and the decoder learns the view point semantic inference relation.

6. A semantic inference based glance path prediction method as in claim 5, wherein the encoder-decoder framework is a CNN model-LSTM network.

7. The method of claim 1, wherein training the glance path prediction model specifically comprises: and optimizing the encoder-decoder network to minimize the loss function by using the Euclidean distance from the predicted fixation point semantic vector to the true fixation point semantic vector as the loss function.

8. A semantic inference based glance path prediction method according to claim 1, characterized by predicting the glance path of the image: and (4) inputting the image in the optimized network for testing to obtain a panning path.

9. A semantic inference based glance path prediction apparatus, comprising:

an image processing module for constructing an image panning path training set;

a training module for establishing and training a glance path prediction model;

a prediction module to predict a saccade path.