CN109447096B

CN109447096B - Glance path prediction method and device based on machine learning

Info

Publication number: CN109447096B
Application number: CN201810332835.5A
Authority: CN
Inventors: 齐飞; 高帅; 石光明; 夏朝辉
Original assignee: Xidian University
Current assignee: Xidian University
Priority date: 2018-04-13
Filing date: 2018-04-13
Publication date: 2022-05-06
Anticipated expiration: 2038-04-13
Also published as: CN109447096A

Abstract

The invention provides a glance path prediction method and a glance path prediction device based on machine learning, which relate to the technical field of computers, and the method comprises the following steps: obtaining an image data set to be processed, wherein each image information in the image data set has corresponding true value information; according to the truth value information, making a training sample of the image data set; according to the image information, obtaining image feature representation information of the image information; constructing and training an LSTM network according to the image feature representation information and the eye movement data sample; and predicting a scanning path according to the LSTM network. The method solves the technical problems that the predicted fixation point depends on a static saliency map too much and the predicted panning path in a natural scene picture is insufficient in the prior art, achieves the purposes of eliminating the dependence of a model on the saliency map, takes the time sequence among the fixation points into consideration and achieves good technical effects on a plurality of public data sets.

Description

Glance path prediction method and device based on machine learning

Technical Field

The invention relates to the technical field of image processing, in particular to a method and a device for predicting a saccade path based on machine learning.

Background

With the rapid development of information technology, people have entered an era of large-scale data growth, digital images and videos become important carriers of information, massive image data is an important component for obtaining information, and how to effectively select the most valuable information from images gradually becomes a hot spot concerned in the field of image processing.

The method has the advantages that the method is used for predicting the fixation point and depends on a static saliency map too much in the prior art, and meanwhile, the prior art also has a plurality of defects in predicting a panning path in a natural scene picture.

Disclosure of Invention

The embodiment of the invention provides a glance path prediction method and device based on machine learning, which solves the problems that the predicted gaze point excessively depends on a static saliency map in the prior art and the predicted glance path in a natural scene picture is insufficient, achieves the purposes of eliminating the dependence of a model on the saliency map, takes the time sequence among the gaze points into consideration and achieves a good technical effect on a plurality of public data sets.

In view of the foregoing, embodiments of the present application are proposed to provide a method and apparatus for predicting a glance path based on machine learning.

In a first aspect, the present invention provides a glance path prediction method based on machine learning, including: obtaining an image data set to be processed, wherein each image information in the image data set has corresponding true value information; according to the truth value information, making a training sample of the image data set; according to the image information, obtaining image feature representation information of the image information; constructing and training an LSTM network according to the image feature representation information and the eye movement data sample; and predicting a scanning path according to the LSTM network.

Preferably, the making a training sample of the image data set according to the truth value information specifically includes: processing the truth value information to obtain eye movement data information of N observers; performing boundary processing on the eye movement data of the N observers; normalizing the eye movement data of the N observers after the boundary processing; and combining the eye movement data of the N observers to obtain the training sample, wherein N is a positive integer.

Preferably, the obtaining of the image feature representation information of the image information specifically includes: obtaining a training set and a testing set according to the image data set; cutting the image information of the training set into a standard size; constructing a convolutional neural network, and loading the trained model parameters; and taking the image information as the input of a convolutional neural network, and outputting image feature representation information of the image information.

Preferably, the constructing and training of the LSTM network specifically includes: obtaining the coordinates of the LSTM network, and defining a corresponding weight matrix according to the coordinates; taking the image feature representation information and a weight matrix corresponding to the coordinates as the input of an LSTM network; carrying out the operations of an input gate, a forgetting gate and an output gate on the input by using a forward propagation method; decoding the LSTM network output according to a deep output layer; inputting the image feature representation information into the LSTM network, and training the LSTM network by using a back propagation algorithm.

Preferably, the method further comprises: loading the LSTM network and inputting the training samples into the LSTM network; obtaining an output feature vector of the LSTM network by using a forward propagation algorithm; and inputting the output characteristic vector and the truth value information into the LSTM network, and obtaining the fixation point coordinate by using a forward propagation algorithm.

In a second aspect, the present invention provides a machine learning based glance path prediction apparatus, comprising:

a first obtaining unit, configured to obtain an image dataset to be processed, where each image information in the image dataset has corresponding true value information;

a first production unit, configured to produce a training sample of the image data set according to the truth value information;

a second obtaining unit configured to obtain image feature representation information of the image information based on the image information;

a first constructing unit, configured to construct and train an LSTM network according to the image feature representation information and the eye movement data samples;

a first prediction unit to predict a scan path according to the LSTM network.

Preferably, the apparatus further comprises:

a third obtaining unit, configured to process the truth value information to obtain eye movement data information of the N observers;

a first processing unit for performing boundary processing on the eye movement data of the N observers;

a first normalization unit, configured to normalize the eye movement data of the N observers after the boundary processing;

a first merging unit, configured to merge the eye movement data of the N observers to obtain the training sample.

Preferably, the apparatus further comprises:

a fourth obtaining unit, configured to obtain a training set and a test set according to the image data set;

a first cropping unit for cropping the training set image information to a standard size;

the first construction unit is used for constructing a convolutional neural network and loading the trained model parameters;

a first output unit configured to output image feature representation information of the image information using the image information as an input to a convolutional neural network.

Preferably, the apparatus further comprises:

a fifth obtaining unit, configured to obtain coordinates of the LSTM network, and define a corresponding weight matrix according to the coordinates;

a first input unit configured to input the image feature representation information and a weight matrix corresponding to the coordinates as an LSTM network;

a first operation unit for performing operations of an input gate, a forgetting gate, and an output gate on the input using a forward propagation method;

a first decoding unit to decode the LSTM network output according to a deep output layer;

a first training unit for inputting the image feature representation information to the LSTM network, and training the LSTM network using a back propagation algorithm.

Preferably, the apparatus further comprises:

a second input unit, configured to load the LSTM network and input the training samples into the LSTM network;

a sixth obtaining unit, configured to obtain an output feature vector of the LSTM network using a forward propagation algorithm;

a seventh obtaining unit, configured to input the output eigenvector and the truth information into the LSTM network, and obtain the gaze point coordinates using a forward propagation algorithm.

In a third aspect, the present invention provides a machine learning based glance path prediction apparatus comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the program: obtaining an image data set to be processed, wherein each image information in the image data set has corresponding true value information; according to the truth value information, making a training sample of the image data set; according to the image information, obtaining image feature representation information of the image information; constructing and training an LSTM network according to the image feature representation information and the eye movement data sample; and predicting a scanning path according to the LSTM network.

One or more technical solutions in the embodiments of the present application have at least one or more of the following technical effects:

1. according to the method and the device for predicting the saccade path based on machine learning, an image data set to be processed is obtained, wherein each piece of image information in the image data set has corresponding true value information; according to the truth value information, making a training sample of the image data set; according to the image information, obtaining image feature representation information of the image information; constructing and training an LSTM network according to the image feature representation information and the eye movement data sample; and predicting a scanning path according to the LSTM network. The method solves the technical problems that the predicted fixation point depends on a static saliency map too much and the predicted panning path in a natural scene picture is insufficient in the prior art, achieves the purposes of eliminating the dependence of a model on the saliency map, takes the time sequence among the fixation points into consideration and achieves good technical effects on a plurality of public data sets.

2. The image features are extracted by adopting the convolutional neural network, the convolutional neural network has strong capability of representing learning and can learn higher-level features by using a layer-by-layer learning strategy, the defects of a manual selection or combined multi-dimensional feature selection method in the prior art are overcome, and the method has better universality and expandability.

3. The invention estimates the saccade path by constructing the LSTM network, the structure of the LSTM network is suitable for processing a time sequence, the LSTM network is trained by combining the currently input image area and the fixation point generated so far, the saccade stage of the human visual processing period and the transmission and prediction of information on the visual cortex are simulated, the consistency with the human saccade path process is realized from the biological mechanism, and the saccade path result consistent with the human eye movement data is obtained.

4. According to the invention, by introducing an attention mechanism into the network, each step of network output allows a decoder to pay attention to different parts of the image, and finally, the trained model can learn which part of the image should be paid attention to, so as to guide the decoding of the network output.

The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.

Drawings

FIG. 1 is a flow chart of a method for predicting a glance path based on machine learning according to an embodiment of the present invention;

FIG. 2 is a block diagram of a convolutional neural network in accordance with an embodiment of the present invention;

FIG. 3 is a schematic diagram of a glance path predicting device based on machine learning according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of another glance path predicting device based on machine learning according to an embodiment of the present invention.

The reference numbers illustrate: a bus 300, a receiver 301, a processor 302, a transmitter 303, a memory 304, a bus interface 306.

Detailed Description

The embodiment of the invention provides a glance path prediction method and a device based on machine learning, which are used for solving the problems that the predicted fixation point is over dependent on a static saliency map in the prior art and the predicted glance path in a natural scene picture is insufficient, and the technical scheme provided by the invention has the following general ideas:

in the technical scheme of the embodiment of the invention, by obtaining an image data set to be processed, each image information in the image data set has corresponding true value information; according to the truth value information, making a training sample of the image data set; according to the image information, obtaining image feature representation information of the image information; constructing and training an LSTM network according to the image feature representation information and the eye movement data sample; and predicting a scanning path according to the LSTM network. The method achieves the purposes of eliminating the dependence of the model on the saliency map, and taking the time sequence among the fixation points into consideration, and achieves good technical effects on a plurality of common data sets.

The technical solutions of the present invention are described in detail below with reference to the drawings and specific embodiments, and it should be understood that the specific features in the embodiments and examples of the present invention are described in detail in the technical solutions of the present application, and are not limited to the technical solutions of the present application, and the technical features in the embodiments and examples of the present application may be combined with each other without conflict.

The term "and/or" herein is merely an association describing an associated object, meaning that three relationships may exist, e.g., a and/or B, may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter associated objects are in an "or" relationship.

In order to more clearly disclose a glance path prediction method based on machine learning provided by the embodiments of the present application, some terms are described below.

A Convolutional Neural Network (CNN) is a feed-forward Neural Network whose artificial neurons can respond to a portion of the coverage of surrounding cells, and performs well for large image processing. It includes a convolutional layer (alternating volumetric layer) and a pooling layer (pooling layer).

LSTM (Long Short-Term Memory, LSTM) is an improved recurrent neural network, and the paper was first published in 1997. Due to the unique design structure, LSTM is suitable for handling and predicting significant events of very long intervals and delays in a time series.

Tensorflow is a second generation artificial intelligence learning system developed by Google based on DistBerief, and the naming of the Tensorflow comes from the operation principle of the Tensorflow. Tensor means an N-dimensional array, Flow means computation based on a dataflow graph, and TensorFlow is a computation process in which tensors Flow from one end of the Flow graph to the other. TensorFlow is a system that transports complex data structures into artificial intelligent neural networks for analysis and processing.

Basicslstmcell is the basic LSTM cyclic network element.

Example 1

Fig. 1 is a flowchart illustrating a glance path prediction method based on machine learning according to an embodiment of the present invention. As shown in fig. 1, the method includes:

step 110: obtaining an image data set to be processed, wherein each image information in the image data set has corresponding true value information;

specifically, the to-be-processed image data set refers to a set of a plurality of pictures to be processed, and the corresponding truth information refers to the fixation point coordinates of the corresponding image as a label.

Step 120: according to the truth value information, making a training sample of the image data set;

further, the making of the training sample of the image data set according to the truth information specifically includes: processing the truth value information to obtain eye movement data information of N observers; performing boundary processing on the eye movement data of the N observers; normalizing the eye movement data of the N observers after the boundary processing; and combining the eye movement data of the N observers to obtain the training sample, wherein N is a positive integer.

Specifically, for the processing of the truth value, each image data set has corresponding eye movement data, each picture has eye movement data of N observers, the eye movement data is subjected to boundary processing, and all points outside the image are processed into points on the image boundary; and selecting corresponding eye movement data each time according to a dictionary, normalizing, and combining the data of N observers to obtain training sequences, wherein each sequence is 8 fixation point coordinates, and the one-dimensional data is expressed as that one sequence contains 16 numbers.

And performing experiments on each data set for several times, for example, the MIT1003 data set comprises 1003 pictures, each picture comprises 15 observers, mapping the picture sequence numbers and the numbers 0-1003 to obtain a dictionary, and selecting 900 pictures from the dictionary in sequence in each experiment, wherein 0-900 is used as training and 900-1003 is used as testing. Then, a truth value is selected from the truth value data according to the method for selecting pictures, that is, the eye movement data is 900 × 15 — 13500 as a label. Training samples and labels were made according to this method for each experiment.

Step 130: obtaining image characteristic representation information of the image information according to the image information;

further, the obtaining of the image feature representation information of the image information specifically includes: obtaining a training set and a test set according to the image data set; cutting the image information of the training set into a standard size; constructing a convolutional neural network, and loading the trained model parameters; and taking the image information as the input of a convolutional neural network, and outputting image feature representation information of the image information.

Specifically, the image data set comprises corresponding numbers, the names and the numbers of the pictures in the image data set are mapped, and a certain picture is selected as a training set and a test set according to the number each time; the structure of the convolutional neural network is further described below in conjunction with fig. 2. The glancing path model established by the invention mainly comprises an encoding network, a decoding network and an output layer. The coding network consists of a convolutional neural network, the network structure of the convolutional neural network is further described below, the convolutional neural network comprises five parts, and the first part is two convolutional layers; the second part is two convolution layers; the third part is four convolution layers; the fourth part is four convolution layers; the fifth part is four convolution layers; wherein each convolutional layer comprises a convolution operation and a pooling operation, the convolution kernel is 3 x 3 in size, and the activation functions of all convolutional layers are selected to be linear rectification functions. The number of first convolutional layer convolutional kernels is 64, the number of second convolutional layer convolutional kernels is 128, the number of third convolutional layer convolutional kernels is 256, and the number of the last two convolutional layer convolutional kernels is 512. By adopting the convolutional neural network to extract the image features, the convolutional neural network has strong capability of representing learning and can learn higher-level features through a layer-by-layer learning strategy, the defects of using a manual selection or combined multi-dimensional feature selection method in the prior art are overcome, and the method has better universality and expandability.

The embodiment of the application adopts the model parameters trained by the VGG19, and the VGG19 network is trained on a large data set ImageNet, so that accurate image features can be extracted, and the VGG19 model parameters are written to carry out feature extraction. The image is used as the input of the convolutional neural network, the size is 224 multiplied by 3, and the output is the feature vector alpha of the image ═ alpha₁,...,α_L},α_i∈R^DHere, L196 and D512, for each picture, the network extracts L vectors, each corresponding to a region of the image.

Step 140: constructing and training an LSTM network according to the image feature representation information and the eye movement data sample;

further, the constructing and training the LSTM network specifically includes: obtaining the coordinates of the LSTM network, and defining a corresponding weight matrix according to the coordinates; taking the image feature representation information and a weight matrix corresponding to the coordinates as the input of an LSTM network; carrying out the operations of an input gate, a forgetting gate and an output gate on the input by using a forward propagation method; decoding the LSTM network output according to a deep output layer; inputting the image feature representation information into the LSTM network, and training the LSTM network by using a back propagation algorithm.

Specifically, the LSTM network consists of BasiclSSTMCell units in TensorFlow, wherein the number H of the units is 1024; we define the coordinates of the generative model as x, y_iThe vector is a vector with dimensions of 1 × K, K is the size of the coordinate base, C is the length of the obtained sequence, and in the experiment, C is 8, that is, each graph generates eight fixation points.

y＝{y₁,...,y_C},y_i∈R^K

I in FIG. 2_tIs an input gate, f_tIs forgetting to gate, o_tIs an output gate and g_tIs a candidate vector controlled by an input gate, h_t-1Representing the hidden state at the previous time,

representing the context vector at time t, Ey_t-1The output representing the time t-1 is the embedding vector obtained by embedding the matrix E. The embedded matrix E is to use the total weight matrix and the true coordinates corresponding to x and y as the input of the function embedding _ lookup (ids), so as to obtain the weight matrix corresponding to x and y. According to the invention, by introducing an attention mechanism into the network, each step of network output allows a decoder to pay attention to different parts of the image, and finally, the trained model can learn which part of the image should be paid attention to, so as to guide the decoding of the network output.

The output formula of the LSTM network is as follows:

the method is specifically realized through a deep output layer network and comprises two neural network layers, wherein the first layer network firstly performs dropout on a hidden layer, then obtains output h _ logits by adopting a logistic regression mode, adds context information and previously generated coordinate information into the h _ logits, and performs dropout by using a tanh activation function; the second layer network obtains the output out _ locations by using the output of the first layer in a logistic regression mode. Dropout means that, in the training process of the neural network, for some units in the neural network, some neurons are temporarily discarded from the network according to a certain probability in each iteration process, and those discarded nodes can be temporarily regarded as not being part of the network structure, but the weight of the discarded nodes needs to be retained.

Further, before the training of the LSTM neural network is started, all weight parameters of the LSTM neural network are randomly initialized to a number close to 0, all bias quantities are initialized to 0, the initialization of the hidden layer state h and the cell state c is obtained through two independent multilayer perceptrons, the characteristic average value of each image area is used as the input of the perceptrons to obtain the initial values of the hidden layer state and the cell state, and the formula is that

Randomly dividing training sample data into a plurality of smaller batches, wherein the size of the batch selected in the experiment is 25; a batch of image feature vectors and truth value data are input in each iterative training,

the cost of the LSTM network is calculated as follows:

l_ti＝-[y_tiln a_ti+(1-y_ti)ln(1-a_ti)]

wherein y is_tiRepresenting the actual output value of the network, a_tiRepresenting the ideal output value,/_tiThe value of the loss function for the ith sample at time t is shown. When the network trains N samples each time, the loss values of the N samples at each time t are summed to obtain the loss values loss of all the samples at the time t_t. And summing all the loss values of the training time step t to obtain the loss values loss of the N samples.

And according to the cost of the LSTM network, optimizing a cost function of the LSTM network by adopting a gradient descent optimization algorithm RMSProp, and updating the model parameters of the LSTM network layer by layer through a back propagation algorithm. And training the LSTM network to converge and persisting the network model and parameters of the LSTM network. The method comprises the steps of estimating a saccade path by constructing an LSTM network, wherein the structure of the LSTM network is suitable for processing a time sequence, training the LSTM network by combining a currently input image area and a fixation point generated so far, simulating the saccade stage of a human visual processing period and the propagation and prediction of information on visual cortex, realizing the consistency with the human saccade path process from the biological mechanism, and obtaining a saccade path result consistent with human eye movement data.

Step 150: and predicting a scanning path according to the LSTM network.

Further, loading the LSTM network, and inputting the training samples into the LSTM network; obtaining an output feature vector of the LSTM network by using a forward propagation algorithm; and inputting the output characteristic vector and the truth value information into the LSTM network, and obtaining the fixation point coordinate by using a forward propagation algorithm.

Specifically, the LSTM network is loaded, the prepared training sample picture is input into the LSTM network, and the output feature vector of the LSTM network is calculated by using a forward propagation algorithm. And inputting the characteristic vector and the sample truth value into the LSTM network, and obtaining the fixation point coordinate by using a forward propagation algorithm.

Example 2

The effect of the present invention will be further described with reference to simulation experiments.

1. Simulation conditions are as follows:

in the simulation experiment of the invention, the computer system adopted is Ubuntu 16.04, the machine learning framework is TensorFlow, the version is 1.1.0, the Python version adopted is 2.7, the vector of the embedded matrix is V multiplied by M, V is correspondingly adjusted according to different data sets, M is 512, C is 16, and 8 fixation points are represented.

2. Simulation content:

in the simulation experiment of the invention, the picture names and Arabic numerals are mapped to form a dictionary, the experiment is designed for each data set, the training set picture and the test set picture are selected according to the numbers, and the corresponding eye movement data set is processed to obtain the label. And (3) training the LSTM network by using the samples, training the LSTM network by using a gradient descent optimization algorithm RMSProp, and stopping training when the cost of the LSTM network is converged. The invention only adopts the simulation experiment method in the invention, uses the trained LSTM network to estimate the scanning path of the image, tests the trained LSTM network through the test set samples, and each data set has about 100 test samples.

3. And (3) simulation result analysis:

the estimated saccade path includes 8 fixation point coordinates. The evaluation indexes of the method comprise three indexes: HD (Hausdorff distance), MMD (the mean minimum distance), SS (sequence score), wherein the first two indices are used to measure the similarity between two sequences, with smaller distances representing more similar sequences; the SS describes the sequences from several angles of the gaze point position, the direction and distance of gaze point movement, and the order of panning, the closer the value is to 1, the higher the degree of similarity of the sequences.

The glancing path estimated by the model has smaller algorithm value than the classical glancing path on HD and MMD, is close to the curve calculated by the human eye truth value, is larger than the classical algorithm value on SS, is closer to the truth value and has better effect.

Example 3

Based on the same inventive concept as the glance path prediction method based on machine learning in the foregoing embodiment, the present invention further provides a glance path prediction apparatus based on machine learning, as shown in fig. 3, including:

the first construction unit is used for constructing and training an LSTM network according to the image feature representation information and the eye movement data samples;

a first prediction unit to predict a scan path according to the LSTM network.

Further, the apparatus further comprises:

a third obtaining unit, configured to process the truth value information to obtain eye movement data information of N observers;

Further, the apparatus further comprises:

the first construction unit is used for constructing an LSTM network and loading the trained model parameters;

a first output unit configured to output image feature representation information of the image information as an input of an LSTM network.

Further, the apparatus further comprises:

Various modifications and embodiments of a machine learning based glance path prediction method in embodiment 1 of fig. 1 are also applicable to a machine learning based glance path prediction apparatus of this embodiment, and a detailed description of a machine learning based glance path prediction method will be apparent to those skilled in the art, and therefore, for brevity of description, a detailed description thereof will not be provided herein.

Example 4

Based on the same inventive concept as the machine learning based glance path prediction method in the previous embodiment, the present invention further provides a machine learning based glance path prediction apparatus comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the program, when executed by the processor, implementing the steps of any one of the above-mentioned machine learning based glance path prediction methods.

Where in fig. 4 a bus architecture (represented by bus 300), bus 300 may include any number of interconnected buses and bridges, bus 300 linking together various circuits including one or more processors, represented by processor 302, and memory, represented by memory 304. The bus 300 may also link together various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. A bus interface 306 provides an interface between the bus 300 and the receiver 301 and transmitter 303. The receiver 301 and the transmitter 303 may be the same element, i.e., a transceiver, providing a means for communicating with various other apparatus over a transmission medium.

The processor 302 is responsible for managing the bus 300 and general processing, and the memory 304 may be used for storing information used by the processor 302 in performing operations.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable information processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable information processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable information processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable information processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A method for machine learning based glance path prediction, the method comprising:

obtaining an image data set to be processed, wherein each image information in the image data set has corresponding true value information;

according to the truth value information, making a training sample of the image data set;

according to the image information, obtaining image feature representation information of the image information;

constructing and training an LSTM network according to the image characteristic representation information and the eye movement data sample;

predicting a scanning path according to the LSTM network;

the constructing and training of the LSTM network specifically comprises the following steps:

obtaining the coordinates of the LSTM network, and defining a corresponding weight matrix according to a total weight matrix and a true value coordinate corresponding to the coordinates;

taking the image feature representation information and a weight matrix corresponding to the coordinates as the input of an LSTM network;

carrying out the operations of an input gate, a forgetting gate and an output gate on the input by using a forward propagation method;

decoding the LSTM network output according to a deep output layer;

inputting the image feature representation information into the LSTM network, and training the LSTM network by using a back propagation algorithm;

predicting a scanning path according to the LSTM network, which specifically comprises the following steps:

loading the LSTM network and inputting the training samples into the LSTM network;

obtaining an output feature vector of the LSTM network by using a forward propagation algorithm;

and inputting the output characteristic vector and the truth value information into the LSTM network, and obtaining the fixation point coordinate by using a forward propagation algorithm.

2. The method of claim 1, wherein the making training samples of the image dataset based on the truth information comprises:

processing the truth value information to obtain eye movement data information of N observers;

performing boundary processing on the eye movement data of the N observers;

normalizing the eye movement data of the N observers after the boundary processing;

and combining the eye movement data of the N observers to obtain the training sample, wherein N is a positive integer.

3. The method according to claim 1, wherein the obtaining image feature representation information of the image information specifically includes:

obtaining a training set and a test set according to the image data set;

cutting the image information of the training set into a standard size;

constructing a convolutional neural network, and loading the trained model parameters;

and taking the image information as the input of a convolutional neural network, and outputting image feature representation information of the image information.

4. A machine learning based glance path prediction apparatus, comprising:

a first prediction unit for predicting a scan path according to the LSTM network;

wherein the apparatus further comprises:

a fifth obtaining unit, configured to obtain a coordinate of the LSTM network, and define a corresponding weight matrix according to a total weight matrix and a true value coordinate corresponding to the coordinate;

a first training unit, configured to input the image feature representation information to the LSTM network, and train the LSTM network using a back propagation algorithm;

5. A machine learning based glance path prediction apparatus comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the steps of:

predicting a scanning path according to the LSTM network;

decoding the LSTM network output according to a deep output layer;