CN116091792A

CN116091792A - Method, system, terminal and medium for constructing visual attention prediction model

Info

Publication number: CN116091792A
Application number: CN202310007698.9A
Authority: CN
Inventors: 段会展; 刘志
Original assignee: University of Shanghai for Science and Technology
Current assignee: University of Shanghai for Science and Technology
Priority date: 2023-01-04
Filing date: 2023-01-04
Publication date: 2023-05-09

Abstract

The invention provides a method for constructing a visual attention prediction model, which is oriented to autism population and comprises the following steps: constructing a visual attention prediction model based on atypical salient region enhancement; pre-training a vision attention prediction model based on atypical significant region enhancement by adopting a known eye movement data set, and correcting the model by adopting an eye movement data set of an autism group to finish end-to-end training of the vision attention prediction model based on atypical significant region enhancement; and testing the trained visual attention prediction model based on atypical salient region enhancement by using a test image in the known eye movement data set, and constructing a final visual attention prediction model. Meanwhile, a corresponding construction system, an application method, a terminal and a medium are provided. The method starts from the special visual preference of the autism patient, and has the characteristics of high prediction efficiency, low cost, easy realization, very flexible deployment and the like.

Description

Method, system, terminal and medium for constructing visual attention prediction model

Technical Field

The invention relates to the technical field of visual attention prediction, in particular to a method, a system, a terminal and a medium for constructing a visual attention prediction model for autism groups.

Background

The ability of the human visual system to quickly select and focus on important areas in visual stimuli enables humans to selectively process large amounts of information into the field of view, thereby efficiently receiving and processing primary information while ignoring extraneous information, a selective mechanism called the visual attention mechanism. Visual attention prediction (also called visual saliency prediction, gaze point prediction) is a technique that simulates the visual attention mechanism of the human eye, and the final computed saliency map can quantitatively represent the attention distribution, wherein the higher the brightness of a region is, the greater the probability that the region attracts the attention of the human eye. It has very important applications in many visually related tasks such as object segmentation, object tracking, image compression and video compression.

Autism spectrum disorder is a genetic neurological disorder, and there is a computational model and neuroimaging evidence that autism patients exhibit atypical visual attention behavior in the face of visual stimuli, unlike normal controls. In short, when a scene is observed, the normal control group tends to focus on objects having higher-order semantic attributes, such as faces, texts, and the like; autistic patients are often attracted by background areas and areas with low-order properties, which are called atypical significant areas. The existing special visual attention prediction method for autism is mostly inspired by a general visual attention prediction method for a conventional group, and the atypical gazing behavior and special visual preference of the autism patient, which are obviously different from those of a normal control group, are ignored, and the method simply shifts on an eye movement data set of the autism patient, lacks pertinence and has poor prediction performance on a gazing point.

Disclosure of Invention

Aiming at the defects in the prior art, the invention provides a method, a system, a terminal and a medium for constructing a visual attention prediction model for an autism group.

According to an aspect of the present invention, there is provided a method of constructing a visual attention prediction model, including:

constructing a visual attention prediction model based on atypical salient region enhancement;

pre-training the vision attention prediction model based on atypical salient region enhancement by adopting a known eye movement data set, and correcting the vision attention prediction model based on atypical salient region enhancement by adopting an eye movement data set of an autism population to finish end-to-end training of the vision attention prediction model based on atypical salient region enhancement;

and testing the trained visual attention prediction model based on atypical salient region enhancement by using a test image in a known eye movement data set, and constructing a final visual attention prediction model.

Optionally, the constructing the visual attention prediction model based on atypical salient region enhancement includes:

constructing a feature extraction network layer for extracting features of an input image and outputting a multi-order feature map;

Constructing a multi-scale enhancement network layer, which is used for carrying out multi-scale enhancement on the highest-order feature diagram in the multi-order feature diagram so as to improve the detection capability of salient regions with different scales;

constructing an atypical salient region enhancement network layer, wherein the atypical salient region enhancement network layer is used for carrying out residual fusion on the multi-order feature map from top to bottom by taking the highest-order feature map as an initial prediction result to obtain an enhanced atypical salient region feature map;

constructing a global semantic stream network layer, which is used for extracting context semantic information of the highest-order feature map from the angles of space and channels respectively to obtain a global semantic stream, introducing the global semantic stream into the atypical salient region enhancement network layer, and guiding residual fusion, and adaptively supplementing diluted semantic information at the same time;

and constructing a saliency map reading network layer, and compressing and normalizing the reinforced atypical salient region feature map along the channel dimension to obtain a visual attention prediction result.

Optionally, the feature extraction network layer constructs a backbone network of the feature extraction network layer by adopting a full convolution form of a pre-trained object recognition network based on deep learning, and is used for extracting features of the input image and outputting a multi-order feature map; wherein the feature extraction network layer comprises: convolution layer, pooling layer and ReLu activation layer.

Optionally, the multi-scale enhancement network layer explicitly introduces multi-scale information by adopting a plurality of parallel convolution layers with convolution kernels of different sizes, performs multi-scale enhancement on the highest-order feature map, and the enhancement result acts on the residual fusion process and is used for improving feature extraction capability of different scale significant regions.

Optionally, the atypical significant region enhances a network layer, including: a background enhancement network layer, a foreground enhancement network layer and a residual fusion network layer; and taking the highest-order feature map as an initial prediction result, carrying out residual fusion on the features in the multi-order feature map from top to bottom to obtain an enhanced atypical significant region feature map, wherein the method comprises the following steps:

inverting the higher-order features in the multi-order feature map by utilizing the background enhancement network layer to obtain background features, and normalizing the background features to obtain a background weight map;

using the foreground enhancement network layer to carry out foreground weighting on low-order features adjacent to the high-order features to obtain enhanced low-order features;

the enhanced low-order features are weighted and fused by the background weight graph through the residual fusion network layer to obtain residual features; performing self-adaptive fusion on the high-order features and the residual features to obtain a new prediction result;

And taking the new prediction result as a new high-order feature, and continuing residual fusion with an adjacent low-order feature until a final reinforced atypical significant region feature map is obtained.

Optionally, the global semantic stream network layer comprises a channel enhancement network layer and a spatial location enhancement network layer; extracting context semantic information of the feature map from the space and channel angles respectively to obtain a global semantic stream, wherein the method comprises the following steps:

the highest-order features in the feature map are transformed and compressed through a convolution layer to obtain network layer input features;

the channel enhancement network layer adopts global average pooling to obtain global priori, and carries out 1*1 convolution layer transformation and normalization to obtain a channel weighted graph of the network layer input characteristics, and carries out channel enhancement on the network layer input characteristics by utilizing the channel weighted graph to obtain channel enhancement characteristics;

the spatial position enhancement network layer adopts a self-attention mechanism to fully capture the correlation among pixels of the network layer input characteristics to obtain a spatial position weighted graph, and performs position enhancement on the network layer input characteristics by using the spatial position weighted graph to obtain position enhancement characteristics;

Fusing the channel enhancement features and the position enhancement features to obtain a global semantic stream;

the global semantic stream is introduced into the atypical salient region enhancement network layer, and the weight of the global semantic stream is adaptively adjusted to be used for adaptively supplementing global information in the residual fusion process.

Optionally, the saliency map readout network layer includes a 3*3 convolution layer and a sigmoid activation function.

Optionally, the pre-training the vision attention prediction model based on atypical salient region enhancement using a known eye movement data set and correcting the vision attention prediction model based on atypical salient region enhancement using an eye movement data set of an autism population, to complete end-to-end training of the vision attention prediction model based on atypical salient region enhancement, including:

pretraining the atypical salient region-based enhanced visual attention prediction model with the disclosed eye movement data sets sallicon and MIT1003, and correcting the atypical salient region-based enhanced visual attention prediction model with the eye movement data set Saliency4ASD of the autism population;

Setting initialization parameters of the visual attention prediction model based on atypical salient region enhancement;

determining a loss function of the atypical salient region-based enhanced visual attention prediction model;

determining relevant hyper-parameters in the atypical salient region-based enhanced visual attention prediction model;

end-to-end training of the atypical salient region-based enhanced visual attention prediction model is accomplished through the above steps.

Optionally, the pre-training the atypical salient region-based enhanced visual attention prediction model using the disclosed eye movement data sets sallicon and MIT1003 and the modifying the atypical salient region-based enhanced visual attention prediction model using the eye movement data set Saliency4ASD of the autism population includes:

acquiring a public eye movement data set SALICON and MIT1003 and an eye movement data set Saliency4ASD of an autism group, and clustering eye movement position sampling points of image data in the eye movement data set to generate a mat file containing a fixation point; normalizing the mat file, and converting the mat file to generate a fixation point density map serving as a truth map;

inputting images of the eye movement data sets SALICON and MIT1003 as models, using truth diagrams corresponding to the images of the eye movement data sets SALICON and MIT1003 as labels, training the visual attention prediction model based on atypical salient region enhancement in an end-to-end mode, and enabling the model to automatically learn a mapping relation between an original image and the truth diagrams to obtain feature distribution related to human eyes;

And (3) taking an image of the eye movement data set Saliency4ASD of the autism group as a model input, taking a truth diagram corresponding to the image of the eye movement data set Saliency4ASD of the autism group as a label, fine-tuning the vision attention prediction model based on the atypical salient region enhancement in an end-to-end mode, enabling the model to automatically learn a mapping relation between an original image and the truth diagram, obtaining eye movement characteristics of the autism group, and correcting the model.

Optionally, the setting the initialization parameters of the visual attention prediction model based on atypical significant region enhancement includes:

the visual attention prediction model based on atypical salient region enhancement comprises: a feature extraction network layer, a multi-scale enhancement network layer, an atypical salient region enhancement network layer, a global semantic stream network layer and a salient map reading network layer; wherein:

the feature extraction network layer adopts parameters obtained by pre-training the feature extraction network layer on an ImageNet data set as initialization parameters; the initial parameters of the other network layers are random initialization parameters.

Optionally, the determining the loss function of the visual attention prediction model based on atypical significant region enhancement includes:

The loss function employs a weighted linear combination of three significance performance assessment indicators KL, CC, NSS.

Optionally, the determining the relevant hyper-parameters in the visual attention prediction model based on atypical salient region enhancement includes:

in the pre-training process, a random gradient descent algorithm is adopted, and the initial learning rate is 10 ^-4 And 3 epochs per iteration drop by a factor of 10, the batch size is 10, and the pre-training process requires 20 epochs to iterate until the model converges.

Optionally, the testing the trained visual attention prediction model based on atypical salient region enhancement using the test images in the known eye movement dataset includes:

the trained visual attention prediction model based on atypical salient region enhancement was tested using benchmarks provided in the public datasets SALICON, MIT1003, and Saliency4ASD to evaluate the performance of the model.

According to another aspect of the present invention, there is provided a construction system of a visual attention prediction model, including:

a predictive model construction module for constructing a visual attention predictive model based on atypical salient region enhancement;

a model training module for pre-training the atypical salient region-based enhanced visual attention prediction model using a known eye movement data set and correcting the atypical salient region-based enhanced visual attention prediction model using an eye movement data set of an autism population to complete end-to-end training of the atypical salient region-based enhanced visual attention prediction model;

And the model test module is used for testing the trained visual attention prediction model based on atypical salient region enhancement by using known test images and evaluating the performance of the constructed model.

According to a third aspect of the present invention, there is provided a visual attention prediction method, wherein a visual attention prediction model constructed by a method or a system for constructing a visual attention prediction model as described above is used, and a visual attention prediction result is outputted by taking any one of images as an input of the model.

According to a fourth aspect of the present invention there is provided a computer terminal comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor being operable to perform the method of any one of the preceding claims when executing the program.

According to a fifth aspect of the present invention there is provided a computer readable storage medium having stored thereon a computer program which when executed by a processor is operable to perform a method as claimed in any one of the preceding claims.

Due to the adoption of the technical scheme, compared with the prior art, the invention has at least one of the following beneficial effects:

According to the method, the system, the terminal and the medium for constructing the visual attention prediction model, the visual attention prediction model based on the atypical significant region enhancement is adopted, the atypical visual attention model and the unique visual preference specific to the autism patient are fully considered, and the excellent performance is obtained on the Saliency4ASD benchmark on the autism eye movement data set. The implementation of the performance mainly depends on the implementation of atypical salient region enhancement technology, and the adopted cross-stage background enhancement operation effectively utilizes the property of the feature extraction network, so that the specific visual characteristics of the autism patient can be more fully learned under the supervision of a truth diagram, and excellent performance is achieved.

According to the method, the system, the terminal and the medium for constructing the visual attention prediction model, disclosed by the invention, the residual fusion of each order is guided by utilizing the global semantic flow technology, the adverse effect of noise contained in low-order characteristics on the model performance is reduced, and the accuracy and the robustness of the visual attention prediction model are improved.

The method, the system, the terminal and the medium for constructing the visual attention prediction model are high in efficiency, low in cost, easy to realize, flexible and capable of being deployed on a backbone network with less parameter quantity (considering efficiency) or a backbone network with better performance (considering performance) according to actual needs.

Drawings

Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments, given with reference to the accompanying drawings in which:

FIG. 1 is a flowchart showing a visual attention prediction method in accordance with a preferred embodiment of the present invention.

FIG. 2 is a schematic diagram showing the components of a visual attention prediction system according to a preferred embodiment of the present invention.

Fig. 3 is a partial subjective experimental result obtained on a public dataset in the field of vision attention prediction for autism according to the vision attention prediction method and system provided in a preferred embodiment of the present invention.

Detailed Description

The following describes embodiments of the present invention in detail: the embodiment is implemented on the premise of the technical scheme of the invention, and detailed implementation modes and specific operation processes are given. It should be noted that variations and modifications can be made by those skilled in the art without departing from the spirit of the invention, which falls within the scope of the invention.

The embodiment of the invention provides a method for constructing a visual attention prediction model, which is oriented to an autism group and realizes the construction of the visual attention prediction model based on atypical significant region enhancement based on special visual preference of an autism patient.

As shown in fig. 1, the method for constructing the visual attention prediction model provided in this embodiment may include:

s1, constructing a visual attention prediction model based on atypical salient region enhancement;

s2, pre-training a vision attention prediction model based on atypical significant region enhancement by adopting a known eye movement data set, and correcting the vision attention prediction model based on atypical significant region enhancement by adopting an eye movement data set of an autism group to finish end-to-end training of the vision attention prediction model based on atypical significant region enhancement;

and S3, testing the trained visual attention prediction model based on atypical salient region enhancement by adopting a test image in a known eye movement data set, and constructing a final visual attention prediction model.

In a preferred embodiment of S1, constructing the visual attention prediction model based on atypical salient region enhancement may include:

constructing a visual attention prediction model based on atypical salient region enhancement, wherein the model mainly comprises a feature extraction network layer, a multi-scale enhancement network layer, an atypical salient region enhancement network layer, a global semantic stream network layer and a salient map reading network layer, and performing end-to-end visual attention prediction; wherein:

S101, constructing a feature extraction network layer for extracting features of an input image and outputting a multi-order feature map. The network layer uses a pre-trained object recognition network based on deep learning (the back full-connection layer is removed, and only the front full-convolution part is reserved) as a main network to extract characteristics, and the network layer mainly comprises a convolution layer, a pooling layer and a ReLu activation layer, and is input into an image and output into a characteristic map. For ease of illustration, the outputs of the five convolution blocks of the backbone are respectively represented as: f (F) ⁱ (i.epsilon. {1,2,3,4,5 }). In addition, because the atypical significant region enhancement network layer and the global semantic stream network layer provided by the embodiment of the invention can be flexibly deployed on different backbones, the visual attention prediction model can select VGG with less parameter quantity or DenseNet with better performance as the backbone network according to actual needs;

s102, constructing a multi-scale enhancement network layer, which is used for carrying out multi-scale enhancement on the highest-order feature map in the multi-order feature map so as to improve the detection capability of the saliency areas with different scales. The explicit introduction of multi-scale enhancement modules in the network layer can enhance the detection capability of the model for different scale salient regions. In a specific application example, a similar acceptance structure is used, i.e. a plurality of parallel convolution layers with convolution kernels of different sizes are used to introduce multi-scale information. To control the parameters, the network layer is implemented using hole convolution with different hole rates. To the highest order feature F ⁵ Outputting multi-scale enhanced features F as multi-scale enhanced network layer inputs ⁵ '；

And S103, constructing an atypical salient region enhancement network layer, wherein the atypical salient region enhancement network layer is used for carrying out residual fusion on the multi-order feature map from top to bottom by taking the highest-order feature map as an initial prediction result to obtain an enhanced atypical salient region feature map. The network layer mainly comprises a foreground enhancement network layer of low-order characteristics, a background enhancement network layer of a cross-order and a residual fusion network layer. In particular, since the feature extraction network layer as the backbone network is modified from the pre-trained object recognition network, the output of the backbone network from the shallow layer to the deep layer is increasingly focused on having higher-order languageOn the object of the sense attribute. For the conventional population, in most cases, objects (e.g., faces, texts, etc.) with high-order semantic attributes are visually significant regions, so the semantic bias nature of the features output by the backbone network promotes the improvement of the performance of the general visual attention prediction model for the conventional population. However, the visual attention of the autism patient is dominated by atypical significance, the attention of the group to areas without social properties such as background areas is obviously higher than that of the conventional group, so that a large gap exists between the semantic bias property of the output characteristics of the backbone network and the atypical visual attention of the autism patient, and the conventional direct fusion method is poor in performance on the visual attention prediction task facing the autism group. In the step, the highest-order output of the backbone network is taken as an initial prediction result, and residual fusion is carried out from top to bottom to strengthen atypical significant regions. The background enhancement network layer acts between adjacent two-order features. In one embodiment, F ⁵ ' and F ⁴ The following are examples:

first, F is to ⁵ ' upsampling to low-order features F adjacent thereto ⁴ The same size gives

Then, the background enhanced network layer pair +.>

Performing inversion to obtain background characteristics Bf ⁵ Further normalizing to obtain a background weight graph BAM ⁵ ：

BAM ⁵ ＝σ(Bf ⁵ )

Where σ represents the sigmoid activation function.

And then, the residual fusion network layer uses a background weight graph to carry out weighted fusion on adjacent low-order features, and the obtained residual features comprise atypical significant regions which are not detected in initial prediction but are successfully detected by the adjacent low-order features.

It is noted that the low-order features contain more noise, and the adverse effect of noise is more obvious in this residual fusion manner. Therefore, before background weighting, the foreground enhancement network layer performs foreground weighting on the low-order features, enhances the foreground significant region and reduces the influence of noise contained in the low-order features, so as to obtain enhanced low-order features:

FAM ⁴ ＝σ(Conv(F ⁴ ))

wherein, FAM ⁴ Conv (-) represents a convolution operation for the resulting enhanced low-order feature.

Finally, the residual fusion network layer is used for fusing the original high-order characteristics (namely the original prediction result

) Carrying out self-adaptive fusion with residual error characteristics to obtain a new prediction result F ⁴ '：

Wherein, the liquid crystal display device comprises a liquid crystal display device,

to F ⁵ Processed by the multi-scale enhanced network layer described in S202 and up-sampled, the ∈A is obtained>

And

representing pixel-by-pixel dot multiplication, pixel-by-pixel addition, and stitching operations along the channel dimension, respectively, T represents 3 consecutive convolution-batch normalization-ReLu activation operations.

Thereafter, F ⁴ ' as a new prediction result, continuing with the neighboring low-order features F ³ Repeating the above operation until the final prediction result is obtained.

Therefore, compared with a direct fusion mode commonly used in a general visual attention prediction model for a conventional group, the atypical significant region enhancement module provided by the embodiment of the invention can make the model pay more attention to significant regions which are not detected originally, and further the atypical significant regions are gradually detected in a residual fusion process from top to bottom, so that the obtained prediction result is more complete;

s104, constructing a global semantic stream network layer, which is used for extracting context semantic information of the highest-order feature map from the space and channel angles respectively to obtain a global semantic stream, introducing the global semantic stream into an atypical salient region enhancement network layer, and guiding residual fusion, and adaptively supplementing diluted semantic information. Considering that in the feature fusion process from top to bottom, semantic information from a deep layer is continuously diluted, and the adverse effect of noise of a shallow layer is larger and larger, the step constructs a global semantic stream network layer, extracts context semantic information from the angles of space and channels respectively, and is further used for guiding residual fusion in a feature fusion stage, and adaptively supplementing diluted semantic information. The network layer mainly comprises a channel enhancement network layer and a spatial position enhancement network layer. First, the highest order feature F ⁵ The number of compressed channels is converted through a convolution layer to obtain

As an input to the network layer. For the channel enhancement network layer, global prior is obtained by global average pooling, and a channel weighting map CAM is obtained through 1*1 convolution layer transformation and normalization ⁵ And based on channel weighting maps CAM ⁵ For original characteristics->

Channel enhancement to give the output channel enhancement feature +.>

Wherein GAP stands for global average pooling operation,

representing a per-channel multiplication operation.

For the spatial location enhancement network layer, a self-attention (self-attention) mechanism is used to fully capture the correlation between pixels to obtain a spatial location weighted graph SM ⁵ And based on a spatial position weighted graph SM ⁵ Enhanced location enhancement features

Wherein Q, K, V are all original features

Is obtained through convolution transformation.

Finally, the two parts of characteristics weighted by the channel and the space position are fused to obtain a global semantic stream

The global semantic stream is used for guiding residual fusion operation of each order, and the corresponding fusion operation in S203 is modified as follows:

to->

Up-sampling, wherein b is a learnable parameter used for adaptively adjusting the weight of the global semantic stream so as to adaptively supplement global information;

s105, constructing a saliency map reading network layer, and compressing and normalizing the reinforced atypical salient region feature map along the channel dimension to obtain a visual attention prediction result. The network layer consists of a 3*3 convolution layer and a sigmoid activation function. The function is to compress and normalize the output of the above module along the channel dimension to obtain the final prediction result.

In a preferred embodiment of S2, pre-training the visual attention prediction model based on atypical salient region enhancement using the known eye movement data set and modifying the visual attention prediction model based on atypical salient region enhancement using the eye movement data set of the autism population to complete end-to-end training of the visual attention prediction model based on atypical salient region enhancement may comprise:

s201, the constructed visual attention prediction model based on atypical salient region enhancement is pre-trained using public data sets SALICON and MIT1003, and then fine-tuned (corrected) on the autism-specific eye movement data set Saliency4ASD data set. Wherein:

acquiring an eye movement data set SalLICON and MIT1003 and an eye movement data set Saliency4ASD of an autism group, clustering eye movement position sampling points of image data in the eye movement data set to generate a mat file containing a fixation point, performing normalization processing on the mat file for convenient processing, converting the mat file to generate a fixation point density map, taking the mat file as a truth map, and training a constructed model in a subsequent step;

taking images of eye movement data sets SALICON and MIT1003 as input, taking a truth diagram corresponding to the images as a label, and training the proposed visual attention prediction model based on atypical salient region enhancement in an end-to-end mode to enable the model to automatically learn a mapping relation between an original image and the truth diagram so as to obtain feature distribution related to human eyes; inputting an image of an eye movement data set Saliency4ASD of the autism group as a model, using a truth diagram corresponding to the image of the eye movement data set Saliency4ASD of the autism group as a label, fine-tuning a visual attention prediction model based on atypical salient region enhancement in an end-to-end mode, enabling the model to automatically learn a mapping relation between an original image and the truth diagram, obtaining eye movement characteristics of the autism group, and correcting the model; specifically:

SALICON is a generic eye movement dataset that simulates the gaze point of the human eye using mouse clicks, which is also the largest published dataset in the field of visual attention prediction, so the proposed atypical visual attention prediction model is first pre-trained on this dataset to provide good initialization parameters for subsequent training. The MIT1003 data set is a data set acquired and constructed using an eye tracker, on which a model pre-trained on SALICON is continued to be trained so that the model can learn the feature distribution associated with the human eye. Finally, because the Saliency4ASD data set is smaller in size, in order to prevent overfitting, the model is initially trained by using the eye movement data sets SALICON and MIT1003, and then the model is finely tuned by using the Saliency4ASD data set, so that the model fully learns the eye movement characteristics of the autism population;

s202, setting initialization parameters for the model. The backbone network uses parameters obtained by pre-training the backbone network on an ImageNet data set as initialization parameters, and the initialization parameters of other network layers are initialized randomly;

s203, determining a loss function. The model training employs a loss function that is a weighted linear combination of three of the significance performance metrics KL, CC, NSS. The weight of each index is determined according to the experimental result, so that the influence of each index on the performance of the model is balanced better; for the KL index, the better the performance of the model is, the smaller the value of the KL index is, so that the coefficient of the KL index in the loss function is a negative number, and the coefficients of the other indexes are positive numbers;

S204, determining relevant super parameters in the model. The gradient descent algorithm used in the training process is random gradient descent, and is initially learnedThe learning rate is 10 ^-4 And 3 epochs per iteration drops by a factor of 10, the batch size is 10, training typically requires 20 epochs to iterate until the model converges.

In a preferred embodiment of S3, the trained visual attention prediction model based on atypical salient region enhancement is tested using the test image in the known eye movement dataset, and the constructing to obtain the final visual attention prediction model may include:

s301, testing the performance of the model proposed by the invention by using 3 common public data sets in the field of visual attention prediction, namely SALICON, MIT1003 and Saliency4ASD. These 3 published data sets all provide benchmark to facilitate fair performance comparisons by researchers. During testing, the test image is input into a trained visual attention prediction model based on atypical salient region enhancement in S2 to obtain a prediction result, and the prediction result is compared with a corresponding truth diagram and the performance is calculated, so that a final visual attention prediction model is constructed.

An embodiment of the invention provides a system for constructing a visual attention prediction model.

As shown in fig. 2, the system for constructing a visual attention prediction model provided in this embodiment may include:

the model training module is used for pre-training the vision attention prediction model based on atypical salient region enhancement by adopting a known eye movement data set, correcting the vision attention prediction model based on atypical salient region enhancement by adopting an eye movement data set of an autism group, and completing end-to-end training of the vision attention prediction model based on atypical salient region enhancement;

In a preferred embodiment, the visual attention prediction model based on atypical salient region enhancement may comprise:

the feature extraction module is used for extracting a feature map of the input image and outputting a multi-order feature map;

the multi-scale enhancement module is used for carrying out multi-scale enhancement on the highest-order feature map in the multi-order feature map so as to improve the detection capability of the salient regions with different scales;

The atypical salient region enhancement module is used for carrying out residual fusion on the features in the feature map from top to bottom by taking the highest-order feature map as an initial prediction result to obtain an enhanced atypical salient region feature map; further, the module may include: a front Jing Zeng enhancer module, a background enhancer module, and a residual fusion subunit;

the global semantic stream module is used for extracting the context semantic information of the highest-order feature map from the space and channel angles respectively to obtain a global semantic stream, introducing the global semantic stream into an atypical salient region enhancement network layer, and guiding residual fusion, and adaptively supplementing diluted semantic information at the same time; further, the module may include: a channel enhancer module and a spatial location enhancer module;

and the salient map reading module is used for compressing and normalizing the reinforced atypical salient region characteristic map along the channel dimension to obtain a visual attention prediction result.

It should be noted that, the steps in the method provided by the present invention may be implemented by using corresponding modules, devices, units, etc. in the system, and those skilled in the art may refer to a technical solution of the method to implement the composition of the system, that is, the embodiment in the method may be understood as a preferred example of constructing the system, which is not described herein.

An embodiment of the present invention provides a visual attention prediction method, which uses the method or the system for constructing a visual attention prediction model according to any one of the above embodiments to construct a visual attention prediction model, and uses any one image as an input of the model to output a visual attention prediction result.

An embodiment of the present invention provides a computer terminal including a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor when executing the program being operable to perform a method or system according to any one of the above embodiments of the present invention.

Optionally, a memory for storing a program; memory, which may include volatile memory (english) such as random-access memory (RAM), such as static random-access memory (SRAM), double data rate synchronous dynamic random-access memory (Double Data Rate Synchronous Dynamic Random Access Memory, DDR SDRAM), and the like; the memory may also include a non-volatile memory (English) such as a flash memory (English). The memory is used to store computer programs (e.g., application programs, functional modules, etc. that implement the methods described above), computer instructions, etc., which may be stored in one or more memories in a partitioned manner. And the above-described computer programs, computer instructions, data, etc. may be invoked by a processor.

The computer programs, computer instructions, etc. described above may be stored in one or more memories in partitions. And the above-described computer programs, computer instructions, data, etc. may be invoked by a processor.

And a processor for executing the computer program stored in the memory to implement the steps in the method or the modules of the system according to the above embodiments. Reference may be made in particular to the description of the previous method and system embodiments.

The processor and the memory may be separate structures or may be integrated structures that are integrated together. When the processor and the memory are separate structures, the memory and the processor may be connected by a bus coupling.

An embodiment of the present invention provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, is operable to perform a method or system of any of the above embodiments of the present invention.

The method, the system, the terminal and the medium for constructing the visual attention prediction model provided by the embodiment of the invention are used for constructing the visual attention prediction model based on atypical salient region enhancement, and the multi-scale enhancement part is used for improving the detection capability of the model on salient regions with different scales; an atypical salient region enhancement part for performing cross-order residual fusion between multi-order features output by the backbone network by utilizing semantic object bias characteristics of the backbone network to enhance atypical salient regions so as to better simulate the special atypical visual attention of the autism patient; a global semantic stream part for guiding residual fusion in a feature fusion stage, and adaptively supplementing diluted semantic information; and the saliency map reading part compresses and normalizes the output of the module along the channel dimension to obtain a final prediction result.

According to the method, the system, the terminal and the medium for constructing the visual attention prediction model, which are provided by the embodiment of the invention, different from a common feature direct fusion mode in a general visual attention prediction method, according to the characteristics of a feature extraction network layer and the special visual attention mode of an autism patient, an atypical significant region enhancement technology is provided, and the cross-order features are subjected to effective residual fusion, so that the model is more concerned about the atypical significant region which is not detected initially. In addition, the embodiment of the invention also provides a global semantic flow technology, and simultaneously extracts context semantic information from space and channel dimensions and guides feature fusion, so that atypical significant regions are gradually detected in a residual fusion process from top to bottom, and the obtained prediction result is more complete.

The method, the system, the terminal and the medium for constructing the visual attention prediction model provided by the embodiment of the invention can be oriented to autism groups, and the visual attention prediction model based on atypical salient region enhancement is constructed, so that the visual attention prediction model is high in efficiency, low in cost, easy to realize and flexible, and can be deployed on a backbone network with less parameters (consideration efficiency) or a backbone network with better performance (consideration performance) according to actual needs.

The flow diagrams in the figures illustrate the method functions and operations according to the preferred embodiment of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the flowchart illustrations, and combinations of blocks in the flowchart illustrations, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions. It is well known to those skilled in the art that implementation by hardware, implementation by software, and implementation by a combination of software and hardware are all equivalent.

Those skilled in the art will appreciate that the invention provides a system and its individual devices that can be implemented entirely by logic programming of method steps, in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc., in addition to the system and its individual devices being implemented in pure computer readable program code. Therefore, the system and various devices thereof provided by the present invention may be considered as a hardware component, and the devices included therein for implementing various functions may also be considered as structures within the hardware component; means for achieving the various functions may also be considered as being either a software module that implements the method or a structure within a hardware component.

The foregoing embodiments of the present invention are not all well known in the art.

The foregoing describes specific embodiments of the present invention. It is to be understood that the invention is not limited to the particular embodiments described above, and that various changes and modifications may be made by one skilled in the art within the scope of the claims without affecting the spirit of the invention.

Claims

1. A method of constructing a visual attention prediction model, comprising:

2. The method of constructing a visual attention prediction model as recited in claim 1, wherein said constructing a visual attention prediction model based on atypical significant region enhancement includes:

3. The method of constructing a visual attention prediction network layer of claim 2, further comprising any one or more of:

-the feature extraction network layer, constructing a backbone network of the feature extraction network layer by using a full convolution form of a pre-trained deep learning-based object recognition network, for extracting features of the input image, and outputting a multi-order feature map; wherein the feature extraction network layer comprises: a convolution layer, a pooling layer and a ReLu activation layer;

-the multi-scale enhancement network layer explicitly introducing multi-scale information by adopting a plurality of parallel convolution layers with convolution kernels of different sizes, performing multi-scale enhancement on the highest-order feature map, wherein the enhancement result acts on the residual fusion process and is used for improving feature extraction capability of different scale significant regions;

-the atypical significant region enhancement network layer comprising: a background enhancement network layer, a foreground enhancement network layer and a residual fusion network layer; and taking the highest-order feature map as an initial prediction result, carrying out residual fusion on the features in the multi-order feature map from top to bottom to obtain an enhanced atypical significant region feature map, wherein the method comprises the following steps:

taking the new prediction result as a new high-order feature, and continuing residual fusion with an adjacent low-order feature until a final reinforced atypical significant region feature map is obtained;

-the global semantic stream network layer comprising a channel enhancement network layer and a spatial location enhancement network layer; extracting context semantic information of the feature map from the space and channel angles respectively to obtain a global semantic stream, wherein the method comprises the following steps:

the global semantic stream is introduced into the atypical salient region enhancement network layer, and the weight of the global semantic stream is adaptively adjusted to be used for adaptively supplementing global information in the residual fusion process;

-the saliency map readout network layer comprising a 3*3 convolution layer and a sigmoid activation function.

4. The method of claim 1, wherein the pre-training the atypical salient region-based enhanced visual attention prediction model with a known eye movement data set and correcting the atypical salient region-based enhanced visual attention prediction model with an eye movement data set of an autism population to complete an end-to-end training of the atypical salient region-based enhanced visual attention prediction model, comprising:

5. The method of constructing a visual attention prediction model as recited in claim 4, further comprising any one or more of:

-said pre-training said atypical salient region-enhanced-based visual attention prediction model with the disclosed eye movement data sets sallicon and MIT1003 and correcting said atypical salient region-enhanced-based visual attention prediction model with the eye movement data set Saliency4ASD of the autism population, comprising:

inputting an image of an eye movement data set Saliency4ASD of the autism group as a model, using a truth diagram corresponding to the image of the eye movement data set Saliency4ASD of the autism group as a label, fine-tuning the vision attention prediction model based on the atypical salient region enhancement in an end-to-end mode, enabling the model to automatically learn a mapping relation between an original image and the truth diagram, obtaining eye movement characteristics of the autism group, and correcting the model;

-said setting of initialization parameters of said atypical salient region-based enhanced visual attention prediction model comprises:

the feature extraction network layer adopts parameters obtained by pre-training the feature extraction network layer on an ImageNet data set as initialization parameters; the initial parameters of other network layers are random initialization parameters;

-said determining a loss function of said atypical significant region-based enhanced visual attention prediction model comprising:

the loss function adopts weighted linear combination of three significance performance evaluation indexes KL, CC, NSS;

-said determining relevant hyper-parameters in said atypical significant region-based enhanced visual attention prediction model comprising:

in the pre-training process, a random gradient descent algorithm is adopted, and the initial learning rate is 10 ^-4 And 3 epochs per iteration decrease by a factor of 10, the Batchsize is 10, the pre-training process requiresThe 20 epochs are iterated until the model converges.

6. The method according to claim 1, wherein the testing the trained visual attention prediction model based on atypical salient region enhancement using the test image in the known eye movement dataset comprises:

7. A system for constructing a visual attention prediction model, comprising:

8. A visual attention prediction method, characterized in that a visual attention prediction model constructed by the method for constructing a visual attention prediction model according to any one of claims 1 to 6 or the system for constructing a visual attention prediction model according to claim 7 is used, and a visual attention prediction result is obtained by outputting any one of images as an input of the model.

9. A computer terminal comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor is operable to perform the method of any of claims 1-6 or 8 when the program is executed.

10. A computer readable storage medium having stored thereon a computer program, which when executed by a processor is operable to perform the method of any of claims 1-6 or 8.