CN108805866B

CN108805866B - Image fixation point detection method based on quaternion wavelet transform depth vision perception

Info

Publication number: CN108805866B
Application number: CN201810500003.XA
Authority: CN
Inventors: 李策; 万玉奇; 张栋; 贾盛泽; 刘昊; 张亚超; 蓝天
Original assignee: Lanzhou University of Technology
Current assignee: Lanzhou University of Technology
Priority date: 2018-05-23
Filing date: 2018-05-23
Publication date: 2022-03-25
Anticipated expiration: 2038-05-23
Also published as: CN108805866A

Abstract

The image fixation point detection method based on quaternion wavelet transform depth vision perception comprises the following steps: step 1, quaternion wavelet transform; step 2, reducing the dimensionality of the feature map to be trained; and 3, calculating the annotation view. Carrying out quaternion wavelet transform on the image to obtain 12 detail sub-band images; reducing the dimensionality of the characteristic diagram to be trained by using a convolution network structure which is constructed by a 1 × 1 convolution kernel and reduces the dimensionality; and extracting the gazing point characteristics of the image by using a deep convolutional neural network, fusing a plurality of characteristic graphs, and detecting the gazing point of the image to obtain the gazing view. The invention provides the method for learning and extracting the characteristic information of the fixation point from the detail sub-band diagram generated by the quaternion wavelet transform of the image, and the method is used for detecting the fixation point, obtains good detection effect and has important theoretical significance and practical value.

Description

Image fixation point detection method based on quaternion wavelet transform depth vision perception

Technical Field

The invention relates to the technical fields of image processing, deep learning, computer vision and the like, in particular to an image fixation point detection method based on quaternion wavelet transform depth vision perception.

Background

Human beings can easily acquire important information in the image through a vision system, and the traditional machine vision is difficult to well detect the gazing point position of the image. The gaze point position is a position where a human eye's gaze point on an image falls on an interesting position by a visual attention mechanism when the human being observes the image. With the popularization of electronic products, people increasingly need the assistance of a computer to quickly detect the fixation point in the image in the face of massive image information. The method for detecting the point of regard can be used in the fields of target detection and identification, and the like, so the detection of the point of regard of the image has become a research hotspot.

The method comprises the steps that the method comprises the steps of obtaining a point of interest of a human eye, and obtaining a point of interest of the human eye by utilizing a depth convolution neural network. The quaternion wavelet transform of the image can generate a detail sub-band diagram under a plurality of channels and a plurality of directions, and the detail characteristics of the image can be better reflected. Based on the analysis, the invention provides a gaze point detection method based on quaternion wavelet transform depth vision perception, which performs depth learning from a detail sub-band image generated by quaternion wavelet transform of an image, extracts the characteristics representing gaze point information in the image and is used for detecting a gaze point.

Disclosure of Invention

The invention provides a fixation point detection method based on quaternion wavelet transform depth visual perception, which generates 12 detail sub-band diagrams reflecting image detail information after quaternion wavelet transform is carried out on an image; the method comprises the steps that a deep convolutional neural network is used for learning characteristic information representing a fixation point, and as the data volume contained in 12 detail sub-band graphs is large, in order to improve the training efficiency of the deep convolutional neural network, the network constructed by 1 x 1 convolutional kernels is used for carrying out dimensionality reduction on the detail sub-band graphs, and low-dimensionality characteristic graphs to be trained are extracted; training a low-dimensional feature map to be trained by using a deep convolutional neural network; and extracting the fixation point information of the image by using the trained network structure, and detecting the fixation point to obtain the fixation view.

The purpose of the invention is realized by the following technical scheme.

A fixation point detection method based on quaternion wavelet transform depth vision perception comprises the following steps:

step 1, performing quaternion wavelet one-level decomposition on a natural scene image, and performing filtering processing in different combination modes on row pixels and column pixels of the image by adopting a low-pass filter and a high-pass filter to obtain 4 channels, namely a low-pass and a low-pass, a low-pass and a high-pass, a high-pass and a low-pass, a high-pass and a high-pass, and 12 fine-pitch sub-band diagrams in three directions, namely a horizontal direction, a vertical direction and a diagonal direction and 4 approximate diagrams in 4 channels;

step 2, using a convolution network structure with reduced dimensionality constructed by 1 × 1 convolution kernel to perform dimensionality reduction processing on 12 detail sub-band graphs, and extracting 3 detail feature graphs capable of better representing image detail information from the 12 detail sub-band graphs for training a deep convolution neural network structure for extracting an image fixation point;

and 3, training by adopting a deep convolutional neural network based on the detail characteristic graph extracted by the dimensionality reduction convolutional network, establishing a mapping network between the detail characteristic graph and the image fixation point, and detecting the fixation point by adopting the trained dimensionality reduction convolutional network and the trained deep convolutional neural network.

Preferably, step 1 further comprises: the quaternion wavelet transform in the invention refers to dual-tree quaternion two-dimensional discrete wavelet transform, the quaternion wavelet transform of the image is formed by real number wavelet transform and two-dimensional Hilbert transform, a standard orthogonal basis of the quaternion wavelet transform is constructed by the two-dimensional Hilbert transform, and the quaternion wavelet transform is carried out on the image, so that wavelet coefficients of four channels, namely 12 detail sub-band diagrams and 4 approximate diagrams, can be obtained. The method is realized by the following steps:

1) if it is used

And psi for quaternion wavelet transformWavelet scale function and wavelet basis function, then

The hilbert transform in the horizontal direction x, the vertical direction y, and the diagonal direction xy can be expressed as:

wherein H represents a Hilbert transform,

and the results of the transformation of equation (1) together form a set of orthonormal bases.

2) Analogously to step 1), for the wavelet functions separately

And

ψ_h(x)ψ_h(y) performing Hilbert transform to construct four sets of orthonormal bases included in quaternion wavelet transform, which can be expressed as matrix G:

3) the image is subjected to one-level decomposition of quaternion wavelet transform, so that wavelet decomposition coefficients on four channels can be obtained, and are represented by a matrix F:

where LL denotes channel low-pass and low-pass, LH denotes channel low-pass and high-pass, HL denotes channel high-pass and low-pass, and HH denotes channel high-pass and high-pass. In the matrix F, the first row represents the coefficient matrix of the approximation part, i.e. 4 approximation graphs; the second, third and fourth rows represent detail coefficient matrixes in the horizontal direction, the vertical direction and the diagonal direction respectively, namely 12 detail subband graphs.

Preferably, step 2 further comprises: the 12 detail sub-band maps obtained in the step 1 contain a large data volume, and if the 12 detail sub-band maps are used for directly training the deep convolutional neural network, a long training time is needed, and in order to improve the training efficiency, the dimension reduction operation is performed on the 12 detail sub-band maps by adopting the convolutional neural network constructed by 1 × 1 convolutional kernel.

Preferably, step 2 further comprises: the convolutional neural network for reducing the dimensionality of the data to be trained comprises 1 input layer, 3 convolutional layers and 1 output layer, and the connection mode is as follows: input layer → convolutional layer 1 → convolutional layer 2 → convolutional layer 3 → output layer, the output of each convolutional layer is input to the next adjacent layer after a batch normalization (BatchNorm) and activation function 1 (ReLU). And the input layer inputs the 12 detailed sub-band diagrams into the convolution neural network with reduced dimensionality, and the characteristic diagram to be trained with low dimensionality can be obtained after multilayer convolution processing. The representation of the ReLU function is as follows:

f(x)＝max(0,x) (4)

preferably, step 3 further includes inputting the feature map to be trained after dimensionality reduction into a deep convolutional neural network, training the network, and detecting the image fixation point by using the trained network structure. The deep convolutional neural network is further divided into a network for extracting the characteristics of the fixation point and a network for detecting the fixation point, and the specific network structure and implementation steps are as follows:

1) the network for extracting the gazing point features is constructed and comprises 1 input layer, 5 convolution stages and 1 output layer, wherein the first two convolution stages respectively comprise 2 convolution layers and 1 pooling layer, the second two convolution stages respectively comprise 3 convolution layers and 1 pooling layer, and the last convolution stage only comprises 3 convolution layers. The specific connection mode is as follows: input layer → convolution stage 1 (convolution layer 1_1 → convolution layer 1_2 → pooling layer 1) → convolution stage 2 (convolution layer 2_1 → convolution layer 2_2 → pooling layer 2) → convolution stage 3 (convolution layer 3_1 → convolution layer 3_2 → convolution layer 3_3 → pooling layer 3) → convolution stage 4 (convolution layer 4_1 → convolution layer 4_2 → convolution layer 4_3 → pooling layer 4) → convolution stage 5 (convolution layer 5_1 → convolution layer 5_2 → convolution layer 5_3) → output layer, and the output of each convolution layer passes through activation function 1(ReLU) before being input to the next adjacent layer.

Each convolution layer adopts a small-scale convolution kernel, and compared with a large-scale convolution kernel, the convolution process of the small-scale convolution kernel can reduce the parameters of the network structure. The input layer inputs 3 detailed feature maps to be trained into a network for extracting features, and outputs the characteristic information of the fixation point after 5 convolution stages of operation.

2) The network structure for constructing the detection annotation view comprises 3 deconvolution layers, 1 convolution layer and 1 output layer, and the specific connection mode is described as follows: respectively inputting feature information output by a convolutional layer 3_3, a convolutional layer 4_3 and a convolutional layer 5_3 in a network for extracting the gazing point feature into different deconvolution layers; then respectively carrying out primary cropping (Crop) processing to obtain 3 characteristic graphs with the size consistent with that of the original graph; and then outputting 1 characteristic diagram representing the image fixation point information after passing through 1 convolution layer, and outputting the image fixation view after passing through an activation function 2 (Sigmod).

The characteristic information output by each convolutional layer is different, and in order to improve the detection effect, the characteristic information output by each convolutional layer 3_3, 4_3 and 5_3 is fused for detecting the fixation point. Because the gaze point features are extracted through a plurality of convolutional layers and pooling layers, the feature maps output by different convolutional layers have different sizes, and the feature maps output by the convolutional layers 3_3, 4_3 and 5_3 are deconvoluted and then fused. The fused gazing point characteristic information is subjected to Sigmod function calculation to calculate the significance value of each pixel point, and therefore the detected gazing view is obtained. The Sigmod function is represented as follows:

drawings

FIG. 1 is a general flowchart of a gaze point detection method based on quaternion wavelet transform depth view perception according to the present invention; FIG. 2 is a block diagram of a reduced dimensionality convolutional neural network of the present invention; FIG. 3 is a diagram of a deep convolutional neural network architecture of the present invention; fig. 4 is a diagram showing the result of the final gaze point detection according to the present invention.

Detailed Description

The present invention will be further described with reference to the accompanying drawings and detailed description, but such embodiments are described by way of illustration only, and are not intended to limit the scope of the invention.

The invention relates to a method for detecting a fixation point based on quaternion wavelet transform depth visual perception, and a figure 1 is an integral flow block diagram of the method, and the method comprises the following specific implementation steps:

1. acquisition of the sub-band diagram

The quaternion wavelet transform is to perform dual-tree quaternion two-dimensional discrete wavelet transform on the image, the image can generate 12 detail subband diagrams and 4 approximate diagrams after the quaternion wavelet transform, and the image characteristics are extracted from the 12 detail subband diagrams and used for detecting the fixation point. The specific steps of quaternion wavelet transform are as follows:

1) constructing a wavelet function: by using

And psi represents wavelet scale function and wavelet basis function of quaternion wavelet transform, respectively

The hilbert transform in the x-direction, y-direction, and xy-direction can be expressed as:

2) constructing the orthonormal basis of the quaternion wavelet transform using the hilbert transform: analogously to step 1), for the wavelet functions separately

And

by performing Hilbert transform, four sets of orthonormal bases included in quaternion wavelet transform can be constructed and expressed as a matrix G:

3) and (3) carrying out quaternion wavelet transform on the image: wavelet decomposition coefficients on four channels are obtained, represented by a matrix F:

in the matrix F, the first row represents the coefficient matrix of the approximation part, i.e. 4 approximation graphs; the second, third and fourth lines represent the detail coefficient matrixes in the horizontal direction, the vertical direction and the diagonal direction respectively, namely 12 detail sub-band diagrams, as shown in fig. 1(a), the wavelet coefficients (detail sub-band diagrams) selected by the invention are the second, third and fourth lines in the matrix F.

2. Acquisition of a feature map to be trained

The invention uses 1 × 1 convolution kernel to construct a convolution network with reduced dimensionality, as shown in fig. 1(b), the feature diagram to be trained with low dimensionality can be extracted from 12 detail sub-band diagrams, and the interlayer connection mode of the convolution network with reduced dimensionality is shown in fig. 2: input layer → convolutional layer 1 → convolutional layer 2 → convolutional layer 3 → output layer, the output of each convolutional layer is input into the next adjacent layer after being processed by batch normalization (BatchNorm) and activation function 1(ReLU) once, convolutional layer 1 selects convolution kernel of 1 × 1 × 16, and the input 12 detailed sub-band maps output 16-layer feature maps after the first layer of convolution; the convolution layer 2 selects a convolution kernel of 1 multiplied by 8, and 8 layers of characteristic graphs are output after the convolution of the second layer; the convolution layer 3 selects a convolution kernel of 1 multiplied by 3, and outputs a 3-layer characteristic diagram after the convolution of the third layer; all convolutional layers have step size of 1, the data output by each convolutional layer is input into the next adjacent layer after undergoing batch normalization (BatchNorm) processing and activation function 1(ReLU) operation, the data output by the convolutional network with reduced dimensionality is a feature map to be trained and is input into the deep convolutional neural network, as shown in fig. 1(c), the ReLU activation function is represented in the following form:

f(x)＝max(0,x) (4)

3. acquisition of annotation views

And inputting the feature map to be trained after dimensionality reduction into a deep convolutional neural network, training the network, and detecting an image fixation point by using the trained network structure so as to obtain a fixation view. The interlayer connection mode of the deep convolutional neural network is shown in fig. 3, and is further divided into a network for extracting the gazing point characteristics and a network for detecting the gazing point, and the specific network structure and implementation steps are as follows:

1) the network for extracting the gazing point features is constructed and comprises 1 input layer, 5 convolution stages and 1 output layer, wherein the first two convolution stages respectively comprise 2 convolution layers and 1 pooling layer, the second two convolution stages respectively comprise 3 convolution layers and 1 pooling layer, and the last convolution stage only comprises 3 convolution layers in a specific connection mode: input layer → convolution stage 1 (convolution layer 1_1 → convolution layer 1_2 → pooling layer 1) → convolution stage 2 (convolution layer 2_1 → convolution layer 2_2 → pooling layer 2) → convolution stage 3 (convolution layer 3_1 → convolution layer 3_2 → convolution layer 3_3 → pooling layer 3) → convolution stage 4 (convolution layer 4_1 → convolution layer 4_2 → convolution layer 4_3 → pooling layer 4) → convolution stage 5 (convolution layer 5_1 → convolution layer 5_2 → convolution layer 5_3) → output layer, and the output of each convolution layer passes through activation function 1(ReLU) before being input to the next adjacent layer.

In the network for extracting the gazing point characteristics, each convolution layer adopts a convolution kernel of 3 multiplied by 3, and compared with the convolution kernel of large scale, in the process of performing convolution operation, parameters of a network structure can be reduced, all convolution kernels in a first convolution stage are 3 × 3 × 64, all convolution kernels in a second convolution stage are 3 × 3 × 128, all convolution kernels in a third convolution stage are 3 × 3 × 256, all convolution kernels in a fourth convolution stage are 3 × 3 × 512, all convolution kernels in a fifth convolution stage are 3 × 3 × 512, all pooling layers are 2 × 2 maximum pooling operation, an input layer inputs 3 detailed feature graphs to be trained output by the convolution network with reduced dimensionality into a network with gaze point features extracted, and gaze point feature information is output after 5 convolution stages.

2) The network structure for constructing the detection annotation view comprises 3 deconvolution layers, 1 convolution layer and 1 output layer, and the specific connection mode is described as follows: respectively inputting feature information output by a convolutional layer 3_3, a convolutional layer 4_3 and a convolutional layer 5_3 in a network for extracting the gazing point feature into a deconvolution layer 1, a deconvolution layer 2 and a deconvolution layer 3; then respectively carrying out primary cropping (Crop) processing to obtain 3 characteristic graphs with the size consistent with that of the original graph; and then outputting 1 characteristic graph representing the gaze point information of the image after passing through the convolutional layer 6, and outputting the gaze view of the image after passing through an activation function 2 (Sigmod).

In order to keep the size of the output feature map consistent with the size of the original image, the invention adopts 3 deconvolution layers to respectively carry out size expansion operation on the feature maps output by the convolution layers 3_3, 4_3 and 5_ 3; cutting the feature graph with the enlarged size according to the size of the original graph to obtain 3 feature graphs with the size consistent with that of the original graph; and then after 1 convolution kernel processing of 1 × 1 × 1, fusing 3 feature maps into 1 feature map representing the gaze point information of the image, and calculating the significant value of each pixel point by the fused gaze point feature map through a Sigmod function to obtain the detected gaze map, wherein the expression form of the Sigmod function is as follows:

Claims

1. the image fixation point detection method based on quaternion wavelet transform depth vision perception is characterized by comprising the following steps of:

step 3, training by adopting a deep convolutional neural network based on a detail characteristic graph extracted by the dimensionality reduction convolutional network, establishing a mapping network between the detail characteristic graph and an image fixation point, and detecting the fixation point by adopting the trained dimensionality reduction convolutional network and the trained deep convolutional neural network; wherein: training a deep convolutional neural network by using the 3 detailed feature maps extracted in the step 2; step 3 further comprises:

3.1 construct the network structure who draws the characteristic, including 1 input layer, 5 convolution stages and 1 output layer, two former convolution stages all include 2 convolutional layers and 1 pooling layer, and two convolution stages all include 3 convolutional layers and 1 pooling layer afterwards, and last convolution stage only includes 3 convolutional layers, specific connected mode: input layer → buildup layer 1_1 → buildup layer 1_2 → pooling layer 1 → buildup layer 2_2 → pooling layer 2 → buildup layer 3_1 → buildup layer 3_2 → buildup layer 3_3 → pooling layer 3 → buildup layer 4_1 → buildup layer 4_2 → buildup layer 4_3 → pooling layer 4 → buildup layer 5_1 → buildup layer 5_2 → buildup layer 5_3 → output layer, the output of each buildup layer being calculated by the activation function (1) before being input to the next adjacent layer;

the input layer inputs 3 detailed feature maps to be trained into a network for extracting features, and outputs the feature maps after five stages of convolution operation;

3.2, constructing a network structure for detecting the annotation view, and respectively inputting the characteristic information output by the last convolution layer in the third, fourth and fifth convolution stages in the step 3.1 into different deconvolution layers; respectively carrying out primary cutting treatment to obtain 3 characteristic graphs with the size consistent with that of the original graph; then outputting 1 characteristic graph representing image fixation point information after passing through 1 convolution layer, and outputting the fixation graph of the image after calculating an activation function (2); wherein the expression form of the activation function (2) is as follows, x represents the value of each pixel point on the characteristic diagram input to the activation function (2),

2. the method of claim 1, wherein: in the step 1, quaternion wavelet transform has good information localization performance, a detail sub-band diagram generated by the image through quaternion wavelet transform better reflects the detail characteristics of the image, and information representing the image fixation point is obtained from the detail sub-band diagram.

3. The method of claim 1, wherein: in step 2, reducing the dimensionality of the feature graph to be trained by using a plurality of 1 × 1 convolution operations, and constructing a reduced-dimensionality convolution network structure, which comprises 1 input layer, 3 convolution layers and 1 output layer, wherein the connection mode is as follows: input layer → convolution layer 1 → convolution layer 2 → convolution layer 3 → output layer, the output of each convolution layer is input to the next adjacent layer after being processed by batch normalization and activation function (1);

wherein, the input layer sends 12 detail sub-band graphs generated by quaternion wavelet transform into a convolution network with reduced dimensionality, and extracts 3 detail feature graphs after multilayer convolution processing for training a deep convolution neural network to improve the training efficiency, wherein, the representation form of an activation function (1) is as follows, x in the activation function (1) represents the value of each pixel point on the feature graph input into the activation function,

f(x)＝max(0,x) (1)。