CN112308005A

CN112308005A - Traffic video significance prediction method based on GAN

Info

Publication number: CN112308005A
Application number: CN202011241840.9A
Authority: CN
Inventors: 颜红梅; 刘秩铭; 田晗; 秦龙; 蒋莲芳; 卓义轩; 杨晓青
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2019-11-15
Filing date: 2020-11-09
Publication date: 2021-02-02

Abstract

The invention discloses a traffic video saliency prediction method based on GAN, belonging to the technical field of computer vision. According to the method, a selective attention mechanism in driving is combined with a deep learning method, a gradually growing multi-step judgment GAN network model is designed, and the significance region of the traffic scene video shot by the automobile data recorder can be calculated and predicted in real time. Based on the GAN network model, the salient region of the visual search of the driver in the traffic driving environment and the emergency situation around the environment can be effectively estimated, and important objects which are worth attention, such as traffic signs and the like, can be calculated. The invention can provide useful theoretical basis and visual perception related technical means for future intelligent driving vehicles, driving training and auxiliary driving systems and the like by understanding and predicting the information related to the driving task in the traffic driving scene by combining the visual attention related mechanism and the significance detection model.

Description

Traffic video significance prediction method based on GAN

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a method for detecting image saliency in a traffic scene.

Background

In traffic scenarios, the conditions around the vehicle are complex and variable. The surroundings of the driving vehicle are filled with a large amount of information, such as other vehicles, pedestrians, obstacles, traffic signs, etc. For driving safety, the driving assist system should recognize an object more important for driving safety, in addition to a general object in the periphery of the vehicle. Because the driver needs to pay high attention in the driving process and the brain can only process limited information, the reminding of important information in the driving process is very important. Research shows that experienced drivers search and extract targets faster in simulated driving, visual perception saliency maps are more concentrated, and top-down tasks drive the ability to extract and process specific environmental information more efficiently. Important information in traffic scenes can be extracted from automobile data recorders and intersection traffic monitoring by learning behavior patterns to extract vehicle frames and movement trends, and to detect pedestrian positions, paths, and spatial positions of traffic signs. The eye movement data recorded by the eye tracker represents a visually perceived salient region in the traffic scene, which can be used as a label of important information to predict the salient region of the traffic scene.

In recent years, neural networks are increasingly used in the fields of images and videos, and the processing effect is good. The convolutional neural network can extract low and high level features in the image and make corresponding identifications and classifications. A generative countermeasure network (GAN), which can generate target images, is an unsupervised learning algorithm that learns the characteristics of non-linearities. GAN can be applied to video based on the generative model using GAN. The residual error network (Resnet) structure utilizes the guidance of prior information, solves the problem of gradient disappearance in a deep neural network, and can deepen the layer number of the neural network and simultaneously reduce the difficulty of parameter learning. The cycleGAN is a variant for generating the countermeasure network, the training set data can not have a one-to-one mapping relation, the data volume of the training set is increased equivalently, and the functions of non-paired image-to-image style conversion and the like can be realized. By combining the methods, a high-resolution traffic scene graph can be generated, and a saliency predicted image can also be generated, so that a better gaze region prediction effect is achieved.

Accordingly, the inventor of the invention sets a method for predicting a significance region of a driver visual search in a traffic driving scene based on a selective attention mechanism and a deep learning neural network model, and detects the attention behavior and the significance region of the driver according to eye movement characteristics of selective attention driving so as to remind the driver of important information which is useful for driving.

Disclosure of Invention

The invention aims to: aiming at the existing problems, a network model combining CNN, cycleGAN, Resnet and GAN and carrying out progressive training and discrimination is provided, so that the prediction performance of the significance of the traffic video is improved.

The network model of the present invention for prediction of saliency of traffic video contains two main networks:

1) generating a network (generator): a traffic image can be generated that contains a reasonable region of gaze,

2) discrimination network (discriminator): the generated image and the real image can be distinguished.

The generation network adopts the PG-GAN idea, the scale of the generated image is gradually increased, and the high-resolution image is gradually generated from the low resolution. The scale change process of the generated image comprises a residual error network structure block, and the discrimination network discriminates each scale of the generated image in multiple steps, gradually corrects the generated image quality and optimizes the generated result. And the final output result of the generated network is the predicted significance region.

Namely, the traffic video saliency prediction method based on GAN of the invention comprises the following steps:

the GAM model comprises a generation network and a judgment network;

the generating network adopts a U-shaped network structure and comprises an encoder and a decoder, and three convolutional layers which are sequentially connected and have the same structure are arranged between the encoder and the decoder;

the encoder comprises N subblocks which are connected in sequence, wherein each subblock comprises a convolution layer, a maximum pooling layer and a residual error module which are connected in sequence; the N residual modules of the encoder are similar (the network levels are the same), the image scale output by each residual module is reduced one by one, and the image scale processed by each residual module is reduced by one time compared with the input image scale;

the decoder comprises N subblocks which are connected in sequence, each subblock comprises an image scaling layer, a convolution layer and a residual error module which are connected in sequence, wherein the N residual error modules of the decoder are similar, the image scale output by each residual error module is increased one by one, and the image scale processed by each residual error module of the decoder is increased by one time than the input image scale, so that the image scale generated by the N residual error modules is gradually increased to the size of the original input image, namely the N residual error modules of the decoder and the N residual error modules of the encoder are arranged symmetrically, and the residual error modules at the same symmetrical subblock position have the same structure;

each residual error module of a decoder of the generation network outputs a generated image of one scale, so that N generated images with different scales are generated;

each residual module of the encoder and the decoder comprises two layers of convolution hidden layers, and the superposition result of the output of the second layer of convolution hidden layer and the initial input (namely the input of the first layer of convolution hidden layer) is used as the output of the residual module;

the image scaling layer is used to enlarge the input image, for example, by using a bilinear interpolation resize method, which has the advantage of retaining more image information than a deconvolution method.

The judgment network carries out multi-step judgment on N generated images (output of N residual modules of a decoder) with different scales generated by the generation network, namely the N scale images generated by the generation network are respectively input from different hidden layer positions in the judgment network, after multilayer convolution operation, Batch normalization processing and LRelu activation processing are carried out, finally, the output tensor of the hidden layer is convolved, and a two-dimensional rectangular vector is output;

and the judgment network adopts a loss judgment mode of cycleGAN;

step 2: carrying out deep neural network model training on the GAM model based on the collected training sample set to obtain a GAM model to be trained;

wherein the sample label of the training sample is used for carrying out image processing based on a preset eye movement experiment

Obtaining the pretreatment;

and step 3: and performing image preprocessing on the traffic video image to be predicted to enable the image preprocessing to be matched with the input of the trained GAM model, and obtaining the salient region of the traffic video image to be predicted based on the output of the image preprocessing.

In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that: training the constructed GAM model based on the set processing mode for extracting the training sample label; the trained GAM model can improve the prediction accuracy of the salient region of the traffic video image. The invention can provide useful theoretical basis and visual perception related technical means for future intelligent driving vehicles, driving training and auxiliary driving systems and the like by understanding and predicting the information related to the driving task in the traffic driving scene by combining the visual attention related mechanism and the significance detection model.

Drawings

Fig. 1 is a schematic diagram illustrating the principle of the GAN-based traffic video saliency prediction method of the present invention.

Fig. 2 is a diagram of a GAN network-based generation network model of the present invention.

Fig. 3 is a diagram of a GAN network-based discriminant network model according to the present invention.

FIG. 4 is a diagram illustrating the prediction results of traffic video using the algorithm of the present invention.

FIG. 5 shows the evaluation index result of the traffic video prediction using the algorithm of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the following embodiments and accompanying drawings.

The invention discloses a traffic scene significance prediction method based on a combination of a selective attention mechanism and a generation countermeasure network (GAN). Is a top-down visual selective attention mechanism. According to the method, a selective attention mechanism and a deep learning method in driving are combined, a gradually growing multi-step judgment GAN network model is set, and the significance region of the traffic scene video shot by the automobile data recorder can be calculated and predicted in real time. The GAN mainly comprises a generation model G and a discrimination model D, the generation model adopts a progressive structure, the transition from low resolution to high resolution is realized, the scale of generated images in the network is gradually increased, and each part comprises a residual learning structure block to reduce the learning difficulty. And the discrimination model performs multi-step discrimination on each scale of the generated image, gradually corrects the generated image quality, and finally generates a high-resolution image. The GAM model can effectively estimate the significance area of the visual search of the driver in the traffic driving environment and the emergency situation around the environment, and can calculate the important objects of attention such as traffic signs and the like. The invention can provide useful theoretical basis and visual perception related technical means for future intelligent driving vehicles, driving training and auxiliary driving systems and the like by understanding and predicting the information related to the driving task in the traffic driving scene by combining the visual attention related mechanism and the significance detection model.

For the purpose of video saliency prediction, the specific steps of the invention are as follows:

1. according to a top-down visual attention mechanism and a prior mechanism, the saliency information in the traffic scene is extracted according to a set eye movement experiment, an attention saliency map of a driver is obtained after data processing, and the saliency map is used as a sample label.

Since an experienced driver can effectively search and process target scene information in a driving task, a driver with certain driving experience is selected as a subject to be tested, and an eye movement experiment is set.

2. The generation countermeasure network (GAN) model of the invention mainly comprises two parts: generating a network and discriminating the network. Both networks are progressively growing structures, gradually generating high quality output results from low resolution to high resolution.

The GAN model of the invention is specifically as follows:

(1) a network is generated.

The generation network of the present invention adopts a U-type network structure and can be divided into an encoder part and a decoder part. The encoder is divided into four residual blocks, each of which is similar in structure and reduces the image scale by a factor of two, as shown in fig. 2. Three convolutional layers in series are included between the encoder and decoder. Each convolution operation in the network is processed by Batch normalization and a relu activation function is used.

A residual module: the input of the generation network is a single-frame image, and the encoding part performs convolution operation on the input image so as to extract image characteristics and understand video content. The convolution operation is followed by a pooling and Batch normalization operation, and the result is processed using the Relu activation function. Two convolution hidden layers are designed in the residual error module and then are superposed with the initial input to serve as output.

Residual structure block output function: h (x) ═ f (x) + x, where x denotes the original input of the residual block, i.e. the input image; f (x) represents a convolution result of performing a convolution operation on the input image.

The Batch normalization is to normalize and normalize the calculation result of each training Batch. In the GAM model of the invention, each convolution operation is processed by BN (batch normalization), aiming at overcoming the problem that the deep neural network is difficult to train and solving the problem of gradient in the back propagation process. Because the increase of the network depth can generate the defect of gradient dispersion, the Batch normalization operation can increase the convergence speed, prevent the condition that the gradient explosion and the like can not be trained, and improve the model precision. So every step in the model is processed by Batch normalization.

Relu is an activation function that has better effect than the traditional sigmoid-based activation function. Relu simulates a more accurate activation model of brain neuron receiving signals, and has three main changes compared with a Sigmoid model: unilateral suppression; a relatively broad excitement boundary; sparse activity. The problem of gradient disappearance can be avoided by using the Relu activation function.

The four residual modules of the decoder have the same structure as the encoder, and the generated image size is gradually increased to the size of the original input image. When the scale of each image is increased by one time, a bilinear interpolation resize method is adopted, and compared with a deconvolution method, the method has the advantage of reserving more image information. The generated images of four scales are judged by using a judgment network, and high-quality output results are gradually generated.

(2) And judging the network.

The decision network of the present invention is a deep convolutional network, as shown in fig. 3.

Four scale images generated by the generated network can be input from different hidden layer positions in the discrimination network, after multilayer convolution operation, Batch normalization processing and LRelu activation processing are carried out, finally, convolution is carried out on the output tensor of the hidden layer, a two-dimensional rectangular vector is output, and the two-dimensional rectangular vector is used as a discrimination condition once.

And (5) a CycleGAN discrimination mode.

The invention adopts a loss discrimination mode of cycleGAN. The total generation loss of the inventive method comprises four parts: two unidirectional generation losses of the two generation networks, and a true and false discrimination loss of the generation results of the two generation networks. The cycleGAN adopts paired images for training, and superposes losses of a plurality of generated networks, so that the prediction model can achieve a better effect after training.

The invention adopts a deep learning method, and comprises two stages of a training model and a testing model. Training and testing sets are prepared before training the model: the traffic video is used as a training sample, and eye movement data obtained by performing an eye movement experiment is used as a sample label. The preprocessed training samples and test samples serve as initial input data for the entire significance prediction model.

Referring to fig. 1, the specific implementation steps of the method for predicting the traffic video saliency area based on GAN in the present invention are as follows:

the method selects several random frames of color traffic fee scene videos from the test set as an example test description.

The first step is as follows: a training set and a test set are created to train the model.

The traffic video decomposes each frame into traffic images as input data for the network model. And all videos were divided into training and test sets on a 3:1 scale. And matching eye movement data acquired by the traffic video to be watched with each frame of video, and preprocessing the data. The data preprocessing steps are as follows:

(1) eye movement data to image: the initial extracted eye movement data is a series of data points that match the traffic video size (1280 × 720). Each frame of 16 video segments is converted into 1280 × 720 image as original image of traffic scene. And mapping the eye movement data corresponding to each video frame into a gray image with the size of 1280 × 720, wherein each fixation point position is correspondingly matched with the traffic original image. And finally, obtaining eye movement data images matched with the video original images one by one.

(2) The fixation point is expanded into a circular point: because the extracted eye movement data is a single fixation point and is mapped to a plurality of independent pixel points in the image, the analysis and the calculation are not facilitated. Therefore, in the present embodiment, each gaze point is used as a center of a circle, and each gaze point is expanded into a circular gaze area based on a preset circle radius, and a gray value of the circular gaze area gradually attenuates from the center of the circle to a circle edge, wherein the gray value is inversely proportional to a distance from a point to the center of the circle.

For example, the individual gaze points are extended to a circular gaze area with a circle radius of 21 (units are pixels).

(3) Gaussian smoothing: the plurality of fixation points in each frame of image are basically distributed in a specific area, so that each frame of image is subjected to Gaussian smooth filtering processing to form a fixation area for subsequent calculation. Through test adjustment, the optimal value of the Gaussian distribution parameter is 5, and the size of the significance region is just proper.

The resulting gray gaussian image is then converted to color, and the color gradient is changed from red to green using the hsv standard color shadow map (in this example, the first 44 rows of the gradient matrix are taken, and this value can be adjusted based on the actual application scene). In order to make the color gradual change smooth, the length of the color gradual change matrix is expanded by 4 times (experimental tests prove that after the resolution of the gradual change color exceeds the original 4 times, the precision of the gradual change color exceeds the precision of a gray value), and finally, the effect of realizing the stable transition of the color is achieved, and the color brightness accords with the watching of a human visual system.

(4) Fusing the eye movement data with the original image: and fusing each color Gaussian gazing area image obtained in the third step with a corresponding traffic scene video frame to obtain a fused image, wherein the fused image is a training sample image with a sample label. Namely, the obtained color Gaussian gazing area is the corresponding sample label. Generally, the region of interest is substantially concentrated in important traffic information such as road vehicles ahead, vehicles coming out of the side lanes, traffic signs and signal lights.

The second step is that: and setting a deep neural network model structure and training a model.

And training the model in stages according to different scales. At the beginning, model parameters are initialized randomly, a minimum-scale generation model is trained, and based on a selected optimization method (such as an Adam optimization method), the trained model parameters are stored after loss convergence so as to be adopted by subsequent training. When a second image scale is trained, model parameters are initialized, model parameters stored in the previous image scale are led into the model, then the same method is adopted for training, and the trained model parameters are stored. And sequentially training each image until the maximum video scale is trained, representing that the prediction model is finally trained, and storing model parameters for subsequent testing.

The third step: and testing and evaluating the model effect.

And importing trained model parameters, inputting random video frames in the test set to test the prediction effect, and calculating an evaluation index quantitative evaluation prediction result. Fig. 4 shows the results of the saliency prediction for the example, the first column is a random video frame, the second column is the eye saliency map (i.e. ground route) corresponding to the video frame, and the third column is the prediction result of the method of the present invention. Fig. 5 shows evaluation indexes used for the prediction results: AUC _ Borji value, AUC _ Judd value ((Judd Area Under dark, Area Under juddi line)), NSS value (significance of normalized scan path), CC (linear correlation coefficient), SIM (similarity), KL-divergence (also called relative entropy), EMD (land mobile distance).

And after the trained GAM model is obtained, performing image preprocessing (size normalization and the like) on the traffic video image to be predicted to enable the traffic video image to be matched with the input of the GAM model, and obtaining the salient region of the traffic video image to be predicted based on the output of the GAM model.

While the invention has been described with reference to specific embodiments, any feature disclosed in this specification may be replaced by alternative features serving the same, equivalent or similar purpose, unless expressly stated otherwise; all of the disclosed features, or all of the method or process steps, may be combined in any combination, except mutually exclusive features and/or steps.

Claims

1. The traffic video saliency prediction method based on the GAN is characterized by comprising the following steps:

step 1: constructing a GAM model for predicting a salient region of a traffic video image;

the GAM model comprises a generation network and a judgment network;

the encoder comprises N subblocks which are connected in sequence, wherein each subblock comprises a convolution layer, a maximum pooling layer and a residual error module which are connected in sequence; the N residual modules of the encoder are similar, and the scale of the image processed by each residual module of the encoder is reduced by one time compared with the scale of the input image;

the decoder comprises N subblocks which are connected in sequence, wherein each subblock comprises an image scaling layer, a convolution layer and a residual error module which are connected in sequence, and the image scaling layer is used for increasing an input image; the N residual modules of the decoder are similar, and the image scale processed by each residual module of the decoder is doubled compared with the input image scale, namely the N residual modules of the decoder and the N residual modules of the encoder are symmetrically arranged, and the residual modules at the same symmetrical sub-block position have the same structure;

the judgment network carries out multi-step judgment on N generated images with different scales generated by the generation network, batch standardization processing and linear activation processing are carried out after multilayer convolution operation, finally, the output tensor of the hidden layer is convolved, and a two-dimensional rectangular vector is output;

and the judgment network adopts a loss judgment mode of cycleGAN;

the method comprises the following steps that a sample label of a training sample is obtained by preprocessing image data based on a preset eye movement experiment;

2. The method of claim 1, wherein each residual module of the encoder and decoder includes two convolutional implicit layers, and a result of a superposition of an output of the second convolutional implicit layer and an input of the first convolutional implicit layer is used as an output of the residual module.

3. The method of claim 1, wherein the image scaling layer performs image enhancement processing on the input image using a bilinear interpolation resize method.

4. The method of claim 1, 2 or 3, wherein in step 2, the sample label settings of the training samples are specifically:

decomposing each frame of the traffic video into a traffic image as input data of a GAM model, dividing all videos into a training set and a testing set according to a certain proportion, matching eye movement data acquired by the traffic video to be watched with each frame of video, and executing the following data preprocessing:

(1) eye movement data to image:

carrying out size normalization processing on each frame of image of the traffic video to serve as original image of a traffic scene, and defining the size of the normalized image as mxn;

mapping eye movement data corresponding to each video frame into m multiplied by n gray level images, wherein each fixation point position is correspondingly matched with the original image of the traffic scene, and eye movement data images matched with the original image of the traffic scene one by one are obtained;

(2) the fixation point is expanded into a circular point:

expanding each fixation point into a circular fixation area based on a preset circle radius by taking each fixation point in the eye movement data image as a circle center, wherein the gray value of the circular fixation area is gradually attenuated from the circle center to a circle edge, and the gray value is in inverse proportion to the distance from the fixation point to the circle center;

(3) gaussian smoothing:

after Gaussian smoothing is carried out on each eye movement data image with the fixation point expanded into a circular point, an hsv standard color shadow image is adopted, and the obtained gray Gaussian image is converted into a color to obtain a color Gaussian fixation area image;

(4) fusing the eye movement data with the original image:

and fusing each color Gaussian gazing area image with the corresponding original image of the traffic scene to obtain a fused image, and obtaining the training sample with the sample label.

5. A method as claimed in claim 1, 2 or 3, wherein the radius used when extending the point of regard to the dots is 21 pixels.

6. A method as claimed in claim 1, 2 or 3, wherein in step (3), when the obtained grey-scale gaussian image is converted into colour, the colour gradient is changed from red to green, and the length of the adopted colour gradient matrix is extended by 4 times.

7. The method of claim 1, 2 or 3, wherein in step (1), the normalized image size is 1280 x 720.