CN112861733A

CN112861733A - Night traffic video significance detection method based on space-time double coding

Info

Publication number: CN112861733A
Application number: CN202110183195.8A
Authority: CN
Inventors: 颜红梅; 蒋莲芳; 田晗; 高港耀; 吴江
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2021-02-08
Filing date: 2021-02-08
Publication date: 2021-05-28
Anticipated expiration: 2041-02-08
Also published as: CN112861733B

Abstract

The invention discloses a night traffic video saliency detection method based on space-time double coding, which is applied to the technical field of computer vision and aims at solving the problem of the prior art in saliency detection of night traffic scenes, wherein a network model related to the invention comprises three parts, namely space-time double coding, attention fusion and a decoder; the time coding module learns the time information before and after the continuous time sequence of the night traffic video by adopting convolution LSTM, and highlights the motion characteristic in the traffic video; in a spatial coding module, extracting spatial features under different receptive fields by utilizing pyramid cavity convolution (PDC); the Attention module is used for enhancing the characteristics which greatly contribute to the driving task while fusing the time characteristics and the space characteristics; finally, important salient regions in the night driving task are accurately predicted by the decoder.

Description

Night traffic video significance detection method based on space-time double coding

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a video saliency detection technology in a night traffic scene.

Background

The development of advanced driver assistance systems for decades has made them play an increasingly important role in automobile driving, achieving good results in facilitating driving and ensuring safe driving. However, the traffic driving environment is a complex and varied dynamic scene and is flooded with a large amount of information. In a road traffic environment, there are not only driving-related information, such as traffic lights, signs, pedestrians, etc., but also driving-unrelated disturbances, such as billboards, neon lights, etc. The brain's ability to process information is limited and therefore must be highly focused when driving. If there is distracted driving, the chance of an accident is greatly increased. Under the conditions of dim light, mixed light and low visibility, the driver is easy to cause visual fatigue, distracted, overlooked or overlooked important targets when driving for a long time in the night traffic scene. The traffic accident rate and the death rate caused by night driving are high, so that the real-time reminding of important information in the driving process is particularly important. The visual saliency detection of traffic scenes is the calculation of the areas and objects that the driver should focus on while driving, which are important for the driving task. The attention distribution in the night driving process under the waking state of an experienced driver is learned to help the significance detection of the night traffic scene, so that the driving safety is improved.

Night driving scenes are more complex than daytime. The night scene is complicated by: 1. insufficient illumination and low contrast; 2. the lamp light is disordered, and the visual interference caused by uneven brightness is increased; 3. the noise interference is large, and the detail blurring is serious; 4. color distortion, etc. This greatly increases the difficulty of processing the nighttime images. Therefore, the detection of significance in night traffic scenarios is one of the challenges to be addressed.

In order to solve the problems faced by the method, a dual-coding neural network model is designed to predict the significance region of the visual search of the driver in the night traffic video scene, and the purpose of reminding the driver in real time to pay attention to important information which is useful for driving is achieved. The significance region predicted by the model has high matching degree with the real attention distribution of the driver.

Disclosure of Invention

In order to solve the technical problem, the invention provides a night traffic video saliency detection method based on space-time double coding.

The technical scheme adopted by the invention is as follows: a night traffic video saliency detection method based on space-time dual coding comprises the following steps:

s1, acquiring a standard fixation point saliency map;

s2, establishing a network model, wherein the network model is used for carrying out significance detection on the input standard fixation point significant map;

the network model includes: the system comprises a space-time coding structure, an Attention fusion module and a decoding module, wherein the space-time coding structure is used for extracting the spatial characteristics and the time characteristics of an input standard fixation point saliency map, the Attention fusion module is used for fusing the extracted spatial characteristics and the time characteristics, and the decoding module calculates to obtain the saliency map according to a fusion result;

and S3, training the network model, and detecting the image significance by adopting the trained network model.

The space-time coding structure comprises a space coding structure and a time coding structure, the space coding structure is used for extracting the space characteristics of the input standard viewpoint saliency map, and the time coding structure is used for extracting the time characteristics of the input standard viewpoint saliency map.

In the process of training the network model: the current frame is used for extracting the spatial features, and the current frame and the previous 5 frames form a continuous sequence for extracting the temporal features.

The spatial coding structure comprises: 4 groups of rolling blocks and a pyramid cavity rolling block;

each group of convolution blocks specifically includes: 2 convolution operation layers, wherein each convolution operation layer comprises a 3 multiplied by 3 convolution, a batch processing normalization unit and a correction linear unit; each group of volume blocks comprises a2 multiplied by 2 maximum pooling layer with the step size of 2;

the word tower hole convolution block acquires spatial features by adopting hole convolution parallel architectures with different hole rates.

The temporal coding structure comprises: 4 groups of convolution blocks and a convolution long-term and short-term memory network;

each group of convolution blocks specifically includes: 2 convolution operation layers, wherein each convolution operation layer comprises a 3 multiplied by 3 convolution, a batch processing normalization unit and a correction linear unit; each group of volume blocks includes a2 x 2 maximum pooling layer of step size 2.

The time coding structure extracts the characteristics of a continuous sequence through 4 groups of convolution blocks, and then the extracted characteristics are input into the frame information before and after the learning of a convolution long-short term memory network.

The structure of the decoding module sequentially comprises: the system comprises 3 upsampling layers, 3 groups of convolution blocks, a layer of 1 multiplied by 1 convolution layer and a Sigmoid layer, wherein a multiplied by 2 upsampling layer is arranged in front of each group of convolution blocks;

The invention has the beneficial effects that: the invention firstly proposes significance detection based on a top-down night traffic scene, the model of the invention extracts time information and space information, and selectively strengthens space-time information through integration of an attention mechanism, thereby enabling the effect of a significance detection map to be better and a predicted region to be more accurate.

Drawings

FIG. 1 is a flow chart of an eye movement experiment provided by the present invention;

FIG. 2 is a schematic diagram of a network architecture employed in the present invention;

fig. 3 is a diagram illustrating an example of a night traffic video image saliency prediction according to an embodiment of the present invention;

fig. 3(a) shows an input original image, fig. 3(b) shows a standard eye movement saliency map, and fig. 3(c) shows a model prediction map according to the present invention.

Detailed Description

In order to facilitate the understanding of the technical contents of the present invention by those skilled in the art, the present invention will be further explained with reference to the accompanying drawings.

The method comprises two main steps of calculation of a standard eye movement saliency map and training of a network model:

A. calculation of standard eye movement saliency map:

step A1: eye movement data (including fixation point information of each frame) of 30 drivers with driving ages of two years and more are recorded by an eye tracker, and the experimental process is shown in fig. 1. And eliminating abnormal data and integrating all tested eye movement data to each frame.

Step A2: generating a blank matrix with the same size as the input image, and assigning the corresponding position of the gazing point of each frame in the blank matrix as 1 to obtain a binary image, namely a standard gazing point binary image. Next, 2-dimensional gaussian smoothing (δ being 30) is performed on the binary image to obtain a standard fixation point saliency map, which is used as a label for network training. (one standard gaze point saliency map for each input picture).

B. Training a network model:

b1, the model designed by the invention mainly comprises three parts: a space-time coding structure (divided into a space coding structure and a time coding structure), an Attention fusion module and a decoding module.

The spatial coding structure is used for extracting spatial features of an image, and specifically comprises the following steps:

the spatial features are very important features of the traffic scene, and the convolution operation can effectively extract the spatial characteristics of the image. The spatial coding structure consists of 4 sets of rolling blocks and a pyramid void rolling block (PDC). Each set of volume blocks consists of two convolution operation layers, each convolution operation layer comprising a 3 × 3 convolution, a batch normalization unit (BN), and a correction linear unit Relu. There is a2 x 2 maximum pooling layer of step 2 between the volume blocks.

The PDC module aims at solving the detection problem of areas with different sizes, and adopts a cavity convolution parallel framework with different cavity rates to obtain spatial characteristics. In this embodiment, 4 void convolutions with a void ratio of 1, 2, 4, and 8 are respectively used to obtain local information, and global features are obtained by using Global Average Pooling (GAP). And finally, a convolution operation layer is connected.

The time coding structure is used for extracting time characteristics of the image, and specifically comprises the following steps:

the temporal coding structure consists of 4 sets of volume blocks and a convLSTM (Convolitional Long Short-Term Memory Network Convolutional Long Short-Term Memory Network). The volume block and the spatial coding structure are the same. There is a2 x 2 maximum pooling layer of step 2 between the volume blocks. Compared with FC-LSTM, convLSTM effectively preserves the spatial structure of the image when learning temporal features, and is therefore better suited to processing video sequences.

Unlike spatial coding, the input to the temporal coding structure is a T frame (1)<T<10) A video sequence. Extraction of characteristics Z of T frames of a video sequence by 4 sets of convolutional blocks_t～Z_t-TThen Z is_t～Z_t-TInput to the convolutional LSTM learning front and rear frame information. To obtain maximum dynamic information of consecutive T frames, finally we retain the feature H of the last time series_t-T。

The Attention fusion module is used for fusing space-time characteristics, and specifically comprises:

the fusion module fuses the time and space characteristics on the channel by applying an Attention mechanism. The method mainly calculates the weight of the channel through the correlation on the characteristic diagram channel, and then weights the weight to the image characteristic to update the characteristic, so that the channel which is more important to the detection result is more prominent. Splicing the output of the time code and the output of the space code to obtain a characteristic F, and deforming the characteristic F through a shape function to obtain F₁Then transpose to obtain F₂。F₁And F₂Multiplying the matrix and obtaining F through softmax₃。F₃And weighting to F as the channel weight to obtain the final fusion result.

The decoding module is configured to calculate a final saliency map, specifically:

the decoding structure consists of 3 sets of convolutional blocks, 3 upsampled layers, one layer of 1 x 1 convolutional layers, and one Sigmoid layer. Wherein the convolutional blocks are identical to those of the spatial coding structure. Each set of convolutional blocks is preceded by a x 2 upsampling. The last layer of Sigmoid function controls the output value to be in the range of [0,1 ]. The predicted driver saliency map is a grayscale map of the same size as the input image.

B2, mixing the data set with about 8: 2: 3 into training set, verification set and test set. In order to shorten the training time, the size of the input picture is changed to 320 × 192 × 3 (height H × width W × number of channels C).

B3, firstly, randomly initializing the parameters of the network model (see fig. 2 for the network model, the network characteristics are all represented by H × W × C). Will train set picture F_t(320 x 192) is input to spatial coding, while the current frame and its previous t-5 frames, i.e., F, are input_t～F_t-5The frames constitute a continuous time sequence that is input into the time code. The BCE function is employed to calculate the loss value between the predicted saliency map and the corresponding label (standard gaze point saliency map). Usage learning rate of 10^-3The Adam optimizer with a momentum value of 0.9 and an attenuation rate of 10-4 was trained to update the parameters and save the model parameters for each epoch.

And B4, verifying the model by using a verification set after each epoch is trained. And continuously repeating the step B2 to carry out iterative training until the calculated loss value fluctuation amplitude is basically stable, namely the parameters in the network are basically stable, so as to obtain the optimal model parameters.

The network model of the present invention is verified with specific data as follows:

step 1: and B4, importing the optimal model parameters in the step B4 into the model, and randomly inputting test set data to obtain a prediction result.

Step 2: in order to verify the performance of the model, the results are qualitatively analyzed and quantitatively calculated. In qualitative analysis, for more intuitive comparison and evaluation, the predicted gray-scale image is colorized and then superimposed on an original image, namely a standard eye movement saliency map. The qualitative effect is shown in fig. 3, where fig. 3(a) shows the input original image, and the distribution of the model prediction graph shown in fig. 3(c) is similar to the distribution of the standard eye movement saliency graph shown in fig. 3(b), indicating that the prediction performance of the model is better. In quantitative analysis, the main evaluation indexes include: AUC _ Borji value, AUC _ Judd value, NSS value (normalized scan path significance), CC (linear correlation coefficient), KLD (relative entropy), EMD (land mobile distance), SIM (similarity). The results of the quantitative analysis are shown in Table 1. The better the model effect, namely the more accurate the predicted region, the more the model effect is evaluated by using the indexes. Wherein, the lower the KLD and EMD values are, the better the effect of the model is. The higher the AUC _ Borji, AUC _ Judd, NSS, CC, and SIM values are, the better the model effect is.

Table 1: the method of the invention predicts the evaluation index result of the night traffic video image

Those skilled in the art should note that ↓ represents the higher the value is better, and ↓ represents the lower the value is better in table 1.

It will be appreciated by those of ordinary skill in the art that the embodiments described herein are intended to assist the reader in understanding the principles of the invention and are to be construed as being without limitation to such specifically recited embodiments and examples. Various modifications and alterations to this invention will become apparent to those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the scope of the claims of the present invention.

Claims

1. A night traffic video saliency detection method based on space-time dual coding is characterized by comprising the following steps:

s1, acquiring a standard fixation point saliency map;

2. The method for detecting the saliency of night traffic video based on space-time dual coding according to claim 1, wherein in the training process of the network model: the current frame is used for extracting the spatial features, and the current frame and the previous 5 frames form a continuous sequence for extracting the temporal features.

3. The method according to claim 2, wherein the spatial coding structure comprises: 4 groups of rolling blocks and a pyramid cavity rolling block;

4. The method according to claim 3, wherein the temporal coding structure comprises: 4 groups of convolution blocks and a convolution long-term and short-term memory network;

5. The method as claimed in claim 4, wherein the temporal coding structure extracts features of continuous sequences through 4 groups of convolutional blocks, and then inputs the extracted features into convolutional long-short term memory network learning front and back frame information.

6. The method as claimed in claim 5, wherein the decoding module sequentially comprises: the system comprises 3 upsampling layers, 3 groups of convolution blocks, a layer of 1 multiplied by 1 convolution layer and a Sigmoid layer, wherein a multiplied by 2 upsampling layer is arranged in front of each group of convolution blocks;