CN112233079A

CN112233079A - Method and system for fusing images of multiple sensors

Info

Publication number: CN112233079A
Application number: CN202011084849.3A
Authority: CN
Inventors: 耿可可
Original assignee: Southeast University
Current assignee: Southeast University
Priority date: 2020-10-12
Filing date: 2020-10-12
Publication date: 2021-01-15
Anticipated expiration: 2040-10-12
Also published as: CN112233079B

Abstract

The invention discloses a method and a system for fusing multi-sensor images, which relate to the technical field of image processing and solve the technical problem of low robustness and effectiveness of environmental perception caused by the prior image processing technology.

Description

Method and system for fusing images of multiple sensors

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to a method and a system for multi-sensor image fusion.

Background

The environment perception is one of key technologies for realizing the autonomous form of the unmanned automobile, wherein the multi-sensor information fusion perception is that the perception of the automobile to the environment is realized by fusing the information of various sensors such as a camera, a laser radar and the like, the driving safety and the intelligence of the automobile are ensured, and the multi-sensor information fusion perception is the eyes of the unmanned automobile and is a necessary condition for realizing the unmanned automobile.

At present, a perception method based on a deep learning network is mostly used for environment perception of an unmanned automobile, and RGB images and distance information are used as input of the deep learning network for feature extraction. One of the most common deep learning networks is a convolutional Neural Network, which has a characteristic learning capability and can perform translation invariant classification on input information according to a hierarchical structure thereof, and one of the most representative convolutional Neural networks is a ResNet (Residual Neural Network), and since the input can be directly connected to the output, the whole Network only needs to learn a Residual, thereby simplifying a learning objective and difficulty.

The key to the fusion of multiple sensors and the perception of the environment lies in the quality of image data, and cameras are widely applied to various environment perception systems because of lower cost and richer image features including color, texture, brightness, direction and the like, but illumination change, motion blur and strong noise have great influence on the image quality, which is very disadvantageous to the effectiveness and robustness of an image-based traffic object classification algorithm. Therefore, how to reasonably utilize different sensor data so as to improve the robustness and effectiveness of environmental perception is a problem to be solved urgently.

Disclosure of Invention

The present disclosure provides a method and a system for multi-sensor image fusion, which aims to reasonably utilize different sensor data and improve the robustness and effectiveness of environment perception.

The technical purpose of the present disclosure is achieved by the following technical solutions:

a method of multi-sensor image fusion, comprising:

evaluating the RGB image by using a visible light image quality evaluation network IQAN to obtain an IQA score of the RGB image;

obtaining a weight fusion function of the RGB image by using a weighted evaluation network IQA to obtain a laser radar image and a weight coefficient of the RGB image;

respectively extracting the features of the RGB image and the laser radar image by using a ResNet101+ FPN network to respectively obtain RGB image features and laser radar image features;

performing feature fusion on the RGB image features and the laser radar image features through the weight fusion function to obtain first fusion features;

performing feature fusion on the first fusion feature through an FPN network to obtain a second fusion feature;

predicting the result of the second fusion characteristic by using a prediction network to obtain a predicted result;

and training a network model by using the joint loss function and the prediction result to obtain a deep learning network.

Further, the visible light image quality evaluation network IQAN includes two convolutional layers, an active layer, a pooling layer, and two fully-connected layers.

Further, the expression of the weight fusion function includes:

wherein, w_RGBRepresenting the weight coefficients of an RGB image, delta representing the relative error, epsilon representing the effect parameter, IQ_RGBRepresenting IQA score, IQ, of each RGB image_TThe threshold value of the IQA score is represented, and the weight coefficient of the laser radar image is w_LIDAR＝1-w_RGB。

Further, the performing feature extraction on the RGB image and the lidar image by using a ResNet101+ FPN network includes:

sampling a laser radar point cloud in a camera field, and performing projection conversion on the sampling, wherein the projection function is as follows:

where α and β represent azimuth and zenith angles at which the lidar point cloud is observed, Δ α and Δ β represent average horizontal and vertical angular resolutions between successive beam emitters, and (r, c) represents the angles at which they are observedA two-dimensional map position index of the lidar point cloud on the projected image, (x, y, z) representing the coordinates of the lidar point cloud in a cartesian coordinate system, transformed at (r, c) with a two-channel data (d, z) fill-in element,

and respectively inputting the converted laser radar point cloud data (r, c) and the RGB image into a network formed by ResNet101+ FPN for feature extraction.

Further, the joint loss function includes:

Loss_total＝λ_clsLoss_cls+λ_bboxLoss_bbox+λ_maskLoss_mask；

therein, Loss_totalRepresents total Loss, Loss_clsRepresents the Loss of classification, Loss_bboxRepresents the bounding Box regression Loss, Loss_maskDenotes the mask prediction loss, λ_cls、λ_bbox、λ_maskAre respectively corresponding weight coefficients;

using P_iAnd P_i ^*Representing the ground truth classification and the prediction classification respectively, the classification loss is represented as:

wherein N is_clsIndicating the number of suggested regions, L_clsIs a multi-class cross entropy function;

using B_iAnd B_i ^*Respectively representing a ground real bounding box and a prediction bounding box, and then the bounding box regression loss is represented as:

wherein N is_regSize of the element map, L_bboxRepresenting a loss function smooth-L1;

using M_iAnd M_i ^*Respectively representing a ground real segmented mask and a predicted segmented mask, the mask prediction loss is represented as:

wherein N is_maskIndicating the position of the element map at pixel level, L_segRepresenting a cross entropy function.

A system for multi-sensor image fusion, comprising:

the evaluation module is used for evaluating the RGB image by using a visible light image quality evaluation network IQAN to obtain an IQA score of the RGB image;

the weight acquisition module is used for acquiring a weight fusion function of the RGB image by using a weighted evaluation network IQA to obtain a laser radar image and a weight coefficient of the RGB image;

the feature extraction module is used for respectively extracting features of the RGB image and the laser radar image by using a ResNet101+ FPN network to respectively obtain RGB image features and laser radar image features;

the first fusion module is used for performing feature fusion on the RGB image features and the laser radar image features through the weight fusion function to obtain first fusion features;

the second fusion module is used for carrying out feature fusion on the first fusion feature through the FPN network to obtain a second fusion feature;

the prediction module is used for predicting the result of the second fusion characteristic by using a prediction network to obtain a prediction result;

and the training module is used for training a network model by using the joint loss function and the prediction result to obtain a deep learning network.

The beneficial effect of this disclosure lies in: the method and the system for fusing the images of the multiple sensors utilize a visible light image quality evaluation network IQAN to evaluate RGB images, perform characteristic fusion on the RGB images and the characteristics of laser radar images through an expression of characteristic weight, and train a network model by utilizing a joint loss function to obtain a deep learning network based on image quality evaluation.

Drawings

FIG. 1 is a flow chart of the disclosed method;

FIG. 2 is a schematic view of the disclosed system;

FIG. 3 is a flow chart of data fusion;

FIG. 4 is a graph comparing the results of the examples.

Detailed Description

The technical scheme of the disclosure will be described in detail with reference to the accompanying drawings. In the description of the present disclosure, it is to be understood that the terms "first" and "second" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implying any number of technical features indicated, but merely as distinguishing between different components.

Fig. 1 is a flow chart of the method of the present disclosure, as shown in fig. 1, 100: first, the RGB image is evaluated using a visible light image quality evaluation network IQAN (image quality evaluation network).

The overall structure of IQAN includes two convolutional layers, an active layer, a pooling layer, and two fully-connected layers. For an input RGB Image, firstly, the RGB Image is adjusted to 64 × 64 pixels, 50 feature maps with a size of 60 × 60 are generated after passing through a first layer of convolutional layer, then 50 feature maps with a size of 56 × 56 are generated after passing through a second layer of convolutional layer, and then the RGB Image reaches two full-link layers after passing through an activation layer and a pooling layer, and one-dimensional output with IQA (Image Quality Assessment) scores is given after linear regression.

101: and obtaining a weight fusion function of the RGB image by using a weighted evaluation network IQA so as to obtain the weight coefficients of the laser radar image and the RGB image. The expression of the weight fusion function includes:

wherein, w_RGBRepresenting the weight coefficients of an RGB image, delta representing the relative error, epsilon representing the effect parameter, IQ_RGBRepresenting IQA score, IQ, of each RGB image_TThe threshold value of the IQA score is represented, and the weight coefficient of the laser radar image is w_LIDAR＝1-w_RGB。w_RGBAnd w_LIDARThe dependency of the example segmentation of traffic objects (cars, etc.) on the RGB image and the lidar image, respectively, is described.

102: feature extraction is performed on the RGB image and the lidar image respectively by using a ResNet101+ FPN network. ResNet101 is one of ResNet, FPN (feature pyramid networks) is a feature pyramid network, and the specific process of feature extraction includes: firstly, sampling a laser radar point cloud in a camera field, converting the sampling into a pixel-level dense spherical depth image, namely performing projection conversion on the sampling, wherein the projection function is as follows:

wherein α and β represent azimuth and zenith angles at which the lidar point cloud is observed, Δ α and Δ β represent average horizontal angular resolution and average vertical angular resolution between successive beam emitters, (r, c) represent two-dimensional map position indices of the lidar point cloud on the projected image, (x, y, z) represent coordinates of the lidar point cloud in a cartesian coordinate system, and the transformation is performed at (r, c) with a two-channel data (d, z) fill element,

and respectively inputting the converted laser radar point cloud data (r, c) and the RGB image into a network formed by ResNet101+ FPN for feature extraction, and respectively obtaining RGB image features and laser radar image features, namely obtaining feature maps of camera data and laser radar data.

103: performing feature fusion on the RGB image features and the laser radar image features through a weight fusion function to obtain first fusion features, namely multiplying the feature graph of the RGB image and the feature graph of the laser radar image by respective weight coefficients w_RGB、w_LIDARThen serially connecting to perform feature fusion to obtainTo the first fused feature. Fig. 3 is a flow chart of data fusion, and the RGB image and the lidar image are respectively put into a ResNet101+ FPN network for feature extraction, and then feature fusion is performed on the respective extracted features.

104: and performing feature fusion on the first fusion feature through the FPN network to obtain a second fusion feature. A shallow layer with high-resolution features and a deep layer with rich semantic information are fused by using an FPN network, and a feature pyramid with strong semantic information on all scales is constructed.

105: and predicting the result of the second fusion characteristic by using a prediction network to obtain a prediction result. The output of the FPN network is subjected to bounding box regression, classification and mask prediction using a prediction network.

106: and training a network model by using the joint loss function and the prediction result to obtain a deep learning network. The proposed network model is trained using the following joint loss function: loss_total＝λ_clsLoss_cls+λ_bboxLoss_bbox+λ_maskLoss_mask；

Therein, Loss_totalRepresents total Loss, Loss_clsRepresents the Loss of classification, Loss_bboxLoss of expression, Loss_maskDenotes the mask prediction loss, λ_cls、λ_bbox、λ_maskRespectively, corresponding weight coefficients.

wherein N is_clsIndicating the number of suggested regions, L_clsAs a multi-class cross entropy function.

wherein N is_regSize of the element map, L_bboxRepresenting the loss function smooth-L1.

Table 1 shows the data used in practicing the method of the present disclosure, as follows:

environment(s)	In sunny days	Rainy day	In fog weather	Night time
					Number of	4369	2315	3907	4061

TABLE 1

As can be seen from Table 2: the fusion perception effect of the camera data and the laser radar data evaluated based on the IQAN network is better than the perception effect evaluated based on the IQAN network and only the camera data without the IQAN network. Meanwhile, the perception effect of the dual-mode image perception depth neural network using ResNet-101+ FPN as the backbone network is superior to that of the MASK-RCNN and Retina-RCNN of the current most advanced example segmentation network.

	Backbone network	Modality	FPS	AP	AP50	AP75
							MASK-RCNN	ResNet-101-FPN	Single	13.5	35.7	58.0	37.8
Retina-RCN	ResNet-101-FPN	Single	11.2	34.7	55.4	36.9
							YOLACT	ResNet-101-FPN	Single	30.0	29.8	48.5	31.2
IQAN	ResNet-18-FPN	Dual	37.3	28.7	46.8	30.0
							IQAN	ResNet-50-FPN	Dual	35.5	31.2	50.6	32.8
IQAN	ResNet-101-FPN	Dual	27.0	39.1	59.7	39.8

TABLE 2

Wherein fps (frames Per second) represents a number of frames transmitted Per second, ap (average prediction) represents an average prediction accuracy, a Single modality represents data input to the backbone network only with RGB images, and a Dual modality represents data input to the backbone network including RGB images and lidar images. AP50 and AP75 indicate average prediction accuracies when the intersection ratio is 50% and 75%, respectively, and AP indicates an average value of the average prediction accuracies under 9 intersection ratios of 50%, 55%, 60%, 65%, 70%, 75%, 80%, 85%, and 90%, respectively, and the intersection ratio indicates an overlapping ratio between the generated prediction region (candidate frame) and the real target region (original mark frame).

Fig. 4 shows the detection performed by using the test data set in table 1 under different weather conditions, where a represents rainy days, b represents foggy days, c represents presence of street lamps at night, and d represents absence of street lamps at night, and the marking results under three conditions are obtained by combining table 2, it is obvious that the environment sensing result obtained after the RGB image and the lidar image are subjected to feature data fusion based on IQAN is the most accurate.

Fig. 2 is a schematic diagram of the system of the present disclosure, and the multi-sensor image fusion system of the present disclosure includes an evaluation module, a weight obtaining module, a feature extraction module, a first fusion module, a second fusion module, a prediction module, and a training module, and specific functions of each module refer to the method of the present disclosure, which is not described again.

The foregoing is an exemplary embodiment of the present disclosure, and the scope of the present disclosure is defined by the claims and their equivalents.

Claims

1. A method of multi-sensor image fusion, comprising:

2. The method of multi-sensor image fusion of claim 1, wherein the visible light image quality assessment network IQAN includes two convolutional layers, an active layer, a pooling layer, and two fully-connected layers.

3. The method of multi-sensor image fusion of claim 1, wherein the expression of the weight fusion function comprises:

4. The method of multi-sensor image fusion of claim 3, wherein said feature extracting the RGB image and the lidar image using a ResNet101+ FPN network comprises:

wherein α and β represent azimuth and zenith angles at which the lidar point cloud is viewed, Δ α and Δ β represent average horizontal angular resolution and average vertical angular resolution between successive beam emitters, (r, c) represent a two-dimensional map position index of the lidar point cloud on the projected image, (x, y, z) represent coordinates of the lidar point cloud in a cartesian coordinate system, and the transformation is performed at (r, c) with a two-channel data (d, z) fill element,

5. The method of multi-sensor image fusion of claim 4, wherein the joint loss function comprises:

Loss_total＝λ_clsLoss_cls+λ_bboxLoss_bbox+λ_maskLoss_mask；

6. A multi-sensor image fusion system, comprising:

7. The multi-sensor image fusion system of claim 6, wherein the visible light image quality assessment network IQAN includes two convolutional layers, an active layer, a pooling layer, and two fully-connected layers.

8. The multi-sensor image fusion system of claim 6, wherein the expression of feature weights comprises:

9. The multi-sensor image fusion system of claim 8, the feature extraction module to: sampling a laser radar point cloud in a camera field, and performing projection conversion on the sampling, wherein the projection function is as follows:

10. The multi-sensor image fusion system of claim 9, wherein the joint loss function comprises:

Loss_total＝λ_clsLoss_cls+λ_bboxLoss_bbox+λ_maskLoss_mask；

therein, Loss_totalRepresents total Loss, Loss_clsRepresents the Loss of classification, Loss_bboxLoss of expression, Loss_maskDenotes the mask prediction loss, λ_cls、λ_bbox、λ_maskAre respectively corresponding weight coefficients;