CN112801027A

CN112801027A - Vehicle target detection method based on event camera

Info

Publication number: CN112801027A
Application number: CN202110182127.XA
Authority: CN
Inventors: 孙艳丰; 刘萌允; 齐娜; 施云惠; 尹宝才
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2021-02-09
Filing date: 2021-02-09
Publication date: 2021-05-14

Abstract

The invention discloses a vehicle target detection method based on an event camera, which is based on the event camera and is researched by utilizing a deep learning technology. The event camera can generate frames and event data asynchronously, which is of great help to overcome motion blur and extreme lighting conditions. Firstly, converting an event into an event image, then simultaneously sending a frame image and the event image into a fusion convolutional neural network, and adding a convolutional layer for extracting the characteristics of the event image; simultaneously, fusing the characteristics of the two through a fusion module in the middle layer of the network; and finally, redesigning the loss function to improve the effectiveness of vehicle target detection. The method can make up the defect that only the frame image is used for target detection in an extreme scene, and the event image is fused in the fusion convolution neural network on the basis of using the frame image, so that the vehicle target detection effect in the extreme scene is enhanced.

Description

Vehicle target detection method based on event camera

Technical Field

The invention discloses a vehicle target detection method under an extreme scene based on an event camera and by utilizing a deep learning technology, belongs to the field of computer vision, and particularly relates to technologies such as deep learning and target detection.

Background

With the rapid development of the automobile industry, the technology of automatically driving automobiles has received extensive attention in recent years from academic and industrial fields. Vehicle target detection is a challenging task in autonomous vehicle technology. It is an important application in the fields of automatic driving automobile technology and intelligent traffic system. It plays a key role in the automatic driving technology. The purpose of vehicle target detection is to accurately locate the positions of the remaining vehicles in the surrounding environment, avoiding accidents with other vehicles.

A great deal of current target detection research uses deep neural networks to enhance the target detection system. These studies basically use a frame-based camera called Active Pixel Sensor (APS). Thus, many of the detected objects are stationary or slowly moving, and lighting conditions are also suitable. In practice, vehicles encounter a variety of complex and extreme scenarios. In extreme lighting and motion blur situations, overexposure and blur situations can occur in images presented by conventional frame-based cameras, which can present a significant challenge to target detection.

Dynamic Vision Sensors (DVS) have the key features of high dynamic range and low latency. These characteristics enable them to capture environmental information and generate images faster than standard cameras. At the same time, they are not affected by motion blur, which is helpful for frame cameras in extreme cases. Furthermore, autonomous vehicles may be made more sensitive due to their low latency and short response time. Dynamic and active pixel sensors (DAVIS) can output regular gray frames and asynchronous events through APS and DVS channels, respectively. Regular grayscale frames may provide the main information for target detection, and asynchronous events may provide information for fast motion and illumination changes. With this heuristic, the detection performance of the target can be improved by combining the two data.

In recent years, deep learning algorithms have been used with great success and are widely used in image classification and target detection. The deep neural network has excellent feature extraction capability and strong learning capability, and can identify target categories and locate target positions in a target identification task. A Convolutional Neural Network (CNN) based on boundary regression can directly regress the position and class of a target from an input image without searching for candidate regions. But this requires that objects in the image fed into the CNN that need to be discriminated are sharp, whereas objects in the image generated in extreme scenes may be blurred. It cannot meet the requirement if only CNN is used for object detection of frame images generated in extreme scenes.

The CNN-based vehicle detection method fuses frame and event data output by a DAVIS camera. The method comprises the steps of reconstructing event data into an image, simultaneously sending a frame image and the event image into a convolutional neural network, and fusing the characteristics extracted from the event image and the characteristics extracted from the frame image in a network intermediate layer through a fusion module. And at the final detection layer, redesigning the loss function of the network, and adding a loss term to the DVS characteristics. The data set used in the experiment was a self-established vehicle target detection data set (Dataset of APS and DVS, DAD). The comparison of different input modes shows that the vehicle detection result is obviously improved under different environmental conditions. Meanwhile, compared with different methods such as a network using a single image input and a network using two kinds of data input at the same time, the method provided herein has a significant effect.

Disclosure of Invention

The invention provides a vehicle target detection method based on an event camera by utilizing a deep learning technology. Since a normal camera can produce motion blur, overexposure, or over-darkness in fast moving and extreme brightness scenes, temporal data generated by an event camera is used to enhance the detection effect. The event camera may asynchronously output events for changes in brightness, including coordinates of pixels, polarity of brightness, and time stamp, so the events are first turned into images. This is because the image-based target detection technology is mature at present, and the detection of the event is realized by the image detection technology. And simultaneously feeding the frame image (APS) and the event image (DVS) into a framework (ADF) of a converged convolutional network for convolution operation, and performing feature extraction and feature fusion in the network framework. Therefore, the characteristics of the images can be extracted, and the finally extracted characteristics have effective characteristic information of the characteristics. Finally, by modifying the loss function of the model, the loss term of the DVS is also increased on the basis of only carrying out the loss term on the APS. The overall frame diagram of the method is shown in the attached figure 1, and the method can be divided into the following four steps: and converting the event data into an event image, fusing the whole framework of the convolutional neural network to extract features, fusing the features through a fusion module, and carrying out target detection on the extracted features through a detection layer.

(1) Event data converted into event image

Considering that the current target detection algorithm for the image is relatively mature, the event data of the DVS channel is converted into the image and then is sent to the network together with the APS image for target detection. The event data is 5 parts in total, the abscissa x of the pixel, the ordinate y of the pixel, the luminance polarity increased by +1, the luminance polarity decreased by-1 and the time stamp. And converting the event data into an event image with the same size as the frame image in the accumulated time according to the change of the coordinates and the polarity of the pixels.

(2) Feature extraction monolithic framework

The invention uses the dark net-53 as a basic framework, and adds a convolution layer for extracting the characteristics of the DVS image on the basis of only carrying out convolution operation on the APS image. Because the data of the DVS channel is sparse, fewer convolutional layers are used to extract features at different resolutions. For Darknet-53, the DVS channel still uses continuous convolutional layers of 3X 3 and 1X 1. The specific number of convolution layers is shown in table 1.

(3) Fusion module

In the network structure, a fusion module is designed with reference to ResNet. And the fusion module extracts the main features of the DVS at different resolutions and then fuses the main features with the features of the APS with the same size so as to guide the network to learn more detailed features of the APS and the DVS at the same time. The fusion module is shown in fig. 2.

(4) Target detection is carried out on the extracted features through a detection layer

And modifying the loss function of the network at the detection layer, wherein the loss function of the APS features adopts a cross entropy loss function, and the loss of coordinates, categories and confidence degrees is included. And on the basis, the cross-entropy loss function is also adopted to carry out loss calculation on the DVS characteristics. And finally, combining the detection result of the APS and the detection result of the DVS. The results for APS or DVS alone may still be correct results. Taking only the intersection of the two results, many correct detection results are lost. The results of the two are collected, so that the error can be reduced, and the accuracy can be improved.

Compared with the prior art, the invention has the following obvious advantages and beneficial effects:

the invention adopts convolution neural network technology to detect the vehicle in the extreme scene based on the APS image and DVS data generated by the event camera. Compared with the method that only traditional APS images are used, the event data are converted into event images, and the images are identified through mature depth learning. And then adding a fusion module in the convolutional neural network to perform feature level fusion on the two parts of information. Finally, by modifying the loss function again, the capability of the network for identifying the target when the problems of target blurring, illumination discomfort and the like exist in the image is improved, and a good effect is obtained in an extreme scene.

Drawings

FIG. 1 is a block diagram of an overall network architecture;

FIG. 2 is a schematic diagram of a fusion module;

FIG. 3 is a graph of experimental results;

Detailed Description

In light of the above description, a specific implementation flow is as follows, but the scope of protection of this patent is not limited to this implementation flow.

Step 1: event data converted into event image

Based on the generation mechanism of the event, there are three reconstruction methods to convert the event into the frame. They are a fixed event number method, a leaky integrator method, and a fixed time interval method, respectively. In the present invention, it is an object to be able to detect fast moving objects. The event reconstruction is set to a fixed frame length of 10ms using a fixed time interval method. In each time interval, according to the pixel position generated by the event, at the corresponding pixel point generated with polarity, the event with the polarity increased is drawn as a white pixel, the event with the polarity decreased is drawn as a black pixel, and the background color of the image is gray. And finally generating an event image with the same size as the APS image.

Step 2: feature extraction via a network ensemble framework

APS images and DVS images are simultaneously input into a network framework, and features are extracted through respective 3 × 3 and 1 × 1 convolutional layers, except that the number of convolutional layers for extracting the features is different, and the DVS is less than that of the APS. The network (2) predicts the input APS image and also predicts the DVS image. Both APS and DVS images are divided into S × S grids, each grid predicts B bounding boxes, and predicts C classes altogether. Each bbox was introduced into the Gaussian model, predicting 8 coordinate values, μ _ x, ε _ x, μ _ y, ε _ y, μ _ w, ε _ w, μ _ h, ε _ h. A confidence score p is also predicted. So at the last input detection layer of the network is a tensor of 2 × S × B × (C + 9). The three size tensors of the APS channel and the three same size tensors of the DVS channel are fed into the detection layer, respectively.

And step 3: fusion module

Passing APS and DVS through respective convolution layers to obtain characteristic F_apsAnd F_dvsFeeding into a fusion model, and first F_apsAnd F_dvsF → U, F ∈ R, U ∈ R^M×N×C,U＝[u₁,u₂,…,u_C]To obtain a transformation characteristic U_apsAnd U_dvsWherein u is_cIs a feature matrix of size M × N for the C-th channel among the C channels. Briefly, the Tc operation is taken as a convolution operation;

obtaining transformation characteristics U_dvsThen, we consider the global information of all channels in the feature, compress this global information into one channel to get the aggregation information z_c. Operating Tst (U) by global average pooling_dvs) To accomplish this formally expressed as:

wherein u is_c(i, j) is the (i, j) th value in the feature matrix. In order to utilize the aggregate information z in the compression operation_cExcitation operation is carried out, convolution characteristic information of each channel is fused, and a dependency relation s on the channels is obtained, namely:

s＝Tex(z,E)＝δ(E₂σ(E₁z))#(2)

where σ denotes a ReLU activation function, δ denotes a sigmoid activation function, E₁And E₂Two weights. This is achieved using two fully connected layers;

using s to activate switch U through Tscale operation_apsObtaining a feature block U':

U′＝Tscale(U_aps,s)＝U_aps·s#(3)

finally, the DVS feature block is fused with the APS feature to obtain the final fusion feature F_aps′：

Splicing operation is adopted in specific implementation.

And 4, step 4: target detection is carried out on the extracted features through a detection layer

The same as APS part, DVS detection result is added in the detection layer, binary cross entropy loss is carried out on the objects and classes detected by DVS, and the negative log likelihood loss function (NLL) of the coordinate frame is as follows:

wherein

Is the NLL loss of the x coordinate of the DVS. W and H are the number of grids for each width and height, respectively, and K is the prior frame number. The output of the detection layer at the kth prior box of the (i, j) grid is:

and

the coordinates of x are shown as such,

representing the uncertainty of the x coordinate.

Is the group Truth in x-coordinate, which is calculated from the width and height of the adjusted image in Gaussian yollov 3 and the kth prior box prior. ξ is a fixed value of 10-9.

The same as the x coordinate, represents the loss of the remaining coordinates y, w, h.

ω_scale＝2-w^G×h^G#(7)

ω_scaleAccording to the size (w) of the object in the training process^G,h^G) Different weights are provided. (6) In (1)

Is a parameter that is applied in the loss only if there is an anchor point in the prior box that best fits the current object. The value of this parameter is 1 or 0, which is intersected by the intersection of GroudTruth and the kth prior box in the (i, j) meshes ((IOU).

C_ijkThe value of (d) depends on whether the bounding box of the grid cell fits the predicted object. If appropriate, C_ijk1 is ═ 1; otherwise, C_ijk＝0。τ_noobjThe k-th prior box indicating the grid does not fit the target.

Representing the correct category.

The k-th prior box indicating the mesh is not responsible for predicting the target.

The class losses are as follows:

P_ijindicating the probability that the currently detected object is the correct object.

The loss function for the DVS portion is:

wherein L is_DVSRepresenting the sum of DVS channel coordinate value loss, class loss, and confidence loss.

L_APSAnd L_DVSConsistent in form. The overall network loss function is therefore:

L＝L_APS+L_DVS#(11)

by increasing the loss function of the DVS channel, the data of the model for detecting the extreme environment has stronger robustness, and the accuracy of the algorithm is improved.

To verify the validity of the proposed solution of the present invention, experiments were first performed on a custom data set. Comparative experiments were performed for different methods such as inputting only an APS image, inputting only a DVS image, inputting a superimposed image of APS and DVS pixels, and inputting both images at the same time, and the experimental results are shown in table 2. Further, the effect of the different input modes is shown in fig. 3. Each column in the figure corresponds to one input mode. Four scenes (fast moving, over-lit dark, and normal) were selected for each method. In a scene where the object is moving rapidly, the input DVS image may detect a rapidly moving vehicle, but may not detect a relatively stationary vehicle. The opposite input APS image can detect a relatively stationary vehicle, but cannot detect a fast moving vehicle. The effect of the image after the superposition of the input APS and DVS pixels is comparable to the effect of the input APS image alone. After the two images are input simultaneously, the vehicle can obtain good detection effect no matter the vehicle moves rapidly or is still. In the case of too strong or too dark illumination, neither the input APS image nor the superimposed image of APS and DVS pixels has a good detection effect. Compared with the prior art, the APS image and the DVS image can be input simultaneously, so that the characteristics of the two parts can be well fused, and the defects of the APS can be made up through the DVS. The DVS image detection effect is the worst in a normal scene because only the luminance variation in the image can generate information, and the region without luminance variation corresponds to the background and cannot be recognized. In general, the method of fusing two images in a network while using an ADF network is significantly superior to other methods.

At the same time, several of the most advanced single input networks were also selected for comparison, as shown in table 3. The network comparison results of the single image input are compared on the custom data set. As can be seen from the table, in the case where the model of the present invention inputs only a single image, the network performance is not as good as that of other networks because the network itself is designed to implement dual input. Therefore, when the model simultaneously inputs frames and events, the experimental result is improved, and the effect of improving the identification by using the event data is proved.

In addition, the present invention compares the PKU-DDD17-CAR dataset with the JDF network into which both data were imported, and the results are shown in Table 4. And converting the event data in the data set into images and then sending the images into the ADF network. The results of inputting the frame image only and the frame image at the same time and the event data are compared, respectively. Although the network is inferior to the JDF network in the case of only inputting a frame image, the network in the case of simultaneously inputting two kinds of data is superior to the JDF network.

TABLE 1 number of convolution layers in network framework

Table 2 experimental results on custom data set

TABLE 3 comparison with Single image input network

TABLE 4 comparison of two data inputs into different networks

Claims

1. The vehicle target detection method based on the event camera is characterized by comprising the following steps: based on APS images and DVS data generated by an event camera, adopting a convolutional neural network technology to detect a target of a vehicle in an extreme scene, and converting event data into event images; according to the change of the coordinates and the polarity of the pixels, event data are converted into an event image with the same size as the frame image in the accumulated time; by utilizing a mature convolutional neural network, on the basis of a dark net-53 framework, only on the basis of carrying out convolution operation on an APS image, a convolutional layer for extracting characteristics of the DVS image is added, and the DVS channel still adopts continuous convolutional layers of 3 multiplied by 3 and 1 multiplied by 1; then adding a fusion module in the convolutional neural network, weighting the features of APS with the same size after extracting the features of DVS under different resolutions so as to guide the network to learn more detailed features of APS and DVS at the same time; modifying a loss function of the network at a detection layer, wherein the loss function of the APS features is a cross entropy loss function, and the loss function comprises the loss of coordinates, categories and confidence degrees; the cross-entropy penalty function performs a penalty calculation on the DVS features.

2. The event camera-based vehicle object detection method of claim 1, wherein the event is converted into an image by a fixed time interval method; in order to achieve detection at a speed FPS of 100 frames per second, the frame reconstruction is set to a fixed frame length of 10 ms; in each time interval, according to the pixel position generated by the event, in the corresponding pixel point generated by the polarity, the event with the increased polarity is drawn into a white pixel, the event with the decreased polarity is drawn into a black pixel, and the background color of the image is gray; and finally generating an event image with the same size as the APS image.

3. The event camera-based vehicle object detection method according to claim 1, wherein successive 3 x 3 and 1 x 1 convolutional layers that extract features from the DVS image are added; inputting an APS image and a DVS image into a network framework simultaneously, extracting features through respective 3 × 3 and 1 × 1 convolutional layers, wherein the difference is that the number of the convolutional layers for extracting the features is different, and the DVS is less than that of the APS; the network predicts the input APS image and also predicts the DVS image; both the APS image and the DVS image are divided into grids of S multiplied by S, each grid predicts B bounding boxes, and predicts C types; each bbox is introduced into a gaussian model, and 8 coordinate values, mu _ x, epsilon _ x, mu _ y, epsilon _ y, mu _ w, epsilon _ w, mu _ h and epsilon _ h are predicted; predicting a confidence score p; so at the last input detection layer of the network is a tensor of 2 × S × B × (C + 9); the three size tensors of the APS channel and the three same size tensors of the DVS channel are fed into the detection layer, respectively.

4. The event camera-based vehicle object detection method according to claim 1, characterized in that the two-part features are effectively fused in a fusion module; passing APS and DVS through respective convolution layers to obtain characteristic F_apsAnd F_dvsFeeding into a fusion model, and first F_apsAnd F_dvsAfter a given transformation operation Tc: f → U, F belongs to R, U belongs to R^M×N×C，U＝[u₁，u₂，...，u_C]To obtain a transformation characteristic U_apsAnd U_dvsWherein u is_cIs a feature matrix with the size of MxN of the C channel in the C channels; briefly, the Tc operation is taken as a convolution operation;

obtaining transformation characteristics U_dvsThen, we consider the global information of all channels in the feature, compress this global information into one channel to get the aggregation information z_c(ii) a Through a global average pooling operation Tsq (U)_dvs) To accomplish this formally expressed as:

wherein u is_c(i, j) is the (i, j) th value in the feature matrix; in order to utilize the aggregate information z in the compression operation_cExcitation operation is carried out, convolution characteristic information of each channel is fused, and a dependency relation s on the channels is obtained, namely:

s＝Tex(z，E)＝δ(E₂σ(E₁z))#(2)

wherein, sigma represents a ReLU activation function, and delta represents a sigmoid activation function; e₁And E₂Two weights; this is achieved using two fully connected layers;

U′＝Tscale(U_aps，s)＝U_aps·s#(3)

Splicing operation is adopted in specific implementation.

5. The event camera-based vehicle object detection method according to claim 1, wherein a loss term for DVS features is added at a detection layer; the same as APS part, DVS detection result is added in the detection layer, binary cross entropy loss is carried out on the objects and classes detected by DVS, and the negative log likelihood loss function (NLL) of the coordinate frame is as follows:

wherein

NLL loss as x-coordinate of DVS; w and the number of the grids of each width and height are respectively the number of the slices, and K is the prior frame number; the output of the detection layer at the kth prior box of the (i, j) grid is:

and

the coordinates of x are shown as such,

representing the uncertainty of the x coordinate;

is the group Truth of the x coordinate, which is calculated from the width and height of the adjusted image in Gaussian YOLOv3 and the kth prior box prior; ξ is a fixed value of 10-9;

the same as x coordinate, representing the loss of the remaining coordinates y, w, h;

ω_scale＝2-w^G×h^G#(7)

ω_scaleaccording to the size (w) of the object in the training process^G，h^G) Providing different weights; (7) in (1)

Is a parameter that is applied to loss only when there is an anchor point in the prior frame that is best suited for the current object; the value of this parameter is 1 or 0, which is determined by the intersection of the group Truth and the kth prior box in the (i, j) grids (IOU);

C_ijkthe value of (d) depends on whether the bounding box of the grid cell fits the predicted object; if appropriate, C_ijk1 is ═ 1; otherwise, C_ijk＝0；τ_noobjIndicating that the kth prior box of the grid does not fit the target;

represents the correct category;

indicating that the kth prior box of the mesh is not responsible for predicting the target;

the class losses are as follows:

P_ijrepresenting the probability that the currently detected object is the correct object;

the loss function for the DVS portion is:

wherein L is_DVSRepresenting the sum of the DVS channel coordinate value loss, the category loss and the confidence coefficient loss;

L_APSand L_DVSRemain consistent in form; the overall network loss function is therefore:

L＝L_APS+L_DVS#(11) 。