CN113901897A

CN113901897A - Parking lot vehicle detection method based on DARFNet model

Info

Publication number: CN113901897A
Application number: CN202111118334.5A
Authority: CN
Inventors: 陈志华; 嵇恒铭; 周小兵; 公海涛; 张景轩; 王喆; 吴宇迪
Original assignee: East China University of Science and Technology
Current assignee: East China University of Science and Technology
Priority date: 2021-09-22
Filing date: 2021-09-22
Publication date: 2022-01-07

Abstract

The invention relates to the technical field of video image processing, and provides a parking lot vehicle detection algorithm based on a DARFNet model, which comprises the following steps: a. providing a parking lot image to be detected, and performing feature extraction processing on the parking lot image to obtain corresponding image preliminary features; b. building a lightweight network, and inputting the parking lot image to be detected into the lightweight network to obtain the initial characteristics of the image; c. building a multi-channel hybrid fusion self-attention network, and inputting the preliminary features into the multi-channel fusion self-attention network to obtain multi-channel fusion features; d. and building a prediction network comprising a classification network, an IOU network and a regression network, and inputting the multi-channel fusion characteristics into the prediction network to obtain a prediction result. The embodiment of the invention provides a DARFNet model-based parking lot vehicle detection algorithm, which effectively improves the vehicle detection precision in a dense small target scene of a parking lot.

Description

Parking lot vehicle detection method based on DARFNet model

Technical Field

The invention relates to the technical field of image processing, in particular to a vehicle detection method of a parking lot image.

Background

With the rapid development of computer vision technology, various characteristics of image targets can be easily obtained through related technologies of deep learning. Dense target environments have become an important research target, and the dense scenes of most current interest are typically human scenes, including crowded pedestrian detection, crowded parking lot vehicle detection, and dense shopping mall item detection. Such datasets differ from conventional detection datasets, such as COCO, VOC, primarily because such datasets have greater coverage between target objects and there are a large number of small target objects. Under such circumstances, it is becoming increasingly necessary to quickly and accurately identify the target. Therefore, a new detection algorithm needs to be designed to solve the target detection problem of the dense scene. The main problem is that small target and multi-scale detection are a very difficult problem to solve in target detection. In addition, conventional NMS algorithms can provide satisfactory accuracy for most isolated target detection tasks. But conventional NMS algorithms may result in lower accuracy and recall due to over-suppression of prediction boxes in scenarios where there are a large number of overlapping objects.

Target detection is a basic problem in the field of computer vision, and vehicle detection is an important component, and is widely applied to traffic information acquisition and urban road network planning in an intelligent traffic system. Object detection in dense object images is a real challenge compared to general image object detection tasks, mainly for two reasons. One reason is that objects typically have a variety of shapes, colors, and poses. Another reason is that the detection model is easily affected by different weather conditions and lighting. In addition, there is an impact of texture on the object detection task, such as traffic signals on rocks, buildings and roads, which may interfere with vehicle detection. Usually, the vehicles in the unmanned aerial vehicle image are usually small, and the mobile vehicle detection is often influenced by motion blur, target occlusion and the like. The drone vehicle detection data set studied herein includes data for many ambient background disturbances and different lighting environments. These problems increase the difficulty of designing algorithms.

Many methods have been proposed by the academia for general object detection tasks. These methods can be broadly divided into two categories: a two-phase model and a one-phase model. In the two-stage model, the lateral connections of the feature-based pyramid network with the region suggestion network (RPN) are used to filter negative examples, and then classification and regression branches are used to generate the final prediction box. In a one-stage model, the problem is handled using a pyramid network based on a simplified feature without top-bottom connections. These networks typically use special loss functions or sampling methods to avoid interference of a large number of background boxes with the detection training process. While these two types of models can achieve high-precision detection on generic data sets, they do not perform well in crowded environments, particularly on data sets that contain many small targets.

Disclosure of Invention

Aiming at the defects existing in the parking lot small target detection method in the existing target detection technology, a DARF network module is provided to expand the receptive field in the general SSD model and an IOU prediction branch to obtain a threshold value, and the threshold value can prevent the traditional NMS algorithm from removing too many proposal boxes. DARF network modules we use consist of one densely connected RFB module. We used dense concatenations similar to the modules of DenseNet and densesaspp to concatenate the signatures from different expanded convolution kernels to further expand the receptive field range. We propose an improved NMS algorithm based further innovation in predicting the IOU value. In summary, our method contributes the following: the receptive field is further expanded by utilizing channel dense connection, and partial interference information is filtered by utilizing a channel attention module, and the characteristic interference of similar background objects is reduced; a new multi-hole convolution branch module DARF is provided; aiming at the problem that target boxes are easy to delete due to high overlapping rate in an excessively dense target post-processing NMS algorithm, a branch for predicting IOU is provided, and the product of the confidence of classified branch prediction and the predicted IOU is used as the confidence of the NMS algorithm. The main modules are described as follows: the module consists of a dense receptor field module and a channel attention module. A typical RFB module includes three branches including convolutions of three convolution kernels of different sizes. Prior to hole convolution, the feature map is processed using a convolution kernel. In addition, dense connection can obtain denser proportion and pixel sampling, which is helpful for feature acquisition of small targets. After merging the feature maps obtained from the different hole convolution kernels, we propose to use a simplified version of the channel attention module to highlight the channels of interest and suppress the noise information. To avoid introducing excessive parameters, the two fully-connected layers commonly used in general channel attention modules are not applicable here, and only one fully-connected layer is used to learn feature transformations. Attention weights are then obtained using the Sigmoid activation layer and the original feature maps are finally multiplied. Finally, the feature map before the DARF module and the feature map after the dilation convolution need to be combined in an accumulative manner.

In summary, there is provided a real-time detector for parking lot detection, comprising the steps of:

step S1: providing a parking lot image to be detected, and extracting features of the parking lot image by using a backbone network to obtain corresponding image features extracted by the backbone network.

Step S2: building a lightweight network, wherein the network comprises four convolutional layers; and inputting the parking lot image to be detected into the lightweight network to obtain the initial features extracted by the corresponding lightweight network.

Step S3: building a multi-channel fusion self-attention network, wherein the network comprises a connecting layer, a fusion layer and a convolution layer; and inputting the image preliminary features respectively obtained in the S1 and the S2 into a multi-channel fusion self-attention network to obtain multi-channel fusion features.

Step S4: building a DARF network module, wherein the network comprises a connecting layer, a fusion layer, a convolution layer and an average pooling layer; and inputting the mapping features obtained in the S3 into the DARF network to obtain multi-channel fusion features.

Step S5: building a network module of a fusion layer, wherein the network comprises a connecting layer, the fusion layer and a convolution layer; and inputting the mapping feature obtained in the step S4 into a feature fusion network to obtain a multi-channel fusion feature.

Step S6: building a prediction network, wherein the prediction network comprises a classification network, an IOU network and a regression network; and inputting the multi-channel fusion characteristics obtained in the step S5 into the prediction network to obtain a prediction result.

Optionally, in an embodiment of the present invention, the input of step S1 is a parking lot image of H × W (H, W respectively indicate the length and width of the parking lot image), and the multi-layer residual mapping feature of the image is extracted through an SSD network and indicated by R-Conv-1-4.

Optionally, in an embodiment of the present invention, the lightweight network of step S2 does not undergo a pre-training process. The network is divided into four layers, and each layer outputs network characteristics of different layers. And is represented by Q-Conv-1 to 4.

Optionally, in an embodiment of the present invention, the multi-channel converged self-attention network of step S3 has two inputs, which are output characteristics shown in step 1 and step 2, respectively. The product and convolution operations are performed on the two input features. A total of four pairs of inputs result in four outputs, and are denoted by C-Conv-1-4.

Optionally, in an embodiment of the present invention, the input of step S4 is the mapping characteristics C-Conv-1-4 obtained in step S3. This network module is called a DARF, and includes convolutional layers containing three convolutional kernels of different sizes in a DARF network. Can be expressed as:

C_i＝F_3,i(A) i＝1

C_i＝F_3,i(Concat(A,C_i-1)) i＝2,3

wherein, C_iFor different levels of output features, F_3,iAnd C-Conv-1-4 obtained in the step 3 are obtained by convolution of holes with different sizes. Finally, the output of S is obtained and expressed by B-Conv-1-4.

Optionally, in an embodiment of the invention, the method is characterized in that after obtaining B-Conv-1-4, an attention mechanism module is followed, wherein the attention mechanism module only uses one full connection layer to learn feature conversion. Attention weights are then obtained using the Sigmoid activation layer, and the original feature map is multiplied with it. Finally, the feature map before the DARF module and the feature map after the dilation convolution need to be combined in an accumulative manner.

Optionally, in an embodiment of the present invention, the IOU prediction branch in step S6 is composed of a convolution layer and a Sigmoid activation layer of a 3x3 convolution kernel. The activation layer is used to ensure that the IOU prediction value is in the [0,1] range.

f_iFor confidence of classification branch prediction, IOU_iIs the IOU value of the IOU branch prediction, a is the parameter used to adjust the product ratio, score_iThen is the score of the final prediction, the confidence calculation for each prediction box can be expressed as:

the confidence level of the original prediction is replaced by the one with the largest score in the iteration of the NMS algorithm. May particularly be expressed as

In the formula b_mIs a detection box with higher confidence selected in the iterative process of NMS algorithm, score_jIs the corresponding confidence, score_j' is the adjusted confidence and Ω is the threshold.

The invention provides a novel parking lot image vehicle detector. The DARFNet model provided by the invention can improve the accuracy of the detection result and obviously reduce the number of lost targets by utilizing a specific feature fusion module and an IOU prediction branch module. The invention can also achieve better performance than the most advanced real-time target detection model at present. The visualization result shows that the algorithm can adapt to different scenes, including dense scenes and sparse scenes, and different illumination conditions. The model can be widely applied to remote sensing detection tasks such as traffic monitoring, military target identification and the like. In summary, our method contributes the following:

1) the invention further researches the problem of feature fusion in a target detection model and provides a DARF module to fuse feature layers obtained by different convolutions. The system consists of a receptive field module with a plurality of dense connection layers and a lightweight channel attention module, and a larger receptive field can be obtained from a final characteristic diagram.

2) The invention provides a new IOU value prediction branch, the predicted overlap ratio is obtained in the prediction stage of detection result generation, and the product of the predicted overlap ratio and the predicted value of the classification branch is used as the confidence coefficient of the prediction frame, and the branch can avoid removing excessive prediction frames.

3) The results of comparison experiments performed by using the similar model and the model provided by the invention show that the detection performance can be obviously improved by adopting the specific characteristic fusion module and the IOU branch module. The results of the tests on the CARPK and PUCPR data sets show that the model proposed by the present invention performs best, while the model proposed by us has real-time processing speed.

Drawings

FIG. 1 shows a schematic flow diagram of a model for a parking lot image real-time detector of the present invention;

figure 2 shows a simple flow diagram of the DARF of the present invention. The module includes a feature fusion block and a channel attention block.

Fig. 3 shows a comparison of the visualization of the present invention on a parking lot data set with different real-time object detection methods.

Detailed description of the invention

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the drawings are illustrative and intended to be illustrative of the invention and are not to be construed as limiting the invention. Referring to fig. 1, the method of an embodiment of the present invention operates as follows: s1, providing a parking lot image to be detected, and performing feature extraction processing on the parking lot image to obtain corresponding image preliminary features; s2, building a lightweight network, and inputting the parking lot image to be detected into the lightweight network to obtain the preliminary characteristics of the image; s3, building a multi-channel hybrid fusion self-attention network; and inputting the mapping characteristics into a multi-channel fusion self-attention network to obtain the multi-channel fusion characteristics. And S4, building a prediction network comprising a classification network, an IOU network and a regression network, and inputting the multi-channel fusion characteristics into the prediction network to obtain a prediction result.

In step S1, a parking lot image H multiplied by W (H, W respectively represents the length and width of the parking lot image) is input, and the multi-layer residual error mapping characteristics of the image are extracted through an SSD network and are represented by R-Conv-1-4.

For step S2, the network has two inputs. The network is divided into four layers, and each layer outputs network characteristics of different layers. And is represented by Q-Conv-1 to 4.

For step S3, the IOU prediction branch consists of a convolution layer of 3x3 convolution kernels and a Sigmoid activation layer. The activation layer is used to ensure that the IOU prediction value is in the [0,1] range.

Claims

1. A method for detecting vehicles in a parking lot based on a DARFNet model is characterized by comprising the following steps:

2. The method according to claim 1, wherein the input of step S1 is H × W (H, W respectively represents the length and width of the parking lot image) parking lot image, and the multi-layer residual mapping feature of the image is extracted through SSD network and represented by R-Conv-1 ~ 4.

3. The method according to claim 1, wherein the lightweight network of step S2 is not pre-trained. The network is divided into four layers, and each layer outputs network characteristics of different layers. And is represented by Q-Conv-1 to 4.

4. The method according to claim 1, wherein the multi-channel converged self-attention network of step S3 has two inputs, which are output characteristics shown in step 1 and step 2, respectively. The product and convolution operations are performed on the two input features. A total of four pairs of inputs result in four outputs, and are denoted by C-Conv-1-4.

5. The method according to claim 1, wherein the input of step S4 is the mapping characteristics C-Conv-1-4 obtained in step 3. This network module is called a DARF, and includes convolutional layers containing three convolutional kernels of different sizes in a DARF network. Can be expressed as:

C_i＝T_3，i(A)i＝1

C_i＝F_3，i(Concat(A，C_i-1))i＝2，3

wherein, C_iFor different levels of output features, F_3，iAnd C-Conv-1-4 obtained in the step 3 are obtained by convolution of holes with different sizes. Finally, the output of S is obtained and expressed by B-Conv-1-4.

6. The method of claim 5, wherein obtaining B-Conv-1-4 is followed by an attention mechanism module, wherein the attention mechanism module learns feature transformations using only one fully connected layer. Attention weights are then obtained using the Sigmoid activation layer, and the original feature map is multiplied with it. Finally, the feature map before the DARF module and the feature map after the dilation convolution need to be combined in an accumulative manner.

7. The method of claim 1, wherein the IOU prediction branch of step S6 is composed of convolution layer and Sigmoid activation layer of 3x3 convolution kernel. The activation layer is used to ensure that the IOU prediction value is in the [0,1] range.

f_iIs classified into branchesConfidence of prediction, IOU_iIs the IOU value of the IOU branch prediction, a is the parameter used to adjust the product ratio, score_iThen is the score of the final prediction, the confidence calculation for each prediction box can be expressed as:

In the formula b_mIs a detection box with higher confidence selected in the iterative process of NMS algorithm, score_jIs the corresponding confidence, score'_jIs the adjusted confidence level and Ω is the threshold.