CN110210350B

CN110210350B - Rapid parking space detection method based on deep learning

Info

Publication number: CN110210350B
Application number: CN201910429977.8A
Authority: CN
Inventors: 陈慧岩; 陈建松; 熊光明; 黄书昊; 齐建永; 龚建伟; 吴绍斌
Original assignee: Beili Huidong Beijing Technology Co ltd; Bit Intelligent Vehicle Technology Co ltd; Beijing Institute of Technology BIT
Current assignee: Beili Huidong Beijing Technology Co ltd; Bit Intelligent Vehicle Technology Co ltd; Beijing Institute of Technology BIT
Priority date: 2019-05-22
Filing date: 2019-05-22
Publication date: 2021-12-21
Anticipated expiration: 2039-05-22
Also published as: CN110210350A

Abstract

The invention relates to a fast parking space detection method based on deep learning, belongs to the field of driving technology, and is used to solve the problems of poor environmental adaptability and large model calculation amount for parking space detection. The method includes an offline step: offline collection of image data including parking spaces , establish a training and verification data set; carry out the training, evaluation and optimization of the neural network model; the neural network model is used to semantically segment the parking space edges in the image data; online steps: online collection of image data containing parking spaces, Use the trained neural network model to perform semantic segmentation of parking space edges to obtain parking space edge masks, and fit, cluster and combine the obtained edge masks to obtain geometric shapes composed of edge lines; according to the set shape discrimination conditions , filter the geometric shape to determine the parking space. The invention has strong environmental adaptability; the adopted model is small in size, low in calculation amount, and requires less computing resources; the system cost is low, and it has the potential of large-scale application.

Description

Rapid parking space detection method based on deep learning

Technical Field

The invention relates to the technical field of driving, in particular to a quick parking space detection method based on deep learning.

Background

Parking space detection and positioning are the basis of an automatic parking system and an auxiliary parking system, in the existing method, a method based on non-deep learning carries out parking space detection by means of manually extracting the sideline characteristics of the parking space, and the problem that the detection system fails exists under the conditions that the marks of the sidelines of the parking space are not clear, the shadow of a building and the reflection phenomenon caused by water accumulation exist, a camera is fuzzy and the like. However, the deep learning-based method generally has a large model volume, a large model calculation amount, a high requirement on a calculation device, and high system cost, and is not favorable for large-scale application and popularization on vehicles. The method with the robust detection rate and the low system cost has important popularization value.

Disclosure of Invention

In view of the above analysis, the present invention aims to provide a fast parking space detection method based on deep learning, which solves the problems of poor environment adaptability and large model calculation amount in parking space detection.

The purpose of the invention is mainly realized by the following technical scheme:

a quick parking space detection method based on deep learning comprises the following steps:

an off-line step: acquiring image data including parking spaces offline, and establishing a training and verification data set; training, evaluating and optimizing a neural network model; the neural network model is used for performing semantic segmentation on a parking space sideline in the image data;

an online step: acquiring image data containing parking spaces on line, performing parking space sideline semantic segmentation by using a trained neural network model to obtain a parking space sideline mask, and fitting, clustering and combining the obtained sideline masks to obtain a geometric shape consisting of sidelines; and screening the geometric shapes according to the set shape discrimination conditions to determine the parking spaces.

Further, the offline step specifically includes:

1) acquiring a plurality of groups of picture data containing parking spaces in an off-line manner, marking the side line areas of the parking spaces in the pictures, and constructing a training and verifying data set;

2) constructing a lightweight deep learning semantic segmentation model based on a channel compression convolution mode, and performing model parameter training by using a training data set;

3) establishing an evaluation standard, evaluating the trained model by using a verification data set, and adjusting model parameters;

4) and optimizing and accelerating the evaluated model.

Further, the lightweight deep learning semantic segmentation model based on the channel compression convolution mode comprises:

a preprocessing unit for input size W₁×H₁Performing convolution and maximum pooling on the x 3 image, reducing dimension of the image in width and height dimensions, connecting the convolution processing result and the maximum pooling processing result, and outputting the result with the size of W₂×H₂×N₂The pre-processed image of (1);

a down-sampling feature extraction unit for sequentially performing two-stage down-sampling processing on the preprocessed image, reducing the dimension of the width and height dimensions of the image, extracting edge semantic features, and outputting the edge semantic features with the output size of W₃×H₃×N₃The down-sampled image of (2);

an up-sampling feature extraction unit for sequentially performing two-stage up-sampling processing on the sampled image, increasing the width and height dimensions of the image, recovering edge semantic features, and outputting an output size of W₂×H₂Up-sampling a binary image by 2;

a model output unit for performing difference processing on the up-sampled binary image with output size W₁×H₁A binary image of x 2;

the two-stage up-sampling process corresponds to the two-stage down-sampling process; wherein the first stage upsampling process corresponds to the second stage downsampling process, and the second stage upsampling process corresponds to the first stage downsampling process.

Furthermore, the main structure of each stage of downsampling processing firstly reduces the channel dimension of an input image through a 1 × 1 convolution kernel, then reduces the width dimension and the height dimension through a 3 × 3 convolution kernel, and finally expands the channel dimension through the 1 × 1 convolution kernel to obtain a downsampling main output result, and the lateral structure of each stage of upsampling processing firstly reduces the width dimension and the height dimension through pooling layer operation and then expands the channel dimension through the 1 × 1 convolution kernel to obtain a downsampling lateral output result; and finally, performing element-by-element addition on the trunk output result and the side output result to obtain a down-sampling result.

Furthermore, the trunk structure of each stage of upsampling processing firstly adopts a 1 × 1 convolution kernel to reduce the channel dimension of the input image, then adopts a 3 × 3 deconvolution to extract the features and improve the width and height dimensions, and then adopts a 1 × 1 convolution mode to expand the channel dimension to obtain an upsampling lateral output result; the input features of the lateral connection structure of each level of up-sampling processing are the features of width and height dimension output by the corresponding down-sampling processing, and the features of width and height dimension and the main output result are added element by element to obtain the fused feature information of different levels.

Furthermore, after each convolution layer which is subjected to convolution operation, a batch normalization layer, a linear mapping layer and a linear rectification layer are sequentially connected, and batch normalization operation, linear mapping operation and linear rectification operation are carried out on convolution operation results to realize normalization and nonlinear transformation of output characteristics.

Further, the optimizing and accelerating the neural network model comprises,

a. extracting and fusing parameters in all the convolution layers, the batch normalization layer and the linear mapping layer in the model; the fused parameters include:

fused convolutional layer weights

Fused convolutional layer biasing

Wherein, w_oldAs convolution layer weights before fusion, b_oldBiasing the convolutional layer before fusing;

gamma and beta are parameters of a linear mapping layer;

mean and var are the mean and variance of all features in the normalization layer;

ε is a minimum value greater than 0;

b. the model was quantized with low precision using FP 16.

Further, the online step specifically includes:

1) in the driving process of the vehicle, a camera installed on the vehicle is adopted to collect the image information including the parking space on line;

2) performing parking space sideline semantic segmentation on the picture information by using the neural network model;

3) performing line-by-line scanning on the semantic partition result of the parking space sidelines, extracting the central point of a continuous area in the partition result, and performing straight line fitting by using hough transformation on the basis to obtain each sideline of the parking space contained in the picture;

4) and clustering and combining the sidelines to form a geometric area, and judging the geometric area meeting the parking space judging condition as the parking space according to the set parking space judging condition.

Further, the clustering the edge includes:

1) judging whether the included angle delta theta of two straight line segments in the side line is smaller than a clustering angle threshold theta or not_T；

2) Judging whether the distance between two straight line sections is smaller than the pixel distance of the borderline of the parking space in the image, wherein the distance between the two straight line sections is the distance from the center point of any straight line section to the straight line where the other straight line section is located;

3) judging whether the distance between the nearest points of the two straight line segments is less than a threshold value d_TSaid

L_sThe distance of the short side of the parking space in the image is taken as the distance;

4) clustering straight line segments satisfying 1) to 3).

Further, the image data including the parking space is acquired by a calibrated camera installed on the vehicle, and the calibrating of the camera includes:

1) off-line calibration is carried out on the internal and external parameters of the camera, and the parameters are used for eliminating image distortion caused by imaging of a camera lens;

2) and off-line calibration is carried out on the inverse perspective transformation matrix of the camera, so that the forward-looking image is converted into a top view, and the shape distortion of the parking space caused by the perspective transformation imaging of the camera is eliminated.

The scheme of the invention can realize at least one of the following beneficial effects:

1. the method for detecting the parking space by utilizing the deep learning has the advantages of strong environmental adaptability and good detection effect under the conditions of shadow, shielding, ground reflection, abrasion of parking space marking lines and the like.

2. The model structure design adopts a lightweight convolution module design method based on channel compression, and compared with a network structure using a standard convolution operation mode, the model has the advantages of small volume, low calculation amount and low requirement on calculation resources, so that the detection efficiency is high and the performance requirement on hardware is low.

3. Through the combination of network weight, the further improvement of network speed is realized, and the level of realizing real-time detection on an embedded platform can be reached, so that the system is low in cost and has the potential of large-scale application.

Drawings

The drawings are only for purposes of illustrating particular embodiments and are not to be construed as limiting the invention, wherein like reference numerals are used to designate like parts throughout.

Fig. 1 is a flowchart of a fast parking space detection method according to an embodiment of the present invention;

FIG. 2 is a diagram of a parking space segmentation network model architecture according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating an overall calculation process of a standard convolution according to an embodiment of the present invention;

FIG. 4 is a diagram illustrating a single step standard convolution calculation process according to an embodiment of the present invention;

FIG. 5 is a diagram of a standard convolution wide high dimension information flow structure in an embodiment of the present invention;

fig. 6 is a diagram of a standard convolution channel information flow structure in an embodiment of the present invention.

Detailed Description

The preferred embodiments of the present invention will now be described in detail with reference to the accompanying drawings, which form a part hereof, and which together with the embodiments of the invention serve to explain the principles of the invention.

The embodiment discloses a quick parking space detection method based on deep learning, as shown in fig. 1, comprising the following steps:

step S1, offline step: acquiring image data including parking spaces offline, and establishing a training and verification data set; training, evaluating and optimizing a neural network model; the neural network model is used for performing semantic segmentation on a parking space sideline in the image data;

the establishing process comprises the following steps:

1) acquiring a plurality of groups of image data containing parking spaces in an off-line manner, marking the side line areas of the parking spaces in the images, and constructing a training and verifying data set;

4) and optimizing and accelerating the evaluated model.

Specifically, in order to enable the trained neural network model to be more accurate, image data near parking spaces of various types as much as possible are collected on the construction of training and verifying data sets; moreover, the sample ratio of the training data set to the validation data set is approximately 5: 1;

for example, 10000-30000 pieces of picture data are collected as training samples of a deep learning model training data set; meanwhile, 2000 plus 6000 pictures are collected as verification samples of the model verification data set;

since the top view is output after the calibration of the camera in step S1, the samples in the training and verification data sets need to be converted into the top view, and the conversion method adopts the same inverse perspective transformation method as that in step S1.

Specifically, marking the parking space borderline area of the sample in the training and verification data set is realized by adopting manual marking in a top view;

in the marking process, open source software such as a labelme tool is adopted to carry out image pixel level marking, the side line area of the parking space is marked as 1 through marking, and the background area is marked as 0.

Specially, the lightweight deep learning semantic segmentation model based on the channel compression convolution mode is constructed based on the open source deep learning framework Caffe, and specifically comprises the following steps:

a preprocessing unit for input size W₁×H₁Performing convolution and maximum pooling on the x 3 image, reducing dimension of the image in width and height dimensions, connecting the convolution processing result and the maximum pooling processing result, and outputting the result with the size of W₂×H₂×N₂The pre-processed image of (1); the wide and high dimension of the image is reduced by preprocessing, so that the calculation amount of subsequent processing is greatly reduced.

For each level of downsampling processing, the main structure of the downsampling processing is firstly reduced through a 1 × 1 convolution kernel to reduce the channel dimensionality of an input image, then reduced through a 3 × 3 convolution kernel to reduce the width and height dimensionalities, and finally expanded through the 1 × 1 convolution kernel to obtain a downsampling main output result; the lateral structure firstly reduces the width dimension and the height dimension through the operation of a pooling layer, and then expands the channel dimension through a 1 multiplied by 1 convolution kernel to obtain a down-sampling lateral output result; and finally, performing element-by-element addition on the trunk output result and the side output result to obtain a down-sampling result.

Preferably, after the first-stage downsampling processing, the image feature data is subjected to feature extraction through a certain number of serial same-dimensional feature extraction modules, and after output, the second-stage downsampling processing is performed; after the second-stage down-sampling processing, feature extraction is carried out on the image feature data through a certain number of serial same-dimensional feature extraction modules;

for each level of up-sampling processing, firstly reducing the channel dimension of an input image by adopting a 1 × 1 convolution kernel, then extracting the features and improving the width and height dimensions by adopting a 3 × 3 deconvolution, and then expanding the channel dimension by adopting a 1 × 1 convolution mode to obtain an up-sampling lateral output result; the input features of the lateral connection structure of each level of up-sampling processing are the features of width and height dimension output by the corresponding down-sampling processing, and the features of width and height dimension and the main output result are added element by element to obtain the fused feature information of different levels.

Preferably, after the first-stage up-sampling processing, feature extraction is performed on the image feature data through a certain number of series same-dimensional feature extraction modules, and after output, second-stage up-sampling processing is performed; and after the second-stage up-sampling processing, performing feature extraction on the image feature data through a same-dimension feature extraction module.

Preferably, after each convolution layer which is subjected to convolution operation, a batch normalization layer, a linear mapping layer and a linear rectification layer are sequentially connected, and batch normalization operation, linear mapping operation and linear rectification operation are carried out on convolution operation results to realize normalization of output characteristics, so that convergence speed of network training is accelerated.

The model will be described below by taking as an example an RGB color image having an input image width and height of 448 × 448 (i.e., an image size of 448 × 448 × 3). The network structure using the model is shown in fig. 2;

1) the preprocessing unit is used for extracting features and reducing dimensions of an input image by adopting 3 × 3 standard convolution processing with the step length of 2, outputting a feature with the dimension of 224 × 224 × 13, simultaneously obtaining the feature with the dimension of 224 × 224 × 3 by adopting maximum pooling processing, connecting the feature with the feature in parallel to obtain an output feature with the dimension of 224 × 224 × 16, and performing batch normalization operation (BatchNorm) on the feature to obtain the output feature with the dimension of 224 × 224 × 16, namely the preprocessing output feature after linear mapping operation and linear rectification operation.

2) The down-sampling feature extraction unit is used for extracting features and reducing feature width and height dimensions, the down-sampling feature extraction unit of the network model shares a two-stage down-sampling module, the feature scale input by the first-stage down-sampling module is 224 multiplied by 16, and the output feature scale is 112 multiplied by 64; the method comprises the steps that 3 serial conventional feature extraction modules are connected behind a first-level downsampling module, the conventional feature extraction modules are same-dimensional conventional feature extraction modules, features with the extraction scale of 112 multiplied by 64 are output to a second-level downsampling module, the feature scale output by the second-level downsampling module is 56 multiplied by 128, 16 serial conventional feature extraction modules are connected behind the second-level downsampling module, the conventional feature extraction modules are same-dimensional conventional feature extraction modules, and the features with the extraction scale of 56 multiplied by 128 are output to an upsampling feature extraction unit.

The down-sampling module is formed as follows: the main structure firstly adopts 1 × 1 convolution to reduce the channel dimension of the feature dimension into 1/4 of the input feature channel dimension, then standard 3 × 3 convolution operation is carried out, the step length used in the convolution operation is 2, therefore, the width dimension and the height dimension of the output feature are reduced to half of the width dimension and the height dimension of the input feature, the channel dimension of the output feature is equal to the input feature, then the 1 × 1 convolution is used for expanding the feature channel dimension, and the width dimension and the height dimension of the feature are not changed. In the lateral connection structure, firstly, the maximum value pooling operation with the step length set to be 2 is used for carrying out the dimension reduction operation of the feature width and height dimensions, and then the 1 x1 convolution mode is adopted for carrying out the expansion of the feature channel dimensions. And then, carrying out element-by-element addition operation on the features output from the main structure and the features output from the lateral connection structure to realize feature fusion. After each convolution operation, batch normalization operation is used, and normalization of output characteristics is realized through linear mapping operation and linear rectification operation, so that convergence speed of network training is accelerated. In addition, the input features of the module can be connected with the features of the corresponding scale of the up-sampling module at the back, so that the fusion of the features of different layers is realized, and the segmentation precision is improved.

The conventional feature extraction module is constituted as follows: in the trunk structure, 1 × 1 convolution is firstly adopted for channel dimension reduction, then 3 × 3 standard convolution is adopted for feature extraction, and then a 1 × 1 convolution mode is adopted for channel dimension expansion; in the lateral connection structure, the pixel-by-pixel addition operation is directly carried out on the input features and the convolution output features of the main network; moreover, batch normalization operation is used after each convolution operation, and normalization of output characteristics is realized through linear mapping operation and linear rectification operation, so that convergence speed of network training is accelerated.

3) And the up-sampling feature extraction unit is used for realizing the expansion of feature width and height dimensions and the feature extraction function. The up-sampling feature extraction unit of the network model has two stages of up-sampling modules, the feature scale input by the first stage of up-sampling module is 56 multiplied by 128, and the feature scale output by the first stage of up-sampling module is 112 multiplied by 64; connecting 3 serial conventional feature extraction modules after the first-stage upsampling module, wherein the conventional feature extraction modules are same-dimensional conventional feature extraction modules, extracting features with the scale of 112 multiplied by 64 and outputting the features to the second-stage upsampling module, the feature scale input by the second-stage upsampling module is 112 multiplied by 64, and the output feature scale is 224 multiplied by 16; and a second-stage down-sampling module is connected with a serial conventional feature extraction module, the conventional feature extraction module is a channel dimension reduction conventional feature extraction module, and features with the extraction scale of 224 multiplied by 2 are output to a model output unit.

The upsampling module is composed as follows: the method comprises the steps that firstly, 1 × 1 convolution is adopted for a trunk structure to reduce channel dimensionality into 1/4 of input channel dimensionality, then 3 × 3 deconvolution is adopted for feature extraction and width and height dimensionality lifting, and then a 1 × 1 convolution mode is adopted for channel dimensionality expansion; the input features of the lateral connection structure are features corresponding to the width and the height dimensions in the downsampling module, and the features obtained by convolution processing in the main structure are subjected to element-by-element addition operation, so that feature information of different fused layers is obtained, and the segmentation accuracy of the network model is improved. After each convolution operation, batch normalization operation, linear mapping operation and linear rectification operation are used for realizing normalization and nonlinear transformation of output characteristics, and the convergence speed of network training is accelerated conveniently.

4) A model output unit that performs difference processing on an input image having a scale of 224 × 224 × 2 and outputs a binary image having a size of 448 × 448 × 2;

the 448 × 448 RGB color image is converted into a 448 × 448 × 2 binary image by a network model, where 1 in the binary image is a parking space borderline and a non-parking space borderline is 0.

The feature channel dimensionality reduction is carried out by a large amount of 1 multiplied by 1 convolution in the network model, so that the calculated amount of a feature extraction unit in the neural network model is greatly reduced, and meanwhile, a good detection effect can be guaranteed. The specific analysis is as follows:

in the deep convolutional neural network, a large part of the calculation amount comes from a convolutional layer or a fully-connected layer, and in the semantic segmentation network, the convolution operation occupies most of the calculation amount because the fully-connected layer is less adopted. The standard convolution calculation process is shown in fig. 3 and 4, in the standard convolution operation, the width, height and channel number of the input feature are respectively represented as W, H and N, the number of convolution kernels of the convolution layer is M, and the dimension of the convolution kernels is represented as K × N, where K represents the scale of the convolution kernel, and N represents the channel of the input vector, the convolution operation process is that the convolution kernel slides along the width direction and height direction of the image and performs pixel-by-pixel multiplication and summation with the corresponding image according to the set step size, the multiplication and summation result at each position represents the function response of the input data of the convolution kernel in the local area, and the convolution result after the same convolution kernel performs traversal on all positions of the input image is the output feature of the convolution kernel. In the convolution operation with the step size of 1, a convolution result with the same dimension as the input characteristic scale can be obtained, the dimension is H multiplied by W multiplied by 1, and the output characteristic dimension obtained by all the convolution results in the convolution layer is H multiplied by W multiplied by M.

For standard convolution, the multiplication operation of one convolution operation is calculated to be H × W × N × K under the condition that the step size is 1²

Therefore, in the convolutional layer, the calculation amount of the multiplication operation performed on the input features is H × W × N × K²×M

Taking convolution operation with a size of 3 × 3 scale as an example, information flows in the image width and height spatial dimension and channel dimension are shown in fig. 5 and 6.

In the case that the dense connection in the width and height dimensions of the image is a local connection mode (local connection in a range of 3 × 3 pixels), and the dense connection relationship in the channel dimensions is full connection, that is, the channels are all connected to each other, so the calculation amount in the calculation process of the dimension is the same as that in the full connection mode, for this case, the convolution operation mode adopted in the embodiment is a convolution method based on channel compression module, which can greatly reduce the calculation amount, as follows:

under the conditions that the input characteristic dimension is H multiplied by W multiplied by N, the output characteristic dimension is H multiplied by 0W multiplied by 1M and the compression channel dimension is H multiplied by W multiplied by C, the calculation amount of the compression convolution, the intermediate conventional convolution and the expansion convolution is H multiplied by W multiplied by N multiplied by C, C multiplied by W multiplied by H multiplied by K²xC and CxW xH x M;

amount of computation compared to standard convolution

In the case of 64 input and output channels and 16 compressed channel dimensions, the method based on channel compressed convolution is 9.7% of the standard convolution. It can be seen that the convolution method based on the channel compression method has more economical computational overhead than the standard convolution method. Compared with a mainstream segmentation network, the network structure designed by the patent has smaller model volume and higher calculation speed.

Through tests, the lightweight neural network model can reach 20ms/frame on the NVIDIA GTX1060 video card without network acceleration optimization processing, the running speed on the embedded artificial intelligence platform NVIDAI TX2 is 103ms, and the size of the network model is 2.7M. After subsequent neural network optimization, the operation speed can reach 30ms/frame on NVIDIA TX2 on the embedded artificial intelligence platform.

Preferably, in the training process, the model evaluation is performed on the neural network model by using the verification data set, the evaluation criterion uses a pixel segmentation precision standard, that is, the ratio of the number of pixels correctly segmented by the network to the number of all pixels is used as an optimization target to train the network, the training platform is NVIDIA TITAN X, the training solver is an Adam solver, the number of training steps is 70000-100000 steps, the number of images input into the neural network in each step is 6, and the number depends on the performance of the graphics card used for training.

Preferably, the optimization and acceleration of the neural network model includes:

a. the parameters in all the convolution layers, the batch normalization layer (batch norm layer) and the linear mapping layer in the model parameters are extracted and fused, the calculated amount of the network can be greatly reduced, and the fusion principle is as follows:

the batch normalization layer (BatchNorm layer) and the linear mapping layer (Scale layer) play a role in accelerating the training and convergence of the neural network through data normalization. However, when the network is deployed, only forward reasoning is carried out, the updated parameters do not need to be propagated reversely, the batch normalization layer and the linear mapping layer only play a role in linear transformation of data, repeated redundant calculation is generated, and the calculation speed of the network is influenced. Considering that all the batch normalization layer and the linear mapping layer in the network are after the convolutional layer, the layer parameters and the convolutional layer parameters can be directly fused.

The calculation formula in the batch normalization layer is

In the formula, mean and var are the mean and variance of all the characteristics in the normalization layer, epsilon is a minimum value larger than 0, and the prevention denominator is 0;

the data is linearly transformed in a linear mapping layer (Scale layer) with the formula

In the formula, gamma and beta are linear mapping layer parameters;

the parameters after parameter fusion are preferably adopted to include:

fused convolutional layer weights

Fused convolutional layer biasing

In the formula, w_oldAs convolution layer weights before fusion, b_oldIs biased for the pre-fused convolutional layer.

Meanwhile, after the model training is finished, the parameter of the convolutional layer is solidified and can be directly multiplied by the result of the combination of the two layers, so that the result of the combination of the three network layers is obtained. The model after parameter combination is effectively compressed, the size of the network model is reduced from 2.7M to 1.8M, the compression ratio reaches 33%, meanwhile, the inference time of the network is reduced from 103ms to about 78ms, the acceleration ratio of the network reaches nearly 24.3%, and obvious speed improvement is brought.

b. The model is subjected to low-precision quantification by using FP16, so that the model computation amount and the memory demand amount are further reduced, and the model can be subjected to real-time detection on an embedded platform: the training process of the network is carried out by using 32-bit floating point operation, and because a large number of local connections of the neural network have strong self-adaptive capacity, the calculation speed can be doubled by replacing parameters with low-precision 16-bit half-precision floating point operation under the condition that a segmentation result is not obviously reduced. After the TensorRT-based semi-precision quantization acceleration operation, the running speed of the network on the NVIDIA TX2 platform is compressed from 78ms to 31ms, and the segmentation frame rate exceeds 30 frames per second.

Step S2, online step: acquiring image data containing parking spaces on line, performing parking space sideline semantic segmentation by using a trained neural network model to obtain a parking space sideline mask, and fitting, clustering and combining the obtained sideline masks to obtain a geometric shape consisting of sidelines; and screening the geometric shapes according to the set shape discrimination conditions to determine the parking spaces.

The method specifically comprises the following steps:

2) performing semantic segmentation on the parking space borderline on the picture information by using the neural network model established in the step S2 to obtain a parking space borderline mask;

3) and (4) scanning line by line on the semantic partition result of the parking space sidelines, extracting the central point of the continuous area in the partition result, and performing straight line fitting by using hough transformation on the basis to obtain each sideline of the parking space contained in the picture.

4) Clustering and combining sidelines and judging the geometrical shape of the parking space:

in the image, due to the phenomena of shielding, shadow or sideline abrasion, the same sideline can be split into a plurality of straight line segments, so that the straight line segments fitted by hough transformation are clustered, and the straight line segments meeting the following position relation are clustered into the same straight line segment, so that the calculation amount and the mismatching rate when the parking space sideline is combined are reduced:

a. two straight line segments are approximately parallel to each other (| delta theta | < theta |)_T,θ_T＝5°)；

b. The distance between the straight line sections is smaller than the pixel distance of the borderline of the parking space in the image, and the distance between the straight line sections is defined as the distance from the center point of any one straight line section to the straight line where the other straight line section is located;

c. the distance between the nearest points of the two straight line segments is less than a certain threshold d_TLet L be the distance of the short side of the parking space in the image_sThen, then

In an actual scene, a parking space is a rectangle or a parallelogram (a parking line with a special shape is not in the detection direction of the technology related to the patent) surrounded by 4 or 3 sidelines, and the area surrounded by the shape, the parallelism and the distance between opposite sides all accord with certain standards. Therefore, the scale of the standard parking space needs to be calibrated according to the size of the actual parking space in the image, then the combination of 4 sidelines and 3 sidelines is selected from the extracted straight line segments in a full arrangement mode, and the parking space geometric shape screening is carried out according to the following rules, so that the parking space conforming to the geometric shape relation is obtained:

a. the area enclosed by the sideline area is about the calibration area range, namely | S-S_C|＜S_dIn which S is_CFor the calibrated standard area size, the value is equal to the arithmetic mean value S of randomly selected 10-30 standard parking spaces in the overhead view image_CThe threshold value for detecting the difference between the parking space area and the calibrated parking space area is equal to half of the range of the area of 10-30 randomly selected standard parking spaces in the overlook image.

b. Parallel to the side lines: let the included angle of the side lines be theta_dThen | θ_dIf < 5 degrees, only one group of opposite sides is judged according to the condition of three side lines.

c. The distance between the opposite side lines is in a certain range: for a set of long edges, | Dl-D_Cl|＜D_dlWherein D is_ClThe calibrated standard distance between the long sides is equal to the central point of any long side calculated in randomly selected 10-30 standard parking spaces in the overhead view imageArithmetic mean of the distances of the bars to the long sides, D_dlThe value of the threshold value for detecting the difference between the distance between the long sides of the parking spaces and the distance between the long sides of the calibrated parking spaces is equal to half of the range of the distance between the long sides of 10-30 randomly selected standard parking spaces in the overlook image. The same constraint on the distance between the short sides, i.e. | Ds-D_Cs|＜D_ds

Preferably, the image data including the parking space is acquired by a calibrated camera mounted on the vehicle; by calibrating the camera, the image distortion caused by the camera and the shooting angle is removed, and the problem of parking space deformation caused by the perspective effect is solved;

the method specifically comprises the following steps:

1) carrying out offline calibration on internal and external parameters of the camera:

the camera can bring image distortion when shooting pictures, internal and external parameters of the camera are obtained by utilizing the calibration of the images captured by the camera, the images are corrected, and the image distortion caused by the imaging of the camera lens is removed.

2) Calibrating an inverse perspective transformation matrix of the camera:

because the camera is arranged on the vehicle, when the parking space is imaged, the shape of the parking space is distorted due to the perspective effect; through the calibration of the inverse projection transformation matrix of the camera, a forward-looking image can be converted into an overlooking top view, and the problem of parking space shape distortion caused by the perspective transformation imaging of the camera is solved.

More preferably, the online step may employ two cameras, which are respectively disposed at the left and right sides of the vehicle, respectively detect the parking spaces at the left and right sides of the vehicle, and respectively project the parking spaces to the vehicle body coordinate system, thereby increasing the detection range. This scheme all has good detection effect to perpendicular parking stall and parallel parking stall.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention.

Claims

1. a fast parking space detection method based on deep learning, is characterized in that, comprises:

Offline steps: offline collection of image data containing parking spaces, establishment of training and verification data sets; training, evaluation and optimization of neural network models; the neural network model is used for semantic segmentation of parking space edges in the image data; The neural network model described above is a lightweight deep learning semantic segmentation model;

Online steps: collect image data containing parking spaces online, use the trained neural network model to perform semantic segmentation of parking space edges to obtain parking space edge masks, and perform fitting, clustering and combination on the obtained edge masks. The geometric shape formed by the edge; according to the set shape discrimination conditions, the geometric shape is screened to determine the parking space;

Clustering edges includes:

1) judge whether the included angle Δθ of the two straight line segments in the sideline is less than the clustering angle threshold θ _T ;

2) judge whether the distance between two straight line segments is less than the pixel distance of the parking space edge in the image, and the distance between the two straight line segments is the distance from the center point of any straight line segment to the straight line where another straight line segment is located;

3) Judging whether the distance between the closest points of the two straight line segments is less than the threshold d _T , the

L _s is the distance of the short side of the parking space in the image;

4) Clustering the straight line segments satisfying 1)-3);

The lightweight deep learning semantic segmentation model includes a preprocessing unit, a downsampling feature extraction unit, an upsampling feature extraction unit and a model output unit;

The preprocessing unit is used to reduce the dimension of the width and height of the input image;

The downsampling feature extraction unit is used for feature extraction and feature reduction in width and height dimensions; there are two-level downsampling modules in the downsampling feature extraction unit;

Each level of downsampling module is composed as follows: the backbone structure, firstly, 1×1 convolution is used to reduce the channel dimension of the feature dimension to 1/4 of the input feature channel dimension, and then the standard 3×3 convolution operation is performed. The stride used in is 2, the width and height dimensions of the output feature are reduced to half of the width and height dimensions of the input feature, and the channel dimension of the output feature is equal to the input feature; then 1×1 convolution is used to expand the feature channel dimension, The width and height dimensions of the features do not change; the lateral connection structure first uses the maximum pooling operation with a stride set to 2 to reduce the width and height dimensions of the features; then use 1×1 convolution to perform the feature The expansion of the channel dimension; then the features output in the backbone structure and the features output in the lateral connection structure are added element by element to achieve feature fusion;

The up-sampling feature extraction unit is used to realize the expansion of the width and height dimensions of the feature and the extraction of the feature to obtain an up-sampled binary image with the same width and height as the output image of the preprocessing unit; there are two levels of up-sampling feature extraction in the up-sampling feature extraction unit. sampling module;

Each level of upsampling module is composed as follows: the backbone structure, firstly, 1×1 convolution is used to reduce the channel dimension to 1/4 of the input channel dimension, and then 3×3 deconvolution is used for feature extraction and enhancement of width and height dimensions , and then use the 1×1 convolution method to expand the channel dimension; for the lateral connection structure, the input features are the features with the same width and height dimensions in the downsampling module, and the input features are convoluted with the backbone structure. The features obtained by product processing are added element by element to obtain the fused feature information of different levels;

The model output unit is configured to perform difference processing on the up-sampled binary image to output a binary image with the same width and height as the input image, where 1 in the binary image is a parking space borderline, and a non-parking space borderline is 0.

2. The fast parking space detection method according to claim 1, characterized in that,

The offline steps specifically include:

1) Collect multiple sets of image data containing parking spaces offline, and mark the edge areas of the parking spaces in the images to construct training and validation datasets;

2) Build a lightweight deep learning semantic segmentation model based on channel compression and convolution, and use the training data set for model parameter training;

3) Establish evaluation criteria, use the validation data set to evaluate the trained model, and adjust the model parameters;

4) Optimize and accelerate the evaluated model.

3. The fast parking space detection method according to claim 2, wherein the lightweight deep learning semantic segmentation model based on the channel compression convolution method comprises:

The preprocessing unit performs convolution and maximum pooling on the image with the input size W ₁ × H ₁ × 3, reduces the dimension of the width and height of the image, and connects the results of the convolution processing and the maximum pooling processing, and outputs Preprocessed images of size W ₂ ×H ₂ ×N ₂ ;

The downsampling feature extraction unit performs two-level downsampling processing on the preprocessed image sequence, reduces the dimension of the width and height of the image, extracts edge semantic features, and outputs a downsampled image with a size of W ₃ ×H ₃ ×N ₃ ;

The up-sampling feature extraction unit performs two-level up-sampling processing on the sampling image sequence, up-scales the width and height dimensions of the image, restores the edge semantic features, and outputs an up-sampled binary image with a size of W ₂ ×H ₂ ×2;

The model output unit performs difference processing on the up-sampled binary image, and outputs a binary image with a size of W ₁ ×H ₁ ×2;

The two-stage up-sampling processing corresponds to the two-stage down-sampling processing; wherein the first-stage up-sampling processing corresponds to the second-stage down-sampling processing, and the second-stage up-sampling processing corresponds to the first-stage down-sampling processing.

4. The fast parking space detection method according to any one of claims 1-3, characterized in that, after each convolutional layer for convolution operation, batch normalization layer, linear mapping layer and linear rectification are sequentially connected layer, which performs batch normalization operation, linear mapping operation and linear rectification operation on the result of convolution operation to achieve normalization and nonlinear transformation of output features.

5. The fast parking space detection method according to claim 4, characterized in that,

Optimization and acceleration of neural network models include,

a. Extract and fuse the parameters in all convolutional layers, batch normalization layers and linear mapping layers in the model; the fused parameters include:

fused convolutional layer weights

The fused convolutional layer bias

Among them, w _old is the weight of the convolutional layer before fusion, and _old is the bias of the convolutional layer before fusion;

γ, β are linear mapping layer parameters;

ε is a minimum value greater than 0;

b. Use FP16 for low-precision quantization of the model.

6. The fast parking space detection method according to claim 1, characterized in that,

The online steps specifically include:

1) During the driving process of the vehicle, the camera installed on the vehicle is used to collect the picture information including the parking space online;

2) Using the neural network model, the image information is semantically segmented on the edge of the parking space;

3) Perform line-by-line scanning on the semantic segmentation result of the parking space edge, extract the center point of the continuous area in the segmentation result, and use hough transform on this basis to perform straight line fitting to obtain each edge of the parking space included in the picture;

4) The edge lines are clustered and combined to form a geometric area, and according to the set parking space discrimination conditions, the geometric area that satisfies the parking space discrimination conditions is discriminated as a parking space.

7. The fast parking space detection method according to claim 1, wherein the image data including the parking space is collected by a calibrated camera installed on the vehicle, and the calibration of the camera comprises:

1) Offline calibration of camera internal and external parameters to eliminate image distortion caused by camera lens imaging;

2) The off-line calibration of the camera inverse perspective transformation matrix is used to convert the front-view image to the top view, and to eliminate the shape distortion of the parking space caused by the camera's perspective transformation imaging.