CN113269040A

CN113269040A - Driving environment sensing method combining image recognition and laser radar point cloud segmentation

Info

Publication number: CN113269040A
Application number: CN202110445391.8A
Authority: CN
Inventors: 俞扬; 詹德川; 周志华; 余德丛; 袁雷; 余峰; 黄军富; 陈雄辉; 张云天; 庞竟成
Original assignee: Nanjing University
Current assignee: Nanjing University
Priority date: 2021-04-25
Filing date: 2021-04-25
Publication date: 2021-08-17

Abstract

The invention discloses a driving environment perception method combining image recognition and laser radar point cloud segmentation, which comprises the following steps: (1) and collecting ground laser radar point cloud data and image data on the real road. (2) And using the collected image data as a reference, calibrating the laser radar point cloud data and the image data, and marking the collected laser radar point cloud data. (3) Initializing a point cloud segmentation network, training marked laser radar point cloud data, and updating network parameters. (4) And transplanting the trained network into an unmanned vehicle manual control machine to obtain the category of the object to which the point cloud belongs. (5) The image data is identified. (6) And fusing the segmented laser radar point cloud data and the image data after image recognition to obtain the accurate positions of the road and the object. The invention senses the environment in real time and overcomes the defect of poor recognition effect of image recognition under the conditions of bad weather and poor light.

Description

Driving environment sensing method combining image recognition and laser radar point cloud segmentation

Technical Field

The invention relates to a driving environment perception method combining image recognition and laser radar point cloud segmentation, and belongs to the technical field of unmanned driving environment perception.

Background

With the falling of artificial intelligence application technology to the fields of voice recognition, recommendation systems, intelligent robots and the like, the demand of various social circles on the application of the artificial intelligence technology to the field of unmanned driving is more and more urgent. The premise behind the implementation of unmanned technology is the need to be able to accurately perceive and identify drivable areas and objects to the surrounding environment "like a person". Conventional methods mainly use image recognition to identify drivable areas and objects. In the aspect of recognizing drivable areas and objects, although the image recognition has the characteristics of high recognition speed and good recognition effect, the image recognition has the defect of poor recognition effect in the case of recognizing distant objects and poor weather and light.

In recent years, both academic and industrial industries have begun to attempt environmental awareness using lidar. The working principle of the laser radar is that a laser beam is emitted to the surrounding environment, the laser beam returns when encountering an obstacle, and information such as the distance of a target and the reflection intensity is calculated by calculating the time difference between the emission and the return of the laser beam. Therefore, the laser radar does not influence the environment perception by weather and light, has stable output, and becomes necessary hardware for an environment perception module of a plurality of unmanned companies.

The existing laser radar point cloud segmentation algorithm has several disadvantages: 1. lack of labeled data; 2. the network structure is large, and the operation time is too long; 3. spatial structure information between the point clouds is lost.

Disclosure of Invention

The purpose of the invention is as follows: the invention provides a driving environment perception method for solving the defects of poor anti-jamming capability of image recognition and poor effect of recognizing distant objects in the unmanned driving environment perception process, which relates to a driving environment perception method combining image recognition and laser radar point cloud segmentation. The algorithm has the advantages of high identification accuracy, good identification effect and stable identification effect.

In addition, an image recognition algorithm and a RangeNet + + laser radar point cloud segmentation algorithm are combined, the point cloud is segmented by the RangeNet + + algorithm, image recognition is carried out by a camera, recognition results of the two are fused, and the defect of image recognition is overcome by laser radar point cloud segmentation.

The technical scheme is as follows: a driving environment perception method combining image recognition and laser radar point cloud segmentation is characterized by collecting laser radar point cloud data in the manual driving process of an unmanned vehicle, carrying out category marking, using a deep learning network to train a classification model, and judging the category of the point cloud by the network according to information of the point cloud such as (x, y, z, permission) and the like. After the network is trained, the network is transplanted to an unmanned vehicle with a laser radar, a camera and an industrial personal computer, the network is loaded in the unmanned process, surrounding environment point cloud and image data are collected in real time, the type of the point cloud is judged by using the network, the image data is identified by using an image identification algorithm, and sensing results of the two modules are fused to realize the purpose of accurately sensing the environment. The invention comprises the following steps:

(1) manually driving an unmanned vehicle to run on a real road, and collecting ground laser radar point cloud data and image data;

(2) using the collected image data as a reference, calibrating the laser radar point cloud data and the image data, and marking the collected laser radar point cloud data;

(3) initializing a point cloud segmentation network, training marked laser radar point cloud data to reduce errors between real categories and classified categories as targets, and updating network parameters until the network converges or runs to the maximum training times;

(4) and transplanting the trained network into the unmanned vehicle manual control machine, and segmenting the laser radar point cloud collected in real time by using the point cloud segmentation network in the driving process of the unmanned vehicle to obtain the class of the object to which the point cloud belongs.

(5) The image data is identified using an image identification algorithm.

(6) And fusing the partitioned laser radar point cloud data and the image data after image recognition to obtain the positions of roads and objects, thereby achieving the function of accurately sensing the environment.

Dividing the laser radar point cloud collected in real time, mapping the three-dimensional point cloud to a two-dimensional plane image, and dividing the two-dimensional plane image by using a deep learning technology to obtain the category of each pixel in the two-dimensional plane image; and mapping the category of each pixel in the two-dimensional plane image back to the category of the three-dimensional point cloud, and performing clustering operation on the mapped three-dimensional point cloud category to eliminate burrs and shadows generated in the mapping process to obtain the segmented laser radar point cloud.

And in the data collection stage, the unmanned vehicle is provided with a laser radar and a camera, the unmanned vehicle is manually driven, ground laser radar point cloud data and picture data are collected, each frame of laser radar point cloud data is stored in a 4 XHXW format, wherein 4 of the first dimension represents (x, y, z, permission) information, H represents the vertical resolution of the laser radar, and W represents the horizontal resolution of the laser radar.

The marked point cloud data shape is 5 × H × W, where the first dimension represents the information of (x, y, z, permission, label).

Mapping the marked three-dimensional space point cloud data to a two-dimensional plane image, and referring to a formula according to the coordinates (x, y, z) of the three-dimensional point cloud

Calculating coordinates (u, v) of the three-dimensional point cloud in the two-dimensional plane image, wherein u and v respectively represent an abscissa and an ordinate after the three-dimensional point cloud is mapped to the two-dimensional plane image, w and h respectively represent the width and the height of the mapped two-dimensional plane image, r represents the range from the three-dimensional point cloud to an origin, f_upAnd f_downAnd f represents the sum of the absolute values of the maximum value and the minimum value of the laser radar ray pitch angle, so as to obtain tensors of (w, h, 5).

The deep learning technology used for laser radar point cloud segmentation divides point cloud data by using a segmentation network, and a network main part comprises a down-sampling coding block, an up-sampling coding block and category calculation. The down-sampling coding block performs down-sampling on the plane image, so that the time required by processing is reduced; the up-sampling carries out up-sampling on the plane image, and adds the image with the corresponding dimensionality of the down-sampling before, supplements details, and gradually restores to the input dimensionality; and calculating the recovered plane image by category calculation to obtain the category of each pixel of the plane, calculating the error between the output category and the real category, and updating the network until convergence or the maximum training times is reached.

The down-sampling coding block is realized by the following steps:

11) and splicing the distances between the two-dimensional plane images and the corresponding points into a tensor of 5 multiplied by h multiplied by w. Wherein h and w respectively represent the height and width of the plane picture, the 1 st dimension represents x, y, z, r and permission information of a point corresponding to the pixel, x, y and z respectively represent three-dimensional coordinates of the point, r represents the distance from the point to the host vehicle, and permission represents the reflection intensity of the point.

12) The input image is down-sampled with a sampling step size of 2 × 1, h (height) of the down-sampled image is unchanged, and w (width) is one-half of the original value.

13) And (3) performing convolution on an input image, and performing convolution on a downsampled image by using a residual block, wherein the residual block uses two layers of CONV + BN + ReLU networks.

14) Repeating the steps 12) to 13) for a preset number of times.

The up-sampling coding block is realized by the following steps:

21) the method comprises the steps of up-sampling an input image, wherein the sampling step length is 2 multiplied by 1, the height of the image after up-sampling is unchanged, the length of the image after up-sampling is twice that of the image after up-sampling, and in addition, data (corresponding to the same level) with the same dimensionality as the image in the down-sampling process are recovered, so that the loss of details is caused by down-sampling.

22) And (3) performing convolution on an input image, and performing convolution on the up-sampled image by using a residual block, wherein the residual block uses a two-layer CONV + BN + ReLU network.

23) Repeat 21) -22) until the data is restored to the dimensions of the original image.

The category calculation comprises the following steps:

31) and performing 1 × 1 convolution on the image restored to the original dimension to generate an n × h × w image, wherein n represents the total number of categories.

32) The probability of occurrence of each category is calculated using the softmax function on the output image,

selecting the class with the highest probability as the class of the pixel, wherein c represents the class, logit_cRepresenting the value output by the corresponding pixel class c after convolution,

representing the probability of selecting the category after calculation by the softmax function.

33) Updating the network parameters according to a loss function, the formula of which is:

wherein the content of the first and second substances,

f_cindicating the frequency of occurrence of class c, ε being a very small number prevent (f)_c+ ε) is 0 resulting in no sense in taking the logarithm, y_cIndicating whether the real category of the corresponding point is c (1 if yes, 0 if not),

representing the probability that the corresponding pixel output class is c, the penalty function aims to reduce the error between the output class and the true class, while solving the class imbalance problem.

Clustering processing is carried out on the neural network classification result through a KNN method, and the defect that burrs and shadows are generated when a two-dimensional plane is mapped back to a three-dimensional space is overcome. For each pixel of the two-dimensional plane, the class is determined according to the classes of the surrounding pixels.

And mapping the two-dimensional plane back to the three-dimensional space to obtain the category of each point in the original three-dimensional space. And the error of the real class and the output class of each point in the three-dimensional space is used as a loss function, and the minimum loss function is used as the target of training the network to optimize the network until the network converges or the maximum training times is reached.

And calibrating the laser radar and the camera. Calibrating the camera by using calibration Tookit of auto to obtain parameters (f) of the camera_u,f_v,u₀,v₀) Wherein (f)_u,f_v) Is a planar XY-axis direction scale factor, (u)₀,v₀) Is the center point of the plane. The camera and the laser radar are jointly calibrated by using the calibration Tookit of the auto to obtain a rotation matrix R and a translation vector t, and the formula for converting the radar point cloud coordinate into the image coordinate is

Where (x, y, z) is the three-dimensional coordinates of the point cloud and (u, v) is the coordinates of the point cloud converted into an image.

The method comprises the steps of using a trained radar point cloud segmentation model to segment laser radar point cloud of the unmanned vehicle, using image recognition to recognize image data, fusing the recognized data according to the formula, and converting the laser radar point cloud into an image, so that the unmanned accurate perception environment is achieved.

Compared with the prior art, the invention has the following advantages:

the method is not interfered by factors such as weather, light and the like, and the identification effect is stable. The method can still show a good recognition effect under the conditions of poor weather, light and the like.

The object identified by the method is three-dimensional space information, not only has plane information, but also has depth information, and the identification effects on the near object and the far object are not greatly different.

The method can segment one frame within the single-frame time interval of the laser radar, and can achieve the real-time effect.

Drawings

Fig. 1 is an overall frame diagram of the present invention.

FIG. 2 is a schematic view of a lidar camera fusion of the present invention;

FIG. 3 is a flow chart of network training according to the present invention.

Detailed Description

The present invention is further illustrated by the following examples, which are intended to be purely exemplary and are not intended to limit the scope of the invention, as various equivalent modifications of the invention will occur to those skilled in the art upon reading the present disclosure and fall within the scope of the appended claims.

The driving environment perception method combining image recognition and laser radar point cloud segmentation provides a laser radar point cloud segmentation algorithm, the laser radar point cloud is segmented by using the algorithm, and image data is recognized by using a mature image recognition algorithm. And fusing the segmented point cloud and the identified image to obtain accurate environment perception.

In a laser radar point cloud segmentation algorithm, a method for mapping a three-dimensional point cloud to a two-dimensional plane image is provided, the two-dimensional plane image is segmented by using a deep learning technology, and the category of each pixel in the two-dimensional plane image is obtained; and mapping the category of each pixel in the two-dimensional plane image back to the category of the three-dimensional point cloud, and performing clustering operation on the mapped three-dimensional point cloud category to eliminate burrs and shadows generated in the mapping process to obtain the segmented laser radar point cloud.

The method for mapping the three-dimensional point cloud to the two-dimensional plane image comprises the following steps: according to the coordinates (x, y, z) of the three-dimensional point cloud, referring to the formula

Calculating the coordinates of the three-dimensional point cloud in a two-dimensional plane, wherein u and v respectively represent the abscissa and the ordinate after the three-dimensional point cloud is mapped to the two-dimensional plane image, w and h respectively represent the width and the height of the mapped two-dimensional plane image, r represents the distance from the three-dimensional point cloud to the host vehicle, f_upAnd f_downThe absolute values of the maximum value and the minimum value of the laser radar ray pitch angle are respectively represented, and f represents the sum of the two values.

The deep learning technology used for laser radar point cloud segmentation divides point cloud data by using a division network, wherein a network main body part comprises a down-sampling coding block, an up-sampling coding block and category calculation. The down-sampling coding speed carries out down-sampling on the plane image, so that the time required by processing is reduced; the up-sampling carries out up-sampling on the plane image, and adds the image with the corresponding dimensionality of the down-sampling before, supplements details, and gradually restores to the input dimensionality; and calculating the recovered plane image by category calculation to obtain the category of each pixel of the plane, calculating the error between the output category and the real category, and updating the network until convergence or the maximum training times is reached.

The implementation of the down-sampling coding block comprises the following steps:

(1) and splicing the distances between the obtained two-dimensional plane images and the corresponding points into a tensor of 5 multiplied by h multiplied by w. Wherein h and w respectively represent the height and width of the plane picture, the 1 st dimension represents x, y, z, r and permission information of a point corresponding to the pixel, x, y and z respectively represent three-dimensional coordinates of the point, r represents the distance from the point to the host vehicle, and permission represents the reflection intensity of the point.

(2) Down-sampling the input image with a sampling step length of 1 × 2, and keeping the height of the down-sampled image unchangedWhen length becomes input

(3) And (3) performing convolution on an input image, and performing convolution on the downsampled image by using a residual block, wherein the residual block adopts a two-layer CONV + BN + ReLU network structure.

(4) And (4) repeating the steps (2) and (3) for a preset number of times according to training experience.

The implementation of the upsampling coding block comprises the following steps:

(1) the method comprises the steps of up-sampling an input image, wherein the sampling step length is 1 x 2, the height of the image after up-sampling is unchanged, the length of the image is 2 times of the input length, and in addition, the image with the same dimensionality as the image in the down-sampling process restores the details lost due to down-sampling.

(2) And (3) performing convolution on an input image, and performing convolution on the up-sampled image by using a residual block, wherein the residual block adopts a two-layer CONV + BN + ReLU network structure.

(3) And (3) repeating the steps (1) and (2) until the dimensionality of the original image is restored.

The category calculation comprises the following steps:

(1) and performing 1 × 1 convolution on the image restored to the original dimension to generate an n × h × w image, wherein n represents the total number of categories.

(2) The probability of occurrence of each category is calculated using the softmax function on the output image,

wherein c represents a category, logit_cRepresenting the value output by the corresponding pixel class c after convolution,

(3) Updating the network parameters according to a loss function, the formula of which is:

wherein，

f_cIndicating the frequency of occurrence of class c, ε being a very small number prevent (f)_c+ ε) is 0 resulting in no sense in taking the logarithm, y_cThe true category of the corresponding point is represented,

representing the output class of the corresponding pixel, the penalty function is aimed at reducing the error between the output class and the true class, while solving the class imbalance problem.

Clustering the three-dimensional point cloud category, and storing S surrounding each pixel in the divided plane image by taking the pixel as a central pixel and using a sliding window with the size of S multiplied by S²And calculating the absolute value of the difference value between the distance of the pixels and the distance of the central pixel, calculating the probability of each pixel being selected through two standard normal distributions, multiplying the absolute value of the distance difference value by the probability to obtain the distance, sequencing according to the distance, counting the categories of the first K points, and taking the category with the largest occurrence frequency as the final category.

In the image recognition algorithm, an open-source fast-RCNN framework and a model are adopted to segment images received by a camera in real time, and detection results of objects, travelable areas and traveling paths are obtained.

And fusing the image recognition result and the segmentation result of the laser radar point cloud, calibrating the laser radar and the camera to obtain the conversion parameter from the laser radar to the camera, converting the segmentation result of the laser radar point cloud into the coordinate system of the camera, and realizing the fusion of the two sensors.

The combination of image recognition and lidar point cloud segmentation for unmanned environment perception also requires hardware including: (1) laser radar: the system is used for collecting surrounding environment point cloud information; (2) monocular camera: for collecting ambient picture data; (3) an industrial personal computer: the method is used for radar point cloud data point cloud segmentation, image identification and fusion.

FIG. 1 is an overall frame diagram of the present invention, collecting point cloud data using a lidar, preprocessing the data, training a point cloud segmentation network using the preprocessed data with categories, segmenting a real-time lidar point cloud using the point cloud segmentation network, identifying an image using an image recognition algorithm, and fusing the two recognition results.

Fig. 2 is a schematic diagram of the integration of a laser radar and a camera in the present invention.

FIG. 3 is a network training flowchart of the present invention, wherein the network training diagram is divided into two steps, the first step is a down-sampling encoding process, the second step is an up-sampling encoding process, and the third step is a category calculation process; in the downsampling coding process, firstly, convolution (using a two-layer residual block) is carried out on the preprocessed image, then downsampling is carried out, then convolution is carried out again, and then downsampling is carried out again for proper times; and in the up-sampling encoding process, the up-sampling encoding result is up-sampled, then image data with the same dimensionality in the corresponding down-sampling process is added, convolution is carried out, and the like until the original dimensionality is restored, 1 × 1 convolution is carried out for the last time, the output dimensionality is C × h × w, wherein C is the total number of marks, the probability of each mark is calculated through softmax, and the class with the maximum probability as a pixel is selected.

Following are pseudo-codes of the KNN algorithm of the present invention, respectively, collecting S around each pixel in the segmentation plane using a sliding window of size S × S²Calculating S from the absolute value of the pixel distance and the difference between the pixel distances²The probability of each pixel being selected is then multiplied by the distance to obtain a value, the values are sorted, and the one with the highest occurrence number of the first K elements is selected as the label of the pixel.

Claims

1. A driving environment perception method involving combining image recognition and laser radar point cloud segmentation is characterized by comprising the following steps:

(3) initializing a point cloud segmentation network, and training marked laser radar point cloud data until the network converges or runs to the maximum training times;

(4) transplanting the trained network into an unmanned vehicle manual control machine to obtain the category of the object to which the point cloud belongs;

(5) identifying the image data by using an image identification algorithm;

(6) and fusing the segmented laser radar point cloud data and the image data after image recognition to obtain the positions of the road and the object.

2. The driving environment sensing method related to the combination of image recognition and lidar point cloud segmentation as claimed in claim 1, wherein the lidar point cloud collected in real time is segmented, the three-dimensional point cloud is mapped to the two-dimensional plane image, and the two-dimensional plane image is segmented using a deep learning technique to obtain a category of each pixel in the two-dimensional plane image; and mapping the category of each pixel in the two-dimensional plane image back to the category of the three-dimensional point cloud, and performing clustering operation on the mapped three-dimensional point cloud category to eliminate burrs and shadows generated in the mapping process to obtain the segmented laser radar point cloud.

3. The method as claimed in claim 1, wherein the step of collecting data includes installing a laser radar and a camera on the unmanned vehicle, manually driving the unmanned vehicle, and collecting ground laser radar point cloud data and picture data, and each frame of laser radar point cloud data is stored in a 4 x H x W format, wherein 4 in the first dimension represents (x, y, z, mission) information, H represents the vertical resolution of the laser radar, and W represents the horizontal resolution of the laser radar.

4. The method of claim 1, wherein the labeled three-dimensional space point cloud data is mapped to a two-dimensional plane image, and formula referencing is performed according to coordinates (x, y, z) of the three-dimensional point cloud

5. The method as claimed in claim 2, wherein the deep learning technique is used to segment the point cloud data by using a segmentation network, and the network main part comprises a down-sampling coding block, an up-sampling coding block and a category calculation; the down-sampling coding block performs down-sampling on the plane image, so that the time required by processing is reduced; the up-sampling carries out up-sampling on the plane image, and adds the image with the corresponding dimensionality of the down-sampling before, supplements details, and gradually restores to the input dimensionality; and calculating the recovered plane image by category calculation to obtain the category of each pixel of the plane, calculating the error between the output category and the real category, and updating the network until convergence or the maximum training times is reached.

6. The method for sensing driving environment in combination with image recognition and lidar point cloud segmentation as claimed in claim 5, wherein the down-sampling encoding block is implemented by:

11) splicing the distances between the two-dimensional plane images and the corresponding points into a tensor of 5 multiplied by h multiplied by w; h and w respectively represent the height and width of a plane picture, the 1 st dimension represents x, y, z, r and permission information of a point corresponding to the pixel, x, y and z respectively represent three-dimensional coordinates of the point, r represents the distance from the point to the host vehicle, and permission represents the reflection intensity of the point;

12) down-sampling the input image, wherein the sampling step length is 2 multiplied by 1, h of the down-sampled image is unchanged, and w is changed to be one half of the original value;

13) performing convolution on an input image, and performing convolution on a downsampled image by using a residual block, wherein the residual block uses two layers of CONV + BN + ReLU networks;

14) repeating the steps 12) to 13) for a preset number of times.

7. The method for sensing driving environment in combination with image recognition and lidar point cloud segmentation as claimed in claim 5, wherein the upsampling coding block is implemented by:

21) the method comprises the steps of up-sampling an input image, wherein the sampling step length is 2 multiplied by 1, the height of the image after up-sampling is unchanged, the length of the image is twice that of the image during input, and in addition, data with the same dimensionality as the image during the down-sampling process are recovered, so that the loss of details caused by the down-sampling is reduced;

22) performing convolution on an input image, and performing convolution on an up-sampled image by using a residual block, wherein the residual block uses a two-layer CONV + BN + ReLU network;

8. The method of claim 5, wherein the class calculation comprises the steps of:

31) performing 1 × 1 convolution on the image restored to the original dimension to generate an n × h × w image, wherein n represents the total number of categories;

representing the probability of selecting the category after calculation by the softmax function;

wherein the content of the first and second substances,

f_cindicating the frequency of occurrence of class c, ε being a very small number prevent (f)_c+ ε) is 0 resulting in no sense in taking the logarithm, y_cIndicating whether the true category of the corresponding point is c,

representing the probability that the output class of the corresponding pixel is c.

9. The method as claimed in claim 2, wherein the three-dimensional point cloud is clustered, and each pixel in the planar image after segmentation is stored by using a sliding window with size of S x S and the pixel as a center pixelSurrounding S²And calculating the absolute value of the difference value between the distance of the pixels and the distance of the central pixel, calculating the probability of each pixel being selected through two standard normal distributions, multiplying the absolute value of the distance difference value by the probability to obtain the distance, sequencing according to the distance, counting the categories of the first K points, and taking the category with the largest occurrence frequency as the final category.

10. The method of claim 1, wherein the method requires hardware for operation, and wherein the hardware comprises: (1) laser radar: the system is used for collecting surrounding environment point cloud information; (2) monocular camera: for collecting ambient picture data; (3) an industrial personal computer: the method is used for radar point cloud data point cloud segmentation, image identification and fusion.