CN114445592B

CN114445592B - Bird's eye view semantic segmentation label generation method based on inverse perspective transformation and point cloud projection

Info

Publication number: CN114445592B
Application number: CN202210111850.3A
Authority: CN
Inventors: 詹东旭; 冯绪杨
Original assignee: Chongqing Changan Automobile Co Ltd
Current assignee: Chongqing Changan Automobile Co Ltd
Priority date: 2022-01-29
Filing date: 2022-01-29
Publication date: 2024-09-20
Anticipated expiration: 2042-01-29
Also published as: CN114445592A

Abstract

The invention relates to a bird's eye view semantic segmentation label generation method based on inverse perspective transformation and point cloud projection, which comprises the following steps: the data acquisition, which uses the synchronous signal to synchronize the camera and the laser radar data at the same moment, and the time stamp difference of all the camera and the laser radar sensor data at each moment is not more than a set value; marking data, namely marking m images and n point cloud data at the same moment in a combined way, marking a static area of a pavement on the images, and marking a dynamic object 3D bounding box by the point cloud; the inverse perspective transformation generates a pavement area of the BEV label, the semantic segmentation labels of the pavement of each camera view angle are perspective to the BEV canvas and spliced based on affine geometry inverse perspective transformation, and the spliced pictures are subjected to refinement treatment. According to the invention, the accurate aerial view semantic segmentation label is directly generated from the original image and the point cloud which are synchronous at a certain moment, so that the aerial view is prevented from being acquired and marked in a way of unmanned aerial vehicle road surface taking, and the cost is reduced.

Description

Bird's eye view semantic segmentation label generation method based on inverse perspective transformation and point cloud projection

Technical Field

The invention relates to the technical field of automatic driving surrounding sensing of automobiles, in particular to a bird's eye view semantic segmentation label generation method based on inverse perspective transformation and point cloud projection.

Background

The automatic driving system is one of the core systems of the current intelligent automobile and mainly comprises three large modules, namely a perception fusion module, a decision planning module and a control module, wherein the perception fusion is used as a front module of the other two modules, and the perceived accuracy of the perception fusion module directly determines the performance of the whole automatic driving system.

At present, a plurality of automatic driving companies begin to pay attention to surrounding perception, specifically, a plurality of cameras are distributed around a vehicle body, as shown in fig. 1, and are in most common arrangement, the six cameras respectively collect image information of different visual angles, then the image information is sent into a perception model to directly output Bird eye View (BEV: bird eye views) semantic information, the Bird eye View is a View obtained from the top of the vehicle in overlook mode relative to the vehicle, the Bird eye View semantic information refers to semantic segmentation of the Bird eye View, and the four types of semantic information mainly comprise: pedestrians, vehicles, drivable regions, lane lines.

In order to train such a bird's-eye view semantic division model, it is necessary to acquire a corresponding bird's-eye view semantic division tag (hereinafter abbreviated as BEV tag). Currently, two main ways are used in industry to obtain BEV tags:

The first way is: and generating a high-precision map offline, and then generating corresponding BEV labels through semantic information elements of the high-precision map. The method has two defects, namely, the BEV label is easily affected by the precision of the high-precision map, the acquisition cost of the high-precision map is higher, the period is longer, and the geographic range of the high-precision map is limited, so that the diversity of data is restricted.

The second way is: and synchronously taking aerial views through the unmanned aerial vehicle, and then manually marking the aerial views. The biggest disadvantage of this kind of mode is that data acquisition car can't carry out data acquisition in unmanned aerial vehicle controlled no-fly zone, and data acquisition's scope will be limited like this to the scene diversity is limited, and this kind of collection model simultaneously, does not have the method to carry out automatic acquisition through the shadow mode, has influenced the iterative renewal of model.

Disclosure of Invention

The invention aims to provide a bird's eye view semantic segmentation label generation method based on inverse perspective transformation and point cloud projection, which solves the technical problems: currently, there are two main ways of obtaining BEV labels in the industry, namely, one way is to generate a high-precision map offline, and then generate a corresponding BEV label through semantic information elements of the high-precision map.

Secondly, the aerial view is synchronously shot through the unmanned aerial vehicle, and then the aerial view is marked manually. The defect of this kind of mode is that data acquisition car can't carry out data acquisition in unmanned aerial vehicle controlled no-fly zone, and data acquisition's scope will be limited to scene diversity is limited, and this kind of collection model has the method to carry out automatic acquisition through the shadow mode simultaneously, influences the iterative renewal of model.

In order to solve the technical problems, the invention adopts the following technical scheme: a bird's eye view semantic segmentation label generation method based on inverse perspective transformation and point cloud projection comprises the following steps:

s01: the data acquisition, which uses the synchronous signal to synchronize the camera and the laser radar data at the same moment, and the time stamp difference of all the camera and the laser radar sensor data at each moment is not more than a set value;

s02: marking data, namely marking m images and n point cloud data at the same moment in a combined way, marking a static area of a pavement on the images, and marking a dynamic object 3D bounding box by the point cloud;

s03: generating a pavement area of the BEV label by inverse perspective transformation, carrying out perspective on semantic segmentation labels of pavement of each camera view angle to the BEV canvas based on affine geometry inverse perspective transformation, splicing, and carrying out fine processing on spliced pictures;

s04: generating a BEV tag moving target by point cloud projection, converting the point cloud into a vehicle body coordinate system by rigid body transformation, and then projecting a 3D bounding box onto a BEV canvas by point projection transformation;

s05: combining the road surface and the moving object: and fusing the semantic segmentation labels generated in the S03 and the S04 to obtain a complete high-precision BEV label.

Preferably, the method comprises the steps of,

S01 further comprises sensor configuration and calibration, wherein the sensors are cameras and laser radars and are arranged on the data acquisition vehicle; the calibration is to calibrate the external parameters and the internal parameters of each camera relative to the vehicle body by using a camera calibration plate, and calibrate the external parameters of the laser radar relative to the vehicle body by using a camera and laser radar combined calibration method.

Preferably, the method comprises the steps of,

The cameras are distributed around the vehicle body, and the visual angles of the cameras are partially overlapped; the laser radar is carried on the top of the vehicle body, the horizontal FOV is 360 degrees, and the vertical FOV is-20 degrees to 20 degrees; the external parameters are yaw angle yaw, pitch angle pitch, roll angle roll, translation distance tx, translation distance ty and translation distance tz of the camera relative to the vehicle body; the internal references are the pixel dimensions f _x、f_y and the pixel center p _x、p_y in the x-direction and the y-direction of the camera; the projection matrix change from the vehicle body coordinate system to the camera pixel coordinate system is calculated through external parameters and internal parameters, and the transformation deduction formula is as follows:

R＝R_yawR_pitchR_roll，

Wherein R is a rigid body rotation matrix, the rotation direction is from a vehicle body coordinate system to a camera coordinate system, T is a translation matrix, and the translation direction is from the vehicle body coordinate system to the camera coordinate system; k is an internal reference matrix formed by camera internal references, wherein R, T, K three of the internal reference matrix jointly form a projection matrix P of 3x4, and the matrix is used for uniformly coordinates a certain point in a vehicle body coordinate system Projected as pixel coordinates of the camera pixel plane, Z _c is the depth of this point in the camera coordinate system.

Preferably, the method comprises the steps of,

In the step S02, 6 images and 1 point cloud data at the same moment are marked in a combined mode, wherein only a pavement area is marked on the image, and the pavement area comprises a drivable area and pavement lane lines; pedestrians and vehicles of moving targets are marked on the point cloud.

Preferably, the method comprises the steps of,

A point (u, v) of the road surface area on the camera image, the corresponding road surface point of the vehicle body coordinate system isWherein the subscript r represents road and Z _r is 0. Pixel coordinate system (u _r,v_r) and vehicle body coordinate system road surface point of canvasThe relation of (2) is:

Wherein W _target、H_target is the width and height of the BEV canvas and ppx _target、ppy_target is the number of pixels per meter of the BEV canvas in the width and height directions.

Relationship between camera road surface point pixel coordinates and BEV road surface point pixel coordinates:

PM is 3x3 square matrix and reversible, so get

Wherein P is { P ₁,P₂,P₃,P₄,P₅,P₆ }, namely the projection matrixes of 6 surrounding cameras, wherein the P and M matrixes are known quantities, and (u, v) is a given road surface pixel coordinate point on the original image of the camera, and the corresponding projection pixel point on the BEV canvas is the homogeneous coordinateThe pixel coordinates are:

Preferably, the method comprises the steps of,

In S04, the coordinates of the 4 grounding points of the 3D bounding box marked for each moving object are converted into a vehicle body coordinate system through rigid body transformation, and the conversion is performed according to the following formula:

The formula converts 4 grounding points of a bounding box of each moving object of a laser radar coordinate system into a vehicle

4 Ground points in the body coordinate system, projecting the 4 ground points onto the BEV map, and generating on the map

Forming an bounding matrix, and further generating a label of the moving object on the BEV canvas.

Preferably, the method comprises the steps of,

And superposing and fusing the generated static road surface BEV image and the generated moving target image to generate a BEV label image comprising the attributes of the drivable area, the lane lines, the vehicles and the pedestrians.

By adopting the technical scheme, the invention has the following beneficial technical effects: the invention refers to a high-precision map generation mode, not focuses on the generation of a high-precision map, but focuses on a set of low-cost BEV label automatic generation algorithm, designs a set of low-cost BEV label automatic generation algorithm flow, avoids the complexity of an unmanned aerial vehicle and the high-precision map, directly obtains BEV labels with higher precision from an original image and a point cloud, and specifically, directly generates accurate bird-eye view semantic segmentation labels from the original image and the point cloud which are synchronous at a certain moment, thereby avoiding the acquisition and marking of the bird-eye view by a way of taking an aerial image of the unmanned aerial vehicle, greatly reducing the cost of the data labels, and simultaneously

The scene of the collectable data (the scene of the unmanned plane regulated) is expanded.

Drawings

FIG. 1 is a schematic diagram of a data acquisition vehicle sensor configuration;

FIG. 2 is a diagram labeled example (a original picture);

FIG. 3 is a diagram of an example of labeling (b mask generated by a label of an original picture);

FIG. 4 is a schematic diagram of the generation of BEV labels from an inverse perspective transformation;

FIG. 5 is a schematic view of the point cloud projection generation of BEV labels (a BEV projection view of a pedestrian);

FIG. 6 is a schematic view of the point cloud projection generation of BEV tags (b BEV projection view of a vehicle);

FIG. 7 is a schematic diagram of a generated BEV tag;

FIG. 8 is a flowchart of an overall BEV tag auto-generation algorithm.

Detailed Description

The invention is further described below with reference to the accompanying drawings.

The invention discloses a method for generating a bird's eye view semantic segmentation label based on inverse perspective transformation and point cloud projection, which comprises the following specific implementation steps:

Step one, configuring a data acquisition vehicle: as shown in fig. 1, a data acquisition vehicle is configured, in which 6 cameras with 200 ten thousand pixels are distributed around the vehicle body for surround sensing, and the viewing angles of each camera are partially overlapped, so that 360-degree environment sensing without dead angles is ensured. In the lidar (lidar), which is mounted on the top of the vehicle body, the horizontal FOV is 360 degrees and the vertical FOV is about-20 degrees to 20 degrees.

Secondly, calibrating a sensor: an external reference (extrinsic) and an internal reference (intraside) of each camera relative to a vehicle body (ego vehicle) are calibrated by using a camera calibration plate, wherein the external reference refers to a yaw angle yaw, a pitch angle pitch, a roll angle roll, a translation distance tx, a translation distance ty and a translation distance tz of the camera relative to the vehicle body, the internal reference refers to pixel scales f _x、f_y and pixel centers p _x、p_y of the camera in the x direction and the y direction, then the camera and the lidar are calibrated in a combined mode, the external reference of the lidar relative to the vehicle body is calibrated, and a projection transformation matrix from a vehicle body coordinate system to a camera pixel coordinate system can be calculated through the external reference and the internal reference, and the specific transformation is deduced as formulas (1) - (8).

R＝R_yawR_pitchR_roll (4)

Wherein R is a rigid body rotation matrix, the rotation direction is from a vehicle body coordinate system to a camera coordinate system, T is a translation matrix, the translation direction is from the vehicle body coordinate system to the camera coordinate system, K is an internal reference matrix formed by internal references of the camera, and the formula (9) can be deduced by the formulas (6) and (8), wherein R, T, K jointly form a 3x4 projection matrix P, and the matrix is used for uniformly coordinates a certain point in the vehicle body coordinate systemProjected as pixel coordinates of the camera pixel plane, where Z _c is the depth of this point in the camera coordinate system.

For Lidar, the same applies that its transformation from the body coordinate system to its own coordinate system satisfies equation (6), and since Lidar has no internal parameters, only the external parameter matrix R, T is considered. In the following discussion, let the projection matrix of the ith camera be P _i and the extrinsic matrix of Lidar be R _lidar、T_lidar.

Thirdly, data acquisition: in the data acquisition process, the most important is the synchronization before all 6 paths of cameras and laser radars. The synchronous mode of this patent is that Lidar sweeps a camera all the way, and this camera triggers the exposure, so when Lidar sweeps 360 degrees, all cameras are exposed once. The scanning frequency of Lidar is 20Hz, so that the rotation of Lidar needs about 50ms, so that the maximum time difference of 6 paths of camera synchronization is (50/6) x5=41.6ms, and the requirement of less than 45ms is met.

Fourth, data marking: the 6 images and 1 point cloud data at the same moment are labeled jointly, wherein only the road surface area, including the drivable area and the road surface lane line, is marked on the image, and pedestrians and vehicles of moving targets are marked on the point cloud, and an labeling example is shown in fig. 2 and 3.

Fifth, the inverse perspective transformation generates the pavement area and the label refinement of the BEV label: for one point (u, v) of the road surface area on the camera image, the corresponding road surface point of the vehicle body coordinate system isWherein the subscript r represents the read. Since the height of the road surface area with respect to the vehicle body is 0, Z _r =0. As shown in FIG. 4, the pixel coordinate system (u _r,v_r) of the canvas and the road surface points of the vehicle body coordinate system can be obtained according to the coordinate system setting of the canvasThe relation of (2) is:

Where W _target、H_target represents the width and height of the BEV canvas, respectively, and ppx _target、ppy_target represents the number of pixels per meter of the BEV canvas in the width direction (x-direction) and height direction (y-direction).

And (3) combining the formula (9) and the formula (10) to obtain the relation between the pixel coordinates of the camera road surface points and the pixel coordinates of the BEV road surface points:

since P is a 3x4 matrix and M is a 4x3 matrix, PM is a 3x3 matrix and reversible, so that

Where P ε { P ₁,P₂,P₃,P₄,P₅,P₆ }, is the projection matrix of 6 surrounding cameras. The formula is an inverse perspective transformation formula, wherein the P matrix and the M matrix are known quantities, and (u, v) is a given road surface pixel coordinate point on the original image of the camera, so that the homogeneous coordinates of the projection pixel point on the corresponding BEV canvas can be reversely calculatedFrom this, it can be derived that the pixel coordinates are:

first, pavement area marking is performed on the original pictures shot by the 6 surrounding cameras, wherein lane lines and a travelable area are marked, and a mask label as shown in fig. 4 is generated. The mask labels for each view angle are then projected onto the same BEV canvas using formulas (12) - (14). As shown in FIG. 3, the mask labels for each view angle are stitched together sequentially over the BEV canvas to produce a holographic BEV projection map. Because the camera shakes, the car body vibrates, so that the camera is externally changed, and the spliced BEV projection images are misplaced, so that the repaired BEV pavement label is required to be repaired manually, as shown in fig. 4.

As shown in fig. 5 and 6, in a sixth step, the BEV tag moving object is generated by the point cloud projection: for moving targets, converting coordinates of 4 grounding points of the marked 3D bounding box of each moving target into a vehicle body coordinate system through rigid body transformation, wherein the conversion is carried out according to a formula (15):

this formula converts the bounding box 4 ground points of each moving object of the Lidar coordinate system to 4 ground points under the vehicle body coordinate system, then projects the 4 ground points onto the BEV map, and generates a bounding matrix on the map, thereby generating the tags of the moving objects on the BEV canvas.

Seventh, combining the road surface and the moving object: the static BEV image of the road surface generated in the fifth step and the dynamic target image generated in the sixth step are simply overlapped and fused, so that a BEV label image containing 4 attributes of a drivable area, lane lines, vehicles and pedestrians is generated, as shown in fig. 7; the whole BEV label generation flow is as shown in figure 8, namely 6 paths of camera (camera) data are collected, 6 paths of image pavement areas are marked, masks are generated according to the marks, the masks are subjected to inverse perspective transformation, the masks subjected to inverse perspective transformation are spliced, BEV pavement labels are refined, laser radar data are collected, 3D bounding boxes of point cloud moving targets are marked, 4 grounding points are rigidly transformed to a vehicle body coordinate system, point cloud projections are used for generating moving target masks, and the dynamic and static masks are fused for generating the BEV labels.

Claims

1. A bird's eye view semantic segmentation label generation method based on inverse perspective transformation and point cloud projection is characterized by comprising the following steps:

2. The method for generating the bird's eye view semantic segmentation labels based on the inverse perspective transformation and the point cloud projection according to claim 1, wherein,

3. The method for generating the bird's eye view semantic segmentation labels based on the inverse perspective transformation and the point cloud projection according to claim 2, wherein,

R＝R_yawR_pitchR_roll，

wherein R is a rigid body rotation matrix, the rotation direction is from a vehicle body coordinate system to a camera coordinate system, T is a translation matrix, and the translation direction is from the vehicle body coordinate system to the camera coordinate system; k is an internal reference matrix formed by camera internal references,

Wherein R, T, K together form a 3x4 projection matrix P, which is a homogeneous coordinate of a certain point in a vehicle body coordinate systemProjected as pixel coordinates of the camera pixel plane, Z _c is the depth of this point in the camera coordinate system.

4. The method for generating the bird's eye view semantic segmentation labels based on the inverse perspective transformation and the point cloud projection according to claim 1, wherein,

5. The method for generating the bird's eye view semantic segmentation labels based on the inverse perspective transformation and the point cloud projection according to claim 3, wherein,

A point (u, v) of the road surface area on the camera image, the corresponding road surface point of the vehicle body coordinate system isWherein the subscript r represents road, Z _r is 0, the pixel coordinate system of canvas (u _r,v_r) and the road surface point of the vehicle body coordinate systemThe relation of (2) is:

wherein W _target、H_target is the width and height of the BEV canvas, ppx _target、ppy_target is the number of pixels corresponding to each meter of the BEV canvas in the width direction and the height direction, and the relationship between the pixel coordinates of the camera road surface point and the pixel coordinates of the BEV road surface point:

PM is 3x3 square matrix and reversible, so get

6. The method for generating the bird's eye view semantic segmentation labels based on the inverse perspective transformation and the point cloud projection according to claim 5, wherein,

The formula converts 4 grounding points of a bounding box of each moving object of the laser radar coordinate system into 4 grounding points under the vehicle body coordinate system, projects the 4 grounding points to the BEV diagram, generates a bounding matrix on the diagram, and further generates a label of the moving object on the BEV canvas.

7. The method for generating the bird's eye view semantic segmentation labels based on the inverse perspective transformation and the point cloud projection according to claim 6, wherein,