CN116895059A

CN116895059A - BEV space target detection method and device for multi-view perspective image

Info

Publication number: CN116895059A
Application number: CN202310844740.2A
Authority: CN
Inventors: 居聪; 刘国清; 杨广; 王启程
Original assignee: Shenzhen Youjia Innovation Technology Co ltd
Current assignee: Shenzhen Youjia Innovation Technology Co ltd
Priority date: 2023-07-11
Filing date: 2023-07-11
Publication date: 2023-10-17

Abstract

The invention discloses a BEV space target detection method and device of a multi-view perspective image, wherein the method inputs the multi-view perspective image into a preset BEV target detector, a corresponding perspective image feature image is generated through an image feature extractor, the perspective image feature image is input into a depth feature extractor and a front background mask extractor which are arranged in the perspective image feature image, the depth feature extractor generates a depth feature image, the front background mask extractor generates a front background mask feature image, and features of foreground pixels are thrown according to the perspective image feature image, the front background mask feature image and the depth feature image to generate the BEV feature image; and detecting the position and the category information of the target to be detected according to the BEV characteristic diagram. By implementing the embodiment of the invention, the calculated amount can be reduced, and the target detection efficiency and accuracy can be improved.

Description

BEV space target detection method and device for multi-view perspective image

Technical Field

The invention relates to the field of computer vision and automatic driving, in particular to a BEV space target detection method and device for a multi-view perspective image.

Background

The automotive field needs to accurately sense the targets of surrounding obstacles, automobiles, pedestrians and the like. Although the 2D object detector based on the image has mature technical proposal, the image has perspective effect in the imaging process, and the information such as the azimuth distance of the object on the own vehicle and the like is difficult to judge by the image detection result alone. The BEV object detector can directly detect an object in the BEV space by inputting only perspective image information acquired by a plurality of cameras installed around the vehicle without using high-cost equipment such as a laser radar. BEV object detection has characterized multiple perspective images in a unified BEV space, which itself solves the problem of object correlation.

However, the existing BEV object detector puts all the features of the image into the BEV space where the imaging sight line is located, so that the calculated amount is large, and the object detection efficiency is low.

Disclosure of Invention

The embodiment of the invention provides a BEV space target detection method and device for a multi-view perspective image, which can separate the foreground and the background of the image during target detection, and only puts in the foreground features during feature putting, thereby reducing the calculated amount, improving the target detection efficiency, and simultaneously eliminating the background information from the BEV feature map put in the BEV, thereby being more beneficial to training and learning of the foreground target and further improving the accuracy rate of target detection.

An embodiment of the present invention provides a BEV spatial target detection method for a multi-view perspective image, including: acquiring a multi-view perspective image of a target to be detected;

inputting the multi-view perspective images into a preset BEV object detector so that the BEV object detector can identify the position and the category information of the object to be detected;

wherein the BEV object detector identifies position and category information of the object to be detected, comprising:

performing feature extraction on the multi-view perspective images through an image feature extractor in the BEV target detector to generate perspective image feature images;

Inputting the perspective image feature map to a depth feature extractor in a BEV (binary image) target detector so that the depth feature extractor extracts depth features of all pixel points in the perspective image feature map to generate a depth feature map;

inputting the perspective image feature map to a front background mask extractor in a BEV target detector, so that the front background mask extractor extracts mask features of foreground pixel points and background pixel points in the perspective image feature map to generate a front background mask feature map;

generating a binarized front background mask feature map for distinguishing foreground pixel points and background pixel points according to the front background mask feature map;

performing topK processing on each depth category corresponding to each pixel point in the depth feature map to obtain a topK probability feature map and a topK index feature map; calculating to obtain a probability weighted feature map according to the topK probability feature map and the perspective image feature map; calculating to obtain a topK depth feature map according to the topK index feature map; generating BEV mapping coordinates of each pixel point in a BEV coordinate system according to the depth value of each pixel point in the topK depth feature map and the pixel coordinates of each pixel point;

Extracting BEV mapping coordinates of foreground pixel points in all BEV mapping coordinates and probability weighting characteristics of the foreground pixel points in the probability weighting characteristic map according to positions of the foreground pixel points in the binarization mask characteristic map, and then carrying out foreground characteristic putting to generate a BEV characteristic map;

and detecting the position and the category information of the target to be detected according to the BEV characteristic diagram.

Further, the depth feature extractor extracts depth features of each pixel point in the perspective image feature map, and generates a depth feature map, including:

determining probability values of depth categories corresponding to each pixel point of a perspective image feature map, and generating the depth feature map according to the probability values of the depth categories corresponding to all the pixel points;

the generating of each depth category corresponding to each pixel point comprises the following steps: and cutting the preset depth range corresponding to the pixel point according to preset intervals to generate a plurality of depth categories corresponding to the pixel point.

Further, the generating a binarized front background mask feature map for distinguishing the foreground pixel point and the background pixel point according to the front background mask feature map includes:

extracting mask values of all pixel points in the front background mask feature map;

Comparing the mask value of each pixel point with a preset threshold value;

and updating the mask value of the pixel points with the mask value larger than the preset threshold value to a preset foreground mask value, updating the mask value of the pixel points with the mask value smaller than the preset threshold value to a preset background mask value, and generating a binarized pre-background mask feature map.

Further, the topK processing is performed on each depth category corresponding to each pixel point in the depth feature map to obtain a topK probability feature map and a topK index feature map, which includes:

ordering probability values of depth categories corresponding to the pixel points according to the sequence from large to small for each pixel point, and then acquiring the first k probability values;

generating a topK probability feature map according to the first K probability values of all the pixel points and depth categories corresponding to the first K probability values;

for each pixel point, determining index values of the first K probability values of the pixel point according to index positions of the first K probability values of the pixel point before sequencing; and generating a topK index feature map according to index values of the first K probability values corresponding to all the pixel points.

Further, the calculating to obtain a probability weighted feature map according to the topK probability feature map and the perspective image feature map includes:

And multiplying the K probability values of each pixel point on the topK probability feature map with the feature values of each channel of the corresponding pixel point in the perspective image feature map respectively to generate a probability weighted feature map.

Further, the calculation of the topK depth feature map according to the topK index feature map includes:

for each pixel point in the topK index feature map, determining a depth category corresponding to each index value according to index values of the first K probability values of the pixel point; extracting the center point of the depth range corresponding to each depth category to generate k depth values corresponding to the pixel points;

and generating a topK depth characteristic map according to all depth values of all pixel points.

Further, generating BEV mapping coordinates of each pixel point in a BEV coordinate system according to the depth value of each pixel point in the topK depth feature map and the pixel coordinates of each pixel point, including:

combining the depth values of all the pixel points in the topK depth feature map with the pixel coordinates of the corresponding pixel points to generate three-dimensional space coordinates;

performing depth inverse normalization processing on the three-dimensional space coordinates to generate inverse normalized coordinates;

mapping the inverse normalized three-dimensional coordinates to a camera coordinate system of a corresponding camera to generate camera three-dimensional coordinates in the camera coordinate system;

And mapping the three-dimensional coordinates of the camera to a radar coordinate system, and generating BEV mapping coordinates of each pixel point in a BEV coordinate system.

Further, extracting BEV mapping coordinates of the foreground pixel point in all BEV mapping coordinates and probability weighting features of the foreground pixel point in the probability weighting feature map according to positions of the foreground pixel point in the binarization mask feature map, and then performing foreground feature delivery to generate a BEV feature map, including:

extracting BEV mapping coordinates of foreground pixel points in all BEV mapping coordinates and probability weighting characteristics of the foreground pixel points in the probability weighting characteristic map according to positions of foreground mask values in the binarization mask characteristic map, so as to respectively obtain the BEV mapping coordinates of the foreground pixel points and the probability weighting characteristic map of the foreground pixel points;

rasterizing a preset BEV space to obtain a plurality of BEV grids, filtering BEV mapping coordinates of foreground pixel points which are not in the BEV grids, and updating the BEV mapping coordinates of the foreground pixel points and a probability weighting characteristic diagram of the foreground pixel points according to the filtered BEV mapping coordinates;

generating coordinates of effective BEV mapping points and features of the effective BEV mapping points according to the updated BEV mapping coordinates of the foreground pixel points and the updated probability weighted feature map of the foreground pixel points;

And carrying out foreground feature delivery according to the coordinates of the effective BEV mapping points and the features of the effective BEV mapping points, and generating a BEV feature map.

Further, the detecting the position and the category information of the object to be detected according to the BEV feature map includes:

inputting the BEV feature map to a BEV encoder within a BEV object detector, such that the BEV encoder encodes the BEV feature map to generate a BEV encoded feature map;

inputting the BEV coding feature map to a BEV decoder in a BEV object detector, so that the BEV decoder decodes the BEV coding feature map to generate a BEV decoding feature map;

and generating position and category information of the object to be detected according to the BEV decoding characteristic diagram.

On the basis of the method item embodiments, the invention correspondingly provides device item embodiments;

an embodiment of the present invention provides a BEV spatial target detection apparatus for a multi-view perspective image, including: a multi-view perspective image acquisition module and a target identification module; the target recognition module comprises: a perspective image feature map generating unit, a depth feature map generating unit, a front background mask feature map generating unit, a binarized front background mask feature map generating unit, a BEV mapping coordinate generating unit, a BEV feature map generating unit, and a detecting unit;

The multi-view perspective image acquisition module is used for acquiring multi-view perspective images of the object to be detected;

the object recognition module is used for inputting the multi-view perspective images into a preset BEV object detector so that the BEV object detector recognizes the position and the category information of the object to be detected;

the perspective image feature map generating unit is used for carrying out feature extraction on the multi-view perspective images through an image feature extractor in the BEV object detector to generate a perspective image feature map;

the depth feature map generating unit is used for inputting the perspective image feature map into a depth feature extractor in the BEV target detector so that the depth feature extractor extracts depth features of all pixel points in the perspective image feature map to generate a depth feature map;

the front background mask feature map generating unit is configured to input the perspective image feature map to a front background mask extractor in the BEV object detector, so that the front background mask extractor extracts mask features of foreground pixels and background pixels in the perspective image feature map, and generates a front background mask feature map;

the binarization front background mask feature map generating unit is used for generating a binarization front background mask feature map for distinguishing a front pixel point and a background pixel point according to the front background mask feature map;

The BEV mapping coordinate generation unit is used for performing topK processing on each depth category corresponding to each pixel point in the depth feature map to obtain a topK probability feature map and a topK index feature map; calculating to obtain a probability weighted feature map according to the topK probability feature map and the perspective image feature map; calculating to obtain a topK depth feature map according to the topK index feature map; generating BEV mapping coordinates of each pixel point in a BEV coordinate system according to the depth value of each pixel point in the topK depth feature map and the pixel coordinates of each pixel point;

the BEV feature map generating unit is used for extracting BEV mapping coordinates of foreground pixel points in all BEV mapping coordinates and probability weighting features of the foreground pixel points in the probability weighting feature map according to positions of the foreground pixel points in the binarization mask feature map, and then performing foreground feature putting to generate a BEV feature map;

the detection unit is used for detecting the position and the category information of the object to be detected according to the BEV characteristic diagram.

The invention has the following beneficial effects:

the embodiment of the invention provides a BEV space target detection method and device of a multi-view perspective image, wherein the method inputs the multi-view perspective image into a preset BEV target detector, the BEV target detector generates a corresponding perspective image feature image through an internally arranged image feature extractor, then the perspective image feature image is input into an internally arranged depth feature extractor and a front background mask extractor, the depth feature extractor generates a depth feature image, the front background mask extractor generates a front background mask feature image, and then a binary front background mask feature image for distinguishing foreground pixel points and background pixel points is generated according to the front background mask feature image, and a BEV mapping coordinate of the corresponding probability weighting feature image and each pixel point is generated according to the depth feature image; then, based on the positions of foreground pixel points in the binarized foreground and background mask feature map, extracting BEV mapping coordinates of the foreground pixel points in all BEV mapping coordinates and probability weighting features of the foreground pixel points in the probability weighting feature map, and then carrying out foreground feature delivery to generate a BEV feature map; and finally, detecting the position and the category information of the target to be detected according to the BEV characteristic diagram. Compared with the prior art, the method and the device have the advantages that in the detection process, only the foreground features are put in, the BEV feature map is generated, and the background features of the image are not required to be put in, so that the calculated amount is reduced, the target detection efficiency is improved, and meanwhile, the BEV feature map put in the BEV eliminates the background information, so that the method and the device are more beneficial to training and learning of the foreground target, and further improve the accuracy of target detection.

Drawings

FIG. 1 is a flow chart of a method for detecting BEV space objects of a multi-view perspective image according to an embodiment of the present invention;

FIG. 2 is a flow chart of a BEV object detector provided by an embodiment of the present invention for identifying position and category information of an object to be detected;

FIG. 3 is a schematic diagram of a BEV spatial object detection apparatus for multi-view perspective images according to an embodiment of the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

As shown in fig. 1, an embodiment of the present invention provides a BEV space object detection method for a multi-view perspective image, which at least includes the following steps:

s1, acquiring a multi-view perspective image of an object to be detected.

In an embodiment of the invention, pictures shot by cameras arranged around a vehicle body can be acquired to obtain a multi-view perspective image of an object to be detected.

Illustratively, 6 images are taken from 6 cameras respectively perceiving the front, left front, right front, rear left, rear right of the vehicle.

S2, inputting the multi-view perspective images into a preset BEV object detector so that the BEV object detector can identify the position and the category information of the object to be detected.

In an embodiment of the present invention, the above-mentioned multi-view perspective image may be input to a trained BEV target detector, where the input feature map corresponding to the multi-view perspective image is a high-dimensional matrix of 6x3xHxW, where 6 represents 6 images, 3 represents RGB values of 3 channels of the image, and H, W are the height and width of the image. The BEV object detector receives the input multi-view perspective images and then identifies the position and category information of the object to be detected.

As shown in fig. 2, in an embodiment of the present invention, the BEV object detector identifies the position and type information of the object to be detected, and may include the following steps:

and S21, carrying out feature extraction on the multi-view perspective image through an image feature extractor in the BEV object detector to generate a perspective image feature map.

In one embodiment of the present invention, an image feature extractor is disposed in the BEV object detector, and the input multi-view image is extracted by the image feature extractor to obtain a perspective image feature map 6xCx (H// 8) x (W// 8). The feature extraction process is a process of running CNN forward operations. The expression// 8 means that the CNN downsamples by 8 times, which means that the resolution of the feature map is reduced by 8 times, and C is the number of channels of the feature map, namely the feature vector dimension.

S22, inputting the perspective image feature map into a depth feature extractor in the BEV object detector, so that the depth feature extractor extracts depth features of all pixel points in the perspective image feature map, and a depth feature map is generated.

In an embodiment of the present invention, a depth feature extractor is further provided in the BEV object detector, and depth features of each pixel point in the perspective image feature map are extracted by the depth feature extractor, so as to generate a depth feature map;

in a preferred embodiment, the depth feature extractor extracts depth features of each pixel point in the perspective image feature map, and generates a depth feature map, including: determining probability values of depth categories corresponding to each pixel point of a perspective image feature map, and generating the depth feature map according to the probability values of the depth categories corresponding to all the pixel points; the generating of each depth category corresponding to each pixel point comprises the following steps: and cutting the preset depth range corresponding to the pixel point according to preset intervals to generate a plurality of depth categories corresponding to the pixel point.

Specifically, the depth feature extractor is composed of a common convolutional neural network, wherein the depth estimation is performed by adopting a classification method, that is, a preset depth range, for example, 2-52 meters, is pre-assumed for each pixel point, then the preset depth range is segmented according to a preset interval, for example, 0.5, so as to generate (52-2)/0.5=100 depth categories, therefore, in the depth feature map, D values corresponding to each pixel point of each camera represent probability values of 100 depth categories corresponding to the pixel point, and the depth feature map 6xDx (H// 8) x (W// 8) is generated according to the probability values of 100 depth categories corresponding to all pixel point positions.

S23, inputting the perspective image feature map into a front background mask extractor in the BEV object detector, so that the front background mask extractor extracts mask features of foreground pixel points and background pixel points in the perspective image feature map, and generating a front background mask feature map.

In an embodiment of the present invention, a front background mask extractor is further provided in the BEV object detector, and mask features of foreground pixels and background pixels in the perspective image feature map are extracted by the front background mask extractor, so as to generate a front background mask feature map 6x1x (H// 8) x (W// 8); the preferred front background mask extractor consists of a common convolutional neural network.

S24, generating a binarized front background mask characteristic diagram for distinguishing the foreground pixel points and the background pixel points according to the front background mask characteristic diagram.

After the front background mask feature map is generated, in order to be able to distinguish between the foreground and the background, the front background mask feature map is binarized, thereby generating a binarized front background mask feature map.

In a preferred embodiment, the generating a binarized front background mask feature map for distinguishing the foreground pixel point from the background pixel point according to the front background mask feature map includes:

Extracting mask values of all pixel points in the front background mask feature map; comparing the mask value of each pixel point with a preset threshold value; and updating the mask value of the pixel points with the mask value larger than the preset threshold value to a preset foreground mask value, updating the mask value of the pixel points with the mask value smaller than the preset threshold value to a preset background mask value, and generating a binarized pre-background mask feature map.

Specifically, in the present invention, when the preset threshold is set to 0.5, the preset foreground mask value is set to 1, and the background mask value is set to 0, during binarization processing, the mask value of the pixel point with the mask value smaller than 0.5 in the background mask feature map is updated to 0, and the mask value of the pixel point with the mask value greater than or equal to 0.5 is updated to 1; and generating a binarized pre-background mask feature map, wherein in the binarized pre-background mask feature map, the pixel points with the mask value of 1 are foreground pixel points, and the pixel points with the mask value of 0 are background pixel points.

S25, performing topK processing on each depth category corresponding to each pixel point in the depth feature map to obtain a topK probability feature map and a topK index feature map; calculating to obtain a probability weighted feature map according to the topK probability feature map and the perspective image feature map; calculating to obtain a topK depth feature map according to the topK index feature map; and generating BEV mapping coordinates of each pixel point in a BEV coordinate system according to the depth value of each pixel point in the topK depth feature map and the pixel coordinates of each pixel point.

In a preferred embodiment, the topK processing is performed on each depth class corresponding to each pixel point in the depth feature map to obtain a topK probability feature map and a topK index feature map, which includes:

ordering probability values of depth categories corresponding to the pixel points according to the sequence from large to small for each pixel point, and then acquiring the first k probability values; generating a topK probability feature map according to the first K probability values of all the pixel points and depth categories corresponding to the first K probability values; for each pixel point, determining index values of the first K probability values of the pixel point according to index positions of the first K probability values of the pixel point before sequencing; and generating a topK index feature map according to index values of the first K probability values corresponding to all the pixel points.

According to the above example, each pixel corresponds to 100 depth categories, so there are 100 probability values, the 100 probability values corresponding to the pixel are sorted in descending order, the first K probability values are taken, and a topK probability feature map of 6xKx (H// 8) x (W// 8) is generated according to the taken K probability values and the depth categories corresponding to the K probability values. The priority above k may be set to 5; the calculation amount can be reduced by acquiring the first k probability values for calculation, and the calculation amount is greatly reduced because the subsequent probability values participate in a series of operations, and only 5 values are used instead of 100 values after the topK is acquired.

And simultaneously, determining corresponding index values according to index positions of the K probability values before sequencing, and generating a topK index feature map of 6xKx (H// 8) x (W// 8) according to the determined index values and depth categories corresponding to the K probability values.

the topK probability signature describes the first K probability values obtained from among the D probability values after topK processing. the topK indexing feature map describes the first K probability values obtained after topK processing, the indexing position in the original D probability values, i.e. the number of values belonging to D values.

In a preferred embodiment, the calculating the probability weighted feature map according to the topK probability feature map and the perspective image feature map includes:

Specifically, the K probability values of each pixel of the topK probability feature map (6 xKx (H// 8) x (W// 8)) are multiplied by the feature values of the C feature map channels of each pixel of the perspective image feature map (6 xCx (H// 8) x (W// 8)) to obtain a probability weighted feature map of [6xCxKx (H// 8) x (W// 8) ].

In a preferred embodiment, the calculation of the topK depth profile from the topK index profile includes:

For each pixel point in the topK index feature map, determining a depth category corresponding to each index value according to index values of the first K probability values of the pixel point; extracting the center point of the depth range corresponding to each depth category to generate k depth values corresponding to the pixel points; and generating a topK depth characteristic map according to all depth values of all pixel points.

Referring to the above example, the K index values corresponding to each pixel of each camera of the topK index feature map are used to find the corresponding depth class among the 100 predefined depth classes, and then the center point of the depth range represented by the depth class is taken as the depth value: if 2-2.5 represents a depth range of a depth class, the center point of which is 2.25, then 2.25 is taken as the depth value, and finally a topK depth profile of 6xKx (H// 8) x (W// 8) x1 can be generated according to all depth values of all pixels.

In a preferred embodiment, the generating BEV mapping coordinates of each pixel point in the BEV coordinate system according to the depth value of each pixel point in the topK depth feature map and the pixel coordinates of each pixel point includes:

combining the depth values of all the pixel points in the topK depth feature map with the pixel coordinates of the corresponding pixel points to generate three-dimensional space coordinates; performing depth inverse normalization processing on the three-dimensional space coordinates to generate inverse normalized coordinates; mapping the inverse normalized three-dimensional coordinates to a camera coordinate system of a corresponding camera to generate camera three-dimensional coordinates in the camera coordinate system; and mapping the three-dimensional coordinates of the camera to a radar coordinate system, and generating BEV mapping coordinates of each pixel point in a BEV coordinate system.

Specifically, each camera in the topK depth feature map corresponds to K depth values, each depth value is combined with its pixel coordinate (u, v), so as to obtain a three-dimensional space coordinate of [6xKx (H// 8) x (W// 8) x3], wherein one value and two values of the last dimension are pixel coordinates (u, v), and three values are depth values z; then performing depth inverse normalization processing, namely multiplying the first value and the second value by the depth value to obtain (uz, vz, z), thereby obtaining new inverse normalization coordinates of [6xKx (H// 8) x (W// 8) x3 ];

then mapping the inverse normalized coordinate to a camera coordinate system by using matrix multiplication by using an internal reference matrix calibrated in advance by each camera to obtain a three-dimensional coordinate of [6xKx (H// 8) x (W// 8) x3] in the camera coordinate system;

finally, mapping the three-dimensional coordinates of the camera to a radar coordinate system, namely a BEV coordinate system or a vehicle body coordinate system (three coordinate systems can be integrated) by utilizing a pre-calibrated external reference matrix through matrix multiplication to obtain a BEV mapping coordinate of [6xKx (H// 8) x (W// 8) x3], namely generating the BEV mapping coordinate of the pixel point in the BEV coordinate system.

S26, extracting BEV mapping coordinates of the foreground pixel points in all BEV mapping coordinates and probability weighting features of the foreground pixel points in the probability weighting feature map according to positions of the foreground pixel points in the binarization mask feature map, and then carrying out foreground feature delivery to generate the BEV feature map.

In a preferred embodiment, the extracting the BEV mapping coordinates of the foreground pixel point in the BEV mapping coordinates and the probability weighted features of the foreground pixel point in the probability weighted feature map according to the positions of the foreground pixel point in the binary mask feature map, and then performing foreground feature delivery, to generate a BEV feature map, includes:

Specifically, the probability weighted feature map and the BEV mapping coordinates are filtered by using the position of the foreground pixel point in the binarized mask feature map, that is, at the 1 value, to obtain the probability weighted feature map of the foreground (that is, the foreground probability weighted feature map [6xCxKxM ]), and the BEV mapping coordinates of the foreground pixel point (that is, the foreground BEV mapping coordinates [6xKxMx3 ]). M is the number of foreground pixels in (H// 8) x (W// 8) pixel coordinates, each M being in one-to-one correspondence.

The BEV space is then rasterized and the mapped points within the BEV are preserved. The plurality of perspective images comprise the looking-around information of the vehicle, the front 50 meters, the rear 46 meters, the left and right 32 meters and the three-dimensional space cube 5 meters above and 3 meters below are built by using the vehicle body coordinate system of the vehicle, the preset BEV space is obtained, the xy plane is parallel to the ground, and the upper and lower axes are z axes. The three-dimensional space was rasterized in units of 0.5 m in the xy direction and 8 m in the z direction to obtain 192x128x1 BEV grids of [96/0.5,64/0.5,8/8], each corresponding to the spatial extent of the three-dimensional space. Further filtering the points of which the falling points are not in the BEV grid, and reserving effective points, namely, according to the 3-dimensional coordinates of 6xKxM (BEV space uniformly representing 6 images) of the foreground BEV mapping coordinates [6xKxMx3], judging that the falling points are not in the BEV grid, and reserving and not discarding the falling points, wherein the reserving and discarding are both processing the foreground probability weighted feature map and the foreground BEV mapping coordinates. Obtaining an effective BEV mapping point Nx3 and an effective BEV mapping point characteristic CxN (the foreground probability weighted characteristic diagram is regarded as the characteristic of 6xKxM points, each characteristic vector dimension is C), and N is the number of effective points obtained by filtering the 6xKxM points.

Traversing N, putting the corresponding features into the BEV grids according to the three-dimensional coordinates, and adding the C-dimensional feature vectors if a plurality of mapping points correspond to the same grid; for a grid without feature placement, its C-dimensional feature is initialized to 0. Finally, a BEV characteristic diagram of Cx192x128 is obtained, wherein 192=HBEV and 128=WBEV, the generated BEV characteristic diagram only contains the characteristics of foreground pixel points, the calculated amount is greatly reduced, and meanwhile, the BEV characteristic diagram put into the BEV eliminates background information, so that the method is more beneficial to training and learning of a foreground target, and further the accuracy of target detection is improved.

And S27, detecting the position and the category information of the target to be detected according to the BEV characteristic diagram.

In a preferred embodiment, the detecting the position and the category information of the object to be detected according to the BEV feature map includes:

inputting the BEV feature map to a BEV encoder within a BEV object detector, such that the BEV encoder encodes the BEV feature map to generate a BEV encoded feature map; inputting the BEV coding feature map to a BEV decoder in a BEV object detector, so that the BEV decoder decodes the BEV coding feature map to generate a BEV decoding feature map; and generating position and category information of the object to be detected according to the BEV decoding characteristic diagram.

In an embodiment of the invention, the BEV object detector is further provided with a BEV encoder and a BEV decoder; the BEV encoder and the BEV decoder are both CNN networks; and (3) encoding the BEV characteristic diagram through a BEV encoder to generate a BEV encoding characteristic diagram, decoding the BEV encoding characteristic diagram through a BEV decoder, and finally performing post-processing on the BEV decoding characteristic diagram to obtain the position and the category information of the detection target.

To better illustrate the present invention, a BEV object detector training process is described below from the perspective of model training, the BEV object detector training process comprising the steps of:

1. preparing training data and BEV labeling corresponding to the training data, and labeling: the training data BEV label comprises target position information and category information of a BEV space, and the target position information and the category information of the BEV space can be manually labeled by radar point clouds arranged on the roof or predicted by a point cloud detection model in practical application.

The labeling processing comprises the steps of processing training data labels to obtain front background mask labels and processing point cloud data to obtain sparse depth labels corresponding to each image.

The radar coordinate system is a unified coordinate system of the BEV space, and the coordinate system can be calibrated in advance to obtain the mapping relation between the radar coordinate system and the pixel coordinate system of each camera image.

When the training data annotation is processed, specifically, the target position coordinates under the radar coordinate system can be projected under the camera coordinate system through the external reference matrix, then the internal reference matrix is projected to the pixel coordinate system corresponding to the image from the camera coordinate system, so that the region position of each target of the BEV space corresponding to each image is known, and the region position is circumscribed by a rectangle to obtain the ROI region of the BEV space target in each image. Initializing a full 0 mask labeling matrix of 6x1xHxW, and setting 1 to the same position region of the mask matrix corresponding to all target ROIs of 6 images to indicate that the pixel position has a foreground target.

When processing the point cloud data, all the point cloud data coordinates (which are in the radar coordinate system) are projected under the camera coordinate system through the external reference matrix, and then projected under the pixel coordinate system of each image through the internal reference matrix. The depth of each pixel position of each image can be known by combining the depth of each point under the camera coordinate system corresponding to each point and the pixel position of the corresponding pixel coordinate system, and if a certain pixel point does not correspond to a radar point, the depth of the pixel point is 0, so that a depth marking matrix of 6x1xHxW can be obtained. Where depth, i.e. the distance of each point of the real world from the camera's optical center, the camera imaging process itself loses depth information.

2. Multiple perspective image samples (such as 6 pieces of images respectively perceived in front, left front, right front, back left, back and right) mounted around the vehicle body are input, and adjacent camera perception areas are overlapped.

3. The input multi-view perspective image sample is subjected to an image feature extractor, and a perspective image feature map corresponding to the training sample is extracted.

4. And respectively feeding the perspective image feature images corresponding to the training samples into a depth feature extractor and a front background mask extractor to obtain a depth feature image and a front background mask feature image corresponding to the training samples.

5. And using a mask labeling matrix to monitor a front background mask feature map corresponding to the training sample, and calculating to obtain mask loss.

6. And using a depth marking matrix to monitor a depth feature map corresponding to the training sample, and calculating to obtain the depth loss.

7. And converting the features into BEV space by using the depth feature map, the perspective image feature map and the background mask feature map corresponding to the training samples to obtain BEV feature maps corresponding to the training samples.

8. And obtaining the BEV coding feature map corresponding to the training sample by using a BEV coder for the BEV feature map corresponding to the training sample.

9. And obtaining the BEV decoding characteristic diagram corresponding to the training sample by using a BEV decoder for the BEV coding characteristic diagram corresponding to the training sample.

10. And (5) using the BEV labeling supervision BEV decoding feature map corresponding to the training data to calculate and obtain the target detection loss.

11. All CNN model parameters were updated and trained using the mask loss + depth loss + target detection loss using a back propagation algorithm.

12. Repeating the above process until the network converges, thereby obtaining a trained BEV object detector.

Preferably, in the BEV space object detection method for a multi-view perspective image according to the embodiments of the present invention, object detection is performed by using a BEV object detector trained in the steps 1 to 12, it may be understood that a specific process flow of the steps 2 to 9 in the model training process is consistent with a process flow of a corresponding step in the model application process during object detection, and for avoiding redundant description, the specific process flow of the steps 2 to 9 is not further described.

On the basis of the method item embodiment, the invention correspondingly provides a device item embodiment:

as shown in fig. 3, an embodiment of the present invention provides a BEV space object detection apparatus for a multi-view perspective image, including: a multi-view perspective image acquisition module and a target identification module; the target recognition module comprises: a perspective image feature map generating unit, a depth feature map generating unit, a front background mask feature map generating unit, a binarized front background mask feature map generating unit, a BEV mapping coordinate generating unit, a BEV feature map generating unit, and a detecting unit;

It should be noted that the above-described apparatus embodiments are merely illustrative, and the units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. In addition, in the drawings of the embodiment of the device provided by the invention, the connection relation between the modules represents that the modules have communication connection, and can be specifically implemented as one or more communication buses or signal lines. Those of ordinary skill in the art will understand and implement the present invention without undue burden.

While the foregoing is directed to the preferred embodiments of the present invention, it will be appreciated by those skilled in the art that changes and modifications may be made without departing from the principles of the invention, such changes and modifications are also intended to be within the scope of the invention.

Claims

1. A BEV spatial target detection method for a multi-view perspective image, comprising:

acquiring a multi-view perspective image of a target to be detected;

2. The BEV spatial target detection method of a multi-view perspective image according to claim 1, wherein the depth feature extractor extracts depth features of each pixel point in the perspective image feature map, and generates a depth feature map, comprising:

3. The BEV spatial target detection method of a multi-view perspective image according to claim 2, wherein generating a binarized front-background mask feature map for distinguishing a front pixel point from a background pixel point from the front-background mask feature map comprises:

comparing the mask value of each pixel point with a preset threshold value;

4. The BEV space object detection method for a multi-view perspective image according to claim 3, wherein the topK processing is performed on each depth class corresponding to each pixel point in the depth feature map to obtain a topK probability feature map and a topK index feature map, and the method comprises:

5. The BEV spatial target detection method of a multi-view perspective image according to claim 4, wherein the calculating a probability weighted feature map according to the topK probability feature map and perspective image feature map comprises:

6. The BEV space object detection method of a multi-view perspective image according to claim 5, wherein the calculating a topK depth feature map according to the topK index feature map comprises:

7. The BEV space object detection method of a multi-view perspective image according to claim 6, wherein generating BEV mapping coordinates of each pixel point in a BEV coordinate system according to a depth value of each pixel point in the topK depth feature map and pixel coordinates of each pixel point comprises:

8. The method for detecting a BEV spatial target of a multi-view perspective image according to claim 7, wherein the extracting BEV mapping coordinates of foreground pixels in all BEV mapping coordinates and probability weighting features of foreground pixels in the probability weighting feature map according to positions of foreground pixels in the binarized mask feature map, and then performing foreground feature delivery, and generating a BEV feature map, includes:

generating coordinates of effective BEV mapping points and features of the effective BEV mapping points according to the updated BEV mapping coordinates of the foreground pixel points and the updated probability weighted feature map of the foreground;

9. The BEV spatial target detection method of a multi-view perspective image according to claim 8, wherein the detecting the position and the category information of the target to be detected based on the BEV feature map comprises:

10. A BEV spatial object detection apparatus for a multi-view perspective image, comprising: a multi-view perspective image acquisition module and a target identification module; the target recognition module comprises: a perspective image feature map generating unit, a depth feature map generating unit, a front background mask feature map generating unit, a binarized front background mask feature map generating unit, a BEV mapping coordinate generating unit, a BEV feature map generating unit, and a detecting unit;