CN116466320A

CN116466320A - Target detection method and device

Info

Publication number: CN116466320A
Application number: CN202310297687.9A
Authority: CN
Inventors: 王春微; 叶超强; 徐航; 曾艺涵; 张维; 张力
Original assignee: Fudan University; Huawei Technologies Co Ltd
Current assignee: Fudan University; Huawei Technologies Co Ltd
Priority date: 2023-03-17
Filing date: 2023-03-17
Publication date: 2023-07-21

Abstract

The application relates to a target detection method and device, wherein the method comprises the following steps: voxel processing is carried out on the point cloud data acquired by the laser radar according to polar coordinates, and processed data are obtained; performing feature extraction on the processed data to obtain a first 2D feature map; aligning the first 2D feature map based on global information interaction to obtain a second 2D feature map; 2D feature extraction is carried out on the second 2D feature map, and first 2D feature information is obtained; adjusting and aggregating the first 2D characteristic information based on the geometric information and the instance level information of the first 2D characteristic information to obtain second 2D characteristic information; and performing target detection based on the second 2D characteristic information to obtain a detection result. The method and the device can conduct interaction of the global feature information, reduce the calculation amount of the global interaction and align the feature information. After the feature information is extracted, geometric clues and object level information can be introduced into the features, so that regression capability is improved.

Description

Target detection method and device

Technical Field

The application relates to the field of unmanned driving and assisted driving, in particular to a target detection method and device.

Background

With the development of economy, the number of automobiles in the world is continuously increased, and the occurrence rate of car accidents is greatly increased, so that huge threats are caused to lives and properties of people. Human factors are main factors causing traffic accidents, and how to reduce human errors is an important subject for improving driving safety coefficient. Accordingly, advanced driving assistance systems (ADAS, advanced Driving Assistant System) and automated driving systems (ADS, autonomous Driving System) are receiving attention from various companies worldwide. Related enterprises invest a large amount of funds to develop and deploy related technologies. How to provide a more accurate target detection method is a technical problem to be solved.

Disclosure of Invention

In view of this, a target detection method and apparatus are provided.

In a first aspect, embodiments of the present application provide a target detection method, the method including:

voxel processing is carried out on the point cloud data acquired by the laser radar according to polar coordinates, and processed data are obtained;

extracting features of the processed data to obtain a first 2D feature map;

aligning the first 2D feature map based on global information interaction to obtain a second 2D feature map;

2D feature extraction is carried out on the second 2D feature map, and first 2D feature information is obtained;

performing adjustment aggregation on the first 2D characteristic information based on the geometric information and the instance level information of the first 2D characteristic information to obtain second 2D characteristic information;

and performing target detection based on the second 2D characteristic information to obtain a detection result.

According to the target detection method provided in the first aspect, the characteristics of laser radar scanning can be met, the voxel density distribution of the processed data obtained after voxelization is more uniform, and the near object can be focused, namely, the detection performance of the near object is slowly reduced when the resolution is reduced. The interaction of the global feature information can reduce the calculation amount of the global interaction and help to improve the deformation problem and align the feature information. After the feature information is extracted, the first 2D feature information is adjusted and aggregated based on the geometric information and the instance level information of the first 2D feature information, so that geometric clues and object level information can be introduced into the features, and regression capability is improved.

In one possible implementation manner, voxel processing is performed on point cloud data acquired by a laser radar according to polar coordinates to obtain processed data, including:

and 3D space division is carried out on the point cloud data according to the distance, the angle and the height, so that processed data are obtained.

According to the first aspect or a possible implementation manner of the first aspect, the method may be adapted to streaming detection and may conform to characteristics of laser radar scanning, such as voxel density of processed data obtained after voxelization may be distributed more uniformly, and near objects may be of interest, i.e. detection performance of near objects may be degraded slowly when resolution is reduced.

In one possible implementation manner, the first 2D feature map is aligned based on global information interaction, and a second 2D feature map is obtained, including:

determining a plurality of first keypoints of column features of each column in the first 2D feature map based on neighbor non-maxima suppression;

based on a first attention mechanism, compressing column characteristics of each column into characteristic information of each first key point of the column to obtain a plurality of second key points, wherein the first attention mechanism uses each first key point characteristic as a request and uses column characteristics of the column to be used as keywords and values;

Dividing second key points corresponding to each column into a plurality of non-overlapping key point windows, and carrying out information interaction on characteristic information of each second key point in each key point window based on a second attention mechanism to obtain a plurality of third key points, wherein the second attention mechanism takes the characteristics of the second key points as requests, keywords and values;

and based on a third attention mechanism, diffusing the characteristic information of a third key point in each column of characteristics into the column characteristics of the column to obtain a second 2D characteristic diagram.

According to the first aspect or one possible implementation manner of the first aspect, selecting the key points to represent the column features to perform interaction of global feature information may reduce the calculation amount of global interaction. The information interaction is helpful to improve the deformation problem, and simultaneously, the receptive field is enlarged by a window shifting method, so that the characteristic information of no matter short distance or long distance is aligned as much as possible.

In one possible implementation manner, the adjusting and aggregating the first 2D feature information based on the geometric information and the instance level information of the first 2D feature information to obtain second 2D feature information includes:

predicting segmentation information of the first 2D feature information using a segmentation branch, and predicting a center offset of the first 2D feature information using a regression branch;

Determining geometric information corresponding to the first 2D characteristic information according to the segmentation information, the center offset and the position code corresponding to the first 2D characteristic information;

dividing the geometric information and the first 2D characteristic information into a plurality of non-overlapping information windows, and carrying out information interaction on the geometric information and the first 2D characteristic information in each information window based on a fourth attention mechanism to obtain second 2D characteristic information, wherein the fourth attention mechanism is a self-attention mechanism.

According to the first aspect or one possible implementation manner of the first aspect, the adjusting and aggregating the first 2D feature information based on the geometric information and the instance level information of the first 2D feature information can introduce geometric clues and object level information into the feature, so as to improve regression capability.

In a second aspect, embodiments of the present application provide an object detection apparatus, the apparatus comprising:

the global feature alignment module is used for carrying out voxel processing on the point cloud data acquired by the laser radar according to polar coordinates to obtain processed data; extracting features of the processed data to obtain a first 2D feature map; aligning the first 2D feature map based on global information interaction to obtain a second 2D feature map;

The 2D feature extraction module is used for carrying out 2D feature extraction on the second 2D feature map to obtain first 2D feature information;

the geometric perception detection head is used for adjusting and aggregating the first 2D characteristic information based on the geometric information and the instance level information of the first 2D characteristic information to obtain second 2D characteristic information;

and the target detection module is used for carrying out target detection based on the second 2D characteristic information to obtain a detection result.

The advantages of the object detection device provided in the second aspect and the various possible implementations of the second aspect are the same as those provided in the first aspect and the various possible implementations of the first aspect, and are not redundant, and are not repeated here.

In a third aspect, embodiments of the present application provide an object detection apparatus, including:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to implement the object detection method of the first aspect or one or several of the plurality of possible implementations of the first aspect when executing the instructions.

In a fourth aspect, embodiments of the present application provide a non-transitory computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the object detection method of the first aspect or one or more of the possible implementations of the first aspect.

In a fifth aspect, embodiments of the present application provide a computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when run in an electronic device, a processor in the electronic device performs the object detection method of the first aspect or one or more of the possible implementations of the first aspect.

These and other aspects of the application will be apparent from and elucidated with reference to the embodiment(s) described hereinafter.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate exemplary embodiments, features and aspects of the present application and together with the description, serve to explain the principles of the present application.

Fig. 1 shows a schematic diagram of a related art unmanned system.

Fig. 2 shows a schematic diagram of a model framework provided by the related art.

Fig. 3 shows a graph of the performance of the related art at different resolutions.

Fig. 4 shows a schematic diagram of a model framework provided by the second related art.

Fig. 5a shows a schematic diagram of a point cloud distribution after voxelization based on a cartesian coordinate system.

Fig. 5b shows a schematic diagram of the distribution of the point cloud after voxelization based on the polar coordinate system.

Fig. 6a shows a schematic diagram of a full scan lidar delay.

Fig. 6b shows a schematic diagram of a quarter-scan lidar delay.

Fig. 7a shows a schematic diagram of a flow detection based on a rectangular coordinate system.

Fig. 7b shows a schematic diagram of a stream detection based on a polar coordinate system.

FIG. 8a shows a schematic representation of a characteristic deformation of a polar coordinate system voxelization.

FIG. 8b shows a schematic representation of a characteristic deformation of a polar coordinate system voxelization.

Fig. 9a shows a flow chart of a target detection method according to an embodiment of the present application.

Fig. 9b shows a flow diagram of a target detection method according to an embodiment of the present application.

Fig. 10 shows a flowchart of step S103 in the target detection method according to an embodiment of the present application.

Fig. 11 shows a flowchart of step S105 in the target detection method according to an embodiment of the present application.

Fig. 12a shows a schematic diagram of detection performance of the target detection method at different resolutions according to an embodiment of the present application.

Fig. 12b shows a schematic diagram of the detection performance of the target detection method at different resolutions according to an embodiment of the present application.

Fig. 13 shows a schematic structural diagram of an electronic device 100.

Detailed Description

Various exemplary embodiments, features and aspects of the present application will be described in detail below with reference to the accompanying drawings. In the drawings, like reference numbers indicate identical or functionally similar elements. Although various aspects of the embodiments are illustrated in the accompanying drawings, the drawings are not necessarily drawn to scale unless specifically indicated.

The word "exemplary" is used herein to mean "serving as an example, embodiment, or illustration. Any embodiment described herein as "exemplary" is not necessarily to be construed as preferred or advantageous over other embodiments.

In addition, numerous specific details are set forth in the following detailed description in order to provide a better understanding of the present application. It will be understood by those skilled in the art that the present application may be practiced without some of these specific details. In some instances, methods, means, elements, and circuits have not been described in detail as not to unnecessarily obscure the present application.

Fig. 1 shows a schematic diagram of a related art unmanned system. As shown in fig. 1, the unmanned or assisted driving system mainly comprises five components: sensing data acquisition, target detection, target tracking, multi-sensor fusion and regulation.

The sensing data acquisition may include acquisition of driving surface data by a variety of devices such as cameras, lasers, ibeo, etc. Wherein the laser scan typically collects the external environment return information at a frequency of 10 FPS. The camera acquisition data will typically collect external scene information at a rate of 25 or 30 PFS. Target detection may include detection of dynamic target obstacles (pedestrians, vehicles), as well as static targets (traffic signs, lane lines, freehand, etc.). The target tracking can smooth the detection result, can be used for measuring the speed at the same time, and can predict the motion trail of the target. The multi-sensor fusion can enable each sensor to play a role, and the purpose that the fusion result is superior to any single sensor result is achieved. The regulation can make reasonable path planning and self-vehicle running state control by fusing the output obstacle comprehensive information of the multiple sensors. The part of the decision-making method how to walk the self-vehicle is a control center of the unmanned vehicle.

The target detection method provided by the application can be applied to a perception data acquisition part in related products of automatic driving or auxiliary driving, and can also be applied to target detection of other scenes, and the application is not limited to the method. The problems of the first and second related art target detection and the target detection method and apparatus provided by the present application will be schematically described below by taking autopilot as an example.

The 3D object detection technique is an important component in an autopilot awareness data acquisition system. With the cost reduction, lidar sensors are now widely used in autonomous vehicles to obtain accurate positioning information. Unlike the 2D image acquired by the camera sensor, the point cloud obtained by laser radar scanning has unordered and sparse properties, so that selecting effective characterization processing point cloud data is one of the important problems of the point cloud 3D target detection task.

Related art one:

fig. 2 shows a schematic diagram of a model framework provided by the related art. As shown in fig. 2, in the first related art, a model frame based on center point is provided, where a 3D point cloud is first subjected to voxel processing under a cartesian coordinate system, then subjected to feature extraction through a 3D backhaul (backbone network), the 3D feature is converted into a bird's eye view angle, and the 3D frame is predicted through 2D feature extraction and a detection head. Fig. 3 shows a graph of the performance of the related art at different resolutions. As shown in fig. 3, the related art has the following problems: the point cloud data is represented by a Cartesian coordinate system, voxel density distribution after voxelization is uneven, the size of the voxels can be increased, namely, when resolution is reduced, the detection performance shows an exponential decline trend.

And related technology II:

fig. 4 shows a schematic diagram of a model framework provided by the second related art. As shown in fig. 4, in the second related art, a model framework based on the polar stream method is provided, and a stream detection scheme and a polar coordinate system are adopted. In order to solve the problems of the Cartesian coordinate system, the polar coordinate system is adopted to voxelate the 3D point cloud in the related technology II, the polar coordinate system can effectively relieve the problems under the Cartesian coordinate system, the polar coordinate system divides the 3D space according to the distance and the angle, the characteristics of a laser radar scanning environment are more met, the density distribution of the voxels is more uniform, the division mode focuses on the near object, the detection performance of the near object is slowly reduced when the resolution is reduced, and the characterization obtained based on the voxelization of the polar coordinate system is more suitable for a flow detection scheme.

Fig. 5a shows a schematic diagram of a point cloud distribution after voxelization based on a cartesian coordinate system. Fig. 5b shows a schematic diagram of the distribution of the point cloud after voxelization based on the polar coordinate system. As shown in fig. 5a and 5b, since the point cloud distribution is generally much more close and less far, the uniform voxel division based on the cartesian coordinate system is more non-uniform than the voxel density distribution based on the polar coordinate system, and the degradation of the detection performance based on the voxel formation based on the cartesian coordinate system is significant when the resolution is lowered. Therefore, when the resolution is lowered, the voxelization based on the polar coordinate system can alleviate the problem that the detection performance is remarkably lowered due to the voxelization based on the Cartesian coordinate system.

Fig. 6a shows a schematic diagram of a full-sweep lidar delay (latency of full-sweep lidar), and fig. 6b shows a schematic diagram of a quarter-sweep lidar delay (latency of 1/4sweep lidar). Fig. 6a shows a non-streaming detection, where a laser radar scans one circle to obtain one frame of point cloud data, and the system delay is long. Fig. 6b is a flow detection scheme, which considers that the point cloud acquired by the conventional rotary laser radar is actually flow data (i.e. data is obtained while scanning), so that the concept of "frame" of the point cloud data can be removed, the point cloud data is divided into a plurality of slices (data obtained by scanning one fourth of the laser radar in fig. 6 b) and connected in parallel, and after the point cloud data of each slice is scanned, the data of the part is processed preferentially for model reasoning. Based on this setting, the delay of the system can be significantly reduced, ideally the delay of the system can be divided directly by the number of slices. Fig. 7a shows a schematic diagram of a flow detection based on a rectangular coordinate system. Fig. 7b shows a schematic diagram of a stream detection based on a polar coordinate system. As shown in fig. 7a, since the data is divided into sectors, the use of a common cartesian coordinate system results in memory waste and computation of blank areas. As shown in FIG. 7b, the polar coordinate system is more suitable for stream detection by the rotary lidar, and does not cause memory waste and calculation of blank areas.

As shown in fig. 4, since the second related art adopts a polar coordinate system to represent the point cloud data after slicing, aiming at the deformation problem after voxelization based on the polar coordinate system, the second related art proposes a strategy of interpolating cartesian positions and distance hierarchical convolution and normalization. The interpolation strategy is used for classifying the heads, and after the Cartesian positions corresponding to the feature images are encoded, the 2D feature images are subjected to linear transformation. The distance layered convolution and normalization are features of different distances processed by different convolution kernels. The problem of characteristic deformation is improved in the second related art, but the problem is not solved well. However, a major problem with polar detectors is the characteristic deformation, which results in different deformation of objects at different distances due to non-uniform voxel division. Fig. 8a, 8b show schematic diagrams of the characteristic deformation of the polar coordinate system voxelization. As shown in fig. 8a and 8b, the same object, different distances and different deformations of the object facing downwards, also occupy different pixels, which makes it difficult for a CNN with constant translation to handle such non-rectangular features, causing problems of misalignment of global features and difficulty in regression. The related art II has the following problems: the interpolation strategy performs linear transformation on the 2D feature map, and the feature deformation is a nonlinear transformation, which cannot completely solve the deformation problem. Furthermore, the approach of processing features at different distances using different convolution kernels is somewhat cumbersome and features at the boundaries may be discontinuous.

In order to solve the technical problems, the application provides a target detection method and a target detection device. The polar coordinate system is used to conform to the characteristics of laser radar scanning, the voxel density distribution of the processed data obtained after voxelization is more uniform, and near objects can be focused, namely, the detection performance of the near objects is slowly reduced when the resolution is reduced. And selecting the key point representative column feature to perform interaction of global feature information, so that the calculated amount of global interaction is reduced. The information interaction is helpful for improving the deformation problem and aligning the characteristic information. After the feature information is extracted, the first 2D feature information is adjusted and aggregated based on the geometric information and the instance level information of the first 2D feature information, so that geometric clues and object level information can be introduced into the features, and regression capability is improved.

Fig. 9a shows a flow chart of a target detection method according to an embodiment of the present application. Fig. 9b shows a flow diagram of a target detection method according to an embodiment of the present application. As shown in fig. 9a and 9b, the target detection method provided in the present application includes steps S101 to S106, and may be applied to an electronic device.

In step S101, voxel processing is performed on point cloud data acquired by a laser radar according to polar coordinates, and processed data is obtained.

In one possible implementation, step S101 may include: and 3D space division is carried out on the point cloud data according to the distance, the angle and the height, so that processed data are obtained.

The point cloud data can be acquired by a laser radar. The point cloud data can be data of one circle scanned by the laser radar, can be one frame of data, and can also be slice-shaped point cloud data obtained by flow detection. The streaming detection can be carried out after a frame of point cloud data is obtained without scanning a circle by a laser radar, and the scanned point cloud data can be processed while scanning. The point cloud data is then voxelized with a polar coordinate system to obtain processed data, e.g., the 3D point cloud space of a cylinder (i.e., point cloud data) may be divided into a plurality of polar columns (i.e., processed data) as shown in fig. 4. The voxel forming point cloud data using the polar coordinate system can be suitable for flow detection, and the polar coordinate system can be in accordance with the characteristics of laser radar scanning, for example, the voxel density of processed data obtained after the voxel forming can be distributed more uniformly, and near objects can be concerned, namely, the detection performance of the near objects is slowly reduced when the resolution is reduced.

In step S102, feature extraction is performed on the processed data, so as to obtain a first 2D feature map.

The 3D feature extraction may be performed on the processed data obtained in the step S101 through 3D sparse convolution to obtain a 3D point cloud feature, and then the height dimension of the 3D point cloud feature (i.e. the height of the polar column) and the feature dimension obtained by the 3D feature extraction may be combined, that is, the 3D point cloud feature is converted into a first 2D feature map of a bird' S-eye view perspective, at this time, the 3D point cloud feature may generate deformation in the process of being converted into the first 2D feature map, and the deformation may affect the accuracy of 2D feature extraction, so after being converted into the first 2D feature map, information interaction may be added before 2D feature extraction to align global features, thereby improving the deformation problem.

In step S103, the first 2D feature map is aligned based on global information interaction, and a second 2D feature map is obtained.

Fig. 10 shows a flowchart of step S103 in the target detection method according to an embodiment of the present application. As shown in fig. 10, step S103 may include: an information compression step, an information interaction step and an information diffusion step. The information compression step, the information interaction step, and the information diffusion step are described below with reference to fig. 10.

In the "information compression step", a plurality of first keypoints of column features of columns in the first 2D feature map may be determined based on neighbor non-maximum suppression.

The plurality of key points with the largest feature value in each column of the first 2D feature map may be determined according to the column information of each column of the first 2D feature map. The method comprises the steps of selecting a first key point with a maximum characteristic value, wherein the first key point with the maximum characteristic value can be selected firstly by performing neighbor non-maximum value inhibition operation along the radial direction, but the first key points adjacent to the selected first key point are not selected later, namely, the first key points with the maximum characteristic values selected last are not adjacent to each other, so that the first key points are prevented from being selected on different targets, and the first key points are ensured to be selected on different targets.

As shown in fig. 10, in the "information interaction step", the column feature of each column may be compressed into feature information of each first key point of the column based on the first attention mechanism, so as to obtain a plurality of second key points. In the first attention mechanism, each first Key point feature is taken as a request (query), and the column feature of the column is taken as a Key word (Key) and a Value (Value).

Wherein the column characteristics of each column may be determined from the column information of each column of the first 2D feature map. For each column, the column features of the column can be concentrated and combined onto the first key points by adopting a first attention mechanism to obtain a plurality of second key points, and the feature information of each second key point comprises the feature information of the corresponding first key point and the information of the column features of the column where the feature information is concentrated and combined into the second key point. The number of first key points to be selected for each column may be set according to actual needs, which is not limited in this application. For example, as shown in fig. 10, two first keypoints with the largest eigenvalues may be selected for each column, and then the column features of each column are merged onto the two first keypoints of the column.

Wherein, assume f _i1 ∈R ^N×C Is the characteristic information of the first key point in the ith column, f _i ′∈R ^A×C For the column feature of column i, the first attentiveness mechanism may be expressed as the following formula:

Q _i1 ＝f _i1 W _q1 ,K _i1 ＝f′ _i W _k1 ,V _i1 ＝f′ _i W _v1 ，

wherein Q is _i1 For the request corresponding to the ith column, K _i1 Is the keyword corresponding to the ith column, V _i1 Is the value corresponding to the ith column, f _i1 Characteristic information of the first key point in the ith column, f _i ' column feature of ith column, W _q1 ，W _k1 ，W _v1 The linear mappings for the request (query), key (key) and value (value), respectively. f (f) _i2 Is the characteristic information of the second key point in the ith column. Can be according to W _q1 、W _k1 、W _v1 、f _i1 And f' _i Calculate Q _i1 、K _i1 、V _i1 . E (p) is a relative position code, which can be expressed as:

E(p)＝ReLU((p _i1 -p′ _i )×W _pos )

wherein, the liquid crystal display device comprises a liquid crystal display device,representing dimension (s)/(s)>Represents the regularization ratio corresponding to column i, < ->Representing the similarity between the calculated regularization ratio corresponding to column i and the relative position code E (p). P is p _i1 The feature information of the first key point of the ith column is the coordinate position in the cartesian coordinate system. P is p _i ' is the coordinate position of the column feature of the ith column in the polar coordinate system. W (W) _pos Is a linear transformation of the coordinate position. ReLU is a commonly used activation function in artificial neural networks, often referred to as a ramp function in mathematics, for outputting nonlinear results after linear transformation of neurons from a previous layer of neural network to a next layer of neural network or as an output of the entire neural network. Calculating each first key point and the column characteristics corresponding to each first key point through the first attention mechanism, and combining the column characteristics of the column where the first key point is located to the first key point to obtain f _i2 Obtaining each second key point f containing the column characteristics of each corresponding column _i2 。

As shown in fig. 10, in the "information interaction step: the second key points corresponding to each column can be divided into a plurality of key point windows which are not overlapped, and the characteristic information of each second key point in each key point window is subjected to information interaction based on a second attention mechanism, so that a plurality of third key points are obtained. The second attention mechanism uses the second key point feature as a request, a key word and a value.

Wherein, a plurality of second key points corresponding to the point cloud data can be obtained after the information interaction step. And dividing the plurality of second key points into a plurality of non-overlapped key point windows according to the preset key point window size, and carrying out characteristic information interaction on the second key points in each key point window by adopting a second attention mechanism. Wherein the second attention mechanism may be a self-attention mechanism that may be used to determine correlations between different parts of the overall input. As can be seen from FIG. 5, the characteristic deformation of the polar coordinate system is greater in the vicinity of the azimuth region, so that the above-mentioned information interaction step can be added to the azimuth region to introduce a sufficient receptive field, thereby facilitating alignment of the features.

Wherein, assume f _i2 ∈R ^N×C Is the characteristic information of the second key point in the ith column, f' _i ∈R ^A×C For the column feature of column i, the second attentiveness mechanism may be expressed as the following formula:

Q _i2 ＝f _i2 W _q2 ,K _i2 ＝f′ _i W _k2 ,V _i2 ＝f′ _i W _v2 ，

wherein Q is _i2 For the request corresponding to the ith column, K _i2 Is the keyword corresponding to the ith column, V _i2 Is the value corresponding to the ith column, f _i2 Characteristic information of the second key point in the ith column, f' _i For column features of column i, W _q2 ，W _k2 ，W _v2 And respectively carrying out linear mapping corresponding to the request, the keywords and the values. f (f) _i3 Is the characteristic information of the third key point in the ith column. Can be according to W _q2 、W _k2 、W _v2 、f _i2 And f' _i Can calculate Q _i2 、K _i2 、V _i2 . E (p) is a relative position code, which can be expressed as:

E(p)＝ReLU((p _i2 -p′ _i )×W _pos )

wherein, the liquid crystal display device comprises a liquid crystal display device,representing dimension (s)/(s)>Represents the regularization ratio corresponding to column i, < ->Representing the similarity between the calculated regularization ratio corresponding to the ith column and the relative position code E (p). P is p _i2 The feature information of the second key point of the ith column is a coordinate position in a cartesian coordinate system. P is p _i ' is the coordinate position of the column feature of the ith column in the polar coordinate system. W (W) _pos Is a linear transformation of the coordinate position. ReLU is a commonly used activation function in artificial neural networks, often referred to as a ramp function in mathematics, for outputting nonlinear results after linear transformation of neurons from a previous layer of neural network to a next layer of neural network or as an output of the entire neural network. Calculating the characteristic information in each second key point through the second attention mechanism, wherein each second key point can mutually interact with each other to obtain f _i3 Obtaining a third key point f interacted with the characteristic information of other second key points _i3 。

In an embodiment of the present application, the execution times of the "information interaction step" may be set as required, and the sizes of the key point windows in different execution processes may be the same or different, which is not limited in this application. For example, the above-mentioned "information interaction step" may be repeated twice, and in the second information interaction step, the third keypoints obtained in the plurality of first information interaction steps may be divided into several non-overlapping keypoint windows as new second keypoints, where the manner of dividing the keypoint windows may be different from the manner of dividing the first information interaction steps to expand the receptive field, and this manner may be referred to as window shifting. And then, a second attention mechanism can be adopted to execute an information interaction step on the second key points of the information of each new key point window, and each new second key point in the new key point window can mutually interact the respective characteristic information to obtain a third key point in which the characteristic information of other new second key points is interacted again.

As shown in fig. 10, in the "information diffusion step: the feature information of the third key point in each column of features can be diffused into the column of features of the column where the feature information is located based on the third attention mechanism, so that a second 2D feature map is obtained.

Wherein the characteristic information in the key points may be diffused back into the corresponding columns. In an embodiment of the present application, a third attention mechanism is adopted to diffuse the feature information in each third key point obtained in the above step back into each corresponding column, so as to obtain a second 2D feature map after global feature alignment.

In detail, in the first attention mechanism, a Key point is a query (request), a column is characterized by a Key (keyword) and a Value, and in the third attention mechanism, a Key point is a Key (keyword) and a Value, and a column is characterized by a query (request). The third mechanism of attention is similar to the first mechanism of attention except that the values carried into the formula are different and will not be described again here. In this way, each column obtains feature information after global feature alignment, which is diffused back and interacts with other surrounding column features, by calculating the third key point through a third attention mechanism, and a second 2D feature map is formed. In the method, the interaction of global feature information is carried out by selecting the key point representative column feature, so that the calculation amount of global interaction can be reduced. The information interaction in azimuth is helpful to improve the deformation problem, and simultaneously, the receptive field is enlarged by a window shifting method, so that the characteristic information of short distance or long distance is aligned as much as possible.

In step S104, 2D feature extraction is performed on the second 2D feature map, so as to obtain first 2D feature information.

The second 2D feature map obtained through the global information interaction alignment process can solve the problem of feature deformation of the first 2D feature map, and then more accurate 2D feature information can be obtained when the second 2D feature map is subjected to 2D feature extraction.

In step S105, the first 2D feature information is adjusted and aggregated based on the geometric information and the instance level information of the first 2D feature information, so as to obtain second 2D feature information.

Fig. 11 shows a flowchart of step S105 in the target detection method according to an embodiment of the present application. As shown in fig. 11, step S105 may include: geometric sense prediction and geometric sense aggregation. The following describes the geometric sense prediction and geometric sense aggregation with reference to fig. 11.

In the "geometric sense prediction", the partition information and the center offset of the first 2D feature information may be predicted.

After the second 2D feature map obtains the first 2D feature information through 2D feature extraction, the first 2D feature information may be input into the geometric sense detection head. The geometric perception detection head can adjust and aggregate the first 2D characteristic information based on the geometric information and the instance level information of the first 2D characteristic information so as to introduce geometric clues and object level information into the characteristics and improve regression capability. The geometric sense detection head may include auxiliary branches, which may include segmentation branches and regression branches. Wherein the segmentation branch can predict the segmentation information F in the second 2D feature map according to the first 2D feature information ^seg The segmentation information may include a class of each pixel. The regression branch can predict the center offset F of the second 2D feature map according to the first 2D feature information ^offset The center offset may include an offset of a center point of the object in the second 2D feature map. The two branches can be optimized by adopting a focal loss (a loss function for solving the problem of extremely unbalanced number of positive and negative samples in target detection) and a smooth L1 loss (a loss function for solving outlier gradient explosion) respectively.

In the "geometric sense aggregation", the segmentation information and the center offset obtained in the "geometric sense prediction" may be used as prediction information, and geometric information corresponding to the first 2D feature information may be determined according to the prediction information and the position code corresponding to the first 2D feature information. The position codes represent the relative positions of the individual pixels between the cartesian and polar coordinate systems.

Wherein the geometric information, i.e. the fusion partition information F, can be determined first ^seg Center offset F ^offset And position coding F ^pos As geometric information F ^geo Can be expressed as the following formula:

F ^geo ＝MLP(Cat([F ^seg ,F ^offset ,F ^pos ]))

the MLP is a multi-layer perceptron, which can be mutually connected by a plurality of perceptrons, has a plurality of layers, can be seen as a basic form of a neural network, and has strong fitting capability.

In one possible implementation manner, there may be an "information interaction step" as in step S103, where the geometric information and the first 2D feature information may be divided into a plurality of non-overlapping information windows according to a preset information window size, and information interaction is performed on the geometric information and the first 2D feature information in each information window based on a fourth attention mechanism, so as to obtain second 2D feature information. Wherein the fourth attention mechanism may be a self-attention mechanism.

Wherein the geometric information F can be used ^geo And first 2D characteristic information F ^neck Dividing the information window into non-overlapping information windows, and carrying out information interaction on the geometric information of each information window and the first 2D characteristic information by adopting a fourth attention mechanism to obtain second 2D characteristic informationThe second 2D characteristic information includes information interacted with by the geometric information.

The fourth attention mechanism may be a self-attention mechanism, and may perform information interaction in each information window, for example, for the jth information window, the fourth attention mechanism may be expressed as the following formula:

wherein Q is _j For the j-th column corresponding request, K _j Is the keyword corresponding to the j th column, V _j Is the value corresponding to the j-th column,for the first 2D characteristic information in column j,/o >For the geometrical information of the j-th column, W _q2 ，W _k2 ，W _v2 And respectively carrying out linear mapping corresponding to the request, the keywords and the values. Can be according to W _q 、W _k 、W _v 、/>Can calculate Q _j 、K _j 、V _j 。/>Representing dimension (s)/(s)>Represents the regularization ratio corresponding to the j-th column, < >>Representing the similarity value of the canonical ratio corresponding to the calculation of column j. The geometric information and the first 2D characteristic information of each information window can be calculated through the fourth attention mechanism, and the geometric information can be fused into the first 2D characteristic information to obtain second 2D characteristic information->

Similar to the global information interaction alignment, the information interaction of the geometric information and the first 2D feature information in each information window may be repeated multiple times in the detection head, and the sizes of the information windows may be the same or different in different execution processes, which is not limited in this application. For example, the above information interaction may be repeated twice, and in the second information interaction, each row of geometric information and the 2D feature information obtained by the first interaction may be divided into a plurality of non-overlapping information windows different from the first information interaction to expand the receptive field, and then the geometric information may be continuously fused into the 2D feature information obtained by the first interaction of each information window by using the fourth attention mechanism, so as to obtain the second 2D feature information fully interacted.

In step S106, target detection is performed based on the second 2D feature information, so as to obtain a detection result.

The second 2D feature information obtained after the information interaction may be input to a detection head based on a key point in the method based on the centrpoint, and classification and regression results are obtained after the segmentation branch and the regression branch are respectively performed.

In one embodiment of the present application, comparison is made with the centrpoint method and the polar stream method on the public data sets Waymo and onse, as shown in tables 1 and 2, respectively. Our method has been superior to the cartesian 3D detector centrpoint method on the waymo dataset. Because of the characteristic deformation, there is a significant performance degradation when directly converting the center point into a polar coordinate system, and our method has a significant gain of 4.93% L2 mAP and 4.02% L2 mAP, respectively, compared to the centrpoint method and polar stream method. Similar conclusions were also drawn on the Once dataset, our approach achieved a performance improvement of 2.43% map compared to the cartesian 3D detector centrpoint approach, and also a significant gain compared to the polar and centrpoint approach.

TABLE 1 Waymo dataset 3D detection results

Wherein, table 1 is the 3D detection result of the Waymo dataset, and enumerates four methods: the first row uses the centrpoint method of the cartesian coordinate system, the second row uses the centrpoint method of the polar coordinate system, the polar stream method, and the object detection method (PARTNER) of the present application. The Backbone network (Backbone) is input as a Voxel result, and a Coordinate system (Coordinate) is used, and a polar stream method and an object detection method of the application use polar Coordinate systems. Taking the first line as an example, the method of centrpoint was used to voxelate based on a cartesian coordinate system, and in the first-order Vehicle detection (Vehicle LEVEL 1), the achievement rate of the detection index (mAP) of the 3D target detection model was 75.58%, and the achievement rate of the waymo dataset evaluation index (APH) was 75.01%. In the two-LEVEL Vehicle detection (Vehicle LEVEL 2), the achievement rate of the detection index (mAP) of the 3D target detection model was 67.00%, and the achievement rate of the waymo dataset evaluation index (APH) was 66.52%.

Table 2on dataset 3D detection results

Wherein, table 2 is the 3D detection result of the Once dataset, and enumerates four methods: the first row of the Cartesian coordinate system-based centrpoint method (centrpoint-Cartesian), the second row of the Polar coordinate system-based centrpoint method (centrpoint-Polar), the Polar stream method, and the target detection method (PARTNER) of the present application. The detected targets are vehicles (vehicles), pedestrians (pedestrians) and bicycles (bicycles), and the detection accuracy of the whole detection, the detection accuracy of the whole detection from 0 to 30 meters to the target, the detection accuracy of the whole detection from 30 to 50 meters to the target and the detection accuracy of the whole detection from 50 to more distant from the target are respectively obtained.

Fig. 12a and 12b are schematic diagrams showing detection performance of the target detection method at different resolutions according to an embodiment of the present application. Fig. 12a, 12b compare the centrpoint method in cartesian and polar coordinate systems, and the method provided herein results at different resolutions. It can be seen that the cartesian detector exhibits an exponential decrease in resolution, while the polar detector is linear, since the polar system divides the near scene finer, and the performance of the near object remains better when resolution is reduced. The problem of distortion in the polar coordinate system results in poor detector performance at high resolution, while the method provided by the present application is to maintain relatively optimal performance throughout. This fully embodies the advantage of the polar detector over low resolution.

In addition, the polar coordinate system can provide a more suitable characterization in a streaming detection scheme. As shown in table 3, under the streaming detection scheme, the method is superior to the polar stream method under the division of different area numbers, and the detection performance after slicing has even a certain gain.

TABLE 3 comparison of the performance of Polar systems under a flow detection scheme

As can be seen, the global information interaction alignment (i.e. step S103) can alleviate the problem of feature deformation after voxelization based on the polar coordinate system, and obtain better 2D feature map characterization. As shown in Table 4, adding global information interaction alignment to the polar detector baseline yields a performance gain of 2.27% L2 mAPH. The 2D characteristic information is subjected to adjustment aggregation based on the geometric information and the instance-level information in the detection head, so that the regression capability is improved, and as shown in table 4, the performance gain of 2.85% L2 mAPH can be further obtained by adopting the geometric sensing detection head.

Table 4 gain of module design versus detection performance

Embodiments of the present application provide an object detection apparatus, the apparatus including: global feature alignment module, 2D feature extraction module, geometric perception detection head and target detection module.

And the global feature alignment module is used for carrying out voxel processing on the point cloud data acquired by the laser radar according to polar coordinates to obtain processed data. And extracting the characteristics of the processed data to obtain a first 2D characteristic diagram. And aligning the first 2D feature map based on global information interaction to obtain a second 2D feature map.

And the 2D feature extraction module is used for carrying out 2D feature extraction on the second 2D feature map to obtain first 2D feature information.

And the geometric perception detection head is used for carrying out adjustment aggregation on the first 2D characteristic information based on the geometric information and the instance level information of the first 2D characteristic information to obtain second 2D characteristic information.

The implementation manner and the beneficial effects of each module and the detection head of the target detection device provided in the embodiment of the present application may refer to the related description of the corresponding steps in the target detection method, which is not redundant, and will not be repeated herein.

The object detection device may be an electronic apparatus 100, and fig. 13 shows a schematic structural diagram of the electronic apparatus 100. The electronic device 100 may include at least one of a cell phone, a foldable electronic device, a tablet computer, a desktop computer, a laptop computer, a handheld computer, a notebook computer, an ultra-mobile personal computer (ultra-mobile personal computer, UMPC), a netbook, a personal digital assistant (personal digital assistant, PDA), an augmented reality (augmented reality, AR) device, a Virtual Reality (VR) device, an artificial intelligence (artificial intelligence, AI) device, a wearable device, a vehicle-mounted device, a smart home device, or a smart city device. The embodiment of the present application does not particularly limit the specific type of the electronic device 100.

The electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (universal serial bus, USB) connector 130, a charge management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, keys 190, a motor 191, an indicator 192, a camera 193, a display 194, a lidar (not shown), a subscriber identity module (subscriber identification module, SIM) card interface 195, and the like. The sensor module 180 may include a pressure sensor 180A, a gyro sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, a bone conduction sensor 180M, and the like.

It is to be understood that the structure illustrated in the embodiments of the present application does not constitute a specific limitation on the electronic device 100. In other embodiments of the present application, electronic device 100 may include more or fewer components than shown, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

The processor 110 may include one or more processing units, such as: the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), a controller, a video codec, a digital signal processor (digital signal processor, DSP), a baseband processor, and/or a neural network processor (neural-network processing unit, NPU), etc. Wherein the different processing units may be separate devices or may be integrated in one or more processors.

The processor can generate an operation control signal according to the instruction operation code and the time sequence signal to complete the control of instruction fetching and instruction execution, and further realize the target detection method after the point cloud data from the laser radar is acquired.

An embodiment of the present application provides an object detection apparatus, including: a processor and a memory for storing processor-executable instructions; wherein the processor is configured to implement the above-described method when executing the instructions.

Embodiments of the present application provide a non-transitory computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the above-described method.

Embodiments of the present application provide a computer program product comprising a computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, which when run in a processor of an electronic device, performs the above method.

The computer readable storage medium may be a tangible device that can hold and store instructions for use by an instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic storage device, a magnetic storage device, an optical storage device, an electromagnetic storage device, a semiconductor storage device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: portable computer disk, hard disk, random Access Memory (Random Access Memory, RAM), read Only Memory (ROM), erasable programmable Read Only Memory (Electrically Programmable Read-Only-Memory, EPROM or flash Memory), static Random Access Memory (SRAM), portable compact disk Read Only Memory (Compact Disc Read-Only Memory, CD-ROM), digital versatile disk (Digital Video Disc, DVD), memory stick, floppy disk, mechanical coding devices, punch cards or in-groove protrusion structures having instructions stored thereon, and any suitable combination of the foregoing.

The computer readable program instructions or code described herein may be downloaded from a computer readable storage medium to a respective computing/processing device or to an external computer or external storage device over a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmissions, wireless transmissions, routers, firewalls, switches, gateway computers and/or edge servers. The network interface card or network interface in each computing/processing device receives computer readable program instructions from the network and forwards the computer readable program instructions for storage in a computer readable storage medium in the respective computing/processing device.

Computer program instructions for carrying out operations of the present application may be assembly instructions, instruction set architecture (Instruction Set Architecture, ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, c++ or the like and conventional procedural programming languages, such as the "C" language or similar programming languages. The computer readable program instructions may be executed entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a local area network (Local Area Network, LAN) or a wide area network (Wide Area Network, WAN), or it may be connected to an external computer (e.g., through the internet using an internet service provider). In some embodiments, aspects of the present application are implemented by personalizing electronic circuitry, such as programmable logic circuitry, field programmable gate arrays (Field-Programmable Gate Array, FPGA), or programmable logic arrays (Programmable Logic Array, PLA), with state information of computer readable program instructions.

Various aspects of the present application are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer readable program instructions may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer readable program instructions may also be stored in a computer readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer readable medium having the instructions stored therein includes an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, systems, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.

It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by hardware (e.g., circuits or ASICs (Application Specific Integrated Circuit, application specific integrated circuits)) which perform the corresponding functions or acts, or combinations of hardware and software, such as firmware, etc.

Although the invention is described herein in connection with various embodiments, other variations to the disclosed embodiments can be understood and effected by those skilled in the art in practicing the claimed invention, from a study of the drawings, the disclosure, and the appended claims. In the claims, the word "comprising" does not exclude other elements or steps, and the "a" or "an" does not exclude a plurality. A single processor or other unit may fulfill the functions of several items recited in the claims. The mere fact that certain measures are recited in mutually different dependent claims does not indicate that a combination of these measures cannot be used to advantage.

The embodiments of the present application have been described above, the foregoing description is exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the various embodiments described. The terminology used herein was chosen in order to best explain the principles of the embodiments, the practical application, or the improvement of technology in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method of target detection, the method comprising:

extracting features of the processed data to obtain a first 2D feature map;

2. The method of claim 1, wherein the voxelizing the point cloud data acquired by the laser radar according to polar coordinates to obtain processed data, comprising:

3. The method of claim 1, wherein aligning the first 2D feature map based on global information interaction results in a second 2D feature map, comprising:

4. The method of claim 1, wherein performing adjustment aggregation on the first 2D feature information based on geometric information and instance level information of the first 2D feature information to obtain second 2D feature information, comprises:

5. An object detection device, the device comprising:

6. The apparatus of claim 5, wherein the voxelized point cloud data acquired by the lidar is processed according to polar coordinates to obtain processed data, comprising:

7. The apparatus of claim 5, wherein aligning the first 2D feature map based on global information interaction results in a second 2D feature map, comprising:

8. The apparatus of claim 5, wherein performing adjustment aggregation on the first 2D feature information based on geometric information and instance level information of the first 2D feature information to obtain second 2D feature information, comprises:

9. An object detection apparatus, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to implement the method of any of claims 1-4 when executing the instructions.

10. A non-transitory computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the method of any of claims 1-4.

11. A computer program product comprising computer readable code, or a non-transitory computer readable storage medium carrying computer readable code, characterized in that a processor in an electronic device performs the method of any one of claims 1-4 when the computer readable code is run in the electronic device.