CN117576665B

CN117576665B - Automatic driving-oriented single-camera three-dimensional target detection method and system

Info

Publication number: CN117576665B
Application number: CN202410077692.3A
Authority: CN
Inventors: 徐小龙; 周鑫
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2024-01-19
Filing date: 2024-01-19
Publication date: 2024-04-16
Anticipated expiration: 2044-01-19
Also published as: CN117576665A

Abstract

The invention discloses a single-camera three-dimensional target detection method and system for automatic driving, wherein the method comprises the following steps: inputting the obtained monocular image into a feature extraction network, and outputting a two-dimensional detection result; cutting out RoI features of the two-dimensional detection result by adopting a RoIAlign method; connecting the coordinate graph normalized by the monocular image with the map of each cut-out RoI feature in a channel mode to form a final RoI feature; predicting three-dimensional detection information according to the final RoI characteristics; calculating the target depth by adopting a geometric projection formula from the predicted two-dimensional frame height in the two-dimensional detection result and the predicted three-dimensional frame height in the three-dimensional detection information; calculating the target depth by directly solving the depth and a geometric projection formula in the three-dimensional detection information, and obtaining the final depth through uncertainty weighted fusion; and fusing the predicted three-dimensional detection information with the weighted fusion to obtain the final depth, and outputting the predicted information of the target.

Description

Automatic driving-oriented single-camera three-dimensional target detection method and system

Technical Field

The invention relates to a single-camera three-dimensional target detection method and system for automatic driving, and belongs to the technical field of three-dimensional target detection.

Background

Three-dimensional object detection has been an important problem in automatic driving, and its main task is to identify the three-dimensional position of the vehicle, vehicle size information and yaw angle by calculation.

In the computer vision application scene facing to automatic driving, a three-dimensional target detection algorithm for identifying three-dimensional space information of a vehicle is important. In three-dimensional spatial information, depth estimation is the most important branch. However, it is very difficult in theory to accurately acquire depth information of a target from a single camera, and inaccuracy of depth prediction is a main cause of performance degradation. The existing single-camera three-dimensional target detection method for automatic driving mainly comprises a method based on radar, a method based on pre-training depth and a direct regression method, wherein the two methods are seriously dependent on additional information, and the calculation and labor costs are high. In recent years, researchers of computer vision propose a plurality of methods based on direct regression, so that research cost is greatly reduced, and detection speed is improved.

However, most of these methods are single depth estimation methods, and in model training, depth is estimated directly by using a neural network by using texture information of a vehicle or by using altitude information through a geometric projection formula, and image information cannot be comprehensively utilized.

Disclosure of Invention

The invention aims to provide a single-camera three-dimensional target detection method and system for automatic driving, which aim to solve the defect that most of the existing methods are single-depth estimation methods and cannot comprehensively utilize image information and predict inaccuracy.

An automatic driving-oriented single-camera three-dimensional target detection method, comprising the following steps:

inputting the obtained monocular image into a feature extraction network, and outputting a two-dimensional detection result;

cutting out RoI features of the two-dimensional detection result by adopting a RoIAlign method;

Connecting the coordinate graph normalized by the monocular image with the map of each cut-out RoI feature in a channel mode to form a final RoI feature;

Predicting three-dimensional detection information according to the final RoI characteristics;

Calculating the target depth by adopting a geometric projection formula from the predicted two-dimensional frame height in the two-dimensional detection result and the predicted three-dimensional frame height in the three-dimensional detection information;

calculating the target depth by directly solving the depth and a geometric projection formula in the three-dimensional detection information, and obtaining the final depth through uncertainty weighted fusion;

And fusing the predicted three-dimensional detection information with the weighted fusion to obtain the final depth, and outputting the predicted information of the target.

Further, the two-dimensional detection result includes four parts:

heatmap predicting class scores of targets and coarse coordinates of the centers of the 2D frames;

Offset_2d, predicting the Offset of the 3Dbounding box center point projection and the 2Dbounding box center coordinates after downsampling;

size_2D: height and width of 2D frame, unit pixels;

residual_2D:2Dbounding box center coordinates downsampled residual.

Further, the three-dimensional detection information includes:

Angle: the angle prediction output is divided into 24 intervals by adopting a multi-bin strategy, wherein the first 12 are used for classifying the prediction output, and the second 12 are used for regression prediction output;

Direct_depth, namely directly predicting depth information of a target by using a feature extraction network, and outputting two columns of information, wherein the first column is a depth value, and the second column is uncertain;

the offset_3D:3Dbounding box center point projects the residual after downsampling;

Size_3d:3Dbounding box, actually predicting the deviation of the size, and adding the average size of the targets in the dataset to the predicted deviation to obtain the predicted size;

deviation of predicted Depth, offset of predicted Depth of cut-off target.

Further, the loss function of the feature extraction network is:

；

Initially setting weights of the two-dimensional detection portion, the three-dimensional detection portion/> ;/> being expressed as an overall loss; and/> is denoted as loss of each predicted branch.

Further, the method for calculating the target depth by directly solving the depth and geometric projection formula in the three-dimensional detection information and obtaining the final depth by uncertainty weighted fusion is as follows:

direct depth estimation is performed at the final RoI feature:

；

Wherein is a predicted branch in three-dimensional information for estimating depth and uncertainty; the direct depth estimation result is denoted by/> , the set parameter is denoted by/> , and the heteroscedastic random uncertainty in modeling depth estimation is denoted by/> ;

Bringing the height of the three-dimensional frame obeying the Laplace distribution into a geometric projection formula, and predicting the depth according to the geometric projection as follows:

；

Wherein denotes a focal length,/> denotes a two-dimensional frame height,/> obeys a standard laplace distribution/> ,/> denotes a three-dimensional frame height,/> denotes a scale parameter,/> denotes a mean of the three-dimensional frame heights;

Meanwhile, the depth deviation obeying the Laplace distribution ,/> is also predicted in the three-dimensional detection information, and the depth and uncertainty of the final geometric projection prediction are obtained by utilizing the additivity of the Laplace distribution:

,/>；

Where ,/>,/> denotes the variance of the depth deviation,/> denotes the mean of the depth deviation,/> is based on the uncertainty of the geometric projection; the/> is expressed as depth based on geometric projection; the/> is denoted/> ;

Fusing the direct depth found on the RoI feature with the depth/> based on geometric projection using uncertainty guidance; weight/> formula:

；

Wherein denotes direct depth estimation,/> denotes geometry projection based depth estimation,/> denotes uncertainty of direct depth estimation and geometry projection based depth estimation; the/> represents the sum of squares of uncertainty of the direct depth estimation and the depth estimation based on geometrical projection; the/> is expressed as the uncertainty of the direct depth estimation or the uncertainty based on the geometry projection depth;

the final target depth and uncertainty/> calculation formula:

,/>；

because the target depth also obeys the laplace distribution, the loss function expression for the target depth information is:

；

Where represents a tag true value,/> represents a target depth,/> represents uncertainty,/> represents two depth estimates, and represents uncertainty corresponding to the depth estimate.

Further, the predicting three-dimensional detection information according to the final RoI characteristic includes:

And carrying out convolution, group normalization, activation, adaptive average pooling and convolution operation on the RoI characteristic, and outputting predicted three-dimensional detection information.

Further, the feature extraction network comprises a DLA-34 main network and a Neck network, wherein the DLA-34 main network adopts a CENTERNET framework, the DLA-34 main network is used for inputting the last 4 layers of feature graphs of the output 6 layers of feature graphs into a Neck network, and the Neck network outputs one layer of feature graphs of the input 4 layers of feature graphs as a two-dimensional detection result.

Further, the predicted information of the target includes three-dimensional center point coordinates, a size, and a yaw angle.

Further, the RoI features include only object level features and no background noise.

The second aspect of the invention provides a single-camera three-dimensional target detection system for automatic driving, which comprises:

The feature extraction module is used for acquiring a monocular image, inputting the monocular image into the feature extraction network and outputting a two-dimensional detection result;

the feature clipping module is used for clipping RoI features of the two-dimensional detection result by adopting a RoIAlign method;

the normalization module is used for connecting the coordinate graph normalized by the monocular image with the map of each cut RoI feature in a channel mode to form a final RoI feature;

the three-dimensional detection module is used for predicting three-dimensional detection information according to the final RoI characteristics;

The algorithm module is used for calculating the target depth by adopting a geometric projection formula from the predicted two-dimensional frame height in the two-dimensional detection result and the predicted three-dimensional frame height in the three-dimensional detection information;

The uncertainty fusion module is used for calculating the target depth from the depth directly obtained from the three-dimensional detection information and the geometric projection formula, and obtaining the final depth through uncertainty weighted fusion;

and the fusion module is used for fusing the predicted three-dimensional detection information with the weighted fusion to obtain the final depth and outputting the predicted information of the target.

Compared with the prior art, the invention has the beneficial effects that:

1. The method integrates direct depth estimation and geometric projection-based depth estimation through uncertainty guidance, comprehensively utilizes the texture and geometric characteristics of the image, provides more accurate depth estimation, and has better robustness;

2. The invention distributes higher weight values to branches with unstable depth prediction through depth fusion, which is helpful for improving the stability of the whole depth estimation;

3. In order to better assist the three-dimensional detection task, two-dimensional detection task branches are added, group normalization is carried out in each channel, position information among the channels can be reserved, better learning of spatial information in three-dimensional target detection is facilitated, and a network training process is accelerated by using group normalization;

4. The invention adopts two-stage detection, carries out further detection on RoI characteristics, is faster than most single-stage methods, and has higher detection precision than the detection method of each category at present on the premise of ensuring the real-time requirement of three-dimensional target detection of the automatic driving single camera.

Drawings

FIG. 1 is a three-dimensional spatial information diagram of a detection target of the method of the present invention;

FIG. 2 is a schematic diagram of a network structure of the method of the present invention;

FIG. 3 is a schematic diagram of a network prediction branch of the method of the present invention.

Detailed Description

The invention is further described in connection with the following detailed description, in order to make the technical means, the creation characteristics, the achievement of the purpose and the effect of the invention easy to understand.

Example 1

The invention discloses a single-camera three-dimensional target detection method for automatic driving, wherein three-dimensional space information is shown in fig. 1, and the method comprises the following steps:

Fusing the predicted three-dimensional detection information with a weighted fusion to obtain a final depth, and outputting predicted information of a target, wherein the predicted information of the target comprises three-dimensional center point coordinates , a size/> and a yaw angle/> ;

The feature extraction network comprises a DLA-34 main network and a Neck network, wherein the DLA-34 main network adopts a CENTERNET framework, the DLA-34 main network is used for inputting the last 4 layers of feature graphs of the output 6 layers of feature graphs into a Neck network, and the Neck network outputs one layer of feature graphs of the input 4 layers of feature graphs as a two-dimensional detection result.

Here the task of three-dimensional object detection is decoupled. For monocular images, the task of three-dimensional object detection is to find each object of interest in the picture, estimate the class and three-dimensional box of the object, and the main object in the KITTI dataset is an automobile. The three-dimensional box information is divided into three-dimensional center point coordinates , size/> , and yaw/> of the target, as shown in fig. 1. After the target depth/> was found, the projection points/> at the center of the three-dimensional frame were used to find/> and/> using the following formula:

,/>；

Wherein is the principal point, and/() is the focal length, so that the 3D center point can be predicted. The size and yaw angle are output by other associated predicted branches.

The main predictive branches of the present invention are shown in figure 2.

1) The prediction branch is divided into a two-dimensional detection part and a three-dimensional detection part, the three-dimensional detection is realized on the basis of RoI characteristics, and finally a final three-dimensional frame is formed according to two-dimensional detection and three-dimensional detection information. The group normalization module sets num_groups=32, num_channels=256, eps default values 1e-5. The two-dimensional center can guide regression of the three-dimensional projection center point, and the two-dimensional task and the three-dimensional task are connected, and learning of different tasks is mutually promoted through the connection. The two-dimensional bounding box width and height prediction branches of the two-dimensional detection module can enable the model to learn some characteristics helpful for depth estimation, and the two-dimensional detection module is required for a three-dimensional detection task because an object generally appears as a near large and a far small on the graph based on an imaging principle. The two-dimensional detection output is improved on the basis of CENTERNET, the back 4-layer characteristic diagram output by the Backbone is fed into Neck, and finally the final-layer characteristic diagram is output as the output of the whole network, wherein the final-layer characteristic diagram comprises four parts.

Heatmap predicting the class score of the object and the coarse coordinates of the 2D box center, supervising the coarse coordinates with a 3Dbounding box center projection, which helps to perceive 3D geometry information and is associated with the task of estimating the 3D object center.

Offset_2d predicts the Offset (/ > ,/>) of the 3Dbounding box center point projections and the 2Dbounding box center coordinates after downsampling (s=4).

The size_2d:2d frame height and width (, unit pixels.

Residual_2D:2Dbounding box center coordinates downsampled residual (,/>).

2) To better focus each object, the RoI features are extracted using RoIAlign clipping, the computed normalized graph is connected together with each RoI feature map in a channel fashion to obtain final RoI features, and some information of the three-dimensional box is predicted using the extracted final RoI features.

Angle: the angle prediction output is divided into 24 intervals by adopting a multi-bin strategy, wherein the first 12 are used for classifying the prediction output, and the last 12 are used for regression prediction output.

Direct_depth, which is to directly predict the depth information of a target, namely the target distance z (depth) under a camera coordinate system, by using a backbone neural network model. Two columns of information are output, the first column being depth values and the second column being uncertainty (log variance form).

The offset_3D:3Dbounding box center point projection is the post-downsampled residual (/ > ,).

Size_3d: size information of 3Dbounding box, what is actually predicted is the deviation of the size, and the predicted deviation is added to the average size of the targets in the dataset/> to obtain the predicted size.

Deviation value of predicted Depth, which can make up the deviation of truncated target Depth prediction.

The invention is two-stage detection, the task of the two-dimensional detection stage is a front-end task of 3D detection, and the final depth estimation task depends on the front-end two-dimensional detection and the three-dimensional detection task. The total loss function is:

；

Initially setting weights of the two-dimensional detection portion, the three-dimensional detection portion/> ;/> being expressed as an overall loss; and/> is denoted as loss of each predicted branch. And observing the learning condition of each task and the local change trend of the loss function of the front-end task by using a hierarchical task learning strategy. If the pre-task tends to converge, the weight of the task will also be increased. As the task progresses, the weight of the 3D detection branch gradually increases from 0to 1. The loss weight of each item can dynamically reflect the learning condition of the pre-task, so that training is more stable.

In the method of the invention, the main process is solving the depth, and the specific process is shown in fig. 3, and the steps are as follows:

1) The direct depth estimation based on uncertainty theory relies on the appearance of the target and surrounding pixels, the RoI features contain only object features, not background noise. Direct depth estimation is performed herein on the RoI features:

；

Branch is used to estimate depth and uncertainty. Estimating depth using an inverse Sigmoid transform on the first channel, mapping the continuous range to a positive range; the/> indicates a setting parameter, where/> is a small number to ensure stability of the value, and e-6 is taken in this example. And/> denotes the heteroscedastic uncertainty in the modeling depth estimate.

2) In geometric projection, it is assumed herein that the three-dimensional box height of the target is subject to laplace distribution , the parameters enabling end-to-end prediction by the size_3d branch:

,/>；

Is a standard laplace distribution/> . Thus, the loss function of 3D height can be expressed as:

；

The loss function makes the predicted target/> as close as possible to true height/> , which can make the network learn more accurate height predictions. Adding regularization term/> facilitates joint optimization of the height and uncertainty predictions.

3) Bringing the 3D height subject to the laplace distribution into a geometric projection formula:

；

The projection depth is therefore also subject to the laplace distribution, with a mean value of depth , standard deviation/> . The network in turn predicts a depth bias to help achieve more accurate depth results.

The same depth deviation is still subject to the laplace distribution ,/>. The additivity of the laplace distribution is exploited, so the depth and uncertainty of the final geometric projection prediction is:

,/>；

Where denotes the variance of the depth deviation,/> denotes the mean of the depth deviation,/> is the uncertainty based on the geometric projection,/> denotes the depth based on the geometric projection; the/> is denoted/> ;

4) The direct depth found on the RoI features and the depth based on geometric projection/> are fused together using uncertainty guidance. Weight/> formula:

；

Wherein denotes direct depth estimation,/> denotes depth estimation based on geometric projection; the/> is expressed as the uncertainty of the direct depth estimation or the uncertainty based on the geometry projection depth;

5) The final target depth and uncertainty/> calculation formula:

,/>；

since here the depth also obeys the laplace distribution, the loss function of depth:

；

The overall loss can enable the predicted depth result to be more approximate to the real depth value, and uncertainty of the height and depth deviation of the three-dimensional frame is trained in the optimization process. The depth fusion formula dynamically assigns weights by observing changes, more favoring training unstable depth prediction branches, and depth estimates with higher uncertainty get higher weights, meaning that even if one estimate has higher uncertainty, it can still have some impact on the final depth estimate, which helps to improve the overall depth estimate stability, since the higher uncertainty estimates have more impact on the final result. For example: when the uncertainty calculated based on the height calculation is larger than the uncertainty of directly estimating the depth, the network is more biased to the depth prediction based on the height, and the corresponding weight is improved, so that the prediction of the depth is comprehensively optimized, and the fault tolerance is enhanced.

Depth prediction is important in the subsequent reasoning process. The depth fusion model can well represent the uncertainty of the depth, in order to obtain the final three-dimensional frame confidence coefficient, the uncertainty of the depth after fusion is further mapped to a value between 0 and 1, the confidence coefficient of the depth is represented through an exponential function, and the depth confidence coefficient can provide more accurate confidence coefficient for each projection depth:

；

Let be the probability that the target is correctly detected (three-dimensional box confidence), where/> represents the classification Heatmap score and/> represents the conditional three-dimensional box confidence. Previous approaches typically use two-dimensional confidence/> as the final score and do not consider features in three-dimensional space. Or using three-dimensional box IOU modeling/> , but since the average three-dimensional box IOU of the training phase model is larger than the verification phase, this results in poor performance in the verification phase. The conditional three-dimensional box confidence is expressed herein as a deep confidence, and the final confidence is obtained using the probability chain law as:

；

The final score represents both the 2D detection confidence and fusion depth confidence, which can guide better reliability. The calculation process of the method introduces uncertainty of direct depth estimation and priori information of a projection model, and depth errors caused by three-dimensional frame height errors are well reflected into confidence calculation.

In this embodiment, the method and the model are tested on KITTI datasets and compared with the mainstream single-camera three-dimensional target detection method. The overall performance comparison is shown in Table 1, and our model is MonoCoDe. Wherein the best results are bolded and the next best results are skewed; e denotes a direct depth estimation, H denotes a depth estimation from height; AP is the most important index meanAveragePrecision for measuring algorithm precision in a target detection algorithm, and the experimental evaluation index is 40-point interpolation AP under the condition that a medium-difficulty sample is IoU (the intersection ratio of prediction and true value) or more under the automobile classification.

Table 1: target detection overall performance comparison result

As can be seen from table 1, the present invention performs better than other methods in the classification of automobiles (various method data are from the respective paper publication data), including methods using additional information. The automobile classification is the object of most interest in KITTI three-dimensional object detection benchmark evaluations, and the medium level is the main basis for ranking. Except for the difficulty level, the methods herein all exceeded Monocon (a direct depth estimation method using assisted learning, the 2022 monocular 3D object detection SOTA method). The methods herein perform better than other classes of SOTA models. For example, at a mid-level of three-dimensional detection, the method herein is 2.67% higher than MonoFlex (about 20% better than this method). In addition, the running speed of the method is 38fps, the requirement of real-time detection is met, the speed is much faster than that of a method by means of additional information, and the advantage of a single-camera three-dimensional target detection method which does not depend on any auxiliary information is also reflected.

Example 2

The invention also discloses a single-camera three-dimensional target detection system for automatic driving, which comprises:

The feature extraction module is used for inputting the acquired monocular image into a feature extraction network and outputting a two-dimensional detection result;

The foregoing is merely a preferred embodiment of the present invention, and it should be noted that modifications and variations could be made by those skilled in the art without departing from the technical principles of the present invention, and such modifications and variations should also be regarded as being within the scope of the invention.

Claims

1. An automatic driving-oriented single-camera three-dimensional target detection method is characterized by comprising the following steps of:

Fusing the predicted three-dimensional detection information with the final depth obtained by weighting fusion, and outputting the predicted information of the target;

the loss function of the feature extraction network is:

Loss_all＝∑w_i·Loss_i

Initially setting a weight w _i =1 of the two-dimensional detection portion, the three-dimensional detection portion w _i＝0;Loss_all being represented as an overall loss; loss _i represents the Loss of each predicted branch;

The method for calculating the target depth by directly solving the depth and the geometric projection formula in the three-dimensional detection information and obtaining the final depth by uncertainty weighted fusion comprises the following steps:

direct depth estimation is performed at the final RoI feature:

σ_d＝Head(Direct_Depth(RoI)) [1]

The Head (direct_depth) is a prediction branch in three-dimensional information and is used for estimating Depth and uncertainty, z _d represents a Direct Depth estimation result, epsilon is a set parameter, and sigma _d represents heteroscedastic uncertainty in modeling Depth estimation;

Bringing the height of the three-dimensional frame obeying the Laplace distribution La (mu _H,λ_H) into a geometric projection formula, and predicting the depth according to the geometric projection as follows:

Wherein f represents a focal length, H _2D represents a two-dimensional frame height, X obeys a standard Laplacian distribution La (0, 1), H _3D represents a three-dimensional frame height, lambda _H represents a scale parameter, and mu _H represents a mean value of the three-dimensional frame height;

Meanwhile, the depth deviation obeying Laplacian distribution La (mu _b,λ_b),) is also predicted in the three-dimensional detection information, and the depth and uncertainty of the final geometric projection prediction are obtained by utilizing the additivity of the Laplacian distribution:

z_p＝μ_z+μ_b,σ_p ²＝σ_z ²+σ_b ²

Wherein σ_b denotes the variance of the depth deviation, μ _b denotes the mean of the depth deviation, σ _p is the uncertainty based on the geometric projection, and z _p is the depth based on the geometric projection; sigma _H is denoted as/>

Fusing the direct depth z _d obtained on the RoI feature and the depth z _p based on geometric projection by using uncertainty guidance; weight ω _i (i=d, p) calculation formula:

ω_i＝σ_i ²/∑σ_j ²(j＝d，p)，

Where d represents the direct depth estimate, p represents the geometric projection-based depth estimate, Σσ _j ² represents the sum of squares of the uncertainty of the direct depth estimate and the geometric projection-based depth estimate; σ _i is expressed as the uncertainty of the direct depth estimate or the uncertainty based on the geometry projection depth;

The final target depth z _c and uncertainty σ _c calculate the formula:

z_c＝∑ω_iz_i,σ_c ²＝∑ω_iσ_i ²

Where z ^* represents a tag true value, z _c represents a target depth, σ _c represents uncertainty, z _i represents two depth estimates, and σ _i represents uncertainty corresponding to the depth estimate.

2. The autopilot-oriented single-camera three-dimensional target detection method of claim 1 wherein the two-dimensional detection result comprises four parts:

Heatmap: predicting class scores of the targets and coarse coordinates of the centers of the 2D frames;

Offset_2d: predicting the offset of the 3Dbounding box center point projection and the 2Dbounding box center coordinates after downsampling;

Size_2d: height and width of the 2D frame, unit pixels;

Residual_2d:2Dbounding box center coordinates down-sampled residual.

3. The autopilot-oriented single-camera three-dimensional object detection method of claim 1 wherein the three-dimensional detection information comprises:

direct_depth: directly predicting the depth information of the target by using a feature extraction network, and outputting two columns of information, wherein the first column is a depth value, and the second column is uncertain;

Offset_3d:3Dbounding box the residual error after downsampling is projected at the center point;

depth_bias: and (3) predicting the deviation value of the depth to make up the deviation of the truncated target depth prediction.

4. The method for detecting a three-dimensional object by using a single camera for automatic driving according to claim 1, wherein predicting three-dimensional detection information according to a final RoI characteristic comprises:

5. The method for detecting the three-dimensional target of the single camera facing the automatic driving according to claim 1, wherein the feature extraction network comprises a DLA-34 main network and a Neck network, the DLA-34 main network adopts a CENTERNET framework, the DLA-34 main network is used for inputting the last 4 layers of feature graphs of the output 6 layers of feature graphs into a Neck network, and the Neck network outputs one layer of feature graphs of the input 4 layers of feature graphs as a two-dimensional detection result.

6. The autopilot-oriented single camera three-dimensional target detection method of claim 1 wherein the predicted information of the target includes three-dimensional center point coordinates, dimensions, and yaw angle.

7. The autopilot-oriented single camera three-dimensional object detection method of claim 1 wherein the RoI features include only object level features.

8. An autopilot-oriented single-camera three-dimensional target detection system, the system comprising:

the fusion module is used for fusing the predicted three-dimensional detection information with the weighting fusion to obtain the final depth and outputting the predicted information of the target;

the loss function of the feature extraction network is:

Loss_all＝∑w_i·Loss_i

Calculating the target depth by directly solving the depth and the geometric projection formula in the three-dimensional detection information, and obtaining the final depth by uncertainty weighted fusion comprises the following steps:

direct depth estimation is performed at the final RoI feature:

σ_d＝Head(Direct_Depth(RoI)) [1]

z_p＝μ_z+μ_b,σ_p ²＝σ_z ²+σ_b ²

ω_i＝σ_i ²/∑σ_j ²(j＝d，p)，

The final target depth z _c and uncertainty σ _c calculate the formula:

z_c＝∑ω_iz_i,σ_c ²＝∑ω_iσ_i ²