CN113436239A

CN113436239A - Monocular image three-dimensional target detection method based on depth information estimation

Info

Publication number: CN113436239A
Application number: CN202110541790.4A
Authority: CN
Inventors: 叶青松; 刘玮; 马云; 段帅东; 高明强
Original assignee: China University of Geosciences
Current assignee: China University of Geosciences
Priority date: 2021-05-18
Filing date: 2021-05-18
Publication date: 2021-09-24

Abstract

The invention provides a monocular image three-dimensional target detection method based on depth information estimation, which only needs to input a monocular image and utilizes a FasterR-CNN network model to carry out end-to-end training and prediction and complete the three-dimensional detection task of a road target. The whole monocular image three-dimensional target detection method framework can be roughly divided into three parts: candidate area proposal, depth estimation branch network and parameter estimation prediction part. The invention has the beneficial effects that: through multi-task training and learning, the utilization efficiency of computing resources is improved, the problem that the target loses position information after passing through a convolutional neural network feature extraction link is solved, and the final detection precision is improved.

Description

Monocular image three-dimensional target detection method based on depth information estimation

Technical Field

The invention relates to the technical field of image processing, in particular to a monocular image three-dimensional target detection method based on depth information estimation.

Background

The task of object detection is to identify all objects of interest in a given image, determine their category and location, which can be applied to various scenes. Generally, three-dimensional object detection mainly aims at a detection task which takes a vehicle as a main object in an automatic driving environment, and provides a three-dimensional detection result of the vehicle object and the like, wherein the three-dimensional detection result comprises the type of the object, a two-dimensional detection frame, a three-dimensional detection frame and the like. One of the key points in the research on the active safety technology of the autonomous vehicle is the road environment sensing technology, and the accurate detection of the road target is the core part of the road environment sensing technology. Therefore, the accuracy and timeliness of the vehicle for sensing the road environment can be guaranteed only by better completing the road target detection task, so that the decision control of the intelligent vehicle is accurately guided, and the safety of automatic driving is guaranteed.

The problems of the traditional target detection method mainly include two aspects: firstly, the region selection strategy based on the sliding window has no pertinence, high complexity and redundant windows; secondly, the manually designed features are not very robust to variations in diversity. Thus, with the development of artificial neural networks and deep learning, the mainstream target detection method has been selected to be completed by using a method based on deep learning convolutional neural network.

At present, target detection methods for road environments can be classified into a method based on a laser radar and a method based on an image according to different input data. The image-based method has more practical application value because a plane image acquired by a common camera is used, and the method is more perfect and more widely applied to two-dimensional target detection at present, is more prone to solving the tasks of target classification and two-dimensional framing, cannot acquire three-dimensional information of a target to be detected, and cannot meet the requirement of an automatic driving vehicle on real world three-dimensional information. In view of the above-mentioned shortage of two-dimensional detection results in the demand for three-dimensional information and the high cost of the laser radar method, studies on a monocular image-based three-dimensional target detection method have been promoted in recent years.

In the prior art, the three-dimensional target detection task under the road environment is completed by directly utilizing or indirectly combining the point cloud data of the laser radar, a method based on images is not provided, the advantages of the point cloud data in the current three-dimensional target detection field are utilized more or less, and the mass production of the method on actual automatic driving vehicles is limited by the high cost of the laser radar. Meanwhile, the image-based method often does not fully utilize the position information of the target in the image, and the loss of the method in the convolutional neural network feature extraction link can influence the final estimation and prediction of the target position. Furthermore, some existing methods require a relatively large amount of training time, memory space, and computational resources.

Compared with the two methods, the monocular image-based three-dimensional target detection development starts late, the invention provides a monocular image three-dimensional target detection method based on depth information estimation, and the method mainly solves the following problems:

(1) the existing monocular image-based three-dimensional target detection method is generally low in detection precision, and a monocular image three-dimensional target detection method based on depth information estimation is provided for solving the problem, so that the detection result precision is improved to a certain extent compared with the existing method.

(2) The planar image is processed by a convolutional neural network, little target position information in the image can be lost after a link of obtaining a feature map through feature extraction, and the estimation and prediction of the final target position are influenced greatly. The existing method often fails to fully utilize the position information of the target in the image, and the invention selects and designs to introduce a depth information estimation branch to improve and solve the problem.

(3) According to the method, a large amount of training time, storage space and computing resources are needed, and aiming at the problem, the target detection network model provided by the invention can be used for performing end-to-end joint training, and the utilization efficiency of the computing resources is improved through multi-task training and learning.

Disclosure of Invention

Aiming at the problems or the defects, the invention provides a monocular image three-dimensional target detection method based on depth information estimation, only a monocular image is input to complete a three-dimensional target detection task, and compared with the existing method, the detection result precision is improved to a certain extent. A depth information estimation branch is introduced to fully utilize the position information of the target in the image, so that the problem that the target loses the position information after passing through a convolutional neural network feature extraction link is solved, and the final detection precision is improved. In addition, aiming at the problems of needing a large amount of training time, storage space and computing resources, the target detection neural network model provided by the invention can carry out end-to-end joint training, and the utilization efficiency of the computing resources is improved through multi-task training and learning.

The invention provides a monocular image three-dimensional target detection method based on depth information estimation, which mainly comprises the following steps:

s1: inputting the acquired monocular image, and obtaining a target candidate region by using a Faster R-CNN network model and a Region Proposed Network (RPN) thereof;

s2: a MonoDepth algorithm is used for constructing a depth information estimation branch network, the monocular image is input into the depth information estimation branch network, parallax information is output, then depth information is obtained, and a point cloud is constructed by obtaining three-dimensional coordinate information of each pixel point in the image, so that a corresponding area is obtained;

s3: pooling is carried out on the candidate region in the step S1 and the corresponding region in the step S2, then the features obtained after pooling are fused, estimation and prediction of each parameter of the target are carried out on the fused features by using a convolutional neural network, and the monocular image three-dimensional target detection process is completed after the prediction is finished.

Further, the process of obtaining the candidate region of the target is as follows: the area proposal network generates a series of proposal areas containing targets through a convolution feature map and an anchor point mechanism, generates two-dimensional anchor points with preset proportion and aspect ratio in each rectangular area, and then outputs final candidate areas through target fraction prediction and two-dimensional bounding box regression.

Further, the coordinates of a certain pixel point in the three-dimensional space under the camera coordinate system are obtained through the following formula:

wherein (I)_x,I_y) Is the coordinate of a certain pixel point in the monocular image, I_dFor the predicted parallax, f is the camera focal length, C_bIs the base line distance of the binocular camera (C)_x，C_y) Is the coordinate of the image principal point;

by the method, the three-dimensional coordinate information of each pixel point in the image is further acquired, the point cloud is constructed in the whole scene according to the three-dimensional coordinate information, and the point cloud obtained through prediction is encoded into the corresponding area of three-channel input.

Further, the candidate area of step S1 is subjected to the maximum pooling process.

Further, the average pooling process is performed on the corresponding area of step S2.

Further, after the sizes of the candidate region and the corresponding region which are subjected to pooling treatment are kept consistent, fusion treatment is carried out, and the corresponding region is directly connected behind the candidate region in series.

Further, the estimation and prediction of each parameter specifically includes a category and two-dimensional detection box, scale estimation, direction estimation and three-dimensional position estimation.

The technical scheme provided by the invention has the beneficial effects that:

1. the three-dimensional target detection task is completed only by means of the input monocular image, and the more mainstream laser radar point cloud data is not needed.

2. By introducing depth information estimation aiming at the position information of the target in the image, which is easy to lose through the convolutional neural network, the position information of the target in the image can be fully utilized, and the result detection precision is improved.

3. The whole target detection network model can carry out end-to-end joint training, the utilization efficiency of computing resources is improved through multi-task training learning, and the problem that a large amount of training time, storage space and computing resources are needed is improved to a certain extent.

Drawings

The invention will be further described with reference to the accompanying drawings and examples, in which:

fig. 1 is a frame diagram of a monocular image three-dimensional target detection method based on depth information estimation in an embodiment of the present invention.

FIG. 2 illustrates a FasterR-CNN network model and a Regional Proposal Network (RPN) according to an embodiment of the present invention.

Detailed Description

For a more clear understanding of the technical features, objects and effects of the present invention, embodiments of the present invention will now be described in detail with reference to the accompanying drawings.

The embodiment of the invention provides a monocular image three-dimensional target detection method based on depth information estimation, only a monocular image needs to be input, and a network can carry out end-to-end training and prediction to complete a three-dimensional detection task of a road target. The method is a detection model based on machine learning/deep learning on the whole, and the whole model framework can be roughly divided into three parts: the method comprises a candidate region proposing part, a depth estimation branch network part and a parameter estimation predicting part, wherein the output of a detection model is the estimation prediction of each parameter, and the detection model is trained by utilizing a training set.

Referring to fig. 1-2, fig. 1 is a frame diagram of a monocular image three-dimensional target detection method based on depth information estimation according to an embodiment of the present invention, fig. 2 is a fast R-CNN network model and a Regional Proposal Network (RPN) according to an embodiment of the present invention, and the monocular image three-dimensional target detection method specifically includes the following steps:

s1, candidate area proposition

Selecting and using a Faster R-CNN Network model in two-dimensional target detection, using a Region Proposed Network (RPN) with characteristics to obtain a candidate Region of a target, and directly predicting the type and a two-dimensional boundary frame of the target through the extracted RoI, and outputting the two-dimensional boundary frame and other parameters needing prediction in the follow-up process.

The RPN generates a series of proposed regions containing targets through a convolution feature map and an anchor point mechanism, and generates two-dimensional anchor points with preset proportion and aspect ratio in each rectangular region. The network may then output the final candidate region by object score prediction and two-dimensional bounding box regression.

The network model of FasterR-CNN is shown in FIG. 2, and the overall two-dimensional target detection implementation framework flow is roughly as follows:

a. inputting the whole picture into CNN for feature extraction;

b. generating proposed regions using a Regional Proposal Network (RPN), about 300 proposed regions per picture;

c. projecting the suggested region onto a last layer of a convolution feature map of the convolution neural network;

d. generating a feature map with a fixed size for each proposed region through the RoI pooling layer;

e. and performing joint training and prediction on the classification probability and the frame regression by using a loss function.

S2. depth estimation branch network

Different from estimation of other three-dimensional information, acquisition of a target three-dimensional position is difficult, and the design of the depth estimation branch network is mainly to improve detection accuracy of the target three-dimensional position.

In the object detection network model of the subject, it is difficult to utilize the position information of the object in the image because of the presence of RoI pooling in the convolutional neural network. First, the RoI pooling transforms candidate regions of different sizes into features of uniform size, ignoring projection information of the same object such as size in the camera coordinate system. Second, the RoI pooling extracts only the candidate region part of the global feature map, so each object is predicted separately, thus losing the relative position information of the objects and the positional relationship in the whole map.

To address the impact of RoI pooling on location prediction, monocular image-based depth estimation is introduced. The introduction of the depth estimation can obtain the coordinates of each pixel point in the monocular image in a camera coordinate system, so that a three-channel input image with the same width and height as the input image can be obtained. When the two-dimensional candidate area is extracted, the corresponding area of the three-channel input map can be extracted at the same time, so that the three-dimensional position information of each point can be fully utilized. The result of the depth information estimation can be considered a global prior.

(1) Depth information estimation algorithm

A MonoDepth algorithm is used for constructing a depth information estimation branch network, binocular parallax is predicted through an unsupervised method according to the consistency principle of left and right images, and then depth information is further determined. Although the network model is trained by using binocular images, the finally trained network model can predict parallax information and then obtain required depth information only by inputting a monocular image on one side.

Suppose that the coordinate of a certain pixel point in the image is (I)_x,I_i) Predicting the obtained parallax to be I_dThen, the coordinates of the pixel point in the three-dimensional space under the camera coordinate system are as follows:

wherein f is the focal length of the camera, C_bIs the base line distance of the binocular camera (C)_x，C_y) As principal point-like coordinates.

By the method, the three-dimensional coordinate information of each pixel point in the image can be obtained, the point cloud can be constructed in the whole scene according to the obtained pixel point coordinates, the estimated point cloud is encoded into the corresponding image with three-channel input, and three corresponding areas are obtained, have the same size as the candidate areas, and are convenient for subsequent operations such as fusion of the two parts of features.

(2) Feature fusion

In the actual operation, different pooling operations need to be applied to the candidate region in step S1 and the corresponding region in step S2. The pooling operation performed on the candidate region generally refers to the RoI max pooling, and for each sub-region, the max pooling operation selects the maximum value of the element in the sub-region, and for the candidate region feature map, the maximum value of the selected element is a feature value corresponding to the response degree of the neural network, so that there is a rationality for selecting the max pooling for the candidate region feature map. However, the value in each sub-region of the coordinate value feature map represents the absolute value of the coordinate of the pixel point in the three-dimensional coordinate system, and the information of the point is changed by using the maximum pooling, so that an error occurs in the result, so that the RoI average pooling is selected for the coordinate value feature map representing the three-dimensional coordinate information in step S2, instead of the maximum value calculation, and the element values in each sub-region are calculated to obtain the average distribution therein, thereby retaining sufficient background information in the image.

After the sizes of the candidate region features on the two sides are extracted and kept consistent, fusion operation can be performed, because the sizes of the candidate region features are consistent, the two candidate region features are directly connected in series, three feature graphs representing coordinate values in the coordinate value feature graphs are directly connected in series behind the candidate region feature graphs in the detection network, and three dimensions are integrally improved. For the candidate region feature fused with the coordinate value feature, the three-dimensional position information of the object is more than that of the original candidate region feature, so that more information can be provided, and the feature characterization capability is improved. Meanwhile, the fused regional characteristics can be used for estimating and predicting the three-dimensional position of the target, and can also participate in the estimation and prediction tasks of other parameter information branches, so that the detection precision of each parameter can be improved to a certain extent.

S3, parameter estimation and prediction part

And after the characteristics obtained by fusion are obtained, estimating and predicting each parameter of the target by using a convolutional neural network.

(1) Class and two-dimensional detection frame

Because the basic target detection network used is fast R-CNN, the two detection tasks can be accurately obtained by the convolutional neural network in step S1 through the basic two-dimensional target detection algorithm flow, and can be output together with other three-dimensional parameter predictions in the whole detection framework.

(2) Scale estimation

Because the same type of targets are similar in size and can not be guaranteed in precision by directly estimating and predicting according to regional characteristics obtained by a neural network, the method considers that the parameters are not directly regressed, but the difference value between the regression prediction related data parameters and the reference parameters. The reference parameter is obtained by clustering and analyzing the same type of data in the training set according to the used method, and the value of the reference parameter is not fixed. And outputting a difference value through the detection model, and combining the difference value and the reference parameter to complete scale estimation. The penalty function for this estimation branch is as follows:

wherein, P_dThe size of the prediction is represented by,

representing the true size in the label, D_tRepresenting the trained reference dimensions, SL1() represents the SmoothL1 function.

(3) Direction estimation

For the estimation of the target direction, because the roll angle and the pitch angle are almost 0 under the ideal condition, only the yaw angle is considered, however, when in the same camera coordinate system, the target yaw angle does not have a direct corresponding relation with the appearance characteristics of the target in the image, and the yaw angle of the target cannot be obtained through direct regression prediction. Considering the consistency of the local viewing angle of the target (the angle formed by the global direction and the ray whose camera center passes through the center of the target) with the change of the appearance characteristics of the target in the image, the following relationship exists between the yaw angle and the local viewing angle:

yaw＝α+arctan2(cx,cy)

where yaw represents yaw angle, α represents local observation angle, and (cx, cy) represents target center point coordinates. Therefore, the direction estimation can be completed by detecting the predicted local observation angle output by the model and calculating the yaw angle by combining the formula.

The result obtained by directly regressing the local angle by using the neural network has lower precision, and the invention selects and uses a MultiBin method to carry out regression on the angle. The whole angular space is firstly divided into n regions, which are denoted as bins, and the range of each region can be represented by one interval. Therefore, for each local angle, the local angle is classified first, the section to which the local angle belongs is judged, and after the category to which the local angle belongs is judged, the residual Δ θ between the local angle and the center of the category is calculated. Therefore, when the local observation angle is actually predicted, the local observation angle is classified into two parts, one is a classification problem that the predicted local observation angle belongs to each bin, and the other is a regression problem that the predicted local observation angle and the residual Δ θ of the centers of such bins are regressed. The two-part loss function is as follows:

L_confrepresented is a loss function of the first part of the classification problem, where,

representing the class, P, of the bin to which the true alpha angle belongs_confIs a predictor of a fully-connected network, σ () represents a Sigmoid function, and CE is cross entropy.

L_regRepresented is a loss function of the second partial residual regression problem, where,

representing the true (cos (Δ α), sin (Δ α) vector, P_regIs the corresponding predictor, SL1() represents the SmoothL1 function, and n represents the number of bins to which α belongs.

The loss function for the entire local viewing angle α is as follows, where ω determines the relative weights of the two components:

L_α＝L_conf+ω*L_reg

(4) three-dimensional position estimation

For the estimation of the three-dimensional position of the target, the design mainly introduces a depth information estimation branch network, which is already set forth in the step S2.

(5) Multitask learning

In order to optimize a detection model framework of the whole monocular image three-dimensional target detection method based on depth information estimation, joint training is adopted, and on the basis of Faster R-CNN, the whole network is trained end to end. The multiple prediction branches share weights of the convolutional neural network, each branch corresponds to different parameter objects and loss functions thereof, and the total loss function of the whole network is as follows:

L＝w_2d*L_2d+w_d*L_d+w_α*L_α+w_loc*L_loc

wherein L is_2d、L_d、L_α、L_locThe loss functions of the target two-dimensional detection frame, the scale, the direction and the three-dimensional position are respectively expressed, w in the above formula determines the weight proportion of each part, and the weight parameter of each part is shown as each coefficient of the above formula.

And L is an overall loss function of the detection model of the method, the overall loss function is adjusted by adjusting the weight of each part, the allowable range of the loss function is determined according to actual requirements, the final detection model can be obtained, the monocular image three-dimensional target detection based on depth information estimation is further completed, and the detection result is output.

In the whole monocular image three-dimensional target detection method based on depth information estimation, the key points are as follows:

1. on the basis of a target detection network model, depth information estimation is introduced to complete a three-dimensional target detection task based on a monocular image, and position information/depth information of a target in the image is fully utilized.

2. By means of multi-task learning, the whole target detection network model can perform end-to-end joint training, and through multi-task training learning, the utilization efficiency of computing resources is improved, and the overall detection precision is improved.

Compared with the prior art, the invention has the beneficial effects that:

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A monocular image three-dimensional target detection method based on depth information estimation is characterized in that: the method comprises the following steps:

s1: inputting the acquired monocular image, and obtaining a target candidate area by using a Faster R-CNN network model and an area proposal network thereof;

2. The method for detecting the monocular image three-dimensional target based on depth information estimation as claimed in claim 1, wherein: in step S1, the process of obtaining the candidate region of the target is: the area proposal network generates a series of proposal areas containing targets through a convolution feature map and an anchor point mechanism, generates two-dimensional anchor points with preset proportion and aspect ratio in each rectangular area, and then outputs final candidate areas through target fraction prediction and two-dimensional bounding box regression.

3. The method for detecting the monocular image three-dimensional target based on depth information estimation as claimed in claim 1, wherein: in step S2, the coordinates of a certain pixel point in the three-dimensional space under the camera coordinate system are obtained by the following formula:

4. The method for detecting the monocular image three-dimensional target based on depth information estimation as claimed in claim 1, wherein: in step S3, the candidate region of step S1 is subjected to maximum pooling.

5. The method for detecting the monocular image three-dimensional target based on depth information estimation as claimed in claim 1, wherein: in step S3, the average pooling process is performed on the corresponding area in step S2.

6. The method for detecting the monocular image three-dimensional target based on depth information estimation as claimed in claim 1, wherein: in step S3, the sizes of the pooled candidate regions and the corresponding regions are matched, and then fusion processing is performed to directly connect the corresponding regions in series behind the candidate regions.

7. The method for detecting the monocular image three-dimensional target based on depth information estimation as claimed in claim 1, wherein: in step S3, the estimation and prediction of each parameter specifically includes a category and two-dimensional detection box, scale estimation, direction estimation, and three-dimensional position estimation.