CN113436239A - Monocular image three-dimensional target detection method based on depth information estimation - Google Patents

Monocular image three-dimensional target detection method based on depth information estimation Download PDF

Info

Publication number
CN113436239A
CN113436239A CN202110541790.4A CN202110541790A CN113436239A CN 113436239 A CN113436239 A CN 113436239A CN 202110541790 A CN202110541790 A CN 202110541790A CN 113436239 A CN113436239 A CN 113436239A
Authority
CN
China
Prior art keywords
dimensional
estimation
monocular image
depth information
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110541790.4A
Other languages
Chinese (zh)
Inventor
叶青松
刘玮
马云
段帅东
高明强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China University of Geosciences
Original Assignee
China University of Geosciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China University of Geosciences filed Critical China University of Geosciences
Priority to CN202110541790.4A priority Critical patent/CN113436239A/en
Publication of CN113436239A publication Critical patent/CN113436239A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10004Still image; Photographic image
    • G06T2207/10012Stereo images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10028Range image; Depth image; 3D point clouds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides a monocular image three-dimensional target detection method based on depth information estimation, which only needs to input a monocular image and utilizes a FasterR-CNN network model to carry out end-to-end training and prediction and complete the three-dimensional detection task of a road target. The whole monocular image three-dimensional target detection method framework can be roughly divided into three parts: candidate area proposal, depth estimation branch network and parameter estimation prediction part. The invention has the beneficial effects that: through multi-task training and learning, the utilization efficiency of computing resources is improved, the problem that the target loses position information after passing through a convolutional neural network feature extraction link is solved, and the final detection precision is improved.

Description

Monocular image three-dimensional target detection method based on depth information estimation
Technical Field
The invention relates to the technical field of image processing, in particular to a monocular image three-dimensional target detection method based on depth information estimation.
Background
The task of object detection is to identify all objects of interest in a given image, determine their category and location, which can be applied to various scenes. Generally, three-dimensional object detection mainly aims at a detection task which takes a vehicle as a main object in an automatic driving environment, and provides a three-dimensional detection result of the vehicle object and the like, wherein the three-dimensional detection result comprises the type of the object, a two-dimensional detection frame, a three-dimensional detection frame and the like. One of the key points in the research on the active safety technology of the autonomous vehicle is the road environment sensing technology, and the accurate detection of the road target is the core part of the road environment sensing technology. Therefore, the accuracy and timeliness of the vehicle for sensing the road environment can be guaranteed only by better completing the road target detection task, so that the decision control of the intelligent vehicle is accurately guided, and the safety of automatic driving is guaranteed.
The problems of the traditional target detection method mainly include two aspects: firstly, the region selection strategy based on the sliding window has no pertinence, high complexity and redundant windows; secondly, the manually designed features are not very robust to variations in diversity. Thus, with the development of artificial neural networks and deep learning, the mainstream target detection method has been selected to be completed by using a method based on deep learning convolutional neural network.
At present, target detection methods for road environments can be classified into a method based on a laser radar and a method based on an image according to different input data. The image-based method has more practical application value because a plane image acquired by a common camera is used, and the method is more perfect and more widely applied to two-dimensional target detection at present, is more prone to solving the tasks of target classification and two-dimensional framing, cannot acquire three-dimensional information of a target to be detected, and cannot meet the requirement of an automatic driving vehicle on real world three-dimensional information. In view of the above-mentioned shortage of two-dimensional detection results in the demand for three-dimensional information and the high cost of the laser radar method, studies on a monocular image-based three-dimensional target detection method have been promoted in recent years.
In the prior art, the three-dimensional target detection task under the road environment is completed by directly utilizing or indirectly combining the point cloud data of the laser radar, a method based on images is not provided, the advantages of the point cloud data in the current three-dimensional target detection field are utilized more or less, and the mass production of the method on actual automatic driving vehicles is limited by the high cost of the laser radar. Meanwhile, the image-based method often does not fully utilize the position information of the target in the image, and the loss of the method in the convolutional neural network feature extraction link can influence the final estimation and prediction of the target position. Furthermore, some existing methods require a relatively large amount of training time, memory space, and computational resources.
Compared with the two methods, the monocular image-based three-dimensional target detection development starts late, the invention provides a monocular image three-dimensional target detection method based on depth information estimation, and the method mainly solves the following problems:
(1) the existing monocular image-based three-dimensional target detection method is generally low in detection precision, and a monocular image three-dimensional target detection method based on depth information estimation is provided for solving the problem, so that the detection result precision is improved to a certain extent compared with the existing method.
(2) The planar image is processed by a convolutional neural network, little target position information in the image can be lost after a link of obtaining a feature map through feature extraction, and the estimation and prediction of the final target position are influenced greatly. The existing method often fails to fully utilize the position information of the target in the image, and the invention selects and designs to introduce a depth information estimation branch to improve and solve the problem.
(3) According to the method, a large amount of training time, storage space and computing resources are needed, and aiming at the problem, the target detection network model provided by the invention can be used for performing end-to-end joint training, and the utilization efficiency of the computing resources is improved through multi-task training and learning.
Disclosure of Invention
Aiming at the problems or the defects, the invention provides a monocular image three-dimensional target detection method based on depth information estimation, only a monocular image is input to complete a three-dimensional target detection task, and compared with the existing method, the detection result precision is improved to a certain extent. A depth information estimation branch is introduced to fully utilize the position information of the target in the image, so that the problem that the target loses the position information after passing through a convolutional neural network feature extraction link is solved, and the final detection precision is improved. In addition, aiming at the problems of needing a large amount of training time, storage space and computing resources, the target detection neural network model provided by the invention can carry out end-to-end joint training, and the utilization efficiency of the computing resources is improved through multi-task training and learning.
The invention provides a monocular image three-dimensional target detection method based on depth information estimation, which mainly comprises the following steps:
s1: inputting the acquired monocular image, and obtaining a target candidate region by using a Faster R-CNN network model and a Region Proposed Network (RPN) thereof;
s2: a MonoDepth algorithm is used for constructing a depth information estimation branch network, the monocular image is input into the depth information estimation branch network, parallax information is output, then depth information is obtained, and a point cloud is constructed by obtaining three-dimensional coordinate information of each pixel point in the image, so that a corresponding area is obtained;
s3: pooling is carried out on the candidate region in the step S1 and the corresponding region in the step S2, then the features obtained after pooling are fused, estimation and prediction of each parameter of the target are carried out on the fused features by using a convolutional neural network, and the monocular image three-dimensional target detection process is completed after the prediction is finished.
Further, the process of obtaining the candidate region of the target is as follows: the area proposal network generates a series of proposal areas containing targets through a convolution feature map and an anchor point mechanism, generates two-dimensional anchor points with preset proportion and aspect ratio in each rectangular area, and then outputs final candidate areas through target fraction prediction and two-dimensional bounding box regression.
Further, the coordinates of a certain pixel point in the three-dimensional space under the camera coordinate system are obtained through the following formula:
Figure BDA0003071866510000031
wherein (I)x,Iy) Is the coordinate of a certain pixel point in the monocular image, IdFor the predicted parallax, f is the camera focal length, CbIs the base line distance of the binocular camera (C)x,Cy) Is the coordinate of the image principal point;
by the method, the three-dimensional coordinate information of each pixel point in the image is further acquired, the point cloud is constructed in the whole scene according to the three-dimensional coordinate information, and the point cloud obtained through prediction is encoded into the corresponding area of three-channel input.
Further, the candidate area of step S1 is subjected to the maximum pooling process.
Further, the average pooling process is performed on the corresponding area of step S2.
Further, after the sizes of the candidate region and the corresponding region which are subjected to pooling treatment are kept consistent, fusion treatment is carried out, and the corresponding region is directly connected behind the candidate region in series.
Further, the estimation and prediction of each parameter specifically includes a category and two-dimensional detection box, scale estimation, direction estimation and three-dimensional position estimation.
The technical scheme provided by the invention has the beneficial effects that:
1. the three-dimensional target detection task is completed only by means of the input monocular image, and the more mainstream laser radar point cloud data is not needed.
2. By introducing depth information estimation aiming at the position information of the target in the image, which is easy to lose through the convolutional neural network, the position information of the target in the image can be fully utilized, and the result detection precision is improved.
3. The whole target detection network model can carry out end-to-end joint training, the utilization efficiency of computing resources is improved through multi-task training learning, and the problem that a large amount of training time, storage space and computing resources are needed is improved to a certain extent.
Drawings
The invention will be further described with reference to the accompanying drawings and examples, in which:
fig. 1 is a frame diagram of a monocular image three-dimensional target detection method based on depth information estimation in an embodiment of the present invention.
FIG. 2 illustrates a FasterR-CNN network model and a Regional Proposal Network (RPN) according to an embodiment of the present invention.
Detailed Description
For a more clear understanding of the technical features, objects and effects of the present invention, embodiments of the present invention will now be described in detail with reference to the accompanying drawings.
The embodiment of the invention provides a monocular image three-dimensional target detection method based on depth information estimation, only a monocular image needs to be input, and a network can carry out end-to-end training and prediction to complete a three-dimensional detection task of a road target. The method is a detection model based on machine learning/deep learning on the whole, and the whole model framework can be roughly divided into three parts: the method comprises a candidate region proposing part, a depth estimation branch network part and a parameter estimation predicting part, wherein the output of a detection model is the estimation prediction of each parameter, and the detection model is trained by utilizing a training set.
Referring to fig. 1-2, fig. 1 is a frame diagram of a monocular image three-dimensional target detection method based on depth information estimation according to an embodiment of the present invention, fig. 2 is a fast R-CNN network model and a Regional Proposal Network (RPN) according to an embodiment of the present invention, and the monocular image three-dimensional target detection method specifically includes the following steps:
s1, candidate area proposition
Selecting and using a Faster R-CNN Network model in two-dimensional target detection, using a Region Proposed Network (RPN) with characteristics to obtain a candidate Region of a target, and directly predicting the type and a two-dimensional boundary frame of the target through the extracted RoI, and outputting the two-dimensional boundary frame and other parameters needing prediction in the follow-up process.
The RPN generates a series of proposed regions containing targets through a convolution feature map and an anchor point mechanism, and generates two-dimensional anchor points with preset proportion and aspect ratio in each rectangular region. The network may then output the final candidate region by object score prediction and two-dimensional bounding box regression.
The network model of FasterR-CNN is shown in FIG. 2, and the overall two-dimensional target detection implementation framework flow is roughly as follows:
a. inputting the whole picture into CNN for feature extraction;
b. generating proposed regions using a Regional Proposal Network (RPN), about 300 proposed regions per picture;
c. projecting the suggested region onto a last layer of a convolution feature map of the convolution neural network;
d. generating a feature map with a fixed size for each proposed region through the RoI pooling layer;
e. and performing joint training and prediction on the classification probability and the frame regression by using a loss function.
S2. depth estimation branch network
Different from estimation of other three-dimensional information, acquisition of a target three-dimensional position is difficult, and the design of the depth estimation branch network is mainly to improve detection accuracy of the target three-dimensional position.
In the object detection network model of the subject, it is difficult to utilize the position information of the object in the image because of the presence of RoI pooling in the convolutional neural network. First, the RoI pooling transforms candidate regions of different sizes into features of uniform size, ignoring projection information of the same object such as size in the camera coordinate system. Second, the RoI pooling extracts only the candidate region part of the global feature map, so each object is predicted separately, thus losing the relative position information of the objects and the positional relationship in the whole map.
To address the impact of RoI pooling on location prediction, monocular image-based depth estimation is introduced. The introduction of the depth estimation can obtain the coordinates of each pixel point in the monocular image in a camera coordinate system, so that a three-channel input image with the same width and height as the input image can be obtained. When the two-dimensional candidate area is extracted, the corresponding area of the three-channel input map can be extracted at the same time, so that the three-dimensional position information of each point can be fully utilized. The result of the depth information estimation can be considered a global prior.
(1) Depth information estimation algorithm
A MonoDepth algorithm is used for constructing a depth information estimation branch network, binocular parallax is predicted through an unsupervised method according to the consistency principle of left and right images, and then depth information is further determined. Although the network model is trained by using binocular images, the finally trained network model can predict parallax information and then obtain required depth information only by inputting a monocular image on one side.
Suppose that the coordinate of a certain pixel point in the image is (I)x,Ii) Predicting the obtained parallax to be IdThen, the coordinates of the pixel point in the three-dimensional space under the camera coordinate system are as follows:
Figure BDA0003071866510000051
wherein f is the focal length of the camera, CbIs the base line distance of the binocular camera (C)x,Cy) As principal point-like coordinates.
By the method, the three-dimensional coordinate information of each pixel point in the image can be obtained, the point cloud can be constructed in the whole scene according to the obtained pixel point coordinates, the estimated point cloud is encoded into the corresponding image with three-channel input, and three corresponding areas are obtained, have the same size as the candidate areas, and are convenient for subsequent operations such as fusion of the two parts of features.
(2) Feature fusion
In the actual operation, different pooling operations need to be applied to the candidate region in step S1 and the corresponding region in step S2. The pooling operation performed on the candidate region generally refers to the RoI max pooling, and for each sub-region, the max pooling operation selects the maximum value of the element in the sub-region, and for the candidate region feature map, the maximum value of the selected element is a feature value corresponding to the response degree of the neural network, so that there is a rationality for selecting the max pooling for the candidate region feature map. However, the value in each sub-region of the coordinate value feature map represents the absolute value of the coordinate of the pixel point in the three-dimensional coordinate system, and the information of the point is changed by using the maximum pooling, so that an error occurs in the result, so that the RoI average pooling is selected for the coordinate value feature map representing the three-dimensional coordinate information in step S2, instead of the maximum value calculation, and the element values in each sub-region are calculated to obtain the average distribution therein, thereby retaining sufficient background information in the image.
After the sizes of the candidate region features on the two sides are extracted and kept consistent, fusion operation can be performed, because the sizes of the candidate region features are consistent, the two candidate region features are directly connected in series, three feature graphs representing coordinate values in the coordinate value feature graphs are directly connected in series behind the candidate region feature graphs in the detection network, and three dimensions are integrally improved. For the candidate region feature fused with the coordinate value feature, the three-dimensional position information of the object is more than that of the original candidate region feature, so that more information can be provided, and the feature characterization capability is improved. Meanwhile, the fused regional characteristics can be used for estimating and predicting the three-dimensional position of the target, and can also participate in the estimation and prediction tasks of other parameter information branches, so that the detection precision of each parameter can be improved to a certain extent.
S3, parameter estimation and prediction part
And after the characteristics obtained by fusion are obtained, estimating and predicting each parameter of the target by using a convolutional neural network.
(1) Class and two-dimensional detection frame
Because the basic target detection network used is fast R-CNN, the two detection tasks can be accurately obtained by the convolutional neural network in step S1 through the basic two-dimensional target detection algorithm flow, and can be output together with other three-dimensional parameter predictions in the whole detection framework.
(2) Scale estimation
Because the same type of targets are similar in size and can not be guaranteed in precision by directly estimating and predicting according to regional characteristics obtained by a neural network, the method considers that the parameters are not directly regressed, but the difference value between the regression prediction related data parameters and the reference parameters. The reference parameter is obtained by clustering and analyzing the same type of data in the training set according to the used method, and the value of the reference parameter is not fixed. And outputting a difference value through the detection model, and combining the difference value and the reference parameter to complete scale estimation. The penalty function for this estimation branch is as follows:
Figure BDA0003071866510000071
wherein, PdThe size of the prediction is represented by,
Figure BDA0003071866510000072
representing the true size in the label, DtRepresenting the trained reference dimensions, SL1() represents the SmoothL1 function.
(3) Direction estimation
For the estimation of the target direction, because the roll angle and the pitch angle are almost 0 under the ideal condition, only the yaw angle is considered, however, when in the same camera coordinate system, the target yaw angle does not have a direct corresponding relation with the appearance characteristics of the target in the image, and the yaw angle of the target cannot be obtained through direct regression prediction. Considering the consistency of the local viewing angle of the target (the angle formed by the global direction and the ray whose camera center passes through the center of the target) with the change of the appearance characteristics of the target in the image, the following relationship exists between the yaw angle and the local viewing angle:
yaw=α+arctan2(cx,cy)
where yaw represents yaw angle, α represents local observation angle, and (cx, cy) represents target center point coordinates. Therefore, the direction estimation can be completed by detecting the predicted local observation angle output by the model and calculating the yaw angle by combining the formula.
The result obtained by directly regressing the local angle by using the neural network has lower precision, and the invention selects and uses a MultiBin method to carry out regression on the angle. The whole angular space is firstly divided into n regions, which are denoted as bins, and the range of each region can be represented by one interval. Therefore, for each local angle, the local angle is classified first, the section to which the local angle belongs is judged, and after the category to which the local angle belongs is judged, the residual Δ θ between the local angle and the center of the category is calculated. Therefore, when the local observation angle is actually predicted, the local observation angle is classified into two parts, one is a classification problem that the predicted local observation angle belongs to each bin, and the other is a regression problem that the predicted local observation angle and the residual Δ θ of the centers of such bins are regressed. The two-part loss function is as follows:
Figure BDA0003071866510000073
Lconfrepresented is a loss function of the first part of the classification problem, where,
Figure BDA0003071866510000074
representing the class, P, of the bin to which the true alpha angle belongsconfIs a predictor of a fully-connected network, σ () represents a Sigmoid function, and CE is cross entropy.
Figure BDA0003071866510000081
LregRepresented is a loss function of the second partial residual regression problem, where,
Figure BDA0003071866510000082
representing the true (cos (Δ α), sin (Δ α) vector, PregIs the corresponding predictor, SL1() represents the SmoothL1 function, and n represents the number of bins to which α belongs.
The loss function for the entire local viewing angle α is as follows, where ω determines the relative weights of the two components:
Lα=Lconf+ω*Lreg
(4) three-dimensional position estimation
For the estimation of the three-dimensional position of the target, the design mainly introduces a depth information estimation branch network, which is already set forth in the step S2.
(5) Multitask learning
In order to optimize a detection model framework of the whole monocular image three-dimensional target detection method based on depth information estimation, joint training is adopted, and on the basis of Faster R-CNN, the whole network is trained end to end. The multiple prediction branches share weights of the convolutional neural network, each branch corresponds to different parameter objects and loss functions thereof, and the total loss function of the whole network is as follows:
L=w2d*L2d+wd*Ld+wα*Lα+wloc*Lloc
wherein L is2d、Ld、Lα、LlocThe loss functions of the target two-dimensional detection frame, the scale, the direction and the three-dimensional position are respectively expressed, w in the above formula determines the weight proportion of each part, and the weight parameter of each part is shown as each coefficient of the above formula.
And L is an overall loss function of the detection model of the method, the overall loss function is adjusted by adjusting the weight of each part, the allowable range of the loss function is determined according to actual requirements, the final detection model can be obtained, the monocular image three-dimensional target detection based on depth information estimation is further completed, and the detection result is output.
In the whole monocular image three-dimensional target detection method based on depth information estimation, the key points are as follows:
1. on the basis of a target detection network model, depth information estimation is introduced to complete a three-dimensional target detection task based on a monocular image, and position information/depth information of a target in the image is fully utilized.
2. By means of multi-task learning, the whole target detection network model can perform end-to-end joint training, and through multi-task training learning, the utilization efficiency of computing resources is improved, and the overall detection precision is improved.
Compared with the prior art, the invention has the beneficial effects that:
1. the three-dimensional target detection task is completed only by means of the input monocular image, and the more mainstream laser radar point cloud data is not needed.
2. By introducing depth information estimation aiming at the position information of the target in the image, which is easy to lose through the convolutional neural network, the position information of the target in the image can be fully utilized, and the result detection precision is improved.
3. The whole target detection network model can carry out end-to-end joint training, the utilization efficiency of computing resources is improved through multi-task training learning, and the problem that a large amount of training time, storage space and computing resources are needed is improved to a certain extent.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (7)

1. A monocular image three-dimensional target detection method based on depth information estimation is characterized in that: the method comprises the following steps:
s1: inputting the acquired monocular image, and obtaining a target candidate area by using a Faster R-CNN network model and an area proposal network thereof;
s2: a MonoDepth algorithm is used for constructing a depth information estimation branch network, the monocular image is input into the depth information estimation branch network, parallax information is output, then depth information is obtained, and a point cloud is constructed by obtaining three-dimensional coordinate information of each pixel point in the image, so that a corresponding area is obtained;
s3: pooling is carried out on the candidate region in the step S1 and the corresponding region in the step S2, then the features obtained after pooling are fused, estimation and prediction of each parameter of the target are carried out on the fused features by using a convolutional neural network, and the monocular image three-dimensional target detection process is completed after the prediction is finished.
2. The method for detecting the monocular image three-dimensional target based on depth information estimation as claimed in claim 1, wherein: in step S1, the process of obtaining the candidate region of the target is: the area proposal network generates a series of proposal areas containing targets through a convolution feature map and an anchor point mechanism, generates two-dimensional anchor points with preset proportion and aspect ratio in each rectangular area, and then outputs final candidate areas through target fraction prediction and two-dimensional bounding box regression.
3. The method for detecting the monocular image three-dimensional target based on depth information estimation as claimed in claim 1, wherein: in step S2, the coordinates of a certain pixel point in the three-dimensional space under the camera coordinate system are obtained by the following formula:
Figure FDA0003071866500000011
wherein (I)x,Iy) Is the coordinate of a certain pixel point in the monocular image, IdFor the predicted parallax, f is the camera focal length, CbIs the base line distance of the binocular camera (C)x,Cy) Is the coordinate of the image principal point;
by the method, the three-dimensional coordinate information of each pixel point in the image is further acquired, the point cloud is constructed in the whole scene according to the three-dimensional coordinate information, and the point cloud obtained through prediction is encoded into the corresponding area of three-channel input.
4. The method for detecting the monocular image three-dimensional target based on depth information estimation as claimed in claim 1, wherein: in step S3, the candidate region of step S1 is subjected to maximum pooling.
5. The method for detecting the monocular image three-dimensional target based on depth information estimation as claimed in claim 1, wherein: in step S3, the average pooling process is performed on the corresponding area in step S2.
6. The method for detecting the monocular image three-dimensional target based on depth information estimation as claimed in claim 1, wherein: in step S3, the sizes of the pooled candidate regions and the corresponding regions are matched, and then fusion processing is performed to directly connect the corresponding regions in series behind the candidate regions.
7. The method for detecting the monocular image three-dimensional target based on depth information estimation as claimed in claim 1, wherein: in step S3, the estimation and prediction of each parameter specifically includes a category and two-dimensional detection box, scale estimation, direction estimation, and three-dimensional position estimation.
CN202110541790.4A 2021-05-18 2021-05-18 Monocular image three-dimensional target detection method based on depth information estimation Pending CN113436239A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110541790.4A CN113436239A (en) 2021-05-18 2021-05-18 Monocular image three-dimensional target detection method based on depth information estimation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110541790.4A CN113436239A (en) 2021-05-18 2021-05-18 Monocular image three-dimensional target detection method based on depth information estimation

Publications (1)

Publication Number Publication Date
CN113436239A true CN113436239A (en) 2021-09-24

Family

ID=77803321

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110541790.4A Pending CN113436239A (en) 2021-05-18 2021-05-18 Monocular image three-dimensional target detection method based on depth information estimation

Country Status (1)

Country Link
CN (1) CN113436239A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114723827A (en) * 2022-04-28 2022-07-08 哈尔滨理工大学 Grabbing robot target positioning system based on deep learning
TWI803328B (en) * 2022-05-24 2023-05-21 鴻海精密工業股份有限公司 Depth image generation method, system, electronic equipment and readable storage media

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110689008A (en) * 2019-09-17 2020-01-14 大连理工大学 Monocular image-oriented three-dimensional object detection method based on three-dimensional reconstruction
CN111815665A (en) * 2020-07-10 2020-10-23 电子科技大学 Single image crowd counting method based on depth information and scale perception information

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110689008A (en) * 2019-09-17 2020-01-14 大连理工大学 Monocular image-oriented three-dimensional object detection method based on three-dimensional reconstruction
CN111815665A (en) * 2020-07-10 2020-10-23 电子科技大学 Single image crowd counting method based on depth information and scale perception information

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
徐斌: "基于单目图像的三维物体检测研究", 《中国优秀硕士学位论文全文数据库信息科技辑(月刊)》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114723827A (en) * 2022-04-28 2022-07-08 哈尔滨理工大学 Grabbing robot target positioning system based on deep learning
TWI803328B (en) * 2022-05-24 2023-05-21 鴻海精密工業股份有限公司 Depth image generation method, system, electronic equipment and readable storage media

Similar Documents

Publication Publication Date Title
WO2021233029A1 (en) Simultaneous localization and mapping method, device, system and storage medium
CN111583369B (en) Laser SLAM method based on facial line angular point feature extraction
US20210142095A1 (en) Image disparity estimation
CN111429514A (en) Laser radar 3D real-time target detection method fusing multi-frame time sequence point clouds
CN110688905B (en) Three-dimensional object detection and tracking method based on key frame
CN112258618A (en) Semantic mapping and positioning method based on fusion of prior laser point cloud and depth map
Berrio et al. Camera-LIDAR integration: Probabilistic sensor fusion for semantic mapping
CN111201451A (en) Method and device for detecting object in scene based on laser data and radar data of scene
CN112541460B (en) Vehicle re-identification method and system
CN115049821A (en) Three-dimensional environment target detection method based on multi-sensor fusion
CN113436239A (en) Monocular image three-dimensional target detection method based on depth information estimation
CN117274749B (en) Fused 3D target detection method based on 4D millimeter wave radar and image
Cui et al. Dense depth-map estimation based on fusion of event camera and sparse LiDAR
CN113989758A (en) Anchor guide 3D target detection method and device for automatic driving
CN113920254B (en) Monocular RGB (Red Green blue) -based indoor three-dimensional reconstruction method and system thereof
CN115909268A (en) Dynamic obstacle detection method and device
CN117576665B (en) Automatic driving-oriented single-camera three-dimensional target detection method and system
CN114608522A (en) Vision-based obstacle identification and distance measurement method
CN112037282B (en) Aircraft attitude estimation method and system based on key points and skeleton
CN115965961B (en) Local-global multi-mode fusion method, system, equipment and storage medium
CN112950786A (en) Vehicle three-dimensional reconstruction method based on neural network
CN117237411A (en) Pedestrian multi-target tracking method based on deep learning
CN113569803A (en) Multi-mode data fusion lane target detection method and system based on multi-scale convolution
CN114140659A (en) Social distance monitoring method based on human body detection under view angle of unmanned aerial vehicle
CN117523428B (en) Ground target detection method and device based on aircraft platform

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20210924

RJ01 Rejection of invention patent application after publication