CN111784680B

CN111784680B - Detection method based on consistency of key points of left and right eye views of binocular camera

Info

Publication number: CN111784680B
Application number: CN202010645495.9A
Authority: CN
Inventors: 于洁潇; 张美琪; 井佩光; 苏育挺
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2020-07-06
Filing date: 2020-07-06
Publication date: 2022-06-28
Anticipated expiration: 2040-07-06
Also published as: CN111784680A

Abstract

The invention discloses a binocular camera based detection method for key point consistency of left and right eye views, which comprises the following steps: extracting the histogram feature of the direction gradient by using a deterministic network; combining the extracted directional gradient histogram features, and performing 2D target detection on the left eye view and the right eye view respectively by using a three-dimensional region suggestion network to obtain left eye view and right eye view candidate regions; predicting key points of the left and right sight view candidate areas by using an internal key point prediction module of the stereo area convolution neural network; carrying out consistency matching on key points predicted by the left and right eye views, establishing corresponding loss functions, and minimizing the loss of each task through training; and estimating a 3D frame according to the predicted key points, performing pixel-level precision matching through dense 3D frame alignment, and further correcting the result of the 3D frame estimated in the last step. The invention utilizes the consistency of the left key point and the right key point to improve the accuracy rate in the three-dimensional detection.

Description

Detection method based on consistency of key points of left and right eye views of binocular camera

Technical Field

The invention relates to the field of binocular camera three-dimensional detection, in particular to a method for performing 3D detection on a binocular camera based on the consistency of key points of left and right eye views.

Background

Object detection is an important part of the field of computer vision, and almost from the beginning of birth, object detection has become an important point of research. The 2D target detection has great development, and the accuracy and the detection speed are obviously improved. With the development of 2D object detection, researchers' eyes begin to focus on 3D object detection. In addition, 3D target detection has important significance in practical application. For example, in the field of unmanned driving, 3D object detection cannot be left, and 3D object detection still has a large development space, so that it is very important to develop a 3D object detection algorithm.

For 3D target detection, most algorithms are based on laser radar, binocular cameras and monocular cameras, and the used methods are different for different detection devices. Of these, the most numerous of the three are lidar-based, and radar-based algorithms are now capable of extremely high accuracy. However, the lidar is expensive, is easily affected by weather changes, particularly in rainy and snowy days, and easily damages human eyes, which is fatal to popularization of unmanned driving. Although the monocular camera is relatively flat and is not influenced by weather, the defects of the laser radar can be overcome, the detection error of the 3D target is large, and the detection result is not satisfactory, so that the monocular 3D target detection is not suitable for popularization. Compared with the prior art, the binocular camera has better comprehensiveness than the two cameras in consideration of the aspects of precision, cost, efficiency and the like. The binocular camera can obtain relatively accurate depth values. Therefore, 3D detection based on the binocular camera has extremely high research significance. The Stereo Region Convolutional Neural network (Stereo R-CNN) is used as a 3D target detection algorithm and has the characteristics of high accuracy, high speed and the like. However, the accuracy of 3D detection is still in a room for improvement.

It is therefore of interest to propose an efficient method for binocular vision 3D detection.

Disclosure of Invention

The invention provides a method for detecting the consistency of key points of left and right eye views based on a binocular camera, which utilizes the consistency of the left and right key points to improve the accuracy rate in three-dimensional detection, and is described in detail as follows:

a detection method based on binocular camera left and right eye view key point consistency comprises the following steps:

extracting directional gradient histogram features by using a deterministic network;

combining the extracted directional gradient histogram features, respectively carrying out 2D target detection on the left eye view and the right eye view by using a three-dimensional region suggestion network to obtain left eye view and right eye view candidate regions;

respectively predicting key points of the left and right eye view candidate regions by using an internal key point prediction module of the stereo regional convolutional neural network;

carrying out consistency matching on the key points predicted by the left and right eye views, establishing corresponding loss functions, and minimizing the loss of each task through training;

and estimating a 3D frame according to the predicted key points, performing pixel-level precision matching through dense 3D frame alignment, and further correcting the result of the 3D frame estimated in the last step.

Wherein the loss function is specifically:

wherein, w_cls ^pRepresenting a classification weight, L, in the RPN_cls ^pRepresents a classification loss in the RPN; w is a_reg ^pRepresenting the weight of the regression task in RPN, L_reg ^pRepresents the loss of the regression task in the RPN; w is a_cls ^rRepresents the weight of the classification in R-CNN, L_cls ^rRepresents a loss of classification in R-CNN; w is a_box ^rRepresenting the weight of boxed tasks in R-CNN, L_box ^rRepresents the loss of framing tasks in R-CNN;

representing the weight of the angle of view in the R-CNN,

represents the loss of angle of view in R-CNN; w is a_dim ^rRepresents the weight of a dimension in R-CNN, L_dim ^rRepresents the loss of dimensionality in R-CNN;

represents the weight of the left keypoint in the R-CNN,

represents the loss of the left key point in R-CNN;

representing the weight of the right keypoint in the R-CNN,

represents the loss of the right key point in R-CNN, p represents RPN part, R represents R-CNN part, beta and gamma are the coefficients of the left and right key point items respectively, and beta + gamma is 1.

The technical scheme provided by the invention has the beneficial effects that:

1. the method corrects the key points by using the consistency of the key points of the left and right eye views, so that the accuracy of three-dimensional detection is improved;

2. the method modifies the target function so as to improve the effect of the 3D target detection algorithm;

3. the invention combines multiple ideas together to realize the optimal effect, and is particularly suitable for 3D target detection based on a binocular camera.

Drawings

Fig. 1 is a flowchart of a detection method based on consistency of key points of left and right eye views of a binocular camera.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention are described in further detail below.

Example 1

The embodiment of the invention provides a detection method based on the consistency of key points of left and right eye views of a binocular camera, and the method comprises the following steps of:

101: extracting Histogram of Oriented Gradient (HOG) features by using a Deterministic network (DetNet);

102: 2D target detection is respectively carried out on the left eye view and the right eye view by utilizing a Stereo region suggestion network (Stereo RPN) in combination with the extracted HOG characteristics to obtain left eye view and right eye view candidate regions;

103: respectively predicting key points of the left and right eye view candidate regions by using an internal key point prediction module of the Stereo R-CNN;

104: carrying out consistency matching on the key points of the left and right eye view predictions, establishing corresponding loss functions, and minimizing the loss of each task through training to improve the accuracy of classification and detection tasks;

105: and estimating a 3D frame according to the predicted key points, performing pixel-level precision matching through dense 3D frame alignment, and further correcting the result of the 3D frame estimated in the last step.

Example 2

The scheme in example 1 is further described below by combining the calculation formula and examples, and the following description refers to:

201: HOG features were extracted from project of Karlsruhe Institute of Technology and Toyota technical Institute at Chicago, KiTTI) datasets using DetNet for subsequent further processing;

202: the region suggestion network (RPN) uses a sliding window to select the region of interest and selects the best result through non-maximum suppression. Because of the binocular camera, the RPN is transformed into a two-way Stereo region suggestion network (Stereo RPN). Generating left and right target RoI areas by adopting a Stereo RPN and performing non-maximum suppression processing to obtain left and right target candidate areas;

203: using the left and right target candidate regions to predict key points, wherein the predicted key points are used for subsequent stereo estimation;

204: since the left and right target views are theoretically consistent, and the difference between the left and right target views is parallax information, corresponding key points in the left and right target views have consistency. And matching the two to establish a corresponding loss function.

The loss function during training is as follows:

wherein w_cls ^pRepresents a classification weight, L, in the RPN_cls ^pRepresenting classification loss in the RPN; w is a_reg ^pRepresenting the weight of the regression task in RPN, L_reg ^pRepresents the loss of the regression task in the RPN; w is a_cls ^rRepresents the weight of the classification in R-CNN, L_cls ^rRepresents a loss of classification in R-CNN; w is a_box ^rRepresenting the weight of boxed tasks in R-CNN, L_box ^rRepresents the loss of framing tasks in R-CNN;

representing the weight of the angle of view in the R-CNN,

represents the weight of the left keypoint in the R-CNN,

represents the loss of the left key point in R-CNN;

representing the weight of the right keypoint in the R-CNN,

represents the loss of the right key point in R-CNN, p represents the RPN fraction, and R represents the R-CNN fraction.

In order to keep the weight of the key point consistent with the stereo frame, the view angle and the dimension, coefficients of the left key point item and the right key point item are respectively beta and gamma, wherein beta + gamma is 1.

Each loss function L is weighted by uncertainty. Experiments show that beta is 0.8 and gamma is 0.2, which can obtain the best result.

205: and performing 3D frame estimation through the obtained prediction key points.

The following test experiments are given for implementing a method for performing binocular camera 3D detection based on left and right eye view key point consistency in the present invention:

The detection performance of the embodiment of the invention is measured by Average accuracy (Average Precision), and the detection indexes comprise 2D detection and 3D detection, and the detection method is divided into three modes of simple (easy), moderate (mode) and difficult (hard) according to the difficulty degree of the detected image. Average Precision represents the probability that the Average score of a relevant tag is ranked higher than other relevant tags; the 2D detection includes detection of left (left), right (right), and stereo (stereo), and the Intersection-over-Union (IoU) of the detection is 0.7 (corresponding to table 1); the 3D detection includes bird's eye view (bird's view) detection and 3D boxes (3D boxes) detection, and IoU is divided into two kinds of 0.5 (corresponding to table 2) and 0.7 (corresponding to table 3).

To evaluate the performance of the method, embodiments of the present invention used 7481 sets of images from the KITTI data set, randomly and roughly divided into two closely numbered groups, one for training and the other for testing. During the evaluation, only the tag of the Car (Car) is considered, and other tags (including bus and the like similar to the Car tag) are not considered.

TABLE 1

TABLE 2

TABLE 3

As can be seen from table 1, the simple, moderate mode accuracy improvement of the proposed method for 2D detection (IoU ═ 0.7) is insignificant, but about 2% improvement for difficult modes; for the 3D detection aspect, when IoU is 0.5 and IoU is 0.7, the bird's eye view detection accuracy is improved by about 1% to 3%, and the 3D frame detection accuracy is also improved. The experimental results prove the effectiveness of the method.

In the embodiment of the present invention, except for the specific description of the model of each device, the model of other devices is not limited as long as the device can perform the above functions.

Those skilled in the art will appreciate that the drawings are only schematic illustrations of preferred embodiments, and the above-mentioned serial numbers of the embodiments of the present invention are only for description and do not represent the merits of the embodiments.

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims

1. A binocular camera based detection method for key point consistency of left and right eye views is characterized by comprising the following steps:

extracting the histogram feature of the direction gradient by using a deterministic network;

Estimating a 3D frame according to the predicted key points, performing pixel-level precision matching through dense 3D frame alignment, and further correcting the result of the 3D frame estimated in the previous step;

the consistency matching of the key points of the left and right eye view prediction is as follows: the left and right target views are consistent in theory, the difference between the left and right target views is parallax information, corresponding key points in the left and right target views have consistency, and the predicted key points of the left and right target views are matched;

the loss function is specifically:

representing the weight of the angle of view in the R-CNN,

represents the weight of the left keypoint in the R-CNN,

represents the loss of the left key point in R-CNN;

representing the weight of the right keypoint in the R-CNN,